0% found this document useful (0 votes)
90 views44 pages

LexYacc Final

This document provides an overview of Lex and Yacc, which are tools used for generating lexical analyzers and parsers. It discusses the structure and components of Lex specification (.l) files and Yacc specification (.y) files. The key points covered are: - Lex is used to generate scanners/lexers that break input streams into tokens. It uses regular expressions to define patterns and actions. - Yacc is used to generate parsers based on context-free grammars. It defines production rules for a language. - Both Lex and Yacc files have definition, rules, and user code sections to specify the language and processing.

Uploaded by

Vinutha K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views44 pages

LexYacc Final

This document provides an overview of Lex and Yacc, which are tools used for generating lexical analyzers and parsers. It discusses the structure and components of Lex specification (.l) files and Yacc specification (.y) files. The key points covered are: - Lex is used to generate scanners/lexers that break input streams into tokens. It uses regular expressions to define patterns and actions. - Yacc is used to generate parsers based on context-free grammars. It defines production rules for a language. - Both Lex and Yacc files have definition, rules, and user code sections to specify the language and processing.

Uploaded by

Vinutha K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 44

Tutorial

On
Lex & Yacc
Compiler Phases

Font End

Syntax Syntax
High Level

Tree
program

Lexical Tokens Syntax Tree Semantic Intermediate


Analyzer Analyzer Analyzer Code Generator

Intermediate
Code
Target Machine

Machine-
Machine-Dependent Code Independent Code
Code

Code Optimizer Generator


Target Intermediate Optimizer
machine Code Code

Back End
Lex: what is it?
1. Lex: a tool for automatically generating a lexer or scanner
given a lex specification (.l file)
2. A lexer or scanner is used to perform lexical analysis, or the
breaking up of an input stream into meaningful units, or
tokens.
3. For example, consider breaking a text file up into individual
words.

x = a + b * 2;

[(identifier, x), (operator, =), (identifier, a), (operator, +), (identifier, b),
(operator, *), (literal, 2), (separator, ;)]
Skeleton of a lex specification (.l file)
• Input specification for Lex
•Three parts: Definitions, Rules, User definition
•Use “%%” as a delimiter of the each part

• First part: Definitions(Declaration)


•Define the options that is used by Lex inside Lexer.
•Generally, define the running environment of the Lexer
•The programming code within “%{” and “%}” will be directly copied into
the Lexer
•The second part will use these Definitions.

• Second part: Rules


•This part contains pattern information and corresponding action
scripts that should be executed when the pattern matched.
•Patterns can be defined by regular expression.
Skeleton of a lex specification (.l file)

[_a-zA-Z][_a-zA-Z0-9]
{action}

• Third part: User definition


– The code in this area will be copied into the ‘lex.yy.c’ which
is output of the Lex.
– This part is generally used for the sub-programs for the action
code.
The rules section
%%
[RULES SECTION]

<pattern> { <action to take when matched> }


<pattern> { <action to take when matched> }

%%

Patterns are specified by regular expressions.


For example:
%%
[A-Za-z]* { printf(“this is a word”); }
%%
Regular Expression Basics
. : matches any single character except \n
* : matches 0 or more instances of the preceding regular expression
+ : matches 1 or more instances of the preceding regular expression
? : matches 0 or 1 of the preceding regular expression
| : matches the preceding or following regular expression
[ ] : defines a character class
() : groups enclosed regular expression into a new regular expression
“…”: matches everything within the “ “ literally
Lex Regular Exp (cont)
x|y x or y
{i} definition of i
x/y x, only if followed by y (y not removed from input)
x{m,n} m to n occurrences of x
 x x, but only at beginning of line
x$ x, but only at end of line
"s" exactly what is in the quotes (except for "\" and following
character)
Meta-characters
– meta-characters (do not match themselves, because they are
used in the preceding reg exps):
• ()[]{}<>+/,^*|.\"$?-%

– to match a meta-character, prefix with "\"

– to match a backslash, tab or newline, use \\, \t, or \n


Regular Expression Examples
• an integer: 12345
[1-9][0-9]*
• a word: cat
[a-zA-Z]+
• a (possibly) signed integer: 12345 or -12345
[-+]?[1-9][0-9]*
• a floating point number: 1.2345
[0-9]*”.”[0-9]+
Two Rules
1. lex will always match the longest (number of characters)
token possible.
If two or more possible tokens are of the same length, then
the token with the regular expression that is defined first in
the lex specification is favoured.

2. Lex patterns only match a given input character or string once


Regular Expression Examples
• a delimiter for an English sentence

“.” | “?” | ! OR [“.””?”!]


• C++ comment: // call foo() here!!

“//”.*
•white space

[ \t]+
• English sentence: Look at this!

([ \t]+|[a-zA-Z]+)+(“.”|”?”|!)
Generation of lexical analyzer using LEX

LEX Compiler
Lex specification Lex.yy.c
file x.l

C compiler a.out
Lex.yy.c
Executable program

a.out
Input strings from Stream of tokens
source program
Lex program example 1
%{
/* program to count words, lines and characters in the given file */
#include<stdio.h>
int nblk,nword,nchar,nline;
%}
%%
\n {nline++,nchar++;}
[^ \t\n]+ {nword++,nchar+=yyleng;}
"" {nblk++,nchar++;}
. {nchar++;}
%%
int yywrap()
{
return 1;
}
Contn …
int main()
{
yyin = fopen("in.dat","r");
yylex();
fclose(yyin);
printf("\n char count is :%d",nchar);
printf("\n blk count is :%d",nblk);
printf("\n word count is :%d",nword);
printf("\n line count is :%d",nline);

}
Lex program compilation
steps

• 1) lex example.l
• 2) cc lex.yy.c –ll
• 3) ./a.out
Both lex and yacc pgm are
involved

• 1) lex 1a.l
• 2) yacc –d 1b.l
• 3) cc lex.yy.c y.tab.c -ll
• 4) ./a.out
Example program 2
%{
/* program to count comments in the input file and delete
them */
#include <stdio.h>
int comment=0;
%}

%%

"/*".*"*/" {comment++;}
. ECHO;

%%
Contn …
int main(int argc, char *argv[])
{
FILE *fp;
if(argc > 1)
{
fp = fopen(argv[1],"r");
yyout = fopen(argv[2],"w");
if(!fp)
{
printf("error in opening the file");
exit(1);
}
yyin = fp;
yylex();

}
printf("no of comments lines:%d",comment);
}
yytext
Special Functions

– where text matched most recently is stored
• yyleng
– number of characters in text most recently matched
• yylval
– associated value of current token
• yymore()
– append next string matched to current contents of yytext
• yyless(n)
– remove from yytext all but the first n characters
• unput(c)
– return character c to input stream
• yywrap()
– When the end of the file is reached the return value of
yywrap() is checked.
– If it is non-zero, scanning terminates and if it is 0 scanning
continues with next input file.
Grammar
• Grammar is nothing but the collection of rules
• These rules are required to arrange the tokens in some proper
sequence so that the syntax of the language can be defined
• Ex: for being a sentence, there should be some rule to be followed.
– Ex: noun verb adjective , Noun verb , Noun verb noun

• For defining such rules, we need a grammar. Grammar for YACC is


described using a BNF
• The BNF grammar can be used to describe the context free
languages.
• The general form of writing the grammar in YACC is
A  bb | c

Here  means particular token can be replaced by the productions present on RHS.
Here | indicates “or”.
Parser lexer communication
• In the process of compilation, the lexical analyzer and parser work
together.
• When parser requires string of tokens, it invokes lexical analyzer. In
turn, the lexical analyzer supplies tokens to the parser

Error handler

Source
program
Demand for tokens
Lexical analyzer Parser Rest of compiler

Parse
Supply for tokens tree
Output
Symbol table code
Yacc: what is it?
Yacc: a tool for automatically generating a parser given
a grammar written in a yacc specification (.y file)

A grammar specifies a set of production rules, which


define a language. A production rule specifies a
sequence of symbols, sentences, which are legal in the
language.
Structure of yacc File

Definition section
declarations of tokens
type of values used on parser stack

Rules section
list of grammar rules with semantic routines

User code
Skeleton of a yacc specification (.y
file)
*.c is generated after
x.y
running

%{
< C global variables, prototypes, comments > This part will be
embedded into *.c
%}

contains token declarations.


[DEFINITION SECTION] Tokens are recognized in
lexer.
define how to “understand”
%% the input language, and
[PRODUCTION RULES SECTION] what actions to take for
each “sentence”.
%% any user code. For
< C auxiliary subroutines> example, a main function to
call the parser function
yyparse()
The Production Rules Section
%%
production : symbol1 symbol2 … { action }
| symbol3 symbol4 … { action }
| …

production: symbol1 symbol2 { action }


%%
An example
%%
statement : expression { printf (“ = %g\n”, $1); }
expression : expression ‘+’ expression { $$ = $1 + $3; }
| expression ‘-’ expression { $$ = $1 - $3; }
| NUMBER { $$ = $1; }
%%
statement
According these two productions,
expression
5 + 4 – 3 + 2 is parsed into:
expression expression

number expression expression

number expression expression

number number

5 + 4 - 3 + 2
Choosing a Grammar
S -> E S -> E
E -> E + T E -> E + E
E -> E - T
E -> T E ->E - E
T -> T * F E -> E * E
T -> T / F E -> E / E
T -> F E -> ( E )
F -> ( E )
E -> ID
F -> ID
Precedence and
Associativity

%right ‘='
%left '-' '+'
%left '*' '/'
%right '^'
Precedence Rules are used
in two situations

• In expression grammars
• To resolve the “dangling else”
conflict in grammars for if-then-else
language constructs.
Defining Values

expr : expr '+' term { $$ = $1 + $3; }


| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : '(' expr ')' { $$ = $2; }
| ID
| NUM
;
$1 Defining Values
expr : expr '+' term { $$ = $1 + $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : '(' expr ')' { $$ = $2; }
| ID
| NUM
;
Defining Values
expr : expr '+' term { $$ = $1 + $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : '(' expr ')' { $$ = $2; }
| ID
| NUM
; $2
Defining Values
expr : expr '+' term { $$ = $1 + $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : '(' expr ')' { $$ = $2; }
| ID
| NUM
; $3 Default: $$ = $1;
Example: Lex scanner.l

%{
#include <stdio.h>
#include "y.tab.h"
%}
id [_a-zA-Z][_a-zA-Z0-9]*
wspc [ \t\n]+
semi [;]
comma [,]
%%
int { return INT; }
char { return CHAR; }
float { return FLOAT; }
{comma} { return COMMA; } /* Necessary? */
{semi} { return SEMI; }
{id} { return ID;}
{wspc} {;}
decl.y
Example: Definitions

%{
#include <stdio.h>
#include <stdlib.h>
%}
%start line
%token CHAR, COMMA, FLOAT, ID, INT, SEMI
%%
decl.y

Example: Rules

decl : type ID list


{ printf("Success!\n");
} ;
list : COMMA ID list
| SEMI
;
type : INT | CHAR | FLOAT
;

%%
decl.y

Example: Supplementary Code


extern FILE *yyin;

main()
{
do {
yyparse();
} while(!feof(yyin));
}

yyerror(char *s)
{
/* Don't have to do anything! */
}
What yacc cannot Parse
• It cannot deal with ambiguous
grammars.
• It also cannot deal with grammars
that need more than one token of
lookahead to tell whether it has
matched a rule.
Example
Phrase -> cart_animal AND CART
| work_animal AND PLOW
cart_animal -> HORSE | GOAT
work_animal -> HORSE | OX
HORSE AND CART

Phrase -> cart_animal CART


| work_animal PLOW
Conflicts
• A conflict occurs when the parser has
multiple possible actions in some state for a
given next token.
• Two kinds of conflicts:
– shift-reduce conflict:
• The parser can either keep reading more of the input
(“shift action”), or it can mimic a derivation step using
the input it has read already (“reduce action”).
– reduce-reduce conflict:
• There is more than one production that can be used
for mimicking a derivation step at that point.
Example of a conflict
Grammar rules:
S  if ( e ) S /* 1 */ Input: if ( e1 ) if ( e2 ) S2 else S3
| if ( e ) S else S /* 2 */

Parser state when input token = ‘else’:


– Input already seen: if ( e1 ) if ( e2 ) S2
– Choices for continuing:

1. keep reading input (“shift”): 2. mimic derivation step using


S  if ( e ) S (“reduce”):
• ‘else’ part of innermost if • ‘else’ part of outermost if
• eventual parse structure: • eventual parse structure:
if (e1) { if (e2) S2 else S3 } if (e1) { if (e2) S2 } else S3

shift-reduce conflict
Handling Conflicts
General approach:
 Iterate as necessary:
1. Use “yacc -v” to generate the file y.output.
2. Examine y.output to find parser states with conflicts.
3. For each such state, examine the items to figure why the conflict is
occurring.
4. Transform the grammar to eliminate the conflict

Reason for conflict Possible grammar transformation


Ambiguity with operators in expressions Specify associativity, precedence
Error action Move or eliminate offending error action
Semantic action Move the offending semantic action
Insufficient lookahead “expand out” the nonterminal involved
Other …???…
Reference Books
• lex & yacc, 2nd Edition
by John R.Levine, Tony Mason & Doug
Brown, O’Reilly,ISBN: 1-56592-000-7

• Mastering Regular Expressions


by Jeffrey E.F. Friedl,O’Reilly
ISBN: 1-56592-257-3

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy