CD Unit-1
CD Unit-1
Topics to be Covered
1.1 Translators:
The widely used translators that translate the code of a computer program into a machine code
are:
1. Assemblers
2. Interpreters
3. Compilers
Assembler:
An Assembler converts an assembly program into machine code.
A compiler is a program that reads a program written in one language – the source language –
and translates it into an equivalent program in another language – the target language.
error messages
As an important part of this translation process, the compiler reports to its user the presence of
errors in the source program.
If the target program is an executable machine-language program, it can then be called by the
user to process inputs and produce outputs.
Program
Advantages of Compiler:
1. Fast in execution
2. The object/executable code produced by a compiler can be distributed or executed without
having to have the compiler present.
3. The object program can be used whenever required without the need to of recompilation.
Disadvantages of Compiler:
1. Debugging a program is much harder. Therefore not so good at finding errors.
2. When an error is found, the whole program has to be re-compiled.
1.2.2 Interpretation:
Interpretation is the conceptual process of translating a high level source code into executable
code.
Interpreter:
An Interpreter is also a program that translates high-level source code into executable code.
However the difference between a compiler and an interpreter is that an interpreter translates
one line at a time and then executes it: no object code is produced, and so the program has to
be interpreted each time it is to be run. If the program performs a section code 1000 times, then
the section is translated into machine code 1000 times since each line is interpreted and then
executed.
Disadvantages of an Interpreter:
1. Rather slow
2. No object code is produced, so a translation has to be done every time the program is running.
3. For the program to run, the Interpreter must be present
Hybrid Compiler:
Hybrid compiler is a compiler which translates a human readable source code to an intermediate
byte code for later interpretation. So these languages do have both features of a compiler and an
interpreter. These types of compilers are commonly known as Just In-time Compilers (JIT).
Java is one good example for these types of compilers. Java language processors combine
compilation and interpretation. A Java Source program may be first compiled into an
intermediate form called byte codes. The byte codes are then interpreted by a virtual machine.
Source program
Translator
Input Machine
Compilers are not only used to translate a source language into the assembly or machine
language but also used in other places.
Example:
A language processor is a program that processes the programs written in programming language
(source language). A part of a language processor is a language translator, which translates the
program from the source language into machine code, assembly language or other language.
1. Pre Processor
The Pre Processor is the system software which is used to process the source program before fed
into the compiler. They may perform the following functions:
2. Interpreter
An interpreter, like a compiler, translates high-level language into low-level machine language.
The difference lies in the way they read the source code or input. A compiler reads the whole
3. Assembler
An assembler translates assembly language programs into machine code. The output of an
assembler is called an object file, which contains a combination of machine instructions as well
as the data required to place these instructions in memory.
4. Linker
Linker is a computer program that links and merges various object files together in order to make
an executable file. All these files might have been compiled by separate assemblers. The major
task of a linker is to search and locate referenced module/routines in a program and to determine
the memory location where these codes will be loaded, making the program instruction to have
absolute references.
5. Loader
Loader is a part of operating system and is responsible for loading executable files into memory
and executes them. It calculates the size of a program instructions and data and creates memory
space for it. It initializes various registers to initiate execution.
Analysis and
Synthesis
1. Analysis:
The first three phases forms the bulk of the analysis portion of a compiler. The analysis part
breaks up the source program into constituent pieces and creates an intermediate representation
of the source program. During analysis, the operations implied by the source program are
determined and recorded in a hierarchical structure called a syntax tree, in which each node
represents an operation and the children of a node represent the arguments of the operation.
:=
position +
initial *
rate 60
2. Synthesis Part:
The synthesis part constructs the desired target program from the intermediate representation.
This part requires most specialized techniques.
Lexical Analysis: The lexical analysis phase reads the characters in the source program and
groups them into a stream of tokens in which each token represents a logically sequence of
characters, such as identifier, a keyword (if, while, etc), a punctuation character, or a multi-
character operator work like :=. The character sequence forming a token is called the lexeme for
the token.
Certain tokens will be augmented by a “lexical value”. Ex. When an identifier rate is found, the
lexical analyzer generates the token id and also enters rate into the symbol table, if it is not
already exist. The lexical value associated with this id then points to the symbol-table entry for
rate.
Tokens:
Syntax Analysis:
The next phase is called the syntax analysis or parsing. It takes the token produced by lexical
analysis as input and generates a parse tree or syntax tree. In this phase, token arrangements are
checked against the source code grammar, i.e. the parser checks if the expression made by the
tokens is syntactically correct.
It imposes a hierarchical structure of the token stream in the form of parse tree or syntax tree.
The syntax tree can be represented by using suitable data structure.
:=
position +
initial *
rate 60
:=
id 1 +
id 2 *
id 3
id 4
Semantic Analysis:
Semantic analysis checks whether the parse tree constructed follows the rules of language. For
example, assignment of values is between compatible data types, and adding string to an integer.
Also, the semantic analyzer keeps track of identifiers, their types and expressions; whether
identifiers are declared before use or not etc. The semantic analyzer produces an annotated
syntax tree as an output.
This analysis inserts a conversion from integer to real in the above syntax tree.
:=
position +
initial *
rate inttoreal
60
After semantic analysis the compiler generates an intermediate code of the source code for the
target machine. It represents a program for some abstract machine. It is in between the high-level
language and the machine language. This intermediate code should be generated in such a way
that it makes it easier to be translated into the target machine code.
Intermediate code have two properties: easy to produce and easy to translate into the target
program. An intermediate code representation can have many forms. One of the form is three-
address code, which is like the assembly language for a machine in which every memory
location can act like a register and three-address code have at most three operands.
Example: The output of the semantic analysis can be represented in the following intermediate
form:
temp1 := inttoreal ( 60 )
id1 := temp3
Code Optimization:
The next phase does code optimization of the intermediate code. Optimization can be assumed as
something that removes unnecessary code lines, and arranges the sequence of statements in order
to speed up the program execution without wasting resources CPU, memory. In the following
example the natural algorithm is used for optimizing the code.
Example:
Code Generation:
This is the final phase of the compiler which generates the target code, consisting normally of
relocatable machine code or assembly code. Variables are assigned to the registers.
Example:
MOVF id3, R2
MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
The first and the second operands of each instruction specify a source and destination
respectively. The F in each instruction denotes the floating point numbers. The # signifies that
60.0 is to be treated as constant.
Activities of Compiler:
Symbol table manager and error handler are the other two activities in the compiler which is also
referred as phases. These two activities interact with all the six phases of a compiler.
The symbol table is a data structure containing a record for each identifier, with fields for the
attributes of the identifier.
The attributes of the identifiers may provide the information about the storage allocated for an
identifier, its type, its scope (where in the program it is valid), and in the case of procedure
The symbol table allows us to find the record for each identifier quickly and to store or retrieve
data from that record quickly. Attributes of the identifiers cannot be determined during lexical
analysis phase. But it can be determined during the syntax and semantic analysis phases. The
other phase like code generators uses the symbol table to retrieve the details about the identifiers.
Each phase can encounter errors. After the deduction of an error, a phase must somehow deal
with that error, so that the compilation can proceed, allowing further errors in the source program
to be detected.
Lexical Analysis Phase: If the characters remaining in the input do not form any token of the
language, then the lexical analysis phase detect the error.
Syntax Analysis Phase: The large fraction of errors is handled by syntax and semantic analysis
phases. If the token stream violates the structure rules (syntax) of the language, then this phase
detects the error.
Semantic Analysis Phase: If the constructs have right syntactic structure but no meaning to the
operation involved, then this phase detects the error. Ex. Adding two identifiers, one of which is
the name of the array, and the other the name of a procedure.
If the characters remaining in the input do not form any token of the language, then the lexical
analysis phase detect the error.
There are relatively few errors which can be detected during lexical analysis.
i. Strange characters
Some programming languages do not use all possible characters, so any strange ones
which appear can be reported. However almost any character is allowed within a quoted
string.
Many programming languages do not allow quoted strings to extend over more than one
line; in such cases a missing quote can be detected.
If quoted strings can extend over multiple lines then a missing quote can cause quite a lot
of text to be 'swallowed up' before an error is detected.
For example:
fi ( a == 1) ....
Here fi is a valid identifier. But the open parentheses followed by the identifier may tell fi is
misspelling of the keyword if or an undeclared function identifier.
During syntax analysis, the compiler is usually trying to decide what to do next on the basis of
expecting one of a small number of tokens. Hence in most cases it is possible to automatically
generate a useful error message just by listing the tokens which would be acceptable at that
point.
Source: A + * B
Error: | Found '*', expect one of: Identifier, Constant, '('
More specific hand-tailored error messages may be needed in cases of bracket mismatch.
A parser should be able to detect and report any error in the program. It is expected that when an
error is encountered, the parser should be able to handle it and carry on parsing the rest of the
input. Mostly it is expected from the parser to check for errors but errors may be encountered at
various stages of the compilation process. A program may have the following kinds of errors at
various stages:
Lexical : name of some identifier typed incorrectly
Syntactical : missing semicolon or unbalanced parenthesis
Semantical : incompatible value assignment
Logical : code not reachable, infinite loop
There are four common error-recovery strategies that can be implemented in the parser to deal
with errors in the code.
Panic mode
When a parser encounters an error anywhere in the statement, it ignores the rest of the statement
by not processing input from erroneous input to delimiter, such as semi-colon. This is the easiest
way of error-recovery and also, it prevents the parser from developing infinite loops.
Statement mode
When a parser encounters an error, it tries to take corrective measures so that the rest of inputs of
statement allow the parser to parse ahead. For example, inserting a missing semicolon, replacing
comma with a semicolon etc.. Parser designers have to be careful here because one wrong
correction may lead to an infinite loop.
Error productions
Some common errors are known to the compiler designers that may occur in the code. In
addition, the designers can create augmented grammar to be used, as productions that generate
erroneous constructs when these errors are encountered.
Syntax errors must be detected by a compiler and at least reported to the user (in a helpful way).
If possible, the compiler should make the appropriate correction(s). Semantic errors are much
harder and sometimes impossible for a computer to detect.
Front End:
Lexical Analysis
Syntactic Analysis
Creation of the symbol table
Semantic Analysis
Generation of the intermediate code
A part of code optimization
Error Handling that goes along with the above said phases
Back End:
The back end includes the phases of the compiler that depend on the target machine, and these
phases do not depend on the source language, but depend on the intermediate language. The
phases of back end are:
Code Optimization
Code Generation
Necessary Symbol table and error handling operations
Based on the grouping of phases there are two types of compiler design is possible:
In order to atomize the development of compilers some general tools have been created. These
tools use specialized languages for specifying and implementing the component. The most
successful tool should hide the details of the generation algorithm and produce components
which can be easily integrated into the remainder of the compiler. These tools are often referred
as compiler – compilers, compiler – generators, or translator-writing systems.
Syntax-directed translation engines: Produce collections of routines for walking a parse tree
and generating intermediate code.
Scope Rules: The scope of a declaration of x is the context in which uses of x refer to this
declaration. . A language uses static scope or lexical scope if it is possible to determine the scope
of a declaration by looking only at the program and can be determined by compiler. Otherwise,
the language uses dynamic scope.
Example in Java:
public static int x;
The compiler can determine the location of integer x in memory.
The static-scope policy is as follows:
1. A C program consists of a sequence of top-level declarations of variables and functions.
2. Functions may have variable declarations within them, where variables include local
variables and parameters. The scope of each such declaration is restricted to the function
in which it appears.
3. The scope of a top-level declaration of a name x consists of the entire program that
follows, with the exception of those statements that lie within a function that also has a
declaration of x.
Block Structures:
Languages that allow blocks to be nested are said to have block structure. A name a: in a nested
block B is in the scope of a declaration D of x in an enclosing block if there is no other
declaration of x in an intervening block.
5. Dynamic Scope:
Scope Rules: The scope of a declaration of x is the context in which uses of x refer to this
declaration.
A language uses static scope or lexical scope if it is possible to determine the scope of a
declaration by looking only at the program and can be determined by compiler.
Example in Java:
public static int x;
The compiler can determine the location of integer x in memory.
The language uses dynamic scope if it is not possible to determine the scope of a
declaration during compile time.
Example in Java:
public int x;
With dynamic scope, as the program runs, the same use of x could refer to any of several
different declarations of x.
6. Parameter Passing Mechanism: Parameters are passed from a calling procedure to the callee
either by value (call by value) or by reference (call by reference). Depending on the procedure
call, the actual parameters associated with formal parameters will differ.
Call-By-Reference:
In call-by-reference, the address of the actual parameter is passed to the callee as the value of the
corresponding formal parameter. Uses of the formal parameter in the code of the callee are
implemented by following this pointer to the location indicated by the caller. Changes to the
formal parameter thus appear as changes to the actual parameter.
Call-By-Name:
A third mechanism — call-by-name — was used in the early programming language Algol 60. It
requires that the callee execute as if the actual parameter were substituted literally for the formal
parameter in the code of the callee, as if the formal parameter were a macro standing for the
actual parameter (with renaming of local names in the called procedure, to keep them distinct).
When large objects are passed by value, the values passed are really references to the objects
themselves, resulting in an effective call-by-reference.
7. Aliasing: When parameters are (effectively) passed by reference, two formal parameters can
refer to the same object, called aliasing. This possibility allows a change in one variable to
change another.
Topics to be Covered
Lexical Analysis
Its main task is to read the input characters and produce as output a sequence of tokens that
the parser uses for syntax analysis.
Symbol
Table
The above diagram illustrates that the lexical analyzer is a subroutine or a co routine of the
parser. Upon receiving a “get next token” command from the parser, the lexical analyzer
reads input characters until it can identify the next token.
Since Lexical analyzer is the part of the compiler that reads the source text, it may also
perform certain secondary tasks at the user interface.
Scanning – the scanner is responsible for doing simple tasks (Example – Fortran
compiler use a scanner to eliminate blanks from the input)
Lexical analysis – the lexical analyzer does the more complex operations.
There are several reasons for separating the analysis phase of compiling into lexical analysis
and parsing:
1. To make the design simpler. The separation of lexical analysis from syntax analysis
allows the other phases to be simpler. For example, parsing a document with
comments and white spaces is more complex than it is removed in the previous phase
itself.
2. To improve the efficiency of the compiler. A separate lexical analyzer allows to
construct an efficient processor. A large amount of time is spent in reading the source
program and partitioning it into tokens. Specialized buffering techniques speed up the
performance.
3. To enhance the compiler portability. Input alphabets and device specific anomalies
can be restricted to the lexical analyzer.
Token: A token is an atomic unit represents a logically cohesive sequence of characters such
as an identifier, a keyword, an operator, constants, literal strings, punctuation symbols such as
parentheses, commas and semicolons.
+, - - operator
if - keyword
Pattern: A pattern is a rule used to describe lexeme. It is a set of strings in the input for
which the same token is produced as output.
If If if
Relation <, <=, =, < >, >, >= < or <= or = or < > or > or >=
When more than one pattern matches a lexeme, the lexical analyzer must provide additional
information about the particular lexeme that matched to the subsequent phases of the
compiler.
For example, the pattern relation matches the operators like <, <=, >, >=, =, < >. It is
necessary to identify operator which is matched with the pattern.
The lexical analyzer collects other information about tokens as its attributes. A
token has only a single attribute, a pointer to the symbol -table entry in which the
information about the token is ke pt.
For example: The tokens and associated attribute-values for the Fortran statement
X = Y * Z ** 4
<assign_op,>
<mult_op,>
<exp_op,>
Eg. <assign_op,>
For others, the compiler stores the character string that forms a value in a symbol table.
Lexical Errors:
For example:
fi ( a == 1) ....
Here fi is a valid identifier. But the open parentheses followed by the identifier may tell fi is
misspelling of the keyword if or an undeclared function identifier.
INPUT BUFFERING:
Input buffering is a method used to read the source program and to identify the tokens
efficiently. There are three general approaches to the implementation of a lexical analyzer.
Since the lexical analyzer is the only phase of the compiler that reads the source program
character-by-character, it is possible to spend a considerably amount of time in the lexical
analysis phase. Thus the speed of lexical analysis is a concern in compiler design.
Buffer Pairs:
The lexical analyzer needs to look-ahead many characters beyond the lexeme for finding the
pattern. The lexical analyzer uses a function ungetc( ) to push the look-ahead characters back
into the input stream. In order to reduce the amount of overhead required to process an input
character, specialized buffering techniques have been developed.
A buffer is divided into N-character halves where N is the number of characters on one disk
block. Example 1024 or 4096
: : : X : : = : : M : * : : C: * : * : 4 : eof : : : : : :
forward
lexeme_beginning
1. Read N input character into each half of the buffer using one system read command
instead of reading each input character
2. If fewer than N characters remain in the input, then eof marker is read into the buffer
after the input characters.
3. Two pointers to the input buffer are maintained. Initially both pointers point to the
first character of the next lexeme to be found.
a. Begin pointer points the s tart of the lexeme
b. The forward pointer is set to the character at its right end
4. Once the lexeme is identified, both pointers are set to the character immediately past
the lexeme.
If the forward pointer is reaching the halfway mark, the right half is filled with N new input
characters. If the forward pointer is about to move past the right end of the buffer, the left
half is filled with N new characters and the forward pointer wraps around to the beginning of
the buffer. The number of tests to be required is very large.
Sentinels:
In the previous scheme mentioned a check should be made each time when the forward
pointer is moved that we have not moved off one half of the buffer. i.e. only one eof marker
at the end.
A sentinel is a special character which is not a part of the source program used to represent
the end of file. (eof)
Instead of testing the forward pointer each time by two tests, extend each buffer half to hold a
sentinel character at the end and reduce the number of tests to one.
forward
lexeme_beginning
forward := forward + 1;
if forward = eof then begin
if forward at end of first half then begin
reload second half;
forward := forward + 1
end
else if forward at end of second half then begin
reload first half;
move forward to beginning of first half
end
else /* eof within a buffer signifying end of input */
terminate lexical analysis
end
SPECIFICATION OF TOKENS:
Regular expressions are an important notation for specifying patterns. Each pattern matches
a set of strings, so regular expressions will serve as names for set of strings.
Alphabet: An alphabet or character class denotes any finite set of symbols. For example,
Letters, Characters, ASCII characters, EBCDIC characters
String: A string over some alphabet is a finite sequence of symbols drawn from that alphabet.
For example, 1 0 1 0 1 1 is a string over {0, 1}* , is a empty string over {0, 1}*
Length of the String : The length of the string 1 0 1 is denoted as | 1 0 1 | = 3 i.e. the number
of occurrences of symbol is S.
Language: A language denotes any set of strings over some fixed alphabet .
TERM DEFINITION
Operations on Languages:
There are several important operations that ca be applied to languages. For lexical analysis
the following operations are applied:
OPERATION DEFINITION
union of L and M
L U M = { s | s is in L or s is in M }
written L U M
concatenation of L and M
LM = { st | s is in L and t is in M }
written LM
∞
Kleene closure of L L* = U Li
i=0
written L*
L* denotes “zero or more concatenations of” L
positive closure of L written ∞
D = {0, 1, . . . , 9}
By applying operators defined above on these languages L and D we get the following new
languages:
Regular Expressions:
A regular expression is built out of simple regular expressions using a set of defining rules.
Each regular expression r denotes a language L(r).
Basis:
Induction:
iii) Suppose r and s are regular expressions denoting the language L(r) and L(s). Then,
a. ( r ) | ( s ) is a regular expression denoting L ( r ) U L ( s ).
b. ( r ) ( s ) is a regular expression denoting L ( r ) L ( s ).
1. the unary operator * has the highest precedence and is left associative.
2. concatenation has the second highest precedence and is left associative.
3. | has the lowest precedence and is left associative.
Unnecessary parentheses can be avoided in the regular expression if the above precedence is
adopted. For example the regular expression: (a) | ((b)* (c)) is equivalent to a | b*c.
Example:
Let = {a,b}
If two regular expressions r and s denote the same language, then we say r and s are
equivalent and write r = s. For example, ( a | b ) = (b | a ).
There are number of algebraic laws obeyed by regular expressions and these laws can be used
to manipulate regular expressions into equivalent forms.
Let r, s and t be the regular expression. The following are the algebraic laws for these regular
expressions:
10
Regular Definitions:
The regular expressions can be given names and defining regular expressions using these
names is called regular definition. If is an alphabet of basic symbols, then a regular
definition is a sequence of definitions of the form:
d1 -> r1
d2 -> r2
.......
d n -> rn
where each di is a distinct name, and each ri is a regular expression over the symbols in
U { d1, d2, . . . . , di-1 }, i.e., the basic symbols and the previously defined names.
Example:
11
optional_fraction . digits |
optional_exponent ( E ( + | - | ) digits ) |
Notational Shorthands:
1. One or more instances( + ): The unary postfix operator + means “one or more
instances of”. Example – (r)+ - Set of all strings of one or more occurrences of r.
2. Zero or One Instance (?): The unary postfix operator ? means “ zero or one instance
of”. Example – (r)? – One or zero occurrence of r.
The regular definition for num can be written by using unary + and unary ? operator
as follows:
digit 0 | 1 | . . . | 9
digits digit+
optional_fraction ( . digits) ?
optional_exponent ( E ( + | - )? digits ) ?
3. Character Classes: The notation where a, b and c are alphabet symbols denotes the
regular expression a | b | c. An abbreviated character class such as [ a – z ] denotes
the regular expression a | b | . . . | z.
Using character classes the identifiers can be described as strings generated by regular
expression: [A – Za – z] [A – Z a – z 0 – 9]*
12
Example:
| term
term id
| num
where the terminals if, then, else, relop, id and num generate sets of strings given by the
following regular definitions:
if if
then then
else else
letter A | B | . . . | Z | a | b | . . . | z
digit 0 | 1 | . . . | 9
13
ws delim+
The goal of the lexical analyzer is to isolate the lexeme for the next token in the input buffer
and produce as output a pair consisting of the appropriate token and attribute value using the
table given below:
Regular
Token Attribute - Value
Expression
ws - -
if if -
then then -
else else -
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE
14
Positions in a transition diagram are drawn as circles and are called states. The states are
connected by arrows, called edges. Edges leaving state s have labels indicating the input
characters that can next appear after the transition diagram has reached state s. The label
other refers to any character that is not indicated by any of the other edges leaving s.
One state is labeled as start state; it is the initial state of the transition diagram where control
resides when we begin to recognize token. Certain states may have actions that are executed
when the flow of control reaches that state. On entering a state we read the next input
character. If there is an edge from the current state whose label matches this input character,
then we go to the state pointed by the edge. Otherwise, we indicate failure.
The symbol * is used to indicate states on which the input retraction must take place.
There may be several transition diagram, each specifying a group of tokens. If failure occurs
in one transition diagram, then the forward pointer is retracted to where it was in the start
state of this diagram, and activate the next transition diagram. Since the lexeme beginning
and forward pointers marked the same position in the start state of the diagram, the forward
pointer is retracted to the position marked by the lexeme_begining pointer. If failure occurs
in all transition diagrams, then a lexical error has been detected and an error-recovery routine
is invoked.
start > =
0 6 7
other
*
8
15
other
letter or digit
return(gettoken(),install_id())
digit digit
E digit digit
+ or - digit other *
17 18 19
return(gettoken(),install_num())
16
return(gettoken(),install_num())
digit
delim
Finite automata
A recognizer for a language is a program that takes as input a string x and answers yes if x is
a sentence of the language and no otherwise.
qo - Start state
17
δ :Q x Σ → Q
o More than one transition occurs for any input symbol from a state.
o For each state and for each input symbol, exactly one transition occurs from that state.
• Union
r = r1 + r2
Concatenation
r = r1 r2
18
r = r1*
Ɛ –closure
Ɛ - Closure is the set of states that are reachable from the state concerned on taking empty
string as input. It describes the path that consumes empty string (Ɛ) to reach some states of
NFA.
Example 1
Ɛ -closure(q2) = { q0}
19
Sub-set Construction
Steps
l. Convert into NFA using above rules for operators (union, concatenation and closure) and
precedence.
20
6. If new state is found, repeat step 4 and step 5 until no more new states are found.
8. Draw the transition diagram with start state as the Ɛ -closure (start state of NFA) and final
state is the state that contains final state of NFA drawn.
Direct Method
Direct method is used to convert given regular expression directly into DFA.
21
else else
Computation of followpos
The position of regular expression can follow another in the following ways:
If n is a cat node with left child c1 and right child c2, then for every position i in
lastpos(c1), all positions in firstpos(c2) are in followpos(i).
For cat node, for each position i in lastpos of its left child, the firstpos of its
right child will be in followpos(i).
If n is a star node and i is a position in lastpos(n), then all positions in firstpos(n) are
in followpos(i).
For star node, the firstpos of that node is in f ollowpos of all positions in lastpos of
that node.
22
(a+b)*abb
23
A=firstpos(n0)={1,2,3}
Dtran[A,a]=
followpos(1) U followpos(3)= {1,2,3,4}=B
Dtran[A,b]=
followpos(2)={1,2,3}=A
Dtran[B,a]=
followpos(1) U followpos(3)=B
Dtran[B,b]=
followpos(2) U followpos(4)={1,2,3,5}=C
….
24
Equivalent automata
{A, C}=123
{B}=1234
{D}=1235
{E}=1236
Exists a minimum state DFA
Lex is a computer program that generates lexical analyzers. Lex is commonly used with
the yacc parser generator.
Creating a lexical analyzer
First, a specification of a lexical analyzer is prepared by creating a program lex.l in
the Lex language. Then, lex.l is run through the Lex compiler to produce a C program
lex.yy.c.
Finally, lex.yy.c is run through the C compiler to produce an object progra m a.out, which
is the lexical analyzer that transforms an input stream into a sequence of tokens.
25
%{ main()
int v=0,c=0; {
%% yylex();
26