0% found this document useful (0 votes)
29 views52 pages

CD Unit-1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views52 pages

CD Unit-1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

UNIT 1 – INTRODUCTION TO COMPILERS

Topics to be Covered

Translators-Compilation and Interpretation-Language processors -The Phases of Compiler-


Errors Encountered in Different Phases-The Grouping of Phases-Compiler Construction Tools -
Programming Language basics.

1.1 Translators:

A translator is a computer program that performs the translation of a program written in a


given programming language into a functionally equivalent program in a different computer
language, without losing the functional or logical structure of the original code (the "essence" of
each program).

Types of Computer Language Translators:

The widely used translators that translate the code of a computer program into a machine code
are:

1. Assemblers
2. Interpreters
3. Compilers

Assembler:
An Assembler converts an assembly program into machine code.

1.2 Compilation and Interpretation:


1.2.1 Compilation:
Compilation is the conceptual process of translating source code into a CPU-executable binary
target code.

Downloaded from: annauniversityedu.blogspot.com


Compiler:

A compiler is a program that reads a program written in one language – the source language –
and translates it into an equivalent program in another language – the target language.

source Program Compiler target program

error messages

As an important part of this translation process, the compiler reports to its user the presence of
errors in the source program.

If the target program is an executable machine-language program, it can then be called by the
user to process inputs and produce outputs.

input Target output

Program

Advantages of Compiler:
1. Fast in execution
2. The object/executable code produced by a compiler can be distributed or executed without
having to have the compiler present.
3. The object program can be used whenever required without the need to of recompilation.

Disadvantages of Compiler:
1. Debugging a program is much harder. Therefore not so good at finding errors.
2. When an error is found, the whole program has to be re-compiled.

Downloaded from: annauniversityedu.blogspot.com


History of Compiler:
 Until 1952 most of the programs were written in assembly language
 In 1952 Grace Hopper writes the first compiler for the A-0 programming language
 Between 1957 – 58 John Backus writes the first Fortran compiler. Optimization
of the code was the integral component of the compiler.

Applications of Compiler Technology:


 Implementation of High Level Programming Languages
 Optimizations for Computer Architectures (both parallelism and memory hierarchies
improve the potential performance of a machine, but they must be harnessed effectively
by the compiler to deliver real performance of an application)
 Design of a new computer architecture
 Program Translations ( Program Translation techniques are: Binary Translation,
Hardware Synthesis, Database Query Interpreters, Compiled Simulation)
 Software Productivity Tools (Ex. Structure editors, type checking, bound checking,
memory management tools, etc)

1.2.2 Interpretation:
Interpretation is the conceptual process of translating a high level source code into executable
code.
Interpreter:

An Interpreter is also a program that translates high-level source code into executable code.
However the difference between a compiler and an interpreter is that an interpreter translates
one line at a time and then executes it: no object code is produced, and so the program has to
be interpreted each time it is to be run. If the program performs a section code 1000 times, then
the section is translated into machine code 1000 times since each line is interpreted and then
executed.

Downloaded from: annauniversityedu.blogspot.com


Advantages of an Interpreter:
1. Good at locating errors in programs
2. Debugging is easier since the interpreter stops when it encounters an error.
3. If an error is deducted there is no need to retranslate the whole program

Disadvantages of an Interpreter:
1. Rather slow
2. No object code is produced, so a translation has to be done every time the program is running.
3. For the program to run, the Interpreter must be present

Difference between Compiler and Interpreter:

S.No. Compiler Interpreter


Compiler works on the complete program
Interpreter Program works line by line. It
1. at once. It takes the entire program as
takes one statement at a time as input.
input.
Compiler generates intermediate code, Interpreter does not generate intermediate
2.
called the object code or machine code. object code or machine code.
Compiler executes conditional control
statements (like if-else and switch-case) Interpreter executes conditional control
3.
and logical constructs faster than statements at a much slower speed.
interpreter.
Compiled program take more memory Interpreter does not generate intermediate
4. because the entire object code has to reside object code. As a result, interpreted
in memory. programs are more memory efficient.

Downloaded from: annauniversityedu.blogspot.com


S.No. Compiler Interpreter
Compile once and run any time. Compiled
Interpreted programs are interpreted line
5. program does not need to be compiled
by line every time they are executed.
every time.
Error is reported as soon as the first error
Errors are reported after the entire program is encountered. Rest of the program will
6.
is checked for syntactical and other errors. be checked until the existing error is
removed.
Debugging is easy because interpreter
A compiled language is more difficult to
7. stops and report errors as it encounters
debug.
them.
Interpreter runs the program from the
Compiler does not allow a program to run
8. first line and stops execution only if it
until it is completely error-free.
encounters an error.
Compiled languages are more efficient but Interpreted languages are less efficient
9.
difficult to debug. but easier to debug.
Examples:
Examples:
10. BASIC, VISUAL BASIC, Python, Ruby,
C, C++, COBOL
PERL, MATLAB, Lisp

Hybrid Compiler:

Hybrid compiler is a compiler which translates a human readable source code to an intermediate
byte code for later interpretation. So these languages do have both features of a compiler and an
interpreter. These types of compilers are commonly known as Just In-time Compilers (JIT).

Example of a Hybrid Compiler:

Java is one good example for these types of compilers. Java language processors combine
compilation and interpretation. A Java Source program may be first compiled into an
intermediate form called byte codes. The byte codes are then interpreted by a virtual machine.

Downloaded from: annauniversityedu.blogspot.com


A benefit of this arrangement is that the byte codes compiled on one machine can be interpreted
on another machine, perhaps across a network.
In order to achieve faster processing of inputs to outputs, some Java compilers called just-in-time
compilers, translate the byte codes into machine language immediately before they run the
intermediate program to process the input.

Source program

Translator

Intermediate Program Virtual Output

Input Machine
Compilers are not only used to translate a source language into the assembly or machine
language but also used in other places.

Example:

1. Text Formatters: A text formatter takes input that is stream of characters,


most of which is text, some of which includes commands to indicate paragraphs, figures,
or mathematical structures like subscripts and superscripts.
2. Silicon compilers: A silicon compiler has a source language that is similar
or identical to a conventional programming language. The variable of the language
represent logical signals (0 or 1) or groups of signals in a switching circuit. The output is
a circuit design in an appropriate language.
3. Query Interpreters: A query interpreter translates a predicate containing
relational and Boolean operators into commands to search a database for records
satisfying that predicate.

Downloaded from: annauniversityedu.blogspot.com


1.3 Language Processors:

A language processor is a program that processes the programs written in programming language
(source language). A part of a language processor is a language translator, which translates the
program from the source language into machine code, assembly language or other language.

An integrated software developmental environment includes many different kinds of language


processors. They are:
1. Pre Processor
2. Compiler
3. Assembler
4. Linker
5. Loader

1. Pre Processor
The Pre Processor is the system software which is used to process the source program before fed
into the compiler. They may perform the following functions:

1. Macro Processing: A preprocessor may allow a user to define macros that


are shorthand for longer constructs.
2. File Inclusion: A preprocessor may include header files into the program
text. For example, the C pre-processor causes the contents of the file <global.h> to
replace the statement #include <global.h> when it processes a file containing this
statement.
3. Rational Preprocessors: These processors provides the user with built-in
macros for constructs like while-statements or if-statements etc.,
4. Language Extensions: It provides features similar to built-in macros. For
example, the language Equel is a database query language embedded in C.

2. Interpreter
An interpreter, like a compiler, translates high-level language into low-level machine language.
The difference lies in the way they read the source code or input. A compiler reads the whole

Downloaded from: annauniversityedu.blogspot.com


source code at once, creates tokens, checks semantics, generates intermediate code, executes the
whole program and may involve many passes. In contrast, an interpreter reads a statement from
the input, converts it to an intermediate code, executes it, then takes the next statement in
sequence. If an error occurs, an interpreter stops execution and reports it. whereas a compiler
reads the whole program even if it encounters several errors.

3. Assembler
An assembler translates assembly language programs into machine code. The output of an
assembler is called an object file, which contains a combination of machine instructions as well
as the data required to place these instructions in memory.

4. Linker
Linker is a computer program that links and merges various object files together in order to make
an executable file. All these files might have been compiled by separate assemblers. The major
task of a linker is to search and locate referenced module/routines in a program and to determine
the memory location where these codes will be loaded, making the program instruction to have
absolute references.

5. Loader
Loader is a part of operating system and is responsible for loading executable files into memory
and executes them. It calculates the size of a program instructions and data and creates memory
space for it. It initializes various registers to initiate execution.

1.4 Phases of Compiler:


A compiler operates in phases, each of which transforms the source program from one
representation to another.

Downloaded from: annauniversityedu.blogspot.com


The Analysis – Synthesis Model of Compilation:

There are two parts to compilation:

 Analysis and
 Synthesis

1. Analysis:

The first three phases forms the bulk of the analysis portion of a compiler. The analysis part
breaks up the source program into constituent pieces and creates an intermediate representation
of the source program. During analysis, the operations implied by the source program are
determined and recorded in a hierarchical structure called a syntax tree, in which each node
represents an operation and the children of a node represent the arguments of the operation.

Downloaded from: annauniversityedu.blogspot.com


Example:

Syntax tree for position := initial + rate * 60

:=

position +

initial *

rate 60

2. Synthesis Part:

The synthesis part constructs the desired target program from the intermediate representation.
This part requires most specialized techniques.

The Analysis Phase:

Lexical Analysis: The lexical analysis phase reads the characters in the source program and
groups them into a stream of tokens in which each token represents a logically sequence of
characters, such as identifier, a keyword (if, while, etc), a punctuation character, or a multi-
character operator work like :=. The character sequence forming a token is called the lexeme for
the token.

Certain tokens will be augmented by a “lexical value”. Ex. When an identifier rate is found, the
lexical analyzer generates the token id and also enters rate into the symbol table, if it is not
already exist. The lexical value associated with this id then points to the symbol-table entry for
rate.

Example: position := initial + rate * 60

Tokens:

Downloaded from: annauniversityedu.blogspot.com


1. position, initial and rate - id
2. :=, + and * are signs
3. 60 is a number

Thus the lexical analyzer will give the output as:

Id1 := id2 + id3 * 60

Syntax Analysis:

The next phase is called the syntax analysis or parsing. It takes the token produced by lexical
analysis as input and generates a parse tree or syntax tree. In this phase, token arrangements are
checked against the source code grammar, i.e. the parser checks if the expression made by the
tokens is syntactically correct.
It imposes a hierarchical structure of the token stream in the form of parse tree or syntax tree.
The syntax tree can be represented by using suitable data structure.

Example: position := initial + rate * 60

:=

position +

initial *

rate 60

Downloaded from: annauniversityedu.blogspot.com


Data structure of the above tree:

:=

id 1 +

id 2 *

id 3

id 4

Semantic Analysis:

Semantic analysis checks whether the parse tree constructed follows the rules of language. For
example, assignment of values is between compatible data types, and adding string to an integer.
Also, the semantic analyzer keeps track of identifiers, their types and expressions; whether
identifiers are declared before use or not etc. The semantic analyzer produces an annotated
syntax tree as an output.
This analysis inserts a conversion from integer to real in the above syntax tree.
:=

position +

initial *

rate inttoreal

60

Downloaded from: annauniversityedu.blogspot.com


Synthesis Phase:

Intermediate Code Generation:

After semantic analysis the compiler generates an intermediate code of the source code for the
target machine. It represents a program for some abstract machine. It is in between the high-level
language and the machine language. This intermediate code should be generated in such a way
that it makes it easier to be translated into the target machine code.
Intermediate code have two properties: easy to produce and easy to translate into the target
program. An intermediate code representation can have many forms. One of the form is three-
address code, which is like the assembly language for a machine in which every memory
location can act like a register and three-address code have at most three operands.

Example: The output of the semantic analysis can be represented in the following intermediate
form:

temp1 := inttoreal ( 60 )

temp2 := id3 * temp1

temp3 := id2 + temp2

id1 := temp3

Code Optimization:

The next phase does code optimization of the intermediate code. Optimization can be assumed as
something that removes unnecessary code lines, and arranges the sequence of statements in order
to speed up the program execution without wasting resources CPU, memory. In the following
example the natural algorithm is used for optimizing the code.
Example:

The output of intermediate code can be optimized as:

temp1 := id3 * 60.0

id1 := id2 + temp1

Downloaded from: annauniversityedu.blogspot.com


The compiler that do most code optimization are called “optimizing compilers”.

Code Generation:

This is the final phase of the compiler which generates the target code, consisting normally of
relocatable machine code or assembly code. Variables are assigned to the registers.

Example:

The output of above optimized code can be generated as:

MOVF id3, R2

MULF #60.0, R2

MOVF id2, R1

ADDF R2, R1

MOVF R1, id3

The first and the second operands of each instruction specify a source and destination
respectively. The F in each instruction denotes the floating point numbers. The # signifies that
60.0 is to be treated as constant.

Activities of Compiler:

Symbol table manager and error handler are the other two activities in the compiler which is also
referred as phases. These two activities interact with all the six phases of a compiler.

Symbol Table Manager:

The symbol table is a data structure containing a record for each identifier, with fields for the
attributes of the identifier.

The attributes of the identifiers may provide the information about the storage allocated for an
identifier, its type, its scope (where in the program it is valid), and in the case of procedure

Downloaded from: annauniversityedu.blogspot.com


names the attributes provide information about the number and types of its arguments, the
method of passing each argument (eg. by reference), and the type returned, if any.

The symbol table allows us to find the record for each identifier quickly and to store or retrieve
data from that record quickly. Attributes of the identifiers cannot be determined during lexical
analysis phase. But it can be determined during the syntax and semantic analysis phases. The
other phase like code generators uses the symbol table to retrieve the details about the identifiers.

Error Handler: ( Error Detection and Reporting)

Each phase can encounter errors. After the deduction of an error, a phase must somehow deal
with that error, so that the compilation can proceed, allowing further errors in the source program
to be detected.

Lexical Analysis Phase: If the characters remaining in the input do not form any token of the
language, then the lexical analysis phase detect the error.

Syntax Analysis Phase: The large fraction of errors is handled by syntax and semantic analysis
phases. If the token stream violates the structure rules (syntax) of the language, then this phase
detects the error.

Semantic Analysis Phase: If the constructs have right syntactic structure but no meaning to the
operation involved, then this phase detects the error. Ex. Adding two identifiers, one of which is
the name of the array, and the other the name of a procedure.

Downloaded from: annauniversityedu.blogspot.com


Translation of statement

Downloaded from: annauniversityedu.blogspot.com


1.5 Errors Encountered in Different Phases:
Program submitted to a compiler often have errors of various kinds. So, good compiler should
be able to detect as many errors as possible in various ways and also recover from them.
Each phase can encounter errors. After the deduction of an error, a phase must somehow deal
with that error, so that the compilation can proceed, allowing further errors in the source program
to be detected.

Errors during Lexical Analysis:

If the characters remaining in the input do not form any token of the language, then the lexical
analysis phase detect the error.

There are relatively few errors which can be detected during lexical analysis.

i. Strange characters

Some programming languages do not use all possible characters, so any strange ones
which appear can be reported. However almost any character is allowed within a quoted
string.

ii. Long quoted strings (1)

Many programming languages do not allow quoted strings to extend over more than one
line; in such cases a missing quote can be detected.

iii. Long quoted strings (2)

If quoted strings can extend over multiple lines then a missing quote can cause quite a lot
of text to be 'swallowed up' before an error is detected.

iv. Invalid numbers

A number such as 123.45.67 could be detected as invalid during lexical analysis


(provided the language does not allow a full stop to appear immediately after a number).
Some compiler writers prefer to treat this as two consecutive numbers 123.45 and .67 as
far as lexical analysis is concerned and leave it to the syntax analyser to report an error.
Some languages do not allow a number to start with a full stop/decimal point, in which
case the lexical analyzer can easily detect this situation.

Downloaded from: annauniversityedu.blogspot.com


Error Recovery Actions:

The possible error-recovery actions are:

i) Deleting an extraneous character


ii) Inserting a missing character
iii) Replacing an incorrect character by correct character
iv) Transposing two adjacent characters

For example:

fi ( a == 1) ....

Here fi is a valid identifier. But the open parentheses followed by the identifier may tell fi is
misspelling of the keyword if or an undeclared function identifier.

Errors in Syntax Analysis:


The large fraction of errors is handled by syntax and semantic analysis phases. If the token
stream violates the structure rules (syntax) of the language, then this phase detects the error.
The errors detected in this phase include misplaced semicolons or extra or missing braces; that
is, "{" or " } . " As another example, in C or Java, the appearance of a case statement without an
enclosing switch is a syntactic error. (However, this situation is usually allowed by the parser
and caught later in the processing, as the compiler attempts to generate code). Unbalanced
parenthesis in expressions is handled

During syntax analysis, the compiler is usually trying to decide what to do next on the basis of
expecting one of a small number of tokens. Hence in most cases it is possible to automatically
generate a useful error message just by listing the tokens which would be acceptable at that
point.

Source: A + * B
Error: | Found '*', expect one of: Identifier, Constant, '('

More specific hand-tailored error messages may be needed in cases of bracket mismatch.

Downloaded from: annauniversityedu.blogspot.com


Source: C := ( A + B * 3 ;
Error: | Missing ')' or earlier surplus '('

A parser should be able to detect and report any error in the program. It is expected that when an
error is encountered, the parser should be able to handle it and carry on parsing the rest of the
input. Mostly it is expected from the parser to check for errors but errors may be encountered at
various stages of the compilation process. A program may have the following kinds of errors at
various stages:
 Lexical : name of some identifier typed incorrectly
 Syntactical : missing semicolon or unbalanced parenthesis
 Semantical : incompatible value assignment
 Logical : code not reachable, infinite loop
There are four common error-recovery strategies that can be implemented in the parser to deal
with errors in the code.

Panic mode
When a parser encounters an error anywhere in the statement, it ignores the rest of the statement
by not processing input from erroneous input to delimiter, such as semi-colon. This is the easiest
way of error-recovery and also, it prevents the parser from developing infinite loops.

Statement mode
When a parser encounters an error, it tries to take corrective measures so that the rest of inputs of
statement allow the parser to parse ahead. For example, inserting a missing semicolon, replacing
comma with a semicolon etc.. Parser designers have to be careful here because one wrong
correction may lead to an infinite loop.

Error productions
Some common errors are known to the compiler designers that may occur in the code. In
addition, the designers can create augmented grammar to be used, as productions that generate
erroneous constructs when these errors are encountered.

Downloaded from: annauniversityedu.blogspot.com


Global correction
The parser considers the program in hand as a whole and tries to figure out what the program is
intended to do and tries to find out a closest match for it, which is error-free. When an erroneous
input (statement) X is fed, it creates a parse tree for some closest error-free statement Y. This
may allow the parser to make minimal changes in the source code, but due to the complexity
(time and space) of this strategy, it has not been implemented in practice yet.

Errors during Semantic Analysis


Semantic errors are mistakes concerning the meaning of a program construct; they may be either
type errors, logical errors or run-time errors:
(i) Type errors occur when an operator is applied to an argument of the wrong type, or to
the wrong number of arguments.
(ii) Logical errors occur when a badly conceived program is executed, for example: while x
= y do ... when x and y initially have the same value and the body of loop need not
change the value of either x or y.
(iii)Run-time errors are errors that can be detected only when the program is executed, for
example:
var x : real; readln(x); writeln(1/x)
which would produce a run time error if the user input 0.

Syntax errors must be detected by a compiler and at least reported to the user (in a helpful way).
If possible, the compiler should make the appropriate correction(s). Semantic errors are much
harder and sometimes impossible for a computer to detect.

1.6 The Grouping of Phases:


Depending on the relationship between phases, the phases are grouped together as front end and
a back end.

Front End:

Downloaded from: annauniversityedu.blogspot.com


The front end consists of phases that depend primarily on the source language and are largely
independent of the target machine. The phases of front end are:

 Lexical Analysis
 Syntactic Analysis
 Creation of the symbol table
 Semantic Analysis
 Generation of the intermediate code
 A part of code optimization
 Error Handling that goes along with the above said phases

Lexical Syntax Directed


character stream token stream intermediate
Analyzer Translator
representation

Back End:

The back end includes the phases of the compiler that depend on the target machine, and these
phases do not depend on the source language, but depend on the intermediate language. The
phases of back end are:

 Code Optimization
 Code Generation
 Necessary Symbol table and error handling operations

Categories of Compiler Design:

Based on the grouping of phases there are two types of compiler design is possible:

1. A Single Compiler for different Machine - It is possible to produce a single


compiler for the same source language on a different machine by taking the front end of a
compiler as common and redo its associated back end.
2. Several Compiler for One Machine – It is possible to produce several
compilers for one machine by using a common back end for the different front ends.

Downloaded from: annauniversityedu.blogspot.com


1.7 Compiler Construction Tools:

In order to atomize the development of compilers some general tools have been created. These
tools use specialized languages for specifying and implementing the component. The most
successful tool should hide the details of the generation algorithm and produce components
which can be easily integrated into the remainder of the compiler. These tools are often referred
as compiler – compilers, compiler – generators, or translator-writing systems.

Some of the compiler-construction tools are:

Parser generators: Automatically produce syntax analyzers from a grammatical description


of a programming language.

Scanner generators: Produce lexical analyzers from a regular-expression description of the


tokens of a language.

Syntax-directed translation engines: Produce collections of routines for walking a parse tree
and generating intermediate code.

Code-generator generators: Produce a code generator from a collection of rules for


translating each operation of the intermediate language into the machine language for a target
machine.
Data-flow analysis engines: Facilitate the gathering of information about how values are
transmitted from one part of a program to each other part. Data-flow analysis is a key part of
code optimization.

Compiler-construction toolkits: Provide an integrated set of routines for constructing various


phases of a compiler.

Downloaded from: annauniversityedu.blogspot.com


1.8 Programming Language Basics:
The important terminology and distinctions that appear in the programming languages are:

1. The Static / Dynamic Distinction:


 A programming language can have static policy and dynamic policy.
 Static Policy: The issues that can be decided at compile time by compiler is called static
policy.
 Dynamic Policy: The issues that can be decided at run time of the program is called
dynamic policy.
 One of the issue decision policy in the language is the scope of declarations.
 Scope Rules: The scope of a declaration of x is the context in which uses of x refer to this
declaration. A language uses static scope or lexical scope if it is possible to determine the
scope of a declaration by looking only at the program and can be determined by compiler.
Otherwise, the language uses dynamic scope.
 Example in Java:
public static int x;
The compiler can determine the location of integer x in memory.

2. Environments and States:


The association of names with locations in memory (the store) and then with values can be
described by two state mappings that change as the program runs.

Two-State Mapping from Names to Values


The environment is a mapping from names to locations in the store.
The state is a mapping from locations in store to their values. That is, the state maps l-values to
their corresponding r-values, in the terminology of C.

Downloaded from: annauniversityedu.blogspot.com


Example:
The storage address 100, associated with variable pi, holds 0. After the assignment pi := 3.14,
the same storage is associated with pi, but the value held there is 3.14.

3. Static Scope and Block Structure:

Scope Rules: The scope of a declaration of x is the context in which uses of x refer to this
declaration. . A language uses static scope or lexical scope if it is possible to determine the scope
of a declaration by looking only at the program and can be determined by compiler. Otherwise,
the language uses dynamic scope.
 Example in Java:
public static int x;
The compiler can determine the location of integer x in memory.
The static-scope policy is as follows:
1. A C program consists of a sequence of top-level declarations of variables and functions.
2. Functions may have variable declarations within them, where variables include local
variables and parameters. The scope of each such declaration is restricted to the function
in which it appears.
3. The scope of a top-level declaration of a name x consists of the entire program that
follows, with the exception of those statements that lie within a function that also has a
declaration of x.

Block Structures:
Languages that allow blocks to be nested are said to have block structure. A name a: in a nested
block B is in the scope of a declaration D of x in an enclosing block if there is no other
declaration of x in an intervening block.

Downloaded from: annauniversityedu.blogspot.com


4. Explicit Access Control:
 Classes and structures introduce a new scope for their members.
 The use of keywords like public, private, and protected, object oriented languages such
as C + + or Java provide explicit control over access to member names in a super class.
 These keywords support encapsulation by restricting access.
 Thus,
o Private names are purposely given a scope that includes only the method
declarations and definitions associated with that class and any "friend" classes
(the C + + term).
o Protected names are accessible to subclasses.
o Public names are accessible from outside the class.

5. Dynamic Scope:
 Scope Rules: The scope of a declaration of x is the context in which uses of x refer to this
declaration.
 A language uses static scope or lexical scope if it is possible to determine the scope of a
declaration by looking only at the program and can be determined by compiler.
 Example in Java:
public static int x;
The compiler can determine the location of integer x in memory.
 The language uses dynamic scope if it is not possible to determine the scope of a
declaration during compile time.
 Example in Java:
public int x;
 With dynamic scope, as the program runs, the same use of x could refer to any of several
different declarations of x.

6. Parameter Passing Mechanism: Parameters are passed from a calling procedure to the callee
either by value (call by value) or by reference (call by reference). Depending on the procedure
call, the actual parameters associated with formal parameters will differ.

Downloaded from: annauniversityedu.blogspot.com


Call-By-Value: In call-by-value, the actual parameter is evaluated (if it is an expression) or
copied (if it is a variable). The value is placed in the location belonging to the corresponding
formal parameter of the called procedure.

Call-By-Reference:
In call-by-reference, the address of the actual parameter is passed to the callee as the value of the
corresponding formal parameter. Uses of the formal parameter in the code of the callee are
implemented by following this pointer to the location indicated by the caller. Changes to the
formal parameter thus appear as changes to the actual parameter.

Call-By-Name:
A third mechanism — call-by-name — was used in the early programming language Algol 60. It
requires that the callee execute as if the actual parameter were substituted literally for the formal
parameter in the code of the callee, as if the formal parameter were a macro standing for the
actual parameter (with renaming of local names in the called procedure, to keep them distinct).

When large objects are passed by value, the values passed are really references to the objects
themselves, resulting in an effective call-by-reference.

7. Aliasing: When parameters are (effectively) passed by reference, two formal parameters can
refer to the same object, called aliasing. This possibility allows a change in one variable to
change another.

Downloaded from: annauniversityedu.blogspot.com


Lexical Analysis
UNIT 2 – LEXICAL ANALYSIS

Topics to be Covered

Need and Role of Lexical Analyzer-Lexical Errors-Expressing Tokens by Regular


Expressions-Converting Regular Expression to DFA- Minimization of DFA-Language for
Specifying Lexical Analyzers-LEX-Design of Lexical Analyzer for a sample Language.

Lexical Analysis

The Role of the Lexical Analyzer

The lexical analyzer is the first phase of a compiler.

Main Task of Lexical Analyzer:

Its main task is to read the input characters and produce as output a sequence of tokens that
the parser uses for syntax analysis.

Source Lexical token Parser

Program Analyzer get next token

Symbol

Table

The above diagram illustrates that the lexical analyzer is a subroutine or a co routine of the
parser. Upon receiving a “get next token” command from the parser, the lexical analyzer
reads input characters until it can identify the next token.

Secondary Tasks of Lexical Analyzer:

Since Lexical analyzer is the part of the compiler that reads the source text, it may also
perform certain secondary tasks at the user interface.

Downloaded from: annauniversityedu.blogspot.com


1. Stripping out from the source program comments and white space in the
form of blank, tab and newline characters.
2. Correlating error messages from the compiler with the source program.
Example, the lexical analyzer may keep track of the number of newline
characters seen, so that a line number can be associated with an error message.

Phases of Lexical Analyzer:

Lexical analyzers are divided into a cascade of two phases:

Scanning – the scanner is responsible for doing simple tasks (Example – Fortran
compiler use a scanner to eliminate blanks from the input)
Lexical analysis – the lexical analyzer does the more complex operations.

Issues in Lexical Analysis:

There are several reasons for separating the analysis phase of compiling into lexical analysis
and parsing:

1. To make the design simpler. The separation of lexical analysis from syntax analysis
allows the other phases to be simpler. For example, parsing a document with
comments and white spaces is more complex than it is removed in the previous phase
itself.
2. To improve the efficiency of the compiler. A separate lexical analyzer allows to
construct an efficient processor. A large amount of time is spent in reading the source
program and partitioning it into tokens. Specialized buffering techniques speed up the
performance.
3. To enhance the compiler portability. Input alphabets and device specific anomalies
can be restricted to the lexical analyzer.

Tokens, Patterns and Lexemes:

Token: A token is an atomic unit represents a logically cohesive sequence of characters such
as an identifier, a keyword, an operator, constants, literal strings, punctuation symbols such as
parentheses, commas and semicolons.

Eg. rate - identifier

+, - - operator

if - keyword

Pattern: A pattern is a rule used to describe lexeme. It is a set of strings in the input for
which the same token is produced as output.

Downloaded from: annauniversityedu.blogspot.com


Lexeme: A lexeme is a sequence of characters in the source program which is matched by the
pattern for a token. i.e. lexemes represents tokens.

Token Sample Lexemes Informal Description of Pattern

Const Const const

If If if

Relation <, <=, =, < >, >, >= < or <= or = or < > or > or >=

Id pi, count, A2 Letter followed by letters and digits

Num 3.1416, 0, 6.02E23 any numeric constant

Literal “garbage collection” any characters between “ and “ except “

Attributes for Tokens:

When more than one pattern matches a lexeme, the lexical analyzer must provide additional
information about the particular lexeme that matched to the subsequent phases of the
compiler.

For example, the pattern relation matches the operators like <, <=, >, >=, =, < >. It is
necessary to identify operator which is matched with the pattern.

The lexical analyzer collects other information about tokens as its attributes. A
token has only a single attribute, a pointer to the symbol -table entry in which the
information about the token is ke pt.

For example: The tokens and associated attribute-values for the Fortran statement

X = Y * Z ** 4

are written below as a sequence of pairs:

<id, pointer to symbol-table entry for X>

<assign_op,>

<id, pointer to symbol-table entry for Y>

<mult_op,>

<id, pointer to symbol-table entry for Z>

<exp_op,>

<num, integer value 4>

Downloaded from: annauniversityedu.blogspot.com


For certain attribute pairs, there is no need for an attribute value.

Eg. <assign_op,>

For others, the compiler stores the character string that forms a value in a symbol table.

Lexical Errors:

A lexical analyzer has a very localized view of a source programs.

The possible error-recovery actions are:

i) Deleting an extraneous character


ii) Inserting a missing character
iii) Replacing an incorrect character by correct character
iv) Transposing two adjacent characters

For example:

fi ( a == 1) ....

Here fi is a valid identifier. But the open parentheses followed by the identifier may tell fi is
misspelling of the keyword if or an undeclared function identifier.

INPUT BUFFERING:

Input buffering is a method used to read the source program and to identify the tokens
efficiently. There are three general approaches to the implementation of a lexical analyzer.

1. Use a lexical-analyzer generator to produce the lexical analyzer from a regular-


expression based specification. In this case, the generator provides routines for
reading and buffering the input. Example – Lex Compiler
2. Write the lexical analyzer in a conventional systems-programming language, using
the I/O facilities of that language to read the input.
3. Write the lexical analyzer in assembly language and explicitly manage the reading of
input.

Since the lexical analyzer is the only phase of the compiler that reads the source program
character-by-character, it is possible to spend a considerably amount of time in the lexical
analysis phase. Thus the speed of lexical analysis is a concern in compiler design.

Downloaded from: annauniversityedu.blogspot.com


The following technique uses two-buffer input scheme to identify the tokens. The speed of
the lexical .analyzer can be improved by using the sentinels to mark the buffer end.

Buffer Pairs:

The lexical analyzer needs to look-ahead many characters beyond the lexeme for finding the
pattern. The lexical analyzer uses a function ungetc( ) to push the look-ahead characters back
into the input stream. In order to reduce the amount of overhead required to process an input
character, specialized buffering techniques have been developed.

A buffer is divided into N-character halves where N is the number of characters on one disk
block. Example 1024 or 4096

: : : X : : = : : M : * : : C: * : * : 4 : eof : : : : : :

forward

lexeme_beginning

Input Buffer with two halves

The processing of buffer pair is as follows:

1. Read N input character into each half of the buffer using one system read command
instead of reading each input character
2. If fewer than N characters remain in the input, then eof marker is read into the buffer
after the input characters.
3. Two pointers to the input buffer are maintained. Initially both pointers point to the
first character of the next lexeme to be found.
a. Begin pointer points the s tart of the lexeme
b. The forward pointer is set to the character at its right end
4. Once the lexeme is identified, both pointers are set to the character immediately past
the lexeme.

If the forward pointer is reaching the halfway mark, the right half is filled with N new input
characters. If the forward pointer is about to move past the right end of the buffer, the left
half is filled with N new characters and the forward pointer wraps around to the beginning of
the buffer. The number of tests to be required is very large.

Downloaded from: annauniversityedu.blogspot.com


Code to advance forward pointer:

if forward at end of first half then begin


reload second half;
forward := forward + 1
end

else if forward at end of second half then begin


reload first half;
move forward to beginning of first half
end
else
forward := forward + 1;

Sentinels:

In the previous scheme mentioned a check should be made each time when the forward
pointer is moved that we have not moved off one half of the buffer. i.e. only one eof marker
at the end.

A sentinel is a special character which is not a part of the source program used to represent
the end of file. (eof)

Instead of testing the forward pointer each time by two tests, extend each buffer half to hold a
sentinel character at the end and reduce the number of tests to one.

: : : X : : = : : M : * : eof C: * : * : 4 : eof : : : : : eof

forward

lexeme_beginning

Sentinels at end of each buffer half

Downloaded from: annauniversityedu.blogspot.com


For most of the cases, the code performs only one test to see whether forward point to an eof.
If it reaches the end of a buffer or the end of the file, then we performs more tests for
checking each half and to reload other half of the buffer.

Look ahead code with sentinels:

forward := forward + 1;
if forward = eof then begin
if forward at end of first half then begin
reload second half;
forward := forward + 1
end
else if forward at end of second half then begin
reload first half;
move forward to beginning of first half
end
else /* eof within a buffer signifying end of input */
terminate lexical analysis
end

SPECIFICATION OF TOKENS:

Regular expressions are an important notation for specifying patterns. Each pattern matches
a set of strings, so regular expressions will serve as names for set of strings.

Strings and Languages:

Alphabet: An alphabet or character class denotes any finite set of symbols. For example,
Letters, Characters, ASCII characters, EBCDIC characters

String: A string over some alphabet is a finite sequence of symbols drawn from that alphabet.
For example, 1 0 1 0 1 1 is a string over {0, 1}* , is a empty string over {0, 1}*

Length of the String : The length of the string 1 0 1 is denoted as | 1 0 1 | = 3 i.e. the number
of occurrences of symbol is S.

Language: A language denotes any set of strings over some fixed alphabet .

Downloaded from: annauniversityedu.blogspot.com


Example Language L = {0n 1n | n > 0}

Some common terms associated with parts of a string are as follows:

Let s be the string where S = “regular”.

TERM DEFINITION

A string obtained by removing zero or more trailing symbols of


prefix of s
string s; eg. ban is a prefix of banana
A string formed by deleting zero or more of the leading symbols
suffix of s
of s; eg. nana is a suffix of banana
A string obtained by deleting a prefix and a suffix from s; eg.
substring of s
nan is a substring of banana.
proper prefix, suffix or Any nonempty string x that is, respectively, a prefix, suffix, or
substring of s substring of s such that s ≠ x
Any string formed by deleting zero or more not necessarily
subsequence of s contiguous symbols from s; eg. baaa is a subsequence of
banana.

Operations on Languages:

There are several important operations that ca be applied to languages. For lexical analysis
the following operations are applied:

OPERATION DEFINITION
union of L and M
L U M = { s | s is in L or s is in M }
written L U M
concatenation of L and M
LM = { st | s is in L and t is in M }
written LM

Kleene closure of L L* = U Li
i=0
written L*
L* denotes “zero or more concatenations of” L
positive closure of L written ∞

Downloaded from: annauniversityedu.blogspot.com


L+ L + = U Li
i=1

L+ denotes “one or more concatenations of” L


Example:

Let L = {A, B, . . . , Z, a, b, . . . , z} and

D = {0, 1, . . . , 9}

By applying operators defined above on these languages L and D we get the following new
languages:

1. L U D is the set of letters and digits


i.e. L U D = {A,B, . . . ,Z, a, b, . . . , z, 0, 1, . . . , 9}
2. LD is the set of strings consisting of a letter followed by a digit
i.e. LD = {0A, 0B, . . . , 0Z, 0a, 0b, . . . , 0z, 1A, 1B, . . ,1Z, 1a, 1b, . . ,1z, . . . . }
3. L4 is the set of all four-letter strings i.e. L4 = { aBAC, MNop, . . . . }
4. L* is the set of all strings of letters, including , the empty string
i.e. L* = { , A, B, . . . , Z, a, b, . . . , z, AB, BA, aB, . . .. }
5. L(L U D)* is the set of all strings of letters and digits beginning with a letter
6. D+ is the set of all strings of one or more digits

Regular Expressions:

A regular expression is built out of simple regular expressions using a set of defining rules.
Each regular expression r denotes a language L(r).

Rules that define the regular expressions:

Basis:

i) is a regular expression denotes the language { }.


ii) If a is a symbol in , then a is a regular expression denotes the language { a }

Induction:

iii) Suppose r and s are regular expressions denoting the language L(r) and L(s). Then,
a. ( r ) | ( s ) is a regular expression denoting L ( r ) U L ( s ).
b. ( r ) ( s ) is a regular expression denoting L ( r ) L ( s ).

Downloaded from: annauniversityedu.blogspot.com


c. ( r )* is a regular expression denoting ( L ( r ))*.
d. ( r ) is a regular expression denoting L ( r ).

A language denoted by a regular expression is said to be a regular set.

The precedence and associativity of operators are as follows:

1. the unary operator * has the highest precedence and is left associative.
2. concatenation has the second highest precedence and is left associative.
3. | has the lowest precedence and is left associative.

Unnecessary parentheses can be avoided in the regular expression if the above precedence is
adopted. For example the regular expression: (a) | ((b)* (c)) is equivalent to a | b*c.

Example:

Let = {a,b}

1. The regular expression a | b denotes the set { a, b }


2. The regular expression ( a | b ) ( a | b ) denotes {aa, ab, ba, bb}, the set of all strings
of a’s and b’s of length two. Another regular expression for this same set is aa | ab |
ba | bb.
3. The regular expression a* denotes the set of all strings of zero or more a’s i.e. { , a,
aa, aaa, . . . }
4. The regular expression ( a | b )* denotes the set of all strings containing zero or more
instances of an a or b, that is, the set of all strings of a’s and b’s. An equivalent
regular expression for this set is ( a*b* )*
5. The regular expression a | a*b denotes the set containing the string a and all strings
consisting of zero or more a’s followed by a b.

If two regular expressions r and s denote the same language, then we say r and s are
equivalent and write r = s. For example, ( a | b ) = (b | a ).

There are number of algebraic laws obeyed by regular expressions and these laws can be used
to manipulate regular expressions into equivalent forms.

Let r, s and t be the regular expression. The following are the algebraic laws for these regular
expressions:

10

Downloaded from: annauniversityedu.blogspot.com


AXIOM DESCRIPTION
r|s=s|r | is commutative
r|(s|t)=(r|s)|t | is associative
( rs ) t = r ( st ) Concatenation is associative
r ( s | t ) = rs | rt
Concatenation distributes over |
( s | t ) r = sr | st
r=r
is the identity element for concatenation
r =r

r* = ( r | )* relation between * and


r** = r* * is idempotent

Regular Definitions:

The regular expressions can be given names and defining regular expressions using these
names is called regular definition. If is an alphabet of basic symbols, then a regular
definition is a sequence of definitions of the form:

d1 -> r1

d2 -> r2

.......

d n -> rn

where each di is a distinct name, and each ri is a regular expression over the symbols in
U { d1, d2, . . . . , di-1 }, i.e., the basic symbols and the previously defined names.

Example:

1. Regular Definition for identifiers:


letter  A | B | . . . | Z | a | b | . . . | z
digit  0 | 1 | . . . | 9
id  letter ( letter | digit )*

11

Downloaded from: annauniversityedu.blogspot.com


2. Regular Definition for num:
digit  0 | 1 | . . . | 9
digits  digit digit*

optional_fraction  . digits |

optional_exponent  ( E ( + | - | ) digits ) |

num  digits optional_fraction optional_exponent

Notational Shorthands:

Certain constructs occur so frequently in regular expressions that it is convenient to introduce


notational shorthands for them.

1. One or more instances( + ): The unary postfix operator + means “one or more
instances of”. Example – (r)+ - Set of all strings of one or more occurrences of r.
2. Zero or One Instance (?): The unary postfix operator ? means “ zero or one instance
of”. Example – (r)? – One or zero occurrence of r.
The regular definition for num can be written by using unary + and unary ? operator
as follows:
digit  0 | 1 | . . . | 9
digits  digit+

optional_fraction  ( . digits) ?

optional_exponent  ( E ( + | - )? digits ) ?

num  digits optional_fraction optional_exponent

3. Character Classes: The notation where a, b and c are alphabet symbols denotes the
regular expression a | b | c. An abbreviated character class such as [ a – z ] denotes
the regular expression a | b | . . . | z.
Using character classes the identifiers can be described as strings generated by regular
expression: [A – Za – z] [A – Z a – z 0 – 9]*

12

Downloaded from: annauniversityedu.blogspot.com


Recognition of Tokens:

The tokens are recognized by following the grammatical specification of tokens.

Example:

Consider the following grammar fragment:

stmt  if expr then stmt

| if expr then stmt else stmt

expr  term relop term

| term

term  id

| num

where the terminals if, then, else, relop, id and num generate sets of strings given by the
following regular definitions:

if  if

then  then

else  else

relop  < | <= | = | < > | > | >=

id  letter ( letter | digit )*

num  digit+ ( . digit +)? (E ( + | - )? digit+ ) ?

letter  A | B | . . . | Z | a | b | . . . | z

digit  0 | 1 | . . . | 9

13

Downloaded from: annauniversityedu.blogspot.com


Regular definition for White Space (ws) is:

delim  blank | tab | newline

ws  delim+

The goal of the lexical analyzer is to isolate the lexeme for the next token in the input buffer
and produce as output a pair consisting of the appropriate token and attribute value using the
table given below:

Regular
Token Attribute - Value
Expression
ws - -

if if -

then then -

else else -

id id Pointer to table entry

num num Pointer to table entry

< relop LT

<= relop LE

= relop EQ

<> relop NE

> relop GT

>= relop GE

Regular Expression Patterns for Tokens

14

Downloaded from: annauniversityedu.blogspot.com


Transition Diagrams:

As an intermediate step in the construction of a lexical analyzer, a stylized flowchart called a


transition diagram. Transition diagrams depict the actions that take place when a lexical
analyzer is called by the parser to get the next token. It is used to keep track of information
about characters that are seen as the forward pointer scans the input.

Positions in a transition diagram are drawn as circles and are called states. The states are
connected by arrows, called edges. Edges leaving state s have labels indicating the input
characters that can next appear after the transition diagram has reached state s. The label
other refers to any character that is not indicated by any of the other edges leaving s.

One state is labeled as start state; it is the initial state of the transition diagram where control
resides when we begin to recognize token. Certain states may have actions that are executed
when the flow of control reaches that state. On entering a state we read the next input
character. If there is an edge from the current state whose label matches this input character,
then we go to the state pointed by the edge. Otherwise, we indicate failure.

The symbol * is used to indicate states on which the input retraction must take place.

There may be several transition diagram, each specifying a group of tokens. If failure occurs
in one transition diagram, then the forward pointer is retracted to where it was in the start
state of this diagram, and activate the next transition diagram. Since the lexeme beginning
and forward pointers marked the same position in the start state of the diagram, the forward
pointer is retracted to the position marked by the lexeme_begining pointer. If failure occurs
in all transition diagrams, then a lexical error has been detected and an error-recovery routine
is invoked.

Transition Diagram for >=:

start > =
0 6 7
other

*
8

15

Downloaded from: annauniversityedu.blogspot.com


Transition Diagram for Relational Operators:

start < = return ( relop, LE)


0 1 2

> return ( relop, NE)


3

other

* return ( relop, LT)


4
= 5 return ( relop, LE)

return ( relop, GE)


6
> = 7

other * return (relop, GT)


8

Transition Diagram for identifiers and keywords:

letter or digit

start letter other *


9 10 11

return(gettoken(),install_id())

Transition Diagram for Unsigned Numbers in Pascal:

digit digit

start digit . digit E 16


12 13 14 15

E digit digit

+ or - digit other *
17 18 19

return(gettoken(),install_num())

16

Downloaded from: annauniversityedu.blogspot.com


digit digit

start digit . digit other *


20 21 22 23 24

return(gettoken(),install_num())

digit

start digit other *


25 26 27

Transition Diagram for white space:

delim

start delim delim *


28 29 30

Convert Regular Expression to DFA -


Regular expression is used to represent the language (lexeme) of finite automata
(lexical analyzer).

Finite automata

A recognizer for a language is a program that takes as input a string x and answers yes if x is
a sentence of the language and no otherwise.

A regular expression is compiled into a recognizer by constructing a generalized transition


diagram called a Finite Automaton (FA).

Finite automata can be Non-deterministic Finite Automata (NFA) or Deterministic Finite


Automata (DFA).

It is given by M = (Q, Σ, qo, F, δ).

Where Q - Set of states

Σ - Set of input symbols

qo - Start state

17

Downloaded from: annauniversityedu.blogspot.com


F - set of final states

δ - Transition function (mapping states to input symbol).

δ :Q x Σ → Q

• Non-deterministic Finite Automata (NFA)

o More than one transition occurs for any input symbol from a state.

o Transition can occur even on empty string (Ɛ).

• Deterministic Finite Automata (DFA)

o For each state and for each input symbol, exactly one transition occurs from that state.

Regular expression can be converted into DFA by the following methods:

(i) Thompson's subset construction

• Given regular expression is converted into NFA

• Resultant NFA is converted into DFA

(ii) Direct Method

• In direct method, given regular expression is converted directly into DFA.

Rules for Conversion of Regular Expression to NFA

• Union

r = r1 + r2

Concatenation

r = r1 r2

18

Downloaded from: annauniversityedu.blogspot.com


Closure

r = r1*

Ɛ –closure

Ɛ - Closure is the set of states that are reachable from the state concerned on taking empty
string as input. It describes the path that consumes empty string (Ɛ) to reach some states of
NFA.

Example 1

Ɛ -closure(q0) = { q0, q1, q2}

Ɛ –closure(q1 ) = {q1, q2}

Ɛ -closure(q2) = { q0}

19

Downloaded from: annauniversityedu.blogspot.com


Example 2

Ɛ -closure (l) = {l, 2, 3, 4, 6}

Ɛ-closure (2) = {2, 3, 6}

Ɛ-closure (3) = {3, 6}

Ɛ-closure (4) = {4}

Ɛ-closure (5) = {5, 7}

Ɛ -closure (6) = {6}

Ɛ-closure (7) = {7}

Sub-set Construction

• Given regular expression is converted into NFA.

• Then, NFA is converted into DFA.

Steps

l. Convert into NFA using above rules for operators (union, concatenation and closure) and
precedence.

2. Find Ɛ -closure of all states.

3. Start with epsilon closure of start state of NFA.

4. Apply the input symbols and find its epsilon closure.

Dtran [state, input symbol] = Ɛ -closure (move (state, input symbol))

where Dtran transition function of DFA

20

Downloaded from: annauniversityedu.blogspot.com


5. Analyze the output state to find whether it is a new state.

6. If new state is found, repeat step 4 and step 5 until no more new states are found.

7. Construct the transition table for Dtran function.

8. Draw the transition diagram with start state as the Ɛ -closure (start state of NFA) and final
state is the state that contains final state of NFA drawn.

Direct Method

Direct method is used to convert given regular expression directly into DFA.

1. Uses augmented regular expression r#.


2. Important states of NFA correspond to positions in regular expression that hold
symbols of the alphabet.
3. Regular expression is represented as syntax tree where interior nodes correspond to
operators representing union, concatenation and closure operations.
4. Leaf nodes corresponds to the input symbols
5. Construct DFA directly from a regular expression by computing the functions
nullable(n), firstpos(n), lastpos(n) andfollowpos(i) from the syntax tree.
6. nullable (n): Is true for * node and node labeled with Ɛ. For other nodes it is false.
7. firstpos (n): Set of positions at node ti that corresponds to the first symbol of the sub-
expression rooted at n.
8. lastpos (n): Set of positions at node ti that corresponds to the last symbol of the sub-
expression rooted at n.
9. followpos (i): Set of positions that follows given position by matching the first or last
symbol of a string generated by sub-expression of the given regular expression.

21

Downloaded from: annauniversityedu.blogspot.com


Rules for computing nullable, firstpos and lastpos

Node n nullable (n) firstpos (n) lastpos (n)


A leaf labeled Ɛ True Ø Ø
A leaf with position False {i} {i}
i
An or node n = c1| Nullable (c1 ) or firstpos (c1) U Iastpos (c1) U
c2
Nullable (c2 ) firstpos (c2) Iastpos (c2)
A cat node n = c1c2 Nullable (c1 ) and If (Nullable (c1 )) If (Nullable (c2 ))

Nullable (c2 ) firstpos (c1) U lastpos (c1) U

firstpos (c2) Iastpos (c2)

else else

firstpos (c1) lastpos (c1)


A star node n = c1* True firstpos (c1) lastpos (c1)

Computation of followpos

The position of regular expression can follow another in the following ways:

 If n is a cat node with left child c1 and right child c2, then for every position i in
lastpos(c1), all positions in firstpos(c2) are in followpos(i).
 For cat node, for each position i in lastpos of its left child, the firstpos of its
right child will be in followpos(i).
 If n is a star node and i is a position in lastpos(n), then all positions in firstpos(n) are
in followpos(i).
 For star node, the firstpos of that node is in f ollowpos of all positions in lastpos of
that node.

22

Downloaded from: annauniversityedu.blogspot.com


Example:

Thompson's subset construction for

(a+b)*abb

Direct Method for (a+b)*abb #

23

Downloaded from: annauniversityedu.blogspot.com


FollowPos

A=firstpos(n0)={1,2,3}
Dtran[A,a]=
followpos(1) U followpos(3)= {1,2,3,4}=B
Dtran[A,b]=
followpos(2)={1,2,3}=A
Dtran[B,a]=
followpos(1) U followpos(3)=B
Dtran[B,b]=
followpos(2) U followpos(4)={1,2,3,5}=C

….

24

Downloaded from: annauniversityedu.blogspot.com


Minimizing the Number of States of a DFA

Equivalent automata
{A, C}=123
{B}=1234
{D}=1235
{E}=1236
Exists a minimum state DFA

A LANGUAGE FOR SPECIFYING LEXICAL ANALYZER

There is a wide range of tools for constructing lexical analyzers.


 LEX
 YACC
LEX

Lex is a computer program that generates lexical analyzers. Lex is commonly used with
the yacc parser generator.
Creating a lexical analyzer
First, a specification of a lexical analyzer is prepared by creating a program lex.l in
the Lex language. Then, lex.l is run through the Lex compiler to produce a C program
lex.yy.c.

Finally, lex.yy.c is run through the C compiler to produce an object progra m a.out, which
is the lexical analyzer that transforms an input stream into a sequence of tokens.

25

Downloaded from: annauniversityedu.blogspot.com


Example:

%{ main()

int v=0,c=0; {

%} printf("ENTER INTPUT : \n");

%% yylex();

[aeiouAEIOU] v++; printf("VOWELS=%d\nCONSONANTS=%d\


n",v,c);
[a-zA-Z] c++;
}
%%

26

Downloaded from: annauniversityedu.blogspot.com

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy