CD
CD
2. Compiler
A compiler reads an entire source code and then translates it into machine
code. Further, the machine code, aka the object code, is stored in an object
file.
If the compiler encounters any errors during the compilation process, it
continues to read the source code to the end and then shows the errors and
their line numbers to the user.
Compiled programming languages are high-level and machine-
independent. Moreover, examples of compiled programming languages are
C, C++, C#, Java, Rust, and Go:
3. Interpreter
An interpreter receives the source code and then reads it line by line,
translating each line of code to machine code and executing it before
moving on to the next line.
If the interpreter encounters any errors during its process, it stops the
process and shows an error message to the user.
Interpreted programming languages are also high-level and machine-
independent. Python, Javascript, PHP, and Ruby are examples of
interpreted programming languages:
STRUCTURE OF COMPILER
• In a compiler,
o linear analysis
▪ is called LEXICAL ANALYSIS or SCANNING and
▪ is performed by the LEXICAL ANALYZER or LEXER,
o hierarchical analysis
▪ is called SYNTAX ANALYSIS or PARSING and
▪ is performed by the SYNTAX ANALYZER or PARSER.
• During the analysis, the compiler manages a SYMBOL TABLE by
o recording the identifiers of the source program
o collecting information (called ATTRIBUTES) about them: storage
allocation, type, scope, and (for functions) signature.
• When the identifier x is found by the lexical analyzer
o generates the token id
o enters the lexeme x in the symbol-table (if it is not already there)
o associates to the generated token a pointer to the symbol-table entry x.
This pointer is called the LEXICAL VALUE of the token.
• During the analysis or synthesis, the compiler may DETECT ERRORS and
report on them.
o However, after detecting an error, the compilation should proceed
allowing further errors to be detected.
o The syntax and semantic phases usually handle a large fraction of the
errors detectable by the compiler.
PHASES OF COMPILER
1. Lexical Analysis
In the first phase in the compiler, lexical analysis receives as input the
source code of the program. Lexical analysis is also referred to as linear
analysis or scanning. It's the process of tokenizing.
Lexer scans the input source code, one character at a time. The instant it
identifies the end of a lexeme, it transforms the lexeme into a token. The
input is transformed in this manner into a sequence of tokens. A token is
a meaningful group of characters from the source which the compiler
recognizes. The lexical analyzer then passes these tokens to the next
phase in the compiler. Scanning only eliminates the non-token structures
from the input stream, such as comments, unnecessary white spaces, etc.
The program that implements lexical analysis is known as a lexer,
lexical analyzer, or scanner.
2. Syntax Analysis
The second phase in the compiler, receives as input the stream of tokens
from the previous phase are used to create an intermediate tree-like data
structure known as the parse tree in this phase. The parse tree is
generated with the help of pre-determined grammar rules of the
language that the compiler targets. The syntax analyzer checks whether
or not a given program follows the rules of context-free grammar. If it
does, then the syntax analyzer creates the parse tree for the input source
program. If the syntax is incorrect, it generates a syntax error. The phase
of syntax analysis is also known as hierarchical analysis, or parsing. The
program that is responsible for performing syntax analysis is referred to
as a parser.
3. Semantic Analysis
Semantic Analysis is the third phase of a compiler, coming after
syntax analysis. While syntax analysis checks whether the source
code follows the grammatical structure of the programming language,
semantic analysis ensures that the code is meaningful and logically
correct. It looks at the meaning of the code to find errors that grammar
checks can’t catch. If something doesn’t make sense in the code, it
gives a semantic error. So, semantic analysis helps ensure that the
program is not just written correctly, but also works correctly. It
makes sure that all variables and functions are properly declared and
used, and that the types of data being used are correct and compatible
with each other.
4. Intermediate Code Generation
Intermediate Code Generation (ICG) in compiler design is the fourth
phase in the process of compilation which involves converting high-
level source code into an intermediate representation (IR). This step
improves portability and efficiency, acting as a bridge between source
code and machine code. The IR is independent of machine architecture,
facilitating optimization and easier translation into target machine code
across different platforms. It can be represented using various notations
techniques such as postfix notation, directed acyclic graph, syntax tree,
three-address codes, quadruples, triples, etc.
5. Code Optimization
Code optimization is a program transformation approach that aims to
enhance code by reducing resource consumption (i.e., CPU and
memory) while maintaining high performance. In code optimization,
high-level generic programming structures are substituted with low-
level programming codes. The three guidelines for code optimization
are as follows:
• In no way should the output code alter the program's meaning.
• The program's speed should be increased, and it should use fewer
resources if at all feasible.
• The optimization step should be quick and not hinder the
compilation process.
6. Code Generation
In the sixth and the final phase of the compiler, code generation receives
as input the optimized intermediate code and translates the optimized
intermediate code into into machine code or assembly code that the
computer’s hardware can understand and execute. The main goal of this
phase is to produce efficient and correct machine-level instructions that
performs exactly what the source code was intended to do. It converts
each part of the intermediate code into low-level instructions and
assigns variables to physical memory locations or registers.
7. Error Handler
Error Handling in a compiler is the process of detecting, reporting, and
recovering from errors in a program. The compiler’s job is to find these
errors and tell you about them clearly, so you can fix them. Error
handling is done in almost every phase of the compiler. The compiler
tries not to stop immediately after finding one error. Instead, it continues
checking the rest of the code to find more errors, so the programmer can
fix them all at once. This is called error recovery. A good compiler
doesn't just say “there’s an error” — it also gives helpful messages like
where the error happened and what kind of mistake it is. This makes it
easier for programmers to understand and correct their code. Error
handling helps make programming easier by catching and explaining
mistakes during the compilation process.
8. Symbol Table
INPUT BUFFERING
Input Buffering in Compiler is a method used to increase the speed of
reading of the source code by reducing the number of times the compiler
needs to access the source file. Without input buffering, the compiler needs
to read each character from the file one at a time, which is slow and time
consuming. Input buffering solves this problem by reading large blocks of
characters into memory at once, thus least the number of input operations.
How Input Buffering Works?
The basic idea of input buffering is to use a buffer, which is a block of
memory where the source code is temporarily stored. There are typically
two types of buffers used:
1. Single Buffer:
- A single large block of memory that holds part of the source code.
2. Double Buffer: Two blocks of memory, used alternately, to ensure that
while one buffer is being processed, the other can be filled with new
characters from the source file.
Single Buffer
In a single buffer system, the compiler reads a large block of the source file
into a buffer. The lexical analyser then processes this buffer character by
character to identify tokens.
When the buffer is exhausted, the next block of characters is read into the
same buffer, and the process repeats. While simple, this method can be
inefficient because the processing has to stop every time the buffer needs to
be refilled.
Here's how single buffering works in detail:
- One buffer is used named Buffer A.
- The compiler fills Buffer A with characters from the source file.
- The lexical analyser starts processing characters from Buffer A.
- When the buffer is completely processed and is empty, the compiler starts
filling Buffer A with the next set of characters.
- This process continues after refilling, until the end of the file.
Double Buffer
A more efficient approach is double buffering. In this system, there are two
buffers. While the lexical analyser processes characters from one buffer, the
other buffer can be filled with the next block of characters from the source
file. This overlapping of processing and reading helps in maintaining a
continuous flow of characters and reduces the waiting time.
Here's how double buffering works in detail:
- Buffer A and Buffer B are the two buffers.
- The compiler fills Buffer A with characters from the source file.
- The lexical analyser starts processing characters from Buffer A.
- When Buffer A is half-processed, the compiler starts filling Buffer
B with the next set of characters.
- Once Buffer A is completely processed, the lexical analyser switches
to Buffer B.
- This process continues, keeping the flow smooth and uninterrupted.
Sentinels in Input Buffering
To make the process of input buffering fast, sentinels can be used. Sentinels
are special characters placed at the end of each buffer to signify the end.
This eliminates the need for checking the buffer's end condition repeatedly,
which can slow down the process.
Advantages of Input Buffering
1. Efficiency:
By reading large blocks of data at once, input buffering reduces the number
of input operations, making the process faster.
2. Reduced Latency:
Double buffering ensures that while one buffer is being processed, the other
is being filled, reducing waiting time and increasing the overall speed of
the lexical analysis.
3. Smooth Processing:
The use of sentinels helps in seamless buffer transitions, avoiding constant
end-of-buffer checks.
REGULAR EXPRESSIONS
A regular expression can also be described as a sequence of pattern that
defines a string. In Compiler Design, regular expressions are concise
notations used to define and recognize patterns in source code. They play a
crucial role in lexical analysis, where they are used to identify and extract
tokens (e.g., keywords, identifiers) from code. Regular expressions help in
defining the syntax of programming languages, facilitating the
transformation of human-readable code into machine-readable forms
during compilation .
For instance:
• In a regular expression, x* means zero or more occurrence of x. It
can generate {ε, x, xx, xxx, xxxx, .....}
• In a regular expression, x+ means one or more occurrence of x. It can
generate {x, xx, xxx, xxxx, .....}
Here are some examples of regular expressions commonly used in compiler
design, particularly for lexical analysis:
1. Identifiers:
• Regex: ^[a-zA-Z_][a-zA-Z0-9_]*$
• Explanation: This regular expression matches valid identifiers in
programming languages. It ensures that an identifier starts with a
letter or underscore, followed by any combination of letters, digits,
or underscores. This pattern is used to recognize variable names,
function names, and other identifiers.
2. Numeric Literals:
• Regex: ^(\d+(\.\d+)?|\.\d+)$
• Explanation: This regex matches numeric literals, including both
integers and floating-point numbers. It allows for optional decimal
points, ensuring that integers can be expressed as whole numbers and
floating-point numbers can start with digits, end with digits, or begin
with a decimal point.
3. String Literals:
• Regex: ^"([^"\\]|\\.)*"$
• Explanation: This regular expression matches string literals
enclosed in double quotes. It allows for any characters except for
unescaped double quotes, and it accommodates escape sequences
(like \" or \\). This pattern is essential for recognizing string data
types in programming languages.
ROLE OF PARSERS
Applications
CFG has great practical importance. Some of the applications are given
below:
• For defining programming languages.
• For the construction of compilers
• For describing Arithmetic expressions
• For the transition of programming languages.
SHIFT REDUCING PARSING
Bottom-Up Parser
A bottom-up parser is a type of parsing algorithm that starts with the input
symbols to construct a parse tree by repeatedly applying production rules in
reverse until the start symbol is reached. Bottom-up parsers are also known
as shift-reduce parsers because they shift input symbols onto the parse
stack until a set of consecutive symbols can be reduced by a production
rule.
Top-Down Parser
A top-down parser in compiler design can be considered to construct a
parse tree for an input string in preorder, starting from the root. It can also
be considered to create a leftmost derivation for an input string. The
leftmost derivation is built by a top-down parser. A top-down parser builds
the leftmost derivation from the grammar’s start symbol. Then it chooses a
suitable production rule to move the input string from left to right in
sentential form.
Leftmost derivation:
It is a process of exploring the production rules from left to right and
selecting the leftmost non-terminal in the current string as the next symbol
to expand. This approach ensures that the parser always chooses the
leftmost derivation and tries to match the input string. If a match cannot be
found, the parser backtracks and tries another production rule. This process
continues until the parser reaches the end of the input string or fails to find
a valid parse tree.
Example of Top-Down Parsing
Consider the lexical analyzer’s input string ‘acb’ for the following grammar
by using left most deviation.
https://www.naukri.com/code360/library/predictive-parsing
LR(0) Parser
SLR(0) Parser
CLR(1)
LALR(1)