Lexical analysis partitions a program's source code string into tokens. It defines a set of token types like identifiers, integers, keywords, and whitespace. A lexical analyzer recognizes substrings that correspond to each token type and returns the lexeme (substring) and token type. Regular expressions provide a notation for specifying the patterns that define each token type. A lexical analyzer implementation uses these regular expression patterns to efficiently scan the source code and classify its substrings into tokens.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
63 views15 pages
Intro To Compilers Lecture 2
Lexical analysis partitions a program's source code string into tokens. It defines a set of token types like identifiers, integers, keywords, and whitespace. A lexical analyzer recognizes substrings that correspond to each token type and returns the lexeme (substring) and token type. Regular expressions provide a notation for specifying the patterns that define each token type. A lexical analyzer implementation uses these regular expression patterns to efficiently scan the source code and classify its substrings into tokens.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15
Compilers
Lexical Analysis Lexical Analysis • What is the goal? if (i ==0) z=0; else z=1;
• The input is just a string of characters:
• If (i==0)\n\tz=0;\nelse\n\tz=1; • Goal: Partition input string into substrings • where the substrings are tokens Token • Words which are the smallest unit above letters. • Is the minimal syntax category. • English: noun, verb, adjective … • Programming language: Identifier, integer, keyword, whitespace, … • Tokens correspond to sets of strings • Identifier: strings of letters or digits, starting with a letter • Integer: a non-empty string of digits • Keyword: ”else” or “if” … • Whitespace: a non-empty sequence of blanks, newlines and tabs. Contd… • Tokens classify program substrings according to its role • The output of a lexical analysis is a stream of tokens. • Parser relies on token distinction. • Identifier, is treated differently than a keyword Designing a lexical analyser • Define a finite set of tokens • Tokens describe all items of interest • Choice of tokens depends on language, design of parser … • Recall • \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1; • Useful tokens for this expression: • Integer, Keyword, Relation, Identifier, Whitespace, (, ), =, ; • N.B., (, ), =, ; are tokens, not characters, here • Next step is to Describe which substrings belong to each token. Implementation • An implementation is responsible for two things. • Recognize substrings corresponding to tokens accurately • Return the value or lexeme (substring) of the token. • First it discards unneeded tokens which won’t contribute to parsing • Whitespaces and comments.
if (i ==0) //if clause
z=0; if (i == 0)\n\tz=0;\nelse\n\tz=1; else /*else clause is located here*/ z=1; Some examples • C++ • Most are easily done. • In Template syntax : Foo<Bar> • Stream syntax: Cin >> var; • When there is nested templates occur, there is a conflict: FOO<Bar<Bazz>> • Is if two variables I and f? • Is == two equal signs = = or ? Solution • Left-to-right scan • lookahead sometimes required. Regular languages • Are one of the several formalisms for specifying tokens. • Regular languages are simple and useful theory • Easy to understand • Efficient implementation • Definition: Let Σ be a set of characters. A language over Σ is a set of strings of characters drawn from Σ. Examples of languages
English Programming language
• Alphabet = characters • Alphabet = ASCII • Language = Sentences • Language = programs Notations • Languages are sets of strings.
• Need some notation for specifying which sets we want
• The standard notation for regular languages is regular expressions.
Regullar expressions • Single character : ‘c’ ={“c”} • Epsilon: ε ={“”} • Union A+B ={ s| s ∈A or s ∈B} • Concatenation AB = {ab | a ∈A and b ∈A} • Iteration A* = where = AAA… i times. Regular expressions • Definition: The regular expressions over Σ are the smallest set of expressions including • ε • ‘c’ where c ∈ Σ • A + B where A, B are rexp over Σ • AB “ “ “ • A* Where A is a rexp over Σ Examples • Keywords: “else” or “if” or … • ‘else’ + ‘if’ … • ‘else’ abbreviates as ‘e’ ‘l’ ‘s’ ‘e’ • Integer: a non-empty string of digits • Digit = ‘0’ +'1’ +'2’ +'3’ +'4’ +'5’ +'6’ +'7’ +'8’ +’9’ • Integer = digit digit* • Abbreviation: = AA* • Identifir: strings of letters or digits, starting with a letter • Letter = ‘A’ + … + ‘z’ +’a’+….+’z’ • Identifier = letter (letter + digit)* • Whitespace: a non empty sequence of blanks, newlines, and tabs Examples • Phone Number • +251-911-00 00 00 • Σ = digits U { -, +, ‘ ‘} • Email Address • Abc@abc.com
• There are regular expressions everywhere.
• Everything discussed so far is Syntax not semantics (meaning).