0% found this document useful (0 votes)
57 views

Lexical Analysis: Dr. Nguyen Hua Phung

This document discusses lexical analysis and provides an overview of its roles, implementation, and use of ANTLR to generate lexers. It describes how finite automata and regular expressions are used to identify lexemes and generate tokens from source code. Examples are given of lexical analysis for programming languages using deterministic and nondeterministic finite automata.

Uploaded by

Phước Hoài
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Lexical Analysis: Dr. Nguyen Hua Phung

This document discusses lexical analysis and provides an overview of its roles, implementation, and use of ANTLR to generate lexers. It describes how finite automata and regular expressions are used to identify lexemes and generate tokens from source code. Examples are given of lexical analysis for programming languages using deterministic and nondeterministic finite automata.

Uploaded by

Phước Hoài
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Lexical Analysis

Dr. Nguyen Hua Phung

HCMC University of Technology, Viet Nam

08, 2016

Dr. Nguyen Hua Phung Lexical Analysis 1 / 38


Outline

1 Introduction

2 Roles

3 Implementation

4 Use ANTLR to generate Lexer

5 Regex and Parser Libraries in Scala

Dr. Nguyen Hua Phung Lexical Analysis 2 / 38


Compilation Phases

source program

lexical analyzer

syntax analyzer
front end
semantic analyzer

intermediate code generator

code optimizer
back end
code generator

target program

Dr. Nguyen Hua Phung Lexical Analysis 4 / 38


Lexical Analysis

Like a word extractor


in ⇒ i n ⇒ in
Like a spell checker
I og:::
og to socholsochol
:::::::

Like a classification
I am a student

pronoun verb article noun

Dr. Nguyen Hua Phung Lexical Analysis 6 / 38


Lexical Analysis Roles

Identify lexemes: substrings of the source program


that belong to a grammar unit
Return tokens: a lexical category of lexemes
Ignore spaces such as blank, newline, tab
Record the position of tokens that are used in next
phases

Dr. Nguyen Hua Phung Lexical Analysis 7 / 38


Example on Lexeme and Token

result = oldsum - value / 100;

Lexemes Tokens
result IDENT
= ASSIGN_OP
oldsum IDENT
- SUBSTRACT_OP
value IDENT
/ DIV_OP
100 INT_LIT
; SEMICOLON

Dr. Nguyen Hua Phung Lexical Analysis 8 / 38


How to build a lexical analyzer?

How to build a lexical analysis for English?


65000 words
Simply build a dictionary:
{(I,pronoun);(We,pronoun);(am,verb);...}
Extract, search, compare
But for a programming language?
How many words?
Identifiers: abc, cab, Abc, aBc, cAb, ...
Integers: 1, 10, 120, 20, 210, ...
...
Too many words to build a dictionary, so how?

Dr. Nguyen Hua Phung Lexical Analysis 11 / 38


Finite Automata

... b b a a a a . . . Input Tape

Head (Read only)

q3 ...

q2 qn

q1 q0

Finite Control

Dr. Nguyen Hua Phung Lexical Analysis 12 / 38


State Diagram

a a
b

start q0 q1

Input: abaabb

Current state Read New State


q0 a q0
q0 b q1
q1 a q1
q1 a q1
q1 b q0
q0 b q1
Dr. Nguyen Hua Phung Lexical Analysis 13 / 38
Deterministic Finite Automata

Definition
Deterministic Finite Automaton(DFA) is a 5-tuple
M =(K,Σ,δ,s,F) where
K = a finite set of state
Σ = alphabet
s ∈ K = the initial state
F ⊆ K = the set of final states
δ = a transition function from K ×Σ to K

Dr. Nguyen Hua Phung Lexical Analysis 16 / 38


Example

M =(K,Σ,δ,s,F)
where K = {q0 , q1 } Σ = {a,b} s=q0 F={q1 }
and δ

K Σ δ(K , Σ)
q0 a q0
q0 b q1
q1 a q1
q1 b q0
a a
b

start q0 q1

Dr. Nguyen Hua Phung Lexical Analysis 17 / 38


Nondeterministic Finite Automata

Permit several possible “next states” for a given


combination of current state and input symbol
Accept the empty string  in state diagram
Help simplifying the description of automata
Every NFA is equivalent to a DFA

Dr. Nguyen Hua Phung Lexical Analysis 18 / 38


Example

Language L = ({ab} ∪ {aba})*

start q0 q1
b

a b
q2

Dr. Nguyen Hua Phung Lexical Analysis 19 / 38


Example

Language L = ({ab} ∪ {aba})*

a
start q0 q1

b

a
q2

Dr. Nguyen Hua Phung Lexical Analysis 20 / 38


Regular Expression (regex)

Describe regular sets of strings


Symbols other than ( ) | * stand for themselves
Use  for an empty string
Concatenation α β = First part matches α, second
part β
Union α | β = Match α or β
Kleene star α* = 0 or more matches of α
Use ( ) for grouping

Dr. Nguyen Hua Phung Lexical Analysis 22 / 38


Example

(i|I)(f|F)
Keyword if of language Pascal
if
IF
If
iF

E(0|1|2|3|4|5|6|7|8|9)*
An E followed by a (possibly empty) sequence of digits
E123
E9
E

Dr. Nguyen Hua Phung Lexical Analysis 23 / 38


Regular Expression and Finite Automata

a b
start ab
a
 
start a|b
 b 

a
start a*
 


Dr. Nguyen Hua Phung Lexical Analysis 24 / 38


Convenience Notation

α+ = one or more (i.e. αα∗)


α? = 0 or 1 (i.e. (α|))
[xyz]= x|y|z
[x-y]= all characters from x to y, e.g. [0-9] = all ASCII
digits
[^x-y]= all characters other than [x-y]
. matches any character

Dr. Nguyen Hua Phung Lexical Analysis 25 / 38


Example

Integer:
Hexadecimal number:
Fixed-point number:
Floating point number:
String:

Dr. Nguyen Hua Phung Lexical Analysis 26 / 38


ANTLR [1]

ANother Tool for Language Recognition


Terence Parr, Professor of CS at the Uni. San
Francisco
powerful parser/lexer generator

Dr. Nguyen Hua Phung Lexical Analysis 28 / 38


Lexer

/∗∗
∗ Filename : H e l l o . g4
∗/
l e x e r grammar H e l l o ;

/ / match any d i g i t s
INT : [0 −9]+;

/ / Hexadecimal number
HEX: 0 [ Xx][0 −9A−Fa−f ] + ;

/ / match lower−case i d e n t i f i e r s
ID : [ a−z ] + ;

/ / s k i p spaces , tabs , n e w l i n e s
WS : [ \ t \ r \ n ] + −> s k i p ;
Dr. Nguyen Hua Phung Lexical Analysis 29 / 38
Lexical Analyzer

1 0 . 0 e 2 0 . . . Input Tape

Look ahead

r3 ...
Token
r2 rn
with longest prefix match
r1 r0

Lexical Analyzer

Dr. Nguyen Hua Phung Lexical Analysis 30 / 38


Scala Regex Library

Library import scala.util.matching.Regex


Construction new Regex(String)
new Regex("[0-9]+")
"[0-9]+".r
Method findFirstIn(String):Option[Match]
findFirstMatchIn(String):Option[String]
findPrefixOf(String):Option[String]
findPrefixMatchOf(String):Option[String]
findAllIn(String):MatchIterator
...

Dr. Nguyen Hua Phung Lexical Analysis 32 / 38


Example

import scala.util.matching.Regex
val pat = new Regex("[0-9]+")
val pattern = "[a-z][a-z]*".r
val str = "123 abc 456"
pat.findFirstIn(str)
pattern.findFirstIn(str)

Dr. Nguyen Hua Phung Lexical Analysis 33 / 38


Scala Parser Library

Library scala.util.parsing.combinator.Parsers.Parser
Construction new Parser[T]
new Parser[Token]
new Parser[Any]
Method ~ p1 ~ p2: must match p1 followed by p2
| p1 | p2: must match either p1 or p2, with prefer-
ence given to p1
? p1.? : may match p1 or not
* p1.*: matches any number of repetitions of p1
^^ p1 ^^ f: combine for function application
^^^ p1 ^^^ T: changes a successful result into the
specified value
... ...

Dr. Nguyen Hua Phung Lexical Analysis 35 / 38


Summary

A lexical analyzer is a pattern matcher that isolates


small-scale parts of a program
Regular expressions are built based on Finite
Automata
How to write a lexical analyzer (lexer) in Scala

Dr. Nguyen Hua Phung Lexical Analysis 37 / 38


References I

[1] ANTLR, http:antlr.org, 19 08 2016.

Dr. Nguyen Hua Phung Lexical Analysis 38 / 38

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy