0% found this document useful (0 votes)
105 views60 pages

Chapter 4 - Context-Free Grammars and Languages

Uploaded by

lehuy5923
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views60 pages

Chapter 4 - Context-Free Grammars and Languages

Uploaded by

lehuy5923
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

•Chapter 4

Context-Free Grammars and Languages

1
Context Free Grammars
Context Free Grammars
• The difficult with regular languages:
• A lot of languages are simply not regular.
• There’s some pretty important languages that can’t be expressed using regular
expression.

• Consider the language:


{(n)n | n >= 0}

3
Context Free Grammars
• What can regular languages can express?
• Why they aren’t sufficient for recognizing arbitrary nesting structure?
• Look a simple two state machine
1

4
Context Free Grammars
• A context free grammar is a formal notation for expressing such recursive definitions of
languages.
• An example: the language of palindromes
• A palindrome is a string that reads the same forward and backward such as otto or madamimadam
(“Madam, I’m Adam”).
• String w is a palindrome if and only if w = wR
• To make things simple, we shall consider describing only the palindromes with alphabet {0, 1}
• This language includes strings like 0110, 11011, and , but not 011 or 0101
• Context-free grammar for palindromes
P
P0
P1
P  0P0
P  1P1

• Or P  0P0 | 1P1 | 0 | 1 | 
Definition of ContextFree Grammars
• There are four important components in a grammatical description of a language:
1. A finite set of terminals (terminal symbols) that form the strings of the language being
defined.
• This set was {0, 1} in the palindrome example we just saw.
2. A finite set of variables (nonterminals or syntactic categories)
• Each variable represents a language; i.e., a set of strings
• In our example above, there was only one variable P, which we used to represent the class of palindromes
over alphabet {0, 1}
3. A start symbol
• One of the variables represents the language being defined
4. A finite set of productions (rules) that represent the recursive definition of a language. Each
production consists of:
• A variable that is being (partially) defined by the production. This variable is often called the head of the
production.
• The production symbol 
• A string of zero or more terminals and variables (the body of the production)

6
Definition of ContextFree Grammars (CFG)
• A CFG consist of
• A set of terminals T
• A set of non-terminals N
• A start symbol S (SN)
• A set of productions (rules) R
A  B1B2…..Bn,
AN
Bi  N  T  {}

means A can be replaced by B1B2…..Bn

7
Notation for CFG Derivations
1. These symbols are terminals:
(a) Lowercase letters early in the alphabet, such as a, b, c.
(b) Operator symbols such as +, *, and so on.
(c) Punctuation symbols such as parentheses, comma, and so on.
(d) The digits 0, 1, ….., 9.
(e) Boldface strings such as id or if, each of which represents a single terminal symbol.

2. These symbols are nonterminals:


(a) Uppercase letters early in the alphabet, such as A, B, C.
(b) The letter S, which, when it appears, is usually the start symbol.
(c) Lowercase, italic names such as expr or stmt.
(d) When discussing programming constructs, uppercase letters may be
used to represent nonterminals for the constructs. For example, nonterminals for
expressions, terms, and factors are often represented by E, T, and F, respectively.

8
Notation for CFG Derivations
3. Uppercase letters late in the alphabet, such as X, Y, Z, represent grammar symbols;
that is, either nonterminals or terminals.
4. Lowercase letters late in the alphabet, chiefly u, v, …, z, represent (possibly empty)
strings of terminals.
5. Lowercase Greek letters, , ,  for example, represent (possibly empty) strings of
grammar symbols. Thus, a generic production can be written as A  , where A is
the head and  is the body.
6. A set of productions A 1, A 2, ……., A k
with a common head A (call them A-productions), may be written A  1 | 2 | … | k
Call 1, 2, …., k the alternatives for A.
7. Unless stated otherwise, the head of the first production is the start symbol.
9
Example
• Example of a Context-free Grammar
S  ( S )
S  

 set of terminals T = {(, )}


 set of non-terminals N = {S}

10
Example
• Example of a Context-free Grammar
expression  expression + term
expression  expression - term
expression  term
term  term * factor
term  term / factor
term  factor
factor  ( expression )
factor  id

 set of terminals T=?


 set of non-terminals N = ?

11
Example
CFG that represents expressions in a typical programming language (addition and
multiplication).
• Every identifier must begin with a or b, which may be followed by any string in {a, b, 0, 1}*
-> We need two variables in this grammar:
• E: represents expressions
• I: represents identifiers

1. EI
2. EE+E
3. EE*E
4. E  (E)
5. Ia
6. Ib
7. I  Ia
8. I  Ib
9. I  I0
10. I  I1

12
Context Free Grammars
Key idea
1. Begin with a string consisting of the start symbol “S”
2. Replace any non-terminal A in the string by a the right-hand side of
some production
A → B1 … Bn
3. Repeat (2) until there are no non-terminals in the string

13
Derivations Using a Grammar
• An approach to defining the language of a grammar, in which we use
the productions from head to body.
• We expand the start symbol using one of its productions (i.e. using a
production whose head is the start symbol).
• We further expand the resulting string by replacing one of the
variables by the body of one of its productions, and so on, until we
derive a string consisting entirely of terminals.
• The language of the grammar is all strings of terminals that we can
obtain in this way.
• This use of grammars is called derivation.

14
Derivations Using a Grammar
Given the grammar G:
1. E  I
2. E  E + E
3. E  E * E
4. E  (E)
5. I  a
6. I  b
7. I  Ia
8. I  Ib
9. I  I0
10. I  I1

• Show that a*(a+b00) is in the language of G


E  E*E  I*E  a*E  a*(E)  a*(E+E)  a*(I+E)  a*(a+E)
 a*(a+I)  a*(a+I0)  a*(a+I00)  a*(a+b00)

15
Derivations Using a Grammar

• We can use the relationship to condense the derivation.

E  E*E  I*E  a*E  a*(E)  a*(E+E)  a*(I+E)  a*(a+E)


 a*(a+I)  a*(a+I0)  a*(a+I00)  a*(a+b00)


E a*(a+b00)

16
Leftmost and Rightmost Derivations
• Leftmost derivation: at each step we replace the leftmost variable by
one of its production bodies.
• We use the relations and * for one or many steps.
𝑙𝑚 𝑙𝑚
• Rightmost derivation: at each step the rightmost variable is replaced
by one of its bodies.
• We use the relations and * for one or many steps.
𝑟𝑚 𝑟𝑚

17
Leftmost Derivation
Given the grammar G:
E  I | E+E | E*E | (E) |
I  Ia | Ib | I0 | I1 | a | b
• Show that a*(a+b00) is in the language of G using leftmost derivation
•E E*E I*E a*E a*(E) a*(E+E) a*(I+E) a*(a+E)
𝑙𝑚 𝑙𝑚 𝑙𝑚 𝑙𝑚 𝑙𝑚 𝑙𝑚 𝑙𝑚

a*(a+I) a*(a+I0) a*(a+I00) a*(a+b00)


𝑙𝑚 𝑙𝑚 𝑙𝑚 𝑙𝑚

• We can also summarize the leftmost derivation by saying: E * a*(a+b00)


𝑙𝑚

18
Rightmost Derivation
Given the grammar G:
E  I | E+E | E*E | (E) |
I  Ia | Ib | I0 | I1 | a | b
• Show that a*(a+b00) is in the language of G using rightmost
derivation
•E E*E E*(E+E) E*(E+I) E*(E+I0) E*(E+I00) E*(E+b00)
𝑟𝑚 𝑟𝑚 𝑟𝑚 𝑟𝑚 𝑟𝑚 𝑟𝑚

E*(I+b00) E*(a+b00) I*(a+b00) a*(a+b00)


𝑟𝑚 𝑟𝑚 𝑟𝑚 𝑟𝑚

• We can also summarize the rightmost derivation by saying: E * a*(a+b00)


𝑟𝑚

19
The Language of a Grammar
• If G = (T, N, S, R) is a CFG, the language of G, denoted L(G), is the
set of terminal strings that have derivations from the start symbol.
• L(G) = {w in T* | S * w}
𝐺

• If a language L is the language of some context-free grammar, then L


is said to be a context-free language, or CFL.

• The grammar P  0P0 | 1P1 | 0 | 1 |  defined the language of


palindromes over alphabet {0,1}. Thus, the set of palindromes is a
context-free language.

20
Exercises
Exercise 1: The following grammar generates the language of regular
expression 0*1(0+1)*
S  A1B
A  0A | 
B  0B | 1B | 
Give leftmost and rightmost derivations of the following strings:
a) 00101
b) 1001
c) 00011

21
Exercises
Exercise 2: Design context-free grammars for the following languages
a) The set {an | n > 0}
b) The set {0n1n | n  1}, that is, the set of all strings of one or more 0’s
followed by an equal number of 1’s.
c) The set {anbm | n > 0, m > 0}
d) The set {anbm | n > 1, 0 < m < n}
e) The set {anbm | n > 0, 0 <= m < n}
f) The set of all strings with an equal number of a’s and b’s.

22
Parse Trees and Derivations
• A derivation can be drawn as a tree.
• Start symbol is the root of the tree
• For a production A  B1B2…..Bn add children B1B2…..Bn to node A

B1 B2 …….. Bn

23
Parse Trees and Derivations
• Grammar
E  E + E | E * E | - E | ( E ) | id E
• String
-(id + id) - E
• Derivation
E
( E )
 -E
 -(E)
 -(E+E) E + E
 - (id + E )
 - (id + id) id
id
24
Parse Trees and Derivations
• A parse tree is a graphical representation of a derivation that filters out the
order in which productions are applied to replace nonterminals.
• A parse tree has
• Terminals at the leaves
• Non-terminals at the interior nodes
• Each interior node of a parse tree represents the application of a production
• The interior node is labeled with the nonterminal (A) in the head of the production
• The children of the node are labeled, from left to right, by the symbols in the body of
the production by which this A was replaced during the derivation
• An in-order traversal of the leaves is the original input
• The parse tree shows the association of operations, the input string does not.

25
Ambiguity
• A grammar that produces more than one parse tree for some string is
said to be ambiguous.
• Equivalently, there is more than one left-most or right-most derivation for
some string.
• Grammar
E  E + E | E * E | - E | ( E ) | id
• String
id + id * id
• This string has two parse trees

26
Ambiguity
• This string id + id * ids has two parse trees
E E

E + E E => E + E E => E * E E * E
=> id + E => E + E * E
=> id + E * E => id + E * E
id E * E E + E id
=> id + id * E => id + id * E
=> id + id * id => id + id * id

id id id id

27
Ambiguity
• A grammar that produces more than one parse tree for some string is
said to be ambiguous.
• Equivalently, there is more than one left-most or right-most derivation for
some string.

• Ambiguity is BAD.

• Removing Ambiguity From Grammars


• Have to know the reason for ambiguity
• Eliminate ambiguity
28
Exercises
Exercise 1: Consider the context-free grammar:
S SS+|SS*|a
and the string aa + a*
a) Give a leftmost derivation for the string.
b) Give a rightmost derivation for the string.
c) Give a parse tree for the string
d) Is the grammar ambiguous or unambiguous? Justify your answer
e) Describe the language generated by this grammar

29
Exercises
Exercise 2: Repeat Exercise 1 for each of the following grammars and
strings:
a) S  0 S 1 | 0 1 with string 000111
b) S  + S S | * S S | a with string + * aaa
c) S  S ( S ) S |  with string (()())
d) S  S + S | S S | ( S ) | S * | a with string (a + a)*a
e) S  ( L ) | a
L  L , S | S with string ((a, a), a, (a))
f) S  a S b S | b S a S |  with string aabbab
30
Exercises
Exercise 3: Design grammars for the following languages:

a) The set of all strings of 0’s and 1’s that are palindromes; that is, the
string reads the same backward as forward.

b) The set of all strings of 0’s and 1’s with an equal number of 0’s and
1’s

31
Exercises
Exercise 4: Let Σ = {void, int, float, double, name, (, ), ,, ;}.
• Let's write a CFG for C-style function prototypes!
• Examples:
void name(int name, double name);
int name();
int name(double name);
int name(int, int name, float);
void name(void);

32
Exercises
Exercise 5: Following is a sample C++ program. Keywords are written in bold.
void add(int a, float b) {
int sum = 0;
while(sum <= 50000){
sum = sum - 10.43 + 34E4 ;
if (sum == 4000)
break;
}
}
• Let's write a CFG for above program.

33
Applications of Context-Free Grammars
1. Grammars are used to describe programming languages.
• There is a mechanical way of turning the language description as a CFG into a
parser, the component of the compiler that discovers the structure of the source
program and represents that structure by a parse tree.
• This application is one of the earliest uses of CFG’s; in fact it is one of the first
ways in which theoretical ideas in Computer Science found their way into
practice
2. The development of XML (Extensible Markup Language)
• is widely predicted to facilitate electronic commerce by allowing participants
to share conventions regarding the format of orders product descriptions and
many other kinds of documents.

34
Parsers
• Many aspects of a programming language have a structure that may be
described by regular expressions
• For instance, identifiers could be represented by regular expressions.
• However, there are also some very important aspects of typical
programming languages that cannot be represented by regular
expressions alone.
• parentheses and/or brackets in a nested and balanced fashion.
• strings of balanced parentheses are (()), ()(), (()()), and , while )(and (() are not
• A grammar Gbal=({S}, {(, )}, R, S) generates all and only the strings of balanced
parentheses where P consists of the productions: S  SS | (S) | 

35
The YACC ParserGenerator
• The input to YACC is a CFG.
• Associated with each production is an action, which is a fragment of C code
that is performed whenever a node of the parse tree that (with its children)
corresponds to this production is created.
• A sample of a CFG in the YACC notation.

36
Markup Languages
• The “strings” in these languages are documents with certain marks
(called tags) in them.
• Tags tell us something about the semantics of various strings within
the document.
• This language has two ma jor functions: creating links between
documents and describing the format (“look”) of a document.

37
Markup Languages

38
Normal Forms for Context Free Grammars
Chomsky Normal Form
• A CFG in which all productions are of the form ABC or A  a,
where A,B and C are variables and a is a terminal.
• To get Chomsky Normal Form, we need:
• eliminate -productions
• the form A for some variable A
• eliminate unit productions
• the form AB for variables A and B
• eliminate useless symbols
• variables or terminals that do not appear in any derivation of a terminal string from the
start symbol.

39
Eliminating Productions
• productions, while a convenience in many grammar design
problems, are not essential.
• Of course without a production that has an body, it is impossible to generate
the empty string as a member of the language
• Our strategy is to begin by discovering which variables are “nullable”.
• If A  is a production of G then A is nullable
• If there is a production BC1C2….Ck where each Ci is nullable then B is
nullable

40
Example
Consider the grammar G, eliminate productions:
S  AB
A  aAA | 
B  bBB | 
Find the nullable symbols.
• A and B are directly nullable because they have productions with  as the
body.
• S is nullable, because the production SAB has a body consisting of
nullable symbols only

41
Example
Consider the grammar G, eliminate productions:
S  AB
A  aAA | 
B  bBB | 
A, B, S are nullable

construct the productions of grammar G


• Consider S  AB
• All symbols of the body are nullable, so there are four ways we could choose present or absent for A and B, independently.
• We are not allowed to choose to make all symbols absent, so there are only three productions
S  AB | A | B
• Consider production A  aAA
• The second and third positions hold nullable symbols, so again there are four choices of present/absent.
• In this case, all four choices are allowable since the nonnullable symbol a will be present in any case.
• Our four choices yield productions: A  aAA | aA | aA | a
• Two middle choices happen to yield the same production, we eliminate one of them A  aAA | aA | a
• Similarly the production B yields for G: B  bBB | bB | b

42
Example
Consider the grammar G, eliminate productions:
S  AB
A  aAA | 
B  bBB | 

Grammar G after eliminating productions


S  AB | A | B
A  aAA | aA | a
B  bBB | bB | b

43
Eliminating Unit Productions
• A unit production is a production of the form AB where both A and
B are variables.
• Unit productions can introduce extra steps into derivations that technically
need not be there
• Example:
I  a | b | Ia | Ib | I0 | I1
F  I | (E)
T F|T*F
E T|E+T
• Suppose we have determined that (A, B) is a unit pair, and BC is a
production where C is a variable. Then (A, C) is a unit pair
44
Eliminating Unit Productions
• To eliminate unit productions we proceed as follows. Given a CFG
G=(V, T, P, S), construct CFG G1=(V, T, P1, S):
1. Find all the unit pairs of G
2. For each unit pair (A, B), add to P1 all the productions A,
where B is a nonunit production in P.
Note that A = B is possible, in that way, P1 contains all the nonunit
productions in P

45
Eliminating Unit Productions
• Example: Given a grammar for simple arithmetic expressions. Eliminate unit productions.
E T|E+T
T F|T*F
F  I | (E)
I  a | b | Ia | Ib | I0 | I1

1. Find all the unit pairs of G


(E, T), (T, F), (F, I), (E, F), (F, I), (T, I), (E, I)
2. Eliminate unit productions
E  E + T | T * F | (E) | a | b | Ia | Ib | I0 | I1
T  T * F | (E) | a | b | Ia | Ib | I0 | I1
F  (E) | a | b | Ia | Ib | I0 | I1
I  a | b | Ia | Ib | I0 | I1
46
Eliminating Useless Symbols
• We say a symbol X is useful for a grammar G=(V, T, P, S) if there is
some derivation of the form S*Xw, where w is in T*.
• X may be in either V or T, and the sentential form X might be the first or
last in the derivation.
• If X is not useful we say it is useless

=> omitting useless symbols from a grammar will not change the
language generated, so we may as well detect and eliminate all useless
symbols

47
Eliminating Useless Symbols
• Approach to eliminating useless symbols begins by identifying:
• We say X is generating if X* w for some terminal string w
• every terminal is generating, since w can be that terminal itself, which is derived by zero
steps
• We say X is reachable if there is a derivation S* X for some  and 

=> a symbol that is useful will be both generating and reachable

48
Example
Consider the grammar: S  AB | a
A  b

• All symbols but B are generating


• a and b generate themselves
• S generates a
• A generates b
• If we eliminate B, we must eliminate the production SAB, leaving the grammar:
S  a
A  b
• Now, we find that only S and a are reachable from S
• Eliminating A and b leaves only the production S  a
• That production by itself is a grammar whose language is {a}, just as is the language of the original grammar

Notes:
- eliminate the symbols that are not generating first
- then eliminate from the remaining grammar those symbols that are not reachable

49
Computing the Generating Symbols
Let G=(V, T, P, S) be a grammar.

To compute the generating symbols of G we perform:

• Every terminal is generating

• If there is a production A, and every symbol of  is already known to be


generating => A is generating (includes the case where =)

50
Example
Consider the grammar: S  AB | a
A  b

• a and b are generating


• use the production A  b to conclude that A is generating
• use the production S  a to conclude that S is generating
• !!! We cannot use the production S  AB because B has not been
established to be generating

=> the set of generating symbols is {a, b, A, S}

51
Computing the Reachable Symbols
Let G=(V, T, P, S) be a grammar.

• S is surely reachable
• If A is reachable then all productions with A in the head all the
symbols of the bodies of those productions are also reachable

52
Example
Consider the grammar: S  AB | a
A  b

• S is reachable
• S has production bodies AB and a (SAB|a) => we conclude that A,
B and a are reachable
• A has Ab, we conclude that b is reachable

=> the set of reachable symbols is {S, A, B, a, b}

53
• To convert any CFG G into an equivalent CFG that has no useless
symbols, -productions or unit productions, a safe order is:

1. Eliminate -productions

2. Eliminate unit productions

3. Eliminate useless symbols

54
Chomsky Normal Form (CNF)
• A grammar in which all productions are in one of two simple forms,
either:
• ABC, where A, B, and C are each variables, or
• Aa, where A is a variable and a is a terminal

• To put a grammar in Chomsky Normal Form


1. Eliminate productions, unit productions, or useless symbols.
2. Every production does not satisfy the two above forms
a) Arrange that all bodies of length 2 or more consist only of variables
b) Break bodies of length 3 or more into a cascade of productions, each with a body
consisting of two variables

55
Chomsky Normal Form (CNF)
• Convert the following grammar to Chomsky Normal Form
E  E + T | T * F | (E) | a | b | Ia | Ib | I0 | I1
T  T * F | (E) | a | b | Ia | Ib | I0 | I1
F  (E) | a | b | Ia | Ib | I0 | I1
I  a | b | Ia | Ib | I0 | I1

• No productions, unit productions, or useless symbols E  EC1 | TC2 | LC3| a | b | IA | IB | IZ | IO


• E  EPT | TMF | LER | a | b | IA | IB | IZ | IO T  TC2 | LC3 | a | b | IA | IA | IB | IZ | IO
T  TMF | LER | a | b | IA | IA | IB | IZ | IO F  LC3 | a | b | IA | IB | IZ | IO
F  LER | a | b | IA | IB | IZ | IO I  a | b | IA | IB | IZ | IO
I  a | b | IA | IB | IZ | IO C1  PT
P+ C2  MF
M* C3  ER
L( P+
R) M*
A a
Bb L(
Z0 R)
O1
Aa
Bb
Z0
O1

56
Exercises
Exercise 1. Find a grammar equivalent to
S  AB | CA
A  a
B  BC | AB
C  aB | b
with no useless symbols.

57
Exercises
Exercise 2. Begin with the grammar
S  ASB | 
A  aAS | a
B  SbS | A | bb
a. Eliminate -productions
b. Eliminate any unit productions in the resulting grammar
c. Eliminate any useless symbols in the resulting grammar
d. Put the resulting grammar into Chomsky Normal Form

58
Exercises
Exercise 3. Begin with the grammar
S  AAA | B
A  aA | B
B
a. Eliminate -productions
b. Eliminate any unit productions in the resulting grammar
c. Eliminate any useless symbols in the resulting grammar
d. Put the resulting grammar into Chomsky Normal Form

59
Exercises
Exercise 4. Design a CNF grammar for the set of strings of balanced
Parentheses. You need not start from any particular non-CNF grammar.

60

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy