Chapter 4 - Context-Free Grammars and Languages
Chapter 4 - Context-Free Grammars and Languages
1
Context Free Grammars
Context Free Grammars
• The difficult with regular languages:
• A lot of languages are simply not regular.
• There’s some pretty important languages that can’t be expressed using regular
expression.
3
Context Free Grammars
• What can regular languages can express?
• Why they aren’t sufficient for recognizing arbitrary nesting structure?
• Look a simple two state machine
1
4
Context Free Grammars
• A context free grammar is a formal notation for expressing such recursive definitions of
languages.
• An example: the language of palindromes
• A palindrome is a string that reads the same forward and backward such as otto or madamimadam
(“Madam, I’m Adam”).
• String w is a palindrome if and only if w = wR
• To make things simple, we shall consider describing only the palindromes with alphabet {0, 1}
• This language includes strings like 0110, 11011, and , but not 011 or 0101
• Context-free grammar for palindromes
P
P0
P1
P 0P0
P 1P1
• Or P 0P0 | 1P1 | 0 | 1 |
Definition of ContextFree Grammars
• There are four important components in a grammatical description of a language:
1. A finite set of terminals (terminal symbols) that form the strings of the language being
defined.
• This set was {0, 1} in the palindrome example we just saw.
2. A finite set of variables (nonterminals or syntactic categories)
• Each variable represents a language; i.e., a set of strings
• In our example above, there was only one variable P, which we used to represent the class of palindromes
over alphabet {0, 1}
3. A start symbol
• One of the variables represents the language being defined
4. A finite set of productions (rules) that represent the recursive definition of a language. Each
production consists of:
• A variable that is being (partially) defined by the production. This variable is often called the head of the
production.
• The production symbol
• A string of zero or more terminals and variables (the body of the production)
6
Definition of ContextFree Grammars (CFG)
• A CFG consist of
• A set of terminals T
• A set of non-terminals N
• A start symbol S (SN)
• A set of productions (rules) R
A B1B2…..Bn,
AN
Bi N T {}
7
Notation for CFG Derivations
1. These symbols are terminals:
(a) Lowercase letters early in the alphabet, such as a, b, c.
(b) Operator symbols such as +, *, and so on.
(c) Punctuation symbols such as parentheses, comma, and so on.
(d) The digits 0, 1, ….., 9.
(e) Boldface strings such as id or if, each of which represents a single terminal symbol.
8
Notation for CFG Derivations
3. Uppercase letters late in the alphabet, such as X, Y, Z, represent grammar symbols;
that is, either nonterminals or terminals.
4. Lowercase letters late in the alphabet, chiefly u, v, …, z, represent (possibly empty)
strings of terminals.
5. Lowercase Greek letters, , , for example, represent (possibly empty) strings of
grammar symbols. Thus, a generic production can be written as A , where A is
the head and is the body.
6. A set of productions A 1, A 2, ……., A k
with a common head A (call them A-productions), may be written A 1 | 2 | … | k
Call 1, 2, …., k the alternatives for A.
7. Unless stated otherwise, the head of the first production is the start symbol.
9
Example
• Example of a Context-free Grammar
S ( S )
S
10
Example
• Example of a Context-free Grammar
expression expression + term
expression expression - term
expression term
term term * factor
term term / factor
term factor
factor ( expression )
factor id
11
Example
CFG that represents expressions in a typical programming language (addition and
multiplication).
• Every identifier must begin with a or b, which may be followed by any string in {a, b, 0, 1}*
-> We need two variables in this grammar:
• E: represents expressions
• I: represents identifiers
1. EI
2. EE+E
3. EE*E
4. E (E)
5. Ia
6. Ib
7. I Ia
8. I Ib
9. I I0
10. I I1
12
Context Free Grammars
Key idea
1. Begin with a string consisting of the start symbol “S”
2. Replace any non-terminal A in the string by a the right-hand side of
some production
A → B1 … Bn
3. Repeat (2) until there are no non-terminals in the string
13
Derivations Using a Grammar
• An approach to defining the language of a grammar, in which we use
the productions from head to body.
• We expand the start symbol using one of its productions (i.e. using a
production whose head is the start symbol).
• We further expand the resulting string by replacing one of the
variables by the body of one of its productions, and so on, until we
derive a string consisting entirely of terminals.
• The language of the grammar is all strings of terminals that we can
obtain in this way.
• This use of grammars is called derivation.
14
Derivations Using a Grammar
Given the grammar G:
1. E I
2. E E + E
3. E E * E
4. E (E)
5. I a
6. I b
7. I Ia
8. I Ib
9. I I0
10. I I1
15
Derivations Using a Grammar
∗
• We can use the relationship to condense the derivation.
∗
E a*(a+b00)
16
Leftmost and Rightmost Derivations
• Leftmost derivation: at each step we replace the leftmost variable by
one of its production bodies.
• We use the relations and * for one or many steps.
𝑙𝑚 𝑙𝑚
• Rightmost derivation: at each step the rightmost variable is replaced
by one of its bodies.
• We use the relations and * for one or many steps.
𝑟𝑚 𝑟𝑚
17
Leftmost Derivation
Given the grammar G:
E I | E+E | E*E | (E) |
I Ia | Ib | I0 | I1 | a | b
• Show that a*(a+b00) is in the language of G using leftmost derivation
•E E*E I*E a*E a*(E) a*(E+E) a*(I+E) a*(a+E)
𝑙𝑚 𝑙𝑚 𝑙𝑚 𝑙𝑚 𝑙𝑚 𝑙𝑚 𝑙𝑚
18
Rightmost Derivation
Given the grammar G:
E I | E+E | E*E | (E) |
I Ia | Ib | I0 | I1 | a | b
• Show that a*(a+b00) is in the language of G using rightmost
derivation
•E E*E E*(E+E) E*(E+I) E*(E+I0) E*(E+I00) E*(E+b00)
𝑟𝑚 𝑟𝑚 𝑟𝑚 𝑟𝑚 𝑟𝑚 𝑟𝑚
19
The Language of a Grammar
• If G = (T, N, S, R) is a CFG, the language of G, denoted L(G), is the
set of terminal strings that have derivations from the start symbol.
• L(G) = {w in T* | S * w}
𝐺
20
Exercises
Exercise 1: The following grammar generates the language of regular
expression 0*1(0+1)*
S A1B
A 0A |
B 0B | 1B |
Give leftmost and rightmost derivations of the following strings:
a) 00101
b) 1001
c) 00011
21
Exercises
Exercise 2: Design context-free grammars for the following languages
a) The set {an | n > 0}
b) The set {0n1n | n 1}, that is, the set of all strings of one or more 0’s
followed by an equal number of 1’s.
c) The set {anbm | n > 0, m > 0}
d) The set {anbm | n > 1, 0 < m < n}
e) The set {anbm | n > 0, 0 <= m < n}
f) The set of all strings with an equal number of a’s and b’s.
22
Parse Trees and Derivations
• A derivation can be drawn as a tree.
• Start symbol is the root of the tree
• For a production A B1B2…..Bn add children B1B2…..Bn to node A
B1 B2 …….. Bn
23
Parse Trees and Derivations
• Grammar
E E + E | E * E | - E | ( E ) | id E
• String
-(id + id) - E
• Derivation
E
( E )
-E
-(E)
-(E+E) E + E
- (id + E )
- (id + id) id
id
24
Parse Trees and Derivations
• A parse tree is a graphical representation of a derivation that filters out the
order in which productions are applied to replace nonterminals.
• A parse tree has
• Terminals at the leaves
• Non-terminals at the interior nodes
• Each interior node of a parse tree represents the application of a production
• The interior node is labeled with the nonterminal (A) in the head of the production
• The children of the node are labeled, from left to right, by the symbols in the body of
the production by which this A was replaced during the derivation
• An in-order traversal of the leaves is the original input
• The parse tree shows the association of operations, the input string does not.
25
Ambiguity
• A grammar that produces more than one parse tree for some string is
said to be ambiguous.
• Equivalently, there is more than one left-most or right-most derivation for
some string.
• Grammar
E E + E | E * E | - E | ( E ) | id
• String
id + id * id
• This string has two parse trees
26
Ambiguity
• This string id + id * ids has two parse trees
E E
E + E E => E + E E => E * E E * E
=> id + E => E + E * E
=> id + E * E => id + E * E
id E * E E + E id
=> id + id * E => id + id * E
=> id + id * id => id + id * id
id id id id
27
Ambiguity
• A grammar that produces more than one parse tree for some string is
said to be ambiguous.
• Equivalently, there is more than one left-most or right-most derivation for
some string.
• Ambiguity is BAD.
29
Exercises
Exercise 2: Repeat Exercise 1 for each of the following grammars and
strings:
a) S 0 S 1 | 0 1 with string 000111
b) S + S S | * S S | a with string + * aaa
c) S S ( S ) S | with string (()())
d) S S + S | S S | ( S ) | S * | a with string (a + a)*a
e) S ( L ) | a
L L , S | S with string ((a, a), a, (a))
f) S a S b S | b S a S | with string aabbab
30
Exercises
Exercise 3: Design grammars for the following languages:
a) The set of all strings of 0’s and 1’s that are palindromes; that is, the
string reads the same backward as forward.
b) The set of all strings of 0’s and 1’s with an equal number of 0’s and
1’s
31
Exercises
Exercise 4: Let Σ = {void, int, float, double, name, (, ), ,, ;}.
• Let's write a CFG for C-style function prototypes!
• Examples:
void name(int name, double name);
int name();
int name(double name);
int name(int, int name, float);
void name(void);
32
Exercises
Exercise 5: Following is a sample C++ program. Keywords are written in bold.
void add(int a, float b) {
int sum = 0;
while(sum <= 50000){
sum = sum - 10.43 + 34E4 ;
if (sum == 4000)
break;
}
}
• Let's write a CFG for above program.
33
Applications of Context-Free Grammars
1. Grammars are used to describe programming languages.
• There is a mechanical way of turning the language description as a CFG into a
parser, the component of the compiler that discovers the structure of the source
program and represents that structure by a parse tree.
• This application is one of the earliest uses of CFG’s; in fact it is one of the first
ways in which theoretical ideas in Computer Science found their way into
practice
2. The development of XML (Extensible Markup Language)
• is widely predicted to facilitate electronic commerce by allowing participants
to share conventions regarding the format of orders product descriptions and
many other kinds of documents.
34
Parsers
• Many aspects of a programming language have a structure that may be
described by regular expressions
• For instance, identifiers could be represented by regular expressions.
• However, there are also some very important aspects of typical
programming languages that cannot be represented by regular
expressions alone.
• parentheses and/or brackets in a nested and balanced fashion.
• strings of balanced parentheses are (()), ()(), (()()), and , while )(and (() are not
• A grammar Gbal=({S}, {(, )}, R, S) generates all and only the strings of balanced
parentheses where P consists of the productions: S SS | (S) |
35
The YACC ParserGenerator
• The input to YACC is a CFG.
• Associated with each production is an action, which is a fragment of C code
that is performed whenever a node of the parse tree that (with its children)
corresponds to this production is created.
• A sample of a CFG in the YACC notation.
36
Markup Languages
• The “strings” in these languages are documents with certain marks
(called tags) in them.
• Tags tell us something about the semantics of various strings within
the document.
• This language has two ma jor functions: creating links between
documents and describing the format (“look”) of a document.
37
Markup Languages
38
Normal Forms for Context Free Grammars
Chomsky Normal Form
• A CFG in which all productions are of the form ABC or A a,
where A,B and C are variables and a is a terminal.
• To get Chomsky Normal Form, we need:
• eliminate -productions
• the form A for some variable A
• eliminate unit productions
• the form AB for variables A and B
• eliminate useless symbols
• variables or terminals that do not appear in any derivation of a terminal string from the
start symbol.
39
Eliminating Productions
• productions, while a convenience in many grammar design
problems, are not essential.
• Of course without a production that has an body, it is impossible to generate
the empty string as a member of the language
• Our strategy is to begin by discovering which variables are “nullable”.
• If A is a production of G then A is nullable
• If there is a production BC1C2….Ck where each Ci is nullable then B is
nullable
40
Example
Consider the grammar G, eliminate productions:
S AB
A aAA |
B bBB |
Find the nullable symbols.
• A and B are directly nullable because they have productions with as the
body.
• S is nullable, because the production SAB has a body consisting of
nullable symbols only
41
Example
Consider the grammar G, eliminate productions:
S AB
A aAA |
B bBB |
A, B, S are nullable
42
Example
Consider the grammar G, eliminate productions:
S AB
A aAA |
B bBB |
43
Eliminating Unit Productions
• A unit production is a production of the form AB where both A and
B are variables.
• Unit productions can introduce extra steps into derivations that technically
need not be there
• Example:
I a | b | Ia | Ib | I0 | I1
F I | (E)
T F|T*F
E T|E+T
• Suppose we have determined that (A, B) is a unit pair, and BC is a
production where C is a variable. Then (A, C) is a unit pair
44
Eliminating Unit Productions
• To eliminate unit productions we proceed as follows. Given a CFG
G=(V, T, P, S), construct CFG G1=(V, T, P1, S):
1. Find all the unit pairs of G
2. For each unit pair (A, B), add to P1 all the productions A,
where B is a nonunit production in P.
Note that A = B is possible, in that way, P1 contains all the nonunit
productions in P
45
Eliminating Unit Productions
• Example: Given a grammar for simple arithmetic expressions. Eliminate unit productions.
E T|E+T
T F|T*F
F I | (E)
I a | b | Ia | Ib | I0 | I1
=> omitting useless symbols from a grammar will not change the
language generated, so we may as well detect and eliminate all useless
symbols
47
Eliminating Useless Symbols
• Approach to eliminating useless symbols begins by identifying:
• We say X is generating if X* w for some terminal string w
• every terminal is generating, since w can be that terminal itself, which is derived by zero
steps
• We say X is reachable if there is a derivation S* X for some and
48
Example
Consider the grammar: S AB | a
A b
Notes:
- eliminate the symbols that are not generating first
- then eliminate from the remaining grammar those symbols that are not reachable
49
Computing the Generating Symbols
Let G=(V, T, P, S) be a grammar.
50
Example
Consider the grammar: S AB | a
A b
51
Computing the Reachable Symbols
Let G=(V, T, P, S) be a grammar.
• S is surely reachable
• If A is reachable then all productions with A in the head all the
symbols of the bodies of those productions are also reachable
52
Example
Consider the grammar: S AB | a
A b
• S is reachable
• S has production bodies AB and a (SAB|a) => we conclude that A,
B and a are reachable
• A has Ab, we conclude that b is reachable
53
• To convert any CFG G into an equivalent CFG that has no useless
symbols, -productions or unit productions, a safe order is:
1. Eliminate -productions
54
Chomsky Normal Form (CNF)
• A grammar in which all productions are in one of two simple forms,
either:
• ABC, where A, B, and C are each variables, or
• Aa, where A is a variable and a is a terminal
55
Chomsky Normal Form (CNF)
• Convert the following grammar to Chomsky Normal Form
E E + T | T * F | (E) | a | b | Ia | Ib | I0 | I1
T T * F | (E) | a | b | Ia | Ib | I0 | I1
F (E) | a | b | Ia | Ib | I0 | I1
I a | b | Ia | Ib | I0 | I1
56
Exercises
Exercise 1. Find a grammar equivalent to
S AB | CA
A a
B BC | AB
C aB | b
with no useless symbols.
57
Exercises
Exercise 2. Begin with the grammar
S ASB |
A aAS | a
B SbS | A | bb
a. Eliminate -productions
b. Eliminate any unit productions in the resulting grammar
c. Eliminate any useless symbols in the resulting grammar
d. Put the resulting grammar into Chomsky Normal Form
58
Exercises
Exercise 3. Begin with the grammar
S AAA | B
A aA | B
B
a. Eliminate -productions
b. Eliminate any unit productions in the resulting grammar
c. Eliminate any useless symbols in the resulting grammar
d. Put the resulting grammar into Chomsky Normal Form
59
Exercises
Exercise 4. Design a CNF grammar for the set of strings of balanced
Parentheses. You need not start from any particular non-CNF grammar.
60