Formal Theory New - 090050
Formal Theory New - 090050
The English language, for instance, in its spoken form relies on some
finite set of basic sounds as a set of primitives. The words are defined in
term of finite sequences of such sounds. Sentences are derived from finite
sequences of words. Conversations are achieved from finite sequences of
sentences, and so forth.
Alphabets
In computer science and formal language, an alphabet or vocabulary is
a finite set of symbols or letters, e.g. characters or digits. The most
common alphabet is (0,1), the binary alphabet. A finite string is a finite
sequence of letters from an alphabet; for instance a binary string is a
string drawn from the alphabet (0, 1). An infinite sequence of letters may
be constructed from elements of an alphabet as well.
Given an alphabet Σ, we write Σ* to denote the set of all finite strings over
the alphabet Σ. Here, the * denotes the Kleene star operator. We write
(or occasionally, or Σω) to denote the set of all infinite sequences over
the alphabet Σ.
For example, if we use the binary alphabet (0, 1), the strings (ε, 0, 1, 00,
01, 10, 11, 000, etc.) would all be in the Kleene closure of the alphabet
(where ε represents the empty string).
1
Please note that alphabets are important in the use of formal languages,
automata and semi-automata. In most cases, for defining instances of
automata, such as deterministic finite automata (DFAs), it is required to
specify an alphabet from which the input strings for the automaton are
built.
String
In formal languages, which are used in mathematical logic and
theoretical computer science, a string is a finite sequence of symbols that
are chosen from a set or alphabet.
Formal Theory
Let Σ be an alphabet, a non-empty finite set. Elements of Σ are called
symbols or characters. A string (or word) over Σ is any finite sequence
of characters from Σ. For example, if Σ = (0, 1), then 0101 is a string over
Σ.
The length of a string is the number of characters in the string (the length
of the sequence) and can be any non-negative integer. The empty string
is the unique string over Σ of length 0, and is denoted ε or λ. The set of
all strings over Σ of length n is denoted Σn. For example, if Σ = (0, 1),
then Σ2 = {00, 01, 10, 11}. Note that Σ0 = {ε} for any alphabet Σ.
The set of all strings over Σ of any length is the Kleene closure of Σ and
is denoted Σ*. In terms of Σn,
2
For example, if Σ = (0, 1), Σ* = (ε, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011
…). Although Σ* itself is countably infinite, all elements of Σ* have finite
length.
A set of strings over Σ (i.e. any subset of Σ*) is called a formal language
over Σ. For example, if Σ = {0, 1}, the set of strings with an even number
of zeros (ε, 1, 00, 11, 001, 010, 100, 111, 0000, 0011, 0101, 0110, 1001,
1010, 1100, 1111, …}) is a formal language over Σ.
but (ε, 0, 1)+ = (0,1)* = (ε, 0, 1, 00, 01, 10, 11, 000, 001,...
)
2 3 n n
{a}+ = (a, a , a , ....a , ....) = ( a : n ≥ 1)
n 0
Also, X = XXX.....X n≥0. Of course X = (ε). n-
times
3
e.g. (0, 1) (a, b, c) = (0a, 0b, 0c, 1a ,1b, 1c)
If x is a string, then |x| denotes the length of x, and this is the number of
indivisible symbols in x. Of course |ε| = 0.
EXERCISE
1. Determine the following sets:
(a) (0,1) (ε, a, ba) (b) (b, aa)*
2. Let V be a set of strings. Does V+ = V V*?
Alphabet of a String
The alphabet of a string is a list of all of the letters that occur in a
particular string. If s is a string, its alphabet is denoted by Alph(s)
String Substitution
Let L be a language, and let Σ be its alphabet. A string substitution or
simply a substitution is a mapping f that maps letters in Σ to languages
(possibly in a different alphabet). Thus, for example, given a letter a ∈
Σ , one has f(a) = La where La ⊂ Δ* is some language whose alphabet is
Δ. This mapping may be extended to strings as f(ε) = ε
4
CONCATENATION AND SUBSTRINGS
Concatenation is an important binary operation on Σ*. For any two
strings s and t in Σ*, their concatenation is defined as the sequence of
characters in s followed by the sequence of characters in t, and is denoted
st. For example, if Σ = (a, b, …, z), s = bear, and t = hug, then st = bearhug
and ts = hugbear.
String Length
Although formal strings can have an arbitrary (but finite) length, the
length of strings in real languages is often constrained to an artificial
maximum. In general, there are two types of string datatypes: fixed length
strings which have a fixed maximum length and which use the same
amount of memory whether this maximum is reached or not, and variable
length strings whose length is not arbitrarily fixed and which use varying
amounts of memory depending on their actual size. Most strings in
modern programming languages are variable length strings. Despite the
name, even variable length strings are limited in length; although,
generally, the limit depends only on the amount of memory available.
There are many string functions which exist in other languages with
similar or exactly the same syntax or parameters. For example in many
5
languages the length function is usually represented as len (string). Even
though string functions are very useful to a computer programmer, a
computer programmer using these functions should be mindful that a
string function in one language could in another language behave
differently or have a similar or completely different function name,
parameters, syntax, and results.
Representations
Given the preceding definitions of alphabets and strings, representations
of information can be viewed as the mapping of objects into strings in
accordance with some rules. That is, formally speaking, a representation
or encoding over an alphabet Σ of a set D is a function f from D to 2Σ*
that satisfies the following condition: f(e1) and f(e2) are disjoint
nonempty sets for each pair of distinct elements e1 and e2 in D.
Example 1
6
Fig 1: Representations for the Natural Numbers
NOTE
• an alphabet or vocabulary is a finite set of symbols or letters
• a string is a finite sequence of symbols that are chosen from a set
or alphabet
• string functions are used to manipulate a string or change or edit
the contents of a string
• for any two strings s and t in Σ*, their concatenation is defined as
the sequence of characters in s followed by the sequence of
characters in t, and is denoted st
• string concatenation is an associative, but non-commutative
operation
• a string s is said to be a substring or factor of t if there exist
(possibly empty) strings u and v such that t = usv
a representation or encoding over an alphabet Σ of a set D is a function f
from D to 2Σ* that satisfies the following condition: f(e1) and f(e2) are
disjoint nonempty sets for each pair of distinct elements e1 and e2 in D.
Having learnt about strings and alphabets in the previous unit, you will
be taken through another important concept in formal language and
automata theory, which is grammar. This is because it is often convenient
to specify languages in terms of grammars. The advantage in doing so
arises mainly from the usage of a small number of rules for describing a
language with a large number of sentences.
7
• <sentence> <subject><predicate>. (The names in angular
brackets are assumed to belong to the grammar metalanguage.)
• Similarly, the possibility that the subject phrase consists of a noun
phrase can be expressed by a grammatical rule of the form:
• <subject> <noun>.
You may, therefore, think of a grammar as a set of rules for your native
language. Subject, predicate, prepositional phrase, past participle, and so
on. This is a reasonably accurate, or at least helpful, description of a
human language, but it is not entirely rigorous. Chomski formalised the
concept of a grammar, and made important observations regarding the
complexity of the grammar, which in turn establishes the complexity of
the language.
Formal Grammar
8
part by part and look at its analyzed form (known as its parse tree in
computer science, and as its deep structure in generative grammar).
Introductory Example
For example, assume the alphabet consists of a and b, the start symbol is
S, and we have the following production rules:
then we start with S, and can choose a rule to apply to it. If we choose
rule 1, we obtain the string aSb. If we choose rule 1 again, we
replace S with aSb and obtain the string aaSbb. If we now choose
rule 2, we replace S with ba and obtain the string aababb, and are
done. We can write this series of choices more briefly, using
symbols: . The language of the
grammar is then the infinite set
,
where ak is a repeated k times (and n in particular represents the
number of times production rule 1 has been applied).
9
• A finite set Σ of terminal symbols that is disjoint from N.
A finite set P of production rules, each rule of the form
where * is the Kleene star operator and denotes set union. That is,
each production rule maps from one string of symbols to another, where
the first string (the "head") contains an arbitrary number of symbols
provided at least one of them is a nonterminal. In the case that the
second string (the "body") consists solely of the empty string – i.e., that
it contains no symbols at all – it may be denoted with a special notation
(often Λ, e or ε) in order to avoid confusion.
Example 1
Please note that for these examples, formal languages are specified
using set-builder notation.
1.
2. 3.
4.
L(G) = {anbncn|n ≥ 1}
11
context-free languages and regular languages, respectively. Although
much less powerful than unrestricted grammars
(Type 0), which can in fact express any language that can be accepted by
a Turing machine, these two restricted types of grammars are most often
used because parsers for them can be efficiently implemented. For
example, all regular languages can be recognised by a finite state
machine, and for useful subsets of context-free grammars there are
wellknown algorithms to generate efficient LL parsers and LR parsers to
recognise the corresponding languages those grammars generate.
The language defined above is not a context-free language, and this can
be strictly proven using the pumping lemma for context-free languages,
but for example the language L(G) = {anbn|n ≥ 1} (at least 1 a followed
by the same number of b's) is context-free, as it can be defined by the
grammar G2 with N = {S}, Σ = {a, b}, S the start symbol, and the
following production rules:
1.
2.
In regular grammars, the left hand side is again only a single nonterminal
symbol, but now the right-hand side is also restricted. The right side may
be the empty string, or a single terminal symbol, or a single terminal
symbol followed by a nonterminal symbol, but nothing else. (Sometimes
a broader definition is used: one can allow longer strings of terminals or
12
single nonterminals without anything else, making languages easier to
denote while still defining the same class of languages.)
The language defined above is not regular, but the language (anbm | m, n
≥ 1) (at least 1 a followed by at least 1 b, where the numbers may be
different) is, as it can be defined by the grammar G3 with N = (S, A, B0,
Σ = (a, b), S the start symbol, and the following production rules:
1.
2.
3.
4.
5.
Formal languages are studied in the fields of logic, computer science and
linguistics. Their most important practical application is for the precise
definition of syntactically correct programs for a programming language.
The branch of mathematics and computer science that is concerned only
with the purely syntactical aspects of such languages, i.e. their internal
structural patterns, is known as formal language theory.
14
Although it is not formally part of the language, the words of a formal
language often have a semantical dimension as well. In practice this is
always tied very closely to the structure of the language, and a formal
grammar (a set of formation rules that recursively defines the language)
can help to deal with the meaning of (well-formed) words. Well-known
examples for this are “Tarski’s definition of truth” in terms of a Tschema
for first-order logic, and compiler generators like lex and yacc.
15
While formal language theory usually concerns itself with formal
languages that are defined by some syntactical rules, the actual definition
of a formal language is only as above: a (possibly infinite) set of finite-
length strings, no more no less. In practice, there are many languages that
can be defined by rules, such as regular languages or context-free
languages. The notion of a formal grammar may be closer to the intuitive
concept of a "language," one defined by syntactic rules.
Example 1
The following rules define a formal language L over the alphabet Σ = (0,
1, 2, 3, 4, 5, 6, 7, 8, 9, +, =):
• Every nonempty string that does not contain + or = and does not
start with 0 is in L.
• The string 0 is in L.
• A string containing = is in L if and only if there is exactly one =,
and it separates two strings in L.
• A string containing + is in L if and only if every + in the string
separates two valid strings in L.
• No string is in L other than those implied by the previous rules.
• Under these rules, the string “23+4=555” is in L, but the string
“=234=+” is not. This formal language expresses natural
numbers, well-formed addition statements, and well-formed
addition equalities, but it expresses only what they look like (their
syntax), not what they mean (semantics). For instance, nowhere
in these rules is there any indication that 0 means the number zero
or that + means addition.
• For finite languages one can simply enumerate all well-formed
words. For example, we can define a language L as just L = {“a”,
“b”, “ab”, “cba”}.
16
“aaababbbbaab”, …. Therefore, formal languages are typically infinite,
and defining an infinite formal language is not as simple as writing L =
{“a”, “b”, “ab”, “cba”}. Here are some examples of formal languages:
17
machine or finite state automaton
• those strings for which some decision procedure (an algorithm
that asks a sequence of related YES/NO questions) produces the
answer YES.
Operations on Languages
Example 2
Suppose L1 and L2 are languages over some common alphabet. The
concatenation L1L2 consists of all strings of the form vw where v is a
string from L1 and w is a string from L2.
The intersection L1 ∩ L2 of L1 and L2 consists of all strings which are
contained in both languages
The complement ¬L of a language with respect to a given alphabet
consists of all strings over the alphabets that are not in the language.
Such operations are used to investigate closure properties of classes of
languages. A class of languages is closed under a particular operation
when the operation, applied to languages in the class, always produces a
language in the same class again. For instance, the context-free languages
are known to be closed under union, concatenation, and intersection with
regular languages, but not closed under intersection or complementation.
18
Other Operations on Languages
• The Kleene star: the language consisting of all words that are
concatenations of 0 or more words in the original language;
• Reversal:
L(G) = {t∈T*: Z⇒* t for some Z∈S} = {t∈T*:t∈S or Z⇒+ t for some
Z∈S}.
So, the elements of L(G) are those elements of T* which are elements of
S or which are derivable from elements of S.
19
• a formal language is a set of words, i.e. finite strings of letters,
or symbols and the inventory from which these letters are taken is
called the alphabet over which the language is defined.
• a formal language is often defined by means of a formal grammar.
• a vocabulary (or alphabet or character set or word list) is a finite
non-empty set of indivisible symbols
• common operations on languages are the standard set operations,
such as union, intersection, and complementation.
• another class of operation that can be performed on languages is
the element-wise application of string operations
Automata Theory
In theoretical computer science, automata theory is the study of abstract
machines (or more appropriately, abstract ‘mathematical’ machines or
systems) and the computational problems that can be solved using these
machines. These abstract machines are called automata.
20
• Is certain automata closed under union, intersection, or
complementation of formal languages? (Closure properties)
• How much is a type of automata expressive in terms of
recognising class of formal languages? And, their relative
expressive power? (Language Hierarchy)
Automata theory also studies if there exists any effective algorithm or not
to solve problems similar to the following list:
3.2 Automata
The automaton reads the input word one symbol after another in the
sequence and transits from state to state according to the transition
function, until the word is read completely. Once the input word has been
read, the automaton is said to have been stopped and the state at which
automaton has stopped is called final state. Depending on the final state,
21
it is said that the automaton either accepts or rejects an input word. There
is a subset of states of the automaton, which is defined as the set of
accepting states. If the final state is an accepting state, then the
automaton accepts the word. Otherwise, the word is rejected.
The set of all the words accepted by an automaton is called the language
recognized by the automaton.
Formal Definitions
Automaton
Input Word
RUN
Accepting Word
22
Recognized Language
Recognisable languages
Input
States
23
• Finite states: An automaton that contains only a finite number of
states. The above introductory definition describes automata with
finite numbers of states.
• Infinite states: An automaton that may not have a finite number of
states, or even a countable number of states. For example, the
quantum finite automaton or topological automaton has
uncountable infinity of states.
• Stack memory: An automaton may also contain some extra
memory in the form of a stack in which symbols can be pushed
and popped. This kind of automaton is called a pushdown
automaton.
Transition Function
• Deterministic: For a given current state and an input symbol, if an
automaton can only jump to one and only one state then it is a
deterministic automaton.
• Nondeterministic: An automaton that, after reading an input
symbol, may jump into any of a number of states, as licensed by
its transition relation. Notice that the term transition function is
replaced by transition relation: The automaton
nondeterministically decides to jump into one of the allowed
choices. Such automaton are called nondeterministic automaton.
• Alternation: This idea is quite similar to tree automaton, but
orthogonal. The automaton may run its multiple copies on the
same next read symbol. Such automata are called alternating
automaton. Acceptance condition must satisfy all runs of such
copies to accept the input.
Acceptance Condition
• Acceptance of finite words: Same as described in the informal
definition above.
• Acceptance of infinite words: an omega automaton cannot have
final states, as infinite words never terminate. Rather, acceptance
of the word is decided by looking at the infinite sequence of
visited states during the run.
• Probabilistic acceptance: An automaton need not strictly accept
or reject an input. It may accept the input with some probability
between zero and one. For example, quantum finite automaton,
24
geometric automaton and metric automaton has probabilistic
acceptance.
CLASSES OF AUTOMATA
25
Nondeterministic/Deterministic Streett
omega regular languages
automata
Nondeterministic/Deterministic parity
omega regular languages
automata
Nondeterministic/Deterministic Muller
omega regular languages
automata
• generally, an automaton
26
- has some form of input
- has some form of output
- has internal states
- may or may not have some form of storage
- is hard-wired rather than programmable
Like grammars, finite state automata define languages. The finite state
automata are often abbreviated FSA or FA (for finite automata);
however, some texts use the term finite state machine, or FSM to
correlate with Turing machines that will be discussed in module 4 of this
course.
Some states are designated “final” states, and strings that leave the FSA
in one of these final states are, by definition, in the language.
One state is designated the start state. When the start state is also a final
state, ε is necessarily in the language.
Example 1
01
a→bc
27
b→ad
c→da
d→cb
DFAs are:
Some states (possibly including the start state) can be designated as final
states.
Arcs between states represent state transitions. Each such arc is labelled
with the symbol that triggers the transition.
28
Fig 1: Example of DFA
• Start with the “current state” set to the start state and a “read head”
at the beginning of the input string
• while there are still characters in the string
• Read the next character and advance the read head
• From the current state, follow the arc that is labelled with the
character just read; the state that the arc points to becomes the next
current state
• When all characters have been read, accept the string if the current
state is a final state, otherwise reject the string.
Example 2
q0 1 q1 0 q3 0 q1 1 q0 1 q1 1 q0 0
q2 0 q0
29
3.2.2.1 Using a GO TO Statement
If you are not allowed to use a go to statement, you can fake it with a
combination of a loop and a case statement:
state:= q0;
loop case
state of
30
q0: read char; if eof
then accept string; if char
= 0 then state := q2; if char
= 1 then state := q1;
M = (Q, , , q0, F)
where
• Q is a finite set of states,
• is a finite set of symbols, the input alphabet,
• : Q Q is a transition function,
• q0 Q is the initial state, F Q is a set of final states.
Note: The fact that is a function implies that every vertex has an
outgoing arc for each member of .
Q.
Fig.3
32
• q0 Q is the initial state, {q1} Q is a
set of final states.
33
3.3.1 Abbreviated Acceptor for Ada Identifiers
The difference is that, in this automaton, does not appear to be a function. It looks like
a partial function, that is, it is not defined for all values of Q .
The automaton represented in Figure 4 above is really exactly the same as the automaton
in Figure 3; we just have not bothered to draw one state and a whole bunch of arcs that
we know must be there.
I do not think you will find abbreviated automata in the textbook. They are not usually
allowed in a formal course. However, if you ever use an automaton to design a lexical
scanner, putting in an explicit error state just clutters up the diagram.
This seems to add a great deal of power, but in fact it does not. Any NFA can be emulated
by an FSA with more states. Start with an NFA containing n states x1 x2 x3 etc, and
construct a deterministic FSA with 2n states as follows. Each state in the new FSA
corresponds to a unique combination of states in the original NFA. The initial state y0
corresponds to the union of the initial state x0 and all other xj states that are accessible
from x0 via E transitions. The state yi in the FSA is a final state if any of the corresponding
xj states, represented by yi, is a final state in the original NFA. To determine the transition
function f(yi, c), apply c to each corresponding xj state, and bring in any new states that
are accessible via E transitions. The combination of all these states determines a particular
yk. Thus state yi, reading character c, moves to state yk.
By induction on string length, any string that leaves the constructed FSA in state yi also
leaves the original FSA in any of the corresponding states xj. One machine says yes to the
input word if and only if the other one does. Therefore nondeterministic FSAs are no
more powerful than their deterministic counterparts.
A state may have two or more arcs emanating from it labelled with the same symbol. When
the symbol occurs in the input, either arc may be followed.
A state may have one or more arcs emanating from it labelled with (the empty string).
These arcs may optionally be followed without looking at the input or consuming an input
symbol.
Due to non-determinism, the same string may cause an NFA to end up in one of several
different states, some of which may be final while others are not. The string is accepted if
any possible ending state is a final state.
35
Fig 6: Examples of NFAs
• When the automaton is faced with a choice, it always (magically) chooses correctly.
We sometimes think of the automaton as consulting an oracle which advises it as
to the correct choice.
• When the automaton is faced with a choice, it spawns a new process, so that all
possible paths are followed simultaneously.
• The first of these alternatives, using an oracle, is sometimes attractive
mathematically. But if we want to write a program to implement an NFA, that is
not feasible.
There are three ways, two feasible and one not yet feasible, to simulate the second
alternative:
36
• Use a quantum computer. Quantum computers explore literally all possibilities
simultaneously. They are theoretically possible, but are at the cutting edge of
physics. It may (or may not) be feasible to build such a device.
An NFA can be implemented by means of a recursive search from the start state for a path
(directed by the symbols of the input string) to a final state.
One problem with this implementation is that it could get into an infinite loop if there is a
cycle of transitions. This could be prevented by maintaining a simple counter.
Another way to implement an NFA is to keep either a state set or a bit vector of all the
states that the NFA could be in at any given time. Implementation is easier if you
use a bit-vector approach (v[i] is True if state i is a possible state), since most
languages provide vectors, but not sets, as a built-in datatype. However, it is a bit easier
to describe the algorithm if you use a state-set approach, so that is what we will do. The
logic is the same in either case.
37
for each a in A do for each transition
from a to some state b do
add b to B;
while there is a next symbol do
{ read next symbol (x); B := ;
for each a in A do
{ for each transition from a to some state b do add b
to B; for each x transition from a to some state b do
add b to B;
}
for each transition from some state b in B to some
state c not in B do add c to B;
A := B;
}
if any element of A is a final state then
return True; else
return False;
M = (Q, , , q0, F)
where
These are all the same as for a DFA except for the definition of :
• A DFA is just a special case of an NFA that happens not to have any null transitions
or multiple transitions on the same symbol. So DFAs are not more powerful than
NFAs.
• For any NFA, we can construct an equivalent DFA (see below). So NFAs are not
more powerful than DFAs. DFAs and NFAs define the same class of languages –
the regular languages.
• To translate an NFA into a DFA, the trick is to label each state in the DFA with a
set of states from the NFA. Each state in the DFA summarises all the states that the
NFA might be in. If the NFA contains |Q| states, the resultant DFA could contain
as many as
39
The following are primitive regular expressions:
• x, for each x ,
• , the empty string, and
• , indicating no strings at all.
Thus, if | | = n, then there are n+2 primitive regular expressions defined over .
• For each x , the primitive regular expression x denotes the language {x}. That
is, the only string in the language is the string “x”.
• The primitive regular expression denotes the language { }.
The only string in this language is the empty string.
• The primitive regular expression denotes the language {}.
There are no strings in this language.
40
• The plus (+) or | sign, read as "or," denotes the language containing strings
described by either of the component regular expressions i.e. the union of the
component regular expressions. For example, if x, y , then the regular expression
x+y or x|y describes the language {x, y}.
Precedence
• The unary operator * (kleene closure) has the highest precedence and is left
associative. For example, a+bc* or a|bc* denotes the language {a, b, bc, bcc, bccc,
bcccc, ...}.
• Concatenation has a second highest precedence and is left associative.
• Union has lowest precedence and is left associative
• Parentheses override operator precedence as usual. For example, (0|1)* stands for
all possible binary strings, 0|1* stands for either a 0 or an arbitrarily long string of
1's, and 01* stands for 0 followed by an arbitrarily long string of 1's.
• The symbol ε represents the null string, and can be used like any other alphabetic
character. Thus, (0| ε)(1(0| ε))* stands for all binary strings without adjacent zeros.
• Computer languages such as ed, sed, grep, and perl employ regular expressions, but
there are many more features for your convenience. For instance, s+ = ss*, s? = (s|
ε), s{7,} = sssssss+, and so on. Check out ‘man perlre’ for more details.
There is a simple correspondence between regular expressions and the languages they
denote:
41
{
}
{}
(r1) L(r1)
r1 * (L(r1))*
r1 r2 L(r1) L(r2)
r1 + r 2
L(r1) L(r2)
Here are some hints on building regular expressions. We will assume = {a, b, c}.
42
Zero or more
a* means “zero or more a’s.” To say “zero or more ab’s,” that is, { , ab, abab, ababab,
...}, you need to say (ab)*. Don't say ab*, because that denotes the language {a, ab, abb,
abbb, abbbb, ...}.
One or more
Since a* means “zero or more a’s”, you can use aa* (or equivalently, a*a) to mean “one
or more a’s.” Similarly, to describe “one or more ab's,” that is, {ab, abab, ababab, ...}, you
can use ab(ab)*.
Zero or one
To describe any string at all (with = {a, b, c}), you can use (a+b+c)*.
This can be written as any character from followed by any string at all:
(a+b+c)(a+b+c)*.
To describe any string at all that does not contain an a (with = {a, b, c}), you can use
(b+c)*.
To describe any string that contains exactly one a, put “any string not containing an a,” on
either side of the a, like this: (b+c)*a(b+c)*.
43
3.4.1 Example Regular Expressions
(b+c)*a(b+c)*
We can describe the string containing zero, one, two, or three a's (and nothing else) as
Now we want to allow arbitrary strings not containing a’s at the places marked by X’s:
The problem here is that we cannot assume the symbols are in any particular order. We
have no way of saying “in any order”, so we have to list the possible orders:
• abc+acb+bac+bca+cab+cba
To make it easier to see what's happening, let's put an X in every place we want to allow
an arbitrary string:
Finally, replacing the X's with (a+b+c)* gives the final (unwieldy) answer:
• (a+b+c)*a(a+b+c)*b(a+b+c)*c(a+b+c)* +
44
• (a+b+c)*a(a+b+c)*c(a+b+c)*b(a+b+c)* +
• (a+b+c)*b(a+b+c)*a(a+b+c)*c(a+b+c)* +
• (a+b+c)*b(a+b+c)*c(a+b+c)*a(a+b+c)* +
• (a+b+c)*c(a+b+c)*a(a+b+c)*b(a+b+c)* +
• (a+b+c)*c(a+b+c)*b(a+b+c)*a(a+b+c)*
All strings which contain no runs of a's of length greater than two
• (b+c)*( +a+aa)(b+c)*
but if we want to repeat this, we need to be sure to have at least one nona between
repetitions:
All strings in which all runs of a's have lengths that are multiples of three
(aaa+b+c)*
Regular expressions also describe regular languages. We will show that regular
expressions are equivalent to NFAs by doing two things:
• For any given regular expression, we will show how to build an NFA that accepts
the same language. (This is the easy part.)
• For any given NFA, we will show how to construct a regular expression that
describes the same language. (This is the hard part.)
45
Every NFA we construct will have a single start state and a single final state. We will build
more complex NFAs out of simpler NFAs, each with a single start state and a single final
state. The simplest NFAs will be those for the primitive regular expressions.
For any x in , the regular expression x denotes the language {x}. This NFA represents
exactly that language.
Fig 6:
Note that if this were a NFA, we would have to include arcs for all the other elements of
.
Fig 7:
The regular expression denotes the language { }, that is, the
language containing only the empty string.
Fig 8:
The regular expression denotes the language ; no strings belong to this language, not
even the empty string.
Since the final state is unreachable, why bother to have it at all? The answer is that it
simplifies the construction if every NFA has exactly one start state and one final state. We
could do without this final state, but we would have more special cases to consider, and it
does not hurt anything to include it.
46
We will build more complex NFAs out of simpler NFAs, each with a single start state and
a single final state. Since we have NFAs for primitive regular expressions, we need to
compose them for the operations of grouping, juxtaposition, union, and Kleene star (*).
For grouping (parentheses), we do not really need to do anything. The NFA that represents
the regular expression (r1) is the same as the NFA that represents r1.
For juxtaposition (strings in L(r1) followed by strings in L(r2), we simply chain the NFAs
together, as shown. The initial and final states of the original NFAs (boxed) stop being
initial and final states; we include new initial and final states. (We could make do with
fewer states and fewer transitions here, but we aren't trying for the best construction; we're
just trying to show that a construction is possible.)
The + denotes “or” in a regular expression, so it makes sense that we would use an NFA
with a choice of paths. (This is one of the reasons that it is easier to build an NFA than a
NFA.)
The star denotes zero or more applications of the regular expression, so we need to set up
a loop in the NFA. We can do this with a backwardpointing arc. Since we might want to
traverse the regular expression zero times (thus matching the null string), we also need a
forwardpointing arc to bypass the NFA entirely.
47
3.5.3 From NFAs to Regular Expressions
Creating a regular expression to recognise the same strings as an NFA is trickier than you
might expect, because the NFA may have arbitrary loops and cycles. Here is the basic
approach (details supplied later):
• If the NFA has more than one final state, convert it to an NFA with only one final
state. Make the original final states nonfinal, and add a transition from each to the
new (single) final state.
• Consider the NFA to be a generalized transition graph, which is just like an NFA
except that the edges may be labelled with arbitrary regular expressions. Since the
labels on the edges of an NFA may be either or members of , each of these can
be considered to be a regular expression.
• Remove states one by one from the NFA, relabeling edges as you go, until only the
initial and the final state remain.
• Read the final regular expression from the two-state automaton that results.
• The regular expression derived in the final step accepts the same language as the
original NFA.
• Since we can convert an NFA to a regular expression, and we can convert a regular
expression to an NFA, the two are equivalent formalisms--that is, they both
describe the same class of languages, the regular languages.
• There are two complicated parts to extracting a regular expression from an NFA:
removing states, and reading the regular expression off the resultant two-state
generalised transition graph.
48
Here is how to delete a state:
To delete state Q, where Q is neither the initial state nor the final state,
replace with .
You should convince yourself that this transformation is “correct”, in the sense that paths
which leave you in Qi in the original will leave you in Qi in the replacement, and similarly
for Qj.
What if state Q has connections to more than two other states, say, Qi, Qj, and Qk? Then
you have to consider these states pairwise: Qi with Qj, Qj with Qk, and Qi with Qk.
What if some of the arcs in the original state are missing? There are too many cases to
work this out in detail, but you should be able to figure it out for any specific case, using
the above as a model.
You will end up with an NFA that looks like this, where r1, r2, r3, and r4 are (probably
very complex) regular expressions. The resultant NFA in figure 13 below represents the
regular expression r1*r2(r4 + r3r1*r2)*
49
(you should verify that this is indeed the correct regular expression). All you have to do is
plug in the correct values for r1, r2, r3, and r4.
The following presents an example solved three different ways. No new information is
presented.
Problem: Define a language containing all strings over = {a, b, c} where no symbol ever
follows itself; that is, no string contains any of the substrings aa, bb, or cc.
P is given below.
These should be pretty obvious except for the set V, which we generally make up as we
construct P.
Since the empty string belongs to the language, we need the production
• S
Some strings belonging to the language begin with the symbol a. The a can be followed
by any other string in the language, so long as this other string does not begin with a. So
we make up a variable, call it NOTA, to produce these other strings, and add the
production
• S a NOTA
50
By similar logic, we add the variables NOTB and NOTC and the productions
• S b NOTB
• S c NOTc
Now, NOTA is either the empty string, or some string that begins with b, or some string
that begins with c. If it begins with b, then it must be followed by a (possibly empty) string
that does not begin with b--and we already have a variable for that case, NOTB. Similarly,
if NOTA is some string beginning with c, the c must be followed by NOTC. This gives
the productions
• NOTA
• NOTA b NOTB
• NOTA c NOTC
Similar logic gives the following productions for NOTB and NOTC:
• NOTB
• NOTB a NOTA
• NOTB c NOTC
• NOTC
• NOTC a NOTA
• NOTC b NOTB
Example derivation:
Defining the language by an NFA follows almost exactly the same logic as defining the
language by a grammar. Whenever an input symbol is read, go to a state that will accept
any symbol other than the one read.
To emphasise the similarity with the preceding grammar, we will name our states to
correspond to variables in the grammar.
51
Fig 14: Definition of Language by NFA
As usual, it is more difficult to find a suitable regular expression to define this language,
and the regular expression we do find bears little resemblance to the grammar or to the
NFA.
The key insight is that strings of the language can be viewed as consisting of zero or more
repetitions of the symbol a, and between them must be strings of the form bcbcbc... or
cbcbcb.... So we can start with
X a Y a Y a Y a ... Y a Z
where we have to find suitable expressions for X, Y, and Z. But first, let's get the above
expression in a proper form, by getting rid of the "...".
This gives
X a (Y a)* Z
(X a (Y a)* Z) + X
X = ( + b + c + (bc)* + (cb)*)
This isn't quite right, because it does not allow (bc)*b or (cb)*c. When we include these,
we get
52
X=( + b + c + (bc)* + (cb)* + (bc)*b + (cb)*c)
This is now correct, but could be simplified. The last four terms include the +b+c cases,
so we can drop those three terms. Then we can combine the last four terms into
X = (bc)*(b + ) + (cb)*(c + )
Now, what about Z? As it happens, there isn't any difference between what we need for Z
and what we need for X, so we can also use the above expression for Z.
Finally, what about Y? This is just like the others, except that Y cannot be empty. Luckily,
it's easy to adjust the above expression for X and Z so that it can't be empty:
Y = ((bc)*b + (cb)*c)
53