0% found this document useful (0 votes)
12 views53 pages

Formal Theory New - 090050

The document discusses the representation of information through languages, focusing on alphabets, strings, and formal languages in computer science. It defines key concepts such as alphabets, strings, string concatenation, and formal grammars, emphasizing their importance in communication and computation. Additionally, it highlights the role of string functions and representations in manipulating and encoding data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views53 pages

Formal Theory New - 090050

The document discusses the representation of information through languages, focusing on alphabets, strings, and formal languages in computer science. It defines key concepts such as alphabets, strings, string concatenation, and formal grammars, emphasizing their importance in communication and computation. Additionally, it highlights the role of string functions and representations in manipulating and encoding data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

INTRODUCTION

The ability to represent information is crucial to communicating and


processing information. Human societies created spoken languages to
communicate on a basic level, and developed writing to reach a more
sophisticated level.

The English language, for instance, in its spoken form relies on some
finite set of basic sounds as a set of primitives. The words are defined in
term of finite sequences of such sounds. Sentences are derived from finite
sequences of words. Conversations are achieved from finite sequences of
sentences, and so forth.

Written English uses some finite set of symbols as a set of primitives.


The words are defined by finite sequences of symbols. Sentences are
derived from finite sequences of words. Paragraphs are obtained from
finite sequences of sentences, and so forth.

Similar approaches have been developed also for representing elements


of other sets. For instance, the natural number can be represented by finite
sequences of decimal digits. Computations, like natural languages, are
expected to deal with information in its most general form. Consequently,
computations function as manipulators of integers, graphs, programs, and
many other kinds of entities. However, in reality computations only
manipulate strings of symbols that represent the objects. The subsequent
discussions in this course necessitate the following definitions.

Alphabets
In computer science and formal language, an alphabet or vocabulary is
a finite set of symbols or letters, e.g. characters or digits. The most
common alphabet is (0,1), the binary alphabet. A finite string is a finite
sequence of letters from an alphabet; for instance a binary string is a
string drawn from the alphabet (0, 1). An infinite sequence of letters may
be constructed from elements of an alphabet as well.

Given an alphabet Σ, we write Σ* to denote the set of all finite strings over
the alphabet Σ. Here, the * denotes the Kleene star operator. We write
(or occasionally, or Σω) to denote the set of all infinite sequences over
the alphabet Σ.

For example, if we use the binary alphabet (0, 1), the strings (ε, 0, 1, 00,
01, 10, 11, 000, etc.) would all be in the Kleene closure of the alphabet
(where ε represents the empty string).

1
Please note that alphabets are important in the use of formal languages,
automata and semi-automata. In most cases, for defining instances of
automata, such as deterministic finite automata (DFAs), it is required to
specify an alphabet from which the input strings for the automaton are
built.

String
In formal languages, which are used in mathematical logic and
theoretical computer science, a string is a finite sequence of symbols that
are chosen from a set or alphabet.

In computer programming, a string is, essentially, a sequence of


characters. A string is generally understood as a data type storing a
sequence of data values, usually bytes, in which elements usually stand
for characters according to a character encoding, which differentiates it
from the more general array data type. In this context, the terms binary
string and byte string are used to suggest strings in which the stored
data does not (necessarily) represent text.

A variable declared to have a string data type usually causes storage to


be allocated in memory that is capable of holding some predetermined
number of symbols. When a string appears literally in source code, it is
known as a string literal and has a representation that denotes it as such.

Formal Theory
Let Σ be an alphabet, a non-empty finite set. Elements of Σ are called
symbols or characters. A string (or word) over Σ is any finite sequence
of characters from Σ. For example, if Σ = (0, 1), then 0101 is a string over
Σ.

The length of a string is the number of characters in the string (the length
of the sequence) and can be any non-negative integer. The empty string
is the unique string over Σ of length 0, and is denoted ε or λ. The set of
all strings over Σ of length n is denoted Σn. For example, if Σ = (0, 1),
then Σ2 = {00, 01, 10, 11}. Note that Σ0 = {ε} for any alphabet Σ.

The set of all strings over Σ of any length is the Kleene closure of Σ and
is denoted Σ*. In terms of Σn,

2
For example, if Σ = (0, 1), Σ* = (ε, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011
…). Although Σ* itself is countably infinite, all elements of Σ* have finite
length.

A set of strings over Σ (i.e. any subset of Σ*) is called a formal language
over Σ. For example, if Σ = {0, 1}, the set of strings with an even number
of zeros (ε, 1, 00, 11, 001, 010, 100, 111, 0000, 0011, 0101, 0110, 1001,
1010, 1100, 1111, …}) is a formal language over Σ.

Strings and Sets of Strings


If V is a set, then V* denotes the set of all finite strings of elements of V
including the empty string which will be denoted by ε. e.g.

(0,1)* = (ε, 0, 1, 00, 01, 10, 11, 000, 001,... )

The set of all non empty strings of elements of V is denoted by V+.

Usually, V+ = V* \ (ε), but when ε ∈ V, V+ = V*. e.g. (0,1)+ = {0, 1,


00, 01, 10, 11, 000, 001,... }

but (ε, 0, 1)+ = (0,1)* = (ε, 0, 1, 00, 01, 10, 11, 000, 001,...
)

If x ∈ V* and y ∈ V* then xy will denote their concatenation, that is, the


string consisting of x followed by y.
n
If x ∈ V* then x = xxx.....x n≥0
n-times 0
We assume x = ε the empty string.
2 3 n n
e.g. (a)* = (ε, a, a , a , ...a ,.....) = ( a : n ≥ 0)

2 3 n n
{a}+ = (a, a , a , ....a , ....) = ( a : n ≥ 1)

Similarly, if X, Y are sets of strings, then their concatenation is also


denoted by XY. Of course XY=(xy: x∈X and y∈Y).

n 0
Also, X = XXX.....X n≥0. Of course X = (ε). n-
times
3
e.g. (0, 1) (a, b, c) = (0a, 0b, 0c, 1a ,1b, 1c)

(0, 1) = (000, 001, 010, 011, 100, 101, 110, 111)

If x is a string, then |x| denotes the length of x, and this is the number of
indivisible symbols in x. Of course |ε| = 0.

EXERCISE
1. Determine the following sets:
(a) (0,1) (ε, a, ba) (b) (b, aa)*
2. Let V be a set of strings. Does V+ = V V*?

Alphabet of a String
The alphabet of a string is a list of all of the letters that occur in a
particular string. If s is a string, its alphabet is denoted by Alph(s)

String Substitution
Let L be a language, and let Σ be its alphabet. A string substitution or
simply a substitution is a mapping f that maps letters in Σ to languages
(possibly in a different alphabet). Thus, for example, given a letter a ∈
Σ , one has f(a) = La where La ⊂ Δ* is some language whose alphabet is
Δ. This mapping may be extended to strings as f(ε) = ε

for the empty string ε, and f(sa) = f(s)f(a)

for string a ∈ L. String substitution may be extended to the entire


language as
f(L) =

An example of string substitution occurs in regular languages, which are


closed under string substitution. That is, if the letters of a regular
language are substituted by other regular languages, the result is still a
regular language.

Another example is the conversion of an EBCDIC-encoded string to


ASCII.

4
CONCATENATION AND SUBSTRINGS
Concatenation is an important binary operation on Σ*. For any two
strings s and t in Σ*, their concatenation is defined as the sequence of
characters in s followed by the sequence of characters in t, and is denoted
st. For example, if Σ = (a, b, …, z), s = bear, and t = hug, then st = bearhug
and ts = hugbear.

String concatenation is an associative, but non-commutative operation.


The empty string serves as the identity element; for any string s, εs = sε
= s. Therefore, the set Σ* and the concatenation operation form a monoid,
the free monoid generated by Σ. In addition, the length function defines
a monoid homomorphism from Σ* to the non-negative integers. A string
s is said to be a substring or factor of t if there exist (possibly empty)
strings u and v such that t = usv. The relation "is a substring of" defines
a partial order on Σ*, the least element of which is the empty string.

String Length
Although formal strings can have an arbitrary (but finite) length, the
length of strings in real languages is often constrained to an artificial
maximum. In general, there are two types of string datatypes: fixed length
strings which have a fixed maximum length and which use the same
amount of memory whether this maximum is reached or not, and variable
length strings whose length is not arbitrarily fixed and which use varying
amounts of memory depending on their actual size. Most strings in
modern programming languages are variable length strings. Despite the
name, even variable length strings are limited in length; although,
generally, the limit depends only on the amount of memory available.

Character String Functions


String functions are used to manipulate a string or change or edit the
contents of a string. They are also used to query information about a
string. They are usually used within the context of a computer
programming language.
The most basic example of a string function is the length (string)
function, which returns the length of a string (not counting any terminator
characters or any of the string's internal structural information) and does
not modify the string. For example, length (“hello world”) returns 11.

There are many string functions which exist in other languages with
similar or exactly the same syntax or parameters. For example in many
5
languages the length function is usually represented as len (string). Even
though string functions are very useful to a computer programmer, a
computer programmer using these functions should be mindful that a
string function in one language could in another language behave
differently or have a similar or completely different function name,
parameters, syntax, and results.

Representations
Given the preceding definitions of alphabets and strings, representations
of information can be viewed as the mapping of objects into strings in
accordance with some rules. That is, formally speaking, a representation
or encoding over an alphabet Σ of a set D is a function f from D to 2Σ*
that satisfies the following condition: f(e1) and f(e2) are disjoint
nonempty sets for each pair of distinct elements e1 and e2 in D.

If Σ is a unary alphabet, then the representation is said to be a unary


representation. If Σ is a binary alphabet, then the representation is said to
be a binary representation.

In what follows each element in f(e) will be referred to as a


representation, or encoding, of e.

Example 1

f1 is a binary representation over (0, 1) of the natural numbers if f1(0) =


(0, 00, 000, 0000, . . . ), f1(1) = (1, 01, 001, 0001, . . . (, f1(2) = (10, 010,
0010, 00010, . . . ), f1(3) = (11, 011, 0011, 00011, . . . ), and f1(4) = {100,
0100, 00100, 000100, . . . ), etc.

Similarly, f2 is a binary representation over {0, 1} of the natural numbers


if it assigns to the ith natural number the set consisting of the ith
canonically smallest binary string. In such a case, f2(0) = ( ), f2(1) = {0},
f2(2) = (1), f2(3) = (00), f2(4) = (01), f2(5) = (10), f2(6) = (11), f2(7) =
(000), f2(8) = (1000), f2(9) = (1001), . . .

On the other hand, f3 is a unary representation over (1) of the natural


numbers if it assigns to the ith natural number the set consisting of the
ith alphabetically

(= canonically) smallest unary string. In such a case, f3(0) = ( ), f3(1) =


(1), f3(2) = (11), f3(3) = (111), f3(4) = (1111), . . . , f3(i) = (1i), . . .
The three representations f1, f2, and f3 are illustrated in Figure 1

6
Fig 1: Representations for the Natural Numbers

NOTE
• an alphabet or vocabulary is a finite set of symbols or letters
• a string is a finite sequence of symbols that are chosen from a set
or alphabet
• string functions are used to manipulate a string or change or edit
the contents of a string
• for any two strings s and t in Σ*, their concatenation is defined as
the sequence of characters in s followed by the sequence of
characters in t, and is denoted st
• string concatenation is an associative, but non-commutative
operation
• a string s is said to be a substring or factor of t if there exist
(possibly empty) strings u and v such that t = usv
a representation or encoding over an alphabet Σ of a set D is a function f
from D to 2Σ* that satisfies the following condition: f(e1) and f(e2) are
disjoint nonempty sets for each pair of distinct elements e1 and e2 in D.

Having learnt about strings and alphabets in the previous unit, you will
be taken through another important concept in formal language and
automata theory, which is grammar. This is because it is often convenient
to specify languages in terms of grammars. The advantage in doing so
arises mainly from the usage of a small number of rules for describing a
language with a large number of sentences.

For instance, the possibility that an English sentence consists of a subject


phrase followed by a predicate phrase can be expressed by a grammatical
rule of the form:

7
• <sentence> <subject><predicate>. (The names in angular
brackets are assumed to belong to the grammar metalanguage.)
• Similarly, the possibility that the subject phrase consists of a noun
phrase can be expressed by a grammatical rule of the form:
• <subject> <noun>.

You may, therefore, think of a grammar as a set of rules for your native
language. Subject, predicate, prepositional phrase, past participle, and so
on. This is a reasonably accurate, or at least helpful, description of a
human language, but it is not entirely rigorous. Chomski formalised the
concept of a grammar, and made important observations regarding the
complexity of the grammar, which in turn establishes the complexity of
the language.
Formal Grammar

A formal grammar (sometimes simply called a grammar) is a set of


rules for forming strings in a formal language. The rules describe how to
form strings from the language's alphabet that are valid according to the
language's syntax. A grammar does not describe the meaning of the
strings or what can be done with them in whatever context – only their
form.

Formal language theory, the discipline which studies formal grammars


and languages, is a branch of applied mathematics. Its applications are
found in theoretical computer science, theoretical linguistics, formal
semantics, mathematical logic, and other areas.

A formal grammar is a set of rules for rewriting strings, along with a


"start symbol" from which rewriting must start. Therefore, a grammar is
usually thought of as a language generator. However, it can also
sometimes be used as the basis for a “recognizer”—a function in
computing that determines whether a given string belongs to the
language or is grammatically incorrect. To describe such recognisers,
formal language theory uses separate formalisms, known as automata
theory. One of the interesting results of automata theory is that it is not
possible to design a recogniser for certain formal languages.
Parsing is the process of recognising an utterance (a string in natural
languages) by breaking it down to a set of symbols and analysing each
one against the grammar of the language. Most languages have the
meanings of their utterances structured according to their syntax—a
practice known as compositional semantics. As a result, the first step to
describing the meaning of an utterance in language is to break it down

8
part by part and look at its analyzed form (known as its parse tree in
computer science, and as its deep structure in generative grammar).

Introductory Example

A grammar mainly consists of a set of rules for transforming strings. (If


it only consisted of these rules, it would be a semi-Thue system.) To
generate a string in the language, one begins with a string consisting of
only a single start symbol. The production rules are then applied in any
order, until a string that contains neither the start symbol nor designated
nonterminal symbols is produced. The language formed by the grammar
consists of all distinct strings that can be generated in this manner. Any
particular sequence of production rules on the start symbol yields a
distinct string in the language. If there are multiple ways of generating
the same single string, the grammar is said to be ambiguous.

For example, assume the alphabet consists of a and b, the start symbol is
S, and we have the following production rules:

then we start with S, and can choose a rule to apply to it. If we choose
rule 1, we obtain the string aSb. If we choose rule 1 again, we
replace S with aSb and obtain the string aaSbb. If we now choose
rule 2, we replace S with ba and obtain the string aababb, and are
done. We can write this series of choices more briefly, using
symbols: . The language of the
grammar is then the infinite set
,
where ak is a repeated k times (and n in particular represents the
number of times production rule 1 has been applied).

3.2 Formal Definition

3.2.1 The Syntax of Grammars

In the classic formalisation of generative grammars first proposed by


Noam Chomsky in the 1950s, a grammar G consists of the following
components:

• A finite set N of nonterminal symbols, none of which appear


in strings formed from G.

9
• A finite set Σ of terminal symbols that is disjoint from N.
A finite set P of production rules, each rule of the form

where * is the Kleene star operator and denotes set union. That is,
each production rule maps from one string of symbols to another, where
the first string (the "head") contains an arbitrary number of symbols
provided at least one of them is a nonterminal. In the case that the
second string (the "body") consists solely of the empty string – i.e., that
it contains no symbols at all – it may be denoted with a special notation
(often Λ, e or ε) in order to avoid confusion.

A distinguished symbol that is the start symbol.

A grammar is formally defined as the tuple (N,Σ,P,S). Such a formal


grammar is often called a rewriting system or a phrase structure
grammar in the literature.

3.2.2 The Semantics of Grammars

The operation of a grammar can be defined in terms of relations on


strings:

• Given a grammar G = (N,Σ,P,S), the binary relation


(pronounced as "G derives in one step") on strings in is
defined by:

• the relation (pronounced as G derives in zero or more steps) is


defined as the reflexive transitive closure of
• a sentential form is a member of that can be derived in a finite
number of steps from the start symbol S; that is, a sentential form
is a member of . A sentential form
that contains no nonterminal symbols (i.e. is a member of Σ*) is
called a sentence.
• the language of G, denoted as L(G), is defined as all those
sentences that can be derived in a finite number of steps from the
start symbol S; that is, the set .

Note that the grammar G = (N,Σ,P,S) is effectively the semi-Thue


system , rewriting strings in exactly the same way; the only
difference is in that we distinguish specific nonterminal symbols which
must be rewritten in rewrite rules, and are only interested in rewritings
10
from the designated start symbol S to strings without nonterminal
symbols.

Example 1

Please note that for these examples, formal languages are specified
using set-builder notation.

Consider the grammar G where , S is the


start symbol, and P consists of the following production rules:

1.
2. 3.
4.

This grammar defines the language

L(G) = {anbncn|n ≥ 1}

where an denotes a string of n consecutive a's. Thus, the language is the


set of strings that consist of 1 or more a's, followed by the same number
of b's, followed by the same number of c's.

Some examples of the derivation of strings in L(G) are:

(Note on notation: reads "String P generates string Q by means


of production i", and the generated part is each time indicated in bold
type.)

3.3 The Chomsky Hierarchy

When Noam Chomsky first formalised generative grammars in 1956, he


classified them into types now known as the Chomsky hierarchy. The
difference between these types is that they have increasingly strict
production rules and can express fewer formal languages. Two important
types are context-free grammars (Type 2) and regular grammars (Type
3). The languages that can be described with such a grammar are called

11
context-free languages and regular languages, respectively. Although
much less powerful than unrestricted grammars
(Type 0), which can in fact express any language that can be accepted by
a Turing machine, these two restricted types of grammars are most often
used because parsers for them can be efficiently implemented. For
example, all regular languages can be recognised by a finite state
machine, and for useful subsets of context-free grammars there are
wellknown algorithms to generate efficient LL parsers and LR parsers to
recognise the corresponding languages those grammars generate.

3.4 Context-Free Grammars

A context-free grammar is a grammar in which the left-hand side of each


production rule consists of only a single nonterminal symbol. This
restriction is non-trivial; not all languages can be generated by
contextfree grammars. Those that can are called context-free languages.

The language defined above is not a context-free language, and this can
be strictly proven using the pumping lemma for context-free languages,
but for example the language L(G) = {anbn|n ≥ 1} (at least 1 a followed
by the same number of b's) is context-free, as it can be defined by the
grammar G2 with N = {S}, Σ = {a, b}, S the start symbol, and the
following production rules:

1.
2.

A context-free language can be recognised in O(n3) time (see Big O


notation) by an algorithm such as Earley's algorithm. That is, for every
context-free language, a machine can be built that takes a string as input
and determines in O(n3) time whether the string is a member of the
language, where n is the length of the string. Further, some important
subsets of the context-free languages can be recognised in linear time
using other algorithms.

3.5 Regular Grammars

In regular grammars, the left hand side is again only a single nonterminal
symbol, but now the right-hand side is also restricted. The right side may
be the empty string, or a single terminal symbol, or a single terminal
symbol followed by a nonterminal symbol, but nothing else. (Sometimes
a broader definition is used: one can allow longer strings of terminals or

12
single nonterminals without anything else, making languages easier to
denote while still defining the same class of languages.)

The language defined above is not regular, but the language (anbm | m, n
≥ 1) (at least 1 a followed by at least 1 b, where the numbers may be
different) is, as it can be defined by the grammar G3 with N = (S, A, B0,
Σ = (a, b), S the start symbol, and the following production rules:

1.
2.
3.
4.
5.

All languages generated by a regular grammar can be recognised in linear


time by a finite state machine. Although, in practice, regular grammars
are commonly expressed using regular expressions, some forms of
regular expression used in practice do not strictly generate the regular
languages and do not show linear re-cognitional performance due to
those deviations.

3.6 Other Forms of Generative Grammars

Many extensions and variations on Chomsky’s original hierarchy of


formal grammars have been developed, both by linguists and by
computer scientists, usually either in order to increase their expressive
power or in order to make them easier to analyse or parse. Some forms
of grammars developed include:

• Tree-adjoining grammars increase the


expressiveness of conventional generative grammars
by allowing rewrite rules to operate on parse trees instead of just
strings.
• Affix grammars and attribute grammars allow rewrite rules to be
augmented with semantic attributes and operations, useful both
for increasing grammar expressiveness and for constructing
practical language translation tools.

3.7 Analytic Grammars

Though there is a tremendous body of literature on parsing algorithms,


most of these algorithms assumes that the language to be parsed is
initially described by means of a generative formal grammar, and that
13
the goal is to transform this generative grammar into a working parser.
Strictly speaking, a generative grammar does not in any way correspond
to the algorithm used to parse a language, and various algorithms have
different restrictions on the form of production rules that are considered
well-formed.

An alternative approach is to formalise the language in terms of an


analytic grammar in the first place, which more directly corresponds to
the structure and semantics of a parser for the language. Examples of
analytic grammar formalisms include the following:

• The Language Machine directly implements unrestricted analytic


grammars. Substitution rules are used to transform an input to
produce outputs and behaviour. The system can also produce the
im-diagram which shows what happens when the rules of an
unrestricted analytic grammar are being applied.
• Top-down parsing language (TDPL): a highly minimalist analytic
grammar formalism developed in the early 1970s to study the
behaviour of top-down parsers.
• Link grammars: a form of analytic grammar designed for
linguistics, which derives syntactic structure by examining the
positional relationships between pairs of words.
Parsing expression grammars (PEGs): a more recent generalization of
TDPL designed around the practical expressiveness needs of
programming language and compiler writers.
Formal Language

A formal language is a set of words, i.e. finite strings of letters, or


symbols. The inventory from which these letters are taken is called the
alphabet over which the language is defined. A formal language is often
defined by means of a formal grammar. Formal languages are a purely
syntactical notion, so there is not necessarily any meaning associated
with them. To distinguish the words that belong to a language from
arbitrary words over its alphabet, the former are sometimes called
wellformed words (or, in their application in logic, well-formed
formulas).

Formal languages are studied in the fields of logic, computer science and
linguistics. Their most important practical application is for the precise
definition of syntactically correct programs for a programming language.
The branch of mathematics and computer science that is concerned only
with the purely syntactical aspects of such languages, i.e. their internal
structural patterns, is known as formal language theory.

14
Although it is not formally part of the language, the words of a formal
language often have a semantical dimension as well. In practice this is
always tied very closely to the structure of the language, and a formal
grammar (a set of formation rules that recursively defines the language)
can help to deal with the meaning of (well-formed) words. Well-known
examples for this are “Tarski’s definition of truth” in terms of a Tschema
for first-order logic, and compiler generators like lex and yacc.

3.2 Words Over an Alphabet

An alphabet, in the context of formal languages can be any set, although


it often makes sense to use an alphabet in the usual sense of the word, or
more generally a character set such as ASCII. Alphabets can also be
infinite; e.g. first-order logic is often expressed using an alphabet which,
besides symbols such as ∧, ¬, ∀ and parentheses, contains infinitely many
elements x0, x1, x2… that play the role of variables. The elements of an
alphabet are called its letters.

A word over an alphabet can be any finite sequence, or string, of letters.


The set of all words over an alphabet Σ is usually denoted by Σ* (using
the Kleene star). For any alphabet there is only one word of length 0, the
empty word, which is often denoted by e, ε or Λ. By concatenation one
can combine two words to form a new word, whose length is the sum of
the lengths of the original words. The result of concatenating a word with
the empty word is the original word.

As you learnt in the first unit of this course, in some applications,


especially in logic, the alphabet is also known as the vocabulary and
words are known as formulas or sentences; this breaks the letter/word
metaphor and replaces it by a word/sentence metaphor.

3.2.1 Formal Definition

A formal language L over an alphabet Σ is just a subset of Σ*, that is, a


set of words over that alphabet.

In computer science and mathematics, which do not deal with natural


languages, the adjective “formal” is usually omitted as redundant.

15
While formal language theory usually concerns itself with formal
languages that are defined by some syntactical rules, the actual definition
of a formal language is only as above: a (possibly infinite) set of finite-
length strings, no more no less. In practice, there are many languages that
can be defined by rules, such as regular languages or context-free
languages. The notion of a formal grammar may be closer to the intuitive
concept of a "language," one defined by syntactic rules.

By an abuse of the definition, a particular formal language is often


thought of as being equipped with a formal grammar that defines it.

Example 1

The following rules define a formal language L over the alphabet Σ = (0,
1, 2, 3, 4, 5, 6, 7, 8, 9, +, =):

• Every nonempty string that does not contain + or = and does not
start with 0 is in L.
• The string 0 is in L.
• A string containing = is in L if and only if there is exactly one =,
and it separates two strings in L.
• A string containing + is in L if and only if every + in the string
separates two valid strings in L.
• No string is in L other than those implied by the previous rules.
• Under these rules, the string “23+4=555” is in L, but the string
“=234=+” is not. This formal language expresses natural
numbers, well-formed addition statements, and well-formed
addition equalities, but it expresses only what they look like (their
syntax), not what they mean (semantics). For instance, nowhere
in these rules is there any indication that 0 means the number zero
or that + means addition.
• For finite languages one can simply enumerate all well-formed
words. For example, we can define a language L as just L = {“a”,
“b”, “ab”, “cba”}.

However, even over a finite (non-empty) alphabet such as Σ = {a, b}


there are infinitely many words: “a”, “abb”, “ababba”,

16
“aaababbbbaab”, …. Therefore, formal languages are typically infinite,
and defining an infinite formal language is not as simple as writing L =
{“a”, “b”, “ab”, “cba”}. Here are some examples of formal languages:

• L = Σ*, the set of all words over Σ


• L = {a}* = {an}, where n ranges over the natural numbers and an
means “a” repeated n times (this is the set of words consisting only
of the symbol “a”)
• the set of syntactically correct programs in a given programming
language (the syntax of which is usually defined by a context-free
grammar
• the set of inputs upon which a certain Turing machine halts; or
• the set of maximal strings of alphanumeric ASCII characters on
this line, (i.e., the set {“the”, “set”, “of”, “maximal”, “strings”,
“alphanumeric”, “ASCII”, “characters”, “on”, “this”, “line”, “I”,
“e”}).

3.2.2 Vocabulary and Language

A vocabulary (or alphabet or character set or word list) is a finite


nonempty set of indivisible symbols (letters, digits, punctuation marks,
operators, etc.).

A language over a vocabulary V is any subset L of V* which has a finite


description. There are two approaches for making this mathematically
precise. One is to use a grammar – a form of inductive definition of L.
The other is to describe a method for recognising whether an element
x∈L is in the language L and automata theory is based on this approach.

3.3 Language-Specification Formalisms

Formal language theory rarely concerns itself with particular languages


(except as examples), but is mainly concerned with the study of various
types of formalisms to describe languages. For instance, a language can
be given as

• those strings generated by some formal grammar (see Chomsky


hierarchy)
• those strings described or matched by a particular regular
expression
• those strings accepted by some automaton, such as a Turing

17
machine or finite state automaton
• those strings for which some decision procedure (an algorithm
that asks a sequence of related YES/NO questions) produces the
answer YES.

Typical questions asked about such formalisms include:

• What is their expressive power? (Can formalism X describe every


language that formalism Y can describe? Can it describe other
languages?)
• What is their recognisability? (How difficult is it to decide
whether a given word belongs to a language described by
formalism X?)
• What is their comparability? (How difficult is it to decide whether
two languages, one described in formalism X and one in
formalism Y, or in X again, are actually the same language?).
• Surprisingly often, the answer to these decision problems is “it
cannot be done at all”, or “it is extremely expensive” (with a
precise characterisation of how expensive exactly). Therefore,
formal language theory is a major application area of
computability theory and complexity theory.

Operations on Languages

Certain operations on languages are common. This includes the standard


set operations, such as union, intersection, and complementation.
Another class of operation is the element-wise application of string
operations.

Example 2
Suppose L1 and L2 are languages over some common alphabet. The
concatenation L1L2 consists of all strings of the form vw where v is a
string from L1 and w is a string from L2.
The intersection L1 ∩ L2 of L1 and L2 consists of all strings which are
contained in both languages
The complement ¬L of a language with respect to a given alphabet
consists of all strings over the alphabets that are not in the language.
Such operations are used to investigate closure properties of classes of
languages. A class of languages is closed under a particular operation
when the operation, applied to languages in the class, always produces a
language in the same class again. For instance, the context-free languages
are known to be closed under union, concatenation, and intersection with
regular languages, but not closed under intersection or complementation.

18
Other Operations on Languages

Some other operations frequently used in the study of formal languages


are the following:

• The Kleene star: the language consisting of all words that are
concatenations of 0 or more words in the original language;
• Reversal:

a. Let e be the empty word, then eR = e, and


b. for each non-empty word w = x1…xn over some alphabet, let wR =
xn…x1,

• Then for a formal language L, LR = {wR | w L}.


• String homomorphism.

Derivations and Language of a Grammar


Let G=(N,T,P,S) be any phrase structure grammar and let u, v∈(N∪T)*.
We write u⇒v and say v is derived in one step from u by the rule x→y,
providing that u = pxq and v = pyq. (Here the rule x→y is used to replace x by
y in u to produce v. Note that p, q∈(N∪T)*.)

If u1⇒u2⇒u3......⇒ un we say un is derived from u1 in G and write


u1⇒+ un.
+ + *
Also if u1=un or u1⇒ un we write u1⇒ un

L(G) the language of G is defined by:

L(G) = {t∈T*: Z⇒* t for some Z∈S} = {t∈T*:t∈S or Z⇒+ t for some

Z∈S}.

So, the elements of L(G) are those elements of T* which are elements of
S or which are derivable from elements of S.

19
• a formal language is a set of words, i.e. finite strings of letters,
or symbols and the inventory from which these letters are taken is
called the alphabet over which the language is defined.
• a formal language is often defined by means of a formal grammar.
• a vocabulary (or alphabet or character set or word list) is a finite
non-empty set of indivisible symbols
• common operations on languages are the standard set operations,
such as union, intersection, and complementation.
• another class of operation that can be performed on languages is
the element-wise application of string operations

Automata Theory
In theoretical computer science, automata theory is the study of abstract
machines (or more appropriately, abstract ‘mathematical’ machines or
systems) and the computational problems that can be solved using these
machines. These abstract machines are called automata.

Fig. 2: An Example of Automata

Figure 2 above illustrates a finite state machine, which is one wellknown


variety of automata. This automaton consists of states (represented in the
figure by circles), and transitions (represented by arrows). As the
automaton sees a symbol of input, it makes a transition (or jump) to
another state, according to its transition function (which takes the current
state and the recent symbol as its inputs).

• Automata theory is also closely related to formal language theory,


as the automata are often classified by the class of formal
languages they are able to recognise. An automaton can be a finite
representation of a formal language that may be an infinite set.
• In other words, automata theory is a subject matter which studies
properties of various types of automata. For example, following
questions are studied about a given type of automata.
• Which class of formal languages is recognisable by some type of
automata? (Recognisable languages)

20
• Is certain automata closed under union, intersection, or
complementation of formal languages? (Closure properties)
• How much is a type of automata expressive in terms of
recognising class of formal languages? And, their relative
expressive power? (Language Hierarchy)

Automata theory also studies if there exists any effective algorithm or not
to solve problems similar to the following list:

• Does an automaton accept any input word? (emptiness checking)


• Is it possible to transform a given non-deterministic automaton
into deterministic automaton without changing the recognising
language? (Determinisation)
• For a given formal language, what is the smallest automaton that
recognises it? (Minimisation)
• Automata play a major role in compiler design and parsing.

3.2 Automata

In the following sections, you will be presented an introductory definition


of one type of automata, which attempts to help one grasp the essential
concepts involved in automata theory.

INFORMAL DESCRIPTION OF AUTOMATON

An automaton is supposed to run on some given sequence or string of


inputs in discrete time steps. At each time step, an automaton gets one
input that is picked up from a set of symbols or letters, which is called an
alphabet. At any time, the symbols so far fed to the automaton as input
form a finite sequence of symbols, which is called a word. An automaton
contains a finite set of states. At each instance in time of some run,
automaton is in one of its states. At each time step when the automaton
reads a symbol, it jumps or transits to next state depending on its current
state and on the symbol currently read. This function in terms of the
current state and input symbol is called transition function.

The automaton reads the input word one symbol after another in the
sequence and transits from state to state according to the transition
function, until the word is read completely. Once the input word has been
read, the automaton is said to have been stopped and the state at which
automaton has stopped is called final state. Depending on the final state,
21
it is said that the automaton either accepts or rejects an input word. There
is a subset of states of the automaton, which is defined as the set of
accepting states. If the final state is an accepting state, then the
automaton accepts the word. Otherwise, the word is rejected.

The set of all the words accepted by an automaton is called the language
recognized by the automaton.

Formal Definitions

Automaton

An automaton is represented formally by the 5-tuple ⟨Q, Σ, δ, q0, F⟩,


where:

• Q is a finite set of states.


• Σ is a finite set of symbols, called the alphabet of the automaton.
• δ is the transition function, that is, δ: Q × Σ → Q.
• q0 is the start state, that is, the state which the automaton is in
when no input has been processed yet, where q0 ∈ Q.
• F is a set of states of Q (i.e. F⊆Q) called accept states.

Input Word

An automaton reads a finite string of symbols a1, a2,...., an , where ai ∈


Σ, which is called a input word. Set of all words is denoted by Σ*.

RUN

A run of the automaton on an input word w = a1, a2,...., an ∈ Σ*, is a


sequence of states q0, q1, q2,...., qn, where qi ∈ Q such that q0 is a start
state and qi = δ(qi-1,ai) for 0 < i ≤ n. In words, at first the automaton is at
the start state q0 and then automaton reads symbols of the input word in
sequence. When automaton reads symbol ai then it jumps to state qi =
δ(qi-1,ai). qn said to be the final state of the run.

Accepting Word

A word w ∈ Σ* is accepted by the automaton if qn ∈ F.

22
Recognized Language

An automaton can recognise a formal language. The recognised language


L ⊂ Σ* by an automaton is the set of all the words that are accepted by
the automaton.

Recognisable languages

The recognisable languages is the set of languages that are recognised by


some automaton. For above definition of automata, the recognisable
languages are regular languages. For different definitions of automata,
the recognisable languages are different.

Variations in Definition of Automata

Automata are defined to study useful machines under mathematical


formalism. So, the definition of an automaton is open to variations
according to the “real world machine”, which we want to model using
the automaton. People have studied many variations of automata.

Above, the most standard variant is described, which is called


deterministic finite automaton. The following are some popular
variations in the definition of different components of automata.

Input

• Finite input: An automaton that accepts only finite sequence of


words. The above introductory definition only accepts finite
words.
• Infinite input: An automaton that accepts infinite words (ωwords).
Such automata are called ω-automata.
• Tree word input: The input may be a tree of symbols instead of
sequence of symbols. In this case after reading each symbol, the
automaton reads all the successor symbols in the input tree. It is
said that the automaton makes one copy of itself for each successor
and each such copy starts running on one of the successor symbol
from the state according to the transition relation of the automaton.
Such an automaton is called tree automaton.

States

23
• Finite states: An automaton that contains only a finite number of
states. The above introductory definition describes automata with
finite numbers of states.
• Infinite states: An automaton that may not have a finite number of
states, or even a countable number of states. For example, the
quantum finite automaton or topological automaton has
uncountable infinity of states.
• Stack memory: An automaton may also contain some extra
memory in the form of a stack in which symbols can be pushed
and popped. This kind of automaton is called a pushdown
automaton.

Transition Function
• Deterministic: For a given current state and an input symbol, if an
automaton can only jump to one and only one state then it is a
deterministic automaton.
• Nondeterministic: An automaton that, after reading an input
symbol, may jump into any of a number of states, as licensed by
its transition relation. Notice that the term transition function is
replaced by transition relation: The automaton
nondeterministically decides to jump into one of the allowed
choices. Such automaton are called nondeterministic automaton.
• Alternation: This idea is quite similar to tree automaton, but
orthogonal. The automaton may run its multiple copies on the
same next read symbol. Such automata are called alternating
automaton. Acceptance condition must satisfy all runs of such
copies to accept the input.

Acceptance Condition
• Acceptance of finite words: Same as described in the informal
definition above.
• Acceptance of infinite words: an omega automaton cannot have
final states, as infinite words never terminate. Rather, acceptance
of the word is decided by looking at the infinite sequence of
visited states during the run.
• Probabilistic acceptance: An automaton need not strictly accept
or reject an input. It may accept the input with some probability
between zero and one. For example, quantum finite automaton,

24
geometric automaton and metric automaton has probabilistic
acceptance.

Different combinations of the above variations produce many varieties


of automata.

CLASSES OF AUTOMATA

Table 1: Types of Automata


Automata Recognisable language
Deterministic finite automata (DFA) regular languages
Nondeterministic finite automata (NFA) regular languages
Nondeterministic finite automata with ε
regular languages
transitions (FND-ε or ε-NFA)
Pushdown automata (PDA) context-free languages
context-sensitive
Linear bounded automata (LBA)
language
Recursively enumerable
Turing machines
languages
Timed automata
Deterministic Büchi automata omega limit languages
Nondeterministic Büchi automata omega regular languages
Nondeterministic/Deterministic Rabin
omega regular languages
automata

25
Nondeterministic/Deterministic Streett
omega regular languages
automata
Nondeterministic/Deterministic parity
omega regular languages
automata
Nondeterministic/Deterministic Muller
omega regular languages
automata

Discrete, Continuous, and Hybrid Automata


Normally, automata theory describes the states of abstract machines but
there are analog automata or continuous automata or hybrid
discretecontinuous automata, using analog data, continuous time, or
both. An automaton that computes a Boolean (yes-no) function is called
an acceptor. Acceptors may be used as the membership criterion of a
language. An automaton that produces more general output (typically a
string) is called a transducer.

Applications of Automata Theory


Each model in automata theory play varied roles in several applied areas.
Finite automata are used in text processing, compilers, and hardware
design. Context-free grammar (CFG) is used in programming languages
and artificial intelligence. Originally, CFG were used in the study of the
human languages. Cellular automata are used in the field of biology, the
most common example being John Conway's Game of Life. Some other
examples which could be explained using automata theory in biology
include mollusk and pine cones growth and pigmentation patterns. Going
further, Stephen Wolfram claims that the entire universe could be
explained by machines with a finite set of states and rules with a single
initial condition. Other areas of interest which he has related to automata
theory include: fluid flow, snowflake and crystal formation, chaos
theory, cosmology, and financial analysis.

• an automaton is a simple model of a computer there is no formal


definition for “automaton”--instead, there are various kinds of
automata, each with its own formal definition.

• generally, an automaton

26
- has some form of input
- has some form of output
- has internal states
- may or may not have some form of storage
- is hard-wired rather than programmable

• an automaton can recognise a formal language


• the recognizable languages are the set of languages that are
recognised by some automaton.

Finite State Automata

Like grammars, finite state automata define languages. The finite state
automata are often abbreviated FSA or FA (for finite automata);
however, some texts use the term finite state machine, or FSM to
correlate with Turing machines that will be discussed in module 4 of this
course.

An FSA is a virtual device that manipulates a candidate string, one


character at a time, and determines whether that string is in the language
implemented by the machine. The simplest state machine reads the string
exactly once, and has no memory, only registers. It is therefore a finite
state automaton.

An FSA is defined by a set of states and a transition function that maps


state/input pairs into states. In this state, reading this character, move to
that state and advance to the next character.

Some states are designated “final” states, and strings that leave the FSA
in one of these final states are, by definition, in the language.

One state is designated the start state. When the start state is also a final
state, ε is necessarily in the language.
Example 1

The following 4-state machine defines binary strings with an even


number of 1's and 0's. States are a, b, c, d, and input characters are 0 1.
State a is both the start state and the final state

01
a→bc

27
b→ad
c→da
d→cb

3.2 Deterministic Finite Acceptors/Automata (DFA)

DFAs are:

• Deterministic i.e. there is no element of choice


• Finite i.e. only a finite number of states and arcs
• Acceptors i.e. produce only a yes/no answer
• A DFA is drawn as a graph, with each state represented by a circle.

One designated state is the start state.

Some states (possibly including the start state) can be designated as final
states.

Arcs between states represent state transitions. Each such arc is labelled
with the symbol that triggers the transition.

28
Fig 1: Example of DFA

3.2.1 Algorithm for the Operation of a DFA

• Start with the “current state” set to the start state and a “read head”
at the beginning of the input string
• while there are still characters in the string
• Read the next character and advance the read head
• From the current state, follow the arc that is labelled with the
character just read; the state that the arc points to becomes the next
current state
• When all characters have been read, accept the string if the current
state is a final state, otherwise reject the string.

Example 2

Consider the following input string: 1 0 0 1 1 1 0 0. Using the DFA in


Figure 2 above, a sample trace will be as follows:

q0 1 q1 0 q3 0 q1 1 q0 1 q1 1 q0 0
q2 0 q0

Since q0 is a final state, the string is accepted.

3.2.2 Implementing a DFA

29
3.2.2.1 Using a GO TO Statement

If you do not object to the go to statement, below is an easy way to


implement a DFA:

q0: read char; if eof then


accept string; if char = 0
then go to q2; if char =
1 then go to q1;

q1: read char; if eof then


reject string; if char = 0
then go to q3; if char =
1 then go to q0;

q2: read char; if eof then


reject string; if char = 0
then go to q0; if char =
1 then go to q3;

q3: read char; if eof then


reject string; if char = 0
then go to q1; if char =
1 then go to q2;

3.2.2.2 Using a CASE Statement

If you are not allowed to use a go to statement, you can fake it with a
combination of a loop and a case statement:

state:= q0;
loop case
state of

30
q0: read char; if eof
then accept string; if char
= 0 then state := q2; if char
= 1 then state := q1;

q1: read char; if eof


then reject string; if char =
0 then state := q3; if char
= 1 then state := q0;

q2: read char; if eof


then reject string; if char =
0 then state := q0; if char
= 1 then state := q3; q3: read
char; if eof then reject
string; if char = 0 then state
:= q1; if char = 1 then state
:= q2;
end case;
end loop;

3.2.3 Formal Definition of a DFA

A deterministic finite acceptor/automaton or DFA is a quintuple:

M = (Q, , , q0, F)

where
• Q is a finite set of states,
• is a finite set of symbols, the input alphabet,
• : Q Q is a transition function,
• q0 Q is the initial state, F Q is a set of final states.

Note: The fact that is a function implies that every vertex has an
outgoing arc for each member of .

We can also define an extended transition function as

Q.

If a DFA M = (Q, , q0, F) is used as a membership criterion, then


the set of strings accepted by M is a language. That is,
31
L(M) = {w (q0, w) F}.

Languages that can be defined by DFAs are called regular languages.

3.3 Acceptor for Ada Identifiers

In Ada, an identifier consists of a letter followed by any number of letters,


digits, and underlines. However, the identifier may not end in an
underline or have two underlines in a row.

Here is an automaton to recognise Ada identifiers.

Fig.3

• M = (Q, , , q0, F), where


• Q is {q0, q1, q2, q3}
• is {letter, digit, underline}
• is given by
• (q0, letter) = q1 (q1, letter) = q1
• (q0, digit) = q3 (q1, digit) = q1
• (q0, underline) = q3 (q1, underline) = q2
• (q2, letter) = q1 (q3, letter) = q3
• (q2, digit) = q1 (q3, digit) = q3
• (q2, underline) = q3 (q3, underline) = q3

32
• q0 Q is the initial state, {q1} Q is a
set of final states.

33
3.3.1 Abbreviated Acceptor for Ada Identifiers

The following is an abbreviated automaton (my terminology) to recognise Ada identifiers.


You might use something like this in a course on compiler construction.

The difference is that, in this automaton, does not appear to be a function. It looks like
a partial function, that is, it is not defined for all values of Q .

We can complete the definition of by assuming the existence of an


“invisible” state and some “invisible” arcs. Specifically,

• There is exactly one implicit error state


• If there is no path shown from a state for a given symbol in ,
there is an implicit path for that symbol to the error state
• The error state is a trap state: once you get into it, all arcs (one for each symbol in
) lead back to it and
• The error state is not a final state.

The automaton represented in Figure 4 above is really exactly the same as the automaton
in Figure 3; we just have not bothered to draw one state and a whole bunch of arcs that
we know must be there.

I do not think you will find abbreviated automata in the textbook. They are not usually
allowed in a formal course. However, if you ever use an automaton to design a lexical
scanner, putting in an explicit error state just clutters up the diagram.

3.4 Nondeterministic Finite Automata/Acceptors (NFA)

An FSA is nondeterministic if it is confronted with several choices when processing each


character. Thus the transition function of a nondeterministic FSA maps state/input pairs
34
into sets of states. The machine somehow traverses all possible paths in parallel. In
addition, E transitions are permitted, allowing the machine to change states without
reading an input character. A string is accepted by an NFA if one of its parallel transition
sequences leads to a final state.

This seems to add a great deal of power, but in fact it does not. Any NFA can be emulated
by an FSA with more states. Start with an NFA containing n states x1 x2 x3 etc, and
construct a deterministic FSA with 2n states as follows. Each state in the new FSA
corresponds to a unique combination of states in the original NFA. The initial state y0
corresponds to the union of the initial state x0 and all other xj states that are accessible
from x0 via E transitions. The state yi in the FSA is a final state if any of the corresponding
xj states, represented by yi, is a final state in the original NFA. To determine the transition
function f(yi, c), apply c to each corresponding xj state, and bring in any new states that
are accessible via E transitions. The combination of all these states determines a particular
yk. Thus state yi, reading character c, moves to state yk.
By induction on string length, any string that leaves the constructed FSA in state yi also
leaves the original FSA in any of the corresponding states xj. One machine says yes to the
input word if and only if the other one does. Therefore nondeterministic FSAs are no
more powerful than their deterministic counterparts.

A finite-state automaton can be nondeterministic in either or both of two ways:

Fig. 5: Nondeterministic Finite Acceptor

A state may have two or more arcs emanating from it labelled with the same symbol. When
the symbol occurs in the input, either arc may be followed.

A state may have one or more arcs emanating from it labelled with (the empty string).
These arcs may optionally be followed without looking at the input or consuming an input
symbol.

Due to non-determinism, the same string may cause an NFA to end up in one of several
different states, some of which may be final while others are not. The string is accepted if
any possible ending state is a final state.

35
Fig 6: Examples of NFAs

3.4.1 Implementing an NFA

If you think of an automaton as a computer, how does it handle nondeterminism?

There are two ways that this could, in theory, be done:

• When the automaton is faced with a choice, it always (magically) chooses correctly.
We sometimes think of the automaton as consulting an oracle which advises it as
to the correct choice.
• When the automaton is faced with a choice, it spawns a new process, so that all
possible paths are followed simultaneously.
• The first of these alternatives, using an oracle, is sometimes attractive
mathematically. But if we want to write a program to implement an NFA, that is
not feasible.

There are three ways, two feasible and one not yet feasible, to simulate the second
alternative:

• Use a recursive backtracking algorithm. Whenever the automaton has to make a


choice, cycle through all the alternatives and make a recursive call to determine
whether any of the alternatives leads to a solution (final state).
• Maintain a state set or a state vector, keeping track of all the states that the NFA
could be in at any given point in the string.

36
• Use a quantum computer. Quantum computers explore literally all possibilities
simultaneously. They are theoretically possible, but are at the cutting edge of
physics. It may (or may not) be feasible to build such a device.

3.4.1.1 Recursive Implementation of NFAs

An NFA can be implemented by means of a recursive search from the start state for a path
(directed by the symbols of the input string) to a final state.

Here is a rough outline of such an implementation:

function NFA (state A) returns Boolean:


local state B, symbol x; for each transition from state A to some
state B do
if NFA (B) then return True; if there is a next
symbol then { read next symbol (x); for each x
transition from state A to some state B do
if NFA (B) then return
True; return False;
} else
{ if A is a final state then return True; else return
False;
}

One problem with this implementation is that it could get into an infinite loop if there is a
cycle of transitions. This could be prevented by maintaining a simple counter.

3.4.1.2 State-Set Implementation of NFAs

Another way to implement an NFA is to keep either a state set or a bit vector of all the
states that the NFA could be in at any given time. Implementation is easier if you
use a bit-vector approach (v[i] is True if state i is a possible state), since most
languages provide vectors, but not sets, as a built-in datatype. However, it is a bit easier
to describe the algorithm if you use a state-set approach, so that is what we will do. The
logic is the same in either case.

function NFA (state set A) returns Boolean:

local state set B, state a, state b, state c, symbol x;

37
for each a in A do for each transition
from a to some state b do
add b to B;
while there is a next symbol do
{ read next symbol (x); B := ;
for each a in A do
{ for each transition from a to some state b do add b
to B; for each x transition from a to some state b do
add b to B;
}
for each transition from some state b in B to some
state c not in B do add c to B;
A := B;
}
if any element of A is a final state then
return True; else
return False;

3.4.1.3 Formal Definition of NFAs

The extension of our notation to NFAs is somewhat strained. A non-deterministic finite


acceptor/automaton or NFA is defined by the quintuple

M = (Q, , , q0, F)

where

• Q is a finite set of states,


• is a finite set of symbols, the input alphabet,
• :Q ( { }) 2 is a transition function,
• q0 Q is the initial state, F Q is a set of final states.

These are all the same as for a DFA except for the definition of :

• Transitions on are allowed in addition to transitions on elements of , and


• The range of is 2 rather than Q. This means that the values of are not elements
of Q, but rather are sets of elements of Q.

The language defined by NFA M is defined as:


38
L(M) = {w (q0, w) }

3.5 Equivalence of FAs

Two acceptors are equivalent if they accept the same language.

• A DFA is just a special case of an NFA that happens not to have any null transitions
or multiple transitions on the same symbol. So DFAs are not more powerful than
NFAs.
• For any NFA, we can construct an equivalent DFA (see below). So NFAs are not
more powerful than DFAs. DFAs and NFAs define the same class of languages –
the regular languages.
• To translate an NFA into a DFA, the trick is to label each state in the DFA with a
set of states from the NFA. Each state in the DFA summarises all the states that the
NFA might be in. If the NFA contains |Q| states, the resultant DFA could contain
as many as

|2| states. (Usually far fewer states will be needed.)

• finite state automata define languages


• An FSA is nondeterministic if it is confronted with several choices when processing
each character
• A finite-state automaton can be nondeterministic in either or both of two ways
• Two acceptors are equivalent if they accept the same language
• A DFA is just a special case of an NFA that happens not to have any null transitions
or multiple transitions on the same symbol
• For any NFA, we can construct an equivalent DFA
• In Ada, an identifier consists of a letter followed by any number of letters, digits,
and underlines

Primitive Regular Expressions

A regular expression can be used to define a language. A regular expression represents a


“pattern;” strings that match the pattern that are in the language, strings that do not match
the pattern are not in the language.

As usual, the strings are over some alphabet .

39
The following are primitive regular expressions:

• x, for each x ,
• , the empty string, and
• , indicating no strings at all.

Thus, if | | = n, then there are n+2 primitive regular expressions defined over .

Here are the languages defined by the primitive regular expressions:

• For each x , the primitive regular expression x denotes the language {x}. That
is, the only string in the language is the string “x”.
• The primitive regular expression denotes the language { }.
The only string in this language is the empty string.
• The primitive regular expression denotes the language {}.
There are no strings in this language.

3.2 Regular Expressions

Every primitive regular expression is a regular expression. We can compose additional


regular expressions by applying the following rules a finite number of times:

• If r1 is a regular expression, then so is (r1).


• If r1 is a regular expression, then so is r1*.
• If r1 and r2 are regular expressions, then so is r1r2 (Concatenation)
• If r1 and r2 are regular expressions, then so is r1+r2 or r1/r2 (Union)

Here is what the above notation means:

• Parentheses are just used for grouping.


• The postfix star (Kleene closure) indicates zero or more repetitions of the preceding
regular expression. Thus, if x , then the regular expression x* denotes the language
{ , x, xx, xxx, ...}.
• Juxtaposition/concatenation of r1 and r2 indicates any string described by r1
immediately followed by any string described by r2. For example, if x, y , then
the regular expression xy describes the language {xy}.

40
• The plus (+) or | sign, read as "or," denotes the language containing strings
described by either of the component regular expressions i.e. the union of the
component regular expressions. For example, if x, y , then the regular expression
x+y or x|y describes the language {x, y}.

Precedence
• The unary operator * (kleene closure) has the highest precedence and is left
associative. For example, a+bc* or a|bc* denotes the language {a, b, bc, bcc, bccc,
bcccc, ...}.
• Concatenation has a second highest precedence and is left associative.
• Union has lowest precedence and is left associative
• Parentheses override operator precedence as usual. For example, (0|1)* stands for
all possible binary strings, 0|1* stands for either a 0 or an arbitrarily long string of
1's, and 01* stands for 0 followed by an arbitrarily long string of 1's.
• The symbol ε represents the null string, and can be used like any other alphabetic
character. Thus, (0| ε)(1(0| ε))* stands for all binary strings without adjacent zeros.
• Computer languages such as ed, sed, grep, and perl employ regular expressions, but
there are many more features for your convenience. For instance, s+ = ss*, s? = (s|
ε), s{7,} = sssssss+, and so on. Check out ‘man perlre’ for more details.

3.3Languages Defined by Regular Expressions

There is a simple correspondence between regular expressions and the languages they
denote:

Regular expression L(regular expression)


{x}
x, for each x

41
{
}

{}

(r1) L(r1)
r1 * (L(r1))*
r1 r2 L(r1) L(r2)
r1 + r 2
L(r1) L(r2)

3.4 Building Regular Expressions

Here are some hints on building regular expressions. We will assume = {a, b, c}.

42
Zero or more

a* means “zero or more a’s.” To say “zero or more ab’s,” that is, { , ab, abab, ababab,
...}, you need to say (ab)*. Don't say ab*, because that denotes the language {a, ab, abb,
abbb, abbbb, ...}.

One or more

Since a* means “zero or more a’s”, you can use aa* (or equivalently, a*a) to mean “one
or more a’s.” Similarly, to describe “one or more ab's,” that is, {ab, abab, ababab, ...}, you
can use ab(ab)*.

Zero or one

You can describe an optional a with (a+ ).

Any string at all

To describe any string at all (with = {a, b, c}), you can use (a+b+c)*.

Any nonempty string

This can be written as any character from followed by any string at all:
(a+b+c)(a+b+c)*.

Any string not containing....

To describe any string at all that does not contain an a (with = {a, b, c}), you can use
(b+c)*.

Any string containing exactly one...

To describe any string that contains exactly one a, put “any string not containing an a,” on
either side of the a, like this: (b+c)*a(b+c)*.

43
3.4.1 Example Regular Expressions

Give regular expressions for the following languages on = {a, b, c}.

All strings containing exactly one a

(b+c)*a(b+c)*

All strings containing no more than three a's

We can describe the string containing zero, one, two, or three a's (and nothing else) as

• ( +a)( +a)( +a)

Now we want to allow arbitrary strings not containing a’s at the places marked by X’s:

• X( +a)X( +a)X( +a)X

so we put in (b+c)* for each X:

• (b+c)*( +a)(b+c)*( +a)(b+c)*( +a)(b+c)*

All strings which contain at least one occurrence of each symbol in

The problem here is that we cannot assume the symbols are in any particular order. We
have no way of saying “in any order”, so we have to list the possible orders:

• abc+acb+bac+bca+cab+cba

To make it easier to see what's happening, let's put an X in every place we want to allow
an arbitrary string:

• XaXbXcX + XaXcXbX + XbXaXcX + XbXcXaX + XcXaXbX


+ XcXbXaX

Finally, replacing the X's with (a+b+c)* gives the final (unwieldy) answer:

• (a+b+c)*a(a+b+c)*b(a+b+c)*c(a+b+c)* +
44
• (a+b+c)*a(a+b+c)*c(a+b+c)*b(a+b+c)* +
• (a+b+c)*b(a+b+c)*a(a+b+c)*c(a+b+c)* +
• (a+b+c)*b(a+b+c)*c(a+b+c)*a(a+b+c)* +
• (a+b+c)*c(a+b+c)*a(a+b+c)*b(a+b+c)* +
• (a+b+c)*c(a+b+c)*b(a+b+c)*a(a+b+c)*

All strings which contain no runs of a's of length greater than two

We can fairly easily build an expression containing no a, one a, or one aa:

• (b+c)*( +a+aa)(b+c)*

but if we want to repeat this, we need to be sure to have at least one nona between
repetitions:

• (b+c)*( +a+aa)(b+c)*((b+c)(b+c)*( +a+aa)(b+c)*)*

All strings in which all runs of a's have lengths that are multiples of three

(aaa+b+c)*

3.5 Regular Expressions and Automata

• Languages described by deterministic finite acceptors (DFAs) are called regular


languages.
• For any nondeterministic finite acceptor (NFA) we can find an equivalent DFA.
Thus NFAs also describe regular languages.

Regular expressions also describe regular languages. We will show that regular
expressions are equivalent to NFAs by doing two things:

• For any given regular expression, we will show how to build an NFA that accepts
the same language. (This is the easy part.)
• For any given NFA, we will show how to construct a regular expression that
describes the same language. (This is the hard part.)

3.5.1 From Primitive Regular Expressions to NFAs

45
Every NFA we construct will have a single start state and a single final state. We will build
more complex NFAs out of simpler NFAs, each with a single start state and a single final
state. The simplest NFAs will be those for the primitive regular expressions.

For any x in , the regular expression x denotes the language {x}. This NFA represents
exactly that language.

Fig 6:

Note that if this were a NFA, we would have to include arcs for all the other elements of
.

Fig 7:
The regular expression denotes the language { }, that is, the
language containing only the empty string.

Fig 8:

The regular expression denotes the language ; no strings belong to this language, not
even the empty string.

Since the final state is unreachable, why bother to have it at all? The answer is that it
simplifies the construction if every NFA has exactly one start state and one final state. We
could do without this final state, but we would have more special cases to consider, and it
does not hurt anything to include it.

3.5.2 From Regular Expressions to NFAs

46
We will build more complex NFAs out of simpler NFAs, each with a single start state and
a single final state. Since we have NFAs for primitive regular expressions, we need to
compose them for the operations of grouping, juxtaposition, union, and Kleene star (*).

For grouping (parentheses), we do not really need to do anything. The NFA that represents
the regular expression (r1) is the same as the NFA that represents r1.

For juxtaposition (strings in L(r1) followed by strings in L(r2), we simply chain the NFAs
together, as shown. The initial and final states of the original NFAs (boxed) stop being
initial and final states; we include new initial and final states. (We could make do with
fewer states and fewer transitions here, but we aren't trying for the best construction; we're
just trying to show that a construction is possible.)

The + denotes “or” in a regular expression, so it makes sense that we would use an NFA
with a choice of paths. (This is one of the reasons that it is easier to build an NFA than a
NFA.)

The star denotes zero or more applications of the regular expression, so we need to set up
a loop in the NFA. We can do this with a backwardpointing arc. Since we might want to
traverse the regular expression zero times (thus matching the null string), we also need a
forwardpointing arc to bypass the NFA entirely.

47
3.5.3 From NFAs to Regular Expressions

Creating a regular expression to recognise the same strings as an NFA is trickier than you
might expect, because the NFA may have arbitrary loops and cycles. Here is the basic
approach (details supplied later):

• If the NFA has more than one final state, convert it to an NFA with only one final
state. Make the original final states nonfinal, and add a transition from each to the
new (single) final state.
• Consider the NFA to be a generalized transition graph, which is just like an NFA
except that the edges may be labelled with arbitrary regular expressions. Since the
labels on the edges of an NFA may be either or members of , each of these can
be considered to be a regular expression.
• Remove states one by one from the NFA, relabeling edges as you go, until only the
initial and the final state remain.
• Read the final regular expression from the two-state automaton that results.
• The regular expression derived in the final step accepts the same language as the
original NFA.
• Since we can convert an NFA to a regular expression, and we can convert a regular
expression to an NFA, the two are equivalent formalisms--that is, they both
describe the same class of languages, the regular languages.
• There are two complicated parts to extracting a regular expression from an NFA:
removing states, and reading the regular expression off the resultant two-state
generalised transition graph.

48
Here is how to delete a state:

To delete state Q, where Q is neither the initial state nor the final state,

replace with .

Fig 12: Deleting a State

You should convince yourself that this transformation is “correct”, in the sense that paths
which leave you in Qi in the original will leave you in Qi in the replacement, and similarly
for Qj.

What if state Q has connections to more than two other states, say, Qi, Qj, and Qk? Then
you have to consider these states pairwise: Qi with Qj, Qj with Qk, and Qi with Qk.

What if some of the arcs in the original state are missing? There are too many cases to
work this out in detail, but you should be able to figure it out for any specific case, using
the above as a model.

You will end up with an NFA that looks like this, where r1, r2, r3, and r4 are (probably
very complex) regular expressions. The resultant NFA in figure 13 below represents the
regular expression r1*r2(r4 + r3r1*r2)*

Fig 13: NFA for r1*r2(r4 + r3r1*r2)*

49
(you should verify that this is indeed the correct regular expression). All you have to do is
plug in the correct values for r1, r2, r3, and r4.

3.6 Three Ways of Defining a Language

The following presents an example solved three different ways. No new information is
presented.

Problem: Define a language containing all strings over = {a, b, c} where no symbol ever
follows itself; that is, no string contains any of the substrings aa, bb, or cc.

3.6.1 Definition by Grammar

Define the grammar G = (V, T, S, P) where

• V = {S, ...some other variables...}.


• T = = {a, b, c}.

The start symbol is S.

P is given below.

These should be pretty obvious except for the set V, which we generally make up as we
construct P.

Since the empty string belongs to the language, we need the production

• S

Some strings belonging to the language begin with the symbol a. The a can be followed
by any other string in the language, so long as this other string does not begin with a. So
we make up a variable, call it NOTA, to produce these other strings, and add the
production

• S a NOTA

50
By similar logic, we add the variables NOTB and NOTC and the productions

• S b NOTB
• S c NOTc

Now, NOTA is either the empty string, or some string that begins with b, or some string
that begins with c. If it begins with b, then it must be followed by a (possibly empty) string
that does not begin with b--and we already have a variable for that case, NOTB. Similarly,
if NOTA is some string beginning with c, the c must be followed by NOTC. This gives
the productions
• NOTA
• NOTA b NOTB
• NOTA c NOTC

Similar logic gives the following productions for NOTB and NOTC:

• NOTB
• NOTB a NOTA
• NOTB c NOTC
• NOTC
• NOTC a NOTA
• NOTC b NOTB

We add NOTA, NOTB, and NOTC to set V, and we're done.

Example derivation:

S a NOTA a b NOTB a b a NOTA a b a c NOTC a b a c.

3.6.2 Definition by NFA

Defining the language by an NFA follows almost exactly the same logic as defining the
language by a grammar. Whenever an input symbol is read, go to a state that will accept
any symbol other than the one read.

To emphasise the similarity with the preceding grammar, we will name our states to
correspond to variables in the grammar.

51
Fig 14: Definition of Language by NFA

3.6.3 Definition by Regular Expression

As usual, it is more difficult to find a suitable regular expression to define this language,
and the regular expression we do find bears little resemblance to the grammar or to the
NFA.
The key insight is that strings of the language can be viewed as consisting of zero or more
repetitions of the symbol a, and between them must be strings of the form bcbcbc... or
cbcbcb.... So we can start with
X a Y a Y a Y a ... Y a Z

where we have to find suitable expressions for X, Y, and Z. But first, let's get the above
expression in a proper form, by getting rid of the "...".
This gives

X a (Y a)* Z

and, since we might not have any as at all,

(X a (Y a)* Z) + X

Now X can be empty, a single b, a single c, or can consist of an alternating sequence of bs


and cs. This gives

X = ( + b + c + (bc)* + (cb)*)

This isn't quite right, because it does not allow (bc)*b or (cb)*c. When we include these,
we get

52
X=( + b + c + (bc)* + (cb)* + (bc)*b + (cb)*c)

This is now correct, but could be simplified. The last four terms include the +b+c cases,
so we can drop those three terms. Then we can combine the last four terms into

X = (bc)*(b + ) + (cb)*(c + )

Now, what about Z? As it happens, there isn't any difference between what we need for Z
and what we need for X, so we can also use the above expression for Z.

Finally, what about Y? This is just like the others, except that Y cannot be empty. Luckily,
it's easy to adjust the above expression for X and Z so that it can't be empty:

Y = ((bc)*b + (cb)*c)

Substituting into (X a (Y a)* Z) + X, we get

((bc)*(b + ) + (cb)*(c + ) a (((bc)*b + (cb)*c) a)* (bc)*(b + ) +


(cb)*(c + )) + (bc)*(b + ) + (cb)*(c + )

53

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy