Rabah Mokhtari - Advanced databases
Rabah Mokhtari - Advanced databases
A D VA N C E D D ATA B A S E S
1 introduction 1
1.1 Fundamental concepts and features . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Data Vs Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Non-database Vs Database-Management Systems . . . . . . . . . . . 1
1.1.3 Characteristics of the Database Approach . . . . . . . . . . . . . . . . 2
1.1.4 Database Management System (DBMS) . . . . . . . . . . . . . . . . . 2
1.1.5 Classification of DBMS users . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.6 Advantages of Database Approach . . . . . . . . . . . . . . . . . . . . 3
1.1.7 Disadvantages of Database Approach . . . . . . . . . . . . . . . . . . 3
1.2 History of Database Management Systems . . . . . . . . . . . . . . . . . . . 4
1.2.1 Flat file processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Basic Direct Access Method (DBAM) . . . . . . . . . . . . . . . . . . 4
1.2.3 The Indexed Sequential Access Method (ISAM) . . . . . . . . . . . . 5
1.2.4 Shortcomings of Flat Files . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.5 The Era of formal Database Management . . . . . . . . . . . . . . . . 7
1.2.6 The Hierarchical Database Model . . . . . . . . . . . . . . . . . . . . 8
1.2.7 The CODASYL Network Model (known as Navigational Model) . . 10
1.2.8 The Relational Database Model . . . . . . . . . . . . . . . . . . . . . . 12
1.2.9 The Object-Oriented Database Model . . . . . . . . . . . . . . . . . . 14
1.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 relational database model 16
2.1 Manufacturing Company Database Example . . . . . . . . . . . . . . . . . . 16
2.2 Formalization of relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Relation scheme, Relation and Tuple . . . . . . . . . . . . . . . . . . . 16
2.2.2 Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 NULL values in the Tuples . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.4 Relational Model Notation . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Integrity Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Domain constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Key and Primary Key Constraints . . . . . . . . . . . . . . . . . . . . 20
2.3.3 Referencial and Foreign Key Constraints . . . . . . . . . . . . . . . . 20
2.3.4 Constraints on NULL Values . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Relational Databases and Database Scheme . . . . . . . . . . . . . . . . . . . 22
2.5 Functional Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.1 Closure of a set of functional dependencies . . . . . . . . . . . . . . . 23
2.5.2 Closure of a set of attributes X under a set of FDs F . . . . . . . . . . 24
2.5.3 Cover and equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.4 Irreducible sets of dependencies (or minimal cover) . . . . . . . . . . 26
2.6 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6.1 Trivial and nontrivial dependencies . . . . . . . . . . . . . . . . . . . 28
ii
content iii
bibliography 79
LIST OF FIGURES
v
List of Figures vi
vii
INTRODUCTION
1
Today, databases and database systems are used on almost all organization computer
systems including business, electronic commerce, engineering, medicine, genetics, and
education. Initially, we may cite a general definition of what a database is? “A database
is a collection of related data” [9]. This collection of data is stored for later retrieval and
processing[16]. Data is used to reference known facts like, for example, person names,
phone numbers and product prices that can be recorded or stored in physical drives and
may be described by logical meaner. By related data, we mean that a random assortment
of data cannot be qualified as a database. But the term database is usually restricted with
the following properties [9]:
• A database represents some aspect of the real world, and the set of facts represented
in a database is sometimes called the Universe of Discourse (UoD).
• A database is designed, built, and populated with data for a specific purpose. It
has an intended group of users and some preconceived applications in which these
users are interested.
• Data
Data is collection of raw (or unorganized) facts. It simply exists and has no signifi-
cance beyond its existence. It can exist in any form, usable or not. It is always need
to be processed [16]. Raw data are typically the results of measurements and can
be the observation of a set of variables. For example, age of students, the height of
students, blood group of students, are generally considered as “data”.
• Information
Valuable or useful data is called information. After processing, organization and
presenting data in a given situation so to make it useful, it becomes Information.
For example, height of a particular student may be considered as “information”[16].
Before the advent of database management, data was stored in data file. A non-database
(or file-based) system is a program or a collection of programs responsible for data file
1
1.1 fundamental concepts and features 2
• Data Abstraction: A data model is used to hide storage details and present the
users with a conceptual view of the database.
• Support of multiple views of the data: the database is populated for several users
using different views according to the interest of each user.
Database Management System (DBMS) is a software system that is used to define, create,
manipulate, and maintain databases[16]. Through a DBMS, users may interact with and
manage the database. Popular DBMSs include Microsoft Access, Oracle, DB2, MySQL,
and Microsoft SQL Server.
A DBMS user is anyone who directly or indirectly uses the DBMS. The DBMS users
operate in different level according to their appropriate roles. Data user often refers to:
2. Database Designer
The database designer is responsible for defining the detailed database design, in-
cluding data structure, relationships between data, constraints and other database
specific constructs needed to the database creation.
3. End Users
End Users are the people who interact directly with the database through utilities.
They are authorized to query and update all or part of the database using standards
of Queries to respond quickly to other user requests.
4. Application Programmers
1.1 fundamental concepts and features 3
To access and eventually change data, an application programmer interact with the
database from programs written in high-level programming languages such as C++
and Java.
• Getting more information from the same amount of data – The primary goal
of a data computer system is to turn data (recorded facts) into information (the
knowledge gained by processing those facts). In non-database, or file-based system,
this task can be extremely difficult to fulfill. In other cases is impossible. Given the
power of a DBMS, the information is available, and the process of getting it is quick
and easy[14].
• Data integrity and security – DBMS can provide security to databases by assigning
privileges to different users. It provides authorization or access controls to differ-
ent classes of users to perform different operations on databases, such as creation,
modification, deletion and updation of data[16]. The integrity constraints is a rule
that data must follow in the database[14]. For example, the “department number”
given for any “employee” must be one that is already in the database.
• Cost of stuff training The supply and operation of a DBMS is often very complex
and quite costly so hiring and training of users at all level is required. So a lot of
money has to be paid to run the DBMS[16].
• Cost of Hardware & Software A computer with a high speed of data processing
and a large storage capacity is required to run the DBMS software hosting the
database. Similarly, DBMS software may be also very costly[16].
1.2 history of database management systems 4
• More difficult recovery When the database is being updated by many users at the
same time. The database must first be restored to the condition it was in when it
was last known to be correct; any updates made by users since that time must be
redone. The greater the number of users involved in updating the database, the
more complicated this task becomes[14].
To gain insight into the present state of database models and architectures, we need
to present a historical introduction of the evolution of database management systems.
This evolution has been influenced by the increasing demand of organizations for new
database architectures.
A good historical presentation of evolution of Data Processing and Database Manage-
ment Systems may be found in [2]. We will see that each new database architecture had
to add new concepts and features to treat problems occurred in previous architectures.
The early computers were very large and cumbersome to maintain, but they were perfect
for performing repetitive tasks as payroll calculations, and organizations soon began to
see that high volume for repetitive data storage and processing tasks such as payroll
were ideal applications for these computers. In that time, many of "database" systems
were really nothing more than a loosely coupled collection of files. [2].
These were called "flat files" because data was stored as fixed length records in flat-
files in a linear fashion, such that it was necessary to read the file from front-to-end
until retrieving the desired record and there are no structures for indexing or recognizing
relationships between records because flat-file systems were bound by their linear nature.
Flat files are said to be non-keyed files because records was always retrieved in the same
order.
An access method defines the technique that is used to store and retrieve data. The
terms QSAM and BSAM standing respectively for Queued Sequential Access Method (pro-
nounced “quesam“) and Basic Sequential Access Method (pronounced ”bee-sam“) were
often used to describe physical sequential files in IBM mainframe environment[2].
As more and more data was stored on disks, organizations struggled to bypass the linear
nature of flat file organization. When storing records on disk, each block can be identified
1.2 history of database management systems 5
by a unique disk address. When we know the address of a record in disk, it can be
retrieved very quickly.
Unlike, sequential access methods used in flat file systems, the BDAM (pronounced
”bee-damn“) method uses hashing algorithm to determine the disk address of the target
stored record[2]. Each record is identified by a unique symbolic key. Using the record
symbolic key value, the BDAM algorithm will compute the target location (address) of
the record. Since we can go directly to the record, BDAM provides much faster access
and retrieval of records. A direct access file is sometimes called a "keyed" file because the
key is used to generate the disk address (see Figure 1). A disk address includes the disk
number, the cylinder address, the track, and the block address.
While the BDAM method provides presents a good solution for fast storage and retrieval
of data, the high cost of disk storage made it a very expensive proposition. However, new
methods like ISAM (Indexed Sequential Access Method) and VSAM (Virtual Sequential
Access Method) used flat files with indexes to speed data retrieval. These new methods
became a very popular alternative to BDAM.
To understand indexing, let’s take an example from a book. Just as you use an index
in a book to find what you want quickly, a computer index can speed the retrieval of in-
formation. In the simplest index structures, the index only contains two fields. One field
is the symbolic key and the second field contains the disk address of record that contains
the key value. In most file management systems, the index file kept as completely sepa-
rate file from the master file. When a record is requested, based upon an index key, the
program will scan the index, locate the symbolic key, and then retrieve the record from
the file based upon the location that was specified in the index column[2].
In this fashion (see Figure 2), a flat file can be logically recognized to retrieve record in
any desired order, regardless of the physical sequencing of the data. As new records are
1.2 history of database management systems 6
added to the end of the master table, the ISAM file system will automatically adjust the
indexes to account for the change.
ISAM was developed at IBM in late 1960s[6]. Like physical sequential files, ISAM stores
the records back-to-back, making for very efficient use of disk space, with only the Inter-
Block Gap (IBG1 ) between records. However, unlike physical sequential format which
may be stored on tape, ISAM files must be stored on disk since addresses are needed to
create the indexes and the physical sequence of records within ISAM is not important
since the indexes take care of the access to the records[2].
A single ISAM file may have dozens of indexes, each allowing the files to be retrieved
in some predefined order. In some cases, the size of the indexes will exceed the size
of the master file, but this is still less expensive than the disk wastage that occurs with
BDAM files[2]. The main advantages of the ISAM organization are its simplicity, small
space overhead and fast query time[6].
Another popular file-access method was introduced by IBM, called the Virtual Storage
Access Method or VSAM. VSAM, like its cousin ISAM, allows for physical sequencing files
to be indexed on multiple data items. By having multiple indexes, data can be retrieved
directly in several ways and you can access data anywhere in the file using different
index[2].
Needless to say, there were many problems and difficulties with flat-file database sys-
tems, such as:
• Data sharing – Before the invention of centralized computer resources, each de-
partment within an organization would develop their own system, usually imple-
menting their own unique file structures and programming language. Because of
1 IBG is a break between data records on hard drive and magnetic tape to prevent data overwrites
1.2 history of database management systems 7
this, “islands of information” sprang-up within companies and it was very difficult
for departments to share information.
• Data file structure modification – If a data file structure ever changed, correcting
all of the programs that needed modification was almost impossible.
• Backup and recovery – Flat files possessed no real backup system nor recovery
methods. Programmers had to write programs to backup a system before updates
started. If a failure occurred at any times during the update process, the files were
corrupted should be restored from the backup and rerun the update from the be-
ginning.
The early Database Managers access methods were based on BDAM or VSAM and
indexes were often created to speed-up data access. In the late 1950s, IBM developed a
prototype computer database to demonstrate that data could be stored, retrieved, and
1.2 history of database management systems 8
updated in a structured format. This database became known as the Information Man-
agement System, or IMS.
IMS was a revolutionary idea since it allows data access by numerous programs in
different languages and was designed to support the multiuser needs of larger organiza-
tions. Even more important, the creation of IMS codified the industries belief that data
was important, and needed to be managed and controlled in a consistent fashion.
As we noted, the major difference between flat-file systems and BDAM is the ability to
store both data as well as the relationships between the data. The hierarchical database
model was first introduced as IMS (Information Management System), which was re-
leased in 1960. The hierarchical database model used pointers to logically link related
data items, and it does this with the use of "child" and "twin" pointers. In 1998, IMS still
enjoys a large following among users with large databases and high-volume transaction
requirements. A hierarchical database is very well suited to modeling relationships that
are naturally hierarchical. For example, within an organization, we see that each execu-
tive has many managers, each manager has many supervisors, and each supervisor has
many workers. Basically, a hierarchy is a method of organizing data into descending one-
to many relationships, with each level in the hierarchy having a higher precedence than
those below it.
A hierarchy is just an arrangement of structures called nodes, and the nodes are con-
nected by lines or "branches". You can think of these lines or branches as a connection
to the next level of more specific information. The highest node is called the root node,
and queries must through this node on their way down the hierarchy. In our example
(Figure 3), UNIVERSITY is the root node. Every node, except the root node, is connected
upward to only one "parent" node. Nodes have a parent-child relationship, and a parent
node is directly above the child node. We also see that the node called COLLEGE OF
ENGINEERING is parent of ELECTRICAL ENGINEERING. Since a child node is always
one level directly below its parent node, the ELECTRICAL ENGINEERING node is a
child of COLLEGE OF ENGINEERING node. Note that a parent node can have more
than one child node, but a child node may only have one parent.
When we talk about a hierarchical database, the nodes we talked about become "seg-
ment types". A segment type is simply user-defined category of data, and each segment
contains fields. Each segment has a key field, and the key field is used to retrieve the
data from the segment. There can be one or more fields in segment, and most segment
also contain multiple search fields.
Expanding on Figure 3, let’s add some data fields to our UNIVERSITY segment. Let’s
begin by describing the data for each UNIVERSITY. UNIVERSITY information might
include the UNIVERSITY name, the mailing address, and phone number. IMS is well-
suited for modeling systems in which the entities (segment types) are composed of
descending one-to-many relationships. Relationships are established with "child" and
"twin" pointers, and these pointers are embedded into the prefix of every record in the
database.
1.2 history of database management systems 9
1. UNIVERSITY
2. College of Engineering
4. Electrical Engineering
5. Computer Science
6. Mechanical Engineering
A hierarchical path defines the access method, and the path is like an imaginary line
that begins at the root segment and passes through segment types until it reaches the
segment type at the bottom of the inverted tree. One advantage to a hierarchical database
is that if you only wanted information on COLLEGES, the program would only have to
know the format and access the store segment. You would not have to know that any of
other segments even exist, what their fields are, or what relationship exists between the
segments.
Hierarchical databases have rigid rules in relationships and data access. For example,
all segments have to be accessed through the parent segment. The exception to this is, of
course, the root segment because it has no parent.
The IMS database has concurrent control, and a full backup and recovery mechanism.
The backup and recovery protects the system from a failure of IMS itself, an application
program, a database failure, and an operating system failure. The recovery mechanism
for application programs stores "before" and "after" images of each record which was
1.2 history of database management systems 10
changed, and these images could be used to "roll-back" the database if a transaction
failed to complete.
If there was a disk failure the image could be "rolled-forward". IMS was used with
the CICS1 (Customer Information Control System) teleprocessing monitor to develop the
first on-line database systems for the mainframe.
Three main advantages of hierarchical databases are: (1) a large base with a proven
technology that has been around for decades, (2) the ease of using a hierarchy or tree
structure, and (3) the speed of the system (exceeding 2000 transactions per second). Some
disadvantages of hierarchical databases are because of rigid rules in relationships, inser-
tion and deletion can become very complex, access to child segment can only be done
through the parent segment (start at the root segment). While IMS is very good modeling
hierarchical database relationships, complex data relationships such as many-to-many
and recursive many-to-many, like BOM (Bill-Of-Material) relationships, had to be im-
plemented in very clumsy fashion, by using "phantom" records. The IMS database also
suffered from its complexity. To become proficient in IMS you need months of training
and experience. As result, IMS development remains very slow and cumbersome.
During the 1960s, several major BDAM products were created using the CODASYL2 Net-
work Database Management System (DBMS) specifications developed by the Conference
on Data Systems Language (CODASYL). Also involved were two subgroups of CODA-
SYL: the Database Task Group (DBTG) and the Data Description Language Committee
(DDLC).
CODASYL and its subgroups are an organization of volunteer representatives of com-
puter manufacturers and users. While CODASYL began in 1959, the first set of DBMS
specifications were not produced until 1969. This set of specifications was revised, and
the first real CODASYL DBTG specifications of the CODASYL standard approach were
in 1971.
The CODASYL approach was a very complicated system and required substantial
training. It depended on a "manual" navigation technique using a linked data set, which
formed a large network. Searching for records could be accomplished by one of three
techniques:
1. Using the primary key (also known as the CALC key)
and connectivity for applications on IBM mainframe systems under z/OS and z/VSE operating systems.
2 CODASYL was also responsible for designing COBOL (COmmon Business Oriented Language) lan-
guage in 1959.
1.2 history of database management systems 11
for new database systems like IDMS (Integrated Database Management System) from
Cullinet in 1970.
The Data Base Task Group (DBTG) CODASYL Specifications included Schema defini-
tion, Device Media Control Language (DMCL) definition, Data Manipulation Language
(DML) definition. It also included the concept of a database "area" which referred to the
physical structure of the data files. The Logical Structure of the database was defined by
a Data Definition Language (DDL), and a user view of the data was defined by a sub-
schema. The DML commands were used to navigate through the linked-list structures
hat comprised the database. The CODAYL DML verbs included FIND, GET, STORE,
MODIFY, and DELETE. The Data Base Administrator (BDA) functions in a CODASYL
database included: data structure or schema, data integrity, security, and authorization.
Also a Data Base Manager (DBM) function was defined which included: operation, back-
up/recovery, performance, statistics, auditing.
The CODASYL model used two storage methods, DBAM and linked-list data structure,
and DBAM used a hashing algorithm to store and retrieve records.
Because of the many choices that can be made in the design of a Network database, it
is important to review the design with as many people as possible. Charles W. Bachman1
developed a "diagram" that represented the data structures as required by CODASYL.
This diagram method becomes known as the Bachman diagram (Figure 4).
The Bachman diagram describes the physical constructs of all record types in the
database. The rectangles of Bachman diagram are subdivided into four rows. The top
row of the box contains the record name. Row two contains the numeric identification
ID number (each record is given a number which is associated with the record name),
the length mode which is fixed or variable, the length of records, and the location mode
(CALC, or VIA). Row three contains for CALC, the field serving as the CALC key, and
for VIA SET, the set name. Row four contains the era designated.
1 Charles Bachman is best known for his invention of the first random access database management
system, the Integrated Data Store (IDS) and he won the ACM’s 1973 A.M. Turing Award for his outstanding
contributions to database technology
1.2 history of database management systems 12
Figure 5: A set type of one owner record type and one or more member record types
The set type is shown by Bachman arrow pointing from the owner record type to the
member record type (see Figure 5).
Set name is the owner name hyphen member name. Pointers are Next, Prior and
Owner option is used for insertion and retention (MA, OA, MM, OM); the set order is
(First, Last, Next, Prior, or Sorted); and the mode is (Chain or Index).
The design of the CODASYL network model was very complex, and consequently very
difficult to use. Network databases, very much like hierarchical databases, are very diffi-
cult to navigate. The Data Manipulation Language (DML), designed for complex naviga-
tion, was a skill that required months of training.
Implementing structural changes was extremely difficult with network databases since
data relationships are "hard-linked" with embedded pointers. Adding an index or new
relationship requires special utility programs that will "sweep" each and every record in
the database. As records are located, the prefix is structured to accommodate the new
pointers. Object-oriented databases will encounter this same problem if a class hierarchy
needs to be modified.
Eventually, the CODASYL approach lost its popularity as simpler, easier-to-work-with
systems came on the market.
Edgar Codd worked at IBM in San Jose, California, in one of their offshoot offices that
was primarily involved in the development of hard disk systems. He was unhappy with
the navigational model of the CODASYL approach and the IMS model. He wrote a series
of papers, in 1970, outlining novel ways to construct databases. His ideas eventually
evolved into a paper titled, A Relational Model of Data for Large Shared Data Banks[3], which
described new method for storing data and processing large databases. Records would
not be stored in a free-form list of linked records, as in CODASYL navigational model,
1.2 history of database management systems 13
but instead used “tables” (also referred to as “entities”). Each table has "rows" (also
referred to as “records” or “tuples”) and “columns” (also referred to as "attributes"). In
relational database model any tables can be linked together. The relational model is based
on a collection of mathematical principles drawn primarily from set theory and predicate
logic. These principles were first applied to the field of data modeling in the late 1960s by
Dr. Edgar Codd. Tables basically correspond to segment type in hierarchical and record
types in the network models. Relational tables are independent, unlike hierarchical and
network models that are pointer connected, and can contain only one type of record, and
each record has a fixed number of fields that are explicitly named. A table will always
have a field or several fields that make a unique identifier called “primary key”. Data
redundancy is reduced via a process called normalization.
A primary key uniquely identifies a row in a table and a "foreign key" allows you to
join two or more tables together by using a primary key in one table with a non key field
in another table (see Figure 6).
Relational databases made the following improvement over hierarchical and network
databases:
1. Simplicity: The concept of tables with rows and columns is extremely simple and
easy to understand. End users have a simple data model. Complex network dia-
grams used with hierarchical and network databases are not used with a relational
database.
2. Data independence: Data independence is ability to modify data structure (in this,
case, tables) without affecting existing programs. Much of this is because tables
are not hard linked to one another. Columns can be added to tables, tables can be
added to the database, and new data relationships can be added with little or no
1.3 conclusion 14
3. Declarative data access: One of the best improvements of the relational model over
its predecessors was its simplicity. Rather than having dozens of DML commands,
the relational model introduced a declarative language called Structured Query
Language (SQL), also known as “sequel”, to simplify data access and manipula-
tion. The SQL user specifies what data they want, and then the embedded SQL, a
procedural language determines how to get the data. In relational data access, the
user tells the systems the conditions for the retrieval of data. The system then gets
the data that meets the selection conditions in the SQL statement. The database
navigation is hidden from the end user or the programmer, unlike a CODASYL
navigation DML language, where the programmer had to know the details of the
access path.
The term of Object-Oriented, abbreviated OO, has its origins in OO programming lan-
guage. The main idea the OO paradigm is to couple the data with behavior. The first
procedural language to do this task is Simula language, which was proposed in the late
1960s [9] and used in operations research tasks to simulate the behavior of entities. Ob-
ject in data management sense was first developed by Xerox Corporation’s Palo Research
Center (PARC) in the early 1970s [2]. In 1970, Xerox PARC developed the programming
language Smalltalk, which was one of the first languages to explicitly incorporate addi-
tional OO concepts, such as message passing and inheritance. It is known as a pure OO
programming language, meaning that it was explicitly designed to be object-oriented.
OO databases management approach emphasizes a more natural representation of
the data. In late 1990s environments, the data models are more demanding. They need
to handle audio, video, text, graphics, etc. These requirements demand a more flexible
storage format than hierarchical, network and relational databases can provide. Only OO
databases will be able to support this kind of demand [2].
1.3 conclusion
Throughout the chapter, we introduced key concepts and terminology, including data
modeling, database schema, database instances, and transactions. We also discussed the
role of database administrators and the importance of database performance tuning.
In conclusion, this chapter provides a foundational understanding of DBMSs, includ-
ing their history, basic concepts, and features. By understanding the principles of database
management, readers will be better equipped to design, implement, and manage databases
effectively, which are essential for businesses to store and manage their data efficiently
and securely.
R E L AT I O N A L D ATA B A S E M O D E L
2
The relational data model was first introduced by Ted Codd of IBM Research in 1970 in
a paper titled, A Relational Model of Data for Large Shared Data Banks [3], and it attracted
immediate attention due to its simplicity and mathematical foundation. One of the major
advantage of the relational model is its uniformity. All data is viewed as stored in tables,
with each row in the table having the same format. The model uses the concept of a
mathematical relation and has its theoretical basis in set theory and first-order predicate
logic [9].
This chapter presents the main concepts of the model, and relational algebra and rela-
tional calculus which are associated with the relational model.
Almost illustration examples used in the rest of this chapter are taken from an example
of database of a manufacturing company inspired by Date, C.J. [4] (see Section 2.1).
We consider the database of an example of manufacturing company called Know-
Ware Inc. in a little detail. Such an enterprise will typically wish to record information
about the projects it has on hand; the parts that are used in those projects; the suppliers
who are under contract to supply those projects; the warehouse in which those parts are
stored; the employees who work on those projects; and so on. Projects, parts, suppliers,
and so on, thus constitute the basic entities about which KnowWare Inc. needs to record
information. Refer to Figure 7.
A relation scheme R is a finite set of attribute names {A1 , A2 , ..., An }. Corresponding to each
attribute name Ai is a set Di , 1 6 i 6 n, called the domain of Ai . We also donate the
domain of Ai by dom(Ai ). Attribute names are sometimes called attribute symbols or
simply attributes. The domains are arbitrary, non-empty set, finite, or countably infinite.
Let D = D1 ∪ D2 ∪ ... ∪ Dn .
A relation (or relation state) r on relation scheme R, denoted by r(R), is a finite set of
mappings {t1 , t2 , ..., tp } from R to D with the restriction that for each mapping t ∈ r, t(Ai )
must be in Di , 1 6 i 6 n. The mappings are called n-tuples [12]. Each n-tuple is an
ordered list of n values t =< v1 , v2 , ..., vn >, where each value vi , 1 6 i 6 n, is an
element of dom(Ai ) or is a special NULL value [9]. The tuples of a relation r are all
distinct, so one tuple can not be duplicated in the same relation.
16
2.2 formalization of relations 17
Supplier SJ Project
SPJ EJ MJ
SP PJ
PP
ED
WL
WE
WE Department
Location
Example 1
Considering the relation scheme “Supplier” of KnowWare Inc. company database de-
scribed in Section 2.1. We can write Supplier = {Sno, Sname, Status, City} or Supplier(Sno, Sname, Sta
The domain of each attribute names might be:
Table 1 shows a relation of five tuples (or rows). Each row presents one supplier. The
first two tuples may be written as t1 =< 1, Smith, 20, London > and t2 =< 2, Jones, 10, Paris >.
2.2 formalization of relations 18
Example 2
Every flight listed in the airline schedule presented in Table 2 has an origin and a desti-
nation and it is scheduled to depart at a specific time and arrive at a later time. It has a
flight number. The FROM column contains names of airports served by the airline, the
ARRIVES column times of day. The order of the columns is immaterial as far as informa-
tion content is concerned. The DEPARTS and ARRIVES columns could be interchanged
with no change in meaning. Finally, since each flight has a unique number, no flight is
represented by more than one row.
In the Table 2 the relation scheme is
FLIGHT S = {NUMBER, FROM, T O, DEPART S, ARRIVES}.
The domain of each attribute names might be:
1. dom(NUMBER) = a set of decimal numbers.
2.2.2 Keys
By definition, the set of all attributes of relation scheme R has the uniqueness prop-
erty, meaning that, at any given time, no two tuples in a relation r(R) at that time are
duplicates of one another. Based on this property the following definitions are given.
(a) Candidate Key of a relation r(R) is a subset K = {B1 , B2 , ..., Bn } of R with the
following properties. (i) uniqueness which means that for any two distinct tuples
t1 and t2 in r, t1 (K) 6= t2 (K); (ii) irreducibility that is to say no proper subset K 0 of
K has the uniqueness property.
(b) Super Key is superset of a candidate key, meaning if r has a candidate key K, and
K ⊆ K 0 , then K 0 is called a superkey.
2.3 integrity constraints 19
(c) Primary Key – A given relation r(R) may have more than one candidate key. The
primary key is exactly one chosen key of those key, and the others are then called
alternative keys.
Example
An important concept is that of NULL values, which are used to represent the values
of attributes that may be unknown or may not apply to a tuple. A special value, called
NULL, is used in these cases [9].
• To denote the key of a relation, we underline the attribute names in the key. The
relation r on scheme ABCD with AC as a key is written r(ABCD). We can also
incorporate the key into the relation scheme R as R(ABCD). Any relation r(R) is
restricted to have
• An attribute A can be qualified with the relation scheme name R to which it belongs
by using the dot notation R.A. For example, FLIGHTS.NUMBER.
An integrity constraint is a boolean expression that is associated with some database and
is required to evaluate at all time to TRUE [4]. A DBMS should provide capabilities for
defining and enforcing these constraints [9]. Integrity constraint of the relational model
is the part that has changed (and evolved) the most over years. The original emphasis
was on primary and foreign keys specifically(“keys” for short). Gradually, however, the
importance –indeed, crucial importance– of integrity constraints in general began to be
better understood and more widely appreciate [4].
2.3 integrity constraints 20
Some constraints can be specified to the DBMS and automatically enforced. Other
constraints may have to be checked by update programs or at the time of data entry. For
typical large applications, it is customary to call such constraints business rules [9].
Here are some integrity constraints, expressed in natural language, all based on the
KnowWare Inc. database.
4. No supplier with status less then 20 supplies any part in a quantity greater than
500.
1. Constraints that can be expressed in the database scheme using the DDL (Data Def-
inition Language). These constraints must be formally declared to the DBMS, and
the DBMS must then enforce them. For example, the first and last two constraints
(number 1, 3 and 4) expressed above belong to this category of constraints.
2. Constraints that cannot be directly expressed in the database scheme. These types
of constraints are not understood by the DBMS but they specify what the data
means to the users. Constraint number 2 belong to this category of constrains.
The simplest type of integrity constraint involves specifying a data type for each data
item. This kind of constraint is known as domain constraint. Domain constraints specify
that within each tuple t =< A1 , A2 , ..., An >, the value of each attribute Ai , 1 6 i 6 n
must be an atomic value from the domain dom(Ai ).
A superkey SK defined in Section 2.2.2 specifies a uniqueness constraint that no two dis-
tinct tuples t1 and t2 in any relation r of relation scheme R can have the same value for
SK, t1 (SK) 6= t2 (SK). If this SK is a key, it specifies particular uniqueness constraint called
primary key constraint.
SP
S1 1 300
S1 2 200
S1 3 400
S1 4 200
S1 5 100
Part
S1 6 100
Supplier Pno Weight City
Name Color
Sno City S2 1 300
Sname Status
P1 Nut Red 12.0 London
S1 Smith 20 London S2 2 400
P2 Bolt Green 17.0 Paris
S2 Jones 10 Paris S2 3 500 P3 Screw Blue 17.0 Oslo
S3 2 200 P4 Screw Red 14.0 London
S3 Black 30 Paris P5 Cam Blue 12.0 Paris
S3 3 200
P6 Cog Red 19.0 London
S4 Clak 20 London
S3 4 300
S5 Adams 30 Athens
S4 2 300
S4 3 400
Example
Figure 8 shows three relations “Supplier”, “Part”, and “SP”. The “SP” relation contains
the quantity of parts should be supplied by each supplier. For example, the first tuple of
“SP” shows that the supplier “Smith” (Sno=S1) is under contract to supply 300 parts of
type “Nut” (Pno=P1).
Here are three relation schemes Supplier, Part, and SP. The underlined attributes con-
stitute in each relation scheme the primary key while the attributes preceded by the
symbol # represent the foreign keys. “Sno” and “Pno” attributes constitute together the
primary key in “SP” relation scheme. These two attributes are used also to join respec-
tively “Supplier“ and “Part“ relations. For this purpose, two referencial integrity con-
straints are specified to state that each tuple in ”SPA“ relation must refer to two existing
tuples in the ”Supplier“ and ”Part“ relations.
This constraint specifies whether NULL values are or are not permitted for the tuple
value of a particular attribute. For example, if every FLIGHT tuple must have valid, non-
2.4 relational databases and database scheme 22
NULL values for the FROM and TO attributes, then FROM and TO are constrained to be
NOT NULL.
A relational database scheme S is a set of relation schemes S = {R1 , R2 , ..., Rm } and a set
of integrity constraints IC. A relational database state DB of S is a set of relation states
DB = {r1 , r2 , ..., rm } such that each ri is a state of Ri and such that the ri relation states
satisfy the specified integrity constraints [9].
Example
The set of relational schemes of KnowWare Inc. database described in the Entity/Asso-
ciation Model, presented in Section 2.1, and all integrity constraints may be expressed
within it constitute together KnowWare Inc. relational database scheme.
Definition : Let X and Y be subsets of the relation scheme R; then the functional de-
pendency (FD) X → Y holds in R if and only if, whenever two tuples of R agree on X,
they also agree on Y. X and Y are the determinant and the dependant, respectively, and
the FD overall can be read as either ”X functionally determines Y“ or ”Y is functionally
dependent on X“, or more simply just ”X arrow Y“ [5]
Example
For example, the relation shown in Table 3 satisfies the FD {SNO#} → {CIT Y}.
2.5 functional dependencies 23
Definition. Formally, the set of all dependencies that include F as well as all dependencies
that can be inferred from F is called the closure of F. It is denoted by F+ [9].
To compute F+ from F, Armstrong [1] gave a set of inference rules (more usually
called Armstrong’s axioms) by which new FDs can be inferred from given ones [4].
Let A, B, and C be arbitrary subsets of the set of attributes of the given scheme R, and
let us agree to write AB to mean the union of A and B. Then:
2. Augmentation: if A → B, then AC → B.
Several further rules can be derived from the three given above, the following among
them. (Let D is another arbitrary subset of the set of attributes of R)
1. Self-determination: A → A.
Example
The following set of FDs F is specified on the relation scheme ”EMP-DEPT“ in Figure 9
[9].
F = {Ssn → {Ename, Bdate, Address, Dnumber}, Dnumber → {Dname, Dmgr_ssn}}
Some of the additional functional dependencies that we can infer from F are the follow-
ing:
Ssn → {Dname, Dmgr_ssn}
Ssn → Ssn
Dnumber → Dname
Exercise2.1
Suppose we are given shceme R(ABCDEF) and the FDs:
A → BC
2.5 functional dependencies 24
B→E
CD → EF
Prove that the FD AD → F ∈F+ (meaning that we can infer AD → F from F).
1. A → BC (given)
2. A → C (decomposition)
3. AD → CD (2, augmentation)
4. CD → EF (given)
5. AD → EF (3 and 4, transitivity)
6. AD → F (5, decomposition)
Example
Definition of cover. Let F and G two sets of FDs. If every FD implied by F is implied by
G –i.e., if F+ is a subset of G+ – we say that G is a cover for F [4].
Definition of equivalence. Two sets of FDs F and G over scheme R are equivalent, written
F ≡ G, if and only if F+ = G+ . If F ≡ G, then F is a cover for G [12].
Lemma Given sets of FDs over scheme R, F ≡ G if and only if F |= G and G |= F [12].
2.5 functional dependencies 25
1: function CLOSURE(X,F)
2: X+ ← X;
3: oldX+ ← ∅;
4: while (oldX+ 6= X+ ) do
5: oldX+ ← X+ ;
6: for every FD Y → Z in F do
7: if X+ ⊇ Y then
8: X+ ← X+ ∪ Z;
9: end if
10: end for
11: end while
12: return X+ ;
13: end function
Algorithm 2 Test if F |= X → Y
1: function MEMBER(F,X → Y)
2: if Y ⊆ CLOSURE(X, F) then
3: return true;
4: else
5: return false;
6: end if
7: end function
2.5 functional dependencies 26
Exercise2.2
F and G two sets of FDs, F = {A → BC, A → D, CD → E} and G = {A → BCE, A →
ABD, CD → E}
Are F and G equivalent?
Exercise2.3
Let F and G two sets of FDs,
• G = {A → B, C → DE, AC → F}.
– Compute the closure {A, B}+ under F and {A, C}+ under G.
Definition A set F of FDs is irreducible, called also minimal cover, if only if it satisfies
the following three properties:
1. The right-hand side (the dependent) of every FD in F involves just one attribute
(i.e., it is a singleton set).
Exercise2.4
Compute the minimal cover of the set of FDs E on the scheme R(ABCD). E = {A →
BC, B → C, A → B, AB → C, AC → D}.
1. Step 1,
a) A → C
b) A → B (removed because it occurs twice)
c) B → C
d) A → B
e) AB → C
f) AC → D
2.5 functional dependencies 27
1: F ← E;
2: Replace each FD X → {A1 , A2 , ..., An } in F by the n FDs X → A1 , X → A2 , X → An ;
3: for each FD X → A in F do
4: for each attribute B in X do
5: if {{F − {X → A}} ∪ {(X − {B}) → A}} is equivalent to F then
6: replace X → A with (X − {B}) in F;
7: end if
8: end for
9: end for
10: for each remaining FD X → A in F do
11: if {{F − {X → A}} is equivalent to F then
12: remove X → A from F;
13: end if
14: end for
2. Step 2,
AC → D (f) is replaced by A → D because:
– A → AC is implied by composition of A → A and A → C (a).
– Then, A → D is implied by transitivity of A → AC and AC → D (f).
3. Step 3,
AB → C (e) is removed because
– We can imply AB → CB by composition of A → C (given in a) and B → B.
– Then, AB → C is implied by decomposition of AB → CB.
4. Step 4,
A → C is removed because it is implied by transitivity of A → B (d) and B → C (c)
Exercise2.5
Let F =
ABD → E
AB → G
B→F
C→J
CJ → I
G→H
Is F an irreducible set ?
2.6 normalization 28
NAME SEXE
John Male
NAME SEXE
Jean Male
{John, Jean, Ivan} Male
Ivan Male
{Mary, Marie} Female
Mary Female
Table 4: gender relation not in 1FN Marie Femal
2.6 normalization
A normal form is a restriction on the database scheme that presumably precludes certain
undesirable properties from the database [12].
Let relation scheme R(A1 A2 ...An ). A relation r(R) is in first normal form (lNF) if and only
if for all tuple t appearing in r, the value of each attribute Ai of type dom(Ai ) is atomic.
To say it in different words, 1NF means that every tuple contains exactly one value for
each attribute [4].
Example
The relation gender, shown in Table 4, is not in 1NF because it contains values that are
sets of atomic values. To be in 1NF, gender should be stored like it’s in Table 5
Definition : A scheme R is in second normal form (2NF) if and only if, for every
key K of R and every nonkey attribute A of R, the FD K → {A} (which holds in R,
necessarily) is irreducible [5].
2.6 normalization 29
Example
The relation shown in Table 3 contains some values on the scheme SCP(SNO#, CIT Y,
PNO#, QT Y). This scheme clearly suffers from redundancy. Every tuple for supplier S1
tells us S1 is in London, every supplier for S2 tells us S2 is in Paris, and so one. The
scheme SCP isn’t in 2NF because its key is {SNO, PNO} and the FD {SNO, PNO} → CIT Y
isn’t irreducible. We can drop PNO from the determinant and what remains, {SNO} →
{CIT Y}, is still and FD that holds in SCP.
To be in 2NF, SCP should be divided into two new relation schemes SPQ(SNO#,
PNO#, QT Y) and SC(SNO#, CIT Y). Table 6 and Table 7 show sample values for new
schemes SPQ and SC.
Definition: The scheme R is in third normal form (3NF) if and only if, for every
nontrivial FD X → Y that holds in R, either (a) X is a superkey or (b) Y is a subkey
[5].
Contrary to definitions of 3NF commonly found in the literature that take often the
form ”R is in 3NF if and only if it’s in 2NF and ...“, Date C. J. [5] prefer a definition
that takes non mention of 2NF. This definition also follows that 3NF implies 2NF.
Example
The scheme Supplier in Figure 10 isn’t in 3NF. It’s clear that the FD CIT Y → ST AT US
holds in that scheme. CIT Y isn’t a superkey and ST AT US isn’t a subkey.
To be in 3NF, this scheme will be subdivided into two new schemes like they are shown
in Figure 11.
2.6 normalization 30
Definition: A scheme R is in Boyce/Codd Normal Form (BCNF) if and only if, for
every nontrivial FD X → Y that holds in R, X is a superkey [5].
This definition takes no mention of 2NF and 3NF. Note, however, that it can be
derived from the definition of 3NF by dropping of (b) condition (”Y is a subkey”).
But it’s clear it follows that if a scheme R is in BCNF, then it’s certainly in 3NF.
Example
The scheme SNP(SNO, SNAME, PNOQT Y) has two candidate keys {SNO, PNO} and
{SNAME, PNO}; for that reason there’s no underline attributes. The FD {SNO} → {SNAME}
and {SNAME} → {SNO} hold in SNP. In both cases the determinant isn’t a superkey, and
so the scheme SNP isn’t in BCNF.
To be in BCNF we decompose SNP into either of the following decompositions.
SQL (Structured Query Language) is a standard programming language used for man-
aging and manipulating data in relational databases like MySQL, Oracle, SQL Server,
PostGreSQL. It is used to create, modify, and query databases and is widely used in
modern data-driven applications. SQL is used to communicate with databases, allowing
users to retrieve, insert, update, and delete data. SQL is also used for database man-
agement tasks such as creating tables, defining relationships between tables, and setting
constraints on data. SQL is a powerful and versatile language, and its popularity has
made it a critical skill for anyone working with data or databases.
2.7 sql language 32
SQL, which stands for Structured Query Language, was first developed in the early
1970s by Donald D. Chamberlin and Raymond F. Boyce at IBM. The initial version of
SQL, called SEQUEL (Structured English Query Language), was designed as a way to
access and manipulate data in IBM’s experimental System R database.
In the late 1970s and early 1980s, SQL was standardized by ANSI (American National
Standards Institute) and ISO (International Organization for Standardization), which
helped to establish it as the dominant language for managing relational databases. The
first ANSI SQL standard was published in 1986, followed by subsequent revisions and
updates over the years.
As relational databases became more popular in the 1980s and 1990s, SQL evolved
to support a wider range of functionality, including data manipulation, data definition,
and data control. In addition, various commercial database management systems (DBMS)
were developed that supported SQL, such as Oracle, IBM DB2, and Microsoft SQL Server.
Syntax of SQL statements The syntax of SQL statements refers to the specific rules and
conventions used to write SQL commands and queries that interact with a database.
Proper syntax is essential for ensuring that SQL commands are executed correctly and
that the resulting data is accurate and useful. Some key elements of SQL syntax include:
1. Clauses: SQL commands typically consist of one or more clauses that specify the
action to be taken. For example, the SELECT command consists of the SELECT and
FROM clauses, which specify which columns to retrieve and from which table.
2. Keywords: SQL commands are made up of keywords that define the action to be
taken, such as SELECT, INSERT, UPDATE, DELETE, and JOIN.
3. Operators: SQL commands use various operators, such as comparison operators (=,
<>, >, <, etc.) and logical operators (AND, OR, NOT), to filter and manipulate data.
4. Values and expressions: SQL statements often require the use of values and expres-
sions, such as dates, strings, and mathematical expressions, to perform calculations
or filter data.
Overall, understanding the syntax of SQL statements is critical for writing correct
and efficient SQL code. With proper syntax, SQL statements can be used to retrieve,
manipulate, and analyze data from relational databases.
Data types in SQL refer to the categories of data that can be stored and manipulated in a
relational database. SQL supports various data types that are used to define the structure
and behavior of data in a table. Some of the commonly used data types in SQL include:
2.7 sql language 33
1. Numeric data types: Numeric data types are used to store numeric values such as
integers, decimals, and floating-point numbers. Examples of numeric data types in
SQL include INT, BIGINT, DECIMAL, and FLOAT.
2. Character data types: Character data types are used to store text values such as
names, addresses, and descriptions. Examples of character data types in SQL in-
clude VARCHAR, CHAR, and TEXT.
3. Date and time data types: Date and time data types are used to store dates and
times. Examples of date and time data types in SQL include DATE, DATETIME,
and TIMESTAMP.
4. Boolean data types: Boolean data types are used to store true/false values. In SQL,
the BOOLEAN data type is represented by BIT or BOOL.
5. Binary data types: Binary data types are used to store binary data such as images
or files. Examples of binary data types in SQL include BLOB and VARBINARY.
6. Other data types: SQL also supports other data types such as XML, JSON, and
ARRAY.
Understanding data types in SQL is important for designing efficient and accurate
database schemas. By choosing the appropriate data type for each column in a table, you
can ensure that the data is stored and manipulated correctly and that the database can
perform efficiently.
We use in this section the database demonstrated in Figure 13. These database describes
a set of employees that work at a set of departments and assigned to different projects.
SELECT statements
WHERE clause
1. Retrieve the names and email addresses of all employees who are not working on
any projects:
SELECT employee_name, employee_email
FROM employees
WHERE employee_id NOT IN (SELECT employee_id FROM employee_projects);
2. Retrieve the names and dates of all projects that are currently active:
SELECT project_name, start_date, end_date
FROM projects
WHERE end_date >= CURRENT_DATE();
Joins in SQL
SQL provides several types of join operations to combine data from two or more tables.
Some of the most commonly used join operations in SQL are:
1. Inner Join: Returns only the rows that have matching values in both tables being
joined.
2. Left Join: Returns all the rows from the left table and the matching rows from the
right table. If there are no matching rows in the right table, it returns NULL values.
3. Right Join: Returns all the rows from the right table and the matching rows from
the left table. If there are no matching rows in the left table, it returns NULL values.
4. Full Outer Join: Returns all the rows from both tables and combines the rows with
matching values. If there are no matching rows in one of the tables, it returns NULL
values for the missing values.
5. Cross Join: Returns the Cartesian product of the two tables, i.e., all possible combi-
nations of rows from the two tables.
6. Self Join: Joins a table with itself to create a new table that contains only the rows
where a certain condition is true.
The following query joins the employees table with the departments table based on the
department_id column. It selects the employee_name column from the employees table
and the department_name column from the departments table where there is a match
between the department_id values in both tables.
SELECT e.employee_name, d.department_name
FROM employees e
INNER JOIN departments d
ON e.department_id = d.department_id;
GROUP BY clause
The following query groups the employees by their department and returns the number
of employees in each department. The GROUP BY clause is used to group the data by the
department_id column, and the COUNT (∗) function is used to count the number of em-
ployees in each group. The output will show each department ID and the corresponding
number of employees.
SELECT department_id, COUNT(*) as num_employees
FROM employees
GROUP BY department_id;
Having clause
This query groups the employees by their department and returns only the departments
with more than 3 employees. The GROUPBY clause is used to group the data by the
department_id column, and the COUNT (∗) function is used to count the number of
employees in each group. The HAVING clause is then used to filter the results to only
show departments with more than 3 employees.
Note that the HAVING clause is similar to the WHERE clause, but it is used specif-
ically for filtering the results of grouped data. The HAVING clause is evaluated after
the GROUP BY clause, so it can be used to filter the results based on the results of the
grouping function.
2.7 sql language 36
ORDER BY clause
The ORDER BY clause is typically used in conjunction with the SELECT statement to sort
the results based on one or more columns. The default ordering direction is ascending
(ASC), but you can specify DESC to sort the results in descending order. You can also
specify multiple columns in the ORDER BY clause to sort the results by multiple criteria.
The following query selects the employee_name and salary columns from the employ-
ees table and orders the results in descending order by the salary column. The ORDER
BY clause is used to sort the results based on the specified column and ordering direc-
tion (”DESC“ in this case). The output will show the employee names and salaries in
descending order, with the highest paid employees first.
INSERT statement
The following INSERT statement adds three new employees to the employees table in
a single query. Each set of values is separated by a comma, and the column names are
specified in parentheses before the VALUES keyword. Note that the order of the values
in each set must match the order of the columns specified in the INSERT INTO clause.
DELETE clause
DELETE clause is used to delete one or more rows based on some predicates. The fol-
lowing DELETE statement deletes a single row from the employees table where the em-
ployee_id is equal to 5. The DELETE FROM clause specifies the table to delete the row
from, and the WHERE clause is used to specify the condition for which rows to delete.
You can also delete multiple rows at once by specifying a more general condition in
the WHERE clause. For example, to delete all employees with a salary less than 40,000:
2.8 conclusion 37
2.8 conclusion
In this chapter, we delved into the world of relational databases, which are an essential
component of modern data storage and retrieval systems. We discussed the key princi-
ples of functional dependencies and normalization, which are critical for ensuring data
integrity, minimizing redundancy, and optimizing performance. We explored the normal-
ization process, which involves breaking down large tables into smaller, more specific
tables to eliminate repeating groups and reduce the risk of anomalies.
We also covered SQL, which is the primary language used for interacting with rela-
tional databases. We discussed how SQL enables users to create and modify tables, insert,
update, and delete data, and perform complex queries to extract information from the
database. We highlighted some of the key features of SQL, including its ability to handle
complex join operations, grouping and aggregation, sorting, and filtering.
Overall, this chapter provides a comprehensive overview of relational databases and
their key components, including functional dependencies, normalization, and SQL. By
understanding these principles, readers will be better equipped to design and implement
robust, scalable, and efficient data storage and retrieval systems.
O B J E C T A N D O B J E C T - R E L AT I O N A L D ATA B A S E S
3
Object-Oriented Databases, actually known as Object Databases (ODB), were originally
developed as an alternative to relational database technology for the modeling, storage,
and access of complex data forms that were increasingly found in advanced applications
of database technology like Engineering Design, Geographical Information Systems, and Mul-
timedia. After much debate regarding Object-Oriented versus relational database technol-
ogy, object features were eventually incorporated into relational database systems to
create a new model and systems characterized as Object-Relational Databases (ORDB)
and Object Relational Database Management Systems (ORDBMS). Another reason for
creation of ODB is the fast increasing of the use of Object-Oriented Programming Lan-
guages (OOPL) for development of those complex applications, so ODB can be directly
used to store and retrieve object data of these kind of applications.
The term object-oriented, abbreviated OO, were originally used in the OO Programming
Languages, or OOPLs like Java, C++, C#. Today, OO concepts are incorporated into differ-
ent areas of software engineering, multi-agent systems, databases and computer systems
in general [9]. The main idea of the OO paradigm is to couple the data with behavior.
Simula was designed in the late 1960s in Norway as the first procedural language to do
this task and used, as its name suggests, for doing simulations [2]. Simula language in-
troduced objects, classes, inheritance, subclasses, and many other OO features required
for any OOPL.
Once the advantages of the use of object-oriented approach instead of procedural one
were recognized, Simula rapidly developed into a general-purpose programming lan-
guage [11]
The following sections describes the fundamental concepts commonly used in object
technology, that are objects, encapsulation, classes, and inheritance.
Object is the most important concept in object paradigm. Object is an encapsulated unit
of data (or properties) and behavior (or methods). Encapsulation means that the internal
implementation is covered and neither program code or variables are visible outside the
object. Rather, objects have public interface consisting of public methods and properties. In
general, properties cannot be accessible and manipulated directly but only through data
access methods.
38
3.1 object technology 39
To execute a method, we need to send the object a message including the method
name and the set of parameters required for the method execution. The object invokes
immediately the desired method and passes it parameter values.
An object class, or simply class, is a blueprint to build a specific type of object. It defines
how an object will behave and what the object will contain. The incarnations of classes are
called instances of these classes. For example, the circle class defines the class of geometric
circles. Each circle is defined by one center point characterized by two coordinates x and
y and one radius r. The behavior is guarantee by a set of methods such as Draw, Move,
Color..etc. Figure 17 demonstrates the class “Circle” and three different instances.
3.1 object technology 40
Inheritance
Inheritance is the mechanism by which more specific elements incorporate structure and
behavior defined by more general elements [15].
Inheritance is one of the most important feature of Object Oriented Programming. It
is the capability of derivation of a class B from another class A. The subclass B has a “is
a” relationship to the super class A. In this case, we say that class B extends class A, so B
inherits from A all its properties and methods. Thus, the subclass has the same interface
as the super class and can be used in its place. This principle is called substitutability and
means that every message you can send to an instance of the super class is also valid for
instances of the subclass [11].
Example
Figure 18 demonstrates inheritance relationships between some classes. The class “Form”
is the super class of three subclasses “Triangle”, “Ellipse”, and “Rectangle”. All these
classes model geometric forms. Using the “is a” relationship we can simply say for ex-
ample – a “Triangle” is a “Form”.
Specialization
A subclass B can be specialized by adding new properties and methods that the super
class A does not have. Similarly, you can overwrites inherited methods so the subclass
behaviors differently than the super class.
Generalization
is called the parent; an element in the transitive closure is an ancestor. The more specific
description is called the child; an element in the transitive closure is a descendant [? ].
The super class “Form” appeared in Figure 18 is more general than their children classes
“Triangle”, “Ellipse”, and “Rectangle” and all other inherited classes.
3.1.4.1 Persistence
From the OOPL point of view, objects can be instantiated in the RAM memory during
the execution time of one program meaning that all these objects disappear when the
program execution ends.
Persistence denotes that an object can continue to exist after the program that instanti-
ated it ends or even after the system that runs the program turned off.
Not all objects are meant to be stored permanently in the Object Database, there are
two types of objects persistent and transient; the typical mechanisms to make objects
persistent are Naming and Reachability.
In ODBs, a complex type may be constructed from other types by nesting of type con-
structors. The three most basic constructors are atom, struct (or tuple), and collection.
1. The term atom is not used in the latest object standard. They are called single-
valued or atomic types, since each value of the type is considered an atomic (indi-
visible) single value. are similar to the basic types in many programming languages:
integers, strings, floating point numbers, enumerated types, Booleans, and so on.
2. struct (or tuple) can create standard structured types. such as the tuples (record
types) in the basic relational model. It is considered to be a type generator. For ex-
ample, two different structured types that can be created are: struct Name<FirstName:
string, MiddleInitial: char, LastName: string>, and struct CollegeDegree<Major:
string, Degree: string, Year: date>. To create complex nested type structures in the
object model, the collection type constructors are needed, which we discuss next.
3. Collection (or multivalued) type constructors include the set(T), list(T), bag(T),
array(T), and dictionary(K,T) type constructors. These allow part of an object or
literal value to include a collection of other objects or values when needed. These
constructors are also considered to be type generators because many different types
can be created. For example, set(string), set(integer), and set(Employee) are three
different types that can be created from the set type constructor. All the elements
in a particular collection value must be of the same type. For example, all values in
a collection of type set(string) must be string values.
Example
An object definition language (ODL) 1 that incorporates the preceding type construc-
tors can be used to define the object types for a particular database application. In Fig-
ure 19 the attributes that refer to other objectssuch as Dept of EMPLOYEE or Projects
of DEPARTMENTare basically OIDs that serve as references to other objects to represent
relationships among the objects.
1 This corresponds to the DDL (data definition language) of the relational database system.
3.1 object technology 43
Figure 19: Specifying the object types EMPLOYEE, DATE, and DEPARTMENT using type con-
structors.
3.2 object-relational features: object database extensions to sql 44
3.2.1 SQL/Foundation
SQL was first specified in (1974) and continued its evolution with a new standard, ini-
tially called SQL3 while being developed, and later known as SQL:99 for the parts of
SQL3 that were approved into the standard. Starting with the version of SQL known as
SQL3, features from object databases were incorporated into the SQL standard. At first,
these extensions were known as SQL/Object, but later they were incorporated in the
main part of SQL, known as SQL/Foundation. [9].
The following are some of the object database features that have been included in SQL:
• Some type constructors have been added to specify complex objects. These include
the row type, which corresponds to the tuple (or struct) constructor. An array type
for specifying collections is also provided. Other collection type constructors, such
as set, list, and bag constructors, were not part of the original SQL/Object specifica-
tions but were later included in the standard.
Example
Figure 20 some of the object features of SQL. (a) Using UDTs as types for attributes such
as Address and Phone, (b) Specifying UDT for PERSON_TYPE, (c) Specifying UDTs for
STUDENT_TYPE and EMPLOYEE_TYPE as two subtypes of PERSON_TYPE, Specifying
UDTs for STUDENT_TYPE and EMPLOYEE_TYPE as two subtypes of PERSON_TYPE,
(d) Creating tables based on some of the UDTs,and illustrating table inheritance, (e) Spec-
ifying relationships using REF and SCOPE.
by using the keyword ROW. For example, we could use the following instead of declar-
ing STREET_ADDR_TYPE as a separate type as in Figure 20.
The ODL is designed to support the semantic constructs of the ODMG object model and
is independent of any particular programming language. Its main use is to create object
specificationsthat is, classes and interfaces. Hence, ODL is not a full programming lan-
guage. A user can specify a database schema in ODL independently of any programming
language, and then use the specific language bindings to specify how ODL constructs
3.2 object-relational features: object database extensions to sql 45
The object query language OQL is the query language proposed for the ODMG object
model. It is designed to work closely with the programming languages for which an
ODMG binding is defined, such as C++, Smalltalk, and Java.
3.2 object-relational features: object database extensions to sql 47
Figure 23: Possible ODL schema for the UNIVERSITY database (Part 1)
3.2 object-relational features: object database extensions to sql 48
Figure 24: Possible ODL schema for the UNIVERSITY database (Part 2)
3.2 object-relational features: object database extensions to sql 49
The basic OQL syntax is a select ... from ... where ... structure, as it is for SQL. For
example, the query to retrieve the names of all departments in the college of ’Engineering’
can be written as follows:
Using the last query, there are three ways to specify iterator. (i) D in DEPARTMENTS,
(ii)DEPARTMENTS D, and (iii) DEPARTMENTS AS D.
Because of its overall complexity the complete OQL standard has not yet been fully
implemented in any software. But many of DBMS systems have given their own cor-
responding languages. This section presents the implementation of the object-relational
school database specified in the UML class diagram shown in Figure 25. The implementa-
tion illustrates the use of user-defined types, reference types, and typed tables, as well as
Oracle’s support for collections in the form of variable-sized arrays (varrays) and nested
tables, which is a feature that is not supported by the SQL standard. Object type hierar-
chies and a feature known as substitutable tables are used as a means to support class
hierarchies [7].
Figure 25: Class diagram for the Case Study of the School Database
In conformance with the SQL standard, Oracle supports the basic object-relational feature
of object types. An object type is a composite structure that can have a set of attributes
and methods. Below is a simple example of a ’person_t’ object type definition for the
Person class without methods.
To assign values to an object type, a variable can be created having the type of the
object type. The following example illustrates how to assign values into the attributes of
the person_t object type in PL/SQL.
declare
p person_t;
begin
p.pId := ’PR123456789’;
p.firstName := ’Terrence’;
3.3 conclusion 52
p.lastName := ’Grand’;
p.dob := ’10-NOV-1975’;
end;
3.3 conclusion
In this chapter, we explored the differences between object-oriented and relational databases,
and discussed how modern databases have evolved to support both paradigms through
the use of hybrid object-relational and object databases. We provided an example of
Oracle database, which is a popular database management system that supports both
relational and object-oriented data models.
We discussed the key features of object databases, such as support for complex data
types, encapsulation, inheritance, and polymorphism, and highlighted their benefits in
scenarios where the data is naturally hierarchical or complex. We also explored the lim-
itations of relational databases in handling complex data types and relationships, and
how object-oriented databases provide a more natural solution for such scenarios.
We then introduced the concept of hybrid object-relational databases, which combine
the strengths of both relational and object-oriented databases to create a more versatile
and flexible data management solution. We provided an example of how Oracle database
supports hybrid models, allowing developers to define and store complex data types,
such as nested tables and user-defined types, within the context of a relational database.
Q U E R Y I N G X M L D ATA B A S E S : X PAT H A N D X Q U E R Y L A N G U A G E S
4
XML stands for eXtensible Markup Language. The use of XML has exploded in recent
years. A huge amount of information is now stored in XML, both in XML databases and
in documents on a file system. This includes highly structured data, such as sales figures,
semi-structured data such as product catalogs and yellow pages. Even more information
is transmitted between systems as transient XML documents [18].
XML documents form a tree structure that starts at "the root" and branches to "the leaves".
Figure 26 shows an example of hierarchical tree structure of an XML schema.
Example
The document bellow contains books in this XML representation. This representation is
valid with the XML schema shown in Figure 26.
53
4.2 querying xml documents 54
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
We assume that it is necessary to convert XML data to some other representation (say,
relational) and query that converted data using some language (such as SQL). Sometimes
that is the most appropriate strategy - for example, if the XML data is highly regular and
will be queried many times in the same way, you may be able to query it more efficiently
in a purely relational context. Often, though, you want to store and represent the data as
XML throughout its life (or at least preserve the XML abstraction over your data when
querying) [13].
Querying XML data is different from querying relational data - it requires navigating
around a tree structure that may or may not be well defined. XQuery is an SQL-like
language designed by the W3C to address these needs. It allows you to select the XML
data elements of interest, reorganize and possibly transform them, and return the results
in a structure of your choosing [18].
XPath is a language for selecting elements and attributes from an XML document while
traversing its hierarchy and filtering out unwanted content. XPath 1.0 is a useful recom-
mendation that specifies path expressions and a limited set of functions. XPath 2.0 has
become much more than that, encompassing a wide variety of expressions and functions,
not just path expressions. XQuery 1.0 and XPath 2.0 overlap to a very large degree. They
have the same data model and the same set of built-in functions and operators. XQuery
has a number of features that are not included in XPath, such as FLWORs (standing
for For, Let, Where, and Order By clauses) and XML constructors. This is because these
features are not relevant to selecting, but instead have to do with structuring or sorting
query results.
A simple example of a processing model for XQuery is shown in Figure 27. This section
describes the various components of this model.
4.3 xpath 55
4.3 xpath
The XML Path Language - XPath, as it is more commonly known - was first published
as a recommendation by the W3C in 1999. XPath is a language for addressing parts
of XML documents. Querying facilities in general function to locate or identify certain
information. Because querying facilities in general function to locate or identify certain
information, it’s easy to see that XPath is itself a sort of query language [13].
The typical XPath expression deliberately uses a “path like” syntax similar to that nota-
tion used by operating systems for referencing files and directory paths. A path expression
is used by XPath to identify and navigate nodes in an XML document. And as a result, it
evaluates to a value that has one of four possible types: node set (an ordered collection
of nodes), a Boolean value, a number, or a string.
There are many different kind of expressions. The most important are : (i) path expres-
sion, (ii) value expression, and (iii) node set expression.
/bookstore/book[1] Selects the first book element that is the child of the
bookstore element
/bookstore/book[last()] Selects the last book element that is the child of the book-
store element
/bookstore/book[last()-1] Selects the last but one book element that is the child of
the bookstore element
/bookstore/book[position()<3] Selects the first two book elements that are children of
the bookstore element
//title[@lang] Selects all the title elements that have an attribute
named lang
//title[@lang=’en’] Selects all the title elements that have a "lang" attribute
with a value of "en"
/bookstore/book[price>35.00] Selects all the book elements of the bookstore element
that have a price element with a value greater than 35.00
/bookstore/book[price>35.00]/title Selects all the title elements of the book elements of
the bookstore element that have a price element with
a value greater than 35.00
• Function invocations
Node set expressions ( |, pronounced "union" - combines two node sets into one).
4.4 xquery
also flexible enough to query a broad spectrum of XML information sources, including
both databases and documents.
XQuery has a data model that is used to define formally all the values used within
queries, including those from the input document(s), those in the results, and any inter-
mediate values. The XQuery data model is officially known as the XQuery 1.0 and XPath
2.0 Data Model, or XDM.
Understanding the XQuery data model is analogous to understanding tables, columns,
and rows when learning SQL. It describes the structure of both the inputs and outputs
of the query. The basic components of XDM are:
• Atomic value: A simple data value with no markup associated with it.
Nodes
Nodes are used to represent XML constructs such as elements and attributes. For ex-
ample, the path expression doc("catalog.xml")/catalog/product returns four product el-
ement nodes.
4.4 xquery 58
3. Document nodes: Represent an entire XML document (not its outermost element)
Example
When translating the following small XML document to the XQuery data model, it looks
like the diagram in Figure 29
<catalog xmlns="http://datypic.com/cat">
<product dept="MEN" xmlns="http://datypic.com/prod">
<number>784</number>
<name language="en">Cotton Dress Shirt</name>
<colorChoices>white gray</colorChoices>
<desc>Our <i>favorite</i> shirt!</desc>
</product>
</catalog>
Root node
XPath 1.0 has a separate concept of a root node, which is equivalent to a document node
in XQuery (and XPath 2.0). A root node represents the entire document and would be the
parent of the catalog element in our previous example.
Every node has a unique identity. You may have two XML elements in the input doc-
ument that contain the exact same data, but that does not mean they have the same
identity.
There are two kinds of values for a node: string and typed. The string value of an element
node is its character data content and that of all its descendant elements concatenated
together. The string value of an attribute node is simply the attribute value. The string
value of a node can be accessed using the string function. For example:
string(doc("catalog.xml")/catalog/product[4]/number)
4.4 xquery 59
returns the string Our favorite shirt!, without the i start and end tags.
An element or attribute might have a particular type if it has been validated with a
schema. The typed value of a node can be accessed using the data function. For example:
data(doc("catalog.xml")/catalog/product[4]/number)
returns the integer 784, if the number element is declared in a schema to be an integer. If
it is not declared in the schema, its typed value is still 784, but the value is considered to
be untyped.
Atomic Values
An atomic value is a simple data value such as 784 or ACC, with no markup, and no
association with any particular element or attribute. An atomic value can have a specific
type, such as xs:integer or xs:string, or it can be untyped.
Sequences
Sequences are ordered collections of items. A sequence can contain zero, one, or many
items. Each item in a sequence can be either an atomic value or a node. For example, the
expression doc("catalog.xml")/catalog/product returns a sequence of four items, which
happen to be product element nodes.
4.5 flowr expressions 60
A sequence can also be created explicitly using a sequence constructor. The syntax
of a sequence constructor is a series of values, delimited by commas, surrounded by
parentheses. For example, the expression (1, 2, 3) creates a sequence consisting of those
three atomic values.
Some of the most used functions on sequences are the aggregation functions (min,
max, avg, sum). In addition, union, except, and intersect expressions allow sequences
to be combined. There are also a number of functions that operate generically on any
sequence, such as index-of and insert-before.
Types
The XQuery type system is based on that of XML Schema. XML Schema has built-in
simple types representing common datatypes such as xs:integer, xs:string, and xs: date.
Literals
Literals are simply constant values that are directly represented in a query, such as "ACC"
and 29.99. There are two kinds of literals: string literals and numeric literals. They can be
used in expressions anywhere a constant value is needed, for example the strings in the
comparison expression: if ($department = "ACC") then "accessories" else "other" or the
numbers 1 and 30 in the function call: substring($name, 1, 30)
Variables
Variables in XQuery are identified by names that are preceded by a dollar sign ($).*
Example 1, declare an initialized variable $num.
The heart of XQuery is in five clauses (For, Let, Order By, Where, and Return). These
fives clauses are abbreviated in the term FLWOR and pronounced Flower. The presence
of these clause in a query follows a set of rules
4.5 flowr expressions 61
3. ”where” and “order by” clauses can not be used if “for” clause is not used in the
query.
<catalog>
<product dept="WMN">
<number>557</number>
<name language="en">Fleece Pullover</name>
<colorChoices>navy black</colorChoices>
</product>
<product dept="ACC">
<number>563</number>
<name language="en">Floppy Sun Hat</name>
</product>
<product dept="ACC">
<number>443</number>
<name language="en">Deluxe Travel Bag</name>
</product>
<product dept="MEN">
<number>784</number>
<name language="en">Cotton Dress Shirt</name>
<colorChoices>white gray</colorChoices>
<desc>Our <i>favorite</i> shirt!</desc>
</product>
</catalog>
for $p in doc("catalog.xml")/catalog/product
return $p
Because the hole document contains one king of product element, the following query
is equivalent to the previous one.
for $p in doc("catalog.xml")//product
return $p
4.5.3 Return all elements that have a specific child element or attribute
The two following queries returns respectively all products that have desc element and
dept attribute.
If we want retrieve the products that have dept = “ACC 00 , we can modify the last query
(Example 2) as follow
The queries of the remain sections are given to query the document “price.xml”. This
document contains the price lists of a set of products seen in the document “catalog.xml”.
When the price of some products change, we keep their new prices in this document.
<prices>
<priceList effDate="2006-11-15">
<prod num="557">
<price currency="USD">29.99</price>
<discount type="CLR">10.00</discount>
</prod>
<prod num="563">
<price currency="USD">69.99</price>
<discount type="CLR">10.00</discount>
</prod>
<prod num="443">
<price currency="USD">39.99</price>
<discount type="CLR">3.99</discount>
</prod>
</priceList>
<priceList effDate="2009-01-01">
<prod num="557">
4.6 conclusion 64
<price currency="USD">40.99</price>
<discount type="CLR">10.00</discount>
</prod>
</priceList>
</prices>
The following query returns all the prices of the product number 557 ordered by price
value. We intentionally used all five clauses FLWOR in this query.
for $p in doc("price.xml")//prod
let $num := $p/@num
let $price := $p/price
where $num = 557
order by $price
return $p
We can construct XML elements in the output of an XQuery script. In the following
example, we want return the prices of all product constructed with the dates of changing.
for $p in doc("price.xml")//prod
let $a := string($p/price),
$d := string($p/parent::priceList/@effDate)
return <price date="{$d}">{$a}</price>
<price date="2006-11-15">29.99</price>
<price date="2006-11-15">69.99</price>
<price date="2006-11-15">39.99</price>
<price date="2022-01-01">40.99</price>
4.6 conclusion
In this chapter, we introduced XPath and XQuery, two powerful query languages used
for querying and transforming XML documents. XPath is a language for navigating
through the hierarchical structure of an XML document and selecting specific elements
or attributes. XQuery is a more expressive language that builds on XPath and allows for
more complex queries and transformations.
4.6 conclusion 65
We discussed the syntax and semantics of both languages, and how they can be used
to extract data from XML documents and transform them into different formats. We also
highlighted the similarities and differences between XPath and XQuery, and when to use
each language based on the complexity of the task at hand.
We explored the key features of both languages, such as their support for a wide range
of data types, functions, and operators, and their ability to handle complex queries and
transformations. We provided examples of how to use XPath and XQuery to query and
transform XML documents, including selecting specific elements or attributes, filtering
results, and performing mathematical and logical operations.
Overall, this chapter provides an overview of XPath and XQuery, two essential query
languages for working with XML documents. By understanding the syntax and seman-
tics of these languages, readers will be better equipped to extract and transform data
from XML documents and integrate them with other systems and applications.
D I S T R I B U T E D D ATA B A S E S
5
5.1 introduction
Distributed database system (DDBS) technology is the union of two approaches to data
processing: database system and computer network technologies. Database systems have
taken us from a paradigm of data processing in which each application defined and
maintained its own data to one in which the data are defined and administered centrally.
This new orientation results in data independence, whereby the application programs are
immune to changes in the logical or physical organization of the data, and vice versa. The
technology of computer networks, on the other hand, promotes a mode of work that goes
against all centralization efforts. At first glance it might be difficult to understand how
these two contrasting approaches can possibly be synthesized to produce a technology
that is more powerful and more promising than either one alone.
A distributed database system is a type of database system that is spread across multi-
ple computers geographically distributed. In a distributed database system, the data is
partitioned or replicated across multiple nodes, and the nodes work together to process
queries and transactions from clients.
A DDBS is also not a system where, despite the existence of a network, the database
resides at only one node of the network. What we are interested in is an environment
where data are distributed among a number of sites Figure 30.
Several benefits are given when using distributed database systems:
• Fault tolerance: Distributed database systems can continue to operate even if one
or more nodes fail. Data can be replicated across multiple nodes, so if one node
fails, another node can take over without loss of data.
66
5.2 distributed dbms architecture 67
Figure 30: Centralized vs Distributed Database Systems: (A) Centralized Database System on
network, (B) DDBS environment.
The architecture of a system defines its structure. This means that the components of
the system are identified, the function of each component is specified, and the interrela-
tionships and interactions among these components are defined. The specification of the
architecture of a system requires identification of the various modules, with their inter-
faces and interrelationships, in terms of the data and control flow through the system.
Before going to present the architecture of a distributed DBMS we need firstly to
present the “ANSI/SPARC Architecture“
• External level: This is the level of the database system that is visible to end-users
and applications. It describes how data is viewed by different users and groups,
and how data is accessed and manipulated by applications. Each external schema
is tailored to meet the specific needs of a particular user or application.
• Conceptual level: This is the level of the database system that describes the overall
logical structure of the database. It defines the relationships between different types
5.2 distributed dbms architecture 68
of data and how they are organized and stored in the database. The conceptual
schema is independent of any particular application or user, and is used to ensure
that all data in the database is consistent and integrated.
• Internal level: This is the level of the database system that describes how data
is physically stored and accessed by the computer system. It defines the storage
structures and access methods used by the DBMS to manage the data. The internal
schema is hidden from users and applications, and is optimized for efficient storage
and retrieval of data.
The ways in which a distributed DBMS can be architected can be classified in terms of:
(1) the autonomy of local systems, (2) their distribution, and (3) their heterogeneity. (see
Figure 32).
5.2.2.1 Autonomy
Autonomy, in this context, refers to the distribution of control, not of data. It indicates
the degree to which individual DBMSs can operate independently. Requirements of an
autonomous system have been specified as follows [10].
• The local operations of the individual DBMSs are not affected by their participation
in the distributed system.
• The manner in which the individual DBMSs process queries and optimize them
should not be affected by the execution of global queries that access multiple
databases.
5.2 distributed dbms architecture 69
On the other hand, the dimensions of autonomy can be specified as follows [8]
• Design autonomy: Individual DBMSs are free to use the data models and transac-
tion management techniques that they prefer.
• Communication autonomy: Each of the individual DBMSs is free to make its own
decision as to what type of information it wants to provide to the other DBMSs or
to the software that controls their global execution.
• Execution autonomy: Each DBMS can execute the transactions that are submitted
to it in any way that it wants to.
5.2.2.2 Distribution
The distribution dimension of the taxonomy deals with data. We are considering the
physical distribution of data over multiple sites; as we discussed earlier, the user sees
the data as one logical pool. There are a number of ways DBMSs have been distributed.
We abstract these alternatives into two classes: client/server distribution and peer-to-peer
distribution (or full distribution).
In the context of distributed databases, the difference between client/server distribu-
tion and peer-to-peer distribution is as follows:
Client/server distribution of a distributed database involves a centralized server that
manages and coordinates the database, while the clients are responsible for accessing and
querying the data. The server is responsible for maintaining the consistency and integrity
of the data by ensuring that all clients see the same version of the data at any given time.
5.2 distributed dbms architecture 70
This approach is often used when there is a large amount of data that needs to be shared
among multiple clients. On the other hand, in peer-to-peer distribution of a distributed
database, all participants in the network have equal status, and each participant can act
as both a client and a server. Each participant maintains a local copy of the data and
shares it with other participants in the network. This approach is often used in situations
where there is no centralized authority, and where participants need to share data with
each other without relying on a central server.
5.2.2.3 Heterogeneity
Client/server DBMSs entered the computing scene at the beginning of 1990s and have
made a significant impact on both the DBMS technology and the way we do computing.
The general idea is very simple and elegant: distinguish the functionality that needs
to be provided and divide these functions into two classes: server functions and client
functions. This provides a two-level architecture which makes it easier to manage the
complexity of modern DBMSs and the complexity of distribution. We can cite many
examples of DDBMS that use client/server architecture of distributed database systems.
One such example is Microsoft SQL Server, Oracle Database, MySQL and PostgreSQL.
Figure 33 depicts the architecture of Client/Server DDB system.
5.2 distributed dbms architecture 71
1 https://cassandra.apache.org/
5.2 distributed dbms architecture 72
Distributed query processing is the process of executing a database query that involves
data stored on multiple nodes or servers in a distributed database system. When a query
is submitted, it must be broken down into smaller subqueries that can be executed on dif-
ferent nodes in parallel, and then the results must be combined to form the final result set.
Distributed query processing involves several steps, including query optimization, query
decomposition, data fragmentation and distribution, data transfer, local processing, and
result consolidation. The goal of distributed query processing is to minimize the amount
of data that needs to be transferred between nodes and to maximize parallelism in the
execution of subqueries, in order to improve query performance and scalability in a dis-
tributed environment. It is a key challenge in designing and implementing distributed
database systems.
Because it is a critical performance issue, query processing has received (and contin-
ues to receive) considerable attention in the context of both centralized and distributed
DBMSs. However, the query processing problem is much more difficult in distributed
environments than in centralized ones, because a larger number of parameters affect the
performance of distributed queries. In particular, the relations involved in a distributed
query may be fragmented and/or replicated, thereby inducing communication overhead
costs. Furthermore, with many sites to access, query response time may become very
high.
The main function of a relational query processor is to transform a high-level query (typi-
cally, in relational calculus) into an equivalent lower-level query (typically, in some vari-
ation of relational algebra). The low-level query actually implements the execution strat-
egy for the query. The transformation must achieve both correctness and efficiency. It is
correct if the low-level query has the same semantics as the original query, that is, if
both queries produce the same result. Since each equivalent execution strategy can lead
to very different consumptions of computer resources, the main difficulty is to select the
execution strategy that minimizes resource consumption.
Example
We consider the following subset of the engineering database schema given in Figure 37.
and the following simple user query: ”Find the names of employees who are managing
a project“.
The expression of the query in relational calculus using the SQL syntax is
SELECT ENAME
FROM EMP,ASG
5.3 distributed database design 73
Two equivalent relational algebra queries that are correct transformations of the query
above are:
ΠNAME (σRESP= 00 Manager“ ∧ EMP.ENO(EMP × ASG)), and
ΠNAME (EMP ./ENO (σRESP= 00 Manager“ (ASG)))
It is intuitively obvious that the second query, which avoids the Cartesian product of
EMP and ASG, consumes much less computing resources than the first, and thus should
be retained. In a centralized context, query execution strategies can be well expressed in
an extension of relational algebra. The main role of a centralized query processor is to
choose, for a given query, the best relational algebra query among all equivalent ones.
In a distributed system, relational algebra is not enough to express execution strategies.
It must be supplemented with operators for exchanging data between sites. Besides the
choice of ordering relational algebra operators, the distributed query processor must also
select the best sites to process data, and possibly the way data should be transformed.
This increases the solution space from which to choose the distributed execution strategy,
making distributed query processing significantly more difficult.
Example
In the design of a distributed DBMSs, the distribution of applications involves two things:
the distribution of the distributed DBMS software and the distribution of the application
programs that run on it. Two major strategies that have been identified for designing
distributed databases are the top-down approach and the bottom-up approach.
The activity of top-down approach designs (Figure 36) with a requirements analysis that
defines the environment of the system and ”elicits both the data and processing needs
of all potential database users”. The requirements study also specifies where the final
5.3 distributed database design 74
system is expected to stand with respect to the objectives of a distributed DBMS. These
objectives are defined with respect to performance, reliability and availability, economics,
and expandability (flexibility).
The requirements document is input to two parallel activities: view design and concep-
tual design. The view design activity deals with defining the interfaces for end users. The
conceptual design, on the other hand, is the process by which the enterprise is examined
to determine entity types and relationships among these entities. The global conceptual
schema (GCS) and access pattern information collected as a result of view design are
inputs to the distribution design step. The objective at this stage is to design the local
conceptual schemas (LCSs) by distributing the entities over the sites of the distributed
system.
The relation in distributed databases can be divided into sub-relations called frag-
ments. Thus, the distribution design activity consists of two steps: fragmentation and
allocation.
Fragmentation in distributed databases occurs when data is divided into smaller subsets
and distributed across multiple nodes in a network. Relation instances are essentially
tables, so the issue is one of finding alternative ways of dividing a table into smaller
ones. There are clearly two alternatives for this: dividing it horizontally or dividing it
vertically.
5.3 distributed database design 75
Figure 37: Relational schema. Set of employees (EMP) assigned (ASG) to projects (PROJ) for
different payments (PAY)
.
Example
Figure 38 shows the PROJ relation of Figure 37 divided horizontally into two relations.
Subrelation PROJ1 contains information about projects whose budgets are less than
$200,000, whereas PROJ2 stores information about projects with larger budgets.
We will enforce the following three rules during fragmentation, which, together, ensure
that the database does not undergo semantic change during fragmentation.
1. Completeness
If a relation instance R is decomposed into fragments FR = {R1 , R2 , ..., Rn }, each data item
that can be found in R can also be found in one or more of Ri ’s. This property, which
is identical to the lossless decomposition property of normalization, is also important
in fragmentation since it ensures that the data in a global relation are mapped into
fragments without any loss. Note that in the case of horizontal fragmentation, the “item”
typically refers to a tuple, while in the case of vertical fragmentation, it refers to an
attribute.
2. Reconstruction
Figure 38: (A) Horizontal fragmentation of the relation PRJ. Whereas (B) is a vertical fragmenta-
tion of the relation PRJ
.
ent for different forms of fragmentation; it is important, however, that it can be identified.
The reconstructability of the relation from its fragments ensures that constraints defined
on the data in the form of dependencies are preserved.
3. Disjointness
If the database is properly fragmented, the fragments can be stored on different sites on
the network, and the data can either be replicated or maintained as a single copy. Repli-
cation helps to ensure reliability and the efficiency of read-only queries, as it increases
the chances of accessible data even in the event of system failures. This also allows for
parallel execution of read-only queries that access the same data items, as multiple copies
exist on multiple sites. On the other hand, the execution of update queries cause trouble
since the system has to ensure that all the copies of the data are updated properly. Hence
the decision regarding replication is a trade-off that depends on the ratio of the read-only
queries to the update queries.
A non-replicated database (commonly called a partitioned database) contains fragments
that are allocated to sites, and there is only one copy of any fragment on the network. In
case of replication, either the database exists in its entirety at each site (fully replicated
database), or fragments are distributed to the sites in such a way that copies of a fragment
may reside in multiple sites (partially replicated database). In the latter the number of
5.4 conclusion 78
5.4 conclusion
In this chapter, we discussed the concept of distributed databases, which are designed
to manage data across multiple physical locations and provide a more scalable and fault-
tolerant data management solution. We explored the key concepts and technologies in-
volved in distributed data processing, including distributed data systems, distributed
DBMS architecture, and distributed data design.
We highlighted the benefits of distributed databases, including improved scalability,
increased availability, and better performance for large-scale data processing. We also
discussed the challenges of designing and implementing distributed databases, such as
data consistency, network latency, and system complexity.
We examined the different types of distributed database architectures, such as client-
server, peer-to-peer, and hybrid models, and how they can be used to balance perfor-
mance, scalability, and fault tolerance. We also discussed the importance of distributed
data design.
BIBLIOGRAPHY
[2] Burleson, D. K. (1998). Inside the Database Object Model. CRC Press.
[3] Codd, E. F. (1970). A relational model of data for large shared data banks. Commun.
ACM, 13(6):377–387.
[5] Date, C. J. (2012). Database Design and Relational Theory: Normal Forms and All That Jazz.
O’Reilly Media.
[6] Delis, A. and Tsotras, V. J. (2009). Indexed Sequential Access Method, pages 1435–1438.
Springer US, Boston, MA.
[7] Dietrich, S. W. and Urban, S. D. (2011). Fundamentals of Object Databases: Object Ori-
ented and Object Relational Design. Morgan, 1st edition.
[11] Kirsten, W., Ihringer, M., Schulte, P., and Röhrig, B. (2001). Object-Oriented Applica-
tion Development Using the Caché Postrelational Database. Springer Berlin Heidelberg.
[13] Melton, J. and Buxton, S. (2006). Querying XML : XQuery, XPath, and SQL/XML in
context. Margan Kaufmann, Amsterdam [etc.].
[14] Pratt, P. J. (2014). Concepts of Database Management. Cengage Learning, 8th edition.
[15] Rumbaugh, J. (2004). The Unified Modeling Language Reference Manual (2nd Edition)
(The Addison-Wesley Object Technology Series). Addison-Wesley Professional.
[16] Singha, N. P. and Gupta, C. (2014). Relational Database Management Systems. Ab-
hishek Publications.
[17] Strickland, R. (2016). Cassandra 3.x High Availability - Second Edition. Packt Publish-
ing.
79
bibliography 80