0% found this document useful (0 votes)
3 views262 pages

02 Database Systems

The document is a National Exit Exam Module prepared by Wolaita Sodo University, focusing on Database Systems and Information Management. It covers various aspects of database systems, including definitions, roles in database design, types of database systems, and the importance of data management in organizations. The content is structured into chapters that address different components of database systems, from fundamental concepts to advanced topics like query languages and data warehousing.

Uploaded by

Girum Z Girum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views262 pages

02 Database Systems

The document is a National Exit Exam Module prepared by Wolaita Sodo University, focusing on Database Systems and Information Management. It covers various aspects of database systems, including definitions, roles in database design, types of database systems, and the importance of data management in organizations. The content is structured into chapters that address different components of database systems, from fundamental concepts to advanced topics like query languages and data warehousing.

Uploaded by

Girum Z Girum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 262

Wolaita Sodo University

School of Informatics
Department of Information Technology
2015 E.C- National Exit Exam Module
Theme Name: Database System and Information
Management

Prepared By: Biruk S. (MSc.)

Reviewed By: Alemayehu Dereje (MSc.)


Nuhamin Nigussie (MSc.)

Wolaita Sodo, Ethiopia


March, 2023
Fundamentals of Database Systems

2|Page
Table of Contents
CHAPTER ONE .................................................................................................................................................................................... 3
1. Introduction .................................................................................................................................................................................... 3
1.1 Database System ....................................................................................................................................................................... 6
1.2 Data Handling approaches .......................................................................................................................................... 8
1.2.1 Manual Data Handling approach ............................................................................................................................ 8
1.2.2 Traditional File Based Data Handling approach ............................................................................................ 9
1.2.3 Database Data Handling approach ........................................................................................................................ 11
1.3 Roles in Database Design and Development ............................................................................................................ 15
1.3.1 Database Designer ......................................................................................................................................................... 15
1.3.2 Database Administrator .............................................................................................................................................. 16
1.3.3 Application Developers .............................................................................................................................................. 17
1.3.4 End-Users........................................................................................................................................................................... 18
1.4 The ANSI/SPARC and Database Architecture ....................................................................................................... 19
1.5 Types of Database Systems .............................................................................................................................................. 24
1.5.1 Client-Server Database System ............................................................................................................................. 24
1.5.2 Parallel Database System.......................................................................................................................................... 26
1.5.3 Distributed Database System .................................................................................................................................. 27
1.6 Database Management System (DBMS) ................................................................................................................... 29
1.6.1 Components and Interfaces of DBMS ............................................................................................................... 29
1.6.2 Functions of DBMS ........................................................................................................................................................... 31
1.7 Data models and conceptual models ............................................................................................................................ 32
1.7.1 Record-based Data Models...................................................................................................................................... 33
1.7.2 Database Languages .................................................................................................................................................... 37
CHAPTER TWO .................................................................................................................................................................................40
2. Relational Data Model .................................................................................................................................................................40
2. 1 Introduction..............................................................................................................................................................................40
2.1 Properties of Relational Databases .............................................................................................................................. 44
2.2 Building Blocks of the Relational Data Model ............................................................................................. 44
2.2.1 The ENTITIES ...............................................................................................................................................................45
2.2.2 The ATTRIBUTES......................................................................................................................................................45
2.2.3 The RELATIONSHIPS ............................................................................................................................................. 47
2.2.4 Key constraints.............................................................................................................................................................. 49
2.2.5 Integrity, Referential Integrity and Foreign Keys Constraints.............................................................. 50
CHAPTER THREE ............................................................................................................................................................................ 52
3. Conceptual Database Design- E-R Modeling ................................................................................................................. 52
3.1 Database Development Life Cycle ............................................................................................................................... 52

i|Page
3.2 Basic concepts of E-R model ...........................................................................................................................................54
3.3 Developing an E-R Diagram ............................................................................................................................................ 55
3.4 Graphical Representations in Entity Relationship Diagram ............................................................................56
3.4.1 Conceptual ER diagram symbols .........................................................................................................................56
3.5 Problem with E-R models..................................................................................................................................................65
CHAPTER FOUR ............................................................................................................................................................................... 77
4. Logical Database Design............................................................................................................................................................ 77
4.1 Introduction ............................................................................................................................................................................... 77
4.2 Normalization...........................................................................................................................................................................85
4.3 Process of normalization (1NF, 2NF, 3NF) .............................................................................................................. 91
CHAPTER FIVE ................................................................................................................................................................................ 98
5. Physical Database Design ......................................................................................................................................................... 98
5.1 Conceptual, Logical, and Physical Data Models .................................................................................................. 98
5.2 Physical Database Design Process ............................................................................................................................... 99
5.1.1. Overview of the Physical Database Design Methodology .................................................................. 100
CHAPTER SIX....................................................................................................................................................................................102
6. Query Languages...........................................................................................................................................................................102
6.1. Relational Algebra...............................................................................................................................................................102
6.1.1. Unary Operations........................................................................................................................................................103
6.1.2. Set Operations ............................................................................................................................................................. 105
6.1.3. Aggregation and Grouping Operations...........................................................................................................110
6.2. Relational Calculus .............................................................................................................................................................. 112
6.2.1. Tuple Relational Calculus....................................................................................................................................... 113
6.3. Structured Query Languages (SQL) ........................................................................................................................... 115
6.3.1. Introduction to SQL ................................................................................................................................................... 115
6.3.2. Writing SQL Commands .........................................................................................................................................118
6.3.3. SQL Data Definition and Data Types...............................................................................................................119
6.3.4. Basic Queries in SQL .............................................................................................................................................. 124
CHAPTER SEVEN ...........................................................................................................................................................................155
7. Advanced Database Concepts ................................................................................................................................................155
7.1. Integrity and Security ........................................................................................................................................................155
7.1.1. Levels of Security Measures ................................................................................................................................157
7.1.2. Countermeasures: Computer Based Controls ............................................................................................. 158
7.2. Distributed Database Systems ...................................................................................................................................... 162
7.3 Data Warehousing and Data Mining.......................................................................................................................... 164
7.3.1. Data Warehousing..................................................................................................................................................... 164
7.3.2. Data Mining.................................................................................................................................................................. 166

ii | P a g e
CHAPTER ONE
1. Introduction

Database is a shared collection of logically related data and its description, designed to meet the
information needs of an organization. We now examine the definition of a database so that
you can understand the concept fully. The database is a single, possibly large repository of
data that can be used simultaneously by many departments and users. Instead of disconnected
files with redundant data, all data items are integrated with a minimum amount of
duplication. The database is no longer owned by one department but is a shared corporate
resource. The database holds not only the organization‘s operational data, but also a
description of this data. For this reason, a database is also defined as a self-describing
collection of integrated records. The description of the data is known as the system catalog
(or data dictionary or metadata—the ―data about data ). It is the self-describing nature of a
database that provides program–data independence.

The approach taken with database systems, where the definition of data is separated from the
application programs, is similar to the approach taken in modern software development,
where an internal definition of an object and a separate external definition are provided. The
users of an object see only the external definition and are unaware of how the object is
defined and how it functions. One advantage of this approach, known as data abstraction, is
that we can change the internal definition of an object without affecting the users of the
object, provided that the external definition remains the same. In the same way, the database
approach separates the structure of the data from the application programs and stores it in the
database. If new data structures are added or existing structures are modified, then the
application programs are unaffected, provided that they do not directly depend upon what has
been modified. For example, if we add a new field to a record or create a new file, existing
applications are unaffected. However, if we remove a field from a file that an application
program uses, then that application program is affected by this change and must be modified
accordingly.
Another expression in the definition of a database that we should explain is ―logically
related. When we analyze the information needs of an organization, we attempt to identify
entities, attributes, and relationships. An entity is a distinct object (a person, place, thing,
concept, or event) in the organization that is to be represented in the database. An attribute is
a property that describes some aspect of the object that we wish to record, and a relationship
is an association between entities

Databases and database systems are an essential component of life in modern society: most of
us encounter several activities every day that involve some interaction with a database. For
example, if we go to the bank to deposit or withdraw funds, if we make a hotel or airline
reservation, if we access a computerized library catalog to search for a bibliographic item, or
if we purchase something online—such as a book, toy, or computer—chances are that our
activities will involve someone or some computer program accessing a database. Even
purchasing items at a supermarket often automatically updates the database that holds the
inventoryofgroceryitems.

These interactions are examples of what we may call traditional database applications, in
which most of the information that is stored and accessed is either textual or numeric. In the
past few years, advances in technology have led to exciting new applications of database
systems. In the past few years, advances in technology have led to exciting new applications
of database systems. The proliferation of social media Web sites, such as Facebook, Twitter,
and Flickr, among many others, has required the creation of huge databases that store
nontraditional data, such as posts, tweets, images, and video clips. New types of database
systems, often referred to as big data storage systems, or NOSQL systems, have been created
to manage data for social media applications. These types of systems are also used by
companies such as Google, Amazon, and Yahoo, to manage the data required in their Web
search engines, as well as to provide cloud storage, whereby users are provided with storage
capabilities on the Web for managing all types of data including documents, programs,
images,videosandemails.

Database Systems and Information Management Module Page 4


We now mention some other applications of databases. The wide availability of photo and
video technology on cell phones and other devices has made it possible to store images, audio
clips, and video streams digitally. These types of files are becoming an important component
of multimedia databases. Geographic information systems (GISs) can store and analyze
maps, weather data, and satellite images. Data warehouses and online analytical processing
(OLAP) systems are used in many companies to extract and analyze useful business
information from very large databases to support decision making. Real-time and active
database technology is used to control industrial and manufacturing processes. And database
search techniques are being applied to the World Wide Web to improve the search for
information that is needed by users browsing the Internet.
Databases and database technology have had a major impact on the growing use of
computers. It is fair to say that databases play a critical role in almost all areas where
computers are used, including business, electronic commerce, social media, engineering,
medicine, genetics, law, education, and library science. The word database is so commonly
used that we must begin by defining what a database is. Our initial definition is quite general.

A database is a collection of related data. By data, we mean known facts that can be recorded
and that have implicit meaning. For example, consider the names, telephone numbers, and
addresses of the people you know. Nowadays, this data is typically stored in mobile phones,
which have their own simple database software. This data can also be recorded in an indexed
address book or stored on a hard drive, using a personal computer and software such as
Microsoft Access or Excel. This collection of related data with an implicit meaning is a
database.

The preceding definition of database is quite general; for example, we may consider the
collection of words that make up this page of text to be related data and hence to constitute a
database. However, the common use of the term database is usually more restricted. A
database has the following implicit properties:
 A database represents some aspect of the real world, sometimes called the mini world
or the universe of discourse (UoD).
 Changes to the mini world are reflected in the database.
 A database is a logically coherent collection of data with some inherent meaning. A

Database Systems and Information Management Module Page 5


random assortment of data cannot correctly be referred to as a database.
 A database is designed, built, and populated with data for a specific purpose. It has an
intended group of users and some preconceived applications in which these users are
interested

The database is now such an integral part of our day-to-day life that often we are not aware
that we are using one. To start our discussion of databases, in this section we examine some
applications of database systems. For the purposes of this discussion, we consider a database
to be a collection of related data and a database management system (DBMS) to be the
software that manages and controls access to the database. A database application is simply a
program that interacts with the database at some point in its execution. We also use the more
inclusive term database system as a collection of application programs that interact with the
database along with the DBMS and the database itself.

1.1 Database System


An organization must have accurate and reliable data for effective decision making. To this end,
the organization maintains records on the various facets maintaining relationships among them.
Such related data are called a database. A database system is an integrated collection of related
files, along with details of the interpretation of the data contained therein. Basically, database
system is nothing more than a computer-based record keeping system i.e. a system whose overall
purpose is to record and maintain information/data. A database management system (DBMS) is a
software system that allows access to data contained in a database. The objective of the DBMS is
to provide a convenient and effective method of defining, storing and retrieving the information
contained in the database.

The DBMS interfaces with the application programs, so that the data contained in the database can
be used by multiple applications and users. In addition, the DBMS exerts centralized control of
the database, prevents fraudulent or unauthorized users from accessing the data, and ensures the
privacy of the data. Generally a database is an organized collection of related information. The
organized information or database serves as a base from which desired information can be retrieved

Database Systems and Information Management Module Page 6


or decision made by further recognizing or processing the data. People use several databases in
their day-to-day life. Dictionary, Telephone directory, Library catalog, etc are example for
databases where the entries are arranged according to alphabetical or classified order.

The term ‘DATA’ can be defined as the value of an attribute of an entity. Any collection of related
data items of entities having the same attributes may be referred to as a ‘DATABASE’. Mere
collection of data does not make it a database; the way it is organized for effective and efficient
use makes it a database. Database technology has been described as “one of the most rapidly
growing areas of computer and information science”. It is emerged in the late Sixties as a result of
combination of various circumstances. There was a growing demand among users for more
information to be provided by the computer relating to the day-to-day running of the organization
as well as information for planning and control purposes.

The technology that emerged to process data of various kinds is grossly termed as ‘DATABASE
MANAGEMENT TECHNOLOGY’ and the resulting software are known as ‘DATABASE
MANAGEMENT SYSTEM’ (DBMS) which they manage a computer stored database or
collection of data.

An entity may be concrete as person or book, or it may be abstract such as a loan or a holiday or a
concept. Entities are the basic units of objects which can have concrete existence or constitute
ideas or concepts. An entity set is a set of entities of the same type that share the same properties
or attributes. An entity is represented by set of attributes. An attribute is also referred as data item,
data element, data field, etc. Attributes are descriptive properties possessed by each member of an
entity set. A groping of related entities becomes an entity set.

For example: In a library environment,

Entity Set: -Catalogue –

Entity: -of Books, Journals, AV-Materials, etc

Database Systems and Information Management Module Page 7


Attributes: – contains ISBN, title, author, or publisher, etc.

The word ‘DATA’ means a fact or more specially a value of attribute of an entity. An entity in
general, may be an object, idea, event, condition or situation. A set of attributes describes an entity.
Information in a form which can be processed by a raw computer is called data. Data are raw
material of information. The term ‘BASE’ means the support, foundation or key ingredient of
anything. Therefore base supports data. A ‘DATABASE’ can be conceived as a system whose
base, whose key concept, is simply a particular way of handling data. In other words, a database
is nothing more than a computer-based record keeping.

The objective of database is to record and maintain information. The primary function of the
database is the service and support of information system which satisfies cost. In short, ―A
database is an organized collection of related information stored with minimum redundancy, in a
manner that makes them accessible for multiple applications”.

1.2 Data Handling approaches

1.2.1 Manual Data Handling approach


In the manual approach, data storage and retrieval follows the primitive and traditional way of
information handling where cards and paper are used for the purpose. The data storage and
retrieval will be performed using human labor.

 Files for as many event and objects as the organization has are used to store information.
 Each of the files containing various kinds of information is labelled and stored in one or more
cabinets.
 The cabinets could be kept in safe places for security purpose based on the sensitivity of the
information contained in it.
 Insertion and retrieval is done by searching first for the right cabinet then for the right the file
then the information.
 One could have an indexing system to facilitate access to the data

Limitations of the Manual approach

Database Systems and Information Management Module Page 8


 Prone to error
 Difficult to update, retrieve, integrate
 You have the data but it is difficult to compile the information
 Limited to small size information
 Cross referencing is difficult

An alternative approach of data handling is a computerized way of dealing with the information.
The computerized approach could also be either decentralized or centralized base on where the
data resides in the system.

1.2.2 Traditional File Based Data Handling approach


After the introduction of Computer for data processing to the business community, the need to
use the device for data storage and processing increase. There were, and still are, several
computer applications with file based processing used for the purpose of data handling. Even
though the approach evolved over time, the basic structure is still similar if not identical.

 File based systems were an early attempt to computerize the manual filing system.
 This approach is the decentralized computerized data handling method.
 A collection of application programs perform services for the end-users. In such systems, every
application program that provides service to end users define and manage its own data
 Such systems have number of programs for each of the different applications in the organization.
 Since every application defines and manages its own data, the system is subjected to serious data
duplication problem.
 File, in traditional file based approach, is a collection of records which contains logically related
data

Database Systems and Information Management Module Page 9


Limitations of the Traditional File Based approach

As business application become more complex demanding more flexible and reliable data
handling methods, the shortcomings of the file based system became evident. These
shortcomings include, but not limited to:

 Separation or Isolation of Data: Available information in one application may not be known.
Data Synchronisation is done manually.
 Limited data sharing- every application maintains its own data.
 Lengthy development and maintenance time
 Duplication or redundancy of data (money and time cost and loss of data integrity)
 Data dependency on the application- data structure is embedded in the application; hence, a
change in the data structure needs to change the application as well.
 Incompatible file formats or data structures (e.g. ―C‖ and COBOL) between different
applications and programs creating inconsistency and difficulty to process jointly.
 Fixed query processing which is defined during application development

Database Systems and Information Management Module Page 10


The limitations for the traditional file based data handling approach arise from two basic reasons.

1. Definition of the data is embedded in the application program which makes it difficult to modify
the database definition easily.
2. No control over the access and manipulation of the data beyond that imposed by the application
programs.

The most significant problem experienced by the traditional file based approach of data handling
can be formalized by what is called ―update anomalies‖. We have three types of update
anomalies;

1. Modification Anomalies: a problem experienced when one ore more data value is modified on
one application program but not on others containing the same data set.
2. Deletion Anomalies: a problem encountered where one record set is deleted from one
application but remain untouched in other application programs.
3. Insertion Anomalies: a problem experienced whenever there is new data item to be recorded, and
the recording is not made in all the applications. And when same data item is inserted at different
applications, there could be errors in encoding which makes the new data item to be considered as
a totally different object.

1.2.3 Database Data Handling approach


Following a famous paper written by Dr. Edgard Frank Codd in 1970, database systems changed
significantly. Codd proposed that database systems should present the user with a view of data
organized as tables called relations. Behind the scenes, there might be a complex data structure
that allowed rapid response to a variety of queries. But, unlike the user of earlier database
systems, the user of a relational system would not be concerned with the storage structure.
Queries could be expressed in a very high-level language, which greatly increased the efficiency
of database programmers. The database approach emphasizes the integration and sharing of data
throughout the organization.

Thus in Database Approach:

Database Systems and Information Management Module Page 11


 Database is just a computerized record keeping system or a kind of electronic filing cabinet.
 Database is a repository for collection of computerized data files.
 Database is a shared collection of logically related data and description of data designed to meet
the information needs of an organization. Since it is a shared corporate resource, the database is
integrated with minimum amount of or no duplication.
 Database is a collection of logically related data where these logically related data comprises
entities, attributes, relationships, and business rules of an organization’s information
 In addition to containing data required by an organization, database also contains a description of
the data which is known as Metadata‖ or Data Dictionary or Systems Catalogue‖ or Data about
Data‖ or sometimes Data Directory.
 Since a database contains information about the data (metadata), it is called a selfdescriptive
collection of integrated records.
 The purpose of a database is to store information and to allow users to retrieve and update that
information on demand.
 Database is deigned once and used simultaneously by many users.
 Unlike the traditional file based approach in database approach there is program data
independence. That is the separation of the data definition from the application. Thus the
application is not affected by changes made in the data structure and file organization.
 Each database application will perform the combination of: Creating database, Reading,
Updating and Deleting data.

Benefits of the database approach

 Data can be shared: two or more users can access and use same data instead of storing data in
redundant manner for each user.
 Improved accessibility of data: by using structured query languages, the users can easily access
data without programming experience.
 Redundancy can be reduced: isolated data is integrated in database to decrease the redundant
data stored at different applications.

Database Systems and Information Management Module Page 12


 Quality data can be maintained: the different integrity constraints in the database approach
will maintain the quality leading to better decision making
 Inconsistency can be avoided: controlled data redundancy will avoid inconsistency of the data
in the database to some extent.
 Transaction support can be provided: basic demands of any transaction support systems are
implanted in a full scale DBMS.
 Integrity can be maintained: data at different applications will be integrated together with
additional constraints to facilitate validity and consistency of shared data resource.
 Security measures can be enforced: the shared data can be secured by having different levels
of clearance and other data security mechanisms.
 Improved decision support: the database will provide information useful for decision making.
 Standards can be enforced: the different ways of using and dealing with data by different unite
of an organization can be balanced and standardized by using database approach.
 Compactness: since it is an electronic data handling method, the data is stored compactly (no
voluminous papers)
 Speed: data storage and retrieval is fast as it will be using the modern fast computer systems.
 Less labour: unlike the other data handling methods, data maintenance will not demand much
resource.
 Centralized information control: since relevant data in the organization will be stored at one
repository, it can be controlled and managed at the central level.

Database Systems and Information Management Module Page 13


Limitations and risk of Database Approach

 Introduction of new professional and specialized personnel


 Complexity in designing and managing data
 The cost and risk during conversion from the old to the new system
 High cost to be incurred to develop and maintain the system
 Complex backup and recovery services from the users perspective
 Reduced performance due to centralization and data independency
 High impact on the system when failure occurs to the central system

Database Systems and Information Management Module Page 14


1.3 Roles in Database Design and Development
As people are one of the components in DBMS environment, there are group of roles played by
different stakeholders of the designing and operation of a database system.

1.3.1 Database Designer


Database designers are responsible for identifying the data to be stored in the database and for
choosing appropriate structures to represent and store this data. These tasks are mostly undertaken
before the database is actually implemented and populated with data. It is the responsibility of
database designers to communicate with all prospective database users in order to understand their
requirements and to create a design that meets these requirements. In many cases, the designers
are on the staff of the DBA and may be assigned other staff responsibilities after the database
design is completed.

Database designers typically interact with each potential group of users and develop views of the
database that meet the data and processing requirements of these groups. Each view is then
analyzed and integrated with the views of other user groups. The final database design must be
capable of supporting the requirements of all user groups.

In large database design projects, we can distinguish between two types of


designer: logical database designers and physical database designers. The logical database
designer is concerned with identifying the data (that is, the entities and attributes), the relationships
between the data, and the constraints on the data that is to be stored in the database.

The logical database designer must have a thorough and complete understanding of the
organization‘s data and any constraints on this data (the constraints are sometimes called business
rules). These constraints describe the main characteristics of the data as viewed by the
organization. Examples of constraints for DreamHome are:

 a member of staff cannot manage more than 100 properties for rent or sale at the same time;
 a member of staff cannot handle the sale or rent of his or her own property;  a solicitor cannot act
for both the buyer and seller of a property.

Database Systems and Information Management Module Page 15


To be effective, the logical database designer must involve all prospective database users in the
development of the data model, and this involvement should begin as early in the process as
possible. In this book, we split the work of the logical database designer into two stages:

 Conceptual database design, which is independent of implementation details, such as the target
DBMS, application programs, programming languages, or any other physical considerations;
 Logical database design, which targets a specific data model, such as relational, network,
hierarchical, or object-oriented.

The physical database designer decides how the logical database design is to be physically
realized. This involves:

 Mapping the logical database design into a set of tables and integrity constraints;
 Selecting specific storage structures and access methods for the data to achieve good
performance;
 Designing any security measures required on the data.

Many parts of physical database design are highly dependent on the target DBMS, and there may
be more than one way of implementing a mechanism. Consequently, the physical database
designer must be fully aware of the functionality of the target DBMS and must understand the
advantages and disadvantages of each alternative implementation. The physical database
designer must be capable of selecting a suitable storage strategy that takes account of usage.
Whereas conceptual and logical database designs are concerned with the what, physical database
design is concerned with the how. It requires different skills, which are often found in different
people.

1.3.2 Database Administrator

 Responsible to oversee, control and manage the database resources (the database itself, the
DBMS and other related software)
 Authorizing access to the database
 Coordinating and monitoring the use of the database

Database Systems and Information Management Module Page 16


 Responsible for determining and acquiring hardware and software resources
 Accountable for problems like poor security, poor performance of the system
 Involves in all steps of database development
 We can have further classifications of this role in big organizations having huge amount of data
and user requirement.
o Data Administrator (DA): is responsible on management of data resources. This involves in
database planning, development, maintenance of standards policies and procedures at the
conceptual and logical design phases.
o Database Administrator (DBA): This is more technically oriented role. DBA is responsible for
the physical realization of the database. It is involved in physical design, implementation,
security and integrity control of the database.

In any organization where many people use the same resources, there is a need for a chief
administrator to oversee and manage these resources. In a database environment, the primary
resource is the database itself, and the secondary resource is the DBMS and related software.
Administering these resources is the responsibility of the database administrator (DBA). The
DBA is responsible for authorizing access to the database, coordinating and monitoring its use,
and acquiring software and hardware resources as needed. The DBA is accountable for problems
such as security breaches and poor system response time. In large organizations, the DBA is
assisted by a staff that carries out these functions.

1.3.3 Application Developers


Once the database has been implemented, the application programs that provide the required
functionality for the end-users must be implemented. This is the responsibility of the application
developers. Typically, the application developers work from a specification produced by systems
analysts. Each program contains statements that request the DBMS to perform some operation on
the database, which includes retrieving data, inserting, updating, and deleting data. The programs
may be written in a third-generation or fourth-generation programming language, as discussed
previously.

Database Systems and Information Management Module Page 17


1.3.4 End-Users
End users are the people whose jobs require access to the database for querying, updating, and
generating reports; the database primarily exists for their use. There are several categories of end
users:

 Casual end users occasionally access the database, but they may need different information
each time. They use a sophisticated database query interface to specify their requests and are
typically middle- or high-level managers or other occasional browsers.
 Naive or parametric end users make up a sizable portion of database end users. Their main job
function revolves around constantly querying and updating the database, using standard types of
queries and updates— called canned transactions—that have been carefully programmed and
tested. Many of these tasks are now available as mobile apps for use with mobile devices. The
tasks that such users perform are varied. A few examples are:
o Bank customers and tellers check account balances and post withdrawals and deposits.
o Reservation agents or customers for airlines, hotels, and car rental companies check availability
for a given request and make reservations.
o Employees at receiving stations for shipping companies enter package identifications via bar
codes and descriptive information through buttons to update a central database of received and
in-transit packages.
o Social media users post and read items on social media Web sites.
 Sophisticated end users include engineers, scientists, business analysts, and others who
thoroughly familiarize themselves with the facilities of the DBMS in order to implement their
own applications to meet their complex requirements.
 Standalone users maintain personal databases by using ready-made program packages that
provide easy-to-use menu-based or graphics-based interfaces. An example is the user of a
financial software package that stores a variety of personal financial data.

A typical DBMS provides multiple facilities to access a database. Naive end users need to learn
very little about the facilities provided by the DBMS; they simply have to understand the user
interfaces of the mobile apps or standard transactions designed and implemented for their use.

Database Systems and Information Management Module Page 18


Casual users learn only a few facilities that they may use repeatedly. Sophisticated users try to
learn most of the DBMS facilities in order to achieve their complex requirements.

Standalone users typically become very proficient in using a specific software package.

1.4 The ANSI/SPARC and Database Architecture


An early proposal for a standard terminology and general architecture for database systems was
produced in 1971 by the DBTG appointed by CODASYL in 1971. The DBTG recognized the need
for a two-level approach with a system view called the schema and user views called subschemas.
The American National Standards Institute (ANSI) Standards Planning and Requirements
Committee (SPARC), or ANSI/X3/SPARC, produced a similar terminology and architecture in
1975 (ANSI, 1975). The ANSI-SPARC architecture recognized the need for a three-level approach
with a system catalog. These proposals reflected those published by the IBM user organizations
Guide and Share some years previously, and concentrated on the need for an implementation-
independent layer to isolate programs from underlying representational issues (Guide/Share,
1970). Although the ANSISPARC model did not become a standard, it still provides a basis for
understanding some of the functionality of a DBMS.

The purpose and origin of the Three-Level database architecture

 All users should be able to access same data. This is important since the database is having a shared
data feature where all the data is stored in one location and all users will have their own customized
way of interacting with the data.
 A user’s view is unaffected or immune to changes made in other views. Since the requirement of
one user is independent of the other, a change made in one user‗s view should not affect other
users.
 Users should not need to know physical database storage details. As there are naïve users of the
system, hardware level or physical details should be a black-box for such users.

Database Systems and Information Management Module Page 19


 DBA should be able to change database storage structures without affecting the users’ views. A
change in file organization, access method should not affect the structure of the data which in turn
will have no effect on the users.
 Internal structure of database should be unaffected by changes to physical aspects of storage, such
as change of hard disk
 DBA should be able to change conceptual structure of database without affecting all users. In any
database system, the DBA will have the privilege to change the structure of the database, like
adding tables, adding and deleting an attribute, changing the specification of the objects in the
database.

The American National Standards Institute (ANSI) Standards Planning and Requirements
Committee (SPARC). It recognized the need for a three-level approach with a system catalog. The
levels form a three-level architecture comprising an external, a conceptual, and an internal level,
the way users perceive the data is called the external level. The way the DBMS and the operating
system perceive the data is the internal level, where the data is actually stored using the data
structures and file.

A. External level: The users‘ view of the database. This level describes that part of the database that
is relevant to each user. The external level consists of a number of different external views of the
database. Each user has a view of the ‗real world‘ represented in a form that is familiar for that
user. The external view includes only those entities, attributes, and relationships in the ‗real world‘
that the user is interested in. Other entities, attributes, or relationships that are not of interest may
be represented in the database, but the user will be unaware of them.

In addition, different views may have different representations of the same data. For example, one
user may view dates in the form (day, month, year), while another may view dates as (year, month,
day). Some views might include derived or calculated data: data not actually stored in the database
as such, but created when needed. For example, in the DreamHome case study, we may wish to
view the age of a member of staff. However, it is unlikely that ages would be stored, as this data
would have to be updated daily. Instead, the member of staff‘s date of birth would be stored and

Database Systems and Information Management Module Page 20


age would be calculated by the DBMS when it is referenced. Views may even include data
combined or derived from several entities.

 Conceptual level: The community view of the database. This level describes what data is stored
in the database and the relationships among the data. The middle level in the threelevel architecture
is the conceptual level. This level contains the logical structure of the entire database as seen by
the DBA. It is a complete view of the data requirements of the organization that is independent of
any storage considerations. The conceptual level represents:
o all entities, their attributes, and their relationships;
o the constraints on the data;  semantic information about the data;  Security and integrity
information.

The conceptual level supports each external view, in that any data available to a user must be
contained in or derivable from, the conceptual level. However, this level must not contain any
storage-dependent details. For instance, the description of an entity should contain only data types
of attributes (for example, integer, real, character) and their length (such as the maximum number
of digits or characters), but not any storage considerations, such as the number of bytes occupied.

 Internal level: The physical representation of the database on the computer. This level describes
how the data is stored in the database.The internal level covers the physical implementation of the
database to achieve optimal runtime performance and storage space utilization. It covers the data
structures and file organizations used to store data on storage devices. It interfaces with the
operating system access methods (file management techniques for storing and retrieving data
records) to place the data on the storage devices, build the indexes, retrieve the data, and so on.
The internal level is concerned with such things as:
o storage space allocation for data and indexes;
o record descriptions for storage (with stored sizes for data items);
o record placement;
o data compression and data encryption techniques

Database Systems and Information Management Module Page 21


All of the above and much more functionalities are possible due to the three levels ANSISPARC
architecture.

Database Systems and Information Management Module Page 22


Database Systems and Information Management Module Page 23
1.5 Types of Database Systems
1.5.1 Client-Server Database System
Client/server is developed to deal with various computing environments that have a large number
of computers and servers connected together via a network. In this architecture, a Client is a user
machine which provides the user interface and local processing capabilities.

When any client requires additional functionality like database access, it can connect to Server
that is capable of providing the functionality needed by the client. Basically Server is a machine
that provides services to the Client i.e. user machine.

Database Systems and Information Management Module Page 24


Advantages of client server database system

 Client/Server system has less expensive platforms to support applications that had previously
been running only on large and expensive mini or mainframe computers
 Client offer icon-based menu-driven interface, which is superior to the traditional command-line,
dumb terminal interface typical of mini and mainframe computer systems.
 Client/Server environment facilitates in more productive work by the users and making better
use of existing data.
 Client/Server database system is more flexible as compared to the Centralized system.
 Response time and throughput is high.
 The server (database) machine can be custom-built (tailored) to the DBMS function and thus can
provide a better DBMS performance.
 The client (application database) might be a personnel workstation, tailored to the needs of the
end users and thus able to provide better interfaces, high availability, faster responses and overall

Database Systems and Information Management Module Page 25


improved ease of use to the user. A single database (on server) can be shared across several
distinct client (application) systems

Disadvantages of Client/Server Database System

 Programming cost is high in client/server environments, particularly in initial phases.


 There is a lack of management tools for diagnosis, performance monitoring and tuning and security
control, for the DBMS, client and operating systems and networking environments.

1.5.2 Parallel Database System


Parallel database system architecture consists of a multiple Central Processing Units (CPUs) and
data storage disk in parallel. Hence, they improve processing and Input/ Output (I/O) speeds.
Parallel database systems are used in the application that have to query extremely large databases
or that have to process an extremely large number of transactions per second.

Advantages of a Parallel Database System

 Parallel database systems are very useful for the applications that have to query extremely large
databases (of the order of terabytes, for example, 1012 bytes) or that have to process an extremely
large number of transactions per second (of the order of thousands of transactions per second).
 In a parallel database system, the throughput (that is, the number of tasks that can be completed in
a given time interval) and the response time (that is, the amount of time it takes to complete a
single task from the time it is· submitted) are very high.

Disadvantages of a Parallel Database System

 In a parallel database system, there· is a startup cost associated with initiating a single process
and the startup-time may overshadow the actual processing time, affecting speedup adversely.

Database Systems and Information Management Module Page 26


 Since process executing in a parallel system often access shared resources, a slowdown may
result from interference of each new process as it completes with existing processes for
commonly held resources, such as shared data storage disks, system bus and so on.

1.5.3 Distributed Database System


A logically interrelated collection of shared data physically distributed over a computer network
is called as distributed database and the software system that permits the management of the
distributed database and makes the distribution transparent to users is called as Distributed
DBMS. It consists of a single logical database that is split into a number of fragments. Each
fragment is stored on one or more computers under the control of a separate DBMS, with the
computers connected by a communications network. As shown, in distributed database system,
data is spread across a variety of different databases. These are managed by a variety of different
DBMS software running on a variety of different operating systems. These machines are spread
(or distributed) geographically and connected together by a variety of communication networks

Database Systems and Information Management Module Page 27


Figure 2 Distributed DBMS Architecture Advantages of Distributed Database System

 Distributed database architecture provides greater efficiency and better performance.


 A single database (on server) can be shared across several distinct client (application) systems.
 As data volumes and transaction rates increase, users can grow the system incrementally.
 It causes less impact on ongoing operations when adding new locations.
 Distributed database system provides local autonomy.

Disadvantages of Distributed Database System

 Recovery from failure is more complex in distributed database systems than in centralized
systems.

Database Systems and Information Management Module Page 28


1.6 Database Management System (DBMS)
Database Management System (DBMS) is a Software package used for providing EFFICIENT,
CONVENIENT and SAFE MULTI-USER (many people/programs accessing same database, or
even same data, simultaneously) storage of and access to MASSIVE amounts of PERSISTENT
(data outlives programs that operate on it) data. A DBMS also provides a systematic method for
creating, updating, storing, retrieving data in a database. DBMS also provides the service of
controlling data access, enforcing data integrity, managing concurrency control, and recovery.

1.6.1 Components and Interfaces of DBMS


We can identify five major components in the DBMS environment: hardware, software, data,
procedures, and people

Hardware: The DBMS and the applications require hardware to run. The hardware can range
from a single personal computer to a single mainframe or a network of computers.

The particular hardware depends on the organization‘s requirements and the DBMS used. Some
DBMSs run only on particular hardware or operating systems, while others run on a wide variety
of hardware and operating systems. A DBMS requires a minimum amount of main memory and
disk space to run, but this minimum configuration may not necessarily give acceptable
performance. A simplified hardware configuration for DreamHome is illustrated in Figure. It
consists of a network of small servers, with a central server located in London running the backend
of the DBMS, that is, the part of the DBMS that manages and controls access to the database. It

Database Systems and Information Management Module Page 29


also shows several computers at various locations running the frontend of the DBMS, that is, the
part of the DBMS that interfaces with the user. This is called a client–server architecture.

Software: The software component comprises the DBMS software itself and the application
programs, together with the operating system, including network software if the DBMS is being
used over a network. Typically, application programs are written in a third-generation
programming language (3GL), such as C, C++, C#, Java, Visual Basic, COBOL, Fortran, Ada, or
Pascal, or a fourth-generation language (4GL), such as SQL, embedded in a third-generation
language. The target DBMS may have its own fourthgeneration tools that allow rapid development
of applications through the provision of nonprocedural query languages, reports generators, forms
generators, graphics generators, and application generators. The use of fourth-generation tools can
improve productivity significantly and produce programs that are easier to maintain.

Data: Perhaps the most important component of the DBMS environment—certainly from the end-
users‘ point of view—is the data. I observe that the data acts as a bridge between the machine
components and the human components. The database contains both the operational data and the
metadata, the ―data about data.‖ The structure of the database is called the schema.

Procedures refer to the instructions and rules that govern the design and use of the database. The
users of the system and the staff who manage the database require documented procedures on how
to use or run the system. These may consist of instructions on how to:

 Log on to the DBMS.


 Use a particular DBMS facility or application program.
 Start and stop the DBMS.
 Make backup copies of the database.
 Handle hardware or software failures. This may include procedures on how to identify the failed
component, how to fix the failed component (for example, telephone the appropriate hardware
engineer), and, following the repair of the fault, how to recover the database.
 Change the structure of a table, reorganize the database across multiple disks, improve
performance, or archive data to secondary storage.

Database Systems and Information Management Module Page 30


People: The final component is the people involved with the system we can identify four distinct
types of people who participate in the DBMS environment: data and database administrators,
database designers, application developers, and end-users. This component is composed of the
people in the organization that are responsible or play a role in designing, implementing,
managing, administering and using the resources in the database. This component includes group
of people with high level of knowledge about the database and the design technology to other with
no knowledge of the system except using the data in the database.

1.6.2 Functions of DBMS


In this section we look at the types of function and service we would expect a DBMS to provide
lists different services that should be provided by any full-scale DBMS

a. Data storage, retrieval, and update: DBMS must furnish users with the ability to store,
retrieve, and update data in the database.
b. A user-accessible catalog: A DBMS must furnish a catalog in which descriptions of data items
are stored and which is accessible to users.A key feature of the ANSI-SPARC architecture is the
recognition of an integrated system catalog to hold data about the schemas, users, applications,
and so on. The catalog is expected to be accessible to users as well as to the DBMS. A system
catalog, or data dictionary, is a repository of information describing the data in the database: it is,
the ‗data about the data‘ or metadata. The amount of information and the way the information is
used vary with the DBMS. Typically, the system catalog stores:
1. names, types, and sizes of data items;
2. names of relationships;
3. integrity constraints on the data;
4. names of authorized users who have access to the data;
c. Transaction support: A DBMS must furnish a mechanism which will ensure either that all the
updates corresponding to a given transaction are made or that none of them is made. A
transaction is a series of actions, carried out by a single user or application program, which
accesses or changes the contents of the database.
d. Concurrency control services: A DBMS must furnish a mechanism to ensure

Database Systems and Information Management Module Page 31


that the database is updated correctly when multiple users are updating the database concurrently.
One major objective in using a DBMS is to enable many users to access shared data concurrently.
Concurrent access is relatively easy if all users are only reading data, as there is no way that they
can interfere with one another. However, when two or more users are accessing the database
simultaneously and at least one of them is updating data, there may be interference that can result
in inconsistencies. The DBMS must ensure that, when multiple users are accessing the database,
interference cannot occur.

 Recovery services: A DBMS must furnish a mechanism for recovering the database in the event
that the database is damaged in any way.
 Authorization services: A DBMS must furnish a mechanism to ensure that only authorized users
can access the database.

1.7 Data models and conceptual models


A specific DBMS has its own specific Data Definition Language to define a database schema, but
this type of language is too low level to describe the data requirements of an organization in a way
that is readily understandable by a variety of users. We need a higherlevel language. Such a higher-
level description of the database schema is called data-model.

Data Model: a set of concepts to describe the structure of a database, and certain constraints that
the database should obey.

A data model is a description of the way that data is stored in a database. Data model helps to
understand the relationship between entities and to create the most effective structure to hold
data.

Data Model is a collection of tools or concepts for describing

 Data
 Data relationships
 Data semantics
 Data constraints

Database Systems and Information Management Module Page 32


The main purpose of Data Model is to represent the data in an understandable way. Categories
of data models include:

 Object-based
 Record-based
 Physical

1.7.1 Record-based Data Models

Consist of a number of fixed format records. Each record type defines a fixed number of fields;
each field is typically of a fixed length.

 Hierarchical Data Model


 Network Data Model
 Relational Data Model
 Hierarchical Model
o The simplest data model
o Record type is referred to as node or segment
o The top node is the root node
o Nodes are arranged in a hierarchical structure as sort of upside-down tree
o A parent node can have more than one child node
o A child node can only have one parent node
o The relationship between parent and child is one-to-many
o Relation is established by creating physical link between stored records (each is stored with a
predefined access path to other records)
o To add new record type or relationship, the database must be redefined and then stored in a new
form.

Database Systems and Information Management Module Page 33


ADVANTAGES of Hierarchical Data Model:

 Hierarchical Model is simple to construct and operate on


o Corresponds to a number of natural hierarchically organized domains – e.g., assemblies in
manufacturing, personnel organization in companies
o Language is simple; uses constructs like GET, GET UNIQUE, GET NEXT, GET NEXT
WITHIN PARENT etc.

DISADVANTAGES of Hierarchical Data Model:

 Navigational and procedural nature of processing


o Database is visualized as a linear arrangement of records
o Little scope for “query optimization”
 Network Model o Allows record types to have more than one parent unlike hierarchical model o
A network data models sees records as set members o Each set has an owner and one or more
members o Allow no many to many relationship between entities o Like hierarchical model
network model is a collection of physically linked records. o Allow member records to have
more than one owner

Database Systems and Information Management Module Page 34


ADVANTAGES of Network Data Model:

 Network Model is able to model complex relationships and represents semantics of add/delete on
the relationships.
o Can handle most situations for modeling using record types and relationship types.
o Language is navigational; uses constructs like FIND, FIND member, FIND owner, FIND NEXT
within set, GET etc. Programmers can do optimal navigation through the database.

DISADVANTAGES of Network Data Model:

 Navigational and procedural nature of processing


o Database contains a complex array of pointers that thread through a set of records.
o Little scope for automated “query optimization‖
 Relational Data Model
o Developed by Dr. Edgar Frank Codd in 1970 (famous paper, ‘A Relational Model for Large
Shared Data Banks’)
o Terminologies originates from the branch of mathematics called set theory and predicate logic
and is based on the mathematical concept called Relation
o Can define more flexible and complex relationship
o Viewed as a collection of tables called ―Relations‖ equivalent to collection of record types
o Relation: Two dimensional table
o Stores information or data in the form of tables à rows and columns

Database Systems and Information Management Module Page 35


o A row of the table is called tupleà equivalent to record
o A column of a table is called attributeà equivalent to fields
o Data value is the value of the Attribute
o Records are related by the data stored jointly in the fields of records in two tables or files. The
related tables contain information that creates the relation  The tables seem to be independent
but are related some how.
o No physical consideration of the storage is required by the user
o Many tables are merged together to come up with a new virtual view of the relationship

 The rows represent records (collections of information about separate items)


o The columns represent fields (particular attributes of a record)
o Conducts searches by using data in specified columns of one table to find additional data in
another table
o In conducting searches, a relational database matches information from a field in one table with
information in a corresponding field of another table to produce a third table that combines
requested data from both tables

Many data models have been proposed, which we can categorize according to the types of concepts
they use to describe the database structure. High-level or conceptual data models provide
concepts that are close to the way many users perceive data, whereas low-level or physical data
models provide concepts that describe the details of how data is stored on the computer storage
media, typically magnetic disks. Concepts provided by low-level data models are generally meant
for computer specialists, not for end users. Between these two extremes is a class of

Database Systems and Information Management Module Page 36


representational (or implementation) data models, which provide concepts that may be easily
understood by end users but that are not too far removed from the way data is organized in
computer storage. Representational data models hide many details of data storage on disk but can
be implemented on a computer system directly.

Conceptual data models use concepts such as entities, attributes, and relationships. An
entity represents a real-world object or concept, such as an employee or a project from the mini-
world that is described in the database. An attribute represents some property of interest that
further describes an entity, such as the employee‘s name or salary. A relationship among two or
more entities represents an association among the entities, for example, a works-on relationship
between an employee and a project.

1.7.2 Database Languages


A database language consists of four parts: a Data Definition Language (DDL), Data
Manipulation Language (DML), Transaction control language (TCL) and Data control
language (DCL). The DDL is used to specify the database schema and the DML is used to both
read and update the database. These languages are called data sublanguages because they do not
include constructs for all computing needs such as conditional or iterative statements, which are
provided by the high-level programming languages. Many DBMSs have a facility for embedding
the sublanguage in a high-level programming language such as COBOL, FORTRAN, Pascal, Ada,
and ‗C‘, C++, Java, or Visual Basic. In this case, the high-level language is sometimes referred to
as the host language. To compile the embedded file, the commands in the data sublanguage are
first removed from the hostlanguage program and replaced by function calls. The pre-processed
file is then compiled, placed in an object module, linked with a DBMS-specific library containing
the replaced functions, and executed when required. Most data sublanguages also provide non-
embedded, or interactive, commands that can be input directly from a terminal.

A. The Data Definition Language (DDL)

Database Systems and Information Management Module Page 37


It is a language that allows the DBA or user to describe and name the entities, attributes, and
relationships required for the application, together with any associated integrity and security
constraints.

The database schema is specified by a set of definitions expressed by means of a special language
called a Data Definition Language. The DDL is used to define a schema or to modify an existing
one. It cannot be used to manipulate data.

The result of the compilation of the DDL statements is a set of tables stored in special files
collectively called the system catalog. The system catalog integrates the metadata that is data that
describes objects in the database and makes it easier for those objects to be accessed or
manipulated. The metadata contains definitions of records, data items, and other objects that are
of interest to users or are required by the DBMS. The DBMS normally consults the system catalog
before the actual data is accessed in the database. The terms data dictionary and data directory are
also used to describe the system catalog, although the term ‗data dictionary‘ usually refers to a
more general software system than a catalog for a DBMS. Which defines the database structure or
schema. Specifies additional properties or constraints of the data. The database system is checks
these constraints every time the database is updated.

Example: CREATE: create object in database

ALTER: alter the structure of database

DROP: deletes object from database

RENAME: rename the object

 The Data Manipulation Language (DML)

It is a language that provides a set of operations to support the basic data manipulation operations
on the data held in the databases. Data manipulation operations usually include the following:

 insertion of new data into the database;

Database Systems and Information Management Module Page 38


o modification of data stored in the database;
o retrieval of data contained in the database; ü Deletion of data from the database.

Therefore, one of the main functions of the DBMS is to support a data manipulation language in
which the user can construct statements that will cause such data manipulation to occur. Data
manipulation applies to the external, conceptual, and internal levels. However, at the internal
level we must define rather complex low-level procedures that allow efficient data access. In
contrast, at higher levels, emphasis is placed on ease of use and effort is directed at providing
efficient user interaction with the system.

 Transaction Control language(TCL)

It used to manage transaction in database and the change made by data manipulation language
statements.

Transaction: the logical unit of work which consists of some operations to control some tasks.

Example: COMMITE: used to permanently save any transaction into the database.

ROLLBACK: restore the database to last committee state.

 Data Control Language (DCL)

It used to control access to data stored in database (Authorization)

Example: GRANT: allows specified users to perform specified tasks REVOKE:


cancel pervious granted or denied permission

The part of a DML that involves data retrieval is called a query language. A query language can
be defined as a high-level special-purpose language used to satisfy diverse requests for the
retrieval of data held in the database. The term ‗query‘ is therefore reserved to denote a retrieval
statement expressed in a query language or specifies data to retrieve rather than how to retrieve it

Database Systems and Information Management Module Page 39


CHAPTER TWO
2. Relational Data Model
2. 1 Introduction

Relation: a table with rows and columns

Attribute: a named column of a relation

In the relational model, relations are used to hold information about the objects to be represented
in the database. A relation is represented as a two-dimensional table in which the rows of the

Database Systems and Information Management Module Page 40


table correspond to individual records and the table columns correspond to attributes. Attributes
can appear in any order and the relation will still be the same relation, and therefore convey the
same meaning.For example, the information on branch offices is represented by
the Branch relation, with columns for attributes branchNo (the branch number), street, city,
and postcode. Similarly, the information on staff is represented by the Staff relation, with
columns for attributes staffNo (the staff number), fName, lName, position, sex, DOB (date of
birth), salary, and branchNo (the number of the branch the staff member works at).

Domain: a set of allowable values for one or more attributes

Domains are an extremely powerful feature of the relational model. Every attribute in a relation
is defined on a domain. Domains may be distinct for each attribute, or two or more attributes
may be defined on the same domain. The domain concept is important because it allows the user
to define in a central place the meaning and source of values that attributes can hold. As a result,
more information is available to the system when it undertakes the execution of a relational
operation, and operations that are semantically incorrect can be avoided. For example, it is not
sensible to compare a street name with a telephone number, even though the domain definitions
for both these attributes are character strings. On the other hand, the monthly rental on a property
and the number of months a property has been leased have different domains (the first a
monetary value, the second an integer value), but it is still a legal operation to multiply two
values from these domains.

Database Systems and Information Management Module Page 41


Tuple: a row of a relation

The elements of a relation are the rows or tuples in the table. In the Branch relation, each row
contains four values, one for each attribute. Tuples can appear in any order and the relation will
still be the same relation, and therefore convey the same meaning. The structure of a relation,
together with a specification of the domains and any other restrictions on possible values, is
sometimes called its intension, which is usually fixed unless the meaning of a relation is changed
to include additional attributes. The tuples are called the extension (or state) of a relation, which
changes over time.

Database Systems and Information Management Module Page 42


Degree: the degree of a relation is the number of attributes it contains Unary relation, Binary
relation, Ternary relation, N-ary relation.

The Branch relation in the above Figure has four attributes or degree four. This means that each
row of the table is a four-tuple, containing four values. A relation with only one attribute would
have degree one and be called a unary relation or one-tuple. A relation with two attributes is
called binary, one with three attributes is called ternary, and after that the term nary is usually
used. The degree of a relation is a property of the intension of the relation

Cardinality: of a relation is the number of tuples the relation has By contrast, the number of
tuples is called the cardinality of the relation and this changes as tuples are added or deleted. The
cardinality is a property of the extension of the relation and is determined from the particular
instance of the relation at any given moment. Finally, we have the definition of a relational
database.

Relational Database: a collection of normalized relations with distinct relation names. A


relational database consists of relations that are appropriately structured.

Relation Schema: a named relation defined by a set of attribute-domain name pair Let A1,
A2………..An be attributes with domain D1, D2 ………,Dn.

Then the sets {A1:D1, A2:D2… An:Dn} is a Relation Schema. A relation R, defined by a
relation schema S, is a set of mappings from attribute names to their corresponding domains.

Thus a relation is a set of n- tuples of the form (A1:d1, A2:d2 ,…, An:dn) where d1 є D1, d2 є
D2,…….. dn є Dn,

Example Student (studentId char(10), studentName char(50), DOB date) is a relation schema for
the student entity in SQL

Relational Database schema: a set of relation schema each with distinct names. Suppose

Database Systems and Information Management Module Page 43


R1, R2,……, Rn is the set of relation schema in a relational database then the relational database
schema (R) can be stated as: R={ R1 , R2 ,……., Rn}.

2.1 Properties of Relational Databases

 A relation has a name that is distinct from all other relation names in the relational schema.
 Each tuple in a relation must be unique
 All tables are LOGICAL ENTITIES
 Each cell of a relation contains exactly one atomic (single) value.
 Each column (field or attribute) has a distinct name.
 The values of an attribute are all from the same domain.
 A table is either a BASE TABLES (Named Relations) or VIEWS (Unnamed Relations)  Only
Base Tables are physically stored
 VIEWS are derived from BASE TABLES with SQL statements like: [SELECT .. FROM ..
WHERE .. ORDER BY]
 Relational database is the collection of tableso Each entity in one table o Attributes are fields
(columns) in table
 Order of rows theoretically ( but practically has impact on performance) and columns is
immaterial
 Entries with repeating groups are said to be un-normalized. All values in a column represent the
same attribute and have the same data format

2.2 Building Blocks of the Relational Data Model


The building blocks of the relational data model are:

 Entities: real world physical or logical object


 Attributes: properties used to describe each Entity or real world object.
 Relationship: the association between Entities
 Constraints: rules that should be obeyed while manipulating the data.

Database Systems and Information Management Module Page 44


2.2.1 The ENTITIES
The Entities are persons, places; things etc. which the organization has to deal with Relations can
also describe relationships. The name given to an entity should always be a singular noun
descriptive of each item to be stored in it.

Example: student NOTstudents.

Every relation has a schema, which describes the columns, or fields the relation itself
corresponds to our familiar notion of a table:

A relation is a collection of tuples, each of which contains values for a fixed number
of attributes

 Existence Dependency: the dependence of an entity on the existence of one or more entities.
 Weak entity : an entity that cannot exist without the entity with which it has a relationship – it is indicated
by a double rectangle

2.2.2 The ATTRIBUTES


The attributes are the items of information which characterize and describe these
entities.Attributes are pieces of information ABOUT entities. The analysis must of course
identify those which are actually relevant to the proposed application. Attributes will give rise to
recorded items of data in the database

At this level we need to know such things as:

 Attribute name (be explanatory words or phrases)


 The domain from which attribute values are taken (A DOMAIN is a set of values from which attribute
values may be tak
 For example, the domain of Name is string and en.) Each attribute has values taken from a domain. that
for salary is real. However these are not shown on E-R models

Database Systems and Information Management Module Page 45


 Whether the attribute is part of the entity identifier (attributes which just describe an entity and those
which help to identify it uniquely)
 Whether it is permanent or time-varying (which attributes may change their values over time)
 Whether it is required or optional for the entity (whose values will sometimes be unknown or
irrelevant) Types of Attributes

1. Simple (atomic) Vs Composite attributes

 Simple : contains a single value (not divided into sub parts)

Example: Age, gender

 Composite: Divided into sub parts (composed of other attributes) Example: Name, address
 Single-valued Vs multi-valued attributes
o Single-valued : have only single value(the value may change but has only one value at one time)

Example: Name, Sex, Id. No. color_of_eyes

 Multi-Valued: have more than one value

Example: Address, dependent-name

Person may have several college degrees

 Stored vs. Derived Attribute


o Stored : not possible to derive or compute

Example: Name, Address

 Derived: The value may be derived (computed) from the values of other attributes.

Example: Age (current year – year of birth)

Database Systems and Information Management Module Page 46


Length of employment (current date- start date) Profit (earning cost) G.P.A (grade point/credit
hours)

 Null Values
o NULL applies to attributes which are not applicable or which do not have values.
o You may enter the value NA (meaning not applicable)  Value of a key attribute cannot be null.
 Default value- assumed value if no explicit value

Entity versus Attributes

When designing the conceptual specification of the database, one should pay attention to the
distinction between an Entity and an Attribute.

 Consider designing a database of employees for an organization:


 Should address be an attribute of Employees or an entity (connected to Employees by a relationship)?

• If we have several addresses per employee, address must be an entity

(attributes cannot be set-valued/multi valued)

 If the structure (city, Woreda, Kebele, etc) is important, e.g. want to retrieve employees in a given city,
address must be modeled as an entity (attribute values are atomic)

2.2.3 The RELATIONSHIPS


The Relationships between entities which exist and must be taken into account when processing
information. In any business processing one object may be associated with another object due to
some event. Such kind of association is what we call a RELATIONSHIP between entity objects.

 One external event or process may affect several related entities.


 Related entities require setting of LINKS from one part of the database to another.

Database Systems and Information Management Module Page 47


 A relationship should be named by a word or phrase which explains its function
 Role names are different from the names of entities forming the relationship: one entity may take
on many roles, the same role may be played by different entities
 For each RELATIONSHIP, one can talk about the Number of Entities and the

Number of Tuples participating in the association. These two concepts are


called DEGREE and CARDINALITY of a relationship respectively.

2.2.3.1 Degree of a Relationship

 An important point about a relationship is how many entities participate in it. The number of entities
participating in a relationship is called the DEGREE of the relationship. Among the Degrees of
relationship, the following are the basic:
 UNARY/RECURSIVE RELATIONSHIP: Tuples/records of a Single entity are related withy each
other.
 BINARY RELATIONSHIPS: Tuples/records of two entities are associated in a relationship
 TERNARY RELATIONSHIP: Tuples/records of three different entities are associated
 And a generalized one:

o N-ARY RELATIONSHIP: Tuples from arbitrary number of entity sets are participating in a
relationship.

2.2.3.2 Cardinality of a Relationship


Another important concept about relationship is the number of instances/tuples that can be
associated with a single instance from one entity in a single relationship. The number of
instances participating or associated with a single instance from an entity in a relationship is
called the CARDINALITY of the relationship. The major cardinalities of a relationship are:

 ONE-TO-ONE: one tuple is associated with only one other tuple.


o Example: Building – Location as a single building will be located in a single location and as a single
location will only accommodate a single Building.
 ONE-TO-MANY, one tuple can be associated with many other tuples, but not the reverse.

Database Systems and Information Management Module Page 48


o Example: Department-Student as one department can have multiple students.
 MANY-TO-ONE, many tuples are associated with one tuple but not the reverse.
o Example: Employee – Department: as many employees belong to a single department.
 MANY-TO-MANY: one tuple is associated with many other tuples and from the other side, with a
different role name one tuple will be associated with many tuples
o Example: Student – Course as a student can take many courses and a single course can be attended by
many students.

However, the degree and cardinality of a relation are different from degree and cardinality of
a relationship

2.2.4 Key constraints


If tuples are need to be unique in the database, and then we need to make each tuple distinct. To
do this we need to have relational keys that uniquely identify each record.

1. Super Key: an attribute/set of attributes that uniquely identify a tuple within a relation.
2. Candidate Key: a super key such that no proper subset of that collection is a Super Key within the
relation.

A candidate key has two properties:

1. Uniqueness
2. Irreducibility

 If a super key is having only one attribute, it is automatically a Candidate key.


 If a candidate key consists of more than one attribute it is called Composite Key.

3. Primary Key: the candidate key that is selected to identify tuples uniquely within the relation.

 The entire set of attributes in a relation can be considered as a primary case in a worst case.

Database Systems and Information Management Module Page 49


4. Foreign Key: an attribute, or set of attributes, within one relation that matches the candidate key of some
relation. A foreign key is a link between different relations to create a view or an unnamed relation

2.2.5 Integrity, Referential Integrity and Foreign Keys Constraints


The entity integrity constraint states that no primary key value can be NULL. This is because the
primary key value is used to identify individual tuples in a relation. Having NULL values for the
primary key implies that we cannot identify some tuples. For example, if two or more tuples had
NULL for their primary keys, we may not be able to distinguish them if we try to reference them
from other relations.

Key constraints and entity integrity constraints are specified on individual relations.

The referential integrity constraint is specified between two relations and is used to maintain the
consistency among tuples in the two relations. Informally, the referential integrity constraint states
that a tuple in one relation that refers to another relation must refer to an existing tuple in that
relation. For example, the attribute Dept_Num of EMPLOYEE gives the department number for
which each employee works; hence, its value in every EMPLOYEE tuple must match the
Dept_Num value of some tuple in the DEPARTMENT relation.

To define referential integrity more formally, first we define the concept of a foreign key. The
conditions for a foreign key, given below, specify a referential integrity constraint between the two
relation schemas R1 and R2.

A set of attributes FK (foreign key) in relation schema R1 is a foreign key of R1 that references
relation R2 if it satisfies the following rules:

 Rule 1: The attributes in FK have the same domain(s) as the primary key attributes PK of R2; the
attributes FK are said to reference or refer to the relation R2.
 Rule 2: A value of FK in a tuple t1 of the current state r1(R1) either occurs as a value

of PK for some tuple t2 in the current state r2(R2) or is NULL. In the former case, we have t1[FK]
= t2[PK], and we say that the tuple t1 references or refers to the tuple t2.

Database Systems and Information Management Module Page 50


 In this definition, R1 is called the referencing relation and R2 is the referenced relation. If these
two conditions hold, a referential integrity constraint from R1 to R2 is said to hold. In a database
of many relations, there are usually many referential integrity constraints.

To specify these constraints, first we must have a clear understanding of the meaning or roles that
each attribute or set of attributes plays in the various relation schemas of the database.

Referential integrity constraints typically arise from the relationships among the entities
represented by the relation schemas.

For example, consider the database In the EMPLOYEErelation, the attribute Dept_Numrefers to
the department for which an employee works; hence, we designate Dept_Num to be a foreign key
of EMPLOYEE referencing the DEPARTMENT relation. This means that a value of Dept_Num in
any tuple t1 of the EMPLOYEE relation must match a value of DEPARTMENT relation, or the
value of Dept_Num can be NULL if the employee does not belong to a department or will be
assigned to a department later. For example, the tuple for employee ‗John Smith‘ references the
tuple for the ‗Research‘ department, indicating that ‗John Smith‘ works for this department.

Notice that a foreign key can refer to its own relation. For example, the attributeSuper_ssn in
EMPLOYEE refers to the supervisor of an employee; this is another employee, represented by a
tuple in the EMPLOYEE relation. Hence, Super_ssn is a foreign key that references the
EMPLOYEE relation itself. The tuple for employee ‗John Smith‘ references the tuple for
employee ‗Franklin Wong,‘ indicating that ‗Franklin Wong‘ is the supervisor of ‗John Smith‘.

We can diagrammatically display referential integrity constraints by drawing a directed arc from
each foreign key to the relation it references. For clarity, the arrowhead may point to the primary
key of the referenced relation with the referential integrity constraints displayed in this manner.

All integrity constraints should be specified on the relational database schema (i.e., defined as part
of its definition) if we want to enforce these constraints on the data- base states. Hence, the DDL
includes provisions for specifying the various types of constraints so that the DBMS can

Database Systems and Information Management Module Page 51


automatically enforce them. Most relational DBMSs support key, entity integrity, and referential
integrity constraints. These constraints are specified as a part of data definition in the DDL.

The following constraints are specified as a part of data definition using DDL:

 Domain Integrity: No value of the attribute should be beyond the allowable limits
 Entity Integrity: In a base relation, no attribute of a Primary Key can assume a value of NULL
 Referential Integrity: If a Foreign Key exists in a relation, either the Foreign Key value must
match a Candidate Key value in its home relation or the Foreign Key value must be NULL
 Enterprise Integrity: Additional rules specified by the users or database administrators of a
database are incorporated

CHAPTER THREE
3. Conceptual Database Design- E-R Modeling

3.1 Database Development Life Cycle


As it is one component in most information system development tasks, there are several steps in
designing a database system. Here more emphasis is given to the design phases of the system
development life cycle. The major steps in database design are;

Database Systems and Information Management Module Page 52


1. Planning: that is identifying information gap in an organization and propose a database solution
to solve the problem.
2. Analysis: that concentrates more on fact finding about the problem or the opportunity. Feasibility
analysis, requirement determination and structuring, and selection of best design method are also
performed at this phase.
3. Design: in database development more emphasis is given to this phase. The phase is further divided
into three sub-phases.
1. Conceptual Design: concise description of the data, data type, relationship between data and
constraints on the data.
1. There is no implementation or physical detail consideration.  Used to elicit and structure all
information requirements
2. Logical Design: a higher level conceptual abstraction with selected specific data model to
implement the data structure.
1. It is particular DBMS independent and with no other physical considerations.
3. Physical Design: physical implementation of the logical design of the database with respect to
internal storage and file structure of the database for the selected DBMS.
1. To develop all technology and organizational specification.
4. Implementation: the testing and deployment of the designed database for use.
5. Operation and Support: administering and maintaining the operation of the database system and
providing support to users. Tuning the database operations for best performance.

3.1.1 Conceptual Database Design

 Conceptual design revolves around discovering and analyzing organizational and user data
requirements
 The important activities are to identify

Entities
Attributes
Relationships
Constraints

Database Systems and Information Management Module Page 53


 And based on these identified components then develop the ER model using ER diagrams

3.2 Basic concepts of E-R model

 Entity-Relationship modeling is used to represent conceptual view of the database


 The main components of ER Modeling are:

Entities

o Corresponds to entire table, not row


o Represented by Rectangle

Attributes

o Represents the property used to describe an entity or a relationship


o Represented by Oval

Relationships

o Represents the association that exist between entities


o Represented by Diamond

Constraints

o Represent the constraint in the data


o Cardinality and Participation Constraints

Before working on the conceptual design of the database, one has to know and answer the
following basic questions.

 What are the entities and relationships in the enterprise?


 What information about these entities and relationships should we store in the database?

Database Systems and Information Management Module Page 54


 What are the integrity constraints that hold? Constraints on each data with respect to update, retrieval and
store.
 Represent this information pictorially in ER diagrams, then map ER diagram into a relational schema.

3.3 Developing an E-R Diagram


Designing conceptual model for the database is not a one linear process but an iterative activity
where the design is refined again and again. To identify the entities, attributes, relationships, and
constraints on the data, there are different set of methods used during the analysis phase. These
include information gathered by:

 Interviewing end users individually and in a group


 Questionnaire survey
 Direct observation
 Examining different documents Analysis of requirements gathered:
 Nouns àprospective entities
 Adjectivesàprospective attributes
 Verbs/verb phrasesàprospective relationships

The basic E-R model is graphically depicted and presented for review. The process is repeated
until the end users and designers agree that the E-R diagram is a fair representation of the
organization‗s activities and functions. Checking for Redundant Relationships in the ER Diagram
Relationships between entities indicate access from one entity to another – it is therefore possible
to access one entity occurrence from another entity occurrence even if there are other entities and
relationships that separate them – this is often referred to as Navigation’ of the ER diagram. The
last phase in ER modeling is validating an ER Model against requirement of the user

Database Systems and Information Management Module Page 55


3.4 Graphical Representations in Entity Relationship Diagram
3.4.1 Conceptual ER diagram symbols
3.4.1.1 Entity
Entities are represented by means of rectangles. Rectangles are named with the entity set they
represent.

Entities are objects or concepts that represent important data. Entities are typically nouns such as
product, customer, location, or promotion. There are three types of entities commonly used in
entity relationship diagrams.

Database Systems and Information Management Module Page 56


Database Systems and Information Management Module Page 57
3.4.1.2 Attributes
Attributes are the properties of entities. Attributes are represented by means of ellipses. An
attribute is a property, trait, or characteristic of an entity, relationship, or another attribute.
Attributes are represented by OVALS and are connected to the entity by a line.

Types of Attributes

Multivalued attribute: Multivalued attributes are those that are can take on more than one
value. An Attributes are depicted by double ellipse. It can be represented by the following
symbol

Database Systems and Information Management Module Page 58


Derived Attribute: Derived attributes are those which are derived based on other attributes, for
example, age can be derived from date of birth. To represent a derived attribute, another dotted
ellipse is created

Database Systems and Information Management Module Page 59


Database Systems and Information Management Module Page 60
3.4.1.4 Degree of Relationship
The number of participating entities in a relationship defines the degree of the
relationship. There are three categories of degree of relationship:

1. Binary = degree 2
2. Ternary = degree 3
3. n-ary = n degree

3.4.1.5 Binary Relationship and Cardinality


A relationship where two entities are participating is called a binary relationship. Cardinality is
the number of instance of an entity from a relation that can be associated with the relation.

Database Systems and Information Management Module Page 61


Cardinality refers to the maximum number of times an instance in one entity can relate to instances
of another entity.

Ordinality, on the other hand, is the minimum number of times an instance in one entity can be
associated with an instance in the related entity.

1. One-to-one: When only one instance of an entity is associated with the relationship, it is marked as ‘1:1’.
The following image reflects that only one instance of each entity should be associated with the
relationship. It depicts one-to-one relationship

Database Systems and Information Management Module Page 62


Database Systems and Information Management Module Page 63
Participation of DEPARTMENT in ―Manages‖ relationship with EMPLOYEE is total since
every department should have a manager.

Database Systems and Information Management Module Page 64


3.5 Problem with E-R models
The Entity-Relationship Model is a conceptual data model that views the real world as consisting
of entities and relationships. The model visually represents these concepts by the Entity-
Relationship diagram. The basic constructs of the ER model are entities, relationships, and
attributes. Entities are concepts, real or abstract, about which information is collected.

Relationships are associations between the entities. Attributes are properties which describe the
entities.

While designing the ER model one could face a problem on the design which is called a connection
traps. Connection traps are problems arising from misinterpreting certain relationships

Database Systems and Information Management Module Page 65


There are two types of connection traps;

1. Fan trap:

Occurs where a model represents a relationship between entity types, but the pathway between
certain entity occurrences is ambiguous. May exist where two or more one-to-many (1:M)
relationships fan out from an entity. The problem could be avoided by restructuring the model so
that there would be no 1:M relationships fanning out from a single entity and all the semantics of
the relationship is preserved.

Example on Fan trap

Problem: Which car (Car1 or Car3 or Car5) is used by Employee 6 Emp6 working in Branch 1
(Br1)? Thus from this ER Model one cannot tell which car is used by which staff since a branch
can have more than one car and also a branch is populated by more than one employee. Thus we
need to restructure the model to avoid the connection trap.

To avoid the Fan Trap problem we can go for restructuring of the E-R Model. This will result in
the following E-R Model.

Database Systems and Information Management Module Page 66


Semantics description of the problem;

Database Systems and Information Management Module Page 67


Database Systems and Information Management Module Page 68
 EER is important when we have a relationship between two entities and the participation is
partial between entity occurrences.
 In such cases EER is used to reduce the complexity in participation and relationship complexity.
ER diagrams consider entity types to be primitive objects  EER diagrams allow refinements
within the structures of entity types.
 It is a diagrammatic technique for displaying the following concepts o Sub Class and Super Class
o Specialization and Generalization o Union or Category o Aggregation
 Features of EER Model o EER creates a design more accurate to database schemas. o It reflects
the data properties and constraints more precisely.
o It includes all modeling concepts of the ER model. o Diagrammatic technique helps for
displaying the EER schema. o It includes the concept of specialization and generalization.
o It is used to represent a collection of objects that is union of objects of different of different
entity types. n Generalization

Database Systems and Information Management Module Page 69


o Generalization occurs when two or more entities represent categories of the same real-world
object.
o Generalization is the process of defining a more general entity type from a set of more
specialized entity types.
o A generalization hierarchy is a form of abstraction that specifies that two or more entities that
share common attributes can be generalized into a higher level entity type.
o Is considered as bottom-up definition of entities.
o Generalization hierarchy depicts relationship between higher level superclass and lower level
subclass.

Generalization hierarchies can be nested. That is, a subtype of one hierarchy can be a supertype
of another. The level of nesting is limited only by the constraint of simplicity.

Example: Account is a generalized form for having and Current Accounts

 Specialization

Database Systems and Information Management Module Page 70


o Is the result of subset of a higher level entity set to form a lower level entity set.
o The specialized entities will have additional set of attributes (distinguishing characteristics) that
distinguish them from the generalized entity.
o Is considered as Top-Down definition of entities.
o Specialization process is the inverse of the Generalization process. Identify the distinguishing
features of some entity occurrences, and specialize them into different subclasses.
o Reasons for Specialization o Attributes only partially applying to superclasses o Relationship
types only partially applicable to the superclass
o In many cases, an entity type has numerous sub-groupings of its entities that are meaningful and
need to be represented explicitly. This need requires the representation of each subgroup in the ER
model. The generalized entity is a superclass and the set of specialized entities will be subclasses
for that specific Superclass.

o Example: Saving Accounts and Current Accounts are Specialized entities for the generalized
entity Accounts. Manager, Sales, Secretary: are specialized employees.

 Subclass/Subtype
o An entity type whose tuples have attributes that distinguish its members from tuples of the
generalized or Superclass entities.
o When one generalized Superclass has various subgroups with distinguishing features and these
subgroups are represented by specialized form, the groups are called subclasses.
o Subclasses can be either mutually exclusive (disjoint) or overlapping (inclusive).
o A single subclass may inherit attributes from two distinct superclasses.
o A mutually exclusive category/subclass is when an entity instance can be in only one of the
subclasses.

E.g.: An EMPLOYEE can either be SALARIED or PART-TIMER but not both.

 An overlapping category/subclass is when an entity instance may be in two or more subclasses.

Database Systems and Information Management Module Page 71


E.g.: A PERSON who works for a university can be both EMPLOYEE and a STUDENT at the
same time.

 Superclass /Supertype
o An entity type whose tuples share common attributes. Attributes that are shared by all entity
occurrences (including the identifier) are associated with the supertype. Ø Is the generalized entity
 Relationship Between Superclass and Subclass
o The relationship between a superclass and any of its subclasses is called a superclass/subclass or
class/subclass relationship
o An instance can not only be a member of a subclass. i.e. Every instance of a subclass is also an
instance in the Superclass.
o A member of a subclass is represented as a distinct database object, a distinct record that is related
via the key attribute to its super-class entity.
o An entity cannot exist in the database merely by being a member of a subclass; it must also be a
member of the super-class.
o An entity occurrence of a sub class not necessarily should belong to any of the subclasses unless
there is full participation in the specialization.
o The relationship between a subclass and a Superclass is an ―IS A‖ or ―IS PART

OF‖ type.

 Subclass IS PART OF Superclass


o Manager IS AN Employee
o All subclasses or specialized entity sets should be connected with the superclass using a line to a
circle where there is a subset symbol indicating the direction of subclass/superclass relationship.

Database Systems and Information Management Module Page 72


Database Systems and Information Management Module Page 73
 properties not shared by other subtypes. But whether the employee is HOURLY or SALARIED,
same attributes (EmployeeId, Name, and DateHired) are

shared.

 The Supertype EMPLOYEE stores all properties that subclasses have in common.
And HOURLY employees have the unique attribute Wage (hourly wage rate),
while SALARIED employees have two unique attributes, StockOption and Salary.

Constraints on specialization and generalization n Completeness Constraint.

 The Completeness Constraint addresses the issue of whether or not an occurrence of a Superclass
must also have a corresponding Subclass occurrence.
o The completeness constraint requires that all instances of the subtype be represented in the
supertype.
o The Total Specialization Rule specifies that an entity occurrence should at least be a member of
one of the subclasses. Total Participation of superclass instances on subclasses is diagrammed with
a double line from the Supertype to the circle as shown below.

Database Systems and Information Management Module Page 74


E.g.: If we have EXTENTION and REGULAR as subclasses of a superclass STUDENT, then it
is mandatory that each student to be either EXTENTION or REGULAR student. Thus the
participation of instances of STUDENT in EXTENTION and REGULAR subclasses will be total.

The Partial Specialization Rule specifies that it is not necessary for all entity occurrences in the
superclass to be a member of one of the subclasses. Here we have an optional participation on the
specialization. Partial Participation of superclass instances on subclasses is diagrammed with
a single line from the Supertype to the circle.

E.g.: If we have MANAGER and SECRETARY as subclasses of a superclass EMPLOYEE, then


it is not the case that all employees are either manager or secretary. Thus the participation of
instances of employee in MANAGER and SECRETARY subclasses will be partial. n Disjointness
Constraints.

Database Systems and Information Management Module Page 75


 Specifies the rule whether one entity occurrence can be a member of more than one subclasses. i.e.
it is a type of business rule that deals with the situation where an entity occurrence of a Superclass
may also have more than one Subclass occurrence.
o The Disjoint Rule restricts one entity occurrence of a superclass to be a member of only one of
the subclasses. Example: a EMPLOYEE can either be SALARIED or PART-TIMER, but not the
both at the same time.
o The Overlap Rule allows one entity occurrence to be a member f more than one
subclass. Example: EMPLOYEE working at the university can be both a STUDENT and an
EMPLOYEE at the same time.
o This is diagrammed by placing either the letter “d” for disjoint or “o” for overlapping inside the
circle on the Generalization Hierarchy portion of the E-R diagram.

The two types of constraints on generalization and specialization (Disjointness and


Completeness constraints) are not dependent on one another. That is, being disjoint will not
favour whether the tuples in the superclass should have Total or Partial participation for that
specific specialization.

From the two types of constraints we can have four possible constraints

 Disjoint AND Total


 Disjoint AND Partial
 Overlapping AND Total
 Overlapping AND Partial

Database Systems and Information Management Module Page 76


CHAPTER FOUR
4. Logical Database Design
4.1 Introduction

The whole purpose of the data base design is to create an accurate representation of the data, the
relationship between the data and the business constraints pertinent to that organization.
Therefore, one can use one or more technique to design a data base. One such a technique was
the E-R model. In this chapter we use another technique known as ―Normalization‖ with a
different emphasis to the database design defines the structure of a database with a specific data
model.

Logical design is the process of constructing a model of the information used in an enterprise
based on a specific data model (e.g. relational, hierarchical or network or object), but
independent of a particular DBMS and other physical considerations.

Logical database design is the process of deciding how to arrange the attributes of the entities in
a given business environment into database structures, such as the tables of a relational database.
The goal of logical database design is to create well-structured tables that properly reflect the
company’s business environment. The tables will be able to store data about the company’s
entities in a non-redundant manner and foreign keys will be placed in the tables so that all the

Database Systems and Information Management Module Page 77


relationships among the entities will be supported. Physical database design, which will be
treated in the next chapter, is the process of modifying the logical database design to improve
performance.

The focus in logical database design is the Normalization Process

v Normalization process

 Collection of Rules (Tests) to be applied on relations to obtain the minimal, non-redundant set or
attributes.
 Discover new entities in the process
 Revise attributes based on the rules and the discovered Entities
 Works by examining the relationship between attributes known as functional dependency.

The purpose of normalization is to find the suitable set of relations that supports the data
requirements of an enterprise. A suitable set of relations has the following characteristics;

 Minimal number of attributes to support the data requirements of the enterprise


 Attributes with close logical relationship (functional dependency) should be placed in the same relation.
 Minimal redundancy with each attribute represented only once with the exception of the attributes which
form the whole or part of the foreign key, which are used for joining of related tables.

The first step before applying the rules in relational data model is converting the conceptual
design to a form suitable for relational logical model, which is in a form of tables.

Converting ER Diagram to Relational Tables

Three basic rules to convert ER into tables or relations:

Rule 1: Entity Names will automatically be table names

Rule 2: Mapping of attributes: attributes will be columns of the respective tables.

Database Systems and Information Management Module Page 78


 Atomic or single-valued or derived or stored attributes will be columns
o Composite attributes: the parent attribute will be ignored and the decomposed attributes (child attributes)
will be columns of the table.
o Multi-valued attributes: will be mapped to a new table where the primary key of the main table will be
posted for cross referencing.

Rule 3: Relationships: relationship will be mapped by using a foreign key attribute. Foreign key
is a primary or candidate key of one relation used to create association between tables.

 For a relationship with One-to-One Cardinality: post the primary or candidate key of one of the table
into the other as a foreign key. In cases where one entity is having partial participation on the relationship,
it is recommended to post the candidate key of the partial participants to the total participant so as to save
some memory location due to null values on the foreign key attribute. E.g.: for a relationship between
Employee and Department where employee manages a department, the cardinality is one-to-one as one
employee will manage only one department and one department will have one manager. here the PK of
the Employee can be posted to the Department or the PK of the Department can be posted to the
Employee. But the Employee is having partial participation on the relationship “Manages” as not all
employees are managers of departments. thus, even though both way is possible, it is recommended to
post the primary key of the employee to the Department table as a foreign key.
 For a relationship with One-to-Many Cardinality: Post the primary key or candidate key from the
―one‖ side as a foreign key attribute to the ―many‖ side. E.g.: For a relationship called ―Belongs To‖
between Employee (Many) and Department (One) the primary or candidate key of the one side which is
Department should be posted to the many side which is Employee table.
 For a relationship with Many-to-Many Cardinality: for relationships having many to many cardinality,
one has to create a new table (which is the associative entity) and post primary key or candidate key from
the participant entities as foreign key attributes in the new table along with some additional attributes (if
applicable). The same approach should be used for relationships with degree greater than binary.
 For a relationship having Associative Entity property: in cases where the relationship has its own
attributes (associative entity), one has to create a new table for the associative entity and post primary key
or candidate key from the participating entities as foreign key attributes in the new table.

Database Systems and Information Management Module Page 79


Example to illustrate the major rules in mapping ER to relational schema: The following ER
has been designed to represent the requirement of an organization to capture Employee
Department and Project information. And Employee works for department where an employee
might be assigned to manage a department. Employees might participate on different projects
within the organization. An employee might as well be assigned to lead a project where the
starting and ending date of his/her project leadership and bonus will be registered

Database Systems and Information Management Module Page 80


Database Systems and Information Management Module Page 81
Database Systems and Information Management Module Page 82
Database Systems and Information Management Module Page 83
After converting the ER diagram in to table forms, the next phase is implementing the process of
normalization, which is a collection of rules each table should satisfy.

Database Systems and Information Management Module Page 84


4.2 Normalization
A relational database is merely a collection of data, organized in a particular manner. As the father
of the relational database approach, Codd created a series of rules (tests) called normal forms that
help define that organization

One of the best ways to determine what information should be stored in a database is to clarify
what questions will be asked of it and what data would be included in the answers.

Database normalization is a series of steps followed to obtain a database design that allows for
consistent storage and efficient access of data in a relational database. These steps reduce data
redundancy and the risk of data becoming inconsistent.

NORMALIZATION is the process of identifying the logical associations between data items and
designing a database that will represent such associations but without suffering the update
anomalies which are;

1. Insertion Anomalies
2. Deletion Anomalies
3. Modification Anomalies

Normalization may reduce system performance since data will be cross referenced from many
tables. Thus denormalization is sometimes used to improve performance, at the cost of reduced
consistency guarantees.

Normalization normally is considered ―good‖ if it is lossless decomposition.

All the normalization rules will eventually remove the update anomalies that may exist during data
manipulation after the implementation. The update anomalies are;

The type of problems that could occur in insufficiently normalized table is called update anomalies
which includes;

Database Systems and Information Management Module Page 85


 Insertion anomalies

An “insertion anomaly” is a failure to place information about a new database entry into all the
places in the database where information about that new entry needs to be stored. Additionally, we
may have difficulty to insert some data. In a properly normalized database, information about a
new entry needs to be inserted into only one place in the database; in an inadequately normalized
database, information about a new entry may need to be inserted into more than one place and,
human fallibility being what it is, some of the needed additional insertions may be missed.

 Deletion anomalies
o “deletion anomaly” is a failure to remove information about an existing database entry when it is time to
remove that entry. Additionally, deletion of one data may result in lose of other information. In a properly
normalized database, information about an old, to-begotten-rid-of entry needs to be deleted from only one
place in the database; in an inadequately normalized database, information about that old entry may need
to be deleted from more than one place, and, human fallibility being what it is, some of the needed additional
deletions may be missed.
 Modification anomalies
o modification of a database involves changing some value of the attribute of a table. In a properly normalized
database table, what ever information is modified by the user, the change will be effected and used
accordingly.

In order to avoid the update anomalies we in a given table, the solution is to decompose it to smaller
tables based on the rule of normalization. However, the decomposition has two important
properties

a. The Lossless-join property insures that any instance of the original relation can be identified from the
instances of the smaller relations.

Database Systems and Information Management Module Page 86


1. The Dependency preservation property implies that constraint on the original dependency can be
maintained by enforcing some constraints on the smaller relations. i.e. we don‘t have to perform Join
operation to check whether a constraint on the original relation is violated or not.

The purpose of normalization is to reduce the chances for anomalies to occur in a database

Example of problems related with Anomalies

EmpID FName LName SkillID Skill SkillType School SchoolAdd Skill Level

12 Abebe Mekuria 2 SQL Database AAU Sidist_Kilo 5

16 Lemma Alemu 5 C++ Programming Unity Gerji 6

28 Chane Kebede 2 SQL Database AAU Sidist_Kilo 10

25 Abera Taye 6 VB6 Programming Helico Piazza 8

65 Almaz Belay 2 SQL Database Helico Piazza 9

24 Dereje Tamiru 8 Oracle Database Unity Gerji 5

51 Selam Belay 4 Prolog Programming Jimma Jimma City 8

94 Alem Kebede 3 Cisco Networking AAU Sidist_Kilo 7

18 Girma Dereje 1 IP Programming Jimma Jimma City 4

13 Yared Gizaw 7 Java Programming AAU Sidist_Kilo 6

Deletion Anomalies:

If employee with ID 16 is deleted then ever information about skill C++ and the type of skill is
deleted from the database. Then we will not have any information about C++ and its skill type.

Insertion Anomalies:

Database Systems and Information Management Module Page 87


What if we have a new employee with a skill called Pascal? We can not decide weather Pascal is
allowed as a value for skill and we have no clue about the type of skill that Pascal should be
categorized as.

Modification Anomalies:

What if the address for Helico is changed from Piazza to Mexico? We need to look for every
occurrence of Helico and change the value of School_Add from Piazza to Mexico, which is prone
to error.

Database-management system can work only with the information that we put explicitly into its
tables for a given database and into its rules for working with those tables, where such rules are
appropriate and possible.

Functional Dependency (FD)

Before moving to the definition and application of normalization, it is important to have an


understanding of “functional dependency.”

Data Dependency

The logical associations between data items that point the database designer in the direction of a
good database design are refered to as determinant or dependent relationships.

Two data items A and B are said to be in a determinant or dependent relationship if certain values
of data item B always appears with certain values of data item A. if the data item A is the
determinant data item and B the dependent data item then the direction of the association is from
A to B and not vice versa.

The essence of this idea is that if the existence of something, call it A, implies that B must exist
and have a certain value, then we say that “B is functionally dependent on A.” We also often
express this idea by saying that “A functionally determines B,” or that “B is a function of A,” or
that “A functionally governs B.” Often, the notions of functionality and functional dependency are

Database Systems and Information Management Module Page 88


expressed briefly by the statement, “If A, then B.” It is important to note that the value of B must
be unique for a given value of A, i.e., any given value of A must imply just one and only one value
of B, in order for the relationship to qualify for the name “function.” (However, this does not
necessarily prevent different values of A from implying the same value of B.)

However, for the purpose of normalization, we are interested in finding 1..1 (one to one)
dependencies, lasting for all times (intension rather than extension of the database), and the
determinant having the minimal number of attributes.

X à Y holds if whenever two tuples have the same value for X, they must have the same value
for Y

The notation is: AàB which is read as; B is functionally dependent on A

In general, a functional dependency is a relationship among attributes. In relational databases, we


can have a determinant that governs one or several other attributes.

FDs are derived from the real-world constraints on the attributes and they are properties on the
database intension not extension.

Example

Dinner Course Type of Wine

Meat Red

Fish White

Cheese Rose

Since the type of Wine served depends on the type of Dinner, we say Wine is functionally
dependent on Dinner.

Dinner à Wine

Database Systems and Information Management Module Page 89


Dinner Course Type of Wine Type of Fork

Meat Red Meat fork

Fish White Fish fork

Cheese Rose Cheese fork

Since both Wine type and Fork type are determined by the Dinner type, we say Wine is
functionally dependent on Dinner and Fork is functionally dependent on Dinner.

Dinner à Wine

Dinner à Fork

Partial Dependency

If an attribute which is not a member of the primary key is dependent on some part of the primary
key (if we have composite primary key) then that attribute is partially functionally dependent on
the primary key.

Let {A,B} is the Primary Key and C is no key attribute.

Then if {A,B}àC and BàC

Then C is partially functionally dependent on {A,B}

Full Functional Dependency

If an attribute which is not a member of the primary key is not dependent on some part of the
primary key but the whole key (if we have composite primary key) then that attribute is fully
functionally dependent on the primary key.

Let {A,B} be the Primary Key and C is a non- key attribute

Then if {A,B}àC and BàC and AàC does not hold

Database Systems and Information Management Module Page 90


Then C Fully functionally dependent on {A,B}

Transitive Dependency

In mathematics and logic, a transitive relationship is a relationship of the following form: “If A
implies B, and if also B implies C, then A implies C.”

Example:

If Mr X is a Human, and if every Human is an Animal, then Mr X must be an Animal.

Generalized way of describing transitive dependency is that:

If A functionally governs B, AND

If B functionally governs C

THEN A functionally governs C

Provided that neither C nor B determines A i.e. (B /à A and C /à A) In the normal notation:

{(AàB) AND (BàC)} ==> AàC provided that B /à A and C /à A

4.3 Process of normalization (1NF, 2NF, 3NF)


We have various levels or steps in normalization called Normal Forms. The level of complexity,
strength of the rule and decomposition increases as we move from one lower level Normal Form
to the higher.

A table in a relational database is said to be in a certain normal form if it satisfies certain


constraints.

A normal form below represents a stronger condition than the previous one

Normalization towards a logical design consists of the following steps:

Database Systems and Information Management Module Page 91


UnNormalized Form(UNF):

Identify all data elements First Normal Form(1NF):

Find the key with which you can find all data i.e. remove any repeating group Second Normal
Form(2NF):

Remove part-key dependencies (partial dependency). Make all data dependent on the whole
key.

Third Normal Form(3NF)

Remove non-key dependencies (transitive dependencies). Make all data dependent on nothing
but the key.

For most practical purposes, databases are considered normalized if they adhere to the third
normal form (there is no transitive dependency).

First Normal Form (1NF)

Requires that all column values in


a table are atomic (e.g., a number
is an atomic value, while a list or
a set is not). We have two ways of
achiving this:

Putting each repeating group into a separate table and


connecting them with a primary key-
foreign key relationship Moving these repeating groups to a
new row by repeating the non-repeating attributes known
as ―flattening‖ the table. If so then Find the key with which
you can find all data

Definition: a table (relation) is in 1NF

If

Database Systems and Information Management Module Page 92


 There are no duplicated rows in the table. Unique identifier
 Each cell is single-valued (i.e., there are no repeating groups).

Entries in a column (attribute, field) are of the same kind.

Example for First Normal form (1NF )

EmpID FirstName LastName Skill SkillType School SchoolAdd SkillLevel

SQL, Database,
12 Abebe Mekuria AAU, Helico Sidist_Kilo Piazza 5 8
VB6 Programming

C++ Programming
16 Lemma Alemu Unity Jimma Gerji Jimma City 6 4
IP Programming

28 Chane Kebede SQL Database AAU Sidist_Kilo 10

SQL Database
Piazza Jimma City
65 Almaz Belay Prolog Programming Helico Jimma AAU 986
Sidist_Kilo
Java Programming

24 Dereje Tamiru Oracle Database Unity Gerji 5

94 Alem Kebede Cisco Networking AAU Sidist_Kilo 7

4.2.1 First normal form (1NF)

Remove all repeating groups. Distribute the multi-valued attributes into different rows and
identify a unique identifier for the relation so that is can be said is a relation in relational
database. Flatten the table

EmpID FirstName LastName SkillID Skill SkillType School SchoolAdd SkillLeve

12 Abebe Mekuria 1 SQL Database AAU Sidist_Kilo 5

12 Abebe Mekuria 3 VB6 Programming Helico Piazza 8

16 Lemma Alemu 2 C++ Programming Unity Gerji 6

16 Lemma Alemu 7 IP Programming Jimma Jimma City 4

28 Chane Kebede 1 SQL Database AAU Sidist_Kilo 10

Database Systems and Information Management Module Page 93


65 Almaz Belay 1 SQL Database Helico Piazza 9

65 Almaz Belay 5 Prolog Programming Jimma Jimma City 8

65 Almaz Belay 8 Java Programming AAU Sidist_Kilo 6

24 Dereje Tamiru 4 Oracle Database Unity Gerji 5

94 Alem Kebede 6 Cisco Networking AAU Sidist_Kilo 7

4.2.2 Second Normal form 2NF

No partial dependency of a non key attribute on part of the primary key. This will result in a set of
relations with a level of Second Normal Form.

Any table that is in 1NF and has a single-attribute (i.e., a non-composite) key is automatically also
in 2NF.

Definition: a table (relation) is in 2NF

If

 It is in 1NF and
 If all non-key attributes are dependent on the entire primary key. i.e. no partial dependency.

Example for 2NF:

EMP_PROJ

Database Systems and Information Management Module Page 94


Definition: a Table (Relation) is in 3NF

If

 It is in 2NF and
 There are no transitive dependencies between a primary key and nonprimary key attributes.

Database Systems and Information Management Module Page 95


Example for (3NF)

Assumption: Students of same batch (same year) live in one building or dormitory

STUDENT

StudID Stud_F_Name Stud_L_Name Dept Year Dormitary

125/97 Abebe Mekuria Info Sc 1 401

654/95 Lemma Alemu Geog 3 403

842/95 Chane Kebede CompSc 3 403

165/97 Alem Kebede InfoSc 1 401

985/95 Almaz Belay Geog 3 403

This schema is in its 2NF since the primary key is a single attribute and there are no repeating
groups (multi valued attributes).

Let‘s take StudID, Year and Dormitary and see the dependencies.

StudIDàYear AND YearàDormitary

And Year can not determine StudID and Dormitary can not determine StudID Then
transitively StudIDàDormitary

To convert it to a 3NF we need to remove all transitive dependencies of non key attributes on
another non-key attribute.

The non-primary key attributes, dependent on each other will be moved to another table and
linked with the main table using Candidate Key- Foreign Key relationship.

STUDENT DORM

Database Systems and Information Management Module Page 96


Database Systems and Information Management Module Page 97
CHAPTER FIVE
5. Physical Database Design
5.1 Conceptual, Logical, and Physical Data Models
You begin with a summary-level business data model that‘s most often used on strategic data
projects. It typically describes an entire enterprise, which allows you to understand at a high level
the different entities in your data and how they relate to one another. Due to its highly abstract
nature, it may be referred to as a conceptual model.

Common characteristics of a conceptual data model:

A conceptual data model identifies important entities and the high-level relationships among them.
This means no attribute or primary key is specified. Moreover, complexity increases as you expand
from a conceptual data model.

A logical data model, otherwise known as a fully attributed data model, allows you to understand
the details of your data without worrying about how the data will be implemented in the database.
Additionally, a logical data model will normally be derived from and or linked back to objects in
a conceptual data model. It is independent of DBMS, technology, data storage or organizational
constraints.

Common characteristics of a logical data model:

Unlike the conceptual model, a logical model includes all entities and relationships among them.
Additionally, all attributes, the primary key, and foreign keys (keys identifying the relationship
between different entities) are specified. As a result, normalization occurs at this level.

The steps for designing the logical data model are as follows:

 First, specify primary keys for all entities.


 Then find the relationships between different entities.
 Find all attributes for each entity.

Database Systems and Information Management Module Page 98


 Lastly, resolve many-to-many relationships.

Finally, the physical data model will show you exactly how to implement your data model in the
database of choice. This model shows all table structures, including column name, column data
type, column constraints, primary key, foreign key, and relationships between tables.
Correspondingly, the target implementation technology may be a relational DBMS, an XML
document, a spreadsheet, or any other data implementation option

5.2 Physical Database Design Process


Physical database design is the process of transforming logical data models into physical data
models. Physical database design is the process of transforming a data model into the physical data
structure of a particular database management system (DBMS). Normally, Physical Design is
accomplished in multiple steps, which include expanding a business model into a fully attributed
model (FAM) and then transforming the fully attributed model into a physical design model.

Physical database design is the process of producing a description of the implementation of the
database on secondary storage; it describes the base relations, file organizations, and indexes used
to achieve efficient access to the data, and any associated integrity constraints and security
measures.

The physical database design phase allows the designer to make decisions on how the database is
to be implemented. Therefore, physical design is tailored to a specific DBMS. There is feedback

Database Systems and Information Management Module Page 99


between physical and logical design, because decisions taken during physical design for improving
performance may affect the logical data model.

Database design process in database design methodology is divided into three main phases:
conceptual, logical, and physical database design. The phase prior to physical design—logical
database design—is largely independent of implementation details, such as the specific
functionality of the target DBMS and application programs, but is dependent on the target data
model. The output of this process is a logical data model consisting of an ER/relation diagram,
relational schema, and supporting documentation that describes this model, such as a data
dictionary. Together, these represent the sources of information for the physical design process
and provide the physical database designer with a vehicle for making tradeoffs that are so
important to an efficient database design.

Whereas logical database design is concerned with the what, physical database design is concerned
with the how. It requires different skills that are often found in different people. In particular, the
physical database designer must know how the computer system hosting the DBMS operates and
must be fully aware of the functionality of the target DBMS. As the functionality provided by
current systems varies widely, physical design must be tailored to a specific DBMS. However,
physical database design is not an isolated activity—there is often feedback between physical,
logical, and application design. For example, decisions taken during physical design for improving
performance, such as merging relations together, might affect the structure of the logical data
model, which will have an associated effect on the application design.

Physical database design translates the logical data model into a set of SQL statements that define
the database. For relational database systems, it is relatively easy to translate from a logical data
model into a physical database.

5.1.1. Overview of the Physical Database Design Methodology

 Choosing a physical data structure for the data constructs in the data model.
 Optionally choosing DBMS options for the existence constraints in the data model.
 Does not change the business meaning of the data.

Database Systems and Information Management Module Page 100


 First transformation of a data model should be a ―one to one‖ transformation.
 Should not denormalize the data unless required for performance.
 Based on shop design standards and DBA experience & biases.
 Entity Subsetting – Choosing to transform only a subset of the attributes in an entity.
 Dependent Encasement – Collapsing a dependent entity into its parent to form a repeating group of
attributes or collapsing a dependent entity into its parent to form a new set of attributes.
 Category Encasement
 Category Discriminator Collapse
 Horizontal Split
 Vertical Split
 Data Migration
 Synthetic Keys – The merging of similar entities using a made up key. Always loses business vocabulary
and is never more than BCNF. Also used a lot in packages to make them easily extendable.
 Adding Summaries – Adding new entities that are each a summary of data upon a single level of a
dimension.
 Adding Dimensions

Moreover, it is often necessary to apply multiple transforms to a single entity to get the desired
physical performance characteristics. All physical design transformations are compromises

Database Systems and Information Management Module Page 101


CHAPTER SIX
6. Query Languages
6.1. Relational Algebra
The relational algebra is a theoretical language with operations that work on one or more relations
to define another relation without changing the original relation(s). Thus both the operands and
the results are relations, and so the output from one operation can become the input to another
operation. This ability allows expressions to be nested in the relational algebra, just as we can nest
arithmetic operations. This property is called closure: relations are closed under the algebra, just
as numbers are closed under arithmetic operations.

The relational algebra is a relation-at-a-time (or set) language in which all tuples, possibly from
several relations, are manipulated in one statement without looping. There are many variations of
the operations that are included in relational algebra. The five fundamental operations in relational
algebra—Selection, Projection, Cartesian product, Union, and Set difference—perform most of
the data retrieval operations that we are interested in. In addition, there are also the Join,
Intersection, and Division operations, which can be expressed in terms of the five basic operations.

The Selection and Projection operations are unary operations, as they operate on one relation. The
other operations work on pairs of relations and are therefore called binary operations. In the
following definitions, let R and S be two relations defined over the attributes A = (a1, a2, … , aN)
and B = (b1, b2, … , bM), respectively. Use the following figures for the examples in this chapter.

Database Systems and Information Management Module Page 102


6.1.1. Unary Operations
Selection (σ predicate (R))
The Selection operation works on a single relation R and defines a relation that contains only
those tuples of R that satisfy the specified condition (predicate). Example 6.1: List all staff with a
salary greater than 10000 birr.
σ salary > 10000 ( Staff )
Here, the input relation is Staff and the predicate is salary > 10000. The Selection operation
defines a relation containing only those Staff tuples with a salary greater than 10000 birr. The

Database Systems and Information Management Module Page 103


result of this operation is shown in Figure 6.2. More complex predicates can be generated using
the logical operators (AND), (OR), and ~ (NOT).

Projection (π a1 … an (R) )

The Projection operation works on a single relation R and defines a relation that contains a
vertical subset of R, extracting the values of specified attributes and eliminating duplicates.

Example 6.2: Produce a list of salaries for all staff, showing only the staffNo, fName, IName,
and salary details.

π staffNo, fName, IName, salary ( Staff ) In this example, the Projection


operation defines a relation that contains only the designated Staff attributes staffNo, fName,
IName, and salary, in the specified order. The result of this operation is shown in Figure 6.3

Database Systems and Information Management Module Page 104


6.1.2. Set Operations
The Selection and Projection operations extract information from only one relation. There are
obviously cases where we would like to combine information from several relations. In the
remainder of this section, we examine the binary operations of the relational algebra, starting
with the set operations of Union, Set difference, Intersection, and Cartesian product.
Union ( R U S )
The union of two relations R and S defines a relation that contains all the tuples of R, or S, or
both R and S, duplicate tuples being eliminated. R and S must be union-compatible.
If R and S have I and J tuples, respectively, their union is obtained by concatenating them into
one relation with a maximum of (I + J) tuples. Union is possible only if the schemas of the two
relations match, that is, if they have the same number of attributes with each pair of
corresponding attributes having the same domain. In other words, the relations must be union-
compatible. Note that attributes names are not used in defining union-compatibility. In some
cases, the Projection operation may be used to make two relations union-compatible.
Example 6.3: List all cities where there is either a branch office or a property for rent.
Pcity ( Branch ) U Pcity ( PropertyForRent )
To produce union-compatible relations, we first use Projection operation to project the Branch
and PropertyForRent relations over the attribute city , eliminating duplicates where necessary.

Database Systems and Information Management Module Page 105


We then use the Union operation to combine these new relations to produce the result shown in
Figure 6.4.

Set difference ( R – S )

The Set difference operation defines a relation consisting of the tuples that are in relation R, but
not in S. R and S must be union-compatible.

Example 6.4: List all cities where there is a branch office but no properties for rent. π city (
Branch ) – π city ( PropertyForRent )

As in the previous example, we produce union-compatible relations by projecting the Branch and
PropertyForRent relations over the attribute city . We then use the Set difference operation to
combine these new relations to produce the result shown in Figure 6.5.

Intersection ( R ∩ S )

Database Systems and Information Management Module Page 106


The Intersection operation defines a relation consisting of the set of all tuples that are in both R
and S. R and S must be union-compatible.

Example 6.5: List all cities where there is both a branch office and at least one property for rent.

π city ( Branch ) ∩ π city ( PropertyForRent )

As in the previous example, we produce union-compatible relations by projecting the Branch


and PropertyForRent relations over the attribute city . We then use the Intersection operation to
combine these new relations to produce the result shown in Figure 6.6.

Note that we can express the Intersection operation in terms of the Set difference
operation: R ∩ S = R – (R – S

Figure 6.6. Intersection based on city attribute from the Branch and PropertyForRent relations.

Cartesian product (R x S)

The Cartesian product operation defines a relation that is the concatenation of every tuple of
relation R with every tuple of relation S.

The Cartesian product operation multiplies two relations to define another relation consisting of
all possible pairs of tuples from the two relations. Therefore, if one relation has I tuples and N
attributes and the other has J tuples and M attributes, Cartesian product relation will contain (I *
J) tuples with (N + M) attributes. It is possible that the two relations may have attributes with the
same name. In this case, the attribute names are prefixed with the relation name to maintain the
uniqueness of attribute names within a relation.

Database Systems and Information Management Module Page 107


Example 6.6: List the names and comments of all clients who have viewed a property for rent.

The names of clients are held in the Client relation and the details of viewings are held in the
Viewing relation. To obtain the list of clients and the comments on properties they have viewed,
we need to combine these two relations:

(π clientNo, fName, IName ( Client )) X ( π clientNo, propertyNo, comment (


Viewing ))

The result of this operation is shown in Figure 6.7. In its present form, this relation contains more
information than we require. For example, the first tuple of this relation contains different clientNo
values. To obtain the required list, we need to carry out a Selection operation on this relation to
extract those tuples where Client.clientNo = Viewing.clientNo.

The complete operation is thus:

σ Client.clientNo = Viewing.clientNo ((π clientNo, fName, IName (Client)) X ( π clientNo,


propertyNo, comment (Viewing)) The result of this operation is shown in Figure 6.8

Database Systems and Information Management Module Page 108


Figure 6.7. Cartesian product of reduced Client and Viewing relations

Database Systems and Information Management Module Page 109


6.1.3. Aggregation and Grouping Operations
As well as simply retrieving certain tuples and attributes of one or more relations, we often want
to perform some form of summation or aggregation of data, similar to the totals at the bottom of a
report, or some form of grouping of data, similar to subtotals in a report. These operations cannot
be performed using the basic relational algebra operations considered earlier. However, additional
operations have been proposed, as we now discuss.

Aggregate operations ( AL(R) )

Applies the aggregate function list, AL, to the relation R to define a relation over the aggregate
list.

AL contains one or more (<aggregate_function>, <attribute>) pairs.

The main aggregate functions are:

 COUNT – returns the number of values in the associated attribute.


 SUM – returns the sum of the values in the associated attribute.
 AVG – returns the average of the values in the associated attribute.
 MIN – returns the smallest value in the associated attribute.
 MAX – returns the largest value in the associated attribute.

Example 6.9: (a) How many properties cost more than £350 per month to rent? We can use the
aggregate function COUNT to produce the relation R shown in Figure 6.10(a):

ρR ( myCount ) COUNT propertyNo ( σ rent > 350 ( PropertyForRent ))

(b) Find the minimum, maximum, and average staff salary.

We can use the aggregate functions—MIN, MAX, and AVERAGE—to produce the relation R
shown in Figure 5.10(b) as follows:

Database Systems and Information Management Module Page 110


ρR ( myMin , myMax , myAverage ) MIN salary, MAX salary, AVERAGE salary (
Staff )

Figure 6.10. Result of the Aggregate operations: (a) finding the number of properties whose rent
is greater than £350; (b) finding the minimum, maximum, and average staff salary.

Grouping operation ( GA GL (R) )

Groups the tuples of relation R by the grouping attributes, GA, and then applies the aggregate
function list AL to define a new relation. AL contains one or more (<aggregate_function>,
<attribute>) pairs. The resulting relation contains the grouping attributes, GA, along with the
results of each of the aggregate functions.

The general form of the grouping operation is as follows:

a1 , a2 , … , an <Apap>, <Aqaq>, … , <Azaz> (R)

where R is any relation, a1 , a2 , … , an are attributes of R on which to group, ap , aq , … , az are


other attributes of R, and Ap , Aq , … , Az are aggregate functions.

The tuples of R are partitioned into groups such that:

 all tuples in a group have the same value for a1 , a2 , … , an;


 tuples in different groups have different values for a1 , a2 , … , an.

We illustrate the use of the grouping operation with the following example.

Database Systems and Information Management Module Page 111


Example 6.10: Find the number of staff working in each branch and the sum of their salaries.

We first need to group tuples according to the branch number, branchNo , and then use the
aggregate functions COUNT and SUM to produce the required relation. The relational algebra
expression is as follows:

ρR ( branchNo , myCount , mySum ) branchNo COUNT staffNo, SUM salary ( Staff


)

The resulting relation is shown in Figure 5.11.

Figure 6.11. Result of the grouping operation to find the number of staff working in each branch
and the sum of their salaries.

6.2. Relational Calculus


A certain order is always explicitly specified in a relational algebra expression and a strategy for
evaluating the query is implied. In the relational calculus, there is no description of how to evaluate
a query; a relational calculus query specifies what is to be retrieved rather than how to retrieve it.

The relational calculus is not related to differential and integral calculus in mathematics, but takes
its name from a branch of symbolic logic called predicate calculus. When applied to databases,
it is found in two forms: tuple relational calculus, as originally proposed by Codd,
and domain relational calculus, as proposed by Lacroix and Pirotte.

Database Systems and Information Management Module Page 112


In first-order logic or predicate calculus, a predicate is a truth-valued function with arguments.
When we substitute values for the arguments, the function yields an expression, called
a proposition, which can be either true or false. For example, the sentences, ―John White is a
member of staff‖ and ―John White earns more than Ann Beech‖ are both propositions, because
we can determine whether they are true or false. In the first case, we have a function, ―is a member
of staff,‖ with one argument (John White); in the second case, we have a function, ―earns more
than,‖ with two arguments (John White and Ann Beech).

If a predicate contains a variable, as in ―x is a member of staff,‖ there must be an associated range


for x. When we substitute some values of this range for x, the proposition may be true; for other
values, it may be false. For example, if the range is the set of all people and we replace x by John
White, the proposition ―John White is a member of staff‖ is true. If we replace x by the name of
a person who is not a member of staff, the proposition is false.

If P is a predicate, then we can write the set of all x such that P is true for x, as:

{x | P(x)}

We may connect predicates by the logical connectives (AND), (OR), and ~ (NOT) to form
compound predicates.

6.2.1. Tuple Relational Calculus


In the tuple relational calculus, we are interested in finding tuples for which a predicate is true.
The calculus is based on the use of tuple variables. A tuple variable is a variable that ―ranges
over‖ a named relation: that is, a variable whose only permitted values are tuples of the relation.
(The word ―range‖ here does not correspond to the mathematical use of range, but corresponds
to a mathematical domain.) For example, to specify the range of a tuple variable S as the Staff
relation, we write:

Staff(S)

To express the query ―Find the set of all tuples S such that F(S) is true,‖ we can write:

Database Systems and Information Management Module Page 113


{S | F(S)}

F is called a formula (well-formed formula, or wff in mathematical logic). For example, to


express the query ―Find the staffNo , fName , IName , position , sex , DOB , salary , and
branchNo of all staff earning more than £10,000,‖ we can write:

{S | Staff(S) Ù S.salary > 10000 }

S.salary means the value of the salary attribute for the tuple variable S. To retrieve a particular
attribute, such as salary , we would write:

{S.salary | Staff(S) Ù S.salary > 10000 }

The existential and universal quantifiers

There are two quantifiers we can use with formulae to tell how many instances the predicate
applies to. The existential quantifier $ (―there exists‖) is used in formulae that must be true for
at least one instance, such as:

Staff(S) ( $B) (Branch(B) (B.branchNo = S.branchNo) B.city = ‗London‘ )

This means, ―There exists a Branch tuple that has the same branchNo as the branchNo of the
current Staff tuple, S, and is located in London.‖ The universal quantifier ” (―for all‖) is used in
statements about every instance, such as:

(“B) (B.city ≠ ‗Paris‘)

This means, ―For all Branch tuples, the address is not in Paris.‖ We can apply a generalization
of De Morgan‘s laws to the existential and universal quantifiers. For example:

($X)(F(X)) ~ (“X)(~(F(X)))

(“X)(F(X)) ~ ($X)(~(F(X)))

Database Systems and Information Management Module Page 114


($X)(F1 (X) F2 (X)) ~ (“X)(~(F1 (X)) ~ (F2 (X)))

(“X)(F1 (X) F2 (X)) ~ ($X)(~(F1(X)) ~ (F2 (X)))

Using these equivalence rules, we can rewrite the previous formula as:

~ ($B) (B.city = ‗Paris‘ )

which means, ―There are no branches with an address in Paris.

Tuple variables that are qualified by $ or ” are called bound variables; the other tuple variables
are called free variables. The only free variables in a relational calculus expression should be
those on the left side of the bar (|). For example, in the following query:

{S.fName, S.lName | Staff(S) ($B) (Branch(B) (B.branchNo = S.branchNo) Ù B.city = ‗London‘


)}

S is the only free variable and S is then bound successively to each tuple of Staff .

6.3. Structured Query Languages (SQL)


6.3.1. Introduction to SQL
Objectives of SQL

Ideally, a database language should allow a user to:

 create the database and relation structures;


 perform basic data management tasks, such as the insertion, modification, and deletion of data
from the relations;
 perform both simple and complex queries.

A database language must perform these tasks with minimal user effort, and its command structure
and syntax must be relatively easy to learn. Finally, the language must be portable; that is, it must
conform to some recognized standard so that we can use the same command structure and syntax
when we move from one DBMS to another. SQL is intended to satisfy these requirements.

Database Systems and Information Management Module Page 115


SQL is an example of a transform-oriented language, or a language designed to use relations to
transform inputs into required outputs. As a language, the ISO SQL standard has two major
components:

 Data Definition Language (DDL) for defining the database structure and controlling access to the
data;
 Data Manipulation Language (DML) for retrieving and updating data.

Until the 1999 release of the standard, known as SQL:1999 or SQL3, SQL contained only these
definitional and manipulative commands; it did not contain flow of control commands, such as IF
. . . THEN . . . ELSE, GO TO, or DO . . . WHILE. These commands had to be implemented using
a programming or job-control language, or interactively by the decisions of the user. Owing to this
lack of computational completeness, SQL can be used in two ways. The first way is to use SQL
interactively by entering the statements at a terminal. The second way is to embed SQL statements
in a procedural language. We also SQL is a relatively easy language to learn:

 It is a nonprocedural language; you specify what information you require, rather than how to get
it. In other words, SQL does not require you to specify the access methods to the data.
 Like most modern languages, SQL is essentially free-format, which means that parts of statements
do not have to be typed at particular locations on the screen.
 The command structure consists of standard English words such as CREATE TABLE, INSERT,
SELECT.

For example:

 CREATE TABLE Staff ( staffNo VARCHAR(5), IName VARCHAR(15), salary

DECIMAL(7,2));

 INSERT INTO Staff VALUES (‗SG16‘, ‗Brown‘, 8300);


 SELECT staffNo , IName , salary FROM Staff WHERE salary > 10000;

Database Systems and Information Management Module Page 116


 SQL can be used by a range of users including database administrators (DBA), management personnel,
application developers, and many other types of end-user. An international standard now .exists for the
SQL language making it both the formal and de facto standard language for defining and manipulating
relational databases.

Importance of SQL

SQL is the first and, so far, only standard database language to gain wide acceptance. The only
other standard database language, the Network Database Language (NDL), based on the
CODASYL network model, has few followers. Nearly every major current vendor provides
database products based on SQL or with an SQL interface, and most are represented on at least
one of the standard-making bodies.

There is a huge investment in the SQL language both by vendors and by users. It has become part
of application architectures such as IBM‘s Systems Application Architecture (SAA) and is the
strategic choice of many large and influential organizations, for example, the Open Group
consortium for UNIX standards. SQL has also become a Federal Information Processing Standard
(FIPS) to which conformance is required for all sales of DBMSs to the U.S. government. The SQL
Access Group, a consortium of vendors, defined a set of enhancements to SQL that would support
interoperability across disparate systems.

SQL is used in other standards and even influences the development of other standards as a
definitional tool. Examples include ISO‘s Information Resource Dictionary System (IRDS)
standard and Remote Data Access (RDA) standard. The development of the language is supported
by considerable academic interest, providing both a theoretical basis for the language and the
techniques needed to implement it successfully. This is especially true in query optimization,
distribution of data, and security. There are now specialized implementations of SQL that are
directed at new markets, such as OnLine Analytical Processing (OLAP).

Terminology

Database Systems and Information Management Module Page 117


The ISO SQL standard does not use the formal terms of relations, attributes, and tuples, instead
using the terms tables, columns, and rows.

6.3.2. Writing SQL Commands


An SQL statement consists of reserved words and user-defined words. Reserved words are a
fixed part of the SQL language and have a fixed meaning. They must be spelled exactly as required
and cannot be split across lines. User-defined words are made up by the user (according to certain
syntax rules) and represent the names of various database objects such as tables, columns, views,
indexes, and so on. The words in a statement are also built according to a set of syntax rules.
Although the standard does not require it, many dialects of SQL require the use of a statement
terminator to mark the end of each SQL statement (usually the semicolon ―;‖ is used).

Most components of an SQL statement are case-insensitive, which means that letters can be typed
in either upper- or lowercase. The one important exception to this rule is that literal character data
must be typed exactly as it appears in the database. For example, if we store a person‘s surname
as ―SMITH‖ and then search for it using the string ―Smith,‖ the row will not be found.

Although SQL is free-format, an SQL statement or set of statements is more readable if indentation
and lineation are used. For example:

 each clause in a statement should begin on a new line;


 the beginning of each clause should line up with the beginning of other clauses;
 if a clause has several parts, they should each appear on a separate line and be indented under the
start of the clause to show the relationship.

Throughout this and the next three chapters, we use the following extended form of the Backus
Naur Form (BNF) notation to define SQL statements:

 uppercase letters are used to represent reserved words and must be spelled exactly as shown;
 lowercase letters are used to represent user-defined words;
 a vertical bar ( | ) indicates a choice among alternatives; for example, a | b | c;
 curly braces indicate a required element; for example, {a};

Database Systems and Information Management Module Page 118


 square brackets indicate an optional element; for example, [a];
 an ellipsis (. . .) is used to indicate optional repetition of an item zero or more times.

For example:

{a|b} (, c . . .) means either a or b followed by zero or more repetitions of c separated by commas.

In practice, the DDL statements are used to create the database structure (that is, the tables) and
the access mechanisms (that is, what each user can legally access), and then the DML statements
are used to populate and query the tables.

6.3.3. SQL Data Definition and Data Types


SQL uses the terms table, row, and column for the formal relational model terms relation, tuple,
and attribute, respectively. We will use the corresponding terms interchangeably. The main SQL
command for data definition is the CREATE statement, which can be used to create schemas,
tables (relations), and domains (as well as other constructs such as views, assertions, and triggers).

The Create Table Command in SQL

The CREATE TABLE command is used to specify a new relation by giving it a name and
specifying its attributes and initial constraints. The attributes are specified first, and each
attribute is given a name, a data type to specify its domain of values, and any attribute
constraints, such as NOT NULL. The key, entity integrity, and referential integrity constraints
can be specified within the CREATE TABLE statement after the attributes are declared, or they
can be added later using the ALTER TABLE command.

Create Table Employee (Fname varchar (20), Lname varchar (20), ID int primary key, Bdate
varchar (20), Address varchar (20), Sex char, Salary decimal, SuperID int foreign key references
Employee (ID))

Create Table Department (DName Varchar (15), Dnumber Int primary key, MgrID int foreign
key references employee (ID), Mgrstartdate varchar (20))

Database Systems and Information Management Module Page 119


Create table Dept_Locations (Dnumber Int, Dlocation Varchar (15), Primary Key (Dnumber,
Dlocation) , Foreign Key(Dnumber) References Department(Dnumber) )

Create table project (Pname Varchar (15), Pnumber Int primary key, Plocation Varchar (15),
Dnum Int foreign key references Department (Dnumber))

Create table works_On (EID int, Pno Int, Hours Decimal (3, 1), Primary Key (EID,
Pno), Foreign Key (EID) References Employee (ID), Foreign Key(Pno) References,
project(Pnumber) )

Create table dependent (EID int, Dependent_Name Varchar (15), Sex Char, Bdate Date,

Relationship Varchar (8), Primary Key(EID, Dependent_Name), Foreign


Key(EID) References Employee(ID) )

Inserting Values into the Table

In its simplest form, INSERT is used to add a single tuple to a relation. We must specify the relation
name and a list of values for the tuple. The values should be listed in the same order in which the
corresponding attributes were specified in the CREATE TABLE command. For example, to add a
new tuple to the EMPLOYEE, Department, Project, Works_On, Dept_Locations, and Dependent
are shown below

insert into employee values


(‘sara’,’girma’,19,’12/2/2004′,’hossana’,’F’,’1000′,’19’) insert into employee values
(‘selam’,’john’,15,’12/2/1997′,‘hossana’,’F’,’1000′,’19’) insert into department values
(‘computer scien’,19,19,’12/2/2004′) insert into department values
(‗Mathematics’,11,19,’12/2/2004′) insert into project values
(‘reaearch’,1,’hossana’,19) insert into project values (‘reaearch’,1,’hossana’,19)

Database Systems and Information Management Module Page 120


When you are inserting values into the table department the value that you insert to the MgrID
should exist in the employee table ID because it references employee ID. Similar concept will be
applied in the entire table that references another table.

Schema Change Statements in SQL

In this section, we give an overview of the schema evolution commands available in SQL, which
can be used to alter a schema by adding or dropping tables, attributes, constraints, and other
schema elements.

The DROP Command

The DROP command can be used to drop named schema elements, such as tables, domains, or
constraints. One can also drop a schema. For example, if a whole schema is not needed any more,
the DROP SCHEMA command can be used. For example, to remove the COMPANY database
and all its tables, domains, and other elements, it is used as follows:

Drop Database Company

In order to drop the table employee from company database we use the following command

Drop table employee

The ALTER Command

The definition of a base table or of other named schema elements can be changed by using the
ALTER command. For base tables, the possible alter table actions include adding or dropping a
column (attribute), changing a column definition, and adding or dropping table constraints. For
example, to add an attribute for keeping track of jobs of employees to the EMPLOYEE base
relations in the COMPANY schema, we can use the command

Alter Table employee Add Job Varchar (12)

Database Systems and Information Management Module Page 121


We must still enter a value for the new attribute JOB for each individual EMPLOYEE tuple. This
can be done either by specifying a default clause or by using the UPDATE command If no default
clause is specified, the new attribute will have NULLs in all the tuples of the relation immediately
after the command is executed; hence, the NOT NULL constraint is not allowed in this case.

To drop a column, we must choose either CASCADE or RESTRICT for drop behavior. If
CASCADE is chosen, all constraints and views that reference the column are dropped
automatically from the schema, along with the column. If RESTRICT is chosen, the command is
successful only if no views or constraints (or other elements) reference the column. For example,
the following command removes the attribute ADDRESS from the EMPLOYEE base table:

Alter Table Employee Drop column Address

It is also possible to alter a column definition by dropping an existing default clause or by defining
a new default clause. The following examples illustrate this clause:

Alter Table Employee Alter Job Drop DEFAULT;

Alter Table Employee Alter Job Set Default “333445555”;

The DELETE Command

The DELETE command removes tuples from a relation. It includes a WHERE clause, similar to
that used in an SQL query, to select the tuples to be deleted. Tuples are explicitly deleted from
only one table at a time. However, the deletion may propagate to tuples in other relations
if referential triggered actions are specified in the referential integrity constraints. Depending on
the number of tuples selected by the condition in the WHERE clause, zero, one, or several tuples
can be deleted by a single DELETE command. A missing WHERE clause specifies that all tuples
in the relation are to be deleted; however the table remains in the database as an empty table. The
DELETE commands in Q4A to Q4D

Q4A: DELETE FROM EMPLOYEE WHERE LNAME=’Brown’

Database Systems and Information Management Module Page 122


Q4B: DELETE FROM EMPLOYEE

WHERE ID=’12’

Q4C: DELETE FROM EMPLOYEE

WHERE DNO IN (SELECT DNUMBER FROM DEPARTMENT

WHERE DNAME=’Research’) Q4D: DELETE FROM EMPLOYEE

The UPDATE Command

The UPDATE command is used to modify attribute values of one or more selected tuples. As in
the DELETE command, a WHERE clause in the UPDATE command selects the tuples to be
modified from a single relation. However, updating a primary key value may propagate to the
foreign key values of tuples in other relations if such a referential triggered action is specified in
the referential integrity constraints. An additional SET clause in the UPDATE command specifies
the attributes to be modified and their new values. For example, to change the location and
controlling department number of project number 10 to ‘Bellaire’ and 5, respectively, we use U5:

U5: UPDATE PROJECT

SET PLOCATION = ‘Bellaire’, DNUM = 5

WHERE PNUMBER=10;

Several tuples can be modified with a single UPDATE command. An example is to give all
employees in the ‘Research’ department a 10 percent raise in salary, as shown in U6. In this
request, the modified SALARY value depends on the original SALARY value in each tuple, so
two references to the SALARY attribute are needed. In the SET clause, the reference to the
SALARY attribute on the right refers to the old SALARY value before modification, and the one
on the left refers to the new SALARY value after modification:

Database Systems and Information Management Module Page 123


U6: UPDATE EMPLOYEE

SET SALARY = SALARY *1.1

WHERE DNO IN (SELECT DNUMBER

FROM DEPARTMENT

WHERE DNAME=’Research’);

It is also possible to specify NULL or DEFAULT as the new attribute value. Notice that each
UPDATE command explicitly refers to a single relation only. To modify multiple relations, we
must issue several UPDATE commands.

6.3.4. Basic Queries in SQL


SQL has one basic statement for retrieving information from a database: the SELECT statement.
The SELECT statement has no relationship to the SELECT operation of relational algebra, which
was discussed in Chapter 6. There are many options and flavors to the SELECT statement in SQL,
so we will introduce its features gradually.

The SELECT-FROM-WHERE Structure of Basic SQL Queries

Queries in SQL can be very complex. We will start with simple queries, and then progress to more
complex ones in a step-by-step manner. The basic form of the SELECT statement, sometimes
called a mapping or a select-from-where block, is formed of the three clauses SELECT, FROM,
and WHERE and has the following form:

SELECT<attribute list>

FROM<table list>

WHERE<condition>

Database Systems and Information Management Module Page 124


Where <attribute list> is a list of attribute names whose values are to be retrieved by the query.
<Table list> is a list of the relation names required to process the query.

<Condition> is a conditional (Boolean) expression that identifies the tuples to be retrieved by the
query.

In SQL, the basic logical comparison operators for comparing attribute values with one another
and with literal constants are =, <, <=, >, >=, and <>. SQL has many additional comparison
operators that we shall present gradually as needed.

QUERY 0: Retrieve the birth date and address of the employee whose name is ‗sara girma’.

SELECT Bdate, Address

FROM EMPLOYEE

WHERE FNAME=’sara’ AND LNAME=’girma’

This query involves only the EMPLOYEE relation listed in the FROM clause. The
query selects the EMPLOYEE tuples that satisfy the condition of the WHERE clause,
then projects the result on the BDATE and ADDRESS attributes listed in the SELECT clause. Q0
is similar to the following relational algebra expression, except that duplicates, if any, would not be
eliminated:

Π Bdate, Address (σ Fname=’ sara’ And Lname=’ girma’ (EMPLOYEE))

Hence, a simple SQL query with a single relation name in the FROM clause is similar to a
SELECT-PROJECT pair of relational algebra operations. The SELECT clause of SQL specifies
the projection attributes, and the WHERE clause specifies the selection condition. The only
difference is that in the SQL query we may get duplicate tuples in the result, because the constraint
that a relation is a set is not enforced.

QUERY1

Database Systems and Information Management Module Page 125


Retrieve the name and address of all employees who work for the ‘Research’ department.

Q1:

SELECT Fname, Lname, Address

FROM EMPLOYEE, DEPARTMENT

WHERE DNAME=’Research’ AND DNUMBER=DNO

Q1 is similar to a SELECT-PROJECT-JOIN sequence of relational algebra operations. Such


queries are often called select-project-join queries. In the WHERE clause of Q1, the condition
DNAME = ‘Research’ is a selection condition and corresponds to a SELECT operation in the
relational algebra. The condition DNUMBER = DNO is a join condition, which corresponds to a
JOIN condition in the relational algebra.

In general, any number of select and join conditions may be specified in a single SQL query.

The next example is a select-project-join query with two join conditions.

QUERY2

For every project located in ‘Stafford’, list the project number, the controlling department number,
and the department manager’s last name, address, and birth date.

Q2:

SELECT Pnumber, Dnum, Lname, Address, Bdate

FROM Project, Department, Employee

WHERE Dnum=Dnumber and MgrID=ID and Plocation=’nekemte’

Database Systems and Information Management Module Page 126


The join condition DNUM = DNUMBER relates a project to its controlling department, whereas
the join condition MgrID = ID relates the controlling department to the employee who manages
that department.

Ambiguous Attribute Names, Aliasing, and Tuple Variables

In SQL the same name can be used for two (or more) attributes as long as the attributes are
in different relations. If this is the case, and a query refers to two or more attributes with the same
name, we must qualify the attribute name with the relation name to prevent ambiguity. This is
done by prefixing the relation name to the attribute name and separating the two by a period. To
illustrate this, suppose that DNO and LNAME attributes of the EMPLOYEE relation were called
DNUMBER and NAME, and the DNAME attribute of DEPARTMENT was also called NAME;
then, to prevent ambiguity, query Q1 would be rephrased as shown in Q1A. We must prefix the
attributes NAME and DNUMBER in QIA to specify which ones we are referring to, because the
attribute names are used in both relations:

Q1A:

SELECT Fname, Employee.Name, Address

FROM Employee, Department

WHERE Department. Name=’Research’ AND Department. Dnumber

=Employee. Dnumber;

Ambiguity also arises in the case of queries that refer to the same relation twice, as in the
following example.

QUERY 8

For each employee, retrieve the employee’s first and last name and the first and last name of his
or her immediate supervisor.

Database Systems and Information Management Module Page 127


Q8:

SELECT E.Fname, E.Lname, S.Fname, S. Lname

FROM EMPLOYEE AS E, EMPLOYEE AS S

WHERE E.SUPERID=S.ID

In this case, we are allowed to declare alternative relation names E and S, called aliases or tuple
variables, for the EMPLOYE E relation. An alias can follow the keyword AS, as shown in Q8, or
it can directly follow the relation name-for example, by writing EMPLOYEE E, EMPLOYEE S
in the FROM clause of Q8. It is also possible to rename the relation attributes within the query in
SQL by giving them aliases. For example, if we write EMPLOYEE AS E (FN, MI, LN, ID, SD,
ADDR, SEX, SAL, SID, DNO) in the FROM clause, FN becomes an alias for FNAME, MI for
MINH, LN for LNAME, and so on.

In Q8, we can think of E and S as two different copies of the EMPLOYEE relation; the first, E,
represents employees in the role of supervisees; the second, S, represents employees in the role of
supervisors. We can now join the two copies. Of course, in reality there is only one EMPLOYEE
relation, and the join condition is meant to join the relation with itself by matching the tuples that
satisfy the join condition E. SUPERID = S. ID. Notice that this is an example of a one-level
recursive query.

The result of query Q8 is shown in Figure 8.3d. Whenever one or more aliases are given to a
relation, we can use these names to represent different references to that relation. This permits
multiple references to the same relation within a query. Notice that, If we want to use this alias-
naming mechanism in any SQL query to specify tuple variables for every table in the WHERE
clause, whether or not the same relation needs to be referenced more than once. In fact, this practice
is recommended since it results in queries that are easier to comprehend.

For example, we could specify query Q1A as in Q1B:

Database Systems and Information Management Module Page 128


Q1B:

SELECT E.FNAME, E.NAME, E.ADDRESS

FROM EMPLOYEE E, DEPARTMENT D

WHERE D.NAME=’Research’ AND D.DNUMBER=E.DNUMBER

If we specify tuple variables for every table in the WHERE clause, a select-project-join query in
SQL closely resembles the corresponding tuple relational calculus expression (except for duplicate
elimination).

Unspecified WHERE Clause and Use of the Asterisk

We discuss two more features of SQL here. A missing WHERE clause indicates no condition on
tuple selection; hence, all tuples of the relation specified in the FROM clause qualify and are
selected for the query result. If more than one relation is specified in the FROM clause and there
is no WHERE clause, then the CROSS PRODUCT-all possible tuple combinationsof these
relations is selected. For example, Query 9 selects all EMPLOYEE ID and Query 10 selects all
combinations of an EMPLOYEE ID and a DEPARTMENT DNAME.

QUERIES 9 AND 10 : Select all EMPLOYEE ID (Q9), and all combinations of EMPLOYEE ID
and DEPARTMENT DNAME (Q10) in the database.

Q9: SELECT SSN

FROM EMPLOYEE

Q10: SELECT SSN, DNAME

FROM EMPLOYEE, DEPARTMENT

It is extremely important to specify every selection and join condition in the WHERE clause; if
any such condition is overlooked, incorrect and very large relations may result.

Database Systems and Information Management Module Page 129


Notice that Q10 is similar to a CROSS PRODUCT operation followed by a PROJECT operation
in relational algebra. If we specify all the attributes of EMPLOYEE and DEPARTMENT in Ql0,
we get the CROSS PRODUCT (except for duplicate elimination, if any).

To retrieve all the attribute values of the selected tuples, we do not have to list the attribute names
explicitly in SQL; we just specify an asterisk (*), which stands for all the attributes. For example,
query Q1C retrieves all the attribute values of any EMPLOYEE who works in DEPARTMENT
number 5, query Q1D retrieves all the attributes of an EMPLOYEE and the attributes of the
DEPARTMENT in which he or she works for every employee of the ‘Research’ department, and
Ql0A specifies the CROSS PRODUCT of the EMPLOYEE and DEPARTMENT relations.

QIC:

Select *

From Employee

Where Dno=5 Q1D:

Select *

From Employee, Department

Where Dname=’Research’ and Dno=Dnumber Ql0A:

Select *

From Employee, Department

Tables as Sets in SQL

As we mentioned earlier, SQL usually treats a table not as a set but rather as a multiset; duplicate
tuples can appear more than once in a table, and in the result of a query. SQL does not
automatically eliminate duplicate tuples in the results of queries, for the following reasons:

Database Systems and Information Management Module Page 130


 Duplicate elimination is an expensive operation. One way to implement it is to sort the tuples first and then
eliminate duplicates.
 The user may want to see duplicate tuples in the result of a query.
 When an aggregate function is applied to tuples, in most cases we do not want to eliminate duplicates.

An SQL table with a key is restricted to being a set, since the key value must be distinct in each
tuple. If we do want to eliminate duplicate tuples from the result of an SQL query, we use the
keyword DISTINCT in the SELECT clause, meaning that only distinct tuples should remain in the
result. In general, a query with SELECT DISTINCT eliminates duplicates, whereas a query with
SELECT ALL does not. Specifying SELECT with neither

ALL nor DISTINCT-as in our previous examples-is equivalent to SELECT ALL. For example,
Query 11 retrieves the salary of every employee; if several employees have the same salary, that
salary value will appear as many times in the result of the query. If we are interested only in distinct
salary values, we want each value to appear only once, regardless of how many employees earn
that salary. By using the keyword DISTINCT as in Q11A. QUERY 11

Retrieve the salary of every employee (Q11) and all distinct salary values (Q11A).

Q11:

SELECT ALL SALARY

FROM EMPLOYEE

Q11A:

SELECT DISTINCT SALARY

FROM EMPLOYEE

SQL has directly incorporated some of the set operations of relational algebra. There are set union
(UNION), set difference (EXCEPT), and set intersection (INTERSECT) operations. The relations

Database Systems and Information Management Module Page 131


resulting from these set operations are sets of tuples; that is, duplicate tuples are eliminated from
the result. Because these set operations apply only to union-compatible relations, we must make
sure that the two relations on which we apply the operation have the same attributes and that the
attributes appear in the same order in both relations. The next example illustrates the use of
UNION.

QUERY 4

Make a list of all project numbers for projects that involve an employee whose last name is ‘girma’,
either as a worker or as a manager of the department that controls the project.

Q4:

(SELECT DISTINCT PNUMBER

FROM PROJECT, DEPARTMENT, EMPLOYEE

WHERE DNUM=DNUMBER AND MGRID=ID AND LNAME=’girma’)

UNION

(SELECT DISTINCT PNUMBER

FROM PROJECT, WORKS_ON, EMPLOYEE

WHERE PNUMBER=PNO AND EID=ID AND LNAME=’girma’);

The first SELECT query retrieves the projects that involve a ‘girma’ as manager of the department
that controls the project, and the second retrieves the projects that involve a ‘girma’ as a worker
on the project. Notice that if several employees have the last name ‘girma’, the project names
involving any of them will be retrieved. Applying the UNION operation to the two SELECT
queries gives the desired result. SQL also has corresponding multiset operations, which are
followed by the keyword ALL (UNION ALL, EXCEPT ALL,

Database Systems and Information Management Module Page 132


INTERSECT ALL). Their results are multisets (duplicates are not eliminated).

Substring Pattern Matching and Arithmetic Operators

In this section we discuss several more features of SQL. The first feature allows comparison
conditions on only parts of a character string, using the LIKE comparison operator. This can be
used for string pattern matching. Partial strings are specified using two reserved characters: %
replaces an arbitrary number of zero or more characters, and the underscore (_) replaces a single
character. For example, consider the following query.

QUERY 12 Retrieve all employees whose address is in Houston, Texas.

Q12:

SELECT FNAME, LNAME

FROM EMPLOYEE

WHERE ADDRESS LIKE ‘%Houston, TX%’;

To retrieve all employees who were born during the 1950s, we can use Query 12A. Here, ‘5’ must
be the third character of the string (according to our format for date), so we use the value ‘_ _ 5_
_ _ _ ‘, with each underscore serving as a placeholder for an arbitrary character.

QUERY 12A: Find all employees who were born during the 1950s.

Q12A:

SELECT FNAME, LNAME

FROM EMPLOYEE

WHERE BDATE LIKE ‗_ _ 5_ _ _ _‘

Database Systems and Information Management Module Page 133


If an underscore or % is needed as a literal character in the string, the character should be preceded
by an escape character, which is specified after the string using the keyword ESCAPE. For
example, ‘AB\_CD\%EF’ ESCAPE ‘\’ represents the literal string ‘AB_CD%EF’, because \ is
specified as the escape character. Any character not used in the string can be chosen as the escape
character. Also, we need a rule to specify apostrophes or single quotation marks (“) if they are to
be included in a string, because they are used to begin and end strings. If an apostrophe (‘) is
needed, it is represented as two consecutive apostrophes (“) so that it will not be interpreted as
ending the string.

Another feature allows the use of arithmetic in queries. The standard arithmetic operators for
addition (+), subtraction (-), multiplication (*), and division (/) can be applied to numeric values
or attributes with numeric domains. For example, suppose that we want to see the effect of giving
all employees who work on the ‘ProductX’ project a 10 percent raise; we can issue Query13 to see
what their salaries would become. This example also shows how we can rename an attribute in the
query result using AS in the SELECT clause.

QUERY 13 Show the resulting salaries if every employee working on the ‘ProductX’ project is
given a 10 percent raise.

Q13:

SELECT FNAME, LNAME, 1.1*SALARY AS INCREASED_SAL

FROM EMPLOYEE, WORKS_ON, PROJECT

WHERE ID=EID AND PNO=PNUMBER AND PNAME=’ProductX’

For string data types, the concatenate operator | | can be used in a query to append two string values.
For date, time, timestamp, and interval data types, operators include incrementing (+) or
decrementing (-) a date, time, or timestamp by an interval. In addition, an interval value is the
result of the difference between two date, time, or timestamp values. Another comparison operator

Database Systems and Information Management Module Page 134


that can be used for convenience is BETWEEN, which is illustrated in Query 14. QUERY 14
Retrieve all employees in department 5 whose salary is between $30,000 and $40,000.

Q14: SELECT *

FROM EMPLOYEE

WHERE (SALARY BETWEEN 30000 AND 40000) AND DNO =5;

The condition (SALARY BETWEEN 30000 AND 40000) in Q14 is equivalent to the condition
((SALARY >= 30000) AND (SALARY <= 40000)).

Ordering of Query Results

SQL allows the user to order the tuples in the result of a query by the values of one or more
attributes, using the ORDER BY clause. This is illustrated by Query 15.

QUERY 15 Retrieve a list of employees and the projects they are working on, ordered by
department and, within each department, ordered alphabetically by last name, first name.

Q15: SELECT DNAME, LNAME, FNAME, PNAME

FROM DEPARTMENT, EMPLOYEE, WORKS_ON, PROJECT

WHERE DNUMBER=DNO AND SSN=ESSN AND PNO=PNUMBER

ORDER BY DNAME, LNAME, FNAME

The default order is in ascending order of values. We can specify the keyword DESC if we want
to see the result in a descending order of values. The keyword ASC can be used to specify
ascending order explicitly. For example, if we want descending order on DNAME and ascending
order on LNAME, FNAME, the ORDER BY clause of Q15 can be written as

ORDER BY DNAME DESC, LNAME ASC, FNAME ASC

Database Systems and Information Management Module Page 135


More Complex SQL Queries

In the previous section, we described some basic types of queries in SQL. Because of the generality
and expressive power of the language, there are many additional features that allow users to specify
more complex queries. We discuss several of these features in this section.

Nested Queries, Tuples, and Set/Multiset Comparisons

Some queries require that existing values in the database be fetched and then used in a comparison
condition. Such queries can be conveniently formulated by using nested queries, which are
complete select-from-where blocks within the WHERE clause of another query. That other query
is called the outer query. Query 4 is formulated in Q4 without a nested query, but it can be
rephrased to use nested queries as shown in Q4A. Q4A introduces the comparison operator IN,
which compares a value v with a set (or multiset) of values V and evaluates to TRUE if v is one of
the elements in V

Q4A: SELECT DISTINCT PNUMBER

FROM PROJECT

WHERE PNUMBER IN (SELECT PNUMBER

FROM PROJECT, DEPARTMENT,

EMPLOYEE

WHERE DNUM=DNUMBER AND MGRID=ID AND

LNAME=’Girma’)

Database Systems and Information Management Module Page 136


OR

PNUMBERIN (SELECT PNO

FROM WORKS_ON, EMPLOYEE

WHERE EID=ID AND LNAME=’Girma’)

The first nested query selects the project numbers of projects that have a ‘GIRMA’ involved as
manager, while the second selects the project numbers of projects that have a ‘GIRMA’ involved
as worker. In the outer query, we use the OR logical connective to retrieve a PROJECT tuple if
the PNUMBER value of that tuple is in the result of either nested query. If a nested query returns
a single attribute and a single tuple, the query result will be a single (scalar) value. In such cases,
it is permissible to use = instead of IN for the comparison operator. In general, the nested query
will return a table (relation), which is a set or multiset of tuples.

SQL allows the use of tuples of values in comparisons by placing them within parentheses.

To illustrate this, consider the following query:

SELECT DISTINCT EID

FROM WORKS_ON

WHERE (PNO, HOURS) IN (SELECT PNO, HOURS FROM WORKS_ON

WHERE ID=’12’)

This query will select the Identity numbers of all employees who work the same (project, hours)
combination on some project that employee ‘John Smith’ (whose ID =’12’) works on. In this
example, the IN operator compares the sub tuple of values in parentheses (PNO, HOURS) for each
tuple in WORKS_ON with the set of union-compatible tuples produced by the nested query.

Database Systems and Information Management Module Page 137


In addition to the IN operator, a number of other comparison operators can be used to compare a
single value v (typically an attribute name) to a set or multiset V (typically a nested query). The =
ANY (or = SOME) operator returns TRUE if the value v is equal to some value in the set V and is
hence equivalent to IN. The keywords ANY and SOME have the same meaning. Other operators
that can be combined with ANY (or SOME) include >, >=, <, <=, and < >. The keyword ALL can
also be combined with each of these operators.

For example, the comparison condition (v > ALL V) returns TRUE if the value v is greater
than all the values in the set (or multiset) V. An example is the following query, which returns the
names of employees whose salary is greater than the salary of all the employees in department 5:

SELECT LNAME, FNAME

FROM EMPLOYEE

WHERE SALARY> ALL (SELECT SALARY

FROM EMPLOYEE

WHERE DNO=5)

In general, we can have several levels of nested queries. We can once again be faced with possible
ambiguity among attribute names if attributes of the same name exist-one in a relation in the
FROM clause of the outer query, and another in a relation in the FROM clause of the nested
query. The rule is that a reference to an unqualified attribute refers to the relation declared in the
innermost nested query. For example, in the SELECT clause and WHERE clause of the first nested
query of Q4A, a reference to any unqualified attribute of the PROJECT relation refers to the
PROJECT relation specified in the FROM clause of the nested query. To refer to an attribute of
the PROJECT relation specified in the outer query, we can specify and refer to an alias (tuple
variable) for that relation. These rules are similar to scope rules for program variables in most
programming languages that allow nested procedures and functions. To illustrate the potential

Database Systems and Information Management Module Page 138


ambiguity of attribute names in nested queries, consider Query 16, QUERY 16 Retrieve the name
of each employee who has a dependent with the same first name and same sex as the employee.

Q16: SELECT E.FNAME, E.LNAME

FROM EMPLOYEE AS E

WHERE E.ID IN (SELECT EID

FROM DEPENDENT

WHERE E.FNAME=DEPENDENT_NAME

AND E.SEX=SEX)

In the nested query of Q16, we must qualify E. SEX because it refers to the SEX attribute of
EMPLOYEE from the outer query, and DEPENDENT also has an attribute called SEX. All
unqualified references to SEX in the nested query refer to SEX of DEPENDENT. However, we
do not have to qualify FNAME and ID because the DEPENDENT relation does not have attributes
called FNAME and ID, so there is no ambiguity.

It is generally advisable to create tuple variables (aliases) for all the tables referenced in an SQL
query to avoid potential errors and ambiguities.

Correlated Nested Queries

Whenever a condition in the WHERE clause of a nested query references some attribute of a
relation declared in the outer query, the two queries are said to be correlated. We can understand
a correlated query better by considering that the nested query is evaluated once for each tuple (or
combination of tuples) in the outer query. For example, we can think of Q16 as follows:
For each EMPLOYEE tuple, evaluate the nested query, which retrieves the ESSN values for all
DEPENDENT tuples with the same sex and name as that EMPLOYEE tuple; if the SSN value of
the EMPLOYEE tuple is in the result of the nested query, then select that EMPLOYEE tuple.

Database Systems and Information Management Module Page 139


In general, a query written with nested select-from-where blocks and using the = or IN comparison
operators can always be expressed as a single block query. For example, Q16 may be written as in
Q16A:

Q16A: SELECT E.FNAME, E.LNAME

FROM EMPLOYEE AS E, DEPENDENT AS D WHERE E.ID=D.EID AND E.SEX=D.SEX


AND

E.FNAME=D.DEPENDENT_NAME

The original SQL implementation on SYSTEM R also had a CONTAINS comparison operator,
which was used to compare two sets or multisets. This operator was subsequently dropped from
the language, possibly because of the difficulty of implementing it efficiently. Most commercial
implementations of SQL do not have this operator. The CONTAINS operator compares two sets
of values and returns TRUE if one set contains all values in the other set. Query 3 illustrates the
use of the CONTAINS operator.

QUERY 3 Retrieve the name of each employee who works on all the projects controlled by
department number 5.

Q3: SELECT FNAME, LNAME

FROM EMPLOYEE

WHERE ((SELECT PNO

FROM WORKS_ON

WHERE ID=EID)

CONTAINS

(SELECT PNUMBER

Database Systems and Information Management Module Page 140


FROM PROJECT

WHERE DNUM=5))

In Q3, the second nested query (which is not correlated with the outer query) retrieves the project
numbers of all projects controlled by department 5. For each employee tuple, the first nested query
(which is correlated) retrieves the project numbers on which the employee works; if these contain
all projects controlled by department 5, the employee tuple is selected and the name of that
employee is retrieved. Notice that the CONTAINS comparison operator has a similar function to
the DIVISION operation of the relational algebra.

Because the CONTAINS operation is not part of SQL, we have to use other techniques, such as
the EXISTS function, to specify these types of queries

The EXISTS and UNIQUE Functions in SQL

The EXISTS function in SQL is used to check whether the result of a correlated nested query is
empty (contains no tuples) or not. We illustrate the use of EXISTS-and NOT EXISTS-with some
examples. First, we formulate Query 16 in an alternative form that a use EXISTS.

Q16B: SELECT E.FNAME, E.LNAME

FROM EMPLOYEE AS E

WHERE

EXISTS (SELECT *

FROM DEPENDENT

Database Systems and Information Management Module Page 141


WHERE E.ID=EID AND E.SEX=SEX

AND E.FNAME=DEPENDENT_NAME)

EXISTS and NOTEXISTS are usually used in conjunction with a correlated nested query. In QI6B,
the nested query references the ID, FNAME, and SEX attributes of the EMPLOYEE relation from
the outer query. We can think of Q16B as follows: For each EMPLOYEE tuple, evaluate the nested
query, which retrieves all DEPENDENT tuples with the same Identity number, sex, and name as
the EMPLOYEE tuple; if at least one tuple EXISTS in the result of the nested query, then select
that EMPLOYEE tuple. In general, EXISTS (Q) returns TRUE if there is at least one tuple in the
result of the nested query Q, and it returns FALSE otherwise. On the other hand, NOTEXISTS
(Q) returns TRUE if there are no tuples in the result of nested query Q, and it returns FALSE
otherwise. Next, we illustrate the use of NOTEXISTS.

QUERY 6 Retrieve the names of employees who have no dependents.

Q6: SELECT FNAME, LNAME

FROM EMPLOYEE

WHERE NOT EXISTS (SELECT *

FROM DEPENDENT

WHERE ID=EID)

In Q6, the correlated nested query retrieves all DEPENDENT tuples related to a particular
EMPLOYEE tuple. If none exist, the EMPLOYEE tuple is selected. We can explain Q6 as follows:

For each EMPLOYEE tuple, the correlated nested query selects all DEPENDENT tuples whose
EID value matches the EMPLOYEE ID; if the result is empty; no dependents are related to the
employee, so we select that EMPLOYEE tuple and retrieve its FNAME and

Database Systems and Information Management Module Page 142


LNAME.

QUERY 7 List the names of managers who have at least one dependent.

Q7: SELECT FNAME, LNAME

FROM EMPLOYEE WHERE

EXISTS (SELECT *

FROM DEPENDENT WHERE ID=EID) AND

EXISTS

(SELECT *

FROM DEPARTMENT

WHERE ID=MGRID)

One way to write this query is shown in Q7, where we specify two nested correlated queries; the
first selects all DEPENDENT tuples related to an EMPLOYEE, and the second selects all
DEPARTMENT tuples managed by the EMPLOYEE. If at least one of the first and at least one
of the second exists, we select the EMPLOYEE tuple. Can you rewrite this query using only a
single nested query or no nested queries?

Query 3 (“Retrieve the name of each employee who works on all the projects controlled by
department number 5,”) can be stated using EXISTS and NOTEXISTS in SQL systems. There are
two options. The first is to use the well-known set theory transformation that (S1 CONTAINS S2)
is logically equivalent to (S2 EXCEPT S1) is empts,” This option is shown as Q3A.

Q3A: SELECT FNAME, LNAME

FROM EMPLOYEE

Database Systems and Information Management Module Page 143


WHERE NOT EXISTS ((SELECT PNUMBER

FROM PROJECT WHERE DNUM=5) EXCEPT

(SELECT PNO

FROM WORKS_ON

WHERE ID=EID))

In Q3A, the first sub query (which is not correlated) selects all projects controlled by department
5, and the second sub query (which is correlated) selects all projects that the particular employee
being considered works on. If the set difference of the first sub query MINUS (EXCEPT) the
second sub query is empty, it means that the employee works on all the projects and is hence
selected. The second option is shown as Q3B. Notice that we need two-level nesting in Q3B and
that this formulation is quite a bit more complex than Q3, which used the CONTAINS comparison
operator, and Q3A, which uses NOT EXISTS and EXCEPT. However, CONTAINS is not part of
SQL, and not all relational systems have the EXCEPT operator even though it is part of SQL-99.

Q3B: SELECT LNAME, FNAME

FROM EMPLOYEE

WHERE NOT EXISTS

(SELECT *

FROM WORKS_ON B

WHERE (B.PNO IN (SELECT PNUMBER

Database Systems and Information Management Module Page 144


FROM PROJECT WHERE DNUM=5))

AND

NOT EXISTS (SELECT *

FROM WORKS_ON C

WHERE C.EID=ID

AND C.PNO=B.PNO))

In Q3B, the outer nested query selects any WORKS_ON (B) tuples whose PNO is of a project
controlled by department 5, if there is not a WORKS_ON (C) tuple with the same PNO and the
same SSN as that of the EMPLOYEE tuple under consideration in the outer query. If no such tuple
exists, we select the EMPLOYEE tuple. The form of Q3B matches the following rephrasing of
Query 3: Select each employee such that there does not exist a project controlled by department 5
that the employee does not work on. There is another SQL function, UNIQUE (Q), which returns
TRUE if there are no duplicate tuples in the result of query Q; otherwise, it returns FALSE. This
can be used to test whether the result of a nested query is a set or a multiset.

Explicit Sets and Renaming of Attributes in SQL

We have seen several queries with a nested query in the WHERE clause. It is also possible to use
an explicit set of values in the WHERE clause, rather than a nested query. Such a set is enclosed
in parentheses in SQL.

QUERY 17 Retrieve the IDENTITY numbers of all employees who work on project numbers

1, 2, or 3.

Database Systems and Information Management Module Page 145


Q17: SELECT DISTINCT ESSN

FROM WORKS_ON

WHERE PNO IN (1, 2, 3)

In SQL, it is possible to rename any attribute that appears in the result of a query by adding the
qualifier AS followed by the desired new name. Hence, the AS construct can be used to alias both
attribute and relation names, and it can be used in both the SELECT and FROM clauses. For
example, Q8A shows how query Q8 can be slightly changed to retrieve the last name of each
employee and his or her supervisor, while renaming the resulting attribute names as
EMPLOYEE_NAME and SUPERVISOR_NAME. The new names will appear as column headers
in the query result.

Q8A: SELECT E.LNAME AS EMPLOYEE_NAME, S.LNAME AS

SUPERVISOR_NAME

FROM EMPLOYEE AS E, EMPLOYEE AS S

WHERE E.SUPERID=S.ID

Joined Tables in SQL

The concept of a joined table (or joined relation) was incorporated into SQL to permit users to
specify a table resulting from a join operation in the FROM clause of a query. This construct may
be easier to comprehend than mixing together all the select and join conditions in the WHERE
clause. For example, consider query Q1, which retrieves the name and address of every employee
who works for the ‘Research’ department. It may be easier first to specify the join of the
EMPLOYEE and DEPARTMENT relations, and then to select the desired tuples and attributes.
This can be written in SQL as in Q1A:

Q1A: SELECT FNAME, LNAME, ADDRESS

Database Systems and Information Management Module Page 146


FROM (EMPLOYEE JOIN DEPARTMENT ON DNO=DNUMBER)

WHERE DNAME=’Research’

The FROM clause in Q 1A contains a single joined table. The attributes of such a table are all the
attributes of the first table, EMPLOYEE, followed by all the attributes of the second table,
DEPARTMENT. The concept of a joined table also allows the user to specify different types of
join, such as NATURAL JOIN and various types of OUTER JOIN. In a NATURAL JOIN on two
relations R and S, no join condition is specified; an implicit equijoin condition for each pair of
attributes with the same name from Rand S is created. If the names of the join attributes are not
the same in the base relations, it is possible to rename the attributes so that they match, and then
to apply NATURAL JOIN. In this case, the AS construct can be used to rename a relation and all
its attributes in the FROM clause. This is illustrated in Q1B, where the DEPARTMENT relation
is renamed as DEPT and its attributes are renamed as DNAME, DNO (to match the name of the
desired join attribute DNO in EMPLOYEE), MID, and MSDATE. The implied join condition for
this NATURAL JOIN is EMPLOYEE. DNO = DEPT. DNO, because this is the only pair of
attributes with the same name after renaming.

Q1B: SELECT FNAME, LNAME, ADDRESS

FROM (EMPLOYEE NATURAL JOIN (DEPARTMENT AS DEPT (DNAME, DNO, MID,


MSDATE)))

WHERE DNAME=’Research;

The default type of join in a joined table is an inner join, where a tuple is included in the result
only if a matching tuple exists in the other relation. For example, in queryQ8A, only employees
that have a supervisor are included in the result; an EMPLOYEE tuple whose value for SUPERID
is NULL is excluded. If the user requires that all employees be included, an OUTER JOIN must
be used explicitly. In SQL, this is handled by explicitly specifying the OUTER JOIN in a joined
table, as illustrated in Q8B:

Database Systems and Information Management Module Page 147


Q8B: SELECT E.LNAME AS EMPLOYEE_NAME, S.LNAME AS

SUPERVISOR_NAME

FROM (EMPLOYEE AS E LEFT OUTER JOIN EMPLOYEE AS S ON

E.SUPERID=S.ID)

The options available for specifying joined tables in SQL include INNER JOIN (same as JOIN),
LEFT OUTER JOIN, RIGHT OUTER JOIN, and FULL OUTER JOIN. In the latter three options,
the keyword OUTER may be omitted. If the join attributes have the same name, one may also
specify the natural join variation of outer joins by using the keyword NATURAL before the
operation (for example, NATURAL LEFT OUTER JOIN). The

keyword CROSS JOIN is used to specify the Cartesian product operation

It is also possible to nest join specifications; that is, one of the tables in a join may itself be a joined
table. This is illustrated by Q2A, which is a different way of specifying query Q2, using the
concept of a joined table:

Q2A: SELECT PNUMBER, DNUM, LNAME, ADDRESS, BDATE

FROM ((PROJECT JOIN DEPARTMENT ON DNUM=DNUMBER)

JOIN EMPLOYEE ON MGRID=ID)

WHERE PLOCATION=’Stafford’;

Aggregate Functions in SQL

Database Systems and Information Management Module Page 148


In chapter six, we introduced the concept of an aggregate function as a relational operation.

Because grouping and aggregation are required in many database applications, SQL has features
that incorporate these concepts. A number of built-in functions exist: COUNT, SUM, MAX, MIN,
and AVG. The COUNT function returns the number of tuples or values as specified in a query.
The functions SUM, MAX, MIN, and AVG are applied to a set or multiset of numeric values and
return, respectively, the sum, maximum value, minimum value, and average (mean) of those
values. These functions can be used in the SELECT clause or in a HAVING clause (which we
introduce later). The functions MAX and MIN can also be used with attributes that have
nonnumeric domains if the domain values have a total ordering among one another. We illustrate
the use of these functions with example queries.

QUERY 19 Find the sum of the salaries of all employees, the maximum salary, the minimum
salary, and the average salary.

Q19: SELECT SUM (SALARY), MAX (SALARY), MIN (SALARY),

AVG (SALARY)

FROM EMPLOYEE

If we want to get the preceding function values for employees of a specific department-say, the
‘Research’ department-we can write Query 20, where the EMPLOYEE tuples are restricted by the
WHERE clause to those employees who work for the ‘Research’ department.

QUERY 20 Find the sum of the salaries of all employees of the ‘Research’ department, as well as
the maximum salary, the minimum salary, and the average salary in this department.

Q20: SELECT SUM (SALARY), MAX (SALARY), MIN (SALARY),

AVG (SALARY)

FROM (EMPLOYEE JOIN DEPARTMENT ON DNO=DNUMBER)

Database Systems and Information Management Module Page 149


WHERE DNAME=’Research‘

QUERIES 21 AND 22 Retrieve the total number of employees in the company (Q21) and the
number of employees in the ‘Research’ department (Q22).

Q21: SELECT COUNT (*)

FROM EMPLOYEE;

Q22: SELECT COUNT (*)

FROM EMPLOYEE, DEPARTMENT

WHERE DNO=DNUMBER AND DNAME=’Research’

Here the asterisk (*) refers to the rows (tuples), so COUNT (*) returns the number of rows in the
result of the query. We may also use the COUNT function to count values in a column rather than
tuples, as in the next example.

QUERY 23 Count the number of distinct salary values in the database.

Q23: SELECT COUNT (DISTINCT SALARY)

FROM EMPLOYEE

If we write COUNT (Salary) instead of COUNT (DISTINCT SALARY) in Q23, then duplicate
values will not be eliminated. However, any tuples with NULL for SALARY will not be counted.
In general, NULL values are discarded when aggregate functions are applied to a particular column
(attribute).

The preceding examples summarize a whole relation (Q19, Q21, Q23) or a selected subset of
tuples (Q20, Q22), and hence all produce single tuples or single values. They illustrate how
functions are applied to retrieve a summary value or summary tuple from the database. These
functions can also be used in selection conditions involving nested queries. We can specify a

Database Systems and Information Management Module Page 150


correlated nested query with an aggregate function, and then use the nested query in the WHERE
clause of an outer query. For example, to retrieve the names of all employees who have two or
more dependents (Query 5), we can write the following:

Q5: SELECT LNAME, FNAME

FROM EMPLOYEE

WHERE

(SELECT COUNT (*) FROM DEPENDENT

WHERE ID=EID) >= 2′

The correlated nested query counts the number of dependents that each employee has; if this is
greater than or equal to two, the employee tuple is selected.

Grouping: The GROUP BY and HAVING Clauses

In many cases we want to apply the aggregate functions to subgroups of tuples in a relation, where
the subgroups are based on some attribute values. For example, we may want to find the average
salary of employees in each department or the number of employees who work on each project. In
these cases we need to partition the relation into non overlapping subsets (or groups) of tuples.
Each group (partition) will consist of the tuples that have the same value of some attributes), called
the grouping attributes). We can then apply the function to each such group independently. SQL
has a GROUP BY clause for this purpose.

The GROUP BY clause specifies the grouping attributes, which should also
appear in the SELECT clause, so that the value resulting from applying each aggregate function
to a group of tuples appears along with the value of the grouping attributes).

QUERY 24 For each department, retrieve the department number, the number of employees in the
department, and their average salary.

Database Systems and Information Management Module Page 151


Q24: SELECT DNO, COUNT (*), AVG (SALARY)

FROM EMPLOYEE

GROUP BY DNO;

In Q24, the EMPLOYEE tuples are partitioned into groups-each group having the same value for
the grouping attribute DNO. The COUNT and AVG functions are applied to each such group of
tuples. Notice that the SELECT clause includes only the grouping attribute and the functions to be
applied on each group of tuples. If NULLs exist in the grouping attribute, then a separate group is
created for all tuples with a NULL value in the grouping attribute. For example, if the
EMPLOYEE table had some tuples that had NULL for the grouping attribute DNO, there would
be a separate group for those tuples in the result of Q24.

QUERY 25 For each project, retrieve the project number, the project name, and the number of
employees who work on that project.

Q25: SELECT PNUMBER, PNAME, COUNT (*)

FROM PROJECT, WORKS_ON

WHERE PNUMBER=PNO

GROUP BY PNUMBER, PNAME

Q25 shows how we can use a join condition in conjunction with GROUP BY. In this case, the
grouping and functions are applied after the joining of the two relations. Sometimes we want to
retrieve the values of these functions only for groups that satisfy certain conditions. For example,
suppose that we want to modify Query 25 so that only projects with more than two employees
appear in the result. SQL provides a HAVING clause, which can appear in conjunction with a
GROUP BY clause, for this purpose. HAVING provides a condition on the group of tuples
associated with each value of the grouping attributes. Only the groups that satisfy the condition
are retrieved in the result of the query. This is illustrated by Query 26.

Database Systems and Information Management Module Page 152


QUERY26 For each project on which more than two employees work, retrieve the project number,
the project name, and the number of employees who work on the project

Q26: SELECT PNUMBER, PNAME, COUNT (*)

FROM PROJECT, WORKS_ON

WHERE PNUMBER=PNO

GROUP BY PNUMBER, PNAME

HAVING COUNT (*) > 2

Notice that, while selection conditions in the WHERE clause limit the tuples to which functions
are applied, the HAVING clause serves to choose whole groups.

QUERY27 For each project, retrieve the project number, the project name, and the number of
employees from department 5 who work on the project.

Q27: SELECT PNUMBER, PNAME, COUNT (*)

FROM PROJECT, WORKS_ON, EMPLOYEE

WHERE PNUMBER=PNO AND ID=EID AND DNO=5

GROUP BY PNUMBER, PNAME

Here we restrict the tuples in the relation (and hence the tuples in each group) to those that satisfy
the condition specified in the WHERE clause-namely, that they work in department number 5.
Notice that we must be extra careful when two different conditions apply (one to the function in
the SELECT clause and another to the function in the HAVING clause). For example, suppose
that we want to count the total number of employees whose salaries exceed $40,000 in each
department, but only for departments where more than five employees work. Here, the condition

Database Systems and Information Management Module Page 153


(SALARY> 40000) applies only to the COUNT function In the SELECT clause. Suppose that we
write the following incorrect query:

SELECT DNAME, COUNT (*)

FROM DEPARTMENT, EMPLOYEE

WHERE DNUMBER=DNO AND SALARY>40000

GROUP BY DNAME

HAVING COUNT (*) > 5

This is incorrect because it will select only departments that have more than five employees who
each earn more than$40,000. The rule is that the WHERE clause is executed first, to select
individual tuples; the HAVING clause is applied later, to select individual groups of tuples. Hence,
the tuples are already restricted to employees who earn more than $40,000, before the function in
the HAVING clause is applied. One way to write this query correctly is to use a nested query, as
shown in Query 28.

QUERY28 For each department that has more than five employees, retrieve the department
number and the number of its employees who are making more than $40,000.

Q28: SELECT DNUMBER, COUNT (*)

FROM DEPARTMENT, EMPLOYEE

WHERE DNUMBER=DNO AND SALARY>40000 AND

DNO IN (SELECT DNO

FROM EMPLOYEE

GROUP BY DNO

Database Systems and Information Management Module Page 154


HAVING COUNT (*) > 5)

GROUP BY DNUMBER

CHAPTER SEVEN
7. Advanced Database Concepts

7.1. Integrity and Security


A database represents an essential corporate resource that should be properly secured using
appropriate controls. Multi-user database system must provide a database security and
authorization subsystem to enforce limits on individual and group access rights and privileges.

Database security and integrity is about protecting the database from being inconsistent and being
disrupted. We can also call it database misuse. Database security encompasses hardware, software,
people and data. Database misuse could be Intentional or accidental, where accidental misuse is
easier to cope with than intentional misuse.

Accidental inconsistency could occur due to:

 System crash during transaction processing


 Anomalies due to concurrent access
 Anomalies due to redundancy
 Logical errors

Database Systems and Information Management Module Page 155


Likewise, even though there are various threats that could be categorized in this group,
intentional misuse could be:

 Unauthorized reading of data


 Unauthorized modification of data or
 Unauthorized destruction of data

Most systems implement good Database Integrity to protect the system from accidental misuse
while there are many computer based measures to protect the system from intentional misuse,
which is termed as Database Security measures.

Database security is considered in relation to the following situations:

 Theft and fraud


 Loss of confidentiality (secrecy)
 Loss of privacy
 Loss of integrity
 Loss of availability

Security Issues and general considerations

 Legal, ethical and social issues regarding the right to access information
 Physical control
 Policy issues regarding privacy of individual level at enterprise and national level
 Operational consideration on the techniques used (password, etc)
 System level security including operating system and hardware control Security levels and security policies
in enterprise level

Database security is the mechanisms that protect the database against intentional or
accidental threats. Threat is any situation or event, whether intentional or accidental, that may
adversely affect a system and consequently the organization. A threat may be caused by a situation

Database Systems and Information Management Module Page 156


or event involving a person, action, or circumstance that is likely to bring harm to an organization.
The harm to an organization may be tangible or intangible:

 Tangible – loss of hardware, software, or data


 Intangible – loss of credibility or client confidence Examples of threats:
 Using another persons‘ means of access
 Unauthorized amendment/modification or copying of data
 Program alteration
 Inadequate policies and procedures that allow a mix of confidential and normal out put
 Wire-tapping
 Illegal entry by hacker
 Blackmail
 Creating ‗trapdoor‘ into system
 Theft of data, programs, and equipment
 Failure of security mechanisms, giving greater access than normal
 Staff shortages or strikes
 Inadequate staff training
 Viewing and disclosing unauthorized data
 Electronic interference and radiation
 Data corruption owing to power loss or surge
 Fire (electrical fault, lightning strike, arson), flood, bomb
 Physical damage to equipment
 Breaking cables or disconnection of cables
 Introduction of viruses

7.1.1. Levels of Security Measures


Security measures can be implemented at several levels and for different components of the
system. These levels are:

1. Physical Level: concerned with securing the site containing the computer system should be physically
secured. The backup systems should also be physically protected from access except for authorized users.

Database Systems and Information Management Module Page 157


2. Human Level: concerned with authorization of database users for access the content at different levels and
privileges.
3. Operating System: concerned with the weakness and strength of the operating system security on data
files. Weakness may serve as a means of unauthorized access to the database. This also includes protection
of data in primary and secondary memory from unauthorized access.
4. Database System: concerned with data access limit enforced by the database system. Access limit like
password, isolated transaction and etc.

Even though we can have different levels of security and authorization on data objects and users,
who access which data is a policy matter rather than technical. These policies should be:

 known by the system: should be encoded in the system


 remembered: should be saved somewhere (the catalogue)

An organization needs to identify the types of threat it may be subjected to and initiate appropriate
plans and countermeasures, bearing in mind the costs of implementing them.

7.1.2. Countermeasures: Computer Based Controls


The types of countermeasure to threats on computer systems range from physical controls to
administrative procedures. Despite the range of computer-based controls that are available, it is
worth noting that, generally, the security of a DBMS is only as good as that of the operating system,
owing to their close association. The following are computer-based security controls for a multi-
user environment:

Authorization

Authorization is the granting of a right or privilege that enables a subject to have legitimate access
to a system or a system‘s object. The process of authorization involves authentication of subjects
(i.e. a user or program) requesting access to objects (i.e. a database table, view, procedure, trigger,
or any other object that can be created within the system). Authorization controls, also known as
access controls can be built into the software, and govern not only what system or object a specified
user can access, but also what the user may do with it.

Database Systems and Information Management Module Page 158


There are different forms of user authorization on the resource of the database. These forms are
privileges on what operations are allowed on a specific data object. User authorization on the
data/extension:

1. Read Authorization: the user with this privilege is allowed only to read the content of the data object.
2. Insert Authorization: the user with this privilege is allowed only to insert new records or items to the
data object.
3. Update Authorization: users with this privilege are allowed to modify content of attributes but are not
authorized to delete the records.
4. Delete Authorization: users with this privilege are only allowed to delete a record and not anything else.

Different users, depending on the power of the user, can have one or the combination of the
above forms of authorization on different data objects.

The database administrator is responsible to make the database to be as secure as possible. For
this the DBA should have the most powerful privilege than every other user. The DBA provides
capability for database users while accessing the content of the database.

The major responsibilities of DBA in relation to authorization of users are:

1. Account Creation: involves creating different accounts for different USERS as well as USER
GROUPS.
2. Security Level Assignment: involves in assigning different users at different categories of
access levels.
3. Privilege Grant: involves giving different levels of privileges for different users and user
groups.

Database Systems and Information Management Module Page 159


4. Privilege Revocation: involves denying or canceling previously granted privileges for users due
to various reasons.
5. Account Deletion: involves in deleting an existing account of users or user groups. It is similar
with denying all privileges of users on the database.

Views

A view is the dynamic result of one or more relational operations operation on the base relations
to produce another relation. A view is a virtual relation that does not actually exist in the
database, but is produced upon request by a particular user. It is a mechanism that provides a
powerful and flexible security mechanism by hiding parts of the database from certain users.
Therefore, using a view is more restrictive than simply having certain privileges granted to a user
on the base relation(s).

Integrity

Integrity constraints contribute to maintaining a secure database system by preventing data from
becoming invalid and hence giving misleading or incorrect results:

 Entity integrity: The first integrity rule applies to the primary keys of base relations. In a base
relation, no attribute of a primary key can be null.
 Referential integrity: The second integrity rule applies to foreign keys. If a foreign key exists in a
relation, either the foreign key value must match a candidate key value of some tuple in its home
relation or the foreign key value must be wholly null.
 Domain Integrity
 Key constraints

Backup and Recovery

Backup is the process of periodically taking a copy of the database and log file (and possibly
programs) on to an offline storage media. A DBMS should provide backup facilities to assist with

Database Systems and Information Management Module Page 160


the recovery of a database following failure. Database recovery is the process of restoring the
database to a correct state in the event of a failure.

Journaling is the process of keeping and maintaining a log file (or journal) of all changes made
to the database to enable recovery to be undertaken effectively in the event of a failure. The
advantage of journaling is that, in the event of a failure, the database can be recovered to its last
known consistent state using a backup copy of the database and the information contained in the
log file.

If no journaling is enabled on a failed system, the only means of recovery is to restore the database
using the latest backup version of the database. However, without a log file, any changes made
after the last backup to the database will be lost.

Encryption

Encryption is the process of encoding the data by a special algorithm that renders the data
unreadable by any program without the decryption key. If a database system holds particularly
sensitive data, it may be deemed necessary to encode it as a precaution against possible external
threats or attempts to access it.

The DBMS can access data after decoding it, although there is a degradation in performance
because of the time taken to decode it

Encryption also protects data transmitted over communication lines. To transmit data securely over
insecure networks requires the use of a Cryptosystem, which includes:

Authentication

All users of the database will have different access levels and permission for different data objects,
and authentication is the process of checking whether the user is the one with the privilege for the
access level. It is the process of checking the users are who they say they are.

Database Systems and Information Management Module Page 161


Each user is given a unique identifier, which is used by the operating system to determine who
they are. Thus the system will check whether the user with a specific username and password is
trying to use the resource. Associated with each identifier is a password, chosen by the user and
known to the operation system, which must be supplied to enable the operating system to
authenticate who the user claims to be.

Any database access request will have the following three major components

1. Requested Operation: what kind of operation is requested by a specific query?


2. Requested Object: on which resource or data of the database is the operation sought to be applied?
3. Requesting User: who is the user requesting the operation on the specified object?

The database should be able to check for all the three components before processing any request.
The checking is performed by the security subsystem of the DBMS

7.2. Distributed Database Systems


A distributed database (DDB) is a collection of multiple, logically interrelated databases
distributed over a computer network.

A distributed database management system (D–DBMS) is the software that manages the DDB
and provides an access mechanism that makes this distribution transparent to the users.

Distributed database system (DDBS) = DDB + D–DBMS Distributed DBMS Environment

Database Systems and Information Management Module Page 162


Distributed DBMS Promises

 Transparent management of distributed, fragmented, and replicated data


 Improved reliability/availability through distributed transactions
 Improved performance
 Easier and more economical system expansion

Distributed DBMS – Reality

Database Systems and Information Management Module Page 163


7.3 Data Warehousing and Data Mining
7.3.1. Data Warehousing
Since the 1970s, organizations have largely focused their investment in new computer systems
(called online transaction processing or OLTP systems) that automate business processes. In this
way, organizations gained competitive advantage through systems that offered more efficient and
cost-effective services to the customer. Throughout this period, organizations accumulated
growing amounts of data stored in their operational databases. However, in recent times, where
such systems are commonplace, organizations are focusing on ways to use operational data to
support decision making as a means of regaining competitive advantage.

Operational systems were never primarily designed to support business decision making and so
using such systems may never be an easy solution. The legacy is that a typical organization may
have numerous operational systems with overlapping and sometimes contradictory definitions,
such as data types. The challenge for an organization is to turn its archives of data into a source of

Database Systems and Information Management Module Page 164


knowledge, so that a single integrated/consolidated view of the organization‘s data is presented to
the user.

The concept of a data warehouse was deemed the solution to meet the requirements of a system
capable of supporting decision making, receiving data from multiple operational data sources.

Data warehouse is a consolidated/integrated view of corporate data drawn from disparate


operational data sources and a range of end-user access tools capable of supporting simple to
highly complex queries to support decision making.

The data held in a data warehouse is described as being subject-oriented, integrated, timevariant,
and nonvolatile (Inmon, 1993).

 Subject-oriented, as the warehouse is organized around the major subjects of the organization
(such as customers, products, and sales) rather than the major application areas (such as customer
invoicing, stock control, and product sales). This is reflected in the need to store decision-support
data rather than application oriented data.
 Integrated, because of the coming together of source data from different organization-wide
applications systems. The source data is often inconsistent, using, for example, different data types
and/or formats. The integrated data source must be made consistent to present a unified view of
the data to the users.
 Time-variant, because data in the warehouse is accurate and valid only at some point in time or
over some time interval.
 Nonvolatile, as the data is not updated in real time but is refreshed from operational systems on a
regular basis. New data is always added as a supplement to the database, rather than a replacement.

The typical architecture of a data warehouse is shown in Figure 7.3.

The source of operational data for the data warehouse is supplied from mainframes, proprietary
file systems, private workstations and servers, and external systems such as the Internet.
An operational data store (ODS) is a repository of current and integrated operational data used for
analysis. It is often structured and supplied with data in the same way as the data warehouse, but

Database Systems and Information Management Module Page 165


may in fact act simply as a staging area for data to be moved into the warehouse. The load
manager performs all the operations associated with the extraction and loading of data into the
warehouse.

The warehouse manager performs all the operations associated with the management of the data,
such as the transformation and merging of source data; creation of indexes and views on base
tables; generation of aggregations, and backing up and archiving data. The query manager
performs all the operations associated with the management of user queries. Detailed data is not
stored online but is made available by summarizing the data to the next level of detail. However,
on a regular basis, detailed data is added to the warehouse to supplement the summarized data.
The warehouse stores all the predefined lightly and highly summarized data generated by the
warehouse manager. The purpose of summary information is to speed up the performance of
queries. Although there are increased operational costs associated with initially summarizing the
data, this cost is offset by removing the requirement to continually perform summary operations
(such as sorting or grouping) in answering user queries. The summary data is updated
continuously as new data is loaded into the warehouse. Detailed and summarized data is stored
offline for the purposes of archiving and backup. Metadata (data about data) definitions are used
by all the processes in the warehouse, including the extraction and loading processes; the
warehouse management process; and as part of the query management process.

The principal purpose of data warehousing is to provide information to business users for strategic
decision making. These users interact with the warehouse using end-user access tools. The data
warehouse must efficiently support ad hoc and routine analysis as well as more complex data
analysis. The types of end-user access tools typically include reporting and query tools, application
development tools, executive information system (EIS) tools, online analytical processing (OLAP)
tools, and data mining tools.

7.3.2. Data Mining


Simply storing information in a data warehouse does not provide the benefits that an organization
is seeking. To realize the value of a data warehouse, it is necessary to extract the knowledge hidden
within the warehouse. However, as the amount and complexity of the data in a data warehouse

Database Systems and Information Management Module Page 166


grows, it becomes increasingly difficult, if not impossible, for business analysts to identify trends
and relationships in the data using simple query and reporting tools. Data mining is one of the best
ways to extract meaningful trends and patterns from huge amounts of data. Data mining discovers
within data warehouses information that queries and reports cannot effectively reveal.

There are numerous definitions of what data mining is, ranging from the broadest definitions of
any tool that enables users to access directly large amounts of data to more specific definitions
such as tools and applications that perform statistical analysis on the data. In this chapter, we use
a more focused definition of data mining by Simoudis (1996).

Data mining is the process of extracting valid, previously unknown, comprehensible, and
actionable information from large databases and using it to make crucial business decisions.

Data mining is concerned with the analysis of data and the use of software techniques for finding
hidden and unexpected patterns and relationships in sets of data. The focus of data mining is to
reveal information that is hidden and unexpected, as there is less value in finding patterns and
relationships that are already intuitive. Examining the underlying rules and features in the data
identifies the patterns and relationships.

Data mining analysis tends to work from the data up, and the techniques that produce the most
accurate results normally require large volumes of data to deliver reliable conclusions. The process
of analysis starts by developing an optimal representation of the structure of sample data, during
which time knowledge is acquired. This knowledge is then extended to larger sets of data, working
on the assumption that the larger data set has a structure similar to the sample data.

Data mining can provide huge paybacks for companies who have made a significant investment
in data warehousing. Data mining is used in a wide range of industries including retail/marketing,
banking, insurance, and medicine.

Data Mining Techniques

Database Systems and Information Management Module Page 167


There are four main operations associated with data mining techniques, which
include predictive modeling, database segmentation, link analysis, and deviation detection.
Although any of the four major operations can be used for implementing any of the business
applications listed above, there are certain recognized associations between the applications and
the corresponding operations. For example, direct marketing strategies are normally implemented
using the database segmentation operation, and fraud detection could be implemented by any of
the four operations.

OPERATIONS DATA MINING TECHNIQUES

Predictive modeling Classification Value prediction

Database segmentation Demographic clustering Neural clustering

Link analysis Demographic clustering Neural clustering Similar time sequence discovery

Deviation detection Statistics Visualization

Table 7.1. Data mining operations and associated techniques

Database Systems and Information Management Module Page 168


Wolaita Sodo University
School of Informatics
Department of Information Technology
2015 E.C- National Exit Exam Module
Theme Name: Database System and Information
Management

Prepared By: Alemayehu Dereje (MSc.)

Reviewed By: Biruk Sidamo (MSc.)


Nuhamin Nigussie (MSc.)

Wolaita Sodo, Ethiopia


March, 2023

Database Systems and Information Management Module Page 169


Advanced Database Systems

Database Systems and Information Management Module Page 170


Table of Contents
CHAPTER ONE ...........................................................................................................................................................................................3
1. Introduction ..........................................................................................................................................................................................3
1.1 Database System............................................................................................................................................................................. 6
1.2 Data Handling approaches .............................................................................................................................................. 8
1.2.1 Manual Data Handling approach .............................................................................................................................. 8
1.2.2 Traditional File Based Data Handling approach ............................................................................................ 9
1.2.3 Database Data Handling approach .......................................................................................................................... 11
1.3 Roles in Database Design and Development ............................................................................................................... 15
1.3.1 Database Designer .............................................................................................................................................................. 15
1.3.2 Database Administrator ................................................................................................................................................. 16
1.3.3 Application Developers.................................................................................................................................................... 17
1.3.4 End-Users ................................................................................................................................................................................. 18
1.4 The ANSI/SPARC and Database Architecture ......................................................................................................... 19
1.5 Types of Database Systems.................................................................................................................................................... 24
1.5.1 Client-Server Database System ................................................................................................................................. 24
1.5.2 Parallel Database System .............................................................................................................................................. 26
1.5.3 Distributed Database System...................................................................................................................................... 27
1.6 Database Management System (DBMS)....................................................................................................................... 29
1.6.1 Components and Interfaces of DBMS .................................................................................................................. 29
1.6.2 Functions of DBMS ................................................................................................................................................................. 31
1.7 Data models and conceptual models ............................................................................................................................... 32
1.7.1 Record-based Data Models .......................................................................................................................................... 33
1.7.2 Database Languages ......................................................................................................................................................... 37
CHAPTER TWO ...................................................................................................................................................................................... 40
2. Relational Data Model ...................................................................................................................................................................... 40
2. 1 Introduction ................................................................................................................................................................................... 40
2.1 Properties of Relational Databases ..................................................................................................................................44
2.2 Building Blocks of the Relational Data Model .................................................................................................44
2.2.1 The ENTITIES..................................................................................................................................................................... 45
2.2.2 The ATTRIBUTES ........................................................................................................................................................... 45
2.2.3 The RELATIONSHIPS.................................................................................................................................................. 47
2.2.4 Key constraints ....................................................................................................................................................................49
2.2.5 Integrity, Referential Integrity and Foreign Keys Constraints ........................................................... 50

Database Systems and Information Management Module Page 1


CHAPTER THREE................................................................................................................................................................................. 52
3. Conceptual Database Design- E-R Modeling .................................................................................................................... 52
3.1 Database Development Life Cycle .................................................................................................................................... 52
3.2 Basic concepts of E-R model ................................................................................................................................................ 54
3.3 Developing an E-R Diagram................................................................................................................................................. 55
3.4 Graphical Representations in Entity Relationship Diagram .......................................................................... 56
3.4.1 Conceptual ER diagram symbols ............................................................................................................................ 56
3.5 Problem with E-R models ...................................................................................................................................................... 65
CHAPTER FOUR .................................................................................................................................................................................... 77
4. Logical Database Design ................................................................................................................................................................. 77
4.1 Introduction .................................................................................................................................................................................... 77
4.2 Normalization ................................................................................................................................................................................ 85
4.3 Process of normalization (1NF, 2NF, 3NF) .................................................................................................................. 91
CHAPTER FIVE.......................................................................................................................................................................................98
5. Physical Database Design ...............................................................................................................................................................98
5.1 Conceptual, Logical, and Physical Data Models .....................................................................................................98
5.2 Physical Database Design Process ....................................................................................................................................99
5.1.1. Overview of the Physical Database Design Methodology ......................................................................100
CHAPTER SIX.......................................................................................................................................................................................... 102
6. Query Languages ................................................................................................................................................................................ 102
6.1. Relational Algebra .................................................................................................................................................................... 102
6.1.1. Unary Operations ............................................................................................................................................................ 103
6.1.2. Set Operations ....................................................................................................................................................................105
6.1.3. Aggregation and Grouping Operations ............................................................................................................ 110
6.2. Relational Calculus ................................................................................................................................................................... 112
6.2.1. Tuple Relational Calculus ........................................................................................................................................... 113
6.3. Structured Query Languages (SQL) ............................................................................................................................. 115
6.3.1. Introduction to SQL........................................................................................................................................................ 115
6.3.2. Writing SQL Commands ............................................................................................................................................. 118
6.3.3. SQL Data Definition and Data Types .................................................................................................................. 119
6.3.4. Basic Queries in SQL ....................................................................................................................................................124
CHAPTER SEVEN................................................................................................................................................................................. 155
7. Advanced Database Concepts..................................................................................................................................................... 155
7.1. Integrity and Security ............................................................................................................................................................ 155

Database Systems and Information Management Module Page 2


7.1.1. Levels of Security Measures ..................................................................................................................................... 157
7.1.2. Countermeasures: Computer Based Controls ..............................................................................................158
7.2. Distributed Database Systems ..........................................................................................................................................162
7.3 Data Warehousing and Data Mining ............................................................................................................................ 164
7.3.1. Data Warehousing.......................................................................................................................................................... 164
7.3.2. Data Mining ........................................................................................................................................................................ 166
CHAPTER ONE .................................................................................................................................................................................................. 6
QUERY PROCESSING AND OPTIMIZATION ........................................................................................................................................ 6
Relational Algebra ........................................................................................................................................................................................ 6
TRANSLATING SQL QUERIES INTO RELATIONAL ALGEBRA ................................................................................................ 6
QUERY PROCESSING ............................................................................................................................................................................... 13
Heuristic Query Tree Optimization ..................................................................................................................................................... 15
Summary of Heuristics for Algebraic Optimization: ..................................................................................................................... 19
Cost-based query optimization: ............................................................................................................................................................. 19
Semantic Query Optimization: ............................................................................................................................................................. 20
CHAPTER TWO............................................................................................................................................................................................... 23
DATABASE SECURITY AND AUTHORIZATION ................................................................................................................................ 23
Introduction to Database Security Issues ......................................................................................................................................... 23
Authentication............................................................................................................................................................................................. 23
Authorization/Privilege ............................................................................................................................................................................ 23
Database Security and the DBA ........................................................................................................................................................... 24
Comparing DAC and MAC .................................................................................................................................................................... 28
Introduction to Statistical Database Security .................................................................................................................................. 28
Types of Cryptosystems .......................................................................................................................................................................... 29
CHAPTER THREE ............................................................................................................................................................................................ 31
TRANSACTION PROCESSING CONCEPTS ........................................................................................................................................... 31
Introduction to Transaction Processing ............................................................................................................................................. 31
ACID properties of the transactions .................................................................................................................................................. 32
SIMPLE MODEL OF A DATABASE .................................................................................................................................................... 32
Transaction Atomicity and Durability................................................................................................................................................ 33
Serializability................................................................................................................................................................................................ 36
Transactions as SQL Statements.......................................................................................................................................................... 39
Summary ....................................................................................................................................................................................................... 40
CHAPTER FOUR ............................................................................................................................................................................................. 43
CONCURRENCY CONTROL TECHNIQUES ......................................................................................................................................... 43
Introduction to Concurrency control techniques .......................................................................................................................... 43
The Two-Phase Locking Protocol ........................................................................................................................................................44

Database Systems and Information Management Module Page 3


Deadlock Detection and Recovery .......................................................................................................................................................46
Recovery from Deadlock .........................................................................................................................................................................46
Timestamp-Based Protocols ................................................................................................................................................................... 47
The Timestamp-Ordering Protocol .....................................................................................................................................................48
Purpose of Concurrency Control .........................................................................................................................................................49
CHAPTER FIVE ................................................................................................................................................................................................. 51
DATABASE RECOVERY TECHNIQUES ................................................................................................................................................... 51
Recovery Outline and Categorization of Recovery Algorithms ................................................................................................. 51
Write-Ahead Logging, Steal/No-Steal, and Force/No-Force ........................................................................................................ 52
Transaction Actions That Do Not Affect the Database ............................................................................................................... 53
Recovery Techniques Based on Immediate Update ...................................................................................................................... 54
Shadow Paging............................................................................................................................................................................................ 54
The ARIES Recovery Algorithm ............................................................................................................................................................ 55
Recovery in Multidata base Systems .................................................................................................................................................. 56
Database Backup and Recovery from Catastrophic Failures ..................................................................................................... 56
CHAPTER SIX ................................................................................................................................................................................................... 58
DISTRIBUTED DATABASE SYSTEMS ..................................................................................................................................................... 58
Distributed Database Concepts ............................................................................................................................................................ 58
Parallel Versus Distributed Technology ............................................................................................................................................ 58
Additional Functions of Distributed Databases .............................................................................................................................. 60
Data Fragmentation ................................................................................................................................................................................... 61
Data Replication and Allocation ........................................................................................................................................................... 63
Types of Distributed Database Systems ............................................................................................................................................64
Federated Database Management Systems Issues .........................................................................................................................64
Semantic Heterogeneity........................................................................................................................................................................... 65
Query Processing in Distributed Databases ....................................................................................................................................66
Data Transfer Costs of Distributed Query Processing ...........................................................................................................66
An Overview of Client-Server Architecture and Its Relationship to Distributed Databases.................................... 67
CHAPTER SEVEN............................................................................................................................................................................................69
Spatial /multimedia/mobile databases .....................................................................................................................................................69
What is a Spatial Database? ..................................................................................................................................................................69
Spatial Database Applications ..........................................................................................................................................................69
Spatial Data Types ...............................................................................................................................................................................69
Spatial Relationship ............................................................................................................................................................................. 70
Spatial Queries ...................................................................................................................................................................................... 70
Mobile Database .......................................................................................................................................................................................... 71
Multimedia Database Management System (MMDBMS) ........................................................................................................... 73

Database Systems and Information Management Module Page 4


A true MMDBMS should be able to: .................................................................................................................................................. 73
Multimedia Data Types ........................................................................................................................................................................... 73
Multimedia Database Models ................................................................................................................................................................ 73
Applications of Multimedia Databases............................................................................................................................................... 75
CHAPTER EIGHT ............................................................................................................................................................................................ 76
WEB- BASED DATABASES ......................................................................................................................................................................... 76
Databases on the World Wide Web ................................................................................................................................................... 76
Providing Access to Databases on the World Wide Web ........................................................................................................... 76
The Web Integration Option of INFORMIX..................................................................................................................................... 77
The ORACLE Web Server ...................................................................................................................................................................... 78
Open Problems with Web Databases ................................................................................................................................................. 79
CHAPTER NINE ................................................................................................................................................................................................ 81
DATA WAREHOUSING ................................................................................................................................................................................. 81
What is Data Warehouse? ....................................................................................................................................................................... 81
Understanding a Data Warehouse....................................................................................................................................................... 82
Why a Data Warehouse is separated from Operational Databases? ...................................................................................... 82
Data Warehouse Applications.......................................................................................................................................................... 82
Types of Data Warehouse ................................................................................................................................................................. 82
OLTP vs. OLAP ........................................................................................................................................................................................... 83
Data Warehouse Architecture ...............................................................................................................................................................84
What is an Aggregation? ................................................................................................................................................................... 85
What is Data Mining? .............................................................................................................................................................................. 85
Reference ...................................................................................................................................................... Error! Bookmark not defined.

Database Systems and Information Management Module Page 5


CHAPTER ONE
QUERY PROCESSING AND OPTIMIZATION
Relational Algebra
 The basic set of operations for the relational model is known as the relational algebra. These
operations enable a user to specify basic retrieval requests.
 The result of retrieval is a new relation, which may have been formed from one or more relations.
 The algebra operations thus produce new relations, which can be further manipulated using
operations of the same algebra.
 A sequence of relational algebra operations forms a relational algebra expression, whose result
will also be a relation that represents the result of a database query (or retrieval request).

TRANSLATING SQL QUERIES INTO RELATIONAL ALGEBRA


Query block
 The basic unit that can be translated into the algebraic operators and optimized.
 A query block contains a single SELECT-FROM-WHERE expression, as well as GROUP BY and
HAVING clause if these are part of the block.
 Nested queries within a query are identified as separate query blocks.
 Aggregate operators in SQL must be included in the extended algebra
SELECT LNAME, FNAME FROM EMPLOYEE WHERE SALARY > (SELECT MAX (SALARY) FROM
EMPLOYEE WHERE DNO = 5);

SELECT LNAME, FNAME FROM EMPLOYEE SELECT MAX (SALARY) FROM


WHERE SALARY > C EMPLOYEE WHERE DNO = 5

π LNAME, FNAME (σ SALARY>C(EMPLOYEE)) ℱ MAX SALARY (σ DNO=5 (EMPLOYEE))

Relational Algebra consists of several groups of operations


1. Unary Relational Operations
 SELECT (symbol: σ (sigma))  RENAME (symbol: ρ (rho))
 PROJECT (symbol: π (pi))
2. Relational Algebra Operations from Set Theory
 UNION (∪), INTERSECTION (∩), DIFFERENCE (or MINUS, –)
 CARTESIAN PRODUCT (X)
3. Binary Relational Operations
 JOIN (several variations of JOIN exist)
 DIVISION

Database Systems and Information Management Module Page 6


4. Additional Relational Operations
 OUTER JOINS, OUTER UNION
 AGGREGATE FUNCTIONS (These compute summary of information: for
example, SUM, COUNT, AVG, MIN, MAX)
1. Unary Relational Operations
A) SELECT Operation
 SELECT operation is used to select a subset of the tuples from a relation that satisfy a selection
condition.
 It is a filter that keeps only those tuples that satisfy a qualifying condition those satisfying the
condition are selected while others are discarded.
 Example: To select the EMPLOYEE tuples whose department number is four or those whose
salary is greater than $30,000 the following notation is used:
σ DNO=4(EMPLOYEE) Employee is table name
σ SALARY>30,000(EMPLOYEE)
 In general, the selection operation is denoted by σ<selection condition>(R) where the symbol σ
(Sigma) is used to denote the select operator and the selection condition is a Boolean expression
specified on the attributes of relation(R).

What is the SQL statement of the above unary relational operations?


SELECT * FROM EMPLOYEE WHERE DNO=”4” AND SALARY>30000;

SELECT Operation Properties


The SELECT operations σ<selection condition>(R) Produces a relation S that has the same schema as R.
 The SELECT operations σ is commutative i.e. σ<condition1> (σ<condition2>(R)) =
σ<condition2> (σ<condition1>(R)).
 A cascade SELECT operation may be applied in any order i.e. σ<condition1> (σ<condition2> (σ
<condition3>(R)) = σ<condition2> (σ<condition3> (σ <condition1>(R)))
 A cascade SELECT operation may be replaced by a single selection with conjunctions of all the
conditions. i.e. σ<condition1> (σ<condition2> (σ <condition3>(R)) = σ<condition1> AND
σ<condition2> AND σ <condition3>(R)))
What are the results of the following select and projection operation? And what is its SQL Statement?
 Results of SELECT and PROJECT operations
A. σ (DNO=4 AND SALARY>25000) OR (DNO=5 AND SALARY>30000) (EMPLOYEE)
B. 𝝅LNAME, FNAME, SALARY(EMPLOYEE) 𝝅SEX, SALARY(EMPLOYEE)
The SQL statement of the above relational algebra is the following.
A. SELECT * FROM EMPLOYEE WHERE DNO='4' AND SALARY>2500 OR
DNO='5' AND SALARY>30000;
B. SELECT FIRSTNAME, LASTNAME, SALARY FROM EMPLOYEE;
C. SELECT SEX, SALARY FROM EMPLOYEE;
The result of the above SQL statement is:

Database Systems and Information Management Module Page 7


B) PROJECT Operation
 This operation selects certain columns from the table and discards the other columns.
 The PROJECT creates a vertical partitioning one with the needed columns (attributes containing
results of the operation and other containing the discarded Columns.
 Example: To list each employee’s first and last name and salary, the following is used:
𝝅LAME, FNAME, SALARY (EMPLOYEE)
 The general form of the project operation is: π<attribute list>(R)
 π (pi) is the symbol used to represent the project operation<attribute list> is the desired list of
attributes from relation R.
 The project operation removes any duplicate tuples. This is because the result of the project
operation must be a set of tuples. Mathematical sets do not allow duplicate elements.
PROJECT Operation Properties
 The number of tuples in the results of projection π<attribute list>(R) is always less or equal to the
number of tuples in R. If the lists of attributes include a key of R, then the number of tuples is equal
to the number of tuples in R. π<list> (π<list2>(R)) = π<list>(R) as long as <list2> contains the
attributes in <list2>.
2. Relational Algebra Operations from Set Theory
A. UNION OPERATION
 The result of this operation, denoted by RUS is a relation that includes all tuples that are either in R or
in S or in both R and S. Duplicate tuples is eliminated.
 Example: To retrieve the social security numbers of all employees who either work in department 5 or
directly supervise an employee who works in department 5, we can use the union operation as follows:
DEP5_EMPS σ DNO=5(EMPLOYEE)
RESULT1 π SSN (DEP5_EMPS)
RESULT2 (SSN) π SUPERSSN (DEP5_EMPS)
RESULT RESULT1 U RESULT2
 The union operation produces the tuples that are in either RESULT1 or RESULT2 or both.
 The two operands must be “type compatible”.
TYPE COMPATABILITY

Database Systems and Information Management Module Page 8


 Type Compatibility of operands is required for the binary set operation UNION ∪, (also for
INTERSECTION ∩, and SET DIFFERENCE -
 R1(A1, A2, ..., An) and R2(B1, B2, ..., Bn) are type compatible if:
 They have the same number of attributes, and
 The domains of corresponding attributes are type compatible (i.e. dom(Ai)=dom(Bi) for
i=1, 2, ..., n).
 The resulting relation for R1∪R2 (also for R1∩R2, or R1–R2) has the same attribute names as the
first operand relation R1 (by convention)
Instructor Student

Instructor U Student

SELECT * FROM student UNION ALL SELECT * FROM instructor;


B. INTERSECTION OPERATION
 The result of this operation denoted by R n S is a relation that includes all tuples that are in both R and S.
 The two operands must be "type compatible “. Example: The result of the intersection operation (figure
below) includes only those who are both students and instructors.

Customers Customer
C. Set Difference (or MINUS) Operation
 The result of this operation, denoted by R - S, is a relation that includes all tuples that are in R
but not in S. The two operands must be "type compatible”.
 Note: both union and intersection are commutative operations; i.e. RUS=SUR and R n S = S n R
 Both union and intersection can be treated as n-ary operations applicable to any number of relations
as both are associative operations; that is RU(SUT)=(RUS)UT and (R n S) n =R n (S n T)
 The minus operation is not commutative, that is in general. R—S? S--R
D. CARTESIAN (or cross product) Operation
 This operation is used to combine tuples from two relations in a combinational fashion.
 In general, the result of R(A1, A2,….An) x S(B1, B2,….,Bm) is a relation Q with degree n+m
attributes Q(A1, A2,….An, B1, B2,….,Bm), in that order.

Database Systems and Information Management Module Page 9


 The resulting relation Q has one tuple for each combination of tuples—one from R and one from
S. Hence, if R has nR tuples (denoted as |R|= nR), and S has nS tuples, then |RXS| will have nR*nS
tuples.
 The two operands do not have to be “type compatible”.
Example FEMALE_EMPS σ SEX = ‘F’(EMPLOYEE)
EMPNAMES π FNAME, LNAME,
SSN(FEMALE_EMPS)
EMP_DEPENDENTS EMPNAMES X DEPENDENT
3. Binary Relational Operations
A. JOIN Operation
 The sequence of Cartesian product followed by select is used quite commonly to identify and select
related tuples from two relations, a special operation, called JOIN.
 It is denoted by a
 This operation is very important for any relational database with more than a single relation,
because it allows us to process relationships among relations.
 The general form of a join operation on two relations R (A1, A2, …, An) and S (B1, B2,… Bn) is
R <join condition> S. Where R and S can be any relations that result from general
relational algebra expressions.
 EXAMPLE: - Suppose that we want to retrieve the name of the manager of each
Department. To get the managers name we need to combine each DEPARTMENT tuple with
the EMPLOYEE tuple whose SSN value matches the MGRSSN value in the department
tuple. We do this by using the join operation. DEPT_MGR DEPARTMENT
EMPLOYEE MGRSSN=SSN
B. EQUIJOIN Operation
 The most common use of join involves join conditions with equality comparisons only.
 Such a join, where the only comparison operator used is =, is called an EQUIJOIN.
 In the result of an EQUIJOIN we always have one or more pairs of attributes (whose names need
not be identical) that have identical values in every tuple.
 The JOIN seen in the previous example was EQUIJOIN.
C. NATURAL JOIN Operation
 Because one of each pair of attributes with identical values is superfluous, a new operation called
natural join—denoted by *—was created to get rid of the second (superfluous) attribute in an
EQUIJOIN condition.
 The standard definition of natural join requires that the two join attributes, or each pair of
corresponding join attributes, have the same name in both relations.
 If this is not the case, a renaming operation is applied first.
EXAMPLE: To apply a natural join on the DNUMBER attributes of DEPARTMENT and
DEPT_LOCATIONS, it is sufficient to write: DEPT_LOCS DEPARTMENT *
DEPT_LOCATIONS
 The set of operations including select, project, union(u), set difference (-), and Cartesian
product(x) is called a complete set because any other relational algebra expression can be
expressed by a combination of these five operations. Example R n S =(RUS)-((R-S) U(S-R))
4. Additional Relational Operations
A. Aggregate Functions and Grouping

Database Systems and Information Management Module Page 10


 A type of request that cannot be expressed in the basic relational algebra is to specify mathematical
aggregate functions on collections of values from the database.
 Examples of such functions include retrieving the average or total salary of all employees or the
total number of employee tuples.
 These functions are used in simple statistical queries that summarize information from the database
tuples.
 Common functions applied to collections of numeric values include SUM, AVERAGE,
MAXIMUM, and MINIMUM.
 The COUNT function is used for counting tuples or values.
Use of the Functional operator F
 F MAX Salary(Employee) retrieves the maximum salary value from the Employee relation
 F MIN Salary(Employee) retrieves the minimum Salary value from the Employee relation
 F SUM Salary(Employee) retrieves the sum of the Salary from the Employee relation
 DNO F COUNT SSN, AVERAGE Salary(Employee) groups employees by DNO (department number) and
computes the count of employees and average salary per department.[ Note: count just counts the
number of rows, without removing duplicates]
What is Query Processing?
 The Steps required transforming high level SQL query into a correct and “efficient” strategy for
execution and retrieval. The activities involved in parsing, validating, optimizing, and executing
a query.
What is the aim of query processing?
 To transform a query written in a high-level language, typically SQL, into a correct and efficient
execution strategy expressed in a low-level language (implementing the relational algebra), and
to execute the strategy to retrieve the required data.
What is Query Optimization?
 The activity of choosing a single “efficient” execution strategy (from hundreds) as determined
by database catalog statistics for processing a query. An important aspect of query processing is
query optimization.
What is the aim of query Optimization?
 As there are many equivalent transformations of the same high-level query, the aim of query
optimization is to choose the one that minimizes resource usage.
 Generally, we try to reduce the total execution time of the query, which is the sum of the execution
times of all individual operations that make up the query.
Examples for query Optimization: Identify all managers who work in a London branch
SQL: - SELECT * FROM Staff s, Branch b WHERE s.branchNo = b.branchNo AND s.position =
‘Manager’ AND b.city = ‘london’;
Results in these equivalent relational algebra statements
(1) s(position=‘Manager’)^(city=‘London’)^(Staff.branchNo=Branch.branchNo) (Staff X Branch)
(2) s(position=‘Manager’)^(city=‘London’) (Staff Staff.branchNo = Branch.branchNo Branch)
(3) [s(position=‘Manager’) (Staff)] Staff.branchNo = Branch.branchNo [s(city=‘London’) (Branch)]
Assume:
 1000 tuples in Staff.  No indexes or sort keys
 50 Managers  All temporary results are written
 50 tuples in Branch. back to disk (memory is small)
 5 London branches

Database Systems and Information Management Module Page 11


 Tuples are accessed one at a
time (not in blocks)
Query 1 (Bad)

 Requires (1000+50) disk accesses to read from Staff and Branch relations
 Creates temporary relation of Cartesian Product (1000*50) tuples
 Requires (1000*50) disk access to read in temporary relation and test predicate
Total Work = (1000+50) + 2*(1000*50) = 101,050 I/O operations
Query 2 (Better)

 Again requires (1000+50) disk accesses to read from Staff and Branch
 Joins Staff and Branch on branchNo with 1000 tuples (1 employee : 1 branch )
 Requires (1000) disk access to read in joined relation and check predicate
Total Work = (1000+50) + 2*(1000) = 3050 I/O operations
3300% Improvement over Query 1
Query 3 (Best)

 Read Staff relation to determine ‘Managers’ (1000 reads)


 Create 50 tuple relation(50 writes)
 Read Branch relation to determine ‘London’ branches (50 reads)
 Create 5 tuple relation (5 writes)
 Join reduced relations and check predicate (50 + 5 reads)
Total Work = 1000 + 2*(50) + 5 + (50 + 5) = 1160 I/O operations
8700% Improvement over Query 1
Two main Techniques for Query Optimization
1. Heuristic Rules: -Rules for ordering the operations in query optimization.
2. Systematical estimation: -It estimates cost of different execution strategies and chooses the
execution plan with lowest execution cost.
Steps in Processing High-Level Query

Scanning, Parsing, Validating

Database Systems and Information Management Module Page 12


 Scanner: The scanner identifies the language tokens such as SQL Keywords, attribute names, and
relation names in the text of the query.
 Parser: The parser checks the query syntax to determine whether it is formulated according to the
syntax rules of the query language.
 Validation: The query must be validated by checking that all attributes and relation names are valid
and semantically meaningful names in the schema of the particular database being queried.
QUERY PROCESSING
 Query Optimization: The process of choosing a suitable execution strategy for processing a query.
This module has the task of producing an execution plan.
 Query Code Generator: It generates the code to execute the plan.
 Runtime Database Processor: It has the task of running the query code whether in compiled or
interpreted mode. If a runtime error results an error message is generated by the runtime database
processor.
Query Processing Steps

Processing can be divided into Decomposition, Optimization, Execution, and Code generation main
categories.
1. Query Decomposition
 It is the process of transforming a high-level query into a relational algebra query, and to check that
the query is syntactically and semantically correct.
 It Consists of parsing and validation
Typical stages in query decomposition are:
1. Analysis:
 Lexical and syntactical analysis of the query (correctness) based on attributes, data type...
 Query tree will be built for the query containing
 leaf node for base relations,
 one or many non-leaf nodes for relations produced by relational algebra operations and
 Root node for the result of the query.
 Sequence of operation is from the leaves to the root. (SELECT * FROM Catalog c, Author a
Where a.authorid = c.authorid AND c.price>200 AND a.country= ‘ USA’ )
2. Normalization:
 Convert the query into a normalized form.
 The predicate WHERE will be converted to Conjunctive (∨) or Disjunctive (∧) Normal form.
3. Semantic Analysis
 To reject normalized queries that is not correctly formulated or contradictory.
 Incorrect if components do not contribute to generate result.

Database Systems and Information Management Module Page 13


 Contradictory if the predicate cannot be satisfied by any tuple.
 Say for example, (Catalog = “BS”  Catalog= “CS”) since a given book can only be classified in
either of the category at a time
4. Simplification
 to detect redundant qualifications,
 eliminate common sub-expressions, and
 Transform the query to a semantically equivalent but more easily and effectively computed
form.
 For example, if a user doesn’t have the necessary access to all of the objects of the query, it
should be rejected.
2. Query Optimization
 Everyone wants the performance of their database to be optimal.
 In particular, there is often a requirement for a specific query or object that is query based, to run faster.
 Problem of query optimization is to find the sequence of steps that produces the answer to user request
in the most efficient manner, given the database structure.
 The performance of a query is affected by
 the tables or queries that underlies the query and
 The complexity of the query.
 Given a request for data manipulation or retrieval, an optimizer will choose an optimal plan for
evaluating the request from among the manifold alternative strategies. i.e. there are many ways (access
paths) for accessing desired file/record. Hence, DBMS is responsible to pick the best execution
strategy based on various considerations (Least amount of I/O and CPU resources.
Transformational rules for relational Algebra
1. Cascade of SELECTON: conjunctive SELCTION operations can cascade into individual
selection operations and vice versa.

2. Commutativity of SELECTION operations

3. Cascade of PROJECTION: in the sequence of PROJECTION Operations, only last in the


sequence is required.

4. Commutativity of SELECTION with PROJECTION and vice versa


a. If the predicate C1 involves only the attributes in the projection list(L1), then the selection
and projection operations commute

5. Commutativity of THETA JOIN/ Cartesian product RXS is equivalent to SXR. Also holds for
Equi-join and Natural join

6. Commutativity of SELECTION with THETA JOIN

Database Systems and Information Management Module Page 14


a. If the predicate c1 involves only attributes of one of the relations (R) being joined, then
the selection and join operations commute.

b. If the predicate is in the forms of c1^ c2 and c1 involves only attributes of R and c2
involves only attributes of S, then the selection and theta join operations commute.

7. Commutativity of PROJECTION and THETA JOIN


If the projection list of the form L1 L2, where L1 involves only attributes of R and L2 involves
only attributes of S being joined, and the predicate c involve only attributes in the projection list,
then the SELECTION and JOIN operations commute.

8. Commutativity of the set operations: UNION and INTERSECTION but not STE DIFFERENCE

9. Associativity of THETA JOIN, CARTESIAN PRODUCT, UNION and INTERSECTION

10. Commuting SELECTION with SET OPERATIONS

11. Commuting SELECTION with UNION

Heuristic approach will be implemented by using the above transformation rules in the following sequence
or steps. Sequence for applying Transformation Rules
1. Use Rule 1—Cascade Selection
2. Use
Rule 2. Commutatively of SELECTION
Rule 4. Commuting SELECTION with PROJECTION
Rule 6 Commuting SELECTION with JOIN and CARTESIAN
Rule 10 Commuting SELECTION with SET OPERATIONS
3. Use: Rule 9 Associativity of binary operations (JOIN, CARTESIAN, UNION and
INTERSECTION). Rearrange nodes by making the most restrictive operations to be performed
first (moving it as far down the tree as possible)
4. Perform Cartesian Operations with the subsequent selection operation
5. Use Rule 3 Cascade of PROJECTION
Rule 4 Commuting PROJECTION with SELECTION
Rule 7 Commuting PROJECTION with JOIN and CARTESIAN
Rule 11 Commuting PROJECTION with UNION
Heuristic Query Tree Optimization
 It has some rules which utilize equivalence expressions to transform the initial tree into final,
optimized query tree.
 Process for heuristics optimization
1. The parser of a high-level query generates an initial internal representation;

Database Systems and Information Management Module Page 15


2. Apply heuristics rules to optimize the internal representation.
3. A query execution plan is generated to execute groups of operations based on the access
paths available on the files involved in the query.
A) The main heuristic is to apply first the operations that reduce the size of intermediate
results.
E.g. Apply SELECT and PROJECT operations before applying the JOIN or other binary
operations.
 The main idea behind is to reduce intermediate results. This includes performing
 SELECT operation to reduce the number of tuples &
 PROJECT operation to reduce number of attributes.
Query tree:
 A tree data structure that corresponds to a relational algebra expression. It represents the input
relations of the query as leaf nodes of the tree, and represents the relational algebra operations as
internal nodes.
 Example: For every project located in ‘Stafford’, retrieve the project number, the controlling
department number and the department manager’s last name, address and birthdate.
 SQL query: Q2: SELECT P. NUMBER, P.DNUM,E.LNAME,E.ADDRESS, E.BDATE FROM
PROJECT AS P,DEPARTMENT AS D, EMPLOYEE AS E WHERE P.DNUM=D.DNUMBER
AND D.MGRSSN=E.SSN AND P.PLOCATION=‘STAFFORD’;
 Relation algebra: P PNUMBER, DNUM, LNAME, ADDRESS, BDATE (((s PLOCATION=‘STAFFORD’
(PROJECT)) DNUM=DNUMBER (DEPARTMENT)) MGRSSN=SSN (EMPLOYEE))
 The same query could correspond too many different relational algebra expressions and hence
many different query trees.

Query graph:
 A graph data structure that corresponds to a relational calculus expression. It does not indicate an
order on which operations to perform first. Nodes represent Relations. Ovals represent constant

Database Systems and Information Management Module Page 16


nodes. Edges represent Join & Selection conditions. Attributes to be retrieved from relations
represented in square brackets.
 Drawback: - Does not indicate an order on which operations are performed first.
 There is only a single graph corresponding to each query.

Example 2 of Heuristic Optimization of Query Trees:


 The same query could correspond too many different relational algebra expressions and hence
many different query trees.
 The task of heuristic optimization of query trees is to find a final query tree that is efficient to
execute.
 For Example : Q:-SELECT LNAME FROM EMPLOYEE, WORKS_ON, PROJECT WHERE
PNAME= ‘AQUARIUS’ AND PNUMBER=PNO AND ESSN=SSN AND BDATE > ‘1957-12-
31’

Fig 1: Initial (canonical) query tree for SQL query Q

Database Systems and Information Management Module Page 17


Fig2: move select down the tree using cascade & commutatively rule of select operation

Fig3: rearrange of leaf nodes, using commutatively & associativity of binary operations.

Fig 4: converting select & Cartesian product into join

Database Systems and Information Management Module Page 18


Fig 5: break-move of project using cascade & commuting rules of project operations.
Query Execution Plans
 An execution plan for a relational algebra query consists of a combination of the relational algebra
query tree and information about the access methods to be used for each relation as well as the
methods to be used in computing the relational operators stored in the tree.
 Materialized evaluation: the result of an operation is stored as a temporary relation.
 Pipelined evaluation: as the result of an operator is produced, it is forwarded to the next operator
in sequence.
Summary of Heuristics for Algebraic Optimization:
1) The main heuristic is to apply first the operations that reduce the size of intermediate results.
2) Perform select operations as early as possible to reduce the number of tuples and perform project
operations as early as possible to reduce the number of attributes. (This is done by moving select
and project operations as far down the tree as possible.)
3) The select and join operations that are most restrictive should be executed before other similar
operations. (This is done by reordering the leaf nodes of the tree among themselves and adjusting
the rest of the tree appropriately.)
Cost-based query optimization:
 The optimizer examines alternative access paths and operator algorithms and chooses the
execution plan with lowest estimate cost.
 The query cost is calculated based on the estimated usage of resources such as I/O, CPU and
memory needed.
 Application developers could specify hints to the ORACLE query optimizer.
 The idea is that an application developer might know more information about the data.
 Issues of cost-based query optimization are.
 Cost function
 Number of execution strategies to be considered
 Cost Components for Query Execution
 Access cost to secondary  Computation cost
storage  Memory usage cost
 Storage cost  Communication cost

Database Systems and Information Management Module Page 19


 Note: Different database systems may focus on different cost components.
 Catalog Information Used in Cost Functions
 Information about the size of a  record size (R),
file  number of blocks (b)
 number of records (tuples) (r),  blocking factor (bfr)
 Information about indexes and indexing attributes of a file
 Number of levels (x) of each multilevel  Selectivity (sl) of an attribute
index  Selection cardinality (s) of an attribute. (s = sl *
 Number of first-level index blocks (bI1) r)
 Number of distinct values (d) of an
attribute
1. Access Cost of Secondary Storage
 Data is going to be accessed from secondary storage, as a query will need some part of the data stored
in the database. The disk access cost can again be analyzed in terms of:
 Searching  Writing, data blocks used to store
 Reading, and some portion of a relation.
 Remark: The disk access cost will vary depending on
 The file organization used and the access method implemented for the file organization.
 Whether the data is stored contiguously or in scattered manner, will affect the disk access cost.
2. Storage Cost
 While processing a query, as any query would be composed of many database operations, there
could be one or more intermediate results before reaching the final output.
 These intermediate results should be stored in primary memory for further processing.
 The bigger the intermediate relation, the larger the memory requirement, which will have impact
on the limited available space.
3. Computation Cost
 Query is composed of many operations. The operations could be database operations like reading
and writing to a disk, or mathematical and other operations like:
 Searching  Merging
 Sorting  Computation on field values
4. Communication Cost
 In most database systems the database resides in one station and various queries originate from
different terminals.
 This will have impact on the performance of the system adding cost for query processing.
 Thus, the cost of transporting data between the database site and the terminal from where the query
originate should be analyzed.
3. Query Execution Plans
 An execution plan for a relational algebra query consists of a combination of the relational algebra
query tree and information about the access methods to be used for each relation as well as the
methods to be used in computing the relational operators stored in the tree.
Semantic Query Optimization:
 Uses constraints specified on the database schema in order to modify one query into another query
that is more efficient to execute. Consider the following SQL query, SELECT E. LNAME, M.

Database Systems and Information Management Module Page 20


LNAME FROM EMPLOYEE E M WHERE E. SUPERSSN=M.SSN AND E.
SALARY>M.SALARY
Explanation:
 Suppose that we had a constraint on the database schema that stated that no employee can earn
more than his or her direct supervisor.
 If the semantic query optimizer checks for the existence of this constraint, it need not execute the
query at all because it knows that the result of the query will be empty. Techniques known as
theorem proving can be used for this purpose.
Exercise
I. Select the alternative answer from the given letter option
1. Among the list below one is not cost based optimization
A) Cost function C) Computational cost
B) Number of execution strategies to be D) None of above
considered
2. The basic set of operations for the relational model is known as the _________________
A) Relational algebra. C) Relational query
B) Relational models D) Relational theory
3. Nested queries within a query are identified as separate ______________________.
A) query blocks C) query invention
B) query summary D) query control
4. _____________ operation is used to select a subset of the tuples from a relation that satisfy a selection
condition.
A) Delete C) Update
B) Select D) Project
5. Which one is Binary Relational Operations?
A) Selection C) Join
B) Deletion D) Projection
6. Which one is Relational Algebra Operations from Set Theory?
A) Select C) Rename
B) Project D) Cartesian product
7. Which one is unary relational algebra?
A) Cartesian product C) Division
B) Rename D) Projection
8. Which one is not catalog information used in cost function used to exhibit relational algebra?
A) Record size C) Cost function
B) Number of blocks D) Blocking factor
9. Which one is not part of information about indexes and indexing attributes of a file?
A) Number of levels C) Selection cardinality
B) Selectivity D) Number of blocks
10. The disk access cost can again be analyzed in terms of:
A) Searching
B) Reading
C) Writing
D) All can be answer

Database Systems and Information Management Module Page 21


II. Write short answer for the following question
1. To increase your understanding try to generalize relational algebra in terms of database!
2. Illustrate query processing and explain each step undoubtable?
3. How you define heuristic query optimizations?

Database Systems and Information Management Module Page 22


CHAPTER TWO

DATABASE SECURITY AND AUTHORIZATION


Introduction to Database Security Issues
 In today's society, some information is extremely important that needs to be protected.
 For example, disclosure or modification of military information could cause danger to
national security.
 A good database security management system has to handle the possible database threats.
 Threat may be any situation or event, whether intentional (planned) or accidental, that may
adversely affect a system and consequently the organization.
 Threats to databases: It may result in degradation of some/all security goals like;
 Loss of Integrity
 Only authorized users should be allowed to modify data.
 For example, students may be allowed to see their grades, but not allowed to modify them.
 Loss of Availability-if DB is not available for those users/ to which they have a legal right to
uses the data.
 Authorized users should not be denied access.
 For example, an instructor who wishes to change a grade should be allowed to do so.
 Loss of Confidentiality
 Information should not be disclosed to unauthorized users.
 For example, a student should not be allowed to examine other students' grades.
Authentication
 All users of the database will have different access levels and permission for different data
objects.
 Authentication is the process of checking whether the user is the one with the privilege for the
access level. Thus, the system will check whether the user with a specific username and password
is trying to use the resource.
Authorization/Privilege
 Authorization refers to the process that determines the mode in which a particular (previously
authenticated) client is allowed to access a specific resource controlled by a server.
 Any database access request will have the following three major components.
1. Requested Operation: what kind of operation is requested by a specific query?
2. Requested Object: on which resource or data of the database is the operation sought to be
applied?
3. Requesting User: who is the user requesting the operation on the specified object?
Forms of user authorization
There are different forms of user authorization on the resource of the database. These include:
1. Read Authorization: the user with this privilege is allowed only to read the content of the data
object.
2. Insert Authorization: the user with this privilege is allowed only to insert new records or items
to the data object.
3. Update Authorization: users with this privilege are allowed to modify content of attributes but
are not authorized to delete the records.
4. Delete Authorization: users with this privilege are only allowed to delete a record and not
anything else.

Database Systems and Information Management Module Page 23


Note: Different users, depending on the power of the user, can have one or the combination of the above
forms of authorization on different data objects.
Database Security and the DBA
 The database administrator (DBA) is the central authority for managing a database system.
 The DBA’s responsibilities include
 Account creation
 granting privileges to users who need to use the system
 Privilege revocation
 classifying users and data in accordance with the policy of the organization
Access Protection, User Accounts, and Databases Audits
 Whenever a person or group of persons need to access a database system, the individual or group must
first apply for a user account.
 The DBA will then create a new account id and password for the user if he/she believes there is a
legitimate need to access the database.
 The user must log in to the DBMS by entering account id and password whenever database access is
needed.
 The database system must also keep track of all operations on the database that are applied by a
certain user throughout each login session.
 If any tampering with the database is assumed, a database audit is performed
 A database audit consists of reviewing the log to examine all accesses and operations
applied to the database during a certain time period.
 A database log that is used mainly for security purposes is sometimes called an audit trail.
 To protect databases against the possible threats two kinds of countermeasures can be implemented:
1. Access control, and 2. Encryption
Access Control (AC)
1. Discretionary Access Control (DAC)
 The typical method of enforcing discretionary access control in a database system is based on the
granting and revoking privileges.
 The granting and revoking of privileges for discretionary privileges known as the access matrix
model where
 The rows of a matrix M represents subjects (users, accounts, programs)
 The columns represent objects (relations, records, columns, views, operations).
 Each position M(i,j) in the matrix represents the types of privileges (read, write, update)
that subject i holds on object j.
 To control the granting and revoking of relation privileges, each relation R in a database is
assigned an owner account, which is typically the account that was used when the relation was
created in the first place.
 The owner of a relation is given all privileges on that relation.
 The owner account holder can pass privileges on any of the owned relation to other users
by granting privileges to their accounts.
Privileges Using Views
 The mechanism of views is an important discretionary authorization mechanism in its own right.

Database Systems and Information Management Module Page 24


 For example, If the owner A of a relation R wants another account B to be able to retrieve
only some fields of R, then A can create a view V of R that includes only those attributes
and then grant SELECT on V to B.
Revoking Privileges
 In some cases, it is desirable to grant a privilege to a user temporarily.
 For example, the owner of a relation may want to grant the SELECT privilege to a user for a
specific task and then revoke that privilege once the task is completed. Hence, a mechanism for
revoking privileges is needed.
 In SQL, a REVOKE command is included for the purpose of canceling privileges.
Propagation of Privileges using the GRANT OPTION
 Whenever the owner A of a relation R grants a privilege on R to another account B, privilege can
be given to B with or without the GRANT OPTION.
 If the GRANT OPTION is given, this means that B can also grant that privilege on R to other
accounts.
 Suppose that B is given the GRANT OPTION by A and that B then grants the privilege on R
to a third account C, also with GRANT OPTION.
 In this way, privileges on R can propagate to other accounts without the knowledge of the
owner of R.
 If the owner account A now revokes the privilege granted to B, all the privileges that B
propagated based on that privilege should automatically be revoked by the system.
Example 1
 Suppose that the DBA creates four accounts: A1, A2, A3, A4 and wants only A1 to be able to
create relations. Then the DBA must issue the following GRANT command in SQL: -
GRANT CREATE TABLE TO A1;
Example 2
 Suppose that A1 creates the two base relations EMPLOYEE and DEPARTMENT. A1 is then
owner of these two relations and hence A1 has all the relation privileges on each of them.
 Suppose that A1 wants to grant A2 the privilege to insert and delete rows in both of these relations,
but A1 does not want A2 to be able to propagate these privileges to additional accounts:
 GRANT INSERT, DELETE ON EMPLOYEE, DEPARTMENT TO A2;
Example 3
 Suppose that A1 wants to allow A3 to retrieve information from either of the table (Department
or Employee) and also to be able to propagate the SELECT privilege to other accounts. A1 can
issue the command:
 GRANT SELECT ON EMPLOYEE, DEPARTMENT TO A3 WITH GRANT
OPTION;
 A3 can grant the SELECT privilege on the EMPLOYEE relation to A4 by issuing:
 GRANT SELECT ON EMPLOYEE TO A4;
 Notice that A4 can’t propagate the SELECT privilege because GRANT OPTION was not given
to A4
Example 4
Suppose that A1 decides to revoke the SELECT privilege on the EMPLOYEE relation from A3;
A1 can issue:
 REVOKE SELECT ON EMPLOYEE FROM A3;

Database Systems and Information Management Module Page 25


 The DBMS must now automatically revoke the SELECT privilege on EMPLOYEE from A4, too,
because A3 granted that privilege to A4 and A3 does not have the privilege any more.
Example 5
 Suppose that A1 wants to give back to A3 a limited capability to SELECT from the EMPLOYEE
relation and wants to allow A3 to be able to propagate the privilege. The limitation is to retrieve
only the NAME, BDATE, and ADDRESS attributes and only for the tuples with DNO=5.
 A1 then create the view:
 CREATE VIEW A3 EMPLOYEE AS SELECT NAME, BDATE, ADDRESS
FROM EMPLOYEE WHERE DNO = 5;
 After the view is created, A1 can grant SELECT on the view A3 EMPLOYEE to A3 as follows:
 GRANT SELECT ON A3 EMPLOYEE TO A3 WITH GRANT OPTION;
Example 6
Finally, suppose that A1 wants to allow A4 to update only the SALARY attribute of
EMPLOYEE; A1 can issue:
 GRANT UPDATE ON EMPLOYEE (SALARY) TO A4;
2 Mandatory Access Control (MAC)
 DAC techniques are an all-or-nothing method: A user either has or does not have a certain privilege.
 In many applications, additional security policy is needed that classifies data and users based on
security classes.
 Typical security classes are top secret (TS), secret (S), confidential (C), and unclassified (U),
where TS is the highest level and U the lowest: TS ≥ S ≥ C ≥ U
 The commonly used model for multilevel security, known as the Bell-LaPadula model, classifies
each subject (user, account, program) and object (relation, tuple, column, view, operation) into
one of the security classifications, TS, S, C, or U:
 Clearance (classification) of a subject S as class(S) and to the classification of an
object O as class (O).

 Two restrictions are enforced on data access based on the subject/object classifications:

 A subject S is not allowed read access to an object O unless class(S) ≥ class (O).
 A subject S is not allowed to write an object O unless class(S) ≤ class (O).
 To incorporate multilevel security notions into the relational database model, it is common to
consider attribute values and rows as data objects. Hence, each attribute A is associated with a
classification attribute C in the schema.
 In addition, in some models, a tuple classification attribute TC is added to the relation attributes
to provide a classification for each tuple as a whole.
 Hence, a multilevel relation schema R with n attributes would be represented as
 R(A1,C1,A2,C2, …, An,Cn,TC) where each Ci represents the classification attribute
associated with attribute Ai.
 The value of the TC attribute in each tuple t – which is the highest of all attribute classification
values within t – provides a general classification for the tuple itself.
 Whereas, each Ci provides a finer security classification for each attribute value within the tuple.
 A multilevel relation will appear to contain different data to subjects (users) with different
clearance levels.

Database Systems and Information Management Module Page 26


 In some cases, it is possible to store a single tuple in the relation at a higher classification
level and produce the corresponding tuples at a lower-level classification through a process
known as filtering.
 In other cases, it is necessary to store two or more tuples at different classification levels
with the same value for the apparent key.
 This leads to the concept of poly instantiation where several tuples can have the same apparent
key value but have different attribute values for users at different classification levels.
 Example. Consider query SELECT * FROM employee

a. The original employee table,


b. After filtering employee table for classification C users,
c. After filtering employee table for classification U users
d. Poly instantiation of the smith row for C users who want to modify some value

 A user with a security clearance S would see the same relation shown above (a) since all row
classification are less than or equal to S as shown in (a).
 However, a user with security clearance C would not allow to see values for salary of Brown and
job performance of Smith, since they have higher classification as shown in (b)
 For a user with security clearance U, filtering introduces null values for attributes values whose
security classification is higher than the user’s security clearance as shown in (c)
 A user with security clearance C may request for update on the values of job performance of smith
to ‘Excellent’ and the view will allow him to do so. However, the user shouldn't be allowed to
overwrite the existing value at the higher classification level.
 Solution: to create Ploy Station for smith row at the lower classification level C as
shown in (d)

Database Systems and Information Management Module Page 27


Comparing DAC and MAC
 DAC policies are characterized by a high degree of flexibility, which makes them suitable for a large
variety of application domains.
 The main drawback of DAC models is their weakness to malicious attacks, such as Trojan horses
embedded in application programs.
 By contrast, mandatory policies ensure a high degree of protection in a way; they prevent any illegal
flow of information.
 Mandatory policies have the drawback of being too rigid and they are only applicable in limited
environments.
 In many practical situations, discretionary policies are preferred because they offer a better trade-off
between security and applicability.
3. Role-Based Access Control
 Its basic notion is that permissions are associated with roles, and users are assigned to appropriate
roles.
 Roles can be created using the CREATE ROLE and DESTROY ROLE commands.
 The GRANT and REVOKE commands discussed under DAC can then be used to assign
and revoke privileges from roles.
 RBAC appears to be a feasible alternative to discretionary and mandatory access controls;
 It ensures that only authorized users are given access to certain data or resources.
 Many DBMSs have allowed the concept of roles, where privileges can be assigned to roles.
 Role hierarchy in RBAC is a natural way of organizing roles to reflect the organization’s lines of
authority and responsibility:\My DB File\Role.ppt
Introduction to Statistical Database Security
 Statistical databases are used mainly to produce statistics on various populations.
 The database may contain confidential data on individuals, which should be protected from user
access.
 Users are permitted to retrieve statistical information on the populations, such as averages, sums,
counts, maximums, minimums, and standard deviations.
 A population is a set of rows of a relation (table) that satisfy some selection condition.
 Statistical queries involve applying statistical functions to a population of rows.
 For example, we may want to retrieve the number of individuals in a population or the average
income in the population.
 However, statistical users are not allowed to retrieve individual data, such as the income of
a specific person.
 Statistical database security techniques must disallow the retrieval of individual data.
This can be achieved by elimination of queries that retrieve attribute values and by allowing only
queries that involve statistical aggregate functions such as, SUM, MIN, MAX, and Such queries are
sometimes called statistical queries.
 It is DBMS’s responsibility to ensure confidentiality of information about individuals, while still
providing useful statistical summaries of data about those individuals to users.
 Provision of privacy protection of users in a statistical database is paramount.
 In some cases, it is possible to infer the values of individual rows from sequence statistical queries.
 This is particularly true when the conditions result in a population consisting of a small
number of rows.

Database Systems and Information Management Module Page 28


Encryption
 Authorization may not be sufficient to protect data in database systems, especially when there
is a situation where data should be moved from one location to the other using network facilities.
 Encryption is used to protect information stored at a particular site or transmitted between sites
from being accessed by unauthorized users.
 Encryption is the encoding of the data by a special algorithm that renders the data unreadable by
any program without the decryption key.
 It is not possible for encrypted data to be read unless the reader knows how to decipher/decrypt the
encrypted data.
 If a database system holds particularly sensitive data, it may be believed necessary to encode it as
an insurance against possible external threats or attempts to access it.
 The DBMS can access data after decoding it, although there is degradation in performance
because of the time taken to decode it.
 Encryption also protects data transmitted over communication lines.
 To transmit data securely over insecure networks requires the use of a Cryptosystem, which
includes:
1. An encryption key to encrypt the data (plaintext)
2. An encryption algorithm that, with the encryption key, transforms the plaintext into
cipher text
3. A decryption key to decrypt the cipher text
4. A decryption algorithm that, with the decryption key, transforms the cipher text back
into plaintext
 Data encryption standard is an approach which does both a substitution of characters and a
rearrangement of their order based on an encryption key.
Types of Cryptosystems
 Cryptosystems can be categorized into two:
1. Symmetric encryption – uses the same key for both encryption and decryption and relies on
safe communication lines for exchanging the key.
2. Asymmetric encryption – uses different keys for encryption and decryption.
 Generally,
 Symmetric algorithms are much faster to execute on a computer than those that are
asymmetric.
 Asymmetric algorithms are more secure than symmetric algorithms.
Public Key Encryption algorithm: Asymmetric encryption
 This algorithm operates with modular arithmetic – mod n, where n is the product of two large prime
numbers.
 Two keys, d and e, are used for decryption and encryption.
 n is chosen as a large integer that is a product of two large distinct prime numbers, p and q.
 The encryption key e is a randomly chosen number between 1 and n that is relatively prime
to (p-1) x (q-1).
 The plaintext m is encrypted as C= me mod n.
 However, the decryption key d is carefully chosen so that C d mod n = m.
 The decryption key d can be computed from the condition that d x e -1 is divisible by (p-1)x(q-1).
 Thus, the legitimate receiver who knows d simply computes Cd mod n = m and recovers m.

Database Systems and Information Management Module Page 29


Simple Example: Asymmetric encryption
1. Select primes p=11, q=3.
2. n = pq = 11*3 = 33
3. Find phi winch Is Gien by, phi = (p-1)(q-1) = 10*2 = 20
4. Choose e=3 ( 1<e<phi)
5. Check for gcd(e, phi) = gcd(e, (p-1)(q-1)) = gcd(3, 20) = 1
6. Compute d (1<d<phi) such that d *e -1 is divisible by phi
Simple testing (d = 2, 3 ...) gives d = 7
7. Check: ed-1 = 3*7 - 1 = 20, which is divisible by phi (20).
Given
Public key = (n, e) = (33, 3)
Private key = (n, d) = (33, 7)
 Now say we want to encrypt the message m = 7
 c = me mod n = 73 mod 33 = 343 mod 33 = 13
 Hence the ciphertext c = 13
 To check decryption, we compute
 m = cd mod n = 137 mod 33 =62,748,517 mod 33 = 7

Database Systems and Information Management Module Page 30


CHAPTER THREE

TRANSACTION PROCESSING CONCEPTS


Introduction to Transaction Processing
 Single-User System:
 At most one user at a time can use the database management system. Eg. Personal computer
system.
 Multi-user System:
 Many users can access the DBMS concurrently. Eg. Airline reservation, Bank and the like
system are operated by many users who submit transaction concurrently to the system.
 This is achieved by multi programming, which allows the computer to execute multiple
programs /processes at the same time.
Concurrency
 Interleaved processing:
 Concurrent execution of processes is interleaved in a single CPU using for example, round robin
algorithm
 Advantages:
 keeps the CPU busy when the process requires I/O by switching to execute another process rather
than remaining idle during I/O time and hence this will increase system throughput (average no.
of transactions completed within a given time)
 Prevents long process from delaying other processes (minimize unpredictable delay in the
response time).
 Parallel processing:
 If Processes are concurrently executed in multiple CPUs.

 A Transaction
 Logical unit of database processing that includes one or more access operations (read -
retrieval, write - insert or update, delete). Examples include ATM transactions, credit
card approvals, flight reservations, hotel check-in, phone calls, supermarket scanning,
academic registration and billing.
 Collections of operations that form a single logical unit of work are called transactions.

Database Systems and Information Management Module Page 31


 A transaction is a unit of program execution that accesses and possibly updates various
data items. Usually, a transaction is initiated by a user program written in a high-level data-
manipulation language (typically SQL), or programming language (for example, C++, or
Java), with embedded database accesses in JDBC or ODBC.
 Transaction boundaries
 Any single transaction in an application program is bounded with Begin and End
statements. An application program may contain several transactions separated by the
Begin and End transaction boundaries.
 This collection of steps must appear to the user as a single, indivisible unit. Since a
transaction is indivisible, it either executes in its entirety or not at all. Thus, if a transaction
begins to execute but fails for whatever reason, any changes to the database that the
transaction may have made must be undone. This requirement holds regardless of whether
the transaction itself failed. For example,
 if it divided by zero,  The computer itself stopped operating.
 the operating system crashed, or
ACID properties of the transactions
1. Atomicity. Either all operations of the transaction are reflected properly in the database, or none
are. This “all-or-none” property is referred to as atomicity.
2. Consistency. Execution of a transaction in isolation (that is, with no other transaction executing
concurrently) preserves the consistency of the database.
A transaction must preserve database consistency—if a transaction is run atomically in isolation
starting from a consistent database, the database must again be consistent at the end of the
transaction.
3. Isolation. Even though multiple transactions may execute concurrently, the system guarantees that,
for every pair of transactions Ti and Tj, it appears to Ti that either Tj finished execution before Ti
started or Tj started execution after Ti finished. Thus, each transaction is unaware of other
transactions executing concurrently in the system.
The database system must take special actions to ensure that transactions operate properly without
interference from concurrently executing database statements. This property is referred to as
isolation.
4. Durability. After a transaction completes successfully, the changes it has made to the database
persist, even if there are system failures.
SIMPLE MODEL OF A DATABASE
 A database is a collection of named data items.
 Because SQL is a powerful and complex language, we focus on when data are moved from disk
to main memory and from main memory to disk.
 Granularity of data a field, a record, or a whole disk block that measure the size of the data item
 Basic operations that a transaction can perform are read and write. Transactions access data
using two operations.
 read item(X): Reads a database item named X into a program variable. To simplify our
notation, we assume that the program variable is also named X.
 write item(X): Writes the value of program variable X into the database item named X.
 Basic unit of data transfer from the disk to the computer main memory is one block.
 read item(X) command includes the following steps:

Database Systems and Information Management Module Page 32


 Find the address of the disk block that contains item X.
 Copy that disk block into a buffer in main memory (if that disk block is not already
in some main memory buffer).
 Copy item X from the buffer to the program variable named X.
 write item(X) command includes the following steps:
 Find the address of the disk block that contains item X.
 Copy that disk block into a buffer in main memory (if that disk block is not already in
some main memory buffer).
 Copy item X from the program variable named X into its correct location in the buffer.
 Store the updated block from the buffer back to disk (either immediately or later).
 The DBMS maintains a number of buffers in the main memory that holds database disk blocks
which contains the database items being processed.
 When this buffer is occupied and
 if there is a need for additional database block to be copied to the main memory.
 Some buffer management policy is used to choose for replacement but if the chosen buffer has been
modified, it must be written back to disk before it is used.
Transaction Atomicity and Durability
 A transaction may not always complete its execution successfully. Such a transaction is termed
aborted.
 If we are to ensure the atomicity property, an aborted transaction must have no effect on the state
of the database. Thus, any changes that the aborted transaction made to the database must be
undone.
 Once the changes caused by an aborted transaction have been undone, we say that the transaction
has been rolled back. It is part of the responsibility of the recovery scheme to manage transaction
aborts. This is done typically by maintaining a log.
 Each database modification made by a transaction is first recorded in the log. We record the
identifier of the transaction performing the modification, the identifier of the data item being
modified, and both the old value (prior to modification) and the new value (after modification) of
the data item. Only then is the database itself modified.
 Maintaining a log provides the possibility of redoing a modification to ensure atomicity and
durability as well as the possibility of undoing a modification to ensure atomicity in case of a failure
during transaction execution.
 A transaction that completes its execution successfully is said to be committed. A committed
transaction that has performed updates transforms the database into a new consistent state, which must
persist even if there is a system failure.
 Once a transaction has committed, we cannot undo its effects by aborting it.
 The only way to undo the effects of a committed transaction is to execute a compensating
transaction (pay costs). For instance, if a transaction added $20 to an account, the compensating
transaction would subtract $20 from the account.
Transaction States
A transaction must be in one of the following states:
 Active, the initial state; the transaction stays in this state while it is executing.
 Partially committed, after the final statement has been executed.
 Failed, after the discovery that normal execution can no longer proceed.

Database Systems and Information Management Module Page 33


 Aborted, after the transaction has been rolled back and the database has been restored to its state
prior to the start of the transaction.
 Committed, after successful completion.
 A transaction has committed only if it has entered the committed state.
 A transaction has aborted only if it has entered the aborted state.
 A transaction is said to have terminated if it has either committed or aborted.

Figure 1: - State diagram of a transaction


 A transaction starts in the active state. When it finishes its final statement, it enters the partially
committed state. At this point, the transaction has completed its execution, but it is still possible
that it may have to be aborted, since the actual output may still be temporarily residing in main
memory, and thus a hardware failure may preclude its successful completion.
 The database system then writes out enough information to disk that, even in the event of a failure, the
updates performed by the transaction can be re-created when the system restarts after the failure. When
the last of this information is written out, the transaction enters the committed state.
A transaction enters the failed state after the system determines that the transaction can no longer proceed
with its normal execution (for example, because of hardware or logical errors). Such a transaction must
be rolled back. Then, it enters the aborted state. At this point, the system has two options:
1. It can restart the transaction, but only if the transaction was aborted as a result of some hardware
or software error that was not created through the internal logic of the transaction. A restarted
transaction is considered to be a new transaction.
2. It can kill the transaction. It usually does so because of some internal logical error that can be
corrected only by rewriting the application program, or because the input was bad, or because the
desired data were not found in the database.
Transaction Isolation
 Transaction-processing systems usually allow multiple transactions to run concurrently.
Allowing multiple transactions to update data concurrently causes several complications with
consistency of the data. There are two good reasons for allowing concurrency:
1. Improved throughput and resource utilization. To increases the throughput of the system—
that is, the number of transactions executed in a given amount of time. Correspondingly, the
processor (CPU) and disk utilization also increase; in other words, the processor and disk
spend less time idle, or not performing any useful work.
2. Reduced waiting time
 If transactions run serially, a short transaction may have to wait for a preceding long
transaction to complete, which can lead to unpredictable delays in running a transaction.
 If the transactions are operating on different parts of the database, it is better to let them
run concurrently, sharing the CPU cycles and disk accesses among them.
 Concurrent execution reduces the unpredictable delays in running transactions.
Moreover, it also reduces the average response time: the average time for a transaction to
be completed after it has been submitted.

Database Systems and Information Management Module Page 34


 The database system must control the interaction among the concurrent transactions to prevent
them from destroying the consistency of the database. It does so through a variety of mechanisms
called concurrency-control schemes.
Example: - Let T1 and T2 be two transactions that transfer funds from one account to another.
A. Transaction T1 transfers 50 birr from B. Transaction T2 transfers 10 percent of
account A to account B. It is defined as: the balance from account A to account B.
It is defined as:
T1:read(A);
A:= A − 50; T2:read(A);
write(A); Temp: = A *0.1;
read(B); A: = A − temp;
B:= B + 50; write(A);
write(B) read(B);
B:= B + temp;
write(B)
 Suppose the current values of accounts A and B are 1000 birr and 2000 birr, respectively. Suppose
two transactions are executed one at a time in the order T1 followed by T2.
 This execution sequence appears in Figure 2. In the figure, the sequence of instruction steps is in
chronological order from top to bottom. The final values of accounts A and B, after the execution
in Figure 2 takes place, are 855 birr and 2145 birr, respectively. Thus, the total amount of money
in accounts A and B—that is, the sum A + B—is preserved after the execution of both transactions.
T1 T2
1 read(A) 8 read(A)
2 A: = A − 50 9 temp: = A ∗ 0.1
3 write(A) 10 A: = A − temp
4 read(B) 11 write(A)
5 B: = B + 50 12 read(B)
6 write(B) 13 B: = B + temp
7 commit 14 write(B)
15 commit

Figure 2 Schedule 1a serial schedule in which T1 is followed by T2.


 Schedule 1—a serial schedule in which T1 is followed by T2. Similarly, if the transactions are
executed one at a time in the order T2 followed by T1, then the corresponding execution sequence
is that of Figure 3. Again, as expected, the sum A + B is preserved, and the final values of accounts
A and B are 850 birr and 2150 birr, respectively.
9 read(A) 1 read(A)
T1 T2
10 A: = A − 50 2 temp: = A ∗ 0.1
read(A)
11Awrite(A) 3 A: = A − temp
A:= − 50 4 write(A)
12 read(B)
write(A)
13 B: = B + 50
read(B) 5 read(B)
B:=
14Bwrite(B)
+ 50 6 B: = B + temp
write(B)
15 commit 7 write(B)
8 commit
Figure 3 Schedule 2 a serial schedule in which T2 is followed by T1.
 The execution sequences just described are called schedules. They represent the chronological
order in which instructions are executed in the system. These schedules are serial.

Database Systems and Information Management Module Page 35


 Each serial schedule consists of a sequence of instructions from various transactions, where the
instructions belonging to one single transaction appear together in that schedule.
 When the database system executes several transactions concurrently, the corresponding schedule
no longer needs to be serial. If two transactions are running concurrently, the operating system may
execute one transaction for a little while, then perform a context switch, execute the second
transaction for some time, and then switch back to the first transaction for some time, and so on.
With multiple transactions, the CPU time is shared among all the transactions.
 Not all concurrent executions result in a correct state. Consider the schedule of Figure 5. After the
execution of this schedule, we arrive at a state where the final values of accounts A and B are 950
birr and 2100 birr, respectively. This final state is an inconsistent state, since we have gained 50
birr in the process of the concurrent execution. Indeed, the sum A + B is not preserved by the
execution of the two transactions.
 It is the job of the database system to ensure that any schedule that is executed will leave the
database in a consistent state. The concurrency-control component of the database system carries
out this task. 4 read(A)
1 read(A)
5 temp: = A ∗ 0.1
T1 2 A: = A −50 T2 6 A: = A − temp
3 Write(A)
8 read(B) 7 write(A)
12 read(B)
9 B: = B + 50
13 B: = B + temp
10 write(B)
14 write(B)
11 commit
15 commit
Figure 4 Schedule 3 a concurrent schedule equivalent to schedule 1.
 We can ensure consistency of the database under concurrent execution by making sure that any
schedule that is executed has the same effect as a schedule that could have occurred without any
concurrent execution. That is, the schedule should, in some sense, be equivalent to a serial
schedule. Such schedules are called serializable schedules.
3 read(A)
T1 1 read(A) T2
4 temp: = A ∗ 0.1
2 A: = A −50 5 A: = A − temp
8 Write(A) 6 write(A)
9 read(B) 7 read(B)
10 B: = B + 50 13 B: = B + temp
11 write(B) 14 write(B)
12 commit 15 commit
Figure 5 Schedule 4—a concurrent schedule resulting in an inconsistent state.
Serializability
 Serial schedules are serializable, but if steps of multiple transactions are interleaved, it is harder to
determine whether a schedule is serializable. Since transactions are programs, it is difficult to
determine exactly
 What operations a transaction performs and
 How operations of various transactions interact.
 For this reason, we shall consider only on two types of operations that a transaction can perform
on a data item: read and write.
 We assume that, between a read (Q) instruction and a write (Q) instruction on a data item Q, a
transaction may perform an arbitrary sequence of operations on the copy of Q that is residing in
the local buffer of the transaction.

Database Systems and Information Management Module Page 36


 Let us consider a schedule S in which there are two consecutive instructions, I and J, of
transactions Ti and Tj, respectively (i = j).
 If I and J refer to different data items, then we can swap (exchange) I and J without affecting the
results of any instruction in the schedule.
 If I and J refer to the same data item Q, then the order of the two steps may matter.
1. I = read (Q), J = read (Q). The order of I and J does not matter.
2. I = read (Q), J = write (Q). If I come before J, then Ti does not read the value of Q
that is written by Tj in instruction J. If J comes before I, then Ti reads the value of Q
that is written by Tj. Thus, the order of I and J matters.
T1 1 read(A) T2 3 read(A)
2 write(A) 4 write(A)
5 read(B) 7 read(B)
6 write(B) 8 write(B)
Figure 6 Schedule 3 showing only the read and write instructions.
T1 T2
1 read(A) 3 read(A)
2 write(A)) 5 write(A)
4 read(B) 7 read(B)
6 write(B) 8 write(B)

Figure 7 Schedule 5 schedule 3 after swapping of a pair of instructions.


3. I = write (Q), J = read (Q). The order of I and J matters for reasons similar to those of
the previous case.
4. I = write (Q), J = write (Q). Since both instructions are write operations, the order of
these instructions does not affect either Ti or Tj.
However, the value obtained by the next read (Q) instruction of S is affected, since the
result of only the latter of the two write instructions is preserved in the database.
If there is no other write (Q) instruction after I and J in S, then the order of I and J
directly affects the final value of Q in the database state that results from schedule S.
We say that I and J conflict if they are operations by different transactions on the same data item, and at
least one of these instructions is a write operation.
To illustrate the concept of conflicting instructions, we consider schedule 3. The write (A) instruction of
T1 conflicts with the read (A) instruction of T2. However, the write (A) instruction of T2 does not conflict
with the read (B) instruction of T1, because the two instructions access different data items.
T1 1 read(A) T2 5 read(A)
2 write(A) 6 write(A)
3 read(B) 7 read(B)
4 write(B) 8 write(B)
Figure 8 Schedule 6 a serial schedule that is equivalent to schedule 3.
T3 T4
1 read(Q) 2 write(Q)
3 write(Q)
Figure 9 Schedules 7.
 Let I and J be consecutive instructions of a schedule S. If I and J are instructions of different
transactions and I and J do not conflict, then we can swap the order of I and J to produce a new
schedule S.

Database Systems and Information Management Module Page 37


 S is equivalent to S, since all instructions appear in the same order in both schedules except for I
and J, whose order does not matter.
 Since the write (A) instruction of T2 in schedule 3 does not conflict with the read (B) instruction
of T1, we can swap these instructions to generate an equivalent schedule, schedule 5, in Figure
7. Regardless of the initial system state, schedules 3 and 5 both produce the same final system
state.
 We continue to swap non conflicting instructions:
 Swap the read (B) instruction of T1 with the read (A) instruction of T2.
 Swap the write (B) instruction of T1 with the write (A) instruction of T2.
 Swap the write (B) instruction of T1 with the read (A) instruction of T2.
 The final result of these swaps, schedule 6 of Figure 8, is a serial schedule.
Transaction Isolation and Atomicity
 Effect of transaction failures during concurrent execution.
T1 T5
1 read(A) 3 read(A)
2 write(A) 4 commit
5 read(B)

Example Schedule 9, a non-recoverable schedule.


 If a transaction Ti fails, for whatever reason, we need to undo the effect of this transaction to
ensure the atomicity property of the transaction.
 In a system that allows concurrent execution, the atomicity property requires that any
transaction Tj that is dependent on Ti (that is, Tj has read data written by Ti) is also
aborted.
 To achieve this, we need to place restrictions on the type of schedules permitted in the
system. In the following two subsections, we address the issue of what schedules are
acceptable from the viewpoint of recovery from transaction failure.
Recoverable Schedules
 A recoverable schedule is one where, for each pair of transactions Ti and Tj such that Tj reads a
data item previously written by Ti, the commit operation of Ti appears before the commit
operation of Tj.
Cascade less Schedules
 Even if a schedule is recoverable, to recover correctly from the failure of a transaction Ti, we may
have to roll back several transactions. Such situations occur if transactions have read data written
by Ti.
 Example transaction T8 writes a value of A that is read by transaction T9. Transaction T9 writes
a value of A that is read by transaction T10.
 Suppose that, at this point, T8 fails. T8 must be rolled back. Since T9 is dependent on T8,
T9 must be rolled back. Since T10 is dependent on T9, T10 must be rolled back.
 This phenomenon, in which a single transaction failure leads to a series of transaction rollbacks, is
called cascading rollback. 1 read(A) T9 4 read(A) T10
6 read(A)
2 read(B) 5 write(A)
T8 3 write(A) T10
7 abort

Example Schedules 10.


 Cascading rollback is undesirable, since it leads to the undoing of a significant amount of work.

Database Systems and Information Management Module Page 38


 It is desirable to restrict the schedules to those where cascading rollbacks cannot occur. Such
schedules are called cascade less schedules.
 Formally, a cascade less schedule is one where, for each pair of transactions Ti and Tj
such that Tj reads a data item previously written by Ti, the commit operation of Ti appears
before the read operation of Tj. It is easy to verify that every cascade less schedule is also
recoverable.
Transaction Isolation Levels
 Serializability is allows programmers to ignore issues related to concurrency when they code
transactions.
 The SQL standard also allows a transaction to specify that it may be executed in such a way that
it becomes non-serializable with respect to other transactions. For instance, a transaction may
operate at the isolation level of read uncommitted, which permits the transaction to read a data
item even if it was written by a transaction that has not been committed.
The isolation levels specified by the SQL standard are as follows:
 Serializable usually ensures serializable execution.
 Repeatable read allows only committed data to be read and further requires that, between two reads
of a data item by a transaction, no other transaction is allowed to update it
 Read committed allows only committed data to be read, but does not require repeatable reads. For
instance, between two reads of a data item by the transaction, another transaction may have updated
the data item and committed.
 Read uncommitted allows uncommitted data to be read. It is the lowest isolation level allowed
by SQL.
All the isolation levels above additionally disallow dirty writes, that is, they disallow writes to a data item
that has already been written by another transaction that has not yet committed or aborted.
Locking
 Instead of locking the entire database, a transaction could, lock only those data items that it
accesses. Under such a policy, the transaction must hold locks long enough to ensure
Serializability, but for a period short enough not to harm performance excessively.
 There are two kinds of locks: shared and exclusive.
1. Shared locks are used for data that the transaction reads and
2. Exclusive locks are used for those it writes.
 Many transactions can hold shared locks on the same data item at the same time, but a transaction
is allowed an exclusive lock on a data item only if no other transaction holds any lock (regardless
of whether shared or exclusive) on the data item. This use of two modes of locks along with two-
phase locking allows concurrent reading of data while still ensuring Serializability.
Transactions as SQL Statements
 In SQL, insert statements create new data and delete statements delete data.
 These two statements are, write operations, since they change the database, but their interactions
with the actions of other transactions are different.
 Example, consider the following SQL query on our university database that finds all
instructors who earn more than 90,000 birrs.
Select ID, name from instructor where salary > 90000;

Database Systems and Information Management Module Page 39


 Using our sample instructor relation, we find that only Ein-stein and Brandt satisfy the
condition. Now assume that around the same time we are running our query, another user
inserts a new instructor named “James” whose salary is $100,000.
Insert into instructor values (’11111’, ’James’, ’Marketing’, 100000);
 The result of our query will be different depending on whether this insert comes before or after
our query is run. In a concurrent execution of these transactions, it is intuitively clear that they
conflict, but this is a conflict not captured by our simple model. This situation is referred to as the
phantom phenomenon, because a conflict may exist on “phantom” data.
 But in an SQL statement, the specific data items (tuples) referenced may be determined by a where
clause predicate. So, the same transaction, if run more than once, might reference different data
items each time it is run if the values in the database change between runs.
 One way of dealing with the above problem is to recognize that it is not sufficient for concurrency
control to consider only the tuples that are accessed by a transaction; the information used to find
the tuples that are accessed by the transaction must also be considered for the purpose of
concurrency control.
 The information used to find tuples could be updated by an insertion or deletion, or in the case of
an index, even by an update to a search-key attribute.
 For example, if locking is used for concurrency control, the data structures that track the tuples in
a relation, as well as index structures, must be appropriately locked. However, such locking can
lead to poor concurrency in some situations; index-locking protocols which maximize
concurrency, while ensuring Serializability in spite of inserts, deletes, and predicates in queries.
 Let us consider again the query:
Select ID, name from instructor where salary> 90000;
 And the following SQL update:
Update instructor set salary = salary *0.9 where name = ’Wu’;
 If our query reads the entire instructor relation, then it reads the tuple with Wu’s data and conflicts
with the update. However, if an index were available that allowed our query direct access to those
tuples with salary > 90000, then our query would not have accessed Wu’s data at all because
Wu’s salary is initially 90,000 birr in our example instructor relation, and reduces to 81,000 birr
after the update.
 However, using the above approach, it would appear that the existence of a conflict depends on a
low-level query processing decision by the system that is unrelated to a user-level view of the
meaning of the two SQL statements! An alternative approach to concurrency control treats an
insert, delete or update as conflicting with a predicate on a relation, if it could affect the set of
tuples selected by a predicate.
 In our example query above, the predicate is “salary > 90000”, and an update of Wu’s salary from
90,000 birr to a value greater than 90,000 birr, or an update of Einstein’s salary from a value
greater than 90,000 birr to a value less than or equal to 90,000 birr, would conflict with this
predicate. Locking based on this idea is called predicate locking; however predicate locking is
expensive, and not used in practice.
Summary
 A transaction is a unit of program execution that accesses and updates various data items.

Database Systems and Information Management Module Page 40


 Transactions are required to have the ACID properties: atomicity, consistency, isolation, and
durability.
 Atomicity ensures that either all the effects of a transaction are reflected in the
database, or none are; a failure cannot leave the database in a state where a transaction
is partially executed.
 Consistency ensures that, if the database is initially consistent, the execution of the
transaction (by itself) leaves the database in a consistent state.
 Isolation ensures that concurrently executing transactions are isolated from one
another, so that each has the impression that no other transaction is executing
concurrently with it.
 Durability ensures that, once a transaction has been committed, that transaction’s
updates do not get lost, even if there is a system failure.
 Concurrent execution of transactions improves throughput of transactions and system utilization,
and also reduces waiting time of transactions.
 The various types of storage in a computer are volatile storage, nonvolatile storage, and stable
storage. Data in volatile storage, such as in RAM, are lost when the computer crashes. Data in
nonvolatile storage, such as disk, are not lost when the computer crashes, but may occasionally be
lost because of failures such as disk crashes. Data in stable storage are never lost.
 Stable storage that must be accessible online is approximated with mirrored disks, or other forms
of RAID, which provide redundant data storage. Offline or archival, stable storage may consist of
multiple tape copies of data stored in physically secure locations.
 When several transactions execute concurrently on the database, the consistency of data may no
longer be preserved. It is therefore necessary for the system to control the interaction among the
concurrent transactions.
 Since a transaction is a unit that preserves consistency, a serial execution of transactions
guarantees that consistency is preserved.
 A schedule captures the key actions of transactions that affect concurrent execution, such as read
and write operations, while abstracting away internal details of the execution of the transaction.
 We require that any schedule produced by concurrent processing of a set of transactions will
have an effect equivalent to a schedule produced when these transactions are run serially in some
order.
 A system that guarantees this property is said to ensure serializability.
 There are several different notions of equivalence leading to the concepts of conflict
serializability and view serializability.
 Serializability of schedules generated by concurrently executing transactions can be ensured
through one of a variety of mechanisms called concurrency-control policies.
 We can test a given schedule for conflict serializability by constructing a precedence graph for the
schedule, and by searching for absence of cycles in the graph. However, there are more efficient
concurrency-control policies for ensuring serializability.
 Schedules must be recoverable, to make sure that if transaction a sees the effects of transaction b,
and b then aborts, then an also gets aborted.

Database Systems and Information Management Module Page 41


 Schedules should preferably be cascade less, so that the abort of a transaction does not result in
cascading aborts of other transactions. Cascadelesness s is ensured by allowing transactions to
only read committed data.
 The concurrency-control–management component of the database is responsible for handling the
concurrency-control policies.

Database Systems and Information Management Module Page 42


CHAPTER FOUR

CONCURRENCY CONTROL TECHNIQUES


Introduction to Concurrency control techniques
 When several transactions execute concurrently in the database, however, the isolation property
may no longer be preserved. To ensure that it is, the system must control the interaction among the
concurrent transactions; this control is achieved by the mechanisms called concurrency-control
schemes.
 There are a variety of concurrency-control schemes. The most frequently used schemes are two-
phase locking and snapshot isolation.
Lock-Based Protocols
 One way to ensure isolation is to require that data items be accessed in a mutually exclusive
manner; that is, while one transaction is accessing a data item, no other transaction can modify
that data item.
 The most common method used to implement this requirement is to allow a transaction to access a
data item only if it is currently holding a lock on that item.

Locks
 There are various modes in which a data item may be locked. Two modes of locks mostly focused are:
1. Shared. If a transaction Ti has obtained a shared-mode lock (denoted by S) on item Q, then
Ti can read, but cannot write, Q.
2. Exclusive. If a transaction Ti has obtained an exclusive-mode lock (denoted by X) on item
Q, then Ti can both read and write Q.
 Every transaction request a lock in an appropriate mode on data item Q, depending on the types of
operations that it will perform on Q. The transaction makes the request to the concurrency-control
manager.
 The transaction can proceed with the operation only after the concurrency-control manager grants
the lock to the transaction. The use of these two lock modes allows multiple transactions to read
a data item but limits write access to just one transaction at a time.
 Examples: -Let A and B represent arbitrary lock modes. Suppose that a transaction Ti requests a
lock of mode A on item Q on which transaction Tj (Ti =Tj) currently holds a lock of mode B. If
transaction Ti can be granted a lock on Q immediately, in spite of the presence of the mode B lock,
then we say mode A is compatible with mode B. Such a function can be represented conveniently
by a matrix.

Granting of Locks
 When a transaction requests a lock on a data item in a particular mode, and no other transaction
has a lock on the same data item in a conflicting mode, the lock can be granted. However, care
must be taken to avoid the following scenario.
 Suppose a transaction T2 has a shared-mode lock on a data item, and transaction T1 requests
an exclusive-mode lock on the data item. Clearly, T1 has to wait for T2 to release the shared-
mode lock.
 Meanwhile, a transaction T3 may request a shared-mode lock on the same data item. The lock
request is compatible with the lock granted to T2, so T3 may be granted the shared-mode lock.

Database Systems and Information Management Module Page 43


 At this point T2 may release the lock, but still T1 has to wait for T3 to finish. But again, there
may be a new transaction T4 that requests a shared-mode lock on the same data item, and is
granted the lock before T3 releases it.
 In fact, it is possible that there is a sequence of transactions that each requests a shared-mode
lock on the data item, and each transaction releases the lock a short while after it is granted, but
T1 never gets the exclusive-mode lock on the data item. The transaction T1 may never make
progress, and is said to be starved.
 We can avoid starvation of transactions by granting locks in the following manner: When a
transaction Ti requests a lock on a data item Q in a particular mode M, the concurrency-control
manager grants the lock provided that:
1. There is no other transaction holding a lock on Q in a mode that conflicts with M.
2. There is no other transaction that is waiting for a lock on Q and that made its lock request
before Ti. Thus, a lock request will never get blocked by a lock request that is made later.
The Two-Phase Locking Protocol
 One protocol that ensures serializability is the two-phase locking protocol. 2-phase locking
protocol is one in which there are 2 phases that a transaction goes through. The first is the growing
phase in which it is acquiring locks, the second is one in which it is releasing locks. This protocol
requires that each transaction issue lock and unlock requests in two phases:
1. Growing phase. A transaction may obtain locks, but may not release any lock.
2. Shrinking phase. A transaction may release locks, but may not obtain any new locks.
 Initially, a transaction is in the growing phase. Once the transaction releases a lock, it enters the
shrinking phase, and it can issue no more lock requests.
 For example, transactions T3 and T4 are two phases. On the other hand, transactions T1 and T2
are not two phase. Note that the unlock instructions do not need to appear at the end of the
transaction.
 For example, in the case of transaction T3, we could move the unlock (B) instruction to just after
the lock-X (A) instruction, and still retain the two-phase locking property.
 We can show that the two-phase locking protocol ensures conflict serializability. Consider any
transaction. The point in the schedule where the transaction has obtained its final lock (the end of
its growing phase) is called the lock point of the transaction.
 Now, transactions can be ordered according to their lock points— this ordering is, in fact, a
serializability ordering for the transactions.
 Two-phase locking does not ensure freedom from deadlock. Observe that transactions T3 and T4
are two phase, but, in schedule 2 they are deadlocked.
 Cascading rollback may occur under two-phase locking.
 Cascading rollbacks can be avoided by a modification of two-phase locking called the strict two-
phase locking protocol.
 This protocol requires not only that locking be two phase, but also that all exclusive-mode
locks taken by a transaction be held until that transaction commits.
 This requirement ensures that any data written by an uncommitted transaction are locked
in exclusive mode until the transaction commits, preventing any other transaction from
reading the data.
 Another variant of two-phase locking is the rigorous two-phase locking protocol, which requires
that all locks be held until the transaction commits.

Database Systems and Information Management Module Page 44


 Deadlock Handling
 A system is in a deadlock state if every transaction in the set is waiting for another transaction in
the set.
 There exists a set of waiting transactions {T0, T1,..., Tn} such that T0 is waiting for a data item that
T1 holds, and T1 is waiting for a data item that T2 holds, and ... ,and Tn−1 is waiting for a data
item that Tn holds, and Tn is waiting for a data item that T0 holds. None of the transactions can
make progress in such a situation.
 To ensure this problem the system rolling back some of the transactions create such problem.
 There are two principal methods for dealing with the deadlock problem.
1. Use a deadlock prevention protocol to ensure that the system will never enter a deadlock state.
2. Allow the system to enter a deadlock state, and then try to recover by using a deadlock detection
and deadlock recovery scheme.
 As we shall see, both methods may result in transaction rollback. Prevention is commonly used if the
probability that the system would enter a deadlock state is relatively high; otherwise, detection and recovery
are more efficient.
 Deadlock Prevention
 There are two approaches to deadlock prevention.
1. Ensures that no cyclic waits can occur by ordering the requests for locks, or requiring all locks
to be acquired together.
2. Performs transaction rollback instead of waiting for a lock, whenever the wait could
potentially result in a deadlock.
 The simplest scheme under the first approach requires that each transaction locks all its data
items before it begins execution. Moreover, either all are locked in one step or none are locked.
There are two main disadvantages to this protocol:
1. It is often hard to predict, before the transaction begins, what data items need to be locked;
2. Data-item utilization may be very low, since many of the data items may be locked but unused for
a long time.
 Another approach for preventing deadlocks is to impose an ordering of all data items, and to
require that a transaction lock data items only in a sequence consistent with the ordering.
 The second approach for preventing deadlocks is to use preemption and transaction rollbacks.
 In preemption, when a transaction Tj requests a lock that transaction Ti holds, the lock
granted to Ti may be preempted by rolling back of Ti, and granting of the lock to Tj.
 To control the preemption, we assign a unique timestamp, based on a counter or on the
system clock, to each transaction when it begins.
 The system uses these timestamps only to decide whether a transaction should wait or roll
back.
 Locking is still used for concurrency control.
 If a transaction is rolled back, it retains its old timestamp when restarted.
 Two different deadlock-prevention schemes using timestamps have been proposed:
1. The wait–die scheme is a non-preemptive technique.
 When transaction Ti requests a data item currently held by Tj, Ti is allowed
to wait only if it has a timestamp smaller than that of Tj (that is, Ti is older
than Tj). Otherwise, Ti is rolled back (dies).
 For example, suppose that transactions T14, T15, and T16 have timestamps
5, 10, and 15, respectively.
i. If T14 requests a data item held by T15, then T14 will wait.
ii. If T24 requests a data item held by T15, then T16 will be rolled back.

Database Systems and Information Management Module Page 45


2. The wound–wait scheme is a preemptive technique.
 It is a counterpart to the wait–die scheme.
 When transaction Ti requests a data item currently held by Tj, Ti is allowed to wait
only if it has a timestamp larger than that of Tj (that is, Ti is younger than Tj).
Otherwise, Tj is rolled back (Tj is wounded by Ti).
 Returning to our example, with transactions T14, T15, and T16,
i. If T14 requests a data item held by T15, then the data item will be preempted from
T15, and T15 will be rolled back.
ii. If T16 requests a data item held by T15, then T16 will wait.
 The major problem with both of these schemes is that unnecessary rollbacks may occur.
 Another simple approach to deadlock prevention is based on lock timeouts.
 In this approach, a transaction that has requested a lock waits for at most a specified amount
of time.
 If the lock has not been granted within that time, the transaction is said to time out, and it
rolls itself back and restarts.
 If there was in fact a deadlock, one or more transactions involved in the deadlock will time
out and roll back, allowing the others to proceed. This scheme falls somewhere between
deadlock prevention, where a deadlock will never occur, and deadlock detection and
recovery.
 The timeout scheme is particularly easy to implement, and works well if transactions are
short and if long waits are likely to be due to deadlocks. However, in general it is hard to
decide how long a transaction must wait before timing out. Too long a wait results in
unnecessary delays once a deadlock has occurred. Too short a wait results in transaction
rollback even when there is no deadlock, leading to wasted resources.
 Starvation is also a possibility with this scheme. Hence, the timeout-based scheme has
limited applicability.

Deadlock Detection and Recovery


 If a system does not employ some protocol that ensures deadlock freedom, then a detection and
recovery scheme must be used.
 An algorithm that examines the state of the system is invoked periodically to determine whether a
deadlock has occurred.
 If one has, then the system must attempt to recover from the deadlock. To do so, the system must:
 Maintain information about the current allocation of data items to transactions, as well as any
outstanding data item requests.
 Provide an algorithm that uses this information to determine whether the system has entered
a deadlock state.
 Recover from the deadlock when the detection algorithm determines that a deadlock exists.

Recovery from Deadlock


 When a detection algorithm determines that a deadlock exists, the system must recover from the
deadlock.
 The most common solution is to roll back one or more transactions to break the deadlock. Three
actions need to be taken:

Database Systems and Information Management Module Page 46


1. Selection of a victim: - Given a set of deadlocked transactions, we must determine which
transaction to roll back to break the deadlock. We should roll back those transactions that will
incur the minimum cost. Many factors may determine the cost of a rollback, including:
A. How long the transaction has computed, and how much longer the transaction will
compute before it completes its designated task.
B. How many data items the transaction has used?
C. How many more data items the transaction needs for it to complete.
D. How many transactions will be involved in the rollback?
2. Rollback: - Once we have decided that a particular transaction must be rolled back, we must
determine how far this transaction should be rolled back. The simplest solution is a total
rollback: Abort the transaction and then restart it.
3. Starvation. In a system where the selection of victims is based primarily on cost factors, it
may happen that the same transaction is always picked as a victim. As a result, this transaction
never completes its designated task, thus there is starvation. We must ensure that a transaction
can be picked as a victim only a (small) finite number of times. The most common solution is
to include the number of rollbacks in the cost factor.

Timestamp-Based Protocols
 The locking protocols that we have described thus far determine the order between every pair of
conflicting transactions at execution time by the first lock that both members of the pair request that
involves incompatible modes.
 Another method for determining the serializability order is to select an ordering among
transactions in advance. The most common method for doing so is to use a timestamp-ordering
scheme.
Timestamps
 With each transaction Ti in the system, we associate a unique fixed timestamp, denoted by TS(Ti).
This timestamp is assigned by the database system before the transaction Ti starts execution.
 If a transaction Ti has been assigned timestamp TS(Ti), and a new transaction Tj enters the system,
then TS(Ti) < TS(Tj).
 There are two simple methods for implementing this scheme:
1. Use the value of the system clock as the timestamp; that is, a transaction’s timestamp is equal
to the value of the clock when the transaction enters the system.
2. Use a logical counter that is incremented after a new timestamp has been assigned; that is,
a transaction’s timestamp is equal to the value of the counter when the transaction enters the
system.
 The timestamps of the transactions determine the serializability order. Thus, if TS(Ti) < TS(Tj),
then the system must ensure that the produced schedule is equivalent to a serial schedule in which
transaction Ti appears before transaction Tj.
 To implement this scheme, we associate with each data item Q two timestamp values:
1. W-timestamp(Q) denotes the largest timestamp of any transaction that executed write(Q) successfully.
2. R-timestamp(Q) denotes the largest timestamp of any transaction that executed read(Q) successfully.
 These timestamps are updated whenever a new read(Q) or write(Q) instruction is executed.

Database Systems and Information Management Module Page 47


The Timestamp-Ordering Protocol
 The timestamp-ordering protocol ensures that any conflicting read and write operations are
executed in timestamp order. This protocol operates as follows:

1. Suppose that transaction Ti issues read(Q).


A. If TS(Ti) < W-timestamp(Q), then Ti needs to read a value of Q that was already
overwritten. Hence, the read operation is rejected, and Ti is rolled back.
B. If TS(Ti) ≥ W-timestamp(Q), then the read operation is executed, and R-timestamp(Q)
is set to the maximum of R-timestamp(Q) and TS(Ti).

2. Suppose that transaction Ti issues write(Q).


A. If TS(Ti) < R-timestamp(Q), then the value of Q that Ti is producing was needed
previously, and the system assumed that that value would never be produced. Hence,
the system rejects the write operation and rolls Ti back.
B. If TS(Ti) < W-timestamp(Q), then Ti is attempting to write an obsolete value of Q.
Hence, the system rejects this write operation and rolls Ti back.
C. Otherwise, the system executes the write operation and sets W-timestamp(Q) to TS(Ti).
 If a transaction Ti is rolled back by the concurrency-control scheme as result of issuance of either
a read or writes operation, the system assigns it a new timestamp and restarts it.
 To illustrate this protocol, we consider transactions T25 and T26. Transaction T25 displays the
contents of accounts A and B:
T25: read(B);
read(A);
display (A + B).
 Transaction T26 transfers $50 from account B to account A, and then displays the contents of
both:
T26: read(B);
B := B − 50;
write(B);
read(A);
A := A + 50;
write(A);
display (A + B).
Validation-Based Protocols
 The validation protocol requires that each transaction Ti executes in two or three different phases
in its lifetime, depending on whether it is a read-only or an update transaction. The phases are,
in order:
1. Read phase. The system executes transaction Ti. It reads the values of the various data items
and stores them in variables local to Ti. It performs all write operations on temporary local
variables, without updates of the actual database.
2. Validation phase. The validation test is applied to transaction Ti. This determines whether Ti
is allowed to proceed to the write phase without causing a violation of serializability. If a
transaction fails the validation test, the system aborts the transaction.
3. Write phase. If the validation test succeeds for transaction Ti, the temporary local variables
that hold the results of any write operations performed by Ti are copied to the database. Read-
only transactions omit this phase.

Database Systems and Information Management Module Page 48


 To perform the validation test, we need to know when the various phases of transactions took place.
We shall, therefore, associate three different timestamps with each transaction Ti:
1. Start (Ti), the time when Ti started its execution.
2. Validation (Ti), the time when Ti finished its read phase and started its validation phase.
3. Finish (Ti), the time when Ti finished its write phase.
 The validation test for transaction Ti requires that, for all transactions Tk with TS(Tk) < TS(Ti),
one of the following two conditions must hold:
1. Finish (Tk) < Start (Ti). Since Tk completes its execution before Ti started, the serializability
order is indeed maintained.
2. The set of data items written by Tk does not intersect with the set of data items read by Ti, and
Tk completes its write phase before Ti starts its validation phase (Start(Ti) < Finish(Tk) <
Validation(Ti)). This condition ensures that the writes of Tk and Ti do not overlap. Since the
writes of Tk do not affect the read of Ti, and since Ti cannot affect the read of Tk, the
serializability order is indeed maintained.
T26
T25 2. Read(B)
1. Read(B)
3. B: =B-50 read(A)
5. Read(A)
4. A: =A+50
6. <Validate>
8. <Validate>
7. Display(A+B)
9. Write(B)
10. Write(A)

Figure Schedule 6, a schedule produced by using validation.


 As an illustration, consider again transactions T25 and T26. Suppose that TS(T25) < TS(T26).
Then, the validation phase succeeds in the schedule 6. Note that the writes to the actual variables
are performed only after the validation phase of T26. Thus, T25 reads the old values of B and A,
and this schedule is serializable.
 The validation scheme automatically guards against cascading rollbacks, since the actual writes
take place only after the transaction issuing the write has committed. However, there is a possibility
of starvation of long transactions, due to a sequence of conflicting short transactions that cause
repeated restarts of the long transaction.
 To avoid starvation, conflicting transactions must be temporarily blocked, to enable the long
transaction to finish.
 This validation scheme is called the optimistic concurrency-control scheme since transactions
execute optimistically, assuming they will be able to finish execution and validate at the end. In
contrast, locking and timestamp ordering are pessimistic in that they force a wait or a rollback
whenever a conflict is detected, even though there is a chance that the schedule may be conflict
serializable.

Purpose of Concurrency Control


 To enforce Isolation (through mutual exclusion) among conflicting transactions.
 To preserve database consistency through consistency preserving execution of
transactions.
 To resolve read-write and write-write conflicts.

Database Systems and Information Management Module Page 49


Concurrency Control Techniques
 Locking  Multi-version
 Timestamp  Lock Granularity
 Optimistic
 Locking and timestamp are conservative/ traditional approach

Starvation
 Starvation occurs when a particular transaction consistently waits or restarted and
never gets a chance to proceed further while other transaction continues normally
 This may occur, if the waiting method for item locking:
 Gave priority for some transaction over others
 Problem in Victim selection algorithm- it is possible that the same transaction
may consistently be selected as victim and rolled-back .example In Wound-
Wait
 Solution

 FIFO
 Allow for transaction that wait for a longer time
 Give higher priority for transaction that have been aborted for many time

Database Systems and Information Management Module Page 50


CHAPTER FIVE

DATABASE RECOVERY TECHNIQUES


Recovery Outline and Categorization of Recovery Algorithms
 Recovery from transaction failures usually means that the database is restored to the most recent
consistent state just before the time of failure.
 To do this, the system must keep information about the changes that were applied to data
items by the various transactions. This information is typically kept in the system log.
 A typical strategy for recovery may be summarized informally as follows:
1. If there is extensive damage to a wide portion of the database due to catastrophic failure,
such as a disk crash, the recovery method restores a past copy of the database that was
backed up to archival storage and reconstructs a more current state by reapplying or redoing
the operations of committed transactions from the backed up log, up to the time of failure.
2. When the database on disk is not physically damaged, and a non-catastrophic failure has
occurred, the recovery strategy is to identify any changes that may cause an inconsistency
in the database.
 For non-catastrophic failure, the recovery protocol does not need a complete
archival copy of the database. Rather, the entries kept in the online system log on
disk are analyzed to determine the appropriate actions for recovery.
 Conceptually, we can distinguish two main techniques for recovery from non-catastrophic
transaction failures:
1. Deferred update and
2. Immediate update.

Deferred Update
 The deferred update techniques do not physically update the database on disk until after a
transaction reaches its commit point; then the updates are recorded in the database.
 Before reaching commit, all transaction updates are recorded in the local transaction
workspace or in the main memory buffers that the DBMS maintains.
 Before commit, the updates are recorded persistently in the log, and then after commit,
the updates are written to the database on disk.
 If a transaction fails before reaching its commit point, it will not have changed the database in any
way, so UNDO is not needed.
 It may be necessary to REDO the effect of the operations of a committed transaction from the
log, because their effect may not yet have been recorded in the database on disk. Hence, deferred
update is also known as the NO-UNDO/REDO algorithm.

Immediate Update
 In the immediate update techniques, the database may be updated by some operations of a
transaction before the transaction reaches its commit point.
 However, these operations must also be recorded in the log on disk by force-writing before
they are applied to the database on disk, making recovery still possible.
 If a transaction fails after recording some changes in the database on disk but before reaching its
commit point, the effect of its operations on the database must be undone; that is, the transaction
must be rolled back.
Database Systems and Information Management Module Page 51
 In the general case of immediate update, both undo and redo may be required during recovery.
This technique, known as the UNDO/REDO algorithm, requires both operations during recovery,
and is used most often in practice.
 A variation of the algorithm where all updates are required to be recorded in the database on disk
before a transaction commits requires undo only, so it is known as the UNDO/NO-REDO
algorithm.
Caching (Buffering) of Disk Blocks
 Typically, multiple disk pages that include the data items to be updated are cached into main
memory buffers and then updated in memory before being written back to disk.
 The caching of disk pages is traditionally an operating system function, but because of its
importance to the efficiency of recovery procedures, it is handled by the DBMS by calling
low-level operating systems routines.
 When the DBMS requests action on some item,
1. First it checks the cache directory to determine whether the disk page containing the
item is in the DBMS cache.
2. Second if it is not, the item must be located on disk, and the appropriate disk pages are
copied into the cache.
 It may be necessary to replace (or flush) some of the cache buffers to make space available for the
new item.
 Some page replacement strategy similar to these used in operating systems, such as least recently
used (LRU) or first-in-first out (FIFO), or a new strategy that is DBMS-specific can be used to
select the buffers for replacement, such as DBMIN or Least-Likely-to-Use.
 Two main strategies can be employed when flushing a modified buffer back to disk.
1. In-place updating, writes the buffer to the same original disk location, thus
overwriting the old value of any changed data items on disk.
2. Shadowing, writes an updated buffer at a different disk location, so multiple versions
of data items can be maintained.
 In general, the old value of the data item before updating is called the before image (BFIM),
and the new value after updating is called the after image (AFIM).
 If shadowing is used, both the BFIM and the AFIM can be kept on disk; hence, it is not strictly
necessary to maintain a log for recovering.

Write-Ahead Logging, Steal/No-Steal, and Force/No-Force


 When in-place updating is used, it is necessary to use a log for recovery.
 In this case, the recovery mechanism must ensure that the BFIM of the data item is recorded in
the appropriate log entry and that the log entry is flushed to disk before the BFIM is overwritten
with the AFIM in the database on disk. This process is generally known as write-ahead logging.
 There are two types of log entry information included for a write command:
1. The information needed for UNDO: The UNDO-type log entries include the old value
(BFIM) of the item since this is needed to undo the effect of the operation from the log (by
setting the item value in the database back to its BFIM).
2. The information needed for REDO. A REDO-type log entry includes the new value
(AFIM) of the item written by the operation since this is needed to redo the effect of the
operation from the log (by setting the item value in the database on disk to its AFIM).

Database Systems and Information Management Module Page 52


 Standard DBMS recovery terminology includes the terms steal/no-steal and force/no-force,
which specify the rules that govern when a page from the database can be written to disk from
the cache:
1. If a cache buffer page updated by a transaction cannot be written to disk before the
transaction commits, the recovery method is called a no-steal approach.
 On the other hand, if the recovery protocol allows writing an updated buffer
before the transaction commits, it is called steal approach.
2. If all pages updated by a transaction are immediately written to disk before the transaction
commits, it is called a force approach. Otherwise, it is called no-force.
 The force rule means that REDO will never be needed during recovery, since
any committed transaction will have all its updates on disk before it is committed.
 The deferred update (NO-UNDO) recovery scheme follows a no-steal approach. However, typical
database systems employ a steal/no-force strategy.
 The advantage of steal is that it avoids the need for a very large buffer space to store all
updated pages in memory.
 The advantage of no-force is that an updated page of a committed transaction may still
be in the buffer when another transaction needs to update it, thus eliminating the I/O
cost to write that page multiple times to disk, and possibly to have to read it again from
disk.
 This may provide a substantial saving in the number of disk I/O operations when
a specific page is updated heavily by multiple transactions.

Transaction Actions That Do Not Affect the Database


 In general, a transaction will have actions that do not affect the database, such as generating and
printing messages or reports from information retrieved from the database.
 If a transaction fails before completion, we may not want the user to get these reports, since the
transaction has failed to complete.
 If such erroneous reports are produced, part of the recovery process would have to inform
the user that these reports are wrong, since the user may take an action based on these
reports that affects the database. Hence, such reports should be generated only after the
transaction reaches its commit point.
 A common method of dealing with such actions is to issue the commands that generate the reports
but keep them as batch jobs, which are executed only after the transaction reaches its commit
point. If the transaction fails, the batch jobs are canceled.

NO-UNDO/REDO Recovery Based on Deferred Update


 The idea behind deferred update is to defer or postpone any actual updates to the database on disk
until the transaction completes its execution successfully and reaches its commit point.
 During transaction execution, the updates are recorded only in the log and in the cache buffers.
 We can state a typical deferred update protocol as follows:
1. A transaction cannot change the database on disk until it reaches its commit point.
2. A transaction does not reach its commit point until all its REDO-type log entries are
recorded in the log and the log buffer is force-written to disk.
 Notice that step 2 of this protocol is a restatement of the write-ahead logging (WAL) protocol.

Database Systems and Information Management Module Page 53


 Because the database is never updated on disk until after the transaction commits,
there is never a need to UNDO any operations.
 REDO is needed in case the system fails after a transaction commits but before all its
changes are recorded in the database on disk.
 In this case, the transaction operations are redone from the log entries
during recovery.
 For multiuser systems with concurrency control, the concurrency control and recovery processes
are interrelated.
 Consider a system in which concurrency control uses strict two-phase locking, so the locks
on items remain in effect until the transaction reaches its commit point. After that, the locks
can be released. This ensures strict and serializable schedules.
 If a transaction is aborted for any reason (say, by the deadlock detection method), it is simply
resubmitted, since it has not changed the database on disk.
 A drawback of the method described here is that it limits the concurrent execution of
transactions because all write-locked items remain locked until the transaction reaches
its commit point.
 Additionally, it may require excessive buffer space to hold all updated items until the
transactions commit.
 The method’s main benefit is that transaction operations never need to be undone, for two reasons:
1. A transaction does not record any changes in the database on disk until after it reaches
its commit point—that is, until it completes its execution successfully. Hence, a transaction
is never rolled back because of failure during transaction execution.
2. A transaction will never read the value of an item that is written by an uncommitted
transaction, because items remain locked until a transaction reaches its commit point.
Hence, no cascading rollback will occur.

Recovery Techniques Based on Immediate Update


 In these techniques, when a transaction issues an update command, the database on disk can be
updated immediately, without any need to wait for the transaction to reach its commit point.
 Notice that it is not a requirement that every update be applied immediately to disk; it is
just possible that some updates are applied to disk before the transaction commits.
 Theoretically, we can distinguish two main categories of immediate update algorithms.
1. If the recovery technique ensures that all updates of a transaction are recorded in the
database on disk before the transaction commits, there is never a need to REDO any
operations of committed transactions. This is called the UNDO/NO-REDO recovery
algorithm.
 In this method, all updates by a transaction must be recorded on disk before the
transaction commits, so that REDO is never needed. Hence, this method must utilize the
force strategy for deciding when updated main memory buffers are written back to disk.
2. If the transaction is allowed to commit before all its changes are written to the database, we
have the most general case, known as the UNDO/REDO recovery algorithm. In this case, the
steal/no-force strategy is applied.

Shadow Paging
 This recovery scheme does not require the use of a log in a single-user environment.

Database Systems and Information Management Module Page 54


 In a multiuser environment, a log may be needed for the concurrency control method.
 Shadow paging considers the database to be made up of a number of fixed size disk pages
(or disk blocks)—say, n—for recovery purposes.
 A directory with n entries is constructed, where the ith entry points to the ith database page on
disk.
 The directory is kept in main memory if it is not too large, and all references—read or writes—to
database pages on disk go through it.
 When a transaction begins executing, the current directory—whose entries point to the most recent
or current database pages on disk—is copied into a shadow directory.
 The shadow directory is then saved on disk while the current directory is used by the transaction.
 During transaction execution, the shadow directory is never modified. When a write item operation
is performed, a new copy of the modified database page is created, but the old copy of that page is
not overwritten.
 The database thus is returned to its state prior to the transaction that was executing when the crash
occurred, and any modified pages are discarded. Committing a transaction corresponds to
discarding the previous shadow directory. Since recovery involves neither undoing nor redoing
data items, this technique can be categorized as a NO-UNDO/ NO-REDO technique for recovery.

The ARIES Recovery Algorithm


 It is used in many relational database-related products of IBM.
 ARIES uses a steal/no-force approach for writing, and it is based on three concepts:
1. Write-ahead logging, repeating history during redo, and logging changes during undo.
2. The second concept, repeating history, means that ARIES will retrace all actions of the
database system prior to the crash to reconstruct the database state when the crash occurred.
a. Transactions that were uncommitted at the time of the crash (active transactions)
are undone.
3. The third concept, logging during undo, will prevent ARIES from repeating the completed
undo operations if a failure occurs during recovery, which causes a restart of the recovery
process.
 The ARIES recovery procedure consists of three main steps: analysis, REDO, and UNDO.
1. Analysis: The analysis step identifies the dirty (updated) pages in the buffer and the set of
transactions active at the time of the crash.
2. Redo
 The appropriate point in the log where the REDO operation should start is also
determined.
 The REDO phase actually reapplies updates from the log to the database.
 Generally, the REDO operation is applied only to committed transactions.
 In ARIES, every log record has an associated log sequence number (LSN) that is
monotonically increasing and indicates the address of the log record on disk.
 Each LSN corresponds to a specific change (action) of some transaction.
 Also, each data page will store the LSN of the latest log record corresponding to a
change for that page.
 A log record is written for any of the following actions: updating a page (write), committing a
transaction (commit), aborting a transaction (abort), undoing an update (undo), and ending
a transaction (end).
Database Systems and Information Management Module Page 55
Recovery in Multidata base Systems
 These databases may even be stored on different types of DBMSs; for example, some DBMSs may
be relational, whereas others are object oriented, hierarchical, or network DBMSs.
 In such a case, each DBMS involved in the multidata base transaction may have its own
recovery technique and transaction manager separate from those of the other DBMSs.
 This situation is somewhat similar to the case of a distributed database management
system, where parts of the database reside at different sites that are connected by a
communication network.
 To maintain the atomicity of a multidata base transaction, it is necessary to have a two-level
recovery mechanism.
 A global recovery manager, or coordinator, is needed to maintain information needed for recovery,
in addition to the local recovery managers and the information they maintain (log, tables).
 The coordinator usually follows a protocol called the two-phase commit protocol; whose two
phases can be stated as follows:
 Phase 1. When all participating databases signal the coordinator that the part of the multidata
base transaction involving each has concluded, the coordinator sends a message prepare for
commit to each participant to get ready for committing the transaction.
 Each participating database receiving that message will force-write all log records
and needed information for local recovery to disk and then send a ready to commit or
OK signal to the coordinator.
 If the force-writing to disk fails or the local transaction cannot commit for some
reason, the participating database sends cannot commit or not OK signal to the
coordinator.
 If the coordinator does not receive a reply from the database within a certain time out
interval, it assumes a not OK response.
 Phase 2. If all participating databases reply OK, and the coordinator’s vote is also OK, the
transaction is successful, and the coordinator sends a commit signal for the transaction to the
participating databases.
 The net effect of the two-phase commit protocol is that either all participating databases
commit the effect of the transaction or none of them do.
 In case any of the participants—or the coordinator—fails, it is always possible to
recover to a state where either the transaction is committed or it is rolled back.
 A failure during or before Phase 1 usually requires the transaction to be rolled back, whereas
a failure during Phase 2 means that a successful transaction can recover and commit.

Database Backup and Recovery from Catastrophic Failures


 So far, all the techniques we have discussed apply to non-catastrophic failures.
 A key assumption has been that the system log is maintained on the disk and is not lost as a result
of the failure.
 Similarly, the shadow directory must be stored on disk to allow recovery when shadow paging is
used.
 The recovery techniques we have discussed use the entries in the system log or the shadow
directory to recover from failure by bringing the database back to a consistent state.
 The recovery manager of a DBMS must also be equipped to handle more catastrophic failures such
as disk crashes.
Database Systems and Information Management Module Page 56
 The main technique used to handle such crashes is a database backup, in which the
whole database and the log are periodically copied onto a cheap storage medium such
as magnetic tapes or other large capacity offline storage devices.
 In case of a catastrophic system failure, the latest backup copy can be reloaded from the tape to the
disk, and the system can be restarted.
 Data from critical applications such as banking, insurance, stock market, and other databases is
periodically backed up in its entirety and moved to physically separate safe locations.
 To avoid losing all the effects of transactions that have been executed since the last
backup, it is customary to back up the system log at more frequent intervals than full
database backup by periodically copying it to magnetic tape.
 The system log is usually substantially smaller than the database itself and hence can be backed up
more frequently. Therefore, users do not lose all transactions they have performed since the last
database backup.
 All committed transactions recorded in the portion of the system log that has been backed up to
tape can have their effect on the database redone.
 A new log is started after each database backup.
 Hence, to recover from disk failure, the database is first recreated on disk from its latest
backup copy on tape.
 Following that, the effects of all the committed transactions whose operations have been recorded
in the backed-up copies of the system log are reconstructed.

Database Systems and Information Management Module Page 57


CHAPTER SIX

DISTRIBUTED DATABASE SYSTEMS


Distributed Database Concepts
 Distributed databases bring the advantages of distributed computing to the database management
domain.
 A distributed computing system consists of a number of processing elements, not necessarily
homogeneous, that are interconnected by a computer network, and that cooperate in performing
certain assigned tasks.
 As a general goal, distributed computing systems partition a big, unmanageable problem into
smaller pieces and solve it efficiently in a coordinated manner.
 The economic viability of this approach stems from two reasons:
1. More computer power is harnessed(connected) to solve a complex task, and
2. Each autonomous (independent) processing element can be managed
independently and develop its own applications.

Generally
 Distributed database (DDB) as a collection of multiple logically interrelated databases
distributed over a computer network.
 Distributed database management system (DDBMS) as a software system that manages a
distributed database while making the distribution transparent to the user.
 A collection of files stored at different nodes of a network and the maintaining of inter
relationships among them via hyperlinks has become a common organization on the Internet,
with files of Web pages.
 Examples DDBMS Operational database, Analytical database, Hypermedia database

Parallel Versus Distributed Technology


 Turning our attention to system architectures, there are two main types of multiprocessor system
architectures that are commonplace:
1. Shared memory (tightly coupled) architecture: Multiple processors share secondary (disk)
storage and also share primary memory.
2. Shared disk (loosely coupled) architecture: Multiple processors share secondary (disk)
storage but each has their own primary memory.
 These architectures enable processors to communicate without the overhead of exchanging
messages over a network.
 Database management systems developed using the above types of architectures are termed
parallel database management systems rather than DDBMS, since they utilize parallel processor
technology.
 Another type of multiprocessor architecture is called shared nothing architecture. In this
architecture, every processor has its own primary and secondary (disk) memory, no common
memory exists, and the processors communicate over a high-speed interconnection network (bus
or switch).

Advantages of Distributed Databases

Database Systems and Information Management Module Page 58


 Distributed database management has been proposed for various reasons ranging from
organizational decentralization and economical processing to greater autonomy.
 We highlight some of these advantages here.
1. Management of distributed data with different levels of transparency:
 Ideally, a DBMS should be distribution transparent in the sense of hiding the details of
where each file (table, relation) is physically stored within the system.
 The EMPLOYEE, PROJECT, and WORKS_ON tables may be fragmented
horizontally (that is, into sets of rows and stored with possible replication).
 The following types of transparencies are possible:
 Distribution or network transparency: This refers to freedom for the user from the
operational details of the network. It may be divided into location transparency and
naming transparency.
A. Location transparency refers to the fact that the command used to perform a task is
independent of the location of data and the location of the system where the command
was issued.
B. Naming transparency implies that once a name is specified, the named objects can
be accessed unambiguously without additional specification.
 Replication transparency: copies of data may be stored at multiple sites for better
availability, performance, and reliability. Replication transparency makes the user
unaware (uninformed) of the existence of copies.
 Fragmentation transparency: Two types of fragmentation are possible.
A. Horizontal fragmentation distributes a relation into sets of tuples (rows).
B. Vertical fragmentation distributes a relation into sub relations where each sub relation
is defined by a subset of the columns of the original relation.
A global query by the user must be transformed into several fragment queries.
Fragmentation transparency makes the user unaware of the existence of
fragments.
2. Increased reliability and availability:
 These are two of the most common potential advantages cited for distributed databases.
A. Reliability is broadly defined as the probability that a system is running (not
down) at a certain time point.
B. Availability is the probability that the system is continuously available during a
time interval.
 When the data and DBMS software are distributed over several sites, one site may fail
while other sites continue to operate.
 Only the data and software that exist at the failed site cannot be accessed. This improves
both reliability and availability. Further improvement is achieved by judiciously
replicating data and software at more than one site.
 In a centralized system, failure at a single site makes the whole system unavailable to all
users.
 In a distributed database, some of the data may be unreachable, but users may still be
able to access other parts of the database.

3. Improved performance:

Database Systems and Information Management Module Page 59


 A distributed DBMS fragments the database by keeping the data closer to where it is
needed most.
 Data localization reduces the contention for CPU and I/O services and simultaneously
reduces access delays involved in wide area networks.
 When a large database is distributed over multiple sites, smaller databases exist at each
site.
 As a result, local queries and transactions accessing data at a single site have better
performance because of the smaller local databases.
 In addition, each site has a smaller number of transactions executing than if all
transactions are submitted to a single centralized database.
 Moreover, inter query and intra query parallelism can be achieved by executing
multiple queries at different sites, or by breaking up a query into a number of sub
queries that execute in parallel. This contributes to improved performance.
4. Easier expansion (scalability):
 In a distributed environment, expansion of the system in terms of adding more data,
increasing database sizes, or adding more processors is much easier.
 The transparencies we discussed in (1) above lead to a compromise between ease of use
and the overhead cost of providing transparency.
 Total transparency provides the global user with a view of the entire DDBS as if it is a
single centralized system.
 Transparency is provided as a complement to autonomy, which gives the users tighter
control over their own local databases.
 Transparency features may be implemented as a part of the user language, which may
translate the required services into appropriate operations. In addition, transparency
impacts the features that must be provided by the operating system and the DBMS.

Disadvantages of Distributed Databases


 Complexity- The data replication, failure recovery, network management …makes the system more
complex than the central DBMSs.
 Cost- since DDBMS needs more people and more hardware, maintaining and running the system
can be more expensive than the centralized system.
 Problem of connecting Dissimilar Machine- Additional layers of operating system software are
needed to translate and coordinate the flow of data between machines.
 Data integrity and security problem- Because data maintained by distributed systems can be
accessed at any locations in the network, controlling the integrity of a database can be difficult .
Additional Functions of Distributed Databases
 Distribution leads to increased complexity in the system design and implementation.
 To achieve the potential advantages listed previously, the DDBMS software must be able to
provide the following functions in addition to those of a centralized DBMS:
 Keeping track of data: The ability to keep track of the data distribution, fragmentation, and
replication by expanding the DDBMS catalog.
 Distributed query processing: The ability to access remote sites and transmit queries and data
among the various sites via a communication network.

Database Systems and Information Management Module Page 60


 Distributed transaction management: The ability to devise/plan execution strategies for
queries and transactions that access data from more than one site and to synchronize the access
to distributed data and maintain integrity of the overall database.
 Replicated data management: The ability to decide which copy of a replicated data item to
access and to maintain the consistency of copies of a replicated data item.
 Distributed database recovery: The ability to recover from individual site crashes and from
new types of failures such as the failure of a communication links.
 Security: Distributed transactions must be executed with the proper management of the
security of the data and the authorization/access privileges of users.
 Distributed directory (catalog) management: A directory contains information (metadata)
about data in the database. The directory may be global for the entire DDB, or local for each
site. The placement and distribution of the directory are design and policy issues.
 These functions themselves increase the complexity of a DDBMS over a centralized
DBMS.
 Before we can realize the full potential advantages of distribution, we must find satisfactory
solutions to these design issues and problems.
 Including all this additional functionality is hard to accomplish, and finding optimal solutions is a
step beyond that.
 At the physical hardware level, the following main factors distinguish a DDBMS from a centralized
system:
 There are multiple computers, called sites or nodes.
 These sites must be connected by some type of communication network to transmit data and
commands among sites.
 The sites may all be located in physical proximity—say, within the same building or group of
adjacent buildings—and connected via a local area network, or they may be geographically
distributed over large distances and connected via a long-haul or wide area network.
 Local area networks typically use cables, whereas long-haul networks use telephone lines or
satellites. It is also possible to use a combination of the two types of networks.

Data Fragmentation, Replication, and Allocation Techniques for Distributed Database


Design
 Techniques that are used to break up the database into logical units, called fragments, which may
be assigned for storage at the various sites.
 Data replication, which permits certain data to be stored in more than one site, and the process
of allocating fragments—or replicas of fragments—for storage at the various sites.
 These techniques are used during the process of distributed database design.
 The information concerning data fragmentation, allocation, and replication is stored in a global
directory that is accessed by the DDBS applications as needed.

Data Fragmentation
 There are two approaches to store the relation in the distributed database:

Replication and Fragmentation.

Database Systems and Information Management Module Page 61


 Data Fragmentation: is a technique used to break up the database into logically related units called
fragments. A database can be fragmented as:
1. Horizontal Fragmentation
2. Vertical Fragmentation
3. Mixed (Hybrid) Fragmentation
 The main reasons for fragmenting a relation are
1. Efficiency (good organization)- data that is not needed by the local applications is not stored
2. Parallelism- a transaction can be divided into several sub queries that operate on fragments
which will increase the degree of concurrency.
 In a DDB, decisions must be made regarding which site should be used to store which portions of
the database.
 Before we decide on how to distribute the data, we must determine the logical units of the database
that are to be distributed.

Horizontal Fragmentation
 A horizontal fragment of a relation is a subset of the tuples in that relation.
 The tuples that belong to the horizontal fragment are specified by a condition on one or more
attributes of the relation. Often, only a single attribute is involved.
 For example, we may define three horizontal fragments on the EMPLOYEE relation: (DNO= 5),
(DNO= 4), and (DNO= 1)—each fragment contains the EMPLOYEE tuples working for a
particular department.
 Similarly, we may define three horizontal fragments for the PROJECT relation, with the
conditions (DNUM= 5), (DNUM= 4), and (DNUM= 1)—each fragment contains the PROJECT
tuples controlled by a particular department.
 Horizontal fragmentation divides a relation "horizontally" by grouping rows to create subsets
of tuples, where each subset has a certain logical meaning. These fragments can then be assigned
to different sites in the distributed system.
 Derived horizontal fragmentation applies the partitioning of a primary relation (DEPARTMENT
in our example) to other secondary relations (EMPLOYEE and PROJECT in our example), which
are related to the primary via a foreign key. This way, related data between the primary and the
secondary relations gets fragmented in the same way.

Vertical Fragmentation
 Each site may not need all the attributes of a relation, which would indicate the need for a different
type of fragmentation.
 Vertical fragmentation divides a relation "vertically" by columns.
 A vertical fragment of a relation keeps only certain attributes of the relation.
 For example, we may want to fragment the EMPLOYEE relation into two vertical fragments.
 The first fragment includes personal information—NAME, BDATE, ADDRESS, and
SEX—and
 The second includes work-related information—SSN, SALARY, SUPERSSN, DNO.
 This vertical fragmentation is not quite proper because, if the two fragments are stored separately,
we cannot put the original employee tuples back together, since there is no common attribute
between the two fragments.

Database Systems and Information Management Module Page 62


 It is necessary to include the primary key or some candidate key attribute in every vertical
fragment so that the full relation can be reconstructed from the fragments. Hence, we must add the
SSN attribute to the personal information fragment.

Data Replication and Allocation


 Replication is useful in improving the availability of data.
 The most extreme case is replication of the whole database at every site in the distributed system,
thus creating a fully replicated distributed database.
 This can improve availability remarkably because the system can continue to operate as
long as at least one site is up.
 It also improves performance of retrieval for global queries, because the result of such
a query can be obtained locally from any one site; hence, a retrieval query can be
processed at the local site where it is submitted, if that site includes a server module.
 The disadvantage of full replication is that it can slow down update operations drastically, since
a single logical update must be performed on every copy of the database to keep the copies
consistent. This is especially true if many copies of the database exist.
 Full replication makes the concurrency control and recovery techniques more expensive than
they would be if there were no replication,
 The other extreme from full replication involves having no replication—that is, each fragment is
stored at exactly one site.
 In this case all fragments must be disjoint, except for the repetition of primary keys among vertical
(or mixed) fragments. This is also called non redundant allocation.
 Between these two extremes, we have a wide spectrum of partial replication of the data—that is,
some fragments of the database may be replicated whereas others may not.
 The number of copies of each fragment can range from one up to the total number of sites in
the distributed system.
 A special case of partial replication is occurring heavily in applications where mobile workers—
such as sales forces, financial planners, and claims adjustors—carry partially replicated databases
with them on laptops and personal digital assistants and synchronize them periodically with the
server database.
 A description of the replication of fragments is sometimes called a replication schema.
 Each fragment—or each copy of a fragment—must be assigned to a particular site in the
distributed system. This process is called data distribution (or data allocation).
 The choice of sites and the degree of replication depend on the performance and availability goals
of the system and on the types and frequencies of transactions submitted at each site.
 For example, if high availability is required and transactions can be submitted at any site and if
most transactions are retrieval only, a fully replicated database is a good choice.
 However, if certain transactions that access particular parts of the database are mostly submitted
at a particular site, the corresponding set of fragments can be allocated at that site only.
 Data that is accessed at multiple sites can be replicated at those sites. If many updates are
performed, it may be useful to limit replication. Finding an optimal or even a good solution to
distributed data allocation is a complex optimization problem.

Database Systems and Information Management Module Page 63


Types of Distributed Database Systems
 The term distributed database management system can describe various systems that differ from
one another in many respects.
 The main thing that all such systems have in common is the fact that data and software are
distributed over multiple sites connected by some form of communication network.
 In this section we discuss a number of types of DDBMSs and the criteria and factors that make
some of these systems different. The first factor we consider is the degree of homogeneity of the
DDBMS software. If all servers (or individual local DBMSs) use identical software and all users
(clients) use identical software, the DDBMS is called homogeneous; otherwise, it is called
heterogeneous.
 Another factor related to the degree of homogeneity is the degree of local autonomy.
 If there is no provision for the local site to function as a stand-alone DBMS, then the system has
no local autonomy.
 On the other hand, if direct access by local transactions to a server is permitted, the system has
some degree of local autonomy. At one extreme of the autonomy spectrum, we have a DDBMS
that "looks like" a centralized DBMS to the user.
 A single conceptual schema exists, and all access to the system is obtained through a site that is
part of the DDBMS—which means that no local autonomy exists.
 At the other extreme we encounter a type of DDBMS called a federated DDBMS (or a multi
database system).
 In such a system, each server is an independent and autonomous centralized DBMS that has its
own local users, local transactions, and DBA and hence has a very high degree of local
autonomy.
 The term federated database system (FDBS) is used when there is some global view or schema of
the federation of databases that is shared by the applications.
 On the other hand, a multi database system does not have a global schema and interactively
constructs one as needed by the application. Both systems are hybrids between distributed and
centralized systems and the distinction we made between them is not strictly followed.
 In a heterogeneous FDBS, one server may be a relational DBMS, another network DBMS, and a
third an object or hierarchical DBMS; in such a case it is necessary to have a canonical system
language and to include language translators to translate sub queries from the canonical language
to the language of each server.

Federated Database Management Systems Issues


 The type of heterogeneity present in FDBSs may arise from several sources.
 We discuss these sources first and then point out how the different types of autonomies contribute
to a semantic heterogeneity that must be resolved in a heterogeneous FDBS.
A. Differences in data models: Databases in an organization come from a variety of data models
including the so-called legacy models, the relational data model, the object data model, and
even files.
The modeling capabilities of the models vary. Hence, to deal with them uniformly via a single
global schema or to process them in a single language is challenging.
Even if two databases are both from the RDBMS environment, the same information may be
represented as an attribute name, as a relation name, or as a value in different databases.

Database Systems and Information Management Module Page 64


This calls for an intelligent query processing mechanism that can relate information based on
metadata.
B. Differences in constraints: Constraint facilities for specification and implementation vary
from system to system. There are comparable features that must be reconciled in the
construction of a global schema. For example, the relationships from ER models are
represented as referential integrity constraints in the relational model. Triggers may have to
be used to implement certain constraints in the relational model. The global schema must also
deal with potential conflicts among constraints.
C. Differences in query languages: Even with the same data model, the languages and their
versions vary. For example, SQL has multiple versions like SQL-89, SQL-92 (SQL2), and
SQL3, and each system has its own set of data types, comparison operators, string
manipulation features, and so on.

Semantic Heterogeneity
 Semantic heterogeneity occurs when there are differences in the meaning, interpretation, and
intended use of the same or related data.
 Semantic heterogeneity among component database systems (DBSs) creates the biggest hurdle in
designing global schemas of heterogeneous databases.
 The design autonomy of component DBSs refers to their freedom of choosing the following design
parameters, which in turn affect the eventual complexity of the FDBS:
A. The universe of discourse from which the data is drawn: For example, two customer accounts
databases in the federation may be from United States and Japan with entirely different sets of
attributes about customer accounts required by the accounting practices. Currency rate
fluctuations would also present a problem. Hence, relations in these two databases which have
identical names—CUSTOMER or ACCOUNT—may have some common and some entirely
distinct information.
B. Representation and naming: The representation and naming of data elements and the
structure of the data model may be pre specified for each local database.
C. The understanding, meaning, and subjective interpretation of data. This is a chief contributor
to semantic heterogeneity.
D. Transaction and policy constraints: this deal with Serializability criteria, compensating
transactions, and other transaction policies.
E. Derivation of summaries: Aggregation, summarization, and other data-processing features
and operations supported by the system.
 Communication autonomy of a component DBS refers to it stability to decide whether to
communicate with another component DBSs.
 Execution autonomy refers to the ability of a component DBS to execute local operations without
interference from external operations by other component DBSs and its ability to decide the order
in which to execute them.
 The association autonomy of a component DBS implies that it has the ability to decide whether
and how much to share its functionality (operations it supports) and resources (data it manages)
with other component DBSs.
 The major challenge of designing FDBSs is to let component DBSs interoperate while still
providing the above types of autonomies to them.

Database Systems and Information Management Module Page 65


 Typical five-level schema architecture to support global applications in the FDBS environment. In
this architecture, the local schema is the conceptual schema (full database definition) of a
component database, and the component schema is derived by translating the local schema into a
canonical data model or common data model (CDM) for the FDBS.
 Schema translation from the local schema to the component schema is accompanied by generation
of mappings to transform commands on a component schema into commands on the corresponding
local schema.
 The export schema represents the subset of a component schema that is available to the FDBS.
 The federated schema is the global schema or view, which is the result of integrating all the
shareable export schemas.
 The external schemas define the schema for a user group or an application, as in the three-level
schema architecture.

Query Processing in Distributed Databases


Data Transfer Costs of Distributed Query Processing
 In a distributed system, several additional factors further complicate query processing.
 The first is the cost of transferring data over the network.
 This data includes intermediate files that are transferred to other sites for further processing, as
well as the final result files that may have to be transferred to the site where the query result is
needed.
 Although these costs may not be very high if the sites are connected via a high-performance
local area network, they become quite significant in other types of networks.
 Hence, DDBMS query optimization algorithms consider the goal of reducing the amount of
data transfer as an optimization criterion in choosing a distributed query execution strategy.
 We illustrate this with two simple example queries.
 Suppose that the EMPLOYEE and DEPARTMENT relations of.
 The size of the EMPLOYEE relation is 100 * 10,000 = 106 bytes, and
 The size of the DEPARTMENT relation is 35 * 100 = 3500 bytes.
 Consider the query Q: "For each employee, retrieve the employee name and the name of the
department for which the employee works." This can be stated as follows in the relational algebra:

Q: pFNAME, LNAME, DNAME (EMPLOYEEDNO=DNUMBER DEPARTMENT)

 The result of this query will include 10,000 records, assuming that every employee is related to a
department.
 Suppose that each record in the query result is 40 bytes long.
 The query is submitted at a distinct site 3, which is called the result site because the query result is
needed there.
 Neither the EMPLOYEE nor the DEPARTMENT relations reside at site 3.

There are three simple strategies for executing this distributed query:
1. Transfer both the EMPLOYEE and the DEPARTMENT relations to the result site, and perform
the join at site 3. In this case a total of 1,000,000+ 3500 = 1,003,500 bytes must be transferred.
2. Transfer the EMPLOYEE relation to site 2, execute the join at site 2, and send the result to site
the size of the query result is 40 * 10,000 = 400,000 bytes, so 400,000 + 1,000,000 = 1,400,000
bytes must be transferred.
Database Systems and Information Management Module Page 66
If minimizing the amount of data transfer is our optimization criterion, we should choose strategy 3. Now
consider another query Q: "For each department, retrieve the department name and the name of the
department manager." This can be stated as follows in the relational algebra:
An Overview of Client-Server Architecture and Its Relationship to Distributed Databases
 Distributed database applications are being developed in the context of the client-server
architecture.
 Exactly how to divide the DBMS functionality between client and server has not yet been
established.
 Different approaches have been proposed. One possibility is to include the functionality
of a centralized DBMS at the server level.
 A number of relational DBMS products have taken this approach, where an SQL server is provided
to the clients.
 Each client must then formulate the appropriate SQL queries and provide the user interface and
programming language interface functions.
 Since SQL is a relational standard, various SQL servers, possibly provided by different vendors,
can accept SQL commands.
 The client may also refer to a data dictionary that includes information on the distribution of data
among the various SQL servers, as well as modules for decomposing a global query into a number
of local queries that can be executed at the various sites.
 Interaction between client and server might proceed as follows during the processing of an SQL
query:
1. The client parses a user query and decomposes it into a number of independent site queries.
Each site query is sent to the appropriate server site.
2. Each server processes the local query and sends the resulting relation to the client site.
3. The client site combines the results of the sub queries to produce the result of the originally
submitted query.
 In this approach, the SQL server has also been called a transaction server (or a database processor
(DP) or a back-end machine), whereas the client has been called an application processor (AP)(or
a front-end machine).
 The interaction between client and server can be specified by the user at the client level or via a
specialized DBMS client module that is part of the DBMS package.
 For example, the user may know what data is stored in each server, break down a query
request into site sub queries manually, and submit individual sub queries to the various sites.
 The resulting tables may be combined explicitly by a further user query at the client level.
 The alternative is to have the client module undertake these actions automatically.
 In a typical DDBMS, it is customary to divide the software modules into three levels:
1. The server software is responsible for local data management at a site, much like
centralized DBMS software.
2. The client software is responsible for most of the distribution functions; it accesses data
distribution information from the DDBMS catalog and processes all requests that require
access to more than one site. It also handles all user interfaces.
3. The communications software (sometimes in conjunction with a distributed operating
system) provides the communication primitives that are used by the client to transmit

Database Systems and Information Management Module Page 67


commands and data among the various sites as needed. This is not strictly part of the
DDBMS, but it provides essential communication primitives and services.
 The client is responsible for generating a distributed execution plan for a multisite query or
transaction and for supervising distributed execution by sending commands to servers.
 These commands include local queries and transactions to be executed, as well as commands to
transmit data to other clients or servers.
 Hence, client software should be included at any site where multisite queries are submitted. Another
function controlled by the client (or coordinator) is that of ensuring consistency of replicated copies
of a data item by employing distributed (or global) concurrency control techniques.
 The client must also ensure the atomicity of global transactions by performing global recovery
when certain sites fail.

One possible function of the client is to hide the details of data distribution from the user; that is, it
enables the user to write global queries and transactions as though the database were centralized,
without having to specify the sites at which the data referenced in the query or transaction resides. This
property is called distribution transparency.

Database Systems and Information Management Module Page 68


CHAPTER SEVEN
Spatial /multimedia/mobile databases
What is a Spatial Database?
 A spatial database is a database that is enhanced to store and access spatial data or data that
defines a geometric space.
 Data on spatial databases are stored as coordinates, points, lines, polygons and topology.
 Spatial RDBMS allows to use SQL data types, such as int and varchar, as well as spatial data
types, such as Point, Line string and Polygon for geometric calculations like distance or
relationships between shapes.
 RDBMS uses the B-Tree series or Hash Function to process indexes; only one-dimensional data
can be processed. Since spatial data types are two or three dimensional:
 R-Tree series or Quad Trees can be used in Spatial RDBMS to process such data;
 Or it is necessary to transform two- or three-dimensional data to one dimensional data,
then B-Tree can be used.
Spatial Database Applications
GIS applications (maps):
 Urban planning, route optimization, fire or pollution monitoring, utility networks, etc.
Other applications:
 VLSI design, CAD/CAM, model of human brain, etc.
Traditional applications:
 Multidimensional records
Spatial Data Types
line region
point

Point: 2 real numbers


Line: sequence of points
Region: area included inside n-points
An Example of Spatial Representation

Database Systems and Information Management Module Page 69


Spatial Relationship
 Topological relationships:
 adjacent, inside, disjoint, etc.
 Direction relationships:
 Above, below, north of, etc.
 Metric relationships: “distance < 100”
EXAMPLE
A database:
 Relation states (sname: string, area: region, spop: int)
 Relation cities (cname: string, center: point; ext: region)
 Relation rivers (rname: string, route:line)
SELECT * FROM rivers WHERE route intersects Ethiopia
SELECT cname, sname FROM cities, states WHERE center inside area
SELECT rname, length (intersection (route, California)) FROM rivers WHERE route intersects
Oromia
Spatial Queries
 Selection queries:
“Find all objects inside query q”, inside-> intersects, north
 Nearest Neighbor-queries:
“Find the closets object to a query point q”, k-closest objects
 Spatial join queries: Two spatial relations S1 and S2, find all pairs: {x in S1, y in S2, and x
rel y= true}, rel= intersect, inside, etc

Database Systems and Information Management Module Page 70


Example of Spatial SQL Nearest Neighbor search

Access Methods
Point Access Methods (PAMs):
 Index methods for 2 or 3-dimensional points (k-d trees, Z-ordering, grid-file)
Spatial Access Methods (SAMs):
 Index methods for 2 or 3-dimensional regions and points (R-trees)
Mobile Database
A mobile database is either a stationary database that can be connected to by a mobile computing device
such as smart phones or PDAs over a mobile network, or a database which is actually carried by the
mobile device. This could be a list of contacts, price information, distance travelled, or any other
information.
 Many applications require the ability to download information from an information
repository and operate on this information even when out of range or disconnected.

Mobile databases are highly concentrated in the retail and logistics industries. They are increasingly
being used in aviation and transportation industry.

Home Directory
An example of this is a mobile workforce. In this scenario, a user would require access to update
information from files in the home directories on a server or customer records from a database.

A home directory is a file system directory on a multi-user operating system containing files for a
given user of the system.

Database Systems and Information Management Module Page 71


The specifics of the home directory (such as its name and location) is defined by the operating
system involved; for example, Windows systems between 2000 and 2003 keep home directories in
a folder called Documents and Settings.

Mobile Database Considerations


Mobile users must be able to work without a network connection due to poor or even non-existent
connections.

A cache could maintain to hold recently accessed data and transactions so that they are not lost due
to connection failure.

Users might not require access to truly live data, only recently modified data, and uploading of
changing might be deferred until reconnected.

Bandwidth must be conserved (a common requirement on wireless networks that charge per
megabyte.

Mobile computing devices tend to have slower CPUs and limited battery life.

Users with multiple devices (i.e.: smartphone and tablet) may need to synchronize their devices to
a centralized data store. This may require application-specific automation features.

Users may change location geographically and on the network. Usually dealing with this, is left to
the operating system, which is responsible for maintaining the wireless network connection.

Mobile Database Capabilities


 Can physically move around without affecting data availability
 Can reach to the place data is stored
 Can process special types of data efficiently
 Not subjected to connection restrictions
 Very high reachability
 Highly portable
Mobile Database Limitations
 Limited wireless bandwidth
 Wireless communication speed
 Limited energy source (battery power)
 Less secured
 Vulnerable to physical activities
 Hard to make theft proof.
Mobile Database Applications
 Insurance companies
 Emergencies services (Police, medical, etc.)
 Traffic control
 Taxi dispatch

Database Systems and Information Management Module Page 72


 E-commerce
Multimedia Database Management System (MMDBMS)
A true MMDBMS should be able to:
Operate (create, update, delete, retrieve) with at least all the audio-visual multimedia data types
(text, images, video, audio) and perhaps some non-audio-visual types (e.g. olfactory, taste,
haptic)
Fullfil all the standard requirements of a DBMS i.e. data integrity, ppersistence, transaction
management, concurrency control, system recovery, queries, versioning, data integrity, data
security, etc.
Manipulate huge data volumes (virtually no restriction concerning the number of multimedia
structures and their size)
Allow interaction with the user e.g. object searches, generated media
Retrieve multimedia data based on their content (attributes, features and concepts)
Efficiently manage data distribution over the nodes of a computer network (distributed
architecture)
By way of combination of basic multimedia objects, it should be possible to create new complex
objects (i.e. “specialisation” and “inheritance”)
Because many different data types are involved, special methods may be needed for optimal
storage, access, indexing, and retrieval - in particular, for time-based media objects
May include advanced features such as character recognition, voice recognition, and image
analysis
With hypermedia databases, “Hyper bases”, links between generated pages/reports are derived
from associations between objects
Many hypermedia systems are based on relational or object-relational database technologies e.g.
Active Server Pages, ColdFusion, Java Server Pages

Multimedia Data Types


 DBMS provide different kinds of domains for multimedia data:
Large object domains – these are long unstructured sequences of data e.g. Bbinary
Large Objects (BLOBs) or Character Large Objects (CLOBs)
File references – instead of holding the data, a file reference contains a link to the data
MS Windows Object Linking and Embedding (OLE) as in Microsoft Access
Actual multimedia domains
 object-relational databases support specialised multimedia data types
 object-oriented databases support specialised multimedia classes
Multimedia Database Models
 Multimedia File Managers
Programs which organise, index, catalogue, search and retrieve multimedia objects stored
as conventional standalone files in the operating system environment
May include thumbnail panel describing each object and its file characteristics
May also include a “preview” facility, and / or facility to launch associated media-editing
tools
Vary from simple “scrapbook” catalogue to complex programs offering off-line storage
or mounting of files via network

Database Systems and Information Management Module Page 73


May support data compression, such as JPEG
 Relation DBMS with Extended / Add-on Capabilities
extension of ANSI SQL data types to facilitate storage and manipulation of binary data
e.g. BLOBs
use of filters / windows / external viewers / OLE to handle rich multimedia data
 Object-Oriented Databases
handle all kinds of multimedia data is a specialised classis
quite low market share in database technology
 Object-Relational Databases
hybrid combination of features of RDBMS and OODB
fully SQL-compliant, with added support for complex data types via object definition
language (ODL)
 Document Management Systems / Content Management Systems
Specialised systems for managing documents/content of many different types
Automate workflow, pages/reports generated automatically
Very popular for Web sites where content changes rapidly e.g. Kenny’s Bookshop
(www.kennys.ie), Irish Times (ireland.com)
Can buy off-the-shelf (e.g. Terminal Four, Broad Vision), acquire via open source, or
build it yourself! (e.g. Lotus Notes)
Multimedia database management system: system requirements
 High network and communications bandwidth
network standards such as asynchronous transfer mode (ATM), fibre distributed data
interface (FDDI), and Fast Ethernet
improved communications through ISDN, fibre optic technologies, wireless networks,
and cable TV
 High capacity storage, and fast retrieval mechanisms
enhancements in compression algorithms
 Uncompressed CD = 10 MB/min (30 songs approx. 1GB)
 MP3 high-quality compression = 1 MB/min (300 songs approx. 1GB)
 MPEG-AAC (advanced audio) = 500 KB/min (600 songs approx. 1 GB)
optical disks and magneto-optic technologies can store many terabytes (TB) of data
which is accessible near-line
 Highly-specified desktop PCs / workstations
desktop machines must be capable of processing multimedia data, and must be fitted with
requisite peripheral devices
 User interfaces
 interfaces need to be redesigned to facilitate real-time multi-user interactive multimedia
applications
Multimedia data storage
 High capacity storage, and fast retrieval mechanisms
 preferred approach storage is HSM (hierarchical storage management)

Database Systems and Information Management Module Page 74


Main
Memory
increasing probability (RAM) increasing storage capacity
of access increasing permanence
increasing cost On-line devices: increasing access
magnetic disks,
improving time
optical disks
Near-line devices:
performance
optical storage

Applications of Multimedia Databases Off-line devices:


magnetic tapes,
Generic business & Office Information opticaldocument
Systems: storage imaging, document editing tools,
mmultimedia email, mmultimedia conferencing, mmultimedia workflow management, teleworking
Software development: multimedia object libraries; computer-aided software engineering
(CASE); multimedia communication histories; multimedia system artefacts
Education: multimedia encyclopaedia; multimedia courseware & training materials; education-
on-demand (distance education, JIT learning)
Banking: tele-banking
Retail: home shopping; customer guidance
Tourism & hospitality: trip visualisation & planning; sales & marketing
Publishing: electronic publishing; document editing tools; multimedia archives
Medicine: digitised x-rays; patient histories; image analysis; tele-consultation
Criminal investigation: biometrics; fingerprint matching; “photo-fit” analysis
entertainment: video-on-demand; interactive TV; Virtual Reality gaming
Museums & Libraries
science: spatial data analysis; cartographic databases; geographic information systems (GIS’s)
engineering: computer-aided design / manufacture (CAD / CAM); collaborative design;
concurrent engineering
pharmaceuticals: dossier management for new drug applications (NDA’s)

Database Systems and Information Management Module Page 75


CHAPTER EIGHT
WEB- BASED DATABASES
Databases on the World Wide Web
The World Wide Web (WWW)—popularly known as "the Web"—originally developed in Switzerland at
CERN in early 1990 as a large-scale hypermedia information service system for biological scientists to
share information. Today this technology allows universal access to this shared information to anyone
having access to the Internet and the Web contains hundreds of millions of Web pages within the reach of
millions of users.

In Web technology, basic client-server architecture underlies all activities. Information is stored on
computers designated as Web servers in publicly accessible shared files encoded using Hyper Text Markup
Language (HTML).

A number of tools enable users to create Web pages formatted with HTML tags, freely mixed with
multimedia content from graphics to audio and even to video. A page has many interspersed hyperlinks
literally a link that enables a user to "browse" or move from one page to another across the Internet. This
ability has given a tremendous power to end users in searching and navigating related information often
across different continents.

Information on the Web is organized according to a Uniform Resource Locator (URL) something similar
to an address that provides the complete pathname of a file. The pathname consists of a string of machine
and directory names separated by slashes and ends in a filename. For example, the table of contents of this
book is currently at the following URL:

A URL always begins with a hypertext transport protocol (http), which is the protocol used by the Web
browsers, a program that communicates with the Web server, and vice versa. Web browsers interpret and
present HTML documents to users. Popular Web browsers include the Internet Explorer of Microsoft and
the Netscape Navigator. A collection of HTML documents and other files accessible via the URL on a Web
server is called a Web site. In the above URL, "www.awl.com" may be called the Web site of Addison
Wesley Publishing.

Providing Access to Databases on the World Wide Web


Today’s technology has been moving rapidly from static to dynamic Web pages, where content may be in
a constant state of flux. The Web server uses a standard interface called the Common Gateway Interface
(CGI) to act as the middleware the additional software layer between the user interface front-end and the
DBMS back-end that facilitates access to heterogeneous databases.

Database Systems and Information Management Module Page 76


The CGI middleware executes external programs or scripts to obtain the dynamic information, and it returns
the information to the server in HTML, which is given back to the browser. As the Web undergoes its latest
transformations, it has become necessary to allow users access not only to file systems but to databases and
DBMSs to support query processing, report generation, and so forth. The existing approaches may be
divided into two categories:

1. Access using CGI scripts: The database server can be made to interact with the Web server via
CGI. The main disadvantage of this approach is that for each user request, the Web server must
start a new CGI process: each process makes a new connection with the DBMS and the Web server
must wait until the results are delivered to it. No efficiency is achieved by any grouping of multiple
users’ requests; moreover, the developer must keep the scripts in the CGI-bin subdirectories only,
which opens it to a possible breach of security. The fact that CGI has no language associated with
it but requires database developers to learn PERL or Tcl is also a drawback. Manageability of
scripts is another problem if the scripts are scattered everywhere.
2. Access using JDBC: JDBC is a set of Java classes developed by Sun Microsystems to allow access
to relational databases through the execution of SQL statements. It is a way of connecting with
databases, without any additional processes for each client request. Note that JDBC is a name
trademarked by Sun; it does not stand for Java Data Base connectivity as many believe. JDBC has
the capabilities to connect to a database, send SQL statements to a database and to retrieve the
results of a query using the Java classes Connection, Statement, and Result Set respectively. With
Java’s claimed platform independence, an application may run on any Java-capable browser, which
loads the Java code from the server and runs it on the client’s browser.
The Java code is DBMS transparent; the JDBC drivers for individual DBMSs on the server end
carry the task of interacting with that DBMS. If the JDBC driver is on the client, the application
runs on the client and its requests are communicated to the DBMS directly by the driver. For
standard SQL requests, many RDBMSs can be accessed this way. The drawback of using JDBC
is the prospect of executing Java through virtual machines with inherent efficiency. The JDBC
bridge to Object Database Connectivity (ODBC) remains another way of getting to the RDBMSs.

Besides CGI, other Web server vendors are launching their own middleware products for providing multiple
database connectivity. These include Internet Server API (ISAPI) from Microsoft and Netscape API
(NSAPI) from Netscape

The Web Integration Option of INFORMIX


Informix has addressed the limitations of CGI and the incompatibilities of CGI, NSAPI, and ISAPI by
creating the Web Integration Option (WIO).

Database Systems and Information Management Module Page 77


WIO eliminates the need for scripts. Developers use tools to create intelligent HTML pages called
Application Pages (or App Pages) directly within the database. They execute SQL statements dynamically,
format the results inside HTML, and return the resulting Web page to the end users.

Driver, a lightweight CGI process that is invoked when a URL request is received by the Web server. A
unique session identifier is generated for each request but the WIO application is persistent and does not
terminate after each request. When the WIO application receives a request from the Web driver, it connects
to the database and executes Web Explode, a function that executes queries within Web pages and formats
results as a Web page that goes back to the browser via the Web driver.

Informix HTML tag extensions allow Web authors to create applications that can dynamically construct
Web page templates from the Informix Dynamic Server and present them to the end users.

WIO also lets users create their own customized tags to perform specialized tasks. Thus, without resorting
to any programming or script development, powerful applications can be designed.

Another feature of WIO helps transaction-oriented applications by providing an application programming


interface (API) that offers a collection of basic services such as connection and session management that
can be incorporated into Web application.

WIO supports applications developed in C, C++, and Java. This flexibility lets developer’s port existing
applications to the Web or develops new applications in these languages. The WIO is integrated with Web
server software and utilizes the native security mechanism of the Informix Dynamic Server. The open
architecture of WIO allows the use of various Web browsers and servers.

The ORACLE Web Server


The client requests files that are called "static" or "dynamic" files from the Web server. Static files have a
fixed content whereas dynamic files may have content that includes results of queries to the database.

There is an HTTP demon (a process that runs continuously) called Web Listener running on the server that
listens for the requests originating in the clients. A static file (document) is retrieved from the file system
of the server and displayed on the Web browser at the client. Request for a dynamic page is passed by the
listener to a Web request broker (WRB), which is a multi-threaded dispatcher that adheres to cartridges.
Cartridges are software modules that perform specific functions on specific types of data; they can
communicate among themselves.

Currently cartridges are provided for PL/SQL, Java, and Live HTML; customized cartridges may be
provided as well. Web Server has been fully integrated with PL/SQL, making it efficient and scalable.

Database Systems and Information Management Module Page 78


The cartridges give it additional flexibility, making it possible to work with other languages and software
packages. An advanced secure sockets layer may be used for secure communication over the Internet. The
Designer 2000 development has a Web generator that enables previous applications developed for LANs
to be ported to the Internet and Intranet environments.

Open Problems with Web Databases


The Web is an important factor in planning for enterprise-wide computing environments, both for providing
external access to the enterprise’s systems and information for customers and suppliers and for marketing
and advertising purposes. At the same time, due to security requirements, employees of some organizations
are restricted to operate within intranets—sub networks that cannot be accessed freely from the outside
world.

Among the prominent applications of the intranet and the WWW are databases to support electronic
storefronts, parts and product catalogs, directories and schedules, newsstands, and bookstores. Electronic
commerce the purchasing of products and services electronically on the Internet is likely to become a major
application supported by such databases.

The future challenges of managing databases on the Web will be many, among them the following:

 Web technology needs to be integrated with the object technology. Currently, the web can be
viewed as a distributed object system, with HTML pages functioning as objects identified by the
URL.
 HTML functionality is too simple to support complex application requirements. As we saw, the
Web Integration Option of Informix adds further tags to HTML. In general, additional facilities
will be needed to
1. To make Web clients function as application front ends, integrating data from multiple
heterogeneous databases;
2. To make Web clients present different views of the same data to different users; and
3. To make Web clients "intelligent" by providing additional data mining functionality
 Web page content can be made more dynamic by adding more "behavior" to it as an object (In this
respect
1. client and server objects (HTML pages) can be made to interact;
2. Web pages can be treated as collections of programmable objects; and
3. Client-side code can access these objects and manipulate them dynamically.
 The support for a large number of clients coupled with reasonable response times for queries
against very large (several tens of gigabytes in size) databases will be major challenges for Web

Database Systems and Information Management Module Page 79


databases. They will have to be addressed both by Web servers and by the underlying DBMSs.
Efforts are underway to address the limitations of the current data structuring technology,
particularly by the World Wide Web Consortium (W3C). The W3C is designing a Web Object
Model. W3C is also proposing an Extensible Markup Language (XML) for structured document
interchange on the Web.

XML defines a subset of SGML (the Standard Generalized Markup Language), allowing customization of
markup languages with application-specific tags. XML is rapidly gaining ground due to its extensibility in
defining new tags.

W3C’s Document Object Model (DOM) defines an object-oriented API for HTML or XML documents
presented by a Web client. W3C is also defining metadata modeling standards for describing Internet
resources.

The technology to model information using the standards discussed above and to find information on the
Web is undergoing a major evolution. Overall, the Web servers have to gain robustness as a reliable
technology to handle production-level databases for supporting 24x7 applications—24 hours a day, 7 days
a week.

Security remains a critical problem for supporting applications involving financial and medical databases.
Moreover, transferring from existing database application environments to those on the Web will need
adequate support that will enable users to continue their current mode of operation and an expensive
infrastructure for handling migration of data among systems without introducing inconsistencies. The
traditional database functionality of querying and transaction processing must undergo appropriate
modifications to support Web-based applications. One such area is mobile databases.

Database Systems and Information Management Module Page 80


CHAPTER NINE
DATA WAREHOUSING
What is Data Warehouse?
 A data warehouse is a subject oriented, integrated, time-variant, and non-volatile
collection of data.
 This data helps analysts to take informed decisions in an organization.
 A Data Warehouse consists of data from multiple heterogeneous data sources and is used
for analytical reporting and decision making.
 Data Warehouse is a central place where data is stored from different data sources and
applications.
 An operational database undergoes frequent changes on a daily basis on account of the
transactions that take place.
 Suppose a business executive wants to analyze previous feedback on any data such as
a product, a supplier, or any consumer data, then the executive will have no data
available to analyze because the previous data has been updated due to transactions.
 A data warehouses provides us generalized and consolidated data in multidimensional
view.
 Along with generalized and consolidated view of data, a data warehouses also provides
us Online Analytical Processing (OLAP) tools.
These tools help us in interactive and effective analysis of data in a multidimensional space. This
analysis results in data generalization and data mining.
 Data mining functions such as association, clustering, classification, prediction can be
integrated with OLAP operations to enhance the interactive mining of knowledge at
multiple level of abstraction. That's why data warehouse has now become an important
platform for data analysis and online analytical processing.

 In the above image, you can see that the data is coming from multiple heterogeneous data
sources to a Data Warehouse. Common data sources for a data warehouse includes.
 Operational databases  Flat Files
 SAP and non-SAP Applications

Database Systems and Information Management Module Page 81


 Data in data warehouse is accessed by BI Business Intelligence users for Analytical
Reporting, Data Mining and Analysis. This is used for decision making by Business Users,
Sales Manager, and Analysts to define future strategy.
Understanding a Data Warehouse
 A data warehouse is a database, which is kept separate from the organization's
operational database.
 There is no frequent updating done in a data warehouse.
 It possesses consolidated historical data, which helps the organization to analyze its
business.
 A data warehouse helps executives to organize, understand, and use their data to take
strategic decisions.
 Data warehouse systems help in the integration of diversity of application systems.
Why a Data Warehouse is separated from Operational Databases?
A data warehouses is kept separate from operational databases due to the following reasons −
 An operational database is constructed for well-known tasks and workloads such as
searching particular records, indexing, etc. In contract, data warehouse queries are often
complex and they present a general form of data.
 Operational databases support concurrent (parallel) processing of multiple transactions.
Concurrency control and recovery mechanisms are required for operational databases to
ensure robustness and consistency of the database.
 An Operational Database query allows reading and modifying operations insert, delete and
Update while an OLAP query needs only read-only access of stored data Select statement.
 An operational database maintains current data of an organization. On the other hand, a
data warehouse maintains historical data.
Data Warehouse Applications
 A data warehouse helps business executives to organize, analyze, and use their data for
decision making.
 A data warehouse serves as a sole part of a plan-execute-assess "closed-loop" feedback
system for the enterprise management.
 Data warehouses are widely used in the following fields
 Financial services  Retail sectors
 Banking services  Controlled manufacturing
 Consumer goods
Types of Data Warehouse
Information processing, analytical processing and data mining are the three types of data
warehouse applications that are discussed below
A. Information Processing − A data warehouse allows to process the data stored in it. The
data can be processed by means of querying, basic statistical analysis, reporting using
crosstabs, tables, charts, or graphs.
B. Analytical Processing − A data warehouse supports analytical processing of the
information stored in it. The data can be analyzed by means of basic OLAP operations,
including slice-and-dice, drill down, drill up, and pivoting.
C. Data Mining − Data mining supports knowledge discovery by finding hidden patterns
and associations, constructing analytical models, performing classification and
prediction. These mining results can be presented using the visualization tools.
Characteristics of a Data Warehouse

Database Systems and Information Management Module Page 82


The following are the key characteristics of a Data Warehouse −
A. Subject Oriented − In a DW system, the data is categorized and stored by a business
subject rather than by application like equity plans, shares, loans, etc.
B. Integrated − Data from multiple data sources are integrated in a Data Warehouse.
C. Non-Volatile − Data in data warehouse is non-volatile. It means when data is loaded in
DW system, it is not altered.
D. Time Variant − A DW system contains historical data as compared to Transactional system
which contains only current data. In a Data warehouse you can see data for 3 months, 6
months, 1 year, 5 years, etc.
Note − A data warehouse does not require transaction processing, recovery, and concurrency
controls, because it is physically stored and separate from the operational database.
OLTP vs. OLAP
 Firstly, OLTP stands for Online Transaction Processing, while OLAP stands for Online
Analytical Processing
 In an OLTP system, there are a large number of short online transactions such as
INSERT, UPDATE, and DELETE. Whereas, in an OLTP system, an effective measure
is the processing time of short transactions and is very less. It controls data integrity in
multi-access environments. For an OLTP system, the number of transactions per second
measures the effectiveness.
 An OLTP Data Warehouse System contains current and detailed data and is maintained in
the schemas in the entity model 3 NF.
 For Example − A Day-to-Day transaction system in a retail store, where the customer
records are inserted, updated and deleted on a daily basis. It provides faster query
processing. OLTP databases contain detailed and current data. The schema used to
store OLTP database is the Entity model.
 In an OLAP system, there is lesser number of transactions as compared to a transactional
system. The queries executed are complex in nature and involves data aggregations.

Database Systems and Information Management Module Page 83


Data Warehouse Architecture
 Data Warehousing involves data cleaning, data integration, and data consolidations.
 A Data Warehouse has a 3-layer architecture.
1. Data Source Layer: -It defines how the data comes to a Data Warehouse. It involves
various data sources and operational transaction systems, flat files, applications, etc.
2. Integration Layer: -It consists of Operational Data Store and Staging area. Staging
area is used to perform data cleansing, data transformation and loading data from
different sources to a data warehouse. As multiple data sources are available for
extraction at different time zones, staging area is used to store the data and later to apply
transformations on data.
3. Presentation Layer: -This is used to perform BI reporting by end users. The data in a
DW system is accessed by BI users and used for reporting and analysis. The following
illustration shows the common architecture of a Data Warehouse System.
Database Systems and Information Management Module Page 84
What is an Aggregation?
 We save tables with aggregated data like yearly 1row, quarterly 4rows, and monthly
12rows or so, if someone has to do a year to year comparison, only one row will be
processed. However, in an unaggregated table it will compare all the rows. This is called
Aggregation.
 There are various Aggregation functions that can be used in an OLAP system like Sum,
Avg, Max, Min, etc.
 For Example −SELECT Avg(salary) FROM employee WHERE title = 'Programmer';
Key Differences
These are the major differences between an OLAP and an OLTP system.
 Indexes − An OLTP system has only few indexes while in an OLAP system there are
many indexes for performance optimization.
 Joins − In an OLTP system, large number of joins and data are normalized. However,
in an OLAP systems there are less joins and are de-normalized.
 Aggregation − In an OLTP system, data is not aggregated while in an OLAP database
more aggregations are used.
 Normalization − An OLTP system contains normalized data however data is not
normalized in an OLAP system.

What is Data Mining?


 Data Mining is the process of extracting useful information and patterns from enormous
data.
 Data Mining includes collection, extraction, analysis and statistics of data.

Database Systems and Information Management Module Page 85


 It is also known as Knowledge discovery process, Knowledge Mining from Data or data/
pattern analysis.
 Data Mining is a logical process of finding useful information to find out useful data.
 Once the information and patterns are found it can be used to make decisions for
developing the business.
 Data mining tools can give answers to your various questions related to your business
which was too difficult to resolve. They also forecast the future trends which lets the
business people to make proactive decisions.
 The information or knowledge extracted so can be used for any of the following
applications area:

 Market Analysis
 Fraud Detection
 Customer Retention (maintenance)
 Production Control
 Science Exploration

Data mining involves three steps. They are


 Exploration – In this step the data is cleared and converted into another form. The nature
of data is also determined
 Pattern Identification – The next step is to choose the pattern which will make the best
prediction
 Deployment – The identified patterns are used to get the desired outcome.
Benefits of Data Mining
 Automated prediction of trends and behaviors
 It can be implemented on new systems as well as existing platforms
 It can analyze huge database in minutes
 Automated discovery of hidden patterns
 There are a lot of models available to understand complex data easily
 It is of high speed which makes it easy for the users to analyze huge amount of data in
less time
 It yields improved predictions
Data Mining Techniques
 There are several major data mining techniques have been developing and using in data
mining projects recently including association, classification, clustering, prediction,
sequential patterns and decision tree. We will briefly examine those data mining
techniques in the following sections.
1. Association
 In association, a pattern is discovered based on a relationship between items in
the same transaction.
 That’s the reason why association technique is also known as relation technique.
 The association technique is used in market basket analysis to identify a set of
products that customers frequently purchase together.
2. Classification

Database Systems and Information Management Module Page 86


 Classification is a classic data mining technique based on machine learning.
 Basically, classification is used to classify each item in a set of data into one of a
predefined set of classes or groups.
 Classification method makes use of mathematical techniques such as decision
trees, linear programming, neural network, and statistics.
 In classification, we develop the software that can learn how to classify the data
items into groups.
 For example, we can apply classification in the application that “given all
records of employees who left the company; predict who will probably leave
the company in a future period.”
 In this case, we divide the records of employees into two groups that named “leave”
and “stay”. And then we can ask our data mining software to classify the employees
into separate groups.
3. Clustering
 Clustering is a data mining technique that makes a meaningful or useful cluster
of objects which have similar characteristics using the automatic technique.
The clustering technique defines the classes and puts objects in each class, while in the
classification techniques, objects are assigned into predefined

Data mining involves three steps. They are


 Exploration – In this step the data is cleared and converted into another form. The nature
of data is also determined
 Pattern Identification – The next step is to choose the pattern which will make the best
prediction
 Deployment – The identified patterns are used to get the desired outcome.
Benefits of Data Mining
 Automated prediction of trends and behaviors
 It can be implemented on new systems as well as existing platforms
 It can analyze huge database in minutes
 Automated discovery of hidden patterns
 There are a lot of models available to understand complex data easily
 It is of high speed which makes it easy for the users to analyze huge amount of data in
less time
 It yields improved predictions
Data Mining Techniques
 There are several major data mining techniques have been developing and using in data
mining projects recently including association, classification, clustering, prediction,
sequential patterns and decision tree. We will briefly examine those data mining
techniques in the following sections.
4. Association
 In association, a pattern is discovered based on a relationship between items in
the same transaction.
 That’s the reason why association technique is also known as relation technique.
 The association technique is used in market basket analysis to identify a set of
products that customers frequently purchase together.
5. Classification
Database Systems and Information Management Module Page 87
 Classification is a classic data mining technique based on machine learning.
 Basically, classification is used to classify each item in a set of data into one of a
predefined set of classes or groups.
 Classification method makes use of mathematical techniques such as decision
trees, linear programming, neural network, and statistics.
 In classification, we develop the software that can learn how to classify the data
items into groups.
 For example, we can apply classification in the application that “given all
records of employees who left the company; predict who will probably leave
the company in a future period.”
 In this case, we divide the records of employees into two groups that named “leave”
and “stay”. And then we can ask our data mining software to classify the employees
into separate groups.
6. Clustering
 Clustering is a data mining technique that makes a meaningful or useful cluster
of objects which have similar characteristics using the automatic technique.
 The clustering technique defines the classes and puts objects in each class, while
in the classification techniques, objects are assigned into predefined classes.
 To make the concept clearer, we can take book management in the library as an
example.
 In a library, there is a wide range of books on various topics available.
 The challenge is how to keep those books in a way that readers can take
several books on a particular topic without hassle.
 By using the clustering technique, we can keep books that have some kinds
of similarities in one cluster or one shelf and label it with a meaningful name.
 If readers want to grab books in that topic, they would only have to go to that
shelf instead of looking for the entire library.
7. Prediction
 The prediction, as its name implied, is one of a data mining technique that discovers
the relationship between independent variables and relationship between
dependent and independent variables.
 For instance, the prediction analysis technique can be used in the sale to predict
profit for the future if we consider the sale is an independent variable, profit
could be a dependent variable.
 Then based on the historical sale and profit data, we can draw a fitted regression
curve that is used for profit prediction.
8. Sequential Patterns
 Sequential patterns analysis is one of data mining technique that seeks to discover
or identify similar patterns, regular events or trends in transaction data over a
business period.
 In sales, with historical transaction data, businesses can identify a set of items that
customers buy together different times in a year.
 Then businesses can use this information to recommend customers buy it with
better deals based on their purchasing frequency in the past.
9. Decision trees

Database Systems and Information Management Module Page 88


 The A decision tree is one of the most commonly used data mining techniques
because its model is easy to understand for users.
 In decision tree technique, the root of the decision tree is a simple question or
condition that has multiple answers.
 Each answer then leads to a set of questions or conditions that help us determine
the data so that we can make the final decision based on it.
10. Data Mining Decision Tree
 Starting at the root node, if the outlook is overcast then we should definitely play
tennis. If it is rainy, we should only play tennis if the wind is the week. And if it is
sunny then we should play tennis in case the humidity is normal.
 We often combine two or more of those data mining techniques together to form
an appropriate process that meets the business needs.

***

Database Systems and Information Management Module Page 89


Database Systems and Information Management Module Page 1

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy