02 Database Systems
02 Database Systems
School of Informatics
Department of Information Technology
2015 E.C- National Exit Exam Module
Theme Name: Database System and Information
Management
2|Page
Table of Contents
CHAPTER ONE .................................................................................................................................................................................... 3
1. Introduction .................................................................................................................................................................................... 3
1.1 Database System ....................................................................................................................................................................... 6
1.2 Data Handling approaches .......................................................................................................................................... 8
1.2.1 Manual Data Handling approach ............................................................................................................................ 8
1.2.2 Traditional File Based Data Handling approach ............................................................................................ 9
1.2.3 Database Data Handling approach ........................................................................................................................ 11
1.3 Roles in Database Design and Development ............................................................................................................ 15
1.3.1 Database Designer ......................................................................................................................................................... 15
1.3.2 Database Administrator .............................................................................................................................................. 16
1.3.3 Application Developers .............................................................................................................................................. 17
1.3.4 End-Users........................................................................................................................................................................... 18
1.4 The ANSI/SPARC and Database Architecture ....................................................................................................... 19
1.5 Types of Database Systems .............................................................................................................................................. 24
1.5.1 Client-Server Database System ............................................................................................................................. 24
1.5.2 Parallel Database System.......................................................................................................................................... 26
1.5.3 Distributed Database System .................................................................................................................................. 27
1.6 Database Management System (DBMS) ................................................................................................................... 29
1.6.1 Components and Interfaces of DBMS ............................................................................................................... 29
1.6.2 Functions of DBMS ........................................................................................................................................................... 31
1.7 Data models and conceptual models ............................................................................................................................ 32
1.7.1 Record-based Data Models...................................................................................................................................... 33
1.7.2 Database Languages .................................................................................................................................................... 37
CHAPTER TWO .................................................................................................................................................................................40
2. Relational Data Model .................................................................................................................................................................40
2. 1 Introduction..............................................................................................................................................................................40
2.1 Properties of Relational Databases .............................................................................................................................. 44
2.2 Building Blocks of the Relational Data Model ............................................................................................. 44
2.2.1 The ENTITIES ...............................................................................................................................................................45
2.2.2 The ATTRIBUTES......................................................................................................................................................45
2.2.3 The RELATIONSHIPS ............................................................................................................................................. 47
2.2.4 Key constraints.............................................................................................................................................................. 49
2.2.5 Integrity, Referential Integrity and Foreign Keys Constraints.............................................................. 50
CHAPTER THREE ............................................................................................................................................................................ 52
3. Conceptual Database Design- E-R Modeling ................................................................................................................. 52
3.1 Database Development Life Cycle ............................................................................................................................... 52
i|Page
3.2 Basic concepts of E-R model ...........................................................................................................................................54
3.3 Developing an E-R Diagram ............................................................................................................................................ 55
3.4 Graphical Representations in Entity Relationship Diagram ............................................................................56
3.4.1 Conceptual ER diagram symbols .........................................................................................................................56
3.5 Problem with E-R models..................................................................................................................................................65
CHAPTER FOUR ............................................................................................................................................................................... 77
4. Logical Database Design............................................................................................................................................................ 77
4.1 Introduction ............................................................................................................................................................................... 77
4.2 Normalization...........................................................................................................................................................................85
4.3 Process of normalization (1NF, 2NF, 3NF) .............................................................................................................. 91
CHAPTER FIVE ................................................................................................................................................................................ 98
5. Physical Database Design ......................................................................................................................................................... 98
5.1 Conceptual, Logical, and Physical Data Models .................................................................................................. 98
5.2 Physical Database Design Process ............................................................................................................................... 99
5.1.1. Overview of the Physical Database Design Methodology .................................................................. 100
CHAPTER SIX....................................................................................................................................................................................102
6. Query Languages...........................................................................................................................................................................102
6.1. Relational Algebra...............................................................................................................................................................102
6.1.1. Unary Operations........................................................................................................................................................103
6.1.2. Set Operations ............................................................................................................................................................. 105
6.1.3. Aggregation and Grouping Operations...........................................................................................................110
6.2. Relational Calculus .............................................................................................................................................................. 112
6.2.1. Tuple Relational Calculus....................................................................................................................................... 113
6.3. Structured Query Languages (SQL) ........................................................................................................................... 115
6.3.1. Introduction to SQL ................................................................................................................................................... 115
6.3.2. Writing SQL Commands .........................................................................................................................................118
6.3.3. SQL Data Definition and Data Types...............................................................................................................119
6.3.4. Basic Queries in SQL .............................................................................................................................................. 124
CHAPTER SEVEN ...........................................................................................................................................................................155
7. Advanced Database Concepts ................................................................................................................................................155
7.1. Integrity and Security ........................................................................................................................................................155
7.1.1. Levels of Security Measures ................................................................................................................................157
7.1.2. Countermeasures: Computer Based Controls ............................................................................................. 158
7.2. Distributed Database Systems ...................................................................................................................................... 162
7.3 Data Warehousing and Data Mining.......................................................................................................................... 164
7.3.1. Data Warehousing..................................................................................................................................................... 164
7.3.2. Data Mining.................................................................................................................................................................. 166
ii | P a g e
CHAPTER ONE
1. Introduction
Database is a shared collection of logically related data and its description, designed to meet the
information needs of an organization. We now examine the definition of a database so that
you can understand the concept fully. The database is a single, possibly large repository of
data that can be used simultaneously by many departments and users. Instead of disconnected
files with redundant data, all data items are integrated with a minimum amount of
duplication. The database is no longer owned by one department but is a shared corporate
resource. The database holds not only the organization‘s operational data, but also a
description of this data. For this reason, a database is also defined as a self-describing
collection of integrated records. The description of the data is known as the system catalog
(or data dictionary or metadata—the ―data about data ). It is the self-describing nature of a
database that provides program–data independence.
The approach taken with database systems, where the definition of data is separated from the
application programs, is similar to the approach taken in modern software development,
where an internal definition of an object and a separate external definition are provided. The
users of an object see only the external definition and are unaware of how the object is
defined and how it functions. One advantage of this approach, known as data abstraction, is
that we can change the internal definition of an object without affecting the users of the
object, provided that the external definition remains the same. In the same way, the database
approach separates the structure of the data from the application programs and stores it in the
database. If new data structures are added or existing structures are modified, then the
application programs are unaffected, provided that they do not directly depend upon what has
been modified. For example, if we add a new field to a record or create a new file, existing
applications are unaffected. However, if we remove a field from a file that an application
program uses, then that application program is affected by this change and must be modified
accordingly.
Another expression in the definition of a database that we should explain is ―logically
related. When we analyze the information needs of an organization, we attempt to identify
entities, attributes, and relationships. An entity is a distinct object (a person, place, thing,
concept, or event) in the organization that is to be represented in the database. An attribute is
a property that describes some aspect of the object that we wish to record, and a relationship
is an association between entities
Databases and database systems are an essential component of life in modern society: most of
us encounter several activities every day that involve some interaction with a database. For
example, if we go to the bank to deposit or withdraw funds, if we make a hotel or airline
reservation, if we access a computerized library catalog to search for a bibliographic item, or
if we purchase something online—such as a book, toy, or computer—chances are that our
activities will involve someone or some computer program accessing a database. Even
purchasing items at a supermarket often automatically updates the database that holds the
inventoryofgroceryitems.
These interactions are examples of what we may call traditional database applications, in
which most of the information that is stored and accessed is either textual or numeric. In the
past few years, advances in technology have led to exciting new applications of database
systems. In the past few years, advances in technology have led to exciting new applications
of database systems. The proliferation of social media Web sites, such as Facebook, Twitter,
and Flickr, among many others, has required the creation of huge databases that store
nontraditional data, such as posts, tweets, images, and video clips. New types of database
systems, often referred to as big data storage systems, or NOSQL systems, have been created
to manage data for social media applications. These types of systems are also used by
companies such as Google, Amazon, and Yahoo, to manage the data required in their Web
search engines, as well as to provide cloud storage, whereby users are provided with storage
capabilities on the Web for managing all types of data including documents, programs,
images,videosandemails.
A database is a collection of related data. By data, we mean known facts that can be recorded
and that have implicit meaning. For example, consider the names, telephone numbers, and
addresses of the people you know. Nowadays, this data is typically stored in mobile phones,
which have their own simple database software. This data can also be recorded in an indexed
address book or stored on a hard drive, using a personal computer and software such as
Microsoft Access or Excel. This collection of related data with an implicit meaning is a
database.
The preceding definition of database is quite general; for example, we may consider the
collection of words that make up this page of text to be related data and hence to constitute a
database. However, the common use of the term database is usually more restricted. A
database has the following implicit properties:
A database represents some aspect of the real world, sometimes called the mini world
or the universe of discourse (UoD).
Changes to the mini world are reflected in the database.
A database is a logically coherent collection of data with some inherent meaning. A
The database is now such an integral part of our day-to-day life that often we are not aware
that we are using one. To start our discussion of databases, in this section we examine some
applications of database systems. For the purposes of this discussion, we consider a database
to be a collection of related data and a database management system (DBMS) to be the
software that manages and controls access to the database. A database application is simply a
program that interacts with the database at some point in its execution. We also use the more
inclusive term database system as a collection of application programs that interact with the
database along with the DBMS and the database itself.
The DBMS interfaces with the application programs, so that the data contained in the database can
be used by multiple applications and users. In addition, the DBMS exerts centralized control of
the database, prevents fraudulent or unauthorized users from accessing the data, and ensures the
privacy of the data. Generally a database is an organized collection of related information. The
organized information or database serves as a base from which desired information can be retrieved
The term ‘DATA’ can be defined as the value of an attribute of an entity. Any collection of related
data items of entities having the same attributes may be referred to as a ‘DATABASE’. Mere
collection of data does not make it a database; the way it is organized for effective and efficient
use makes it a database. Database technology has been described as “one of the most rapidly
growing areas of computer and information science”. It is emerged in the late Sixties as a result of
combination of various circumstances. There was a growing demand among users for more
information to be provided by the computer relating to the day-to-day running of the organization
as well as information for planning and control purposes.
The technology that emerged to process data of various kinds is grossly termed as ‘DATABASE
MANAGEMENT TECHNOLOGY’ and the resulting software are known as ‘DATABASE
MANAGEMENT SYSTEM’ (DBMS) which they manage a computer stored database or
collection of data.
An entity may be concrete as person or book, or it may be abstract such as a loan or a holiday or a
concept. Entities are the basic units of objects which can have concrete existence or constitute
ideas or concepts. An entity set is a set of entities of the same type that share the same properties
or attributes. An entity is represented by set of attributes. An attribute is also referred as data item,
data element, data field, etc. Attributes are descriptive properties possessed by each member of an
entity set. A groping of related entities becomes an entity set.
The word ‘DATA’ means a fact or more specially a value of attribute of an entity. An entity in
general, may be an object, idea, event, condition or situation. A set of attributes describes an entity.
Information in a form which can be processed by a raw computer is called data. Data are raw
material of information. The term ‘BASE’ means the support, foundation or key ingredient of
anything. Therefore base supports data. A ‘DATABASE’ can be conceived as a system whose
base, whose key concept, is simply a particular way of handling data. In other words, a database
is nothing more than a computer-based record keeping.
The objective of database is to record and maintain information. The primary function of the
database is the service and support of information system which satisfies cost. In short, ―A
database is an organized collection of related information stored with minimum redundancy, in a
manner that makes them accessible for multiple applications”.
Files for as many event and objects as the organization has are used to store information.
Each of the files containing various kinds of information is labelled and stored in one or more
cabinets.
The cabinets could be kept in safe places for security purpose based on the sensitivity of the
information contained in it.
Insertion and retrieval is done by searching first for the right cabinet then for the right the file
then the information.
One could have an indexing system to facilitate access to the data
An alternative approach of data handling is a computerized way of dealing with the information.
The computerized approach could also be either decentralized or centralized base on where the
data resides in the system.
File based systems were an early attempt to computerize the manual filing system.
This approach is the decentralized computerized data handling method.
A collection of application programs perform services for the end-users. In such systems, every
application program that provides service to end users define and manage its own data
Such systems have number of programs for each of the different applications in the organization.
Since every application defines and manages its own data, the system is subjected to serious data
duplication problem.
File, in traditional file based approach, is a collection of records which contains logically related
data
As business application become more complex demanding more flexible and reliable data
handling methods, the shortcomings of the file based system became evident. These
shortcomings include, but not limited to:
Separation or Isolation of Data: Available information in one application may not be known.
Data Synchronisation is done manually.
Limited data sharing- every application maintains its own data.
Lengthy development and maintenance time
Duplication or redundancy of data (money and time cost and loss of data integrity)
Data dependency on the application- data structure is embedded in the application; hence, a
change in the data structure needs to change the application as well.
Incompatible file formats or data structures (e.g. ―C‖ and COBOL) between different
applications and programs creating inconsistency and difficulty to process jointly.
Fixed query processing which is defined during application development
1. Definition of the data is embedded in the application program which makes it difficult to modify
the database definition easily.
2. No control over the access and manipulation of the data beyond that imposed by the application
programs.
The most significant problem experienced by the traditional file based approach of data handling
can be formalized by what is called ―update anomalies‖. We have three types of update
anomalies;
1. Modification Anomalies: a problem experienced when one ore more data value is modified on
one application program but not on others containing the same data set.
2. Deletion Anomalies: a problem encountered where one record set is deleted from one
application but remain untouched in other application programs.
3. Insertion Anomalies: a problem experienced whenever there is new data item to be recorded, and
the recording is not made in all the applications. And when same data item is inserted at different
applications, there could be errors in encoding which makes the new data item to be considered as
a totally different object.
Data can be shared: two or more users can access and use same data instead of storing data in
redundant manner for each user.
Improved accessibility of data: by using structured query languages, the users can easily access
data without programming experience.
Redundancy can be reduced: isolated data is integrated in database to decrease the redundant
data stored at different applications.
Database designers typically interact with each potential group of users and develop views of the
database that meet the data and processing requirements of these groups. Each view is then
analyzed and integrated with the views of other user groups. The final database design must be
capable of supporting the requirements of all user groups.
The logical database designer must have a thorough and complete understanding of the
organization‘s data and any constraints on this data (the constraints are sometimes called business
rules). These constraints describe the main characteristics of the data as viewed by the
organization. Examples of constraints for DreamHome are:
a member of staff cannot manage more than 100 properties for rent or sale at the same time;
a member of staff cannot handle the sale or rent of his or her own property; a solicitor cannot act
for both the buyer and seller of a property.
Conceptual database design, which is independent of implementation details, such as the target
DBMS, application programs, programming languages, or any other physical considerations;
Logical database design, which targets a specific data model, such as relational, network,
hierarchical, or object-oriented.
The physical database designer decides how the logical database design is to be physically
realized. This involves:
Mapping the logical database design into a set of tables and integrity constraints;
Selecting specific storage structures and access methods for the data to achieve good
performance;
Designing any security measures required on the data.
Many parts of physical database design are highly dependent on the target DBMS, and there may
be more than one way of implementing a mechanism. Consequently, the physical database
designer must be fully aware of the functionality of the target DBMS and must understand the
advantages and disadvantages of each alternative implementation. The physical database
designer must be capable of selecting a suitable storage strategy that takes account of usage.
Whereas conceptual and logical database designs are concerned with the what, physical database
design is concerned with the how. It requires different skills, which are often found in different
people.
Responsible to oversee, control and manage the database resources (the database itself, the
DBMS and other related software)
Authorizing access to the database
Coordinating and monitoring the use of the database
In any organization where many people use the same resources, there is a need for a chief
administrator to oversee and manage these resources. In a database environment, the primary
resource is the database itself, and the secondary resource is the DBMS and related software.
Administering these resources is the responsibility of the database administrator (DBA). The
DBA is responsible for authorizing access to the database, coordinating and monitoring its use,
and acquiring software and hardware resources as needed. The DBA is accountable for problems
such as security breaches and poor system response time. In large organizations, the DBA is
assisted by a staff that carries out these functions.
Casual end users occasionally access the database, but they may need different information
each time. They use a sophisticated database query interface to specify their requests and are
typically middle- or high-level managers or other occasional browsers.
Naive or parametric end users make up a sizable portion of database end users. Their main job
function revolves around constantly querying and updating the database, using standard types of
queries and updates— called canned transactions—that have been carefully programmed and
tested. Many of these tasks are now available as mobile apps for use with mobile devices. The
tasks that such users perform are varied. A few examples are:
o Bank customers and tellers check account balances and post withdrawals and deposits.
o Reservation agents or customers for airlines, hotels, and car rental companies check availability
for a given request and make reservations.
o Employees at receiving stations for shipping companies enter package identifications via bar
codes and descriptive information through buttons to update a central database of received and
in-transit packages.
o Social media users post and read items on social media Web sites.
Sophisticated end users include engineers, scientists, business analysts, and others who
thoroughly familiarize themselves with the facilities of the DBMS in order to implement their
own applications to meet their complex requirements.
Standalone users maintain personal databases by using ready-made program packages that
provide easy-to-use menu-based or graphics-based interfaces. An example is the user of a
financial software package that stores a variety of personal financial data.
A typical DBMS provides multiple facilities to access a database. Naive end users need to learn
very little about the facilities provided by the DBMS; they simply have to understand the user
interfaces of the mobile apps or standard transactions designed and implemented for their use.
Standalone users typically become very proficient in using a specific software package.
All users should be able to access same data. This is important since the database is having a shared
data feature where all the data is stored in one location and all users will have their own customized
way of interacting with the data.
A user’s view is unaffected or immune to changes made in other views. Since the requirement of
one user is independent of the other, a change made in one user‗s view should not affect other
users.
Users should not need to know physical database storage details. As there are naïve users of the
system, hardware level or physical details should be a black-box for such users.
The American National Standards Institute (ANSI) Standards Planning and Requirements
Committee (SPARC). It recognized the need for a three-level approach with a system catalog. The
levels form a three-level architecture comprising an external, a conceptual, and an internal level,
the way users perceive the data is called the external level. The way the DBMS and the operating
system perceive the data is the internal level, where the data is actually stored using the data
structures and file.
A. External level: The users‘ view of the database. This level describes that part of the database that
is relevant to each user. The external level consists of a number of different external views of the
database. Each user has a view of the ‗real world‘ represented in a form that is familiar for that
user. The external view includes only those entities, attributes, and relationships in the ‗real world‘
that the user is interested in. Other entities, attributes, or relationships that are not of interest may
be represented in the database, but the user will be unaware of them.
In addition, different views may have different representations of the same data. For example, one
user may view dates in the form (day, month, year), while another may view dates as (year, month,
day). Some views might include derived or calculated data: data not actually stored in the database
as such, but created when needed. For example, in the DreamHome case study, we may wish to
view the age of a member of staff. However, it is unlikely that ages would be stored, as this data
would have to be updated daily. Instead, the member of staff‘s date of birth would be stored and
Conceptual level: The community view of the database. This level describes what data is stored
in the database and the relationships among the data. The middle level in the threelevel architecture
is the conceptual level. This level contains the logical structure of the entire database as seen by
the DBA. It is a complete view of the data requirements of the organization that is independent of
any storage considerations. The conceptual level represents:
o all entities, their attributes, and their relationships;
o the constraints on the data; semantic information about the data; Security and integrity
information.
The conceptual level supports each external view, in that any data available to a user must be
contained in or derivable from, the conceptual level. However, this level must not contain any
storage-dependent details. For instance, the description of an entity should contain only data types
of attributes (for example, integer, real, character) and their length (such as the maximum number
of digits or characters), but not any storage considerations, such as the number of bytes occupied.
Internal level: The physical representation of the database on the computer. This level describes
how the data is stored in the database.The internal level covers the physical implementation of the
database to achieve optimal runtime performance and storage space utilization. It covers the data
structures and file organizations used to store data on storage devices. It interfaces with the
operating system access methods (file management techniques for storing and retrieving data
records) to place the data on the storage devices, build the indexes, retrieve the data, and so on.
The internal level is concerned with such things as:
o storage space allocation for data and indexes;
o record descriptions for storage (with stored sizes for data items);
o record placement;
o data compression and data encryption techniques
When any client requires additional functionality like database access, it can connect to Server
that is capable of providing the functionality needed by the client. Basically Server is a machine
that provides services to the Client i.e. user machine.
Client/Server system has less expensive platforms to support applications that had previously
been running only on large and expensive mini or mainframe computers
Client offer icon-based menu-driven interface, which is superior to the traditional command-line,
dumb terminal interface typical of mini and mainframe computer systems.
Client/Server environment facilitates in more productive work by the users and making better
use of existing data.
Client/Server database system is more flexible as compared to the Centralized system.
Response time and throughput is high.
The server (database) machine can be custom-built (tailored) to the DBMS function and thus can
provide a better DBMS performance.
The client (application database) might be a personnel workstation, tailored to the needs of the
end users and thus able to provide better interfaces, high availability, faster responses and overall
Parallel database systems are very useful for the applications that have to query extremely large
databases (of the order of terabytes, for example, 1012 bytes) or that have to process an extremely
large number of transactions per second (of the order of thousands of transactions per second).
In a parallel database system, the throughput (that is, the number of tasks that can be completed in
a given time interval) and the response time (that is, the amount of time it takes to complete a
single task from the time it is· submitted) are very high.
In a parallel database system, there· is a startup cost associated with initiating a single process
and the startup-time may overshadow the actual processing time, affecting speedup adversely.
Recovery from failure is more complex in distributed database systems than in centralized
systems.
Hardware: The DBMS and the applications require hardware to run. The hardware can range
from a single personal computer to a single mainframe or a network of computers.
The particular hardware depends on the organization‘s requirements and the DBMS used. Some
DBMSs run only on particular hardware or operating systems, while others run on a wide variety
of hardware and operating systems. A DBMS requires a minimum amount of main memory and
disk space to run, but this minimum configuration may not necessarily give acceptable
performance. A simplified hardware configuration for DreamHome is illustrated in Figure. It
consists of a network of small servers, with a central server located in London running the backend
of the DBMS, that is, the part of the DBMS that manages and controls access to the database. It
Software: The software component comprises the DBMS software itself and the application
programs, together with the operating system, including network software if the DBMS is being
used over a network. Typically, application programs are written in a third-generation
programming language (3GL), such as C, C++, C#, Java, Visual Basic, COBOL, Fortran, Ada, or
Pascal, or a fourth-generation language (4GL), such as SQL, embedded in a third-generation
language. The target DBMS may have its own fourthgeneration tools that allow rapid development
of applications through the provision of nonprocedural query languages, reports generators, forms
generators, graphics generators, and application generators. The use of fourth-generation tools can
improve productivity significantly and produce programs that are easier to maintain.
Data: Perhaps the most important component of the DBMS environment—certainly from the end-
users‘ point of view—is the data. I observe that the data acts as a bridge between the machine
components and the human components. The database contains both the operational data and the
metadata, the ―data about data.‖ The structure of the database is called the schema.
Procedures refer to the instructions and rules that govern the design and use of the database. The
users of the system and the staff who manage the database require documented procedures on how
to use or run the system. These may consist of instructions on how to:
a. Data storage, retrieval, and update: DBMS must furnish users with the ability to store,
retrieve, and update data in the database.
b. A user-accessible catalog: A DBMS must furnish a catalog in which descriptions of data items
are stored and which is accessible to users.A key feature of the ANSI-SPARC architecture is the
recognition of an integrated system catalog to hold data about the schemas, users, applications,
and so on. The catalog is expected to be accessible to users as well as to the DBMS. A system
catalog, or data dictionary, is a repository of information describing the data in the database: it is,
the ‗data about the data‘ or metadata. The amount of information and the way the information is
used vary with the DBMS. Typically, the system catalog stores:
1. names, types, and sizes of data items;
2. names of relationships;
3. integrity constraints on the data;
4. names of authorized users who have access to the data;
c. Transaction support: A DBMS must furnish a mechanism which will ensure either that all the
updates corresponding to a given transaction are made or that none of them is made. A
transaction is a series of actions, carried out by a single user or application program, which
accesses or changes the contents of the database.
d. Concurrency control services: A DBMS must furnish a mechanism to ensure
Recovery services: A DBMS must furnish a mechanism for recovering the database in the event
that the database is damaged in any way.
Authorization services: A DBMS must furnish a mechanism to ensure that only authorized users
can access the database.
Data Model: a set of concepts to describe the structure of a database, and certain constraints that
the database should obey.
A data model is a description of the way that data is stored in a database. Data model helps to
understand the relationship between entities and to create the most effective structure to hold
data.
Data
Data relationships
Data semantics
Data constraints
Object-based
Record-based
Physical
Consist of a number of fixed format records. Each record type defines a fixed number of fields;
each field is typically of a fixed length.
Network Model is able to model complex relationships and represents semantics of add/delete on
the relationships.
o Can handle most situations for modeling using record types and relationship types.
o Language is navigational; uses constructs like FIND, FIND member, FIND owner, FIND NEXT
within set, GET etc. Programmers can do optimal navigation through the database.
Many data models have been proposed, which we can categorize according to the types of concepts
they use to describe the database structure. High-level or conceptual data models provide
concepts that are close to the way many users perceive data, whereas low-level or physical data
models provide concepts that describe the details of how data is stored on the computer storage
media, typically magnetic disks. Concepts provided by low-level data models are generally meant
for computer specialists, not for end users. Between these two extremes is a class of
Conceptual data models use concepts such as entities, attributes, and relationships. An
entity represents a real-world object or concept, such as an employee or a project from the mini-
world that is described in the database. An attribute represents some property of interest that
further describes an entity, such as the employee‘s name or salary. A relationship among two or
more entities represents an association among the entities, for example, a works-on relationship
between an employee and a project.
The database schema is specified by a set of definitions expressed by means of a special language
called a Data Definition Language. The DDL is used to define a schema or to modify an existing
one. It cannot be used to manipulate data.
The result of the compilation of the DDL statements is a set of tables stored in special files
collectively called the system catalog. The system catalog integrates the metadata that is data that
describes objects in the database and makes it easier for those objects to be accessed or
manipulated. The metadata contains definitions of records, data items, and other objects that are
of interest to users or are required by the DBMS. The DBMS normally consults the system catalog
before the actual data is accessed in the database. The terms data dictionary and data directory are
also used to describe the system catalog, although the term ‗data dictionary‘ usually refers to a
more general software system than a catalog for a DBMS. Which defines the database structure or
schema. Specifies additional properties or constraints of the data. The database system is checks
these constraints every time the database is updated.
It is a language that provides a set of operations to support the basic data manipulation operations
on the data held in the databases. Data manipulation operations usually include the following:
Therefore, one of the main functions of the DBMS is to support a data manipulation language in
which the user can construct statements that will cause such data manipulation to occur. Data
manipulation applies to the external, conceptual, and internal levels. However, at the internal
level we must define rather complex low-level procedures that allow efficient data access. In
contrast, at higher levels, emphasis is placed on ease of use and effort is directed at providing
efficient user interaction with the system.
It used to manage transaction in database and the change made by data manipulation language
statements.
Transaction: the logical unit of work which consists of some operations to control some tasks.
Example: COMMITE: used to permanently save any transaction into the database.
The part of a DML that involves data retrieval is called a query language. A query language can
be defined as a high-level special-purpose language used to satisfy diverse requests for the
retrieval of data held in the database. The term ‗query‘ is therefore reserved to denote a retrieval
statement expressed in a query language or specifies data to retrieve rather than how to retrieve it
In the relational model, relations are used to hold information about the objects to be represented
in the database. A relation is represented as a two-dimensional table in which the rows of the
Domains are an extremely powerful feature of the relational model. Every attribute in a relation
is defined on a domain. Domains may be distinct for each attribute, or two or more attributes
may be defined on the same domain. The domain concept is important because it allows the user
to define in a central place the meaning and source of values that attributes can hold. As a result,
more information is available to the system when it undertakes the execution of a relational
operation, and operations that are semantically incorrect can be avoided. For example, it is not
sensible to compare a street name with a telephone number, even though the domain definitions
for both these attributes are character strings. On the other hand, the monthly rental on a property
and the number of months a property has been leased have different domains (the first a
monetary value, the second an integer value), but it is still a legal operation to multiply two
values from these domains.
The elements of a relation are the rows or tuples in the table. In the Branch relation, each row
contains four values, one for each attribute. Tuples can appear in any order and the relation will
still be the same relation, and therefore convey the same meaning. The structure of a relation,
together with a specification of the domains and any other restrictions on possible values, is
sometimes called its intension, which is usually fixed unless the meaning of a relation is changed
to include additional attributes. The tuples are called the extension (or state) of a relation, which
changes over time.
The Branch relation in the above Figure has four attributes or degree four. This means that each
row of the table is a four-tuple, containing four values. A relation with only one attribute would
have degree one and be called a unary relation or one-tuple. A relation with two attributes is
called binary, one with three attributes is called ternary, and after that the term nary is usually
used. The degree of a relation is a property of the intension of the relation
Cardinality: of a relation is the number of tuples the relation has By contrast, the number of
tuples is called the cardinality of the relation and this changes as tuples are added or deleted. The
cardinality is a property of the extension of the relation and is determined from the particular
instance of the relation at any given moment. Finally, we have the definition of a relational
database.
Relation Schema: a named relation defined by a set of attribute-domain name pair Let A1,
A2………..An be attributes with domain D1, D2 ………,Dn.
Then the sets {A1:D1, A2:D2… An:Dn} is a Relation Schema. A relation R, defined by a
relation schema S, is a set of mappings from attribute names to their corresponding domains.
Thus a relation is a set of n- tuples of the form (A1:d1, A2:d2 ,…, An:dn) where d1 є D1, d2 є
D2,…….. dn є Dn,
Example Student (studentId char(10), studentName char(50), DOB date) is a relation schema for
the student entity in SQL
Relational Database schema: a set of relation schema each with distinct names. Suppose
A relation has a name that is distinct from all other relation names in the relational schema.
Each tuple in a relation must be unique
All tables are LOGICAL ENTITIES
Each cell of a relation contains exactly one atomic (single) value.
Each column (field or attribute) has a distinct name.
The values of an attribute are all from the same domain.
A table is either a BASE TABLES (Named Relations) or VIEWS (Unnamed Relations) Only
Base Tables are physically stored
VIEWS are derived from BASE TABLES with SQL statements like: [SELECT .. FROM ..
WHERE .. ORDER BY]
Relational database is the collection of tableso Each entity in one table o Attributes are fields
(columns) in table
Order of rows theoretically ( but practically has impact on performance) and columns is
immaterial
Entries with repeating groups are said to be un-normalized. All values in a column represent the
same attribute and have the same data format
Every relation has a schema, which describes the columns, or fields the relation itself
corresponds to our familiar notion of a table:
A relation is a collection of tuples, each of which contains values for a fixed number
of attributes
Existence Dependency: the dependence of an entity on the existence of one or more entities.
Weak entity : an entity that cannot exist without the entity with which it has a relationship – it is indicated
by a double rectangle
Composite: Divided into sub parts (composed of other attributes) Example: Name, address
Single-valued Vs multi-valued attributes
o Single-valued : have only single value(the value may change but has only one value at one time)
Derived: The value may be derived (computed) from the values of other attributes.
Null Values
o NULL applies to attributes which are not applicable or which do not have values.
o You may enter the value NA (meaning not applicable) Value of a key attribute cannot be null.
Default value- assumed value if no explicit value
When designing the conceptual specification of the database, one should pay attention to the
distinction between an Entity and an Attribute.
If the structure (city, Woreda, Kebele, etc) is important, e.g. want to retrieve employees in a given city,
address must be modeled as an entity (attribute values are atomic)
An important point about a relationship is how many entities participate in it. The number of entities
participating in a relationship is called the DEGREE of the relationship. Among the Degrees of
relationship, the following are the basic:
UNARY/RECURSIVE RELATIONSHIP: Tuples/records of a Single entity are related withy each
other.
BINARY RELATIONSHIPS: Tuples/records of two entities are associated in a relationship
TERNARY RELATIONSHIP: Tuples/records of three different entities are associated
And a generalized one:
o N-ARY RELATIONSHIP: Tuples from arbitrary number of entity sets are participating in a
relationship.
However, the degree and cardinality of a relation are different from degree and cardinality of
a relationship
1. Super Key: an attribute/set of attributes that uniquely identify a tuple within a relation.
2. Candidate Key: a super key such that no proper subset of that collection is a Super Key within the
relation.
1. Uniqueness
2. Irreducibility
3. Primary Key: the candidate key that is selected to identify tuples uniquely within the relation.
The entire set of attributes in a relation can be considered as a primary case in a worst case.
Key constraints and entity integrity constraints are specified on individual relations.
The referential integrity constraint is specified between two relations and is used to maintain the
consistency among tuples in the two relations. Informally, the referential integrity constraint states
that a tuple in one relation that refers to another relation must refer to an existing tuple in that
relation. For example, the attribute Dept_Num of EMPLOYEE gives the department number for
which each employee works; hence, its value in every EMPLOYEE tuple must match the
Dept_Num value of some tuple in the DEPARTMENT relation.
To define referential integrity more formally, first we define the concept of a foreign key. The
conditions for a foreign key, given below, specify a referential integrity constraint between the two
relation schemas R1 and R2.
A set of attributes FK (foreign key) in relation schema R1 is a foreign key of R1 that references
relation R2 if it satisfies the following rules:
Rule 1: The attributes in FK have the same domain(s) as the primary key attributes PK of R2; the
attributes FK are said to reference or refer to the relation R2.
Rule 2: A value of FK in a tuple t1 of the current state r1(R1) either occurs as a value
of PK for some tuple t2 in the current state r2(R2) or is NULL. In the former case, we have t1[FK]
= t2[PK], and we say that the tuple t1 references or refers to the tuple t2.
To specify these constraints, first we must have a clear understanding of the meaning or roles that
each attribute or set of attributes plays in the various relation schemas of the database.
Referential integrity constraints typically arise from the relationships among the entities
represented by the relation schemas.
For example, consider the database In the EMPLOYEErelation, the attribute Dept_Numrefers to
the department for which an employee works; hence, we designate Dept_Num to be a foreign key
of EMPLOYEE referencing the DEPARTMENT relation. This means that a value of Dept_Num in
any tuple t1 of the EMPLOYEE relation must match a value of DEPARTMENT relation, or the
value of Dept_Num can be NULL if the employee does not belong to a department or will be
assigned to a department later. For example, the tuple for employee ‗John Smith‘ references the
tuple for the ‗Research‘ department, indicating that ‗John Smith‘ works for this department.
Notice that a foreign key can refer to its own relation. For example, the attributeSuper_ssn in
EMPLOYEE refers to the supervisor of an employee; this is another employee, represented by a
tuple in the EMPLOYEE relation. Hence, Super_ssn is a foreign key that references the
EMPLOYEE relation itself. The tuple for employee ‗John Smith‘ references the tuple for
employee ‗Franklin Wong,‘ indicating that ‗Franklin Wong‘ is the supervisor of ‗John Smith‘.
We can diagrammatically display referential integrity constraints by drawing a directed arc from
each foreign key to the relation it references. For clarity, the arrowhead may point to the primary
key of the referenced relation with the referential integrity constraints displayed in this manner.
All integrity constraints should be specified on the relational database schema (i.e., defined as part
of its definition) if we want to enforce these constraints on the data- base states. Hence, the DDL
includes provisions for specifying the various types of constraints so that the DBMS can
The following constraints are specified as a part of data definition using DDL:
Domain Integrity: No value of the attribute should be beyond the allowable limits
Entity Integrity: In a base relation, no attribute of a Primary Key can assume a value of NULL
Referential Integrity: If a Foreign Key exists in a relation, either the Foreign Key value must
match a Candidate Key value in its home relation or the Foreign Key value must be NULL
Enterprise Integrity: Additional rules specified by the users or database administrators of a
database are incorporated
CHAPTER THREE
3. Conceptual Database Design- E-R Modeling
Conceptual design revolves around discovering and analyzing organizational and user data
requirements
The important activities are to identify
Entities
Attributes
Relationships
Constraints
Entities
Attributes
Relationships
Constraints
Before working on the conceptual design of the database, one has to know and answer the
following basic questions.
The basic E-R model is graphically depicted and presented for review. The process is repeated
until the end users and designers agree that the E-R diagram is a fair representation of the
organization‗s activities and functions. Checking for Redundant Relationships in the ER Diagram
Relationships between entities indicate access from one entity to another – it is therefore possible
to access one entity occurrence from another entity occurrence even if there are other entities and
relationships that separate them – this is often referred to as Navigation’ of the ER diagram. The
last phase in ER modeling is validating an ER Model against requirement of the user
Entities are objects or concepts that represent important data. Entities are typically nouns such as
product, customer, location, or promotion. There are three types of entities commonly used in
entity relationship diagrams.
Types of Attributes
Multivalued attribute: Multivalued attributes are those that are can take on more than one
value. An Attributes are depicted by double ellipse. It can be represented by the following
symbol
1. Binary = degree 2
2. Ternary = degree 3
3. n-ary = n degree
Ordinality, on the other hand, is the minimum number of times an instance in one entity can be
associated with an instance in the related entity.
1. One-to-one: When only one instance of an entity is associated with the relationship, it is marked as ‘1:1’.
The following image reflects that only one instance of each entity should be associated with the
relationship. It depicts one-to-one relationship
Relationships are associations between the entities. Attributes are properties which describe the
entities.
While designing the ER model one could face a problem on the design which is called a connection
traps. Connection traps are problems arising from misinterpreting certain relationships
1. Fan trap:
Occurs where a model represents a relationship between entity types, but the pathway between
certain entity occurrences is ambiguous. May exist where two or more one-to-many (1:M)
relationships fan out from an entity. The problem could be avoided by restructuring the model so
that there would be no 1:M relationships fanning out from a single entity and all the semantics of
the relationship is preserved.
Problem: Which car (Car1 or Car3 or Car5) is used by Employee 6 Emp6 working in Branch 1
(Br1)? Thus from this ER Model one cannot tell which car is used by which staff since a branch
can have more than one car and also a branch is populated by more than one employee. Thus we
need to restructure the model to avoid the connection trap.
To avoid the Fan Trap problem we can go for restructuring of the E-R Model. This will result in
the following E-R Model.
Generalization hierarchies can be nested. That is, a subtype of one hierarchy can be a supertype
of another. The level of nesting is limited only by the constraint of simplicity.
Specialization
o Example: Saving Accounts and Current Accounts are Specialized entities for the generalized
entity Accounts. Manager, Sales, Secretary: are specialized employees.
Subclass/Subtype
o An entity type whose tuples have attributes that distinguish its members from tuples of the
generalized or Superclass entities.
o When one generalized Superclass has various subgroups with distinguishing features and these
subgroups are represented by specialized form, the groups are called subclasses.
o Subclasses can be either mutually exclusive (disjoint) or overlapping (inclusive).
o A single subclass may inherit attributes from two distinct superclasses.
o A mutually exclusive category/subclass is when an entity instance can be in only one of the
subclasses.
Superclass /Supertype
o An entity type whose tuples share common attributes. Attributes that are shared by all entity
occurrences (including the identifier) are associated with the supertype. Ø Is the generalized entity
Relationship Between Superclass and Subclass
o The relationship between a superclass and any of its subclasses is called a superclass/subclass or
class/subclass relationship
o An instance can not only be a member of a subclass. i.e. Every instance of a subclass is also an
instance in the Superclass.
o A member of a subclass is represented as a distinct database object, a distinct record that is related
via the key attribute to its super-class entity.
o An entity cannot exist in the database merely by being a member of a subclass; it must also be a
member of the super-class.
o An entity occurrence of a sub class not necessarily should belong to any of the subclasses unless
there is full participation in the specialization.
o The relationship between a subclass and a Superclass is an ―IS A‖ or ―IS PART
OF‖ type.
shared.
The Supertype EMPLOYEE stores all properties that subclasses have in common.
And HOURLY employees have the unique attribute Wage (hourly wage rate),
while SALARIED employees have two unique attributes, StockOption and Salary.
The Completeness Constraint addresses the issue of whether or not an occurrence of a Superclass
must also have a corresponding Subclass occurrence.
o The completeness constraint requires that all instances of the subtype be represented in the
supertype.
o The Total Specialization Rule specifies that an entity occurrence should at least be a member of
one of the subclasses. Total Participation of superclass instances on subclasses is diagrammed with
a double line from the Supertype to the circle as shown below.
The Partial Specialization Rule specifies that it is not necessary for all entity occurrences in the
superclass to be a member of one of the subclasses. Here we have an optional participation on the
specialization. Partial Participation of superclass instances on subclasses is diagrammed with
a single line from the Supertype to the circle.
From the two types of constraints we can have four possible constraints
The whole purpose of the data base design is to create an accurate representation of the data, the
relationship between the data and the business constraints pertinent to that organization.
Therefore, one can use one or more technique to design a data base. One such a technique was
the E-R model. In this chapter we use another technique known as ―Normalization‖ with a
different emphasis to the database design defines the structure of a database with a specific data
model.
Logical design is the process of constructing a model of the information used in an enterprise
based on a specific data model (e.g. relational, hierarchical or network or object), but
independent of a particular DBMS and other physical considerations.
Logical database design is the process of deciding how to arrange the attributes of the entities in
a given business environment into database structures, such as the tables of a relational database.
The goal of logical database design is to create well-structured tables that properly reflect the
company’s business environment. The tables will be able to store data about the company’s
entities in a non-redundant manner and foreign keys will be placed in the tables so that all the
v Normalization process
Collection of Rules (Tests) to be applied on relations to obtain the minimal, non-redundant set or
attributes.
Discover new entities in the process
Revise attributes based on the rules and the discovered Entities
Works by examining the relationship between attributes known as functional dependency.
The purpose of normalization is to find the suitable set of relations that supports the data
requirements of an enterprise. A suitable set of relations has the following characteristics;
The first step before applying the rules in relational data model is converting the conceptual
design to a form suitable for relational logical model, which is in a form of tables.
Rule 3: Relationships: relationship will be mapped by using a foreign key attribute. Foreign key
is a primary or candidate key of one relation used to create association between tables.
For a relationship with One-to-One Cardinality: post the primary or candidate key of one of the table
into the other as a foreign key. In cases where one entity is having partial participation on the relationship,
it is recommended to post the candidate key of the partial participants to the total participant so as to save
some memory location due to null values on the foreign key attribute. E.g.: for a relationship between
Employee and Department where employee manages a department, the cardinality is one-to-one as one
employee will manage only one department and one department will have one manager. here the PK of
the Employee can be posted to the Department or the PK of the Department can be posted to the
Employee. But the Employee is having partial participation on the relationship “Manages” as not all
employees are managers of departments. thus, even though both way is possible, it is recommended to
post the primary key of the employee to the Department table as a foreign key.
For a relationship with One-to-Many Cardinality: Post the primary key or candidate key from the
―one‖ side as a foreign key attribute to the ―many‖ side. E.g.: For a relationship called ―Belongs To‖
between Employee (Many) and Department (One) the primary or candidate key of the one side which is
Department should be posted to the many side which is Employee table.
For a relationship with Many-to-Many Cardinality: for relationships having many to many cardinality,
one has to create a new table (which is the associative entity) and post primary key or candidate key from
the participant entities as foreign key attributes in the new table along with some additional attributes (if
applicable). The same approach should be used for relationships with degree greater than binary.
For a relationship having Associative Entity property: in cases where the relationship has its own
attributes (associative entity), one has to create a new table for the associative entity and post primary key
or candidate key from the participating entities as foreign key attributes in the new table.
One of the best ways to determine what information should be stored in a database is to clarify
what questions will be asked of it and what data would be included in the answers.
Database normalization is a series of steps followed to obtain a database design that allows for
consistent storage and efficient access of data in a relational database. These steps reduce data
redundancy and the risk of data becoming inconsistent.
NORMALIZATION is the process of identifying the logical associations between data items and
designing a database that will represent such associations but without suffering the update
anomalies which are;
1. Insertion Anomalies
2. Deletion Anomalies
3. Modification Anomalies
Normalization may reduce system performance since data will be cross referenced from many
tables. Thus denormalization is sometimes used to improve performance, at the cost of reduced
consistency guarantees.
All the normalization rules will eventually remove the update anomalies that may exist during data
manipulation after the implementation. The update anomalies are;
The type of problems that could occur in insufficiently normalized table is called update anomalies
which includes;
An “insertion anomaly” is a failure to place information about a new database entry into all the
places in the database where information about that new entry needs to be stored. Additionally, we
may have difficulty to insert some data. In a properly normalized database, information about a
new entry needs to be inserted into only one place in the database; in an inadequately normalized
database, information about a new entry may need to be inserted into more than one place and,
human fallibility being what it is, some of the needed additional insertions may be missed.
Deletion anomalies
o “deletion anomaly” is a failure to remove information about an existing database entry when it is time to
remove that entry. Additionally, deletion of one data may result in lose of other information. In a properly
normalized database, information about an old, to-begotten-rid-of entry needs to be deleted from only one
place in the database; in an inadequately normalized database, information about that old entry may need
to be deleted from more than one place, and, human fallibility being what it is, some of the needed additional
deletions may be missed.
Modification anomalies
o modification of a database involves changing some value of the attribute of a table. In a properly normalized
database table, what ever information is modified by the user, the change will be effected and used
accordingly.
In order to avoid the update anomalies we in a given table, the solution is to decompose it to smaller
tables based on the rule of normalization. However, the decomposition has two important
properties
a. The Lossless-join property insures that any instance of the original relation can be identified from the
instances of the smaller relations.
The purpose of normalization is to reduce the chances for anomalies to occur in a database
EmpID FName LName SkillID Skill SkillType School SchoolAdd Skill Level
Deletion Anomalies:
If employee with ID 16 is deleted then ever information about skill C++ and the type of skill is
deleted from the database. Then we will not have any information about C++ and its skill type.
Insertion Anomalies:
Modification Anomalies:
What if the address for Helico is changed from Piazza to Mexico? We need to look for every
occurrence of Helico and change the value of School_Add from Piazza to Mexico, which is prone
to error.
Database-management system can work only with the information that we put explicitly into its
tables for a given database and into its rules for working with those tables, where such rules are
appropriate and possible.
Data Dependency
The logical associations between data items that point the database designer in the direction of a
good database design are refered to as determinant or dependent relationships.
Two data items A and B are said to be in a determinant or dependent relationship if certain values
of data item B always appears with certain values of data item A. if the data item A is the
determinant data item and B the dependent data item then the direction of the association is from
A to B and not vice versa.
The essence of this idea is that if the existence of something, call it A, implies that B must exist
and have a certain value, then we say that “B is functionally dependent on A.” We also often
express this idea by saying that “A functionally determines B,” or that “B is a function of A,” or
that “A functionally governs B.” Often, the notions of functionality and functional dependency are
However, for the purpose of normalization, we are interested in finding 1..1 (one to one)
dependencies, lasting for all times (intension rather than extension of the database), and the
determinant having the minimal number of attributes.
X à Y holds if whenever two tuples have the same value for X, they must have the same value
for Y
FDs are derived from the real-world constraints on the attributes and they are properties on the
database intension not extension.
Example
Meat Red
Fish White
Cheese Rose
Since the type of Wine served depends on the type of Dinner, we say Wine is functionally
dependent on Dinner.
Dinner à Wine
Since both Wine type and Fork type are determined by the Dinner type, we say Wine is
functionally dependent on Dinner and Fork is functionally dependent on Dinner.
Dinner à Wine
Dinner à Fork
Partial Dependency
If an attribute which is not a member of the primary key is dependent on some part of the primary
key (if we have composite primary key) then that attribute is partially functionally dependent on
the primary key.
If an attribute which is not a member of the primary key is not dependent on some part of the
primary key but the whole key (if we have composite primary key) then that attribute is fully
functionally dependent on the primary key.
Transitive Dependency
In mathematics and logic, a transitive relationship is a relationship of the following form: “If A
implies B, and if also B implies C, then A implies C.”
Example:
If B functionally governs C
Provided that neither C nor B determines A i.e. (B /à A and C /à A) In the normal notation:
A normal form below represents a stronger condition than the previous one
Find the key with which you can find all data i.e. remove any repeating group Second Normal
Form(2NF):
Remove part-key dependencies (partial dependency). Make all data dependent on the whole
key.
Remove non-key dependencies (transitive dependencies). Make all data dependent on nothing
but the key.
For most practical purposes, databases are considered normalized if they adhere to the third
normal form (there is no transitive dependency).
If
SQL, Database,
12 Abebe Mekuria AAU, Helico Sidist_Kilo Piazza 5 8
VB6 Programming
C++ Programming
16 Lemma Alemu Unity Jimma Gerji Jimma City 6 4
IP Programming
SQL Database
Piazza Jimma City
65 Almaz Belay Prolog Programming Helico Jimma AAU 986
Sidist_Kilo
Java Programming
Remove all repeating groups. Distribute the multi-valued attributes into different rows and
identify a unique identifier for the relation so that is can be said is a relation in relational
database. Flatten the table
No partial dependency of a non key attribute on part of the primary key. This will result in a set of
relations with a level of Second Normal Form.
Any table that is in 1NF and has a single-attribute (i.e., a non-composite) key is automatically also
in 2NF.
If
It is in 1NF and
If all non-key attributes are dependent on the entire primary key. i.e. no partial dependency.
EMP_PROJ
If
It is in 2NF and
There are no transitive dependencies between a primary key and nonprimary key attributes.
Assumption: Students of same batch (same year) live in one building or dormitory
STUDENT
This schema is in its 2NF since the primary key is a single attribute and there are no repeating
groups (multi valued attributes).
Let‘s take StudID, Year and Dormitary and see the dependencies.
And Year can not determine StudID and Dormitary can not determine StudID Then
transitively StudIDàDormitary
To convert it to a 3NF we need to remove all transitive dependencies of non key attributes on
another non-key attribute.
The non-primary key attributes, dependent on each other will be moved to another table and
linked with the main table using Candidate Key- Foreign Key relationship.
STUDENT DORM
A conceptual data model identifies important entities and the high-level relationships among them.
This means no attribute or primary key is specified. Moreover, complexity increases as you expand
from a conceptual data model.
A logical data model, otherwise known as a fully attributed data model, allows you to understand
the details of your data without worrying about how the data will be implemented in the database.
Additionally, a logical data model will normally be derived from and or linked back to objects in
a conceptual data model. It is independent of DBMS, technology, data storage or organizational
constraints.
Unlike the conceptual model, a logical model includes all entities and relationships among them.
Additionally, all attributes, the primary key, and foreign keys (keys identifying the relationship
between different entities) are specified. As a result, normalization occurs at this level.
The steps for designing the logical data model are as follows:
Finally, the physical data model will show you exactly how to implement your data model in the
database of choice. This model shows all table structures, including column name, column data
type, column constraints, primary key, foreign key, and relationships between tables.
Correspondingly, the target implementation technology may be a relational DBMS, an XML
document, a spreadsheet, or any other data implementation option
Physical database design is the process of producing a description of the implementation of the
database on secondary storage; it describes the base relations, file organizations, and indexes used
to achieve efficient access to the data, and any associated integrity constraints and security
measures.
The physical database design phase allows the designer to make decisions on how the database is
to be implemented. Therefore, physical design is tailored to a specific DBMS. There is feedback
Database design process in database design methodology is divided into three main phases:
conceptual, logical, and physical database design. The phase prior to physical design—logical
database design—is largely independent of implementation details, such as the specific
functionality of the target DBMS and application programs, but is dependent on the target data
model. The output of this process is a logical data model consisting of an ER/relation diagram,
relational schema, and supporting documentation that describes this model, such as a data
dictionary. Together, these represent the sources of information for the physical design process
and provide the physical database designer with a vehicle for making tradeoffs that are so
important to an efficient database design.
Whereas logical database design is concerned with the what, physical database design is concerned
with the how. It requires different skills that are often found in different people. In particular, the
physical database designer must know how the computer system hosting the DBMS operates and
must be fully aware of the functionality of the target DBMS. As the functionality provided by
current systems varies widely, physical design must be tailored to a specific DBMS. However,
physical database design is not an isolated activity—there is often feedback between physical,
logical, and application design. For example, decisions taken during physical design for improving
performance, such as merging relations together, might affect the structure of the logical data
model, which will have an associated effect on the application design.
Physical database design translates the logical data model into a set of SQL statements that define
the database. For relational database systems, it is relatively easy to translate from a logical data
model into a physical database.
Choosing a physical data structure for the data constructs in the data model.
Optionally choosing DBMS options for the existence constraints in the data model.
Does not change the business meaning of the data.
Moreover, it is often necessary to apply multiple transforms to a single entity to get the desired
physical performance characteristics. All physical design transformations are compromises
The relational algebra is a relation-at-a-time (or set) language in which all tuples, possibly from
several relations, are manipulated in one statement without looping. There are many variations of
the operations that are included in relational algebra. The five fundamental operations in relational
algebra—Selection, Projection, Cartesian product, Union, and Set difference—perform most of
the data retrieval operations that we are interested in. In addition, there are also the Join,
Intersection, and Division operations, which can be expressed in terms of the five basic operations.
The Selection and Projection operations are unary operations, as they operate on one relation. The
other operations work on pairs of relations and are therefore called binary operations. In the
following definitions, let R and S be two relations defined over the attributes A = (a1, a2, … , aN)
and B = (b1, b2, … , bM), respectively. Use the following figures for the examples in this chapter.
Projection (π a1 … an (R) )
The Projection operation works on a single relation R and defines a relation that contains a
vertical subset of R, extracting the values of specified attributes and eliminating duplicates.
Example 6.2: Produce a list of salaries for all staff, showing only the staffNo, fName, IName,
and salary details.
Set difference ( R – S )
The Set difference operation defines a relation consisting of the tuples that are in relation R, but
not in S. R and S must be union-compatible.
Example 6.4: List all cities where there is a branch office but no properties for rent. π city (
Branch ) – π city ( PropertyForRent )
As in the previous example, we produce union-compatible relations by projecting the Branch and
PropertyForRent relations over the attribute city . We then use the Set difference operation to
combine these new relations to produce the result shown in Figure 6.5.
Intersection ( R ∩ S )
Example 6.5: List all cities where there is both a branch office and at least one property for rent.
Note that we can express the Intersection operation in terms of the Set difference
operation: R ∩ S = R – (R – S
Figure 6.6. Intersection based on city attribute from the Branch and PropertyForRent relations.
Cartesian product (R x S)
The Cartesian product operation defines a relation that is the concatenation of every tuple of
relation R with every tuple of relation S.
The Cartesian product operation multiplies two relations to define another relation consisting of
all possible pairs of tuples from the two relations. Therefore, if one relation has I tuples and N
attributes and the other has J tuples and M attributes, Cartesian product relation will contain (I *
J) tuples with (N + M) attributes. It is possible that the two relations may have attributes with the
same name. In this case, the attribute names are prefixed with the relation name to maintain the
uniqueness of attribute names within a relation.
The names of clients are held in the Client relation and the details of viewings are held in the
Viewing relation. To obtain the list of clients and the comments on properties they have viewed,
we need to combine these two relations:
The result of this operation is shown in Figure 6.7. In its present form, this relation contains more
information than we require. For example, the first tuple of this relation contains different clientNo
values. To obtain the required list, we need to carry out a Selection operation on this relation to
extract those tuples where Client.clientNo = Viewing.clientNo.
Applies the aggregate function list, AL, to the relation R to define a relation over the aggregate
list.
Example 6.9: (a) How many properties cost more than £350 per month to rent? We can use the
aggregate function COUNT to produce the relation R shown in Figure 6.10(a):
We can use the aggregate functions—MIN, MAX, and AVERAGE—to produce the relation R
shown in Figure 5.10(b) as follows:
Figure 6.10. Result of the Aggregate operations: (a) finding the number of properties whose rent
is greater than £350; (b) finding the minimum, maximum, and average staff salary.
Groups the tuples of relation R by the grouping attributes, GA, and then applies the aggregate
function list AL to define a new relation. AL contains one or more (<aggregate_function>,
<attribute>) pairs. The resulting relation contains the grouping attributes, GA, along with the
results of each of the aggregate functions.
We illustrate the use of the grouping operation with the following example.
We first need to group tuples according to the branch number, branchNo , and then use the
aggregate functions COUNT and SUM to produce the required relation. The relational algebra
expression is as follows:
Figure 6.11. Result of the grouping operation to find the number of staff working in each branch
and the sum of their salaries.
The relational calculus is not related to differential and integral calculus in mathematics, but takes
its name from a branch of symbolic logic called predicate calculus. When applied to databases,
it is found in two forms: tuple relational calculus, as originally proposed by Codd,
and domain relational calculus, as proposed by Lacroix and Pirotte.
If P is a predicate, then we can write the set of all x such that P is true for x, as:
{x | P(x)}
We may connect predicates by the logical connectives (AND), (OR), and ~ (NOT) to form
compound predicates.
Staff(S)
To express the query ―Find the set of all tuples S such that F(S) is true,‖ we can write:
S.salary means the value of the salary attribute for the tuple variable S. To retrieve a particular
attribute, such as salary , we would write:
There are two quantifiers we can use with formulae to tell how many instances the predicate
applies to. The existential quantifier $ (―there exists‖) is used in formulae that must be true for
at least one instance, such as:
This means, ―There exists a Branch tuple that has the same branchNo as the branchNo of the
current Staff tuple, S, and is located in London.‖ The universal quantifier ” (―for all‖) is used in
statements about every instance, such as:
This means, ―For all Branch tuples, the address is not in Paris.‖ We can apply a generalization
of De Morgan‘s laws to the existential and universal quantifiers. For example:
($X)(F(X)) ~ (“X)(~(F(X)))
(“X)(F(X)) ~ ($X)(~(F(X)))
Using these equivalence rules, we can rewrite the previous formula as:
Tuple variables that are qualified by $ or ” are called bound variables; the other tuple variables
are called free variables. The only free variables in a relational calculus expression should be
those on the left side of the bar (|). For example, in the following query:
S is the only free variable and S is then bound successively to each tuple of Staff .
A database language must perform these tasks with minimal user effort, and its command structure
and syntax must be relatively easy to learn. Finally, the language must be portable; that is, it must
conform to some recognized standard so that we can use the same command structure and syntax
when we move from one DBMS to another. SQL is intended to satisfy these requirements.
Data Definition Language (DDL) for defining the database structure and controlling access to the
data;
Data Manipulation Language (DML) for retrieving and updating data.
Until the 1999 release of the standard, known as SQL:1999 or SQL3, SQL contained only these
definitional and manipulative commands; it did not contain flow of control commands, such as IF
. . . THEN . . . ELSE, GO TO, or DO . . . WHILE. These commands had to be implemented using
a programming or job-control language, or interactively by the decisions of the user. Owing to this
lack of computational completeness, SQL can be used in two ways. The first way is to use SQL
interactively by entering the statements at a terminal. The second way is to embed SQL statements
in a procedural language. We also SQL is a relatively easy language to learn:
It is a nonprocedural language; you specify what information you require, rather than how to get
it. In other words, SQL does not require you to specify the access methods to the data.
Like most modern languages, SQL is essentially free-format, which means that parts of statements
do not have to be typed at particular locations on the screen.
The command structure consists of standard English words such as CREATE TABLE, INSERT,
SELECT.
For example:
DECIMAL(7,2));
Importance of SQL
SQL is the first and, so far, only standard database language to gain wide acceptance. The only
other standard database language, the Network Database Language (NDL), based on the
CODASYL network model, has few followers. Nearly every major current vendor provides
database products based on SQL or with an SQL interface, and most are represented on at least
one of the standard-making bodies.
There is a huge investment in the SQL language both by vendors and by users. It has become part
of application architectures such as IBM‘s Systems Application Architecture (SAA) and is the
strategic choice of many large and influential organizations, for example, the Open Group
consortium for UNIX standards. SQL has also become a Federal Information Processing Standard
(FIPS) to which conformance is required for all sales of DBMSs to the U.S. government. The SQL
Access Group, a consortium of vendors, defined a set of enhancements to SQL that would support
interoperability across disparate systems.
SQL is used in other standards and even influences the development of other standards as a
definitional tool. Examples include ISO‘s Information Resource Dictionary System (IRDS)
standard and Remote Data Access (RDA) standard. The development of the language is supported
by considerable academic interest, providing both a theoretical basis for the language and the
techniques needed to implement it successfully. This is especially true in query optimization,
distribution of data, and security. There are now specialized implementations of SQL that are
directed at new markets, such as OnLine Analytical Processing (OLAP).
Terminology
Most components of an SQL statement are case-insensitive, which means that letters can be typed
in either upper- or lowercase. The one important exception to this rule is that literal character data
must be typed exactly as it appears in the database. For example, if we store a person‘s surname
as ―SMITH‖ and then search for it using the string ―Smith,‖ the row will not be found.
Although SQL is free-format, an SQL statement or set of statements is more readable if indentation
and lineation are used. For example:
Throughout this and the next three chapters, we use the following extended form of the Backus
Naur Form (BNF) notation to define SQL statements:
uppercase letters are used to represent reserved words and must be spelled exactly as shown;
lowercase letters are used to represent user-defined words;
a vertical bar ( | ) indicates a choice among alternatives; for example, a | b | c;
curly braces indicate a required element; for example, {a};
For example:
In practice, the DDL statements are used to create the database structure (that is, the tables) and
the access mechanisms (that is, what each user can legally access), and then the DML statements
are used to populate and query the tables.
The CREATE TABLE command is used to specify a new relation by giving it a name and
specifying its attributes and initial constraints. The attributes are specified first, and each
attribute is given a name, a data type to specify its domain of values, and any attribute
constraints, such as NOT NULL. The key, entity integrity, and referential integrity constraints
can be specified within the CREATE TABLE statement after the attributes are declared, or they
can be added later using the ALTER TABLE command.
Create Table Employee (Fname varchar (20), Lname varchar (20), ID int primary key, Bdate
varchar (20), Address varchar (20), Sex char, Salary decimal, SuperID int foreign key references
Employee (ID))
Create Table Department (DName Varchar (15), Dnumber Int primary key, MgrID int foreign
key references employee (ID), Mgrstartdate varchar (20))
Create table project (Pname Varchar (15), Pnumber Int primary key, Plocation Varchar (15),
Dnum Int foreign key references Department (Dnumber))
Create table works_On (EID int, Pno Int, Hours Decimal (3, 1), Primary Key (EID,
Pno), Foreign Key (EID) References Employee (ID), Foreign Key(Pno) References,
project(Pnumber) )
Create table dependent (EID int, Dependent_Name Varchar (15), Sex Char, Bdate Date,
In its simplest form, INSERT is used to add a single tuple to a relation. We must specify the relation
name and a list of values for the tuple. The values should be listed in the same order in which the
corresponding attributes were specified in the CREATE TABLE command. For example, to add a
new tuple to the EMPLOYEE, Department, Project, Works_On, Dept_Locations, and Dependent
are shown below
In this section, we give an overview of the schema evolution commands available in SQL, which
can be used to alter a schema by adding or dropping tables, attributes, constraints, and other
schema elements.
The DROP command can be used to drop named schema elements, such as tables, domains, or
constraints. One can also drop a schema. For example, if a whole schema is not needed any more,
the DROP SCHEMA command can be used. For example, to remove the COMPANY database
and all its tables, domains, and other elements, it is used as follows:
In order to drop the table employee from company database we use the following command
The definition of a base table or of other named schema elements can be changed by using the
ALTER command. For base tables, the possible alter table actions include adding or dropping a
column (attribute), changing a column definition, and adding or dropping table constraints. For
example, to add an attribute for keeping track of jobs of employees to the EMPLOYEE base
relations in the COMPANY schema, we can use the command
To drop a column, we must choose either CASCADE or RESTRICT for drop behavior. If
CASCADE is chosen, all constraints and views that reference the column are dropped
automatically from the schema, along with the column. If RESTRICT is chosen, the command is
successful only if no views or constraints (or other elements) reference the column. For example,
the following command removes the attribute ADDRESS from the EMPLOYEE base table:
It is also possible to alter a column definition by dropping an existing default clause or by defining
a new default clause. The following examples illustrate this clause:
The DELETE command removes tuples from a relation. It includes a WHERE clause, similar to
that used in an SQL query, to select the tuples to be deleted. Tuples are explicitly deleted from
only one table at a time. However, the deletion may propagate to tuples in other relations
if referential triggered actions are specified in the referential integrity constraints. Depending on
the number of tuples selected by the condition in the WHERE clause, zero, one, or several tuples
can be deleted by a single DELETE command. A missing WHERE clause specifies that all tuples
in the relation are to be deleted; however the table remains in the database as an empty table. The
DELETE commands in Q4A to Q4D
WHERE ID=’12’
The UPDATE command is used to modify attribute values of one or more selected tuples. As in
the DELETE command, a WHERE clause in the UPDATE command selects the tuples to be
modified from a single relation. However, updating a primary key value may propagate to the
foreign key values of tuples in other relations if such a referential triggered action is specified in
the referential integrity constraints. An additional SET clause in the UPDATE command specifies
the attributes to be modified and their new values. For example, to change the location and
controlling department number of project number 10 to ‘Bellaire’ and 5, respectively, we use U5:
WHERE PNUMBER=10;
Several tuples can be modified with a single UPDATE command. An example is to give all
employees in the ‘Research’ department a 10 percent raise in salary, as shown in U6. In this
request, the modified SALARY value depends on the original SALARY value in each tuple, so
two references to the SALARY attribute are needed. In the SET clause, the reference to the
SALARY attribute on the right refers to the old SALARY value before modification, and the one
on the left refers to the new SALARY value after modification:
FROM DEPARTMENT
WHERE DNAME=’Research’);
It is also possible to specify NULL or DEFAULT as the new attribute value. Notice that each
UPDATE command explicitly refers to a single relation only. To modify multiple relations, we
must issue several UPDATE commands.
Queries in SQL can be very complex. We will start with simple queries, and then progress to more
complex ones in a step-by-step manner. The basic form of the SELECT statement, sometimes
called a mapping or a select-from-where block, is formed of the three clauses SELECT, FROM,
and WHERE and has the following form:
SELECT<attribute list>
FROM<table list>
WHERE<condition>
<Condition> is a conditional (Boolean) expression that identifies the tuples to be retrieved by the
query.
In SQL, the basic logical comparison operators for comparing attribute values with one another
and with literal constants are =, <, <=, >, >=, and <>. SQL has many additional comparison
operators that we shall present gradually as needed.
QUERY 0: Retrieve the birth date and address of the employee whose name is ‗sara girma’.
FROM EMPLOYEE
This query involves only the EMPLOYEE relation listed in the FROM clause. The
query selects the EMPLOYEE tuples that satisfy the condition of the WHERE clause,
then projects the result on the BDATE and ADDRESS attributes listed in the SELECT clause. Q0
is similar to the following relational algebra expression, except that duplicates, if any, would not be
eliminated:
Hence, a simple SQL query with a single relation name in the FROM clause is similar to a
SELECT-PROJECT pair of relational algebra operations. The SELECT clause of SQL specifies
the projection attributes, and the WHERE clause specifies the selection condition. The only
difference is that in the SQL query we may get duplicate tuples in the result, because the constraint
that a relation is a set is not enforced.
QUERY1
Q1:
In general, any number of select and join conditions may be specified in a single SQL query.
QUERY2
For every project located in ‘Stafford’, list the project number, the controlling department number,
and the department manager’s last name, address, and birth date.
Q2:
In SQL the same name can be used for two (or more) attributes as long as the attributes are
in different relations. If this is the case, and a query refers to two or more attributes with the same
name, we must qualify the attribute name with the relation name to prevent ambiguity. This is
done by prefixing the relation name to the attribute name and separating the two by a period. To
illustrate this, suppose that DNO and LNAME attributes of the EMPLOYEE relation were called
DNUMBER and NAME, and the DNAME attribute of DEPARTMENT was also called NAME;
then, to prevent ambiguity, query Q1 would be rephrased as shown in Q1A. We must prefix the
attributes NAME and DNUMBER in QIA to specify which ones we are referring to, because the
attribute names are used in both relations:
Q1A:
=Employee. Dnumber;
Ambiguity also arises in the case of queries that refer to the same relation twice, as in the
following example.
QUERY 8
For each employee, retrieve the employee’s first and last name and the first and last name of his
or her immediate supervisor.
WHERE E.SUPERID=S.ID
In this case, we are allowed to declare alternative relation names E and S, called aliases or tuple
variables, for the EMPLOYE E relation. An alias can follow the keyword AS, as shown in Q8, or
it can directly follow the relation name-for example, by writing EMPLOYEE E, EMPLOYEE S
in the FROM clause of Q8. It is also possible to rename the relation attributes within the query in
SQL by giving them aliases. For example, if we write EMPLOYEE AS E (FN, MI, LN, ID, SD,
ADDR, SEX, SAL, SID, DNO) in the FROM clause, FN becomes an alias for FNAME, MI for
MINH, LN for LNAME, and so on.
In Q8, we can think of E and S as two different copies of the EMPLOYEE relation; the first, E,
represents employees in the role of supervisees; the second, S, represents employees in the role of
supervisors. We can now join the two copies. Of course, in reality there is only one EMPLOYEE
relation, and the join condition is meant to join the relation with itself by matching the tuples that
satisfy the join condition E. SUPERID = S. ID. Notice that this is an example of a one-level
recursive query.
The result of query Q8 is shown in Figure 8.3d. Whenever one or more aliases are given to a
relation, we can use these names to represent different references to that relation. This permits
multiple references to the same relation within a query. Notice that, If we want to use this alias-
naming mechanism in any SQL query to specify tuple variables for every table in the WHERE
clause, whether or not the same relation needs to be referenced more than once. In fact, this practice
is recommended since it results in queries that are easier to comprehend.
If we specify tuple variables for every table in the WHERE clause, a select-project-join query in
SQL closely resembles the corresponding tuple relational calculus expression (except for duplicate
elimination).
We discuss two more features of SQL here. A missing WHERE clause indicates no condition on
tuple selection; hence, all tuples of the relation specified in the FROM clause qualify and are
selected for the query result. If more than one relation is specified in the FROM clause and there
is no WHERE clause, then the CROSS PRODUCT-all possible tuple combinationsof these
relations is selected. For example, Query 9 selects all EMPLOYEE ID and Query 10 selects all
combinations of an EMPLOYEE ID and a DEPARTMENT DNAME.
QUERIES 9 AND 10 : Select all EMPLOYEE ID (Q9), and all combinations of EMPLOYEE ID
and DEPARTMENT DNAME (Q10) in the database.
FROM EMPLOYEE
It is extremely important to specify every selection and join condition in the WHERE clause; if
any such condition is overlooked, incorrect and very large relations may result.
To retrieve all the attribute values of the selected tuples, we do not have to list the attribute names
explicitly in SQL; we just specify an asterisk (*), which stands for all the attributes. For example,
query Q1C retrieves all the attribute values of any EMPLOYEE who works in DEPARTMENT
number 5, query Q1D retrieves all the attributes of an EMPLOYEE and the attributes of the
DEPARTMENT in which he or she works for every employee of the ‘Research’ department, and
Ql0A specifies the CROSS PRODUCT of the EMPLOYEE and DEPARTMENT relations.
QIC:
Select *
From Employee
Select *
Select *
As we mentioned earlier, SQL usually treats a table not as a set but rather as a multiset; duplicate
tuples can appear more than once in a table, and in the result of a query. SQL does not
automatically eliminate duplicate tuples in the results of queries, for the following reasons:
An SQL table with a key is restricted to being a set, since the key value must be distinct in each
tuple. If we do want to eliminate duplicate tuples from the result of an SQL query, we use the
keyword DISTINCT in the SELECT clause, meaning that only distinct tuples should remain in the
result. In general, a query with SELECT DISTINCT eliminates duplicates, whereas a query with
SELECT ALL does not. Specifying SELECT with neither
ALL nor DISTINCT-as in our previous examples-is equivalent to SELECT ALL. For example,
Query 11 retrieves the salary of every employee; if several employees have the same salary, that
salary value will appear as many times in the result of the query. If we are interested only in distinct
salary values, we want each value to appear only once, regardless of how many employees earn
that salary. By using the keyword DISTINCT as in Q11A. QUERY 11
Retrieve the salary of every employee (Q11) and all distinct salary values (Q11A).
Q11:
FROM EMPLOYEE
Q11A:
FROM EMPLOYEE
SQL has directly incorporated some of the set operations of relational algebra. There are set union
(UNION), set difference (EXCEPT), and set intersection (INTERSECT) operations. The relations
QUERY 4
Make a list of all project numbers for projects that involve an employee whose last name is ‘girma’,
either as a worker or as a manager of the department that controls the project.
Q4:
UNION
The first SELECT query retrieves the projects that involve a ‘girma’ as manager of the department
that controls the project, and the second retrieves the projects that involve a ‘girma’ as a worker
on the project. Notice that if several employees have the last name ‘girma’, the project names
involving any of them will be retrieved. Applying the UNION operation to the two SELECT
queries gives the desired result. SQL also has corresponding multiset operations, which are
followed by the keyword ALL (UNION ALL, EXCEPT ALL,
In this section we discuss several more features of SQL. The first feature allows comparison
conditions on only parts of a character string, using the LIKE comparison operator. This can be
used for string pattern matching. Partial strings are specified using two reserved characters: %
replaces an arbitrary number of zero or more characters, and the underscore (_) replaces a single
character. For example, consider the following query.
Q12:
FROM EMPLOYEE
To retrieve all employees who were born during the 1950s, we can use Query 12A. Here, ‘5’ must
be the third character of the string (according to our format for date), so we use the value ‘_ _ 5_
_ _ _ ‘, with each underscore serving as a placeholder for an arbitrary character.
QUERY 12A: Find all employees who were born during the 1950s.
Q12A:
FROM EMPLOYEE
Another feature allows the use of arithmetic in queries. The standard arithmetic operators for
addition (+), subtraction (-), multiplication (*), and division (/) can be applied to numeric values
or attributes with numeric domains. For example, suppose that we want to see the effect of giving
all employees who work on the ‘ProductX’ project a 10 percent raise; we can issue Query13 to see
what their salaries would become. This example also shows how we can rename an attribute in the
query result using AS in the SELECT clause.
QUERY 13 Show the resulting salaries if every employee working on the ‘ProductX’ project is
given a 10 percent raise.
Q13:
For string data types, the concatenate operator | | can be used in a query to append two string values.
For date, time, timestamp, and interval data types, operators include incrementing (+) or
decrementing (-) a date, time, or timestamp by an interval. In addition, an interval value is the
result of the difference between two date, time, or timestamp values. Another comparison operator
Q14: SELECT *
FROM EMPLOYEE
The condition (SALARY BETWEEN 30000 AND 40000) in Q14 is equivalent to the condition
((SALARY >= 30000) AND (SALARY <= 40000)).
SQL allows the user to order the tuples in the result of a query by the values of one or more
attributes, using the ORDER BY clause. This is illustrated by Query 15.
QUERY 15 Retrieve a list of employees and the projects they are working on, ordered by
department and, within each department, ordered alphabetically by last name, first name.
The default order is in ascending order of values. We can specify the keyword DESC if we want
to see the result in a descending order of values. The keyword ASC can be used to specify
ascending order explicitly. For example, if we want descending order on DNAME and ascending
order on LNAME, FNAME, the ORDER BY clause of Q15 can be written as
In the previous section, we described some basic types of queries in SQL. Because of the generality
and expressive power of the language, there are many additional features that allow users to specify
more complex queries. We discuss several of these features in this section.
Some queries require that existing values in the database be fetched and then used in a comparison
condition. Such queries can be conveniently formulated by using nested queries, which are
complete select-from-where blocks within the WHERE clause of another query. That other query
is called the outer query. Query 4 is formulated in Q4 without a nested query, but it can be
rephrased to use nested queries as shown in Q4A. Q4A introduces the comparison operator IN,
which compares a value v with a set (or multiset) of values V and evaluates to TRUE if v is one of
the elements in V
FROM PROJECT
EMPLOYEE
LNAME=’Girma’)
The first nested query selects the project numbers of projects that have a ‘GIRMA’ involved as
manager, while the second selects the project numbers of projects that have a ‘GIRMA’ involved
as worker. In the outer query, we use the OR logical connective to retrieve a PROJECT tuple if
the PNUMBER value of that tuple is in the result of either nested query. If a nested query returns
a single attribute and a single tuple, the query result will be a single (scalar) value. In such cases,
it is permissible to use = instead of IN for the comparison operator. In general, the nested query
will return a table (relation), which is a set or multiset of tuples.
SQL allows the use of tuples of values in comparisons by placing them within parentheses.
FROM WORKS_ON
WHERE ID=’12’)
This query will select the Identity numbers of all employees who work the same (project, hours)
combination on some project that employee ‘John Smith’ (whose ID =’12’) works on. In this
example, the IN operator compares the sub tuple of values in parentheses (PNO, HOURS) for each
tuple in WORKS_ON with the set of union-compatible tuples produced by the nested query.
For example, the comparison condition (v > ALL V) returns TRUE if the value v is greater
than all the values in the set (or multiset) V. An example is the following query, which returns the
names of employees whose salary is greater than the salary of all the employees in department 5:
FROM EMPLOYEE
FROM EMPLOYEE
WHERE DNO=5)
In general, we can have several levels of nested queries. We can once again be faced with possible
ambiguity among attribute names if attributes of the same name exist-one in a relation in the
FROM clause of the outer query, and another in a relation in the FROM clause of the nested
query. The rule is that a reference to an unqualified attribute refers to the relation declared in the
innermost nested query. For example, in the SELECT clause and WHERE clause of the first nested
query of Q4A, a reference to any unqualified attribute of the PROJECT relation refers to the
PROJECT relation specified in the FROM clause of the nested query. To refer to an attribute of
the PROJECT relation specified in the outer query, we can specify and refer to an alias (tuple
variable) for that relation. These rules are similar to scope rules for program variables in most
programming languages that allow nested procedures and functions. To illustrate the potential
FROM EMPLOYEE AS E
FROM DEPENDENT
WHERE E.FNAME=DEPENDENT_NAME
AND E.SEX=SEX)
In the nested query of Q16, we must qualify E. SEX because it refers to the SEX attribute of
EMPLOYEE from the outer query, and DEPENDENT also has an attribute called SEX. All
unqualified references to SEX in the nested query refer to SEX of DEPENDENT. However, we
do not have to qualify FNAME and ID because the DEPENDENT relation does not have attributes
called FNAME and ID, so there is no ambiguity.
It is generally advisable to create tuple variables (aliases) for all the tables referenced in an SQL
query to avoid potential errors and ambiguities.
Whenever a condition in the WHERE clause of a nested query references some attribute of a
relation declared in the outer query, the two queries are said to be correlated. We can understand
a correlated query better by considering that the nested query is evaluated once for each tuple (or
combination of tuples) in the outer query. For example, we can think of Q16 as follows:
For each EMPLOYEE tuple, evaluate the nested query, which retrieves the ESSN values for all
DEPENDENT tuples with the same sex and name as that EMPLOYEE tuple; if the SSN value of
the EMPLOYEE tuple is in the result of the nested query, then select that EMPLOYEE tuple.
E.FNAME=D.DEPENDENT_NAME
The original SQL implementation on SYSTEM R also had a CONTAINS comparison operator,
which was used to compare two sets or multisets. This operator was subsequently dropped from
the language, possibly because of the difficulty of implementing it efficiently. Most commercial
implementations of SQL do not have this operator. The CONTAINS operator compares two sets
of values and returns TRUE if one set contains all values in the other set. Query 3 illustrates the
use of the CONTAINS operator.
QUERY 3 Retrieve the name of each employee who works on all the projects controlled by
department number 5.
FROM EMPLOYEE
FROM WORKS_ON
WHERE ID=EID)
CONTAINS
(SELECT PNUMBER
WHERE DNUM=5))
In Q3, the second nested query (which is not correlated with the outer query) retrieves the project
numbers of all projects controlled by department 5. For each employee tuple, the first nested query
(which is correlated) retrieves the project numbers on which the employee works; if these contain
all projects controlled by department 5, the employee tuple is selected and the name of that
employee is retrieved. Notice that the CONTAINS comparison operator has a similar function to
the DIVISION operation of the relational algebra.
Because the CONTAINS operation is not part of SQL, we have to use other techniques, such as
the EXISTS function, to specify these types of queries
The EXISTS function in SQL is used to check whether the result of a correlated nested query is
empty (contains no tuples) or not. We illustrate the use of EXISTS-and NOT EXISTS-with some
examples. First, we formulate Query 16 in an alternative form that a use EXISTS.
FROM EMPLOYEE AS E
WHERE
EXISTS (SELECT *
FROM DEPENDENT
AND E.FNAME=DEPENDENT_NAME)
EXISTS and NOTEXISTS are usually used in conjunction with a correlated nested query. In QI6B,
the nested query references the ID, FNAME, and SEX attributes of the EMPLOYEE relation from
the outer query. We can think of Q16B as follows: For each EMPLOYEE tuple, evaluate the nested
query, which retrieves all DEPENDENT tuples with the same Identity number, sex, and name as
the EMPLOYEE tuple; if at least one tuple EXISTS in the result of the nested query, then select
that EMPLOYEE tuple. In general, EXISTS (Q) returns TRUE if there is at least one tuple in the
result of the nested query Q, and it returns FALSE otherwise. On the other hand, NOTEXISTS
(Q) returns TRUE if there are no tuples in the result of nested query Q, and it returns FALSE
otherwise. Next, we illustrate the use of NOTEXISTS.
FROM EMPLOYEE
FROM DEPENDENT
WHERE ID=EID)
In Q6, the correlated nested query retrieves all DEPENDENT tuples related to a particular
EMPLOYEE tuple. If none exist, the EMPLOYEE tuple is selected. We can explain Q6 as follows:
For each EMPLOYEE tuple, the correlated nested query selects all DEPENDENT tuples whose
EID value matches the EMPLOYEE ID; if the result is empty; no dependents are related to the
employee, so we select that EMPLOYEE tuple and retrieve its FNAME and
QUERY 7 List the names of managers who have at least one dependent.
EXISTS (SELECT *
EXISTS
(SELECT *
FROM DEPARTMENT
WHERE ID=MGRID)
One way to write this query is shown in Q7, where we specify two nested correlated queries; the
first selects all DEPENDENT tuples related to an EMPLOYEE, and the second selects all
DEPARTMENT tuples managed by the EMPLOYEE. If at least one of the first and at least one
of the second exists, we select the EMPLOYEE tuple. Can you rewrite this query using only a
single nested query or no nested queries?
Query 3 (“Retrieve the name of each employee who works on all the projects controlled by
department number 5,”) can be stated using EXISTS and NOTEXISTS in SQL systems. There are
two options. The first is to use the well-known set theory transformation that (S1 CONTAINS S2)
is logically equivalent to (S2 EXCEPT S1) is empts,” This option is shown as Q3A.
FROM EMPLOYEE
(SELECT PNO
FROM WORKS_ON
WHERE ID=EID))
In Q3A, the first sub query (which is not correlated) selects all projects controlled by department
5, and the second sub query (which is correlated) selects all projects that the particular employee
being considered works on. If the set difference of the first sub query MINUS (EXCEPT) the
second sub query is empty, it means that the employee works on all the projects and is hence
selected. The second option is shown as Q3B. Notice that we need two-level nesting in Q3B and
that this formulation is quite a bit more complex than Q3, which used the CONTAINS comparison
operator, and Q3A, which uses NOT EXISTS and EXCEPT. However, CONTAINS is not part of
SQL, and not all relational systems have the EXCEPT operator even though it is part of SQL-99.
FROM EMPLOYEE
(SELECT *
FROM WORKS_ON B
AND
FROM WORKS_ON C
WHERE C.EID=ID
AND C.PNO=B.PNO))
In Q3B, the outer nested query selects any WORKS_ON (B) tuples whose PNO is of a project
controlled by department 5, if there is not a WORKS_ON (C) tuple with the same PNO and the
same SSN as that of the EMPLOYEE tuple under consideration in the outer query. If no such tuple
exists, we select the EMPLOYEE tuple. The form of Q3B matches the following rephrasing of
Query 3: Select each employee such that there does not exist a project controlled by department 5
that the employee does not work on. There is another SQL function, UNIQUE (Q), which returns
TRUE if there are no duplicate tuples in the result of query Q; otherwise, it returns FALSE. This
can be used to test whether the result of a nested query is a set or a multiset.
We have seen several queries with a nested query in the WHERE clause. It is also possible to use
an explicit set of values in the WHERE clause, rather than a nested query. Such a set is enclosed
in parentheses in SQL.
QUERY 17 Retrieve the IDENTITY numbers of all employees who work on project numbers
1, 2, or 3.
FROM WORKS_ON
In SQL, it is possible to rename any attribute that appears in the result of a query by adding the
qualifier AS followed by the desired new name. Hence, the AS construct can be used to alias both
attribute and relation names, and it can be used in both the SELECT and FROM clauses. For
example, Q8A shows how query Q8 can be slightly changed to retrieve the last name of each
employee and his or her supervisor, while renaming the resulting attribute names as
EMPLOYEE_NAME and SUPERVISOR_NAME. The new names will appear as column headers
in the query result.
SUPERVISOR_NAME
WHERE E.SUPERID=S.ID
The concept of a joined table (or joined relation) was incorporated into SQL to permit users to
specify a table resulting from a join operation in the FROM clause of a query. This construct may
be easier to comprehend than mixing together all the select and join conditions in the WHERE
clause. For example, consider query Q1, which retrieves the name and address of every employee
who works for the ‘Research’ department. It may be easier first to specify the join of the
EMPLOYEE and DEPARTMENT relations, and then to select the desired tuples and attributes.
This can be written in SQL as in Q1A:
WHERE DNAME=’Research’
The FROM clause in Q 1A contains a single joined table. The attributes of such a table are all the
attributes of the first table, EMPLOYEE, followed by all the attributes of the second table,
DEPARTMENT. The concept of a joined table also allows the user to specify different types of
join, such as NATURAL JOIN and various types of OUTER JOIN. In a NATURAL JOIN on two
relations R and S, no join condition is specified; an implicit equijoin condition for each pair of
attributes with the same name from Rand S is created. If the names of the join attributes are not
the same in the base relations, it is possible to rename the attributes so that they match, and then
to apply NATURAL JOIN. In this case, the AS construct can be used to rename a relation and all
its attributes in the FROM clause. This is illustrated in Q1B, where the DEPARTMENT relation
is renamed as DEPT and its attributes are renamed as DNAME, DNO (to match the name of the
desired join attribute DNO in EMPLOYEE), MID, and MSDATE. The implied join condition for
this NATURAL JOIN is EMPLOYEE. DNO = DEPT. DNO, because this is the only pair of
attributes with the same name after renaming.
WHERE DNAME=’Research;
The default type of join in a joined table is an inner join, where a tuple is included in the result
only if a matching tuple exists in the other relation. For example, in queryQ8A, only employees
that have a supervisor are included in the result; an EMPLOYEE tuple whose value for SUPERID
is NULL is excluded. If the user requires that all employees be included, an OUTER JOIN must
be used explicitly. In SQL, this is handled by explicitly specifying the OUTER JOIN in a joined
table, as illustrated in Q8B:
SUPERVISOR_NAME
E.SUPERID=S.ID)
The options available for specifying joined tables in SQL include INNER JOIN (same as JOIN),
LEFT OUTER JOIN, RIGHT OUTER JOIN, and FULL OUTER JOIN. In the latter three options,
the keyword OUTER may be omitted. If the join attributes have the same name, one may also
specify the natural join variation of outer joins by using the keyword NATURAL before the
operation (for example, NATURAL LEFT OUTER JOIN). The
It is also possible to nest join specifications; that is, one of the tables in a join may itself be a joined
table. This is illustrated by Q2A, which is a different way of specifying query Q2, using the
concept of a joined table:
WHERE PLOCATION=’Stafford’;
Because grouping and aggregation are required in many database applications, SQL has features
that incorporate these concepts. A number of built-in functions exist: COUNT, SUM, MAX, MIN,
and AVG. The COUNT function returns the number of tuples or values as specified in a query.
The functions SUM, MAX, MIN, and AVG are applied to a set or multiset of numeric values and
return, respectively, the sum, maximum value, minimum value, and average (mean) of those
values. These functions can be used in the SELECT clause or in a HAVING clause (which we
introduce later). The functions MAX and MIN can also be used with attributes that have
nonnumeric domains if the domain values have a total ordering among one another. We illustrate
the use of these functions with example queries.
QUERY 19 Find the sum of the salaries of all employees, the maximum salary, the minimum
salary, and the average salary.
AVG (SALARY)
FROM EMPLOYEE
If we want to get the preceding function values for employees of a specific department-say, the
‘Research’ department-we can write Query 20, where the EMPLOYEE tuples are restricted by the
WHERE clause to those employees who work for the ‘Research’ department.
QUERY 20 Find the sum of the salaries of all employees of the ‘Research’ department, as well as
the maximum salary, the minimum salary, and the average salary in this department.
AVG (SALARY)
QUERIES 21 AND 22 Retrieve the total number of employees in the company (Q21) and the
number of employees in the ‘Research’ department (Q22).
FROM EMPLOYEE;
Here the asterisk (*) refers to the rows (tuples), so COUNT (*) returns the number of rows in the
result of the query. We may also use the COUNT function to count values in a column rather than
tuples, as in the next example.
FROM EMPLOYEE
If we write COUNT (Salary) instead of COUNT (DISTINCT SALARY) in Q23, then duplicate
values will not be eliminated. However, any tuples with NULL for SALARY will not be counted.
In general, NULL values are discarded when aggregate functions are applied to a particular column
(attribute).
The preceding examples summarize a whole relation (Q19, Q21, Q23) or a selected subset of
tuples (Q20, Q22), and hence all produce single tuples or single values. They illustrate how
functions are applied to retrieve a summary value or summary tuple from the database. These
functions can also be used in selection conditions involving nested queries. We can specify a
FROM EMPLOYEE
WHERE
The correlated nested query counts the number of dependents that each employee has; if this is
greater than or equal to two, the employee tuple is selected.
In many cases we want to apply the aggregate functions to subgroups of tuples in a relation, where
the subgroups are based on some attribute values. For example, we may want to find the average
salary of employees in each department or the number of employees who work on each project. In
these cases we need to partition the relation into non overlapping subsets (or groups) of tuples.
Each group (partition) will consist of the tuples that have the same value of some attributes), called
the grouping attributes). We can then apply the function to each such group independently. SQL
has a GROUP BY clause for this purpose.
The GROUP BY clause specifies the grouping attributes, which should also
appear in the SELECT clause, so that the value resulting from applying each aggregate function
to a group of tuples appears along with the value of the grouping attributes).
QUERY 24 For each department, retrieve the department number, the number of employees in the
department, and their average salary.
FROM EMPLOYEE
GROUP BY DNO;
In Q24, the EMPLOYEE tuples are partitioned into groups-each group having the same value for
the grouping attribute DNO. The COUNT and AVG functions are applied to each such group of
tuples. Notice that the SELECT clause includes only the grouping attribute and the functions to be
applied on each group of tuples. If NULLs exist in the grouping attribute, then a separate group is
created for all tuples with a NULL value in the grouping attribute. For example, if the
EMPLOYEE table had some tuples that had NULL for the grouping attribute DNO, there would
be a separate group for those tuples in the result of Q24.
QUERY 25 For each project, retrieve the project number, the project name, and the number of
employees who work on that project.
WHERE PNUMBER=PNO
Q25 shows how we can use a join condition in conjunction with GROUP BY. In this case, the
grouping and functions are applied after the joining of the two relations. Sometimes we want to
retrieve the values of these functions only for groups that satisfy certain conditions. For example,
suppose that we want to modify Query 25 so that only projects with more than two employees
appear in the result. SQL provides a HAVING clause, which can appear in conjunction with a
GROUP BY clause, for this purpose. HAVING provides a condition on the group of tuples
associated with each value of the grouping attributes. Only the groups that satisfy the condition
are retrieved in the result of the query. This is illustrated by Query 26.
WHERE PNUMBER=PNO
Notice that, while selection conditions in the WHERE clause limit the tuples to which functions
are applied, the HAVING clause serves to choose whole groups.
QUERY27 For each project, retrieve the project number, the project name, and the number of
employees from department 5 who work on the project.
Here we restrict the tuples in the relation (and hence the tuples in each group) to those that satisfy
the condition specified in the WHERE clause-namely, that they work in department number 5.
Notice that we must be extra careful when two different conditions apply (one to the function in
the SELECT clause and another to the function in the HAVING clause). For example, suppose
that we want to count the total number of employees whose salaries exceed $40,000 in each
department, but only for departments where more than five employees work. Here, the condition
GROUP BY DNAME
This is incorrect because it will select only departments that have more than five employees who
each earn more than$40,000. The rule is that the WHERE clause is executed first, to select
individual tuples; the HAVING clause is applied later, to select individual groups of tuples. Hence,
the tuples are already restricted to employees who earn more than $40,000, before the function in
the HAVING clause is applied. One way to write this query correctly is to use a nested query, as
shown in Query 28.
QUERY28 For each department that has more than five employees, retrieve the department
number and the number of its employees who are making more than $40,000.
FROM EMPLOYEE
GROUP BY DNO
GROUP BY DNUMBER
CHAPTER SEVEN
7. Advanced Database Concepts
Database security and integrity is about protecting the database from being inconsistent and being
disrupted. We can also call it database misuse. Database security encompasses hardware, software,
people and data. Database misuse could be Intentional or accidental, where accidental misuse is
easier to cope with than intentional misuse.
Most systems implement good Database Integrity to protect the system from accidental misuse
while there are many computer based measures to protect the system from intentional misuse,
which is termed as Database Security measures.
Legal, ethical and social issues regarding the right to access information
Physical control
Policy issues regarding privacy of individual level at enterprise and national level
Operational consideration on the techniques used (password, etc)
System level security including operating system and hardware control Security levels and security policies
in enterprise level
Database security is the mechanisms that protect the database against intentional or
accidental threats. Threat is any situation or event, whether intentional or accidental, that may
adversely affect a system and consequently the organization. A threat may be caused by a situation
1. Physical Level: concerned with securing the site containing the computer system should be physically
secured. The backup systems should also be physically protected from access except for authorized users.
Even though we can have different levels of security and authorization on data objects and users,
who access which data is a policy matter rather than technical. These policies should be:
An organization needs to identify the types of threat it may be subjected to and initiate appropriate
plans and countermeasures, bearing in mind the costs of implementing them.
Authorization
Authorization is the granting of a right or privilege that enables a subject to have legitimate access
to a system or a system‘s object. The process of authorization involves authentication of subjects
(i.e. a user or program) requesting access to objects (i.e. a database table, view, procedure, trigger,
or any other object that can be created within the system). Authorization controls, also known as
access controls can be built into the software, and govern not only what system or object a specified
user can access, but also what the user may do with it.
1. Read Authorization: the user with this privilege is allowed only to read the content of the data object.
2. Insert Authorization: the user with this privilege is allowed only to insert new records or items to the
data object.
3. Update Authorization: users with this privilege are allowed to modify content of attributes but are not
authorized to delete the records.
4. Delete Authorization: users with this privilege are only allowed to delete a record and not anything else.
Different users, depending on the power of the user, can have one or the combination of the
above forms of authorization on different data objects.
The database administrator is responsible to make the database to be as secure as possible. For
this the DBA should have the most powerful privilege than every other user. The DBA provides
capability for database users while accessing the content of the database.
1. Account Creation: involves creating different accounts for different USERS as well as USER
GROUPS.
2. Security Level Assignment: involves in assigning different users at different categories of
access levels.
3. Privilege Grant: involves giving different levels of privileges for different users and user
groups.
Views
A view is the dynamic result of one or more relational operations operation on the base relations
to produce another relation. A view is a virtual relation that does not actually exist in the
database, but is produced upon request by a particular user. It is a mechanism that provides a
powerful and flexible security mechanism by hiding parts of the database from certain users.
Therefore, using a view is more restrictive than simply having certain privileges granted to a user
on the base relation(s).
Integrity
Integrity constraints contribute to maintaining a secure database system by preventing data from
becoming invalid and hence giving misleading or incorrect results:
Entity integrity: The first integrity rule applies to the primary keys of base relations. In a base
relation, no attribute of a primary key can be null.
Referential integrity: The second integrity rule applies to foreign keys. If a foreign key exists in a
relation, either the foreign key value must match a candidate key value of some tuple in its home
relation or the foreign key value must be wholly null.
Domain Integrity
Key constraints
Backup is the process of periodically taking a copy of the database and log file (and possibly
programs) on to an offline storage media. A DBMS should provide backup facilities to assist with
Journaling is the process of keeping and maintaining a log file (or journal) of all changes made
to the database to enable recovery to be undertaken effectively in the event of a failure. The
advantage of journaling is that, in the event of a failure, the database can be recovered to its last
known consistent state using a backup copy of the database and the information contained in the
log file.
If no journaling is enabled on a failed system, the only means of recovery is to restore the database
using the latest backup version of the database. However, without a log file, any changes made
after the last backup to the database will be lost.
Encryption
Encryption is the process of encoding the data by a special algorithm that renders the data
unreadable by any program without the decryption key. If a database system holds particularly
sensitive data, it may be deemed necessary to encode it as a precaution against possible external
threats or attempts to access it.
The DBMS can access data after decoding it, although there is a degradation in performance
because of the time taken to decode it
Encryption also protects data transmitted over communication lines. To transmit data securely over
insecure networks requires the use of a Cryptosystem, which includes:
Authentication
All users of the database will have different access levels and permission for different data objects,
and authentication is the process of checking whether the user is the one with the privilege for the
access level. It is the process of checking the users are who they say they are.
Any database access request will have the following three major components
The database should be able to check for all the three components before processing any request.
The checking is performed by the security subsystem of the DBMS
A distributed database management system (D–DBMS) is the software that manages the DDB
and provides an access mechanism that makes this distribution transparent to the users.
Operational systems were never primarily designed to support business decision making and so
using such systems may never be an easy solution. The legacy is that a typical organization may
have numerous operational systems with overlapping and sometimes contradictory definitions,
such as data types. The challenge for an organization is to turn its archives of data into a source of
The concept of a data warehouse was deemed the solution to meet the requirements of a system
capable of supporting decision making, receiving data from multiple operational data sources.
The data held in a data warehouse is described as being subject-oriented, integrated, timevariant,
and nonvolatile (Inmon, 1993).
Subject-oriented, as the warehouse is organized around the major subjects of the organization
(such as customers, products, and sales) rather than the major application areas (such as customer
invoicing, stock control, and product sales). This is reflected in the need to store decision-support
data rather than application oriented data.
Integrated, because of the coming together of source data from different organization-wide
applications systems. The source data is often inconsistent, using, for example, different data types
and/or formats. The integrated data source must be made consistent to present a unified view of
the data to the users.
Time-variant, because data in the warehouse is accurate and valid only at some point in time or
over some time interval.
Nonvolatile, as the data is not updated in real time but is refreshed from operational systems on a
regular basis. New data is always added as a supplement to the database, rather than a replacement.
The source of operational data for the data warehouse is supplied from mainframes, proprietary
file systems, private workstations and servers, and external systems such as the Internet.
An operational data store (ODS) is a repository of current and integrated operational data used for
analysis. It is often structured and supplied with data in the same way as the data warehouse, but
The warehouse manager performs all the operations associated with the management of the data,
such as the transformation and merging of source data; creation of indexes and views on base
tables; generation of aggregations, and backing up and archiving data. The query manager
performs all the operations associated with the management of user queries. Detailed data is not
stored online but is made available by summarizing the data to the next level of detail. However,
on a regular basis, detailed data is added to the warehouse to supplement the summarized data.
The warehouse stores all the predefined lightly and highly summarized data generated by the
warehouse manager. The purpose of summary information is to speed up the performance of
queries. Although there are increased operational costs associated with initially summarizing the
data, this cost is offset by removing the requirement to continually perform summary operations
(such as sorting or grouping) in answering user queries. The summary data is updated
continuously as new data is loaded into the warehouse. Detailed and summarized data is stored
offline for the purposes of archiving and backup. Metadata (data about data) definitions are used
by all the processes in the warehouse, including the extraction and loading processes; the
warehouse management process; and as part of the query management process.
The principal purpose of data warehousing is to provide information to business users for strategic
decision making. These users interact with the warehouse using end-user access tools. The data
warehouse must efficiently support ad hoc and routine analysis as well as more complex data
analysis. The types of end-user access tools typically include reporting and query tools, application
development tools, executive information system (EIS) tools, online analytical processing (OLAP)
tools, and data mining tools.
There are numerous definitions of what data mining is, ranging from the broadest definitions of
any tool that enables users to access directly large amounts of data to more specific definitions
such as tools and applications that perform statistical analysis on the data. In this chapter, we use
a more focused definition of data mining by Simoudis (1996).
Data mining is the process of extracting valid, previously unknown, comprehensible, and
actionable information from large databases and using it to make crucial business decisions.
Data mining is concerned with the analysis of data and the use of software techniques for finding
hidden and unexpected patterns and relationships in sets of data. The focus of data mining is to
reveal information that is hidden and unexpected, as there is less value in finding patterns and
relationships that are already intuitive. Examining the underlying rules and features in the data
identifies the patterns and relationships.
Data mining analysis tends to work from the data up, and the techniques that produce the most
accurate results normally require large volumes of data to deliver reliable conclusions. The process
of analysis starts by developing an optimal representation of the structure of sample data, during
which time knowledge is acquired. This knowledge is then extended to larger sets of data, working
on the assumption that the larger data set has a structure similar to the sample data.
Data mining can provide huge paybacks for companies who have made a significant investment
in data warehousing. Data mining is used in a wide range of industries including retail/marketing,
banking, insurance, and medicine.
Link analysis Demographic clustering Neural clustering Similar time sequence discovery
Instructor U Student
Customers Customer
C. Set Difference (or MINUS) Operation
The result of this operation, denoted by R - S, is a relation that includes all tuples that are in R
but not in S. The two operands must be "type compatible”.
Note: both union and intersection are commutative operations; i.e. RUS=SUR and R n S = S n R
Both union and intersection can be treated as n-ary operations applicable to any number of relations
as both are associative operations; that is RU(SUT)=(RUS)UT and (R n S) n =R n (S n T)
The minus operation is not commutative, that is in general. R—S? S--R
D. CARTESIAN (or cross product) Operation
This operation is used to combine tuples from two relations in a combinational fashion.
In general, the result of R(A1, A2,….An) x S(B1, B2,….,Bm) is a relation Q with degree n+m
attributes Q(A1, A2,….An, B1, B2,….,Bm), in that order.
Requires (1000+50) disk accesses to read from Staff and Branch relations
Creates temporary relation of Cartesian Product (1000*50) tuples
Requires (1000*50) disk access to read in temporary relation and test predicate
Total Work = (1000+50) + 2*(1000*50) = 101,050 I/O operations
Query 2 (Better)
Again requires (1000+50) disk accesses to read from Staff and Branch
Joins Staff and Branch on branchNo with 1000 tuples (1 employee : 1 branch )
Requires (1000) disk access to read in joined relation and check predicate
Total Work = (1000+50) + 2*(1000) = 3050 I/O operations
3300% Improvement over Query 1
Query 3 (Best)
Processing can be divided into Decomposition, Optimization, Execution, and Code generation main
categories.
1. Query Decomposition
It is the process of transforming a high-level query into a relational algebra query, and to check that
the query is syntactically and semantically correct.
It Consists of parsing and validation
Typical stages in query decomposition are:
1. Analysis:
Lexical and syntactical analysis of the query (correctness) based on attributes, data type...
Query tree will be built for the query containing
leaf node for base relations,
one or many non-leaf nodes for relations produced by relational algebra operations and
Root node for the result of the query.
Sequence of operation is from the leaves to the root. (SELECT * FROM Catalog c, Author a
Where a.authorid = c.authorid AND c.price>200 AND a.country= ‘ USA’ )
2. Normalization:
Convert the query into a normalized form.
The predicate WHERE will be converted to Conjunctive (∨) or Disjunctive (∧) Normal form.
3. Semantic Analysis
To reject normalized queries that is not correctly formulated or contradictory.
Incorrect if components do not contribute to generate result.
5. Commutativity of THETA JOIN/ Cartesian product RXS is equivalent to SXR. Also holds for
Equi-join and Natural join
b. If the predicate is in the forms of c1^ c2 and c1 involves only attributes of R and c2
involves only attributes of S, then the selection and theta join operations commute.
8. Commutativity of the set operations: UNION and INTERSECTION but not STE DIFFERENCE
Heuristic approach will be implemented by using the above transformation rules in the following sequence
or steps. Sequence for applying Transformation Rules
1. Use Rule 1—Cascade Selection
2. Use
Rule 2. Commutatively of SELECTION
Rule 4. Commuting SELECTION with PROJECTION
Rule 6 Commuting SELECTION with JOIN and CARTESIAN
Rule 10 Commuting SELECTION with SET OPERATIONS
3. Use: Rule 9 Associativity of binary operations (JOIN, CARTESIAN, UNION and
INTERSECTION). Rearrange nodes by making the most restrictive operations to be performed
first (moving it as far down the tree as possible)
4. Perform Cartesian Operations with the subsequent selection operation
5. Use Rule 3 Cascade of PROJECTION
Rule 4 Commuting PROJECTION with SELECTION
Rule 7 Commuting PROJECTION with JOIN and CARTESIAN
Rule 11 Commuting PROJECTION with UNION
Heuristic Query Tree Optimization
It has some rules which utilize equivalence expressions to transform the initial tree into final,
optimized query tree.
Process for heuristics optimization
1. The parser of a high-level query generates an initial internal representation;
Query graph:
A graph data structure that corresponds to a relational calculus expression. It does not indicate an
order on which operations to perform first. Nodes represent Relations. Ovals represent constant
Fig3: rearrange of leaf nodes, using commutatively & associativity of binary operations.
Two restrictions are enforced on data access based on the subject/object classifications:
A subject S is not allowed read access to an object O unless class(S) ≥ class (O).
A subject S is not allowed to write an object O unless class(S) ≤ class (O).
To incorporate multilevel security notions into the relational database model, it is common to
consider attribute values and rows as data objects. Hence, each attribute A is associated with a
classification attribute C in the schema.
In addition, in some models, a tuple classification attribute TC is added to the relation attributes
to provide a classification for each tuple as a whole.
Hence, a multilevel relation schema R with n attributes would be represented as
R(A1,C1,A2,C2, …, An,Cn,TC) where each Ci represents the classification attribute
associated with attribute Ai.
The value of the TC attribute in each tuple t – which is the highest of all attribute classification
values within t – provides a general classification for the tuple itself.
Whereas, each Ci provides a finer security classification for each attribute value within the tuple.
A multilevel relation will appear to contain different data to subjects (users) with different
clearance levels.
A user with a security clearance S would see the same relation shown above (a) since all row
classification are less than or equal to S as shown in (a).
However, a user with security clearance C would not allow to see values for salary of Brown and
job performance of Smith, since they have higher classification as shown in (b)
For a user with security clearance U, filtering introduces null values for attributes values whose
security classification is higher than the user’s security clearance as shown in (c)
A user with security clearance C may request for update on the values of job performance of smith
to ‘Excellent’ and the view will allow him to do so. However, the user shouldn't be allowed to
overwrite the existing value at the higher classification level.
Solution: to create Ploy Station for smith row at the lower classification level C as
shown in (d)
A Transaction
Logical unit of database processing that includes one or more access operations (read -
retrieval, write - insert or update, delete). Examples include ATM transactions, credit
card approvals, flight reservations, hotel check-in, phone calls, supermarket scanning,
academic registration and billing.
Collections of operations that form a single logical unit of work are called transactions.
Locks
There are various modes in which a data item may be locked. Two modes of locks mostly focused are:
1. Shared. If a transaction Ti has obtained a shared-mode lock (denoted by S) on item Q, then
Ti can read, but cannot write, Q.
2. Exclusive. If a transaction Ti has obtained an exclusive-mode lock (denoted by X) on item
Q, then Ti can both read and write Q.
Every transaction request a lock in an appropriate mode on data item Q, depending on the types of
operations that it will perform on Q. The transaction makes the request to the concurrency-control
manager.
The transaction can proceed with the operation only after the concurrency-control manager grants
the lock to the transaction. The use of these two lock modes allows multiple transactions to read
a data item but limits write access to just one transaction at a time.
Examples: -Let A and B represent arbitrary lock modes. Suppose that a transaction Ti requests a
lock of mode A on item Q on which transaction Tj (Ti =Tj) currently holds a lock of mode B. If
transaction Ti can be granted a lock on Q immediately, in spite of the presence of the mode B lock,
then we say mode A is compatible with mode B. Such a function can be represented conveniently
by a matrix.
Granting of Locks
When a transaction requests a lock on a data item in a particular mode, and no other transaction
has a lock on the same data item in a conflicting mode, the lock can be granted. However, care
must be taken to avoid the following scenario.
Suppose a transaction T2 has a shared-mode lock on a data item, and transaction T1 requests
an exclusive-mode lock on the data item. Clearly, T1 has to wait for T2 to release the shared-
mode lock.
Meanwhile, a transaction T3 may request a shared-mode lock on the same data item. The lock
request is compatible with the lock granted to T2, so T3 may be granted the shared-mode lock.
Timestamp-Based Protocols
The locking protocols that we have described thus far determine the order between every pair of
conflicting transactions at execution time by the first lock that both members of the pair request that
involves incompatible modes.
Another method for determining the serializability order is to select an ordering among
transactions in advance. The most common method for doing so is to use a timestamp-ordering
scheme.
Timestamps
With each transaction Ti in the system, we associate a unique fixed timestamp, denoted by TS(Ti).
This timestamp is assigned by the database system before the transaction Ti starts execution.
If a transaction Ti has been assigned timestamp TS(Ti), and a new transaction Tj enters the system,
then TS(Ti) < TS(Tj).
There are two simple methods for implementing this scheme:
1. Use the value of the system clock as the timestamp; that is, a transaction’s timestamp is equal
to the value of the clock when the transaction enters the system.
2. Use a logical counter that is incremented after a new timestamp has been assigned; that is,
a transaction’s timestamp is equal to the value of the counter when the transaction enters the
system.
The timestamps of the transactions determine the serializability order. Thus, if TS(Ti) < TS(Tj),
then the system must ensure that the produced schedule is equivalent to a serial schedule in which
transaction Ti appears before transaction Tj.
To implement this scheme, we associate with each data item Q two timestamp values:
1. W-timestamp(Q) denotes the largest timestamp of any transaction that executed write(Q) successfully.
2. R-timestamp(Q) denotes the largest timestamp of any transaction that executed read(Q) successfully.
These timestamps are updated whenever a new read(Q) or write(Q) instruction is executed.
Starvation
Starvation occurs when a particular transaction consistently waits or restarted and
never gets a chance to proceed further while other transaction continues normally
This may occur, if the waiting method for item locking:
Gave priority for some transaction over others
Problem in Victim selection algorithm- it is possible that the same transaction
may consistently be selected as victim and rolled-back .example In Wound-
Wait
Solution
FIFO
Allow for transaction that wait for a longer time
Give higher priority for transaction that have been aborted for many time
Deferred Update
The deferred update techniques do not physically update the database on disk until after a
transaction reaches its commit point; then the updates are recorded in the database.
Before reaching commit, all transaction updates are recorded in the local transaction
workspace or in the main memory buffers that the DBMS maintains.
Before commit, the updates are recorded persistently in the log, and then after commit,
the updates are written to the database on disk.
If a transaction fails before reaching its commit point, it will not have changed the database in any
way, so UNDO is not needed.
It may be necessary to REDO the effect of the operations of a committed transaction from the
log, because their effect may not yet have been recorded in the database on disk. Hence, deferred
update is also known as the NO-UNDO/REDO algorithm.
Immediate Update
In the immediate update techniques, the database may be updated by some operations of a
transaction before the transaction reaches its commit point.
However, these operations must also be recorded in the log on disk by force-writing before
they are applied to the database on disk, making recovery still possible.
If a transaction fails after recording some changes in the database on disk but before reaching its
commit point, the effect of its operations on the database must be undone; that is, the transaction
must be rolled back.
Database Systems and Information Management Module Page 51
In the general case of immediate update, both undo and redo may be required during recovery.
This technique, known as the UNDO/REDO algorithm, requires both operations during recovery,
and is used most often in practice.
A variation of the algorithm where all updates are required to be recorded in the database on disk
before a transaction commits requires undo only, so it is known as the UNDO/NO-REDO
algorithm.
Caching (Buffering) of Disk Blocks
Typically, multiple disk pages that include the data items to be updated are cached into main
memory buffers and then updated in memory before being written back to disk.
The caching of disk pages is traditionally an operating system function, but because of its
importance to the efficiency of recovery procedures, it is handled by the DBMS by calling
low-level operating systems routines.
When the DBMS requests action on some item,
1. First it checks the cache directory to determine whether the disk page containing the
item is in the DBMS cache.
2. Second if it is not, the item must be located on disk, and the appropriate disk pages are
copied into the cache.
It may be necessary to replace (or flush) some of the cache buffers to make space available for the
new item.
Some page replacement strategy similar to these used in operating systems, such as least recently
used (LRU) or first-in-first out (FIFO), or a new strategy that is DBMS-specific can be used to
select the buffers for replacement, such as DBMIN or Least-Likely-to-Use.
Two main strategies can be employed when flushing a modified buffer back to disk.
1. In-place updating, writes the buffer to the same original disk location, thus
overwriting the old value of any changed data items on disk.
2. Shadowing, writes an updated buffer at a different disk location, so multiple versions
of data items can be maintained.
In general, the old value of the data item before updating is called the before image (BFIM),
and the new value after updating is called the after image (AFIM).
If shadowing is used, both the BFIM and the AFIM can be kept on disk; hence, it is not strictly
necessary to maintain a log for recovering.
Shadow Paging
This recovery scheme does not require the use of a log in a single-user environment.
Generally
Distributed database (DDB) as a collection of multiple logically interrelated databases
distributed over a computer network.
Distributed database management system (DDBMS) as a software system that manages a
distributed database while making the distribution transparent to the user.
A collection of files stored at different nodes of a network and the maintaining of inter
relationships among them via hyperlinks has become a common organization on the Internet,
with files of Web pages.
Examples DDBMS Operational database, Analytical database, Hypermedia database
3. Improved performance:
Data Fragmentation
There are two approaches to store the relation in the distributed database:
Horizontal Fragmentation
A horizontal fragment of a relation is a subset of the tuples in that relation.
The tuples that belong to the horizontal fragment are specified by a condition on one or more
attributes of the relation. Often, only a single attribute is involved.
For example, we may define three horizontal fragments on the EMPLOYEE relation: (DNO= 5),
(DNO= 4), and (DNO= 1)—each fragment contains the EMPLOYEE tuples working for a
particular department.
Similarly, we may define three horizontal fragments for the PROJECT relation, with the
conditions (DNUM= 5), (DNUM= 4), and (DNUM= 1)—each fragment contains the PROJECT
tuples controlled by a particular department.
Horizontal fragmentation divides a relation "horizontally" by grouping rows to create subsets
of tuples, where each subset has a certain logical meaning. These fragments can then be assigned
to different sites in the distributed system.
Derived horizontal fragmentation applies the partitioning of a primary relation (DEPARTMENT
in our example) to other secondary relations (EMPLOYEE and PROJECT in our example), which
are related to the primary via a foreign key. This way, related data between the primary and the
secondary relations gets fragmented in the same way.
Vertical Fragmentation
Each site may not need all the attributes of a relation, which would indicate the need for a different
type of fragmentation.
Vertical fragmentation divides a relation "vertically" by columns.
A vertical fragment of a relation keeps only certain attributes of the relation.
For example, we may want to fragment the EMPLOYEE relation into two vertical fragments.
The first fragment includes personal information—NAME, BDATE, ADDRESS, and
SEX—and
The second includes work-related information—SSN, SALARY, SUPERSSN, DNO.
This vertical fragmentation is not quite proper because, if the two fragments are stored separately,
we cannot put the original employee tuples back together, since there is no common attribute
between the two fragments.
Semantic Heterogeneity
Semantic heterogeneity occurs when there are differences in the meaning, interpretation, and
intended use of the same or related data.
Semantic heterogeneity among component database systems (DBSs) creates the biggest hurdle in
designing global schemas of heterogeneous databases.
The design autonomy of component DBSs refers to their freedom of choosing the following design
parameters, which in turn affect the eventual complexity of the FDBS:
A. The universe of discourse from which the data is drawn: For example, two customer accounts
databases in the federation may be from United States and Japan with entirely different sets of
attributes about customer accounts required by the accounting practices. Currency rate
fluctuations would also present a problem. Hence, relations in these two databases which have
identical names—CUSTOMER or ACCOUNT—may have some common and some entirely
distinct information.
B. Representation and naming: The representation and naming of data elements and the
structure of the data model may be pre specified for each local database.
C. The understanding, meaning, and subjective interpretation of data. This is a chief contributor
to semantic heterogeneity.
D. Transaction and policy constraints: this deal with Serializability criteria, compensating
transactions, and other transaction policies.
E. Derivation of summaries: Aggregation, summarization, and other data-processing features
and operations supported by the system.
Communication autonomy of a component DBS refers to it stability to decide whether to
communicate with another component DBSs.
Execution autonomy refers to the ability of a component DBS to execute local operations without
interference from external operations by other component DBSs and its ability to decide the order
in which to execute them.
The association autonomy of a component DBS implies that it has the ability to decide whether
and how much to share its functionality (operations it supports) and resources (data it manages)
with other component DBSs.
The major challenge of designing FDBSs is to let component DBSs interoperate while still
providing the above types of autonomies to them.
The result of this query will include 10,000 records, assuming that every employee is related to a
department.
Suppose that each record in the query result is 40 bytes long.
The query is submitted at a distinct site 3, which is called the result site because the query result is
needed there.
Neither the EMPLOYEE nor the DEPARTMENT relations reside at site 3.
There are three simple strategies for executing this distributed query:
1. Transfer both the EMPLOYEE and the DEPARTMENT relations to the result site, and perform
the join at site 3. In this case a total of 1,000,000+ 3500 = 1,003,500 bytes must be transferred.
2. Transfer the EMPLOYEE relation to site 2, execute the join at site 2, and send the result to site
the size of the query result is 40 * 10,000 = 400,000 bytes, so 400,000 + 1,000,000 = 1,400,000
bytes must be transferred.
Database Systems and Information Management Module Page 66
If minimizing the amount of data transfer is our optimization criterion, we should choose strategy 3. Now
consider another query Q: "For each department, retrieve the department name and the name of the
department manager." This can be stated as follows in the relational algebra:
An Overview of Client-Server Architecture and Its Relationship to Distributed Databases
Distributed database applications are being developed in the context of the client-server
architecture.
Exactly how to divide the DBMS functionality between client and server has not yet been
established.
Different approaches have been proposed. One possibility is to include the functionality
of a centralized DBMS at the server level.
A number of relational DBMS products have taken this approach, where an SQL server is provided
to the clients.
Each client must then formulate the appropriate SQL queries and provide the user interface and
programming language interface functions.
Since SQL is a relational standard, various SQL servers, possibly provided by different vendors,
can accept SQL commands.
The client may also refer to a data dictionary that includes information on the distribution of data
among the various SQL servers, as well as modules for decomposing a global query into a number
of local queries that can be executed at the various sites.
Interaction between client and server might proceed as follows during the processing of an SQL
query:
1. The client parses a user query and decomposes it into a number of independent site queries.
Each site query is sent to the appropriate server site.
2. Each server processes the local query and sends the resulting relation to the client site.
3. The client site combines the results of the sub queries to produce the result of the originally
submitted query.
In this approach, the SQL server has also been called a transaction server (or a database processor
(DP) or a back-end machine), whereas the client has been called an application processor (AP)(or
a front-end machine).
The interaction between client and server can be specified by the user at the client level or via a
specialized DBMS client module that is part of the DBMS package.
For example, the user may know what data is stored in each server, break down a query
request into site sub queries manually, and submit individual sub queries to the various sites.
The resulting tables may be combined explicitly by a further user query at the client level.
The alternative is to have the client module undertake these actions automatically.
In a typical DDBMS, it is customary to divide the software modules into three levels:
1. The server software is responsible for local data management at a site, much like
centralized DBMS software.
2. The client software is responsible for most of the distribution functions; it accesses data
distribution information from the DDBMS catalog and processes all requests that require
access to more than one site. It also handles all user interfaces.
3. The communications software (sometimes in conjunction with a distributed operating
system) provides the communication primitives that are used by the client to transmit
One possible function of the client is to hide the details of data distribution from the user; that is, it
enables the user to write global queries and transactions as though the database were centralized,
without having to specify the sites at which the data referenced in the query or transaction resides. This
property is called distribution transparency.
Access Methods
Point Access Methods (PAMs):
Index methods for 2 or 3-dimensional points (k-d trees, Z-ordering, grid-file)
Spatial Access Methods (SAMs):
Index methods for 2 or 3-dimensional regions and points (R-trees)
Mobile Database
A mobile database is either a stationary database that can be connected to by a mobile computing device
such as smart phones or PDAs over a mobile network, or a database which is actually carried by the
mobile device. This could be a list of contacts, price information, distance travelled, or any other
information.
Many applications require the ability to download information from an information
repository and operate on this information even when out of range or disconnected.
Mobile databases are highly concentrated in the retail and logistics industries. They are increasingly
being used in aviation and transportation industry.
Home Directory
An example of this is a mobile workforce. In this scenario, a user would require access to update
information from files in the home directories on a server or customer records from a database.
A home directory is a file system directory on a multi-user operating system containing files for a
given user of the system.
A cache could maintain to hold recently accessed data and transactions so that they are not lost due
to connection failure.
Users might not require access to truly live data, only recently modified data, and uploading of
changing might be deferred until reconnected.
Bandwidth must be conserved (a common requirement on wireless networks that charge per
megabyte.
Mobile computing devices tend to have slower CPUs and limited battery life.
Users with multiple devices (i.e.: smartphone and tablet) may need to synchronize their devices to
a centralized data store. This may require application-specific automation features.
Users may change location geographically and on the network. Usually dealing with this, is left to
the operating system, which is responsible for maintaining the wireless network connection.
In Web technology, basic client-server architecture underlies all activities. Information is stored on
computers designated as Web servers in publicly accessible shared files encoded using Hyper Text Markup
Language (HTML).
A number of tools enable users to create Web pages formatted with HTML tags, freely mixed with
multimedia content from graphics to audio and even to video. A page has many interspersed hyperlinks
literally a link that enables a user to "browse" or move from one page to another across the Internet. This
ability has given a tremendous power to end users in searching and navigating related information often
across different continents.
Information on the Web is organized according to a Uniform Resource Locator (URL) something similar
to an address that provides the complete pathname of a file. The pathname consists of a string of machine
and directory names separated by slashes and ends in a filename. For example, the table of contents of this
book is currently at the following URL:
A URL always begins with a hypertext transport protocol (http), which is the protocol used by the Web
browsers, a program that communicates with the Web server, and vice versa. Web browsers interpret and
present HTML documents to users. Popular Web browsers include the Internet Explorer of Microsoft and
the Netscape Navigator. A collection of HTML documents and other files accessible via the URL on a Web
server is called a Web site. In the above URL, "www.awl.com" may be called the Web site of Addison
Wesley Publishing.
1. Access using CGI scripts: The database server can be made to interact with the Web server via
CGI. The main disadvantage of this approach is that for each user request, the Web server must
start a new CGI process: each process makes a new connection with the DBMS and the Web server
must wait until the results are delivered to it. No efficiency is achieved by any grouping of multiple
users’ requests; moreover, the developer must keep the scripts in the CGI-bin subdirectories only,
which opens it to a possible breach of security. The fact that CGI has no language associated with
it but requires database developers to learn PERL or Tcl is also a drawback. Manageability of
scripts is another problem if the scripts are scattered everywhere.
2. Access using JDBC: JDBC is a set of Java classes developed by Sun Microsystems to allow access
to relational databases through the execution of SQL statements. It is a way of connecting with
databases, without any additional processes for each client request. Note that JDBC is a name
trademarked by Sun; it does not stand for Java Data Base connectivity as many believe. JDBC has
the capabilities to connect to a database, send SQL statements to a database and to retrieve the
results of a query using the Java classes Connection, Statement, and Result Set respectively. With
Java’s claimed platform independence, an application may run on any Java-capable browser, which
loads the Java code from the server and runs it on the client’s browser.
The Java code is DBMS transparent; the JDBC drivers for individual DBMSs on the server end
carry the task of interacting with that DBMS. If the JDBC driver is on the client, the application
runs on the client and its requests are communicated to the DBMS directly by the driver. For
standard SQL requests, many RDBMSs can be accessed this way. The drawback of using JDBC
is the prospect of executing Java through virtual machines with inherent efficiency. The JDBC
bridge to Object Database Connectivity (ODBC) remains another way of getting to the RDBMSs.
Besides CGI, other Web server vendors are launching their own middleware products for providing multiple
database connectivity. These include Internet Server API (ISAPI) from Microsoft and Netscape API
(NSAPI) from Netscape
Driver, a lightweight CGI process that is invoked when a URL request is received by the Web server. A
unique session identifier is generated for each request but the WIO application is persistent and does not
terminate after each request. When the WIO application receives a request from the Web driver, it connects
to the database and executes Web Explode, a function that executes queries within Web pages and formats
results as a Web page that goes back to the browser via the Web driver.
Informix HTML tag extensions allow Web authors to create applications that can dynamically construct
Web page templates from the Informix Dynamic Server and present them to the end users.
WIO also lets users create their own customized tags to perform specialized tasks. Thus, without resorting
to any programming or script development, powerful applications can be designed.
WIO supports applications developed in C, C++, and Java. This flexibility lets developer’s port existing
applications to the Web or develops new applications in these languages. The WIO is integrated with Web
server software and utilizes the native security mechanism of the Informix Dynamic Server. The open
architecture of WIO allows the use of various Web browsers and servers.
There is an HTTP demon (a process that runs continuously) called Web Listener running on the server that
listens for the requests originating in the clients. A static file (document) is retrieved from the file system
of the server and displayed on the Web browser at the client. Request for a dynamic page is passed by the
listener to a Web request broker (WRB), which is a multi-threaded dispatcher that adheres to cartridges.
Cartridges are software modules that perform specific functions on specific types of data; they can
communicate among themselves.
Currently cartridges are provided for PL/SQL, Java, and Live HTML; customized cartridges may be
provided as well. Web Server has been fully integrated with PL/SQL, making it efficient and scalable.
Among the prominent applications of the intranet and the WWW are databases to support electronic
storefronts, parts and product catalogs, directories and schedules, newsstands, and bookstores. Electronic
commerce the purchasing of products and services electronically on the Internet is likely to become a major
application supported by such databases.
The future challenges of managing databases on the Web will be many, among them the following:
Web technology needs to be integrated with the object technology. Currently, the web can be
viewed as a distributed object system, with HTML pages functioning as objects identified by the
URL.
HTML functionality is too simple to support complex application requirements. As we saw, the
Web Integration Option of Informix adds further tags to HTML. In general, additional facilities
will be needed to
1. To make Web clients function as application front ends, integrating data from multiple
heterogeneous databases;
2. To make Web clients present different views of the same data to different users; and
3. To make Web clients "intelligent" by providing additional data mining functionality
Web page content can be made more dynamic by adding more "behavior" to it as an object (In this
respect
1. client and server objects (HTML pages) can be made to interact;
2. Web pages can be treated as collections of programmable objects; and
3. Client-side code can access these objects and manipulate them dynamically.
The support for a large number of clients coupled with reasonable response times for queries
against very large (several tens of gigabytes in size) databases will be major challenges for Web
XML defines a subset of SGML (the Standard Generalized Markup Language), allowing customization of
markup languages with application-specific tags. XML is rapidly gaining ground due to its extensibility in
defining new tags.
W3C’s Document Object Model (DOM) defines an object-oriented API for HTML or XML documents
presented by a Web client. W3C is also defining metadata modeling standards for describing Internet
resources.
The technology to model information using the standards discussed above and to find information on the
Web is undergoing a major evolution. Overall, the Web servers have to gain robustness as a reliable
technology to handle production-level databases for supporting 24x7 applications—24 hours a day, 7 days
a week.
Security remains a critical problem for supporting applications involving financial and medical databases.
Moreover, transferring from existing database application environments to those on the Web will need
adequate support that will enable users to continue their current mode of operation and an expensive
infrastructure for handling migration of data among systems without introducing inconsistencies. The
traditional database functionality of querying and transaction processing must undergo appropriate
modifications to support Web-based applications. One such area is mobile databases.
In the above image, you can see that the data is coming from multiple heterogeneous data
sources to a Data Warehouse. Common data sources for a data warehouse includes.
Operational databases Flat Files
SAP and non-SAP Applications
Market Analysis
Fraud Detection
Customer Retention (maintenance)
Production Control
Science Exploration
***