DBMS Notes Unit 1
DBMS Notes Unit 1
Purpose of Database System – Views of data – Data Models – Database System Architecture –
Introduction to relational databases – Relational Model – Keys – Relational Algebra – SQL
fundamentals – Advanced SQL features – Embedded SQL– Dynamic SQL
1.1 INTRODUCTION
A database is a collection of data elements (facts) stored in a computer in a systematic way, such
that a computer program can consult it to answer questions. The answers to those questions become
information that can be used to make decisions that may not be made with the data elements alone. The
computer program used to manage and query a database is known as a database management system
(DBMS).
So a database is a collection of related data that we can use for
Defining - specifying types of data
Constructing - storing & populating
Manipulating - querying, updating, reporting
A Database Management System (DBMS) is a software package to facilitate the creation and
maintenance of a computerized database.
Database management systems were developed to handle the difficulties of typical file-
processing systems supported by conventional operating systems.
Advantages of DBMS
Data redundancy and inconsistency - Since different programmers create the files and
application programs over a long period, the various files are likely to have different structures and the
programs may be written in several programming languages. Moreover, the same information may be
duplicated in several places (files).
For example, if a student has a double major (say, music and mathematics) the address and
telephone number of that student may appear in a file that consists of student records of students in the
Music department and in a file that consists of student records of students in the Mathematics
department. This redundancy leads to higher storage and access cost. In addition, it may lead to data
inconsistency; that is, the various copies of the same data may no longer agree. For example, a
changed student address may be reflected in the Music department records but not elsewhere in the
system.
Difficulty in accessing data - Suppose that one of the university clerks needs to find out the
names of all students who live within a particular postal-code area. The clerk asks the data-processing
department to generate such a list. Because the designers of the original system did not anticipate this
request, there is no application program on hand to meet it. There is, however, an application program to
generate the list of all students.
Data isolation – Because data are scattered in various files, and files may be in different
formats, writing new application programs to retrieve the appropriate data is difficult.
Integrity problems - The data values stored in the database must satisfy certain types of
consistency constraints. Suppose the university maintains an account for each department, and records
the balance amount in each account. Suppose also that the university requires that the account balance
of a department may never fall below zero. Developers enforce these constraints in the system by
adding appropriate code in the various application programs. However, when new constraints are
added, it is difficult to change the programs to enforce them.
Atomicity of updates - A computer system, like any other device, is subject to failure. In
many applications, it is crucial that, if a failure occurs, the data be restored to the consistent state that
existed prior to the failure. Consider a program to transfer $500 from the account balance of
department A to the account balance of department B. If a system failure occurs during the execution
of the program, it is possible that the $500 was removed from the balance of department A, but was
not credited to the balance of department B, resulting in an inconsistent database state. Clearly, it is
essential to database consistency that either both the credit and debit occur, or that neither occur.
Concurrent access by multiple users – To improve the overall performance of the system,
many systems allow multiple users to update the data simultaneously. In such an environment,
interaction of concurrent updates is possible and may result in inconsistent data.
Security problems - Not every user of the database system should be able to access all the
data. For example, in a university, payroll personnel need to see only that part of the database that has
financial information. They do not need access to information about academic records. But, since
application programs are added to the file-processing system in an ad hoc manner, enforcing such
security constraints is difficult.
Disadvantages of DBMS
It is bit complex. Since it supports multiple functionality to give the user the best, the
underlying software has become complex. The designers and developers should have thorough
knowledge about the software to get the most out of it.
Because of its complexity and functionality, it uses large amount of memory. It also needs
large memory to run efficiently.
DBMS system works on the centralized system, i.e.; all the users from all over the world
access this
database. Hence any failure of the DBMS, will impact all the users.
DBMS is generalized software, i.e.; it is written work on the entire systems rather specific one.
Hence some of the application will run slow.
Data Abstraction
For the system to be usable, it must retrieve data efficiently. The need for efficiency has led
designers to use complex data structures to represent data in the database. Since many database-system
users are not computer trained, developers hide the complexity from users through several levels of
abstraction, to simplify users’ interactions with the system. Database contains three levels of abstraction
namely,
i. Physical level (or Internal View / Schema): The lowest level of abstraction describes how
the data are actually stored. The physical level describes complex low-level data structures in detail.
ii. Logical level (or Conceptual View / Schema): The next-higher level of abstraction
describes what data are stored in the database, and what relationships exist among those data. The
logical level thus describes the entire database in terms of a small number of relatively simple structures.
Although implementation of the simple structures at the logical level may involve complex physical-
level structures, the user of the logical level does not need to be aware of this complexity. This is
referred to as physical data independence. Database administrators, who must decide what information
to keep in the database, use the logical level of abstraction.
iii. View level (or External View / Schema): The highest level of abstraction describes only
part of the entire database. Even though the logical level uses simpler structures, complexity remains
because of the variety of information stored in a large database. Many users of the database system do
not need all this information; instead, they need to access only a part of the database. The view level of
abstraction exists to simplify their interaction with the system. The system may provide many views for
the same database. Fig. 1.1 shows the relationship among the three levels of abstraction.
i) Relational Model. The relational model uses a collection of tables to represent both data
and the relationships among those data. Each table has multiple columns, and each column has a
unique name. Tables are also known as relations. The relational model is an example of a recordbased
model. Record-based models are so named because the database is structured in fixedformat records of
several types. Each table contains records of a particular type. Each record type defines a fixed number
of fields, or attributes. The columns of the table correspond to the attributes of the record type. The
relational data model is the most widely used data model, and a vast majority of current database
systems are based on the relational model.
ii) Entity-Relationship Model. The entity-relationship (E-R) data model uses a collection of
basic objects, called entities, and relationships among these objects. An entity is a “thing” or “object”
in the real world that is distinguishable from other objects. The entity-relationship model is widely
used in database design.
iii) Object-Based Data Model. Object-oriented programming has become the dominant
software-development methodology. This led to the development of an object-oriented data model
that can be seen as extending the E-R model with notions of encapsulation, methods, and object
identity.
iv) Semi-structured Data Model. The semi-structured data model is different from the other
three data models. The semistructured data model allows the data specifications at places where the
individual data items of the same type may have different attributes sets.
The Extensible Markup Language, also known as XML, is widely used for representing the
semistructured data. Although XML was initially designed for including the markup information to the
text document, it gains importance because of its application in the exchange of data.
Database Languages
A database system provides a data-definition language to specify the database schema and a
data-manipulation language to express database queries and updates. In practice, the datadefinition and
data-manipulation languages are not two separate languages; instead they simply form parts of a single
database language, such as the widely used SQL language.
Data-Manipulation Language
A data-manipulation language (DML) is a language that enables users to access or manipulate
data as organized by the appropriate data model. The types of access are:
Retrieval of information stored in the database
Insertion of new information into the database
Deletion of information from the database
Modification of information stored in the database
A query is a statement requesting the retrieval of information. The portion of a DML that
involves information retrieval is called a query language. Although technically incorrect, it is common
practice to use the terms query language and data-manipulation language synonymously.
The storage manager is important because databases typically require a large amount of storage
space. Corporate databases range in size from hundreds of gigabytes to, for the largest databases,
terabytes of data. A gigabyte is approximately 1000 megabytes (actually 1024) (1 billion bytes), and a
terabyte is 1 million megabytes (1 trillion bytes). Since the main memory of computers cannot store this
much information, the information is stored on disks. Data are moved between disk storage and main
memory as needed. Since the movement of data to and from disk is slow relative to the speed of the
central processing unit, it is imperative that the database system structure the data so as to minimize the
need to move data between disk and main memory.
The query processor is important because it helps the database system to simplify and facilitate
access to data. The query processor allows database users to obtain good performance while being able
to work at the view level. It is the job of the database system to translate updates and queries written in a
nonprocedural language, at the logical level, into an efficient sequence of operations at the physical
level.
Storage Manager:
The storage manager is the component of a database system that provides the interface between
the low level data stored in the database and the application programs and queries submitted to the
system. The storage manager is responsible for the interaction with the file manager. The raw data are
stored on the disk using the file system provided by the operating system. The storage man-ager translates
the various DML statements into low-level file-system commands. Thus, the storage manager is
responsible for storing, retrieving, and updating data in the database.
The storage manager components include:
Authorization and integrity manager, which tests for the satisfaction of integrity
constraints and checks the authority of users to access data.
Transaction manager, which ensures that the database remains in a consistent) state despite
system failures and that concurrent transaction execution proceed without conflicting.
File manager, which manages the allocation of space on disk storage and the data
structures used to represent information stored on disk.
Buffer manager, which is responsible for fetching data from disk storage into main
memory, and deciding what data to cache in main memory. The buffer manager is a
critical part of the database system, since it enables the database to handle data sizes that
are much larger than the size of main memory
The storage manager implements several data structures as part of the physical system
implementation:
Data files, which store the database itself.
Data dictionary, which stores metadata about the structure of the database, in particular the
schema of the database.
Indices, which can provide fast access to data items. A database index provides pointers to
those data items that hold a particular value. For example, we could use an index to find
the instructor record with a particular ID, or all instructor records with a particular name.
Hashing is an alternative to indexing that is faster in some but not all cases.
Database Architecture
The architecture of a database system is greatly influenced by the underlying computer system on
which the database system runs. Database systems can be centralized, or client-server, where one server
machine executes work on behalf of multiple client machines. Database systems can also be designed to
exploit parallel computer architectures. Distributed databases span multiple geographically separated
machines.
Most users of a database system today are not present at the site of the database system, but
connect to it through a network. We can therefore differentiate between client machines, on which remote
database users’ work, and server machines, on which the database system runs.
Database applications are usually partitioned into two or three parts. In a two-tier architecture, the
application resides at the client machine, where it invokes database system functionality at the server
machine through query language statements. Application program interface standards like ODBC and
JDBC are used for interaction between the client and the server.
In contrast, in three-tier architecture, the client machine acts as merely a front end and does not
contain any direct database calls. Instead, the client end communicates with an application server, usually
through a forms interface. The application server in turn communicates with a database system to access
data. The business logic of the application, which says what actions to carry out under what conditions, is
embedded in the application server, instead of being distributed across multiple clients. Three-tier
applications are more appropriate for large applications, and for applications that run on the World Wide
Web.
Thus, a row in the prereq table indicates that two courses are related in the sense that one course
is a prerequisite for the other. As another example, we consider the table instructor, a row in the table
can be thought of as representing the relationship between a specified ID and the corresponding values
for name, dept name, and salary values.
Thus, in the relational model the term relation is used to refer to a table, while the term tuple is
used to refer to a row. Similarly, the term attribute refers to a column of a table.
Examining Table 1.1, we can see that the relation instructor has four attributes: ID, name, dept
name, and salary. We use the term relation instance to refer to a specific instance of a relation, i.e.,
containing a specific set of rows. The instance of instructor shown in Table 1.1 has 5 tuples,
corresponding to 12 instructors.
Database Schema
Database schema is the logical design of the database and the database instance is a snapshot of
the data in the database at a given instant in time.
The concept of relation corresponds to the programming language notion of a variable, while the
concept of a relation schema corresponds to the programming language notion of type definition.
In general, a relation schema consists of a list of attributes and their corresponding domains
The concept of a relation instance corresponds to the programming language notation of type
definition.
Schema Diagram
A schema is the blueprint or structure that defines how data is organized and stored in a
database. It outlines the tables, fields, relationships, views, indexes, and other elements within the
database. The schema defines the logical view of the entire database and specifies the rules that govern
the data, including its types, constraints, and relationships.
Database Schema
A database schema is the design or structure of a database that defines how data is organized and
how different data elements relate to each other. It acts as a blueprint, outlining tables, fields,
relationships, and rules that govern the data.
Using Entity-Relationship (ER) modeling, the logical schema outlines the relationships between
different data components. It also defines integrity constraints to ensure the quality of data during
insertion and updates. This schema represents a higher level of abstraction compared to the physical
schema, focusing on logical constraints and how the data is structured, without dealing with the physical
storage details.
Star Schema
Star schema is better for storing and analyzing large amounts of data. It has a fact table at its
center & multiple dimension tables connected to it just like a star, where the fact table contains the
numerical data that run business processes and the dimension table contains data related to dimensions
such as product, time, people, etc. or we can say, this table contains the description of the fact table.
1.8 Keys
In databases, keys are fundamental in maintaining data integrity and organization. They serve as
unique identifiers and establish relationships between tables, enabling efficient data retrieval and
manipulation.
Primary Key
There can be more than one candidate key in relation out of which one can be chosen as the
primary key. For Example, STUD_NO, as well as STUD_PHONE, are candidate keys for relation
STUDENT but STUD_NO can be chosen as the primary key (only one out of many candidate keys).
A primary key is a unique key, meaning it can uniquely identify each record (tuple) in a table.
It must have unique values and cannot contain any duplicate values.
A primary key cannot be NULL, as it needs to provide a valid, unique identifier for every
record.
A primary key does not have to consist of a single column. In some cases, a composite
primary key (made of multiple columns) can be used to uniquely identify records in a table.
Databases typically store rows ordered in memory according to primary key for fast access of
records using primary key.
Example:
STUDENT table -> Student(STUD_NO, SNAME, ADDRESS, PHONE) , STUD_NO is a
primary key .
Table STUDENT
Stud_No. SName Address Phone
1 Shyam Delhi 123456789
2 Rakesh Kolkata 223365796
3 Suraj Delhi 175468965
Candidate Key
The minimal set of attributes that can uniquely identify a tuple is known as a candidate key.
For Example, STUD_NO in STUDENT relation.
A candidate key is a minimal super key, meaning it can uniquely identify a record but contains
no extra attributes.
It is a super key with no repeated data is called a candidate key.
Example:
STUD_NO is the candidate key for relation STUDENT.
Super Key
The set of one or more attributes (columns) that can uniquely identify a tuple (record) is known
as Super Key. For Example, STUD_NO, (STUD_NO, STUD_NAME), etc.
A super key is a group of single or multiple keys that uniquely identifies rows in a table. It
supports NULL values in rows.
A super key can contain extra attributes that aren’t necessary for uniqueness. For example, if
the “STUD_NO” column can uniquely identify a student, adding “SNAME” to it will still
form a valid super key, though it’s unnecessary.
Foreign Key
A foreign key is an attribute in one table that refers to the primary key in another table. The table
that contains the foreign key is called the referencing table, and the table that is referenced is called the
referenced table.
A foreign key in one table points to the primary key in another table, establishing a
relationship between them.
It helps connect two or more tables, enabling you to create relationships between them. This is
essential for maintaining data integrity and preventing data redundancy.
They act as a cross-reference between the tables.
Example:
STUD_NO in STUDENT_COURSE is a foreign key to STUD_NO in STUDENT relation.
Alternate Key
An alternate key is any candidate key in a table that is not chosen as the primary key. In other
words, all the keys that are not selected as the primary key are considered alternate keys.
An alternate key is also referred to as a secondary key because it can uniquely identify records in
a table, just like the primary key.
An alternate key can consist of one or more columns (fields) that can uniquely identify a record,
but it is not the primary key
Example:- SNAME, and ADDRESS is Alternate keys in STUDENT table.
Composite Key
Sometimes, a table might not have a single column/attribute that uniquely identifies all the
records of a table. To uniquely identify rows of a table, a combination of two or more columns/attributes
can be used. It still can give duplicate values in rare cases. So, we need to find the optimal set of
attributes that can uniquely identify rows in a table.
It acts as a primary key if there is no primary key in a table
Two or more attributes are used together to make a composite key.
Different combinations of attributes may give different accuracy in terms of identifying the
rows uniquely.
Example
σsubject = "database"(Books)
Output: Selects tuples from books where subject is 'database'.
σsubject = "database" and price = "450"(Books)
Output: Selects tuples from books where subject is 'database' and 'price' is 450.
Example
σauthor = 'tutorialspoint'(Books Χ Articles)
Output − Yields a relation, which shows all the books and articles written by tutorials point.
Join operations
1. Natural Join
A Natural join is the set of tuples of all combinations in R and S that are equal on their common
attribute name. It is denoted by
Example: Let’s consider EMPLOYEE table and SALARY table
EMPLOYEE
EMP_CODE EMP_NAME
101 Ajay
102 Suresh
103 Ram
SALARY
EMP_CODE SALARY
101 50000
102 30000
103 25000
2. Outer Join
The outer join operation is an extension of the join operation. It is used to deal with missing
information.
Example
EMPLOYEE
EMP_NAME STREET CITY
Ajay M.G.R Street Chennai
Suresh Park Street Trichy
Ram Anna Nagar Madurai
FACT_WORKERS
EMP_NAME BRANCH SALARY
Ajay Wipro 50000
Ram TCS 30000
Bala Honeywell 25000
Input:
(EMPLOYEE FACT_WORKERS)
Output:
EMP_NAME STREET CITY BRANCH SALARY
Ajay M.G.R Street Chennai Wipro 50000
Ram Anna Nagar Trichy TCS 30000
3. Equi Join
It is also inner join. It is based on matched data as per the equality condition. The Equi join uses
the comparison operator (=).
Example:
CUSTOMER RELATION
CUSST_ID NAME
1 Ajay
2 Suresh
3 Ram
PRODUCT
PRODUCT_ID CITY
1 Chennai
2 Trichy
3 Madurai
Example
σsubject = "database"(Books)
Output: Selects tuples from books where subject is 'database'.
σsubject = "database" and price = "450"(Books)
Output: Selects tuples from books where subject is 'database' and 'price' is 450.
σsubject = "database" and price = "450" or year > "2010"(Books)
Output: Selects tuples from books where subject is 'database' and 'price' is 450 or those books
published after 2010.
Example
σauthor = 'tutorialspoint'(Books Χ Articles)
Output − Yields a relation, which shows all the books and articles written by tutorials point.
Join operations
4. Natural Join
A Natural join is the set of tuples of all combinations in R and S that are equal on their common
attribute name. It is denoted by
SALARY
EMP_CODE SALARY
101 50000
102 30000
103 25000
5. Outer Join
The outer join operation is an extension of the join operation. It is used to deal with missing
information.
Example
EMPLOYEE
EMP_NAME STREET CITY
Ajay M.G.R Street Chennai
Suresh Park Street Trichy
Ram Anna Nagar Madurai
FACT_WORKERS
EMP_NAME BRANCH SALARY
Ajay Wipro 50000
Ram TCS 30000
Bala Honeywell 25000
Input:
(EMPLOYEE FACT_WORKERS)
Output:
EMP_NAME STREET CITY BRANCH SALARY
Ajay M.G.R Street Chennai Wipro 50000
Ram Anna Nagar Trichy TCS 30000
6. Equi Join
It is also inner join. It is based on matched data as per the equality condition. The Equi join uses
the comparison operator (=).
Example:
CUSTOMER RELATION
CUSST_ID NAME
1 Ajay
2 Suresh
3 Ram
PRODUCT
PRODUCT_ID CITY
1 Chennai
2 Trichy
3 Madurai
Aggregate functions are functions that take a collection of values as input and return a single
value. Aggregate functions supported by SQL are
Average: avg
Minimum: min
Maximum: max
Total: sum
Count: count
Group by clause is used to apply aggregate functions to a set of tuples. The attributes given in the
group by clause are used to form groups. Tuples with the same value on all attributes in the group by
clause are placed in one group.
Aggregate Functions
- A type of request that cannot be expressed in the basic relational algebra is to specify
mathematical aggregate functions on collections of values from the database.
- Examples of such functions include retrieving the average or total salary of all employees or the
total number of employee tuples. These functions are used in simple statistical queries that
summarize information from the database tuples.
- Common functions applied to collections of numeric values include SUM, AVERAGE,
MAXIMUM, and MINIMUM. The COUNT function is used for counting tuples or values
a)
DNO NO_OF_EMPLOYEES AVERAGE_SAL
5 4 33250
4 3 31000
1 1 55000
b)
DNO COUNT_SSN AVERAGE_SALARY
5 4 33250
4 3 31000
1 1 55000
c)
DNO AVERAGE_SALARY
8 33125
Example:
Consider the following SQL query on the EMPLOYEE relation:
SELECT Lname, Fname
FROM EMPLOYEE
WHERE Salary > ( SELECT MAX (Salary)
FROM EMPLOYEE
WHERE Dno=5 );
This query retrieves the names of employees (from any department in the company) who
earn a salary that is greater than the highest salary in department 5. The query includes a nested
subquery and hence would be decomposed into two blocks.
Embedded SQL is portable to other databases and other environments, and is functionally
equivalent in all operating environments. It is a comprehensive, low-level interface that provides all the
functionality available in the product. Embedded SQL requires knowledge of C or C++ programming
languages.
Embedded SQL provides a means by which a program can interact with a database server.
However, under embedded SQL, the SQL statements are identied at compile time using a preprocessor,
which translates requests expressed in embedded SQL into function calls. At runtime, these function calls
connect to the database using an API that provides dynamic SQL facilities but may be specie to the
database that is being used.
Example:
The embedded SQL statements are parsed by an embedded SQL preprocessor and replaced by host-
language calls to a code library. The output from the preprocessor is then compiled by the host compiler.
The following script dynamically retrieves data from the geek table:
Query:
DECLARE
@tab NVARCHAR(128),
@st NVARCHAR(MAX);
SET @tab = N'geektable';
SET @st = N'SELECT *
FROM ' + @tab;
EXEC sp_executesql @st;
Output