Paper 2 Database-Chapter A.3
Paper 2 Database-Chapter A.3
1
Keep up with current technological trends
Predict future changes
Emphasis on established off the shelf products
How many major threats to database security can you think of?
1. Accidental loss due to human error or software/ hardware error.
2. Theft and fraud that could come from hackers or disgruntled employees.
3. Improper data access to personal or confidential data.
4. Loss of data integrity.
5. Loss of data availability through sabotage, a virus, or a worm.
4. Data backup
2
o automatic dump- facility that produces backup copy of the entire
database
o periodic backup- done on periodic basis such as nightly or weekly
o cold backup- database is shut down during backup
o hot backup- a selected portion of the database is shut down and
backed up at a given time
o backups stored in a secure, off-site location
5. Database recovery
If there are back up facilities, are there also journalizing, checkpoint, and recovery
facilities?
Yes
3
1. Database design
2. Database implementation
DBA
o establish security controls
o supervise database loading
o specify test procedures
o develop programming standards
o establish back up/ recovery procedures
Both
o specify access policies
o user training
DBA
o monitor database performance
o tune and reorganize databases as needed
o enforce standards and procedures
Both
o support users
Both
o implement change-control procedures
o plan for growth and change
o evaluate new technologies
New functions
1. Data warehouse administration
4
o (massively) integrated decision support databases from various
sources
Emphasis on integration and coordination of data and metadata from
multiple databases
Specific functions
o support decision-oriented applications
o manage data warehouse (exponential) growth
o establish service level agreements
A. Checkpoint facility
B. Recovery Manager
C. Biometric Device
D. Journalizing facilities.
5. Which of the following is the goal of database security?
5
A. To protect primarily against accidental or intentional loss of data
B. To protect against misuse of data
C. To protect against destruction of data
D. All of the above
Answers:
1. A
2. B
3. A
4. C
5. D
Questions:
2. What impact has the internet caused to the management of data security.
As a result of the internet, managing data security effectively has become more
difficult because access to data has become open through the internet and
corporate intranets.
6
5. Loss of data availability through sabotage, a virus, or a worm.
The recovery manager is a module of the DBMS which restores the database to
a correct condition when a failure occurs and which resumes processing user
requests.
5. What is the difference between backward (rollback) and forward (roll forward)
recovery?
The rollback is the back out or undo of unwanted changes to the database.
Before-images of the records that have been changed are applied to the
database, and the database is returned to an earlier state. Used to reverse the
changes made by transactions that have been aborted or terminated abnormally.
Roll forward is the technique that starts with an earlier copy of the database.
After-images (the results of good transactions) are applied to the database, and
the database is quickly moved forward to a later state.
The data, a user fills into the forms, is saved in the database.
Who can i use the data for functionality which is performed at another place and
time?
I don't find a answer.
Also If an user posts 2 numbers in a form, then both numbers are stored in th db.
Then there i have a button somewhere else on the page, which calculates the sum
of the two numbers.
7
Recovery Techniques for Database Systems
Definitions:
o Failure: An event at which the system does not perform according to
specifications. There are three kinds of failures:
1. failure of a program or transaction
2. failure of the total system
3. hardware failure
o Recovery Data: Data required by the recovery system for the
recovery of the primary data. In very high reliability systems, this
data might also need to be covered by a recovery mechanism... Data
recovery data is divided into two categories : 1) data required to
keep current values, and 2) data to make the restoration of previous
values possible.
o Transaction: The base unit of locking and recovery (for undo, redo, or
completion), appears atomic to the user.
o Database: A collection of related storage objects together with
controlled redundancy that serves one or more applications. Data is
stored in a way that is independent of programs using it, with a single
approach used to add, modify, or retrieve data.
o Correct State: Information in the database consists of the most
recent copies of data put in the database by users and contains no
data deleted by users.
o Valid State: The database contains part of the information of the
correct state. There is no spurious data, although pieces may be
missing.
o Consistent State: In a valid state, with the information contained
satisfying user consistency constraints. Varies depending on the
database and users.
o Crash: A failure of a system that is covered by a recovery technique.
o Catastrophe: A failure of a system not covered by a recovery
technique.
8
1. Recovery to the correct state.
2. Recovery to a checkpointed (past) correct state.
3. Recovery to a possible previous state.
4. Recovery to a valid state.
5. Recovery to a consistent state.
6. Crash resistance (prevention).
The bigger the damage, the cruder the recovery technique used.
Recovery Techniques:
1. Salvation program: Run after a crash to attempt to restore the
system to a valid state. No recovery data used. Used when all other
techniques fail or were not used. Good for cases where buffers were
lost in a crash and one wants to reconstruct what was lost...(4,5)
2. Incremental dumping: Modified files copied to archive after job
completed or at intervals. (3,4)
3. Audit trail: Sequences of actions on files are recorded. Optimal for
"backing out" of transactions. (Ideal if trail is written out before
changes). (1,2,3)
4. Differential files: Separate file is maintained to keep track of
changes, periodically merged with the main file. (2,3)
5. Backup/current version: Present files form the current version of the
database. Files containing previous values form a consistent backup
version. (2,3)
6. Multiple copies: Multiple active copies of each file are maintained
during normal operation of the database. In cases of failure,
comparison between the versions can be used to find a consistent
version. (6)
7. Careful replacement: Nothing is updated in place, with the original
only being deleted after operation is complete. (2,6)
(Parens and numbers are used to indicate which levels from above are
supported by each technique).
Combinations of two techniques can be used to offer similar protection
against different kinds of failures. The techniques above, when
implemented, force changes to:
9
o The way data is updated and manipulated (7).
o nothing (available as utilities) (1,2,3).
Examples and bits of wisdom:
o Original Multics system : all disk files updated or created by the user
are copied when the user signs off. All newly created of modified files
not previously dumped are copied to tapes once per hour. High
reliability, but very high overhead. Changed to a system using a mix
of incremental dumping, full checkpointing, and salvage programs.
o Several other systems maintain backup copies of data through the
paging system (keep backups in the swap space).
o Use of buffers is dangerous for consistency.
o Intention lists: specify audit trail before it actually occurs.
o Recovery among interacting processes is hard. You can either
prevent the interaction or synchronize with respect to recovery.
o Error detection is difficult, and can be costly.
Relevance
Recovery from failure is a critical factor in databases. In case of disaster, it is very
important that as much as possible (if not everything) is recovered. This paper
surveys the methods that we in use at the time for data recovery.
10
Economies of Scale
By consolidating data and applications across departments wasteful overlap of
resources and personnel can be avoided
Data Matching:
The Data Quality Services (DQS) data matching process enables you to reduce
data duplication and improve data accuracy in a data source. Matching analyzes
the degree of duplication in all records of a single data source, returning weighted
probabilities of a match between each set of records compared. You can then
decide which records are matches and take the appropriate action on the source
data.
The DQS matching process has the following benefits:
You can perform the matching process in conjunction with other data cleansing
processes to improve overall data quality. You can also perform data de-
11
duplication using DQS functionality built into Master Data Services. For more
information, see Master Data Services Overview.
The following illustration displays how data matching is done in DQS:
As with other data quality processes in DQS, you perform matching by building a
knowledge base and executing a matching activity in a data quality project in the
following steps:
12
Data Mining
Data mining (sometimes called data or knowledge discovery) is the process of
analyzing data from different perspectives and summarizing it into useful
information - information that can be used to increase revenue, cuts costs, or
both. Data mining software is one of a number of analytical tools for analyzing
data. It allows users to analyze data from many different dimensions or angles,
categorize it, and summarize the relationships identified. Technically, data mining
is the process of finding correlations or patterns among dozens of fields in large
relational databases.
Data
Data are any facts, numbers, or text that can be processed by a computer. Today,
organizations are accumulating vast and growing amounts of data in different
formats and different databases. This includes:
operational or transactional data such as, sales, cost, inventory, payroll, and
accounting
meta data - data about the data itself, such as logical database design or
data dictionary definitions
Information
The patterns, associations, or relationships among all this data can provide
information. For example, analysis of retail point of sale transaction data can yield
information on which products are selling and when.
13
Knowledge
Information can be converted into knowledge about historical patterns and future
trends. For example, summary information on retail supermarket sales can be
analyzed in light of promotional efforts to provide knowledge of consumer buying
behavior. Thus, a manufacturer or retailer could determine which items are most
susceptible to promotional efforts.
Data Warehouses
Dramatic advances in data capture, processing power, data transmission, and
storage capabilities are enabling organizations to integrate their various
databases into data warehouses. Data warehousing is defined as a process of
centralized data management and retrieval. Data warehousing, like data mining,
is a relatively new term although the concept itself has been around for years.
Data warehousing represents an ideal vision of maintaining a central repository of
all organizational data. Centralization of data is needed to maximize user access
and analysis. Dramatic technological advances are making this vision a reality for
many companies. And, equally dramatic advances in data analysis software are
allowing users to access this data freely. The data analysis software is what
supports data mining.
14
For example, Blockbuster Entertainment mines its video rental history database
to recommend rentals to individual customers. American Express can suggest
products to its cardholders based on analysis of their monthly expenditures.
WalMart is pioneering massive data mining to transform its supplier relationships.
WalMart captures point-of-sale transactions from over 2,900 stores in 6 countries
and continuously transmits this data to its massive 7.5 terabyte Teradata data
warehouse. WalMart allows more than 3,500 suppliers, to access data on their
products and perform data analyses. These suppliers use this data to identify
customer buying patterns at the store display level. They use this information to
manage local store inventory and identify new merchandising opportunities. In
1995, WalMart computers processed over 1 million complex data queries.
The National Basketball Association (NBA) is exploring a data mining application
that can be used in conjunction with image recordings of basketball games. The
Advanced Scout software analyzes the movements of players to help coaches
orchestrate plays and strategies. For example, an analysis of the play-by-play
sheet of the game played between the New York Knicks and the Cleveland
Cavaliers on January 6, 1995 reveals that when Mark Price played the Guard
position, John Williams attempted four jump shots and made each one! Advanced
Scout not only finds this pattern, but explains that it is interesting because it
differs considerably from the average shooting percentage of 49.30% for the
Cavaliers during that game.
By using the NBA universal clock, a coach can automatically bring up the video
clips showing each of the jump shots attempted by Williams with Price on the
floor, without needing to comb through hours of video footage. Those clips show
a very successful pick-and-roll play in which Price draws the Knick's defense and
then finds Williams for an open jump shot.
How does data mining work?
While large-scale information technology has been evolving separate transaction
and analytical systems, data mining provides the link between the two. Data
mining software analyzes relationships and patterns in stored transaction data
based on open-ended user queries. Several types of analytical software are
available: statistical, machine learning, and neural networks. Generally, any of
four types of relationships are sought:
15
determine when customers visit and what they typically order. This
information could be used to increase traffic by having daily specials.
Extract, transform, and load transaction data onto the data warehouse
system.
Rule induction: The extraction of useful if-then rules from data based on
statistical significance.
Size of the database: the more data being processed and maintained, the
more powerful the system required.
Query complexity: the more complex the queries and the greater the
number of queries being processed, the more powerful the system
required.
17
Relational database storage and management technology is adequate for many
data mining applications less than 50 gigabytes. However, this infrastructure
needs to be significantly enhanced to support larger applications. Some vendors
have added extensive indexing capabilities to improve query performance. Others
use new hardware architectures such as Massively Parallel Processors (MPP) to
achieve order-of-magnitude improvements in query time. For example, MPP
systems from NCR link hundreds of high-speed Pentium processors to achieve
performance levels exceeding those of the largest supercomputers.
Multiple Choice:
1. Database programs can do all of the following EXCEPT:
B. Create graphics.
C. Communicate data.
D. Manage information.
Answer: B
B. computerized typewriter
18
C. office desktop
D. computerized calculator
Answer: A
A. DBA.
B. application.
D. operating system.
Answer: B
19
4. Advantages of databases include all of the following EXCEPT:
Answer: D
A. database.
B. database program.
C. operating system.
D. data warehouse.
Answer: B
A. database.
B. DBMS.
C. operating system.
D. utility.
Answer: A
B. tables.
C. folders.
D. DBMS.
Answer: B
21
8. In a database table, a ____________ is a collection of data fields.
A. vector
B. query
C. descriptor
D. record
Answer: D
9. In a customer database table, all of the information for one customer is kept in
a:
A. field type.
B. field.
C. record.
D. column.
Answer: C
A. row.
B. text field.
C. record.
D. computed field.
Answer: B
22
11. In a database, a ____________ field shows results of calculations performed
on data in other numeric fields.
A. configured
B. concatenated
C. key
D. computed
Answer: D
23
12. The number of newspapers sold on May 30 would be kept in a ____________
field.
A. date
B. numeric
C. text
D. key
Answer: B
13. Bringing data from a word processing program into a database program is
known as:
A. exporting.
B. batch processing.
C. importing.
D. mining.
Answer: C
A. Browsing
B. Mining
C. Scrubbing
D. Cleansing
24
Answer: A
A. surfing
B. keying
C. scrubbing
D. querying
Answer: D
25
16. Arranging all customer records in customer number order is an example of:
A. querying.
B. sorting.
C. inquiring.
D. filtering.
Answer: B
17. An ordered list of specific records and specific fields printed in an easy-to-read
format is known as a(n):
A. query.
B. sort.
C. inquiry.
D. report.
Answer: D
18. The process of ____________ would be used when sending data from a
database to a word processor so that mailing labels could be produced.
A. exporting
B. sorting
C. mining
D. querying
Answer: A
26
19. Database queries must be:
A. contiguous.
B. unambiguous.
C. contoured.
D. batched.
Answer: B
27
20. The following is an example of:
Select Student_ID From Students Where Major = Business and Credits >= 46
A. query language.
B. BASIC language.
C. HTML language.
D. a spreadsheet formula.
Answer: A
Answer: A
A. PIM
B. intranet
C. SPSS
D. GIS
Answer: D
28
23. A ____________ manipulates data in a large collection files and cross
references those files.
A. DBA
B. GIS
C. PIM
D. DBMS
Answer: D
29
24. A large corporation would use a ____________ to keep records for many
employees and customers along with all of its inventory data.
A. GIS
B. spreadsheet program
C. PIM
Answer: D
25. For a customer database, a good choice of key field would be:
A. address.
B. customer ID.
C. phone number.
D. last name.
Answer: B
Answer: A
30
27. In a(n) ____________, data from more than one table can be combined.
A. key field
B. relational database
C. file manager
D. XML
Answer: B
31
28. ____________ processing is used when a large mail-order company
accumulates orders and processes them together in one large set.
A. Interactive
B. Group
C. Real-time
D. Batch
Answer: D
29. When making an airline reservation through the Internet, you use
____________ processing.
A. interactive
B. group
C. digitization
D. batch
Answer: A
A. interactive
B. digitization
C. real-time
D. batch
Answer: D
32
31. In a typical client/server environment, the client can be any of the following
EXCEPT a:
A. desktop computer.
B. mainframe.
C. PDA.
D. notebook.
Answer: B
33
32. In a client/server environment, the server:
A. processes a query from a client and then sends the answer back to
the client.
Answer: A
A. CRM
B. XML
C. Middleware
D. Firmware
Answer: C
34
Answer: B
A. SQL
B. CRM
C. PIM
D. XML
Answer: D
35
36. A CRM system organizes and tracks information on:
A. consulates.
B. computer registers.
C. customers.
D. privacy violations.
Answer: C
A. table
B. field
C. class
D. record.
Answer: C
38. When a person uses language like ordinary English to query a database, it is
known as a(n) ____________ language query.
A. HTML
B. object-oriented
C. natural
D. XML
Answer: C
36
39. The act of accessing data about other people through credit card information,
credit bureau data, and public records and then using that data without
permission is known as:
A. identity theft.
B. personal theft.
C. data mining.
Answer: A
37
40. An aspect of the USA Patriot Act is the requirement that when presented with
appropriate warrants:
Answer: C
Answer: A
Answer: database
43. A(n) ____________ field shows results of a calculation done using values in
other numeric fields.
Answer: computed
38
44. A(n) ____________ is a collection of related information stored in a database
program.
Answer: table
45. In a university database table, all of the information for one student (e.g.
student ID, name, address) would be stored in one ____________.
Answer: record
39
47. In a university database, a student’s birth date would be stored in a(n)
____________ type field.
Answer: date
48. In ____________ view, the database program shows the data one record at a
time.
Answer: form
49. Bringing a list of names and addresses from a Word document into a database
program is called ____________ data.
Answer: importing
50. A request for information from a database that can be saved and reused later
is known as a(n) ____________.
Answer: sort
Answer: records
53. A specialized database program that can store addresses and phone numbers,
keep a calendar, and set alarms is known as a(n) ____________.
40
Answer: PIM
Answer: GIS
41
57. PIM stands for ____________.
Answer: key
Answer: relational
60. Timesheet transactions collected and used to update payroll files once a
week, is an example of ____________ processing.
Answer: batch
61. In ____________ computing, users can view and change data online.
Answer: Distributed
Answer: middleware
42
64. In a client/server environment, a desktop computer is known as the
____________.
Answer: client
65. Some large companies keep all of their corporate data in an integrated data
repository called a data ____________.
Answer: warehouse
67. Data ____________ uses artificial intelligence and statistical methods to find
trends and patterns in data.
Answer: mining
Answer: customers
43
69. A company’s self-contained network that uses a search engine and Web
browser is called a(n) ____________.
Answer: intranet
Answer: Object-oriented
71. Using software to search for and replace data that contains errors is called
____________.
72. Since the ____________ Act was passed, libraries and bookstores can be
required to turn over their customer records to the FBI.
Answer: thirteen
44
III. Privacy Act of 1974 C. parents must give consent if
an Internet-based business wishes to collect data from children under 13
years of age
Answers: D, A, F, B, G, E, C
III. XML C. process that locates hidden predictive information in large databases
V. DBMS E. data description language designed for database access on the Web
Answers: C, A, E, B, D
45
Normal Forms
Forms of normalization are given below:
society city
Here,address is a composite attribute , which is further subdivided into two column society and city.And
attribute contact_no is multivalued attribute.
Problems with this relation are -
46
It is not possible to store multiple values in a single field in a relation. so, if any customer has more
than one contact number, it is not possible to store those numbers.
Another problem is related to information retrieval . Suppose, here, if there is a need to find out all
customers belonging to some particular city, it is very difficult to retrieve. The reason is: city name
for all customers are combined with society names and stored whole as an address.
city contact_no
1. First Approach:
In a First approach, determine maximum allowable values for a multi-valued attribute.In our case, if
maximum two numbers are allowed to store, insert two separate attributes attributes to store contact
numbers as shown.
Customer:
47
C0 aaa Amul avas Anand 1234567988
1
Now,if customer has only one contact number or no any contact number, then keep the related field empty
for tupple of that customer. If customer has two contact numbers, store both number in related fields. If
customer has more than two contact numbers, store two numbers and ignore all other numbers.
2.Second Approach:
In a second approach, remove the multi-valued attribute that violates 1NF and place it in a separate relation
along with the primary key of given original relation. The primary key of new relation is the combination of
multi-valued attribute and primary key of old relation. for example, in our case, remove the contact_no
attribute and place it with cid in a separate relation customer_contact. Primary Key for relation
Customer_contact will be combination of cid and contact_no.
customer:
Customer_contact
cid contact_no
48
C01 1234567988
C02 123
C02 333
C02 4445
First approach is simple.But, it is not always possible to put restriction on maximum allowable values.It also
introduces null values for mant fields.
Second approach is superior as it does not suffer from draw backs of first approach. But, it is some what
complicated one.For example, to display all information about any/all customers, two relations - Customer
and Customer_contact - need to be accessed.
"A relation schema R is in 2NF, if It is in First Normal Form, and every non_ prime attribute of relation is fully
functionally dependent on primary key."
A relation can violate 2NF only when it has more than one attribute in combination as a primary key. if
relation has only single attribute as a primary key, then, relation will definitely be in 2NF.
Example:
consider the folllowing relation Table Depositor_Account
Depositor_Account
In this relation schema, access_date, balance and bname are non - prime attributes. Among all these three
49
attributes, access_date is fully dependent on primary key (cid and ano). But balance and bname are not fully
dependent on primary key. tey depend on ano only.
So, this relation is not in Second normal form. Such kind of partia dependencies result in data redundancy.
Solution:
Decompose the ralation such that, resultant relations do not have any partial functional dependency. For this
purpose, remove the partial dependent non-prime attributes that violates 2NF in relation. Place them in a
separate new relation along with the prime attribute on which they fully depend.
In our example, balance and bname are partial dependant attribute on primary key. so, remove them and
place in separate ralation called account alog with prime attribute ano. For relation Account, ano will be a
Primary key.
The Depositor_ account relation will be decomped in two seperate relations, called Account_holder and
Account.
Account:
Account_Holder:
Example:
Consider the following relation schema Account_Branch
Account_Branch:
50
This relationcontains following functional dependencies.
FD1 : ano -> {balance, bname, baddress}, and
FD2 : bname -> baddress
In this relation schema, there is a functional dependency ano -> bname between ano & bname as shown in
FD1. also, there is another functional dependency bname -> baddress between bname & baddress as shown
in FD2. more over bname is a non-prime attribute. So, there is a transitive dependency from ano to baddress,
denoted by ano -> baddress.
Such kind of transitive dependencies result in data redundancy. In this relation, branch address wil be stored
repeatedly for each account of the same branch, occupying more amount of memory.
Solution:
Decompose the relation in such a way that, resultant relatons do not have any non-prime attribute
transitively dependent on primary key. For this purpose, remove the transitively dependant non-prime
attributes that violates 3NF from relation. Place them in a separate new relation along with the non-prime
attribute due to which tansitive dependency occured. The primary key of new relation will be the non-prime
atttribute.
In our example, baddress is transitively dependent on ano due to non-prime attribute bname. So, remove
baddress and place it in separate relation called Branch along with the non-prime attribute bname. for
relation Branch, bname will be a primary key.
The Account_Branch relation will be decomposed in two separate relations called Account and Branch.
Account:
Branch:
51
52