DBT 1
DBT 1
UNIT 1
Status UNIT1
Assign
1
ER Diagram
refferential integrity constraint - foreign key either points to existing prim key or is null
Eg - constraint key_name primary/foreign key (attribute_name)
Functional Dependencies - X → Y when tuples for x can predict the tuples in y (eg - if two
tuples have same val for x they should have same value for y)
UNIT 1 1
IR2 - augmentation, if X→Y then XZ→YZ
IR3 - transitive, if X→Y and Y→Z then X→Z
Normalization - top down approach testing to check with normal form it satisfies
each subsequent normal form is supposed to fall under the previous normal form
UNIT 1 2
BCNF - non trivial dependency holds X→A, if X is a superkey of R; eg - student and coursre
together give time, and time gives course so split into two relations;
LHS should be primary key or super key
4NF - in 1NF we split multi valued attributes into multiple rows, now if two attributes have
miltiple values, eg 3 each, we will need 9 rows for the same person; instead split into two
relations - one with person and first attribute giving 3 rows, and second with person and
second attribute giving 3 rows
test for non-additive or lossless join - create table on top all attributes, on left all relations
involved in the join set full table to have b(i,j); if the relation has the attribute set it to a(j); for
the given dependencies apply to table, if X gives Y, for all positions where X has a make Y
have an a there too; at the end if one entire row has all a’s it will be lossless.
3 - Storage
Disk Failures
UNIT 1 3
write failure - when trying to write power outage so
never stored, cant write and cant read written part
level 0 - striped disk array without fault tolerance, if one disk fails all fail, not to be used
for mission critical tasks; data broken down into blocks,
level 1 - mirroring and duplexing, one write or two reads per mirrored pair, has high
overhead, no rebuild incase of disk failure,
level 2 - hamming code ECC(parity), high ratio of ECC to disk for small words so
inefficient, correct errors on the fly, high cost, but simpler design, and single transaction
rate, ECC code confirms if correct data or corrects disk errors
level 3 - parallel with parity, striped, high read and write transfer rate, less parity per data
disks so high efficiency, resource intensive, low ratio of ECC so good efficiency
level 4 - independent data disk with shared parity disk, high efficiency, worst write
transition rate, low ECC rate meaning good efficiency
level 5 - independent disk with distributed parity blocks, highest read, medium write, high
efficiency, parity for each block instead of separate disk with only parity blocks, most
complex
level 6 - independent data disk with 2 independent parity schemes, xor generation,
protects against multiple bad block failures, same as level 5 with extended fault tolerance,
two independent parity computations with separate algorithms to give protection against
double disk failure
level 10 - high reliability and high performance, mirroring and striping, expensive and
high overhead, limited scalability (min 4), combination of 0 and 1
level 50 - more fault tolerant than raid 5 but double the parity overhead, very expensive,
striped version of raid 3
flash Storage - between dram and magnetic disk, high density and performance, fast access
speed but entire block needs to be erased and rewritten, USB(universal serial bus) most
common type
database buffer - minimize number of transfers between main mem and disk, difficult to keep
so many blocks in main mem, reduce latency of accessing disk, temporary storage
UNIT 1 4
buffer manager - responsible for allocation of data to buffer, allocates to data not already in
the buffer, no spacethrows out existing block, request block from disk will bring from disk to
memory and pass the addrtess of the block to the main memory to make it available for the
user to access (if the user wants to access thrown out block)
replacement strategy - uses lru which is least recently used
pinned blocks - set timeout when writing back to disk
forced out block - write back to disk even when space not required in buffer, so that data isnt
lost in case of system crash
Blockingfactor = floor(B/R)
where B is block length and R is fixed
length of each record; floor bcuz minimum
fit;
UNIT 1 5
stored on disk
offset table gives offset of each record
within the block, here address is the
physical address plus offset
in SQL server - store as pages of size 8kb each which starts with 96 byte
header
swizzling - when one record in the memory is referncing or pointing to another record in
memory, the first memory will translate the pointers db entry to virtual mem then find it in
the memory and then refernce it; by “swizzling” we cut out the ltaency of using the translate
table and directly user a pointer to point to the required location within the memory itself
types of swizzling -
automatic - as soon as block into memory, addresses and pointers added to table
on demand - leave all pointers unswizzled, when pointer is referenced swizzle it
mem back to disk → unswizzle, the memory address needs to be replaced by the database
address again
block is said to be pinned if cant be send back to disk safely, bit telling this is in the header,
need to make sure block is not pinned before unswizzling
UNIT 1 6
if doesnt fit in one block, keep pointer and point to the next
block and have prev ptr from that block to current block, and
has bit in header mentioning wether it is a fragment or not,
and wether it is the first fragemnt or last fragment.
Record Modification -
insert - if not ordered make space anywhere, if ordered find
location to be sotred if no space slide records to make space;
if no room, create overflow block
6s
7
index structure - structure for faster retrieval, index file associates value of search key with
value in data file, primary index for primary key, and secondary index for other attributes
secondary indexes - unlike primary index, find record given value of one or more fileds,
returns current memory address; place bucket inbetween for indirection
UNIT 1 7
document retrieval - uses inverted indexes, every attribute is boolean to check if that word
exists in the document, secondary index for each words, use bucket for inverted indirection
B-trees - root at first layer, last layer is leafs; keys in leaf node are copies from data file and
are ordered/sorted from left to right; allow llokup insertion and deletion of block with very
little need of disk i/o, if any problem only two leaves and their parent is affected.; insert 40
and delete 7
Hash tables - use hash function to hash the key, get arbitrary value and store at that
value,when searching use same hash function to find where stored; equally distributes
records amongst buckets
insert - for key K, run hash function h(k), if bucket for h(K) has space insert, else insert into
overflow block
delete - for h(K) bucket, delete K record, move around remaining records to consolidate and
available space
efficient since lookup is one disk i/o and insert and delete only two disk i/os, but with
increase in size will cause more overflow tables causing linear search
convert hash into binary, take 1 least significant bits and store in
bucket from pointer of that bit value
insert - key k, get h(k), take first bit i, enter into bucket array
indexed at bit i; follow pointer to bucket array to block B, if there is
space enter record, if not check number j which tells us how many
bits of hash value are used to determine membership; if j<i split
UNIT 1 8
block B into two, distribute based j bit and adjust pointer; if i=j
then increment i by 1, increasing length of bucket array to make
space
From the image, if we now want to insert 0000, enter into the first
bucket; now to enter 0111, split the block, making j=2 so now two
bits considered, and i gets incremented to 3; 000 and 001 point to
first bucket having 0000 and 0001; for the blocks where i equals j
there is only one mapping; for blocks where j<i there are two
mappings
→ if they are equal we need to double the global depth, by incrementing j value by 1, if
before it was 2 it considered 00 01 10 and 11, now it considers 3 bits giving 8 lenght
→ if i is lesser, we need to split that block and increment i
Linear hash tables - n buckets, overflow allowed but less than 1 overflow on avg per bucket,
number of bits to number given by cieling(log2(n))
Multiple key indexes - two layer of indexing, first indexes first attribute and second indexes
second attribute
query for age = 50 and salary>50 ⇒ look for index in first layer for 50, then the pointer to the
index table of second attribute will get rest of records
UNIT 1 9
R trees - sub regions can have overlaps
search - start at root, examine subregions for point P, if 0 regions point P is not in any data
region, atleast 1 region then recursively search for P at child
insert - start at root, fins sub region into which R fits, more than one possible then pick one,
no sub region then expand subregion where expansion is as little as possible, else insert into
leaf, then split the leaf;
Bitmap indexes - in an attribute if there are 3 possible values (traffic light can be green
yellow or red) use bitmap for this; lets say there are 7 traffic lights, first three are red, next 3
are yellow and last one is green then bitmap for them will be - red: 1110000, yellow:
0001110, green: 0000001;
to check when light is red or yellow use OR to get 11111110, for another attribute you get
1011001, now and these two to get - 1011000 only these satisfy the query
10
indexes automatically created for primary key and unique key constraints - called implicit
key
UNIT 1 10