File Organization Lec910
File Organization Lec910
2
Introduction
3
Introduction
4
Data Compression
➢ Data Compression:
• The encoding of data in such a way as to reduce its size.
• Data compression is the process of encoding, restructuring or otherwise modifying data
in order to reduce its size.
➢ Redundancy Compression:
• Any form of compression which removes only redundant information.
5
Data Compression
➢ Advantages:
• Smaller files use less storage space.
• The transfer time of disk access is reduced.
• The transmission time to transfer files over a network is reduced.
➢ Disadvantages:
• Program complexity and size are increased (Encoding/Decoding Module).
• Computation time is increased.
• Cost of Encoding/Decoding Time.
6
Data Compression
➢ Data compression is possible because most data contains redundant (repeated) data
or unnecessary information.
7
Data Compression
Data Compression
Methods
8
Data Compression
➢ When the data is represented in a sparse array, we can use a type of compression
called: run-length encoding.
9
Data Compression
➢ Example 1:
Input: “AAAABBBCCDAA”
Output: “4A3B2C1D2A”
➢ Example 2:
Input:
“WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWW
WWWWWWWWWWWWWWWB”
Output: “12WB12W3B24WB”
10
Data Compression
11
Data Compression
➢ Run-length encoding (RLE) works best with simple images and animations that
have a lot of redundant pixels. It is useful for black and white images in particular.
➢ For complex images and animations that do not have many redundant sections,
RLE can make the file size bigger rather than smaller. Therefore it is important to
understand the content and whether this algorithm will help or hinder compression.
12
Data Compression
➢ Huffman Coding is a variable length code that depends on the frequency of letter
occurring in data set.
➢ Example 1: suppose the content is:
➢ Encoding Message:
• 1010010011011011001111100
13
Data Compression
➢ Huffman Tree (for easy encoding)
➢ Encoding Message:
• 1010010011011011001111100
➢ Intepret 0’s as “go left” and 1’s as
“go right”
➢ A codeword for a character
corresponding to the path from root
of the Huffman tree to the leaf
containing the character.
14
Data Compression
Example 2:
Initial string:
Size = 8*15=120
2. Sort the characters in increasing order of the frequency. These are stored in a
priority queue Q.
15
Data Compression
3. Make each unique character as a leaf node.
4. Create an empty node z. Assign the minimum frequency to the left child of z and
assign the second minimum frequency to the right child of z. Set the value of the z as
the sum of the above two minimum frequencies.
16
Data Compression
5. Remove these two minimum frequencies from Q and add the sum into the list of
frequencies.
6. Insert node z into the tree.
7. Repeat steps 3 to 5 for all the characters.
17
Data Compression
8. For each non-leaf node, assign 0 to the left edge and 1 to the right edge.
18
Data Compression
Without encoding, the total size of the string was 120 bits. After encoding the
size is reduced to 32+9+28=69.
19
Data Compression
➢ The techniques we have discussed so far preserve all information in the original
data.
20
Constructing Huffman Codes
While there is more than one TREE in the FOREST
Create a new node with left child FOREST(i)--> root and right child FOREST(j)--> root
Replace TREE i in FOREST by a tree whose root is the new node and whose weight is
FOREST(i)--> weight + FOREST(j)--> weight
• Storage Compaction can be used with both fixed- and variable-length records.
22
Reclaiming Space in Files
➢ Deleting Fixed-Length Records for Reclaiming Space Dynamically:
➢ In some applications, it is necessary to reclaim space immediately.
➢ To do so, we can:
• Mark deleted records in some special ways
• Find the space that deleted records once occupied so that we can reuse that space
when we add records.
• Come up with a way to know immediately if there are empty slots in the file and
jump directly to them.
➢ Solution: Use an avail (List of Available Space) linked list in the form of a stack.
Relative Record Numbers (RRNs) play the role of pointers.
23
Reclaiming Space in Files
➢ Deleting Variable-Length Records for Reclaiming Space Dynamically:
• Same ideas as for Fixed-Length Records, but a different implementation must be
used.
• In particular, we must keep a byte count of each record and the links to the next
records on the avail list cannot be the RRNs.
24
Storage Fragmentation
➢ Wasted Space within a record is called internal fragmentation.
➢ Variable-Length records do not suffer from internal fragmentation. However,
external fragmentation is not avoided.
25
Placement Strategies I
➢ First Fit Strategy: accept the first available record slot that can accommodate
the new record.
➢ Best Fit Strategy: choose the first available smallest available record slot that
can accommodate the new record.
26
Placement Strategies II
➢ Some general remarks about placement strategies:
• Placement strategies only apply to variable-length records.
• If space is lost due to internal fragmentation, the choice is first fit and best fit. A worst
fit strategy truly makes internal fragmentation worse.
• If the space is lost due to external fragmentation, one should give careful
consideration to a worst-fit strategy.
27
Finding Things Quickly
➢ The cost of Seeking is very high.
➢ This cost has to be taken into consideration when determining a strategy for
searching a file for a particular piece of information.
➢ The same question also arises with respect to sorting, which often is the first step
to searching efficiently.
➢ Rather than simply trying to sort and search, we concentrate on doing so in a way
that minimizes the number of seeks.
28
Finding Things Quickly
➢ So far, the only way we have to retrieve or find records quickly is by using their
RRN (in case the record is of fixed-length).
➢ Without a RRN or in the case of variable-length records, the only way, so far, to
look for a record is by doing a sequential search. This is a very inefficient method.
➢ We are interested in more efficient ways to retrieve records based on their key-
value.
29
Finding Things Quickly
➢ Binary Search:
• Let’s assume that the file is sorted and that we are looking for record whose key is
Kelly in a file of 1000 fixed-length records.
1: Johnson 2: Monroe
Next Comparison
30
Finding Things Quickly
➢ Binary Search versus Sequential Search:
➢ Binary Search of a file with n records takes O (log2n) comparisons.
➢ Sequential search takes O (n) comparisons.
➢ When sequential search is used, doubling the number of records in the file
doubles the number of comparisons required for sequential search.
➢ When binary search is used, doubling the number of records in the file only adds
one more guess to our worst case.
➢ In order to use binary search, though, the file first has to be sorted. This can be
very expensive.
31
Finding Things Quickly
➢ Sorting a Disk File in Memory:
➢ If the entire content of a file can be held in memory, then we can perform an
internal sort. Sorting in memory is very efficient.
➢ However, if the file does not hold entirely in memory, any sorting algorithm will
require a large number of seeks. Sorting would, thus, be extremely slow.
Unfortunately, this is often the case, and solutions have to be found.
32
Finding Things Quickly
➢ The limitations of Binary Search and Internal Sorting:
➢ Binary Search requires more than one or two accesses. Accessing a record using
the RRN can be done with a single access ==> We would like to achieve RRN
retrieval performance while keeping the advantage of key access.
➢ Keeping a file sorted is very expensive: in addition to searching for the right
location for the insert, once this location is founds, we have to shift records to open
up the space for insertion.
33
Finding Things Quickly
➢ KeySorting:
➢ Overview: when sorting a file in memory, the only thing that really needs sorting
are record keys.
➢ Keysort algorithms work like internal sort, but with 2 important differences:
• Rather than read an entire record into a memory array, we simply read each record
into a temporary buffer, extract the key and then discard.
• If we want to write the records in sorted order, we have to read them a second time.
34
Finding Things Quickly
➢ KeySorting:
35
Finding Things Quickly
➢ Limitation of the KeySort Method:
➢ Writing the records in sorted order requires as many random seeks as there are records.
➢ Since writing is interspersed with reading, writing also requires as many seeks as there are
records.
➢ Pinned records make sorting very difficult. One solution is to use an ordered
index and not to move the records.
37