0% found this document useful (0 votes)
14 views37 pages

File Organization Lec910

Uploaded by

Pc Pc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views37 pages

File Organization Lec910

Uploaded by

Pc Pc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

CSW241-File Organization and Processing

Organizing Files for Performance

Dr. Riham Moharam


Faculty of Information Technology & Computer Science
Sinai University
North Sinai, Egypt
Outline
➢ Introduction.
➢ Data Compression.
➢ Compression in Unix.
➢ Reclaiming Space in Files.
➢ Finding Things Quickly.

2
Introduction

➢ We will be looking at four different issues:


• Data Compression: how to make files smaller.
• Reclaiming space in files that have undergone deletions and updates.
• Sorting Files in order to support binary searching ==> Internal Sorting.
• A better Sorting Method: KeySorting.

3
Introduction

➢ Question: Why do we want to make files smaller?


Answer:
• To use less storage, i.e., saving costs.
• To transmit these files faster, decreasing access time or using the same access time, but
with a lower and cheaper bandwidth.

• To process the file sequentially faster.

4
Data Compression

➢ Data Compression:
• The encoding of data in such a way as to reduce its size.
• Data compression is the process of encoding, restructuring or otherwise modifying data
in order to reduce its size.

➢ Redundancy Compression:
• Any form of compression which removes only redundant information.

5
Data Compression

➢ Advantages:
• Smaller files use less storage space.
• The transfer time of disk access is reduced.
• The transmission time to transfer files over a network is reduced.

➢ Disadvantages:
• Program complexity and size are increased (Encoding/Decoding Module).
• Computation time is increased.
• Cost of Encoding/Decoding Time.

6
Data Compression
➢ Data compression is possible because most data contains redundant (repeated) data
or unnecessary information.

➢ Reversible compression removes only redundant information, making it possible to


restore the data to its original form.

➢ Irreversible compression goes further, removing information which is not actually


necessary, making it impossible to recover the original form.

7
Data Compression

Data Compression
Methods

Lossless Methods Lossy Methods


Text & Programs Images , Audio & Video

Run-Length Huffman Lempel Ziv JPEG MPEG MP3

8
Data Compression
➢ When the data is represented in a sparse array, we can use a type of compression
called: run-length encoding.

➢ Run-length encoding (RLE) is a lossless compression method where sequences


that display redundant data are stored as a single data value. This value represents
the repeated block, and shows how many times it appears in the image.

➢ The goal of RLE is to compress a string by replacing sequences of repeated


characters with a single character followed by the number of times it’s repeated.

9
Data Compression
➢ Example 1:
Input: “AAAABBBCCDAA”

Output: “4A3B2C1D2A”

➢ Example 2:
Input:
“WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWW
WWWWWWWWWWWWWWWB”

Output: “12WB12W3B24WB”

10
Data Compression

11
Data Compression
➢ Run-length encoding (RLE) works best with simple images and animations that
have a lot of redundant pixels. It is useful for black and white images in particular.

➢ For complex images and animations that do not have many redundant sections,
RLE can make the file size bigger rather than smaller. Therefore it is important to
understand the content and whether this algorithm will help or hinder compression.

12
Data Compression
➢ Huffman Coding is a variable length code that depends on the frequency of letter
occurring in data set.
➢ Example 1: suppose the content is:

➢ There are 10 characters

➢ Encoding Message:
• 1010010011011011001111100

13
Data Compression
➢ Huffman Tree (for easy encoding)
➢ Encoding Message:
• 1010010011011011001111100
➢ Intepret 0’s as “go left” and 1’s as
“go right”
➢ A codeword for a character
corresponding to the path from root
of the Huffman tree to the leaf
containing the character.

14
Data Compression
Example 2:

Initial string:
Size = 8*15=120

1. Calculate the frequency of each character in the string.

2. Sort the characters in increasing order of the frequency. These are stored in a
priority queue Q.

15
Data Compression
3. Make each unique character as a leaf node.
4. Create an empty node z. Assign the minimum frequency to the left child of z and
assign the second minimum frequency to the right child of z. Set the value of the z as
the sum of the above two minimum frequencies.

16
Data Compression
5. Remove these two minimum frequencies from Q and add the sum into the list of
frequencies.
6. Insert node z into the tree.
7. Repeat steps 3 to 5 for all the characters.

17
Data Compression
8. For each non-leaf node, assign 0 to the left edge and 1 to the right edge.

18
Data Compression

Without encoding, the total size of the string was 120 bits. After encoding the
size is reduced to 32+9+28=69.

19
Data Compression
➢ The techniques we have discussed so far preserve all information in the original
data.

➢ Irreversible compression technique: Based on the assumption that some


information can be sacrificed. [Irreversible compression is also called Entropy
Reduction].

20
Constructing Huffman Codes
While there is more than one TREE in the FOREST

i= index of the TREE in FOREST with smallest weight;

j= index of the TREE in FOREST with 2nd smallest weight;

Create a new node with left child FOREST(i)--> root and right child FOREST(j)--> root

Replace TREE i in FOREST by a tree whose root is the new node and whose weight is
FOREST(i)--> weight + FOREST(j)--> weight

Delete TREE j from FOREST

➢ (A FOREST is a collection of TREES; each TREE has a root and a weight)


21
Reclaiming Space in Files
➢ Record Deletion and Storage Compaction.
• Recognizing Deleted Records
• Reusing the space from the record ==> Storage Compaction.
• Storage Compaction: After deleted records have accumulated for some time, a special
program is used to reconstruct the file with all the deleted approaches.

• Storage Compaction can be used with both fixed- and variable-length records.

22
Reclaiming Space in Files
➢ Deleting Fixed-Length Records for Reclaiming Space Dynamically:
➢ In some applications, it is necessary to reclaim space immediately.
➢ To do so, we can:
• Mark deleted records in some special ways
• Find the space that deleted records once occupied so that we can reuse that space
when we add records.

• Come up with a way to know immediately if there are empty slots in the file and
jump directly to them.

➢ Solution: Use an avail (List of Available Space) linked list in the form of a stack.
Relative Record Numbers (RRNs) play the role of pointers.
23
Reclaiming Space in Files
➢ Deleting Variable-Length Records for Reclaiming Space Dynamically:
• Same ideas as for Fixed-Length Records, but a different implementation must be
used.

• In particular, we must keep a byte count of each record and the links to the next
records on the avail list cannot be the RRNs.

24
Storage Fragmentation
➢ Wasted Space within a record is called internal fragmentation.
➢ Variable-Length records do not suffer from internal fragmentation. However,
external fragmentation is not avoided.

➢ 3 ways to deal with external fragmentation:


• Storage Compaction
• Coalescing the holes
• Use a clever placement strategy

25
Placement Strategies I
➢ First Fit Strategy: accept the first available record slot that can accommodate
the new record.

➢ Best Fit Strategy: choose the first available smallest available record slot that
can accommodate the new record.

➢ Worst Fit Strategy: choose the largest available record slot.

26
Placement Strategies II
➢ Some general remarks about placement strategies:
• Placement strategies only apply to variable-length records.
• If space is lost due to internal fragmentation, the choice is first fit and best fit. A worst
fit strategy truly makes internal fragmentation worse.

• If the space is lost due to external fragmentation, one should give careful
consideration to a worst-fit strategy.

27
Finding Things Quickly
➢ The cost of Seeking is very high.
➢ This cost has to be taken into consideration when determining a strategy for
searching a file for a particular piece of information.

➢ The same question also arises with respect to sorting, which often is the first step
to searching efficiently.

➢ Rather than simply trying to sort and search, we concentrate on doing so in a way
that minimizes the number of seeks.

28
Finding Things Quickly
➢ So far, the only way we have to retrieve or find records quickly is by using their
RRN (in case the record is of fixed-length).

➢ Without a RRN or in the case of variable-length records, the only way, so far, to
look for a record is by doing a sequential search. This is a very inefficient method.

➢ We are interested in more efficient ways to retrieve records based on their key-
value.

29
Finding Things Quickly
➢ Binary Search:
• Let’s assume that the file is sorted and that we are looking for record whose key is
Kelly in a file of 1000 fixed-length records.

1: Johnson 2: Monroe

1 2 …. 500 750 1000

Next Comparison

30
Finding Things Quickly
➢ Binary Search versus Sequential Search:
➢ Binary Search of a file with n records takes O (log2n) comparisons.
➢ Sequential search takes O (n) comparisons.
➢ When sequential search is used, doubling the number of records in the file
doubles the number of comparisons required for sequential search.

➢ When binary search is used, doubling the number of records in the file only adds
one more guess to our worst case.

➢ In order to use binary search, though, the file first has to be sorted. This can be
very expensive.
31
Finding Things Quickly
➢ Sorting a Disk File in Memory:
➢ If the entire content of a file can be held in memory, then we can perform an
internal sort. Sorting in memory is very efficient.

➢ However, if the file does not hold entirely in memory, any sorting algorithm will
require a large number of seeks. Sorting would, thus, be extremely slow.
Unfortunately, this is often the case, and solutions have to be found.

32
Finding Things Quickly
➢ The limitations of Binary Search and Internal Sorting:
➢ Binary Search requires more than one or two accesses. Accessing a record using
the RRN can be done with a single access ==> We would like to achieve RRN
retrieval performance while keeping the advantage of key access.

➢ Keeping a file sorted is very expensive: in addition to searching for the right
location for the insert, once this location is founds, we have to shift records to open
up the space for insertion.

➢ Internal Sorting only works on small files. ==> Keysorting

33
Finding Things Quickly
➢ KeySorting:
➢ Overview: when sorting a file in memory, the only thing that really needs sorting
are record keys.

➢ Keysort algorithms work like internal sort, but with 2 important differences:
• Rather than read an entire record into a memory array, we simply read each record
into a temporary buffer, extract the key and then discard.

• If we want to write the records in sorted order, we have to read them a second time.

34
Finding Things Quickly
➢ KeySorting:

35
Finding Things Quickly
➢ Limitation of the KeySort Method:
➢ Writing the records in sorted order requires as many random seeks as there are records.
➢ Since writing is interspersed with reading, writing also requires as many seeks as there are
records.

➢ Solution: Why bother to write the file of records in key order:


• Simply write back the sorted index.
36
Finding Things Quickly
➢ Pinned Records:
➢ Indexes are also useful with regard to deleted records.
➢ The avail list indicating the location of unused records consists of pinned records
in the sense that these unused records cannot be moved since moving them would
create dangling pointers.

➢ Pinned records make sorting very difficult. One solution is to use an ordered
index and not to move the records.

37

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy