0% found this document useful (0 votes)
15 views65 pages

Chapter 6 Organizing Files For Performance Not Complete

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views65 pages

Chapter 6 Organizing Files For Performance Not Complete

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

CS2202‐File Organization

2021‐2022
Chapter 6
Organizing Files for
Performance
File Compression
Contents

• We will be looking at four different issues:


• Data Compression: how to make files
smaller
• Reclaiming space in files that have
undergone deletions and updates
• Sorting Files in order to support binary
searching ==> Internal Sorting
• A better Sorting Method: KeySorting
Overview

• In this lecture, we continue to focus on


file organization, but with a different
motivation.
• This time we look at ways to organize or
re-organize files in order to improve
performance.
Data Compression
• Introduction to Data Compression
• Techniques of Data Compression
• Compact Notation
• Run-Length Encoding
• Variable length codes: Huffman Coding
• Lempel-Ziv Codes
Data Compression
• Data Compression = Encoding the information in a file in
such a way that it takes less space.
• Reasons for data compression
• less storage
• transmitting faster, decreasing access time
• processing faster sequentially
Using Compact Notation
• Ex :File with fields: lastname, province, postal code, etc.
• Province field uses 2 bytes (e.g. 'ON', 'BC') but there are only
13 provinces and territories which could be encoded by
using only 4 bits (compact notation).
• 16 bits are encoded by 4 bits (12 bits were redundant, i.e.
added no extra information)
• Disadvantages
• The field "province" becomes unreadable by humans
• Time is spent encoding ("ON" ➔ 0001) and decoding
(0001 ➔ 'ON')
• It increases the complexity of software (Encoding/
Decoding Module)
Run length Encoding
• Good for files in which sequences of the same byte may be
frequent.
• Example image of the sky
• A pixel is represented by 8 bits
• Background is represented by the pixel value 0
• The idea is to avoid repeating, say, 200 bytes equal to 0
and represent it by (0, 200).
Suppressing Repeating Sequences
• Run-length encoding algorithm
• read through pixels, copying pixel values to file in
sequence, except the same pixel value occurs more than
once in succession
• when the same value occurs more than once in
succession, substitute the following three bytes
special run-length code indicator( e.g. OxFF)
 pixel value repeated
 the number of times that value is repeated
• Example:
• 22 23 24 24 24 24 24 24 24 25 26 26 26 26 26 26 25 24
• RL-coded stream: 22 23 ff 24 07 25 ff 26 06 25 24
• 18 bytes reduced to 11 bytes
Suppressing Repeating Sequences
• Run-length encoding (cont’d)
• example of redundancy reduction
• cons.
• not guarantee any particular amount of space savings
• under some circumstances, compressed image is
larger than original image
Variable Length Codes, Morse Code
• Morse code: oldest & most common scheme of variable-length
code
• Some values occur more frequently than others
• That value should take the least amount of space
• Morse Code )two symbols associated to each letter(
A ._
B _...
...
E .
F .._.
...
T _
U .._
• Since E and T are the most frequent letters they are
associated to the shortest codes . and _ respectively
Huffman coding
• Huffman Code is a variable length code, but unlike Morse
Code the encoding depends on the frequency of letters
occurring in the data set.
• base on probability of occurrence
• determine probabilities of each value occurring
• build binary tree with search path for each value
• more frequently occurring values are given shorter
search paths in tree
Huffman Code Example I
• Suppose the file content is:

• Total: 10 characters

• Encoded message:
1010010011011011001111100

• 25 bits rather than 80 bits (10 bytes)!


Huffman Code Example I
• Interpret 0’s as ‘go left’ and
the 1’s as "go right’.
• A codeword for a character
corresponds to the path
from the root of the
Huffman tree to the leaf
containing the character.
• Following the labeled edges
in the Huffman tree we
decode the above message.
• 1010 leads us to I
• 01 leads us to /b
• 00 leads us to A
• 11 leads us to M
• 01 leads us to /b
• etc.
Huffman Code Example II
Huffman Code Example II
Lempel-Ziv Codes
• There are several variations of Lempel-Ziv Codes.
• We will look at LZ78
• Used in many applications
Example
• Let us look at an example for an alphabet having only two
letters:
aaababbbaaabaaaaaaabaabb
• Rule
• Separate this stream of characters into pieces of text so
that each piece is the shortest string of characters that
we have not seen yet.
a|aa|b|ab|bb|aaa|ba|aaaa|aab|aabb
Example
a|aa|b|ab|bb|aaa|ba|aaaa|aab|aabb
1. We see "a".
2. "a" has been seen. we now see "aa".
3. We see "b".
4. "a" has been seen. we now see "ab".
5. "b" has been seen. we now see "bb".
6. "aa" has been seen. we now see "aaa".
7. "b" has been seen. we now see "ba".
8. "aaa" has been seen. we now see "aaaa".
9. "aa" has been seen. we now see "aab".
10."aab" has been seen. we now see "aabb".
Example
• Index the pieces from 1 to n. In the previous example:

Index: 0 1 2 3 4 5 6 7 8 9 10
0|a|aa|b|ab|bb|aaa|ba|aaaa|aab|aabb
0 = Null string
Encoding:

Index: 1 2 3 4 5 6 7 8 9 10
0a|1a|0b|1b|3b|2a|3a|6a|2b|9b
Lempel-Ziv Codes
• Since each piece is the concatenation of a piece already seen
with a new character, the message can be encoded by a
previous index plus a new character.
• A tree can be built when encoding

• Note that this tree is not


binary in general. Here, it is
binary because the alphabet
has only 2 letters.
Exercise #1
• encode the file containing the following characters, drawing
the corresponding tree

''aaabbcbcdddeab''
Solution
Encoding Tree
Exercise #2
• Encode the file containing the following characters, drawing
the corresponding digital tree

“I AM SAM. SAM I AM''


Solution
Encoding Tree
Irreversible (Lossy) Compression
Techniques
• All previous techniques : we preserve all information in the
original data.
• Irreversible compression is used when some information can
be sacrificed.
• Less common in data files
• Shrinking raster image
• 400-by-400 pixels to 100-by-100 pixels
• 1 pixel in the new image for each 16 pixels in the original
message
• Speech compression
• voice coding (the lost information is of no little or no
value)
CS2202‐File Organization
2021‐2022
Chapter 6

Reclaiming Spaces in
Files
Motivation
• Let us consider a file of records (fixed length or
variable length)
• We know how to create a file, how to add records to
a file, modify the content of a record. These actions
can be performed physically by using the various
basic file operations we have seen.
• What happens if records need to be deleted?
• There is no basic operation that allows us to remove
part of a file.
Motivation
• Modification of a variable-length record (new record
is longer than original record )
1. append the extra data to the end of the file and
put a pointer from the original record space to
the extension => slower
2. rewrite the whole record at the end of the file (if
not sorted), leaving a hole at the original
location=> wasted space
• Record deletion should be taken care by the
program responsible for file organization
Reclaiming Space in Files
• Three forms of modification
1. record addition
2. record updating : deletion -> addition
3. record deletion
Record Deletion and Storage
Compaction
• Approach to record deletion
place a special mark in a special field of each deleted
record. (e.g.) asterisk in the first field : Fig (a),(b)

• a program ignores the marked record as deleted


• advantage
• Undelete a record with very little effort
• disadvantage.
• Don’t reuse the space for a while (rely on storage
compaction)
Record Deletion and Storage
Compaction
1. Storage compaction
• make files smaller by looking for unused places in
a file and then recovering this space (how often ?)
• a special program reconstructs a file with all the
deleted records squeezed out : Fig. (c)
• Compaction methods
1. through a file copy program (out place)
2. through more complicated and time-consuming
compacting algorithm (in place)
Example
Strategies for Record Deletion
2. Deleting Fixed-Length Records and Reclaiming Space
Dynamically
▪ In some applications, it is necessary to reclaim space
immediately.
▪ To do so, we must:
▪ Mark deleted records in some special ways.
▪ Find the space that deleted records once occupied,
so that we can reuse that space when we add
records.
▪ Come up with a way to know immediately if there are
empty slots in the file and jump directly to them.
▪ Solution: Use an avail linked list in the form of a stack.
Relative Record Numbers (RRNs) play the role of
pointers.
Example

• If we add a record, it can go to the first available spot in the


avail list where RRN=4 ➔ header’s RRN=2
• If we delete a record (Edwards), header’s RRN=5, RRN 5
links to RRN 4
Strategies for Record Deletion

3. Deleting Variable-Length Records


▪ Use an AVAIL LIST as before, but take care of the
variable-length difficulties
▪ The records in AVAIL LIST must store its size as a
field.
▪ RRN can not be used, but exact byte offset must
be used
▪ Addition of records must find a large enough
record in AVAIL LIST.
Example of Deletion
Removal of a record from an avail list
Storage Fragmentation
• Wasted Space within a record is called internal
fragmentation.
• Variable-Length records do not suffer from internal
fragmentation. However, external fragmentation is
not avoided.
• 3 ways to deal with external fragmentation:
1. Storage Compaction.
2. Coalescing the holes.
• If two record slots on the avail list are physically
adjacent, combine them to make a single,
larger record slot.
3. Using a clever placement strategy
Placement Strategies
1. First Fit Strategy
– Avail list is not sorted by size.
– Choose the first available record slot that
can hold the new record.
• Example
– Avail list: size=10, size=50, size=22, size=60
– record to be added: size=20
– Which record from AVAIL LIST is used for the
new record?
– Choose size=50
Placement Strategies
2. Best Fit Strategy
– Avail list is sorted by size.
– Choose the smallest available record slot that can hold
the new record.
– After inserting the new record, the free area left may be
too small to be useful.
▪ May cause serious external fragmentation
(dependent on the implementation).
▪ Increase the search time for the best-fit space.
• Example
– Avail list: size=10, size=22, size=50, size=60
– New record: size=20
– Which record from AVAIL LIST is used for the new record?
– Choose size=22
Placement Strategies
3. Worst Fit Strategy
– Avail list is sorted by decreasing order of size.
– Largest record is used for holding new record;
unused space is placed again in AVAIL LIST.
Example
– Avail list: size=60, size=50, size=22, size=10
– New record: size=20
– Which record from AVAIL LIST is used for the new
– record?
– Choose size=60
How to Choose Between Strategies
• We must consider two types of fragmentation within
a file:
• Internal Fragmentation
– wasted space within a record.
• External Fragmentation
– space is available at AVAIL LIST, but it is so small
that cannot be reused.
CS2202‐File Organization
2021‐2022
Chapter 6

Binary Searching ,
KeySorting &
Indexing
Content

• Binary Searching
• Keysorting
• Introduction to Indexing
Binary Searching
• Let us consider fixed-length records that must be
searched by a key value
• If we knew the RRN of the record identified by this key
value, we could jump directly to the record (by using
Seek function)
• In practice, we do not have this information and we
must search for the record containing this key value
• If the file is not sorted by the key value we may have to
look at every possible record before we find the
desired record
• An alternative to this is to maintain the file sorted by
key value and use binary searching
Binary Search Algorithm
bool BinarySearch(Stream file, RecordType rec, KeyType key)
{
int low=0,high=getFileLength(file)/ sizeof(RecordType)-1;
int guess;
while (low <== high)
{
guess == (high + low) / 2;
readRecord(file, rec, guess);
if (rec.key()== key))
return true;
if (rec.key() > key))
high == guess - 1;
else low == guess + 1;
return false;
}
}
Binary Search Algorithm
Binary Search Algorithm
Binary Search Algorithm
Binary Search vs. Sequential Search
• Sequential Search: O(n)
• Binary Search: O(log2n)
• If file size is doubled, sequential search time is
doubled, while binary search time increases by 1
Keysorting
• Suppose a file needs to be sorted, but it is too big
to fit into main memory.
• To sort the file, we only need the keys.
• Suppose that all the keys fit into main memory
• Idea
– Bring the keys to main memory plus
corresponding RRN
– Do internal sorting of keys
– Rewrite the file in sorted order
Example
How much effort we must do?
• Read file sequentially once
• Go through each record in random order (seek)
• Write each record once (sequentially)
Why bother to write the file back?
• Use keynode array to create an index file instead.

this is called indexing!


Pinned Records
• Remember that in order to support deletions we
used AVAIL LIST, a list of available records
• The AVAIL LIST contains info on the physical
information of records. In such a file, a record is
said to be pinned
• If we use an index file for sorting, the AVAIL LIST
and positions of records remain unchanged.
• This is a good news
Introduction to Indexing
• Simple indexes use simple arrays.
• An index lets us impose order on a file without
rearranging the file.
• Indexes provide multiple access paths to a file -
multiple indexes (like library catalog providing
search for author, book and title)
• An index can provide keyed access to variable
length record files
A Simple Index for Entry-Sequenced File
• Records (Variable-length)

address of
record

• Primary key = company label + record ID


key reference field

index:
Index
• Index is sorted (main memory)
• Records appear in file in the order they entered
• How to search for a recording with given LABEL ID?
– Binary search (in main memory) in the index:
find LABEL ID, which leads us to the
referenced field
– Seek for record in position given by the
reference field
Some Issues
• How to make a persistent index
– i.e. how to store the index into a file when it
is not in main memory
• How to guarantee that the index is an accurate
reflection of the contents of the file
– This is tricky when there are lots of additions,
deletions and updates
End of Chapter 6

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy