0% found this document useful (0 votes)

15 views65 pages

Chapter 6 Organizing Files For Performance Not Complete

Uploaded by

Abdallah Saber Khalifa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views65 pages

Chapter 6 Organizing Files For Performance Not Complete

Uploaded by

Abdallah Saber Khalifa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

CS2202‐File Organization

2021‐2022
Chapter 6
Organizing Files for
Performance
File Compression
Contents

• We will be looking at four different issues:

• Data Compression: how to make files
smaller
• Reclaiming space in files that have
undergone deletions and updates
• Sorting Files in order to support binary
searching ==> Internal Sorting
• A better Sorting Method: KeySorting
Overview

• In this lecture, we continue to focus on

file organization, but with a different
motivation.
• This time we look at ways to organize or
re-organize files in order to improve
performance.
Data Compression
• Introduction to Data Compression
• Techniques of Data Compression
• Compact Notation
• Run-Length Encoding
• Variable length codes: Huffman Coding
• Lempel-Ziv Codes
Data Compression
• Data Compression = Encoding the information in a file in
such a way that it takes less space.
• Reasons for data compression
• less storage
• transmitting faster, decreasing access time
• processing faster sequentially
Using Compact Notation
• Ex :File with fields: lastname, province, postal code, etc.
• Province field uses 2 bytes (e.g. 'ON', 'BC') but there are only
13 provinces and territories which could be encoded by
using only 4 bits (compact notation).
• 16 bits are encoded by 4 bits (12 bits were redundant, i.e.
added no extra information)
• Disadvantages
• The field "province" becomes unreadable by humans
• Time is spent encoding ("ON" ➔ 0001) and decoding
(0001 ➔ 'ON')
• It increases the complexity of software (Encoding/
Decoding Module)
Run length Encoding
• Good for files in which sequences of the same byte may be
frequent.
• Example image of the sky
• A pixel is represented by 8 bits
• Background is represented by the pixel value 0
• The idea is to avoid repeating, say, 200 bytes equal to 0
and represent it by (0, 200).
Suppressing Repeating Sequences
• Run-length encoding algorithm
• read through pixels, copying pixel values to file in
sequence, except the same pixel value occurs more than
once in succession
• when the same value occurs more than once in
succession, substitute the following three bytes
special run-length code indicator( e.g. OxFF)
 pixel value repeated
 the number of times that value is repeated
• Example:
• 22 23 24 24 24 24 24 24 24 25 26 26 26 26 26 26 25 24
• RL-coded stream: 22 23 ff 24 07 25 ff 26 06 25 24
• 18 bytes reduced to 11 bytes
Suppressing Repeating Sequences
• Run-length encoding (cont’d)
• example of redundancy reduction
• cons.
• not guarantee any particular amount of space savings
• under some circumstances, compressed image is
larger than original image
Variable Length Codes, Morse Code
• Morse code: oldest & most common scheme of variable-length
code
• Some values occur more frequently than others
• That value should take the least amount of space
• Morse Code )two symbols associated to each letter(
A ._
B _...
...
E .
F .._.
...
T _
U .._
• Since E and T are the most frequent letters they are
associated to the shortest codes . and _ respectively
Huffman coding
• Huffman Code is a variable length code, but unlike Morse
Code the encoding depends on the frequency of letters
occurring in the data set.
• base on probability of occurrence
• determine probabilities of each value occurring
• build binary tree with search path for each value
• more frequently occurring values are given shorter
search paths in tree
Huffman Code Example I
• Suppose the file content is:

• Total: 10 characters

• Encoded message:
1010010011011011001111100

• 25 bits rather than 80 bits (10 bytes)!

Huffman Code Example I
• Interpret 0’s as ‘go left’ and
the 1’s as "go right’.
• A codeword for a character
corresponds to the path
from the root of the
Huffman tree to the leaf
containing the character.
• Following the labeled edges
in the Huffman tree we
decode the above message.
• 1010 leads us to I
• 01 leads us to /b
• 00 leads us to A
• 11 leads us to M
• 01 leads us to /b
• etc.
Huffman Code Example II
Huffman Code Example II
Lempel-Ziv Codes
• There are several variations of Lempel-Ziv Codes.
• We will look at LZ78
• Used in many applications
Example
• Let us look at an example for an alphabet having only two
letters:
aaababbbaaabaaaaaaabaabb
• Rule
• Separate this stream of characters into pieces of text so
that each piece is the shortest string of characters that
we have not seen yet.
a|aa|b|ab|bb|aaa|ba|aaaa|aab|aabb
Example
a|aa|b|ab|bb|aaa|ba|aaaa|aab|aabb
1. We see "a".
2. "a" has been seen. we now see "aa".
3. We see "b".
4. "a" has been seen. we now see "ab".
5. "b" has been seen. we now see "bb".
6. "aa" has been seen. we now see "aaa".
7. "b" has been seen. we now see "ba".
8. "aaa" has been seen. we now see "aaaa".
9. "aa" has been seen. we now see "aab".
10."aab" has been seen. we now see "aabb".
Example
• Index the pieces from 1 to n. In the previous example:

Index: 0 1 2 3 4 5 6 7 8 9 10
0|a|aa|b|ab|bb|aaa|ba|aaaa|aab|aabb
0 = Null string
Encoding:

Index: 1 2 3 4 5 6 7 8 9 10
0a|1a|0b|1b|3b|2a|3a|6a|2b|9b
Lempel-Ziv Codes
• Since each piece is the concatenation of a piece already seen
with a new character, the message can be encoded by a
previous index plus a new character.
• A tree can be built when encoding

• Note that this tree is not

binary in general. Here, it is
binary because the alphabet
has only 2 letters.
Exercise #1
• encode the file containing the following characters, drawing
the corresponding tree

''aaabbcbcdddeab''
Solution
Encoding Tree
Exercise #2
• Encode the file containing the following characters, drawing
the corresponding digital tree

“I AM SAM. SAM I AM''

Solution
Encoding Tree
Irreversible (Lossy) Compression
Techniques
• All previous techniques : we preserve all information in the
original data.
• Irreversible compression is used when some information can
be sacrificed.
• Less common in data files
• Shrinking raster image
• 400-by-400 pixels to 100-by-100 pixels
• 1 pixel in the new image for each 16 pixels in the original
message
• Speech compression
• voice coding (the lost information is of no little or no
value)
CS2202‐File Organization
2021‐2022
Chapter 6

Reclaiming Spaces in
Files
Motivation
• Let us consider a file of records (fixed length or
variable length)
• We know how to create a file, how to add records to
a file, modify the content of a record. These actions
can be performed physically by using the various
basic file operations we have seen.
• What happens if records need to be deleted?
• There is no basic operation that allows us to remove
part of a file.
Motivation
• Modification of a variable-length record (new record
is longer than original record )
1. append the extra data to the end of the file and
put a pointer from the original record space to
the extension => slower
2. rewrite the whole record at the end of the file (if
not sorted), leaving a hole at the original
location=> wasted space
• Record deletion should be taken care by the
program responsible for file organization
Reclaiming Space in Files
• Three forms of modification
1. record addition
2. record updating : deletion -> addition
3. record deletion
Record Deletion and Storage
Compaction
• Approach to record deletion
place a special mark in a special field of each deleted
record. (e.g.) asterisk in the first field : Fig (a),(b)

• a program ignores the marked record as deleted

• advantage
• Undelete a record with very little effort
• disadvantage.
• Don’t reuse the space for a while (rely on storage
compaction)
Record Deletion and Storage
Compaction
1. Storage compaction
• make files smaller by looking for unused places in
a file and then recovering this space (how often ?)
• a special program reconstructs a file with all the
deleted records squeezed out : Fig. (c)
• Compaction methods
1. through a file copy program (out place)
2. through more complicated and time-consuming
compacting algorithm (in place)
Example
Strategies for Record Deletion
2. Deleting Fixed-Length Records and Reclaiming Space
Dynamically
▪ In some applications, it is necessary to reclaim space
immediately.
▪ To do so, we must:
▪ Mark deleted records in some special ways.
▪ Find the space that deleted records once occupied,
so that we can reuse that space when we add
records.
▪ Come up with a way to know immediately if there are
empty slots in the file and jump directly to them.
▪ Solution: Use an avail linked list in the form of a stack.
Relative Record Numbers (RRNs) play the role of
pointers.
Example

• If we add a record, it can go to the first available spot in the

avail list where RRN=4 ➔ header’s RRN=2
• If we delete a record (Edwards), header’s RRN=5, RRN 5
links to RRN 4
Strategies for Record Deletion

3. Deleting Variable-Length Records

▪ Use an AVAIL LIST as before, but take care of the
variable-length difficulties
▪ The records in AVAIL LIST must store its size as a
field.
▪ RRN can not be used, but exact byte offset must
be used
▪ Addition of records must find a large enough
record in AVAIL LIST.
Example of Deletion
Removal of a record from an avail list
Storage Fragmentation
• Wasted Space within a record is called internal
fragmentation.
• Variable-Length records do not suffer from internal
fragmentation. However, external fragmentation is
not avoided.
• 3 ways to deal with external fragmentation:
1. Storage Compaction.
2. Coalescing the holes.
• If two record slots on the avail list are physically
adjacent, combine them to make a single,
larger record slot.
3. Using a clever placement strategy
Placement Strategies
1. First Fit Strategy
– Avail list is not sorted by size.
– Choose the first available record slot that
can hold the new record.
• Example
– Avail list: size=10, size=50, size=22, size=60
– record to be added: size=20
– Which record from AVAIL LIST is used for the
new record?
– Choose size=50
Placement Strategies
2. Best Fit Strategy
– Avail list is sorted by size.
– Choose the smallest available record slot that can hold
the new record.
– After inserting the new record, the free area left may be
too small to be useful.
▪ May cause serious external fragmentation
(dependent on the implementation).
▪ Increase the search time for the best-fit space.
• Example
– Avail list: size=10, size=22, size=50, size=60
– New record: size=20
– Which record from AVAIL LIST is used for the new record?
– Choose size=22
Placement Strategies
3. Worst Fit Strategy
– Avail list is sorted by decreasing order of size.
– Largest record is used for holding new record;
unused space is placed again in AVAIL LIST.
Example
– Avail list: size=60, size=50, size=22, size=10
– New record: size=20
– Which record from AVAIL LIST is used for the new
– record?
– Choose size=60
How to Choose Between Strategies
• We must consider two types of fragmentation within
a file:
• Internal Fragmentation
– wasted space within a record.
• External Fragmentation
– space is available at AVAIL LIST, but it is so small
that cannot be reused.
CS2202‐File Organization
2021‐2022
Chapter 6

Binary Searching ,
KeySorting &
Indexing
Content

• Binary Searching
• Keysorting
• Introduction to Indexing
Binary Searching
• Let us consider fixed-length records that must be
searched by a key value
• If we knew the RRN of the record identified by this key
value, we could jump directly to the record (by using
Seek function)
• In practice, we do not have this information and we
must search for the record containing this key value
• If the file is not sorted by the key value we may have to
look at every possible record before we find the
desired record
• An alternative to this is to maintain the file sorted by
key value and use binary searching
Binary Search Algorithm
bool BinarySearch(Stream file, RecordType rec, KeyType key)
{
int low=0,high=getFileLength(file)/ sizeof(RecordType)-1;
int guess;
while (low <== high)
{
guess == (high + low) / 2;
readRecord(file, rec, guess);
if (rec.key()== key))
return true;
if (rec.key() > key))
high == guess - 1;
else low == guess + 1;
return false;
}
}
Binary Search Algorithm
Binary Search Algorithm
Binary Search Algorithm
Binary Search vs. Sequential Search
• Sequential Search: O(n)
• Binary Search: O(log2n)
• If file size is doubled, sequential search time is
doubled, while binary search time increases by 1
Keysorting
• Suppose a file needs to be sorted, but it is too big
to fit into main memory.
• To sort the file, we only need the keys.
• Suppose that all the keys fit into main memory
• Idea
– Bring the keys to main memory plus
corresponding RRN
– Do internal sorting of keys
– Rewrite the file in sorted order
Example
How much effort we must do?
• Read file sequentially once
• Go through each record in random order (seek)
• Write each record once (sequentially)
Why bother to write the file back?
• Use keynode array to create an index file instead.

this is called indexing!

Pinned Records
• Remember that in order to support deletions we
used AVAIL LIST, a list of available records
• The AVAIL LIST contains info on the physical
information of records. In such a file, a record is
said to be pinned
• If we use an index file for sorting, the AVAIL LIST
and positions of records remain unchanged.
• This is a good news
Introduction to Indexing
• Simple indexes use simple arrays.
• An index lets us impose order on a file without
rearranging the file.
• Indexes provide multiple access paths to a file -
multiple indexes (like library catalog providing
search for author, book and title)
• An index can provide keyed access to variable
length record files
A Simple Index for Entry-Sequenced File
• Records (Variable-length)

address of
record

• Primary key = company label + record ID

key reference field

index:
Index
• Index is sorted (main memory)
• Records appear in file in the order they entered
• How to search for a recording with given LABEL ID?
– Binary search (in main memory) in the index:
find LABEL ID, which leads us to the
referenced field
– Seek for record in position given by the
reference field
Some Issues
• How to make a persistent index
– i.e. how to store the index into a file when it
is not in main memory
• How to guarantee that the index is an accurate
reflection of the contents of the file
– This is tricky when there are lots of additions,
deletions and updates
End of Chapter 6

Organizing Files For Performance
No ratings yet
Organizing Files For Performance
30 pages
Chapter Four Indexing Structure
100% (2)
Chapter Four Indexing Structure
60 pages
Unit 3 - Speech and Video Processing (SVP)
100% (1)
Unit 3 - Speech and Video Processing (SVP)
44 pages
Chapter 3 Multimedia Data Compression
100% (2)
Chapter 3 Multimedia Data Compression
23 pages
21IS742 Module 2 PDF p
No ratings yet
21IS742 Module 2 PDF p
33 pages
File Organization Lec910
No ratings yet
File Organization Lec910
37 pages
Compression: Some Slides Courtesy James Allan@umass
No ratings yet
Compression: Some Slides Courtesy James Allan@umass
47 pages
File Organization For Performance: Amogh P K, SVIT
No ratings yet
File Organization For Performance: Amogh P K, SVIT
12 pages
Huffman Coding, RLE, LZW
No ratings yet
Huffman Coding, RLE, LZW
41 pages
Organizing Files For Performance
No ratings yet
Organizing Files For Performance
92 pages
Lec2 PDF
No ratings yet
Lec2 PDF
38 pages
Module Multimedia Technologies Module1
100% (1)
Module Multimedia Technologies Module1
55 pages
Fs Mod 2 (WWW - Vtuloop.com)
No ratings yet
Fs Mod 2 (WWW - Vtuloop.com)
91 pages
7.file Compression
No ratings yet
7.file Compression
20 pages
W11 Greedy Algorithms Lecture 21 06052024 115021am
No ratings yet
W11 Greedy Algorithms Lecture 21 06052024 115021am
6 pages
Vidler Data Compression Powerpoint
No ratings yet
Vidler Data Compression Powerpoint
25 pages
unit 5 data compression
No ratings yet
unit 5 data compression
98 pages
Lecture# 08 Greedy Algorithms
No ratings yet
Lecture# 08 Greedy Algorithms
63 pages
Computer Science Revision
No ratings yet
Computer Science Revision
73 pages
Unit 1 Data Compression
No ratings yet
Unit 1 Data Compression
30 pages
Chapter 8 - Video
No ratings yet
Chapter 8 - Video
79 pages
Huffman Coding Ms 140400147 Sadia Yunas Butt
No ratings yet
Huffman Coding Ms 140400147 Sadia Yunas Butt
9 pages
Lesson 7 INF211 Lect 08
No ratings yet
Lesson 7 INF211 Lect 08
29 pages
Organizing Files For Performance
No ratings yet
Organizing Files For Performance
39 pages
Q04. How To Check The Datapump Import Jobs Are Running or Not ?
No ratings yet
Q04. How To Check The Datapump Import Jobs Are Running or Not ?
6 pages
CS 300 Data Structures: Sabancı University Faculty of Engineering and Natural Sciences
No ratings yet
CS 300 Data Structures: Sabancı University Faculty of Engineering and Natural Sciences
6 pages
20 Compression
No ratings yet
20 Compression
58 pages
Compression Techniques and Cyclic Redundency Check
No ratings yet
Compression Techniques and Cyclic Redundency Check
5 pages
Term Paper Huffman Coding
No ratings yet
Term Paper Huffman Coding
9 pages
File Management
No ratings yet
File Management
10 pages
Umit;1 Mmdcs
No ratings yet
Umit;1 Mmdcs
17 pages
Operating System Seminar 1
No ratings yet
Operating System Seminar 1
16 pages
Part 2 File Organization L1&2
No ratings yet
Part 2 File Organization L1&2
23 pages
Coos Unit V Part 1&2
No ratings yet
Coos Unit V Part 1&2
16 pages
Unit 2 CA209
No ratings yet
Unit 2 CA209
29 pages
Digital Transmission
No ratings yet
Digital Transmission
61 pages
OS CO4 S4 FileDirectories FileSystemImplementation
No ratings yet
OS CO4 S4 FileDirectories FileSystemImplementation
79 pages
Data Compression (RCS 087)
No ratings yet
Data Compression (RCS 087)
51 pages
DCT Based Coding
No ratings yet
DCT Based Coding
49 pages
Unit 2
No ratings yet
Unit 2
28 pages
Sequential Files
No ratings yet
Sequential Files
26 pages
Revision Checklist
No ratings yet
Revision Checklist
17 pages
huffman-encoding-supplement
No ratings yet
huffman-encoding-supplement
10 pages
1.file Organization
No ratings yet
1.file Organization
90 pages
Chapter 7 Indexing
No ratings yet
Chapter 7 Indexing
29 pages
DoQuangHuy_HE191197
No ratings yet
DoQuangHuy_HE191197
8 pages
lect10
No ratings yet
lect10
3 pages
7269IV - 5th Semester - Computer Science and Engineering
No ratings yet
7269IV - 5th Semester - Computer Science and Engineering
37 pages
Pix 2 Style 2 Pix
No ratings yet
Pix 2 Style 2 Pix
21 pages
DX Diag 8
No ratings yet
DX Diag 8
30 pages
Chapter 11 Hashing
No ratings yet
Chapter 11 Hashing
42 pages
os exam
No ratings yet
os exam
10 pages
3-Arrays 1
No ratings yet
3-Arrays 1
21 pages
Lecture 26
No ratings yet
Lecture 26
2 pages
OSY Chapter 6
No ratings yet
OSY Chapter 6
12 pages
File Structure and Indexing
No ratings yet
File Structure and Indexing
18 pages
chapter 7
No ratings yet
chapter 7
70 pages
Wa0024
No ratings yet
Wa0024
30 pages
Os Unit 4
No ratings yet
Os Unit 4
20 pages
File
No ratings yet
File
22 pages
Lom Log
No ratings yet
Lom Log
17 pages
File System
No ratings yet
File System
9 pages
Compression: Safeen H. Rasool Assist. Lecturer
No ratings yet
Compression: Safeen H. Rasool Assist. Lecturer
16 pages
Introduction To Temporal Media Subject
No ratings yet
Introduction To Temporal Media Subject
11 pages
Chapter 4 Lossless Compression Algorithims
No ratings yet
Chapter 4 Lossless Compression Algorithims
30 pages
Cambridge CS IGCSE Chp 1
No ratings yet
Cambridge CS IGCSE Chp 1
25 pages
Nano
No ratings yet
Nano
38 pages
Anna University: Coimbatore M.C.A (Master of Computer Applications)
No ratings yet
Anna University: Coimbatore M.C.A (Master of Computer Applications)
9 pages
OS - Chapter - 5 - File System
No ratings yet
OS - Chapter - 5 - File System
30 pages
Wwwwrtyyu FGDH
No ratings yet
Wwwwrtyyu FGDH
25 pages
Installed Files
No ratings yet
Installed Files
90 pages
Data Compression Report
No ratings yet
Data Compression Report
12 pages
File Concept
No ratings yet
File Concept
21 pages
Encoder and Decoder: A Project By:Priyanka Basak 3Rd Year, 6th Sem Roll:32
No ratings yet
Encoder and Decoder: A Project By:Priyanka Basak 3Rd Year, 6th Sem Roll:32
13 pages
Course Outline (Itm 207) Multimedia Systems
No ratings yet
Course Outline (Itm 207) Multimedia Systems
6 pages
6.File Managment
No ratings yet
6.File Managment
7 pages
Jpeg Cmatlab Code
No ratings yet
Jpeg Cmatlab Code
3 pages
Cost-Effective Revit Server Deployments PDF
No ratings yet
Cost-Effective Revit Server Deployments PDF
3 pages
Practical 2 For AACS2284
No ratings yet
Practical 2 For AACS2284
7 pages
Lecture 6
No ratings yet
Lecture 6
11 pages
Log Cat 1750599494637
No ratings yet
Log Cat 1750599494637
4 pages
EverFocus EVS410
No ratings yet
EverFocus EVS410
4 pages
Avail List 1
No ratings yet
Avail List 1
4 pages
Indexing Sectuion
No ratings yet
Indexing Sectuion
5 pages
Videoguys Guide To Understanding HD Formats: Hdisnotdv
No ratings yet
Videoguys Guide To Understanding HD Formats: Hdisnotdv
10 pages
2
No ratings yet
2
8 pages
A Level CS CH 1 9618
No ratings yet
A Level CS CH 1 9618
16 pages
Dh-Xvr5116Hs-I2: 16 Channel Penta-Brid 5M-N/1080P Compact 1U Wizsense Digital Video Recorder
No ratings yet
Dh-Xvr5116Hs-I2: 16 Channel Penta-Brid 5M-N/1080P Compact 1U Wizsense Digital Video Recorder
3 pages
Topic 11-13 MIL
No ratings yet
Topic 11-13 MIL
5 pages
Literature Survey
No ratings yet
Literature Survey
5 pages
Poser 13 Content Links
No ratings yet
Poser 13 Content Links
3 pages
American International University-Bangladesh: Title: Design of A 2 To 4 Decoder and A Decimal To BCD Encoder
No ratings yet
American International University-Bangladesh: Title: Design of A 2 To 4 Decoder and A Decimal To BCD Encoder
4 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Assembly Programming:Simple, Short, And Straightforward Way Of Learning Assembly Language
From Everand
Assembly Programming:Simple, Short, And Straightforward Way Of Learning Assembly Language
Sherwyn Allibang
5/5 (2)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Chapter 6 Organizing Files For Performance Not Complete

Uploaded by

Chapter 6 Organizing Files For Performance Not Complete

Uploaded by

CS2202‐File Organization

• We will be looking at four different issues:

• In this lecture, we continue to focus on

• 25 bits rather than 80 bits (10 bytes)!

• Note that this tree is not

“I AM SAM. SAM I AM''

• a program ignores the marked record as deleted

• If we add a record, it can go to the first available spot in the

3. Deleting Variable-Length Records

this is called indexing!

• Primary key = company label + record ID

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.