File Organization & Data Processing Note_073520
File Organization & Data Processing Note_073520
LECTURE NOTE
ON
AAUA-CSC 202 & CSC216
Compiled by
S.O. Ogunlana, Ph.D
Dept of Informatics and Information Systems
2025
Data processing concepts and systems, data techniques, EDP, equipment and EDP using COBOL
programming output and auxiliary storage devices, types of memory access concepts of data,
Physical and logical records, inter-record gaps, record structuring types and operation on files,
labels, buffering blocking and deblocking, relevant i/o. Facilities for file processing of some high
level programming languages such as FORTRAN, COBOL, PI/I.
COURSE REQUIREMENTS:
This is a compulsory course for all computer science students in the University. In view of this,
students are expected to participate in all the course activities and have minimum of 75%
attendance to be able to write the final examination.
1
I
DATA PROCESSING
Electronic data processing is any process that a computer program does to enter data and
summarise, analyse or otherwise convert data into usable information. The process may be
automated and run on a computer. It involves recording, analysing, sorting, summarising,
calculating, disseminating and storing data. Because data is most useful when well-presented
and actually informative, data-processing systems are often referred to as information systems.
Nevertheless, the terms are roughly synonymous, performing similar conversions;
dataprocessing systems typically manipulate raw data into information, and likewise
information systems typically take raw data as input to produce information as output.
Data processing may or may not be distinguished from data conversion, when the process is
merely to convert data to another format, and does not involve any data manipulation.
In information processing, a Data Processing System is a system which processes data which
has been captured and encoded in a format recognizable by the data processing system or has
been created and stored by another unit of an information processing system.
DEFINITION OF DATA
Data is the basic fact about an entity. It is unprocessed information. Examples are
a. Student records which contain items like the Student Name, Number, Age, Sex etc.
b. Payroll data which contain employee number, name, department, date joined, basic salary
and other allowances.
c. Driving License which contain Driver's Name, Date of Birth, Home address, Class of
license and its expiry date.
Data can be regarded as the raw material from which information is produced by data
processing. When data is suitably processed, results (or output data) are interpreted to derive
information that would assist in decision- making.
2
I
Data processing may be divided into five separate but related steps. They are:
3
I
ORIGINATION
INPUT
MANIPULATION
OUTPUT
STORAGE
a. Origination
b. Input
c. Manipulation
d. Output
e. Storage
Origination It should be kept in mind that "to process" means to do something with or to
"manipulate" existing information so that the result is meaningful and can be used by the
organization to make decisions. The existing information is generally original in nature and may
be handwritten or typewritten. Original documents are commonly referred to as source
documents. Examples of source documents are cheques, attendance sheets, sales orders, invoices,
receipts etc. Producing such source documents, then, is the first step in the data processing cycle.
Input After source documents are originated or made available, the next step is to introduce the
information they contain into the data processing system. This system may be manual,
mechanical, electromechanical or electronic. However, our focus is on electronic data processing.
This is done using any of the available input devices (keyboard, joystick etc) or data capture
devices.
4
I
Processing When input data is recorded and verified, they are ready to be processed. Processing
or "Manipulation” involves the actual work performed on the source data to produce meaningful
results. This may require performing any or all of the following functions - classifying, sorting,
calculating, recording and summarizing.
Output After input data has been fed into a data processing system and properly processed, the
result is called output. Output can be either in summary or detail form. The type of output
desired must be planned in advance so that no waste of time and resources occur. Included with
output is communication. Output is of little value unless it is communicated properly and
effectively. The output is the ultimate goal of data processing. The system must be capable of
producing results quickly, completely and accurately. The data processing cycle is incomplete
without the concept of control. In an organization, controls depend basically upon the
comparison of attained results with predetermined goals. When the results are compared, they
either agree or differ. However, if a disagreement is detected, a decision is made to make the
necessary changes and the processing cycle repeated. This feedback concept of control is an essential
part of data processing. That is, output is compared with a predetermined standard and a
decision is made (if necessary) on a course of action, and is communicated to the stage where it
is taken.
Storage Data related to or resulting from the previous four data processing steps can be stored,
either temporarily or permanently for future reference and usage. It is necessary to store data,
especially when it relates periodic reports, since they are used over and over again in other
related applications. A monthly attendance report or profit and loss statements will be useful in
preparing annual reports or in student result computation previous semester result will be useful
in preparing present semester results in this instance requires intermittent storage. Stored
information can either be raw, semi-processed or output data. Quite often, the output of one
problem becomes the input to another. In the case of inventory, any unsold at the end of a year
(ending inventory) become the beginning inventory for the next year. There are various ways of
storing data, ranging from simple recording to storage in diskettes, hard disks, CDs etc.
ERRORS
Errors (numerical errors) An error occurs when the value used to represent some quantity is not
the true value of that quantity, e.g. errors occur if we use approximate values such as 2.5 instead
of 2.53627.
Note that:
a. A value may be intentionally used despite the fact that it may incur an error. The reasons
for this may be those of:
5
I
b. An error may be caused by a mistake, which is said to occur if the valued use is other than
the one intended, e.g. writing 5.2 when meaning to write 2.5.
OTHER ERRORS
The term “error” may also be use to describe situations that occur when a program either is not
executed in the manner intended or produces results that are not correct. Causes for such errors
include:
a. Faulty data
b. Faulty software.
c. Faulty hardware.
Note: The last is by far the least common
6
I
ii. Since the true value may not always be known, the relative error may be
approximated by using:
Relative error= Absolute error estimate
Used value
For small absolute errors this gives a reasonably accurate value.
Week Two
A discussion on data and different source of error, error avoidance and reduction techniques,
data processing methods. We shall also discuss different modes and processing method of data
processing file accessing, organization and processing methods
Objective: The objective is for the student to understand the various validation techniques,
compare and contrast and examine the advantages and disadvantages of one validation
technique has over the other. Student should be able to know and implement these various
validation techniques.
SOURCES OF ERROR
These may include followings:
a. Data errors.
b. Transcription errors.
c. Conversion errors.
d. Rounding errors.
e. Computational errors.
f. Truncation errors.
g. Algorithmic errors.
Data errors
The data input to the computer may be subject to error because of limitations on the method of
data collection. These limitations may include:
i. The accuracy that it was possible to make measurements.
ii. The skill of the observer, or
iii. The resources available to obtain the data.
7
I
Transcription errors
These are errors (mistakes) in copying from one form to another.
a. Examples
i. Transposition, eg, typing 369 for 396 ii.
“Mixed doubles”, eg, typing 3226 for 3326.
b. These errors may be reduced or avoided by using
i. Direct encoding (eg, OCR/OMR) ii.
Validation checks.
Conversion
When converting data from its input form, BCD say, to its stored form, pure binary say, some
errors may be unavoidably incurred because of practical limits to accuracy. On output similar
errors may occur. Further discussion will follow later.
Rounding errors
This frequently occur when doing manual decimal arithmetic. They may occur with even greater
frequency in computer arithmetic.
a. Examples.
i. Writing 2.53627 as 2.54 ii.
Writing 1/3 as 0.3333.
b. A rounding error occurs when not all the significant digits, (figures) are given e.g.,
when writing 2.54 we omit the less significant digits 6, 2 and 7.
Types of rounding
a. Rounding down, sometimes called truncating, involves leaving off some of the less
significant digits, thus producing a lower or smaller number, e.g. writing 2.53 for
2.53627.
b. Rounding up involves leaving off some of the less significant digits, but the
remaining least significant digit is increased by 1, thus making a larger number,
e.g. writing 2.54 for 2.53627.
c. Rounding off involves rounding up or down according to which of these makes
the least change in the stated value e.g.,2.536 would be rounded up to 2.54 but
2.533 would be rounded down to 2.53. What to do in the case of 2.535 can be
decided by an arbitrary rule such as “if the next significant digit is odd round up,
if even round down.” So using this rule 2.535 would round to 2.54 because “3” is
odd.
Significant digits (figures) and decimal places: These are the two methods of describing
rounded-off results. They are defined as follows.
8
I
a. Decimal places. A number is given to n decimal places (or nD) if there are n
digits to the right of the decimal point. Examples: 2.53627 is 2.54 to 2 decimal
places i.e.
2.53627 = 2.54 (2D)
2.53627 = 2.536 (3D)
4.203 = 4.20 (2D)
0.00351 = 0.0035 (4D)
b. Significant figures. A number is given to n significant
figures (or nS) if there are n digits used to express the number but excluding
i. All leading zeros and ii. Trailing zeros to
the left of the decimal point.
Examples. 2.53627 =2-54 (3S)
57640 = 58000 (2S)
0.00351 = 0.0035 (2S)
4.203 = 4.20 (3S)
Computational errors
This occurs as a result of performing arithmetic operations and are usually caused by overflow
or rounding intermediate results
Truncation errors
Firstly we need to define some terms. When numbers are placed in some specified order they
form a sequence, e.g., 1, 3, 5, 7, 9….. or ½, ¼ , 1/8 , 1/16 , ….. When a Sequence is added it is called
a series e.g., 1+ 3 +5 +7+9 +… or 1/2 + 1/4 + 1/8 + 1/16 … Some series have many practical uses.
For example the quantity π, used extensively in mathematics, can be evaluated its sum, to any
required accuracy be using the formula: π = 4 x (1-1/3 + 1/5 - 1/7 + 1/9 - 1/11 + 1/13 ……)
The series is an infinite series since it goes on as far as we care to take it. In practice we might
only use the first few terms to get an approximate value. We truncate a series if, when we
calculate its sum, we leave off all terms past a given one, e.g.,
9
I
Algorithmic errors
An algorithm is set of procedural steps used in the solution of a given problem and can be
represented by pseudocode. Errors incurred by the execution of an algorithm are called
algorithmic errors. A computer program is one type of algorithm. If two programs are available
to perform a specified task, the one that produces the result with the greatest accuracy will have
the smaller algorithmic error. Since each step in an algorithm may involve a computational error,
the algorithm that achieves the result in fewer steps may have a smaller algorithmic error.
a. For fixed – point integer representation there is good control over accuracy
within the, allowed range since there is no fractional part to be rounded.
b. In other fixed-point representations where part or all of the number is
fractional, rounding will occur often, but the precision provided may still
allow reasonable control over accuracy during addition and subtraction.
c. In floating – point representations almost all storage and calculations can
lead to rounding errors.
d. Rounding should be unbiased if possible, i.e., number should be rounded
off rather than up or down when stored.
Example: Consider a very simple case where only the first two binary fraction places are
available, as shown here. Consider values between 1/4 and 1/2.
Computer Arithmetic
Note. The number with the third binary place “0” are rounded down
whilst those with the same place “1” are rounded up. This suggests a
10
I
Conversion errors
In converting fractions from decimal to binary for storage rounding errors are often introduced.
Example. 4/5is easily represented as the decimal fraction 0.8. However, if we convert, 0.8 to
binary we discover that it can only be represented as a recurring fraction, i.e.,
0.1100110011001100…. suppose we are able to store only 6 bits of this fraction, i.e.,
0.110011. If we convert this store value back to decimal we will get the value
0.796875 not 0.8! Conversion errors like this are very common.
Computational errors
a. In general every arithmetic operation performed by a computer may produce a
rounding error. The cause of this error will be one of:
i. the limited number of bits available to store the result, i.e., finite word
length.
ii. Overflow or underflow (also a consequence of the first cause of this error
type).
iii. Rounding in order to normalize a result.
b. The size of the error will depend on these two main factors:
i. The size of the word length.
ii. The method of rounding up down or off.
Control over these errors depends on factors listed in under head of rounding errors
discussed earlier.
The following paragraphs outline a number of factors that either reduce errors or help in
avoiding them. Detailed discussion of how these factors work is not merited but you should be
able to verify the results from the examples given.
11
I
Order of operations
It is better to add “floating-point” numbers in order of magnitude if possible. For example, try
calculating 0.595000 + 0.003662 + 0.000680 using only 3 digit accuracy for each intermediate
result.
Algorithmic error
The errors produced when using an algorithm will frequently depend on
If the error from one stage of the algorithm is carried over to successive stages then the size of
the error may “grow”. These accumulated errors, as they are called, may ultimately make the
obtained very unreliable.
Nesting
This reduces the number of operations, thereby reducing error accumulation. For example to
evaluate 3x3 + 2x2 +5x + 1 for a given x value use ((3x + 2) x + 5) x + 1, starting with the innermost
bracket and working outwards.
Batch adding
This is an extension of the method that will be described later. A set of numbers to be added is
grouped into several batches containing numbers of similar magnitude. Each batch total is
found, and then the batch totals are added.
A calculation is ill conditioned if small errors in the data used lead to large errors in the answer.
Equations are ill conditioned if small changes in coefficients lead to large changes in the solution.
Algebraic formulae as used in basic mathematics and statistics at school or college can easily
become ill conditioned when certain specific values are substituted and should therefore only be
used with caution. Software specially designed to solve such problems is normally based on
alternative methods specially devised for the job. Such methods are relatively easy to find in
suitable reference books.
12
I
hand. There are some that are best suited for electronic processing, while others are better done
by manual methods.
Manual Method
This involves preparing data by means of using such tools as pens, pencils, ledgers, files, folders
etc. Improvements on these include using multi-copy forms, carbon paper etc. A good example
is the daily marking of attendance register in school.
Advantages
a. They are generally cheap.
b. Simple to operate.
c. Easily adaptable to changes.
d. Easily accessible.
Disadvantages
Mechanical Method
This method involves the use of a combination of manual processes and mechanical equipment
to carry out the function. Examples are Typewriters, Calculators etc.
Advantages
a. Widely used in large and small organizations.
b. Can serve as input to electronic system.
c. Quality and level of output greatly improved as compared to manual method .
d. Requires less manpower than the manual method.
Disadvantages
a. Costly to purchase and maintain.
b. Possibility of equipment breakdown.
c. Produces lots of noise due to moving parts in the equipment .
d. Usually slow in operation.
13
I
Electronic Method
Here, the processing is done electronically by the system. There are two modes; batch processing
and on-line processing.
Advantages
a. Faster analysis and results of processing
b. Handles complex calculations and problems
c. Can provide information in different and varied formats
d. Provides more accurate results than the other two methods
e. Work load capacity can be increased easily without hitches
f. Provides for standardization of method
g. Frees staff from clerical tasks for other tasks e.g. planning
Disadvantages
a. Initial acquisition cost may be high as well as maintenance costs
b. Specialist personnel may be required
c. Decreased flexibility as tasks become standards
There are two modes of computer data processing; Batch Processing and On-line Processing.
Batch Processing
A method of processing information in which transactions are accumulated and stored until a
specified time when it is necessary or convenient to process them as a group is called Batch
Processing. This method is usually adopted in payroll processing and sales ledger updates.
On-line Processing
A method of processing information in which, transactions are entered directly into the
computer and processed immediately. The on-line method can take different forms. These forms
are examined below.
Real Time Processing This is an on-line processing technique in which a transaction undergoes all
the data processing stages immediately on data capture. This method is used in Airline ticket
reservation and modern retail banking software.
14
I
that only one program is actually using the CPU at any given moment, but that the input/output
needs of other programs can be serviced at the same time. Two or more programs are active at
the same time, but they do not use the same computer resources simultaneously. With
multiprogramming, a set of programs takes turns using the processor.
Time Sharing - This capability allows many users to share computer-processing resources
simultaneously. It differs from multiprogramming in that the CPU spends a fixed amount of
time on one program before moving on to another. In a time-sharing environment, the different
users are each allocated a time slice of computer time. In this time slot, each user is free to
perform any required operations; at the end of the period, another user is given a time slice of
the CPU. This arrangement permits many users to be connected to a CPU simultaneously, with
each receiving only a tiny amount of CPU time. Time-sharing is also known as interactive
processing. This enables many users to gain an on-line access to the CPU at the same time, while
the CPU allocates time to each user, as if he is the only one using the computer.
Virtual Storage - Virtual storage was developed after some problems of multiprogramming
became apparent. It handles programs more efficiently because the computer divides the
programs into small fixed or variable length portions, storing only a small portion of the program
in primary memory at one time, due to memory size constraints as compared program needs.
Virtual storage breaks a program into a number of fixed-length portions called pages or variable
length portions called segments. The programmer or the operating system determines the actual
breakpoint between pages and segments. All other program pages are stored on a disk unit until
they are ready for execution and then loaded into primary memory. Virtual storage has a number
of advantages. First, primary storage is utilized more fully. Many more programs can be in
primary storage because only one page of each program actually resides there. Secondly,
programmers need not worry about the size of the primary storage area. With virtual storage,
there is no limit to a program's storage requirements
15
I
Week Three
A discussion on data and different validation techniques, for both on-line and batch systems of
processing data. We shall also discuss data hierarchy, different file accessing, organization and
processing methods
Objective: The objective is for the student to understand the various validation techniques,
compare and contrast and examine the advantages and disadvantages of one validation
technique has over the other. Student should be able to know and implement these various
validation techniques.
GIGO stands for Garbage-In, Garbage-Out. This means that whatever data you pass or enter into
the computer system is what would be processed. The computer is a machine and therefore has
no means of knowing whether the data supplied is the right one or not. To minimize such
situations that may lead to the computer processing wrong data and producing erroneous
output, data entered into a computer is validated within specific criteria to check for correctness
before being processed by the system. This process is called DATA VALIDATION. We stated
above that computer data processing is done in batch and on-line processing modes and we
shall therefore discuss data validation techniques under each of these two modes.
Batch Control
This type of input control requires the counting of transactions or any selected quantity field in
a batch of transactions prior to processing for comparison and reconciliation after processing.
Also, all input forms should be clearly identified with the appropriate application name and
transaction type (e.g. Deposits, Withdrawals etc). In addition, prenumbered and pre-printed
forms can be used where constant data are already printed and used to reduce data entry or
recording errors.
• Total Monetary Amount - This is used to verify that the total monetary value of items
processed equals the total monetary value of the batch documents.
• Total Items - This verifies that the total number of items included on each document in
the batch agrees to the total number of items processed. For example, the total number of
items in the batch must equal the total number of items processed.
16
I
• Total Documents - This verifies that the total number of documents in the batch equals
the total number of documents processed. For example, the total number of invoices
agrees with the number of invoices processed.
• Hash Total - Hashing is the process of assigning a value to represent some original data
string. The value is known as hash total. Hashing provides an efficient method of
checking the validity of data by removing the need for the system to compare the actual
data, but instead allowing them to compare the value of the hash, known as the hash
total, to determine if the data is same or different. For example, totals are obtained on an
identifier (meaningless) data fields such as account number, part number or employee
number. These totals have no significance other than for internal system control
purposes. The hash total is entered at the start of the input process; after completion, the
system re-calculates this hash total using the selected fields (e.g. account number) and
compares the entered and calculated hash total. If the same, the batch is accepted or
otherwise rejected.
An advantage of on-line real time systems is that data editing and validation can be done up
front, before any processing occurs. As each transaction is input and entered it can be operator
prompted immediately an error is found and the system can be designed to reject additional
input until the error is corrected. The most important data edit and validation techniques are
discussed below, but the list is by no means exhaustive.
• Reasonableness Check - Data must fall within certain limits set in advance or they will
be rejected. For example, If an order transaction is for 20,000 units and normally not more
than 100 units, then the transaction will be rejected.
• Range Check - Data must fall within a predetermined range of values. For example, if a
human weighs more than 150kg, the data would be rejected for further verification and
authorization.
• Existence Check - Data are entered correctly and agree with valid predetermined criteria.
For example, the computer compares input reference data like Product type to tables or
master files to make sure the codes are valid.
• Check Digit - An extra reference number called a check digit follows an identification code
and bears a mathematical relationship to the other digits. This extra digit is input with the
data, recomputed by the computer and the result compared with the one entered.
17
I
ii. It is calculated using a modulus are used in practice and each had varying
degrees of success at preventing certain types of errors MODULUS 11
(eleven) is used here.
2. Modulus notation. Two numbers are congruent in a modulus if both yield the
same reminder
When divided by the modulus
means congruent to
E.g. 8 =3 (mod 5)
i.e., has remainder 3 if division by 5, and so does 3
CALCULATIONS
4. Check digits are calculated by a computer in the first place and are generally
used in conjunction with fixed data (i.e., customers’ number, etc ). As a result
of a test done on modulus 11 it was discovered that is detected all transcription
and transposition errors and 91% of random errors.
6. Checking numbers. When the code number is input to the computer precisely
the same calculation can be carried out (using weight of 1 for the rightmost
digit) and the resultant remainder should be 0. If not, then the number is
incorrect.
63495 = (6x5) + (3x4) + (4x3) + (9x2) + (5x1) = 77 Divide by 11, remainder = 0
7. This check can be carried out off-line by a machine called CHECK DIGIT
VERIFIER
8. This is a useful programming exercise and it may also be worth including an
examination project in which account numbers or similar keys were used.
9. Check digits are used in many situations. The ISBN number in any book (see
the back cover page of a textbook) is just one example.
• Completeness Check - A field should always contain data and riot zeros or blanks. A
check of the field is performed to ensure that some form of data, not blanks or zeros is
present. For example, employee number should not be left blank as it identifies that
employee in the employee record.
• Validity Check - This is the programmed checking of data validity in accordance with
predetermined criteria. For example, a gender field should contain only M(ale) or
F(emale). Any other entry should be rejected.
• Key Verification - another individual using a program that compares the original entry
to the repeated keyed input repeats the key-in process. For example, the account number,
date and amount on a cheque is keyed in twice and compared to verify the keying process.
• Duplicate Check - New transactions are matched to those previously entered. For
example, an invoice number is checked to ensure that it is not the same as those
previously entered, so that payment is made twice.
• Logical Relationship Check - If a particular condition is true, then one or more additional
conditions or data input relationship might be required to be true before the input can be
19
I
Week Four
Data hierarchy , file accessing (sequential, direct, index sequential and object oriented file
access),flat file database file, file processing (updating, sorting, merging, blocking, searching and
matching), physical storage consideration, initialization, formatting, defragmentation method.
Objective:
• To explore programming language constructs that support data abstraction and
• To discuss the general concept of file access and processing in data processing.
• Impact to student the knowledge required in choosing appropriate file access and processing
technique when developing data processing application software.
Description: Data constituent in their hierarchy are discussed in detail and storage preparatory
requirement of storage devices need were emphasized.
A computer system organizes data in a hierarchy that starts with bits and bytes and progresses
to fields, records, files, and databases.
Bit: This represents the smallest unit of data a computer can handle. A group of bits, called a
byte, represents a single character, which can be a letter, number or other symbol.
Field: A field is a particular place in a record where an item of information can be held or a
grouping of characters into a word, group of words or a complete number (e.g. a person's first
name or age), is called a field.
Record: A group of related fields, such as a student's name, class, date admitted, age or record
is a collection of related items of data treated as a unit.
File: A file is organized collection of related records which are processed together. It is also
referred to as a data set. The files is a collection of records relating to some class of object e.g.
records of all insurance policies issue by an insurance company, records of all employees of
firm, student records etc. A group of records of the same type (e.g. the records of all students in
the class) is called a file.
20
I
Database: A group of related files (e.g. the personal history, examinations records and payments
history files) make up a database. A record describes an entity. An entity is a person, place, thing,
or event on which we maintain information. An employee record is an entity in a personnel
records file and maintains information on the employees in that organization. Each characteristic
or quality describing a particular entity is called an attribute. For example, employee name,
address, age, gender, date employed is an attribute each of the entity personnel. The specific
values that these attributes can have can be found in the field of the record describing the entity.
Every record in the file contains at least one field that uniquely identifies that record so that the
record can be retrieved, changed, modified or sorted. This identifier is called the key field. An
example of a key field is the employee number for a personnel record containing employee data
such as name, address, age, job title etc.
Computer systems store files in secondary storage (e.g. hard disks) devices. The records can be
arranged in several ways on the storage media, and the arrangement determines the manner in
which the individual records can be accessed or retrieved.
Sequential Access File Organization - In sequential file organization, data records must be
retrieved in the same physical sequence in which they are stored. Sequential file organization is
the only method that can be used on magnetic tape.(e.g. data or audio tape). This method is used
when large volumes of records are involved and it is suitable for batch -processing as it is slow.
Direct/Random Access File Organization - This is a method of storing records so that they
accessed in any sequence without regard to their actual physical order on the storage media.
This method permits data to be read from and written back to, the same location. The physical
location of the record in the file can be computed from the record key and the physical address
of the first record in the file, using a transform algorithm, not an index. (The transform algorithm
is a mathematical formula used to translate the key field directly into the record's physical
location on disk.) Random access file organization is good for large files when the volume of
transactions to be processed against the file is low. It is used to identify and update an
individual's record on a real-time basis. It is fast and suitable for on-line processing where many
searches for data are required. It is faster than sequential file access method. An example is an
on-line hotel reservation system.
21
I
Index Sequential Access Method (ISAM) - This file access method directly accesses records
organized sequentially using an index of key fields. An index to a file is similar to the index of a
book, as it lists the key fields of each record and where that record is physically located in storage
to ensure speedy location of that record. ISAM is employed in applications that require
sequential processing of large numbers of records but occasionally require direct access of
individual records. An example is in airline reservation systems where booking can be taking
place in different parts of the world at the same time accessing information from one file. ISAM
allows access to record in the most efficient manner.
Flat File - Supports a batch-processed file where each record contains the same type of data
elements in the same order, with each data element needing the same number of storage spaces.
Supports a few users' needs. It is inflexible to changes. It is used to enter data into an application
automatically in a batch mode, instead of record by record. This process of automatic batch data
entry is also referred to as a File Upload process.
Database File - A database supports multiple- users needs. The records are related to each other
differently for each file structure. Removes the disadvantages of flat files.
Object Oriented File Access - Here, the application program accesses data objects and uses a
separate method to translate to and from the physical format of the object.
File Processing
Different processes can be performed on files stored in the computer system. These processes
include:
• Updating - The process of bringing information contained in the file up to date by feeding
in current information.
• Sorting - Arranging the records in a file in a particular order (e.g. in alphabetical or
numerical order within a specified field).
• Merging - Appending or integrating two or more files into a bigger file.
• Blocking - This is to logically arrange the records in a file into fixed or variable. blocks or
sets that can be treated as a single record at a time during processing. The gap between
each block is known as the inter-block gap..
• Searching - This involves going through a whole file to locate a particular record or a set
of records, using the key field.
• Matching - This involves going through a whole file to locate a particular record or a set
of records, using one or a combination of the file attributes or fields.
22
I
Disk defragmentation.
Fragmentation: As data is stored on a newly formatted disk the data is written to unused
contiguous sectors (i.e., those sectors which follow one another). If data is erased then the deleted
sectors may leave "holes" of free space among used sectors. Over time, after many inserts and
deletes, these free sectors may be scattered across the disk so that there may be very little
contiguous free space. This phenomenon is called "disk fragmentation". If a file, such as a
document file say, is written to the disk the read-write heads will have to move about as they
access the fragmented free space. This slows down the writing process and will also slow down
any subsequent reads. Therefore, performance suffers. When this happens it may be possible to
use a special disk defragmenter program to re-organise the data on the disk so as to eliminate
the fragmentation.
23
I
24
I
medium
5. Magnetic A search is 50 Mbytes - 160,000 bps Microcomput
tape required. 10 Gbytes - 2.6 Mbps And
cartridge minicompute
6. Magnetic A search is Up to 10 bps - SAS Small micro
tape required. 145,000 33,000 bps Computer
cassette bytes. systems.
Fig. 4: Comparative performance of backing storage media and devices.
Points to note
a. Note the terms "on-line" and "off-line". "On-line" means being accessible to and under the
control of the processor. Conversely, "off-line" means not accessible to or under the control
of the processor. Thus, fixed magnetic disks are permanently "online"; a magnetic tape reel
or an exchangeable magnetic disk pack is "on-line" when placed in its respective units,
but "off-line" when stored away from the computer, terminals, wherever their physical
location, are said to be "on-line" when directly linked to the processor.
b. On exchangeable disks, the read-write heads serving each surface will be positioned over
the same relative track on each surface because the arms on which they are fixed move
simultaneously.
c. The "jukebox" was introduced in this segment as an option used with CDs but jukeboxes
are available for a variety of disk devices.
d. The term "cartridge" is ambiguous unless prefixed by "tape" or "disk".
e. The devices and media have been separated for ease of explanation. It should be noted,
however, that in the case of the fixed disk the media are permanently fixed to the device,
and therefore they cannot be separated.
f. Backing storage is also called auxiliary storage.
g. Input, output and storage devices are referred to collectively as peripheral devices.
25
I
Objective:
• To discuss the general concept of distributed systems approach to data processing.
• Impact to student the knowledge of processing in distributed system environment.
• To discuss the techniques use by various printer types in printing hardcopy output
• To have know how required in choosing appropriate printer for output design
Description: Distributed processing and printers with the kind of hardcopy output quality
generated are discussed in detail and features of the printers were emphasized.
26
I
normally synonymous with VDU and is often used instead. There are many different types of
VDU terminals in use today. Only the more common features and variants will be described.
c. Characters are displayed on the screen in a manner that resembles printed text. A
typical full screen display is 24 rows by 80 columns (i.e., 1920 characters).
27
I
How it works: When a key is pressed on the keyboard the character's code (in ASCII, say) is
generated and transmitted to the computer along the lead connecting the terminal to the
computer. Normally the character code received by the computer is immediately "echoed" back
out to the terminal by the computer. When a character code is received by the terminal it is
interpreted by the control circuitry and the appropriate character symbol is displayed on the
screen, or printed depending upon the type of terminal. In what follows the basic operation of a
VDU is covered in more detail.
For the more basic VDU models the character codes are stored in a memory array inside the
VDU with each location in memory corresponding to one of the 24 x 80 character positions on
the screen. The VDU is able to interpret control characters affecting the text format. Each
character in the character set has its display symbol defined in terms of a grid of bits. These
predetermined patterns are normally held in a special symbol table in ROM. Common grid sizes
are 8 x 14 and 9 x 16. The character array is scanned by circuitry in the VDU and the character
map generator refers to the ROM to produce the appropriate bit-map image for each character
to be displayed on the screen. Often the device is able also to interpret sequences of control
characters (often beginning with the ASCII control character) which may alter display -
characteristics such as reverse video or colour. This kind of VDU is only able to form images on
the screen by constructing them from character symbols by means of the character map
generator. It is therefore called a character terminal. The superior alternative is a graphics
terminal which has high quality displays that can be used for line drawings, draughtsmen's
drawings, etc. In a Raster Scan Display, which is just one of many types of display technology,
the character codes received from the computer are interpreted by the terminal's map generator
which then loads the appropriate hit pattern into special memory (video RAM) acting as a bit-
map for the whole screen
28
I
29
I
Uses: Workstations are normally used by professionals for particular kinds of work such as
Finance (dealer rooms), Science and Research, and Computer Aided Design. They are also
very popular for programming.
Examples of workstations are the Digital VAXSTATIONs and the SUN
SPARCstations.
Output Devices
Printers
30
I
Features
a. As with all character printers the device mimics the action of a typewriter by printing
single characters at a time in lines across the stationery. The print is produced by a
small "print head" that moves to and fro across the page stopping momentarily in
each character position to strike a print ribbon against the stationery with an array of
wires.
b. According to the number of wires in the print head, the character matrix may be 7 x 5,
7x 7, 9 x 7, 9 x 9 or even 24 x 24. The more dots the better the image.
c. Line widths are typically 80, 120, 132, or 160 characters across.
d. Speeds are typically from 30 cps to 200 cps.
e. Multiple print copies may be produced by the use of carboned paper (e.g. 4-6 copies
using NCR (No Carbon Required) paper).
Some higher quality versions can produce NLQ(Near Letter Quality), have inbuilt alternative
character sets plus features for producing graphs, pictures, and colour.
Inkjet printers
The original models of these printers were character matrix printers and had only limited
success. Modern inkjet printers can act as character printers or page printers producing high
print quality relatively quietly and have therefore replaced dot matrix printers for most low
speed printing office use.
31
I
a. These are non-impact page printers often having inbuilt sets of scaleable fonts.
b. They operate by firing very tiny ink droplets onto the paper by using an
"electrostatic field". By this means a medium quality bit-mapped image can be
produced at a resolution of about 300-600dpi or above. Those using oil-based inks
tend to produce higher quality print than those using water based inks.
c. They are very quiet but of low speed (4-6ppm). Their lower speed is reflected in
the price.
d. Some models print colour images (typically at 2ppm), by means of multiple print
heads each firing droplets of a different colour.
e. Some can print on plain paper, glossy paper and transparencies.
Daisywheel printers. This was once a popular type of low-speed printer that was favoured over
dot matrix printers but is now far less common because it has been superseded by superior inkjet
printers. Features:
32
I
own when used in conjunction with digitised sound. For example, by means of special software
a desktop computer may be turned into a sound synthesiser unit which can be hooked up to an
audio system.
Summary
The features of the main hardware units and media for the output of data from the computer have
been covered. They are:
a. Printers - Single sheet or continuous stationery.
b. Microform recorder - Microfilm or Microfiche.
c. Graph Plotters - Single beet or continuous stationery.
d. Actuators.
Week Six
Concept of Data capture and data entry, problems of data entry, data collection stages, data
capture techniques and devices, Computer file concepts, computer file processing, computer
disk storage file processing and element of a computer file.
Objective:
• To explore data capturing that support data collections in a data processing.
• To discuss the general concept of data capture and data entry.
• To enable the students to implement appropriate data collection device for data
processing.
• To introduce file concepts in computer followed by an extended discussion of the ways
of view file store in computer and the purpose of data file in data processing environment
and computer.
Description: data entry versus data entry, features of data capture devices and features of
document captured from data capture device were explain to aim better understanding. To
introduce file concepts in computer followed by an extended discussion of the ways of view file
store in computer and the purpose of data file in data processing environment and computer
33
I
Introduction
These days the majority of computer end-users input data to the computer :via keyboards on PCs,
workstations or terminals. However, for many medium and large scale commercial and industrial
applications involving large volumes of data the use of keyboards is not practical or economical.
Instead, specialist methods, devices and media are used and these are the subject of this segment.
The segment begins by examining the problems of data entry. It goes on to consider the stages
involved and the alternatives available. It then examines the factors that influence the choice of
methods, devices and media for data input. Finally, the segment examines the overall controls
that are needed over data as it is entered into the computer for processing. The selection of the
best method of data entry is often the biggest single problem faced by those designing
commercial or industrial computer systems, because of the high costs involved and numerous
practical considerations. The best methods of data entry may still not give satisfactory facilities
if the necessary controls over their use are not in place.
34
I
Many of the problems of data entry can be avoided if the data can be obtained in a
computersensible form at the point of origin. This is known as data capture. This segment will
describe several methods of data capture. The capture of data does not necessarily mean its
immediate input to the computer. The captured data may be stored in some intermediate form
for later entry into the main computer in the required form. If data is input directly into the
computer at its point of origin the data entry is said to be on-line. In addition, the method of
direct input is a terminal or workstation method of input which is known as Direct Data Entry
(DDE). The term Data Entry used in the segment title usually means not only the process of
physical input by a device but also any methods directly associated with the input.
Character recognition
The methods described so far have been concerned with turning data into a machine sensible
form as a prerequisite to input. By using Optical Character Recognition (OCR) and Magnetic Ink
35
I
Character Recognition (MICR) techniques, the source documents themselves are prepared in a
machine-sensible form and thus eliminate the transcription stage. Notice, however, that such
characters can also be recognised by the human eye. We will first examine the devices used.
Document readers
Optical readers and documents. There are two basic methods of optical document reading:
a. Optical Character Recognition (OCR).
b. Optical Mark Recognition (OMR).
These two methods are often used in conjunction with one another, and have much in
common. Their common and distinguishing features are covered in the next few
paragraphs.
36
Features of an optical reader.
a. It has a document-feed hopper and several stackers, including a stacker for "rejected"
documents.
b. Reading of documents prepared in optical characters or marks is accomplished as follows:
i. Characters. A scanning device recognises each character by the amount of reflected light
(i.e., OCR). The method of recognition, although essentially an electronic one, is similar
in principle to matching photographic pictures with their negatives by holding the
negative in front of the picture. The best match lets through the least light.
ii. Marks. A mark in a particular position on the document will trigger off a response. It is
the position of the mark that is converted to a value by the reader (i.e., OMR). The method
involves directing thin beams of light onto the paper surface which are reflected into a
light detector, unless the beam is absorbed by a dark pencil mark, i.e., a mark is
recognised by the reduction of reflected light.
Note. An older method of mark reading called mark sensing involved pencil marks conducting
between two contacts and completing a circuit.
c. Documents may be read at up to 10,000 A4 documents per hour.
Features of a document.
a. Documents are printed in a stylised form (by printers, etc, fitted with a special
typeface) that can be recognised by a machine. The stylised print is also
recognisable to the human eye. Printing must be on specified areas on the
document.
b. Some documents incorporate optical marks. Predetermined positions on the
document are given values. A mark is made in a specific position using a pencil
and is read by the reader.
c. Good-quality printing and paper are vital.
d. Documents require being undamaged for accurate reading.
e. Sizes of documents, and scanning area, may be limited.
Magnetic ink reader and documents
The method of reading these documents is known as Magnetic Ink Character Recognition (MICR).
Features of magnetic ink readers
a. Documents are passed through a strong magnetic field, causing the iron oxide in
the ink encoded characters to become magnetised. Documents are then passed
under a read head, where a current flows at a strength according to the size of
the magnetised area (i.e., characters are recognised by a magnetic pattern).
b. Documents can be read at up to 2,400 per minute.
b. Applications:
OCR is used extensively in connection with billing, e.g., gas and electricity bills and
insurance premium renewals and security printing. In these applications the bills are
prepared in OC by the computer, then sent out to the customers, who return them with
payment cheques. The documents re-enter the computer system (via the OC reader) as
evidence of payment. This is an example of the "turnaround" technique. Notice that no
transcription is required.
c. OCR/keyboard devices:
These permit a combination of OCR reading with manual keying. Printed data (e.g.,
account numbers) is read by OCR; hand-written data (e.g., amounts) is keyed by the
operator. This method is used in credit card systems.
Optical mark reading (OMR)
a. Technique explained:
Mark reading is discussed here because it is often used in conjunction with OCR, although
it must be pointed out that it is a technique in itself. Positions on a document are given
certain values. These positions when "marked" with a pencil are interpreted by a machine.
Notice it is the "position" that the machine interprets and that has a predetermined value.
b. Application:
Meter reader documents are a good example of the use of OMR in conjunction with OCR.
The computer prints out the document for each customer (containing name, address, last
reading, etc,) in OC. The meter reader records the current reading in the form of "marks"
on the same document. The document reenters the computer system (via a reader that
reads OC and OM) and is processed (i.e., results in a bill being sent to the customer). Note
that this is another example of a "turnaround document".
Magnetic ink character recognition (MICR)
a. Techniques explained:
Numeric characters are created in a highly stylised type by special encoding machines
using magnetic ink. Documents encoded thus are "read" by special machines.
b. Application. One major application is in banking (look at a cheque book), although some local
authorities use for payment of rates by installments. Cheques are encoded at the bottom
with account number, branch code and cheque number before being given to the customer
(i.e., pre-encoded). When the cheques are received from the customers the bottom line is
completed by encoding the amount of the cheque (i.e., post-encoded). Thus all the details
necessary for processing are now encoded in MIC and the cheque enters the computer
system via a magnetic ink character reader to be processed.
Features
The specific feature of these devices tends to depend upon the application for which they are
used. However, data captured by the device must ultimately be represented in some binary from
in order to processed by a digital computer. For some devices, the input may merely be a single
bit representation that corresponds to some instrument, such as a pressure switch, being on or
off.
Introduction
The purpose of this segment is to look at the general concepts that lie behind the subject of
computer files before going on to discuss the different methods of organising them. At all times
the term "file" will refer to computer data files.
Purpose data file
A file holds data that is required for providing information. Some files are processed at regular
intervals to provide this information (e.g., payroll file) and others will hold data that is required
at regular intervals (e.g., a file containing prices of items). There are two common ways of
viewing files:
a. Logical files. A "logical file" is a file viewed in terms of what data items its records contain
and what processing operations may be performed upon the file.
The user of the file will normally adopt such a view.
b. Physical files. A "physical file" is a file viewed in terms of how the data is stored on a
storage device such as a magnetic disk and how the processing operations are made
possible
1201
10 02 80 M 12 5500
Week Seven
Mid Semester Test
Objective:
• To evaluate performance of student knowledge on the
Lectures/teachings received this Course
Description: To know how far the student can apply the knowledge gain from the
course.
Description: Demonstrate of the fields, records, files including database are created
using query language approach.
INTRODUCTION
Database contains one or more tables. Each table is identified by a name (e.g.
"Customers" or "Orders"). Tables contain records (rows) with data. Below is an
example of a table called "Persons":
The file table above contains three records (one for each person) and five columns
(P_Id, LastName, FirstName, Address, and City
43
http://www.unaab.edu.ng
Using SQL, you can create the table structures within the database you have
designated. For example, the STORAGE table would created with:
Most DBMSs now use interfaces that allow you to type the attribute names into a
template and to select the attribute characteristics you want from pick lists. You can
even insert comments that will be reproduced on the screen to prompt the user for
input. For example, the preceding STORAGE table structure might be created in a
u want to generate a LA schedule , you need data from two tables. LABASSISTANT
and WORK-SCHEDULE. Because the report output is ordered by semester, LA,
weekday, and time, indexes must be available for the primary key fields in each
table. Using SQL, we would type:
Most modern DBMSs automatically index on the primary key components. Views
are often for security purposes. However, views are also used to streamline the
system’s processing requirements. For example, output limits may be defined
efficiently appropriate views necessary for the LA schedule report for the fall
semester of 1999, we use the CREAT VIEW command:
44
http://www.unaab.edu.ng
Figure 8.1 You can access a MySQL database server from the command window.
Figure 8.2 (a) The show database command display all available databases in the
MySQL database server; (b) The use test command selects the test database. The
MySQL database contains the tables that store information about the server and its
users. This database is intended for the server administrator to use. For example, the
administrator can use it to create users and grant or revoke user privileges. Since you
care the owner of the server installed on your system, you have full access to the
MySQL database. However, you should not create user tables in the MySQL
database. You can use the test database to store data or create new databases. You
can also create a new database using the command create database <database name>
or drop an existing database using the command drop database <database name>.
To select a database for use, type use database command. Since the test database is
created by default in every MySQL database, let use it to demonstrate SQL
commands. As shown in the figure above, the test database is selected. Enter the
statement to create the course table as shown in figure below:
45
http://www.unaab.edu.ng
Figure 8.3 The execution result of the SQL statements is displayed in the MSQL
monitor
If you make typing errors, you have to retype the whole command. To avoid
retyping the whole command, you can save the command in a file, and then run the
command from the file. To do so, create a text file to contain the commands, named,
for example, test.sql. You can create the text file using any text editor, such as
notepad, as shown in the figure below. To comment a line, proceed it with two
dashes. You can now run the script file by typing source test.sql from MySQL
command prompt, as shown in the figure below
Figure 11.4 You can use Notepad to create a text file for SQL commands
Figure 8.5: You can run the SQL commands in a script file from MySQL
SQL STATEMENTS
46
http://www.unaab.edu.ng
Table 8.2: SQL Commands and Functions SQL Basic SQL Advanced SQL Func
Tables are the essential objects in a database. To creates a table, use the create table
statement to specify a table name, attributes, and types, as in the following example
create table Course( courseId char(5), subjectId char(4) not null, courseNumber
integer, title varchar(50) not null, numOfCredits integer, primary key (courseId) );
This statement creates the course table with attributes courseld, subjectld,
courseNumber, title and numOfCredits. Each attribute has a data type that specifies
the type of data stored in the attribute. char(5) specifies that courseld consists of five
characters. varchar(50) specifies that title is a variant-length string with a maximum
of fifty characters. Integer specifies that courseNumber is an integer. The primary
key is courseId. The table Student and Enrollment can be created as follows:
create table Student ( ssn char(9) firstName varchar(5), mi char (1) lastName varchar
(25) birthDate date, street varchar (25), phone char(11) zipCode char (5), deptId
char(4), primary key (ssn) ); create table Enrollment ( ssn char(9), courseId char(5)
dateRegistered date, grade char (10, primary key (ssn, courseId) foreign key (ssn)
references student, foreign key (courseId) references Course );
If a table is no longer needed, it can be dropped permanently using the drop table
command. For example, the following statements drops the Course table:
drop table Course; If a table to be dropped is referenced by other tables, you have to
drop other tables first. For example, if you have created the tables Course, Student
47
http://www.unaab.edu.ng
and Enrollment and want to drop Course, you have to first drop Enrollment, because
Course is referenced by Enrollment.
Now we want to select the content of the columns named "LastName" and
"FirstName" from the table above. We use the following SELECT statement: SELECT
LastName, FirstName FROM Persons. The result-set will look like this:
SELECT * Example Now we want to select all the columns from the "Persons" table.
We use the following SELECT statement: SELECT * FROM Persons
Tip: The asterisk (*) is a quick way of selecting all columns! The result-set will look
like this:
Stavanger
The WHERE clause is used to filter records.
The WHERE Clause The WHERE clause is used to extract only those records that
fulfill a specified criterion. SQL WHERE Syntax
SELECTcolumn_name(s)
The WHERE clause is used to extract only those records that fulfill a specified
criterion. SQL WHERE Syntax SELECTcolumn_name(s) FROMtable_name WHERE
column_name operator value WHERE Clause Example The "Persons" table:
Now we want to select only the persons living in the city "Sandnes" from the table
above. We use the following :
SELECT * FROM Persons WHERE City='Sandnes'
The result-set will look like this:
49
http://www.unaab.edu.ng
Note: In some versions of SQL the <> operator may be written as != The AND & OR
operators are used to filter records based on more than one condition. 3.4.7. The
AND & OR Operators The AND operator displays a record if both the first condition
and the second condition is true. The OR operator displays a record if either the first
condition or the second condition is true. AND Operator Example The "Persons"
table:
Now we want to select only the persons with the first name equal to "Tove" AND the
last name equal to "Svendson": We use the following SELECT statement:
OR OPERATOR EXAMPLE
Now we want to select only the persons with the first name equal to "Tove" OR the
first name equal to "Ola":
We use the following SELECT statement:
Now we want to select only the persons with the last name equal to "Svendson" AND
the first name equal to "Tove" OR to "Ola": We use the following SELECT statement:
Week Ten
Types of file, access to file, Storage devices, Processing activities of files, Fixedlength
and variable-length records, Hit rate
Objectives:
• To discuss the general concept of types of file and access to file in data processing.
• Impact to student the knowledge of processing file in computing environment.
• To discuss the processing activities of files and it application to computer file.
• To have know how required in choosing between Fixed-length and
variablelength records in record design.
Description: updating master file, Transaction File, Reference file interrogation, file
characteristics, major processing activities are discussed in detail and features were
emphasized
Types of files
a. Master file.
These are files of a fairly permanent nature, e.g., customer ledger, payroll,
inventory, etc. A feature to note is the regular updating of these files to show
a current position. For example customer's orders will be processed,
51
http://www.unaab.edu.ng
Access to files
Key fields: When files of data are created one needs a means of access to
particular records within those files. In general terms this is usually done
by giving each record a "key" field by which the record will be recognised
or identified. Such a key is normally a unique identifier of a record and is
then called the primary key. Sometimes the primary key is made from the
combination of two fields in which case it may be called a composite key or
compound key. Any other field used for the purpose of identifying records,
or sets of records, is called a secondary key. Examples of primary key
fields are:
a. Customer number in a customer ledger record.
b. Stock code number in a stock record.
c. Employee clock number in a payroll record.
Not only does the key field assist in accessing records but also the records
themselves can, if required, be sorted into the sequence indicated by the
key.
Storage devices
The two storage devices that may be considered in connection with the storage
of files (i.e., physical files).
a. Magnetic or optical disk. These are direct access media and are the
primary means of storing files on-line
b. Magnetic tape. This medium has significant limitations because it is a serial
access medium and therefore is the primary means of storing files offline.
These characteristics loom large in our considerations about files in the segments
that follow. Note then that they are inherent in the physical make-up of the
devices and will clearly influence what types of files can be stored on each one,
and how the files can be organised and accessed.
52
http://www.unaab.edu.ng
Processing activities
We will need to have access to particular records in the files in order to process
them. The major processing activities are given below:
53
http://www.unaab.edu.ng
Fixed-length records make it easy for the programmer because he or she is dealing
with a known quantity of characters each time. On the other hand they result in less
efficient utilisation of storage. Variable-length records mean difficulties for the
programmer but better utilisation.
Hit rate
This is the term used to describe the rate of processing of master files in terms of
active records. For example, if 1,000 transactions are processed each day against a
master file of 10,000 records, then the hit rate is said to be 10%. Hit rate is a measure
of the "activity" of the file.
Study questions:
1. An organisation runs a simple savings scheme for its members. Members pay
in sums of money to their own accounts and gain interest on the money saved.
54
http://www.unaab.edu.ng
Data about the accounts is stored in a master file. What would you suggest
would be the entities used in this system, Also suggest what attributes these
entities might have.
2. Define the term "key field". Discuss the suitability of the following data items
as key fields.
a. A person's surname in a personnel file.
b. A national insurance number in a payroll file.
c. A candidate number in an examinations file.
3. Define the terms "hit rate" and "volatility" with regard to computer files. Where
science.else have you come across the term "volatility" in computer
Introduction
This segment describes the ways in which files may be organised and accessed on
disks. Before tackling this segment the need to be thoroughly conversant with the
relevant physical attributes of disks (fixed and exchangeable) and disk units
("reading" and "writing" devices).
in a master file is required. Comment on the probable characteristics of the file
a. Volatility,
b. Activity,
c. Size,
d. Growth.
Today most file processing is carried out using files stored on hard magnetic disks.
Optical disks only have a minority use at present although they are being used
increasingly for applications requiring large volumes of archived or reference data.
Flash and Floppy disks are not normally used as file processing media because of
their limited capacity. They are more often used to transfer small files between
computers, particularly PCs. They are only used as the main file processing medium
on a few very small microcomputers. The principles covered by this segment
concerning the use of disks are applicable to all disk types. Any relevant differences
will be highlighted when appropriate.
There is still some file processing carried out using files stored on magnetic tape but
it is almost all done on mainframes in large commercial, industrial or financial
institutions. Magnetic tape continues to be an important backup medium especially
in its cartridge forms.
The simplest methods of organising and accessing files on disk are very similar to
the standard ones used for magnetic tape. Where appropriate this similarity will be
drawn to the reader's attention. Otherwise little mention will be made of magnetic
tape.
File organisation is the arrangement of records within a particular file. We start from
the point where the individual physical record layout has been already designed,
55
http://www.unaab.edu.ng
i.e., the file "structure" has already been decided. How do we organise our many
hundreds, or even thousands, of such records (e.g., customer records) on disk? When
we wish to access one or more of the records how do we do it? This segment explains
how these things are done. Writing on disk
In order to process files stored on disk the disk cartridge pack must first be loaded
into a disk unit. For a fixed disk the disk is permanently in the disk unit. Records are
"written" onto a disk as the disk pack revolves at a constant speed within its disk
unit. Each record is written in response to a "write" instruction, Data goes from main
storage through a read-write head onto a track on the disk surface. Records are
recorded one after the other on each track. (On magnetic tape the records are also
written one after the other along the tape.)
Note. All references to "records" in this segment should be taken to mean
"physical records" unless otherwise stated.
Reading from disk
In order to process files stored on disk the disk cartridge or pack must first be loaded
into a disk unit. Records are read from the disk as it revolves at a constant speed.
Each record is read in response to a "read" instruction. Data goes from the disk to the
main storage through the read-write head already mentioned. Both reading and
writing of data are accomplished at a fixed number (thousands) of bytes per second.
W e will take for our discussion on file organisation a "6-disk" pack, meaning it has
ten usable surfaces (the outer two are not used for recording purposes). But before
describing how files are organised let us look first at the basic underlying concepts.
Cylinder concept
Where the disk pack is illustrated, and note the following:
i. There are ten recording surfaces. Each surface has 200 tracks. ii.
There is a read-write head for each surface on the disk pack.
iii. All the read-write arms are fixed to one mechanism and are like a
comb.
iv. When the "access" mechanism moves all ten read-write heads move
in unison across the disk surfaces.
v. Whenever the access mechanism comes to rest each read-write head
will be positioned on the equivalent track on each of the ten surfaces.
vi. For one movement of the access mechanism access is possible to
ten tracks of data.
In the case of a floppy disk the situation is essentially the same but simpler. There is
just one recording surface on a "single-sided" floppy disk and two recording surfaces
on a "double-sided" floppy disk. The other significant differences are in terms of
capacity and speed.
Uses made of the physical features already described when organising the storage of
records on disk. Records are written onto the disk starting with track 1 on surface 1,
then track 1 on surface 2, then track 1 on surface 3 and so on to track 1 on surface 10.
56
http://www.unaab.edu.ng
One can see that conceptually the ten tracks of data can be regarded as forming a
CYLINDER.
Week Eleven
Basic address concepts of disk files, Access time, File organisation on disk, Access ,
Methods of addressing, File labels, Control totals, Buffers and buffering Objective:
• To introduce students to storage address creation, arrangement of stored file
on storage media and access in different data processing environment as the
fundamental building blocks of knowing the relevant of computer locating
stored file.
• To explore the storage address concepts including fetching on the storage
media.
Description: the organisation, access time and method of addressing of Serial
Sequential file organisation, Index Sequential Organisation, Random file
organisation were discuss in detail.
b. Hard-sectored disk
Key
Sectors (i.e. blocks) are numbered 1, 2, 3, ... Logical records are
numbered R1, R2, R3, ... indicates wasted storage space.
Access time
Access time on disk is the time interval between the moment the command is given
to transfer data from disk to main storage and the moment this transfer is completed.
It is made up of three components:
a. Seek time. This is the time it takes the access mechanism to position
itself at the appropriate cylinder.
b. Rotational delay. This is the time taken for the bucket to come round
and position itself under the read-write head. On average this will be
the
time taken for half a revolution of the disk pack. This average is called the
"latency" of the disk.
c. Data transfer time. This is the total time taken to read the
contents of the bucket into main
storage.
Access time will vary mainly according to the position of the access mechanism at the
time the command is given. For example if the access mechanism is already
positioned at cylinder 1 and the record required happens to be in cylinder 1 no
movement of the access mechanism is required. If, however, the record required is
in cylinder 200, the access mechanism has to move right across the surface of the
disk. Once the bucket has arrived at the read-write head, the transfer of data to
storage begins. Speed of transfer of data to main storage is very fast and is a constant
rate of so many thousand bytes per second. A hard disk will operate at speeds
roughly 10 times faster than a floppy disk and flash disk.
59
http://www.unaab.edu.ng
Note. Magnetic tape is limited to methods (a) and (b) above. These limited methods
of organisation and access have led to tape becoming very much less common than
disk as an on-line medium for the storage of master files. Tape continues as a major
storage medium for purposes such as offline data storage and back-up.
c. Indexed sequential files. There are three methods of access:
i. Sequential. This is almost the same as in (b) above; the complete file is
read in
sequential order using the index. The method is used when the hit rate is
high. The method makes minimal use of the index, minimises head
movement and processes all records in each block in a single read.
60
http://www.unaab.edu.ng
Therefore, the index is used once per block rather than once per record.
Any transaction file must be pre-sorted into the same key sequence as the
master file.
ii. Selective sequential. Again the transaction file must be pre-sorted into
the same sequence as the master file. The transaction file is processed
against the master file and only those master records for which there is a
transaction are selected. Notice that the access mechanism is going
forward in an ordered progression (never backtracking) because both
files are in the same sequence. This minimises head movement and saves
processing time. This method is suitable when the hit rate is low, as only
those records for which there is a transaction are accessed.
iii. Random. Transactions are processed in a sequence that is not that of the
master file. The transactions may be in another sequence, or may be
unsequenced. In contrast to the selective sequential method, the access
mechanism will move not in an ordered progression but back and forth
along the file. Here the index is used when transactions are processed
immediately - i.e., there is not time to assemble files and sort them into
sequence. It is also used when updating two files simultaneously. For
example, a transaction file of orders might be used to update a stock file
and a customer file during the same run. If the order was sorted to
customer sequence, the customer file would be updated on a selective
sequential basis and the stock file on a random basis. (Examples will be
given in later segments.)
Note. In c.i and c.ii the ordered progression of the heads relies upon an orderly
organisation of the data and no other program performing reads from the disk at the
same time, which would cause head movement to other parts of the disk. In multi-
user systems these things cannot always be relied upon.
Methods of addressing
For direct access one must be able to "address" (locate) each record whenever one
wants to process it. The main methods of obtaining the appropriate address are as
follows:
a. Index: The record keys are listed with the appropriate disk address.The
incoming transaction record key is used to locate the disk address of the
master record in the index. This address is then used to locate the
appropriate master record.
61
http://www.unaab.edu.ng
c. Record key = disk address: It would be convenient if we could use the actual
disk hardware address as our record key. Our transaction record keys would
then also be the appropriate disk addresses and thus no preliminary action
such as searching an index or address generation would be required in order
to access the appropriate master records. This is not a very practical method,
however, and has very limited application.
records. These two records are usually referred to as labels. One comes at the
beginning of the file and the other at the end. This applies to magnetic tape too.
Header label. This is the first and its main function is to identify the file. It will
contain the following data:
i. A specified field to identify the particular record as
a label. ii. File name - e.g., PAYROLL; LEDGER;
STOCK. iii. Date written.
iv. Purge date - being the date from which the information on the particular
file is no longer required and from which it can be deleted and the
storage space re-used. This label will be checked by the program
before the file is processed to ensure that the correct tape has been
opened.
b. Trailer label. This will come at the end of the file and will contain the
following data:
i. A specific field to identify the particular record as a label.
ii. A count of the number of records on file. This will be checked
against the total accumulated by the program during processing.
iii. Volume number if the file takes up more than one cartridge or pack
(or tape).
Control totals
Mention is made here of one further type of record sometimes found on sequential
files - one which will contain control totals, e.g., financial totals.
Such a record will precede the trailer label.
Buffers and buffering
The area of main storage used to hold the individual blocks, when they are read in
or written out, is called a buffer. Records are transferred between the disk (or tape)
unit and main memory one complete block at a time. So, for example, if the blocking
factor is 6, the buffer will be at least as long as 6 logical records. A program that was
processing each record in a file in turn would only have to wait for records to be read
in after processing the sixth record in each block when a whole block would be read
in. The use of just one buffer for the file is called single buffering.
In some systems double buffering is used. Two buffers are used. For the sake of
argument call them A and B and assume that data is to be read into main storage
from the file. (The principle applies equally well to output.) When processing begins
the first block in the file is read into buffer A and then the logical records in A are
processed in turn. While the records in A are being processed the next block (block
2) is read into B. Once the records in A have been processed those in B can be
processed immediately, without waiting for a read. As these records in B are processed
the next block (block 3) is read into A replacing what was there before. This sequence
of alternately filing and processing blocks carries on until the whole file has been
processed. There can be considerable saving in time through using double buffering
because of the absence of waits for block reads.
Note: Single and double buffering are generally carried out by the operating system
not by the application program.
63
http://www.unaab.edu.ng
Week Twelve
Non-sequential updating of disk files, File reorganisation, physical file
organizations, File access methods and File calculations Objective:
• To introduce students to file creation and maintenance in different data
processing environment as the fundamental building blocks of knowing the
relevant of computer file.
• To explore the file organisation design concepts including overflow handling
in file storage and pitfalls.
Description: Serial Sequential file organisation, Index Sequential Organisation,
Random file organisation and their access method were illustrated and discuss with
an emphasis on the storage media.
File reorganisation
As a result of the foregoing the number of records in the overflow area will increase.
As a consequence the time taken to locate such a record will involve first seeking the
home track and then the overflow track.
Periodically it will be necessary to reorganise the file. This will entail rewriting the
file onto another disk:
i. Putting the records that are in the overflow area in the home area
in the proper sequence.
ii. Leaving off the records that have a deletion marker on them.
iii. Rewriting any index that is associated with the file.
64
http://www.unaab.edu.ng
65
http://www.unaab.edu.ng
block is input, and the record is searched for within the block. We thus have
organisation by address generation and access by address generation. Sometimes an
index of generated addresses is produced as the file is created. This index is then
stored with the file. It is then possible to access the file by means of this random
index. We then have organisation by address generation and access by random
index.
Hashed keys: When disk addresses are generated directly from keys, as in the
example just given, there tends to be an uneven distribution of records over
available tracks. This can be avoided by applying some algorithm to the key first.
In this case we say the key is hashed. Examples
a. Squaring, e.g., for key number 188
1882 = 35 3 4 4
DISC ADDRESS Track Surface Bucket Block
Number
Number
Number Number
Fig 10: Disc organisation Structure
b. Division method, e.g., for key number 188.
188 ÷ 7 = 26 Remainder 6. So we could use track 26 surface 6 say.
Hashing reduces the chances of overflow occurring, but when overflow
does occur records are normally placed on the next available surface in the
same cylinder so as to minimise head movement.
b. Direct files. These are files that provide fast and efficient direct access, i.e., they
are normally random files with one of a number of appropriate addressing methods.
66
http://www.unaab.edu.ng
A common type of direct file is the Relative file. The logical organisation of a relative
file is like this:
R1 R2 R3 R4 R5 R6 etc
1 2 3 4 5 6 etc
File calculations
Two basic types of calculation are often needed when using files:
a. The storage space occupied by the file (for magnetic tape the length of tape
may be required).
b. The time taken to read or write the file.
A simple example now follows.
For a sequential file on disk the basic calculation for estimating the required space is
as follows.
a. Divide the block size by the record size to find how many whole records can
fit into a block. This is the blocking factor.
b. Divide the total number of records by the blocking factor to obtain the total
number of blocks required.
c. Multiply the block size in bytes by total number of blocks required. Note.
This basic method can be modified if the records are variable in length (e.g.
use an average).
For non-sequential files on disk the storage space required is greater than that for
sequential because of the space allowed for insertions and overflow. The exact
calculations depend on the software used to organise the files and on the ways in
which it is possible to configure the settings. However, a typical overhead is 20%
more than that for sequential.
Total read latency + data number of seek time + x sectors per x cylinders time =transfer
time cylinder per file
I
FILE ORGANISATION METHOD METHOD OF ACCESS
l. Serial (Sequential) Serial (Sequential)
3. Indexed sequential
a. Sequential
b. Selective sequential
c. Random (Direct)
69