Chapter 1.4 Data: Its Representation, Structure and Management 1.4 (A) Number Systems and Character Sets
Chapter 1.4 Data: Its Representation, Structure and Management 1.4 (A) Number Systems and Character Sets
Counting is one of the first skills that a young child masters, and none of us consider
counting from 1 to 100 difficult. However, to count, we have to learn, by heart, the
meanings of the symbols 0,1,2…9 and also to understand that two identical symbols
mean totally different things according to their ‘place’ in the number. For instance, in 23
the 2 actually means 2 * 10. But why multiply by 10? Why not multiply by 6? The
answer is simply that we were taught to do that because we have 10 fingers, so we can
count on our fingers until we get to the last one, which we remember in the next column
and then start again.
We don’t need to count in tens. The ancient Babylonians counted in a system, which is
similar to counting in sixties. This is very difficult to learn because of all the symbols
needed, but we still use a system based on sixties today: 60 minutes = 1 hour; 60 seconds
= 1 minute; 6 * 60 degrees = 1 revolution.
Instead of increasing the number of symbols in a system, which makes the system more
difficult, it seems reasonable that if we decrease the number of symbols the system will
be easier to use.
A computer is an electronic machine. Electricity can be either on or off. If electricity is
not flowing through a wire that can stand for 0. If electricity is flowing, then it stands for
1. The difficulty is what to do for the number 2. We can’t just pump through twice as
much electricity, what we need is a carry system, just like what happens when we run out
of fingers. What we need is another wire.
The ‘units’ wire no electricity 0
=0
The ‘twos’ wire no electricity 0
ADD 1
electricity 1
=1
no electricity 0
ADD 1
no electricity 0
=2
Carry electricity 1
ADD 1
electricity 1
=3
electricity 1
ADD 1
no electricity 0
Carry
no electricity 0 =4
Carry electricity 1
The computer can continue like this for ever, just adding more wires when it gets bigger
numbers.
This system, where there are only two digits, 0 and 1, is known as the binary system.
Each wire, or digit, is known as a binary digit. This name is normally shortened to BIT.
So each digit, 0 or 1, is one bit. A single bit has very few uses so they are grouped
together. A group of bits is called a BYTE. Usually a byte has 8 bits in it.
One type of data that needs to be stored in computer systems is the letters of the alphabet.
These are stored as codes which look like binary numbers. For instance A could be stored
as 000, B as 001 and so on. Unfortunately, there are only 8 possible codes using 3 bits, so
we could store the letters A to H but not the rest, and what about the lower case letters
and punctuation and…? The computer can store as many characters as necessary simply
by using more and more bits for the code. Some systems don’t need to be able to
recognise a lot of characters so they only use a few bits for each character. The number of
bits needed to store one character is called a byte which is usually said to have 8 bits
because most systems use 8 bits to store the code for each character.
We now have enough codes, but another problem arises. If my computer stores A as
01000001 and your computer stores A as 01000010 then the computers cannot
communicate because they cannot understand each other’s codes. In the 1960’s a meeting
in America agreed a standard set of codes so that computers could communicate with
each other. This standard set of codes is known as the ASCII set. Most systems use
ASCII so you can be fairly sure that when you type in A it is stored in the computer’s
memory as 01000001. However, you can’t be certain because some systems use other
codes. A less common code is called EBCDIC, it was developed for use by larger scale
computer systems and differs in that the code for each character is different than that used
in ASCII.
Notes: All the characters that a system can recognise are called its character set.
ASCII uses 8 bits so there are 256 different codes that can be used and hence 256
different characters. (This is not quite true, we will see why in chapter 1.6.)
A problem arises when the computer retrieves a piece of data from its memory. Imagine
that the data is 01000001. Is this the number 65, or is it A?
They are both stored in the same way, so how can it tell the difference?
The answer is that characters and numbers are stored in different parts of the memory, so
it knows which one it is by knowing whereabouts it was stored.
1.4 (b) Data Types
The computer needs to use different types of data in the operation of the system. All of
these different types of data will look the same because they all have to be stored as
binary numbers. The computer can distinguish one type of data from another by seeing
whereabouts in memory it is stored.
Numeric data.
There are different types of numbers that the computer must be able to recognise.
Numbers can be restricted to whole numbers, these are called INTEGERS and are stored
by the computer as binary numbers using a whole number of bytes. It is usual to use
either 2 bytes (called short integers) or 4 bytes (called long integers), the difference being
simply that long integers can store larger numbers. Sometimes it is necessary to store
negative integers or fractions or, perhaps, some other types of numbers. These other types
do not concern us until later in the course, we will be seeing how they are stored in
chapter 3.4
Boolean data
Sometimes the answer to a question is either yes or no, true or false. There are only two
options. The computer uses binary data which consists of bits of information that can be
either 0 or 1, so it seems reasonable that the answer to such questions can be stored as a
single bit with 1 standing for true and 0 standing for false. Data which can only have two
states like this is known as BOOLEAN data.
A simple example of its use would be in the control program for an automatic washing
machine. One of the important pieces of information for the processor would be to know
whether the door was shut. A boolean variable could be set to 0 if it was open and to 1 if
it was shut. A simple check of that value would tell the processor whether it was safe to
fill the machine with water.
Some types of data are used so often by computer systems that they are considered to be
special forms of data. These special forms of data can be set up by the computer system
so that they are recognised when entered. Two examples of such data types are
Date/Time and Currency. The computer has simply been told the rules that govern such
data types and then checks the data that is input against the rules. Students will probably
be familiar with these data types through their use in databases. Such data types are
fundamentally different from the others mentioned here because the others are
characterised by the operating system while these two are set up by applications software.
Characters
A character can be anything, which is represented in the character set of the computer by
a character code in a single byte.
1.4 (c) and (d) Expressing numbers in binary
These two sections can be combined. We are only interested in expressing numbers in
binary form rather than in our decimal number system.
When a question asks for a conversion either to binary or back to decimal, always draw
the box diagram that the numbers will be put into and put the headings on the boxes. The
headings start from 1 on the left and then get multiplied by two each time, so that a
question which wanted 8 bits for the answer would look like this
128 64 32 16 8 4 2 1
Then consider the number that needs turning in to binary. E.g. Turn 165 into binary.
Start on the left, in this case with 128. Does 128 go into 165? Yes. Put a 1 in the box.
128 has now been used up so take 128 from 165, there is 37 left.
Next box is 64. Does 64 go into 37? No. Put a 0 in the box.
Next box is 32. Does 32 go into 37? Yes. Put a 1 in the box.
32 has now been used up so take 32 from 37, there is 5 left.
Next box is 16. Does 16 go into 5? No. Put a 0 in the box.
Next box is 8. Does 8 go into 5? No. Put a 0 in the box.
Next box is 4. Does 4 go into 5? Yes. Put a 1 in the box.
4 has now been used up so take 4 from 5, there is 1 left.
Next box is 2. Does 2 go into 1? No. Put a 0 in the box.
Next box is 1. Does 1 go into 1? Yes. Put a 1 in the box.
1 has now been used up so take 1 from 1, there is 0 left.
No more boxes. End
(Notice that this is an algorithm which could be adapted into a general algorithm for
working out binary numbers. Try it.)
The result is
128 64 32 16 8 4 2 1
1 0 1 0 0 1 0 1 =165
To turn a number into a denary number from binary, put the number into the boxes, with
the headings on and then just add up the headings that have a one in the box.
E.g.
128 64 32 16 8 4 2 1
1 0 1 1 0 1 1 0
128 + 32 + 16 + 4 + 2 = 182.
Don’t worry about other numbers we will see those in chapter 3.4.
1.4 (e) Arrays
Data stored in a computer is stored at any location in memory that the computer decides
to use. This means that similar pieces of data can be scattered all over memory. This, in
itself, doesn’t matter to the user, except that to find each piece of data it has to be referred
to by a variable name.
e.g. If it is necessary to store the 20 names of students in a group then each location
would have to be given a different variable name. The first, Iram, might be stored in
location NAME, the second, Sahin, might be stored in FORENAME, the third, Rashid,
could be stored in CHRINAME, but I’m now struggling, and certainly 20 different
variable names that made sense will be very taxing to come up with. Apart from anything
else, the variable names are all going to have to be remembered.
Far more sensible would be to force the computer to store them all together using the
variable name NAME. However, this doesn’t let me identify individual names, so if I call
the first one NAME(1) and the second NAME(2) and so on, it is obvious that they are all
peoples’ names and that they are distinguishable by their position in the list. Lists like
this are called ARRAYS.
Because the computer is being forced to store all the data in an array together, it is
important to tell the computer about it before it does anything else so that it can reserve
that amount of space in its memory, otherwise there may not be enough space left when
you want to use it. This warning of the computer that an array is going to be used is
called INITIALISING the array. Initialising should be done before anything else so that
the computer knows what is coming.
Initialising consists of telling the computer
• what sort of data is going to be stored in the array so that the computer knows what
part of memory it will have to be stored in
• how many items of data are going to be stored, so that it knows how much space to
reserve
• the name of the array so that it can find it again.
Different programming languages have different commands for doing this but they all do
the same sort of thing, a typical command would be
DIM NAME$(20)
DIM is a command telling the computer that an array is going to be used
NAME is the name of the array
$ tells the computer that the data is going to be characters
(20) tells it that there are going to be up to 20 pieces of data.
Notes: Just because the computer was told 20 does not mean that we have to fill the array,
the 20 simply tells the computer the maximum size at any one time.
The array that has been described so far is really only a list of single data items. It would
be far more useful if each student had a number of pieces of information about them,
perhaps their name, address, date of birth. The array would now have 20 students and
more than one thing about each, this is called a two dimensional array. Obviously
everything gets more complicated now, but don’t worry as it is enough that you
understand that an array may well be two dimensional. If you consider that the names,
addresses and dates of birth in this array are then repeated for every group of students in
the school, it now can be held as a three dimensional array. Lots more dimensions are
also possible, lets just call them multi dimensional.
We should now have a picture of a part of memory which has been reserved for the array
NAME$
Iram Name$(1)
Sahin Name$(2)
- -
NAME$ - -
- -
Zaid Name$(20)
To read data into the array simply tell the computer what the data is and tell it the
position to place it in e.g. NAME$(11) = Rashid will place Rashid in position 11 in the
array (incidentally, erasing any other data that happened to be in there first).
To read data from the array is equally simple, tell the computer which position in the
array and assign the data to another value
e.g. RESULT$ = NAME$(2) will place Sahin into a variable called RESULT$.
Searching for a particular person in the array involves a simple loop and a question
e.g. search for Liu in the array NAME$
Answer:
Counter = 1
While Counter is less than 21, Do
If NAME$(Counter) = Liu Then Print “Found” and End.
Else Add 1 to Counter
Endwhile
Print “Name not in array”
End
Notice that this is an algorithm written in pseudocode. Try to produce an equivalent
algorithm using a Repeat…Until loop structure.
1.4 (f) Linked Lists
When data is stored in a computer the processor can store it in any location provided it
can get the data back. In order to get the data back, each of the locations where the data is
stored is given an address so that if the computer can remember the address of where it
put some information it can easily retrieve it. If everything has to be in the index then the
index can get very large, it seems reasonable to try to cut down the size of the index by
grouping things together under one index entry. One method for doing this is by using an
array, the twenty people in the set could all be found by the one reference to NAME$
which pointed to the location of the array. There are two major problems with arrays. The
first is that if the size of the set grows because a new student joins there is no room in the
array to store the information. This is because the array size has to be predetermined. The
array could be made much bigger, say size 50, so that we are sure it will never be too
small, but this leads to the second problem, that most of this space will never be used,
consequently wasting valuable memory. These problems can be overcome by using a
linked list.
A linked list of data items tells the computer to store the data in any location and to link it
to the previous data item by giving the previous data item the address of the new one.
That sounds very complex, the idea is simple if we look at it in diagram form.
Imagine the list of names used in the example for the array.
Sahin Zaid XX
Note: The jagged line signifies that there are a number of others which would fit in there,
but they are not shown.
To initialise a list, all that needs to be done is to create a new start pointer for this list and
add it to the index of start pointers for all the other lists.
To search through the list for a particular piece of data follow these rules
1. Find the correct list in the index of lists
2. Follow the pointer to the next item
3. If the item is the one being searched for, report that it is found and end.
4. If the pointer shows that the end of the list has been reached, report that the item is not
there and end.
5. Go to step 2.
(Try to write this algorithm in pseudocode using a while…endwhile loop)
To remove a value from a list, simply change the pointer that points to it into one that
points to the next value after it. E.g. to remove Sahin from the example
Sahin Zaid XX
Note that Sahin’s data is still there, it is just that there is no way of getting to it so it might
just as well not be.
1.4 (g) Stacks and Queues
Queues.
Information arrives at a computer in a particular order, it may not be numeric, or
alphabetic, but there is an order dependent on the time that it arrives. Imagine Zaid, Iram,
Sahin, Rashid send jobs for printing, in that order. When these jobs arrive they are put in
a queue awaiting their turn to be dealt with. It is only fair that when a job is called for by
the printer that Zaid’s job is sent first because his has been waiting longest. These jobs
are held, just like the other data we have been talking about, in an array. The jobs are put
in at one end and taken out of the other. All the computer needs is a pointer showing it
which one is next to be done (start pointer(SP)) and another pointer showing where the
next job to come along will be put (end pointer(EP))
1. Zaid is in the queue for printing, the end pointer is pointing at where the next job will
go.
2. Iram’s job is input and goes as the next in the queue, the end pointer moves to the next
available space.
3. Zaid’s job goes for printing so the start pointer moves to the next job, also Sahin’s job
has been input so the end pointer has to move.
EP
EP Sahin
EP Iram SP Iram
SP Zaid SP Zaid
1. 2. 3.
Notes: The array is limited in size, and the effect of this seems to be that the contents of
the array are gradually moving up. Sooner or later the queue will reach the end of the
array. The queue does not have to be held in an array, it could be stored in a linked list.
This would solve the problem of running out of space for the queue, but does not feature
in this course until the second year.
The example of jobs being sent to a printer is not really a proper queue, it is called a
spool, but we don’t need to know about the difference until chapter 3.1.
Stacks.
Imagine a queue where the data was taken off the array at the same end that it was put on.
This would be a grossly unfair queue because the first one there would be the last one
dealt with. This type of unfair queue is called a stack.
A stack will only need one pointer because adding things to it and taking things off it are
only done at one end
1. Zaid and Iram are in the stack. Notice that the pointer is pointing to the next space.
2. A job has been taken off the stack. It is found by the computer at the space under the
pointer (Iram’s job), and the pointer moves down one.
3. Sahin’s job has been placed on the stack in the position signified by the pointer, the
pointer then moves up one. This seems to be wrong, but there are reasons for this being
appropriate in some circumstances which we will see later in the course.
Pointer Pointer
Iram Pointer Sahin
Zaid Zaid Zaid
1. 2. 3.
In a queue, the Last one to come In is the Last one to come Out. This gives the acronym
LILO, or FIFO (First in is the first out).
In a stack, the Last one In is the First one Out. This gives the acronym LIFO, or FILO
(First in is the last out).
1.4 (h) Files, Records, Items, Fields.
Data stored in computers is normally connected in some way. For example, the data
about the 20 students in the set that has been the example over the last three sections has
a connection because it all refers to the same set of people. Each person will have their
own information stored, but it seems sensible that each person will have the same
information stored about them, for instance their name, address, telephone number, exam
grades…
All the information stored has an identity because it is all about the set of students, this
large quantity of data is called a FILE.
Each student has their own information stored. This information refers to a particular
student, it is called their RECORD of information. A number of records make up a file.
Each record of information contains the same type of information, name, address and so
on. Each type of information is called a FIELD. A number of fields make up a record and
all records from the same file must contain the same fields.
The data that goes into each field, for example “Iram Dahar”, “3671 Jaipur, 2415” will be
different in most of the records. The data that goes in a field is called an ITEM of data.
Note: Some fields may contain the same items of data in more than one record. For
example, there may be two people in the set who happen to be called Iram Dahar. If Iram
Dahar’s brother Bilal is in this set he will presumably have the same address as Iram. It is
important that the computer can identify individual records, and it can only do this if it
can be sure that one of the fields will always contain different data in all the records.
Because of this quality, that particular field in the record is different from all the others
and is known as the KEY FIELD. The key field is unique and is used to identify the
record. In our example the records would contain a field called school number which
would be different for each student.
Note: Iram Dahar is 10 characters (1 for the space), Pervais Durrani is 15 characters. It
makes it easier for the computer to store things if the same amount of space is allocated
to the name field in each record. It might waste some space, but the searching for
information can be done more quickly. When each of the records is assigned a certain
amount of space the records are said to be FIXED LENGTH. Sometimes a lot of space is
wasted and sometimes data has to be abbreviated to make it fit. The alternative is to be
able to change the field size in every record, this comes later in the course.
1.4 (i) Record Formats
To design a record format, the first thing to do is to decide what information would be
sensible to be stored in that situation.
e.g. A teacher is taking 50 students on a rock-climbing trip. The students are being
charged 20 dollars each and, because of the nature of the exercise, their parents may need
to be contacted if there is an accident. The teacher decides to store the information as a
file on a computer. Design the record format for the file.
Answer.
The fields necessary will be Student name, Amount paid, Emergency telephone number,
Form (so that contact can be made in school). There are other fields that could be
included but we will add just one more, the school number (to act as the key field).
For each one of these fields it is necessary to decide what type of data they will be and
also to decide how many characters will be allowed for the data in that field, remember
these are fixed length records.
The easiest way is to write them in a table
Student number Integer 1 byte
Student name Character 20 bytes
Amount paid Integer 1 byte
Emergency number Character 12 bytes
Form (e.g.3RJ) Character 3 bytes
Notes: It would be perfectly reasonable to say that the school number was not a proper
number so it should be stored as characters (probably 4 bytes).
The student name is quite arbitrary. 15 bytes would be perfectly reasonable, as would 25
bytes, but 5 bytes would not because it would not be long enough for most names. In
other words there is no single right answer, but there are wrong ones.
The amount paid is listed as an integer as the teacher will store the number of whole
dollars so far paid. This is not necessarily the best data type, in this example currency
may be better. As you get to learn about other data types you may well be able to
consider better ones still.
Many students expect that the emergency number should be an integer, but phone
numbers, in Britain, start with a 0, and integers are not allowed to. If most numbers do
start with 0, the computer can be programmed to put the 0 in at the start of the rest of the
number, but you would have to say this in your answer. As we are not going to do any
arithmetic with these numbers, why make life more complicated than necessary.
3 characters were allowed for form. If in doubt, give an example of what you mean by the
data, it can’t hurt and it may save you a mark in the question.
1.4 (j) Sizing a File
We have just designed the record format for a given situation. It may be necessary to
calculate how large the file is going to be.
Having decided on the size of each field, it is a simple matter of adding up the individual
field sizes to get the size of a record, in this case 37 bytes.
There are 50 students going on the trip, each of them having their own record, so the size
of the data in the file will be 50 * 37 = 1,850 bytes.
All files need a few extra pieces of information that the user may not see such as
information at the start of the file saying when it was last updated, which file it is, is it
protected in any way? These sort of extra pieces of information are known as overheads,
and it is usual to add 10% to the size of a file because of the need for overheads.
Therefore the size of the file is 1,850 bytes + (10% of 1,850 bytes) =2,035 bytes.
The final stage is to ensure that the units are sensible for the size of the file.
There are 1024 bytes in 1 Kbyte, so the size of this file is 2,035/1024 =
1.99Kbytes.
Note: Don’t worry about dividing by 1024, because, after all, this is only an
approximation anyway. If you gave the final answer as 2 (approx) then that is just as
acceptable. Just make sure that you write down somewhere that you know that there are
1024 bytes in a Kbyte, otherwise you can’t be given the mark for knowing that.
1.4 (k) Access Methods to Data
Computers can store large volumes of data. The difficulty is to be able to get it back. In
order to be able to retrieve data it must be stored in some sort of order. Imagine the phone
book. It contains large volumes of data which can be used, fairly easily, to look up a
particular telephone number because they are stored in alphabetic order of the
subscriber’s name. Imagine how difficult it would be to find a number if they had just
been placed in the book at random. The value of the book is not just that it contains all
the data that may be needed, but that it has a structure that makes it accessible. Similarly,
the structure of the data in a computer file is just as important as the data that it contains.
There are a number of ways of arranging the data that will aid access under different
circumstances.
Serial access.
Data is stored in the computer in the order in which it arrives. This is the simplest form of
storage, but the data is effectively unstructured, so finding it again can be very difficult.
This sort of data storage is only used when it is unlikely that the data will be needed
again, or when the order of the data should be determined by when it is input. A good
example of a serial file is what you are reading now. The characters were all typed in, in
order, and that is how they should be read. Reading this book would be impossible if all
the words were in alphabetic order. Another example of the use of a serial file will be
seen in section 1.4.n.
Sequential access.
In previous sections of this chapter we used the example of a set of students whose data
was stored in a computer. The data could have been stored in alphabetic order of their
name. It could have been stored in the order that they came in a Computing exam, or by
age with the oldest first. However it is done the data has been arranged so that it is easier
to find a particular record. If the data is in alphabetic order of name and the computer is
asked for Zaid’s record it won’t start looking at the beginning of the file, but at the end,
and consequently it should find the data faster.
A file of data that is held in sequence like this is known as a sequential file.
Indexed sequential.
Imagine a large amount of data, like the names and numbers in a phone book. To look up
a particular name will still take a long time even though it is being held in sequence.
Perhaps it would be more sensible to have a table at the front of the file listing the first
letters of peoples’ names and giving a page reference to where those letters start. So to
look up Jawad, a J is found in the table which gives the page number 232, the search is
then started at page 232 (where all the Js will be stored). This method of access involves
looking up the first piece of information in an index which narrows the search to a
smaller area, having done this, the data is then searched alphabetically in sequence. This
type of data storage is called Index Sequential.
Random access.
A file that stores data in no order is very useful because it makes adding new data or
taking data away very simple. In any form of sequential file an individual item of data is
very dependent on other items of data. Jawad cannot be placed after Mahmood because
that is the wrong ‘order’. However, it is necessary to have some form of order because
otherwise the file cannot be read easily. What would be wonderful is if, by looking at the
data that is to be retrieved, the computer can work out where that data is stored. In other
words, the user asks for Jawad’s record and the computer can go straight to it because the
word Jawad tells it where it is being stored. How this can be done is explained in section
1.4.l.
1.4 (l) Implementation of File Access Methods
This section is about how the different access methods to data in files can be put into
practice. There will not be a lot of detail, and some questions will remain unanswered,
don’t worry because those will appear in further work.
Serial access.
Serial files have no order, no aids to searching, and no complicated methods for adding
new data. The data is simply placed on the end of the existing file and searches for data
require a search of the whole file, starting with the first record and ending, either with
finding the data being searched for, or getting to the end of the file without finding the
data.
Sequential access.
Because sequential files are held in order, adding a new record is more complex, because
it has to be placed in the correct position in the file. To do this, all the records that come
after it have to be moved in order to make space for the new one.
e.g. A section of a school pupil file might look like this
…
Hameed, Ali, 21……..
Khurram, Saeed, 317………
Khwaja, Shaffi, 169………..
Naghman, Yasmin, 216………..
…
If a new pupil arrives whose name is Hinna, space must be found between Hameed and
Khurram. To do this all the other records have to be moved down one place, starting with
Naghman, then Khwaja, and then Khurram.
…
Hameed, Ali, 21…………
100 10,000
Second level indexes Final index blocks
(third and fourth digits each containing up to
in account number) 1000 account numbers
Random access.
To access a random file, the data itself is used to give the address of where it is stored.
This is done by carrying out some arithmetic (known as pseudo arithmetic because it
doesn’t make much sense) on the data that is being searched for.
E.g. imagine that you are searching for Jawad’s data.
The rules that we shall use are that the alphabetic position of the first and last letters in
the name should be multiplied together, this will give the address of the student’s data.
So Jawad = 10 * 04 = 40. Therefore Jawad’s data is being held at address 40 in memory.
This algorithm is particularly simplistic, and does not give good results, as we shall soon
see, but it illustrates the principle. Any algorithm can be used as long as it remains the
same for all the data.
This type of algorithm is known as a HASHING algorithm.
The problem with this example can be seen if we try to find Jaheed’s data.
Jaheed = 10 * 04 = 40. The data for Jaheed cannot be here because Jawad’s data is here.
This is called a CLASH. When a clash occurs, the simple solution is to work down
sequentially until there is a free space. So the computer would inspect address 41, and if
that was being used, 42, and so on until a blank space. The algorithm suggested here will
result in a lot of clashes which will slow access to the data. A simple change in the
algorithm will eliminate all clashes. If the algorithm is to write down the alphabetic
position of all the letters in the name as 2 digit numbers and then join them together there
could be no clashes unless two people had the same name.
e.g. Jawad = 10, 01, 23, 01, 04 giving an address 1001230104
Jaheed = 10, 01, 08, 05, 05, 04 giving an address 100108050504
The problem of clashes has been solved, but at the expense of using up vast amounts of
memory (in fact more memory than the computer will have at its disposal). This is known
as REDUNDANCY. Having so much redundancy in the algorithm is obviously not
acceptable. The trick in producing a sensible hashing algorithm is to come up with a
compromise that minimizes redundancy without producing too many clashes.
1.4 (m) Selection of Data Types and Structures
Data types.
When the computer is expected to store data, it has to be told what type of data it is going
to be because different types of data are stored in different areas of memory. In addition
to the types of data that we have already described, there are other, more specialised, data
types. Most can be covered by calling them characters (or string data, which is just a set
of characters one after the other), but there are two others that are useful. Currency data
is, as the name suggests, set up to deal with money. It automatically places two digits
after the point and the currency symbol in. The other is date, this stores the date in either
6 or 8 bytes dependent on whether it is to use 2 or 4 digits for the year. Care should be
taken with the date because different cultures write the three elements of a date in
different orders, for example, Americans put the month first and then the day, whereas
the British put the day first and then the month.
Data structures
Students should be able to justify the use of a particular type of structure for storing data
in given circumstances. Questions based on this will be restricted to the particular
structures mentioned in 1.4.e and 1.4.f and will be non-contentious. E.g. Jobs are sent to a
printer from a number of sources on a network. State a suitable data structure for storing
the jobs that are waiting to be printed giving a reason for your answer.
Answer: A queue, because the next one to be printed should be the one that has been
waiting longest.
1.4 (n) Backing up and Archiving Data
Backing up data.
Data stored in files is very valuable. It has taken a long time to input to the system, and
often, is irreplaceable. If a bank loses the file of customer accounts because the hard disk
crashes, then the bank is out of business.
It makes sense to take precautions against a major disaster. The simplest solution is to
make a copy of the data in the file, so that if the disk is destroyed, the data can be
recovered. This copy is known as a BACK-UP. In most applications the data is so
valuable that it makes sense to produce more than one back-up copy of a file, some of
these copies will be stored away from the computer system in case of something like a
fire which would destroy everything in the building.
The first problem with backing up files is how often to do it. There are no right answers,
but there are wrong ones. It all depends on the application. An application that involves
the file being altered on a regular basis will need to be backed up more often than one
that is very rarely changed (what is the point of making another copy if it hasn’t changed
since the previous copy was made?). A school pupil file may be backed up once a week,
whereas a bank customer file may be backed up hourly.
The second problem is that the back-up copy will rarely be the same as the original file
because the original file keeps changing. If a back up is made at 9.00am and an alteration
is made to the file at 9.05am, if the file now crashes, the back up will not include the
change that has been made. It is very nearly the same, but not quite. Because of this, a
separate file of all the changes that have been made since the last back up is kept. This
file is called the transaction log and it can be used to update the copy if the original is
destroyed. This transaction log is very rarely used. Once a new back up is made the old
transaction log can be destroyed. Speed of access to the data on the transaction log is not
important because it is rarely used, so a transaction log tends to use serial storage of the
data and is the best example of a serial file if an examination question asks for one.
Archiving data.
Data sometimes is no longer being used. A good example would be in a school when
pupils leave. All their data is still on the computer file of pupils, taking up valuable space.
It is not sensible to just delete it, there are all sorts of reasons why the data may still be
important, for instance a past pupil may ask for a reference. If all the data has been erased
it may make it impossible to write a sensible reference. Data that is no longer needed on
the file but may be needed in the future should be copied onto long term storage medium
and stored away in case it is needed. This is known as producing an ARCHIVE of the
data. (Schools normally archive data for 7 years before destroying it).
Note: Archived data is NOT used for retrieving the file if something goes wrong, it is
used for storing little used or redundant data in case it is ever needed again, so that space
on the hard drive can be freed up.
Example Questions
8. a)Explain the difference between a serial file and a sequential file. (2)
b)Describe what is meant by a hashing algorithm and explain why such an
algorithm can lead to clashes. (3)
9. A library keeps both a book file and a member file. The library does a stock take
twice a year and orders new books only once a year. Members can join or cancel
their membership at any time.
a) Describe how the library can implement a sensible system of backing up their
files. (4)
b) Explain the part that would be played by archiving in the management of the
files. (4)
Note: This chapter, or section of the syllabus, is by far the largest portion of module
1, and consequently, candidates should expect a higher proportion of marks on the
exam paper to relate to this work than to the other sections.