Operating Systems 3nbsped 9780070702035 Compress
Operating Systems 3nbsped 9780070702035 Compress
Achyut Godbole is currently the Managing Director of Softexcel Consultancy Services, Mumbai.
His professional career spans 32 years, and during this time, he has served in world-renowned
software companies in India, UK and USA. He has contributed to the multifold growth of companies
such as Patni, Syntel, L&T Infotech, Apar, Disha etc. He did his BTech from IIT Bombay in Chemical
Engineering, and henceforth, worked for the welfare of Adivasi tribes for one year.
Godbole has authored best-selling textbooks from Tata McGraw-Hill such as Operating Systems,
Data Communications and Networking, and Web Technologies, including international editions
and Chinese translations. In addition, he has authored several very-highly rated books on various
subjects (like computers, management, economics, etc.) in Marathi and has written several popular columns in Marathi
newspapers/magazines on science, literature, medicine, and technology. He has conducted numerous programmes on
television pertaining to technology, science, and economics. He has traveled abroad on more than 150 occasions to
several countries to promote software business. Godbole also runs a school for autistic children.
He has won several awards, including an award from the Prime Minister of India, ‘Udyog Ratna’, ‘Distinguished
Alumnus’ from IIT, ‘Kumar Gandharva’ award at the hands of Pandit Bhimsen Joshi, ‘Navaratna’ from Sahyadri TV
channel, the ‘Indradhanu Puraskar’, and ‘Parnerkar Puraskar’ for his contributions to Economics. Besides this, he was
ranked 16th in merit in Maharashtra Board Examination. A brilliant student, he was always a topper in class and won many
prizes in Mathematics.
He has a website (www.achyutgodbole.com) and can be reached at achyut.godbole@gmail.com.
Atul Kahate is working with Oracle Financial Services Software Limited (earlier i-flex solutions
limited) as Head—Technology Practice for over eight years. He has 15 years of experience in
Information Technology in India and abroad in various capacities. Previously, he has worked with
Syntel, L&T Infotech, American Express and Deutsche Bank. He has a Bachelor of Science degree
in Statistics and a Master of Business Administration in Computer Systems.
He has authored 24 highly acclaimed books on Technology, Cricket, and History published by Tata
McGraw-Hill, and other reputed publications. Some of these titles include Web Technologies—TCP/
IP to Internet Application Architectures, Cryptography and Network Security, Fundamentals of Computers, Information
Technology and Numerical Methods, Introduction to Database Management Systems, Object Oriented Analysis and
Design, and Schaum’s Series Outlines—Programming in C++, XML and Related Technologies as well as international
and Chinese translated editions. Several of his books are being used as course textbooks or sources of reference in a
number of universities/colleges/IT companies all over the world. He has authored ‘Flu chi kahani— Influenza te Swine
flu’ (The story of Flu) and has also co-authored a book in Marathi titled IT t ch jayachay (I want to enter into IT). He
has authored two books on cricket, and has written over 3000 articles on IT and cricket in leading Marathi newspapers/
magazines/journals in India and abroad.
He has deep interest in history, teaching, science, economics, music, and cricket, besides technology. He has conducted
several training programs in a number of educational institutions and IT organizations on a wide range of technologies,
including prestigious institutions such as IIT, Symbiosis, I2IT, MET, and Indira Institute of Management. He has done
a series of programmes for IBN Lokmat, Star Majha, and SAAM TV channels for explaining complex technology.
Kahate has also worked as the official cricket statistician and scorer in a number of tests and limited over international
cricket matches. He has contributed to cricket websites, such as CricInfo and Cricket Archive. He is also a member of the
Association of Cricket Statisticians, England.
He has won several awards, both in India and abroad, including the Computer Society of India (CSI) award for IT
education and literacy, the noted ‘Yuvonmesh Puraskar’ from Indradhanu-Maharashtra Times, and the ‘IT Excellence
Award’ from Indira Group of Institutes.
He has a website (www.atulkahate.com) and can be reached at akahate@gmail.com.
Managing Director
Softexcel Consultancy Services
Mumbai
Head–Technology Practice
PrimeSourcing DivisionTM
Oracle Financial Services Software Limited
Information contained in this work has been obtained by Tata McGraw-Hill, from sources believed to be reliable.
However, neither Tata McGraw-Hill nor its authors guarantee the accuracy or completeness of any information
published herein, and neither Tata McGraw-Hill nor its authors shall be responsible for any errors, omissions, or
damages arising out of use of this information. This work is published with the understanding that Tata McGraw-
Hill and its authors are supplying information but are not attempting to render engineering or other professional
services. If such services are required, the assistance of an appropriate professional should be sought.
Typeset at Print-O-World, 2579, Mandir Lane, Shadipur, New Delhi 110 008, and printed at Lalit Offset Printer, 219,
F.I.E., Patpar Ganj, Industrial Area, Delhi 110 092
Shobha Godbole
and
Anita Kahate
for their support, understanding,
patience, and perseverance
CONTENTS
Preface xxix
2. COMPUTER ARCHITECTURE 17
2.1 Introduction 17
2.2 A 4GL Program 18
2.3 A 3GL (HLL) Program 18
2.4 A 2GL (Assembly) Program 19
2.5 A 1GL (Machine Language) Program 21
2.5.1 Assembler 21
2.5.2 Instruction Format 21
2.5.3 Loading/Relocation 22
2.6 0GL (Hardware Level) 24
2.6.1 Basic Concepts 24
2.6.2 CPU Registers 26
2.6.3 The ALU 27
2.6.4 The Switches 27
2.6.5 The Decoder Circuit 28
2.6.6 The Machine Cycle 28
2.6.7 Some Examples 29
2.7 The Context of a Program 33
2.8 Interrupts 33
2.8.1 The Need for Interrupts 33
2.8.2 Computer Hardware for Interrupts and Hardware Protection 33
2.9 Storage Structure 38
2.9.1 Random Access Memory (RAM) 38
2.9.2 Secondary Memory 40
2.10 Storage Hierarchy 43
Terms and Concepts Used 44
Summary 44
Review Questions 46
4. FILE SYSTEMS 72
4.1 Introduction 72
4.1.1 Disk Basics 74
4.1.2 Direct Memory Access 85
4.2 The File System 87
4.2.1 Introduction 87
4.2.2 Block and Block Numbering Scheme 87
4.2.3 File Support Levels 90
4.2.4 Writing a Record 91
4.2.5 Reading a Record 95
4.2.6. The Relationship Between the Operating System and DMS 97
4.2.7 File Directory Entry 101
4.2.8 OPEN/CLOSE Operations 102
4.2.9 Disk Space Allocation Methods 102
4.2.10 Directory Structure: User’s View 117
4.2.11. Implementation of a Directory System 121
4.2.12 File Organization and Access Management 128
4.2.13 File Organization and Access Management 129
4.2.14 File Sharing and Protection 129
4.2.15 Directory Implementation 130
4.2.16 Directory Operations 130
4.2.17 Free Space Management 131
4.2.18 Bit Vector 131
4.2.19 Log Structured File System 131
Terms and Concepts Used 132
Summary 133
Review Questions 134
8. DEADLOCKS 256
8.1 Introduction 256
8.2 Graphical Representation of a Deadlock 257
8.3 Deadlock Prerequisites 258
8.3.1 Mutual Exclusion Condition 258
8.3.2 Wait for Condition 259
8.3.3 No Preemption Condition 259
8.3.4 Circular Wait Condition 259
8.4 Deadlock Strategies 259
8.4.1 Ignore a Deadlock 259
8.4.2 Detect a Deadlock 259
8.4.3 Recover from a Deadlock 263
8.4.4 Prevent a Deadlock 264
8.4.5 Avoid a Deadlock 266
Summary 271
Review Questions 271
Terms and Concepts Used 271
Index 659
PREFACE
OVERVIEW
Almost everybody involved in the development of software comes in contact with different Operating
Systems. There are two major groups of people in this context. One group is concerned with knowing how
an operating system is designed, what data structures are used by an operating system and how various
algorithms within an operating system are organized in various layers to execute different functions. This
is a class of system programmers who are required to study the internals of an operating system and later
on participate in either designing or installing and managing an operating system (including performance
tuning) or enhancing it by writing various device drivers to support various new devices, etc.
When Mr Achyut Godbole came up with the first edition of this book in 1995, Operating System was a
topic of immense interest to technologists studying system software. UNIX and Microsoft Windows were the
leading operating systems of the time. The book was not designed at all considering any syllabi or courses in
mind. It was merely an effort to try and explain the way operating systems function, so that someone who has
some basic background in computer technology would be able to understand the subject thoroughly.
As the 1990’s gave way to the new millennium, computer technologies evolved much more rapidly than
ever before. The role of the Internet changed the entire scenario dramatically. Suddenly, desktop computing
gave way to distributed computing. Web servers and subsequently, application servers, assumed significant
importance. Database servers started hosting billions of bytes of data, which has now easily run into trillions
of bytes, and more. At the same time, however, the client (or the Web browser) also became very crucial. All
this meant that operating systems catering to the needs of these diverse sets of users had to adapt to these
requirements.
We came out with a second edition of the book to reflect these changes. The main additions were the case
studies on Windows 2000 and Linux. Many other supplementary changes were also made. The third edition
of the book is now in your hands. This edition reflects the current technology trends, as well as captures some
of the other topics that we felt were necessary to make the book even more exhaustive.
CHAPTER ORGANISATION
Chapter 1 deals with the history of Operating Systems. It covers the various milestones in the history
of Operating Systems. The chapter also covers the modern trends in Operating Systems.
Chapter 2 begins with an overview of programming language levels, and presents a view at each of
these levels, viz. 4GL, 3GL, 2GL and 1GL (machine language). It shows the relationships amongst
these levels, which essentially provide views of the same system at different levels of capabilities and,
therefore, abstractions.
Chapter 3 introduces the concept of the Operating System functions as provided by the various
system calls. It presents the user’s/application programmer’s view of the Operating System and also
that of the System Programmer and shows how these are related. It discusses the system calls in
three basic categories: Information Management (IM), Process Management (PM), and Memory
Management (MM). It also shows the relationship between these three modules.
Chapter 4 introduces File Systems. It explains file organization and access management, file sharing
and protection. Directory systems are then discussed based on their levels. The chapter also defines
directory operations, free-space management, bit vectors and log structured file systems.
Chapter 5 is on I/O Management and Disk Scheduling. It defines the concept of a “block” and
goes on to explain how the data for a file is organized on a disk. It explains the functioning of hard
and floppy disks in detail. It explains how the Operating System does the address translation from
logical to physical addresses to actually Read/Write any record. It goes on to show the relationship
between the Application Program (AP), Operating System (O/S), and the Data Management Software
(DMS).
Chapter 6 defines a “process” and explains the concepts of context switching as well as
multiprogramming. It defines various process states and discusses different process state transitions.
It gives the details of a data structure “Process Control Block (PCB)” and uses it to show how
different operations on a process such as “create”, “kill”, or “dispatch” are implemented, each time
showing how the PCBs chains would reflect the change. It then discusses the different methods used
for scheduling various processes.
Chapter 7 describes the problems encountered in Process Synchronisation by taking an example
of Producer-Consumer algorithms. It illustrates various solutions that have been proposed so far,
for mutual exclusion. The chapter concludes with a detailed discussion on semaphores and classic
problems in Inter Process Communication.
Chapter 8 describes and defines a deadlock and also shows how the situation can be graphically
represented. It states the pre-requisites for the existence of a deadlock. It then discusses various
strategies for handling deadlocks, viz. ignore, detect, recover from, prevent, and avoid. It concludes
with a detailed discussion of Banker’s algorithm.
Chapter 9 elucidates various Contiguous and Non-Contiguous memory allocation schemes. For all
these schemes, it states the support that the Operating System expects from the hardware and then
goes on to explain in detail the way the scheme is implemented.
Security is an important aspect of any Operating System. Chapter 10 discusses the concept of security
and various threats to it; and attacks on it. It then goes on to discuss how security violation can take
place due to parameter passing mechanisms. It discusses computer worms, and viruses, explaining in
detail, how they operate and grow. The chapter discusses various security design principles and also
various protection mechanisms to enforce security.
Chapter 11 introduces the concept of Parallel Processing and contrasts it with uniprocessing as well
as distributed processing, discussing the merits and demerits of all. The chapter demonstrates the
programming for parallel processing and also classification of computers.
Chapter 12 defines the term “distributed processing” and contrasts centralized versus distributed
processing. It also describes three ways in which the processing can be distributed, viz. distributed
application, distributed data and distributed control. It takes an example to clarify these concepts.
Chapter 13 offers a detailed case study of Windows NT and Windows 2000. Along with Linux; the
Windows family of Operating Systems has become the most important concept that the technologist
should know. The chapter provides in-depth discussion Windows, including its architecture, design
principles, and various Operating Systems algorithms/data structures.
Chapter 14 presents a similar detailed case study of UNIX. The chapter provides a detailed description
of UNIX, including its architecture, design principles, and various Operating Systems algorithms/
data structures.
Chapter 15 is similar to Chapter 14 except that it describes Linux and not UNIX. The chapter
provides notes on the differences between these two Operating Systems at appropriate places.
WEB SUPPLEMENTS
The Web supplements can be accessed at http://www.mhhe.com/godbole/os3 and contain the following
resources:
Instructors
Chapterwise PowerPoint slides
Students
Chapterwise solutions for True and False questions and MCQs
Frequently Asked Questions from OS
Detailed Case Study of UNIX
Chapter on Multimedia Operating System for extra reading
ACKNOWLEDGEMENTS
We are thankful to Shobha Godbole and Anita Kahate for their constant support in numerous ways. Without
them, this book would not have been possible. We are also grateful to Sapna and Umesh Aherwadikar for
their significant help in many ways.
We would also like to acknowledge the following reviewers who took out time to review the book, which
helped us in giving a final shape to the revised edition.
D S Kushwaha
Motilal Nehru National Institute of Technology(MNNIT), Allahabad, Uttar Pradesh
Nitin Gupta
National Institute of Technology(NIT), Hamirpur, Himachal Pradesh
Amritanjali
Birla Institute of Technology, Mesra, Jharkhand
Madhura V Phatak
Institute of Technology (MIT), Pune, Maharashtra
K Poulose Jacob
Cochin University of Science and Technology, Cochin, Kerala
V Shashikiran
Sri Venkateswara College of Engineering, Sriperumbudur, Tamil Nadu
Saidalavi Kalady
National Institute of Technology(NIT), Calicut, Kerala
Finally, we would like to thank the Tata McGraw-Hill Education team specially Vibha Mahajan, Shalini
Jha, Surbhi Shukla, Surbhi Suman, Sohini Mukherjee, Suneeta Bohra and Baldev Raj for their enthusiastic
support and guidance in bringing out the revised edition of the book.
We hope that the reader likes this revised edition and finds it useful in learning the concepts of Operating
Systems.
Achyut Godbole
Atul KAhAte
Constructive suggestions and criticism always go a long way in enhancing any endeavour. We request
all readers to email us their valuable comments / views / feedback for the betterment of the book at
tmh.csefeedback@gmail.com mentioning the title and author name in the subject line. Please report any
piracy spotted by you as well!
VISUAL TOUR
We can imagine that all the issues that are found in a simple, non-distributed environment,
such as mutual exclusion, starvation and deadlock are quite possible in the case of a distributed
environment. In fact, the possibility of these happening is more here, as a number of entities are
involved, which can lead to chaos. To add to the trouble, there is no global state. That is, there is no way for
an operating system or a participating process to know about the overall state of all the processes. It can only
know about its own state (i.e. local processes). For obtaining information about remote processes, it has to
rely on the messages from communication with other processes. Worse yet, these states reflect the position
that was in the past. Although this past can be as less as 1 second, it can have very serious consequences in
business and other applications.
For instance, suppose that Process A gets the information from a remote Process B that the balance in an
account is USD 2000, at 11.59 a.m. Therefore, Process A considers that as a sufficient
a withdrawal transaction of USD 1500, and goes
The
global state. Here, the basic assumption is that
/
Comprehensive coverage of all topics
This record logic
presented with lucid explanations used, even accid
entally. ally connects all
the
This record conta
ins security infor
in simple language. Step-by-step information as
to ho
Different langu
ages have diffe
mation.
rent rules
We have encountered a 10#1024 memory address decoder circuit before. We now study the instruction
decoder circuit which is basically similar.
Diagrams form an important part
of every textbook on Science and
Engineering. This book contains over
400 diagrams which lend clarity to
the concepts discussed.
The decoder circuit has n control signal lines and 2n output signal lines, and depending upon the value of
the control signal lines in binary terms, it chooses or selects one of the output signal lines.
4
correct
sector an
d to activ
ate R/W
head of
the appro
priate su
rface to
read the
data. Th
e controll
er norm
ally
VISUAL TOUR
the end of each chapter. These help These systems are very restrictive in terms of time constraint. These operations always have high priority
over other tasks.
students have a quick overview of These are less restrictive in terms of time constraints. Where a critical real-
time tasks gets priority over the other tasks and when high priority tasks are finished, these operations would
start executing once again.
the important terms discussed in the
chapter, look up the definitions and
meorise them as part of self-study.
n
n
n
n
n
n
n
n
n n
n
n
n n n
n
n
n
n
n n
n
VISUAL TOUR
Around 1955, transistors were introduced in the USA at AT&T. The problems associated with
vacuum tubes vanished overnight. The size and the cost of the machine dramatically dwindled. The
reliability improved. For the first time, new categories of professionals called systems analysts,
designers, programmers and operators came into being as distinct entities. Until then, the functions handled
by these categories of people had been managed by a single individual.
Assembly language, as a second generation language, and FORTRAN, as one High Level Language (third
generation language), emerged, and the programmer's job was extremely simplified.
However, these were batch systems. The IBM-1401 belonged to that era. There was no question of having
multiple terminals attached to the machine, carrying out different inquiries. The operator was continuously
busy loading or unloading cards and tapes before and after the jobs. At a time, only one job could run. At the
end of one job, the operator had to dismount the tapes, take out the cards ('teardown operation'), load the
decks of cards and mount the tapes for the new job ('setup operation'). This entailed the usage of a lot of
computer time. Valuable CPU time was therefore, wasted. This was the case when IBM-1401 was in use. An
improvement came when IBM-7094 - a faster and larger computer was used in conjunction with IBM-1401,
which then was used as a 'satellite computer'. The scheme used to work as follows :
(i) There used to be 'control cards' giving
information about the job, the user and so
on, sequentially stacked, as depicted in Fig.
1.1. For instance, $JOB specified the job to
be done, the user who is doing it and may
be some other information. $LOAD signified
that what would follow were the cards with
executable machine instructions punched
onto them and that they were to be loaded in
the main memory before it could be executed.
These cards were therefore, collectively
known as an 'object deck' or an 'object
program'. When the programmer wrote his
program in an assembly language called a
'source program', the assembly process
carried out by a special program called
'assembler' would convert it into an object
program before it could be executed. The assembler would also punch these machine instructions
on the cards in a predefined format. For instance, each card had a sequence number to help it to be
rearranged in case it fell out by mistake. The column in which the 'op code' of the machine instruction
started was also fixed (e.g. column 16 in the case of Autocoder), so that the loader could do its job
easily and quickly.
The $LOAD card would essentially signify that the object cards following it should then be loaded
in the memory. Obviously, the object program cards followed the $LOAD card as shown in the
figure. The $RUN control card would specify that the program just then loaded should be executed
by branching to the first executable instruction specified by the programmer in the "ORG" statement.
The program might need some data cards which then followed. $END specified the end of the data
cards and $JOB specified the beginning of a new job again!
(ii) An advantage of stacking these cards together was to reduce the efforts of the operator in 'set up' and
'teardown' operations, and therefore, to save precious CPU time. Therefore, many such jobs were
stacked together one after the other as shown in Fig. 1.1.
(iii) All these cards were then read one by one and copied onto a tape using a "card to tape" utility
program. This was done on an IBM-1401 which was used as a satellite computer. This arrangement
is shown in Fig. 1.2. Controls such as 'total number of cards read' were developed and printed by the
utility program at the end of the job to ensure that all cards were read.
(iv) The prepared tape (Tape-J shown in Fig. 1.2) was taken to the main 7094 computer and processed
as shown in Fig. 1.3. The figure shows Tape-A as an input tape and Tape-B as an output tape. The
printed reports were not actually printed on 7094, but the print image was dumped onto the tape
(Tape-P) which was carried to slower 1401 computer again, which did the final printing as shown in
Fig. 1.4. Due to this procedure, 7094 computer, which was a faster and more expensive machine was
not locked up for a long time unnecessarily.
The logic of splitting the operation of printing into two stages here was simple. The CPU of a computer
was quite fast as compared to any I/O operation. This was so, because the CPU was a purely electronic
device, whereas I/O involved electromechanical operations. Secondly, within two types of I/O operations,
writing on a tape was faster than printing a line on paper. Therefore, the time of the more powerful, more
expensive 7094 was saved. This is because, the CPU can execute only one instruction at a time.
If 7094 was used to print a report, it would be idle for most of the time. When a line was being printed, the
CPU could not be doing anything else. Of course, some computer had to read the print image tape (Tape-P)
and print a report. But then, that could be delegated to a relatively less expensive satellite computer, say the
1401. Actually, writing on tape and then printing on the printer appear to be wasteful and more expensive, but
it was not so, due to the differential powers and costs of the 7094 and 1401.
This scheme was very efficient and improved the division of labour. The three operations required for
the three stages shown in Figs. 1.2 to 1.4 were repetitive. The efficiency increased. The only difference was
that the computer 7094 had to have a program which read the card images from Tape-J and interpreted them
(e.g. on hitting a $LOAD card image, it actually started loading the program from the following card image
records). This was essentially a rudimentary Operating System. The IBM-7094 had two Operating Systems
'IBSYS' and 'Fortran Monitor System (FMS)'.
Similarly, the IBM-1401 had to have a program which interpreted the print images from the tape and
actually printed the report. This program was a rudimentary ‘spooler’. One scheme was to have the exact
print image on the tape. For instance, if there were 15 blank lines between two printed valid report lines, one
would actually write 15 blank lines on the print image tape. In this case, the spooler program was very simple.
All it had to do was to dump the tape records on the printer. But this scheme was clearly wasteful, because,
the IBM-7094 program had to keep writing actual blank lines; additionally, the tape utilization was poor.
A better scheme was to use special characters (which are normally not used in common reports, etc.) to
denote end-of-line, end-of-page, number of lines to be skipped and so on. In this case, the program on the
IBM-7094 which created the print-image tape became a little more complex but far more efficient. The actual
tape was used far more efficiently, but then the spooler program also became more complex. It had to actually
interpret the special characters on the tape and print the report!
This was a single-user system. Only one program belonging to only one user could run at a time. When
the program was reading or writing a record, the CPU was idle; and this was very expensive. Due to the
electromechanical nature, the I/O operations used to be extremely time-consuming as compared to the CPU
operations (This is true even today despite great improvements in the speeds of the I/O devices!). Therefore,
during the complete execution of a job, the actual CPU utilization was very poor.
Despite these limitations, the rudimentary Operating System did serve the purpose of reducing operator
intervention in the execution of computer jobs. Setup and teardown were then applicable only for a set of jobs
stacked together instead of for each job.
During this period, the mode of file usage was almost always sequential. Database Management Systems
(DBMS) and On-line systems were unheard of at that time.
One more development of this era was the introduction of a library of standard routines. For example, the
'Input Output Control System (IOCS)' was developed in an assembly language of the IBM-1401, called
'Autocoder', and was supplied along with the hardware. This helped the programmers significantly because
they no longer had to code these tedious and error-prone routines every time, in their programs. The concept
of a 'system call', where the Operating System carried out a function on behalf of the user, was still not in
use. These routines in the source code had therefore, to be included along with the other source program for
all programs before the assembly process. Therefore, these routines went through the assembly process every
time.
An improvement over this was to predetermine the memory locations where the IOCS was expected to be
loaded, and to keep the preassembled IOCS routines ready. They were then added to the assembled object
program cards to be loaded by the loader along with the other object deck. This process saved the repetitive
assembly of IOCS routines every time along with every source program. The source program had simply
to "Branch" to the subroutine residing at a predefined memory address to execute a specific I/O instruction.
In the early 60s, many companies such as National Cash Register (NCR), Control Data Corporation (CDC),
General Electric (GE), Burroughs, Honeywell, (RCA) and Sperry Univac started providing their computers
with Operating Systems. But these were mainly batch systems concerned primarily with throughput.
Transaction processing systems started emerging with the users feeling the need for more and more on-
line processing. In fact, Burroughs was one of the few companies which produced an Operating System
called as 'Master Control Program (MCP)' which had many features of today's Operating Systems such as
multiprogramming (execution of many simultaneous user programs), multiprocessing (many processors
controlled by one Operating System) and virtual storage (program size allowed to be more than the available
memory).
IBM announced System/360 series of computers in 1964. IBM had designed various computers
in this series which were mutually compatible so that the conversion efforts for programs from
one machine to the other in the same family were minimal. This is how the concept of ‘family
of computers’ came into being. IBM-370, 43xx and 30xx systems belong to the same family of computers.
IBM faced the problem of converting the existing 1401 users to System/360, and there were many. IBM
provided the customers with utilities such as 'simulators' (totally software driven and therefore, a little slow)
and 'emulators' (using hardware modifications to enhance the speed at extra cost) to enable the old 1401
bases software to run on the IBM-360 family of computers.
Initially, IBM had plans for delivering only one Operating System for all the computers in the family.
However, this approach proved to be practically difficult and cumbersome . The Operating System for the
larger computer in the family meant to manage larger resources was found to create far more burden and
overheads if used on the smaller computers. Again, the Operating System that could run efficiently on a
smaller computer would not manage the resources for a large computer effectively. At least, IBM thought
so at that time. Therefore, IBM was forced to deliver four Operating Systems within the same range of
computers. These were
The major advantages/features and problems of this computer family and its Operating Systems were as
follows :
The System/360 was based on 'Integrated Circuits (ICs)' rather than transistors. With ICs, the cost and the
size of the computer shrank substantially, and yet the performance improved.
The Operating Systems for the System/360 were written in assembly language. The routines were therefore,
complex and time-consuming to write and maintain. Many bugs persisted for a long time. As these were
written for a specific machine and in the assembly language of that machine, they were tied to the hardware.
They were not easily 'portable' to machines with a different architecture not belonging to the same family.
Despite these problems, the user found them acceptable, because, the operator intervention (for setup and
teardown) decreased. A 'Job Control Language (JCL)' was developed to allow communication between the
user/programmer and the computer and its Operating System. By using the JCL, a user/programmer could
instruct the computer and its Operating System to perform certain tasks, in a specific sequence for creating a
file, running a job or sorting a file.
The concept of ‘Simultaneously Peripheral Operations On-Line (spooling)’ was fully developed during
this period. This was the outgrowth of the same principle that was used in the scheme discussed earlier and
depicted in Figs. 1.2 to 1.4. The only advantage of spooling was that you no longer had to carry tapes to and
fro the 1401 and 7049 machines. Under the new Operating System, all jobs in the form of cards could be
read into the disk first (shown as 'a' in the figure) and later on, the Operating System would load as many jobs
in the memory, one after the other, until the available memory could accommodate them (shown as 'b' in the
figure). After many programs were loaded in different partitions of the memory, the CPU was switched from
one program to another to achieve multiprogramming. We will later see different policies used to achieve this
switching. Similarly, whenever any program printed something, it was not written directly on the printer, but
the print image of the report was written onto the disk in the area reserved for spooling (shown as 'c' in the
figure). At any convenient time later, the actual printing from this disk file could be undertaken (shown as 'd'
in the figure). This is depicted in Fig. 1.6.
Spooling had two distinct advantages. One is that it allowed smooth multiprogramming operations. Imagine
if two programs, say, Stores Ledger and Payslips Printing, were allowed to issue simultaneous instructions to
write directly on the printer, the kind of hilarious report that would be produced with intermingled lines from
both the reports on the same page. Instead, the print images of both the reports were written on to the disk
at two different locations of the Spool file first, and the Spooler program subsequently printed them one by
one. Therefore, while printing, the printer was allocated only to the Spooler program. In order to guide this
subsequent printing process, the print image copy of the report on the disk also contained some preknown
special characters such as for skipping a page. These were interpreted by the Spooler program at the time of
producing the actual report.
Spooling had another advantage too! All the I/O of all the jobs was essentially pooled together in the
spooling method and therefore, this could be overlapped with the CPU bound computations of all the jobs at
an appropriate time chosen by the Operating System to improve the throughput.
The System/360 with its Operating Systems enhanced multiprogramming, but the Operating Systems were
not geared to meet the requirements of interactive users. They were not very suitable for the query systems
for example.
The reason was simple. In interactive systems, the Operating System needs to recognize a terminal as an
input medium. In addition, the Operating System has to give priority to the interactive processes over batch
processes. For instance, if you fire a query on the terminal, "what is the flight time of Flight SQ024?," and
the passenger has to be serviced within a brief time interval, the Operating System must give higher priority
to this process than, say, for a payroll program running in the batch mode. The classical "multiprogramming
batch" Operating Systems did not provide for this kind of scheduling of various processes.
A change was needed. IBM responded by giving its users a program called "Customer Information Control
System (CICS)" which essentially provided ‘Data Communication (DC)’ facility between the terminal and
the computer. It also scheduled various interactive users’ jobs on top of the Operating System. Therefore,
CICS functioned not only as a Transaction Processing (TP) monitor but also took over some functions
of the Operating System such as scheduling. IBM also provided the users with the ‘Time Sharing Option
(TSO) Software’ later to deal with the situation.
Many other vendors came up with ‘Time Sharing Operating Systems’ during the same period. For
instance, DEC came up with TOPS-10 on the DEC-10 machine, RSTS/E and RSX-11M for the PDP-11
family of computers and VMS for the VAX-11 family of computers. Data General produced AOS for its 16
bit minicomputers and AOS/VS for its 32 bit Super-mini computers.
These Operating Systems could learn from the good/bad points of the Operating System running on the
System/360. Most of these were far more user/programmer friendly. Terminal handling was inbuilt in the
Operating System. These Operating Systems provided for batch as well as on-line jobs by allowing both to
coexist and compete for the resources, but giving higher preference to servicing the on-line requests.
One of the first time sharing systems was ‘Compatible Time Sharing System (CTSS)’ developed at the
Masscheusetts Institute of Technology (M.I.T.). It was used on the IBM-7094 and it supported a large number
of interactive users. Time sharing became popular at once.
‘Multiplexed Information and Computing Service (MULTICS)’ was the next one to follow. It was a
joint effort of MIT, Bell Labs and General Electric. The aim was to create a ‘computer utility’ which could
support hundreds of simultaneous time sharing users.
MULTICS was a crucible which generated and tested almost all the important ideas and algorithms which
were to be used repeatedly over several years in many Operating Systems. But the development of MULTICS
itself was very painful and expensive. Finally, Bell Labs withdrew from the project. In fact, in the process, GE
gave up its computer business altogether. Despite its relative failure, MULTICS had a tremendous influence
on the design of an Operating System for many years to come.
One of the computer scientists, Ken Thompson, working on the MULTICS project through Bell Labs
subsequently got hold of a PDP-7 machine which was unused. Bell labs had already withdrawn from
MULTICS. Ken Thompson hit upon the novel idea of writing a single user, stripped down version of
MULTICS on PDP-7. Another computer scientist—Brian Kernighan—started calling this system ‘UNICS’,
out of fun. Later on, the name UNIX was adopted. None of these people were aware of the tremendous impact
this event was to have on all the future developments. The UNIX Operating System was later ported to a
larger machine, PDP-11/45.
There were, however, major problems in this porting. The problems arose because UNIX was written in
the assembly language. A more adventurous idea struck another computer scientist—Dennis Ritchie—that of
writing UNIX in a higher level language. Ritchie examined all the existing Higher Level Languages (HLLs)
and found none suitable for this task. He, in fact, designed and implemented a language called 'C' for this
purpose. Finally, UNIX was written in C. Only 10% of the kernel and hardware-dependent routines where
the architecture and the speed mattered were written in the assembly language for that machine. All the rest
(about 90%) was written in C. This made the job of 'porting' the Operating System far easier. Today, to port
UNIX to a new machine, you need to have a C compiler on that machine to compile the 90% of the source
code in written in C language into the machine instructions of the target computer. You also need to rewrite,
test and integrate only remaining 10% of the assembly language code on that machine. Despite this facility,
the job of porting is not a trivial one, though, it is far simpler than the one for porting earlier Operating
Systems.
This was a great opportunity for the hardware manufacturers. With new hardware and newer architectures,
instead of writing a new Operating System each time, porting of UNIX was a far better solution. They could
announce their products far faster, because all the other products such as Database Management Systems,
Office Automation Systems, language compilers, and so on could also then be easily ported, once the System
Calls under UNIX were known and available. After this, porting of Application Programs also became a
relatively easier task.
Meanwhile, Bell Labs which later became AT&T, licensed the UNIX source code to many universities
almost freely. It became very popular amongst the students who later became designers and managers of
software development processes in many organizations. This was one of the main reasons for its popularity
(By now it had a multiuser version).
When ‘Large Scale Integration (LSI)’ circuits came into existence, thousands of transistors could
be packaged on a very small area of a silicon chip. A computer is made up of many units such as a
CPU, memory, I/O interfaces, and so on. Each of these is further made up of different modules such
as Registers, Adders, Multiplexers, Decoders and a variety of other digital circuits. Each of these, in turn, is
made up of various gates (For example, one memory location storing 1 bit is made up of as many as seven
gates!). Those gates are implemented in digital electronics using transistors. As the size of a chip containing
thousands of such transistors shrank, obviously the size of the whole computer also shrank. But the process of
interconnecting these transistors to form all the logical units became more intricate and complex. It required
tremendous accuracy and reliability. Fortunately, with Computer Aided Design (CAD) techniques, one could
design these circuits easily and accurately, using other computers themselves! Mass automated production
techniques reduced the cost but increased the reliability of the produce computers. The era of microcomputers
and Personal Computers (PC) had begun.
With the hardware, you obviously need the software to make it work. Fortunately many, Operating System
designers on the microcomputers had not worked extensively on the larger systems and therefore, many of
them were not biased in any manner. They started with fresh minds and fresh ideas to design the Operating
System and other software on them.
"Control Program for Microcomputers (CP/M)" was almost the first Operating System on the microcomputer
platform. It was developed on Intel 8080 in 1974 as a File System by Gary Kindall. Intel Corporation had
decided to use PL/M instead of the assembly language for the development of systems software and needed
a compiler for it badly. Obviously, the compiler needed some support from some kind of utility (Operating
System) to perform all the file related operations. Therefore, CP/M was born as a very simple, single-user
Operating System. It was initially only a File System to support a resident PL/M compiler. This was done at
Digital Research Inc. (DRI).
After the commercial licensing of CP/M in 1975, other utilities such as editors, debuggers, etc. were
developed, and CP/M became very popular. CP/M went through a number of versions. Finally, a 16-bit
multiuser, time sharing "MP/M" was designed with real time capabilities, and a genuine competition with the
minicomputers started. In 1980, "CP/NET" was released to provide networking capabilities with MP/M as
the server to serve the requests received from other CP/M machines.
One of the reasons for the popularity of CP/M was its 'userfriendliness'. This had a lot of impact on all
the subsequent Operating Systems on microcomputers.
After the advent of the IBM-PC based on Intel 8086 and then its subsequent models, the 'Disk Operating
System (DOS)' was written. IBM's own PC-DOS and MS-DOS by Microsoft are close cousins with very
similar features. The development of PC-DOS again was related to CP/M. A company called "Seattle
Computer" developed an Operating System called QDOS for Intel 8086. The main goal was to enable the
programs developed under CP/M on Intel 8080 to run on Intel 8086 without any change. Intel 8086 was
upward compatible to Intel 8080. QDOS, however, had to be faster than CP/M in disk operations. Microsoft
Corporation was quick to realize the potential of this product, given the projected popularity of Intel 8086. It
acquired the rights for QDOS which later became MS-DOS (The IBM version is called PC-DOS).
MS-DOS is a single user, user-friendly operating system. In quick succession, a number of other products
such as Database Systems (dBASE), Word Processing (WORDSTAR), Spreadsheet (LOTUS 1-2-3) and
many others were developed under MS-DOS, and the popularity of MS-DOS increased tremendously. The
subsequent development of compilers for various High Level Languages such as BASIC, COBOL and C
added to this popularity, and, in fact, opened the gates to a more serious software development process. This
was to play an important role after the advent of Local Area Networks (LANs). MS-DOS later was influenced
by UNIX and it has been evolving towards UNIX over the years. Many features such as hierarchical file
system have been introduced in MS-DOS over a period of time.
With the advent of Intel 80286, the IBM PC/AT was announced. The hardware had the power of catering
simultaneously to multiple users, despite the name "Personal Computer". Microsoft quickly adapted UNIX
on this platform to announce "XENIX". IBM joined hands with Microsoft again to produce a new Operating
System called "OS/2". Both of these run on 286 and 386 based machines and are multi-user systems. While
XENIX is almost the same as UNIX, OS/2 is fairly different from, though influenced by MS-DOS, which
runs on the IBM PC/AT as well as the PS/2.
With the advent of 386 and 486 computer bit mapped graphic displays became faster and therefore, more
realistic. Therefore, Graphical User Interfaces (GUIs) became possible and infact necessary for every
application. With the advent of GUIs, some kind of standardization was necessary to reduce development
and training time. Microsoft again reacted by producing MS-WINDOWS. MS-WINDOWS is actually not
an Operating System. Internally, it still uses MS-DOS to execute various system calls. On the top of DOS,
however, MS-WINDOWS enables a very user friendly Graphical User Interface (as against the earlier text
based ones) and also allows windowing capability.
MS-WINDOWS did not lend a true multitasking capability to the Operating System. WINDOWS-NT
developed a few years later incorporated this capability in addition to being windows based. (OS/2, UNIX
provided multitasking, but were not windows based). They had to be used along with Presentation Manager
or X-WINDOWS/MOTIF respectively to achieve that capability.
With the era of smaller but powerful computers, 'Distributed Processing' started becoming a reality.
Instead of a centralized large computer, the trend towards having a number of smaller systems at different
work sites but connected through a network became stronger.
There were two responses to this development. One was Network Operating System (NOS) and the
other, Distributed Operating System (DOS). There is a fundamental difference between the two. In Network
Operating System, the users are aware that there are several computers connected to each other via a network.
They also know that there are various databases and files on one or more disks and also the addresses where
they reside. But they want to share the data on those disks. Similarly, there is one or more printers shared
by various users logged on to different computers. NOVELL's NetWare 286 and the subsequent NetWare
386 Operating Systems fall in this category. In this case, if a user wants to access a database on some other
computer, he has to explicitly state its address.
Distributed Operating System, on the other hand, represents a leap forward. It makes the whole network
transparent to the users. The databases, files, printers and other resources are shared amongst a number of
users actually working on different machines, but who are not necessarily aware of such sharing. Distributed
systems appear to be simple, but they actually are not. Quite often, distributed systems allow parallelisms i.e.
they find out whether a program can be segmented into different tasks which can then be run simultaneously
on different machines. On the top of it, the Operating System must hide the hardware differences which exist
in different computers connected to each other. Normally, distributed systems have to provide for high level
of fault tolerance, so that if one computer is down, the Operating System could schedule the tasks on the other
computers. This is an area in which substantial research is still going on. This clearly is the future direction
in the Operating System technology.
In the last few years, new versions of the existing Operating Systems have emerged, and have actually
become quite popular. Microsoft has released Windows 2000, which is technically Windows NT Version
5.0. Microsoft had maintained two streams of its Windows family of Operating Systems – one was targeted
at the desktop users, and the other was targeted at the business users and the server market. For the desktop
users, Microsoft enhanced its popular Windows 3.11 Operating System to Windows 95, then to Windows 98,
Windows ME and Windows XP. For the business users, Windows NT was developed, and its Version 4.0 had
become extensively popular. This meant that Microsoft had to support two streams of Windows Operating
Systems – one was the stream of Windows 95/98/ME/XP, and the other was the Windows NT stream. To bring
the two streams together, Microsoft developed Windows 2000, and it appears that going forward, Windows
2000 would be targeted at both the desktop users, as well as the business users.
On the UNIX front, several attempts were made to take its movement forward. Of them all, the Linux
Operating System has emerged as the major success story. Linux is perhaps the most popular UNIX variant
at the time of going to the press. The free software movement has also helped Linux to become more and
more appealing.
Consequently, today, there are two major camps in the Operating System world: Microsoft Windows 2000
and Linux. It is difficult to predict which one of these would eventually emerge as the winner. However, a
more likely outcome is that both would continue to be popular, and continue to compete with each other.
In a batch system, user enters input on punched cards. The input collected is then read onto a magnetic tape
using a computer such as IMB 1401. These computers were good in performing tasks like reading cards,
copying tapes and printing outputs.
In a batch system, user does not interact with the computer directly. User submits the job to the operator and
such an operator collects jobs from various users. Thus, the operator prepares a batch of jobs. Programmer
also leaves their programs with operator. The batch of similar jobs or similar programs would be processed
when computer time is available. After execution of job or program is complete, output would be sent to the
appropriate user/programmer.
In batch system execution, CPU utilization is not proper and CPU is often idle, because most of the
computing jobs involve I/O operations. In early days, for I/O operations a lot of mechanical action was
required. And mechanical movements are much slower than electronic devices.
Real time systems are used when time is a critical factor. There are creation systems in world where there
is rigid requirement of time. Execution of a task must be finished in specific time period for the entire
system execution. Otherwise, the whole system will fail. Unlike batch systems, input to real time systems
comes directly and immediately from the users/systems and real time systems are capable of analyzing and
processing data. In real time systems, time is a critical factor and such systems are well defined to execute
within certain time period; whereas batch systems are not time dependent.
There are two types of real-time systems:
These are less restrictive in terms of time constraints. Where a critical real-
time tasks gets priority over the other tasks and when high priority tasks are finished, these operations would
start executing once again.
n n
n
n
n
n
n
n n
n n
n n
Computer architecture is a very vast subject and
cannot be covered in great detail in a small chapter
in the book on Operating Systems. However, no
book on Operating Systems can be complete unless it touches
upon the subject of computer architecture. This is because
Operating Systems are intimately connected to the computer
architecture. In fact, the Operating System has to be designed,
taking into account various architectural issues.
For instance, the Operating System is concerned with the
way instruction is executed and the concept of instruction
indivisibility. The Operating System is also concerned
with interrupts: What they are and how they are handled.
The Operating System is concerned with the organization
of memory into a hierarchy, i.e. disk, main memory, cache
memory and CPU registers. Normally at the beginning
of any program, the data resides on the disk, because
the entire data is too large to be held in the main memory
permanently. During the execution of a program, a record
of interest is brought from the disk into the main memory.
If the data is going to be required quite often, it can be
moved further up to the cache memory if available. Cache
can be regarded as a faster memory. However, no arithmetic
or logical operations such as add or compare or even data
movement operations can be carried out unless and until the
data is moved from the memory to the CPU registers finally.
This is because, the circuits to carry out t hese functions are
complex and expensive. They cannot be provided between
any two memory locations randomly. They are provided only
for a few locations which we call CPU registers. The circuits are actually housed in a unit called Arithmetic
and Logical Unit (ALU) to which the CPU registers are connected as we shall see.
The point is: who decides what data resides where? It is the Operating Systems which takes this important
decision of which data resides at what level in this hierarchy. It also controls the periodic movements between
them. The Operating System takes the help of the concept of Direct Memory Access (DMA) which forms
the very foundation of multiprogramming. Finally, the Operating System is also concerned with parallelism.
For instance, if the system has multiple CPUs (multiprocessing system), the philosophy that the Operating
System employs for scheduling various processes changes.
The Operating System, in fact, makes a number of demands on the hardware to function properly. For
instance, if the virtual memory management system has to work properly, the hardware must keep a track of
which pages in a program are being referenced more often/more recently and which are not, or which pages
have been modified.
We will present an overview of computer architecture, limited to what a student of any Operating Systems
needs to be aware of. As we know, the hardware and software of a computer are organized in a number of
layers. At each layer, a programmer forms a certain view of the computer. This, in fact, is what is normally
termed as the level of a programming language, which implies the capabilities and limitations of the
hardware/software of the system at a given level. This structured view helps us to understand various levels
and layers comprehensively in a step-by-step fashion. For instance, a manager who issues a ‘5GL’ instruction
to ‘produce the Sales Summary report’ does not specify which files/databases are to be used to produce this
report or how it is to be produced. He just mentions his basic requirements. Therefore, it is completely non-
procedural. A non-procedural language allows the user to specify what he wants rather than how it is to be
done. A procedural language has to specify both of these aspects.
A 4GL programmer (e.g. a person programming in ORACLE, SYBASE) has to be bothered about
which databases are to be used, how the screens should be designed and the logic with which the
sales summary is to be produced. Therefore, a 4GL program is not completely non-procedural,
though almost all vendors of the so-called 4GLs claim that they are. As of today, the 4GLs are in between
completely procedural and completely non-procedural languages. Today’s 4GLs have a lot of non-procedural
elements built into them. For instance, they can have an instruction to the effect ‘Print a list of all invoices
for all customers belonging to a state XYZ and where the invoice amount is >500 and the list should contain
invoice number, invoice amount and the invoice date’.
A 3GL program is completely procedural. COBOL, FORTRAN, C and BASIC are examples of
3GLs. In these languages, you specify in detail not only what you want, but also how it is to be
achieved. For instance, the same 4GL instruction described in Sec. 2.2 could give rise to the 3GL
program carrying out the following steps:
1. Until it is end of invoice file, do the following:
2. Read an invoice record.
3. If the invoice amount <= 500, bypass the record; go to the next one; else proceed further.
4. Extract the customer code from the invoice record.
5. Access the customer record for that customer code by making a database call.
6. Extract the state from the customer record.
7. If state is not = “XYZ”, bypass that invoice record and go to the next one; else proceed further.
8. If this is a desired record, extract invoice number, date and amount from the invoice record.
9. Print a line to the report.
It can be seen that one 4GL instruction can give rise to a number of 3GL instructions. This is the reason
why a 4GL is considered to be at a ‘higher’ level vis-a-vis 3GL. Despite this, a 3GL, such as COBOL or
C, also presents a level of abstraction to the programmer. For instance, in COBOL you can move a record
of 2000 bytes from one location in the memory to another location—in one instruction. You can carry out a
number of instructions iteratively by using a ‘while…do’ or ‘repeat…until’ construct. A 3GL also deals with
symbolic names which are English-like and therefore, easy to use. Therefore, data and paragraph names can
be easier to use and remember. This helps in the development and maintenance of software. In short, the 3GL
is still close to the programmer than the hardware. Therefore, if the hardware has certain limitations, it is still
possible to imagine that one could use multiple such lower level instructions to simulate one 3GL instruction.
This is also the reason why, on two different machines with different capabilities, you could write suitable
compilers for the same 3GL, so that it can work on both the machines. Therefore, when one defines a 3GL,
no specific computer hardware has to be kept in mind. In this sense, it is hardware independent.
A 2GL or Assembly Language (AL), on the other hand, is very close to the hardware, and is
therefore, restricted by the capabilities of the hardware. The assembly programmer knows and has to
know that a computer has a Central Processing Unit (CPU) which has an Arithmetic and Logical Unit
(ALU), a Control Unit (CU) and a few CPU registers. The CPU registers are like any other memory location,
except that they are connected together with the ALU circuits to allow certain operations at assembly
programming level. For instance, there are normally AL instructions to add two CPU registers, producing the
results in a third one, or to carry out logical operations such as ‘AND’ or ‘OR’ upon two registers producing a
third one, or to compare two CPU registers. These operations are not directly possible between two memory
locations.
If you have to compare two values in HLL, you could say, ‘IF Quantity-A = Quantity-B’. In ALs, normally
you have to move these quantities to two CPU registers and then compare. The hardware is constructed such
that a certain bit treated as a flag or a condition code in a flags register or a condition code register is set on
or off depending upon the result of the comparison. The condition code register also is one of the special CPU
registers. If two character strings of, say 100 characters each are to be compared, it is an easy task for a HLL
(or a 3GL) programmer. In AL, however, the strings have to be moved to the CPU registers word-by-word,
compared successively, storing the flag bit somewhere each time, and in the end, deciding, based on all the
flag bits, whether both the strings were completely matching or not. All these details are hidden from a HLL
programmer.
Data movement instruction in an assembly language can normally move only a certain number of bytes
at a time such as a word. In many ALs, you can not move data directly between two memory locations. You
have to copy the data from the source memory location to the CPU register by a load operation and then
copy that CPU register to the destination memory location by issuing a separate AL instruction such as store.
Therefore, in such cases, we will need to execute a few AL instructions a number of times repetitively, in a
loop, to simulate the HLL instruction to move a 2000 byte record from one place to another. In some more
advanced ALs, it is possible to have an instruction to move data between the memory locations directly.
In such cases, the hardware circuit itself keeps moving the data, word-by-word, from the same memory
address to the CPU register and then moves it subsequently each time to the corresponding words in the
target memory address. The AL programmer need not be aware of this, because his AL is more ‘powerful’.
However, regardless of the sophistication of the AL, in any computer, ultimately the data has to be internally
moved or routed through the CPU registers. It is only a question of the level of abstraction that the AL
provides to the programmer.
Similarly, an instruction in HLL such as “Compute A = (B*C) + (D*E)/F – G” does not have a corresponding
instruction in most of the ALs. The AL normally provides for simple arithmetic instructions such as to add,
subtract, multiply and divide two numbers only and one operation at a time. Also, normally, the operation is
allowed on either two CPU registers or one CPU register and another memory location (though internally,
the addition has to take place ultimately only between two CPU registers and therefore in such a case, the
hardware itself will have to bring the data from the memory location into another CPU register, before the
addition can actually be carried out). Therefore, depending upon the sophistication, abstraction or the ‘level’
within the AL, we will need different numbers of AL instructions to simulate the same HLL instruction
given above. As this job of translation of the 3GL instruction is done by the compiler, we will need different
compilers for machines with different architectures.
The same is true about the flow control instructions. AL does not have an instruction like “while ... do”
or “repeat ... until” or “if ... then ... else”. Therefore, it is not possible to write a structured program in AL
in a strict sense. AL provides for conditional and unconditional jump instructions which have to be used to
simulate the “if ... then ... else” and other constructs.
The HLL or any 3GL allows for an instruction to Read or Write the data from/to an external device. These
instructions are called external data movement instructions. AL does not provide for such a direct instructions.
For these instructions, requests in the form of system calls have to be made to the Operating System. All
the I/O functions have to be carried out ultimately by the Operating System for reasons of complexity and
security. Therefore, corresponding to the HLL instruction,
“OPEN FILE-A”, the assembly programmer has to code the
instructions to load the CPU registers with the parameters (in
the case of OPEN, they will be file name, mode, access rights,
etc.). He then will have to issue an appropriate system call to
the Operating System.
The Operating System executes the system call using the
parameters in the predefined CPU registers. It then loads
some CPU registers with the outcome of the system call. This
could be in the form of the results or the error indicators, if
the system call has failed (e.g. the file to be opened was non-
existent). The assembly programmer has to test these CPU
registers for results/errors after the system call, and then take
the appropriate action. This can be shown diagrammatically as
shown in Fig. 2.1.
A 2GL assembly language has an instruction “JSR” to jump to a subroutine and a corresponding “Return”
instruction. This is equivalent to a “Perform” in COBOL or “GOSUB” in BASIC. As soon as the “JSR”
instruction is encountered, the system “remembers” the point from where the program should continue after
returning from the subroutine. As soon as it encounters the “Return” instruction, it uses this “knowledge” to
return to that address. (Internally, it uses a stack memory/register to store these return addresses as we shall
see a little later.)
The 1GL corresponds to the machine language. There is a one-to-one correspondence between the
assembly and machine instructions. For example, take an assembly language instruction “ADD R0,
TOT” where ADD is an Operation Code (Opcode), R0 is a CPU Register, and TOT is a symbolic
name denoting, may be, the total. This instruction is supposed to add the two numbers in R0 and TOT
respectively and deposit the result again in R0. There is a program called assembler which converts the
assembly language program into a machine language program. We will study this in next section.
After the assembly language program is written, it has to be converted into a machine language program by
using another piece of software called the Assembler. The Assembler starts generating the instructions or
defining the data as the data/instructions are encountered, starting from 0 as the address, unless specified by
the “ORG” instruction.
Each machine instruction has a predefined format. In a simplified view, the format could be as shown in
Fig. 2.2 for our hypothetical computer.
In this case, the instruction consists of 16 bits divided into different parts as shown in the figure. If there
are 16 different opcodes such as “Add” that are possible, the 4 bits in the Operation Code (opcode) specify
the chosen opcode, e.g. 0000 could mean “Load”, 0001 could be “Store”, 0010 could be “Add” and 1111
may mean “Halt”.
The addressing mode specifies whether the address in the instruction is a direct address or an indirect
address. If it is direct (Mode = 0), it indicates that the memory location whose address is in the instruction
(the last field in Fig. 2.2) is to be operated upon (e.g. added to R0 in our instruction “Add R0, TOT”). If it is
indirect (Mode = 1), it means that the address in the instruction points to a memory location which does not
actually contain the data, but it, in turn, contains the address of the memory location to be operated upon. In
our example, we have assumed a direct address.
The register number specifies whether R0 (bit = 0) or R1 (bit = 1) is to be considered in the operation,
assuming that there are only two registers. (If there were four, for instance, we would need 2 bits to specify
this register number.)
The memory address is in binary. For instance, in our assembly instruction, “TOT” is a counter to accumulate
the totals. Let us assume that it is at memory location with address = 500. It will, therefore, be mentioned
as “0111110100” in the instruction, because this is the binary representation of 500 in decimal. Therefore,
our instruction “ADD R0, TOT” would be converted to “0010000111110100” in machine language. This is
because 0010 = ADD, 0 = Direct Addressing, 0 = R0 and 0111110100 = 500.
This is how an assembly language (2GL) instruction would be converted into a machine language program
by an assembler in a predefined instruction format. The machine language program would look as shown in
Fig. 2.3. As the figure shows, one HLL instruction is equivalent to many assembly instructions. However,
there is one and only one machine instruction for each assembly instruction.
In our example, each of these instructions occupies 16 bits or 2-bytes. The assembler starts defining data
or generating instructions from 0 as the starting address, unless specified by the “ORG” statement. Knowing
the length of each data item defined in the program, it can arrive at the addresses of the “next” data item. This
is how, it arrives at the addresses of all the data items. This is also how the symbol table is built using the data
names and their addresses. For the instructions, knowing the length of each instruction based on the opcode
and the type of instructions, it can calculate the starting address of each instruction as well. If the instruction
has a label, the assembler adds the label as well as the instruction addresses into the symbol table.
The assembler can be written as a one-pass or two-pass assembler, though it normally is designed as a
two-pass one. In this scheme, the assembler goes through the assembly program and converts whatever it can
convert to the machine language. It also generates addresses for the data items as well as the instructions,
and builds a symbol table. It cannot generate addresses for any forward references during the first pass, e.g.
a PERFORM or a GOSUB statement with label yet to be encountered. In the second pass, it resolves these
remaining addresses and completes its task. This is possible, because, these forward addresses will have been
generated during the first pass.
This assembled machine language program can be stored as a file on the disk. A program written in any
HLL or a 3GL requires a compiler to generate the machine code. In this case, typically, multiple instructions
are generated for each instruction in the HLL. Therefore, there is no one-to-one correspondence between
the HLL and machine language, unlike in the case of AL. Ultimately, any program has to be in the binary
machine format before it can be executed.
At the time of execution, the machine language program has to be brought into the memory from the disk and
then loaded at the appropriate locations. This is done by a piece of software called loader. The loader consults
the Memory Management Unit to find out which memory locations are free to accommodate this program.
Therefore, the actual starting address of the program could be 1000 instead of 0. This means that each address
in every instruction of this program has to be changed (relocated) by adding 1000 to it. This process, known
as relocation, can be done either in the beginning only before the actual execution (static relocation), or
at the run time for each instruction dynamically (dynamic relocation). Therefore, the assembly/machine
language view, or the perspective of the computer architecture is as follows:
(i) The computer consists of the CPU and the main memory.
(ii) The CPU consists of the Arithmetic and Logical Unit (ALU), the Control Unit (CU) and the CPU
registers.
(iii) All arithmetic operations are performed within the ALU. Therefore, no arithmetic operations are
possible between the two memory locations directly, leave alone on the disk itself. If one wants to add
two numbers in the memory, these two numbers have to be brought to the CPU registers first and then
added. One could imagine an assembly instruction to add the contents of the two memory locations
directly, but internally, only the ALU can add the data into two CPU registers.
(iv) The same is true in the case of data movement instructions. If data has to be moved from any memory
word to any other memory word, the connections between the memory locations would be very
complex.
Data is therefore, moved only between the CPU registers and the memory. Therefore, the data movement
between any two memory locations take place in two steps:
l Load data from the source memory location to a CPU register (Load)
l Store data from the CPU register to the target memory location (Store)
If 2000 bytes in a record are to be moved from a source memory address to the target memory address, the
operation actually internally takes place as 2000 operations, if 8 bits or 1 byte is transferred at a time. This
is the case when the CPU registers and the data bus typically consists of 8 bits. If the data bus and the CPU
registers are 16-bit wide, it will take only 1000 operations to move 2000 bytes, as 2 bytes or 16 bits can be
moved in each operation. In a machine with a data bus of 32 bits, it will take only 500 operations. Therefore,
compiler of any HLL will generate only appropriate number of machine instructions for such an instruction.
This will also clarify why a machine with a data bus of 32 bits is faster than the one with the data bus of 8
bits only. However, in all these cases, each operation consists of two suboperations (Load and Store), as given
above. This picture is true, even if the assembly language has an instruction to move a block of bytes between
memory locations (Block Move Instruction). Internally, it has to be executed as a number of operations and
suboperations in a loop until all the bytes are transferred. This loop has to be managed by the hardware itself,
thereby making the hardware architecture more complex and expensive, as compared to the one where only
one word can be moved at a time and therefore, the assembly programmer has to essentially set up this loop.
There is, therefore, a trade-off in this choice. However, with the hardware becoming more and more
powerful and yet cheaper and cheaper, providing a block move instruction at the assembly language level is
easily possible.
The only deviation from this trend is advent of Reduced Instructions Set Computers (RISC) machines
where, you provide only a limited set of simple instructions. However, a detailed discussion of RISC is
beyond the scope of the current text.
This is the level at which we want to study the actual computer architecture. Certain CPU registers known
as General Purpose Registers (e.g. R0, R1, ..., Rn) are accessible to the assembly language programmer. In
addition, the CPU has a number of ‘hidden’ registers as shown in Fig. 2.4 which shows the architecture for
our hypothetical computer with the instruction format shown in Fig. 2.2.
A computer consists of three buses as listed below:
l Data Bus (External and Internal)
l Address Bus
l Control Bus
The external data bus connects the memory with a CPU register called Memory Buffer Register (MBR)
or Memory Data Register (MDR). The internal data bus connects all the CPU registers, including MBR.
This implies that if any data has to be loaded from the memory to any CPU register, say R0, it has to be
brought into MBR first through the external data bus and then moved to a CPU register such as R0 using
the internal data bus as shown in Fig. 2.9. The store operation stores a CPU register into a specified memory
location. This operation also takes place in the two substeps as given above except that it will be in the reverse
direction. All these data movements can be achieved by manipulating the switches S0, S1, etc. shown in the
figure.
In some architectures, the external and internal data buses are combined into a single data bus, in which
case, theoretically, there is no need for a register such as MBR. This is because, in this case, data from any
memory location can be moved to any CPU register directly and vice versa without the mediation of MBR.
However, we will assume the presence of MBR and two separate data buses, as shown in the figure.
The address bus carries the address of the memory location to be accessed. If the data bus is 16-bit wide,
it will carry a word of 2 bytes at a time. Therefore, the words in the memory will have addresses 0,1,..., n
where n = 1/2 of the total number of bytes in the memory. This is what we mean by addressable location in
our discussion. If there are 1024 locations (numbered as 0 to 1023 in the figure) to be accessed, the address
bus will need to have 10 wires to carry 10 bits in the address. This is because 210 = 1024. There is a CPU
register called Memory Address Register (MAR) which has to contains the address of the memory location
to be accessed. Therefore, in our example, MAR will consist of 10 bits. The address bus connects the MAR
and a Memory Address Decoder.
This decoder has 10 inputs from the address bits of the MAR. These address bits act as control signals to
the decoder. This decoder has 1024 output lines, one going to each memory location separately, as shown in
the figure. This is the reason, why this decoder is called 10 # 1024 decoder. Depending upon the binary value
of the address, the corresponding memory location is activated, e.g. if the address is 0000001010, then the
location number 10 (in decimal) will be activated. There is a read/write control signal shown for the memory.
This signal decides the direction of the data movement, i.e. from the main memory to the CPU registers (as
used in the load operation) or from the CPU registers to the main memory (as used in the store operation).
Therefore, to move the data from memory location 11 (decimal) to register R1, the following steps will
need to be taken:
(i) MAR is loaded with 0000001011 (11 in decimal).
(ii) A “Read” signal is given to memory (equivalent to ‘load’).
(iii) The data is loaded into MBR by manipulating switch S13.
(iv) The data is moved from MBR to R1 by manipulating S1 and S5.
This will clarify how the system works together. Each of these small steps is called microinstruction.
The control bus carries all the control signals.
Of these, only the general purpose registers R0 to Rn are accessible to the assembly language programmer,
and are therefore, visible. All others are not accessible and they are hidden from the programmers. They are
only internally used during the execution of instructions.
We will now study the functions of these various registers.
This contains the address of the memory location which is activated
and accessed for both the LOAD and STORE instructions, as discussed earlier.
It stores the data temporarily, before it is transferred to/from the de-
sired memory location.
It contains an address of the “next” instruction to be executed. Therefore, when
the program starts executing, the PC contains the address of the first executable instruction. This address, as
calculated by the compiler, is stored in the header of the executable file created by the compiler itself at the
time of compilation. When the program execution begins, this header of the executable file is consulted and
the PC is loaded with that value.
This is the reason why the computer begins at the first executable instruction in the program (e.g. the first
instruction is the JUMP to PROCEDURE DIVISION in COBOL). When any instruction is executed, the PC
is incremented by 1 so that it points to the next instruction. In a machine where the instruction length is more
than 1 (word), it is incremented by that number. This cycle is broken only by the “JUMP” instruction. The
jump instruction specifies the address where the control should be transferred. At this juncture, this address
is moved into the PC. This is how the jump instruction is executed.
Most computers make use of stacks. A stack is a data structure which has a Last
In First Out (LIFO) property. It works as a stack of books. When you add a new book to the stack, you
add it on the top of the stack. When you remove it, you normally remove the one that was added the latest
(LIFO). A number of memory words are organized in a similar fashion which can therefore be termed as a
stack. At any point, some of these words contain some stored data, while others are empty. The stack pointer
provides an address of the first free entry where a new element can be copied from a CPU register by a
“PUSH” instruction. After this instruction, the SP is incremented to reflect the change. The SP then points
to the new free entry. Another instruction “POP” does the reverse of “PUSH”. It copies into the CPU regis-
ter, an element from the stack whose address is given by the SP and then decrements the SP. The stack is a
very useful data structure to implement nested subroutines. When you branch to a series of subroutines one af-
ter the other in a nested fashion, the return addresses can be “PUSHed” onto the stack. On returning, they can
be “POPed” in the LIFO order. Which is very necessary if the nested subroutines have to be executed properly.
Registers, R0 through R7 can be addressed in various assembly instruc-
tions. R0 to R7 are 8 such registers as shown in Fig, 2.4. In actual practice, there may be 1, 4, 8, 16 or 32
such registers. When there is only one such register, it is popularly called an Accumulator in the literature
of microprocessors.
It holds an instruction before it is decoded and executed. Therefore, to execute
an instruction, the instruction whose address is given by “PC” has to be “fetched” from the memory into
the IR. This constitutes the ‘fetch cycle’. This is followed by the ‘execute cycle’, in which the instruction is
decoded and executed.
These are the registers within the ALU to hold the
data temporarily. TEMP0 and TEMP1 are input registers into ALU, and TEMP2 is an output register from
the ALU containing the results from any operation. Therefore, when you want to add any two numbers, they
have to be brought to TEMP0 and TEMP1 and the “ADD” circuit in the ALU has to be activated. After the
addition, TEMP2 will contain the result. It now can be moved to any other CPU register or a memory location
as required, through the internal data bus. What we have described is only the architecture of hypothetical
computer. In reality, an architecture could be different, though, in principle, it has to be similar.
The ALU can be logically seen as consisting of different circuits—one for each instruction as shown in
Fig. 2.5. For instance, if a computer has 16 different instructions such as ADD, SUB, MUL, DIV, LDA, STA,
etc. then we can imagine a different circuit for each of these instructions. How does a specific circuit get
activated? The answer lies in a decoder circuit, which we will study next.
We must remember that this is only a logical view. In practice, the subtraction circuit is derived from only
a slight modification of the adder circuit instead of having separate circuit for each instruction. In fact, the
whole idea behind using 1’s component and 2’s component in binary subtraction is to be able to convert a
subtraction operation into an addition operation somehow, so that the same circuit can be used with only a
slight modification in the form of an additional signal. Depending upon the op code (ADD or SUB), the ALU
decides whether to apply the additional signal or not. At a physical level, the circuit remains essentially the
same. However, we will treat them as separate circuits at a logical level for better comprehension. The op
code decides which of these circuits is to be activated as discussed later.
Figure 2.4 also shows various switches S0, S1, etc. This again is a logical representation. In reality, you use
Tri State Buffers (TSBs) as these switches. The idea here is to separate the bus and the registers to make it
possible, the data transfer from the register to the bus or vice versa by manipulating these switches or TSBs.
They also act as the latches, and therefore, the data once deposited in a register does not get mixed up with
the signals or the data in the bus.
We have encountered a 10#1024 memory address decoder circuit before. We now study the instruction
decoder circuit which is basically similar.
The decoder circuit has n control signal lines and 2n output signal lines, and depending upon the value of
the control signal lines in binary terms, it chooses or selects one of the output signal lines.
For example, Fig. 2.5 shows the instruction decoder circuit. Let us imagine that the opcode has 4 bits in
an instruction. Therefore, 4 bits can produce maximum 16 combinations ranging from 0000 to 1111. Each
of these combinations corresponds to one opcode or instruction., e.g. 0000 = LOAD, 0001 = STORE, 0010
= ADD.
When any instruction arrives in the IR, the opcode bits from the IR are separated out and they form
the control signals to the 4#16 instruction decoder. This means that it has 4 input control signal lines and
16 output lines—one for each opcode, as shown in Figs 2.4 and 2.5. Depending upon the opcode, the
appropriate circuit is activated and the operation takes place using the registers TEMP0, TEMP1 and TEMP2.
The machine cycle means the steps in which the instruction is executed. It consists of two subcycles: Fetch
Cycle, and Execute Cycle. We will now discuss these with reference to Fig. 2.4.
In the fetch cycle, the instruction is brought into the IR from the memory. This is the
instruction whose address is in the PC. Therefore, the fetch cycle consists of the following steps:
(a) Move PC to MAR by manipulating (opening/closing) switches S2 and S0.
(b) Give the ‘Read’ signal.
(c) Manipulate switch S13 to deposit the instruction in MBR.
(d) Move MBR to IR by manipulating switches S1 and S9.
(e) Increment PC by 1.
Let us say that we have an instruction “ADD R0, TOT”, as we have seen in
Sec. 2.5.2. Let this instruction reside at location (word) number 700, which is 1010111100 in binary. (We
have assumed a 10-bit address in our example.) We have assumed that the variable TOT is defined at location
or word 500 which is 0111110100 in binary. We have also assumed the coding scheme that ADD = 0010 and
the register bit in the instruction for R0 = 0. We have used a direct addressing mode. Therefore, the addressing
mode bit in the instruction also is =0. Therefore, the binary machine instruction will be 0010000111110100.
This is what will be produced by the assembler. Let this instruction be already loaded at the address 700
(=1010111100). Let us also assume that the PC has the same address (=1010111100).
This instruction is executed as follows (refer to Fig. 2.4):
(a) The PC is moved to MAR by manipulating Switches S2 and S0. MAR also now contains 700
(=1010111100).
(b) A ‘Read’ signal is given.
(c) The word at address 700 is selected and its contents are deposited into MBR by adjusting Switch S13.
(d) MBR is copied into IR by adjusting Switches S1 and S9.
(e) The PC is incremented by 1. This can be done by moving the PC to TEMP0, the number 1 (i.e.,
00000001 in binary) to TEMP1, carrying out the addition in the ALU and moving TEMP2 back to the
PC. This will require a number of sub-steps and manipulations of several switches. Alternatively, a
binary counter also could be used. We need not go into the details of how this is achieved.
(a) The instruction is decoded field-by-field. For instance, the register bit tells the hardware that R0 is
selected. Therefore, R0 is moved to TEMP0 by adjusting Switches S4 and S8.
(b) The addressing mode is interpreted as a direct address (Mode = 0).
(c) Therefore, the 10 bits of the address portion in IR are moved to MAR by manipulating Switches S9
and S0. (Therefore, MAR now contains 500 in decimal, or 0111110100 in binary.)
(d) A “Read” signal is given.
(e) The data item at memory word location 500 is now put on the data bus.
(f) The data is deposited in MBR by adjusting Switch S13.
(g) MBR is now copied into TEMP1 by adjusting Switches S1 and S10.
(h) The ‘opcode’ bits in the IR are used as control signals to the instruction decoder. For ‘ADD’ operation,
these bits are 0010. The appropriate output signal is generated from the instruction decoder and the
“ADD” circuit is activated.
(i) The addition now takes place. TEMP2 now contains the result.
(j) TEMP2 now is copied to R0 by adjusting Switches S11 and S4.
This completes the execution of the instruction.
What would have happened if an indirect address was used instead of a direct address
indicated by the mode bit? Though the fetch cycle may remain the same, the execution cycle would be a little
more lengthy. It would be as follows:
(a) R0 is moved to TEMP0 by adjusting S4 and S8 as before.
(b) The addressing mode is interpreted as an indirect address (Mode = 1).
(c) 10 bits of the address in IR are moved to MAR by adjusting Switches S9 and S0. MAR now contains
500 in decimal or 0111110100 in binary.
(d) A ‘Read’ signal is given.
(e) The contents of the word 500 are put onto the data bus. Let us say the memory word 500 had the value
100 in decimal or 01100100 in binary.
(f) The contents are deposited in MBR by adjusting Switch S13.
l The system knows that this is not the actual final data but, in fact, another address at which the
actual data resides. It knows this because of the mode.
l Therefore, MBR is moved to MAR by adjusting Switches S0 and S1. MAR now contains 01100100.
l The contents of the memory at the location 100 are deposited on the data bus, and loaded into
MBR by adjusting S13.
After this step, all other steps from (g) to (j) are the same as before in the case of direct addressing.
Our architecture allows for only direct and indirect addressing. Most of the archi-
tectures allow for an immediate addressing. In this case, the instruction contains the actual value instead of
any address. Therefore, in our example, if immediate address was used in the instruction “ADD R0, # 500”,
the actual number 500 would have been added to R0. The “#” sign would have told the assembler that it is
an immediate address. The instruction format then would have to be different from the one in our example.
There may be 2 bits for the addressing mode and “10” would mean immediate addressing. With this, the
assembler would then produce the machine code. In this case, during the execute cycle, the address part of
the instruction is separated and directly moved to TEMP1. There is no need to refer to any memory location.
The fetch cycle for this instruction is as explained earlier. The PC will have been already
incremented during this phase. But this is of no use to us as we actually want the program to jump to some
other location.
During the execute cycle, the circuit for “JUMP” is activated. This essentially moves the address portion
of the instruction into the PC by adjusting S9 and S2. After this, the instruction whose address is given by the
PC is executed as usual, thus, in effect enabling the program to branch to a specific location. This completes
the execution of this instruction. If an indirect address is used in the jump instruction, it again goes through
the required memory cycles to fetch the final address and then moves it to the PC.
The “PUSH” instruction places the data item from a specific register, say R0, into a
memory location whose address is given by the Stack Pointer (SP) and then increments the SP. This takes
place as follows:
(a) R0 is moved to MBR by adjusting S4 and S1.
(b) SP is moved to MAR by adjusting S3 and S0.
(c) ‘Write’ signal is given.
(d) SP is incremented by 1.
This instruction also checks whether the SP is within the legitimate, permissible values, i.e. stack overflow
has not taken place.
This is just the reverse of PUSH. It takes place as follows:
(a) SP is moved to MAR by adjusting S3 and S0.
(b) ‘Read’ signal is given.
(c) The data from the stack is deposited into MBR by adjusting S13.
(d) MBR is copied into R0 by adjusting S1 and S4.
(e) SP is decremented.
Most architectures support this instruction to allow modular program-
ming. In this case, you need to preserve the return address so that, after executing the subroutine, the main
program continues where it had left. In the fetch cycle of this instruction, the PC will be already incremented
by 1. This is the address where the main program should branch to after returning from the execution of the
subroutine. (Typically denoted by a “Return” instruction in a subroutine.) Therefore, this is the address given
by the incremented PC which needs to be stored. This is generally stored in the stack as seen earlier by the
PUSH instruction. However, in our example, the PUSH presumes that the contents of R0 have to be pushed
onto the stack. Therefore, before the PUSH instruction, PC will have to be moved to R0. The execution of
this takes place as follows:
(a) Move PC into R0 by adjusting S2 and S4.
(b) PUSH (R0 into stack).
We have seen the sub-steps for PUSH. Therefore, we need not repeat those here.
(c) Move the address of the subroutine in the instruction to the PC for branching.
We will notice that after pushing (storing in a stack) the return address, the hardware will jump to the
location where the subroutine is stored.
This indicates the end of a subroutine. The main program should now be resumed from
where it was left off. The execution of this instruction takes place as follows:
(a) POP (from stack into R0)
We have seen the substeps of POP and therefore, they need not be repeated. After this, obviously, the
main program will continue where it had left off.
(b) Move R0 to PC by adjusting S4 and S2
A question naturally arises in our mind. Why do we store the return address in a stack? Why do we not
store it anywhere else, in any other memory location? The answer is: any other memory location also would
have been acceptable if there was no nesting of subroutines allowed. In such a case, you could store just one
return address and then retrieve it to return to the main routine. The problem arises when a subroutine calls
another subroutine, and so on. In such a case, the return addresses have to be stored in a LIFO fashion, if the
processing has to take place properly. That is the reason a stack is used, because LIFO is the basic property
of a stack.
The real computers support a number of other instructions, such as AND, OR, NOR,
XOR, etc. or COMPARE, Conditional jumps (Jump if equal, etc.). Detailed discussions of these are beyond
the scope of the current text, though it should now be fairly easy to imagine how they are executed.
Each of these sub-steps a, b, c etc. that comprise an instruction is called a microinstruction. In order to
execute one machine instruction, these microinstructions are executed one after the other automatically by
the hardware itself. This implementation is relatively faster but it is also more complex and less flexible.
This is the latest trend, because the increased complexity due to hardwiring is greatly reduced by simplifying
the instructions set itself. The RISC computers have only a limited set of simple instructions. Therefore, the
name: “Reduced Instructions Set Computer (RISC)”.
An alternative to this followed in the majority of the non-RISC microprogrammed machines all these
days was to have a Read Only Memory (ROM) where a small program was etched for each assembly or
machine instruction as studied earlier in steps (a), (b), etc. Each machine instruction gave rise to multiple
microinstructions. Microinstructions were like any other instruction having their own formats and lengths
consisting of several bits, where each bit typically controlled a switch. For instance, a microinstruction 01001
could mean that Switch S0 is to be closed (0), S1 is to be opened (=1), S2 and S3 are to be closed (both 0) and
finally Switch S4 is to be opened (=1). When certain switches are manipulated, we know that data transfer
between various registers/memory can take place as we have seen earlier. Therefore, a specific substep can be
executed by a specific microinstruction and microprogram consisting of several microinstructions would, in
effect, execute one machine instruction. This is why we said that there is a microprogram for each machine
instruction. This is a simplistic view, but one which is not grossly incorrect.
Thus, in a microprogrammed computer, one can imagine microprograms—one for each machine
instruction permanently residing in the ROM. Depending upon the opcode bits, the appropriate microprogram
is activated. Thereafter, appropriate control signals are given and appropriate switches are adjusted to execute
a series of actions in the desired sequence.
At any time, if there is an interrupt and the current program has to be suspended or put aside,
the Context of a Program has to be saved. What is this context? As we know from the hierarchy of
languages, an instruction in a 4GL is equivalent to many instructions in a 3GL. A 3GL instruction
is equivalent to many instructions in a 2GL (assembly or 1GL machine language). The context is essentially
the values of all the registers plus various pointers to memory locations and files used by that program. When
the interrupted program is to be restarted, all these register values have to be reloaded from the Register Save
Area, so that the program can continue as if nothing had happened. The registers have to be saved, because,
the new program (Interrupt Service Routine or ISR, in this case) will be using all these registers internally
for the execution of its own instructions. Imagine the effect on the interrupted program, if the PC is not saved
for instance!
The concept of multiprogramming and multiuser Operating Systems is based on the hardware support of
interrupts. What is this interrupt? Interrupt is essentially the ability of the hardware to stop the currently
running program and turn the system’s attention to something else. When you are reading a book line-by-
line, and the doorbell rings, what happens? You normally adjust the book’s marker, put the book down and
go to open the door. In this case, your current program of ‘reading a book’ is interrupted and a new program
or procedure of ‘opening a door’ starts executing. You open the door and attend to the person (may be a
newspaperman or a milkman). This is similar to the Interrupt Service Routine (ISR) in computer jargon.
After you finish your ISR, you resume reading the book.
Imagine that you have opened the door, and are discussing something with the person at the door (i.e. you
are in the midst of the ISR). At this juncture, if the phone bell rings, you would normally leave the conversation
to attend to the phone with due apologies. This is similar to having interrupts within the interrupts or nested
interrupts. In the case above, we have assumed that attending a phone call is more important, and therefore,
of higher priority than completing the conversation with the person at the door which, in turn, is of higher
priority than reading a book. In the same way, in the computer hardware, there can be multiple interrupts and
one can set priorities to them so that should they occur at the same time, the computer should know which
one to take up first.
In our lives, we can disable interrupts at certain occasions. A manager busy in a meeting tells the secretary
to take all the calls and messages so that the meeting is not interrupted. In computer systems too, you can do
this. We call this masking a certain interrupt. How does the computer hardware do all this?
We have seen how a computer program works without the scheme of the interrupts. The PC points to an
instruction which is then fetched into the IR and executed, whereupon the PC is incremented to point to the
next instruction. When you have a scheme of interrupts, the following happens:
(i) The hardware contains some lines or wires carrying different signals to denote the presence and type
of an interrupt. In the simplest case, if the system allows for only one type of interrupt, only one line
can indicate the presence of the interrupt (e.g. signal high (=1) indicates interrupts, and signal 0 means
no interrupt).
(ii) You cannot predict the exact time when the interrupt will occur. This is obvious. If you could exactly
plan their timings, you would perhaps call them by a different name and tackle them differently.
(iii) The hardware is designed in such a way that the current instruction cannot be interrupted in the
middle. Only after the current instruction at the machine level is completed, the hardware can attend
to the interrupt. This is achieved by simply checking by the hardware itself whether an interrupt
has occurred or not (i.e. whether the interrupt line is high or not), only at the end of each machine
instruction and before the start of the next one. But there is a problem here. Let us assume that the
interrupt occurs halfway during the execution of an instruction. The interrupt line will indicate a high
signal. But for how long? By the time the machine instruction is executed, the signal may go low,
thereby, causing the interrupt to be lost. Therefore, there is a need to ‘store’ the interrupt somewhere
at least for a short duration by setting some bit in some register on!
(iv) It follows from (ii) and (iii) above that at least for a short duration, the interrupt has to be stored in
some register. If there is only one type of interrupt, this register could consist of just bit. Otherwise, it
will have to be longer. This register is called Interrupt Pending Register (IPR).
(v) The system normally has a provision to ignore an interrupt for a certain duration. This is called
‘disabling’ or ‘masking’ the interrupt. This is achieved by another register called Interrupt Mask
Register (IMR). This is of the same length as IPR. In our case, it will be of only 1 bit, because we
have assumed only one type of interrupt.
The IMR is set to 0, if we want to mask the interrupt. It is set
to 1, if we want to enable it. The IMR and IPR are connected to
the AND circuit as inputs, and the output is collected in another
register called Interrupt Register. This is shown in Fig. 2.6.
Finally, this interrupt register is checked to identify the presence of
the interrupt.
Obviously, if IMR = 0, the Interrupt Register will remain 0 even
if IPR takes on a value of either 0 or 1. This is how ‘masking’ is
achieved. If IMR = 1, then the interrupt register will be the same
as IPR. Therefore, the interrupt register will be 1 if IPR = 1. This
is the way interrupts are enabled. To set/reset IMR are privileged
instructions. These are generally used only by the Operating
System.
(vi) After completion of every instruction, before fetching the next one,
the hardware itself checks the interrupt register and continues with
the fetch operation, if this register is 0.
(vii) If this interrupt register = 1, the hardware jumps to a preknown address where the Interrupt Service
Routine (ISR) (which is a part of the Device Driver portion of the Operating System) is kept. This is
depicted in Fig. 2.7.
(viii) While executing the instructions from the ISR, the status of the CPU registers will change, because,
the ISR itself will have to use these registers during its executions. Therefore, the values of these CPU
registers corresponding to the currently executing user program (e.g. after executing the instruction
number 2 “ADD ..... “ in Fig. 2.7) will be lost. Hence, to begin with, the ISR itself executes the
instructions to store the contents of the CPU registers. This procedure is called “Saving the context of
the process”. In some cases, this is done by hardware itself.
(ix) After saving the context of the process, the ISR now processes the interrupt. If our hypothetical
computer has only one type of interrupt, this processing will be for that interrupt type only. (For
instance, if it is due to the completion of an I/O, the ISR will check if the I/O has been carried out
successfully by the other modules of the Operating System and if yes, move the process for which the
I/O was completed to the ‘ready’ state. We will study this later.)
(x) After all the processing is over, the ISR returns control to the next instruction, i.e. the one with
number 3 in the interrupted process, e.g. “LDA .....” in Fig. 2.7. This is done, after restoring the values
of all the CPU registers and resetting the
interrupt pending register.
This scheme is all right for a single type of
interrupt. However, in reality, interrupts can occur
due to multiple reasons, some of which are listed
below:
l Completion of an I/O operation.
l Power failure.
l Operator error.
If there is an interrupt indicated by any non-zero bit in the interrupt register as shown in Figs. 2.8 and 2.9,
then the ISR routine within the Operating System corresponding to that interrupt has to be executed. One
way to achieve this is by the Operating System actually finding out which interrupt has occurred through the
software routine. (A kind of table search to know which bit within the Interrupt Register is 1.)
Having found it, the Operating System can pick up the corresponding address of the ISR and branch to it.
This is obviously very slow. Another method, called vectored interrupt avoids this ‘software table search’. It
can directly branch to the correct ISR, because the hardware itself is made to not only signal the presence of
the interrupt, but also made to send the actual address of the ISR by hardware lines. This method is obviously
more efficient though more expensive. A detailed discussion on this is beyond the scope of the current text
and it is also not necessary.
If there are multiple interrupts, then there will be more than 1 bit set to 1 in the Interrupt Register, and
then the Operating System will have to service them one by one in a certain sequence. The sequence can be
predetermined by setting priorities to different interrupts. In fact, during the execution of an ISR, if another
interrupt occurs, it could be serviced on a priority basis. All this is possible, but again a detailed discussion
on this is unnecessary.
Note that a particular bit in IPR can be set by two ways. One is by hardware (Hardware Interrupts),
which we have seen up to now. Another way is by executing a software instruction (Software Interrupt) to
set or reset a particular bit in IPR. These are obviously privileged instructions and can be used only by the
Operating System. Therefore, only the Operating System can set a particular bit in IPR to 1 to generate a
Software Interrupt. Irrespective of how it is set, the hardware must check the IPR before fetching the next
instruction. As we will learn later, if a process (i.e. running program) requests the Operating System to read
a certain data record on its behalf, the Operating System has to keep the process waiting until the desired
I/O is over. The Operating System has to ‘block’ that process. To do this, the Operating System has to cause
a software interrupt first and then branch to the corresponding ISR and change the status of that process to
“blocked”.
After the process is blocked, the Operating System supplies the DMA controller with the relevant details
of the data transfer and takes up another process for execution. The DMA completes the I/O and generates a
hardware interrupt setting a specific bit in IPR to 1. Remember that, at this time, a certain other process would
be executing. On encountering this interrupt caused due to the I/O completion of the first process, the CPU is
taken away from this second process too! When this interrupt is serviced, the ISR moves the currently running
(second) process into the list of ‘ready’ processes and then finds out which device has caused this interrupt
and the reason for the same (i.e. Data read is over). It then finds out for which process this ‘Read’ operation
was done. It then moves this process from blocked to the ready state so that it can now be scheduled.
There is another type of interrupt that devices can generate. This is the “Device Ready” interrupt. This is
generated by an idle device ready to accept any data. This is generated regularly at fairly short time intervals.
The ISR for that device generally checks if there is any data ready in the memory buffer for that device and
which should be sent to that device. If there is, the device I/O is initiated. If there is no waiting data, the device
I/O is put to sleep and the highest priority ready process is dispatched for execution.
Most of the Operating Systems implement the time sharing facility by using a timer clock, which generates
a signal over a line after a specific time interval which is a bigger slice than the intervals at which the device
generates the interrupt for “Device Ready”. This time slice for the timer can be externally programmed,
because, the timer hardware has a register and machine/assembly instruction is available to set this register
to a specific value. This is obviously a privileged instruction. After setting this register to a specific value,
as the time progresses, the value in this register goes on decreasing. When this value becomes zero, a timer
interrupt is automatically generated.
This timer signal can be used to set on yet another bit in IPR to indicate the interrupt. On recognizing this,
the Operating System executes the ISR corresponding to the interrupt caused by this “time up”. After saving
the context of the then running process, the Operating System switches from that process to the next, after
changing the earlier running process to the ready state.
Normally, when the Operating System is executing some routines (say, a high priority ISR), it does not
want any interruption from any other especially lower priority, interrupts. This can be ensured by adjusting
the IMR. The Operating System sets the IMR to the desired value first to disable certain types of interrupts,
completes the routines and then enables them again before exiting. All the interrupts that come in during
that period change the IPR, but do not take effect due to masking. This more or less completes the picture
of computer architecture which is necessary for our understanding of the subject of Operating Systems. It
should be noted that what we have described is a simple but not too unrealistic view of a practical computer.
Computers have primary memory and secondary memory. Primary memory is the only area
that the hardware can directly access. It is also called Random Access Memory (RAM).
Primary memory contains several crores of bytes. These are arranged in the form of words. The
CPU uses two instructions named as store (meaning write) and load (meaning read) to access the contents
of the primary memory. Depending on the type of operation, both store and load may need access of specific
areas of the primary memory. These instructions either read the contents of primary memory into one of the
CPU registers, or read the contents of such a CPU register and transfer it into the appropriate address in the
main memory.
The most desirable situation is that all our required data and instructions be available in the primary
memory itself. However, there are constraints with reference to this aspect. Primary memory is expensive,
and hence cannot have infinite size. As such, its size is limited. Hence, we need some additional storage space,
and that is available to us in the form of secondary memory. The other factor that makes primary memory
unsuitable for all kinds of storage is that it is made up of hardware that cannot retain data once the computer
power is switched off. Hence, it is also called Dynamic Random Access Memory (DRAM). Therefore, its
contents are volatile. Hence, we need some storage that is more permanent in nature.
Naturally, the two main requirements of secondary memory are
1. Ability to hold large amounts of data
2. Ability to hold data even when computer power is switched off
The most common form of secondary memory is a hard disk. It is also called magnetic disk. It is used
to hold both data as well as programs.
We call the computer’s main memory the Random Access Memory (RAM). This is because any memory
address in the RAM can be accessed randomly. The time required to access, say two different memory
locations 1 and 1000, would be the same. That is why this is random.
The logical organization of each memory location is shown in Fig. 2.10.
We have noted that each memory location can hold a 0 or a 1 (i.e., a binary digit or bit). The read/write
operations on a location work as follows:
The memory location has an input data line, on which either 0 or 1 is placed, depending
on whether the location needs to hold 0 or 1. After this, the CPU activates the write control signal on the line
indicated in the diagram as Write. This means that either 0 or 1 (as specified in the Input line shown in the
diagram) would be written to the memory location.
For the read operation, the
Read line of a memory location is sent a read con-
trol signal. As a result of this, the contents of this
memory location (i.e., a 0 or 1) are read and are
made available on the Output line shown in the
diagram.
Here, we need to make two important
assumptions:
(a) Once a bit value is stored in a RAM location,
it is not lost until the computer is powered
off. That is, the RAM is nonvolatile in
nature.
(b) If we read the contents of a memory
location, the memory location would still
continue to hold the original contents even
after the completion of the read operation. That is, the read operation is nondestructive in nature.
By connecting memory locations with each other in different ways, we can organize the memory in a very
intelligent manner. We can interconnect the read-control lines of all the memory locations together. Similarly,
we can also interconnect the write-control lines of all the memory locations together. The concept is shown
in Fig. 2.11. Here, we have shown three memory locations just as an example.
For a write operation, the bit value to be written is provided into the appropriate input line. The write-
control signal is applied to the write line of the appropriate memory location, which causes the memory
location to contain the new bit value. The previous bit value in the memory location is automatically erased
as a part of this write operation.
For a read operation, a read-control signal is applied to the common Read line. The contents of the cell
appear on the appropriate Output line. As we have mentioned, the Read operation does not cause the contents
of the memory location to be erased.
When we connect memory locations to each other in such a manner, the group of memory locations is
called a register. Typically, a register stores 8, 16, 32, or 64 bits. In our diagram, we have a 3-bit register
(because it contains three memory locations). Once we connect many such registers with each other, we get
main memory or RAM. An example of such a memory organization is shown in Fig. 2.12.
We can see that there are three rows of registers. Like before, each row (i.e., each register) contains three
bits. Therefore, the capacity of this memory is 3 x 3 = 9 bits. What we have done is actually quite simple. We
have connected the input data lines of the first bit of all the three registers. Similarly, we have connected the
input data lines of the second and third bits of all the three registers. In exactly the same manner, we have also
connected the respective output lines of all the three registers.
There are three Write lines, and three Read lines. In other words, there is one Read line and one Write line
per register. Depending on from which register the data is to be read, or to which register it is to be written,
the specific Read/Write line is activated.
For example, if we want to read data from the second register, then we activate the Read line of the second
register, and give a read-control signal. The contents of the second register are immediately made available in
the output lines. Similarly, suppose that we want to write three bits, namely, 101 to the third register. In this
case, we set the input line to the value 101, and activate the Write line of the third register. As a result, the
existing contents of the third register get erased, and they are replaced with bit values 101.
There are some important points that we must note. At any given time, we can either read from or write to
a register. Also, we need to first identify a register from which we want to read or to which we want to write
something. How do we do that? For this purpose, each register is assigned a unique identifier. This identifier
is the memory address of the register. This address is a binary number. For example, in our example of three
registers, the addresses of the registers could be 00, 01 and 10, respectively. Thus, each address here consists
of two bits. In general, with n bits, we can address 2n memory locations.
Whenever we want to read from or write to memory, the address of the memory location is specified in a
special register called the Memory Address Register (MAR). A special signal line is connected to MAR,
which selects the read/write control line, depending on the contents of the MAR and whether read/write
operation is specified.
The actual data read from or to be written to memory is held in another register called Memory Data
Register (MDR), also called Memory Buffer Register (MBR). During read operations, the result of the read
operation (i.e., the output) is placed in the MDR. Similarly, before a write operation executes, the contents
to be written to memory are put in the MDR first. From there, they go to the appropriate memory location.
If we want to store large amounts of information that cannot be fit in the main memory, we store it in files.
We call files secondary memory. When we are interested in any piece of information stored in a file, we open
the file and read it. In other words, we transfer information from one of the lowest level devices to one of
the uppermost ones. In fact, if we are to do any operation on the file, we must bring the records(s) of interest
from the secondary memory into the main memory and perform the appropriate job. For instance, in our
payroll example, if we were to increase the basic rate for all the employees by 10%, how would we do it? We
must read the file from the secondary memory one record at a time and bring it into the main memory. Once
a record is in main memory, we can make changes to it as desired. After the changes are over, we must write
it back to the secondary memory. The most interesting point of this discussion is, although we say that we
have made changes to a file stored in secondary memory, we do not realize that the changes have happened
via the main memory!
In terms of demands on the hardware, we can think of the following instructions if we are to automate the
payroll process by using a computer. Hence, we should be able to
l Store the data in main memory
The secondary memory consists of tapes, floppy disks, hard disks, etc.
These devices vary in terms of information storing mechanisms, speed and capacity. However, regardless of
these characteristics, all high level languages provide instructions such as READ and WRITE. We shall now
study how these instructions get executed. To start with, we shall study the effect of these two instructions.
Figure 2.13 shows the effect of the READ operation.
n n
n n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n n
The Operating System can be viewed as a set
of software programs normally supplied along-
with the hardware for the effective and easy use of the
machine. The two benefits that enhance its utility are:
l elimination of duplicate efforts by hundreds of
mation to users.
Let us illustrate these benefits by an example.
We will learn later how the Operating System is
responsible for allocating/deallocating the storage area
on the disk for different files belonging to various users.
This function tends to avoid conflicting situations due
to two users writing on the same sector of the disk. At a
machine language level, there is normally an instruction
to write a certain number of bytes (say 512) from pre-
specified memory locations onto a specified sector of a
disk (assuming that a sector is of 512 bytes). What will
happen if the other languages also had such an instruction
and if a C/COBOL or Assembler programmer is allowed
to use this instruction freely?
Imagine that we have dozens of programmers in an MIS
department writing programs in C/COBOL for different
applications such as Payroll, Inventory and so on, and that
they are using the data on the same disk. As stated above, if C/COBOL had an instruction to write 512 bytes
of data onto any specified sector, there is a high probability, or almost a certainty, of one of the programmers
specifying a wrong sector and overwriting a sector allocated to somebody else’s or even one’s own file,
thereby, destroying the old contents. This is called the problem of Data Security.
The same problem exists in a single user environment too! In a single user machine such as an IBM-PC,
there is always a possibility of overwriting by mistake, on a sector allocated to the same user, but in a different
file, or even on a sector allocated to the same file but which contains some useful data and, therefore, which
is not free. Therefore, even in single user systems, there is a need for an arbitrator to keep track of all the files
and all the sectors allocated to them. There is also need to keep track of the free sectors to be allocated to new
files. This arbitrator is the Operating System.
Therefore, it is only the Operating System which is allowed to finally instruct the hardware to write
data from memory onto a pre-specified location on the disk. We will discuss later how this is achieved. The
Operating System provides for a number of ‘System Calls’ to perform various I/O functions such as Read,
Write, Open, Close, Seek, etc. The compiled program must contain these system calls (i.e. requests for the
Operating System services) to get these I/O tasks done.
Another problem is to ensure ‘confidentiality’. Imagine a situation where an ordinary Assembler or C/
COBOL programmer is allowed to give an instruction to “Read a sector from the disk into pre-specified 512
bytes of memory”. What will happen? An Inventory programmer, either by mistake or with deliberate intent,
will be able to read the sensitive and confidential payroll data, encroaching the privacy!
This is the reason why all Input/Output (I/O) operations for all the devices are delegated to the Operating
System by the user’s Application Program. All the I/O operations are also quite tedious, as we will learn
later. Therefore, once the Operating System developers write these programs or routines, each applications
programmer does not have to code them again and again, thereby, obviating tremendous amount of duplication
which would take place otherwise. The Application Program (AP) writes Read/Write instructions as usual
(e.g. “while (!feof(FP))” in C or “READ CUST-REC ... AT END ....” in COBOL), but the compiler generates
the appropriate system calls in their places and at the time of execution, the Operating System actually carries
out these instructions on behalf of the AP using the appropriate routines already developed. These routines
also ensure that the user is actually entitled to access the requested data. The Operating System provides
these services to the AP. This way duplication of efforts is avoided, in addition to providing security and
confidentiality.
Another benefit that accrues from this scheme is that the AP can think in terms of logical entities such as
‘Customer Record’ and can leave the Operating System to sort out the physical entities such as disk sectors.
The Operating System, at any moment, keeps track of all the free sectors and also all the sectors allocated to
various files of diverse users. The Operating System does the address translation to find out on which sectors
the specific desired ‘Customer Record’ is stored on the disk, and internally issues the machine instruction to
Read/Write those sectors.
The point is: The C/COBOL or any other HLL Programmer is provided with a logical instruction, “Read
a Customer-Rec into pre-specified memory locations (defined using the file pointer in the case of C, or FD
section in the case of COBOL)” or a with “Write” instruction which is just the opposite. The Operating
System is responsible for the address translation as well as the execution of this instruction.
The C compiler compiles all other instructions such as “a = b;” or “a = b + c;” (i.e. except instruction such
as I/O ) into the machine instructions, as we have seen earlier, but when it encounters an I/O instruction as
given above, it substitutes a call to the Operating System, known as a system call, or in IBM terminology,
‘Supervisory call’ or ‘SVC’. This is illustrated in Fig. 3.1.
At the time of the execution of this program, after the instructions shown in Section. A of Fig. 3.1 are
executed, when this system call is encountered, there is no point in continuing the current program, because,
without reading the customer record, the AP cannot do any other operation indicated by ‘B’ in Fig. 3.1. This
is because, these operations may be dependent upon the data in that record which is yet to be read. If the
program is allowed to continue to execute the instructions in Section B, it may compute something on the data
in the last or the previous record. This is why, this program is put aside for the time being (in the Operating
System jargon, the process is ‘blocked’, as we will learn). The Operating System takes over to read the
required record on behalf of the AP, and schedules some other program at the same time.
The Operating System, however, will have to know from which file the record is to be read and where in
memory it is. This information is supplied in the system call itself in the form of parameters, and passed on
to the Operating System call for “Read”.
Disk drives are controlled by another hardware unit called, ‘disk controller’. This controller is, in fact,
a tiny computer which has an instruction set of its own - basically to take care of the I/O operations. The
controller has its own memory which can vary from one byte to even one track, i.e. many kilobytes.
This controller (or small computer) needs an instruction specifying the address of the sector(s) which
is to be read and that of the memory locations where it is to be read. Once the Operating System supplies
this instruction to the controller, the controller can carry out the data transfer on its own, without the help
of the main CPU. Therefore, the CPU is free to execute any other program when the I/O is going on for
some other program. This is the reason why, this operation is called ‘Direct Memory Access (DMA)’, and
the controller is called the DMA controller. Therefore, DMA forms a very important basis for a multiuser,
multiprogramming Operating System.
With the help of this, the Operating System could delegate the actual I/O operation for a program to the
DMA and switch to a new program. When the I/O is completed for the original program, it can be rescheduled
(depending upon the scheduling algorithm). All this happens so fast that each user can get a feeling that he/
she is controlling the entire computer and its resources. This, in essence, is multiprogramming.
The exact steps in which this happens are given below:
(i) The Application Program (say AP-1) contains an instruction of the type “numread = fread (Custrec,
sizeof (char), 25, FP);” in C or “Read Customer-Rec...At end...”.
(ii) The compiler generates a system call for this instruction. An assembly level programmer can directly
issue a system call in the program.
(iii) At the time of execution, when this system call is encountered, AP-1 is blocked (i.e. kept
pending until its I/O is over). The Operating System then schedules another program, say AP-2 for
execution.
(iv) The Operating System takes up the reading of the record on behalf of AP-1 while the CPU is executing
AP-2. It first finds out the address of the physical sectors where the Customer-Rec is located. This is
called ‘address translation’.
(v) The Operating System then issues an instruction to the DMA controller, giving the disk address of
the source sectors, the address of the target memory locations of the Operating System buffer and the
number of bytes to be transferred.
(vi) The DMA then reads the data from the disk sectors into the Operating System buffer (we will later
see how this is done). The Operating System may have multiple memory buffers for holding the
data read, or to be written, for different programs that it controls. Therefore, the data is read into
the Operating System buffer reserved for that program. It is then transferred from that buffer to the
memory area in the AP, such as the FD section.
(vii) It is also possible to read the data directly into the memory of the AP, bypassing the Operating System
buffers, involving a little more complication. However, we will assume that the data is read in the
Operating System buffers first. All this while AP-2 is executing. The only thing is that it can execute
CPU bound instructions only at this stage, because, DMA would be using the data bus on priority
basis to transfer the data from the disk to the memory for AP-1.
(viii) When the I/O for AP-1 is completed, the DMA generates a hardware interrupt which is stored in the
Interrupt Pending Register.
(ix) When the currently executing machine instruction of AP-2 is completed, the hardware detects the
interrupt, and the interrupt is processed by a routine called ‘Interrupt Service Routine (ISR)’.
(x) The Operating System finds out the disk drive which has caused this interrupt. If then finds out the
program for which the I/O was completed.
(xi) The Operating System then picks up the relevant bytes from its buffer to formulate the logical
Customer-Rec and transfers it to the FD area of the AP-1 (A logical record can start/end at any
random location within a physical sector).
(xii) The Operating System reactivates AP-1 by making it ready for execution.
(xiii) At this juncture, both AP-1 and AP-2 are in the list of ready processes. The Operating Systems
chooses one of them and dispatches. This decision depends upon the scheduling policy followed by
the Operating System as we will study later in greater detail.
(xiv) Regardless of when it restarts, AP-1 is now ready to execute the instructions shown as Section ‘B’ in
Fig. 3.1.
Figure 3.2 shows the relationship among the AP, the Operating System and the hardware.
To read a sector from the disk into the memory, we will need a machine instruction any way. However, this
instruction is a privileged instruction, because, only the Operating System can execute it. No user program
can execute it. We know that all higher level languages are provided with an instruction to read a piece of
logical data such as a record. In the place of that instruction, the compiler of that language is not allowed
to substitute this privileged machine instruction for I/O in the compiled program. This is because, only the
Operating System and not the compiler can do the address translation needed to locate the sectors where
the logical data exists. Therefore, this address translation has to take place at the time of execution and not
compilation. Hence, the compiler merely substitutes a system call for the logical I/O operation.
Even if somebody writes a program in the machine language or patches an existing program to use this
privileged I/O instruction, it is detected and prevented from execution, because, it is a user program (even
though written in machine language) which is executing it, and that is certainly not correct from the standpoint
of security and privacy. How is this achieved?
A CPU, at any moment, can execute only one instruction. If this
instruction is from the Application Program, the machine is said to be in
the ‘user mode’. If it is executing any Operating System instruction, the
machine is said to be in the ‘System mode’. At any time, there is one bit
in the machine, indicating this mode. If this is a user mode, and if the
instruction is a privileged one (as detected by some bit combinations in the
instruction in the IR), the hardware itself generates (notifies) an error. This
is how a discipline is enforced ensuring that all instructions such as I/O
are executed only through system calls and not directly. In some systems
a different mechanism is adopted, but the essence is the same. Figure 3.3
depicts the layers of hardware, the Operating System and the Application
Program.
We have seen the type of services that an Operating System provides for reading and writing records.
These services fall in the category, ‘Information Management (IM)’. From a systems programmer’s
point of view, the Operating System can be considered to be a collection of many such callable
programs or services categorized under three major heads, viz.
l Information Management (IM)
The services provided through Process Management (PM) are very important, if the Operating System
supports multiple users as in the case of UNIX or MVS. In a modern multiuser Operating System, a number of
users located at different terminals run the same or different programs, and this scheme requires an arbitrator.
A computer has only one CPU (unless it is a multiprocessor system). The CPU has only one set of IR, PC,
MAR, MBR and other registers. At a given time, it can execute only one instruction belonging to any one of
all these programs competing for the CPU.
The function of the Operating System in this regard
l
is to keep track of all the competing processes (we will
l
call a running program, a process), schedule them (i.e.
l
sequence them), dispatch (i.e. run or execute) them one l
after the other, but while doing so, try to give each user an l
impression that he has the full control of the CPU. l
The services provided under Memory Management (MM) are directed to keeping track of memory and
allocating/deallocating it to various processes. The Operating System keeps a list of free memory locations.
(Initially, after the booting but before any process starts, the full memory, excepting the part occupied by the
Operating System itself is free.) Before a program is loaded in the memory from the disk, this module (MM)
consults this free list, allocates the memory to the process, depending upon the program size and updates the
list of free memory.
As we have seen earlier, the compiler assigns the addresses assuming that the program is going to be
loaded from memory location 0. Such a program resides on the disk and is brought in the memory, may be,
at different locations than 0. Obviously, some scheme of Address Translation (AT) is needed. For instance,
if a program compiled with addresses 0 to 999 is actually loaded in locations from 1000 to 1999, all the
memory addresses in the program will have to be changed by adding 1000 to each of them. This is called
address translation. The Operating System is responsible for this AT - normally with the help of some
special hardware. Again, there are many ways of doing all these functions, as we shall see later.
We list some selected system calls in this category in Fig. 3.6.
It is again evident that in a single-user, uniprogramming environment, the MM module of the Operating
System is far less complicated and less significant than in a multiuser scheme.
Normally, system calls are callable from the assembly language. In fact, an assembly language
programmer writing programs under a specific Operating System has to be quite familiar with
these system calls. All of us have surely encountered phrases such as “assembly language under
DOS” or “assembly language under VMS”. We should now be clear about its meaning. The assembly
language programmer needs to know the assembly language instructions set for that machine. This depends
upon the architecture of that hardware. In addition, he normally needs to be aware of the system calls of the
Operating System running on that machine. This is the real meaning of these phrases.
Many HLLs such as PL/1 and C also allow system calls to be embedded explicitly in the midst of other
statements. This, alongwith other features of these languages, makes them more suitable for writing utilities
or systems software.
Some HLLs such as Java, C++ or COBOL do not often require the explicit use of system calls
embedded in other statements (though some languages allow it). In these cases, the compiler substitutes
system calls at proper places, wherever necessary, for the appropriate instructions in HLL. We have
discussed how this is done for READ or WRITE statements in C/COBOL. In C, there is a function called
“exit” which is meant to end the program. The compiler substitutes it by a system call to “kill or terminate
the process”.
As you go higher in the level of languages, such as 3GLs or 4GLs, the less will you need to explicitly
use system calls. However, the compilers of these languages almost certainly will use a number of system
calls in the compiled programs. That is the reason why you need different compilers under Windows 2000
and UNIX even if the source language is the same (e.g. C) and the machine is the same (e.g. PC/AT). The
two compilers will be substituting different sets of Operating System calls for the same source statements
such as fread, fwrite or exit. One important point is that normally there is no one-to-one correspondence
between the system calls of two Operating Systems such as Windows 2000 and UNIX. Apart from the format,
the functionality, scope, parameters passed and returned results or errors differ quite a lot between the two
Operating Systems. Some system services or calls in UNIX may not be present in MS/DOS, and vice-
versa. Similarly one system call in UNIX may be equivalent to several system calls under Windows 2000.
In general, the mapping between the system calls of two Operating Systems is never straight forward. Also,
normally, before the actual system call, the compiler generates machine instructions which are preparatory
instructions for that system call. Typically, these instructions load certain CPU registers or the stack with the
parameters of the system call.
For instance, for the “fopen” system call, the parameters could be file id, mode (Read, Write, etc.). The
system call, while executing, assumes that the parameters will be available at a pre-defined place. The results
of the system call or the error codes (e.g. for file non-existent) are deposited by the system call also in pre-
defined registers or the stack. The instructions following the system call generated by the compiler pick up
these results from these registers and then proceed.
The procedure and the exact registers used differ from Operating System to Operating System. This
adds to the porting efforts of the compiler substantially. Note that this is not the porting of an Application
Program written in HLL, or that of the compiled object program. We are talking about the porting of a
compiler itself. You have a compiler of C under Windows 2000 for Pentium and you want to have one for C
on Pentium but under UNIX. In that case, you must essentially change the previous one to make it work for
the other case.
A compiled program normally will look as shown in Fig. 3.7.
Figure 3.7 throws some light on portability. If the same compiled binary program is taken from
one machine to another with a different architecture and consequently different instruction set,
what will happen? The machine instructions in the program shown as (M) in Fig. 3.7 will not be
executed (at least not properly!). After the fetch cycle, the instruction will get into the IR of the new machine,
but then, the new machine will not understand the format and encoding of the instructions. For instance, the
number of bits in the instruction reserved for the op code and for the address could be different on the two
machines. Therefore, a C program compiled under UNIX on Pentium cannot run directly on a VAX machine
even under UNIX. You will need to recompile it using a C compiler under UNIX on the VAX.
Consider another scenario. If we take the same compiled binary program to the same machine but under
a different Operating System, it will still not execute properly. This time, as the machine is the same, the
machine instructions (M) will be executed properly, but then the system calls (S) will not be executed properly,
because the new Operating System will have a set of completely different system calls with their different
encoding and interfaces (i.e. the preparatory and succeeding instructions to the system calls). Therefore, a
C program compiled under Windows 2000 on a Pentium cannot run directly under UNIX on a OPentium. It
will need to be recompiled.
The discussion above will clarify that the object code portability is achieved only if the instruction set
of the machine as well as the Operating System is the same. Some companies, notably Data General, had
provided such object code portability across their products range. This was done, for instance, by ensuring that
the instruction set and the Operating System calls running on a 32 bit supermini computer (e.g. MV/7800XP,
AOS/VS) were a superset of the instruction set and Operating System calls for a 16 bit minicomputer (e.g.
C- 380 and AOS). Therefore, a customer upgrading his machines from 16 to 32 bit computers did not have
to make any change, at least for the old applications. The compiled programs could run on a new machine
without any change, or recompilation.
If you recompile the source program on a 32 bit supermini and under AOS/VS, the compiler will use some
advanced features, due to which a program is likely to run more efficiently than in the previous case, where
you run the same old compiled version. But then, initially, the programs would run without even requiring
compilation. This object code compatibility or portability protects the user’s investment in software and
makes the transition to a higher machine easier by giving him some breathing time. It will also now become
quite clear that even if you have an existing C compiler under Windows 2000 for a Pentium, you will have to
write a new one under UNIX for the Pentium which will use the system calls available under UNIX.
Once you have two different compilers for different machines/Operating Systems combinations (called
environments), then ‘source code portability’ can be achieved easily by allowing the same source statements
for both versions. This is exactly why a system developed in C or COBOL can be source portable, to a large
extent, from PC to the mainframe under different Operating Systems. Both will have to be recompiled under
two environments producing different object codes by two different compilers, but then they will run without
any change in the source program.
We had stated earlier that the Operating System makes the task of a programmer as well as a
user simpler. We had an overview of the Operating System from the programmer’s point of view,
which essentially consists of various system calls. Let us now consider how the Operating System
affects the user and how these two views interact.
A user communicates with the computer typically through some commands given at a terminal. These
commands (part of Command Language or CL in short) constitute what is called a User Interface. The
Operating System has a program called the Command Interpreter (CI) which is constantly running in the
computer. It waits for a user command, and as soon as a user issues one, the CI interprets it, i.e. it executes it
immediately and waits for the next command. DCL for VAX/VMS or CLI for AOS/VS or SHELL for UNIX
are examples of such CI/CL (We will henceforth use these interchangeably).
Typical facilities provided by a CL are as listed in Fig. 3.8.
l
The user’s view of the Operating System is, in fact, a set l
of these CI commands. What we need to know is how these l
commands interact with the system calls or the systems l
programmer’s view.
l
We can imagine that logically the CI consists of a set l
of programs, one for each of the commands in the CL. For l
example, there is a program for creating a file in a directory (say l
CRE). There is another for deleting a file from the directory l
(say DEL), and there is one for executing a program (say RUN).
l
In physical terms, the Operating System designer may combine
two functions in one module or may write many modules
even to implement one function. This is clearly a design issue
(identification of commonality, cohesion, coupling, etc.). We
will assume that there is one program for one function, as it is only a logical view. Therefore, to the user,
the Operating System consists of a CI which comprises a set of these programs, plus a watchdog program
constantly watching or polling on every terminal to check whether the user has given any command, and if
so, initiate the appropriate program.
Let us consider a Windows 2000 screen is shown in Fig. 3.9 (a). The user can simply double click on
a MS-Word document to open it. This is a feature of the GUI. However, when the user double-clicks on
the document name, internally, Windows 2000 has to figure out that this is a MS-Word document, and as
such, invoke the MS-Word editor program. It then loads the selected document in MS-Word as shown in
Fig. 3.9(b). This is transparent to the user. However, the point is that this involves certain Operating System
functions from the IM, MM and PM categories.
We still need to know how Windows 2000 knows what to do in response to the user’s mouse double-
click and then everything will be clear. For this, some of the Operating System routines are provided as
direct system calls belonging to IM, PM or MM category (e.g. CRE is a direct system call belonging to
the IM category to create a file). In subsequent sections, we will see how some of these system calls are
executed. Some of the other routines, however, are bigger programs which, in turn, use system calls. This
is where the user’s and the programmer’s viewpoints merge. Let us illustrate this by an example of RUN
command. “RUN” is equivalent to double-clicking on the file’s icon in Windows 2000 Operating Systems.
Let us imagine that effect of double-clicking on the icon of a file called as PAYROLL is the same as entering
the following command in the Windows 2000 command prompt:
> RUN PAYROLL
where “>” is displayed by the CI as a prompt for the user to type in his command, RUN is a command in
the CL and its argument PAYROLL is the name of a program already compiled, linked and stored on the disk
for future use, and which the user wants to execute now.
Figure 3.10 illustrates the picture of the memory, which is occupied by the Operating System where, a part
is kept free for any AP to be loaded and executed.
The operating system portion is shown as further divided into a portion for CI. This portion logically
consists of compiled versions of a number of routines within the CI. The figure shows one routine for one
command. The figure shows a “Scratch Pad for CI” where all commands input by a user are stored temporarily
before the input is examined by the watchdog program as shown in Fig. 3.9. The Operating System area also
has a portion for other Operating System routines or system calls in the areas of IM, PM and MM. The
following steps are now carried out.
(i) CI watchdog prompts “>” on the screen and waits for the response.
(ii) The user types in “RUN PAYROLL” as shown on the screen in Fig. 3.10. This command is transferred
from the terminal keyboard to a memory buffer (scratch pad) of the CI for analysis as shown in
Fig. 3.11.
(iii) The CI watchdog program as shown in Fig. 3.9 now examines the command and finds a valid
command RUN, whereupon it invokes the routine for RUN. It passes the program name PAYROLL
as a parameter to the RUN program. This is as shown in Fig. 3.12.
(iv) The RUN routine, with the help of a system call in the IM category, locates a file called PAYROLL
and finds out its size.
(v) The RUN routine, with the help of a system call in the MM category, checks whether there is free
memory available to accommodate this program, and if so, requests the routine in MM to allocate
memory to this program. If there is not sufficient memory for this purpose, it displays an error message
and waits or terminates, depending upon the policy.
(vi) If there is sufficient memory, with the help of a system call in the IM category, it actually transfers
the compiled program file for “PAYROLL” from the disk into those available memory locations
(loading). This is done after a system call in the IM category verifies that the user wanting to execute
the PAYROLL program has an “Execute” Access Right for this file. The picture of the memory now
looks as shown in Fig. 3.13.
(vii) It now issues a system call in the PM category to schedule and execute this program (whereupon we
shall start calling it a “Process”).
The picture is a little simplistic, but two points become clear from the discussion above:
l The system calls in IM, PM and MM categories have to work in
close cooperation with one another.
l Though the user's or Application Programmer's view of the Operating
System is restricted to the CL, the commands are executed with the
help of a variety of system calls internally (which is essentially the
system programmer's view of the Operating System).
The latest trend today is to make the user’s life simpler by providing an attractive and friendly
Graphical User Interface (GUI) which provides him with various menus with colours,
graphics and windows, as in OS/2 under PS/2. Therefore, the user does not have to remember tedious
syntaxes of the Command Language, but can point at a chosen option by means of a mouse. However, this
should not confuse us. Internally, the position of the mouse is translated into the coordinates or the position
of the cursor on the screen.
The CI watchdog program maintains a table of routines such as RUN, DEL, etc. versus the possible
screen positions of a cursor manipulated by the mouse. When the user clicks at a certain mouse position, the
Operating System invokes the corresponding routine. The Operating System essentially refers to that table to
translate this screen position into the correct option, and then calls that specific routine (such as RUN) which,
in turn, may use a variety of system calls as discussed earlier. An example of the user-friendly interface is
shown in Fig. 3.14.
If you move the mouse, the cursor also moves. However, you have to move the mouse along a surface so
that the wheels of the mouse also turn. The revolutions of these wheels are translated into the distance which
is mapped into (x, y) coordinates of the cursor by electromechanical means. If the cursor points at a “RUN”
option, as shown on the screen, the cursor coordinates (x, y) are known to the watchdog program of CI. This
program is coded such that even if the cursor points to coordinates corresponding to any of the characters R,
U or N on the screen, it still calls the RUN routine, which as we have seen before, initiates other system calls,
in turn. Figure 3.14 shows that if you point a mouse at LIST and then you point it to FILE-B as shown in the
figure, the Operating System will convert the mouse positions into screen coordinates first, which it will then
convert into an instruction LIST FILE-B and then execute it by calling the appropriate routines.
We need to answer one basic question: If the Operating System is responsible for any I/O
operation, who loads the Operating System itself in the memory? You cannot have another
program, say P, to load the Operating System; because in order to execute, that program will need to
be in the memory first, and then we will need to ask: Who brought that program P in the memory? how was the
I/O possible without the Operating System already executing in the memory? It appears to be a chicken and
egg problem.
In some computers, the part of the memory allocated to the Operating System is in ROM; and therefore,
once brought (or etched) there, one need not do anything more. ROM is permanent; it retains its contents
even when the power is lost. A ROM-based Operating System is always there. Therefore, in such cases, the
problem of loading the Operating System in the memory is resolved.
However, the main memory consists of RAM in most of the computers. RAM is volatile. It loses its
contents when the power is switched off. Therefore, each time the computer is switched on, the Operating
System has to be loaded. Unfortunately, we cannot give a command of the type LOAD Operating System,
because such an instruction would be a part of CI which is a part of the Operating System which is still on the
disk at that time. Unless it is loaded, it cannot execute. Therefore, it begs the question again!
The loading of the Operating System is achieved by a special program called BOOT. Generally this
program is stored in one (or two) sectors on the disk with a pre-determined address. This portion is normally
called ‘Boot Block’ as shown in Fig. 3.19. The Read Only Memory (ROM) normally contains a minimum
program. When you turn the computer on, the control is transferred to this program automatically by the
hardware itself. This program in ROM loads the Boot program in pre-determined memory locations. The
beauty is to keep the BOOT program as small as possible, so that the hardware can manage to load it easily
and in a very few instructions. This BOOT program in turn contains instructions to read the rest of the
Operating System into the memory. This is depicted in Figs. 3.16 and 3.17.
The mechanism gives an impression of pulling oneself up. Therefore, the nomenclature bootstrapping or
its short form booting.
What will happen if we can somehow tamper with the BOOT sector where the Boot program is stored
on the disk? Either the Operating System will not be loaded at all or loaded wrongly, producing wrong and
unpredictable results, as in the case of Computer Virus.
The concept of virtual machine came about in the 1960s. The background to this is quite interesting.
IBM had developed the System/360 operating system, which had become quite popular on the
mainframe computer. However, the major concern regarding this operating system was that it was
batch-oriented in nature. There was no concept of online computing or timesharing. Users were increasingly
feeling a need of timesharing, since only batch processing was not adequate.
In order to add the timesharing features to System/360, IBM appointed a dedicated team, which started
to work on a solution to this problem. This team came up with a new operating system called as TSS/360,
which was based on the System/360, but also had timesharing features. Although technically this solution
was acceptable, as it turned out, the development of TSS/360 took a lot of time, and when it finally arrived,
people thought that it was too bulky and heavy. Therefore, a better solution was warranted.
Soon, IBM came up with another operating system, called as CP/CMS, which was later renamed as
VM/370. The VM/370 operating system is quite interesting. It contains a virtual machine monitor. The term
virtual machine indicates a machine (i.e. a computer), which does not physically exist, and yet, makes the
user believe that there is a machine. This virtual machine monitor runs on the actual hardware, and performs
the multiprogramming functions. The idea is shown in Fig. 3.18. Here, we assume that three application
programs A, B and C are executing with their own operating systems (again A, B and C, shown as Virtual
Machine A, B and C, respectively).
The virtual machine, in this case, is an exact copy of the hardware. In other words, it provides support for
the kernel/user mode, input/output instructions, interrupts, etc. What significance does this have? It means
that there can actually be more than one operating system running on the computer! The way this works is
follows:
1. Each application program is coded for one of the available operating systems. That is, the programmer
issues system calls for a particular operating system.
2. The system call reaches its particular operating system, from all those available (depending on which
operating system the programmer wants to work with).
3. At this stage, the system call of the program’s operating system is mapped to the system call of
VM/370 (i.e. the actual system call to be executed on the real hardware).
The virtual machine now makes the actual system call, addressed to the physical hardware.
This more or less completes an overview of an Operating System. We will now try to uncover the basic
principle of the design of any Operating System.
System calls are an interface provided to communicate with the Operating System.
An Operating System manages entire functioning of a computer on its own, but on many
occasions explicit direct (initiated by the user through program or using commands) or indirect
(not initiated by the user directly) calls are required to perform various operations. Routines or
functions or calls that are used to perform Operating System functions are called as system calls.
Most system calls of Operating Systems are available in the form of commands.
Systems call instructions are normally available in assembly language. High level languages such C, C++,
and Perl also provides facility of system programming. Now C/C++ are widely used for system programming.
We can make call UNIX or MS-DOS system routines using C or C++ and those system calls will be executed
at run-time.
Suppose we have a program in C, which can copy the contents of one file into another file, i.e. it is a
backup utility. Then this C program would require two file names and their paths: 1) input (which is existing)
2) output (which will be created).
When we execute this program, it will prompt for the names of the two files and while processing it will
display error messages if it encounters any problems. Otherwise it will display a ‘success’ message. This is
visible to us and we can interact with the program. But in this program, many things are happening in the
background. These are:
To ensure that the entered file names are as per the standards or nomenclature. This would be normal
processing and would not involving any system calls.
To copy the contents of the input file into the output file, we need to open the input file which is present on the
disk. Hence we use function provide by C/C++, which accepts the file name as a parameter. C/C++ function
would try to locate the file and then open that file. Now this is a system call. If file exists, it will be opened.
If there is any error such as the file is not present or there is not enough memory to load the file or there is no
access to open that file, then the program aborts, which will make another system call.
After copying is done, both the files must be closed so that other processes can use them. File close is also a
system call and if there is any problem while closing the file, another system call will be made.
Following are the types of system calls:
l Process control
l File management
l Device management
l Information maintenance
l Communication
n
n
n
n n
n
n
n
n
n
n
Information Management (IM) consists of two main
modules:
Disk constitutes a very important I/O medium that the Operating System has to deal with very frequently.
Therefore, it is necessary to learn how the disk functions. The operating principle of a floppy disk is similar
to that of a hard disk. Simplistically, in fact, a hard disk can be considered as being made of multiple floppy
disks put one above the other. We will study the floppy disks in the subsequent sections but the discussion is
equally applicable to the hard disks as well.
Disks are like long play music records except that the recording is done in concentric circles and not
spirally. A floppy disk is made up of a round piece of plastic material, coated with a magnetized recording
material. The surface of a floppy disk is made of concentric circles called tracks. Data is recorded on these
tracks in a bit serial fashion. A track contains magnetized particles of metal, each having a north and a south
pole. The direction of this polarity decides the state of the particle. It can have only two directions. Therefore,
each such particle acts as a binary switch taking values of 0 or 1, and 8 such switches can record a character
in accordance with the coding methods (ASCII/EBCDIC). In this fashion, a logical record which consists of
several fields (data items), each consisting of several characters is stored on the floppy disk. This is depicted
in Fig. 4.1.
A disk can be considered to be consisting of several surfaces, each of which consisting of a number of
tracks as shown in the Fig. 4.2. The tracks are normally numbered from 0 as the outermost track, with the
number increasing inwards. For the sake of convenience, each track is divided into a number of sectors of
equal size. Sector capacities vary, but typically, a sector can store up to 512 bytes. Double-sided floppies
have two sides or surfaces on which data can be recorded. Therefore, a given sector is specified by three
components of an address made up of Surface number, Track number and Sector number.
How is this sector address useful? It is useful to
locate a piece of data. The problem is that a logical
record (e.g. Customer Record) may be spread over
one or more sectors, or many logical records could
fit into one sector, depending upon the respective
sizes. These situations are depicted in Fig. 4.3.
The File System portion of Information
Management takes care of the translation from
logical to physical address. For instance, in a 3GL or
a 4GL program, when an instruction is issued to read
a customer record, the File System determines which
blocks need to be read in to satisfy this request, and
instructs the Device Driver accordingly. The Device
Driver then determines the sectors corresponding to
those blocks that need to be read.
The disk drive for the hard disk and its Read/Write mechanism is shown in Fig. 4.4. You will notice that
there is a Read/Write head for each surface connected to the arm; this arm can move in or out, to position
itself on any track while the disk rotates at a constant speed.
Let us assume that the File System asks the DD to read a block which when translated into a sector address
reads Surface = 1, Track = 10, Sector = 5. Let us call this our target address. Let us assume that the R/W
heads are currently at a position which we will call the current address. Let us say it is Surface = 0, Track
= 7, Sector = 4.
We have to give electrical signals to move the R/W heads from the current address to the target address
and then read the data. This operation consists of three stages, as shown in Fig. 4.5.
These three stages are shown graphically in Fig. 4.6. The total time for this operation is given by total time
= seek time + rotational delay + transmission time.
The disk drive is connected to another device called ‘interface’ or ‘controller’. This device is responsible
for issuing the signals to the drive to actually position the R/W heads on the correct track, to choose the
correct sector and to activate R/W head of the appropriate surface to read the data. The controller normally
has a buffer memory of its own, varying from 1 word to 1 trackful. We will assume that the controller has a
buffer to store 1 sector (512 bytes) of data in our example.
We will now see the way these three operations, described in Fig. 4.5 can be performed. In essence, we
will see how the controller executes some of its instructions such as seek, transfer, etc.
The Seek instruction requires the target track number to which the R/W heads have to be moved. The
instruction issued by the DD to the controller contains the target address which has this target track number.
The controller stores such instructions in its memory before executing them one by one. The controller also
stores the current track number, i.e. the track number at which the R/W arm is positioned at any time. The
hardware itself senses the position of the R/W arm and stores it in the controller’s memory. At any time the
R/W arm moves in or out, this field in the controller’s memory is changed by the hardware automatically.
Therefore, at any time, the controller can subtract the
current track number from the target track number
(e.g. 10 – 7 = 3) to arrive at the number of steps that
the R/W arm has to move. The sign + or – of the
result after this subtraction also tells the controller
the direction (in or out) that the R/W arm has to move.
Figure 4.7 depicts a floppy disk after it is inserted
into a floppy drive through the slot (shown in the
figure on the left hand side).
When a floppy disk is inserted into the floppy
drive, the expandable cone seats the floppy disk on
the flywheel. The drive motor rotates at a certain
predetermined speed and therefore, the flywheel
and the floppy disk mounted on it also start rotating
alongwith it.
An electromagnetic R/W head is mounted on the disk arm which is connected to a stepper motor. This
stepper motor can rotate in both the directions. If the stepper motor rotates in clockwise direction, the R/W
arm moves ‘out’. If the stepper motor rotates anticlockwise, the R/W arm moves ‘in’. If you study the figure
carefully, this will become clear. This stepper motor is given a signal by the disk controller. Depending upon
the magnitude (or the number of pulses) and the direction of the signal, the direction of rotation and also the
number of steps (1 step = distance between two adjacent tracks) are determined, and therefore, the R/W arm
can be positioned properly on the desired track (Seek operation). As this is basically an electromechanical
operation, it is extremely time–consuming.
For instance, in our last example, the controller calculates
the difference between the track numbers of target track
and current track. In this case, this is 10 – 7 = 3. It means
that the R/W arm must move ‘in’ by 3 steps. Remember the
numbering scheme for the tracks? The outermost track is
Track 0, and this number increases as one goes inwards. The
controller generates appropriate signals to the disk drive to
rotate the stepper motor to achieve the Seek operation. This
completes the seek operation given in (i) of Fig. 4.5.
Figure 4.8 depicts possible connections between a
controller and a disk drive. A controller is an expensive
equipment and is normally shared amongst multiple drives
to reduce the cost, but then at a reduced speed. Even if
overlapped Seek operations are possible, we will not consider
them. We will assume that a controller controls only one drive
at a time. After that drive is serviced, it turns its attention to
the next one which is to be serviced. At any moment, the drive
being serviced is chosen by the drive select signal shown in
the figure.
In this case, one could imagine some kind of a decoder circuit in action. If a controller is connected to two
disk drives, the connections shown in Fig. 4.8 will exist between the controller and each of the drives. In this
case, there will also be a control signal to the controller which can take on two values - low or high (0 or 1).
Depending upon this signal, the drive select signal between the controller and the desired drive will be made
high while that between the controller and the other drive will be kept low, thereby selecting the former drive
in effect.
The controller selects a drive and then issues signals to the selected drive to select a surface, to give
direction (IN/OUT) to the rotation of the stepper motor, to specify the number of steps and finally whether the
data is to be read or written. For all these, the signals (i.e. instructions) given by the controllers are sent to the
drive as shown in the figure. After giving these instructions to the drive, either the DATA IN or DATA OUT
wire carries the data bit serially, depending on the operation. For instance, in a read operation, data comes out
of the DATA OUT wire from the drive to the controller’s buffer.
Let us now see how the correct sector number is recognized and accessed on that track.
To understand this, we should understand the format in which data is written on the disks. In the earlier days,
IBM used to follow Count Data (CD) and Count Key Data (CKD) disk formats. IBM’s ISAM was based
on the CKD format. There was no concept of fixed sized sectors in those days. This concept was introduced
much later and was called Fixed Block Architecture (FBA). Majority of the computers follow this scheme
today and therefore, we will assume fixed size sectors in our discussions to follow. Each sector follows a
specific format as shown in Fig. 4.9. According to this format, each sector is divided into address and data
portions.
Each of these is further divided into four subdivisions as shown in the figure. Let us study them one by one.
(i) Address marker (with a specific ASCII/EBCDIC code which does not occur in normal data) denotes
the beginning of the address field. As the disk rotates, each sector passes under the R/W head. The
disk controller always ‘looks’ for this address marker byte. On encountering it, it knows that what
follows it is the actual address. The address marker is also used for synchronization, so that the
address can be read properly.
(ii) The actual address, as discussed earlier, consists of three components (surface number, track
number, sector number). This address is written on each sector, at the time of formatting. As this
address passes under the R/W heads, it is read by them and sent to the controller bit by bit in a bit
serial fashion. After receiving it completely, the controller compares it with the sector number in the
target address stored in the controller’s buffer loaded by the DD. If it does not match, the controller
bypasses that sector and waits for the next sector by looking for the next address marker. If the
address now matches, it concludes that the required sector is found, and the operation is complete.
Otherwise, the procedure is repeated for all the sectors on the track. If the sector is not found at all
(wrong sector number specified), it reports an error.
A question that arises is, why maintain the address on each sector? Is it not wasteful in terms of disk
space? The controller knows the beginning of a track which is where Sector 0 starts on that track
(given by the index hole in the floppy disk). If we do not write the addresses on each sector each time,
the controller will have to wait for Sector 0 or the beginning of a track to pass under the R/W heads
and then wait for some exact time given by t time = (target sector number X time to traverse a sector),
and then start reading mechanically. This is obviously impractical and highly error-prone. Another
approach is to maintain only some kind of marker to indicate a new sector. As the disk rotates and the
R/W heads pass over a new sector, the controller itself could add 1 to arrive at the next sector number
and then compare it with the sector number in the target address. But this is not done for the sake of
accuracy. Even a slight error in reading the data from the correct sector can be disastrous. Matching
of the full address is always safer.
(iii) The address CRC is kept to detect any errors while reading the address. This is an extension of the
concept of a parity bit basically to ensure the accuracy of the address field read in from the disk. As
the sector address is written by the formatting program, it also calculates this CRC by using some
algorithm based on the bits in the address field. When this address is read in the controller’s memory
for comparison, this CRC also is read in. At this time, CRC is again calculated, based on the address
read and using the same algorithm as used before. This calculated CRC is now compared with the
CRC read from the sector. If they match, the controller assumes that the address is correctly read.
If there is an error, the controller can be instructed to retry for a certain number of attempts before
reporting the error. We need not discuss at greater length the exact algorithms used for CRC. Suffice
it to say that CRC calculation can be done by the hardware itself these days, and therefore, it is fairly
fast.
(iv) The gap field is used to synchronize the Read operation. For instance, while the address and CRC
comparisons are going on, the R/W heads would be passing over the gap. At this time, the controller
can issue instructions to start reading the data portion after the R/W heads go over the gap, if the
correct sector was indeed found.
The gap allows these operations to be synchronized. If there was absolutely no gap at all, the floppy
would have already traversed some distance by the time the controller had taken the decision to read
the data.
(v) The data marker like address marker indicates that what follows is data.
(vi) The data is actually the data stored by the user. This is normally 128 or 512 bytes per sector.
(vii) The data CRC is used for error detection in data transmission between the disk sector and the
controller’s buffer in the same way that the address CRC is used.
(viii) The gap field again is for synchronization before the next address marker for the next sector is
encountered.
Finally, a question arises as to who writes all this information such as the addresses, CRCs, markers, etc.
on the sectors of a disk? It is the formatting program. Formatting can be done by the hardware as well as
software. Given a raw disk, this program scans all the sectors, writes various markers, addresses, calculates
and writes the address CRC, leaves the desired gaps and goes over to the next sector. It follows this procedure
until all the sectors on the disk are over. Therefore, even if the actual sector capacity is high, the data portion
is only 128 or 512 bytes. This is the reason why people talk about the ‘formatted’ and ‘unformatted’
capacities of a disk. It is only the formatted capacity which is of significance to the user/programmer, as
only that can be used to write any data. Without formatting, you cannot do any Read/Write operation. The
controller and the Operating System just will not accept it.
The formatting program does one more useful thing. It checks whether a sector is good or bad. A bad
sector is the one where, due to some hardware fault, reading or writing does not take place properly. To
check this, the formatting program writes a sequence of bits in the data area of the sector from the predefined
memory locations - say AREA-1. It then reads the same sector in some other memory locations - say AREA-
2. It then compares these two areas in the memory viz. AREA-1 and AREA-2. If they do not match, either the
read or write operation is not performing properly. The formatting process prepares a list of such bad sectors
or bad blocks which is passed on to the Operating System, so that the Operating System ignores them while
allocating the disk space to different files, i.e. the Operating System does not treat these as free or allocable
at all. Using the information already written on the sectors, a correct sector is located on a selected track as
discussed above.
With this explanation, the operation (ii) as given in Fig. 4.5 should become very clear.
Having found the correct sector, the controller activates the appropriate R/W head to actu-
ally start the data transmission as discussed earlier. After this, data traversing in a bit by bit fashion is col-
lected into the bytes of the memory buffer of the controller.
This completes the entire picture of how the disk drive works. Even though some examples or figures
have been taken from hard disks and some others from the floppy disks, the concepts for both are the same.
Again, the exact format s and sizes of sectors and data storage vary from system to system, but the basic
pattern is similar. Some drives will not use stepper motors, but may use a more advanced technology. But
the basic technique still remains the same. There are some very expensive head per track or fixed head
systems wherein there is one R/W head per track, thereby obviating the need for the Seek operation entirely.
In such a case, only steps (ii) and (iii) are necessary to read a specific sector, making the operation very fast
but expensive.
Even if disks are faster than tapes and many other media, their operations involve electromechanical
movements of disk arms, actual disk rotation, and so on. Therefore, these operations are far slower than the
CPU operations which are purely electronic. This mismatch in speeds is one of the important reasons for the
development of the concept of multiprogramming as we shall see.
We will now study the controller in a little more detail. Figure 4.10 shows the schematic
of a controller.
The figure shows the following details:
This is the area to store the data read from the disk drive or to be written onto it. It
is a temporary buffer area. If any data is to be read from the disk to the main memory, it is first read in this
controller’s buffer memory and then it is transferred into the main memory using DMA. Even if the data is
read a sector at a time, some controllers can store a full track. Some controllers can read a full track at a time,
in which case, this buffer memory in the controller has to be at least a trackful.
The data in and data out wires are responsible for the data trans-
fer between the disk drive and the controller as shown in Fig. 4.8 and also in Fig. 4.10. As the data arrives
in a bit serial fashion, the electronics in the controller collects it and shifts it to make room for the next bit.
After shifting eight bits, it also accepts the parity bit, if maintained, depending on the scheme. The hardware
also checks the parity bit for correctness. The received bytes then are stored in the controller’s memory as
discussed in (i) above.
This stores the track number on which the R/W heads are positioned currently.
This is done by the controller’s hardware directly, and it is updated by it if the position of the heads changes.
This information is used by the controller in deciding about the number and the direction of the steps in which
the R/W heads need to move, given any target address.
This gives the information about the status (busy, etc.) or about errors, if any, are
encountered while performing read/write operations.
This is any other information that a specific controller may want to store.
This is also a part of the memory of a controller where the I/O instructions, as the
controller would understand, are stored. Any controller is like a small computer which understands only a
few I/O instructions like “start motor”, “check status”, “seek” etc. The DD sets up an I/O program consisting
of these instructions for this controller. After receiving from the File System the instructions to read specific
blocks, the DD develops this I/O program and loads it in the controller’s memory. In many cases, the basic
skeletal program is already hardwired in the controller’s instruction area, and the Device Driver only sets a
few parameters or changes a few instructions.
After this tiny I/O program is loaded, the controller itself can carry out the I/O operation to read/write any
date from/onto the disk into/from its own memory independently. After the specific blocks are read into its
memory, the bytes pertaining to the requested logical record (e.g. customer record) that can be picked up and
transferred to the main memory by DMA. For all these operations, the main CPU is not required. It is this
independence, which is the basis of multiprogramming.
The seek instruction shown in Fig. 4.10 has the target address which contains the target track number. The
controller can subtract the current track number as discussed in (iii) earlier from it to decide the number and
the direction of the steps that the R/W heads need to move as we have studied earlier.
As we know, the program in the controller’s
memory set up by the DD needs to be executed one instruction at a time. Therefore, the controller will need
some kind of registers like IR, PC, MAR, etc. as in normal computers. It will also require some electronics
to execute these instructions.
This contains some electronics to control the device. Therefore, all the
device control wires shown between the controller and the device in Fig. 4.8 are connected to this portion.
The control signals are shown in Fig. 4.10 also. We have already studied them.
In order to transfer the data from the controller to the main memory
by DMA, the controller has to have a few registers and some electronics or logic to manage this transfer. For
instance, it requires the registers to store the amount of data to be transferred (count) or the starting memory
address where it is to be transferred. The instructions loaded by the Device Driver in the controller’s memory
shown in section (vi) pertaining to Fig. 4.10 contain the instructions to set up these DMA registers too. When
the DMA operation starts data is transferred word to word. Each time a word is transferred, the count has to
be reduced and the memory address has to be incremented by 1. DMA electronics continues to transfer the
data word by word, until the count becomes 0. All this and the actual data transfer are managed by the DMA
electronics shown in this section.
This completes the picture of how the disk, disk drive and the controller functions, including their
interconnections.
In order to achieve the functions (i), (ii) and (iii) mentioned in Fig. 4.5, the co ntroller is provided with a
number of instructions that it can execute directly. One could imagine that the controller has an instruction
register whose wires will carry various signals to the actual disk drive to execute the instruction directly. For
instance, when the ‘seek’ instruction is in this register, the hardware itself will find the target track number,
current track number and accordingly generate a signal to the stepper motor of the disk drive to move the R/W
head the required number of steps in the appropriate direction.
We have assumed all along that the controller
stores all the instructions and then executes them l
one by one. However, some controllers have to be l
supplied with one instruction at a time by the DD. l
In such a case, the overall control still remains with l
l
the DD in terms of supplying the next instruction l
and checking whether all the instructions are l
executed or not. l
We will now study the data transfer between the controller’s memory and the main memory. How is this
transfer achieved ? Normally, this is achieved by using a mechanism called Direct Memory Access (DMA).
This transmission takes place bit parallelly, using the same data bus as the one used for other transfers
between the CPU registers and the memory.
In this scheme, the controller has two registers as shown in Fig. 4.10. These are Memory Address and Count
registers. The Memory Address register is similar to MAR in the main CPU. It gives the target address of the
memory locations where the data is to be transferred. The Count tells you the number of words to be transferred.
The MAR from the DMA is connected to the address bus and the appropriate memory word is selected by
using the memory decoder circuits. The DD uses previleged instructions to set up these initial values of the
DMA registers to carry out the appropriate data transfer. Once having set up, the DMA electronics transfers
the word from the controller’s memory to the target memory location whose address is in the Memory Address
Register of the DMA. This takes place via the data bus. It then increments the Memory Address Register
within the DMA and decrements the Count Register to effect the transfer of the next word. It continues this
procedure until the count is 0.
When the DMA is transferring the data, the main CPU is free to do the work for some other process. This
forms the basis of multiprogramming. But there is a hitch in this scheme. Exactly at the time when the DMA
is using the data bus, the CPU cannot execute any instruction such as LOAD or STORE which has to use the
data bus. In fact, if the DMA and the CPU both request the data bus at the same time, it is the DMA which
is given the priority. In this case, the DMA ‘steals’ the cycle from the CPU, keeping it waiting. This is why
it is called cycle stealing. The CPU, however, can continue to execute the instructions which involve various
registers and the ALU while the DMA is going on.
An alternative to the scheme of DMA is called programmed I/O, where each word from the controller’s
memory buffer is transferred to the CPU register (MBR or ACC) of the main computer first and then it
is transferred from there to the target memory word by a ‘store’ instruction. The software program then
decrements the count and loops back until the count becomes 0 to ensure that the desired number of words
are transferred. This is a cheaper solution but then it has two major disadvantages. One is that it is very
slow. Another is that it ties up the CPU unnecessarily, therefore, not rendering itself as a suitable method for
multiprogramming, multiuser systems. We will presume the use of the DMA in our discussion.
Figure 4.10 (c) shows both the possible methods of I/O. From the data buffer in the controller, the data can
be transferred through the Operating System buffers or directly into the memory of the application program.
We use files in our daily lives. Normally a file contains records of a similar type of information. e.g. Employee
file or Sales file or Electricity bills file. If we want to automate various manual functions the computer must
support a facility for a user to define and manipulate files. The Operating System does precisely that.
The user/Application programmer needs to define various files to facilitate his work at the computer.
As the number of files at any installation increases, another need arises: that of putting various files of
the same type of usage under one directory, e.g., all files containing data about finance could be put under
“Finance” directory. All files containing data about sales could be put under “Sales” directory. A directory can
be conceived as a “file of files”. The user/application programmer obviously needs various services for these
files/directories such as “Open a file”, “Create a file”, “Delete a directory” or “Set certain access controls on a
file”, etc. This is done by the File System, again, using a series of system calls or services, each one catering
to a particular need. Some of the system calls are used by the compiler to generate them at the appropriate
places within the compiled object code for the corresponding HLL source instructions such as “open a file”,
whereas others are used by the CI while executing commands such as “DELETE a file” or “Create a link”
issued by the user sitting at a terminal.
The File System in the IM allows the user to define files and directories and allocate/deallocate the
disk space to each file. It uses various data structures to achieve this, which is the subject of this section.
We have already seen how the Operating System uses a concept of a block in manipulating these data
structures.
The Operating S ystem looks at a hard disk as a series of sectors, and numbers them serially starting from 0.
One of the possible ways of doing this is shown in Fig. 4.11 which depicts a hard disk.
If we consider all the tracks on different surfaces, which are of the same size, we can think of them as a
cylinder due to the obvious similarity in shape. In such a case, a sector address can be thought of as having
three components such as: Cylinder number, Surface number, Sector number. In this case, Cylinder number
is same as track number used in the earlier scheme where the address consisted of Surface number, Track
number and Sector number. Therefore, both these schemes are equivalent. The figure shows four platters and
therefore, eight surfaces, with 10 sectors per track. The numbering starts with 0 (may be aligned with index
hole in case of a floppy), at the outermost cylinder and topmost surface. We will assume that each sector
is numbered anticlockwise so that if the disk rotates clockwise, it will encounter sectors 0, 1, 2 etc. in that
sequence.
When all the sectors on that surface on that cylinder are numbered, we go to the next surface below on the
same platter and on the same cylinder. This surface is akin to the other side of a coin. After both the surfaces
of one platter are over, we continue with other platters for the same cylinder in the same fashion. After the
full cylinder is over, we go to the inner cylinder, and continue from the top surface again. By this scheme,
Sectors 0 to 9 will be on the topmost surface (i.e. surface number = 0) of the outermost cylinder (i.e. cylinder
number = 0). Sectors 10 to 19 will be on the next surface (at the back) below (i.e. surface number = 1), but
on the same platter and the same cylinder (i.e. cylinder number = 0). Continuing this, with 8 surfaces (i.e. 8
tracks/cylinder), we will have Sectors 0–79 on the outermost cylinder (i.e. Cylinder 0). When the full cylinder
is over, we start with the inner cylinder, but from the top surface and repeat the procedure. Therefore, the next
cylinder (Cylinder = 1) will have Sector s 80 to 159 , and so on.
With this scheme, we can now view the entire disk as a series of sectors starting from 0 to n as Sector
Numbers (SN) as shown.
0 1 2 3.................................................................................................... N
This is the way one can convert a three dimensional sector address into a unique serial number. Now, in fact,
one can talk about contiguous area on the disk for a sequential file. For instance, a file can occupy Sectors 5
to 100 contiguously, even if they would be on different Cylinders, and Surfaces. A simple conversion formula
can convert this abstract one dimensional sector number (SN) back into its actual three dimensional address
consisting of surface, track, sector number (or cylinder, surface, sector number).
For example, If SN = 7 what will be its physical address? We know that a track in our example contains 10
sectors. Therefore, the first 10 sectors with SN = 0 to SN = 9 have to be on the outermost cylinder (cylinder =
0) and the uppermost surface (surface = 0). Therefore, SN = 7 has to be equivalent to cylinder = 0, surface 0
and sector = 7. By the same logic, the sectors with SN = 10 to SN = 1 9 will be on cylinder = 0, and surface
= 1. Therefore, if SN = 12, it has to be cylinder = 0, surface = 1 and sector = 2. Similarly, if SN is between
80 and 159, the cylinder or track number will be 1, and so on. By a similar logic, given a three dimensional
address of a sector, the Operating System can convert it into a one dimensional abstract address, viz. Sector
Number or SN.
The formatting program discussed earlier maintains a list of bad sectors which the Operating System
refers to. Therefore, these bad sectors are not taken into account for allocating/deallocating of disk space for
various files by the Operating System.
As we know, the Operating System deals with the block number for all the internal manipulation. A block
may consist of one or more contiguous sectors. If a block for the Operating System is the same as one physical
sector, the SN discussed above will be the same as Block Number or BN. If a block consists of 2 sectors or
1024 bytes, the view of the disk by the Operating System will be as shown in Fig. 4.12. In this case, if BN
= x, this block consists of sector numbers 2x and 2x + 1. For instance, Block number 2 consists of Sectors 4
and 5 as the figure depicts. Similarly, for a sector number SN, BN = integer value of SN/2 after ignoring the
fraction part if any. For instance, Sector 3 must be in Block 1 because the integer value of 3/2 is 1.
Therefore, given a block number (BN), you could calculate the one dimensional abstract sector number
(SN) and then calculate the actual three dimensional sector addresses for both the sectors quite easily and vice
versa. For example, block number 1 would mean Sectors 2 and 3 (based on our sector numbering scheme).
Given that SN = 2 and 3, it is now easy to calculate three component addresses for these sectors, as has been
discussed.
In our examples hereinafter, we will assume a block = 512 bytes. This is assumed to be the same as a sector
size for simplicity. Therefore, in our examples, BN will be same as SN. The File System internally does all
the allocation/deallocation in terms of blocks only. Only when finally a Read/Write operation is to be done,
are the block numbers (BNs) converted by the Device Management to the sector numbers (SNs), the SNs are
converted into the physical addresses (cylinder, surface, sector) as discussed earlier. These are then used by
Device Management for actually reading those sectors by setting the ‘seek’ instruction in the memory of the
controller accordingly. The controller, in turn, sends the appropriate signals to the disk drives to move the
R/W arms and to read the data into the controller’s buffer and
later to transfer it to the main memory through the DMA. Here
too, though other schemes are possible, for logical clarity, we
will assume that the data is read in the Operating System buffer
first and then transferred to the APs memory (e.g., FD area) by
DMA.
Some Operating Systems follow the technique of
interleaving. This is illustrated in Fig. 4.13. After starting from
Sector 0, you skip two sectors and then number the sector as
1, then again skip two sectors and call the next sector as 2, and
so on. We call this interleaving with factor = 3. Generally, this
factor is programmable, i.e. adjustable. This helps in reducing
the rotational delay.
The idea here is simple. While processing a file sequentially,
after reading a block, the program requesting it will take some
time to process it before wanting to read the next one. In the
non-interleaving scheme, the next block will have gone past the R/W heads due to the rotation by that time,
thereby forcing the controller to wait until the next revolution for the next block. In the interleaving scheme,
there is greater probability of saving this revolution if the timings are appropriate.
The Operating System is responsible for the translation from logical to physical level for an Application
Program. In earlier days, this was not so. In those days, an application programmer had to specify the
Operating System the actual disk address (cylinder, surface, sector) to access a file. If he wanted to access a
specific customer record, he had to write routines to keep track of where that record resided and then he had
to specify this address. Each application programmer had a lot of work to do, and the scheme had a lot of
problems in terms of security, privacy and complexity.
Ultimately, somebody had to translate from the logical to the physical level. The point was, who should
do it? Should the Application Program do it or should the Operating System do it? The existing Operating
Systems have a great deal of differences in answering this question. Some (like UNIX) treat a file as a
sequence of bytes. This is one extreme of the spectrum where the Operating System provides the minimum
support. In this case, the Operating System does not recognize the concept of a record.
Therefore, the record length is not maintained as a file attribute in the file system of such an Operating
System like UNIX or Windows 2000. Such an Operating System does not understand an instruction such as
“fread” in C or “Read...record” in COBOL. It only understands an instruction “Read byte numbers X to Y”.
Therefore, something like the application program or the DBMS uses has to do the necessary conversion. At
a little higher level of support, some others treat files as consisting of records of fixed or variable length, (like
AOS/VS). In this case, the Operating System keeps the information about record lengths etc. along with the
other information about the file.
At a still higher level of support, the Operating System allows not only the structuring of files (such as
records, fields, etc.) but also allows different file organizations at the Operating System level. For instance,
you could define the file organization as sequential, random or indexed, and then the Operating System itself
would maintain the records and their relationships (such as hashing or keys, etc.). In this case, the application
programmer could specify “get the customer record where Customer # = 1024”. This is really a part of the
Data Management System (DMS) software function that the Operating System takes upon itself. This is
true of current version of VAX/VMS which has subsumed RMS under it. It is also true of many Operating
Systems running on the IBM mainframes. Of course, the services provided are not sufficient to represent the
complex relationships existing in the actual business situations. This is why you need a separate Database
Management System (DBMS) on the top of the Operating System.
At the other extreme, the Operating System can actually provide full fledged Database functions allowing
you to represent and manipulate complex data relationships existing in business situations such as “Print
all the purchase orders for a supplier where the category of an ordered item = “A”. As an example, OS/400
running on AS/400 has embedded relational database functions as a part of the Operating System itself. PICK
Operating System also belongs to the same class.
The whole spectrum from UNIX (stream of bytes) to OS/400 (integrated Database) varies in terms of how
much the Operating System provides and how much the user process has to do (ultimately, somebody has to
do the required translation.) This is shown in Fig. 4.14.
We will assume that the Operating System recognizes the file structure as consisting of various records for
our subsequent discussions.
We have said before that the File System is responsible for translating the address of a logical record into
its physical address. Let us see how this is achieved by taking a simple example of a sequential file of, say,
customer records.
Let us assume that a customer record (also referred to in our
discussion as logical record) consists of 700 bytes. The Application Program responsible for creating these
records written in HLL has instructions such as “WRITE CUST-REC” to achieve this. As we know, at the
time of execution, this results in a system call to the Operating System to write a record. Therefore, the Oper-
ating System is presented with customer records one after the other. These records may or may not be in any
specific sequence (such as customer number). The Operating System assigns a Relative Record Number
(RRN) to each record starting with 0, as these records are written. This is because we have assumed that the
Operating System recognizes a file made up of logical records. In case of UNIX, the application program
will have to perform this.
For instance, if 10 customer records are written, 700×10 = 7000 bytes will be written onto the customer
file for RRN = 0 to 9. You can, in fact, imagine all the customer records put one after the other like carpets
as shown in Fig. 4.15. It is important to know that this is a logical view of the file as seen by the application
programmer. This is the view the Operating System would like the Application Programmer to have. It does
not however mean that in actual practice, the Operating System will put these records one after the other
in a physical sense of contiguity, as given by the sector/block numbering scheme. The Operating System
may scatter the records in variety of ways, hiding these details from the application programmer, each time
providing him the address translation facilities and making him feel that the file is written and therefore, read
contiguously.
The Operating System can calculate a Relative Byte Number (RBN) for each record. This is the starting
Byte number for a given record. RBN is calculated with respect to 0 as the starting byte number of the file
and again assuming that the logical records are put one after the other. For instance, Fig. 4.16 shows the
relationship between RRN and RBN for the records shown in Fig. 4.15.
It is clear from Fig. 4.16 that RBN = RRN×RL, where RL = record length.
This means that if a record with RRN = 10 is to be written (which actually
will be the 11th record), the Operating System concludes that it has to write
700 bytes, starting from Relative Byte Number (RBN) = 7000. Therefore, if
an Operating System recognizes a definition of a record, it can be supplied
with only the RRN and the record length. It then can calculate the RBN as
seen earlier. For an Operating System like UNIX which considers a file as
only a stream of bytes, it has to be supplied with the RBN itself along with
the number of bytes (typically equal to RL) to be read.
Let us assume that we consider our file as a stream of bytes written into various blocks which we will
consider as contiguous for convenience (that is why they were called ‘logical’). In such a case, block number
0 starts at Relative Byte Number (RBN) = 0, block number 1 starts at RBN = 512, block number 2 starts at
RBN = 1024. Therefore, it is clear that a block with a block number (BN) = N starts at Relative byte number
(RBN) = BL×N where BL = Block Length. In our example RBN = 512 N, because the BL is = 512.
Therefore, we now have two logical views. One is a logical view of records given by Fig. 4.15 and another
is the logical view of blocks given by Fig. 4.17. The point is to map the two and carry out translation between
them.
However, there are only (511–239) = 272 bytes left in block number 105 starting from 240. Therefore,
the Operating System has to continue writing the logical record into the next block too! It will have
to use the remaining (700–272)=428 bytes (byte numbers 0–427) of block 106. This is shown in the
Fig. 4.18.
Therefore, to write the fifth customer record (RRN=4), the Operating System will have to write
272 bytes at the end of physical block number 105, and 428 bytes at the beginning of physical block
number 106, because 272 + 428 = 700 which is the record length.
(v) The DD portion of the Operating System now translates the physical block numbers 105 and 106
into their physical addresses as discussed earlier, based on the sector/block numbering scheme. For
instance, given the disk characteristics of 8 surfaces of 80 tracks each, where there are 10 sectors per
track, we get the following equation:
Block 105: Surface = 2, Track (or Cylinder) = 1, Sector = 5
Block 106: Surface = 2, Track (or Cylinder) = 1, Sector = 6
The reader should verify this, keeping in mind our numbering scheme. The track number is synonymous
with the cylinder number. Cylinder 0 has blocks 0–79, Cylinder 1 has blocks 80–159 and so on. Again, within
cylinder 1, surface 0 has sector 80 – 89, surface 1 has sectors 90 – 99 and surface 2 has sector 100 – 109 and
so on.
Therefore, the procedure of translating the logical record address to the physical sector address can be
summed up as below.
l
l
l
l
l
(e) While creating any file, the Operating System starts from cursor = 0 and therefore, RBN = 0. After
each record is written, the cursor is incremented by 1 by the Operating System so that the RBN is
incremented by the record length. After the Operating System knows the RBN and the number of
bytes to be written, further translation can be performed easily.
(f) After getting the physical block numbers and their offsets, the File System can request the DD to read
the required blocks. The DD then translates the physical block numbers into the three dimensional
addresses and reads the desired sectors after instructing the controller. For this, the DD normally
‘constructs’ a program for this I/O operation, and loads it into the memory of the controller. As we
know, one of the instructions in the program is the “seek” instruction in which the three dimensional
address is specified as a target address as discussed earlier. After this, the desired sectors are located
and the I/O operation is then performed as studied in detail earlier. The File System then picks up the
required bytes to form the logical record.
(g) While writing fixed length records in a file, the Operating System can keep track of the number of
records and/or the number of bytes written in that file. This information in the form of file size is
normally kept in the File Directory where there is one entry for each file for file size.
Let us now see how records are read by the Operating System on the request of the AP. An unstructured
AP for processing all the records sequentially from a file would be as shown in Fig. 4.19 (a) for C and
Fig. 4.19 (b) for COBOL. During the processing, at every time it executes the “fread” or “READ” instruction
respectively, the AP gives a call to the Operating System which then reads a logical record on behalf of the
AP as we know. The Operating System blocks the AP during this time, after which it is woken up.
If there are 50 customer records, the AP requests the Operating System to read records from RRN=0 to
49, one by one, for fread (in C) and READ CUST-REC (in COBOL) instruction by using it in a loop. The
problem is: How does the Operating System resolve the problem of address translation? Let us now study
this.
(i) The Operating System maintains a cursor which gives the “next” RRN to be read at any time. This
cursor is initially 0, as the very first record to be read is with RRN=0. After each record is read, the
Operating System increments it by 1 for the next record to be read. This is done by the Operating
System itself and not by the AP. We know that the compiler generates a system call in the place of the
HLL instruction such as Read.
(ii) When the AP requests the Operating System to read a record by a system call at the “fread” (in C)
or “READ CUST-REC...” (in COBOL) instruction, the Operating System calculates RBN (Relative
Byte Number) as RBN = RL×RRN, where this RRN is given by the cursor.
Therefore, initially RBN will be 0, because, RRN = 0. For the next record, RRN will be 1 and RBN
will be 1×700 = 700 and for RRN = 2, RBN will be 2×700 = 1400. RBN tells the Operating System
the logical starting byte number from which to read 700 bytes. Needless to say that for Operating
System like UNIX, RBN will have to be supplied to it directly instead of RBN, as it has no concept
of records. Let us now see how a record at this juncture with RRN = 2 and RBN = 1400 is read by the
Operating System on behalf of the AP.
(iii) The File System calculates the logical block number as the integer value of RBN/512. For instance,
for RRN = 2 and RBN = 1400, 1400/512 = 2 + (376/512). Therefore, logical block number (LBN) =
2, offset = 376.
This means that the File System has to start reading from byte number 376 of LBN = 2. But if only
this is done, the Operating System will get only (511–375) = 136 bytes out of this block. This is far
less than 700.
(iv) The File System will have to read the next block with LBN = 3 fully to get the additional 512 bytes
to achieve 136 + 512 = 648 bytes in all. This is still less than 700. The Operating System will have
to read the next block with LBN = 4 and extract the first 52 bytes to finally make it 648 + 52 = 700
bytes. Therefore, for this instruction to read
one logical record, the Operating System
has to translate it into reading a sequence of
logical blocks first, as shown in Fig. 4.20.
(v) At this stage, the File System does the
conversion from LBN to PBN by adding 100
to LBN, because, the starting block number is
100 and all allocated blocks are contiguous, as per our assumption.
(vi) Therefore, the File System decides to read 136 bytes (376 – 511) in PBN 102 + all (512) bytes in
PBN 103 and 52 bytes (0–51) from PBN 104. This is shown in Fig. 4.21. The File System issues an
instruction to the DD to read Blocks 102 to 104.
(vii) As before, the DD translates the PBNs into three dimensional physical sector addresses as given
below.
Block 102 = Surface 2, Track 1, Sector = 2
Block 103 = Surface 2, Track 1, Sector = 3
Block 104 = Surface 2, Track 1, Sector = 4
As we have seen, the Operating System can present to the AP, records in the same sequence that they have
been written, i.e. the way they had been presented by the AP to the Operating System for writing in the first
place. If another AP wanted to process the records in a different sequence - say by Customer number, it is
the duty of the AP to ensure that they are presented to the Operating System in that fashion, so that they are
also retrieved in the same fashion. If that sequence was different, what should be done? One way is to sort
the original file in the customer number sequence if all the records are to be processed in that sequence only.
Alternatively, if only some records are to be selectively
processed in that sequence as in a query (given customer
number, what is his name, address or balance?), it is advisable
to maintain some data structure like an index to indicate where
a specific record is written. One advantage of this scheme is
that you can access the file in the original sequence as well as
the customer number sequence. A sample index on customer
number as the key is shown in Fig. 4.22. Notice that RBN is
used to indicate the address of that specific record. We will
revisit this later.
Normally, there is another piece of software to maintain and
access these indexes. It is known as the Data Management
Systems (DMS). It is obvious that the index has to be in the
ascending sequence of the key, if the search time is to be
reduced. If a new record is to be added to the customer file, it will be written by the Operating System in the
next available space on the disk for that file, and therefore, will not necessarily be in the sequence of customer
number. However, that does not matter anymore, because the index is maintained in the customer number
sequence.
As soon as a record is added, the Operating System knows at what RBN the record was written. For
instance, we have studied in Sec. 4.2.4 that the fifth record, i.e. with RRN = 4 was written at RBN = 2800 if
the RL was 700. In fact, at any time the File directory maintains a field called file size. A new record has to
be written at RBN = File size. After the record is written, the file size field also is incremented by the RL.
The Operating System can pass this to DMS and the DMS can then add an entry consisting of the key and the
RBN of the newly added record to the index at an appropriate place to maintain the ascending key sequence.
Let us assume that till now, four records are written with RBNs as shown in Fig. 4.22. As is obvious from the
RBNs, they have been written in the sequence of C001, C009, C004 and C003, because that is the sequence
in which records were created and presented to the Operating System. But you will notice that the index is in
the ascending sequence of the key. Let us now assume that a new record with customer number (key) = C007
is added. Let us trace the exact steps that will take place to execute this (assuming the same scenario of RL =
700 and that the physical blocks 100 - 1099 are allocated to this file contiguously).
(i) The AP requests the DMS to write the next record from its I/O area in the memory, specifying the
position of the key in the record.
(ii) The DMS extracts the key (in this case C007) and stores it for future use.
(iii) DMS now requests the Operating System to write the record onto the customer file and return the
RBN to DMS.
(iv) The Operating System knows that till now, 4 records of 700 bytes each, i.e. totally 2800 bytes have
been written from RBN = 0 to 2799. Therefore, RBN for this new record is 2800. This can be derived
from the file size.
(v) The File System of the Operating System now does the address translation into logical blocks to be
written. The DD translates them in turn into the physical blocks required to be written, as shown in
Fig. 4.18 to discover that it has to write the last 272 bytes of block 105 (Surface = 2, Track = 1, Sector
= 5) and first 428 bytes of block 106 (Surface = 2, Track = 1, Sector = 6).
(vi) The DD requests the controller to write these sectors one by one after loading the physical target
addresses in the controller’s memory. It also transfers the data from the main memory to the
controller’s memory using DMA. One important thing that the Operating System has to take care is
that data already written is not lost. For instance, if only 272 bytes are to be written onto block 105
at the end, that particular block is read first, the last 272 bytes of it are then updated with the desired
data and then that block is written back. If this is not done, the first 240 bytes of that block will be
lost.
(vii) The controller generates the appropriate signals to the
device to seek the correct track and to write the data on
the correct sector.
(viii) The DD supervises the whole operation to ensure that
all the sectors corresponding to that logical record are
written on the disk.
(ix) Having successfully written the logical record, the
Operating System passes the RBN (which is 2800) to
the DMS as requested by the DMS.
(x) The DMS uses this RBN along with the already stored
key (C007) to modify the index. The modified index is
shown in Fig. 4.23.
In the earlier systems, the data management functions
such as maintenance of an index, etc. were part of the
Operating System, as in the case of IBM’s old ISAM, but as
these functions started getting more complex, separate Data
Management Systems (DMS) were written. The DMS can
either be a File Management System (FMS) or a Database
Management System (DBMS). RMS on VAX, VSAM on
IBM and INFOS on DG are examples of FMS, whereas
ORACLE, DB2, INFORMIX are examples of DBMS. DMS
is a complex piece of software and its detailed discussion is
out of the scope of the current volume. What is important
here is to understand the exact role of DMS, and where it fits
in the layered approach. DMS is in between the AP and the
Operating System, as shown in Fig. 4.24.
DMS is responsible for maintaining all these index tables based on keys. The index shown in Fig. 4.23 is
a very simple one. There can be more complicated indexes with multiple levels. Again, an index is only one
of the possible data structures used by DMS. The relational DBMS mainly uses indexes. The hierarchical
DBMS such as IMS/VS on IBM or network DBMS like IMAGE on HP, DG/DBMS on DG, IDMS on IBM,
DBMS-10 and DBMS-11 on the DEC systems normally use chains or some other techniques in addition to
indexes. However, in the chains too, RBN can serve a useful purpose as an address in the chain. For instance,
if an address of the next child record for the same parent is to be maintained, RBN again can form this address.
In any data structure such as a chain or an index, the DMS can use different methods to store the address.
Some of these methods are listed below:
l Physical Sector Addresses
l RRN
l RBN
Storing actual physical addresses is a very fast method. While reading a record, given a customer number
if the physical address is maintained in the index, the DMS itself can access the final addresses without
having to go through various levels of address translations. But then, it has a number of disadvantages. For
one logical record, you may have to store multiple addresses as one logical record can span over multiple
sectors. A more important disadvantage is hardware dependence and the resultant lack of portability. If a
sector goes bad or if the database is moved to a larger disk with more sectors/track and more tracks/surface,
the old database/indexes will not be usable. IBM’s old ISAM had this problem, forcing the development of
VSAM which is hardware independent.
Using RRN as an address in an index or a chain is a better method, but then it requires an Operating
System which recognizes the concept of a record. This also limits the portability (UNIX will not know what
to do with the RRN for instance).
Using RBN as an address in an index or a chain is by far the most popular method as it renders portability.
We will continue to assume that RBN is used in our example.
Let us study the exact sequence of events that take place when an AP requests for a specific record (say
for customer C004) to answer a specific query like “What is the balance of customer C004?” We will assume
that the DMS is maintaining an index of customer numbers versus the RBNs as shown in Fig. 4.22. What
happens is as follows:
(i) The AP for customer inquiry prompts for the customer number in which the user is interested.
(ii) The user keys in “C004”. This is stored in the memory of the AP.
(iii) The AP supplies the key (in this case, C004) to the DMS and requests the DMS to read the desired
record.
(iv) DMS carries out the index search to arrive at the RBN of the desired record to be read (in this case,
1400). If the DMS was using chains, it will have to have algorithms to traverse through them to search
for the record with customer number = C004.
(v) DMS supplies the extracted RBN to the Operating System and requests the Operating System to read
the record by a system call (in this case, the record of 700 bytes with RBN = 1400).
(vi) The File System of the Operating System translates this RBN into the logical address(es) by the
same techniques as described in the last two sections. The DD now translates them into the physical
address(es) and then reads the required blocks in the Operating System buffer first via the controller’s
memory using DMA.
(vii) The File System then formulates and transfers the logical record for that customer to the DMS buffer
reserved for this AP. DMS is a generalized piece of software catering to many programs at a time.
Therefore, normally it allocates as many memory buffers as the number of processes using it.
(viii) The DMS transfers the required record from its buffer into the I/O area of the AP.
(ix) The AP now uses the details such as the balance, etc. in the record read in its area to display them on
the screen as desired.
While studying the address translation scheme, we have assumed contiguous allocation of blocks. In
some systems, blocks allocated to a file are not contiguous but they are scattered on the disk. The Operating
System then maintains an index or chains to access all the blocks for a file one after the other. (These indexes
or chains are different from the ones maintained by DMS for faster on-line access to specific records such as
customer number index. We will study this later.) Even if the disk space allocation to files is non-contiguous,
the address translation scheme substantially remains unchanged. The only difference arises in the way the
PBN(s) are derived from the LBN(s), as we will study.
For each file created by the Operating System, the Operating System maintains a directory entry also called
Volume Table of Contents (VTOC) in IBM jargon, as shown in Fig. 4.25.
The figure shows the possible information that the Operating System keeps at a logical level for each
file. Physically, it could be kept as one single record or multiple records, as in the case of AOS/VS, where
access control information constitute one small record. All dates, etc. constitutes the other. In hierarchical
file systems, normally this logical record is further broken into two logical records, to allow sharing of files,
as we will learn later. We will talk about the significance and the contents of these two logical records later.
Again, these logical records for VTOC or file directory entries are stored on the disk using some hashing
algorithm on the file name. Alternatively, an index on the file names can be maintained to allow faster access
of this information once the file name is supplied. However, in the case of Unix, no hashing technique is
used. Entries are created sequentially or in the first available empty slot. So, everytime, the Operating System
goes through the entire directory starting from the first entry to access this directory entry given a file name.
Therefore, the algorithm for this is very simple, though quite time consuming during execution.
The only thing of significance at this stage for address translation is the file address field for each file. This
signifies the address (i.e. block number) of the first block allocated to the file. If the allocation is contiguous,
finding out the addresses of the subsequent blocks is very easy. If the allocation is chained or indexed, the
Operating System has to traverse through that data structure to access the subsequent blocks of the same
file. When you request the Operating System to create a file with a name, say CUST.MAST and request the
Operating System to allocate 1000 blocks, the Operating System creates this file directory entry for this file.
As in the last example, if blocks 100 to 1099 are allocated to it, the Operating System also creates the file
address within the file directory entry for that file as 100. This is subsequently used by the Operating System
for the I/O operations.
An AP written in C/COBOL or any other HLL has to “Open” a file for reading or writing. As we know,
“Open” is an Operating System service, and therefore, a compiler substitutes an Operating System call in the
place of the “Open” statement in the HLL program.
The system call for “Open” at the time of execution searches for a file directory entry for that file using
the file name and copies that entry from the disk in the memory. Out of a large number of files on the disk,
only a few may be opened and being referred to at a given time. The list of the directory entries for such files
is called Active File List (AFL) and this list in the memory is also arranged to allow faster access (index
on AFL using file name, etc.). After copying in the memory, the Operating System ensures that the user
is allowed to perform the desired operations on the file using the access control information. As we have
seen, for calculating the physical address for every “Read” and “Write” statement, the starting block number
(which was 100 in the previous example) in the file directory entry in the memory has to be added to the
logical block number.
If the AP adds new records to a file, the file size is altered correspondingly by the Operating System in
the file directory entry in the memory (AFL). Similarly, any time the file is referred to or modified as well
as created initially, the dates and times of this creation/modification in the file directory entry are changed
accordingly. When the file is closed by yet another system call, the updated directory entry is written back to
the disk and removed from the AFL in the memory, unless it is used by another user. For the sake of security,
the writing back on the disk can be done more frequently also.
Basically, there are two major philosophies for the allocation of disk space (blocks) to various files, namely:
contiguous and non-contiguous.
This is what most of the earlier IBM Operating Systems such as DOS/VSE used to follow.
We have assumed contiguous allocation up to now in our examples in Sec. 4.2.4. In this scheme, the user
estimates the maximum file size that the file will grow to, considering the future expansion and requests the
Operating System to allocate those many blocks through a command at the time of creation of that file.
The main disadvantages of this scheme are space wastage and inflexibility. For in-
stance, until the file grows to the maximum size, many blocks allocated to the file remain unutilized because
they cannot be allocated to any other file either, thereby wasting a lot of disk space. On the other hand, by
some chance, if the file actually grows more than the predicted maximum, you have a real problem. For
instance, if all the blocks between 100 to 1099 are written and if now you want to add new customers, how
can you do it? The blocks from 1100 onwards may have been allocated by the Operating System to some
other file. One simple way is to stop selling and adding new customers. But this solution is not likely to be
particularly popular.
If you want to add a few blocks without disturbing contiguity, you have to load this file onto a tape or some
other area on the disk, delete the old directory entry, request the Operating System to allocate new contiguous
space with the new anticipated maximum file size (may be 3000 blocks), and load back the file onto the new
area. At this juncture, the earlier allocated area can be freed. Before all this is done, your program to “ADD A
CUSTOMER” cannot proceed. It will abort with a message “No disk space.” This gives rise to inflexibility.
Despite these disadvantages, it was used by some Operating Systems in the past mainly due to its ease of
implementation.
There is one advantage which can result out of the contiguous allocation, however. If the processing is
sequential and if the Operating System uses buffered I/O, the processing speed can enhance substantially.
In buffered I/O, the Operating System reads a number of consecutive blocks at a time in the memory in
anticipation. This reduces the head movement and, therefore, increases the speed of the I/O. This is because
all the blocks read in are guaranteed to be used in that sequence only. But they are read at the most appropriate
time with the least amount of head movements.
However, in an on-line query situation, an Application Program may have a query on a record residing
in one block and then the next query could be on a record somewhere else altogether, which is not in any of
the blocks held by the Operating System in the memory. The buffering concepts may not be very useful in
such a case. In fact, it may be worse. The records read in anticipation may not be used at all. The situation
while writing records may however be different. Even if records are created randomly in any sequence, the
Operating System could buffer them in the memory and write them in one shot while passing their respective
RBNs to the DMS for the index entry creation. This definitely reduces the R/W head movement and enhances
the I/O speed with contiguous allocation.
But, then, this buffering gives rise to a new complication in real time, on-line systems. For instance, if an
AP requests for some details of a customer added to the system just a while ago, and the record is only in
the main memory still and not yet written back on the disk, the DMS or Operating System should not search
for it only on the disk and abort the search with a remark “Not found”. To achieve this, some additional data
structures (The customer index having, not the RBN but the memory address, with an indicator whether
the record is in the memory or the disk etc.) and some additional algorithms are needed. In fact, a common
method is to search through the index to find out whether the data is in the memory or not and then issue an
I/O request only if it is not in the memory but it is on the disk.
One point needs to be understood in this context. If buffering is not used, i.e. if the Operating System does
not read more consecutive blocks than necessary, the contiguity of disk space allocation does not necessarily
enhance the response time in a time sharing environment even if the processing is sequential.
The reason is that when a record for that process is read and is being processed, the CPU may be switched
to another process. This process may request an I/O from an
entirely different disk area causing the disk R/W heads to move.
When the original process is reactivated and the next record for that
process needs to be read, the R/W heads have to move again! It is
interesting to know that without buffering, even the writing speed
does not increase even if the processing is sequential and the disk
space allocation is contiguous. The reason for this is the same as
explained above.
How does the Operating System manage the
disk space in the case of contiguous allocation? It normally uses ei-
ther a blocks allocation list or a bit map. We will now study these.
There are two other methods, viz., Best fit and Worst fit methods to choose
an entry from the free blocks list for the allocation of free blocks. Both of these
methods would require the free blocks list to be sorted by number of free blocks.
Such a list before the allocation would be as shown in Fig. 4.29.
The best fit method would choose an entry which is smallest amongst all the
entries which are bigger than the required one. To achieve this, this sorted table is
used. In our case, where we want 7 blocks, the first entry in the sorted list is such
an entry. Therefore, blocks 41–47 will be allocated. The resulting two tables,
similar to the ones shown in Fig. 4.28 can now be easily constructed. If 10 blocks
were requested for a file, we would have to use the second entry of 16 blocks in
the sorted list and allocate blocks 5 to 14 to the new file. After this allocation, there would be only 16 – 10 =
6 free blocks left in this hole. As 6 is less than 8, which is the number of free blocks in the first entry, the list
obviously would need resorting - therefore, consuming more time.
However, the best fit method claims to reduce the wastage due to fragmentation, i.e. the situation where
blocks are free, but the holes are not large enough to enable any allocation. This is because, this method uses
a hole just enough for the current requirement. It does not allocate blocks from a larger hole unnecessarily.
Therefore, if subsequently, a request for a very large allocation arrives, it is more likely to be fulfilled.
The advocates of worst fit method do not agree. In fact, they argue that after allocating blocks 41 to 47,
block number 48 which is free in the example above cannot be allocated at all. This is because it is far less
likely to encounter a file requiring only one block. Therefore, they recommend that the required 7 blocks
should be taken from the largest slot, provided that it is equal to or larger than our requirement (i.e. 7).
Therefore, by this philosophy, blocks 2001 to 2007 will be allocated, thereby leaving the remaining blocks
with numbers 2008 to 6399 still allocable. This chunk is large enough to cater to other large demands. At
some point, however, in the end, it is likely to have very few free blocks remaining and those would most
probably be unallocable even in the worst fit scenario. But by then, some other blocks are likely to be freed,
thereby creating larger usable chunks after coalescing. It is fairly straight forward to arrive at the resulting
two tables after the allocation using this philosophy.
In either of these philosophies, the tables have to be recreated/resorted after creation/deletion of any file.
In fact, after the deletion of a file, the Operating System has to check whether the adjacent areas are free
and if so, coalesce them and create a newly sorted list. To achieve this, the Operating System needs both the
tables shown in Fig. 4.28. For instance, let us assume that the block allocation to various files is as shown in
Fig. 4.28 at a specific time. Let us assume that CUSTOMER file is now deleted. The Operating System must
now follow the following steps:
(i) It must go through the file allocation list as given in Fig. 4.28(a) to find that 52 blocks between 49 and
100 will be freed after the deletion.
(ii) It must now go through the free blocks list as in Fig. 4.28(b) to find that free blocks 41–48 and
101–200 are free and are adjacent to the chunk of blocks 49–100. Therefore, it will therefore coalesce
these three as shown in Figs. 4.30 and 4.31 and work out a new free blocks list. This new list is shown
in Fig. 4.32.
(iii) It will sort this new free blocks list, as shown in Fig. 4.33. The new list can be used later for best or
worst fit algorithms.
Another method of maintaining the block allocation list is by using chains. Figure 4.34 shows such chains
or the allocations as per Fig. 4.26.
The Operating System can reserve various slots to maintain the information about chunks of blocks. The
figure shows 16 such slots, out of which only 13, i.e. 0 to 12 are used. This is because, Fig. 4.26 contains only
13 entries. Each slot consists of 5 fields. These are listed below.
(i) Slot number (shown for our understanding - the Operating System can do without it), which is shown
in a bold typeface in the figure
(ii) (A)llocated/(F)ree status code
(iii) Starting block number of that chunk
(iv) Number of blocks in that chunk
(v) Next slot number for the same status (A or F as per the case). An asterisk ( * ) in this field denotes the
end of the chain.
At the top of the figure, we show two separate list headers. This allows us to traverse through the allocated
or free list. Therefore, this method does away with two separate tables of Fig. 4.27. If we want to go through
the free list, we would read the free list header - start address. It is 1 in this case, as shown in Fig. 4.34. We will
read slot number 1. It says that there is a free chunk starting from block 5 of 16 blocks. The next slot number
with free status is given in slot 1. It is 3 in this case. This is called a chain. We then go to slot 3, and so on.
When a file is deleted, the status indicator of that slot is changed from ‘A’ to ‘F’ and the next slot number
fields are updated in the appropriate slots to reflect this change. After this is done, the Operating System goes
through the slots sequentially without using the chains to decide about coalescing. For allocating, it goes
through free block chains. Using this method, it can perform coalescing in a better fashion, but then, the
algorithms for best or worst fit are still time-consuming because the free block chunks are not accessed in the
descending sequence of the size of free blocks in a chunk.
The Operating System has to manipulate these slots and their chains quite frequently when a file is deleted
or created or extended. We leave it to the reader to develop these algorithms. This scheme requires a larger
memory to maintain various slots. The time taken to readjust the chains can also be considerable after the
blocks are allocated. Imagine, for instance, if 7 blocks (blocks 5 – 11) are allocated to a file NEW, the slot will
need splitting into two slots. The Operating System will have to acquire slot number 13 to manage this. After
coalescing, some slots may be unused and this scheme will have to have an algorithm to manage the free slots
also. A variation of this scheme which is far less time-consuming is through the use of bit maps. Bit maps
can be used with both contiguous and non-contiguous allocation schemes. We will now study this method.
A bit map is another method of keeping track of free/allocated blocks. A bit map maintains one bit
for every block on the disk as shown in the Fig. 4.35. Bit 0 at a specific location indicates that the correspond-
ing block is free and Bit 1 at a specific location indicates that the corresponding block is allocated to some file.
The figure shows a bit map corresponding to the original table in Fig. 4.26. The first 5 blocks are allocated,
the next 16 are free, the next 20 are again allocated and so on.
A bit map is used only to manage the free blocks and therefore, it need not show the file to which a specific
block is allocated. In contiguous allocation, the file directory entry contains the file size and the starting block
number. This information is sufficient for accessing all the blocks in the file one after the other or at random.
You do not need any help from the bit map in this regard.
When a file is deleted, the file directory entry is consulted and the corresponding blocks are freed, i.e. the
corresponding bits in the bit map are set to 0. When a file is to be created, for example, NEW with 7 blocks,
normally the first fit algorithm is chosen and the routine in the File System searches for 7 consecutive zeroes
in the bit map starting from the beginning. Having found the first such 7 zeroes, it allocates them, i.e. changes
them to 1 and creates the corresponding file directory entry with the appropriate starting block number. To
implement Best fit and Worst fit strategies using a bit map is obviously tougher unless the Operating System
also maintains the tables of free blocks in the sequence of hole size. This is fairly expensive and is the main
demerit of a bit map over a free blocks list. However, a bit map has the advantage of being very simple to
implement. The first look may suggest that the block list method will occupy much more memory than the bit
map method. But this is deceptive. The reason is that, normally, the free blocks list itself is kept in free blocks
and, therefore, does not require any extra space.
to the I/O area of the AP. The reading of these blocks 3, 6 and 10 one after the other is obviously facilitated
due to the chains maintained in the FAT. You can now easily imagine how the Operating System can satisfy
the AP’s requests to read logical customer records sequentially one after the other by using an internal cursor,
and traversing through these chains in the FAT.
With chained allocations, on-line processing however tends to be comparatively a little slower. A Data
Management System (DMS) used for an on-line system will have to use different methods for a faster
response. An index shown in Fig. 4.22 is one of the common methods used in most of the Relational Database
Management Systems (RDBMS). Imagine again that we have an index as shown in the figure. An Application
Program (AP) for “Inquiring about customer information” is written. A user asks for the details of a customer
with customer number = “C009”. How will the query be answered? Let us follow the exact steps:
(a) The AP will prompt for the customer number for which details are required.
(b) The user will key in “C009” as the response.
(c) This will be stored in the I/O area for the terminal of the AP.
(d) The AP will supply this key “C009” to the DMS to access the corresponding record (e.g. MS-Access
under Windows 2000).
(e) DMS will refer to the index and by doing a table search, determine the RBN as 700 (refer to
Fig. 4.22).
(f) DMS now will request the Operating System, through a system call, to read a record of 700 bytes
from RBN = 700.
(g) The Operating System will know that it has to read 700 bytes starting from Relative Byte Number =
700 (after skipping the first 700 bytes, i.e. 0 to 699.) This gives us the starting address = 700/512 =
1 + 188/512, i.e. the reading should start from byte number 188 of logical block number (LBN) = 1.
But only 511 – 187 = 324 bytes will be of relevance in that block. Therefore, we would therefore need
700 – 324 = 376 bytes from the next block (i.e. LBN = 2)
(h) The Operating System will translate this as:
(i) In our example, if FILE A is the CUSTOMER file, as per the FAT, logical block 0 (given in the
directory entry) = physical block number 5. Similarly, logical block 1 will be physical block number
7 and logical block 2 will be physical block number 3 (refer to Fig. 4.37). Therefore, the DD issues
instructions to the controller to read physical blocks 7 and 3, pick up the required bytes as given in
point (h) and formulate the logical record as desired by the AP.
(j) The Operating System transfers these 700 bytes to the I/O area of the AP (perhaps through the DMS
buffers, as per the design).
(k) The AP then picks up the required details in the record read to be displayed on the screen.
This is the way the interaction between the Operating System such as Windows 2000 and any DMS such
as Access takes place.
An interesting point emerges. In this method, how does the Operating System find out which are the
logical blocks 1 and 2? The Operating System has to do this by consulting the file directory entry, picking up
the starting block number which is LBN 0. It then has to consult the FAT for the corresponding entry (in this
case entry number 5) and proceed along the chain to access the next entry each time adding 1 to arrive at the
LBN and checking whether this is the LBN that it wants to read. There is no way out. If logical block numbers
202 and 203 were to be accessed, the Operating System would have to go through a chain of 202 pointers in
the FAT before it could access the LBN = 202 and 203, get their corresponding physical block numbers and
then ask the controller to actually read the corresponding physical blocks.
If we had the pointers embedded in the blocks, leaving only 510 bytes for the data in each block, the chain
traversal would be extremely slow because the next pointer would be available only after actually reading the
previous block, thereby, requiring a lot of I/O operations which are essentially electromechanical in nature.
If the FAT is entirely in the memory as in MS/DOS or OS/2, the chain traversal is not very slow because, you
do not have to actully read a data block to get the address of the next block in a file. However, as the chain
sizes grow, this is not the best method to follow, especially for on-line processing as we have seen. This is the
reason why indexes are used for disk space allocations in some other Operating Systems.
An index can be viewed as an externalized list of pointers. For instance, in the previous
example for file A, if we make a list of pointers as shown below, it becomes an index. All we will need to
do is to allocate blocks for maintaining the index entries themselves and the file directory entry should point
towards this index.
5 7 3 6 10
The problem is how and where to maintain this index. There are many ways.
CP/M found an easy solution. It reserved space in the directory entry itself for 16
blocks allocated to that file as shown in Fig. 4.38.
If the file requires more than 16 blocks, the directory entry is repeated as many times as necessary.
Therefore, for a file with 38 blocks, there would be 3 directory entries. The first two entries will have all the
16 slots used (16×2=32). The last entry will use the first 6 slots (32+6=38) and will have 10 slots unused. In
each directory entry, there also is a field called ‘block count’ which, if less than 16, indicates that there are
some free slots in the directory entry. Therefore, this field will be 16, 16 and 6 in the 3 directory entry records
in our example given above. Figure 4.38 shows this field as 5 because there are only 5 blocks (5, 7, 3, 6 and
10) allocated to this file. This corresponds to file A of Fig. 4.37.
When the AP wants to write a new record on to the file, the following happens:
(a) The AP makes a request to the Operating System to write a record.
(b) If in a block already allocated to the file there is not enough free space to accommodate the new
record, the Operating System calculates the number of blocks needed to be allocated to accommodate
the new record, and then it acquires these blocks in the following manner.
l The Operating System consults the free blocks pool and chooses a free block. Let us say, 3 blocks are
l It now writes as many block numbers as there are vacant slots in the current directory entry given by
(16 - block count). If all block numbers (in this case 3) are accommodated in the same directory entry,
nothing else is required. It only writes these block numbers in the vacant slots of the directory entry
and increments the block count field within that entry. However, if, after writing some block numbers,
the current directory entry becomes full (block count = 16), then it creates a new directory entry with
block count = 0 and repeats this step until all required block numbers have been written.
(c) Now the Operating System actually writes the data into these blocks (i.e. into the corresponding
sectors after the appropriate address translations).
Reading the file sequentially in this scenario is fairly straight forward and will not be discussed here. For
on-line processing, if you want to read 700 bytes starting from RBN=700 as given by the index as in the last
example in the section on chain allocation under non-contiguous allocation, it effectively means reading
logical block numbers 1 and 2. The File System can easily read the directory entry and pick up the second
and third slots (i.e. logical block numbers 1 and 2 - which correspond to physical blocks 7 and 3, as per
Fig. 4.38). In the same way, picking up logical blocks 35 and 36 is not as difficult now as it was in chained
allocation. The File System can easily calculates that the logical block numbers 35 and 36 will be available in
the slot number 3 and 4 of the third directory entry for that file. Therefore, it can directly access these blocks
improving the response time.
With the scheme of allocating only one block at a time (as is
done in CP/M, MS/DOS or UNIX), you have a lot of flexibility and minimum disk space wastage, but then
the index size increases, thereby also increasing the search times. Do we have any via media between
allocating only one block at a time and allocating all blocks for a file, as discussed earlier in contiguous
allocation?
AOS/VS running on Data General machines and VMS running on the VAX machines provide such a via
media. AOS/VS defines an element as a unit in which the disk space allocation is made. VAX/VMS calls this
cluster. For a detailed discussion, we will follow AOS/VS methodology, though VMS follows a very similar
one. An element consists of multiple contiguous blocks with default as 4. The user can define a different
element size for each file. Where response time is of importance and wastage of disk space is immaterial, the
user can choose a very high element size. Whenever a file wants more space, the Operating System will find
out from the bit map as many number of contiguous blocks as given by the element size, and then allocate. If
the element size is very large, we come very close to contiguous allocation as in MVS. On the other hand, if
the element size is only one block, we get closer to the other extreme as in UNIX implementation.
The element is only a unit in which the disk space is allocated essentially to reduce the index length.
Let us illustrate how the concept of the element or cluster works by taking a default element size of 4
blocks or 2048 bytes. In AOS/VS, 4 bytes (or 32 bits) are reserved for specifying the starting block number
of an element allocated to a file. Therefore, one block of 512 bytes can contain 512/4 = 128 pointers to 128
elements allocated to a file. Let us discuss how a file grows in a step by step fashion. Let us imagine that
initially, we create a file called F1 using a command given to Command Language Interpreter (CLI) on the
Data General machine. Later, we copy records from a file on the tape into this file on the disk. The file on the
disk and its corresponding indexes grow in various stages that we will trace now. The good point is that the
user/programmer need not be aware of this, as his intervention is not needed at all.
When you create a file say F1 using a command “CRE F1” for AOS/VS CLI, the Operating System
will do the following.
l Find 1 free element, i.e. contiguous 4 free, allocable blocks using the bit map.
l Allocate them to F1 and mark them as allocated by setting the appropriate bits.
l Create a directory entry for F1, containing the block number of the first block in that element as a
pointer. This is as shown in Fig. 4.39, assuming that blocks 41, 42, 43 and 44 were allocated.
At this stage, imagine that you are running a program copying a tape file onto F1 on the disk. Let
us assume a logical record length of 256 bytes for the sake of convenience. Therefore, the first element of
2048 bytes therefore can hold 8 data records. The pseudocode for the AP to copy the file will be as shown in
Fig. 4.40.
Copying the first 8 records, the Operating System will have no problem. After writing each record, a cursor
within the file is incremented to tell the Operating System where to write the next record. The file size field
also can serve this purpose. For the 9th record, when the AP gives a system call for “Write Disk-file-record
...”, the Operating System will realize that there is no disk space left for the file as the only element of 4 blocks
or 2048 bytes has already been written by the first 8 records. The Operating System finds this using the file
size/cursor. The Operating System now will proceed as follows:
l It will locate another free element from the free blocks pool. A point needs to be made here. Any
Operating System is normally designed in a modular way. Therefore, there will be a routine within
any Operating System to acquire n number of contiguous blocks. This routine also can be organized
as a System Service or System Call. This will be called and executed at this juncture. Let us say that
the acquired free blocks were blocks 81, 82, 83 and 84.
l It will allocate this element to F1 and mark these blocks as allocated by setting the appropriate bits in
the bit map.
Now there are two elements allocated to F1, one starting at block 41 and the other starting at block 81, but
the directory entry for F1 can point to only one of these. Therefore, we need some scheme to accommodate
both these elements. Essentially, we need an index. This index has to be kept in yet another free block.
l It will find one more free block using the bit map and allocate it to an index to contain 128 entries of
4 bytes each; and then update the bit map accordingly. Let us say that this index block was block 150.
Therefore, this will be removed from the free blocks pool.
l Now the directory entry will point to this index block, i.e. it will contain the pointer 150.
l The first two entries in the index blocks will be updated to point to the two elements as shown in
Fig. 4.41 i.e. they will contain the pointers 41 and 81 respectively. The remaining 126 slots in the
index block will still be free. The copying of another 8 tape records can now continue due to the new
element consisting of 4 blocks.
As the tape records are read, new elements are allocated to F1 and the index entries for F1 are filled
in block 150. What happens if the file requires more than 128 elements? The index block 150 can contain only
128 pointers pointing to 128 elements or 128×4=512 blocks. For file sizes larger than this, there is a problem.
In such a case, the Operating System acquires one more index block as shown in Fig. 4.42. Let us say that
block 800 in addition to block 150 is allocated as an index block. We are faced with the same problem. Which
pointer - 150 or 800 - should be maintained in the directory entry now? There are 2 index blocks, whereas
there is space for only one pointer. The Operating System uses the same trick to solve this problem. Another
block, say 51 is acquired and is assigned to a higher level index. This block also can contain 128 entries of
4 bytes each. Each entry in this case holds a pointer to the lower level index. The first two entries in block
51 are pointers to lower level indexes (150 and 800, respectively). The remaining 126 slots in block 51 are
unused at this juncture. Now, the file directory entry points to block 51. A closer study of Fig 4.42 will clarify
how this system works.
If two levels of indexes are not sufficient, a third level of index is introduced. AOS/VS supports 3
levels of indexes. It is easy to imagine this and therefore, it is neither discussed, nor shown.
Reading a file sequentially is fairly straight forward in this case. If the file is being processed from the
beginning, the Operating System does the following to achieve this:
l From the file size maintained as a field in the file directory, the Operating System determines the
index levels used. For instance, if file size is less than the element size, no index will be required.
If file size is between 1 element to 128 elements, there will be one index, and so on. This tells the
Operating System the meaning of the pointer in the file directory entry, as to the level of index to
which it points.
l The Operating System picks up the pointer in the file directory and traverses down to the required
transmitted to the main memory to from a logical record which is presented to the AP.
l After block 44, the Operating System knows that one element is over, and it has to look at the next
pointer in the index (which is 81 in this case).
l By the same logic, when data block 8 which is the last data block pertaining to that index block is read
(refer to Fig. 4.42), the Operating System knows that it has to look up the next index block, whose
address is given as the second pointer in the higher level index (which is in the block number 800 in
this case).
l By repeating this procedure, the entire file is read.
For on-line processing, the AP will make a request to DMS to read a specific record, given its key. The
DMS will use its key index to extract the RBN and request the Operating System to read that record. At this
stage, AOS/VS will determine the logical blocks to be read to satisfy that request. Given the LBNs and the
file size, AOS/VS can find out the index level and the element which is to be read. For instance, from RBN,
if the Operating System knows that LBNs 3 and 4 are to be read, the Operating System can work backwards
to find out where these pointers will be. It knows that LBN 0–3 are in element 0 and LBN 4–7 in element 1.
Therefore, it wants to read the last block of element 0 + First the block of element 1 in order to read the blocks
with LBN = 3 and 4. These are physical blocks 44 and 81 respectively, as shown in Fig. 4.41. It also knows
that elements 0–127 are in index block 0. Therefore, the first two entries in this index block (given by block
150 shown in the figure), will give pointers to these two elements, i.e. element 0 and element 1. It can read
those pointers almost directly and then access the actual block numbers thereafter. This scheme therefore,
has a direct advantage over the chained allocation for on-line processing. You do not have to go through the
previous pointers to arrive at a desired one.
Given RBN and number of bytes to be read in the record, the DMS requests the operating system to read
the record.
From the user’s viewpoint, a directory is a file of files, i.e. a file containing details about other files belonging
to that directory.
Today, almost all operating systems such as Windows 2000, UNIX, AOS/VS, VMS and OS/2 have the
hierarchical file structure. Therefore, we will not consider the single level (as in CP/M) or two level (as
in RSTS/E on PDP-11) directory structures. In a sense, these can be considered as special cases of the
hierarchical file system. In the hierarchical file structure, a directory can have other directories as well as files.
Therefore, it forms a structure like an inverted tree - as shown in Fig. 4.43.
The figure shows most of the facilities and aspects of a hierarchical file system. All the directories are
drawn as rectangles and all the files are drawn as circles. At the top is the “ROOT” directory which contains
a file, viz., VENDOR and two subdirectories, viz., PRODUCTION and COSTING. PRODUCTION and
COSTING in turn, have other directories/files under them.
Normally, when a user logs on, he is automatically positioned at a home directory. This is done by
keeping the home directory name in the record for that user, called a profile record in a profile file. In UNIX,
this record is in /etc/passwd file. Each user in the system has a profile record which contains information such
as username, password, home directory, etc. This is created by the system administrator at the time he allows
a user to use the system. For instance, while assigning a production manager as a valid user on the system,
the system administrator may decide that his home directory is /PRODUCTION. This is stored in the record
for the production manager in the profile file. When the production manager logs onto the system, the login
process, while checking the password itself consults the profile file and extracts the home directory. After this,
he is put automatically in the /PRODUCTION directory. At this juncture, after login, if he immediately gives
a command to list all the files and directories without mentioning any directory name explicitly from which
he wants to list, the Operating System will respond by giving a list, which as BUDGET, WORK-SORDERS
and PURCHASING. The list will also mention that out of these, BUDGET is a file, whereas the other two are
directories. He then can start manipulating/using them.
Similarly, a profile record for the costing manager may contain /COSTING as his home directory. In this
case, when the Costing manager logs onto the system, after checking the typed username, password with
the one in the profile record, the Operating System uses this home directory to position him at/COSTING
directory. This information is copied at this juncture in the process control block (PCB) or (u-area in UNIX)
for the process created by him for further manipulation as we shall see.
The Operating System allows you to start from the top, i.e. the ROOT directory and traverse down to
any directory or any file underneath. At any time, right from the time you login, you are positioned in some
working or current directory (which is the same as the home directory immediately after logging in).
To reach any other directory or file from your current directory, you can specify complete path name.
For example, if you want to print the BUDGET file under the directory PRODUCTION, you can give the
following command from any current directory that you may be at:
PRINT/PRODUCTION/BUDGET
The first slash(/) denotes the root. This instruction in effect directs the Operating System to start from the
root, go into the PRODUCTION directory, search for the file called BUDGET within that directory and then
print it.
In hierarchies which are large, specification of the complete access path can become cumbersome. To
help the user out of this, many systems allow the user to specify partial path name or relative path name,
beginning with some implied point of reference such as the working or current directory. For example, assume
that a user is currently in the directory /PRODUCTION. The user can now just say PRINT BUDGET to have
the same effect as specifying the complete path name as specified above. The absence of a slash(/) in the
path name tells the Operating System command interpreter that the path is partial with respect to the current
directory; and it does not have to start with the root. This is obviously possible because the Operating System
maintains a cursor or pointer at the current position in the hierarchy for each user in the PCB or the u-area
of the process that the user is executing (in this case, it may be the command interpreter or a shell process).
In Unix, by convention, the directory name “.” refers to the current directory and “..” always refers to the
directory which is one level higher. (i.e. parent directory.) Let us assume that after login, you are in your home
directory /PRODUCTION. The path name “..”, therefore, would refer to the root directory. Assume that you
are in the WORKS-ORDER directory under /PRODUCTION directory. If you want to move from there to the
PURCHASING directory, under the same parent directory, you can use a command with “../PURCHASING”
as the pathname to move to that directory. This is because, the “..” in the command takes you one level up,
i.e. to /PRODUCTION directory, and therefore, “../PURCHASING” would take you to /PRODUCTION/
PURCHASING directory. This now becomes your current or working directory which is stored in the PCB
or u-area of your process. You can now issue a command to list all the files in the PURCHASING directory
under /PRODUCTION without having to specify any pathname at all.
If you are in the /PRODUCTION/WORKS-ORDER directory and if you want to open the “BUDGET” file
under the “/COSTING” directory, you will have to an issue a command to open “../../COSTING/BUDGET”.
After this, an Operating System such as UNIX will execute this command if you have the appropriate access
rights. This is because, the first “..” would take you in /PRODUCTION, the next “..” would take you to the
root (/) itself and therefore, “../../COSTING/BUDGET” would be the command to take you to the desired
directory.
If there was no hierarchy, and all the files for all the users were
put together under one big, global directory, there would be chaos. First of all, there would be a problem of
avoiding duplicate file names. How can one prevent a user from not giving a name to his file which some
other user has already used? Of course, a user can keep on giving a name and the Operating System can keep
checking against a list of all already existing files (which could run into hundreds). But this is cumbersome.
Therefore, it is much more convenient to have separate directories for different users or applications. In fact,
with the passage of time, a user or an application also may have hundreds of files underneath. It is obviously
handy to subdivide even these, in turn, in terms of different subdirectories, depending upon the subject or
interest. In such a case, however, the same filename can be used a number of times in the whole system, so
long as it does not appear more than once under one directory. Figure 4.43 shows the same name BUDGET
used for two files. This is legitimate, as these files are under different directories and therefore, therefore have
different pathnames.
With a hierarchical file system, sharing of files or directories is possible, thereby obviating
the need for copying the entire file or directory. This saves disk space. Figure 4.43 shows PURCHASING
directory itself along with all the files underneath it as shared by two directories, viz. /PRODUCTION and /
COSTING. Therefore, you can reach the PURCHASING directory and/or all the files belonging to it from
either /PRODUCTION or /COSTING. This saves duplication. This is achieved by what is known linking.
The idea is that there is only one copy of PURCHASING directory and also one copy each of all the files
underneath it. However, it is pointed to from two directories. We will study linking later in detail.
The hierarchical file system normally also allows aliases, that is, referencing the same file by two names.
For example, the same physical file depicted in Fig. 4.43 can be accessed with the name VENDOR from
the root directory and with the name SUPPLIER from the /PRODUCTION/PURCHASING directory. This
means that the access paths /VENDOR and /PRODUCTION/PURCHASING/SUPPLIER denote the same
physical file. This also is achieved by linking.
For sharing, the file or directory has to previously exist under one directory, after which you can create
links to it from another directory. For instance, if the file VENDOR already exists under the root directory,
you can create a link from the directory PURCHASING to the same file but call it by a different name such
as “SUPPLIER”. There can be two types of links as is the case with different implementations of UNIX. One
is called hard link. In this case, in PURCHASING directory, you create a file called SUPPLIER, but in that
file, you only create one record giving the pathname which in this case is /VENDOR. Therefore, when the
Operating System tries to access the SUPPLIER file, it will come across this pathname record. It will then
separate the / and the VENDOR and then actually resolve the pathname by traversing the path from the root
(/) to the VENDOR file. This is how the Operating System can reach the same file. Since the complete path
name is hardcoded to denote a link, it is called hard link.
Another method of file sharing is by a soft link. In this case, there is a field maintained for each file in the
directory entry known as a usage count. This will be 2 if a file is shared from 2 directories. Every time a link
is created to an existing file, its usage count maintained in the file directory entry for that file is incremented.
Similarly, when a file is deleted from any directory, its usage count is reduced by 1 first, and only if the usage
count now becomes 0, the file is physically deleted - i.e. the blocks allocated to that file are added to the list
of free blocks, because, now the Operating System can be sure that it will no longer be used by anybody.
What will happen if you are in a root directory and you delete the file VENDOR? If hard links are used,
the file will be physically deleted and the blocks will be freed immediately. This makes sense, because even
if you try to access it from PURCHASING directory, you will access the SUPPLIER file and then to go
through /VENDOR path finally. However, if soft links are used, the physical file is not deleted and the blocks
are not freed because the usage count before executing the DELETE instruction was 2. After executing this
DELETE instruction, the Operating System breaks the connection between the VENDOR file and the root
directory, and reduces the usage count in the directory entry by 1. Now the usage count becomes 1. The
Operating System does not delete the file physically because the usage count is 1 and not 0. The same file
is still accessible from PURCHASING directory. For all the files with usage count = 1 (i.e. unshared), the
DELETE instruction would result in making the usage count 0. At this time, the file is physically deleted,
thereby freeing the blocks occupied by that file.
Files represent information which is very valuable for any organization. For each
piece of data, the management would like to assign some protection or access control codes. For instance,
a specific file should be readable only for users A and B, whereas only users B and C should be allowed to
update it, i.e. write on it, while perhaps all others should not be able to access it at all. This access control
information is a part of the directory entry and is checked before the Operating System allows any operation
on any data in that file.
With hierarchical file system, you could group various files by common usage, purpose and authority.
Therefore, you could set up different access controls at the directory levels instead of doing so at each
individual file level, making things easier and controls better. When you deny a user an access to a directory,
obviously he can not access any file within it too! In our figure, you could say that the /PRODUCTION
directory should be under the control of the production manager, /COSTING directory should be under the
control of the costing manager, and so on. PURCHASING directory and therefore, therefore all the files
under it, could be read by costing, production and purchase managers, etc. Specifying these access controls
at different levels is facilitated by having a hierarchical file system.
In the last section, we have seen different aspects and benefits of the hierarchical file system. In this section,
we will examine how this is normally implemented
internally by the Operating System.
One idea to implement this would be to treat a
directory as a file of files. For instance, the file for
root directory shown in Fig. 4.43 will have entries as
shown in Fig. 4.44. Each of the entries is essentially a
record for the file or a subdirectory within it (denoted
by the “Type” field).
The field “Type” indicates whether it is a directory or a file. If it is a file, the address field tells you the
address of the first block, element or the block number of the index at an appropriate level depending upon
the disk space allocation method used. Using this, the address translation between the logical block number
(LBN) and the physical block number (PBN) is achieved as studied earlier. If it is a DIRECTORY entry,
the address is the block number where the details of files and directories within that directory are found.
The details will be maintained in the same fashion, as shown in Fig. 4.44. This means that once the “record
layout” of the entries or records for the “directory file” is determined by the Operating System, it can easily
pick up the relevant fields such as “Type”, “Address” within a record and take the necessary action.
For instance, if you read block 50 shown against
PRODUCTION in Fig. 4.44, you would get the
entries for PRODUCTION directory as shown in
Fig. 4.45. This information tallies with Fig. 4.43.
Now, if the file BUDGET is to be read, the Operating
System can access the field for “Address” in the
entry for that file, and access all the data blocks as
per the disk space allocation method.
Similarly, if you read block number 70, you
would get the details of the COSTING directory, as
shown in Fig. 4.46.
However, there is a problem in this scheme. It
duplicates the information about PURCHASING directory which is essentially a shared directory in both the
directories to which it belongs. This is evident from Figs. 4.45 and 4.46. In an actual environment, where a
file or a directory can be shared by many programmers/users, this duplication or redundancy can be expensive
in terms of disk space. The fact that the file name PURCHASING is duplicate cannot be helped, but all other
details, apart from the name, are unnecessarily repeated, and this is not insignificant (refer to Fig. 4.25). This
redundancy is expensive from another point of view as well. When you delete a file or a directory - how can
the Operating System decide when to actually free the blocks allocated to that file? For instance, even if you
delete the PURCHASING directory from both PRODUCTION as well as COSTING directories, how can the
Operating System take a decision of actually freeing the blocks unless it goes through all the directories to
ensure that PURCHASING does not belong to a third directory as well? Essentially, how can it use the idea
of the usage count that we had talked about?
To solve these problems, normally the file directory entry is split in two parts:
Basic File Directory (BFD)
Symbolic File Directory (SFD)
The BFD gives the basic information about each physical file including its type, size, Access Rights, Address
and a lot of other things such as various dates, etc. as mentioned and shown in Fig. 4.25 (in UNIX, each BFD
entry is called i-node and the BFD itself is called the inode list). On the other hand, SFD gives you only the
file name and the pointer to the BFD (by BFD Entry Number or BEN) for that file. Therefore, if the same
physical file with two same or different names in two different directories is shared by these directories, there
will be two SFD entries with appropriate (same or different) file names but both pointing towards the same
BFD entry. i.e. both of these will have the same BEN. The usage count in the BFD entry will be 2 in this case.
The SFD entry does not have a usage count.
There is one and only one BFD entry for every physical file or directory in the system, regardless of the
sharing status. If a file or a directory is shared, the usage count for that file will be more than 1 and there will
be as many entries in different symbolic file directories (SFDs) pointing towards the same BFD entry i.e. with
the same BEN. For the directory structure shown in Fig. 4.43, we have shown the BFD and SFD entries in
Figs. 4.47 and 4.48 respectively. We will now study these in a little more in detail.
This contains one entry for every file. The fields in the BFD entry are as follows:
(i) BFD Entry Number (BEN) or File ID: This is a serial number for each entry starting from 0. Each
file has a unique BEN. As this is a serial number, it actually need not be stored in the BFD. Knowing
the length of each BFD entry, the Operating System can access any BFD entry directly, given its
BEN. This field is still shown as a part of BFD only for better comprehension.
(ii) Type: This denotes whether a file is a data file (DAT) or a directory (DIR) or Basic File Directory
(BFD) itself ( the very first entry in the BFD with BEN = 0 ).
(iii) Size: This refers to the file size in blocks.
(iv) Usage Count: This refers to the number of directories sharing this file/directory. If this is 1, it is not
shared, but if it is more than 1, it means that sharing is involved.
(v) Access Rights: This gives information about who is allowed to perform what kind of operations
(read, write, execute, etc.) on the file/directory.
(vi) Other Information: Other information that the BFD can hold is as shown in Fig. 4.25 and is discussed
below.
(a) File-date of creation, last usage, last modification: These are self-explanatory. The Operating
System updates these any time anybody touches the file in any way. This information is useful in
deciding questions such as “When was the program modified last?”, and therefore, it enhances
security.
(b) Number of processes having this file open: While on the disk, this field is 0. When a process
opens this file, the BFD entry for that file is copied in memory, and this field is set to 1. Any time
another process also opens it, the BFD entry for that file does not need to be copied from the disk.
The Operating System only increments this field by 1. This field is, therefore, used like usage
count in the BFD, except that it is used at the run time during execution. When a process closes
a file, the Operating System decrements this field by 1 and it removes the in-memory BFD entry
only if this field becomes 0.
(c) Record length: This is maintained by an Operating System only if it recognizes a logical record
as a valid entity. If maintained, this is used in calculating the relative byte number from the
request of the user process to read a logical record. In UNIX, this field does not exist.
(d) File key length, position, etc.: This is maintained by an Operating System basically for the
non-sequential access. Typically, this is done when access methods such as Indexed Sequential
Access Method (ISAM) are a part of the Operating System. This is not a very commonly
followed practice today. The user has to specify the length and position of the key within a
record to the Operating System, which is then used for building indexes by the Operating System
automatically as the record are added.
(e) File organization: Again, this is maintained, depending upon the support level of the Operating
System (refer to Sec. 4.2.3).
(vii) Address: This gives the block number of the first data block or element or the index at an appropriate
level, depending upon the disk space allocation method used and the file size. For instance, in
contiguous and chained allocations, it is the block number of the first block. In indexed allocation, it
is the block number of either the data or the index block at the appropriate level, depending upon the
file size. This has been discussed earlier.
Therefore, BFD is a directory containing information about all the files and directories in the File System.
The BFD itself is normally kept at a predefined place on the disk. The Operating System knows about it.
It may said to be hardcoded. The first entry in BFD is for BFD itself. This is basically kept for the sake of
completeness. This is because, BFD itself is kept as a file. We will ignore this entry (with BEN = 0).
The second entry in the BFD with BEN = 1 is for the root directory, and this also is fixed (in UNIX, it is
BEN or inode number = 2). When you want to read any file, you have to read the root directory first. Because,
its place is fixed, the Operating System can easily read it and bring it in the main memory. The entry in our
example (Fig. 4.47) says that it is a directory (type = DIR), and it occupies 1 block (size = 1) at block number
2 (Address = 2). If the Operating System actually wants to read it, it will have to read block 2 with necessary
address translation. After reading this, the Operating System will get the contents of the root directory as
shown in the SFD of Fig. 4.48 (a).
After the first two entries, the BFD has one entry for each unique file/directory.
The Operating System may use the same hashing or directory search techniques that have been used
for their maintenance to begin with. After removing the entry, the Operating System may maintain
zeroes, spaces or some special characters to signify a free entry in the SFD.
l It subtracts 1 from the usage count in the BFD entry for this file, i.e. the one with BEN=13. Now if
usage count does not become 0, it takes no action and exits the routine.
However, if it does become 0 (which it does in this case), it does the following.
l It frees the blocks allocated to this file and adds them to the free blocks pool maintained as lists,
indexes or bit maps. The Operating System uses the Address field in the BFD entry to traverse through
the chains or indexes (as per the allocation method) to access the block number allocated to that file
before it can do this job.
l It removes the BFD entry for that file.
It should be noted that all these changes in the BFD are made in the in-memory BFD images first using
various data structures described above. Periodically (for better recovery), and finally in the end at the shut
off, the updated data structures are copied onto the disk at the appropriate locations in the BFD on the disk,
so that next time you get the updated values when the system is used again.
An algorithm to create a file under a directory is almost reverse of this and can be easily imagined.
An algorithm for creating a soft link for a file/directory essentially parses the path name, locates the BFD
entry and increments its usage count. It also inserts the file/directory name in the required SFD with the BEN
same as for the file being linked.
If the user wants to go up in a directory, the entry with “..” in the SFD can be used. For instance, if we want
to traverse from PURCHASING —> COSTING —> BUDGET, using a relative path name ../BUDGET when
we are in PURCHASING directory, the following algorithm is executed.
(i) The Operating System will parse the path name.
(ii) The Operating System will read the “..” entry in the SFD of PURCHASING as shown in Fig. 4.48
(d). It gives BEN = 5. This is the BEN of the parent directory (which in this case is the COSTING
directory).
(iii) It will access BFD entry with BEN = 5.
(iv) It will know that it is a directory starting at block 75. It will verify its access rights and then read it to
get the SFD of COSTING. The contents of the SFD for COSTING are as shown in Fig. 4.48 (c).
(v) It will now perform the search for a name BUDGET in the SFD for COSTING and will store its BEN
= 10.
(vi) Having located the desired file, it will proceed to take any further permissible actions.
The data structures that are maintained in the memory have to take care of various requirements. They
have to take into account the following possibilities:
One process may have many files open at a time, and in different modes.
One file may be opened by many processes at a time and in different modes.
For instance, a Customer file may be opened by three processes simultaneously. One may be printing the
name and address labels sequentially. Another may be printing the statements of accounts, again sequentially,
and the third may be answering an on-line query on the balances. It is necessary to maintain separate cursors
to denote the current position in the file to differentiate an instruction to read the next record in each case, so
that the correct records are accessed.
The exact description of these data structures and algorithms is beyond the scope of this text, though it is
not difficult to imagine them. This brings us to the end of our discussion about the File Systems. We need to
examine the DDs more closely to complete the picture.
A file is a collection of data, which is stored on the secondary storage devices like hard disk,
magnetic tape, CD-ROMs etc. When data is being processed, that data is present in the primary memory or
RAM of the computer. Primary memory or RAM is volatile and cannot be used for permanent data storage.
In almost all computing applications we require data to be stored permanently for the future use.
Processes require data for the processing and these processes execute in the primary memory. Primary
memory has limitation of size and large data cannot be store in primary memory. Even if a particular process
is able to accommodate large data, other processes may not get enough space for their execution. Primary
memory is for data processing and not for storage. Hence files are used to store large data permanently on
the secondary devices even after process has completed its execution (either successfully or unsuccessfully).
Access Management describes how a file is accessed from the storage devices for
processing. Processing may require sometimes all the records, a set of records, a particular record, the first
record or the last record, etc.
Early Operating System provided only one type of file access which is called as “sequential access”.
Sequential access means all the bytes from the file are read sequentially (from beginning to end of file)
one-by-one. It is not possible to skip particular records and jump to any specific record. In other words we
cannot select any specific record by skipping or without reading intermediate records in that file. Sequential
files are widely used when the storage device is magnetic tapes, since magnetic tapes provide only sequential
access.
Magnetic/optical disk provides access to any record directly. It allows us to choose any bytes or records
out of order. It is also possible to choose a record by using keys rather than the position of the record. It is also
possible to move directly to a particular position by specifying the byte number. This access type is called
“random access”. Random access is the requirement of many applications and all DBMS and RDBMS use
random access mechanism.
When an Operating System allows multiple users to work simultaneously then it is quite pos-
sible that more than one user can demand the same file. When many users are just reading the same file then
there is no problem; but if more than one user is writing to or updating the same file (here same file means
file with the same name with same location or path) then this would lead to problems. In such situations, the
data of one user will be stored while the data other users may be deleted or overwritten. We cannot exactly
predict what will happen to the data of that file. In single user Operating System, this will not happen since
there will be only user working at any given time.
Multiuser Operating System must provide appropriate ways to deal with all types incidents to handle file
sharing mechanism. In a multiuser environment, the Operating System would have to maintain more file/
directory attributes than a single user Operating System.
Multiuser user Operating System maintain attributes such as owner of the file/directory. The owner of
the file is able to perform all operations on that file. There is also a provision to maintain users of that file/
directory. These users can perform a subset of operations (e.g. read the file) and they are not able to perform
all operations on file (e.g. write to the file or to delete it) like the owner can.
The owner ID and other users or members id of a given file are stored with other file attributes. When a user
requests any operation on file the user ID is compared with owners attribute of the file to determine the user
is owner or not. Likewise for other users also. The result of comparison is to determine which permissions
are applicable to that user.
Files are stored in directories or folders. Directory is the collection of files and each file must belong to
directory. Following are some directory implementations used by the Operating Systems.
Here, the users are allowed to create a directory directly inside the
root directory. However, once such a directory is created by the user, the user cannot create sub-directories
under that directory. This design will help users to keep their files separately under their own directories. This
allows having a file with same name more than once on the disk but under different user directories.
In this structure, there should be a system directory to access all system utilities. Otherwise, all the users
need to copy system utilities in their own directories which results into wastage of disk space.
This structure goes beyond the two-level system and allows users
to create a directory under the root directory and also to create sub directories under this structure. Here, the
user can create many sub directories and then maintain different files in different directories based on the
types of files.
In this directory structure, files are identified and accessed by their locations. The file location is described
using path. There are two types of paths.
An absolute path name would describe file name and location, considering root directory as the base
directory.
e.g. usr/david/salary.doc -> this means salary.doc file is present in david directory and david is sub directory
of usr directory and usr is present on the root of the disk.
In relative path convention, file name is described consider a user’s specific directory as the base or
reference is the base directory. Base directory can be user’s current working directory.
e.g. usr/david/prg/payroll is the directory structure and consider prg is the base directory. Then to access
“bonuscal.c” file, we can use the path payroll/bonuscal.c.
create new directory. Name must be unique under that particular directory. When new
directory has been created there only dot and dotdot can be seen.
delete empty directory. Directory which contains files and subdirectories can be de-
leted. When directory contains dot and dotdot, the directory is considered to be empty.
Files are stored on the disk. As the disk space is limited, we need to reuse the space from deleted files for
new files. So, the management of disk space is a major concern to file system designers. To keep track of
free disk space, the system maintains a free-space list. The free-space list records all the free disk blocks
that are not allocated to any file or directory. To create a file, we search the free-space list for the required
amount of space, then that space is allocated to a new file. This space is then removed from the free-space list.
Conversely, when we delete a file, its disk space is added to the free-space list.
The free-space list is implemented as a bit-map or bit vector. Each block is represented by 1 bit. If the block
is free, the bit is 1; if the block is allocated the bit is 0. The main advantage of this approach is its efficiency
in finding the free blocks on the disk.
CPUs are getting faster day by day and memory size is also getting bigger and bigger. Same is true with disk
space. But the one parameter is not improving with all these changes is the disk seek time. Log structured
file system is aimed at improving the disk seek time and improving the overall disk write operation. Log
structured file system reduces disk-memory trips for fetching data from the disk and loading it into memory.
Since we have an increased memory size, we can load all the required data inside memory and use that for
processing.
When we execute a write operation, the time taken to complete that operation will not be exactly the same
as the time taken for actual write. There are other factors such seek time, rotational delay, etc. And one write
operation involves changes to the disk at various places such as i-node entry, FCB, directory block etc. Delay
in any operation would lead to inconsistency. Any delay or failure in at least one of these operations will
cause a big problem. This problem can be solved by maintaining all the write operations in a log file and then
committing all the write operations periodically.
n
n
n
n
n
/
Controllers require large memory buffers of their own. They also have complicated electronics and therefore,
are expensive. An idea to reduce the cost is to have multiple devices attached to only one controller as shown
in Fig. 5.6. At any time, a control ler can control only one device and therefore, only one device can be active
even if some amount of parallelism is possible due to overlapped seeks. If there are I/O requests from both
the devices attached to a controller, one of them will have to wait. All such pending requests for any device
are queued in a device request queue by the DD. The DD creates a data structure and has the algorithms to
handle this queue. If the response time is very important, these I/O waits have to be reduced, and if one is
ready to spend more money, one can have one separate controller for each device. We have already studied
the connections between a controller and a device such as a disk drive in Fig. 4.8. In the scheme of a separate
controller for each device, this connection will exist between each controller/device pair. In such a case, the
drive select input shown in Fig. 4.8 will not be required. This scheme obviously is faster, but also is more
expensive.
In some mainframe computers such as IBM-370 family (i.e. IBM 370, 43XX, 30XX etc.), the functions
of a controller are very complex, and they are split into two units. One is called a Channel and the other is
called a Control Unit (CU). Channel sounds like a wire or a bus, but it is actually a very small computer with
the capability of executing only some specific I/O instructions. If you refer to Fig. 5.6, you will notice that
one channel can be connected to many controllers and one controller can be connected to many devices. A
controller normally controls devices of the same type, but a channel can handle controllers of different types.
It is through this hierarchy that finally the data transfer from/to memory/device takes place. It is obvious that
there could exist multiple paths between the memory and devices as shown in Fig. 5.6. These paths could be
symmetrical or asymmetrical as we shall see.
Figure 5.7 shows a symmetrical arrangement. In this arrangement, any device can be reached through any
controller and any channel.
We could also have an asymmetrical arrangement as shown in Fig. 5.8. In this scheme, any device cannot
be reached through any controller and channel, but only through specific preassigned paths.
Due to multiple paths, the complexity of the DD routine increases, but then the response time improves. If
one controller controls two devices, the speed will be much slower than if there was only one controller per
device, for instance. This is because in the latter case, there is no question of path management and waiting,
until a controller gets free. This also is clearly shown by many benchmarks.
The Information Management as we know is divided into two parts: the File System and the DD. We can
consider DD, in turn, conceptually to be divided into four submodules: I/O procedure, I/O scheduler, Device
handler and Interrupt Service Routines (ISR). We will take an example to outline the interconnections
among them. Let us warn the reader at this juncture that these four are the conceptual submodules. An
Operating System designer may split the entire task in any number of submodules - may be only two or
three or five. The number and names of these modules that we have used are the ones that we have found
suitable in explaining this in a step by step fashion. All these submodules are intimately connected. Broadly
speaking, I/O procedure converts the logical block numbers into their physical addresses and manages the
paths between all the devices, controllers and channels etc. It also chooses a path and creates the pending
requests if a specific piece of hardware is busy. The I/O scheduler manages the queue of these requests along
with their priorities and schedules one of them. The device handler actually talks to the controller/device and
executes the I/O instruction. On completion of the I/O, an interrupt is generated, which is handled by the
ISRs.
Let us assume that the file system has translated the request from the Application Program to read a logical
record into a request to read specific blocks. Now the DD does the following:
(a) The I/O procedure translates the block number into the physical addresses (i.e. cylinder, surface, sector
numbers) and then creates a pending request on all the elements on the path (i.e. channel, control unit
and the device). While doing this, if multiple paths are possible, the I/O procedure chooses the best
one available at that time (e.g. where there are minimum pending I/O requests on the channel/CU
connecting the device) and adds the current request to the pending queues maintained for all the units
on the path such as a device, controller or a channel.
(b) An I/O scheduler is a program which executes an infinite loop. Its whole purpose is to pick up the
next pending I/O request and schedule it. It is for this purpose that all the I/O requests for the same
device from different processes have to be properly maintained and queued. When a request is being
serviced, the I/O scheduler goes to sleep. When the device completes an I/O, an interrupt is generated
which wakes up the I/O scheduler and sets the device free to handle the next request. From that time
onwards, the device continuously generates interrupts at specific time intervals, suggesting that it is
free and ready for work.
Every time an interrupt is generated by a device, the appropriate Interrupt Service Routine (ISR)
for that device starts executing. The ISR activates the I/O scheduler which, in turn, checks whether
there are some pending I/O requests for that device. If there are, it organizes them according to
its scheduling algorithm and then picks up the next request to be serviced. On scheduling it (i.e.
instructing the device handler about it), it goes to sleep only to be woken up by the execution of the
ISR again, which is executed on the completion of the I/O request. If there are no pending requests
for that device, the scheduler goes to sleep immediately. By this mechanism of generating interrupts
continuously at regular interval for a “free” device, the checking on the pending I/O requests is
continuously done without keeping the I/O scheduler running and consuming CPU power all the time.
If some I/O operation is complete, the ISR, apart from waking up the I/O scheduler, intimates this to
the device handler for error checking on the data read in.
(c) When the I/O scheduler schedules a request, the Device Handler uses the details of the request such
as the addresses on the disk, memory, the number of words to be transferred and constructs the
instructions for the disk controller. It then issues these to the controller in the form of a program which
the controller understands.
In most of the cases, the device handler can construct a full I/O program for the operation such as
“read” or “write” and load the entire program in the controller’s memory. After this, the controller
takes over and executes the I/O operation as we have studied earlier. The controller sets some
hardware bits to denote the success/failure of the operation after it is over, and generates an interrupt.
In some schemes, the device handler instructs the controller one instruction at a time and monitors the
operation more closely. We will assume the former scenario in our subsequent discussion.
(d) The controller then actually moves the R/W arms to seek the data, check the sector addresses and read
the data to its own memory. It then transfers it to the main memory using DMA as we have studied
earlier.
(e) After the data is read/written, the hardware itself generates an interrupt. The current instruction of the
executing process is completed first and then the hardware detects the interrupt, and automatically
branches to its ISR. This ISR puts the current process to sleep. The ISR also wakes up the I/O
scheduler which deletes the serviced request from all the queues and schedules the next one, before
going to sleep again.
(f) The device handler checks for any errors in the data read, and if there is none, intimates the I/O
procedure to form the logical record for moving it into the memory of the AP.
(g) The process for which the record is read/written is now woken up and inserted in the list of ready
processes, depending upon its priority, This process is eventually dispatched, when it starts executing
again.
We will now study the functions of these four submodules a little more closely.
In order to perform the path management and to create the pending requests, the I/O procedure maintains the
data structures as described below.
The I/O Scheduler sequences the IORB chained to a DCB according to the scheduling policy and picks up an
IORB from that chain when a request is to be serviced. It then requests the device handler to actually carry out
the I/O operation and then deletes that IORB from all the queues for pending requests from the appropriate
DCB, CCUB and CCB after setting the appropriate flags in the DCB, CCUB and CCB to denote that they
are now busy. The I/O scheduler can use a number of policies for scheduling IORBs. In fact, the IORBs are
chained to the DCB and one another depending upon this policy only. For instance, if First Come First Served
(FCFS) method is used, the new IORB is just added at the end of the queue. Therefore, one can imagine that
the I/O procedure prepares the IORB and hands it over to the I/O scheduler. The I/O scheduler then chains
it with the DCB as per its scheduling policy. In many modern systems, scheduling is done by the controller
hardware itself thereby making the task of the I/O scheduler simpler and the whole operation much faster. In
some others, the software in the Operating System has to carry it out.
Figure 5.15 illustrates this method. The figure depicts four requests,
0, 1, 2 and 3. It also shows these requests on the figure at their respective target track numbers. If the
requests have arrived in the sequence of 0, 1, 2 and 3,
they are also serviced in that sequence, causing the head
movement as shown in the figure starting with the R/W
head position which is assumed to be between the target
tracks of requests 2 and 3.
FCFS is a ‘just’ algorithm, because, the process to
make a request first is served first, but it may not be
the best in terms of reducing the head movement as is
clear from the figure. To implement this in the Operating
System, one will have to chain all the IORBs to a DCB
in the FIFO sequence. Therefore, the DCB at any time
points to the next IORB to be scheduled. After one IORB
is dispatched, that IORB is deleted, and the DCB now
points to the next one in time sequence, i.e. which came
later. When a new IORB arrives, it is added at the end of
the chain. In order to reduce the time to locate this ‘end
of the chain’, the DCB also can contain a field “Address
of the Last IORB” for that device. For recovery purposes,
normally IORB chains like all others are maintained as
two-way chains, i.e. each IORB has an address of the next IORB for the same device as well as the address of
the previous IORB for the same device. We can now easily construct the algorithms to maintain these IORB
queues for this method.
;
;
The device handler essentially is a piece of software which prepares an I/O program for the channel or a
controller, loads it for them and instructs the hardware to execute the actual I/O. After this happens, the device
handler goes to sleep. The hardware performs the actual I/O and on completion, it generates an interrupt. The
ISR for this interrupt wakes up the device handler again. It checks for any errors (remember, the controller
has an error/status register which is set by the hardware if there are any errors). If there is no error, the device
handler instructs the DMA to transfer the data to the memory.
Let us now take a complete example to study in a step-by-step manner, how all these pieces of software are
interconnected. When an AP wants to read a logical record, the following happens:
(i) The AP-1 has a system call to read a logical record for an Operating System which can recognize
an entity such as a logical record. For systems such as UNIX which treat a file as a stream of bytes,
a system call to read a specific number of bytes starting from a given byte is generated in its place.
Upon encountering this, AP-1 is put aside (blocked) and another ready process, say AP-2 is initiated.
(ii) The File System determines the blocks that are needed to be read for the logical record of AP-1, and
requests DD to read the same. We have seen how this is achieved.
(iii) The I/O procedure within DD prepares an IORB for this I/O request.
(iv) The I/O procedure within DD now establishes a path and chains the IORB to the device control
block and other units as discussed earlier. The I/O procedure actually constructs an IORB and hands
it over to the I/O scheduler. The I/O scheduler chains the IORB in the appropriate manner as per the
scheduling philosophy.
(v) Whenever a device is free, it keeps on generating interrupts at regular intervals to attract attention (“I
want to send some data” or “does anybody want to send anything to me ? I am free”). The ISR for that
interrupt wakes up the I/O scheduler. The I/O scheduler checks up the IORB queues and then services
them one by one, as per the scheduling philosophy. When the device is free, the controller may not be.
Or even if it is, the channel may not be free at that time. For the actual I/O, the entire path has to be
free, and this adds to the complication. The I/O scheduler ensures this before scheduling any IORB.
After this is done, it makes a request to the device handler to actually carry out the I/O.
(vi) After an IORB has finally been scheduled, the I/O scheduler now makes a request to the device
handler to carry out the actual I/O operation.
(vii) The device handler within DD now picks up the required details from the IORB (such as the source,
destination addresses, etc.) and prepares a channel program (CP) and loads it into the channel, which in
turn, instructs the controller. If there is no channel, the device handler directly instructs the controller
about the source and target addresses and number of bytes to be read. As we know, the device handler
can issue a series of instructions which can be stored in the controller’s memory.
(viii) The controller finally calculates the direction and the number of steps that the R/W heads have to
traverse for the seek operation. Depending upon this, appropriate signals are generated from the
controller to the device (refer to Figs. 4.7 and 4.8) and the R/W heads actually move onto the desired
track.
(ix) The R/W head is now on the desired track. It now accesses the correct sector on the track as the disk
rotates. For every sector, it looks for an address marker followed by the address and then matches it
electronically with the target address stored in the controller’s buffer before starting the data transfer.
On hitting the desired sector, the data is transferred bit serially into the controller’s buffer, where it
is collected as bytes and these bytes are then collected into a larger unit (512 bytes or more) in the
controller’s buffer.
(x) This buffer is transferred to the memory buffer within the Operating System using DMA under the
direction of the channel and/or the controller, but through the data bus in a bit parallel fashion.
(xi) When the I/O operation is completed for AP-1, the hardware itself generates an interrupt.
(xii) The Interrupt Service Routine (ISR) within the DD starts executing. It signals the completion of the
requested I/O for the AP-1 and informs up the device handler regarding the same.
(xiii) The device handler checks for any errors and if none is found, it signals it to the I/O procedure.
(xiv) The I/O procedure now deletes the IORB and signals the File System after all the blocks for that
logical record have been read.
(xv) The File System now formulates a logical record from the read blocks and transfers it to the I/O area
of the AP-1. In some cases, the data can be directly read in the AP’s memory. In others, a common
buffer can be maintained between the Operating System and the AP. One can refer to one or the other
by just mapping logical to physical addresses appropriately.
(xvi) The process AP-1 is now moved from the blocked to the ready state, whereupon it can now be
scheduled in due course. At this juncture, assuming that there are only two processes, AP-1 and
AP- 2 in the system. AP-2 can be thrown out of the control of the CPU and AP-1 can be scheduled.
Alternatively, AP-2 can continue executing and AP-1 is scheduled later. This depends upon whether
the process scheduling philosophy is pre-emptive or non-pre-emptive. We will study how this happens
in the section on Process Management (PM).
(xvii) When AP-1 is next scheduled by the PM module of the Operating System, the program can assume
that the logical record is already in the I/O area of AP-1 and therefore, it can start processing it.
While all this happens, the device from which the data was needed to be read gets free and the flag in the
DCB is updated to indicate this. From this time on, at a regular time interval the device generates an interrupt.
The ISR of this interrupt wakes up the I/O scheduler to check if there are any IORBs to be scheduled. And
the cycle continues thereafter.
A terminal or visual display unit (VDU) is an extremely common I/O medium. It would be hard to find any
programmer or user who has not seen and used a terminal. Ironically, there is not much of popular literature
available explaining how terminals work and how the Operating System handles them. We want to provide
an introduction to the subject to uncover the mysteries around it.
Terminal hardware can be considered to be divided into two parts: the keyboard, which is used as an input
medium and the video screen which is used as an output medium. These days, if one uses light pens and
similar devices, the screen can be used as an input medium also.
The terminal can be a dumb terminal or an intelligent terminal. Even the dumb terminal has a
microprocessor in it on which can run some rudimentary software. It also can have a very limited memory.
The dumb terminal is responsible for the basic input and output of characters. Even then, it is called ‘dumb’
because it does no processing on the input characters. As against this, the intelligent terminal can also carry
out some processing (e.g. validation) on the input. This requires a more powerful hardware and software for
it. We will assume a dumb terminal for our discussions.
Terminals can be classified in a number of ways, as shown in Fig. 5.20.
A detailed discussion of all these is beyond the scope of the current text. We will consider only the
memory mapped, character oriented alphanumeric terminals. These terminals have a video RAM as shown in
Fig. 5.21. This video RAM is basically the memory that the terminal hardware itself has.
The figure shows that the video RAM in our example has 2000 data bytes (0 to 1999) preceded by 2000
attribute bytes (0 to 1999). There is therefore, one attribute byte for each data byte. A typical alphanumeric
screen can display 25 lines, each consisting of 80 characters, i.e. 25×80 = 2000 characters. This is the reason
why Fig. 5.21 shows 2000 data bytes. This is typically the case with the monochrome IBM-PC.
Anytime, all the 2000 characters stored in the video RAM are displayed on the screen by the video
controller using display electronics. Therefore, if you want to have a specific character appear on the screen
at a specific position, all you have to do is to move the ASCII or EBCDIC code for that character to the video
RAM at the corresponding position with appropriate coordinates. The rest is actually handled by the video
controller using display electronics.
Therefore, when one is using any data entry program where the data keyed in has to be displayed on the
screen or one is using an enquiry program where data from the desired database or file is to be displayed
on the screen, it has to be ultimately moved into the video RAM at appropriate places, after which display
electronics displays them.
What is then the attribute byte? The attribute byte tells the video controller how the character is to be
displayed. It signifies whether the corresponding data character which is stored next to it in the video RAM
is to be displayed bold, underlined, blinking or in reverse video etc. All this information is codified in the
8 bits of the attribute byte. Therefore, when you give a command to a word processor to display a specific
character in bold, the word processor instructs the terminal driver to set up the attribute byte for that character
appropriately in the video RAM after moving the actual data byte also in the video RAM. The display
electronics consults the attribute byte which, in essence, is an instruction to the display electronics to display
that character in a specific way.
For the monochrome IBM-PC display, only one attribute (i.e. 8 bits) is sufficient to specify how that
character is to be displayed. For bit oriented color graphics terminals, one may require as many as 24 or
32 bits for each byte or even for each bit if a very fine distinction in colors and intensities is needed. This
increases the video RAM capacity requirement. It also complicates the video controller as well as the display
electronics. But then you get finer color pictures.
Why is this terminal called memory mapped? It is because the video RAM is treated as part of the main
memory only. Therefore, for moving any data in or out of the video RAM, ordinary load/store instructions
are sufficient. You do not need specific I/O instructions to do this. This simplifies things but then it reduces
the memory locations available for other purposes. Figure 5.22 shows a typical arrangement of all the
components involved in the operation. It also shows the data bus connecting all these parts.
When a character is keyed in, the electronics in the keyboard generates an 8 bit ASCII/EBCDIC code from
the keyboard. This character is stored temporarily in the memory of the terminal itself. Every key depression
causes an interrupt to the CPU. The ISR for that terminal picks up that character and moves it into the
buffers maintained by the Operating System for that terminal. It is from this buffer that the character is sent
to the video RAM if the character is also to be displayed (i.e. echoed). We will shortly see the need for these
buffers maintained by the Operating System for various terminals. Normally, the Operating System has one
buffer for each terminal. Again, the Operating System can maintain two separate buffers for input and output
operations. However, these are purely design considerations.
When the user finishes keying in the data, i.e. he keys in the carriage return or the new line, etc., the
data stored in the Operating System buffer for that terminal is flushed out to the I/O area of the Application
Program which wants that data and to which the terminal is connected (e.g. the data entry program).
Therefore, there are multiple memory locations involved in the operation. These are:
l A very small memory within the keyboard itself
Imagine that an Application Program written in HLL wants to display something on the
terminal. The compiler of the HLL generates a system call for such a request, so that at the time of execution,
the Operating System can pick up the data from the memory of the AP, dump it into its own output buffers
first and then send it from these buffers to the terminal. The rates of data transfers between the memory of
the AP to the Operating System buffers and finally from there to the terminal are very critical if the user has
to see the output continuously, especially in cases such as scrolling or when using the Page Up/Page Down
facility etc. What is transferred between the AP to the Operating System and finally to the terminal is the
actual data as well as some instructions typically for screen handling (given normally as escape sequences).
Let us say that the AP wants to erase a screen and display some data from its working storage section.
The AP will request the Operating System to do this. The Operating System will transfer the data from the
working storage of the AP to its own buffers and then send an instruction (in the form of escape sequences)
to the terminal for erasing the screen first. The terminal’s microprocessor and the software running on it will
interpret these escape sequences, and as a result, move spaces to the video RAM, so that the screen will be
blanked out. The Operating System then will transfer the data from its buffers to the terminal along with
the control information such as where on the screen it should be displayed and how. Again, the terminal’s
microprocessor and the software will interpret this control information and will move the data into appropriate
places of the video RAM. Now the actual data and also the attribute bytes of the video RAM will be correctly
set. We know that the rest is done by the display electronics.
But there is a problem in this scheme. For instance, when the AP is displaying some matter, if a user keys
something in, where should it be displayed? Should it be mixed in the text being displayed or should it be
stored somewhere and then displayed later? This is the reason the Operating System needs some extra storage
area.
Let us take another example. Let us say a user is keying in some data required by an Application Program.
The user keys it in. As soon as a key is depressed, we know that an interrupt is generated and the interrupt
service routine (ISR) for that terminal will call a procedure which will take that character and move it to
the memory of the AP. But there is a problem in this scheme of transferring the data keyed in directly to the
memory of the AP too! What if the user types a “DEL” key? Should it be sent to the memory of the AP? What
if he types a “TAB” key? What if he types a “CTRL-C” to abort the process? It is obvious that the Operating
System needs a temporary storage area for all the characters keyed in whether they are displayable characters
or command characters such as “DEL” or “TAB”, etc. It also needs to have a routine to interpret these
command characters and carry out the processing of special command characters such as DEL, backspace,
TAB etc. This routine moves the cursor for the DEL command or inserts the necessary spaces for the “TAB”
command in this temporary storage area. Having done this processing, the Operating System will need to
transfer the data from this temporary area to the Application Program on receiving a carriage return (CR) or
a line feed (LF) character.
It is clear that the Operating System needs some temporary buffer space and various data structures to
manage that space between the video RAM and the AP’s memory. We will now study these.
The Operating System reserves a large input memory buffer to
store the data input from various terminals and before it is sent to the respective APs controlling these termi-
nals. Similarly, it reserves the output buffer to store the data sent by the AP before it is sent to the respective
terminal screens for displaying. For large systems, there could be dozens if not hundreds of users logging
on and off various terminals throughout the day. The Operating System needs a large area to hold the data
for all these terminals for the purpose of input (the data keyed in by the users) and the output (the data to be
displayed). These are the buffers which are the most volatile in nature. They
get allocated and deallocated to various terminals by the Operating System,
a number of times throughout the day.
There are two ways in which this buffer space is allocated to various
terminals.
This scheme estimates the maxi-
mum buffer that a terminal will require, and reserves a buffer of that size for
that terminal. This is depicted in Fig. 5.23.
The advantage of this scheme is that the algorithms for allocation/
deallocation of buffers are far simpler and faster. However, the main
disadvantage is that it can waste a lot of memory, because this scheme is not
very flexible.
For instance, it is quite possible in this scheme that one terminal requires
a larger buffer than the one allocated to it, whereas some other terminal
grossly underutilizes its allocated buffer. This scheme is rigid in the sense
that the Operating System cannot dynamically take away a part of a terminal’s buffer and allocate it to some
other.
In this scheme, the Operating System maintains a central pool of buffers and
allocates them to various terminals as and when required. This scheme obviously is more flexible and it re-
duces memory wastage. But the algorithms are more complex and time consuming. We will see this trade-off
time and again in all allocation policies of the Operating System be it for any part of memory or disk space!
In this scheme of central pool, normally, merits overweigh the demerits and therefore, this scheme is more
widely followed. AT&T’s UNIX system 5 follows this scheme, for instance! We will illustrate this scheme
by continuing with this example.
A buffer is divided into a number of small physical entities called Character blocks (Cblocks). A Cblock
is fixed in length. A logical entity such as Character List (Clist) consists of one or more Cblocks. For
instance, for each terminal, there would be a Clist to hold the data input through a keyboard as it was keyed
in. This Clist would also actually store the ASCII or EBCDIC codes for even the control characters, such as
“TAB”, “DEL” etc. along with those for data characters in the same sequence that they were keyed in. If a
Cblock is, say, 10 bytes long, and if a user keys in a customer name which is 14 characters, it will be held in
a Clist requiring 2 Cblocks. In this case, only 6 bytes would be wasted because the allocation/deallocation
takes place in units of full Cblocks. If a user keys in an address which is 46 characters long, that Clist will
require 5 Cblocks, thereby wasting only 4 bytes. All Cblocks in a buffer are numbered serially. Therefore,
the terminal buffer can be viewed as consisting of a series of Cblocks 0 to n of fixed length. When the Clist
is created or when it wants to expand because its already allocated Cblock is full, the Operating System
allocates another free Cblock to that Clist. The Cblocks assigned to a Clist need not be the adjacent ones, as
they are dynamically allocated and deallocated to a Clist.
Therefore, the Operating System has to keep track of which Cblocks are free and which are allocated to
which Clist. The Operating System does that normally by chaining the Cblocks belonging to the same Clist
together with a header for each Clist. As the user keys in characters, they are stored first in the terminal’s own
memory and then in the video RAM if it is to be echoed, as we have seen before. After this, the character is
pushed into a Clist from the terminal’s memory by the ISR for that terminal. While doing this, if a Cblock
of that Clist has an empty space to accommodate this character, it is pushed there; otherwise the Operating
System acquires a new free Cblock for that Clist, delinks that Cblock from the pool of free Cblocks, adjusts
all the necessary pointers and then pushes the character at the beginning of the newly acquired Cblock.
The size of the Cblock is a design parameter. If this size is large, the allocation/deallocation of Cblocks
will be faster, because the list of free and allocated Cblocks will be shorter, therefore, enhancing the speed
of the search routines. But in this case, the memory wastage will be high. Even if one character is required, a
full Cblock has to be allocated. Therefore, the average memory wastage is (Cblock size - 1)/2 for each Clist.
If the size of the Cblock is reduced, the allocation/deallocation routines will became slower as the list of free
and allocated Cblocks will be longer, and also the allocation/deallocation routines will be called more often,
thereby reducing the speed. Therefore, there is a trade-off involved - again similar to the one involved in the
case of deciding the page size in memory management or the size of a cluster or an element used in the disk
space allocation.
Each Cblock has the format as shown in Fig. 5.24, assuming that the Cblock contains 10 bytes in our
examples.
l Next pointer is the Cblock number of the next Cblock allocated to the same Clist. “*” in this field
indicates the end of the Clist.
l Start offset gives the starting byte number of the valid data. For instance, Fig. 5.25 shows 10 bytes
but may be some junk is remaining from the past in bytes 0 to 3. Therefore, the start offset in this case
is 4 - as shown in the figure. The Cblock stores the employee number as “EMP01”. Normally, it is set
to 0 for a newly acquired Cblock for a Clist.
l Last offset gives the byte number of the last significant byte in the Cblock. For instance, Fig. 5.25
shows that byte number 8 is this byte number. That is where the employee number ends. Byte number
9 contains “-” which is again of no consequence. It could be garbage from the past. This field indicates
after which byte, newly keyed-in data can be stored in that Cblock if there is a room.
Let us illustrate the Clist and Cblock structures by taking an example. Let us say that we have
two terminals, T0 and T1. At a given moment, let us also assume that T0 has 3 Clists-CL0, CL1 and CL2. T1
is currently unused. Therefore, no Clist is allocated to T1. A question arises: why are multiple Clists required
for a terminal? We defer the answer to this question for a while. As we have described before, one or more
Cblocks are normally associated with each Clist. Let us assume the size of Cblock to be 10 characters in our
example. These Cblocks are linked together with pointer chains for the same Clist using the “next” pointers
in the Cblocks. Let us also assume that there is a list of free Cblocks called CLF which links all the free
unallocated Cblocks.
Whenever a new Cblock is to be allocated to a Clist, the Operating System goes through this CLF and
allocates a free Cblock which is at the head of the chain of CLF to the required Clist. After this, it adjusts the
pointer chains for that Clist as well as for CLF to reflect this change. In the same way, if a Cblock has served
its purpose, the Operating System delinks it from that Clist and adds it to the pointer chains of CLF at the end
of the chain. There is no reason why it should be added anywhere else. This is because there is no concept
of priority or preference in this case. Any free Cblock is as good as any other for its allocation. Our example
clarifies this in a step by step manner.
Let us now imagine that the Operating System
has allocated the buffer for all the Clists for all the
terminals put together, such that it can accommodate
25 Cblocks in all. This is obviously too small to be
realistic, but it is quite fine to clarify our concepts. Let
us assume that the Cblocks are assigned in the manner
as shown in Fig. 5.26.
To represent this, the Operat-
ing System maintains the data structures as shown in
Fig. 5.27. These are essentially the readers of the chains.
The actual Cblocks would be as shown in Fig. 5.28.
We can easily verify that if we start traversing the chain, using the starting Cblocks from Fig. 5.27 and the
“next” pointers in Fig. 5.28, we will get the same list as shown in Fig. 5.26. Figure 5.28 shows only 5 Cblocks
allocated to terminal T0 in detail with their data contents. The others are not deliberately filled up to avoid
cluttering and to enhance our understanding.
We can make out from the figure that the user has keyed through the terminal T0, the following text “Clist
number 0 for the terminal T0 has 5 Cblocks.” This is stored in Cblocks 0, 5, 8, 17 and 21. All Cblocks in this
example are fully used because there are exactly 50 characters in this text, and therefore, in this example, the
start offset is 0 and last offset is 9 for all those Cblocks. As we know, this need not always be the case.
Let us assme that a user logs on to the terminal T1 at this juncture, where he runs a program which
expects the user to key in an employee name. Let us assume that the name is of 16 characters (Achyut
S.Godbole), requiring 2 Cblocks to hold it.
We know that each key depression causes an interrupt and activates the Interrupt Service Routine (ISR)
which invokes a procedure to pick up the byte keyed in and deposit it from the memory of the terminal itself
into one of the Clists associated with that terminal. We have also seen that if the character is to be displayed
on the screen, i.e. echoed, it is also moved in, along with its attribute byte, to the appropriate location in the
video RAM. We will later see different routines within the Operating System required to handle the terminal,
their exact functions and the exact sequence in which they work together. For now, let us assume that the first
character is to be deposited into the Clist for T1.
At this juncture, there is no Clist and consequently no Cblocks allocated to T1.
The following procedure will now be adopted:
(i) The Operating System routine goes through the entry for free Cblocks (CLF) in the table shown in
Fig. 5.27.
(ii) It will find Cblock number 3 as the entry for the first free Cblock as shown in CLF in Fig. 5.27.
(iii) It will create a Clist CL3 for T1 and allocate Cblock 3 to 5 it. At this juncture, the “starting” free
Cblock will be Cblock number 6 (Refer to Fig. 5.26 for the row for CLF). The terminal data structure
now will look as shown in Fig. 5.29.
(iv) Assuming that the user keys in the first character of the entire text i.e. “Achyut S.Godbole”, viz, the
character “A”, it will move this character “A” into Cblock 3. After this is moved, the Cblocks will
look as shown in Fig. 5.30. Notice that the start as well as the offsets for this Cblock 3 are set to 0
because only zeroth character oin that Cblock has some worthwhile data (in this case, it is “A”).
As the user keys in the first 10 characters (“Achyut S.G”), Cblock 3 will keep getting full. Let us
now imagine that the user has keyed in the 11th character “o”. Now, a new Cblock has to be acquired. The
list of free Cblocks i.e. CLF in Fig. 5.29 now points towards Cblock 6. Therefore, it will be allocated to CL3,
with both the offsets set to 0, and the 11th character will be moved into byte number 0 of Cblock 6 of CL3.
The terminal data structure will then look as shown in Fig. 5.31.
It is possible that at any moment, terminal T0 may acquire new Cblocks to its Clists or it may re-
linquish them. We have, however, assume that the Cblocks for T0 have not changed during the period of this
data entry for T1.
We now assume that all the 16 characters are keyed in. The user then keys in a “Carriage Return (CR)”.
At this juncture, the terminal data structure and the Cblocks will continue to look as shown in Figs. 5.31 and
5.32, respectively (and therefore, not repeated here), except that Cblock 6 would be different and would look
as shown in Fig. 5.33. Notice that the “last” offset has now been updated to 5. This means that a new character
should be entered at position 6.
As soon as “CR” is hit, the Operating System routines for the terminals understand that it is the end of the
user input. At this juncture, a new routine is invoked which moves all the 16 characters into the AP’s memory
in a field appropriately defined (e.g. char[17] in C or PIC X(16) in the ACCEPT statement or screen section
of COBOL). After this, Cblock 3 and Cblock 6 are released and they are chained in the list of free Cblocks
again. We also know that if the name is to be displayed as each character is entered at that time itself, it will
The Cblocks at this stage will look as shown in Fig. 5.32.
have been sent to the video RAM, from which it would have been displayed. After all the characters are keyed
in, the terminal data structure and the Cblocks again look as shown in Figs. 5.27 and 5.28, respectively. This
is where we had started from.
We will now study some of the algorithms associated with Clists and Cblocks (Figs 5.34 and 5.35).
In these, “free the Clist and Cblocks for that terminal” can be further exploded easily. We give below the
algorithm for “insert a character” which is a little complex.
The algorithm for “Acquire a free Cblock” is as shown in Fig. 5.36.
In the algorithm, routines to “Allocate FCB” and “Deallocate FCB” need further
refinements. They essentially are the routines to add an element to a linked list and to remove an element
from the linked list. These must be fairly clear from our example showing Cblocks and Clists in Figs. 5.26 to
5.33. Whenever a Cblock is added to a Clist, the fields in it are set as follows:
(i) Cblock number = the free Cblock number allocated from the head of CLF
(ii) Next pointer = “*”
(iii) Start offset (SO) = 0
(iv) Last offset (LO) = –1
(v) Other fields = blank.
Notice that, LO is set to –1 because LO + 1 gives the byte number within that Cblock where the next
character is to be stored. For newly acquired Cblock, a character has to be stored in byte number 0.
How does the Operating System remove these 16 characters from these Cblocks and send them to the user
process running the AP? While doing that, one thing has to be taken care of. The characters must be sent to
the user process in the same way that they were keyed in, as if Clists and Cblocks were not present - i.e. in
the FIFO manner. The kernel of the Operating System normally has an algorithm for removing the characters
from the Clist in this fashion. It looks as shown in Figs. 5.37 and 5.38.
The kernel can and infact normally does provide for an algorithm to extract or copy only 1 character from
the Clist also.
In actual practice, at any given moment, many Cblocks are being acquired and many others are being
released for different terminals in a multiuser system. Things do not necessarily happen for one terminal
after the other as far as Cblocks are concerned. Our example must have clarified different algorithms and data
structures required for Cblock and Clist maintenance.
The procedure for displaying some value by a process is exactly the reverse. The data is moved from the
AP’s memory into Cblocks for Clists (if required after acquiring them) for the target terminal; and then the
data is moved from the Cblocks to the terminal. As we know, in addition to the actual data, the commands
to manipulate the screen (e.g. user screen) are also sent to the terminal as some specific escape sequences.
The hardware and the software within the terminal interpret these as commands or data and update the video
RAM with the appropriate data and attribute bytes. The display electronics now does the rest. The screen now
shows what the user wants to display. After this is done, the Cblocks are freed again.
The Kernel normally has a number of procedures to handle a variety
of requirements. Some of them are listed below:
(i) Allocate a Cblock from a list of free Cblocks to a given Clist.
(ii) Return a Cblock to a list of free Cblocks.
(iii) Retrieve the first character from a Clist.
(iv) Insert a character at the end of Clist.
(v) Extract all the characters belonging to a Cblock within a Clist (and then free that Cblock).
(vi) Extract all characters from a Clist.
(vii) Place a new Cblock of characters at the end of a Clist.
We have already shown the algorithms for some of them. e.g. for extracting all the characters from a Clist
in Figs. 5.37 and 5.38. The algorithms for all the others by now should be fairly easy to construct from the
preceding discussions in the last two sections.
We now will see various components of these routines in the terminal driver, but before doing that, it is
necessary for us to know why normally different Clists are needed for each terminal.
In the ‘raw’ mode, whatever the user keys in is passed on to the user process faithfully by the Operating
System without any processing. In such a case, the Operating System needs only one Clist for the input from
a terminal apart from the one for the output to the screen. In this mode, the user keys in the data which is
deposited in the input Clist and on encountering a CR or an LF, it is transported to the user process. This is
exactly what we had traced in our example.
The raw mode does not do any processing on the characters keyed in such as “DEL” (delete a character
just keyed in) or “TAB” (jump to the next TAB column).
As we know, DEL, TAB keys etc. also have some ASCII and EBCDIC codes associated with them. The
raw mode extracts the 8 bit code associated with the character from the terminal, puts it into the input Clist
and passes it on to the user process wanting that character. It is then up to the user process to interpret these
special characters and take actions accordingly. This mode is especially used by many editors. A specific
control character sequence or escape sequence means a specific thing to one editor, but the same may mean
a different thing to a different editor running under the same Operating System. In this case, it is useless for
the Operating System to interpret and process these characters. It is better for the Operating System to pass
these special characters to the editor and let the editor interpret them, the way it wants. Therefore, a raw mode
is used.
However, in addition to the input Clist, the raw mode requires one output Clist from where the characters
are to be displayed on the screen. For instance, if a user keys in A, B, C and then the TAB character, the input
Clist will contain “ABC(TAB)” whereas the output Clist will contain “ABC”. The spaces after “ABC” are
as per the TAB. After the TAB character is faithfully sent to the user process (such as an editor), the user
process interprets it and sends some escape sequence back to the terminal driver operating in the raw mode.
The terminal driver interprets them sequence and expands the “ABC(TAB)” to “ABC” for the output Clist.
It is then sent from the output Clist to the video RAM to fill up the data and the attribute bytes. The rest is
known to us.
Let us assume that the user keys in a function key “F1” as an input and a user program is written such that
if “F1” is encountered, the user should be taken back to the previous screen. How is this accomplished? The
raw mode stores the ASCII/EBCDIC code for “F1” in the input Clist after it arrives from the terminal buffer
into the input Clist. After this, it sends the character “F1” to the user process for interpretation faithfully. The
user process is written in such a way that it checks for “F1” and if encountered, it sends the instructions to
“Erase screen” back to the terminal driver. It then also actually sends the data from the previous screen to
be displayed again to the terminal driver. Both these instructions and the data are sent by the terminal driver
to the terminal. The terminal hardware and software interpret these and set up the video RAM accordingly.
For instance, the instruction to erase screen in the form of special characters (escape sequences) will make
the terminal to move spaces to video RAM, so that screen is blanked out. After that, the characters from the
previous screen are moved to the video RAM appropriately. If scrolling is involved, more than a screenful
data can be sent to and stored in the output Clist from the user process and then scrolling is synchronized with
the rate at which the data is sent from the user process to the output Clist and from there to the video RAM.
The ‘cooked’ or ‘canonical’ mode on the other hand processes the input characters before they are passed
on to the user process.
It is for this reason that in this mode, the driver requires an additional input Clist associated with the same
terminal. In this case also, there is only one output Clist from which characters are sent to the screen for
displaying. However, this mode demands two input Clists. These are called ‘raw’ and ‘cooked’ Clists. In this
case, as the characters are keyed in, they are first input in the raw input Clist. This is the same as was the case
in the ‘raw’ mode. After this is done, each character is examined. If it is an ordinary data character, it is copied
to the second “cooked” input Clist. If it is a command or control character such as “F1” or “DEL” etc., it is
then processed according to the character and the result moved into the second cooked input Clist. It is from
this Clist that the data finally goes to the user process. The Clist/Cblock management algorithms and data
structures are as we have already discussed in the last section.
As an example, if the user keys in a TAB character, the ASCII code for TAB is moved into the raw
Clist. The terminal driver in the cooked mode then calculates the next TAB position from the current cursor
position, calculates the number of spaces to be inserted and actually moves the result along with the spaces
into the cooked Clist. If the user keys in a “DEL” character, the Operating System will store the ASCII code
for DEL in the raw Clist first. After this, the Operating System will move spaces to the last character in the
cooked Clist and decrement the Last offset position in the Cblock of that list. It will also decrement the
cursor position, send the instruction to the terminal accordingly, to actually display the cursor at a previous
position. The terminal microprocessor alongwith the software running on it will interpret it and actually move
the cursor character from the current position to one to the left in the video RAM. The attribute byte also is
moved so that the nature of the cursor (blinking, etc.) remains the same. All this enables the user to key in the
next character in the same place.
From the cooked Clist, data is normally sent to the user process only when it encounters CR or LF or
NL. This character such as CR is stored in the raw Clist, but it is a “command” character, and therefore, it
is not sent as it is. Thereafter, if echoing is required, the data from the input cooked Clist is moved (after the
required processing if any) to the output Clist from where it is sent to the video RAM through the hardware
and software of the terminal for display. If a character is to be displayed as soon as it keyed in, it is sent
immediately to the output Clist and thereafter to the video RAM.
Figure 5.39 shows a partial list of characters handled specially in cooked mode along with their possible
interpretations.
Therefore, in cooked mode, after each key depression, the driver has to check whether the character input
is an ordinary character or it needs special interpretation. But
if we want to actually input such a special character as an
ordinary character, what should be done? For instance, if we
want to key in “50 pieces @ 2 per week will take 25 weeks
for delivery”. What will happen after we input the character
“@” due to its special meaning as given in Fig. 5.39? For
this reason, the backslash (\) character is used to denote that
what follows is to be treated as an ordinary character. This
will be clear from Fig. 5.39. Therefore, the message given
above should be sent as “50 pieces \@ 2 per week will take
25 weeks for delivery”.
If the user actually wants to use “\” in his actual message,
then he must type “\\”. After encountering any backslash, the
terminal driver sets a flag denoting that the next character is to be treated as an ordinary character. Therefore,
the first “\” itself is not entered in the Clist.
When a user types “DEL”, the driver must interpret it and send a message/instruction back to the terminal
which must in turn take an action equivalent to three steps.
(i) Backspace (decrement cursor etc.)
(ii) Move a blank character
(iii) Backspace
The reason that another backspace is needed in step (iii) is that after moving a blank character in step (ii),
the cursor would have advanced by 1 position.
While interpreting the “DEL” command, if the previous character was TAB character, the problem
becomes complex. The terminal driver has to keep track of where the cursor was prior to the TAB. In most
systems, backspacing can erase characters on the current line of the screen only. This simplifies the driver
routine, whereas it is definitely possible to allow to continue erasing characters from the previous lines too!
This would enhance user friendliness but would make the terminal driver routines complex.
CTRL-Q and CTRL-S are normally used to control scrolling. On encountering them, the user process
will start or stop sending data to the output Clist. Many editors and other sophisticated programs need to
manipulate the screen in a variety of ways. To support this need, most of the terminal drivers provide for
various routines as listed in Fig. 5.40.
l
l
l
l
There is normally a fixed protocol in terms of certain escape sequences between the user process and the
terminal driver for each of these routines. We have seen before how these escape sequences are interpreted by
the terminal as instructions to set up the video RAM accordingly.
In most of the systems, CTRL-C aborts the current process. When the user types this, how should this be
handled? The reason for this question is that CTRL-C is neither a data character nor a command for screen
manipulation but is a command to the Operating System itself.
Therefore, there are three types of characters that can be keyed in by the user:
(i) Ordinary, displayable characters requested by the user process or sent by it for display.
(ii) Terminal control characters, such as “DEL”, “TAB” etc.
(iii) Process control characters, such as “CTRL-C”.
As soon as a character is keyed in, the driver has to analyze which category it belongs to and then depending
upon the cooked or raw mode, take action by itself or wait until the user process takes action.
There is also an output Clist associated with each terminal. The contents of this are sent to the video RAM
for display. Normally, whatever is keyed in is also echoed, except for special characters or password etc.
Characters to be displayed are moved to the output Clist after interpretation, if necessary. In the raw mode,
this processing is not done. Therefore, if you type “lbte” instead of “late” and then realize your mistake, you
will have to type 3 “DEL” characters, then type “ate” followed by a carriage return. In the raw mode, all these
11 characters will be stored as “lbte(DEL)(DEL)(DEL)ate(CR)” in the input Clist and then they will be sent
to the output Clist and displayed and cook mode, only “late” would be moved into the cooked input Clist and
then the output Clist. After this, it will be moved to the video RAM and displayed.
One problem that arises is that if a user process wants to display some data on a terminal, the data will be
moved from the user memory to the output Clist, from where it will be moved to video RAM. But if a user is
also keying in something at that time which has to be echoed, it will go to the raw input Clist first, and then to
the output Clist. This method can intermix both these outputs resulting in garbage. Therefore, the Operating
System has a very delicate job of processing the keyed-in input and displaying it only at the appropriate time.
If the Operating System does not have this sophistication, one can observe the display intermixed with the
keyed in input.
Therefore, depending upon the mode, two or three Clists are assigned for each terminal. This mode can be
changed by changing the terminal characteristics through software.
responsible for both input and output. When any key is depressed, this device I/O is responsible for
generating the ASCII code from the row number or column number or the key number. This also is
responsible for cursor management. For instance, when you depress a key, the cursor is incremented
by 1. When you depress back arrow ‘!’, the cursor is brought back by 1 position. When you depress
a forward arrow ‘"’, the reverse takes place. All this is managed within the terminal even if it is a
dumb terminal. It is made possible due to this software.
Similarly, device I/O is responsible for receiving characters from the driver, interpreting them if necessary
(e.g. escape sequence for screen management) and moving the data into the video RAM for display.
We will now take a final concrete example to cement all our ideas.
For years, magnetic storage such as disk and tape was popular. However, in the recent past,
optical storage is becoming extremely commonplace. The biggest advantage of optical devices is
their huge storage capacity as compared to disks and tapes. Also, the access time is relatively fast.
One small inexpensive disk can replace about 25 magnetic tapes! The main disadvantage of CD-ROMs, as
the name implies, is, we cannot record data onto them more than once. This means that they can be created
only once and thereafter, they are read-only. However, the recent improvements in technology have resulted
into the birth of CDs that can be overwritten.
The CD-ROMs (Compact Disk-Read Only Memory) are 1.2 mm thick and 120 mm across. There is a
15-mm hole at the center of the disk. Recall that the magnetic disks use the principle of a binary switch to
indicate 0 or 1. In case of CDs, the surface has areas called as pits and lands. A CD is prepared by using a
high-power infrared laser. Wherever the laser strikes the disk surface (made up of polycarbonate material), a
burn occurs. This is the pit. It is like a 1 on the magnetic disk. The land is like a 0: no laser beam strikes the
surface, and hence, the surface remains unburned.
During the playback, a low-power laser shines infrared light on the pits and lands as they pass by. A pit
reflects less light back as compared to a land. A land reflects back strong light. These reflections can then be
converted into the corresponding electric signals. This is how the drive recognizes a 0 from a 1. The following
figure shows a CD-ROM disk. Figure 5.42 shows a CD-ROM.
Notice that the main difference between magnetic disks and CDs is that in case of CDs, the track-like
structure is continuous. A single unbroken spiral contains all the pits and lands. Figure 5.43 shows what
happens when a CD-ROM is played back. When the laser beam strikes a land bit, as shown in the left portion
of the figure, the laser beam is reflected back and thus the sensor receives the beam back. Therefore, it is read
as a 1 bit. However, when the laser beam strikes a pit bit, the pit does not reflect back the laser beam. As a
result, the sensor does not receive anything, which leads to the conclusion that the data there is a 0 bit.
When music is being played on the CD, the lands and pits should pass by at a constant speed. For this, the
rate of rotation of the CD is continuously reduced as the read head moves from the inside of the CD to the
outer parts of the CD. At the inside, the speed of rotation is 530 revolutions per minute (RPM), and it reduces
to 200 RPM at the outside.
Philips and Sony realized that CDs could be used for storing computer data, sometime in 1984. They
published a standard called as Yellow Book for this. CDs were used only for storing music until then. The
CDs which were used for storing computer data from this time onwards and were called as CD-ROM to
distinguish them from audio CDs. To make CD-ROM similar to the audio CDs, the same specifications such
as physical size, mechanical and optical compatibility were used. The Yellow standard decided the format
of computer data. Earlier, when only music was stored on CDs, it was all right to lose some tones. However,
when computer data was to be stored on CDs, it was very important to make sure that no data is lost. For this,
error correction mechanisms were also decided.
Every 8 bit byte from a computer is stored as a 14 bit symbol on a CD-ROM. The hardware does the 14-
to-8 transition. Thus, one symbol on CD-ROM = 14 bits. Next, 42 such consecutive symbols make up one
frame. Each frame contains 588 bits (42 symbols, each consisting of 14 bits). Out of these, only 192 bits (24
bytes) are used for data. The rest 396 bits are used for error detection and control, for a given frame. This
matches with the formats of audio CDs.
Going one step ahead of audio CDs, 98 frames are combined to form a CD-ROM sector. Since we are
talking about data bytes alone, we have the following equation:
24 data bytes per frame × 98 such frames = 2352 data bytes per sector.
Every sector has a 16-byte preamble. The first 12 of these 16 bytes are used to allow the player to detect
the fact that a new sector is beginning. The next three bytes give the sector number. The last byte contains
information about the mode, which is explained later. The next 2048 bytes contain the actual data. Finally,
the last 288 bytes contain error correction and control mechanism data for a given sector. First, let us have a
look at these concepts in a diagrammatic form as shown in Fig. 5.44.
The Yellow Book has defined two modes. Mode-1 takes the form as shown in the figure Here, data part
is 2048 bytes and error correction part is 288 bytes. However, not all applications need such stringent error
correction mechanisms. For instance, audio and video applications should preferably have more data bytes.
Mode-2 takes care of this. Here, all the 2336 bytes are used for data. Note that we are talking about three levels
of error correction: (a) Single-bit errors are corrected at the byte-level, (b) Short burst errors are corrected at
the frame level and finally any other remaining errors are corrected at the sector-level.
The definition of CD-ROM data format has since been extended with the Green Book, which added
graphics and multimedia capabilities in 1986.
The Digital Versatile Disk Read Only Memory (DVD-ROM) uses the same principle as a CD-ROM for
data recording and reading. The main difference between the two, however, is that a smaller laser beam is
used in case of a DVD. This results into a great advantage that data can be written on both the surfaces of a
DVD. Also, the laser beam is sharper. This adds to the advantage of extra storage capacity: the tracks on a
DVD are closer and hence pack more data. It is not possible in case of a CD. The typical capacity of each
surface of a DVD is about 8.5 gigabytes (GB) – hence, together, they can accommodate 17 GB! This is
equivalent to the storage on 13 CDs.
The main technical differences between a CD-ROM and a DVD are the following:
The Operating System is responsible for using devices properly and in an efficient manner. Normally, we
would have devices such as disks, printers, scanners etc. Disks are considered as important devices since
disks are capable of storing large amounts of data. Disks are involved in all read-write or I/O operations since
some amount of data is required to store/write and some amount of data is required to get/read. The Operating
System has to monitor and control these actions to achieve the best performance.
Disk scheduling is about to read the necessary data requested by the processes. There are two important
parameters regarding disks operations:
Seek time is one type of delay associated with reading or writing data on a computer’s disk
drive. In order to read or write at a particular place on the disk, the read/write head of the disk needs to be
physically moved to the correct place. This process is known as seeking and the time require to move head
at correct place is called as seek time.
Rotational delay is the time required for the addressed area of the disk to rotate into
a position where it is accessible by the read/write head.
Disk bandwidth is the capacity of the disk to transfer data from memory to disk and
from disk to memory. It is the total number of bytes transferred, divided by the total time between the first
request for service and the actual completion of last transfer.
There are various algorithms for disk scheduling, i.e. for the disk read/write operations.
The drive head sweeps across the entire surface of the disk, visiting the outermost cylinders before changing
direction and sweeping back to the innermost cylinders. It selects the next waiting request, whose location
it will reach on its path backward and forward across the disk. Thus, the movement time should be less than
FCFS but the policy is clearly fairer than SSTF.
C-SCAN is similar to SCAN but the I/O requests are only satisfied when the drive head is travelling in
one direction across the surface of the disk. The head sweeps from the innermost cylinder to the outermost
cylinder satisfying the waiting requests in the order of their locations. When it reaches the outermost cylinder,
it sweeps back to the innermost cylinder without satisfying any requests and then starts again.
Similarly to SCAN, the drive sweeps across the surface of the disk, satisfying requests in alternating
directions. However, the drive now makes use of the information it has about the locations requested by the
waiting requests. For example, a sweep out towards the outer edge of the disk will be reversed when there are
no waiting requests for locations beyond the current cylinder.
Based on C-SCAN, C-LOOK involves the drive head sweeping across the disk satisfying requests in one
direction only. As in LOOK, the drive makes use of the location of waiting requests in order to determine
how far to continue a sweep, and where to commence the next sweep. Thus, it may curtail a sweep towards
the outer edge when there are locations requested in cylinders beyond the current position, and commence
its next sweep at a cylinder which is not the innermost one, if that is the most central one for which a sector
is currently requested.
Selecting Disk-Scheduling algorithm
l SSTF is commonly used and has natural appeal.
l SCAN and C-SCAN perform better in the cases that place a heavy load on the disk.
l Disk-Scheduling algorithm should be written in the Operating System and it should easy to replace
with different algorithm when necessary.
Swap space is an area on a high-speed storage device (almost always a disk drive), reserved for use by the
virtual memory system for deactivation and paging processes. At least one swap device (primary swap) must
be present on the system.
During system startup, the location (disk block number) and size of each swap device is displayed in 512-
KB blocks. The swapper reserves swap space at the process creation time, but it does not allocate swap space
from the disk until pages need to go out to disk. Reserving swap at process creation protects the swapper from
running out of swap space. You can add or remove swap as needed (that is, dynamically) while the system is
running, without having to regenerate the kernel.
System memory used for swap space is called pseudo-swap space. It allows
users to execute processes in memory without allocating physical swap. Pseudo-swap is controlled by an
Operating System parameter called as swapmem_on. By default, swapmem_on is set to 1, enabling pseudo-
swap.
Typically, when the system executes a process, swap space is reserved for the entire process, in case it
must be paged out. According to this model, to run one gigabyte of processes, the system would have to have
one gigabyte of configured swap space. Although this protects the system from running out of swap space,
disk space reserved for swap is under-utilized if minimal or no swapping occurs.
When using pseudo swap as the swapping mechanism, the pages are locked. As the amount of pseudo-
swap increases, the amount of lockable memory decreases.
For factory-floor systems (such as controllers), which perform best when the entire application is resident
in memory, pseudo-swap space can be used to enhance performance. We can either lock the application in
memory or make sure that the total number of processes created does not exceed three-quarters of system
memory.
Pseudo-swap space is set to a maximum of three-quarters of system memory because the system can begin
paging once three-quarters of system available memory has been used. The unused quarter of memory allows
a buffer between the system and the swapper to give the system computational flexibility.
When the number of processes created approaches the capacity, the system might exhibit thrashing and
a decrease in system response time. If necessary, we can disable pseudo-swap space by setting the tunable
parameter swapmem_on in /usr/conf/master.d/core-hpux to zero.
At the head of a doubly linked list of regions that have pseudo-swap allocated, there is a null terminated
list called pswaplist.
There are two kinds of physical swap space: device swap and file-system swap.
Device swap space resides in its own reserved area (an entire disk or logical volume
of an LVM disk) and is faster than file-system swap because the system can write an entire request (256 KB)
to a device at once.
File-system swap space is located on a mounted file system and can vary in
size with the system's swapping activity. However, it’s throughput is slower than device swap, because free
file-system blocks may not always be contiguous; therefore, separate read/write requests must be made for
each file-system block.
To optimize system performance, file-system swap space is allocated and de-allocated in swchunk-sized
chunks. swchunk is a configurable Operating System parameter; its default is 2048 KB (2 MB). Once a chunk
of file system space is no longer in use by the paging system, it is released for file system use, unless it has
been preallocated with swapon.
If swapping to file-system swap space, each chunk of swap space is a file in the file system swap directory,
and has a name constructed from the system name and the swaptab index (such as becky.6 for swaptab[6] on
a system named becky).
Files are stored on disk. Disk space management is a challenge and a concern for file system designers.
There are two methods to write our files on the disk – 1) Complete file is stored sequentially; one byte after
another occupying consecutive bytes on the disk. 2) File is stored on the disk not sequentially. Instead, it is
split into several blocks and stored wherever the disk has free space.
One big concern when file is stored in consecutive manner on the disk is that it would be difficult to store
the file when the file size is growing. For this reason, the file systems break files into fixed size blocks that
need not be adjacent.
After the challenge of storing a file into fixed-sizes block, the next challenge is deciding the appropriate size
for the block. If the block size is large and suppose if the file is small then it would waste the disk space. If
we decide on a small block size, then it means that the number of blocks would be high per file. This would
cause reading of many blocks when we read a file and the read operation would be slow.
Choosing an appropriate block size would be decided by the file system designers. The block size varies
from one Operating System-to-another.
Keeping track of free blocks is necessary for the allocation of unused/free blocks to store a file on disk. There
are two methods widely used to keep the track of free blocks:
(1) The first method consists of a linked list of disk blocks, with each block holding as many free disk
block numbers as will fit. Often, free blocks are used to hold the free list.
(2) The second technique is the bitmap. A disk with n blocks requires a bitmap with n bits. Free blocks
are represented by 1s in the map, allocated by 0s. A 16 GB disk has 2^24 1-KB blocks and thus
requires 2^24 bits for the map, which requires 2048 blocks. Bitmaps require less space than linked
list method, since they use 1 bit per block.
n
n n
n n
n n
In a multiuser system, many users and program-
mers sit at their respective terminals and execute
the same or different programs. One may be
working on a spreadsheet application, someone else may
be querying a database of customers, and yet another may
be compiling and testing a program. Despite many users
working on the system, each one feels as if the entire system
is being used only by him. How does this happen?
This is made possible by the Operating System which
arbitrates amongst all the users of the computer system.
The disk stores programs and data for all users and we have
seen how the Information Management (IM) module of the
Operating System keeps track of all directories and files
belonging to various users. In the next chapter, we will see
how the Memory Management (MM) module divides the
main memory into various parts for allocating to different
users.
The Operating System enables the CPU to switch
from one user to another, based on certain pre-determined
policy, so rapidly that the users normally do not become
aware of it. Each one thinks that he or she is the only user of the system.
However, we know that at any given time, the CPU can execute only one instruction and that instruction
can belong to only one of the programs residing in the memory. Therefore, the Operating System will have to
allocate the CPU time to various users based on a certain policy. This is done by the Process Management
(PM) module which we will discuss here in this chapter. We will study only the uniprocessor Operating
System in this chapter.
In order to understand Process Management, let us first understand what a process is and how
it is different from a program as far as Operating System is concerned. In simple terms, a program
does not compete for the computing resources like the CPU or the memory, whereas a process does.
A program may exist on paper or reside on the disk. It may be compiled or tested but it still does not compete
for CPU time and other resources. Once a user wants to execute a program, it is located on the disk and
loaded in the main memory at that time it becomes a process, because it is then an entity which competes for
the CPU time. Many definitions of a process have been put forth, but we will call a process “a program under
execution, which competes for the CPU time and other resources”.
How did multiprogramming come about? Did it exist from the beginning? In the earlier days,
there were only uniprogramming systems. Only one process was in the memory which was being
executed at a given time. Let us go through a typical calculation of CPU utilization to understand
the problems involved in this scheme.
Let us say that a program is reading a customer record and printing a line to the customer report after
processing and calculations. It does this for all the customer records in a file. The program will look as shown
in Fig. 6.1 (shown unstructured).
There are two I/O statements in this program: READ and WRITE, and there are 200 processing and
calculation instructions in between. As depicted in the figure, however, all these instructions basically use
the main memory, CPU registers and therefore, data transfers or calculations amongst them take place
electronically. Hence, these instructions are very
fast. For instance, for 200 instructions, it might take
only 0.0002 seconds for any modern computer.
However, READ and WRITE instructions are
different. The Operating System carries out the
I/O on behalf of the Application Program (AP),
with the help of the controller which finally issues
the signals to the device. The entire operation is
electromechanical in nature and takes anywhere
from 0.0012 to 0.0020 seconds in any modern
computer (these figures are only representative). We
will assume 0.0015 seconds as an average.
Therefore, the time taken for processing one
record completely can be calculated as given below.
Read : 0.0015
Execute 200 instructions : 0.0002
Write : 0.0015
Total : 0.0032
When the Operating System issues an instruction to the controller to carry out an I/O instruction, the CPU
is idle during the time the I/O is actually taking place. This is because, the I/O can take place independently by
DMA without involving the CPU. Hence, in a single user system, the CPU utilization will be 0.0002/0.0032
= 6.2 per cent.
Figure 6.2 (drawn out of proportion) depicts the progression of time from the point of view of the CPU.
In earlier days, the computer systems were very costly, and therefore, this idleness had to be reduced.
Hence, it was desirable that by some means, one could run two processes at the same time such that when
process 1 waits for an I/O, process 2 executes and vice versa. There would be some time lost in turning
attention from process 1 to process 2 called context switching. The scheme would work if the time lost in
context switch was far lower than the time gained due to the increased CPU utilization. This is generally true,
as depicted by Fig. 6.3, showing two processes and Fig. 6.4 showing three processes running simultaneously.
This is the rationale of multiprogramming where the Process Management (PM) portion of the Operating
System is responsible for keeping track of various processes, and scheduling them.
We have already seen in Chapter 4 on Information Management (IM) that how multiprogramming
becomes feasible. We know that the disk controller can independently transfer the required data for one
process by DMA when the CPU can be executing another process. DMA transfers the data between the disk
and the memory in bursts without involving the CPU. When the DMA is using the data bus for one process,
the CPU can execute at least limited instructions not involving the data bus for some other process (e.g.
register–to–register transfers or ALU calculations). However, between the bursts of data transfer, when there
is no traffic on the data bus, the CPU can execute any instruction for the other process. This is the basis of
multiprogramming.
The number of processes running simultaneously and competing for the CPU is known as the degree
of multiprogramming. As this increases, the CPU utilization increases, but then each process may get a
delayed attention, hence, causing a deterioration in the response time.
How is this context switching done? To answer this, we must know what is meant by the context
of a process. If we study how a program is compiled into the machine language and how
each machine language instruction is executed in terms of its fetch and execute cycles, we will
realize that at any time, the main memory contains the executable machine program in terms of 0s and 1s.
For each program, this memory can be conceptually considered as divided into certain instruction areas and
certain data areas (such as I/O and working storage areas). The data areas contain at any moment, the state of
various records read and various counters and so on.
Normally, today modern compilers produce code which is reentrant, i.e. it does not modify itself. Therefore,
the instruction area does not get modified during execution. But the Operating System cannot assume this. If
we interrupt a process at any time in order to execute another process, we must store the memory contents
of the old process somewhere. It does not mean that we have to necessarily dump all these memory areas on
the disk. It is sufficient but not necessary to keep their pointers. Also, all CPU registers such as PC, IR, SP,
ACC and other general purpose registers give vital information about the state of the process. Therefore, these
also have to be stored. Otherwise restarting this process would be impossible. We will not know where we
had left off and therefore, where to start from again. The context of the process precisely tells us that, which
comprises both the entities mentioned above. If we could store both of these, we have stored the context of
the process.
Where does one store this information? If the main memory is very small, accommodating only one
program at a time, the main memory contents will have to be stored onto the disk before a new program
can be loaded in it. This will again involve a lot of I/O operations, thereby defeating the very purpose of
multiprogramming. Therefore, a large memory to hold more than one program at a time is almost a pre-
requisite of multiprogramming. It is not always true, but for the sake of the current discussion, we will
assume that the memory is sufficient to run all the processes competing for the CPU.
This means that even after the context switch, the old program will continue to be in the main memory.
Now what remains to be done is to store the status of the CPU registers and the pointers to the memory
allocated to this process. This is done by the Operating System in a specific memory area called Register
Save Area which the Operating System maintains one for each process. Normally, this area is a part of a
Process Control Block (PCB) again maintained by the Operating System one for each process as we shall
see later.
When a process issues an I/O system call, the Operating System takes over this I/O function on behalf
of that process, keeps this process away and starts executing another process, after storing the context of
the original process in its register save area. When
the I/O is completed for the original process, that
process can be executed again. But at this juncture,
the CPU may be executing the other process and,
therefore, its registers will be showing the values
pertaining to that process. The context of that
process has now to be saved in its register save
area and the CPU registers have to be loaded with
the saved values from the register save area of the
original process to be executed next (for which I/O
is complete). At that time, the Operating System
restores the CPU registers including the PC which
gives the address of the next instruction to be
executed, but not yet executed because the CPU
was taken away from it. This in essence resumes
the execution of the process. This operation is
carried out so fast that to the user, there is seldom
a perceived break or delay during the execution of “his” process. This is depicted in Fig. 6.5.
Figure 6.5 shows that before the context switch, process A was running, denoted by the dotted lines. At
the time of the context switch, the Operating System stores the state of the CPU registers for process A (step
(i) shown in the figure), restores (loads) the already saved registers of process B onto the CPU registers (step
(ii) shown in the figure) and starts executing process B (step (iii) shown in the figure). Since processes A and
B can both be in the memory, this context switch does not require any swapping, thereby saving the time
consuming I/O operations.
After some time, when process A is scheduled again, process B registers are stored and the registers for
process A are restored in and from the respective register save areas. process A then continues where it had
left from.
The entire operation is very fast and therefore, the user thinks that he is the only one using the whole
machine. That in fact is the essence of multiprogramming.
In order to manage switching between processes, the Operating System defines three basic
process states, as given below:
Running is the only process which is executed by the CPU at any given moment. In multipro-
cessor systems with multiple CPUs however, there will be many running processes and the Operating System
will have to keep track of all of them.
A process which is not waiting for any external event such as an I/O operation is said to be in
ready state. Actually, it could have been running, but for the fact that there is only one processor which is
busy executing instructions from some other process, while this process is waiting for its chance to run. The
Operating System maintains a list of all such ready processes and when the CPU becomes free, it chooses one
of them for execution as per its scheduling policy and dispatches it for execution. When you sit at a terminal
and give a command to the Operating System, to execute a certain program, the Operating System locates the
program on the disk, loads it in the memory, creates a new process for this program and enters this process
in the list of ready processes. It cannot directly make it run because there might be another process running
at that time. It eventually is scheduled when it starts executing. At that time, its state is changed to running.
When a process is waiting for an external event such as an I/O operation, the process is said
to be in a blocked state.
The major difference between a blocked and a ready process is that a blocked process cannot be directly
scheduled even if the CPU is free, whereas, a ready process can be scheduled if the CPU is free. Imagine,
for instance, a process running a program as shown in Fig. 6.1. At the time of execution, after the READ
instruction is executed, the process will be blocked. If it is scheduled again before the desired record was read
in the main memory, it would execute an instruction on the wrong data (may be by using the previous record!)
and therefore, there is no sense in scheduling this blocked process until its I/O is over, i.e. it is changed to
the ready state.
Let us trace the steps that will be followed when a running process encounters an I/O instruction.
(i) Let us assume that process A was running and it issues a system call for an I/O operation.
(ii) The Operating System saves the context of process A in the register save area of process A.
(iii) The Operating System now changes the state of process A to blocked, and adds it to the list of blocked
processes.
(iv) The Operating System instructs the I/O controller to perform the I/O for process A.
(v) The I/O for process A continues by DMA in bursts, as we have seen.
(vi) The Operating System now picks up a ready process (say process B) out of the list of all the ready
processes. This is done as per the scheduling algorithm.
(vii) The Operating System restores the context of process B from the register save area of process B. We
assume that process B was an already existing process in the system. If process B was a new process,
the Operating System would locate on the disk the executable file for the program to be executed.
The header normally gives the values of the initial CPU register values such as for PC. It stores these
values in the register save area for this newly created process, loads the program in the main memory
and starts executing process B.
(viii) At this juncture, process B is executing but the I/O for process A is also going on simultaneously, as
we have seen earlier.
(ix) Eventually, the I/O requested by process A is completed. The hardware generates an interrupt at this
juncture.
(x) As a part of Interrupt Service Routine (ISR), the Operating System now moves process B from
running to the ready state. It does not put it in a blocked state. This is because, process B is not waiting
for any external event at this juncture. The CPU was taken away from it because of the interrupt. The
Operating System essentially needs to decide which process to run next (it could well be process B
again, depending upon the scheduling algorithm and process B’s priority!).
(xi) The Operating System moves process A from blocked to the ready state. This is done because process
A is not waiting for any event any more.
(xii) The Operating System now picks up a ready process from the list of ready processes for execution.
This is done as per the scheduling algorithm. It could choose process A, process B or some other
process.
(xiii) This selected process is dispatched after restoring its context from its register save area. It now starts
executing.
In addition to these, there are two more process states namely new and halted. They do not participate
very frequently in the process state transitions during the execution of a process. They participate only at the
beginning and at the end of a process and therefore, are not described in detail. When you create a process,
before getting into a queue of ready processes, it might wait as a new process if the Operating System feels
that there are already too many ready processes to schedule. Similarly after the process terminates, the
Operating System can put it in the halted state before actually removing all details about it. In UNIX, this
state is called the Zombie state.
(a) When you start executing a program, i.e. create a process, the Operating System puts it in the list of
new processes as shown by (i) in the figure. The Operating System at any time wants only a certain
number of processes to be in the ready list to reduce competition. Therefore, the Operating System
introduces a process in a new list first, and depending upon the length of the ready queue, upgrades
processes from new to the ready list. This is shown by the ‘admit (ii)’ arrow in the figure. Some
systems bypass this step and directly admit a created process to the ready list.
(b) When its turn comes, the Operating System dispatches it to the running state by loading the CPU
registers with values stored in the register save area. This is shown by the ‘dispatch’ (iii) arrow in the
figure.
(c) Each process is normally given certain time to run. This is known as time slice. This is done so that
a process does not use the CPU indefinitely. When the time slice for a process is over, it is put in the
ready state again, as it is not waiting for any external event. This is shown by (iv) arrow in the figure.
(d) While running, if the process wants to perform some I/O operation, denoted by the I/O request
(v) in the diagram, a software interrupt results because of the I/O system call. At this juncture,
the Operating System makes this process blocked, and takes up the next ready process for
dispatching.
(e) When the I/O for the original process is over, denoted by I/O completion (vi), the hardware generates
an interrupt whereupon the Operating System changes this process into a ready process. This is called
a wake up operation denoted by (vi) in the figure. Now the process can again be dispatched when its
turn arrives.
(f) The whole cycle is repeated until the process is terminated.
(g) After termination, it is possible for the Operating System to put this process into the halted state for a
while before removing all its details from the memory as shown by the (vii) arrow in the figure. The
Operating System can however bypass this step.
The Operating System, therefore, provides for at least seven basic system calls or routines. Some of these
are callable by the programmers whereas others are used by the Operating System itself in manipulating
various things. These are summarised in Fig. 6.7.
"
"
"
"
"
"
"
For each system call, if the process-id is supplied as a parameter, it carries out the process state transition.
We will now study how these are actually done by the Operating System.
The Operating System maintains the information about each process in a record or a data
structure called Process Control Block (PCB) as shown in Fig. 6.8. Each user process has a PCB. It
is created when a user creates a process and it is removed from the system when the process is killed.
All these PCBs are kept in the memory reserved for the Operating System.
Let us now study the fields within a PCB. The fields are as follows:
Process-id is a number allocated by the
Operating System to the process on creation. This is the number
which is used subsequently for carrying out any operation on the
process as is clear from Fig. 6.7. The Operating System normally
sets a limit on the maximum number of processes that it can handle
and schedule. Let us assume that this number is n. This means that
the PID can take on values between 0 and n–1.
The Operating System starts allocating Pids from number 0. The
next process is given Pid as 1, and so on. This continues till n–1. At
this juncture, if a new process is created, the Operating System wraps
around and starts again with 0 again. This is done on the assumption
that at this juncture, the process with Pid = 0 would have terminated.
UNIX follows this scheme.
There is yet another scheme which can be used to generate the
Pid. If the Operating System allows for a maximum of n processes,
the Operating System reserves a memory area to define the PCBs
for n processes. If one PCB requires x number of bytes, it reserves
nx bytes and pre-numbers the PCBs from 0 to n–1. When a process
is created, a free PCB slot is selected, and its PCB number itself
is chosen as the Pid number. When a process terminates, the PCB
is added to a free pool. In this case, the Pids are not necessarily
allocated in the ascending sequence. The Operating System has to
maintain a chain of free PCBs in this case. If this chain is empty, no new process can be created. We will
assume this scheme in our further discussions.
We have studied different process states such as running, ready, etc. This information
is kept in a codified fashion in the PCB.
Some processes are urgently required to be completed (higher priority) than others
(lower priority). This priority can be set externally by the user/system manager, or it can be decided by the
Operating System internally, depending on various parameters. You could also have a combination of these
schemes. We will study more about these in later sections on process scheduling. Regardless of the method
of computation, the PCB contains the final, resultant value of the priority for the process.
As studied before, this is needed to save all the CPU registers at the context
switch.
This gives pointers to other data structures maintained for that pro-
cess.
This is self-explanatory. This can be used by the Operating System to close all
open files not closed by a process explicitly on termination.
This gives the account of the usage of resources such as CPU time, con-
nect time, disk I/O used, etc. by the process. This information is used especially in a data centre environment
or cost centre environment where different users are to be charged for their system usage. This obviously
means an extra overhead for the Operating System as it has to collect all this information and update the
PCBs with it for different processes.
As an example, with regard to the directory, this contains the pathname or the
BFD number of the current directory. As we know, at the time of logging in, the home directory mentioned
in the system file (e.g. user profile in AOS/VS or /etc/passwd in UNIX) also becomes the current directory.
Therefore, at the time of logging in, this home directory is moved in this field as current directory in the PCB.
Subsequently, when the user changes his directory, this field also is appropriately updated. This is done so
that all subsequent operations can be performed easily. For instance, at any time if a user gives an instruction
to list all the files from the current directory, this field in the PCB is consulted, its corresponding directory is
accessed and the files within it are listed.
Apart from the current directory, similar useful information is maintained by the Operating System in the
PCB.
This essentially gives the address of the next PCB (e.g. PCB number) within
a specific category. This category could mean the process state. For instance, the Operating System maintains
a list of ready processes. In this case, this pointer field could mean “the address of the next PCB with state =
“ready”. Similarly, the Operating System maintains a hierarchy of all processes so that a parent process could
traverse to the PCBs of all the child processes that it has created.
Figure 6.9 shows the area reserved by the Operating System for all the PCBs. If an Operating System
allows for a maximum of n processes and the PCB requires x bytes of memory each, the Operating System
will have to reserve nx bytes for this purpose. Each box in the figure denotes a PCB with the PCB-id or
number in the top left corner. We now describe in Fig. 6.9, a possible simple implementation of PCBs and
its data structures. The purpose is only illustrative and a specific Operating System may follow a different
methodology, though essentially to serve the same purpose.
Any PCB will be allocated either to a running process or a ready process or a blocked process (we ignore
the new and halted processes for simplicity). If the PCB is not allocated to any of these three possible states,
then it has to be unallocated or free. In order to manage all this, we can imagine that the Operating System
also maintains four queues or lists with their corresponding headers as follows: One for a running process,
one for the ready processes, one for the blocked process and one for free PCBs. Therefore, we assume for
our current discussion, that a process is admitted to the ready queue directly after its creation. We also know
that there can be only one running process at a time. Therefore, its header shows only one slot. But all other
headers have two slots each. One slot is for the PCB number of the first PCB for a process in that state, and
the second one is for the PCB number of the last one in the same state.
Each PCB itself has two pointer slots. These are for the forward and backward chains. The first slot is for
the PCB number of the next process in the same state. The second one is for the PCB number of the previous
process in the same state. In both the cases, ‘*’ means the end of the chain. Though, we can imagine pointers
only in one direction, we have assumed bidirectional pointers to enhance the data recovery. These slots are
shown at the bottom right corner of each PCB. The PCB shows the Pid number of PCB number in the top left
corner. This is shown only for our better comprehension. As PCBs are of same size, given the PCB number,
the kernel can directly access any PCB, and therefore, this PCB number is not actually needed to be a part
of PCB.
At the bottom, we also have shown some area (currently blank) to list all the PCB numbers of all the
processes in different states. This will enable us to follow the pointers while we give further description. This
area is only for our clarification and the Operating System does not actually maintain it. It maintains only
the PCBs.
Whenever a process terminates, the area for that PCB becomes free and is added to the list of free PCBs.
Any time a new process is created, the Operating System consults the list of free PCBs first, and then acquires
one of them. It then fills up the PCB details in that PCB and finally links up that PCB in the chain for ready
processes.
We assume that, to begin with, at a given time, process with Pid = 3 is in running state. Processes with Pid
= 13, 4, 14 and 7 are in ready state. Processes with Pid = 5, 0, 2, 10 and 12 are in the blocked state. PCB slots
with PCB number = 8, 1, 6, 11, 9 are free (we have shown only Pids 0–14). The same is shown in Fig. 6.10.
In the PCB list of blocked processes or that of the free PCBs, there is no specific order or sequence. A list
of free PCBs grows as processes are killed and PCBs are freed, and there is no specific order in which that
will necessarily happen. The Operating System can keep the blocked list in the sequence of process priorities.
But that rarely helps, because that is not the sequence in which their I/O will be necessarily completed to
move them to the ready state. On the other hand, the ready processes are normally maintained in a priority
sequence. For instance, a process with Pid = 13 is the one with the highest priority and the one with Pid = 7
is with the lowest priority in the list of ready processes shown in Fig. 6.10.
In such a case, at the time of dispatching the highest priority ready process, all that the Operating System
needs to do is to pick up the PCB at the head of the chain. This can be easily done by consulting the header of
the list (which gives the PCB with Pid = 13 as shown in Fig. 6.10) and then adjusting the header to point to
the next ready process (which is with Pid = 4 in the figure). If the process scheduling philosophy demanded
the maintenance of PCBs in the ready list in the FIFO sequence as in the Round Robin philosophy, instead
of the priority sequence, the Operating System would maintain the pointer slots accordingly. In this case, any
new PCB will be added at the end of the list necessarily.
Some Operating Systems have a scheduling policy which is a mixture of the FIFO and priority-based
philosophies. The Operating System in this case will have to chain the PCBs in the ready list accordingly. In
fact, the Operating System may have to sub-divide the ready list further into smaller lists according to sub-
groups within the ready list. In this case, each sub-group corresponds to a priority level. But all the processes
belonging to a sub-group are scheduled in the Round Robin fashion. We will study more about this when we
study Multilevel Feedback Queues in the section on process scheduling. At this juncture, let us assume that
there is only one list of ready processes maintained in the same sequence as the Operating System wants to
schedule them.
Let us trace one chain completely to see how it works. As we know, the PCB contains two pointers: next
and the prior for the PCBs in the same state. For instance, if we want to access all the PCBs in the ready state,
we can do that in the following manner:
(i) Access the ready header. Access the first slot in the header. It says 13. Hence, PCB number 13 is the
first PCB in the ready state (i.e. with Pid = 13).
(ii) We can now access PCB number 13. We confirm that the state is ready (written in the box). Actually
the process state is one of the data items in the PCB which gives us this information.
(iii) We access the next pointer in the PCB 13. It says 4. It means that PCB number 4 is the one for the
next process in the ready list.
(iv) We now access PCB 4 and again confirm that it is also a ready process.
(v) The next pointer in PCB 4 gives 14.
(vi) We can now access PCB 14 as the PCB for the next ready process, and confirm that it is for a ready
process.
(vii) The next pointer in PCB 14 is 7.
(viii) We can access PCB 7 and confirm that it is for a ready process.
(ix) The next pointer of PCB 7 is “*”. It means that this is the end of this chain.
(x) This tallies with the ready header which says that the last PCB in the ready list is PCB 7.
We thus have accessed PCBs 13, 4, 14 and 7 in that order. We know from the box at the bottom of Fig. 6.10
that these are all ready processes in the system to be scheduled in that order.
If we wanted to access them in the reverse order, i.e. 7, 14, 4 and 13, we could start with the last pointer in
the header and use the prior pointers in the PCBs. This is called a two-way chain and is normally maintained
for recovery purposes in case of data corruption.
We leave it to the reader to traverse through the blocked and free PCB chains. The above procedure will
also throw some light on the algorithms needed to access a PCB in a given state to remove it from the chain
or add to it.
dispatch a process, change the process’ priority, block a process due to an I/O request, dispatch yet another
process, time up a process and wake up a process. Essentially, we are trying to simulate a realistic example.
When you sit at a terminal and give a command to the CI to execute a program or your program
gives a call to execute a sub-program, a new child process is created by executing a system call. The
Operating System follows a certain procedure to achieve this which is outlined below.
1. The Operating System saves the caller’s context . If you give a command to the CI, then the CI is the
caller. If a sub-program is being executed within your program, your program is the caller. In both
the cases, the caller process will have its PCB. Imagine that you are executing process A, as shown in
Fig. 6.15.
After the divide instruction (instruction number 7), the program calls another sub-program at
instruction 8, after the completion of which the main program must continue at instruction 9. At the
time of execution, instruction 8 gives rise to the creation of a child process. The point is that after
the child process is executed and terminated, the caller process must continue at the proper point
(instruction 9 in this case).
As we know, while executing instruction 8, the program counter (PC) will have already been
incremented by 1. Hence, it will already be pointing to the address of instruction 9. Hence, it has to be
saved so that when it is restored, the execution can continue at instruction 9. This is the reason why
the caller’s context has to be saved. All CPU registers are saved in the register save area of the caller’s
PCB, before a new child process is created and a PCB is allocated to it. After saving its context, the
caller’s process is blocked.
2. The Operating System consults the list of free PCBs and acquires a free PCB. Assuming that the states
of various processes correspond to Fig. 6.10, the Operating System will find that PCB number 8 is
free (it is at the head of the free chain).
3. It assigns Pid = 8 for the new process.
4. It updates the free PCB header to take the value 1 as the first free PCB number. The header for the free
PCBs now looks as shown below:
5. The Operating System now consults the IM for the location of the sub-program file on the disk, its
size and the address of the first executable instruction (such as the first instruction in the Procedure
Division in COBOL or first stmt in main() in C) in that program. The compiler normally keeps this
address and other information in the header of the executable compiled program file. The Operating
System also verifies the access rights to ensure that the user can execute that program.
6. The Operating System consults the MM to determine the availability of the free memory area to hold
the program and allocates those locations.
7. The Operating System again requests the IM to actually load the program in the allocated memory
locations.
8. The Operating System determines the initial priority of the process. In some cases, the priority can be
assigned externally by the user at the time of process creation. In others, it is directly inherited from
the caller. Priorities can be global (or external) or local (or internal). We will talk about priorities
later.
9. At this juncture, the PCB fields at PCB number 8 are initialised as follows (refer to Fig. 5.8).
(i) Process id = 8
(ii) Process state = ready
(iii) Process priority = as discussed above in point 8.
(iv) Register Save Area
l PC is set to the address of the first executable instruction as discussed in point 5.
We will simply assume that the currently running process gets over (e.g. process with Pid = 3 as
in the last example) and therefore, there is a need to dispatch a new process. We will assume that the
Operating System has finished the kill process procedure as outlined in Sec. 5.11. The PCBs will
be in a state as shown in Fig. 6.18.
1. The Operating System accesses the ready header and through it, it accesses the PCB at the head of the
chain. In this case, it will be the PCB with Pid = 8.
2. It removes PCB 8 from the ready list and adjusts the ready header. It changes the status of PCB 8 to
running. The PCBs will look as shown in Fig. 6.19.
3. The Operating System updates the running header to Pid = 8.
4. The Operating System loads all the CPU registers with the values stored in the register save area of
PCB 8.
5. The process with Pid=8 now starts executing where it had left before or from the first executable
instruction if it has just started executing.
6. The master list of known processes as shown in Fig. 6.17 is also updated.
This system call is very simple and can be executed after the Operating System is supplied with
the Pid and the new priority as parameters. The Operating System now does the following:
Let us now assume that the running process with Pid = 8 issues a system call to read a record.
The process with Pid = 8 will have to be blocked by a system call. This is executed in the following
steps:
1. All CPU registers and other pointers in the context for Pid = 8 are stored in the register save area of
the PCB with Pid = 8.
2. The status field in the PCB with Pid = 8 is updated to blocked.
3. PCB 8 is now added at the end of the blocked list. We have seen why it is not necessary to link the
blocked processes in any order such as by priority.
4. The running header is updated to reflect the change. We know that the scheduler process within the
Operating System is executing at this juncture.
5. The master list of known processes, as shown in Fig 6.17 is updated accordingly. The PCBs will look
as shown in Fig. 6.20.
As the process with Pid = 8 gets blocked, there is a need to dispatch the next ready process
(dispatch is being discussed twice only to simulate a realistic example).
We will now assume that the ready process at the head of the chain, i.e. process with Pid = 13 will
be dispatched.
We have already discussed the detailed algorithm for the dispatch operation in Sec. 6.12 and hence, it
need not be discussed here again. The PCBs at the end of the operation will now look as shown in Fig. 6.21.
In order to be fair to all processes, the time sharing Operating System normally provides a
specific time slice to each process. We will study about this more in the scheduling algorithms.
This time slice is changeable. There is a piece of hardware called timer which is programmable.
The Operating System loads the value of the time slice—e.g. 32 ms in the register of this timer. In the
computer system, there is a system clock provided by the hardware. Each clock tick generates an interrupt.
At the end of each clock tick, some actions may be necessary.
For instance, the Operating System may believe in lowering the priority of a process as it executes for a
longer period. This is done by the Operating System in the interrupt service routine (ISR) for the clock tick.
The clock tick is normally a very small period and the time slice of the Operating System for each process
is normally made up of multiple clock ticks. After the time slice value is loaded in the timer, the hardware
keeps on adding 1 for each clock tick until the time elapsed becomes equal to the time slice. At this juncture,
another interrupt is generated for the time-up operation.
The Operating System uses this interrupt to switch between processes so that a process is prevented from
grabbing the CPU endlessly. At this juncture, the Operating System executes a system call: “process time up”,
given the Pid. Let us assume that the time slice is up for our running process with Pid = 13. The Operating
System now proceeds in the following fashion:
1. The Operating System saves the CPU registers and other details of the context in the register save area
of the PCB with Pid = 13.
2. It now updates the status of that PCB to ready. It may be noted that the process is not waiting for any
external event, and so it is not blocked.
3. The process with Pid = 13 now is linked to the chain of ready processes. This is done as per the
scheduling philosophy as discussed before. Meanwhile, let us assume that, externally, the priorities
of all other ready processes have been increased more than that of 13, and hence, the PCB with Pid =
13 is added at the end of the ready queue. The ready header is also changed accordingly.
4. The running header is updated to denote that the scheduler process is executing.
5. The master list of known processes, as shown in Fig. 6.17 is now updated to reflect this change.
The PCBs now look as shown in Fig. 6.22.
When the I/O for a process is completed by hardware, before the execution of the wake up
system call, the following things happen:
We have seen a variety of commonly used operations on the processes and the way they are
executed. We will now study some operations which are less frequently used, but which are quite
necessary, nevertheless.
There is sometimes a need to be able to suspend a process. Imagine that you are running a payslip printing
program for a company with 6000 employees. After printing 2000 payslips, you suddenly realise that there
could possibly be a mistake in your calculations. At this juncture, you want to suspend the process for a short
while, check the results and then resume it. You do not want to abort the run, because the processing/printing
for 2000 employees might go waste if on inspection, you find that you were actually right.
What will you call the state of such a process after suspension? There could be two possibilities. When you
suspend it by hitting a specific key sequence (CTRL and H, for instance), at that very moment, the original
process could be in either running or ready or blocked states. The Operating System defines two more process
states to take care of suspension while in different states. These are suspendready and suspendblocked. If
the process was in either running or ready state exactly at the time of suspension, the Operating System puts
it in the suspendready state. If the process was in the blocked state exactly at the time of suspension, it puts
the process in the suspendblocked state.
If the process is in the suspendready state, it continues to be in that state until the user externally resumes it
(by hitting another specific key sequence). After the user resumes it, the Operating System puts the process in
the ready state again, whereupon it is eventually dispatched to the running state. This is depicted in Fig. 6.24.
However, if the process is in the suspendblocked state, two things can happen subsequently.
(a) The I/O, for which the process was initially blocked before being suspended, is completed before
the user resumes the process. In this case, the process is internally moved by the Operating System
from the suspendblocked state to the suspendready state. The logic behind this is clear. The process is
suspended all right, but apart from this fact, it is not waiting for any external event such as I/O. Hence,
it is not suspendblocked any more. Therefore, it is moved to the suspendready state. After the user
resumes it, it is then moved to the ready state, whereupon it is eventually dispatched to the running
state. This is depicted in Fig. 6.24.
(b) If the I/O still remains pending, but the user resumes the process before the I/O is completed, the
Operating System moves the process from the suspendblocked to blocked state. Again, the logic is
clear. After resuming, the process continues to be blocked all right, as it is waiting for an I/O, but it
is no longer suspended. When the I/O is eventually completed for that process, the Operating System
moves it to the ready state, whereupon it is eventually dispatched to the running state. This also is
depicted in Fig. 6.24.
The Operating System has to maintain two more queue headers corresponding to the suspendready and
suspendblocked states, and it has to chain all the processes belonging to the same state together. Using these
PCB chains, the Operating System has to implement the following system calls (see Fig. 6.21).
"
"
"
"
"
"
It should be fairly straightforward to imagine the headers that are necessary for the PCB chains and also
the algorithms for implementing all of these system calls. We leave it to the reader to construct them.
While scheduling various processes, there are many objectives for the Operating System to choose from.
Some of these objectives conflict with each other, and therefore, the Operating System designers have to
choose a set of objectives to be achieved, before designing an Operating System. Some of the objectives are
as follows:
l Fairness
l Good throughput
Some of these objectives are conflicting. We will illustrate this by considering fairness and throughput.
Fairness refers to being fair to every user in terms of CPU time that he gets. Throughput refers to the total
productive work done by all the users put together. Let us consider traffic signals as an example (Fig. 6.26)
to understand these concepts first and then see how they
conflict.
There is a signal at the central point S which allows
traffic in the direction of AB, BA or CD and DC. We
assume the British method of driving and signals in our
examples. Imagine that there are a number of cars at point
S, standing in all the four directions. The signalling system
gives a time-slice for traffic in every direction. This is
common knowledge. We define throughput as the total
number of cars passed in all the directions put together
in a given time. Every time the signal at S changes the
direction, there is some time wasted in the context switch
for changing the lights from green to amber and then
subsequently to red. For instance, when the signal is amber, only the cars which have already started and are
half way through are supposed to continue. During this period, no new car is supposed to start (at least in
principle) and hence, the throughput during this period is very low.
If the time slice is very high, say 4 hours each, the throughput will be very high, assuming that there
are sufficient cars wanting to travel in that direction. This is true, because there will be no time lost in the
context switch procedure during these 4 hours. But then, this scheme will not be fair to the cars in all the
other directions at least during this time. If this time slice is only 1 hour, the scheme becomes fairer to others
but the throughput falls because the signals are changing direction more often. Therefore, the time wasted
in the context switch is more. Waiting for 1 to 4 hours at a signal is still not practical. If this time slice is 5
minutes, the scheme becomes still fairer, but the throughput drops still further. At the other extreme, if the
time slice is only 10 seconds, which is approximately equal to the time that is required for the context switch
itself, the scheme will be fairest, but the throughput will be almost 0. This is because, almost all the time will
be wasted in the context switch itself. Hence, fairness and throughput are conflicting objectives. Therefore, a
good policy is to increase the throughput without being unduly unfair.
The Operating System also is presented with similar choices as in the case of street signals. When the
Operating System switches from one process to the next, the CPU registers have to be saved/restored
in addition to some other processing. This is clearly the overhead of the context switch, and during this
period, totally useless work is being done from the point of view of the user processes. If the Operating
System switches from one process to the next too fast, it may be more fair to various processes, but then the
throughput may fall. Similarly, if the time slice is far much, the throughput will increase (assuming there are
a sufficient number of processes waiting and which can make use of the time slice), but then, it may not be
a very fair policy.
Let us briefly discuss the meaning of other objectives. CPU utilization is the fraction of the time that the
CPU is busy on the average executing either the user processes or the Operating System. If the time slice is
very small, the context switches will be more frequent. Hence, the CPU will be busy executing the Operating
System instructions more than those of the user processes. Therefore, the throughput will be low, but the CPU
utilization will be very high, as this objective does not care what is being executed, and whether it is useful.
The CPU utilization will be low only if the CPU remains idle.
Turnaround time is the elapsed time between the time a program or a job is submitted and the time when
it is completed. It is obviously related to other objectives.
Waiting time is the time a job spends waiting in the queue of the newly admitted processes for the
Operating System to allocate resources to it before commencing its execution. This waiting is necessary due
to the competition from other jobs/processes in a multiprogramming system. It should be clear by now that
the waiting time is included in the turnaround time.
The concept of response time is very useful in time-sharing or real-time systems. Its connotation in
these two systems is different and therefore, they are called terminal response time and event response
time, respectively, in these two systems. Essentially, it means the time to respond with an answer or result to
a question or an event and is dependent on the degree of multiprogramming, the efficiency of the hardware
along with the Operating System and the policy of the Operating System to allocate resources.
If these different objectives were not conflicting, a designer would have desired all of them. However, that
is not the case as we have seen. Therefore, the Operating System designers choose only certain objectives
(e.g. response time is extremely important for online or real time systems) and the design of the Operating
System is guided by this choice.
Due to many processes competing for the same available resources like CPU and memory, the concept of
priority is very useful. Like in any capacity planning or shop loading situation, the priority can be global (i.e.
external) or it can be local (i.e. internal).
An external priority is specified by the user externally at the time of initiating the process. In many
cases, the Operating System allows the user to change the priority externally even during its execution. If
the user does not specify any external priority at all, the Operating System assumes a certain priority called
the default priority. In many in-house situations, most of the processes run at the same default priority, but
when an urgent job needs to be done (say for the chairman), the system manager permits that process to be
created with a higher priority.
In data centre situations where each user pays for the time used, normally higher priority processes are
charged at a higher rate to prevent each user from firing his job at the highest priority. This is known as the
scheme of purchased priorities. It is the function of the Operating System to keep track of the time used by
each process and the priority at which it was used, so that it can then perform its accounting function.
To prevent the highest priority process from running indefinitely, the scheduler can decrease the priority
of such a process slightly at some regular time interval, depending on its CPU utilization. After some time,
if its priority drops below that of another ready process, a context switch between them takes place. This
operation is aided by the system clock, and this is also the reason why an interrupt is generated after each
clock tick, so that the scheduler can do this checking. This change in priority is not monitored externally, but
the Operating System can carry this out internally and intelligently using its knowledge about the behaviour
of various processes.
The concept of internal priority is used by some scheduling algorithms. They base their calculation on
the current state of the process. For example, each user, while firing a process, can be forced to also specify
the expected time that the process is likely to take for completion. The Operating System can then set an
internal priority which is the highest for the shortest job (Shortest Job First or SJF) algorithm so that at only
a little extra cost to large jobs, many short jobs will complete. This has two advantages. If short jobs are
finished faster, at any time, the number of processes competing for the CPU will decrease. This will result in a
smaller number of PCBs in the ready or blocked queues. The search times will be smaller, thus improving the
response time. The second advantage is that if smaller processes are finished faster, the number of satisfied
users will increase. However, this scheme has one disadvantage. If a stream of small jobs keeps on coming in,
a large job may suffer from indefinite postponement. This can be avoided by setting a higher external priority
to those important large jobs. The Operating System at any time calculates a resultant priority based on both
external and internal priorities using some algorithm chosen by the designer of the Operating System.
The internal priority can also be based on other factors such as expected remaining time to complete
which is a variation of the SJF scheme discussed above. This scheme is identical with the previous one at the
beginning of the process. This is because in the beginning, the remaining time to complete is the same as the
total expected time for the job to complete. However, this scheme is more dynamic as the process progresses.
At regular intervals, the Operating System calculates the expected remaining time to complete (total expected
completion time—already consumed time) for each process and uses this to determine the priority. The
overhead in this scheme is that as soon as a process uses a certain CPU time, the Operating System has to
keep track of the same, and recalculate the priority at a regular interval.
Some Operating Systems do not operate on the concept of priority at all. They use the concept of time slice
as was described in our example of a traffic signal. Each process is given a fixed time slice, irrespective of its
importance. The process switch occurs only if:
l A process consumes the full time slice or
l A process requests an I/O before the time slice is over. In this case also, a process switch is done,
because, there is no sense in wasting the remaining time slice just waiting for the I/O to complete. The
context switch time consisting of all CPU/memory-related instructions within an Operating System
routine are far less time consuming than the I/O that a process is waiting for.
Some Operating Systems use a combination of the concepts of priority and time slice to schedule various
processes as we will discuss in the later sections.
These concepts can be applied to different levels of scheduling—which is the topic of our discussion in
the next section.
There are basically two scheduling philosophies: Non-Preemptive and Preemptive. Depending upon the
need, the Operating System designers have to decide upon one of them.
A non-preemptive philosophy means that a running process retains the control of the CPU and all the
allocated resources until it surrenders control to the Operating System on its own. This means that even
if a higher priority process enters the system, the running process cannot be forced to give up the control.
However, if the running process becomes blocked due to any I/O request, another process can be scheduled
because, the waiting time for the I/O completion is too high. This philosophy is better suited for getting
a higher throughput due to less overheads incurred in context switching, but it is not suited for real time
systems, where higher priority events need an immediate attention and therefore, need to interrupt the
currently running process.
A preemptive philosophy on the other hand allows a higher priority process to replace a currently running
process even if its time slice is not over or it has not requested for any I/O. This requires context switching
more frequently, thus reducing the throughput, but then it is better suited for online, real time processing,
where interactive users and high priority processes require immediate attention.
Imagine a railway reservation system or a bank, hotel, hospital or any place where there is a front office
and a back office. The front office is concerned with bookings, cancellations and many types of enquiries.
Here, the response time is very crucial; otherwise customer satisfaction will be poor. In such a case, a pre-
emptive philosophy is better. It is pointless to keep a customer waiting for long, because the currently running
process producing some annual statistics is not ready to give up the control. On the other hand, the back
office processing will do better with the non-preemptive philosophy. Business situations with workloads
large enough to warrant a separate computer for front and back office processing, in fact, can go in for
different Operating Systems with different scheduling philosophies if they are compatible in other respects.
Figure 6.27 shows three different levels at which the Operating System can schedule processes. They are as
follows:
l Long term scheduling
An Operating System may use one or all of these levels, depending upon the sophistication desired.
The scheme works as follows:
(a) If the number of ready processes in the ready queue becomes very high, the overhead on the Operating
System for maintaining long lists, context switching and dispatching increases. Therefore, it is wise to
let in only a limited number of processes in the ready queue to compete for the CPU. The long term
scheduler manages this. It disallows processes beyond a certain limit for batch processes first and in
the end also the interactive ones. This is shown in Fig. 6.27. As seen before, this scheduler controls
the admit function as shown in Fig. 6.6.
(b) At any time, the main memory of the computer is limited and can hold only a certain number of
processes. If the availability of the main memory becomes a great problem, and a process gets
blocked, it may be worthwhile to swap it out on the disk and put it in yet another queue for a process
state called swapped out and blocked which is different from a queue of only blocked processes,
hence, requiring a separate PCB chain (we had not discussed this as one of the process states to reduce
complications, but any Operating System has to provide for this).
The question that arises is as to what happens when the I/O is completed for such a process and if the
process is swapped out? Where is the data requested by that process read in? The data required for
that process is read in the memory buffer of the Operating System first. At this juncture, the Operating
System moves the process to yet another process state called swapped out but ready state. It is made
ready because it is not waiting for any I/O any longer. This also is yet another process state which will
require a separate PCB chain.
One option is to retain the data in the memory buffer of the Operating System and transfer it to
the I/O area of the process after it gets swapped in. This requires a large memory buffer for the
Operating System because the Operating System has to define these buffers for every process as a
similar situation could arise in the case of every process. Another option is to transfer the data to the
disk in the process image at the exact location (e.g. I/O area), so that when the process is swapped in,
it does so along with the data record in the proper place. After this, it can be scheduled eventually.
This requires less memory but more I/O time.
When some memory gets freed, the Operating System looks at the list of swapped but ready processes,
decides which one is to be swapped in (depending upon priority, memory and other resources required,
etc.) and after swapping it in, links that PCB in the chain of ready processes for dispatching. This is
the function of the medium term scheduler as shown in Fig. 6.27. It is obvious that this scheduler
has to work in close conjunction with the long term scheduler. For instance, when some memory gets
freed, there could be competition for it from the processes managed by these two schedulers.
(c) The short term scheduler decides which of the ready processes is to be scheduled or dispatched next.
These three scheduling levels have to interact amongst themselves quite closely to ensure that the
computing resources are managed optimally. The exact algorithms for these and the interaction
between them are quite complex and are beyond the scope of this text. We will illustrate the scheduling
policies only for the short term scheduler in the subsequent section.
We will now discuss some of the commonly used scheduling policies—belonging to both pre-emptive
and non-preemptive philosophies and using either a concept of priority or time slice or both. It should be
fairly easy to relate these policies to the kind of PCB chains for ready processes that will be needed for
implementing them.
This is the simplest method which holds all the ready processes in one single queue
and dispatches them one by one. Each process is allocated a certain time slice. A context switch occurs only
if the process consumes the full time slice (i.e. CPU bound job doing a lot of calculations) or if it requests
for I/O during the time slice. If the process consumes the full time slice, the process state is changed from
running to ready and it is pushed at the end of the ready queue. The reason why it is changed to a ready
state is that it is not waiting for any external event such as an I/O operation. Therefore, it cannot be put in a
blocked state. It is pushed at the end of the ready queue because it is a Round Robin policy. The process will
be served in strict sequence only after serving all the other processes ahead of it in the ready queue. After
adding the PCB for this process at the end of the ready queue, the PCB pointers and the headers are changed
as discussed earlier.
If a running process requests for the I/O before the time slice is over, it is pushed into the blocked state. It
cannot be in the ready state, because even if its turn comes, it cannot be scheduled. After its I/O is complete,
it is again introduced at the end of the ready queue and eventually dispatched. This continues until the process
is complete. It is at this time that the PCB for that process is removed from the system. All the new processes
are introduced at the end of the ready queue. This is depicted in Fig. 6.28.
The policy treats all the processes equitably and therefore, it is extremely fair, but if the number of users
is very high, the response time may deteriorate for online processes that require fast attention (e.g. railway or
airline reservations, etc.).
The efficiency and throughput in this policy is dependent upon the size of the time slice as discussed
in the analogy with traffic signals. If the time slice is very high it tends to a single user FIFO policy. This
policy is not fair, even though the throughput can be more in this scheme. On the other hand, if the time slice
is reduced, the policy is fair but it produces a lower throughput due to the overhead of higher frequency of
context switch.
The implementation of this scheme can be done by maintaining a PCB chain in the FIFO sequence of all
the ready processes with the chain pointers being adjusted every time a context switch takes place.
Priority-based policy can be preemptive or non-preemptive as studied earlier.
A preemptive one gives more importance to the response time for real time processes than it gives to fairness.
A pure priority driven preemptive policy is at the other extreme with respect to pure Round Robin with
time slicing. In this case, if the highest priority process is introduced in the system at any moment, with no
regard to the currently running process or the queue of ready processes, the new process will grab the CPU,
hence, forcing a context switch. In fact, if the kernel calculates the new priorities at every clock tick, the PCB
chain also is changed appropriately and the highest priority process then can be dispatched. Thus, a different
process can be dispatched at any clock tick even if no new process is introduced in the system. This can
happen if the internal priorities are modified by the kernel ‘intelligently’.
Next to this can be a priority based non-preemptive policy which schedules the highest priority process
bypassing the queue of ready processes, but only after the currently running process gives up its control of the
CPU due to either an I/O request or termination. After the new process starts running, it in turn does not give
up control unless it requires an I/O or it terminates. If it gets blocked due to an I/O request, the ready process
with the highest priority is dispatched. When the original high priority process becomes ready due to the I/O
completion, the ready queue is again ordered in the priority sequence and if that process happens to be still of
the highest priority, it is again dispatched; otherwise the other process with the highest priority is dispatched.
To implement these policies, the kernel needs to take actions as shown in Fig. 6.29.
In both of these schemes, the final priority is a result of external and internal priorities. Again, as we have
seen, there are a number of ways to calculate the internal priorities. We have seen one of these in the “Shortest
Job First (SJF)” method. Another one could be based on “Shortest Remaining Time First (SRTF)”.
l
The strictly priority-driven policies are really very good for real time events, but can lead to indefinite
postponement or unfairness to low priority processes. Imagine, for example, a process whose whole purpose
is to count from 1 to 100 and then to start all over again after initialising the counter, requiring no I/O at all. If
this process is introduced as the highest priority process, it can virtually bring the whole system to a standstill.
A limited solution to this problem is to introduce and use the concept of priority class.
A thread can be defined as an asynchronous code path within a process. Hence in Operating
Systems which support multithreading, a process can consist of multiple threads, which can run
simultaneously in the same way that a multiuser Operating System supports multiple processes at
the same time. In essence, multiple threads should be able to run concurrently within a process. This is why
a thread is sometimes referred to as a lightweight process.
Let us illustrate the concept of multiple threads using
an example without multithreading first and then using one
with it. Let us consider a utility which reads records from
a tape with a blocking factor of 5, processes them (may be
selects or reformats them) and writes them onto a disk one
by one. Obviously, the speed of the input or read operation
may be quite different from the speed of output or write
operation. The logic of the program is given in Fig. 6.32.
Let us imagine that we have Round Robin scheduling
with time slice for each process = 25 ms. When the process
running of our program is scheduled, 25 ms will be allocated to it, but almost immediately, the process will
be blocked due to the ‘Read’ system call, thereby utilizing only a small fraction of the time slice. When the
record is eventually read, the process becomes ready and then it is dispatched. But almost immediately, it
will be blocked again due to the ‘Write’ system call, and so on. In the Round Robin philosophy, 25 ms will
be allocated to a process regardless of its past behaviour. Hence, the I/O bound processes such as this suffer
in the bargain.
We have seen that heuristic scheduling provides one of the solutions to this problem, but it is not
very inexpensive to provide this type of scheduling and not many Operating Systems choose that path.
Multithreading provides yet another improvement in solving this problem. The idea is simple and it works
as follows:
(i) The programmer (in this case the one who is writing this tape to disk copy utility) defines two threads
within the same process as shown in Fig. 6.33. The advantage is that they can run concurrently within
the same process if synchronized properly. We need not bother about the exact syntax with which a
programmer can define a thread within a process. Let us just assume that it is possible.
(ii) The compiler recognises these as different threads and maintains their identity as such in the executable
code. In our example in Fig. 6.33, a thread is encapsulated between Thread-N-Begin and Thread-N-
End statements for thread N.
(iii) When the process starts executing, the Operating System creates a PCB as usual, but now in addition,
it also creates a Thread Control Block (TCB) for each of the recognised and declared threads
within that process. The TCB contains apart from other things, the register save area to store the
registers at context switch of a thread instead of a process. Because, the idea is to run different threads
simultaneously within a process, similar concepts, ideas and data structures are used here as the ones
used for multiple simultaneous processes. Hence, you need a queue of TCBs and a queue header in
addition to that for PCBs. A thread can now get blocked like a process can. Hence, a TCB needs to
have a register save area for each thread within a process.
(iv) The threads also could have priorities and states. A thread can be in a ready, blocked or running state,
and accordingly all the TCBs are linked together in the same way that PCBs are linked in different
queues with their separate headers.
(v) When the Operating System schedules a process with multiple threads and allocates a time slice to it,
the following happens:
(a) The Operating System selects the highest priority ready thread within that process and schedules
it.
(b) At any time, if the process’ time slice is over, the Operating System turns the process as well as
currently running thread into ready state from running state.
(c) If the process time slice is not over but the current thread is either over or blocked, the Operating
System chooses the next highest priority ready thread within that process and schedules it. But
if there is no ready thread left to be scheduled within that process, only then does the Operating
System turn the process state into a blocked state. And it is in this procedure that there is an
advantage which we will see later. It is worth noting that the process itself does not get blocked
if there is at least one thread within it which can execute within the allocated time slice.
(d) Different threads need to communicate with each other like different processes do. Our example
can be treated as a producer–consumer problem. The tape-read is the producer task and disk-
write is the consumer task. Hence, in multitasking, both Inter Task Communication and Task
Synchronisation are involved and the Operating System has to solve the problems of race
conditions through mutual exclusion. We shall study this problem in more detail in the next
chapter.
Clearly an Operating System with multithreading will be far more complex to design and implement than
the one without it. What did we gain? Is it worth it? The answer is not clearly a yes or no.
A multithreading Operating System has an overhead, but it also allows the programmer flexibility and
improves CPU utilisation. For instance, in our example, if Thread 0 is blocked, instead of blocking the entire
process, the Operating System will find out whether Thread 1 can be scheduled. When both the threads are
blocked, only then will the entire process be blocked. Again, even if any thread becomes ready, the process
can be moved to a ready list from the blocked list and then scheduled. This reduces the overheads of context
switching at a process level, though adding to those at a thread level. The latter is generally far less time
consuming; and this is where the advantage stems from. The example may show some advantage, but may
not reveal the magnitude of the benefit because the threads in our example consist mainly of I/O only. If you
imagine more complex threads with more processing, the advantages will be clear.
Practically, threads can be implemented at two different levels, namely, user level and kernel level. The
threads implemented at kernel level are known as kernel threads. The kernel threads are entirely handled by
the Operating System scheduler. An application programmer has no direct control over kernel threads.
The threads implemented at the user level are known as user threads. The API for handling user threads
is provided by a thread library. The thread library maps the user threads to the kernel threads. Depending
on the way the user threads are mapped to the kernel threads, there are three multithreading models as
described below:
The many-to-one model associates many user threads with a single kernel
thread. Figure 6.34 depicts the many-to-one model. The thread library in the user space provides very ef-
ficient thread management. The user threads are not directly visible to the kernel and they require no kernel
support. As a result, only one user thread has access to the kernel at a time and if the thread blocks, then the
entire process gets blocked. The Green Threads library on Solaris Operating Systems implements this model.
This model is also implemented on the Operating Systems that do not provide kernel threads.
Threads in Linux are handled quite differently from most other operating systems
due to the open source nature of Linux. Although the Linux kernel supported the user threads since version
1.x, the kernel threads support was added only after version 2.x. An important difference between Linux
threads and other threads is the fact that Linux does not distinguish between a process and a thread. A task
represents basic unit of work for Linux.
To create a child process, Linux provides two system calls. First is fork which we have already studied
in Chapter 3. Second Linux specific system call is clone. It creates a child process like the fork call, but
important difference between the two is, fork creates a child process that has its own process context similar
to the parent process whereas the child process created by the clone shares parts of its execution context with
the calling process, such as the memory space, the file descriptor table, and the signal handler table. As a
result, the clone system call is used to implement kernel threads in Linux.
At user level, various libraries that implement Pthreads are available. Some examples are LinuxThreads,
NPTL (Native POSIX Threads Library).
n n
n n
n n
n n
n n
n
n
n
In practice, several processes need to communicate with
one another simultaneously. This requires proper
synchronization and use of shared data residing in
shared memory locations. We will illustrate this by what is
called a ‘Producer-Consumer problem’.
Let us assume that there are multiple users at different
terminals running different processes but each one running the
same program. This program prompts for a number from the
user and on receiving it, deposits it in a shared variable at some
common memory location. As these processes produce some
data, they are called ‘Producer processes’. Now let us imagine
that there is another process which picks up this number as soon
as any producer process outputs it and prints it. This process
which uses or consumes the data produced by the producer
process is called ‘Consumer process’. We can, therefore, see that all the producer processes communicate
with the consumer process through a shared variable where the shared data is deposited. This is depicted in
Fig. 7.1.
UNIX has a facility called ‘pipe’ which works in a very similar manner. Instead of just one variable, UNIX
assigns a file typically of the size of 4096 bytes which could reside in the memory entirely. When Process A
wants to communicate with Process B, Process A keeps writing bytes into this shared file (i.e. the pipe) and
Process B similarly keeps reading from this shared file in the same sequence in which it was produced. This
is how UNIX can allow a facility of pipes through which the output of one process becomes the input to the
next process. This is shown in Fig. 7.2.
Let us say that the sales data is to be selected for a specific division by a program or utility P1. The output
of P1 (i.e. the selected records) is fed to the query program (P2) as input through a pipe. The query program
P2 finally displays the results of the enquiry. All that the user has to do is to give commands to the UNIX shell
to execute P1 to select the data, pipe it to P2 and execute P2 using this piped input data to produce the results.
The pipe is internally managed as a shared file. The beauty of this scheme is that the user is not aware of this
shared file. UNIX manages it for him. Thus, it becomes a vehicle for the ‘Inter Process Communication
(IPC)’. However, one thing should be remembered. A pipe connects only two processes, i.e. it is shared
only between two processes, and it has a “direction” of data flow. A shared variable is a much more general
concept. It can be shared amongst many processes and it can be written/read arbitrarily.
Another example of a producer–consumer situation and the IPC is the spooler process within the Operating
System. The Operating System maintains a shared list of files to be printed for the spooler process to pick up
one by one and print. At any time, any process wanting to print a file adds the file name to this list. Thus, this
shared list becomes a medium of IPC. This is depicted in Fig. 7.3.
Sometimes, two or more processes need to be synchronized based on something. The common or shared
variables again provide the means for such synchronization. For instance, let us imagine that Process A has
to perform a task only after Process B has finished a certain other task. In this case, a shared variable (say, a
flag) could be used to communicate the completion of the task by Process B. After this, Process A can check
this flag and proceed, depending on the flag, in the end resetting the flag for future.
In a sense, all the examples discussed above are typical of both ‘Process Synchronization’ and IPC, because
both are closely related. For example, in the first case, unless any of the producer processes outputs a number,
the consumer process should not try to print anything. Again, unless the consumer process prints it, none of
the producer processes should output the next number if overwriting is to be avoided (assuming that there is
only one shared variable). Thus, it is a problem of process synchronization again!
There is however a serious problem in implementing these schemes. Let us again go back to our first
example. Let us see how we can achieve this synchronization to avoid overwriting. Let us imagine that apart
from a shared variable to hold the number, we also have another flag variable which takes on the value of 0 or
1. The value of the flag is 1 if any of the producer processes has output a number. Hence, no producer process
should output a new number if this flag = 1, to avoid overwriting. Similarly, the consumer process will print
the number only if the flag = 1 and will set the flag to 0, thereafter. Again, the consumer process should not
print anything if this flag = 0 (i.e. nothing is ready for printing).
We illustrate this in the programs in Fig. 7.4.
In this scheme, as we know, instruction P.0 i.e. “while flag = 1 do;” is a wait loop so long as the flag
continues to be = 1. The very moment the flag becomes 0, the program goes down to step P.1 and thereafter
to P.2 whereupon the flag is set to 1. When a process reaches instruction P.2, it means that some producer
process has output the number in a shared variable at instruction P.1. At this juncture, if another producer
process tries to output a number, it should be prevented from doing so in order to avoid overwriting. That
is the reason, the flag is set to 1 in the instruction P.2. After this, the while-do wait loop precisely achieves
this prevention. This is because as long as the flag = 1, the new producer process cannot proceed. A similar
philosophy is applicable for instructions C.0, C.1 and C.2 of the consumer process.
For instance, the consumer process does not proceed beyond C.0 as long as the flag continues to be = 0,
which indicates that there is nothing to print. As soon as the flag becomes 1, indicating that something is
output and is ready for printing, the consumer process executes C.1 and C.2, whereupon the number is printed
and the flag is again set to 0, so that subsequently the consumer process does not print non-existing numbers
but keeps on looping at C.O.
Everything looks fine; where then is the problem? The problem will become apparent if we consider the
following sequence of events.
(i) Let us assume that initially the flag = 0.
(ii) One of the producer processes (PA) executes instruction P.0. Because the flag = 0, it does not wait at
P.O, but it goes to instruction P.1.
(iii) PA outputs a number in the shared variable by executing instruction P.1.
(iv) At this moment, the time slice allocated to PA gets over and that process is moved from running to
ready state. The flag is still = 0.
(v) Another producer process PB is now scheduled. (It is not necessary that a consumer process is
scheduled always after a producer process is executed once.)
(vi) PB also executes P.0 and finds the flag as 0, and therefore, goes to P.1.
(vii) PB overwrites on the shared variable by instruction P.1 therefore, causing the data to be lost.
Hence, there is a problem. An apparent problem is that setting of the flag to 1 in the producer process is
delayed. If the flag is set to 1 as soon as a decision is made to output the number, but before actually outputting
it, what will happen? Can it solve the problem? Let us examine this further. The modified algorithms of the
producer and consumer processes will be as shown in Fig. 7.5.
However, in this scheme, the problem does not get solved. Let us consider the following sequence of
events to see why.
(i) Initially flag = 0
(ii) PA executes instruction P.0 and falls through to P.1, as the flag = 0.
(iii) PA sets flag to 1 by instruction P.1.
(iv) The time slice for PA is over and the processor is allocated to another Producer process PB.
(v) PB keeps waiting at instruction P.0 because flag is now = 1. This continues until its time slice also
is over, without doing anything useful. Hence, even if the shared data item (i.e. the number in this
case) is empty, PB cannot output the number. This is clearly wasteful, though it may not be a serious
problem. Let us proceed further.
(vi) A consumer process CA is now scheduled. It will fall through C.0 because flag = 1. (It was set by PA
in step (iii) )
(vii) CA will set flag to 0 by instruction C.1.
(viii) CA will print the number by instruction C.2 before the producer has output it (may be the earlier
number will get printed again!). This is certainly wrong!
Therefore, just preponing the setting of the flags does not work. What then is the solution?
Before going into the solution, let us understand the problem correctly. The portion in any program which
accesses a shared resource (such as a shared variable in the memory) is called as ‘Critical Section’ or
‘Critical Region’. In our example, instructions P.1 and P.2 of producer process or instructions C.1 and C.2
of consumer process constitute the critical region. This is because both the flag and the data item where the
number is output by producer process are shared variables. The problem that we were facing was caused by
what is called ‘race condition’. When two or more processes are reading or writing some shared data and the
outcome is dependent upon which process runs precisely then, the situation can be called ‘race condition’.
We were clearly facing this problem in our example. This is obviously undesirable, because the results are
unpredictable. What we need is a highly accurate and predictable environment. How can we avoid race
conditions?
A closer look will reveal that the race conditions arose because more than one process was in the critical
region at the same time. One point must be remembered. A critical region here actually means a critical
region of any program. It does not have to be of the same program. In the first example (Fig. 7.4), the problem
arose because both PA and PB were in the critical region of the same program at the same time. However,
PA and PB were two producer processes running the same program. In the second example (Fig. 7.5), the
problem arose even if processes PA and CA were running separate programs and both were in their respective
critical regions simultaneously. This should be clear by going through our example with both alternatives as
in Figs. 7.4 and 7.5. What is the solution to this problem then?
If we could guarantee that only one process is allowed to enter any critical region (i.e. of any process) at
a given time, the problem of race condition will vanish. For instance, in any one of the two cases depicted
in Figs. 7.4 and 7.5, when PA has executed instruction P.1 and is timed out (i.e. without completing and
getting out of its critical region), and if we find some mechanism to disallow any other process (producer
or consumer) to enter its respective critical regions, the problem will be solved. This is because no other
producer process such as PA would be able to execute instructions P.1 or P.2 and no other consumer process
such as CA would be allowed to execute instructions C.1 or C.2. After PA is scheduled again, only PA would
then be allowed to complete the execution of the critical region. Until that happens, all the other processes
wanting to enter their critical regions would keep waiting. When PA gets out of its critical region, one of the
other processes can now enter its critical region; and that is just fine. Therefore, what we want is ‘mutual
exclusion’ which could turn out to be a complex design exercise. We will outline the major issues involved
in implementing this strategy in the next section.
It is important to remember that mutual exclusion is a necessary, though not a sufficient condition, for a
good Operating System. This is because the process for achieving mutual exclusion could be very expensive.
We do not want an Operating System which achieves mutual exclusion at the cost of being extremely slow,
or by making processes wait for a very long time.
In fact, we can list five conditions which can make any solution acceptable. They are:
(i) No two processes should be allowed to be inside their critical regions at the same time (mutual
exclusion).
(ii) The solution should be implemented only in the software, without assuming any special feature of the
machine such as specially designed mutual exclusion instructions. This is not strictly a precondition
but a preference, as it enhances portability to other hardware platforms, which may or may not have
this facility.
(iii) No process should be made to wait for a very long time before it enters its critical region (indefinite
postponement).
(iv) The solution should not be based on any assumption about the number of CPUs or the relative speeds
of the processes.
(v) Any process operating outside its critical region should not be able to prevent another process from
entering its critical region.
We will now proceed to seek solutions which satisfy all these conditions.
One of the alternatives is to use a lock-flag. Consider a lock-flag which takes on only two values “F” or “N”.
If any process is in its critical region, this lock-flag is set to “N” (Not free). If no process is in any critical
region, it is set to “F” (Free). Using this, the algorithm for any process could be as shown in Fig. 7.7. A
process wanting to enter the critical region checks whether the lock-flag is “N”. If it is “N”, i.e. the critical
region is not free, it keeps waiting, indicated by the instruction 0, where the “While...do” suggests a wait loop.
When the lock-flag becomes “F”, the process falls through to instruction 1, where it sets the lock-flag to “N”
and enters its critical region.
The idea is that while one process is in its critical region, no other process should be allowed to enter its
critical region. This can be achieved because, every process has a structure as shown in Fig. 7.7. Therefore,
any other process also will check the lock-flag before entering its critical region. Since the lock-flag is a
shared variable amongst all the processes, its value will be “N” and therefore, the new process will keep
waiting. It will not enter its critical region. When
the process in the critical region gets out of it, it sets
the lock-flag to “F” as shown in the instruction 3 of
Fig. 7.7. At this juncture, the other process waiting
on the lock-flag at instruction 1 can enter its critical
region. Therefore, this procedure is designed to
ensure that at any time, only one process is in the
critical region.
We should not confuse the lock-flag with the flag
in our earlier example. Earlier, the flag indicated whether the memory location contained any valid number
to be printed. In this case, the lock-flag indicates whether any critical region is entered by any process (i.e. it
is busy) or not.
At first sight, the solution seems satisfactory. But then, this solution also has a major bug. In fact, it does
not solve the problem of mutual exclusion at all. Consider the following sequence of events, assuming that
initially the lock-flag = “F”.
(i) Process A executes instruction 0 and finds the lock-flag = “F”, and decides to go to instruction 1, but
loses control of the CPU before actually executing it as the time slice is up. Hence, the lock-flag still
remains to be “F”.
(ii) Process B now is scheduled. It also executes instruction 0 and still finds lock-flag = “F”.
(iii) Process B executes instruction 1 and sets the lock-flag to “N”.
(iv) Process B enters the critical region and half way through, loses the control of CPU as the time slice is
up.
(v) Process A is scheduled again. It resumes from instruction 1 and sets the lock-flag to “N” (which was
already set to “N” by Process B).
(vi) Process A also enters the critical region.
Thus, both the processes (A and B) are in the critical regions simultaneously. The objective of mutual
exclusion has not been achieved. The race condition and the resultant inconsistency may occur.
What we need are mutual exclusion primitives or instructions which will guarantee the mutual exclusion.
There have been many attempts in that direction. We need not at this juncture worry about whether a primitive
can be implemented using hardware or software. The meaning of primitives will be very clear in the subse-
quent discussions.
Let us imagine that we have two primitives: Begin-Critical-Region and End-Critical-Region. We can use
them as a boundary of a critical region in any process. They really act like security guards. The system uses
these primitives to recognize the critical region and allows only one process to be in any critical region at
any given time. Let us, for now, not bother about how these primitives are implemented, but assume that they
exist. We could now rewrite our producer–consumer programs of Fig. 7.4 as shown in Fig. 7.8.
The idea is that for any process when Begin-Critical-Region is encountered, the system checks if there is
any other process in the critical region and if yes, no other process is allowed to enter into it. This guarantees
mutual exclusion. If we retrace steps (i) to (vii) discussed in connection with Fig. 7.4, we will realize this.
For cross-reference, we have retained the numbers such as P.0, P.1 etc. We have added only P.S1, P.S2 and
C.S1 and C.S2 as mutual exclusion primitives in Fig. 7.8. The whole scheme will work in the following way:
(i) Let us assume that initially the flag = 0
(ii) A producer process PA executes P.0. Because flag = 0, it falls through to P.S1. Again, assuming that
there is no other process in the critical region, it will fall through to P.1.
(iii) PA outputs the number in a shared variable by executing P.1.
(iv) Let us assume that at this moment the time slice for PA gets over, and it is moved into the ‘Ready’
state from the ‘Running’ state. The flag is still 0.
(v) Another producer process PB now executes P.0. It finds that flag = 0 and so falls through to P.S1.
(vi) Because PA is in the critical region already, PB is not allowed to proceed further, thereby, avoiding
the problem of race conditions. This is our assumption about the mutual exclusion primitives.
We can verify that the scheme works in many different conditions. The problem that remains now is only
that of implementing these primitives.
There have been a number of attempts to implement the primitives for mutual exclusion. Many algorithms do
not solve the problem of mutual exclusion at all. Some algorithms are based on the assumption that there are
only two processes. Some assume the existence of a special hardware instruction such as ‘Test and Set Lock’
(TSL). All these solutions were refined by a Dutch mathematician Dekkar into a feasible solution which was
further developed by E.W. Dijkstra. But the solution was very complicated. Finally, in 1981, G. L. Peterson
developed a simple but feasible solution.
However, all these solutions, including Peterson’s, required a phenomenon called ‘busy waiting’. This
can be explained again using our producer–consumer example. In our example, as depicted in Fig. 7.4, if flag
= 0 and the producer process enters its critical region, the consumer process keeps looping to check the flag,
until it becomes 1. This is called busy waiting. This is highly undesirable because it wastes CPU resource.
In this scheme, the consumer process is still a process which can contend for CPU resources because it is a
ready process. It is not blocked. It is not waiting for any I/O operation. It is waiting on the flag. If, somehow,
the consumer process could be blocked and therefore, kept away from competing for the CPU time, it would
be useful. Allocating a time slice to a process which is going to waste it in “busy waiting” anyway is quit
unproductive. If we avoid this, the CPU would be free to be scheduled to other ready processes. The blocked
consumer process would be made ready only after the flag status is changed. After this, the consumer process
could continue and enter its critical region.
This is exactly what we had wanted. This effectively means that we are treating this flag operation as
an I/O operation as far as process states are concerned, so that a process could be blocked not only while
waiting for an I/O but also while waiting for the change in the flag. The problem is that none of the solutions,
including Peterson’s, avoids busy waiting. In fact, if this was put down as a sixth condition, in addition to the
five conditions listed at the end of Sec. 7.1, to make the solution as an acceptable one, none of these solutions
would be acceptable, however brilliant it may be.
A way out of all this had to be found. In 1965, Dijkstra suggested a solution using a ‘Semaphore’, which
is widely used today. We will discuss some of these earlier solutions and their shortcomings in the following
sections. At the end, we will discuss Dijkstra’s solution.
This was the first attempt to arrive at the mutual exclusion primitives. It is based on the assumption that there
are only two processes: A and B, and the CPU strictly alternates between them. It firstly schedules A, then B,
then again, A and so on. The algorithms for programs run by processes A and B are outlined in the succeeding
lines. We assume that the variable Process-ID contains the name of the process such as A or B. This is a
shared variable between these processes and is initially set up by the Operating System for them. Figure 7.9
depicts this alternating policy.
We can verify that mutual exclusion is guaranteed. A.0 and A.2 are the instructions which encapsulate the
critical region and therefore, functionally play the role of the primitives for mutual exclusion. This is true
about instructions B.0 and B.2 also. Let us see how this works.
(i) Let us assume that initially Process-ID is set to “A” and Process A is scheduled. This is done by the
Operating System.
(ii) Process A will execute instruction A.0 and fall through to A.1 because Process-ID = “A”.
(iii) Process A will execute the critical region
and only then Process-ID is set to “B” at
instruction A.2. Hence, even if the context
switch takes place after A.0 or even after
A.1 but before A.2, and if Process B is then
scheduled (remember, we have assumed
that there are only two processes!), Process
B will continue to loop at instruction B.0
and will not enter the critical region. This
is because Process-ID is still “A”. Process
B can enter its critical region only if
Process-ID = “B”. And this can happen
only in instruction A.2 which in turn can
happen only after Process A has executed
its critical region in instruction A.1. This
is clear from the program for Process A as
given in Fig. 7.9.
Mutual exclusion can be guaranteed in this
scheme, but then, the scheme has some major problems as listed below:
(a) If there are more than two processes, the system can fail. Imagine three processes PA1, PA2, PA3
executing the algorithm shown for Process A and PB1 executing the algorithm shown for Process B. Now
consider the following sequence of events when Process-ID = “A”.
l PA1 starts executing. It executes A.0 and falls through to A.1. Before it executes A.1, process switch
takes place.
l PA2 starts executing, and because Process-ID is still “A”, it also goes through A.0 to A.1, and during
A.1 (i.e. while in the critical region), again process switch takes place.
l PA3 resumes from instruction A.1 and enters its critical region, thereby failing the scheme of mutual
exclusion. Both PA1 and PA2 are in the critical region simultaneously.
The algorithm for multiple simultaneous processes is fairly complex. Dijkstra proposed a solution but
it involved the possibility of indefinite postponement of some processes. Knuth suggested an improvement
to Dijkstra’s algorithm but it still involved a possibility of long delays for some processes. Many revised
algorithms have been suggested but they are very complex and far from satisfactory.
(b) This algorithm also involves busy waiting, and wastes the CPU time. If Process B is ready to be
dispatched, it may waste the full time slice, waiting at instruction B.0, if Process-ID = “A”.
(c) This algorithm forces Processes A and B to alternate in a strict sequence. If the speed of these two
processes is such that Process A wants to execute again before Process B takes over, it is not possible.
It is clear from the demerits discussed above that this solution violates many of the five conditions
discussed earlier, and therefore, it is not a good solution.
As mentioned earlier, many other solutions have been put forth. We will not discuss all of these. We will
now present Peterson’s algorithm.
This algorithm also is based on two processes only. It uses three variables. The first one called chosen-
process takes the value of “A” or “B” depending upon the process chosen. This is as in the earlier case. PA-
TO-ENTER and PB-TO-ENTER are two flags which take the value of “Yes” or “No”. For instance, if PA
wants to enter the critical region, PA-TO-ENTER is set to “YES” to let PB know about PA’s desire. Similarly,
PB-TO-ENTER is set to “YES”, if PB wants to enter its critical region so that PA can know about it, if it tests
this flag. The following algorithm (Fig. 7.10) will clarify the concepts.
Let us assume that we start with the following values and trace the sequence of events.
PA-TO-ENTER = “NO”,
PB-TO-ENTER = “NO”, and
Chosen-Process = “A”
(i) Let us say that PA is scheduled first.
(ii) After executing A.0 and A.1, PA-TO-ENTER will become “YES”, and Chosen-Process will be “B”.
(iii) At A.2 and A.3 (it is one statement only!), because PB-TO-ENTER is “NO”, it will fall through to
A.4. This is because of the “AND” condition in A.2. It will now start executing the Critical Region-A
at A.4.
(iv) Let us assume that at this time, process switch takes place and PB is scheduled. PA is still in its critical
region.
(v) PB will execute B.0 and B.1 to set PB-TO-
ENTER to “YES” and Chosen-Process to
“A”.
(vi) But at B.2, it will wait, because, both the
conditions are met, i.e. PA-TO-ENTER =
“YES” (in step (ii)) and Chosen-Process =
“A” (in step (v)). Thus, PB will be prevented
from entering its critical region.
(vii) Eventually when PA is scheduled again, it
completes instruction A.5 to set PA-TO-
ENTER to “NO”, but only after coming out of
its critical region.
(viii) Now if PB is scheduled again, it will resume
at instruction B.2, it will fall through to B.2
and B.3 (because PA-TO-ENTER = “NO” in
step vii) to execute B.4 and enter the critical
region of B. However, this has happened only
after PA has come out of its critical region.
Peterson’s algorithm is simple but brilliant. How-
ever, it suffers from the same shortcoming as dis-
cussed earlier. More than two processes are not al-
lowed by this algorithm. Again, it is based on the
inefficient ‘busy waiting’ philosophy.
All solutions discussed till now were software solutions which required no special help from hardware.
However, many computers have a special instruction called “Test and Set Lock (TSL)” instruction. This
instruction has the format: “TSL ACC, IND”, where ACC is the accumulator register and IND is the symbolic
name of a memory location which can hold a character to be used as an indicator or a flag. The following
actions are taken after this instruction is executed:
The interesting point is that this is an indivisible instruction, which means that it cannot be interrupted
during its execution consisting of these two steps. Therefore, the process switch cannot take place during the
execution of this TSL instruction. It will be either fully executed (i.e. both the actions mentioned above) or it
will not be executed at all.
How can we use this TSL instruction to implement the mutual exclusion? Let us assume that IND can take
on the value “N” (indicating thereby that the critical region is being used currently, i.e. it is NOT free.) or “F”
(indicating thereby that the critical region is not being used currently, i.e. it is FREE). When this flag is “F”,
a process can set it to “N”, and only then can it enter the critical region.
Obviously, if IND = “N”, no process can enter its critical region because some process is already in the
critical region.
We had used this scheme earlier. The only difference in this scheme is the use of the TSL instruction.
Checking whether IND is “F” and regardlessly setting it to “N” can be done in one shot without interruption,
thereby removing the problem area. For instance, we can write two common routines as shown in Fig. 7.11.
These routines, “Enter-Critical-Region” and “Exit-Critical-Region”, can now constitute the mutual exclusion
primitives. They are written in an assembly language of a hypothetical computer. They can be easily written
for any other computer as well.
Semaphores have very wide uses and applications whenever shared variables are used. In
fact, the Operating System can use them to implement the scheme of blocking/waking up of processes when
they wait for any event such as the completion of the I/O. Thus, semaphores can be used to synchronize
block/wakeup processes as shown in Fig. 7.16.
Note that one of the processes has only DOWN instruction and the other has only UP instruction. We
would leave it to the reader to verify the detailed steps involved in the interaction between these two processes
and how they can perform block/wakeup operations.
The operating system literature has extremely interesting problems pertaining to the IPC. Three of them seem
to be most well known:
(a) The dining philosophers' problem
(b) The readers' and writers' problem
(c) The sleeping barber problem
We shall now examine these three problems and the three corresponding algorithms to tackle them.
Dijkstra posed the dining philosophers, problem in
1965, and also solved it himself. The problem can be described as follows:
Five philosophers are seated on five chairs across a table. Each philosopher has a plate full of spaghetti.
The spaghetti is very slippery, so each philosopher needs a pair of forks to eat it. However, there are only
five forks available all together, arranged in such a manner that there is one fork between any two plates of
spaghetti. This arrangement is shown in Fig. 7.17.
(i) Each philosopher performs two activities continuously: thinking for some time and eating for some
time. Obviously, thinking does not require any special algorithm here.
(ii) However, eating does. In order to eat, a philosopher lifts two forks, one to his left and the other to his
right (not necessarily in that order).
(iii) If he is successful in obtaining two forks, he starts eating.
(iv) After some time, he stops eating and keeps both the forks down.
A simple algorithm to implement this solution is shown in Fig. 7.18. For simplicity, we shall assume that
a philosopher always picks up the left fork first, and the right fork second.
Let us examine this algorithm closely. On the face of it, it appears that it should work perfectly. However,
it has a major flaw. What if all the five philosophers decide to eat at the same time? All the five philosophers
would attempt to pick up two forks at the same time. But since only five forks are available, perhaps none of
them would succeed.
To improve the algorithm, let us add a condition, between the two Take_fork instructions shown in the
algorithm.
(i) When a philosopher succeeds in obtaining a left fork, he checks to see if the right fork is available.
(ii) If the philosopher does not become successful in obtaining the right fork, he puts down his left fork
and waits for some time.
(iii) After this pre-defined time elapses, the philosopher again starts with the action of picking up the left
fork.
Unfortunately, this solution can also fail. What if the five philosophers pick up their left fork at the same
time, and attempt to pick up their right fork simultaneously? Obviously, they all will fail to obtain the right
fork, and therefore, abandon their attempt. Moreover, they might wait for the same amount of time, and re-
attempt lifting their left fork. This would continue without any philosopher succeeding in actually obtaining
both the forks.
This problem, wherein many programs running at the same time simply wait for some event to happen
without performing any useful task is called as starvation (which, actually, is quite appropriate here, as the
philosophers would be really starved!).
One possible solution is to add randomness to this. The time for which the philosophers wait can be made
different for each one of them.
(i) Thus, if all the philosophers pick up their left fork at the same time, all the five can put their left forks
down.
(ii) Then, the first philosopher could wait for three seconds, the second philosopher could wait for five
seconds, the third philosopher could wait for just one second and so on.
(iii) This randomness can be further randomized, just in case any two philosophers somehow manage to
wait for the same time.
With such a scheme in place, this problem might be resolved. However, this is not a perfect solution,
which guarantees success in every situation. A still better scheme is desired. The appropriate solution to this
problem is the usage of a binary semaphore.
(i) Before a philosopher starts acquiring the left fork, he would do a DOWN on mutex (i.e. disallows
other philosophers to test him).
(ii) After eating is over, a philosopher performs a UP on mutex (i.e. allows other philosophers to test
him).
(iii) With five forks, at the most two philosophers can eat at the same time. Therefore, for each philosopher,
we define three possible states: eating, hungry (making an attempt to acquire forks) or thinking.
(iv) A philosopher can move into the eating state only if both of his neighbors are not eating.
The algorithm shown in Fig. 7.19 presents an answer to the dining philosophers, problem. We allocate
one semaphore to each philosopher. This allows each philosopher to maintain his current state (e.g. a hungry
philosopher can wait before he can move into the eating state). The algorithm shows the steps carried out by
a single philosopher. The same logic applies for all the other philosophers.
In which practical situations would the dining philosophers problem apply? Clearly, it is useful when
the number of resources (such as I/O devices) is limited, and many processes compete for exclusive access
over those resources. This problem is unlikely in the case of a database access, for which the next problem
is applicable.
Imagine a large database containing thousands of records. As-
sume that many programs (or processes) need to read from and write to the database at the same time. In such
situations, it is quite likely that two or more processes make an attempt to write to the database at the same
time. Even if we manage to take care of this, while a process is writing to the database, no other process must
be allowed to read from the database to avoid concurrency problems. But we must allow many processes to
read from the database at the same time.
(i) A proposed solution tackles these issues by assigning higher priorities to the reader processes, as
compared to the writer processes.
(ii) When the first reader process accesses the database, it performs a DOWN on the database. This
prevents any writing process from accessing the database.
(iii) While this reader is reading the database, if another reader arrives, that reader simply increments a
counter RC, which indicates how many readers are currently active.
(iv) Only when the counter RC becomes 0 (which indicates that no reader is active), a writer can write to
the database.
An algorithm to implement this functionality is shown in Fig. 7.20.
Clearly, this solution assigns higher priority to the readers, as compared to the writers.
(i) If many readers are active when a writer arrives, the writer must wait until all the readers end their
reading jobs.
(ii) Moreover, if a few more readers keep coming in, the writer has to wait until all of them finish reading.
This may not always be the best solution, but surely is secure.
Semaphores offer a good solution to resolve the concurrency issues when multiple processes want to access
the same resource. However, the problem with semaphores is that the actual implementation of semaphores
requires programming at the level of system calls. That is, the application developer needs to explicitly invoke
semaphore-related system calls to ensure concurrency. This can not only be tedious, but can actually lead to
an erroneous code.
Consequently, better schemes are desired. Monitors offer such a solution. A monitor is a high-level
construct. It is an abstraction over semaphores. Coding monitors is similar to programming in high-level
programming languages, as compared to semaphores, which are akin to assembly language. Monitors are
easy to program. The compiler of a programming language usually implements them, thus reducing the scope
for programmatic errors.
A monitor is a group or collection of data items (variables), data structures and procedures. It is somewhat
similar to an object (as in the context of object technology). The client processes cannot access the data items
inside a monitor directly. The monitor guards them closely. The client objects can only invoke the services
of the monitor in the form of the monitor’s procedures. This provides a shield against the internal details of
a monitor.
A monitor is similar to a critical section. At any given time, only one process can be a part of a monitor. If
another process makes an attempt to enter the monitor while the other process has not finished, the attempt to
work inside the monitor will fails, and the later process must wait until the process, which is already inside
the monitor, leaves.
The processes, which utilize the services of the monitor, need not know about the internal details of the
monitors. They need not know, for instance, the way they are implemented, or the sequence in which a
monitor executes its instructions, etc. In contrast, we know that a programmer who works with semaphores
actually needs to use this sort of information while coding.
Of course, one argument in favor of semaphores as against the monitors is that semaphores, by virtue of
their basic low-level interface, provide more granular control. This is always true even in the case of assembly
language, which provides a fine-grained control to the application programmer, as compared to a high-level
language. However, if the programmer does not want to use such low-level features, monitor is a better
choice.
The need for message passing came about because the techniques such as semaphores and monitors work
fine in the case of local scope. In other words, as long as the processes are local (i.e. on the same CPU), these
techniques work perfectly. However, they are not intended to serve the needs of processes, which are on
physically different machines. Such processes, which communicate over a network, need some mechanism to
be able to perform communication with each other, and yet be able to ensure concurrency. Message passing
is the solution to this problem.
Using the technique of message passing, one process (sender) can safely send a message to another process
(destination), without worrying if the message would reach the destination process. This is conceptually
similar to the technology of Remote Procedure Calls (RPC), the difference being that message passing is an
operating system concept, whereas RPC is a data communications concept.
In message passing, two primitives are generally used: send and receive. The sender uses the 'send' call to
send a message. The receiver uses the 'receiver' call. These two calls take the following form:
send (destination, &message);
receive (source, &message);
Notably, the two processes can be local, i.e. on the same machine, or they can be remote, i.e. on physically
different machines.
l If the two processes are local, the message passing mechanism is quite simple.
However, if the two processes are not on the same machine, a lot of overheads are required to ensure
that the message passing is successful. For instance, the receiver has to send an acknowledgement (either a
positive acknowledgement, i.e. ACK or a negative acknowledgement, i.e. NAK) to the sender. The sender has
to take an appropriate action accordingly. There can be other issues, as well. How long should the sender wait
for an acknowledgement before re-sending the message? What if the re-transmission fails, as well? How does
the receiver distinguish between the various parts of a message, if the sender has broken down the original
messages into multiple parts and sent them separately? We can see that to handle such situations, the message
passing mechanism has to be a bit similar to the Transmission Control Protocol (TCP), which guarantees an
error-free, only-once and guaranteed delivery of messages.
Process is any particular operation that is currently executing. A process is created as result of a
specific type of system call. In multi-user environment there are users working simultaneously and each user
generally requests for and initiates some processing. This means that there is a need to execute many pro-
cesses simultaneously to satisfy all the users. In single-user Operating System also, there multiple processes
that are running to perform multiple tasks.
Each process requires resources such as memory, CPU time, access to files/directories, etc. In the case
of concurrent processes, memory and CPU time would be distributed among all the processes. The process
management function of Operating System would monitor concurrent process execution.
Each process has following stages during the execution:
Start – Process starts
Wait – Process waits
Terminate – Process exits when its task is complete
Child process – a process can create child processes. Parent and child processes can execute concurrently
and a parent process can wait till the execution of the child process is finished.
When concurrent processes are executing, there are possibilities of two types of
processes: 1) Independent processes and 2) Cooperating process.
Independent process – Independent process – A process, which does not affect the execution of other
running processes and which cannot be affected by the execution of running processes is an independent
process. When a process is running independently, it does not share data or resources with any other process.
Cooperating process – When execution of processes are depending on other processes then the processes are
called cooperating processes. Cooperating processes share data and resources among each other.
We require supporting environments for coOperating System for the following important reasons:
Information sharing – When many user are interested in the same file or database table then only one
process cannot operate exclusively on that file/database table. Hence concurrent access is required on such
resources for information sharing.
CPU utilization – When a process has been divided into small processes and each small process is executing
concurrently, the CPU utilization would be proper and overall process can be finished in less time.
Concurrent processes means programs that are designed in such a way that
whole program can be divided into small interactive executable pieces, which can run in parallel or sequen-
tially. Each piece of program behaves as a separate computational process. Processes may be close to each
other or they are distributed across network. Main challenge in designing concurrent programming is ensur-
ing that the process execution happens in synchronization, causing no resource locking and situations such
as deadlocks. There are challenges such as process co-ordination, processes communication, co-coordinating
access to the resources, sharing of resources among the processes, etc..
Some I/O media, such as disks are easily sharable.
Multiple processes could be using the same disk
drive for reading or writing. But we cannot do the same
for certain I/Omedia such as a tape drive or a printer or a plotter. For
instance, it is not very easily imaginable to have a printer allocated
to two processes, and worse yet, belonging to two different
users. In such a case, a printed report may contain some lines of
payslips interspersed with the lines of sales analysis or production
figures. One can imagine the resulting (or addition to the) chaos!
This is why some I/Omedia have to be allocated exclusively to
only one process. Because of its non–sharable nature, the user
process has to request for the entire device explicitly and the
Operating System has to allocate those I/Odevices accordingly.
Only when the user process gives up a device explicitly, can
the Operating System take it back and add it to the free pool.
Some problems arise when an I/Omedium is allocated to a process in an exclusive manner. Let us imagine
two processes, PA and PB running simultaneously. Half way through, PA requests the Operating System for
a file on the tape drive, and PB requests the Operating System for the printer. Let us assume that both the
requests are granted. After a while, PA requests for the printer without giving up the tape drive. Similarly, PB
requests for the tape drive without giving up the control of the printer. Assuming that the system has only one
tape drive and one printer, what will happen? It is clear that both the processes cannot proceed. PA will wait
until PB releases the printer. But that can happen only if PB can proceed further and finish off its processing
with the printer. And this can happen only if PB gets the tape drive that PA is holding. PB can get the tape
drive only if PA can proceed further and completes its work with the tape drive. That cannot happen too
unless PA gets the printer which PB is holding. This situation is called a ‘deadlock’.
It is not necessary that a deadlock can take place only in the context of the I/O media. It can, in fact, happen
for any shared resource, such as the internal tables maintained by the Operating System or even semaphores.
Due to the problem of race conditions, we want semaphores and certain shared variables to be accessed on
an exclusive manner as seen earlier. These then can become the causes of a deadlock. For instance, let us
assume that an Operating System allows a maximum of 48 processes because it has allocated an area for only
48 PCBs and other data structures. If a process creates a child process, a new PCB has to be acquired and
allocated to it. If no new PCB is available, the parent process waits, and after a while, attempts to acquire
a PCB again. Normally, after a while, if some other process is killed, aPCB will become available and the
attempt to create a child process may succeed. So far so good. But imagine that there are 8 processes running
simultaneously each of which needing to create 9 subprocesses or children. In this case, assuming that the
nature and speed of the processes are the same, after each process has created 5 subprocesses, the total
number of processes will be 8 parent processes + (8x5) child processes = 48. When the 8 parent processes
start creating the 6th child process each, the PCB space will be exhausted, and all the processes will go in the
endless wait loop hoping that some day, there might be some space for the new PCBs to be created. But that
day will never arrive, because it is a deadlock. No other process exists which will eventually terminate to free
a PCB to allow for the forking parent process to proceed and create further child processes. All the processes
will keep on waiting.
A similar problem exists in Database Management Systems due to locking of records. Process A has
locked record REC–0 and wants to read REC–1. Process B has locked REC–1 and issues a call to read
REC–0. What will happen?
To represent the relationship between processes and resources, a certain graphical notation is used.
Figure 8.1 shows square boxes as resources named R1 and R2. Similarly, processes shown as hexagons
are named P1 and P2. The arrows show the relationship. For instance, in part (a) of the figure, resource R1 is
allocated to process P1, or in other words, P1 holds R1. In part (b) of the figure, process P2 wants resource
R2, but it has not yet got it. It is waiting for it. (The moment it gets the resource, the direction of the arrow
will change.)
These graphs are called ‘Directed Resource Allocation Graphs (DRAG)’. They help us in understanding
the process of detection of a deadlock, as we shall see.
Now let us imagine a typical scenario for the deadlock.
l P1 holds R1 but demands R2
If we draw a DRAG for this situation, it will look as shown in Fig. 8.2.
You will notice that there is a closed loop involved. Therefore,
this situation is called a ‘circular wait’ condition. We should not get
confused by the shape of the graph. For instance, the same DRAG can be
drawn as shown in Fig. 8.3.
If you start from any node and follow all the arrows, you must return
to the original node. This is what makes it a circular wait or a deadlock
situation. The shape is immaterial.
This principle is used by the Operating Sys-
tem to detect deadlocks. However, what we have
presented is a simplistic picture. In practice,
the DRAGS can get very complicated, and
therefore, the detection of a deadlock is never
so simple! At any moment, when the Operating Sys-
tem realizes that the existing processes are not finish-
ing for an unduly long time, it can find out whether
there is a deadlock situation or not. All resource al-
locations are made by the Operating System itself.
When any process waits for a resource, it is again
the Operating System which keeps track of this situ-
ation of waiting. Therefore, the Operating System
knows which processes are holding which resources
and which resources these processes are waiting on.
In order to detect a deadlock, the Operating System
can give some imaginary coordinates to the nodes, R
and P. Depending upon the relationships between re-
sources and processes (i.e. directions of the arrows),
it can keep traversing, each time checking if it has
returned to a node it has already travelled by, to de-
tect the incidence of a deadlock.
What does the Operating System do if it finds
a deadlock? The only way out is to kill one of the
processes so that the cycle is broken. Many large
mainframe computers use this strategy. Some systems do not go through the overhead of constructing a
DRAG. They monitor the performance of all the processes. If none finishes for a very long time, the Operating
System kills one of the processes. This is a crude but quicker way to get around the problem.
What causes a deadlock? Coffman, Elphick and Shoshani in 1971 have shown that there are four
conditions all of which must be satisfied for a deadlock to take place. These conditions are given
below:
Resources must be allocated to processes at any time in an exclusive manner and not on a shared basis for a
deadlock to be possible. For instance, a disk drive can be shared by two processes simultaneously. This will
not cause a deadlock. But printers, tape drives, plotters etc. have to be allocated to a process in an exclusive
manner until the process completely finishes its work with it (which normally happens when the process
ends). This is the cause of trouble.
Even if a process holds certain resources at any moment, it should be possible for it to request for new ones.
It should not have to give up the already held resources to be able to request for new ones. If this is not true,
a deadlock can never take place.
If a process holds certain resources, no other process should be able to take them away from it forcibly. Only
the process holding them should be able to release them explicitly.
Processes (P1, P2, ...) and Resources (R1, R2, ...) should form a circular list as expressed in the form of a
graph (DRAG). In short, there must be a circular (logically, and not in terms of the shape) chain of multiple
resources and multiple processes forming a closed loop as discussed earlier.
It is necessary to understand that all these four conditions have to be satisfied simultaneously for the
existence of a deadlock. If any one of them does not exist, a deadlock can be avoided.
Various strategies have been followed by different Operating Systems to deal with the problem
of a deadlock. These are listed below:
l Ignore it.
l Detect it.
l Recover from it.
l Prevent it.
l Avoid it.
We will now discuss these strategies one by one. These are also the areas in which research is going on
because none of the approaches available today is really completely satisfactory.
There are many approaches one can take to deal with deadlocks. One of them, and of course the simplest, is to
ignore them. Pretend as if you are totally unaware of them. (This is the reason why it is called, interestingly,
as ‘Ostrich algorithm’.)
People who like exactitude and predictability do not like this approach, but there is a very valid reason to
ignore a deadlock. Firstly, the deadlock detection, recovery and prevention algorithms are complex to write,
test and debug. Secondly, they slow down the system considerably. As against that, if a deadlock occurs
very rarely, you may have to restart the jobs but then the time may be lost quite infrequently and may not be
significantly large. UNIX follows this approach on the assumption that most users would prefer an occasional
deadlock to a very restrictive, inconvenient, complex and slow system.
We have discussed one of the techniques for the detection of a deadlock in Sec. 8.2. The graphs (DRAG)
provide good help in doing this, as we have seen. However, normally, a realistic DRAG is not as straightforward
as a DRAG between two processes (P1, P2) and two resources (R1 and R2) as depicted in Fig. 8.2. In reality,
there could be a number of resource types such as printers, plotters, tapes and so on. For instance, the system
could have two identical printers, and the Operating System must be told about it at the time of system
generation. It could well be that a specific process could do with either of the printers when requested. The
complexity arises due to the fact that allocation to a process is made of a specific resource by the Operating
System, depending upon the availability but the request is normally made by the process to the Operating
System for only a resource type (i.e. any resource belonging to that type). A very large number of processes
can make this DRAG look more complex and the deadlock detection more time-consuming.
We will denote multiple instances of the same resource type by means of multiple symbols within the
square. For example, consider the DRAG as an example as shown in Fig. 8.4.
R1 is a resource type–say, a tape drive of a certain
kind, and let us assume that there are two tape drives,
R10 and R11 of the same kind known to the system.
R2 may be a printer of a certain type and there may
be only one of that type available in the system–say,
R20. The DRAG shows the possibility of an apparent
circular wait, but it is actually not so. Therefore,
it is NOT a deadlock situation. In the figure, R10
is allocated to P1. P1 is waiting for R20. R20 is
allocated to P2. Now comes the question of the last
leg in the diagram. Let us assume that R11 is free and
P2 wants it. In this case, P2 can actually grab R11.
And if it does so, an arrow will be actually drawn
from R11 to P2 as shown in Fig. 8.5. If you traverse
from a node, following the arrows, you would not
arrive at the starting node. This violates the rules for
a circular wait.
Therefore, P2 in this case need not wait for R11.
It can go to completion. The point is that the visual
illusion of the cycle should not deceive us. It is not
a circular wait condition. If R11 however, is also not
free and is already grabbed by, say P1, it can lead to
a deadlock if P2 requests for R11.
We will follow a method to detect a deadlock where there are multiple instances for a resource type. We
will use DRAG to achieve this. However, what is discussed here is only to clarify the concepts. By no means,
it is the only way to detect deadlocks. In fact, today, there exist far more efficient and better algorithms for
this purpose.
The Operating System, has to treat each resource separately, regardless of the type. The type is important
only while the Operating System is allocating resources to the processes because normally any free resource
of a given type can be allocated. For instance, if a process demands R1, the Operating System could allocate
R10 or R11 depending upon the availability. The Operating System, in this case, could do the following to
detect a deadlock:
(i) Number all processes as P0, P1, ......PN.
(ii) Number each resource separately-using a meaningful coding scheme. For instance, the first character
could always be “R” denoting a resource. The second character could denote the resource type (0 =
tape, 1 = printer etc.) and the third character could denote the resource number or an instance within
the type, e.g. R00, R01, R02, .... could be different tape drives of the same type; R10, R11, R12 ....
could be different printers of the same type with the assumption that resources belonging to the same
type are interchangeable. The Operating System could pick up any of the available resources within a
given type and allocate it without any difference. If this is not true with certain resources, the Operating
System should treat them as different resource types such that the principle of interchangeability of
resources within the same resource type holds true.
(iii) Maintain two tables as shown in Figs. 8.6 and 8.7. One is a resourcewise table giving, for each
resource, its type, allocation status, the process to which it is allocated and the processes that are
waiting for it. In fact, we know in Device Management that the Operating System maintains the
information about the process currently holding the device in the ‘Device Control Block (DCB)’
maintained for each device. We also know that for each process waiting for the device, there is a data
structure called an “Input Output Request Block” or IORB, which is linked to the DCB. Revisiting
Fig. 5.22 will clarify that the Operating System already maintains this information in some form.
Another table is a processwise table giving, for each process, the resources held by it and the resources it
is waiting for. This is normally held along with PCB. Logically, it is a part of PCB, but an Operating System
could choose to maintain it in a separate table linked to the PCB for that process. Therefore, if DCB and PCB
data structures are properly designed, all information needed by the Operating System to allocate/deallocate
resources to various processes will be already available. The Operating System could use this information to
detect any deadlock, as we shall see later.
(iv) Whenever a process requests the Operating System for a resource, the request is obviously for a
resource belonging to a resource type. The user would not really care which one is exactly allocated
(If he did, a new resource type would have been created). The Operating System then goes through
the resourcewise table to see if there is any free resource of that type, and if there is any, allocates it
to the process. After this, it updates both these tables appropriately.
If no free resource of that type is available, the Operating
System keeps that process waiting on one of the resources
for that type. (For instance, it could add the process to the
waiting queue for a resource, where the wait list is the
shortest.) This also will necessitate updating of both tables.
When a process releases a resource, again both the
tables will be updated accordingly.
(v) At any time, the Operating System can use these
tables to detect a circular wait or a deadlock.
Typically, whenever a resource is demanded by a
process, before actually allocating it, the Operating System could use this algorithm to see whether
the allocation can potentially lead to a deadlock or not.
It should be noted that this is by no means the most efficient algorithm of deadlock detection. Modern
research has come out with a number of ingenious ideas, which are being discussed and debated. Some of
these are implemented too! What we present here is a simplified, accurate (though a little inefficient) method
to clarify the concepts. The algorithm would simulate the traversal along the DRAG to detect if the same node
is reached-i.e. the circular wait.
The working is as follows:
(a) Go through the resourcewise table entries one by one, each time storing the values processed. This is
useful in detecting a circular wait, i.e. in finding out whether we have reached the same node or not.
(b) Ignore entries for free resources. (such as an entry for R00 in Fig. 8.6).
(c) For all other entries, access the process to which the resource is allocated (e.g. resource R01 is
allocated to process P1 in Fig. 8.6). In this case, store the numbers R01 and P1 in separate lists called
resource list and process list respectively.
(d) Access the entry in the processwise table (Fig. 8.7) for that process (P1 in this case).
(e) Access one by one the resources this process (P1) is waiting for. For example, P1 is waiting for
resource R20. Check if this is the same as the one already encountered. i.e. if R20 is the same as
R01 stored in step (c). In short, check if circular wait is already encountered. If yes, the deadlock is
detected. If no, store this resource (e.g. R20) in the resource list. This list will now contain R01 and
R20. The process list still contains only P1. Check from Fig. 8.7 whether there is any other resource
apart from R20, that process P1 is waiting for. If there is any, this procedure will have to be repeated.
In this example, there is no such resource. Therefore, the Operating System goes to the next step (f).
(f) Go to the entry in the resourcewise table (Fig. 8.6) for the next resource in the resource list after R01.
This is resource R20, in this case. We find that R20 is allocated to P5.
(g) Check if this process (i.e. P5) is the one already encountered in the process list (e.g. if P5 is the same
as P1). If it is the same, a deadlock is confirmed. In this case, P5 is not the same as P1. So only store
P5 after P1 in the process list and proceed. The process list now contains P1 and P5. The resource list
is still R01, R20 as in step (e). After this, the Operating System will have to choose R10 and R23 as
they are the resources process P5 is waiting for. It finds that R10 is allocated to P1. And P1 already
existed in the process list. Hence, adeadlock (P1!R20!P5!R10!P1) has been detected.
Therefore, the Operating System will have to maintain two lists-one list of resources already encountered
and a separate list of all the waiting processes already encountered. Any time the Operating System hits either
a resource or a process which already exists in the appropriate list while going through the algorithm, the
deadlock is confirmed.
(h) If a deadlock is not confirmed, continue this procedure for all the permutations and combinations e.g.
for all the resources that a process is waiting for and then for each of the resources, the processes to
which they are allocated. This procedure has to be repeated until both the lists are exhausted one by
one. If all the paths lead to resources which are free and allocable, there is no deadlock. If all the paths
make the Operating System repeatedly go through the same process or resource (check if already
encountered), it is a deadlock situation.
Having finished one row, go to the next one in Fig. 8.6 and repeat this procedure for all the rows where
the status is NOT = free.
Let us verify this algorithm for a deadlock. For instance, the two tables corresponding to Fig. 8.2 are
shown in Fig. 8.8 and Fig. 8.9.
The Operating System decides to kill a process and reclaim all its resources after ensuring
that such action will solve the deadlock. (The Operating System can use the DRAG and deadlock detection
algorithms to ensure that after killing a specific process, there will not be a deadlock.) This solution is simple,
but involves loss of at least one process.
Choosing a process to be killed, again, depends on the scheduling policy and the process priority. It is
safest to kill a lowest priority process which has just begun, so that the loss is not very heavy. However, the
matter becomes more complex when one thinks of a database recovery (the process which is killed may have
already updated some databases on-line) or Inter-Process-Communications. As yet, there is no easy solution
to this problem and it is a subject of research today.
This strategy aims at creating the circumstances so that deadlocks are prevented. A study of Coffman’s four
conditions discussed in Sec. 8.3, shows that if any of these conditions is not met, there cannot be a deadlock.
This strategy was suggested by Havender first. We will now discuss the ways to achieve this and problems
encountered while trying to do so.
By prohibiting a process to wait for more resources while already holding cer-
tain resources, we can prevent a deadlock.
This can be achieved by demanding that at the very beginning, aprocess must declare all the resources that
it is expected to use. The Operating System should find out at the outset if all these are available and only if
available, allow the process to commence. In such a case, the Operating System obviously must update its list
of free, available resources immediately after this allocation. This is an attractive solution, but obviously, it
is inefficient and wasteful. If a process does calculations for 8 hours updating some files and at the end, uses
the tape drive for updating the control totals record only for one minute, the tape drive has to be allocated to
that process for the entire duration and it will, therefore, be idle for 8 hours. Despite this, no other process
can use it during this period.
Another variation of this approach is possible. The Operating System must make a process requesting
for some resources to give up the already held resources first and then try for the requested resources. Only
if the attempt is successful, can the relinquished resources be reallocated to that process, so that it can run.
However, if the attempt fails, the relinquished resources are regained and the process waits until those
resources are available. Every time a check is made, the existing, already held resources are relinquished so
that the deadlock can never take place.
Again, there are problems involved in this scheme. After giving up the existing resources, some other
process might grab one or more of them for a long time. In general, it is easy to imagine that this strategy can
lead to long delays, indefinite postponement and unpredictability. Also, this technique can be used for shared
resources such as tables, semaphores and so on, but not for printers and tape drives. Imagine a printer given
up by a process half way in the report and grabbed by some other process!
It is obvious that attacking the first three conditions is very difficult. Only
the last one remains. If the circular wait condition is prevented, the problem of the deadlock can be prevented
too.
One way in which this can be achieved is to force a process to hold only one resource at a time. If it
requires another resource, it must first give up the one that is held by it and then request for another. This
obviously has the same flaws as discussed above while preventing condition (iii). If a process P1 holds R1
and wants R2, it must give up R1 first, because another process P2 should be able to get it (R1). We are again
faced with a problem of assigning a tape drive to P2 after P1 has processed
only half the records. This, therefore, is also an unacceptable solution.
There is a better solution to the problem, in which all resources are
numbered as shown in Fig. 8.10.
A simple rule can tackle the circular wait condition now. Any process has
to request for all the required resources in a numerically ascending order
during its execution, assuming again that grabbing all the required resources
at the beginning is not an acceptable solution. For instance, if a process P1
requires a printer and a plotter at some time during its execution, it has to
request for a printer first and then only for a plotter, because 1 < 2.
This would prevent a deadlock. Let us see how. Let us assume that two
processes P1 and P2 are wanting a tape drive and a plotter each. Adeadlock can take place only if P1 holds
the tape drive and wants the plotter, whereas P2 holds the plotter and requests for the tape drive, i.e. if the
order in which the resources are requested by the two processes is exactly opposite. And this contradicts our
assumption. Because 0 < 2, a tape drive has to be requested for before a plotter by each process, whether it is
P1 or P2. Therefore, it is impossible to get a situation that will lead to the deadlock.
What holds true for two processes also is true for multiple processes. However, there are some minor and
major problems with this scheme also.
Imagine that there are two tape drives, T1 and T2 and two processes, P1 and P2 in the system. If P1 holds
T1 and requests for T2 whereas P2 holds T2 and requests for T1, the deadlock can occur. What numbering
scheme should then be followed as both are tape drives? Giving both tape drives the same number (e.g. 0)
and allowing a request for a resource with a number equal to or greater than that of the previous request, a
deadlock can still occur as shown above.
This minor problem however, could be solved by following a certain coding scheme in numbering the
resources. The first digit denotes the resources type and the second digit denotes the resource number within
the resource type. Therefore, the numbers 00, 01, 02, ... would be for different tape drives and 10, 11, etc.
would be for different printers. The process requests for a resource type only (such as a tape drive). The
Operating System internally translates it into a request for a specific resource such as 00 or 01. Applying this
scheme to the situation above, we realize that our basic assumption would be violated if the situation was
allowed to exist. For instance, our situation is that P1 holds 00 and requests for 01. This is acceptable because
00 < 01. But in this situation, P2 holds 01 and requests for 00. This is impossible because 00 < 01.
Numbering not only the external I/O media but all the resources including all the process tables, disk areas
such as spooler files will be required. That would be a cumbersome process.
It is almost impossible to make all the processes to request resources in a globally predetermined order
because the processes may not actually require them in that order. The waiting periods and the consequent
wastage could be enormous. And, this surely is the major problem.
Therefore, we conclude that there is yet no universally acceptable, satisfactory method for the prevention
of deadlocks and it is still a matter of deep research.
Deadlock prevention was concerned with imposing certain restrictions on the environment or processes so
that deadlocks can never occur. But we found out in the last section the difficulties involved in deadlock
prevention. Therefore, a compromise is sought by the Operating System. The Operating System aims at
avoiding a deadlock rather than preventing one. What is the exact difference between the two? The difference
is quite simple. Deadlock avoidance is concerned with starting with an environment where a deadlock is
theoretically possible (it is not prevented), but by some algorithm in the Operating System, it is ensured,
before allocating any resource that after allocating it, adeadlock can be avoided. If that cannot be guaranteed,
the Operating System does not grant the request of the process for a resource in the first place.
Dijkstra was the first person to propose an algorithm in 1965 for deadlock avoidance. This is known as
‘Banker’s algorithm’, due to its similarity in solving a problem of a banker wanting to disburse loans to
various customers within limited resources.
This algorithm in the Operating System is such that it can know in advance before a resource is allocated
to a process, whether it can lead to a deadlock ('unsafe state') or it can certainly manage to avoid it ('Safe
state').
Banker’s algorithm maintains two matrices on a dynamic basis. Matrix A consists of the resources allocated
to different processes at a given time. Matrix B maintains the resources still needed by different processes at
the same time. These resources could be needed one after the other or simultaneously. The Operating System
has no way of knowing this. Both these matrices are shown in Fig. 8.11.
Matrix A shows that process P0 is holding 2 tape drives at a given time. At the same moment, process P1
is holding 1 printer and so on. If we add these figures vertically, we get a vector of Held Resources (H) = 432.
This is shown as the second row in the rows for vectors. This says that at a given moment, total resources
held by various processes are : 4 tape drives, 3 printers and 2 plotters. This should not be confused with the
decimal number 432. That is why it is called a vector. By the same logic, the figure shows that the vector
for the Total Resources (T) is 543. This means that in the whole system, there are physically 5 tape drives,
4 printers and 3 plotters. These resources are made known to the Operating System at the time of system
generation. By subtraction of (H) from (T) columnwise, we get a vector (F) of free resources which is 111.
This means that the resources available to the Operating System for further allocation are: 1 tape drive, 1
printer and 1 plotter at that juncture.
Matrix B gives processwise additional resources that are expected to be required in due course during the
execution of these processes. For instance, process P2 will require 2 tape drives, 1 printer and 1 plotter, in
addition to the resources already held by it. It means that process P2 requires in all 1 + 2 = 3 tape drives, 2
+ 1= 3 printers and 1 + 1= 2 plotters. If the vector of all the resources required by all the processes (vector
addition of Matrix A and Matrix B) is less then the vector T for each of the resources, there will be no
contention and therefore, no deadlock. However, if that is not so, a deadlock has to be avoided.
Having maintained these two matrices, the algorithm for the deadlock avoidance works as follows:
(i) Each process declares the total required resources to the Operating System at the beginning. The
Operating System puts these figures in Matrix B (resources required for completion) against each
process. For a newly created process, the row in Matrix A is fully zeros to begin with because no
resources are yet assigned for that process. For instance, at the beginning of process P2, the figures
for the row for P2 in Matrix A will be all 0s; and those in Matrix B will be 3, 3 and 2 respectively.
(ii) When a process requests the Operating System for a resource, the Operating System finds out whether
the resource is free and whether it can be allocated by using the vector F. If it can be allocated, the
Operating System does so, and updates Matrix A by adding 1 to the appropriate slot. It simultaneously
subtracts 1 from the corresponding slot of Matrix B. For instance, starting from the beginning, if the
Operating System allocates a tape drive to P2, the row for P2 in Matrix A will become 1, 0 and 0.
The row for P2 in Matrix B will correspondingly become 2, 3 and 2. At any time, the total vector of
these two rows, i.e. addition of the corresponding numbers in the two rows, is always constant and is
equivalent to the total resources needed by P2, which in this case will be 3.3 and 2.
(iii) However, before making the actual allocation, whenever, a process makes a request to the O/S for
any resource, the Operating System goes through the Banker’s algorithm to ensure that after the
imaginary allocation, there need not be a deadlock, i.e. after the allocation, the system will still be
in a ‘safe state’. The Operating System actually allocates the resource only after ensuring this. If it
finds that there can be a deadlock after the imaginary allocation at some point in time, it postpones the
decision to allocate that resource. It calls this state of the system that would result after the possible
allocation as ‘unsafe state’. Remember: the unsafe state is not actually a deadlock. It is a situation of
a potential deadlock with the arithmetic comparison.
The point is: How does the Operating System conclude about the safe or unsafe state? It uses an interesting
method. It looks at vector F, and each row of Matrix B. It compares them on a vector to vector basis i.e. within
the vector, it compares each digit separately to conclude whether all the resources that a process is going to
need to complete are available at this juncture or not. For instance, the figure shows F = 111. It means that at
that juncture, the system has 1 tape drive, 1 printer and 1 plotter free and allocable. (The first row in Matrix B
for P0 to 100.) This means that if the Operating System decides to allocate all needed resources to P0, P0 can
go to completion because 111 > 100 on a vector basis. Similarly, row for P1 in Matrix B is 110. Therefore, if
the Operating System decides to allocate resources to P1 instead of to P0, P1 can complete. The row for P2
is 211. Therefore, P2 cannot complete unless there is one more tape drive available. This is because 211 is
greater than 111 on a vector basis.
The vector comparison should not be confused with the arithmetic comparison. For instance, if F were
411 and a row in Matrix B was 322, it might appear that 411 > 322 and therefore, the process can go to
completion. But that is not true. As 4 > 3, the tape drives would be allocable. But as 1 < 2, both the printers
as well as the plotter would fall short.
The Operating System now does the following to ensure the safe state:
(a) After the process requests for a resource, the Operating System allocates it on a ‘trial’ basis.
(b) After this trial allocation, it updates all the matrices and vectors, i.e. it arrives at the new values of F
and Matrix B as if the allocation was actually done. Obviously, this updation will have to be done by
the Operating System in a separate work area in the memory.
(c) It then compares F vector with each row of Matrix B on a vector to vector basis.
(d) If F is smaller than each of the rows in Matrix B on a vector basis, i.e. even if all F was made available
to any of the processes in Matrix B, none would be guaranteed to complete, the Operating System
concludes that it is an ‘unsafe state’. Again, it does not mean that a deadlock has resulted. However,
it means that it can take place.
(e) If F is greater than any row for a process in Matrix B, the Operating System proceeds as follows:
l It allocates all the needed resources for that process on a trial basis.
l It assumes that after this trial allocation, that process will eventually get completed, and, in fact,
release all the resources on completion. These resources now will be added to the free pool (F). It
now calculates all the matrices and F after this trial allocation and the imaginary completion of this
process. It removes the row for the completed process from both the matrices.
l It repeats the procedures from step (c) above. If in the process, all the rows in the matrices get
eliminated, i.e. all the processes can go to completion, it concludes that it is a ‘safe state’. If it does
not happen, it concludes that it is an ‘unsafe state’.
(f) For each request for any resource by a process, the Operating System goes through all these trial or
imaginary allocations and updations, and if it finds that after the trial allocation, the state of the system
would be ‘safe’, it actually goes ahead and makes an allocation after which it updates various matrices
and tables in a real sense. The Operating System may need to maintain two sets of matrices for this
purpose. Any time, before any allocation, it could copy the first set of matrices (the real one) into the
other, carry out all trial allocations and updations in the other, and if the safe state results, update the
former set with the allocations.
Two examples to understand this algorithm clearly are presented here.
Example-1
Suppose process P1 requests for 1 tape drive when the resources allocated to various processes are given
by Fig. 8.11. The Operating System has to decide whether to grant this request or not. The Banker’s algorithm
proceeds to determine this as follows:
l If a tape drive is allocated to P1, F will become 011 and the resources still required for P1 in Matrix
B will become 010. After this, the free resources are such that only process P1 can complete because
each digit in F, i.e. 011 is equal to or more than the individual digits in the row for required resources
for P1 in Matrix B i.e. 010.
Therefore, hypothetically, if no other process demands anything in between, the free resources can
satisfy P1’s demands and lead it to completion.
l If P1 is given all the resources it needs to complete, the row for assigned resources to P1 in Matrix A
will become 120, and after this allocation, F will become 001.
l At the end of the execution of P1, all the resources used by P1 will become free and F will become
120 + 001 = 121. We can now erase the rows for P1 from both the matrices, indicating that, this is
how the matrices will look if P1 is granted its first request of a tape drive and then is allowed to go to
completion.
l We repeat the same steps with the other rows. For instance, now F = 121. Therefore, the Operating
System will have sufficient resources to complete either P0 or P3 but not P2. This is because P2
requires 2 tape drives to complete, but the Operating System at this imaginary juncture has only 1.
Let us say, the Operating System decides to allocate the resources to P0 (It does not matter which
one is chosen). Assuming that all the required resources are allocated to P0 one by one, the row for
assigned resources to P0 in Matrix A will become 300 and that in Matrix B will obviously become
000. F at this juncture will have become 121 – 100 = 021. If P0 is now allowed to go to completion,
all the resources held by P0 will be returned to F. Now, we can erase the rows for P0 from both the
matrices. F would now become 300 + 021 = 321.
l Now either P2 or P3 can be chosen for this ‘trial allocation’. Let us assume that P3 is allocated. Going
by the same logic and steps, we know that resources required by P3 are 111. Therefore, after the trial
allocation, F will become 321 – 111 = 210, and resources assigned to P3 in Matrix A would become
212. When P3 completes and returns the resources to F, F will become 212 + 210 = 422.
l At the end, P2 will be allocated and completed. At this juncture, resources allocated to P2 will be
332, and F would be 442 – 211 = 211. In the end, all the resources will be returned to the free pool.
At this juncture, F will become 332 + 211 = 543. This is the same as the total resources vector T that
are known to the system. This is as expected because after these imaginary allocations and process
completions, F should become equal to the total resources known to the system.
l The Operating System does all these virtual or imaginary calculations before granting the first request
of process P1 for a tape drive. All it ensures is that if this request is granted, it is possible to let
some processes complete, adding to the pool of free resources and by repeating the same logic, it
is possible to ultimately complete all the processes. Therefore, this request can be granted because
after the allocation, the state is still a ‘safe’ state. It should be noted that after this allocation, it is not
impossible to have a deadlock if subsequent allocations are not done properly. However, all it ensures
is that it is possible to avert the deadlock. The Operating System now actually allocates the tape
drive. After the actual allocation, the Operating System updates both the matrices and all the vectors.
An interesting point is: After this, the processes need not actually complete in the same sequence as
discussed earlier.
Example 2
Let us go back to Fig. 8.11 which depicts the state of the system at some stage. Imagine that process P2
instead of P1 requests 1 tape drive. Let us now apply Banker’s algorithm to this situation. If it is granted,
F will become 011, and this is not sufficient to complete any process. This is because the vector F = 011 is
less than every row in Matrix B after the allocation, since there is no tape drive free. Therefore, there can be
a deadlock. There is no certainty because even if 1 tape drive is allocated to P2, P2 can relinquish it during
the execution before its completion (may be in a short while), and then the processes can complete as in
Example-1. The Operating System still does not grant this request because it is an unsafe state which may
not be able to avert a deadlock. Therefore, the Operating System waits for sometime until some other process
releases some resources during or at the end of the execution. It then ensures that it is a safe state by the same
logic as discussed above, and then only grants the request.
The algorithm is very attractive at first sight, but it is not easy for every process to declare in advance
all the resources it is going to require, especially if the resources include such intangibles as shared data,
variables, files, tables etc.
Therefore, deadlock avoidance is also a matter of research today.
A system consists of many resources, which are shared or distributed among several com-
peting processes. Memory, CPU, disk space, printers and tapes are the example of resources. When a system
has two CPUs then we can say that there are two instances of CPUs. Similarly, in a network, we may have
ten printers and we can say that there ten instances of printers.
In such situations, we are not bothered about which instance of the requested resource is processing the
request. When a process is executing, it requests for a resource before using it and it must release the resource
after using it. Any process can request as many requests to carry out the assigned task. It cannot make more
requests than the maximum number available in the system.
A process uses the request in the following sequence.
Process requests for necessary resource(s). If the resources are not free then the process has
to wait until the resources are free so that it can acquire control on the resources.
Process can operate/use the acquired resource to carry out assigned task.
Process releases the resources when operation is complete on the acquired resource.
In the request and release steps, a process make system calls such as disk read/write, printing, memory
allocation, etc. Therefore, it is necessary to make sure that there is no conflict or a situation where two
processes are acquiring the same resources.
n
n
We will now discuss the last portion of the
Operating System viz. the functions of Mem-
ory Management (MM). As mentioned
earlier, the topic discussed in this chapter assumes spe-
cial importance when a number of users share the same
memory. In general, this module performs the following
functions:
(a) To keep track of all memory locations-free or
allocated and if allocated, to which process and
how much.
(b) To decide the memory allocation policy i.e. which
process should get how much memory, when and
where.
(c) To use various techniques and algorithms to
allocate and deallocate memory locations.
Normally, this is achieved with the help of some special
hardware.
There is a variety of memory management systems.
Fig. 9.1 lists them.
These systems can be divided into two major parts:
'Contiguous' and 'Non-contiguous'. Contiguous Memory Management schemes expect the program to be
loaded in contiguous memory locations. Non-contiguous systems remove this restriction. They allow the
program to be divided into different chunks and loaded at different portions of the memory. It is then the
function of the Operating System to manage these different chunks in such a way that they appear to be
contiguous to the Application Programmer/User. In 'paging', these chunks are of the same size, whereas
in 'segmentation', they can be of different sizes. Again, Memory Management can be of 'Real Memory'
whereby the full process image is expected to be loaded in the memory before execution. Virtual Memory
Management systems can, however, start executing a process even with only a part of the process image
loaded in the memory.
We will now discuss these schemes one by one. In each case, the following issues are involved:
Relocation and Address Translation refers to the problem that arises because at the time of compilation, the
exact physical memory locations that a program is going to occupy at the run time are not known. Therefore,
the compiler generates the executable machine code assuming that each program is going to be loaded from
memory word 0. At the execution time, the program may need to be relocated to different locations, and all
the addresses will need to be changed before execution. This will be illustrated later with an example and also
different methods of Address Translations will be discussed.
Protection refers to the preventing of one program from interfering with other programs. This is true even
when a single user process and the Operating System are both residing in the main memory. A common
question is: “If the compiler has generated proper addresses and relocation is properly done, can one program
interfere with others?” One of the answers is: “hardware malfunction.”
Imagine an instruction "JMP EOJ" in an assembly language which can get translated to 0111000011111001
[JMP = 0111 and EOJ is assumed to be at the address (249) in decimal or (000011111001) in binary]. If
due to hardware malfunction, two high order bits in the address change from 0 to 1, it will not be detected
by the parity checking mechanism due to cancelling errors. The addresses in the memory will now be
(110011111001) in binary which is (3321) in decimal. If the program actually jumps to this location, there
might be a serious problem.
Sure enough, such hardware malfunction does not happen very often, and also some of these cases can
be detected by a few ingenious methods. However, our objective is to guarantee complete accuracy. That is
why protection is important. In most cases, the protection is provided by a special hardware, as we shall see
later. This is so important that until this protection was provided, many Operating Systems, especially on the
microcomputers, could not provide the multiuser facility, despite having all the other algorithms ready.
Sharing is the opposite of protection. In this case, multiple processes have to refer to the same memory
locations. This need may arise because the processes might be using the same piece of data, or all processes
might want to run the same program. e.g. a word processor. Having 10 copies of the same program in the
memory for 10 concurrent users seems obviously wasteful. Though achieving both protection and sharing
of memory are apparently contradictory goals, we will study various schemes for accomplishing this task a
little later.
Each of these memory management methods can be judged in terms of efficiency by using the following
norms:
Wasted memory is the amount of physical memory which remains unused and therefore,
wasted.
Access time is the time to access the physical memory by the Operating System as com-
pared to the memory access time for the bare hardware without the overheads of the Operating System, basi-
cally caused due to Address Translation.
In the scheme of Single Contiguous Memory Management, the physical memory is divided into
two contiguous areas. One of them is permanently allocated to the resident portion of the Operating
System (monitor) as shown in Fig. 9.2. (CP/M and MS-DOS fall in this category.)
The Operating System may be loaded at the lower addresses (0 to P as
shown in Fig. 9.2) or it can be loaded at the higher addresses. This choice is
normally based on where the vectored Interrupt Service Routines are located
because these addresses are determined at the time of hardware design in
such computers.
At any time, only one user process is in the memory. This process is run to
completion and then the next process is brought in the memory. This scheme
works as follows:
All the ‘ready’ processes are held on the disk as executable images-
whereas the Operating System holds their PCBs in the memory in the
order of priority.
At any time, one of them runs in the main memory.
When this process is blocked, it is ‘swapped out’ from the main memory to the disk.
The next highest priority process is ‘swapped in’ the main memory from the disk and it starts running.
Thus, there is only one process in the main memory even if conceptually it is a multi-programming
system.
Now consider the way this scheme solves various problems as stated in Section 9.1.
In this scheme, the starting physical address of the program is known at the time of compilation. Therefore,
the problem of relocation or Address Translation does not exist. The executable machine program contains
absolute addresses only. They do not need to be changed or translated at the time of execution.
Protection can be achieved by two methods: 'Protection bits' and 'Fence register'.
In 'Protection bits', a bit is associated with each memory block because a memory block could belong
either to the Operating System or the application process. Since there could be only these two possibilities,
only 1 bit is sufficient for each block. However, the size of the a memory block must be known. A memory
block can be as small as a word or it could be a very large unit consisting of a number of words. Imagine a
scheme in which a computer has a word length of 32 bits and 1 bit is reserved for every word for protection.
This bit could be 0 if the word belongs to the Operating System and it could be 1 if it belongs to the user
process. At any moment, the machine is in the supervisor (or privileged) mode executing an instruction within
the Operating System, or it is in the 'user' mode executing a user process. This is indicated by a mode bit in
the hardware. If the mode changes, the hardware bit also is changed accordingly automatically. Thus, at any
moment, when the user process refers to memory locations within the Operating System area, the hardware
can prevent it from interfering with the Operating System because the protection bit associated with the
referenced memory block (in our example, a word) is 0.
However, normally the Operating System is allowed unrestricted access to all the memory locations,
regardless of whether they belong to the Operating System or a user process. (i.e. when the mode is privileged
and the Operating System makes any memory reference, this protection bit is not checked at all!) If a block
is as small as a word of say 32 bits, protection bits constitute (1/32)×100 = 3.1 % overhead on the memory.
As the block size increases, this overhead percentage decreases, but then the allocation unit increases. This
has its own demerits such as the memory wastage due to the internal fragmentation, as we shall study later.
The use of a Fence register is another method of protection. This is like any other register in the CPU. It
contains the address of the fence between the Operating System and the user process as depicted in Fig. 9.3,
where the fence register value = P.
Because it contains an address, it is as big as MAR. For
every memory reference, when the final resultant address
(after taking into account the addressing modes such as
indirect, indexed, PC-relative and so on) is in MAR, it is
compared with the fence register by the hardware itself,
and the hardware can detect any protection violations. (For
instance, in Fig. 9.3, if a user process with mode bit = 1 makes a reference to an address within the area for
the Operating System which is less than or equal to P, the hardware itself will detect it.)
Sharing of code and data in memory does not make much sense in this scheme and is usually not supported.
This method does not have a large wasted memory (it can not be used even if it were large anyway!) This
scheme has very fast access times (No Address Translation is required.) and very little time-complexity. But
its use is very limited due to the lack of multiuser facility.
Most operating systems such as OS/360 running on IBM hardware used the Fixed Partitioned Memory
Management method. In this scheme, the main memory was divided into various sections called 'partitions'.
These partitions could be of different sizes, but once decided at the time of system generation, they could not
be changed. This method could be used with swapping and relocation or without them.
In this method, the partitions are fixed at the time of system generation.(System generation is a process
of tailoring the Operating System to specific requirements. The Operating System consists of a number of
routines for supporting a variety of hardware items and devices, all of which may not be necessary for every
user. Each user can select the necessary routines depending upon the devices to be used. This selection is
made at the time of system generation.) At this time, the system manager has to declare the partition size.
To change the partitions, the operations have to be stopped and the Operating System has to be generated
(i.e. loaded and created) with different partition specifications. That is the reason why, these partitions are also
called 'static partitions'. On declaring static partitions, the Operating System creates a Partition Description
Table (PDT) for future use. This table is shown in Fig. 9.4. Initially, all the entries are marked as "FREE".
However, as and when a process is loaded into one of the partitions, the status entry for that partition is
changed to "ALLOCATED". Fig. 9.4 shows the static partitions and their corresponding PDT at a given time.
In this case, the PCB (Process Control Block) of each process contains the ID of the partition in which
the process is running. This could be used as a "pointer to the physical memory locations" field in the PCB.
For instance, in the PCB for process A, the ID of the PDT will be specified as 2. Using this, the Operating
System can access the entry number 2 in the PDT. This is how, using the partition ID as an index into PDT,
information such as starting address, etc. could easily be obtained. The Operating System, however, could
keep this information directly in the PCB itself to enhance the speed at the cost of redundancy. When the
process terminates, the system call "kill the process" will remove the PCB, but before removing it, it will
request the MM to set the status of that partition to "FREE".
When a partition is to be allocated to a process, the following takes place:
(i) The long term process scheduler of the PM decides which process is to be brought into the memory
next.
(ii) It then finds out the size of the program to be loaded by consulting the IM portion of the Operating
System. As seen earlier, the compiler keeps the size of the program in the header of the executable
file.
(iii) It then makes a request to the partition allocation routine of the MM to allocate a free partition
with the appropriate size. This routine can use one of the several algorithms for such allocations, as
described later. The PDT is very helpful in this procedure.
(iv) With the help of the IM module, it now loads the binary program in the allocated partition. (Note that
it could be loaded in an unpredictable partition, unlike the previous case, making Address Translation
necessary at the run time.)
(v) It then makes an entry of the partition ID in the PCB before the PCB is linked to the chain of ready
processes by using the PM module of the Operating System.
(vi) The routine in the MM now marks the status of that partition as “allocated”.
(vii) The PM eventually schedules this process.
The Operating System maintains and uses the PDT as shown in Fig. 9.4. In this case, partition 0 is occupied
by the Operating System and is thus, unallocable. The “FREE” partitions are only 1 and 4. Thus, if a new
process has to be loaded, we have to choose from these two partitions. The strategies of partition allocation
are the same as discussed in disk space allocation, viz., first fit, best fit and worst fit. For instance, if the
size of a program to be executed is 50k, both the first fit and the worst fit strategies would give partition ID =
1 in the situation depicted by Fig. 9.4. This is because the size of the partition with partition ID = 1 is 200k
which is > 50k and also it is the first free partition to accommodate this program. The best fit strategy for the
same task would yield partition ID = 4. This is because the partition size of this partition is 100k, which is the
smallest partition capable of holding this program. The best fit and the worst fit strategies would be relatively
faster, if the PDT was sorted on partition size and if the number of partitions was very high.
The processes waiting to be loaded in the memory (ready for execution, but for the fact that they are on
the disk or swapped out) are held in a queue by the Operating System. There are two methods of maintaining
this queue, viz., Multiple queues and Single queue.
In multiple queues, there is one separate queue for each partition as shown in Fig. 9.5.
In essence, the linked list of PCBs in "ready but not in memory" state is split into multiple lists-one for each
partition, each corresponding to a different size of the partition. For instance, queue 0 will hold processes
with size of 0–2k, queue 2 will be for processes with size between 2k and 5k (the exact size of 2k will be in
this queue) and queue 1 will take care of processes with size between 5k and 8k, etc. (The exact size of 5k
will be in this queue.)
When a process wants to occupy memory, it is added to a proper queue depending upon the size of the
process. If the scheduling method is round robin within each queue, the processes are added at the end of the
proper queue and they move ahead in the strict FIFO manner within each queue. If the scheduling method is
priority driven, the PCBs in each queue are chained in the sorted order of priority.
An advantage of this scheme is that a very small process is not loaded in a very large partition. It thus
avoids memory wastage. It is instead added to a queue for smaller partitions. A disadvantage is obvious. You
could have a long queue for a smaller partition whereas the queue for the bigger partition could be empty as
shown in Fig. 9.6. This is obviously not an optimal and efficient use of resources!
In the Single queue method, only one unified queue is maintained of all the ready processes.
This is shown in Fig. 9.7. Again, the order in which the PCBs of ready processes are chained, depends upon
the scheduling algorithm. For instance, in priority based scheduling, the PCBs are chained in the order of
priority. When a new process is to be loaded in the memory, the unified queue is consulted and the PCB at the
head of the queue is selected for dispatching. The PCB contains the program size which is copied from the
header of the executable file at the time a process is created. A free partition is then found based on either first,
best or worst fit algorithms. Normally, the first fit algorithm is found to be the most effective and the quickest.
However, if the PCB at the head of the queue requires memory which is not available now, but there is a
free partition available to fit the process represented by the next PCB in the queue, what procedure is to be
adopted? The non-availability of the partition of the right size may force the Operating System to change the
sequence in which the processes are selected for dispatching. For instance, in Fig. 9.7, if the partition with
size = 5k is free, the highest priority process at the head of the chain with size = 7k cannot be dispatched.
The Operating System then has to find the next process in the queue which can fit into the 5k partition. In this
case, the Operating System finds that the next process with size = 2k can fit well. However, this may not be
the best decision in terms of performance.
If the Operating System had a "lookahead intelligence"
feature, it could have possibly known that the partition with
size = 2k is likely to get free soon. In this case, choosing the
next process of 5k for loading in the partition of size = 5k
could have been a better decision. Almost immediately after
this, the 2k partition would get free to accommodate the 2k
process with higher priority than the one with size = 5k. The
highest priority process with size = 7k will have to wait until
the partition with size = 8k gets free. There is no alternative
to this. This kind of intelligence is not always possible and it
is also quite expensive.
If the Operating System chooses a simple but relatively
less intelligent solution and loads the process with size = 2k
in the partition with size = 5k, the process with size = 5k keeps waiting. After a while, even if the 2k partition
gets free, it cannot be used, thus causing memory wastage. This is called external fragmentation. Contrast
this with internal fragmentation in which there is a memory wastage within the partition itself. Imagine a
partition of 2k executing a process of 1.5k size. The 0.5k of the memory of the partition cannot be utilized.
This wastage is due to internal fragmentation.
This discussion shows that the MM and the PM modules are interdependent and that they have to cooperate
with each other.
One more way in which the partitioned memory management scheme is categorized is based on whether it
supports swapping or not. Lifting the program from the memory and placing it on the disk is called 'swapping
out'. To bring the program again from the disk into the main memory is called 'swapping in'. Normally, a
blocked process is swapped out to make room for a ready process to improve the CPU utilization. If more
than one process is blocked, the swapper chooses a process with the lowest priority, or a process waiting for a
slow I/O event for swapping out. As discussed earlier, a running process also can be swapped out (in priority
based preemptive scheduling).
Swapping algorithm has to coordinate amongst Information, Process and Memory Management systems.
If the Operating System completes the I/O on behalf of a blocked process which was swapped out, it keeps
the data read recently in its own buffer. When the process is swapped in again, the Operating System moves
the data into the I/O area of the process and then makes the process ‘ready’. In demand paging, some portion
of the memory where the record is to be read can be ‘locked’ or ‘bound’ to the main memory. The remaining
portion can be swapped out if necessary. In this case, even if the process is ‘blocked’ and ‘swapped out’, the
I/O can directly take place in the AP's memory. This is not possible in the scheme of ‘fixed partition’ because,
in this case, the entire process image has to be in the memory or swapped out on the disk.
The Operating System has to find a place on the disk for the swapped out process image. There are two
alternatives. One is to create a separate swap file for each process. This method is very flexible, but can be
very inefficient due to the increased number of files and directory entries thereby deteriorating the search
times for any I/O operation. The other alternative is to keep a common swap file on the disk and note the
location of each swapped out process image within that file. In this scheme, an estimate of the swap file size
has to be made initially. If a smaller area is reserved for this file, the Operating System may not be able to
swap out processes beyond a certain limit, thus affecting performance. The medium term scheduler has to
take this into account.
Regardless of the method, it must be remembered that the disk area reserved for swapping has to be larger
in this scheme than in demand paging because, the entire process image has to be swapped out, even if only
the "Data Division" area undergoes a change after the process is loaded in the memory.
A compiled program is brought into the memory through a single unified queue or through multiple
queues. At the time of compilation, the compiler may not know which partition the process is going to run in.
Again a process can be swapped out and later brought back to a different partition. A question is: How are the
addresses managed in such a scheme? This is what we will learn in the section to follow.
Imagine a program which is compiled with 0 as the starting word address. The addresses
that this program refers to are called ‘virtual addresses or logical addresses’. In reality, that program may
be loaded at different memory locations, which are called 'physical addresses'. In a sense, therefore, in all
Memory Management systems, the problem of relocation and Address Translation is essentially to find a way
to map the virtual addresses onto the physical addresses.
Let us imagine that there is an instruction equivalent to "LDA 500" i.e. "0100000111110100" in a simple
machine language, in a compiled COBOL program, where 0100 is the machine op. code for LDA, and 500
in decimal is 000111110100 in binary. The intention of this instruction is obviously to load a CPU register
(usually, an accumulator) with the contents of the memory word at address = 500. Obviously, this address
500 is the offset with respect to the starting physical address of the program. If this program is loaded in
a partition starting from word address 1000, then this instruction should be changed to "LDA 1500" or
"0100010111011100", because 1500 in decimal is 010111011100 in binary.
Address Translation (AT) must be done for all the addresses in all the instructions except for constants,
physical I/O port addresses and offsets which are related to a Program Counter (PC) in the PC relative
addressing mode because all these do not change depending upon where the instruction is located. There are
two ways to achieve this relocation and AT: Static and Dynamic.
Dynamic relocation is used at the run time, for each instruction. It is normally done
by a special piece of hardware. It is faster, though somewhat more expensive. This is because, it uses a special
register called 'base register'. This register contains the value of relocation. (In our example, this value is
1000 because the program was loaded in a partition starting at the address 1000.)
In this case, the compiled program is loaded at a starting memory location different than 0 (say, starting
from 1000) without any change to any instruction in the program. For instance, Fig. 9.9 shows the instruction
“LDA 500” actually loaded at some memory locations between 1000 and 1500. The address 500 in this
instruction is obviously invalid, if the instruction is executed directly. Hence, the address in the instruction
has to be changed at the time of execution from 500 to 1500.
Normally, any instruction such as “LDA 500” when executed, is fetched to Instruction register (IR) first,
where the address portion is separated and sent to Memory Address Register (MAR). In this scheme however,
before this is done, this address in the instruction is sent to the special adder hardware where the base register
value of 1000 is added to it, and only the resulting address of 1500 finally goes to MAR. As MAR contains
1500, it refers to the correct physical location. For every address needing translation, this addition is made by
the hardware. Hence, it is very fast, despite the fact that it has to be done for every instruction.
Imagine a program with a size of 1000 words. The 'virtual address space' or ‘logical address space’
for this program comprises words from 0 to 999. If it is loaded in a partition starting with the address 1000,
1000 to 1999 will be its 'physical address space' as shown in Fig. 9.9 though the virtual address space still
continues to be 0 to 999. At the time of execution of that process, the value of 1000 is loaded into the base
register. That is why, when this instruction “LDA 500” is executed, actually it executes the instruction “LDA
1500”, as shown in Fig. 9.9.
The base register can be considered as another special purpose CPU register. When the partition allocation
algorithm (first fit, best fit, etc.) allocates a partition for a process and the PCB is created, the value of the base
register (starting address of the partition) is stored in the PCB in its Register Save Area. When the process is
made "running", this value is loaded back in the base register. Whenever the process gets blocked, the base
register value does not need to be stored again as the PCB already has it. Next time when the process is to be
dispatched, the value from the PCB can be used to load the base register if the process has not been swapped
out. After a process gets blocked and a new process is to be dispatched, the value of the base register is simply
picked up from the PCB of the new process and the base register is loaded with that value. This is again
assuming that the new process is already loaded in one of the partitions.
However, if a process is swapped out and swapped into a new partition later, the value of the base register
corresponding to the new partition will have to be written into the PCB before the PCB of this process is
chained in the queue of ready processes to be dispatched eventually.
This is the most commonly used scheme amongst the schemes using fixed partitions, due to its enhanced
speed and flexibility. A major advantage is that it supports swapping easily, i.e. a process can be swapped
out and later swapped in at different locations very easily. Only the base register value needs to be changed
before dispatching.
A process should not, by mistake or on purpose, become capable of interfering with other processes. There
are two approaches for preventing such interference and achieving protection and sharing. These approached
involve the use of:
− Protection bits
− limit register
Protection bits are used by the IBM 360/370 systems. The idea is the same as in single
user systems, except that 1 bit will not suffice for protection. A few bits are reserved to specify each word's
owner (e.g. 4 bits if there are 16 user processes running in 16 partitions). This scheme however is expensive.
If the word length is 32 bits, 4 bit overhead for every word would mean 4/32 = 1/8 or 12.5% increase in the
overheads. Hence, IBM 360 series of computers divided the memory into 2 KB blocks and reserved 4 protec-
tion bits called the 'key' for each such block again, assuming 16 users in all. Size of each partition had to be
a multiple of such blocks and could not be any arbitrary number. This resulted in memory wastage due to the
internal fragmentation. Imagine that the block size is 2 KB, and the process size is 10 KB + 1 byte. If two
of the partitions are of size of 10 KB and 12 KB, the Operating System will have to allocate a partition of
12 KB for this process. The one with 10 KB size will not do. Hence, an area of 2KB-1 will be wasted in that
partition. It can easily be seen that the maximum internal fragmentation per partition is equal to block size -1,
the minimum is 0, and the average is equal to (block size -1)/2 per process.
All the blocks associated with a partition allocated to a process are given the same key value in this
scheme. If the number of partitions is 16, there can be maximum 16 user processes at any moment in the main
memory. Therefore, a 4 bit key value ranging from 0000 to 1111 serves the purpose of identifying the owner
of each block in the main memory. This is shown in Fig. 9.10.
Considering a physical memory of 64 KB and assuming a block of 2 KB size, there would be 32 blocks.
If a 4 bit key is associated with a block, 32×4 = 128 bits have to be reserved for storing the key values. At the
time of system generation, the System Administrator would define a maximum of 16 partitions of different
sizes out of these 32 total blocks-available. One partition could be of 1 block, another of 3 blocks, and yet
another of 2 or even 5 blocks. Each partition is then assigned a protection key from 0000 to 1111. After
declaring various partitions with their different sizes, all the 128 bits reserved for the key values (4 per block)
are set. This is done on the principle that all the blocks belonging to a partition should have the same key
value. Figure 9.10 illustrates this.
When a process is assigned to a partition, the key value for that partition is stored in 'Program Status
Word (PSW)'. Whenever a process makes a memory reference in an instruction, the resulting address (after
taking into account the addressing mode and the value of the base register ) and the block in which that
address falls are computed. After this, a 4 bit protection key for that block is extracted from the 128 bit long
protection keys, and it is tallied with the key stored in PSW. If it does not match, it means that the process is
trying to access an address belonging to some other partition. Thus, if due to hardware malfunction, a high
order 0 of an address becomes 1, the process still is prevented from interfering with an address in some other
partition belonging to a different process. However, if this hardware malfunction generates another address
belonging to the same partition, this protection mechanism cannot detect it!
This scheme has four major drawbacks:
(i) It results in memory wastage because the partition size has to be in multiples of a block size (Internal
fragmentation).
(ii) It limits the maximum number of partitions or resident processes (due to the key length).
(iii) It does not allow sharing easily. This is because the Operating System would have to allow two
possible keys for a shared partition if that partition belongs to two processes simultaneously. Thus,
each block in that partition should have two keys, which is cumbersome. Checking the keys by
hardware itself will also be difficult to implement.
(iv) If hardware malfunction generates a different address but in the same partition, the scheme cannot
detect it because the keys would still tally.
Another method of providing protection is by using a Limit register (see Fig. 9.11),
which ensures that the virtual address present in the original instruction moved into IR before any relocation/
Address Translation. is within the bounds of the process. For instance, in our example in Sec. 9.3.4.3, where
the program size was 1000, the virtual addresses would range from 0 to 999. In this case, the limit register
would be set to 999. Every logical or virtual address will be checked to ensure that it is less than or equal to
999, and only then added to the base register. If it is not within the bounds, the hardware itself will generate
an error message, and the process will be aborted.
The limit register for each process can also be stored in the corresponding PCBs and can be saved/restored
during the context switch in the same way as the base register.
Sharing poses a serious problem in fixed partitions because it might compromise on protection.
One approach to sharing any code or data is to go through the Operating System for any such request. Because
the Operating System has access to the entire memory space, it could mediate. This scheme is possible but
it is very tedious and increases the burden on the Operating System. Therefore, it is not followed in practice.
Another approach is to keep copies of the sharable code/data in all partitions where required. Obviously,
it is wasteful, apart from giving rise to possible inconsistencies, if for instance, the same pieces of data are
updated differently in two different partitions.
Another way is to keep all the sharable code and data in one partition, and with either key modification or
change to the base/limit registers, allowing a controlled access to this partition even from outside by another
process. This is fairly complex and results in high overheads. Besides, it requires specialized hardware
registers. This is the reason it is not followed widely.
If a partition of size 100k is allocated to a process of size 60k, then the 40k space of
that partition is wasted, and cannot be allocated to any other process. This is called 'internal fragmentation'.
However, it may happen that two free partitions of size 20k and 40k are available and a process of 50k has
to be accommodated. In this case, both the available partitions cannot be allocated to that process because it
will violate the principle of allocation, viz. That only contiguous memory should be allocated to a process.
As a result, there is wastage of memory space. This is called 'external fragmentation'.
In fixed partitions, a lot of memory is thus wasted due to fragmentation of both kinds. We have seen
examples of these earlier.
Access times are not very high due to the assistance of special hardware. The translation
from virtual to physical address is done by the hardware itself enabling rapid access.
We have studied the problems associated with fixed partitions, especially in terms of fragmentation and
restriction on the number of resident processes. This puts restrictions on the degree of multiprogramming and
in turn the CPU utilization. Variable partitions came into existence to overcome these problems and became
more popular. In variable partitions, the number of partitions and their sizes are variable. They are not defined
at the time of system generation.
At any time, any partition of the memory can be either free (unallocated) or allocated to some process or
free in pretty much the same way as given in the PDT in Fig. 9.4. The only difference is that with variable
partition, the starting address of any partition is not fixed, but it keeps varying, as is depicted in Fig. 9.12.
The eight states of the memory allocations in the figure correspond to the eight events which are given
below: We will trace these events and study Fig. 9.12 to understand how this scheme works.
(i) The Operating System is loaded in the memory. All the rest of the memory is free.
(ii) A program P1 is loaded in the memory and it starts executing. (after which it becomes a process.)
(iii) A program P2 is loaded in the memory and it starts executing. (after which it becomes a process.)
(iv) A program P3 is loaded in the memory and it starts executing. (after which it becomes a process.)
(v) The process P1 is blocked. After a while, a new high priority program P4 wants to occupy the memory.
The existing free space is less than the size of P4. Let us assume that P4 is smaller than P1 but bigger
than the free area available at the bottom. Assuming that the process scheduling is based on priorities
and swapping, P1 is swapped out. There are now two chunks of free space in the memory.
(vi) P4 is now loaded in the memory and it starts executing.(after which it becomes a process.) Note that
P4 is loaded in the same space where P1 was loaded. However, as the size of P4 is less than that of
P1, still some free space remains. Hence, there are still two separate free areas in the memory.
(vii) P2 terminates. Only P4 and P3 continue. The free area at the top and the one released by P2 can now
be joined together. There is now a large free space in the middle, in addition to a free chunk at the
bottom.
(viii) P1 is swapped in as the Operating System has completed the I/O on its behalf and the data is already
in the buffer of the Operating System. Also, the free space in the middle is sufficient to hold P1 now.
Another process P5 is also loaded in the memory. At this stage, there is only a little free space left.
The shaded area in the figure shows the free area at any time. Notice that the numbers and sizes of
processes are not predetermined. It starts with only two partitions (Operating System and the other) and at
stage (vi), they are 6 partitions. These partitions are created by the Operating System at the run time, and they
differ in sizes.
The procedure to be followed for memory allocation is the same as described for fixed partitions in steps
(i) to (vii) of Sec. 9.3.1, excepting that the algorithms and data structures used may vary. We will not repeat
these steps here. An interested reader can go through that section to refresh the memory.
The basic information needed to allocate/deallocate is the same as given in the PDT in Fig. 9.4. However,
because the number of entries is uncertain, it is rarely maintained as a table, due to the obvious difficulties
of shifting all the subsequent entries after inserting any new entries.(Try doing that for our example in the
previous section shown in Fig. 9.12).
Therefore, the same information is normally kept as bitmaps or linked lists much in the same way that
you keep track of free disk blocks. To do this, like a block on the disk, the Operating System defines a chunk
of memory (often called a block again) This chunk could be 1 word or 2 KB, or 4 KB or whatever. The point
is that for each process, allocation is made in multiples of this chunk. In a bit map method, the Operating
System maintains 1 bit for each such chunk denoting if it is allocated (=1) or free (=0). Hence, if the chunk
is a word of 32 bits, 1 bit per 32 bits means about 3.1% of memory overhead, which is pretty high. However,
the memory wastage is minimal. This is because the average wastage, as we know, is (chunk size-1)/2 per
process. If the chunk size is high, the overhead is low but the wasted memory due to internal fragmentation is
high. In a linked list, we create a record for each variable partition. Each record maintains information such
as:
Allocated/free (F = Free, A = Allocated)
Starting chunk number
Number of chunks
Pointer (i.e. the chunk number) to the next entry.
Figure 9.13 depicts a picture of the memory at a given time. Corresponding to this state, we also show
in b and c, - a bit map and a linked list, where a shaded area denotes a free chunk. You will notice that the
corresponding bit in the bit map is 0. The figure shows 29 chunks of memory-(0 to 28), of which 17 are
allocated to 6 processes.
As is clear, the bit map shows that the first 4 chunks are free, then the next 3 are allocated, then, again, the
next 2 are free and so on. We can ensure that the linked list also depicts the same picture, essentially, any of
these two methods can be used by the Operating System. Each has merits and demerits.
In this scheme, when a chunk is allocated to a process or a process terminates, thereby freeing a number
of chunks, the bit map or the linked list is updated accordingly to reflect these changes.
In addition, the Operating System can link all the free chunks together by bidirectional pointers. It can also
maintain a header for these free chunks, giving the start and end of this chain. This chain is used to allocate
chunks to a new process, given the desired size. When a process terminates, this chain is appropriately
updated with the chunks freed due to the terminated process. At any time, the PCB contains the starting chunk
number of the chunks allocated to that process.
Observe the merits/demerits for both the approaches. Bit maps are very fast for deallocations. For instance,
if a process terminates, the Operating System just carries out the following steps:
information can be found out from the PCB. (Program size may indicate the number of chunks.)
But bit maps can be very slow for allocations. For instance, let us assume that a new process wants 4
chunks. The algorithm has to start from the beginning of a bit map and check for consecutive 4 zero bits. This
certainly takes time. Linked lists, on the other hand, are time-consuming for deallocations, but they can be
faster for allocations. A study of the algorithms for these will reveal the reason for this.
Again for allocations, in both the methods, you could use the 'first fit', 'best fit' and the 'worst fit' algorithms
as discussed earlier. For best fit and worst fit algorithms, linked lists are far more suitable than bit maps as they
maintain the chunk size as a data item explicitly in the linked list. However, you need to have these chunks
sorted in the order of chunk size for both the best and worst fit methods. For the first fit method, which is the
most common, all that the Operating System needs to do is to access the queue header for the free chunks,
traverse through the chain of slots for Free chunks until you hit a first chunk with size >= the size needed.
There is also an algorithm called 'quick fit' which maintains different linked lists for some of the commonly
required process sizes e.g. there is one linked list for 0 – 4k holes, and there is another list for 4k – 8k holes
and yet another one for 8k – 12k holes and so on. The hole when created (due to process termination) is added
to the appropriate list. This reduces the search times considerably.
All these techniques normally 'coalesce' the adjacent holes. For instance, in step (vii) of Fig. 9.12, when
process P2 got over, two adjacent holes are created. The Operating System looks around to see if there are
adjacent holes, and if yes, it creates only one large hole. (To do this, a linked list in the original sequence as
shown in Fig. 9.13 is much better than multiple linked lists.) Having created a large hole, depending upon its
size, it may have to be added to the appropriate list if 'quick fit' is used.
There is yet another method of allocation/deallocation called 'Buddy System' proposed by Knowlton
and Knuth to speed up merging of adjacent holes. Unfortunately, it is very inefficient in terms of memory
utilization. Various modified buddy systems have been proposed to improve this, but the discussion of those
is beyond the scope of this text.
One problem with all these systems is external fragmentation. In states (v), (vi) and (vii) shown in
Fig. 9.12, There are two holes, but at separate locations so that it is not possible to coalesce them. If a process
which requires memory more than each hole individually, but less than both holes put together, that process
cannot run even if total free memory available is larger than what that process requires. What is then, the
solution to this problem? The process to solve this is called ‘Compaction’. We will now study it.
This technique shifts the necessary process images to bring the free chunks to adjacent posi-
tions in order to coalesce. There could be different ways to achieve compaction. Each one results in the move-
ment of different chunks of memory. For instance, Fig. 9.14 (a) shows the original memory allocations and
Fig. 9.14 (b), (c) and (d) show three different ways in which compaction could be achieved. These three ways
result in the movement of chunks of sizes 1200, 800 and 400 respectively. While calculating the movements,
imagine that the live processes are actually moving rather than the free chunks.
For instance, in the method shown in Fig. 9.14 (b), the Operating System has to move P3 (size = 400) and
P4 (size = 800). Hence, the total movement is 1200. In Fig. 9.14 (c), only P4 (size = 800) is moved in the
free chunk of 800 available (1200 – 2000). Hence, the total movement is only 800. In Fig. 9.14 (d), you move
only P3 (size = 400) in the free chunk of 400 available (3800 – 4200). In this case, the total movement is only
400. The free contiguous chunk is in the middle in this case, but it does not matter. Obviously, the method
depicted in Fig. 9.14 (d) is the best.
The Operating System has to evaluate these alternatives internally, and then choose. It is obvious that
regardless of the method used, during the compaction operation, normally no user process can proceed,
though in the case depicted in Fig. 9.14 (d), it is possible to imagine that P1, P2 or P4 can continue while
compaction is going on, because they are unaffected. Any process for which the image is being shuffled
around for compaction has to be blocked until the compaction is over. After the 'external event' of compaction
is over, the PCB is updated for memory pointers and the PCB is chained to the 'ready' list. Eventually, this
process is dispatched.
Obviously, whenever a process terminates, the Operating System would do the following:
(i) Free that memory.
(ii) Coalesce, if necessary.
(iii) Check if there is another free space which is not contiguous and if yes, go through the compaction
process.
(iv) Create a new bit map/linked list as per the new memory allocations.
(v) Store the starting addresses of the partitions in the PCBs of the corresponding processes. This will be
loaded from the appropriate PCB into the base register at the time the process is dispatched. The base
register will be used for Address Translation of every instruction at the run time as seen earlier.
Compaction involves a high overhead, but it increases the degree of multiprogramming. This is because,
after compaction, it can accommodate a process with a larger size which would have been impossible before
compaction. Both IBM and ICL machines have used this scheme for their operating systems.
The swapping considerations are almost identical to those discussed in Sec. 9.3.3. for fixed partitions, and
therefore, need no further discussion.
This is substantially the same as in fixed partitions. This scheme also depends upon the base register which is
saved in and restored and from the PCB at the context switch. The physical address is calculated by adding
the base register to the virtual address as before, and the resulting address goes to MAR for decoding. After
swapping or compaction operations, if the processes change their memory locations, these values also need
to be changed as discussed earlier.
Protection is achieved with the help of the limit register. Before calculating the resultant physical address,
the virtual address is checked to ensure that it is equal to or less than the limit register. This register is
loaded from the PCB when that process is dispatched for execution. As this value of limit register does not
undergo any change during the execution of a process, it does not need to be saved back in the PCB at the
context switch.
Sharing is possible only to a limited extent by using 'overlapping partitions' as shown in Fig. 9.15.
The Figure depicts that process A occupies locations with addresses 3000 to 6999, and process B occupies
locations with addresses 6000 to 8999. Thus, in this case, locations with addresses between 6000 and 6999
are overlapping, as they belong to both the partitions. This is possible only because the partitions were
variable and not fixed.
Though it may sound like a very good idea, in practice, it has a number of limitations. Firstly, it allows
sharing only for two processes. Secondly, the shared code must be either reentrant or must be executed in a
mutually exclusive way with no preemptions. While mapping all the virtual addresses to physical addresses,
references to itself within the shared portion must map to the same physical locations from both the processes.
Due to these difficulties, this method is not widely used for sharing.
This scheme wastes less memory than the fixed partitions because there is theoretically
no internal fragmentation if the partition size can be of any length. In practice, however, the partition size
is normally a multiple of some fixed number of bytes giving rise to a small internal fragmentation. If the
Operating System adopts the policy of compaction, external fragmentation can also be done away with, but
at some extra processing cost.
Access times are not different from those in fixed partitions due to the same scheme of Ad-
dress Translation using the base register.
Time complexity is certainly higher with the variable partition than that in the scheme
of fixed partitions, due to various data structures and algorithms used. Consider, for instance, that the Parti-
tion Description Table (PDT) shown in Fig. 9.4 is no more of fixed length. This is because the number of
partitions are not fixed. Also consider the added complexity of bit maps/linked lists due to coalescing/com-
paction.
Upto now, various contiguous memory allocation schemes and the problem of fragmentation that
arises thereof have been studied. Compaction provides a method to reduce this problem, but at the
expense of a lot of computer time in shifting many process images to and fro. Non-contiguous
allocation provides a better method to solve this problem.
Consider Fig. 9.16. for instance. Before compaction, there are holes of sizes 1k and 2k.
If a new program of size = 3k is to be run next, it could not be run without compaction in the earlier
schemes. However, compaction would force most of the existing processes also to stop running for a while.
A solution to this has to be found.
The problem of fragmentation reinvest involves answers to the following questions:
(a) Can the program be broken into two chunks of 1k and 2k to be able to load them into two holes at
different places? This will make the process image in the memory noncontiguous. This raises several
questions.
(b) How can such a scheme be managed?
(c) How can the addresses generated by the compiler be mapped into those of the two separate non-
contiguous chunks of physical memory by the Address Translation mechanism?
(d) How can the problem and protection and sharing be solved?
A thought would be to have two base registers for two chunks belonging to our process in the above
example. Each base register will have the value of the memory address of the beginning of that chunk. For
instance, in Fig. 9.16 the program 2 of 3k size will be loaded in the two chunks of sizes 1k and 2k starting at
the physical memory addresses 500 and 2100 respectively. In this case, the two base registers will have the
values of 500 and 2100. An address in the process will belong to either chunk-A or chunk-B. Thus, to arrive
at the final physical address by the Address Translation, the respective base register will have to be added to
the original address depending upon which chunk the address belongs to. Values of both of the registers could
be initially stored in the PCB and restored from the PCB at every context switch as before. Thus, this scheme
could be conceptually an extension of earlier ideas. There are as many base and limit registers as there are
chunks in a program will be needed. For instance, if a program is loaded is n non-contiguous chunks, you will
need n base registers and n limit registers for that program alone.
A question is: What should be the sizes of these chunks?
Approaches used for solving this are listed here:
paging'. In this
case, the process, image is divided in fixed sized pages.
segmentation. In this case, the process
image is divided into logical segments of different sizes.
entire process image (all chunks) has to reside in the main memory before execution can commence.
which can be brought into the main memory as and when required, the system is called ‘virtual
memory management system’ (not to be confused with ‘virtual address’). The term ‘virtual
address’ can be used meaningfully even in the pure paging system but that by itself, will not make it
virtual memory management system.
One is called ‘demand paging’. The other is called ‘working set method’. These methods differ in
the way the chunks are brought from the disk into the main memory.
‘segmented paged method’, in which each process image
is divided into a number of segments of different sizes, and each segment in turn is divided into a
number of fixed sized pages. Again, this scheme can be implemented using virtual memory; though
it is possible to implement it using 'real' memory.
These methods are considered one by one in subsequent paragraphs.
As discussed earlier, the chunks of memory are of equal sized pages in the paging scheme. The logical or
virtual address space of a program is divided into equal sized pages, and the physical main memory also is
divided into equal sized page frames. The size of a page is the same as that of the page frame, so that a page
can exactly fit into a page frame and therefore, it can be assigned to any page frame, which is free. (Questions
of first fit, etc. do not arise.)
In order that this scheme works, the following must happen:
(i) The process address space of a program is thought of as consisting of a number of fixed sized
contiguous pages (hence, the name 'virtual or logical pages').
(ii) Any virtual address within this program consists of two parameters: a logical or virtual page number
(P) and a displacement (D) within the page.
(iii) The memory is divided into a number of fixed sized page frames. The size of a page frame is the same
as that of a logical page. The Operating System keeps track of the free page frames and allocates a
free page frame to a process when it wants it.
(iv) Any logical page can be placed in any free available page frame. After the page (P) is loaded in a page
frame (F), the Operating System marks that page as "Not free".
(v) Any logical address in the original program is two dimensional (P, D), as we know. After loading,
the address becomes a two-dimensional physical address (F, D). As the sizes of the page and the page
frame are the same, the same displacement D appears in both the addresses.
(vi) When the program starts executing, the Address Translation mechanism has to find out the physical
page number (F), given the virtual page number (P). After this, it has to append or concatenate D to
it to arrive at the final physical address (F, D). Hence, in the virtual address, it must be possible to
separate out the bits for the page (P) and the ones for D, in order to carry out this translation.
However, there is a problem in our scheme. How does the compiler generate a two-dimensional address?
We know that the compiler generates only one-dimensional single address in binary. How then is it possible
to separate out the address into two components, P and D? The secret of the solution lies in the page size. If
the page size is a power of 2 such as 32, 64, ... 1k, 2k, etc., this problem vanishes. This is because, the single
binary address can be shown to be the same as a two-dimensional address, i.e. automatically some high order
bits correspond to P and the remaining low order bits correspond to D. This can happen only if the page size is
a power of 2. This is the reason the compiler does not have to generate any separate two dimensional address
specifically for the paging system address. It generates only a single binary address, but it can be interpreted
as a two-dimensional address. This is what helps in separating P, translating it to F and then concatenating the
same D to it to arrive at the final address.
If the page size is not a power of 2, this automatic separation of P and D does not take place. We will
consider an example to illustrate this.
Let us say that page size = 100 and that the address in question is 107 is decimal. The address 107 in binary
would be 01101011. This is essentially a single dimensional address in 8 bits as the compiler would generate.
In a two-dimensional address with page size equal to 100, page 00 will have addresses 0 to 99, and page
01 will have addresses 100 to 199. Thus, address 100 would correspond to P = 1, D = 0, address 101 would
correspond to P = 1 and D = 1, and so on. Therefore, address 107 would be that of the location number = 7
in page number 1. Therefore, P = 01, D = 000111 in binary, if we reserve two bits for P and six for D. If we
concatenate the two, we get the two dimensional address as 01000111, as against a one dimensional address
of 01101011.
Notice that these two are different. The Address Translation at the time of execution will pose difficulties,
if the compiler produces an one-dimensional address which has no correlation with a two-dimensional one.
This problem can be easily solved if the page size is a power of 2, which is normally the case. Assume in
this case that the page size is 32. Thus, locations 0–31 are in page 0, 32–63 in page 1, 64–95 in page 2 and
96–127 in page 3. Therefore, location 96 means P = 3 and D = 0, location 97 means P = 3 and D = 1. We can,
therefore, easily see that the address 107 will mean P = 3 and D = 11. Hence, the two-dimensional address for
decimal 107 in binary is P = 011 and D = 01011. If we concatenate the two, we will get 01101011, which is
exactly same as the one-dimensional address in binary that the compiler produces.
An interesting point is worth noting. Even if the page size were 64 instead of 32, the two-dimensional
address would remain the same. In this case, page 0 would have addresses 0–63 and page 1 would
have 64–127. Hence, location 107 would mean location 43 in page 1. Therefore, a two-dimensional
address for 107 would be page (P) = 01 and Displacement (D) = 101011 in binary. Concatenating the two,
we still get 01101011 which is the same as the one-dimensional address in binary address that the compiler
produces.
Therefore, the compiler does not have to produce different addresses, specifically because it is going to be
treated as a two-dimensional address. The compiler compiles addresses, as if they were absolute addresses
with respect to 0 as the starting address. These are the same as single-dimensional addresses. At the time of
execution, the addresses can be separated as page number (P) quite easily by considering only a few high
order bits of the address and displacement (D) by considering the remaining low order bits of the address.
This is shown in Fig. 9.17. The point is: How many bits should be reserved for P and how many for D? The
answer to this depends upon the page size which determines D and maximum number of pages in a process
image which determines D and maximum number of pages in a process image which determines P. Given
that the total size of the process image = page size X number of pages which is a constant, a number of
possibilities can arise. For a process image of 256 bytes, we can have:
The decision of page size is an architectural issue, which has an effect on performance. We will study
this later. The point is: the beauty of binary system is such that, whatever the page size may be, the one-
dimensional address is same as the two-dimensional one.
Normally, in commercial systems, the page size chosen varies from 512 bytes to 4 KB. Assuming that 1
Megabyte or 1 MB (= 1024 KB) of memory is available and page size as well as the page frame size is = 2
KB, we will require 1024/2 = 512 page frames numbering from 0 to 511 or from 000000000 to 111111111
in binary. Hence, the 9 high order bits of the address can be reserved to denote the page frame number. Each
page has 2 KB (2048) locations numbering from 0 to 2047; thus, requiring 11 bits for displacement D. (512
requires 9 bits, 1024 would require 10 and 2048 would require 11 bits.) Thus, the total address would be made
up of 9 bits for page number + 11 bits for displacement = 20 bits.
Similarly, any virtual address produced by the compiler can be thought of as made up of two components-
page number (P) and displacement (D). An interesting point of this scheme is that when a page is loaded into
any available page frames, (the sizes of both are the same), the displacement for any address (D) is the same
in virtual as well as physical address. Hence, all that is needed is to load pages in available page frames and
keep some kind of index as to which page is loaded where.
This index is called a 'Page Map Table (PMT)' which is the key to the Address Translation. At the
execution time, all that is needed is to separate the high order bits in the address reserved for the page number
(P), and convert them into page frame number (F) using this PMT, concatenate F and D and arrive at the
physical address as we know that D remains same. This is the essence of Address Translation.
The Page Map Table (PMT) is shown in Fig. 9.18. There is one such PMT maintained for each process.
The PMT in the figure shows that a virtual address space of a process consists of 4 pages (0 to 3) and they are
loaded in physical page frames 5, 3, 9 and 6 respectively.
In this scheme, in order to load a page, any page frame is as good as any other, so long as it is free; there is
nothing to choose one against the other. In other words, memory allocation algorithm implies maintaining a
list of free page frames and allocating as many page frames from it as there are pages to be loaded.
The scheme for allocating page frames of physical memory at the time of process creation has to co-
ordinate among the Information Management (IM); Process Management (PM) and the Memory Management
(MM) modules. It works as follows:
(i) MM keeps track of free page frames at any time. In the beginning, all page frames except those
occupied by the Operating System itself are free. Thus, MM maintains this list of free page frames in
addition to PMTs.
(ii) When a process is to be loaded into the memory, PM requests IM for the size of the program.
(iii) IM goes through the directory structure to resolve the path name, accesses the Basic File Directory
(BFD) information to get the size of the object file to be executed.
(iv) Having obtained this size, PM requests MM to allocate the memory of that size.
(v) MM calculates the number of page frames needed to be allocated. This is equal to (Program size/Page
frame size) rounded up to the next integer.
(vi) MM now consults the list of free page frames and if possible, allocates them to the process. You know
that they need not be contiguous. MM now updates the list of free page frames to mark these page
frames as “allocated”. It also creates a PMT for that process. If there are not enough free, allocable
page frames, MM indicates that to PM, which postpones the loading of this process (In Virtual Memory
Management System, where the execution of a process can commence with only a part of the process
image in the memory, the story would have been different!). If this process is of high priority, it is for
the PM to swap out an existing low priority process to make room for the new one.
(vii) Having allocated the required page frames, MM now signals the PM to load the process.
(viii) PM loads various pages of the process address space into the allocated physical page frames of the
memory with the help of the IM; and links the PCB for that process in the list of ready processes.
The PCB also maintains a pointer to the starting address of the PMT in the memory. This is used for
Address Translation for this process when this process is dispatched after the context switch.
Let us take an example to illustrate how the free page frames are allocated to a new process and how a
PMT is created for it. Fig. 9.19 shows three processes, A, B and C with their respective PMTs which map the
virtual or logical pages (P) onto the physical page frames (F). A list of free page frames is shown on the top
of the figure. This list is not necessarily maintained in any particular order. The order is dictated by the way
the page frames get free and are allocated. The figure also shows that a new process (Process D) has arrived
wanting to occupy two page frames. The Operating System will consult the list of free page frames, allocate
the first two page frames in that list i.e. page frames 10 and 14 to Process D and then it will create a PMT for
it. It will then remove those page frames from the free list. This is depicted in Fig. 9.20.
There is one PMT for each process and the sizes of different PMTs are different. Study the free page
frames list before and after the allocation. The page frame list need not always be in the sorted order of frame
numbers. As and when the frames are freed by the processes which are terminated or swapped out, they
are added to the list, and that order can be random because you could not predict which page frame would
get free when. Hence, there is no specific sequence maintained in that list. It is important to know that this
order is also not of any consequence. While allocating, any page frame is as good as any other. The need for
contiguity also has vanished. Hence, the allocation algorithms of best fit, first fit etc. are of no value in this
scheme. If 4096 bytes are required, the MM will calculate this as two pages of 2 KB each and will allocate
the first two free page frames in the list of free page frames. These could be physically quite distant. But this
does not matter because the Address Translation is done separately for each page using PMT.
The considerations for swapping are similar to those discussed earlier. In paging, if a process is swapped, it
is swapped entirely. Keeping only a few pages in the main memory is useless because a process can run only
if all the pages are present in the main memory. AOS running on 16 bit Data General machine follows the
pure paging philosophy. The process cannot run unless the entire process image is in the memory even if the
process image is divided into pages, and a few of them are in the main memory. Therefore, the entire process
image is swapped out if required.
Which process is to be swapped out depends upon the priorities and states of the processes already existing
in the main memory and the size of the new process to be accommodated. Normally, a blocked process
with very low priority can be swapped out, if space is to be created in the memory for a new process. These
issues are handled by the medium level process scheduler and are already discussed in the section on process
management.
When a process is swapped out, its area in the memory which holds the PMT for that process is also
released. When it is swapped in again, it may be loaded in different page frames depending upon which
are free at that time. At that time, a new PMT is created, as the PCB for the process is chained to the ready
processes. We know that the PCB also contains the memory address of the PMT for that process itself.
Relocation has already been discussed in Sec. 9.6.1, showing that different pages
are loaded in different page frames before execution. After loading, the addresses in the program, as it resides
in the main memory, are still virtual addresses as generated by the compiler (i.e. assuming that the program
is loaded contiguously from address 0). It is only at the run time that Address Translation is done using PMT.
This is shown in Fig. 9.21.
Let us assume that we have a machine where a word length is 8 bits. Let us also assume that our machine
has main memory with the capacity of 512 words or bytes. This memory is divided into 16 pages of 32 words
each (16×32 = 512). Hence, we will require 4 bits to represent the page number P (0 to 15) and 5 bits to
represent the displacement D (0 to 31). Therefore, the total number of bits in the address will be 9 (i.e. 4 +
5). We verify that with 9 bits, the maximum address that can be generated is 511 which is quite as expected
because we have a memory size of 512 words (0 to 511). This is shown in Fig. 9.21.
It has been seen that any address generated by the compiler is automatically divided into two parts - page
number (P) and displacement (D). This is because the page size is a power of two. When the instruction is
fetched into IR, depending upon the addressing mode (direct, indirect, indexed etc.) the resultant address
ultimately resides in the CPU register. It is this address that is split into P and D. P is fed as an input to the
Address Translation mechanism. Address Translation finds out the page frame (F) corresponding to P using
the PMT, and generates the physical address F + D. This is shown in Fig. 9.21.
Let us assume that there is a COBOL program with an instruction "ADD BASIC, DA GIVING TOTAL".
The compiler would generate many machine instructions for this one instruction. Also let there be one of
those instructions as LDA 107 and that the instruction itself is at a virtual address 50 (i.e. P = 1, D = 18) in
the program with 0 as the starting address. The following discussion shows how this instruction is executed.
At the fetch cycle, when the Program Counter (PC) gets incremented from 49 to 50 (i.e. 000110010), this
address is transferred to MAR by the microinstruction PC "MAR. The bits in MAR act as the control signals
for the address decoder which activate the desired memory location. It is at this stage that we need to modify
the address so that the resulting address can finally be put on the address bus which can access the physical
memory. The PMT in Fig. 9.21 shows that page 1 is mapped onto page frame 4, and thus, the physical address
at which you will find the instruction "LDA 107" will be within page frame 4, at a displacement of 18. Page
frame 4 will contain physical addresses 128 to 159. Therefore, displacement 18 within that page frame would
mean physical address of 128 + 18 = 146 in decimal or 010010010 in binary. Hence, we need to fetch the
instruction not at location 50 but at location 146. To achieve this, the address coming out of MAR which is 50
in decimal or 000110010 in binary is split into two parts. The page number P is used to find the corresponding
page frame F using the PMT Figure 9.21 shows that P = 0001 corresponds to F = 0100. F + D now is used as
the address which is used as the control signal to the memory decoder. Thus, actually the instruction at 146
in decimal or 010010010 in binary is fetched in the Instruction Register (IR). This is just what is required.
At the 'execute cycle', the hardware “knows” that it is an LDA instruction using direct addressing. It,
therefore, copies the address portion 107 i.e. 001101011 (P = 0011 = 3, D = 01011 = 11) to MAR for fetching
the data by giving a 'read' signal. The figure shows that page 3 is mapped onto page frame 2. Hence, the data
at virtual address decimal 107 will now be found at physical address with page frame (F) = 2 = 0010 and
Displacement (D)=11 = 01011 or binary address = 001001011 i.e. 75 instead of 107 in decimal. Again, this
Address Translation is done using PMT on the address in MAR and the resultant translated address is put
on the address bus, so that actually the correct addresses are used for address decoding. This is shown in
Fig. 9.22.
Thus, the total time required for any memory reference = tama + tma. As tama is very small, the time overhead
in this method is very low. This method is fast but expensive. This is because, the associative registers are
not very cheap. Having a large number of associative registers is very expensive. Therefore, a via media is
needed.
The hybrid method provides such a mechanism. In this method, associative memory
is present, but it consists of only 8, 16 or some other manageably small number of registers. This reduces the
cost drastically. Only the pages actually referenced frequently are kept in the associative memory with the
hope that they will be referenced more frequently.
The Address Translation in the hybrid method is carried out in following fashion:
(a) The virtual address is divided into two parts: page number (P) and displacement (D) as discussed
earlier. This is a timeless operation.
(b) The page number (P) is checked for its validity by ensuring that P <= PMTLR. This takes virtually no
time as this comparison takes place in the hardware itself.
Found in associative memory
(c) If P is valid, a check is made to see if P is in the associative registers, and if it exists there, the
corresponding page frame number (F) is extracted directly from the associative registers. This
operation takes some time - tama as seen earlier. This time is required regardless whether P exists in
associative registers or not.
(d) The original displacement (D) is concatenated to F, to get the final physical address F + D. This,
again, takes virtually, no time.
(e) Using this address, the desired item in the main memory is finally accessed. This requires the time tma
as discussed earlier.
Thus, if found in associative memory, the total time required = tama + tma , as in pure hardware method. If
P is not found associative registers, the method followed is the same as the pure software method. The steps
for this are as follows:
Not found in associative memory
(f) In the software method, P is used as an index into PMT. PMTBR is added to P (requiring very
negligible time) to directly find out the desired entry number of PMT.
(g) The selected entry of PMT is fetched into the CPU register. This operation requires are memory
access time = tma, became full PMT is in the main memory.
(h) The page frame number (F) is extracted from the selected PMT entry brought into the CPU register.
This, again, is almost a timeless operation.
(i) The original displacement (D) is now concatenated to F requiring negligible time to get the final
physical address F + D.
(j) Using this physical address, the desired data item in the main memory is now accessed. This requires
time tma as seen before.
Thus, the total time required if P is not found in associative registers = tama + 2tma.
Assuming the hit ratio (h) = the fraction of times that P is found in the associative registers, we can find
out the Effective Access Time (EAT) as follows:
EAT = h(tama + tma) + (1 – h) (tama + 2tma)
Let us assume that tama = 40ns, tma=800 ns and h= 0.8
We get EAT = 0.8 (40+800) + 0.2 (40+1600)
= 1000 ns
Therefore, we see the following degradations:
Time to reference a memory location using pure hardware method = 40 + 800 = 840 ns (5% degradation)
degradation).
ns (100% degradation).
The hit ratio (h) can be increased by having more associative registers. We have to do it in a cost effective
manner by studying the cost and benefits of increasing h.
If h = 0.9, we can see that
EAT = (0.9×840) + (0.1×1640) = 920 ns.
We can see that, by increasing h from 0.8 to 0.9, the EAT has reduced from 1000 to 920 or by 8%. The cost
of increasing h from 0.8 to 0.9 may however, increase by much more than 8%. In fact, commonsense tells us
that as h increases, the cost effectiveness of increasing h still further may go on decreasing. This is the key to
designing the most optimal solution.
Given a fixed size of associative registers, one can improve the performance by keeping the appropriate
entries from the PMT in the associative memory, so that h improves.
Normally, associative memory is kept empty initially. At this stage, the PMT in the memory will have
to be referenced for any page. Any time a new page is referenced, that entry is copied from the PMT to the
associative memory with the assumption that the page is likely to be required frequently at least in the near
future. This assumption is normally true in practice. When an additional page is referenced, the entry for that
page also is copied from the PMT into another associative memory register.
Eventually, the associative memory will get full and yet a new entry would like to come in by the same
logic. This means that some entry will have to be removed from the associative memory to make room for the
new one. The question is: Which entry in the associative memory is to be overwritten by the new entry? To
decide that, there are various algorithms such as 'First In First Out (FIFO)', 'Least Recently Used (LRU)'
etc. We will discuss these in more detail when we discuss various page replacement algorithms in virtual
memory systems. The essential idea is the same in both the cases. We apologize for a forward reference, but
these algorithms are very popular under the title, 'page replacement algorithms' in virtual memory systems,
and hence, we will discuss them at that time to keep with the tradition.
At any time, if the process gets blocked, the existing associative memory contents will have to be stored
in the memory either in the PCB for that process itself or somewhere else for which there is a pointer in the
PCB. They are restored again at the context switch before the process gets dispatched. This is necessary, if
at the context switch, we do not want to lose the wisdom gained by the recent page references in a process.
We have learnt in the previous section how the PMTs can be maintained. In
the pure software or hybrid methods, the entire PMT has to be kept in the main memory for each process,
wherein PMTBR gives the starting memory address of the PMT in the main memory. A practical problem in
this scheme is the size of the PMTs. Today, due to the larger available virtual address space, the sizes of the
PMTs are exceedingly large. For instance, if the virtual address is of 32 bits, the program size can be 232. If
a page size is 2K, the Operating System will have to reserve 11 bits for the offset because 211 is 2k. But this
would mean that each process would have 221 or 2097152 pages (because, 32 – 11 = 21). If one PMT entry
consumes 1 word, it would require 2 MB of main memory to hold only 1 PMT for 1 process! Even if every
process does not require all the pages, the PMT size could still be very large. A solution must be found to
this problem.
A solution is to have Multilevel Page Tables. In this scheme, the virtual address is divided into multiple
parts as shown in Fig. 9.25(a). The figure shows three such parts. P1 of 10 bits, P2 of 11 bits and offset (D)
of 11 bits.
As the offset is of 11 bits, the page size is still 2 KB. Essentially, the big PMT with 221 entries are broken
up into 210 or 1024 small PMTs (0–1023) with each PMT having 211 or 2048 (0–2047) entries. P1, which has
10 bits is used as an index into the first level of page tables to choose the appropriate small PMT. P2, which
has 11 bits then is used as an index to select appropriate entry within that PMT. That entry will give the page
frame number (F) corresponding to the virtual page number (P1 + P2). The offset or displacement (D) can
now be concatenated to this page frame number (F) to form the final resultant physical address. This address
is then put on the address bus for accessing the desired memory locations.
The Address Translation in this scheme is carried out with the help of hardware, though it is a bit more
complicated because of the levels involved. For instance, the hardware will have to separate the bits for P1
first, use them as an index to access the correct entry of the first level page table, and so on.
The main advantage of this scheme is that all the page tables need not be kept in the main memory.
Consider a process needing 12 MB of memory where the bottom 4 MB of memory is for the text, the next
4 MB is for data and the top 4 MB is for the stack. In this scenario, there is a large gap or hole in between
the data and the stack. Normally in such a scenario, a very large PMT would have been necessary to be kept
in memory. But, due to the multilevel page tables, only one first level page table and a few second level page
tables are needed to be kept in the main memory.
This discussion is only an example. In practice, there are different levels of paging that are possible.
PDP-11 for instance, uses one level paging, the VAX has two levels paging, the SUN SPARC has three
levels of paging and M68030 has four level paging. In fact, M68030 can have a level of paging which is
programmable. The levels can be 0 to 4 and the Operating System controls these levels. The virtual address
space is divided into as many parts as there are levels. An interesting idea used in M68030 is that the number
of bits reserved in the virtual address for each level is also programmable. This complicates the hardware and
algorithms involved in Address Translation, but it provides a lot of flexibility to the whole operation.
Segmentation and paging share a lot of common principles of operations, excepting that pages are physical
in nature and hence, are of fixed size, whereas segments are logical divisions of a program and hence, are
normally of variable sizes.
For instance, each program in its executable form can be considered to be consisting of three major
segments: code, data and stack. Each of these can be divided into further segments. For example, in a program,
you normally have a main program and some subprograms. These can be treated as separate segments. A
program can use various functions such as "SQRT". In this case, a routine for "SQRT" is prewritten and
precompiled. This becomes yet another segment.
Let us say that we have a program with the segments as shown in Fig. 9.28. The figure shows various
segments along with their sizes which are obviously found out at the time of compilation, and stored in the
executable file for future use.
Each segment is compiled with respect to 0 as the starting address. The segment numbers and their sizes
are shown in the table within the figure.
An application programmer does not necessarily have to declare different segments in his program
explicitly. If the programmer defines different overlays or segments explicitly, it is fine, but otherwise the
compiler/linkage editor does it on its own. In any program, it is not very difficult to recognize the main
program, the "called" subprograms, the common routines, the data area and the stack area. The compiler/
linkage editor thus does the following:
Like in paging systems, the address in this case consists of a segment number (S) and an offset (D) within
that segment. Hence, this is also a two-dimensional address. Within one segment, the address of any location
is computed with respect to the beginning of the segment, and hence, is a virtual address. Therefore, we face
the same problem as discussed in paging. Should the compiler generate a one or two-dimensional address?
And how should it be used later in Address Translation? Let us study this a little closely. When you put all
these segments conceptually (or virtually!) one after the other, the following picture emerges (Fig. 9.29).
Now, a virtual address 1100 in this program will mean segment 1, displacement = 100 i.e. S = 1, D =
100. (Address 1000 means S=1, D=0, address 1001 means S=1, D=1, and so on.) Similarly, virtual address
3002 will mean S = 3, D = 502. These are the two dimensions or components of the address. If you specify
segment number (S) and displacement (D) within it, you have specified a complete address. In segmentation,
the compiler itself has to generate a two-dimensional address. This is different from the address on paging.
In paging, a single-dimensional virtual address and a two-dimensional address would be exactly the same in
the binary form as the page size is an exact power of 2. In segmentation, this is not so. The segment size is
unpredictable. That is why you need to express an address explicitly in a two-dimensional form.
Hence, if the Operating System uses segmentation as a philosophy for Memory Management, an executable
image of a program has to have a two-dimensional address (S,D) in any program instruction. This means that
in segmentation, you need a different address format
and a different architecture to decode that address.
For instance, in PDP-11/45, an address consists of
16 bits, of which 13 bits (bits 0–12) are reserved for
displacement (D) and 3 bits (bits 13–15) are reserved
for segment number (S). This is shown in Fig. 9.30.
Therefore, a PDP-11/45 it can have 8 segments
corresponding to 3 bits for S, where each segment can be of maximum 8KB, corresponding to 13 bits for D.
Burroughs B5500 uses a 16 bit address, but uses 5 bits for segment number and 11 for displacement. Thus,
it can have 32 segments, each of maximum 2 KB. A computer GE645 used for MULTICS system had 256
segments, each of maximum 64K words. When all the segments are compiled, they are compiled with 0 as
the starting address for all of them. This address essentially is D. At this juncture, only D is important, S is
not even known because segments are logical entities which can be compiled separately and they may be
used in some programs and may not be used in some others. Again, what segment numbers will be assigned
to them at the time of linking different programs is not known in advance. At the time of linking, the linker
numbers the segments used in that program. It then changes all the addresses to two-dimensional addresses
by inserting the appropriate segment numbers in all the address fields in the instruction. As discussed earlier,
the address in the binary executable program itself is a two-dimensional address in segmentation, unlike in
paging.
The steps followed during the initiation of a process are traced below:
(i) At any time, there is possibly some physical memory which is free and allocable. New processes are
loaded after being allocated some of it. At the same time, some old processes are terminated, releasing
or freeing some memory. The Operating System has to keep track of the chunks of free memory at
any time. It will have to know their sizes and starting addresses. There are several algorithms to
keep track of those. These algorithms using data structures such as bit maps, linked lists or tables are
similar to the ones used in the scheme of variable partitions. There are again the same considerations
for coalescing and compaction. We have already covered these and will not repeat them here.
(ii) At any time, when a new process is to be created, the Process Management (PM) module talks to
the Information Management (IM) module and accesses the executable image for that program. The
header of this executable file gives the information about the number of segments and their sizes in
that program. This is passed on to the Memory Management (MM) module.
(iii) The Memory Management (MM) now consults the information about the free memory as given in (i)
above and allocates it for different segments. If can use a number of algorithms such as first fit, best
fit, worst fit, etc. to carry this out.
(iv) After the memory is allocated, it builds a Segment Map Table (SMT)' or 'Segment Descriptor
Table (SDT)', as shown in Fig. 9.31. The SMT contains the following information:
but actually it need not be maintained. This is because the size of each entry in the SMT is fixed, The
Operating System, therefore, can access the entry for any segment directly, given its segment number.
we shall see later.
If some segments are rolled out on the disk from the memory, relocation and the process of bringing it back
to the main memory is fairly straightforward. You really do not need to change anything except that if a rolled
in segment occupies a different physical location, the SMT is appropriately updated.
Again, in systems without virtual memory support, all segments belonging to a process (excepting the
shared ones) are swapped out and swapped in as necessary depending upon the philosophy used by the
process scheduler and based on the process priority. When a process is swapped out, the memory is fixed and
is added to the pool of free memory. The ideas of compaction and coalescing apply here too!
Let us study the process of Address Translation and relocation with an example. The SMT is now used for
Address Translation as shown in Fig. 9.32. The fields in the SMT are shown differently sequenced in the
figure, to avoid cluttering. We will assume that the virtual address to be translated is 2520 consisting of two
components: S and D. We assume the same segments as shown in Fig. 9.28 onwards. Hence, we can easily
find out that S = 3 and D = 20 for this virtual address 2520. The address as it exists in the Instruction register
(IR) itself consists of these two parts. The steps followed for Address Translation are as given below:
(i) The high order bits representing S are taken out from the IR to form the input to the SMT as an index.
For instance, the entry for S = 3 in the SMT can be directly accessed as shown in Fig. 9.32.
(ii) The Operating System now stores the data from that entry (in this case the one with S = 3). Hence, it
(iii) We known that displacement (D) is a virtual address within that segment. Hence, it has to be less than
the hardware itself generates an error for illegal address. This is evidently a protection requirement.
(iv) If the displacement (i.e. D) is legal, then the Operating System checks the access rights and ensures
that these are not violated. We will discuss the access rights under "protection and sharing", later.
(v) Now, the effective address is calculated as (B+D), as shown in the figure. This will be computed as
6100 + 20 = 6120 in our example.
(vi) This address is used as the actual address and is pushed into the address bus to access the physical
memory.
Various problems and their solutions in this scheme are presented briefly, as they have been already discussed
in detail in earlier sections.
(a) Each process has one SMT. Thus, the Operating System will have to reserve a large memory space merely
to store all these SMTs. This also means that when a process is 'dispatched', we need to start referring to the
correct SMT. This is facilitated by the 'Segment Map Table base register (SMTBR)' similar to the PMTBR
in paging systems. Similarly, a logical segment number (S) can be validated against another 'Segment Map
Table Limit register (SMTLR)' before getting into the SMT as an additional protection. The SMTLR stores
the maximum number of segments in a process. The compiler keeps this information in the header of the
executable file for that program. At the time a process is created, this number is copied into its PCB. When
the process is dispatched, the hardware registers SMTLR and SMTBR are loaded appropriately at the context
switch. Hence, Fig. 9.32 will need a modification as shown in Fig. 9.33.
(b) These two registers SMTBR and SMTLR can be restored from the PCB at the time of the context
switch as discussed above. SMTLR tells you the maximum allowable segment number(S) in a process.
Hence, it is ensured that S =< SMTLR. SMTBR tells you the starting word address of the SMT for
that process. Thus, assuming that one SMT entry occupies one word, (SMTBR+S) gives you directly
the address of the relevant SMT entry for the desired segment number. Now you can access and store
(c) The SMT size depends upon the maximum number of segments allowed in a process. If this number
is very small, as in PDP-11/45, where it is 8, all the SMT can be held in the hardware registers
themselves to improve the speed. If the SMT is very large, it can be held in the main memory. This
will require two memory references, thus, reducing the 'memory bandwidth' by half. A compromise,
again as before, is to have only a few hardware registers to get an 80–90 % hit ratio, at only a little
higher cost. Again, which SMT entries are to be kept in these registers is governed by algorithms
such as FIFO, LRU etc., discussed later while dealing with 'page replacement' under virtual memory
management systems.
One approach as depicted in Fig. 9.34 is to have only three sets of registers meant for code, data and stack
respectively. For each of these, the system maintains the size, base and access rights registers. Thus, there
are 9 registers in all, containing the entries from the SMT table for the latest references. As we know, there
can be multiple segments for code, multiple segments for data and so on. The data regarding size, base and
access rights about the code segment which is referenced latest is kept in the corresponding three registers
meant for the code.
For instance, if the program jumps to a location which is in a separate code segment, the data from the
SMT entry for that segment is copied into the three registers for the code. A similar logic is used for data as
well as stack. Therefore, if a program keeps referring to the same code, data and stack segments for a while
(which most programs will do for some time), then these registers will be unchanged for that duration. Any
time a new reference is made to any new segment, the corresponding registers are updated.
For this scheme to work, the Operating System needs one additional hardware support. The CPU should
indicate whether the current reference belongs to code, data or stack as shown in Fig. 9.34. Depending on
this, the appropriate register is chosen and then the Address Translation is carried out as before. This provides
a good compromise in cost and speed. GE645 computer on which MULTICS was run followed this scheme,
where there were four registers: code, data, stack and an extra segment.
After the hardware declared whether the address to be translated belonged to the code, data or the stack
segment, it also checked whether that segment entry was present in the corresponding register. If not, the
entry was copied from the SMT. In either case, the Address Translation then continued as discussed earlier
and also depicted in Fig. 9.34.
Sharing will now be discussed, followed by protection in segmentation. Sharing of different segments is
fairly simple and is quite similar to that in the paging systems. We again illustrate this by an editor program
shown in Fig. 9.35. Assume that this editor has two-code segments and one-data segment.
Also assume that the code is written to be reentrant, and hence, can be shared. If this editor is to be used by
two users simultaneously, the two SMTs must map these two code segments onto the same physical locations
as shown in the figure. It will be noticed that the size and base values for code segments 0 and 1 are the same
in both SMTs.
It will also be noticed that only the data segments are differently mapped for obvious reasons. (Each
program will work with its own records and data.)
As in paging, when one of the processes terminates (say, user A finishes his editing), only the data segment
portion i.e. locations 2000 to 2999 are freed and added to the free memory pool. The code segments are
released only when all the processes using that editor have terminated. The Operating System has to keep
some kind of usage count for this in the same way as was seen earlier in the Basic File Directory (BFD) of
the File Systems.
In fact, sometimes we want only certain segments of a program to be shared (e.g. a common routine such
as SQRT - to find out the square root). This poses a problem. All addresses in segmentation have to specify
the segment number (S) and displacement (D). If in SQRT routine itself, there is a reference to a location in
the same routine (say, Jump to an address within the same routine), how should that address be generated?
What segment number should be assumed? If this precompiled routine is added to program-A as segment 4
and to program-B as segment 5 at the time of linking, how can the scheme of Address Translation work, if
only one copy of SQRT in the physical memory is to be shared by both A and B?
One solution to this problem is to use only PC-relative addressing in sharable segments. A more general
scheme is to use as many registers as there are segments, and use relative addressing to those registers.
This has repercussions on the computer architecture, the compiler and the Operating System. MULTICS
implemented on GE645 used this scheme. A detailed discussion of this is, however, beyond the scope of this
text.
Protection is achieved by defining access rights to each segment such as 'Read Only (RO)' or 'Read/
Write (RW)' etc. and by defining certain bit codes to denote those e.g. 01 = RO, 10 = RW. These are the
access rights bits in any SMT entry. When a user logs on and wants to execute a process, the Operating
System sets up these bits in the SMT entries while creating the SMT, depending upon the user privileges. As
discussed before, if there are 4 processes sharing a segment, all the 4 SMTs for those 4 processes will have
but access rights can be different in different SMTs for different sharing processes. Because of this scheme,
the shared segments also can be well protected. For instance, a shared segment can have only access rights
"RO" for process-A and can have "RW" for process-B.
Protection in terms of restricting the accesses to only the address space of that process is achieved during
The combined systems are systems which combine segmentation and paging. This requires a
three-dimensional address consisting of a segment number (S), a page number (P) and displacement
(D). There are two possible schemes in these systems:
the virtual address space of a program is divided into a number of logical segments of varying sizes.
Each segment, in turn, is divided into a number of pages of the same size.
In either of the above cases, two memory accesses for the final Address Translation are required thus,
slowing down the process. The resulting reduction of the effective memory bandwidth by twothirds may be
too high a price to pay. In such a case, the hardware support for Address Translation in terms of associative
memory assumes a lot of importance.
In both the cases, both the Segment Map Tables (SMT) and Page Map Tables (PMT) are required.
Of these two schemes, segmented paging is more popular. Figure 9.36 illustrates its principles of operation.
The tables required for this scheme are organized as follows:
(a) The program consists of various segments, as given by the SMT. The SMT contains different
entries - one for each segment. Each segment is divided into a number of pages of equal size, whose
information is maintained in a separate PMT. Thus, there are as many PMTs as there are entries in the
SMT. If a process has 4 segments 0 – 3, there will be 4 PMTs for that process, one for each segment.
(The figure shows only one of them to avoid cluttering.)
(b) The interpretations of various fields in the SMT in this combined scheme is different from those
discussed earlier. For instance, the size field in SMT gives the total number of pages and therefore,
maximum page number in that segment starting from 0, instead of the size of the segment itself.
for that segment will be maintained as 5. This is shown in Fig. 9.36. Similarly, the base (B) in the
SMT now gives the starting word address of the PMT for that segment in the memory, instead of the
starting address of the segment itself in the physical memory. Therefore, assuming again that one
PMT entry can be accommodated in one word, the address of the entry in the PMT for the desired
page (P) in a given segment (S) can be obtained by (B+P). Where B can be obtained from the entry in
the SMT for that segment with segment number = S, using this address (B+P) as an index into PMT,
the page frame F can be obtained. And, finally, the physical address can be obtained by concatenating
F to D.
The scheme now works as follows:
(i) The virtual or logical address coming from the instruction consists of three components or dimensions:
Segment number (S), page number (P) and displacement (D) as shown in the Fig. 9.36.
(ii) The system has a pair of registers viz. SMTLR (Limit register) and SMTBR (Base register). Their
values will be different for the SMT of each process. Therefore, their values are maintained in the
respective PCBs and are restored at the context switch.
(iii) The segment number (S) is validated against SMTLR (a protection requirement) and then added to
SMTBR to form an index into SMT.
(iv) A proper SMT entry is picked up and then the page number (P) is validated against the segment size
(v) Now access rights are validated against the ones in the chosen SMT entry. For instance, if the current
instruction is for writing into that segment and the access rights for that user process as mentioned in
the SMT are "RO (i.e. Read Only)", then an error results.
(vi) If everything is OK, the base (B) of that SMT entry is added to P to directly index into the PMT for
that segment. (base in the SMT in this case means the beginning word address of the PMT for that
segment.) When you add B to P, you go into the PMT entry directly for the required page within the
desired segment.
(vii) The page frame number (F) is extracted from the selected PMT entry.
(viii) The displacement (D) is concatenated to this page frame number (F) to get the actual physical address.
This completes the Address Translation process.
M68000 implements this scheme in a different fashion. A detailed discussion of this is, however, beyond
the scope of the current text.
Upto now, all the systems that we have considered (contiguous or noncontiguous) were based
on the assumption that the entire process image was in the main memory at the time of execution. Due to low
priority, if a process had to be removed temporarily from the main memory, the entire process image was
swapped out. When it was reactivated, the entire process image was swapped in.
This scheme is simple to implement but it has one major drawback. If the physical memory is limited
(which is the case normally), then the number of processes it can hold at any time, and hence, the degree of
multiprogramming becomes limited. Also, the entire process image needs to be swapped, thus decreasing
efficiency.
What will happen if we can keep only a part of the process image in the memory and the other part on the
disk and still are able to execute it? If we can do this, we have a 'virtual memory system'. Virtual memory
systems also can be implemented using paging, segmentation or combined schemes. Thus, all that we learnt
upto now is still valid, with only one difference. The process can start executing with only part of the process
image in the memory. For our discussions and examples, we will consider virtual memory systems using only
paging because this is most common. In this case, a program consists of a number of logical or virtual pages
which are loaded into specific page frames. This is as discussed earlier in "paging". A question now arises : If
a page is not loaded in the memory and a location within that page is referenced, what will happen?
The idea is that when a page not currently in the memory is referenced, only that page can be brought
from the disk into the memory. This, of course, is more time consuming, but the trade off has to be considered
between this extra time or cost and the degree of multiprogramming that can result in increased throughput.
The idea is to bring in only zero, one or a few pages of a process to begin with, and continue as long as these
pages suffice. As soon as a reference is made outside these pages, the required page is located on the disk and
brought into the memory.
If there is no page frame free in the physical memory to accommodate this new page, the Operating
System may have to overwrite an existing page in the physical memory. Which page should be overwritten
? This is governed by 'page replacement policy' . However, if this page had been modified after last being
brought in from the disk, the Operating System could not just overwrite it. It would lose all the updates
otherwise. If the same page is later brought in from the disk, it will not serve the purpose. In this case, the
Operating System has to copy this 'dirty page' back on the disk first before overwriting the new one onto the
page frame holding that page.
In the case of virtual memory systems, a programmer can write a very big program. The size is restricted
only by the number of bits reserved for the address in the machine instruction (or Memory Address Register
(MAR) width), e.g. in a VAX computer, the maximum size of the program can be 4 GB (4 Gigabytes = 4
×1024 Megabytes = 4×1024×1024 Kilobytes = 4×1024×1024×1024 Bytes = 2 to the power of 32 bytes
corresponding to 32 bits in the MAR). However, many VAX computers support a maximum of 8 MB of
physical memory which is far less than 4 GB. In such a case, you can still run a program of size bigger than
8 MB, but smaller than 4 GB, because the system supports virtual memory. This reduces or almost removes
the size restrictions which the programmers used to face in earlier days.
From the discussion upto now, it is clear that the Operating System requires
the following data structures to implement the virtual memory management scheme with paging.
Page Map Table (PMT) is as discussed earlier in the paging system except that
it has one more field to indicate the presence of a page in the physical memory. This indicator can take values
: Inside (IN) or Outside (OUT). As there are only two states, the field length can be only 1 bit (IN = 1, OUT
= 0). As a new page is referenced by any program, the Operating System checks the presence bit and causes
a page fault, if the presence bit = 0 (i.e. OUT).
The page fault procedure brings a new page in (Fig. 9.37),
whereupon this bit for that page in the PMT is set to 1 (i.e.
“IN”).
One more field called dirty bit as discussed ear-
lier, is maintained for each page frame. Whenever any write
instruction is executed involving that page frame, this bit is set
to 1 indicating the page frame to be dirty. Prior to that, it is 0.
This is done directly by the hardware on encountering the op
code for the write instruction, and therefore, it is very fast.
This field is used by the Operating System while overwriting a page. Due to the lack of physical memory,
if the Operating System has to remove a page, it removes a page suggested by the 'page replacement' or
'page removal' algorithm which we have yet to study. At that time, if this dirty bit = 1 for a page frame, that
page is copied onto the disk first, before overwriting a new page onto it. When a new page is overwritten onto
it, this bit is set to 0 again, indicating that the copy of that page in the memory is the same as the one on the
disk.
We have seen before that a dirty page is written onto the disk. But where is it to
be written? It cannot be written at the place in the executable file from where it was originally read; otherwise
in all the subsequent executions of the same program, you will get a modified copy to begin with. Thus, the
Operating System creates a swap area for each process on the disk which is normally contiguous to achieve
higher speed.
After the process is loaded in the memory, the Operating System creates a File Map Table (FMT) for
each process as shown in Fig. 9.38. Initially, the address on the disk in this table corresponds to the address
from where it was read. If a dirty page is copied to on the disk onto a swap area, the FMT entry for that page
starts pointing to this new address on the disk. In essence, if a page is to be brought in the main memory,
the corresponding FMT entry should give the correct address of the latest copy of that page on the disk at
that time. Using this, a request is made to the Information Management module, which locates and reads the
desired page in the memory. If the page size is different from the block size used in Information Management
to read a chunk of data at a time, if adds to the complexity, involving one more level of translation.
One could create the FMT as an extension of PMT, if one column, ‘Address
on the disk’, is added to the PMT as shown in Fig. 9.37. This is because, in
both the tables shown in Figs. 9.37 and 9.38, the (virtual) Page # is the "key".
The Page # field shown in both the tables need not actually be kept in either of
the tables. This is because logical or virtual page number takes definite values
of 0, 1, 2, etc. Therefore, given the page number (p), the starting address of the
table and the length of each table entry, it is possible to position directly at the
desired entry. There is no "table search" needed. However, it is shown here for
our convenience in understanding and explaining.
In virtual memory systems, the full process image is rarely rolled out. Hence, the use of the word 'swapping'
is not appropriate in this context, as it implies rolling in/out of full process images.
This section discusses the principles of bringing in the pages of the process (fetch policies) and the
principles of removing pages from the main memory to the disk (page replacement policies).
Fetch policy deals with when to bring a page in the main memory. There are basically two
philosophies in this regard. These are: Demand Paging and Working Set. We will now consider these one by
one.
This policy dictates that a page is brought into the memory from disk, only when de-
manded. This was discussed earlier. This is a most commonly used method. In this method, theoretically you
can start the process with no page in the main memory. When the CPU tries to fetch the first instruction itself
(leave alone execute it) a page fault occurs and then that page is brought in by the procedure outlined in the
previous section. Thereafter, pages go on building in the main memory with increasing number of page faults.
But this is expensive. Typically, reading in a page from the disk is equivalent to executing 20000 CPU
bound instructions. If all the page faults occur only initially, and then all the required pages remain in the main
memory thereafter, until the process terminates, this overhead, though large, could have been understandable.
But if the full process image and/or its parts are continuously rolled in/out in a time sharing environment
(depending upon the scheduling algorithm), and every time we have to build up these pages starting from 0
again at each context switch, it will certainly be very unacceptable.
The working set method tries to plan ahead and tries to keep in the memory the
pages which are more often used. How does the Operating System do it ? The Operating System uses the
principle of locality of reference and tries to build the working set for each process.
A working set is a set of most recently referenced pages in a process. In the beginning, this working set
is very small. With the passage of time, the number of pages often referenced and therefore, the size of the
working set increases steeply as shown in Fig. 9.41.
An interesting property of this curve is that beyond a certain number of page references, the size of the
working set remains more or less constant, i.e. slope of the curve becomes almost 0. At that point in time, we
have a more or less stable working set and the most often referenced pages are already in the main memory.
The page faults have dropped almost to 0. An idea could be to store somewhere the details (i.e. page numbers)
of this stable working set during the execution of the process, so that even if the process is blocked and rolled
out, while restarting it, the Operating System could bring in the entire working set unlike in demand paging.
This can improve the performance substantially, because you do not have to start building up from 0 page
again!
Generally, the working set philosophy may appear to be far better. But this is not always true. As discussed
earlier, the working set changes drastically during different phases of a program execution. (e.g. housekeeping,
main routine, EOJ etc.) At the time of the phase changeover, the working set method is far more expensive
than demand paging. This is because, when the program phase changes, many of the old pages become
unnecessary, thus wasting memory. At that time, many of the pages required in the new phase do not exist
in the main memory, thus causing page faults. The merits/demerits of both of these methods are still being
widely debated.
As discussed earlier, if all the page frames are occupied when a new page
has to be brought in due to a page fault, the Operating System has to choose a candidate page frame to be
overwritten. If this page is necessarily chosen from the same process, it is called ‘local replacement policy’.
If it can be from any process, it is called ‘global replacement policy’. Theoretically, in the latter case, when
a page of Process A has to be brought in, the Operating System should scan through all the page frames of
all the processes and select a page frame to be overwritten. This could very well be a page frame currently
allocated to another process - say, Process D. This could be because Process D could be the one with the least
priority, and within it, that specific page frame may be the least recently used. Throwing that page frame away
is obviously much better than throwing any other page frame from any other process including Process A.
All this discussion may be theoretically correct but to implement this global replacement policy, the
Operating System will require very complicated systemwise logic capable of carrying out the comparative
analysis of references to various pages across different processes with different priorities. For instance, it may
have to solve problems such as : Is it better to throw a less frequently used page from a high priority process
than throwing out a more frequently used page from a low priority process, and so on. That is why, choosing
a page frame from only that process for replacement is simpler. We will assume this local replacement policy
for our discussions. There are several algorithms to achieve this as listed below:
The fourth page reference is for page 3. This is not in the memory, but no page frame is also free. Thus,
a page removal is necessary. The Operating System has the choice of throwing out page 8, page 1 or page 2.
You will notice that the algorithm threw out page number 8 residing in page frame 0. Why? If you study the
page reference list, you will realize that page 8 is going to be used far later than page 1 or page 2, i.e. page 1
and 2 are going to be required much earlier. This exactly was Belady's algorithm for OPT page replacement.
You should trace the figure through all the page references to understand this completely. There are in all
9 page faults. This is obviously the best scheme, but it is not practicable as it requires predictions, and the
Operating System is not exactly the same as astrology. This algorithm therefore, is used only as a yardstick
to measure the performance of other methods.
As the name suggests, in this scheme, the page that is removed from the
memory is the one that entered it first. Assuming the same page reference string as before, we present in Fig.
9.43, the states of various page frames and page faults after each page reference. As is clear from the figure,
15 page faults result.
This algorithm is easy to understand and program. The first three columns are self-explanatory. In the fourth
reference of page 3, a page fault results, and the FIFO algorithm throws out page 8 because it was the first one
to be brought in. (It is a coincidence that the OPT also would decide on the same.) The fifth page reference does
not cause a page fault in both the OPT and FIFO algorithms as page 1 is already in the memory. The sixth page
reference is for page 4. Here, FIFO will throw out page 1, because it came in earlier than the remaining two
pages at that time, viz. page 3 and page 2. Page 1 has been there the longest. Notice that OPT had chosen page
2 for throwing out. The reason is that it "knew" in advance that page 1 is going to be required sooner. The FIFO
policy does not "know" this. Hence, in just the next i.e. the seventh page reference, page 1 is required causing
yet another page fault. Following this logic, the table can be completed.
This algorithm can be implemented using a ‘FIFO Queue’. The queue is nothing but a pointer chain
where the header of the chain is the page that came in first, and the end of the chain is the page that came in
last. Thus, before the eighth reference of page 5 (i.e. after the seventh one for page 1), the queue would be as
shown in Fig. 9.44. The figure shows the 3, 4 and 1 in the page frames 0, 1 and 2 with a specific logic. Out
of the pages 3, 4 and 1 in the memory at that time, page 3 had come in the memory the earliest (in the fourth
reference). Page 4 had come in the memory in the sixth page reference; and page 1 had come in the seventh
page reference. Notice that page 1 had also come in the memory in the second page reference, but it had been
evicted in the sixth page reference. Hence, that was of no consequence.
After the eighth reference of page 5, the FIFO queue would be as shown in Fig. 9.45.
In this case, page 3, which has been in the memory the longest will be thrown out. This page is given by
the first entry pointed to by the header of the chain.
Hence, page removal really means: access the first entry in the chain, get the page number and the page
frame number and throw it out. After this, the head of the chain points towards the next entry. This is how, the
second entry in Fig. 9.44 becomes the first entry in Fig. 9.45. Similarly, the third entry in Fig. 9.44 because
the second entry in Fig. 9.45 and so on. The last entry in Fig. 9.45 is for the page brought in the memory that
last. This is how, the pointers in the chain are adjusted. We have shown only a one-way chain in the figure. In
practice, for better data recovery, it can be implemented using a two-way chain.
The FIFO algorithm has one anomaly known as 'Belady's anomaly', named after the one who discovered
it first. We will now study it.
Normally, if the physical memory size and therefore, the number of page frames available increases, the
number of page faults should decrease. (This will be studied later in Parachor Curve.) This should enhance
the performance. But this is not necessarily the case if you use FIFO as the page replacement policy. This,
in essence, is Belady’s anomaly. Let us, for example, take a reference string 2, 3, 4, 5, 2, 3, 6, 2, 3, 4, 5, 6.
Figure 9.46 shows that with 3 page frames, FIFO gives 9 page faults. However, Fig. 9.47 shows us that
if you increase the number of page frames from 3 to 4, the number of page faults increases and becomes
10! This clearly is anomalous! If you go through these figures, you will notice that the page faults increase
because in the case represented by Fig. 9.47, the page is referenced as soon as it is evicted in this specific
reference string. Fortunately, this anomalous behaviour is rare and can be found only for specific types of
page reference strings. It does not always happen for all reference strings and hence, FIFO is still a fairly
good algorithm.
The Operating System maintains a pointer which moves in circular fashion, i.e. it traverses from
the beginning to the end of the PMT pages held in the FIFO sequence, and then starts again from the
beginning.
Reference bit = 0 means that the page has not been referenced and hence, it can be replaced. Reference
bit = 1 means that the corresponding page has been referenced and hence, is likely to be used soon (at least
this algorithm thinks so) and therefore, it is not replaced. It is given a second chance. At this point, this bit
is set to 0 and the pointer moves on. The idea is that in the next round, that page could be thrown out unless
referenced during this period again.
At any time, when a page is referenced, its reference bit is set to 1. Hence, if a page is referenced very
frequently, it will never get replaced.
When there is need to replace a page, the pointer starts from the current position to check all the reference
bits one by one and selects the one which is 0 (see Fig. 9.49 (a) and 9.49 (b). As discussed earlier, it sets all
the reference bits from 1 to 0 which it encounters during this journey.
When the process is loaded in the memory, initially all the reference bits for those page frames are set to 0.
The reader may arrive at the page fault sequence for this scheme, given the same reference string as shown
in Fig. 9.43.
This algorithm is a variation of the SC algorithm discussed ear-
lier. The difference is that SC does not bother about whether a page is modified (i.e. dirty) before evicting it.
Not Recently Used (NRU) takes this into account. (NRU is also classified as 'LRU approximation' by some
authors. We will talk about it later.)
NRU, therefore, has 2 bits instead of 1. These are: Reference bit (R) and Modified bit (M). Thus, with these
two bits we have four possible different combinations called 'classes'. They are shown in Fig. 9.50.
Thus, for each page frame, 2 bits are maintained. The R and M bits are set by the hardware. For instance,
when a page is referenced, the hardware sets the R bit for that page frame to 1. When it is modified, the
hardware sets the M bit for that page frame to 1. It is possible for the hardware to do it, as it is involved in the
Address Translation actively for a reference/modification of any page in any case. It is easier to understand
all classes except class 1. How can a page be modified and yet not referenced? Does not modification imply
a reference? The answer is: Yes, it does. But these two bits are independent thereafter. At a specific time
interval, which may comprise several clock ticks, another clock interrupt occurs when the Operating System
takes over and in the Interrupt Service Routine sets all the R bits to 0 to differentiate between the latest
references from the earlier ones.
However, it does not touch the M bits because they indicate whether a page is required to be copied onto
the disk before being overwritten or not. Hence, it is possible to have a page which is modified and yet not
referenced, i.e. class 1, if the R bit of a page in class 3 is cleared due to the clock interrupt. At the page fault,
the Operating System reworks the classes for all the pages and maintains different lists for different classes.
It is obvious from this classification that the pages in the class 0 are first candidates for removal and
those in class 4 are the last ones, as they are more valuable and required in the memory and thus, they are
least desirable to be removed. Class 1 pages are removed earlier than class 2 ones because removing a
"referenced, even though not modified" page is more likely to be required again sooner than a "not referenced
but modified" page. This is because a page belonging to class 1 is definitely not referenced for at least one
clock interval, whereas a page belonging to class 2 has certainly been referenced during that time. Thus, even
if removal of class 1 pages takes more time than those in class 2 (you have to write back the modified page
on the disk), this scheme is preferred. In some systems, where there is no hardware support to set/reset the R
and M bits, the Operating System has to simulate this in software, if it desires to implement the NRU method.
This obviously becomes a very slow process.
We cannot give details of page replacements with the assumed page reference list in this case, because we
will have to make some assumptions regarding when and which page was modified.
(a) Introduction Least Recently Used (LRU) algorithm is quite popular as it comes close to the optimal
algorithm. It is based on the principle that pages that have been heavily used in the last few instructions will
probably be required again in the subsequent ones. On the other hand, pages not used for a long time in
the past will probably remain unused for a long time in the future. This matches the principle of locality of
reference that we have discussed earlier. Thus, when a page fault occurs, LRU throws out a page that has been
unused for the longest time. That is why the name 'Least Recently Used'.
(b) Page Faults We present in Fig. 9.51 the same page reference string as shown earlier in Fig. 9.43 and the
corresponding page faults and page replacements using the LRU method.
In this scheme, when page 5 is referenced (8th reference from the beginning), there are already three
frames in the memory, viz. 3, 1 and 4. Studying the reference string, it is found that the last page 3 is
referenced in 4th reference, the last page 1 is referenced in the 7th reference and the last page 4 is referenced
in the 6th reference. The Operating System wants to bring in page 5, but it needs to throw out a page because
there is no free page frame available. As page 3 was used least recently, it is thrown out. Page 1 is used most
recently and thus, it is obviously retained.
With LRU, we get 12 page faults which is more than 9 in OPT but less than 15 in the FIFO method. We
recommend that the reader should verify the sequence in which the page faults occur and the way the pages
are replaced.
(c) Worst Case LRU can have some performance problems in some specific cases though they are rare. One
such case is presented below, where as soon as a page is evicted, it is referenced again, causing yet another
page fault, and this continues. Imagine the main memory consisting of four page frames. Also, imagine a
program executing a loop over five pages referencing the virtual pages in that order. i.e. the page reference
list will consist of 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0... The page reference list and page faults are given in Fig. 9.52.
A closer look will reveal that there is a page fault at every page reference, only because the algorithm cannot
foresee ahead. This of course is the worst case, and it is not likely to occur very frequently.
This worst case is not an anomaly, like the FIFO anomaly. If the number of page frames increase, the
page faults will not increase. In fact, LRU belongs to a class of algorithms known as 'stack algorithms'. It
can be shown that Belady's anomaly is not applicable to these algorithms. FIFO however does not belong to
this class. That is the reason why we encounter the FIFO anomaly. What does the case depicted in Fig. 9.52
demonstrate? The case demonstrates that there can be a case when a policy appearing to be logically the
best can produce the worst results. LRU can be bad in another situation too, though not as bad as the worst
case. This can happen when the working set undergoes a drastic change e.g. from housekeeping to the main
routine.
Despite these problems, LRU is considered to be a very good algorithm coming very close to the OPT
algorithm excepting in some situations as discussed above. The only problem with LRU is that it is very
costly to implement. We will now study some of the ways in which it can be implemented.
(d) Implementation LRU can be normally implemented using one of the three methods: Stack, Counters
and Matrix. We will discuss them in brief.
In this scheme, we need a hardware counter (c) of large length (say 32 or 64 bits) to accommo-
date a large binary number. After each instruction, it is automatically incremented by the hardware. Hence, at
any time, this counter is a measure of time. In addition to this systemwide counter, this method demands that
there should be another counter of the same length as (c), associated with each page frame, as shown in Fig.
9.54. This counter is maintained to indicate the last time the page frame was referenced.
Whenever any virtual page is referenced by any process, the Operating System carries out Address
Translation with the help of the hardware as discussed earlier, wherein the corresponding page frame is found
out. At this time, the current value of the systemwide counter (c) is copied in the "counter" field for that page
frame. Hence, this gives the time at which the page frame was referenced.
Thus, when a page is brought into the memory, it starts with the current value of C and then it is updated
according to the references to that page frame. At any time, a page frame with the lowest value of C amongst
the page frames allocated to that process is the least recently used one and hence, is a candidate for removal.
At a page fault, the Operating System goes through the PMT
and for each page frame, extracts the value of the counter from the
table shown in Fig. 9.54. It then picks up a page frame for which the
value of the counter is the lowest. Thus, if a process has 16 pages, the
Operating System will have to compare all the counter values to arrive
at the minimum. This might look a bit lengthy and time consuming for
the Operating System at each page fault, but in practice it is not so.
Even if this processing takes a long time, if it results in avoiding even
a single page fault, the Operating System has saved the execution time
of equivalent to about 20000 CPU bound instructions!
Imagine that a page frame F is shared between two processes P1
and P2. Let us also imagine that P1 is referencing F quite actively but P2 is not. Now let us say that P2
experiences a page fault and there is no physical memory available. In P2's PMT, F could have been a fairly
underutilized and useless page frame if these counters were maintained alongwith the PMT. F would then
be evicted which is obviously wrong. This is why these counters are maintained for different physical page
frames.
An interesting point pertains to the global versus the local replacement policy. In global replacement, the
Operating System would find out the lowest value of the counter amongst all the page frames belonging to all
the processes. In local replacement, it would search only through the page frames allocated to that process to
find out the lowest value of the counter.
Another method implements LRU using a matrix. This method is not a very popular one
because it requires expensive hardware support. We will not discuss this at length here.
One thing is common to all the methods discussed above. If we want to implement LRU exactly and
completely in the hardware, it becomes a very expensive solution. Additionally, it cannot be easily ported
to any other machine not having that hardware support, thereby, restricting its utility. On the other hand, if
we implement this totally in the software, the algorithm becomes very slow, thus degrading the performance
drastically, but then the portability enhances. Therefore, you need a via media. You need to take help from
the hardware only to the extent necessary, doing everything else in the software. These solutions make some
compromises. They are not LRU exactly, though they reach quite near it. That is why they are called 'LRU
approximations'. We will now examine these.
There are many LRU approximation methods suggested. One of the methods 'Not
Recently Used (NRU)' is the one that we have already studied. We leave it to the reader to find out why NRU
can be classified as LRU approximation. We will now examine another method called 'Not Frequently Used
(NFU)' which again belongs to the category of LRU approximation.
This algorithm requires a counter CTR associated with each page frame. CTRs are initially set to 0 for
all the page frames allocated to that process, excepting the shared page frames. Each page frame also has
Reference (R) bit associated with it in addition. At any time, if a page frame is referenced, the hardware itself
sets the R bit of the corresponding page frame to 1. After some clock interval, this R bit (0 or 1) is added to
the counter associated with the page frame. Therefore, if a page frame was not referenced at all during a clock
tick, its CTR will remain unchanged. Otherwise, it will be incremented. This is done for all the page frames.
After this, all the R bits are cleared. In essence, CTR for any page frame then represents a measure of how
frequently that page frame is referenced. The reasons why this counter is maintained for a page frame, instead
of a page in the PMT are obvious as discussed earlier.
When a page fault occurs the candidate page frame for replacement is the one out of the page frames
belonging to that process (for local replacement policy) or out of all the page frames (for global replacement
policy) with the lowest counter value. This is because, this page is the least frequently used.
However, The NFU algorithm as discussed above has some problems. For instance, it measures and
indicates how many times a page is referenced rather than when or
how recently it was referenced. This may not be in accordance with the
principle of 'locality of references' which assumes that a page referenced
most recently may be more in demand than the one referenced more
frequently. As an example, let us imagine a COBOL program with three
portions or chunks as shown in Fig. 9.55.
Imagine that we have only four page frames available. Let us assume
that the Housekeeping routine (open files,... etc.)
occupies three page frames and those three have been
heavily used during the Housekeeping routine. When
we start the main program, the first page of the main
program will occupy the fourth page frame as shown
in Fig. 9.56. Notice the fields against the various pages
as shown in Fig. 9.56 at a certain time CTR, as an
example.
If another page from the main program needs to
come in the memory, the NFU algorithm will throw
out the page from main program (MAIN 0) instead of
a page from the HSKG routine, sheerly because the CTR for MAIN is the lowest. Obviously, this is not
correct!
A small modification to the NFU algorithm sets this right. This algorithm is called 'aging'. There are two
modifications needed in the NFU to achieve this.
Before adding the R bit, all the counters for all page frames are shifted right after the clock interval.
The R bit is added at the left most bit rather than at the rightmost one only for that page frame. This is
exactly opposite of what we do in the normal arithmetic addition.
Let us consider 3 page frames: 0, 1 and 2. Let us also assume that the counters are of 8 bits. Also assume
that before clock tick 1, only page frame 0 is referenced. Between clock ticks 1 and 2, page frames 1 and 2
are referenced; and between clock ticks 2 and 3, and 2 are referenced. After all these, the page frames values
of the counters will be as shown in Fig. 9.57.
Notice that after each clock tick, the counters are shifted right. This reduces the value of the counter.
However, the value of the counter tells you not only how many times a page frame is referenced, but also it
tells you how recently it has been referenced. This happens precisely due to this shifting. For instance, consider
these counters for page frames 0 and 2 after the clock tick 3. Both of these page frames are referenced twice
upto now as is evident from the two times the digit 1 appears in both of these counter values. However, page
frame 2 is referenced more times more recently than 0 as the counter value for 2 is greater than that for 0. This
is because page frame 2 was referenced in the last two clock ticks consecutively. Page frame 0 was referenced
in the last clock tick all right, but not in the one before that. This helps the algorithm to decide which page
frame to evict. This is how, this ‘aging’ algorithm takes care of the time when a page frame is referenced and
within that, how many times it is referenced!
If a page fault occurs between clock ticks 2 and 3, the NFU will choose the counter with the lower value,
i.e. the one for page frame 0. Thus, even if before the page fault, all the page frames 0,1 and 2 have been
referenced once only, page frame 0 will be removed, because it is referenced earlier or less recently than the
other two, as denoted by the value of its counter.
In a sense, the exact time when the page frame is referenced is lost within a time interval between the
clock ticks. Hence, it is still an approximation.
frames have a value of 0, it could well be that one page frame is not referenced at all, whereas the
other page frame is not referenced in the last 8 ticks, even though it may have been referenced several
times before these 8 ticks (Due to shifting of bits, all history of what happened before 8 bits is lost and
the counter could still be 0.)
Relocation and Address Translation is similar to the paging systems. The only difference is that all the pages
may not be present in the main memory at any time. This forces the Operating System to keep track of
whether a page is in the main memory or not, and if not, then where to locate it on the disk. This necessitates
the modification of PMT and addition of another table called FMT. All this was seen in Sec. 9.9.2. After the
page fault, the page is located on the disk, loaded in the memory, PMT is updated, and thereafter, the process
of Address Translation is quite similar to that in paging.
Concepts of Protection and Sharing are very similar to those in paging again. Illegal addresses are trapped
by the hardware using PMTBR. Again, the operation (Read, Write) to be performed on the page is validated
against the access rights information kept in the entry of PMT for that page. Sharing also is easily achieved
by mapping two or more logical pages from different processes to the same physical page frame. The only
complication added due to virtual memory is the fact that the shared page can become a candidate for eviction
in the case of a page removal/replacement. This can give rise to unnecessary page faults in other processes
too. This forces the Operating System to keep all the counters or indicators (R, M etc.) at the level of physical
page frames instead of logical pages as seen earlier.
The considerations for internal fragmentation are the same as in paging. On the aver-
age, this internal fragmentation is equal to (page size - 1)/2. As in paging, there is no external fragmentation.
In addition, virtual memory systems utilize the memory far more efficiently, because unutilized, unwanted
pages of a process can be evicted to make room for pages from a new process, thus increasing the degree of
multiprogramming.
Access times can be the same as in paging, if a page is found in memory, otherwise they can
be very high due to the page fault processing, as we have studied earlier.
The routines of memory allocation/deallocation are fairly complex as they have to take
into account the page fault processing, page replacement policies etc.
There are many issues which play a significant role in the virtual systems, and which have a direct bearing on
the performance. We will consider these in brief.
This was discussed earlier. To recapitulate, a higher page size results in higher wastage of
memory due to internal fragmentation (refer to Sec. 9.9.5.1), but has lower overheads due to smaller sizes
of PMTs and associative memory. Conversely, a smaller page results in lesser memory wastage but can have
very high overheads due to large PMTs and associative memory for different processes.
This also has been considered earlier. The demand paging brings
in a page only when required and when demanded. Hence, until all the necessary and continuously
required pages get into the memory, it causes a lot of page faults. On the other hand, the working set model
keeps track of the pages that are likely to be required by the process (by studying the past behaviour), and
then brings them in one shot without waiting for each individual page fault to build up the necessary pages
in the memory.
The strategy of bringing in a page in anticipation, before it is actually demanded is called 'prepaging' and
is used heavily in the working set model. If a process starts with no pages in the memory and gradually grows
to a large but steady size, both these philosophies would be the same for the very first time. But imagine that
the process is rolled out and it is to be brought in again. It is here that these two philosophies differ. Demand
paging would start all over again, going through all the page faults, whereas the working set method would
bring in all the pages belonging to the working set for the process.
‘Thrashing' is related to the rate at which page faults occur. If the page faults occur every time
after only a few instructions, the system is said to be thrashing and obviously it degenerates the performance.
This is because, the pages continuously are moved between the main memory and the disk.
An example of a problem of the worst case discussed in LRU algorithm in Sec (c) is another case of
thrashing. A page replacement algorithm should be so chosen that thrashing is avoided.
In such cases, it may require minimum 6 or 8 page frames to be allocated to that process! (6 in PDP-11
and 8 in IBM-370 due to some additional complications!)
If a machine provides a number of levels of indirection, this number of minimum page frames also
increases, thus reducing the efficiency of the virtual memory scheme. As the minimum number of page
frames that must be allocated to a process increases, the page frames available for further allocations decrease
in number, thus, reducing the degree of multiprogramming. This is the reason the Data General computer
designers restricted the number of levels of indirection in the DG instructions set to 16!
As we have seen, pages can be shared between processes if the code is reentrant. This is
useful so that commonly used programs such as editors, Command Line Interpreters, Compilers, Database
Systems etc. do not generate one copy per process. But there is a problem in this scheme. Imagine that Pro-
cesses A and B both are using an editor. If the CPU scheduler decides to remove Process B, should all the
pages be removed? If it does so, Process A will continuously cause page faults for a while, thus degrading
the performance.
Another problem is to decide when to free the pages. For instance, in the earlier example, when Process
A also terminates, all the shared pages should be returned to the free pool (unless there is yet another process
running an editor) How does the Operating System know this? One idea as discussed before is to use a usage
count associated with each page frame. When a PMT for a process is created, the usage count field for all
the allocated frames is incremented by 1. When a process is killed, they are decremented by 1. Now, at any
moment, the page frames with zero usage count can be added to the free pool after a process terminates
because these are not shared pages.
In batch systems, organizing memory is fairly simple since there the requirement to have memory divided
into fixed partitions for the jobs. Each job will occupy a fixed partition and the partition will be free when the
job is finished. So we can keep executing jobs into fixed partitions and we can keep CPU busy all the time.
There is no other requirement for the memory management in older systems.
With the introduction of time sharing systems, advance client-server systems, or even functionally
rich personal computers there is a complete change in memory management function of the Operating
System.
In a time sharing environment, there are various processes in the running state and each process
would require an important resource, which is memory. Similarly on personal computers, operations such as
GUI work, CD read/write would need large memory, which is obviously greater than the primary memory
or RAM.
When the sizes of the executable program, data is being processed, and stack exceeds the amount of physi-
cal memory available for that process, the Operating System keeps only the required parts of the program in
the main memory and the rest is moved to the disk. For example, a 32 MB program can run in 8 MB RAM
when it is known as to which 8 MB part of the program is required at each instant by swapping from disk to
memory and from memory to disk.
This whole idea creates an impression that there is increase in the size of the main memory dynamically by
utilizing the free space from the disk. The extended memory at run time is called virtual memory. It cannot be
counted as real physical or main memory of the computer. Virtual memory also works in a multiprogramming
system, with many pieces of many programs in memory at once.
In earlier discussion of virtual memory, we saw that required part of program is transferred from disk-to-
memory. This swapping from memory-to-disk and disk-to-memory is continuous and on-going processes
would require some technique for this activity. The disk-to-memory and memory-to-disk activity always
happens in fixed size of bytes. This fixed size is called as a Page. Whole virtual memory gets divided into
fixed units, i.e. pages. Each page would occupy physical memory. The corresponding space in physical
memory is called as Page frame.
Hence each page from virtual memory occupies a page frame in physical memory.
We require pages from the secondary storage for the execution of process. There would be demand for the
pages for the complete execution of process. There are two ways to load demanded pages into memory. A
swapper manipulates entire process and pager is concerned with the individual pages of a processes. Pager is
a more effective technique than swapper in connection with demand paging.
Pager guesses, which pages are required and instead of swapping the whole process, the pager brings only
the necessary pages into memory. Thus it avoids reading into memory pages that will not be used, decreasing
the swap time and the amount of physical memory needed.
Page replacement is the technique to remove unwanted pages from the memory
and to bring required pages into the memory. Thus, the page replacement technique makes sure that there is
enough space for the required pages. It is very much possible that page to be removed from the memory has
been modified while the page was in memory. Then it must be rewritten to the disk to save the changes done.
If the page has not been modified while it was in memory, then there is no need rewrite that page on the disk.
There would be some pages inside the memory, which are being heavily used by the processes and the
presence of such pages is almost mandatory at any given point of time. If such pages are removed to bring
in new pages, that would cause a problem since most wanted pages have to be brought back again and again.
This would reduce the efficiency of overall processes. Hence, page replacement technique must be based on
proper algorithm and it should not be based on a random selection.
Consider an example of a Web server, which keeps heavily used Web pages in its memory cache. But
memory cache is limited in size, although it is big and when it is full and then a new Web page is referenced,
the decision about which Web page is to be removed from memory needs to be taken.
There are page replacement algorithms given below –
(1) Not recently used (NRU) page replacement algorithm
(2) First in first out (FIFO) page replacement algorithm
(3) Second chance page replacement algorithm
(4) The clock page replacement algorithm
(5) Least recently used (LRU) page replacement algorithm
n n
n n
n n
n n
n
n n
n
n
n
n
n
n
n
n
n
“In general, secure systems will control, through use of
specific security features, access to information that only
properly authorized individuals or processes operating on
their behalf will have access to read, write, create or delete.”
Open Systems Interconnection (OSI) defines the elements
of security in the following terms:
Confidentiality: Ensuring that information is not ac-
cessed in an unauthorized manner (essentially con-
trolling the read operations).
Integrity: Ensuring that information is not ame-
nded or deleted in an unauthorized manner
(essentially controlling the write operations).
Availability: Ensuring that information is av-
ailable to authorized users at the right time
(essentially controlling the read and delete operations
and ensuring fault recovery).
Generally speaking, security is concerned with the ability
of the Operating System to enforce control over the storage and transportation of data in and between the
objects that the Operating System supports.
In a multiuser environment, the concepts of security and protection are very important. User programs
should not interfere with one another or with the Operating System. As user programs execute in the computer
memory, the first step is to identify and protect the memory allocated to each user program as well as the
one that is allocated to the Operating System. We have studied in the previous chapter how this is done by
different memory management schemes.
During the execution, a user program may access different objects such as files, directories and printers.
Because many of these objects are shared, proper security mechanisms are very important. The lack of such
security mechanism can lead to a disaster. The computing history abounds in such cases.
Now that the cost of hardware is falling at a rapid rate, millions of ordinary users and programmers
have an access to small or large computing equipments. With a trend towards networking, the
user/programmer has an access to data and programs at different remote locations. This has increased
the threat to the security of computing environments in general and the Operating System, in particular
(though the Operating System forms only a small portion of the distributed computing environment).
Sharing and protection are requirements of any modern computing, but ironically they imply contradictory
goals. More sharing gives rise to more possibility of security threats, or penetration, thus requiring higher
protection. When the Personal Computer (PC) was designed, it was intended strictly for individual use. This
is the reason why MS-DOS was not very strong in the security/protection areas. It did not have to protect
the data files of one user from the possibility of penetration or misuse by another user, as no two users were
expected to use the same machine simultaneously. Hence, in the PC environment, in the earlier days, the only
way data could be protected was by locking the room physically where the PC and the floppy disks were
kept. Today, as a number of PCs are being networked together for sharing of data and programs, a need has
arisen to have better and stricter control over the protection aspects. This was the main motivation behind the
development of a Network Operating System (NOS) such as Novell NetWare.
The major threats to security in any computing environment can be categorized as follows:
(i) Unauthorized use of service (Tapping).
(ii) Unauthorized disclosure of information (Disclosure).
(iii) Unauthorized alteration or deletion of information (Amendment).
(iv) Unauthorized fabrication of information (Fabrication).
(v) Denial of service to authorized users (Denial).
Figure 10.1 depicts these major security threats.
Of these five types of security threats, the first two, viz. tapping and disclosure, are categorized as passive
threats and the other three as active threats. It is clear from the figure that, in both cases (i) and (ii), the
information goes to the third party. However, there is a slight difference between them. In tapping, the third
party accesses it without the knowledge of the other two parties (A and B). In disclosure, the source party (A)
willingly or knowingly discloses it to the third party.
Security threats can arise from unintentional or deliberate reasons. Again, there may be casual or malicious
attempts at penetration. Regardless of its origin and motivation, the Operating System designers have to
build a security system to counter all possible penetration attempts. Disclosure of product prices or other
competitive information or military/commercial secrets could be extremely dangerous in varying ways and
degrees to people, organizations or governments.
The security system can be attacked and penetrated in a number of ways. Following sub-sections
outline some of the possible ways in which this can happen.
Authentication means verification of access to the system resources. We will talk about it in more detail in a
later section. Following could be some of the ways of penetration of the system in this regard.
(i) An intruder may guess or steal somebody else’s password and then use it.
(ii) An intruder may use the vender-supplied password which is expected to be used for the purpose of
system generation and maintenance by only the system administrators.
(iii) An intruder may find out the password by trial and error method. It is fairly well known that names,
surnames, initials or some other common identifiers are generally used as passwords by many users.
A program also could be written to assist this trial and error method.
(iv) If a user logs on to a terminal and then goes off for a cup of coffee, an intruder can use that terminal
to access, or even modify, sensitive and confidential information.
(v) An intruder can write a dummy login program to fool the user. The intruder, in this case, can write
a program to throw a screen, prompting for the username and the password in the same way that the
Operating System would do. When a user keys in the username and password for logging in, this
dummy program collects this information for the use by the intruder later on. It may, then terminate
after throwing back some misleading message like “system down...”, etc. This collected information
is used for future intrusion. This is a form of ‘chameleons’ as we will learn later.
Quite often, in some systems, there exist files with access controls which are very permissive. An intruder can
browse through the system files to get this information, after which unprotected files/databases could easily be
accessed. Confidential information could be read or even modified which is more dangerous.
Sometimes, software designers may want to be able to modify their programs after their installation and
even after they have gone in production. To assist them in this task, the programmers leave some secret entry
points which do not require authorization to access certain objects. Essentially, they bypass certain validation
checks. Only the software designers know how to make use of these shortcuts. These are called trap doors.
At times such shortcuts may be necessary for coping with emergency situations, but then these trap doors can
also be abused by some others to penetrate into the system.
Serious security violations can take place due to passing of invalid parameters or due to the failure to validate
them properly. (This needs a detailed discussion and a separate section will be devoted to discussing it.)
A special terminal can be used in this case to tap into the communications line and access, or even modify,
confidential data. The security threat could be in the form of tapping, amendment or fabrication, once the
intruder gets hold of the line. The intruder needs different hardware and techniques depending upon whether
the attack is passive or active.
A penetrator can use active or passive wire-taps, or a mechanism to pick up the screen radiation and recognize
what is displayed on the screen.
In networking (or even timesharing) environments, a line can get lost. In such case, a sturdy Operating
System can log out a user and allow access only after reestablishing the identity of the user. Some Operating
Systems cannot do this. In such case, the process created before losing the line just floats about, and hence,
an intruder can gain control of this floating process and access the resources which were accessible by that
process. Though this is difficult, it is certainly a possible way of an attack on security.
In some Operating Systems, the system itself does not allow the planning of a meticulous, rigorous access
control mechanism (UNIX has, at times, faced this criticism). On top of this, the system administrator may
not plan his access controls properly. This may lead to some users having far too many privileges and some
others, very few. This situation can potentially lead to unauthorized disclosure of information or denial of
service. In either case, it leads to security violation.
A penetrator may use some techniques to recover the deleted files or part of them containing different
passwords. For example, if a block is deleted from a file, it is deallocated from that file and added to the pool
of free blocks. However, the contents of the block still remain intact until it is allocated to another file again
and some other data is written onto it. A penetrator can use some mechanism to scan these free blocks to get
useful information.
A variety of software programs exist under this title. Computer virus is the most notorious of them all. This is
a deliberately written program or a part of it, intended to create mischief. Such programs vary in terms of their
complexity or damage they cause. Obviously, all of them require a deep knowledge of the Operating Systems
and the underlying hardware. They are, therefore, normally produced by very clever systems programmers.
Various types of rough software are listed below:
This is a program which appears to be harmless, but has a piece of code which is very
harmful. Its name is derived from Greek mythology. As is well known, there was a war between the Greeks
and the Trojans. The Greek army appeared to be defeated in a bloody war due to the superior Trojan tactics.
Finally, in a bid to enter Troy clandestinely, the Greeks created a large hollow wooden horse and parked it at
the gates of Troy. The Trojans interpreted it as a symbol of peace and allowed it inside their city. At night, the
Greek soldiers hiding inside the horse got out and set Troy on fire.
The ‘Trojan Horse’, here, is meant to fool a common user. Hence, all the rogue software delivered to the
users normally is in the form of the Trojan Horse.
Chameleon is similar to the Trojan Horse. It normally mimics a useful and correct program.
It can mimic a logon program to collect all the valid usernames and passwords on a system, and then display
a message to the effect “System down”. The collected passwords can be used thereafter.
Another example of a chameleon is a program that mimicked an otherwise normal banking program and
diverted a few fractions of a cent as rounding off errors for each transaction into a separate secret account.
This resulted in the collection of thousands of dollars in a short span of time.
This is a piece of code which “explodes” as soon as it is executed, without
delay or warning.
This is like an ordinary software bomb, except that it becomes active only at a
specific time or frequency.
Again, this is like an ordinary software bomb, except that it is activated only if
a logical condition is satisfied (e.g. destroy the Employee Master data only if gross salary exceeds say 5000).
These are programs attacking the nodes on the network and spreading to other nodes. They
consume all the resources on the network and affect the response time (This will be discussed at length in a
separate section).
This is only a part of a program which gets attached to other programs to definitely cause dam-
age. A full section will be devoted to discuss this term.
These, like worms, are full programs. However, as soon as they are executed, they are repli-
cated on the disk until its capacity is exhausted. This procedure can then be repeated to other nodes so that
the complete network comes to a standstill. Rabbits can be easily detected and hence, are not as dangerous
as worms.
Conventionally, protection mechanisms impose restrictions on various subjects to access different objects.
Despite these restrictions, unauthorized data transfer may still occur. This can happen due to the use of
objects or signalling devices not protected or covered by the protection mechanisms. This can undermine
confidentiality and is hard to detect. The problem pertains to the information flow control and is known as the
confinement problem or the covert channel problem.
Imagine that two parties X and Y are communicating with one another. Party X is a depot receiving sales
orders throughout the day. As soon as a sales order is received, it is verified and sent to the central office, Y.
All the actual data of the sales orders are sent across using a variety of techniques such as encryption. Hence,
an intruder cannot find out the actual data. However, if the intruder taps the line to measure the rate of flow of
the information at different times during the day, he can ‘covertly’ or ‘implicitly’ conclude the times at which
the orders arrive at a high rate. This conclusion is also a kind of ‘information’.
Such information would appear to be fairly harmless in a normal commercial environment, but could lead
to dangerous conclusions in military situations. This is the real reason for the concern. As the information is
not explicitly or overtly penetrated but is derived in an implied or covert manner, this method is called Covert
Channels. There are two types of covert channels: storage and timing. A detailed discussion of these channels
is beyond the scope of the current text.
When one program calls another program or procedure, it may need some parameters to be passed
from the caller routine to the called routine. After the routine is called, a different process is executed
and it could have a set of access rights over different objects (such as files and directories) in the
system, different from those of the caller routine and this can cause problems if not handled carefully. A case
in point is a call to the Operating System itself which is made by the user process. Some examples of security
violation through parameters are presented in the following sub-sections.
The parameters can be passed either directly (by value) or indirectly (by reference). In call by value method,
a user program passes the parameter values themselves. For example, it can load certain CPU registers with
these values which are then picked up by the called routine. If the called routine does not have access to those
registers, it may result in the denial of this service.
In call by reference, the actual values of the parameters, may not be passed but the pointers or addresses
of the locations where the actual values of the parameters are stored may be passed. The denial of service
can take place in this case too. For instance, a call may pass parameters, referencing objects that cannot be
accessed by the called routine (called target domain) though the calling routine (called source domain) has
the access rights to those. Figure 10.2 depicts this scenario.
The figure shows the source and target domains: DOMAIN 1 and DOMAIN 2, respectively.
DOMAIN 1 contains a call to the procedure PROC-A which has parameters p and q which are only the
addresses. DOMAIN 2 is supposed to execute this procedure by reading the data at an address p and then
writing it at an address q. Note that what are passed are not the data items but the pointers to the data items.
The figure also shows the Access Control Matrix (ACM) for these two domains. It shows that DOMAIN
1 has both Read and Write accesses to both the addresses p and q. However, DOMAIN 2 has Read and Write
access only to p. It has only Read access to q. What will happen in this case?
A call will be made as usual, but at the time of execution, when the data read at address p has to be
written at address q, the access will be denied because
DOMAIN 2 does not have a write access for q. Thus,
the denial of the service results in this case.
The cause for these problems is the excessive mutual trust and the failure to check the validity of the parameters
at every stage. As a result, a caller may gain access to unauthorized information. Even if the parameters are
validated, this can still happen. Consider the following situations:
space.
to substitute unauthorized values in the same place. Hence, the pointer now points to the address in
the Operating System area which the user is not supposed to access. To achieve this is obviously
not an easy task for the intruder too; but a detailed study of process scheduling algorithms, careful
programming and a lot of trial and error can result in this situation.
The suggested solution to the problem mentioned
above is to validate the referenced parameters and
execute the routine in the atomic way, i.e. to execute
to completion without interruption. Hence, even if a
process switch occurs, it does not occur in between
the validation of parameters and the execution of the
routine. This prevents any penetration by the intruder.
However, this may involve keeping a few interrupts
pending, which may make it an expensive solution.
Such problems can be reduced by copying all
referenced and indirect parameters into the target
domain before validating them. This can also be
done while passing parameters to the ‘less privileged’
domains. Strictly speaking, it may be necessary to
perform this for all objects indirectly referenced from
the called or source domain. This may possibly involve
maintaining linked lists of which the exact elements of
interest may be difficult to imagine in advance. Also, it
is necessary to exclude other accesses to the referenced
segments in order to prevent the race conditions. A
detailed discussion of this topic is beyond the scope of
the current text.
The invention of computer worms had, in fact, quite good motivations. It all began at Xerox PARC research
center. The research scientists at the centre wanted to carry out large computations. Having identified different
pieces of computations which can be independently carried out, they designed small programs (worms) which
could by themselves spread to other computers. This worm would execute on a machine if idle capacity was
available on that machine; otherwise it would continue to hunt for other machines ‘in search of idleness’. This
was the original purpose.
Usually, though not always, a computer worm does not harm any other program or data. It just spreads,
thereby consuming large resources such as transmission capacity or disk storage, thereby denying services to
legitimate users. A computer worm usually operates on a network. Each node on a network maintains a list
of all other nodes on the network. It also maintains a ‘mailing list’ which contains the names and addresses
of the reachable machines on the network. The worm gets an access into this list and using this, sends a copy
of itself to all those addresses.
If a worm is more ‘intelligent’ and less harmful, after reaching there, it checks at a new node whether its
copy already exists, and if it does, it does not create another one. If it is dumb as well as malefactor, it copies
itself to all the nodes on the mailing list. Hence, if a node is on a mailing list of dozens or may be hundreds of
other nodes in a large network, as many copies of the worm could exist on that node, sent by all those nodes
in the network.
As a result of this large continuous transfer over the network, a major portion of the network resources
such as the line capacity, disk capacity, network buffers, process tables are used up so that the network speed
can reduce substantially. Theoretically, it is possible to bring the entire network down, thus denying service
to the legitimate users.
Even if a worm is normally not harmful to the existing programs and data, it can be extremely harmful to the
organizations and governments operating over a network, as a case in 1988 demonstrated.
It all began on November 2, 1988 when Robert Tappan Morris introduced a worm into Internet. Internet is
a huge network connecting thousands of computers, homes, shops, at corporations, universities, laboratories,
government organizations, etc. allover the world. The worm brought down the network causing tremendous
havoc and generating major controversies.
Morris was studying at Cornell and was extremely intelligent. He noticed two bugs in Berkeley UNIX
which enabled him to gain unauthorized access to the machines world over. He then wrote a self-duplicating
program named as ‘worm’. It was not easy to detect its existence and origin until it was too late.
Within minutes, the VAX and the SUN systems in the network started crawling, stunning hundreds of
users. When the worm got to a new node, it checked its prior existence. If it existed, the worm went on to
hunt for a new machine in 6 out of 7 cases. In one out of seven cases, it created another copy regardless
before hunting for a new machine. This algorithm was obviously to fool the users and create hindrances in
its detection and prevention process.
A friend of Morris let out this secret to John Markoff—a New York Times reporter. The following day, the
story hit the front page of the newspaper, gaining more prominence than the news of the presidential election
which was three days later. Morris was finally tried, convicted and sentenced to a 3 year custody, 400 hours
of community service and a fine of 10,000 USD.
There are several types of computer viruses. We give below a list of types that have been encountered so far.
New varieties are likely to be added to this list with the passage of time.
This classification is done based on what is affected (e.g. boot sector) or where the virus resides (e.g. memory
resident). These types are self-explanatory and need no further explanation. However, this categorization is
not mutually exclusive. For example, a File Specific Infector or Command Processor Infector could also be
Memory Resident.
There are five well known methods in which a virus can infect other programs. These are discussed below:
(i) Append: In this method, the viral code appends itself to the unaffected program.
(ii) Replace: In this case, the viral code replaces the original executable program completely or partially.
(iii) Insert: In this case, the viral code is inserted in the body of an executable code to carry out some
funny (!) or undesirable actions.
(iv) Delete: In this case, the viral code deletes some code from the executable program.
(v) Redirect: This is an advanced approach employed by the authors of sophisticated viruses. The normal
control flow of a program is changed to execute some other (normally viral) code which could exist
as an appended portion of an otherwise normal program. This mode is quite common and, therefore,
needs to be studied in detail.
A virus works in a number of ways. Normally, the developer of a virus has to be a very bright person who
knows the Operating System very well in order to break it. This person produces an otherwise interesting or
useful program such as a good game or a utility. However, this program has some viral code embedded in it.
Typically, it is developed under MS-DOS, as viruses are very (un)popular on the PCs. This program is then
published on the public bulletin board system or it is distributed to people free of charge (or at a throw away
price).
Tempted by its contents and the price, the user acquires it and then starts using it after copying it onto his
machine. At this stage, the virus can be said to be in a “nascent” state. After executing the game or the utility,
i.e. the host program, the virus also executes which allows it to spread to other programs on the machine and
affect them. The algorithm to do this is as follows: (Refer to Figs. 10.4 and 10.5 which show a program before
and after viral attack).
(i) Find out another unaffected program on the disk. Say, it was the Order Entry (OE) program.
(ii) Append only the harmful viral code (i.e. without the useful host game/utility) at the end of the “OE”
program.
(iii) Change the first executable instruction in the “OE” program to jump to this appended viral code. This
is “Instruction 1” in the OE program as shown in Fig. 10.4. At the end of the viral code, add one more
instruction to execute the overwritten first instruction and then jump back to the second instruction of
the OE program.
For instance, Fig. 10.5 shows the viral code appended to the host program. It starts with the label
“VIRUS” in the program. You will notice at label “LLL” that the first executable instruction has been
changed to jump to the viral code. You will also notice that after executing the viral code, it executes
the overwritten instruction “Instruction 1” and then jumps back to “BACK” in the host program to
continue from where it had left.
Hence, when the OE program is executed next, it will execute the viral code once (which in turn
will affect one more program on the disk, apart from causing some other damage such as
corrupting a file etc.) and then execute the OE programs faithfully. In order to fool the users
and prevent its detection, the virus designers choose a slowly affecting but harmful strategy, i.e.
the viral code is harmful (in addition to spreading) but in a gradual manner. The mere execution
of the OE program once does not normally reveal the existence of the virus and thus the virus can
spread slowly with each successive execution. At each execution, it finds and infects a new unaffected
program.
(iv) The actual viral code contains the harmful instructions. The least harmful virus would just spread to
other programs as described above. A very harmful virus may affect the boot sector or the FAT or
some other Operating System tables so that the entire system is crippled. Then, there are a number
of gradations. For instance, a virus may display a blackmailing message to pay a certain amount of
money to a prespecified person or account. This may be merely for fun or actually may have bad
intentions.
(v) In order to further fool the users and to prevent its
detection, when a virus is attached to a new program,
normally it is not completely executed in one shot. For
instance, spreading to all other programs on the disk
and then executing the actual harmful viral code for
each program (e.g. encrypt or corrupt the executable
files) one by one would be quite time consuming. The
user/programmer who is running this original useful
program such as OE would be quite alarmed by the
unusually slow response. This will help its detection.
In order to catch the user unawares, the virus designer
normally puts the virus through three states, viz.
We will study these with an example in C. Let us again look at Fig. 10.4 and 10.5. (This is one of the
possible ways of designing a virus.)
Our sample virus defines a flag variable called Status. This describes the status of itself. Initially, at the
time of creation, it is set to “Nascent”. At the subsequent execution, the viral code itself contains instructions
to check the status flag and make it “Spread” if it is already “Nascent”. At this stage, the virus looks for
another unaffected program and appends itself to it. The Status flag in the newly appended viral code at this
time is set to “Nascent” only. At still subsequent execution of the original viral code, the viral code itself
contains instructions to check the status flag and to change it to “Harm” if it is already “Spread”. Having
changed its own status, it actually damages some other files or data in the system.
This scheme works as follows:
(a) Assume that the virus is already attached to the Order Entry (OE) program. At this time, the status flag
of the virus is set as “Nascent”.
(b) When the first time the OE program is executed, it will jump to the viral code from LLL (shown as
“VIRUS” in the figure).
(c) At this stage, it will check the status flag of the virus and jump to PROC-NASCENT due to its status
flag.
(d) At PROC-NASCENT, it will execute its waiting condition. For instance, some virus may wait for a
few hours or days before it spreads. Some may wait until a specific calendar time. Some others may
wait until the OE program itself is executed n number of times. This condition depends upon the
creativity of the designer and is basically meant to fool the users to prevent its detection. If waiting
is to be continued, the program jumps to “BACK” without changing its status flag, but only after
executing the Instruction 1 belonging to OE itself, so that the execution of OE can proceed without
any problems at least the first time after any virus infection.
(e) After the waiting is done, the status flag is changed to “Spread” in the “PROC-NASCENT” routine.
Until this happens, everytime the OE program is run, the program runs smoothly, without affecting
anything else. It just checks the status, executes Instruction 1 and then comes out by jumping back to
Instruction 2. Thereafter, the execution continues as before. The user of the OE program would not
even notice any virus or if at all he does, he will treat it as a harmless creature.
(f) After the status becomes “Spread”, the next time the OE program is executed, it jumps to the VIRUS
as before, checks the status flag, and jumps to the PROC-SPREAD routine.
(g) At PROC-SPREAD, it checks the “spread condition”, i.e. whether spreading is to be continued or
not. Creativity comes into play here too! A virus designer may choose to spread this virus to all the
unaffected programs at once or may choose to spread only to one program at a time for each execution
of the OE program. The designer may even wait for a certain time before spreading to yet another
program. The designer obviously maintains the required counters or indicators in the viral code as per
the desired functionality.
(h) After the spreading activity is over, the status code is changed to “Harm”. After this, it comes out of
this routine and goes back to execute the OE program harmlessly, this time.
(i) If spreading is not over, it finds another unaffected program and repeats its performance. It copies
itself at the end of a new victim program and modifies the first executable instruction to jump to
this viral code as discussed earlier in (iii) of Sec. 10.6.2. It inserts the proper address values for the
instructions GO TO VIRUS at LLL and GO TO BACK at RRR.
(j) When the OE program is executed the next time, it jumps to the viral code as before. It again checks
its status flag, which will be “Harm” this time, i.e. it will neither be “Nascent” nor “Spread”. It,
therefore, now executes the “PROC-HARM” routine and goes back. There are different ways in
1
which the programs and data of the users or the Operating System can be damaged.1 After causing
harm, it can again resume the execution of the OE program (if at all it can proceed!) unless the virus
is quite vicious, destroying the OE program completely.
This completes the overview of how a typical virus functions. This is only one of many ways in which a
virus can be designed.
As is obvious, the virus designer must know the internal working of the Operating System well enough to
create a new virus and he must know how executable files are maintained (i.e. their formats) by the compiler
and handed over to the Operating System. It needs to know where to find the first executable instruction
(Instruction 1) so that it can be changed. After the viral code is appended to the executable file, the file size
will also change and it will have to be reported to the File System of the Operating System. Accordingly,
some free disk blocks will have to be allocated for the additional code of the appended virus. The virus
designer will have to allocate those free blocks, delink them from the free list or chain, write this new viral
code into those blocks and also change the file size, etc. Therefore, the virus must actually perform some
of the Operating System functions though in a crude way (or make use of the system calls available in the
Operating System to perform these functions).
Normally, the virus detection program checks for the integrity of the binary files. The program maintains a
checksum on each file or for better accuracy, on subsections of each file, in addition. At regular frequency,
and for better control before each execution, the detection program again calculates this checksum and
matches it with the one originally calculated and stored. A mismatch indicates some tampering with the
executable file.
There are also some programs available which normally reside in the memory permanently and continuously
monitor certain memory and I/O operations for guarding against any suspicious behaviour.
A generalized virus removal program is very difficult to imagine due to the multiplicity of the viruses and
the creativity with which they can be constructed. However, there are some viruses whose bit pattern in the
code can be “predicted”. In this case, the virus removal program scans the disk files for the patterns of the
known viruses. On detection, it removes them. However, if that virus has already damaged some other data,
it would be almost impossible to recover the old data which could have had any values with different bit
patterns.
The best way to tackle the virus problem is to prevent it, as there is no good cure available after the infection.
One of the safest ways is to buy official, legal copies of software from reliable stores or sources. (Although
even such copies have been reported to contain viruses at times.) One should be extremely careful about
picking up free, unreliable or illegal software. Frequent backups and running of monitoring programs also
help in detection, and thus subsequent prevention of different viruses.
1 Here again you could design a virus in a number of creative ways.You could damage a program or a data file slowly, bit by bit or in
one shot.f
In 1975, Saltzer and Schroeder put forward the general design principles for protection
mechanism. These principles are outlined in the following sub-sections.
The design of the security system should not be a secret. The designer should, in fact, assume that the
penetrator will know about it. For instance, the penetrator may know the algorithms of cryptographic systems.
However, security can still be obtained because he may not know the keys.
Every process should be given the least possible privileges that are necessary for its execution. For instance,
a word processor program should be given access to only the file being manipulated and which is specified
at the beginning. This principle helps in countering attacks such as Trojan Horse. This means that each
protection domain is normally very small but then switching between the domains may be needed more
frequently.
No access rights should be granted to a process as a default. Each subject should have to demand the access
rights explicitly. The only thing that the designer has to keep in mind is that in this case, a legitimate user can
be denied access at times. But, this is less dangerous than granting of unauthorized access. Also, the denial of
access is reported or detected, and therefore, can be corrected, much faster than the granting of unauthorized
access rights.
The access rights should be verified at every request from the subject. Checking for the access rights only
at the beginning and not checking subsequently is a wrong policy. For instance, it is not sufficient to verify
access rights only at the time of opening a file. The verification must be made at every read/write request too.
This will take care of a possibility of somebody changing the access rights after a file is opened. However,
this continuous verification can degrade the system performance. Hence, it has to be done quite efficiently.
The design of the security system should be simple and uniform, so that it is not difficult to verify its correct
implementation. Security has to be built in all the layers including the lowest ones. It has to be built in the
heart of the system as an integral part of it. It cannot be an additional new feature.
The design should be simple and easy to use to facilitate acceptance by the users. Users should not have to
spend a lot of efforts merely to learn how to protect their files.
Whenever possible, the system should be designed in such a fashion that the access depends on fulfilling
more than one condition, e.g. the system could demand two passwords, or, in a cryptographic system, it
should ask for two keys.
Authentication is a process of verifying whether a person is a legitimate user or not. There are
two types of authentication that are possible, and in fact, necessary. The first is the verification
of users logging into a centralised system. The second is the authentication of computers that are
required to cooperate in a network or distributed environment. We will consider both of these scenarios one by one.
The password is the most commonly used scheme which is also easy to implement. The Oper-
ating System associates a password along with the username of each user, and stores it after encryption in a
system file (e.g. /etc/passwd file in UNIX). When a user wants to log onto the system, the Operating System
demands that the user keys in both his username and password. The Operating System then encrypts this
keyed in password using the same encryption technique and then matches it with the one stored in the system
file. It allows to login only after they tally. This technique is quite popular as it does not require any extra
hardware. But then, it provides only limited protection.
The password scheme is very easy to implement, but it is just as easy to break. All you need to do is to
know somebody else’s password. In order to counter this threat, the designers of the password systems make
use of a number of techniques. Some of these are listed below:
The password is normally not echoed onto the terminal after keying in.
It is also stored in an encrypted form, so that even if somebody reads the password file, the password cannot
be deciphered from it. In this case again, even if the encryption algorithm is known to the penetrator, the key
is not known thus making penetration more difficult.
Three methods can be used in choosing a password. Either the Operating System
itself selects the password for the user or the system administrator decides the password for each user or al-
lows a user to select it. The system-selected password may not be easy to guess for an intruder, but then the
problem is that the user himself may not remember it. As the password is not chosen by the user, it may not
have any significance for him, even remotely. MULTICS tried to improve upon this scheme by employing
a password generator that produced random combinations of pronounceable characters, e.g. “Notally” or
“Nipinzy”, which were relatively easier to remember.
System Administrator choosing the password is not a particularly good idea as more than one person
would know about each password, thereby making the system more vulnerable.
If a user selects a password, the user can remember it easily. However, it can be penetrated easily
too! While selecting a password, most users make use of names, family names, names of cities or some
other important words either directly or with reverse spelling, or they may use some other simple
algorithm. Hence, knowing a user whose files are to be attacked, an intruder can guess the possible password
easily.
Morris and Thompson made a study of passwords on UNIX systems in 1979. They collected a list of likely
passwords by choosing various names, street names, city names, and so on. They encrypted them with known
encryption algorithms and then stored them in a ‘password-pool’ in an alphabetically sorted order to facilitate
the search. They then found out how many users actually choose the passwords from this pool. They found
that 86% did choose from the pool.
The password length plays an important role in the effectiveness of the password scheme.
If the password is short, it is easy to remember. (Berman found that a password of length of five characters
could be remembered easily in 98% of cases.) But then, a short password is easy to break. If a password is too
long, it is difficult to penetrate; but it is also difficult to remember by legitimate users. Therefore, a tradeoff
is necessary. It is normally kept 6–8 characters long.
If the password is 8 characters long, assuming ASCII coding scheme with about 95 printable characters,
the total number of possible passwords would be 958. This, coupled with different encryption schemes makes
the number of possible permutations very large. If an intruder wanted to store all these possible passwords,
the password file will be very large and would require a number of tapes to store them. The search would
be quite time consuming. In fact, this encrypting and storing activity itself will take years, thus making
penetration very very difficult.
This scheme is used, especially if the password length is very short. Along with each pass-
word, a long but meaningful message or phrase is predetermined. For instance, for a specific password, it
could be a full sentence such as:
“I AM READING A GOOD BOOK”
The Operating System would then apply some algorithm on the bits of this message to arrive at a shorter
derived bit pattern or additional shorter password. It then stores this additional shorter password (may be
after encryption) along with the original password. The user is supposed to supply the original password as
well as the long message to the system at the time of logging in. The system applies the same encryption
algorithms and then compares both the passwords with the ones stored. It allows a login only after both
match. An advantage of this scheme is that it is difficult to break and does not require larger storage space.
The disadvantage of this scheme is that too many characters need to be keyed in by the user. This can be either
error-prone or tedious.
Salting is a technique suggested by Morris and Thompson to make it difficult to break somebody’s
password. This breaking usually implies maintaining a large list of possible passwords and then encrypting
and storing them in a sorted order. The salting technique appends a random number n to the password before
the encryption is done.
For instance, if GODBOLE is one such password, and the random number is say 1005, then the full string
“GODBOLE1005” is encrypted and stored in a password file. The random number 1005 is selected by the
system itself and stored in the password file, so that anybody can read it. Hence, if an intruder can guess the
“GODBOLE” part of the password, he can append 1005 to it to get “GODBOLE1005” and then get an entry.
However, if he wants to maintain a list of possible passwords, it becomes a far more difficult task. It could
have been 1004 as well. In fact, if n is a four digited number, a list of possible passwords would contain all of
GODBOLE0000 to GODBOLE9999 values. This would be time and space consuming task.
The advantage of this scheme is that it makes the life of the intruder difficult at least by maintaining a file
of possible passwords. If an intruder knows an exact password such as GODBOLE1005, this method would
not help. In this scheme, if the password changes, a new random number is selected by the system. The user
does not have to remember it. He does not even have to know it. The system itself does the rest. The system
itself calculates, stores and compares these random numbers each time a password is used.
Some Operating Systems ask for multiple passwords at different levels. This makes
penetration more difficult. This additional password could be demanded at the very beginning or intermit-
tently at a random frequency. This might irritate a legitimate user, but it provides additional security. Assume
that a user has logged on to a machine and gone for a cup of coffee, and an intruder is trying to access some
information at his terminal. He certainly would be quite stumped to see a password demanded at an unex-
pected time.
An Operating System, at random intervals, may ask predetermined questions to the
user challenging him to prove his identity. For instance, the system may display a random number and expect
the user to key in its square or cube as per the predecided convention between the legitimate user and the
Operating System. This convention will of course differ from user to user. This convention has to be stored
by the Operating System in some certified way in that user’s record such as user profile record. A variation
of this is to ask questions to the users on the lines given below:
Where were you born?
What was the name of your Maths teacher?
What is your exact height in inches?
The questions and the expected answers are assumed to be known only to the individual user apart from
the Operating System. Different questions asked intermittently can be used to guard against an intruder using
a terminal left by a user during the “coffee break” after logging in.
In this case, the Operating System maintains a list of all the legitimate users and their
work telephone numbers. After a user keys the username, the Operating System consults this list and dials
back the telephone number automatically to ensure that it is the same user. At this juncture, a prerecorded
voice message could ask certain questions to the user who has picked up the phone. A question could be: “key
in your date of birth now.” The user has to key this in, to get an entry into the system. An alternative to this is
to have a voice recognition system along with the Operating System, and to validate the user when he speaks
on the phone. This scheme is fairly good, but it can be expensive in terms of extra equipment and telephone
charges especially if a user is at a remote location. This scheme also does not work satisfactorily if the un-
authorised users start making use of the ‘Call forwarding’ facility from other authorised users. Also, in the
case of a wrong number being dialled by the Operating System, the called person may not be exactly amused
to hear a voice message demanding his date of birth at odd hours. To build a foolproof security system, the
telephone system will have to be changed drastically which is a tall order.
In this scheme, the Operating System forces the user to change the password at
a regular frequency. This is done so that even if an intruder has found out a password, it would not be valid
for a long time, thus reducing the window of damage.
An extreme case of changing the passwords at a regular frequency is to force the user
to use a different password each time. For each user, a list of passwords can be prepared by the Operating
System or the system administrator, and stored in the system. A user keeps one copy of the same with him.
For the first time, the first password in the list has to be keyed in for the successful login. The user is forced
to key in the ‘next’ password from the list each time he logs onto the system. The Operating System as well
as the user have to keep track of the ‘next’ password that is valid. When the list is exhausted, a new list can be
generated or one could start from the beginning (wrap-around). This scheme has only one major drawback. It
is not exactly a very safe strategy to lose this list.
Many Operating Systems allow a user to try a few guesses (typically 3). After these unsuc-
cessful attempts, the Operating System logs the user off. Some Operating Systems go to the extent of dis-
abling the user from the system itself. This means that no more login is allowed. For the entry into the system,
if such a user keys in his username itself, the Operating System itself discards the request with an appropriate
message. The user then has to contact the system administrator to reinstate the system access.
The drawback of this scheme is that it can be misused. If a person knows all the usernames, he can try out
all of them, including that of the system administrator, one by one, to knock the whole system out. It is for
this reason that some Operating Systems disable the user account only if it is not the last account belonging
to the system administrator.
Some systems make use of artifacts such as machine readable badges with magnetic
stripes or some kinds of electronic smart cards. The readers for these badges or cards are kept near the termi-
nal from which the user is going to login. Only on the supply of the correct artifact that the user possesses, he
is allowed to use the system. In many cases, this method is coupled with the use of a password. This method
is popular in Automatic Teller Machines (ATMs).
Fingerprints, patterns in the retina of the eye, hand shapes, facial characteristics.
These techniques are also called “biometric” techniques for obvious reasons.
In the case of fingerprints, for instance, the computer uses scanners to capture and store a database of the
bit patterns of fingerprints for different users. When the user wants to access the machine, he inputs his ‘id’
into the computer, and gives the fingerprints again. For that user, the Operating System then accesses the
stored bit pattern of his fingerprints in the database using his ‘id’ as the key. It then uses a pattern matching
algorithm to verify the identity of the user.
Another way is to use finger length analysis. Each terminal can be attached with a device to measure the
lengths of all five fingers. At the time of creating a user (setting up his identity), the user has to insert his
fingers in the machine. The machine measures the lengths of the fingers and creates a database against that
user for future use. When the same user wants to login, he inserts his fingers in this device again after the
user ‘id’ is supplied. The Operating System then uses the user-id as the key and accesses the database of
fingerlengths for that user. It then compares these lengths with the lengths of the fingers measured afresh.
A mismatch suggests the possibility of a mischief, on the assumption that the fingers do not grow in length
with the passage of time. (A creative idea would be to generate a mild electric shock in case of a mismatch!)
In this section, a number of mechanisms that are employed to protect the system resources
—hardware or software—will be studied.
One of the main problems that any File System has to tackle is to protect the files from unauthorized
users. Confidential information from a very sensitive file should not be accessible to any ordinary user
for reading, let alone for changing or deleting. In some cases, it may be necessary to prevent a user, or
users belonging to a certain group, from accessing a complete directory, i.e. all the files and directories
underneath it. This function of ‘protection’ is not treated as important in single-user systems such as
CP/M or MS/DOS. However, in multiuser systems, this assumes enormous importance. In fact, like
files, it may be necessary that certain devices are accessible only to certain users. The same thing may
also be true about processes, databases or semaphores. The Operating System has to have a generalized
strategy to deal with all of them. All such items are called objects, which need to be protected by giving
certain access rights to known subjects who want to access these objects. A subject, in reality, could
be and normally is a process created by either a user or the Operating System.
For various objects, the Operating System allows different Access Rights for different
subjects. For example, for a file, these access rights can be Own, Write, Append, Read, Execute (OWARE),
as in AOS/VS of Data General machines. UNIX has only Read, Write and Execute (RWX) access rights. For
example, for a printer as a device, the Access Rights can be ‘Write’ or ‘None’ only.
In general, we can list different access rights that can be granted as shown in Fig. 10.6. In Fig. 10.6 ‘Modify
Protection (M)’ means the ability to modify the Access Rights themselves. All the rest are self-explanatory.
One way is to mention all the access rights for each
object explicitly for each subject. This is a simple
method, but requires more disk space.
An interesting alternative could be to organize the
access rights in the form of a table as shown in Fig. 10.6
in such a way that the presence of any access right in the
table implies the presence of all the ones preceding it in
the table. For instance, if a process is allowed to delete
a file (D), it should be certainly allowed to execute it
(E), read it (R), append to it (A), update it (U) or modify
its protection (M). Similarly, if a process is allowed to
update a file (U), it should be allowed to read it (R) but
not allowed to delete it (D).
Thus, one could specify only one code against a file to specify all the access rights for it according to this
scheme. In this case, we could associate only one code for a subject (user) and object (file) combination. If
a user creates a process, the process inherits this code from the user. When it tries to access a file F1, this
access right code could be checked before granting any access to that process on F1. This scheme appears
deceptively simple, but it is not very easy to create this hierarchy of access rights in a strict sense that only
one code implies all the rest above it.
Why does a process inherit the access rights from the users who has created it? This is because mentioning
all the access rights for each process for all the files explicitly will not only be very expensive but also an
infeasible solution. The reason: At any time, a process hierarchy of very tall height can be created. The
number of processes existing at any moment in the system is hence, not easily predictable. Implementing this
scheme would require the users/administrators to assign all these access rights for each file separately for
every process existing in the system. Any time a new file is created, it will entail a huge exercise. At the run
time, the system will have to allocate a huge amount of memory and require long search times.
An easier solution is to assign these access rights to each user for different files. When any user creates a
process (e.g. runs the Shell program) to access a file, the assigned access rights for that file are passed on to
the user process from the user. If that process in turn, creates a child process, the child process can inherit the
access rights from the parent process. Therefore, permanently, the system need not store a matrix of access
rights for different files for different processes, but it can maintain a matrix of the access rights for different
files for different users.
The Operating System defines another concept called domain which
is a combination of the objects and a set of different access rights for each of the objects. You could then as-
sociate these domains with a subject such as a user or a process created by him. This is depicted in Fig. 10.7.
A user process executing in domain 0 has an access right to Read from or Write to file 0, Read from file
1 and Write onto the printer 0. Domains 1, 2 and 3 are similarly defined. It will be noticed that domains 1
and 2 intersect. This means that an object (in this case printer 1 with access right = Write) can belong to two
domains simultaneously. It also means that a user process executing in either domain 1 or domain 2 can write
onto printer 1.
The same figure drawn in matrix form looks as shown in Fig. 10.8.
If a domain is defined as a set of access rights associated with different objects in the system, the system
will consist of such multiple domains and a user process executes at any time in one of them. This means
that at a given moment, it has certain predefined access rights on different objects. During the execution of
that process, it can change the domain. This is called ‘Domain switching’. For instance, in some Operating
Systems such as UNIX, a process can run in two modes: user mode and kernel mode. When it executes in
kernel mode, it has far higher privileges. It can access far more memory locations, disk areas, different tables
and devices than when it is running in the user mode. This scheme is normally supported by a hardware
mechanism.
A variation of this scheme
could be to organize these domains into a number of
access hierarchies. MULTICS used this concept. AOS/
VS is also based on the same idea. Pictorially, the en-
tire protection space comprising a number of domains
could be divided into a number of protection rings as
shown in Fig. 10.9.
Each domain is again a set of access rights for
a set of objects, but the entire protection space
is divided into n domains, 0 to n – 1 in such a way
that domain 0 has the most access rights and domain
n – 1 has the least. A subject such as a process executing
in a specific ring can access all the objects permissible
within that ring. If it changes its domain, a domain
switch results. A domain switch to an outer domain
is easily possible because it is less privileged than
the inner ring, but domain switch to an inner domain
requires strict permissions. These protection barriers
are called gates. These are invoked by the Operating
System with the support from the hardware when a less
privileged outer ring subject (e.g. a process) needs to
use the services running in a more privileged inner ring. Many processors such as Intel 80286, 80386 and
80486 support this ring structure.
In terms of software, if the Operating System has to support this multiring structure, it has to treat each
domain also as an object with a limited number of operations such as Call (or enter) and return. Figure 10.10
depicts this. (This figure has no connection with Fig. 10.8.)
In this figure, domains 0, 1, 2 etc. also are mentioned as objects along with
the other objects like files and printers. The figure depicts that domain 0 can
call a routine in domain 1 but domain 1 cannot call any routine in domain 0.
Having called, the called routine also has to return to the original caller routine.
Thus, domain 1 has a right to return to a routine in domain 0 as the figure
clarifies. Anytime a routine is called, involving a domain switch, this table can
be consulted by the Operating System before permission is granted.
This matrix needs to be stored by the Operating System in order to decide to which user to grant what
access rights for which files.
In practice, storing the whole matrix is wasteful, because there are so many blank entries in it. There are
three methods used for storing the ACM.
In Access Control List (ACL), the Operating System stores the data by column. For each file, it maintains
the information about which users have access rights. It, of course, skips the blank columns from the matrix,
i.e. for those files, the ACL will be a null. It also skips the blank row entries in each column. It is obvious
that the best place to maintain the ACL is in the directory entry for that file. It is also clear that even if a file
is called by two symbolic names, the physical file is the same and hence, its ACL should remain same. This
is because, the symbolic names are ultimately a matter of convenience. Hence, Basic File Directory (BFD) is
the right place to maintain the ACL. (There is only one BFD entry for a shared file with the same or different
symbolic names.) In UNIX, where i-node performs the BFD function, ACL (the rwx codes) is kept in i-node.
In AOS/VS, each user has a username and password combination. This combination is
stored in a record for that user, and is called ‘user profile’ (like/etc/passwd file in UNIX).
Each file in AOS/VS has an entry in the BFD as seen earlier. This entry contains the user name of the
owner i.e. the user who created the file. Also, the BFD for that file contains a pointer to a block containing
the ACL which specifies the access rights for that file. In AOS/VS, ACL is an explicit statement of which user
has which access rights for that file. As we know, AOS/VS allows five types of access rights: Own, Write,
Append, Read and Execute. Therefore, the BFD entry would look as shown in Fig. 10.13. For instance, for
BFD number 2 which may be for a file PAY.COB as a symbolic name, user u0 can execute it and user u1 can
append to it.
There is a problem of design in this scheme. While one file could be used by dozens of users, another
file may have just one user. Hence, the ACL entries are the variable length records in the table shown in
Fig. 10.13.
One way to get over this problem is to have multiple fixed length records, if one ACL record does not
suffice. You only need a pointer to point towards the next ACL record for the same file in such cases. Figure
10.14 depicts this where file 1 has three ACL records whereas all others have only 1. ‘*’ mark in a pointer
field indicates the end of ACL for that file.
AOS/VS does not employ the concept of a user group consisting of some users identified by a group id
Multiple ACL records for a file
(e.g. all programmers in a specific project). Hence, the choice of limiting the length of ACL by specifying the
Access Rights at a group level does not exist. Unlike in UNIX, one is forced to specify ACLs for each file for
each individual user. Therefore, even if non-blank or only valid entries are picked up in each column of the
Access Control Matrix to be specified in the ACL, the ACL is likely to be very long. One way to avoid this is
to use templates. For instance, for a specific file, all usernames whose first two characters are “AP” will have
the R and W access. This scheme is very useful in a project where there are several programmers who should
have identical access rights to a number of files used in the project. In such cases, the System Administrator
assigns the usernames having the same beginning characters to such people in a project group wanting to have
the same access rights. (For instance, AP01, AP02, ...) Thus, the ACL length can be reduced substantially.
In AOS/VS, Access Control is verified as follows:
(i) When a user logs on, he keys in the username (say “u1”) and his password. This is verified by the
Operating System against an entry in his user profile record before any access to the system is granted.
The user profile record also specifies the program that is to be executed after logon, viz. ‘initial
program’. Let this initial program be the Command Line Interpreter (CLI). CLI is like Shell in UNIX.
(ii) The user now executes a CLI process which accepts a command from the user and executes it before
prompting for the new command.
(iii) After the login procedure, the Operating System places the user in the home directory. This home
directory is also picked up from the user profile record which had been specified earlier by the User/
System Administrator when this user was created in the system. The user can move to a new directory
subsequently. At any moment, the directory which he is operating from, is called current directory.
Thus, immediately after logging in, his home directory is the same as his current directory. Let us
assume that this home directory contains an executable program “PAY.COB”. Assume that the user
gives a CLI command to execute this program. As this file is in the same (home) directory, the
pathname resolution is easy. At this time, the Operating System can find the BFD number for this
“PAY.COB” file.
(iv) The Operating System now accesses the BFD entry for “PAY.COB”. Now, using the pointer(s) in the
BFD for the ACL records, it accesses the ACL record(s) for this “PAY.COB” file and searches for
the username (as keyed in and verified with the user profile, i.e. “u1” in this case) in the ACL. Having
found the username, it ensures that the user has an execute (E) access right for PAY.COB. If the user
does not have it (e.g. an inventory clerk trying to run a payroll program), the Operating System rejects
the request with an error message.
(v) If the user has the access right to execute that program, the Operating System finds the file address
after consulting the BFD, reads and consults the header of the executable file of PAY.COB to find out
the file size and then requests the Memory Manager Module to allocate the required memory. It then
requests the Information Manager to load the program in the assigned memory locations and goes on
to request the Process Manager to create a process which, in turn, notes down somewhere in the PCB,
the username of the user running it (i.e. “u1”).
(vi) Now let us assume that this COBOL program “PAY.COB” tries to open a file “EMP.DAT” for
reading. This is given to the Operating System as a system call “OPEN” substituted by the
COBOL compiler for the “OPEN” statement in COBOL. The full pathname of EMP.DAT is given as
a parameter to the system call “OPEN”.
(vii) The Operating System now finds the BFD entry number using the full pathname of “EMP.DAT” and
various SFDs as discussed earlier. The BFD has a pointer to the ACL records for that file (EMP.DAT)
as depicted in Fig. 10.13. The Operating System now reads the ACL for that file using the pointer(s).
(viii) The Operating System checks whether the user (with the username = u1 stored in the PCB) has
the access right “R” for that file. If it does not have it, it rejects the request with an error message;
otherwise it proceeds. For all the subsequent read operations, the Operating System need not again
check the ACL (though according to the strictest security standards, this checking should be done at
every operation).
(ix) Even if a file is being shared and actually being updated simultaneously by a number of processes
initiated by different users (e.g. Airlines booking), the access control verification proceeds as discussed
above. This is made feasible, because, even if different user programs may call the file by the same
or different symbolic names, it will have only one BFD entry in the system which points to a unique
ACL for that file.
(x) Let us assume that “PAY.COB” calls a subprogram called “TAX.COB”. When “TAX.COB” is
called, a separate child process is created for this. It has its own Process Control Block (PCB). Along
with other details, the username “u1” is also inherited from the parent process (in this case “PAY.
COB”) and copied in the PCB of the new child process for “TAX.COB”. If “TAX.COB” now wants
to write to any file, ACL is verified in the same way as discussed above, i.e. the ACL for that new file
is accessed using its pathname and the BFD, and then access rights are verified for that file for u1. It
should have a (W)rite access right, if TAX.COB is to be allowed to update that file.
We hope, this clarifies the exact steps that the Operating System goes through for the verification of the
access rights. We will study how this is done for UNIX in the chapter devoted to a case study on UNIX.
In the ACL method discussed earlier, we had sliced the Access Control Matrix (ACM)
by the column. If we slice the ACM horizontally by a row, we get a ‘capability list’. This means that for each
user, the Operating System will have to maintain a list of the files or devices that he can access and the way
in which it can be accessed. For example, Fig. 10.15 depicts the capability list for user 3 in the ACM shown
in Fig. 10.12.
ACL mechanisms allow rigorous enforcement of authorization state changes through the
centralization of data and mechanisms. As against that, the capability systems allow the protected data to
be distributed. In capability systems, it is however very difficult to cancel any access rights already granted.
One of the ways in which the capability mechanism can be implemented is to use a
technique of indirection and to have a central or global segment of capabilities. For instance, Fig. 10.16
depicts the objects 01...05...16 etc. which could denote various files or devices. The central capability
segment or central capability list is also shown in Fig. 10.16. This segment has various slots, each pointing
to some object as shown in Fig. 10.16. Thus, if a file has three possible access rights, viz. ‘R’, ‘W’ and ‘X’,
there will be three slots for that file in this central capability segment (e.g. Object 05 in Fig. 10.16). If there is
only a ‘W’ access right for a printer, there will be only one slot for it (e.g. Object 16 in Fig. 10.16).
Each of these slots in the central capabilities list is serially numbered as shown from 0 to n – 1 in the figure.
Above the central capability list are shown various domains. We know that a domain is a set of access rights
for various objects. Hence, domains are now formed by choosing the required slots in the central capability
list. For instance, a domain could contain slot numbers 9, 5, 1, 8, ..., etc. as shown for Domain X in the figure.
Another domain could comprise slots 7, 6, 4, 1, 5, ..., etc. as shown for Domain Y in the figure.
Thus, each domain has a sub-segment of central capability segment called local capability segment
or local capability list. Figure 10.16 depicts this. Therefore, these slots in the local capability list are just
pointers to the slots in the central capability list. However, for convenience, these slots in the local capability
list also are serially numbered as shown from 0 through 7 for Domains X and Y in the figure. It is these
numbers that are used in the actual “Read” or “Write” instruction as shown in the figure. If we now follow the
pointer chains in the figure from the instruction to the local capability list and then to the central capability
list, the whole operation will become quite clear. The figure shows an interesting feature that multiple slots in
local capability lists or domains can point to the same slot in the global capability list (e.g. serial numbers 2
and 3 in Domains X and Y, respectively) and multiple slots in the global capability list can point to the same
object (e.g. object 05 and serial numbers 1, 2 and 3). Therefore, the two instructions, viz. “Read, using Slot
2”, being executed in Domain X and “Read, using Slot 3”, being executed in Domain Y are actually identical.
This scheme has one advantage. It is not too difficult to revoke a capability. If a domain’s specific capability
is to be revoked, the link (A) shown in the figure can be broken, i.e. the slot number in the local capability list
is updated with spaces, ‘*’, or null, to indicate an invalid slot in the local capability list. If an operation itself
is to be disallowed for an object, the link (B) shown in the figure can be broken, i.e. the slot number in the
global capability list is updated with spaces, ‘*’, or nulls to indicate an invalid slot in the global capability list.
Due to the numbering scheme, we can store global or central capability segment on a disk and read it into
the memory in the beginning itself for faster operation. We can now store the local capability segments in
the respective user profile record which is one per every user. (In ACLs, we had to store access rights in the
BFDs logically, as we were storing by column.) When a user creates a process, this local capability segment
can be read into the PCB of that process and subsequently transmitted to the PCB of the child process created
by it. When a process wants to carry out any operation on any file, the serial numbers in the local capability
segments can be used as pointers to access the correct slots in the global capability list to verify the access
rights.
Alternatively, domains can be created and their local capability lists can be stored separately on the disk.
These domains can also be serially numbered. The user profile record can specify only the domain number in
this case. After login, when a user creates any process, this local capability segment can be copied from the
domain into the PCB using this number. The subsequent actions will be as discussed earlier.
It is possible to design protection mechanisms that employ a combination of ACL and capability list techniques.
In such cases, initially a permission is verified using the ACL technique. At this juncture, a capability is
granted, which is stored in the memory and then used for subsequent operations with the principles of the
capability lists.
Let us take an example of operations on a file to see how this works. In this case, the following sequence
of events takes place:
(i) The Basic File Directory (BFD) of file, or i-node in the case of UNIX contains the ACL for the file
giving the information as to which user is allowed to carry out which operation on that file.
(ii) When a user process opens a file for a specific operation (say, read), this ACL is consulted to find out
what access rights that process has on that file. (The username is inherited and stored in the u-area or
the PCB of that process.)
(iii) If permission is granted, the Operating System generates a capability in the memory and grants it to
this user process.
(iv) Thus, during its execution, each user process possesses different capabilities for different objects
which it stores in its address space. This really then becomes it local capability segment.
It must be noted that here too, the Operating System normally generates a global capability segment
in the main memory, and keeps adding to it when a process requests a new capability. It can then
maintain local capability segments for each process pointing to the slots in the global capability list,
as seen earlier. The u-area or the PCB points to only the local capability list.
(v) While performing subsequent operations on that file, the user process consults its local capability list,
traverses to the correct slot in the global capability list to verify the specific access right required and
then carries out the operation, if permitted.
Any communication in the language that you and I speak—that is the human language—is
called as plain text. That is, plain text can be understood by anybody knowing the language.
Suppose you say “Hi John”, it is plain text because both you and John know its meaning and
intention. However, plain text cannot always be used if you want to communicate over computer networks
for making business transactions.
For example, suppose you want to order two shirts from an online shop. Also assume that your computer
sends a message to the merchant’s computer over some network in plain text. It could be something like “2
shirts, Gordon, credit card number …” etc. (where a person whose name is Gordon is ordering two shirts).
However, remember that the Internet is a network of computer networks. Thus, this message would pass
through many intermediate networks, media and routers (all these are discussed in later chapters) to reach
its ultimate destination. Someone with malicious intentions can get hold of this message on its way with
malicious intentions and then do a variety of things with this message, as elaborated in the next section.
What are the risks involved in plain text communication over the Internet? Actually, most of these concepts are
true to even private computer networks, if not controlled properly. However, we shall restrict our discussion
to the Internet. The dangers are classified into three main categories, based on what is compromised:
Someone other than the intended recipient accesses
your message (in this case, an order for two shirts). Thus, there is a breach
of confidentiality. Remember he has now access to your personal informa-
tion and more dangerously, your credit card details. This is a case of inter-
ception by a third party. Figure 10.17 depicts a situation where an intruder
C has an access to a message going between A and B, thus resulting in
losing the confidentiality of the message.
In simple terms, cryptography is a technique of encoding (i.e. encrypting) and decoding (i.e. decrypting)
messages, so that they are not understood by anybody except the sender and the intended recipient. We
employ cryptography in our daily life when we do not want a third party to understand what we are saying.
For instance, you can have a convention wherein Ifmmp Kpio actually means saying hello to your boyfriend
John (that is, Hello John!). Here each alphabet of the original message (i.e. H, e, l, etc.) is changed to its next
immediate alphabet (i.e. I, f, m, etc.). Thus, Hello becomes Ifmmp, and John becomes Kpio.
Thus, when John makes a phone call to you and your husband is around, you say Ifmmp Kpio! Only you
and John know its meaning! Cryptography uses the same basic principle. The sender and recipient of the
message decide on an encoding and decoding scheme and use it for communication.
In technical terms, the process of encoding messages is called as encryption. As we know, the original text
called as plain text. When it is encrypted, it is called as cipher text. The recipient understands the meaning
and decodes the message to extract the correct meaning out of it. This process is called as decryption.
Figure 10.20 depicts this.
Note that the sender applies the encryption algorithm and the recipient applies the decryption algorithm.
The sender and the receiver must agree on this algorithm for any meaningful communication. The algorithm
basically takes one text as input and produces another as the output. Therefore, the algorithm contains the
intelligence for transforming messages. This intelligence is called as the key. Only the persons having
intelligence about the message transformation, that is an access to the key, know how to encrypt and decrypt
messages.
In the substitution cipher technique, the characters of a plain text message are replaced by other characters,
numbers or symbols. Caesar Cipher is a special case of substitution techniques wherein each alphabet in
a message is replaced by an alphabet three places down the line. For instance, using the Caesar Cipher, the
clear text ATUL will become cipher text DWXO using the Caesar Cipher (because A will be replaced by D,
T will be replaced by W, U will be replaced by X and L will be replaced by O). Other substitution ciphers are
variations of this basic scheme. We shall therefore, discuss Caesar cipher only.
Let us assume that we want to transform the plain text I LOVE YOU into the corresponding cipher text
using the Caesar cipher. Figure 10.21 shows the process. As we can see, the corresponding cipher text is L
ORYH BRX. This happens because we replace each plain text character with the corresponding character
three places down the line (i.e. I with L, L with O, and so on).
Plain text
As we discussed, substitution techniques focus on substituting a plain text alphabet with a cipher text alphabet.
Transposition cipher techniques differ from substitution techniques in that they do not simply replace one
alphabet with another: they perform some permutation over the plain text alphabets. Let us study some basic
transposition cipher techniques.
As Fig. 10.23 shows, the plain text message Come home tomorrow transforms into Cmhmtmrooeoeoorw
with the help of Rail Fence Technique.
Let us examine the Simple Columnar Transposition Technique with an example. Consider the same plain
text message Come home tomorrow. Let us understand how it can be transformed into cipher text using this
technique. This is illustrated in Fig. 10.25.
Based on the number of keys used for encryption and decryption, cryptography can be classified into two
categories:
This is also called as private key symmetric encryption, in this scheme, only
one key is used and the same key is used for both encryption and decryption of messages. Obviously, both
the parties must agree upon the key before any transmission begins, and nobody else should know about it.
The example in Fig. 9.26 shows how symmetric cryptography works. Basically at the sender’s end, the key
changes the original message to an encoded form. At the receiver’s end, the same key is used to decrypt the
encoded message, thus deriving the original message out of it. IBM’s Data Encryption Standard (DES)
uses this approach. It uses 56-bit keys for encryption.
In practical situations, secret key encryption has a number of problems. One problem is that of key
agreements and distribution. In the first place, how do two parties agree on a key? One way is for somebody
from the sender (say A) to physically visit the receiver (say B) and hand over the key. Another way is to
courier a paper on which the key is written. Both are not exactly very convenient. A third way is to send the
key over the network to B and ask for the confirmation. But then, if an intruder gets the message, he can
interpret all the subsequent ones!
The second problem is more serious. Since the same key is used for encryption and decryption, one key
per communicating parties is required. Suppose A wants to securely communicate with B and also with
C. Clearly, there must be one key for all communications between A and B; and there must be another,
distinct key for all communications between A and C. The same key as used by A and B cannot be used for
communications between A and C. Otherwise, there is a chance that C can interpret messages going between
A and B, or B can do the same for messages going between A and C! Since the Internet has thousands of
merchants selling products to hundreds of thousands of buyers, using this scheme would be impractical
because every buyer–seller combination would need a separate key!
This can be shown in another way. For instance, suppose a bank needs to accept many requests for
transactions from its customers. Then, the bank can have a private key–public key pair. The bank can then
publish its public key to all its customers. The customers can use this public key of the bank for encrypting
messages before they send them to the bank. The bank can decrypt all these encrypted messages with its
private key, which remains with itself. This is shown in Fig. 10.28.
Let us examine a practical example of the public key encryption. In 1977, Ron Rivest,
Adi Shamir and Len Adleman at MIT developed the first major public key cryptography system. This method
is called as Rivest–Shamir–Adleman (RSA) scheme. Even today, it is the most widely accepted public key
solution. It solves the problem of key agreements and distribution. You do not require to have thousands of
keys to be sent across the network just to arrive at the agreement. All you need is to publish your public key.
All these public keys can then be stored in a database that anyone can consult. However, the private key only
remains with you. It requires a very basic amount of information sharing among users.
The RSA algorithm is based on the fact that it is easy to find and multiply large prime numbers together,
but it is extremely difficult to factor their product. The following portion of the discussion about RSA is a bit
mathematical in nature, and can be safely skipped in case you are not interested in knowing the internals of
RSA. However, if you are keen to know the mathematical details, you can continue reading. Let us, therefore,
now understand how RSA works. Figure 10.29 shows an example of the RSA algorithm being employed for
exchanging messages in an encrypted fashion.
This works as follows, assuming that the sender A wants to send a single character F to the receiver B.
We have chosen such a simple case for the ease of understanding. Using the RSA algorithm, the character F
would be encoded as follows:
1. Use the alphabet–numbering scheme (i.e. 1 for A, 2 for B, 3 for C, and so on). As per this rule, for F,
we would have 6. Therefore, first, F would be encoded to 6.
2. Choose any two prime numbers say 7 and 17.
3. Subtract 1 from each prime number and multiply the two results. Therefore, we have (7 – 1) × (17 –
1) = 96.
4. Choose another prime number, say 5. This number, which is the public key, is called as Ke in the RSA
terminology. Therefore, Ke = 5.
5. Multiply the two original prime numbers of step 2. Therefore, we have 17 × 7 = 119.
6. Calculate the following:
(original encoded number of step 1) Ke modulo (Number of step 5), i.e. 65 modulo 119, which
is 41.
7. The number thus obtained (41) is the encrypted information to be sent across the network.
At the receiver’s end, the number 41 is decrypted to get back the original letter F as follows.
1. Subtract 1 form each prime number and multiply the two results, i.e. (7 – 1) × (17 – 1) = 96.
2. Multiply the two original numbers, i.e. 17×7 = 119.
3. Find a number Kd such that when we divide (Kd × Ke) by 96, the remainder is 1. After a few
calculations, we can come up with 77 as Kd.
4. Calculate 41Kd modulo 119. That is, 4177 modulo 119. This gives 6.
5. Decode 6 back to F from our alphabet numbering scheme.
It might appear that anyone knowing about the public key Ke (5) and the number 119 could find the
secret key Kd (77) by trial and error. However, if the private key is a large number, and another large
number is chosen instead of 119, it is extremely difficult to crack the secret key. This is what is done in
practice.
Now imagine that A is a buyer who wants to buy something from a merchant B. Using the above techniques,
they can form a secure communication mechanism that can be used only by them. If the buyer A wants to
buy something from any other merchant C, there is absolutely no problem! All A would need to know is C’s
public key. Similarly, all C would need to know is A’s public key. On the same lines, if merchant B wants to
sell something to another customer D, they simply need to know each other’s public keys.
Any encryption scheme has two aims:
1. If A is sending a message to B, only B should be able to read it and nobody else (privacy).
2. B should be able to ascertain that it is indeed only A (and not someone who is posing as A) has sent
the message (authentication).
In the scheme described above, the first goal is achieved. This is because, when A sends the message to B
by encrypting it using B’s public key, it is only B who can decrypt the message, using B’s private key, which
is known to himself.
However, the second goal is not achieved at all. Instead of A, even C could have sent the message to B,
as everybody—including C—can know B’s public key. In such a case, how do we ascertain that the message
was indeed sent by A only? In order to achieve this second goal (authentication), normally more careful
mechanisms are required. One such method is to use both the public and private keys of both A and B as
shown in Fig. 10.30.
This forms the basis of all secure e-commerce transactions. Different vendors use these schemes with
some variations. But the basic principles remain the same.
Using the techniques described earlier, just as we sign a check, we can sign a computer-
generated document or message. A message thus signed contains our digital signature. This also
involves the same principles of encryption and decryption. When a sender signs a message
digitally, the digital signature establishes the concept of non-repudiation. That is, the digital signature
signifies that the person had signed a message digitally, and it can be proved using the digital signature. The
sender cannot refuse or deny sending a signed message as the technology of digital signatures establish the
necessary proof.
Let us understand how digital signatures work, step-by-step, taking the same example of A (sender) and B
(recipient) with the help of Fig. 10.31.
The above scheme works on the principle that the two keys (public and private) can be used in any
order. Hence, the cipher text 1 at points 2 and 4 in the diagram is the same because it is encrypted by B’s
public key and decrypted by B’s private key. Therefore, conceptually, we can remove the portions between
points 2 and 4. Now applying the same principle, the plain text at points 1 and 5 is also the same because it
is encrypted using A’s private key and decrypted using A’s public key. This ensures that the original data is
restored.
The important point is that this scheme provides authentication as well as protection. Authentication is
provided because between points 4 and 5, the decryption is done by A’s public key, and only A could have
encrypted it with its private key. Hence, the message could have been sent only by A, assuming that A has not
leaked its private key. Privacy is protected because point 3 onwards, only B can decrypt it with its private key.
Nobody else is supposed to know B’s private key. Hence, nobody other than B can access it. As we shall see,
digital signatures form a refined method to implement this.
We shall now elaborate the steps shown in the above figure to study the process in even more detail.
1. As before, A encrypts the original plain text message (PT1) into a cipher text (CT1) using B’s public
key.
2. Rather than sending the cipher text to B, A now does something more. A runs an algorithm on the
original plain text to calculate a message digest MD1, also called as hash. This algorithm simply
takes the original plain text in the binary format, performs the hashing algorithm (i.e. generating
some random number) producing a string of binary digits, which can be treated as a small text in
an unreadable format. Note that this hashing algorithm is public, which means anyone can use it.
However, it is not possible to unhash the message easily, i.e. it is virtually impossible to recreate the
original message from the hash of that message.
Next, A now encrypts the message digest itself. For this encryption, it uses its own private key. The
output of step 2 is called as A’s digital signature (DS1).
3. A now concatenates the cipher text CT1 (generated in step 1) and its digital signature DS1 (generated
in step 2). This whole thing is sent over the network to B. It is like signing a document before faxing
it! B receives the cipher text CT1 and A’s digital signature DS1, as CT2 and DS2, respectively.
4. B has to decrypt both of these. B first decrypts the cipher text CT2 back to plain text PT2. For this, it
uses its own private key. Recall that the original plain text was converted into cipher text using B’s
public key. Therefore, the cipher text can now be decrypted only using B’s private key. Thus, B gets
the message itself in a secure manner. Even if someone has received the cipher text, it is meaningless
for him, since he does not have B’s private key.
5. B now wants to ensure that the message indeed came from A and not from someone who is trying
to befool B into believing that he is A. For this purpose, B takes A’s digital signature and decrypts
it. This gives B, the message digest MD2. Recall that A had encrypted the message digest to form a
digital signature using its own private key. Therefore, B uses A’s public key for decrypting the digital
signature to form the message digest as generated by A.
6. Next, recall that hash algorithm to generate the message digest that was used by A is public. Therefore,
B can also use it. B does that, and calculates its own message digest MD1. For this, it runs the hash
algorithm on the decrypted message of step 1 (PT2).
From steps 5 and 6, B has two message digests: one as received by B from A (MD2) and the other that B
himself has created (MD3). B now simply compares the two message digests. If the two match, B can be sure
that the message came indeed from A and not from someone posing as A. Why is this so? This is because,
A’s message digest was encrypted in step 2 to form its digital signature, using A’s private key. It was then
decrypted using A’s public key. Since only A has access to its private key, no one could have created its digital
signature. This makes B confident that the message has come from A. Thus, authentication is assured.
Quite interestingly, the digital signature also takes care of the issue of message integrity. Since the message
digest is a small portion of the original plain text, if the two message digests (one created by the sender and
the other by the recipient) match, the recipient can be sure that the message was not tampered with. Of course,
if the message digest is too small in comparison with the original plain text, then there are chances that the
hacker has changed some portion of the plain text that is not a part of the message digest. However, the hash
algorithms that create the message digests are usually written to ensure that a wide area of the plain text is
covered randomly to minimize chances of the loss of integrity.
Thus, cryptography techniques applied in various forms make sure that the three important issues related
to message security, namely confidentiality, authenticity and integrity of the messages are dealt with. In
practice, we use the concept of digital certificates to enable secure exchange of information over the Internet.
Also, the practical implementations of encryption on the Internet are in the forms of two protocols: Secure
Socket Layer (SSL) and Secure Electronic Transaction (SET), which are beyond the scope of the current
text.
The public key encryption scheme is actually called as Public Key Infrastructure (PKI), which deals
with the issues of confidentiality, authentication and authorization issues using the concepts discussed in this
chapter. PKI is based on the usage of private key–public key pairs and digital certificates/digital signatures.
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
Over the years, computer Operating Systems have
moved from essentially single-process or single-task
systems to single-processor, multiuser, multitasking
systems. Today, the trend is to opt for multiprocessor multitasking
systems. Two technologies are being developed to harness the
power of a multiprocessor system, viz. distributed processing
and parallel processing. Sometimes, it is a good idea to have
a proper mix of these two techniques, to get the best out of a
system.
The move to these two technologies is similar to the way in
which a business organization evolves. A one-man company is
analogous to a single-user single-tasking system. In this case,
the owner himself has to do everything. He initiates a task and
when that gets over, he takes up the next. As his business grows,
he has to hire more people and distribute his responsibilities
among them. A one-man organization has very little to do in
terms of managing people. When he hires the people and carries
out the function of managing them, the organization is similar to
a multiuser Operating System like UNIX.
At this stage, our entrepreneur keeps the overall control of the
organization, but delegates various functions such as marketing,
planning and finance to the functional managers. When that
happens, different tasks are being carried out simultaneously, but
the control and the direction is central. The next stage of growth
is when the entrepreneur opens up branch offices to run his job
more efficiently by distributing functions to different locations.
This is analogous to distributed computing, which is called
loosely coupled architecture.
In the distributed client–server computing environment, ideally there are server computers which
are huge and are used to handle large databases and/or large computational requirements. There are
client computers which handle the smaller processing requirements at several client places. These client
computers store smaller amounts of data. Essentially, in a distributed computing environment, the jobs to
be done by different processors are well defined in the beginning, and the co-ordination strategies are also
clearly defined.
A parallel processing system is more like a factory floor. On a factory floor, activities are organized in
parallel, to deliver a product in the fastest possible time. In parallel processing, the subdivided parts of a job
which can be done at the same time will be carried out by different processors. The processors carrying out
the computations are closely knit (or tightly coupled in computer parlance). There is large amount of data
to be shared by different processors. The sharing of this data and the necessary communication facilities
complicate this technology.
One more similarity between parallel processing and a large organization (especially an organization that
develops software) stands out and deserves discussion. A one-man organization can produce a large amount
of output. Similarly, when a processor is doing one job, the relative output can be very high. Unfortunately,
business and today’s computing requirements have gone far beyond what a single-man organization or a
single processor can handle. Hence, it is necessary to work in a large group with different departments,
specializing in different skills, to complete the projects.
Similarly, different computer architectures have to be put together to complete today’s problems. A
large organization loses a lot of time and effort in planning and communication, and the output of
that organization is not commensurate to the product of the number of employees and one-man company
effort. In parallel processing also, a lot of processor time has to be devoted to communications and
coordination and therefore, output of a parallel processor will be lower than that of the sum total of all
individual processors.
The aim of parallel processing is to accept a given job, break it down into subparts, divide it
among all the available processors and complete it in the shortest possible time.
To achieve this objective, all the processors in a parallel processing environment should be able to
run the same binary machine language instructions. They should all be able to communicate very fast.
They should share common secondary storage devices like hard disks, since this is where the programs
are stored in the first place. They should also share the terminals from where the user interacts with the
system.
The user of a parallel processor need not be aware of the machine architecture. He should feel as if he is
dealing with a single-processor, and his interactions should be no different. For example, if a user is aware
that he is dealing with a 10-processor machine, he may try to divide his job so as to reserve some of these
machines and hold them up for his computing, if such user interfaces exist. Another user may try to divide his
job in such a way that the job runs only on a 10-processor engine. This will not let the application be portable.
If he moves to an 8-processor machine, his job may fail, or it may need to be rewritten completely. Hence, this
kind of interface to the user should not be provided by a parallel processing system.
The technologies of distributed processing and parallel processing have a common aim: to
achieve higher throughput by using more processors. A basic question is: Why add more proces-
sors instead of using a faster processor? The reason is simple. It is difficult to achieve higher
throughput out of hardware just by increasing the raw speed of the hardware. Increasing speed this way
is very costly. Faster processors imply higher costs, which rise exponentially with the increase in speed.
Cooling and power requirements go up with speed and, therefore, increasing the cost even higher.
Higher throughput can be achieved by using commercially available microprocessors and interconnecting
them. This is called distributed processing, or a loosely coupled system. In a parallel processing or tightly
coupled system, there is only one computer with multiple CPUs. In this case, the Operating System is
responsible for the work distribution, communication and coordination. However, there are subtle differences
in the two technologies.
The characteristics of distributed processing are as follows:
(a) The processing may be distributed by location or geography.
(b) The processing is divided by different processors depending on the type of processing to be done, e.g.
all I/O handled by one processor, all user interaction handled by another, and so on.
(c) The processes can be executing on dissimilar processors.
(d) The Operating Systems running on each processor may be different.
The characteristics of parallel processing are as follows:
(a) All the processors are tightly coupled, communicate using shared memory and are packaged in one
housing.
(b) Any processor can execute any job and, hence, all processors are similar, e.g. Intel 486s.
(c) All processors run a common Operating System.
These processing techniques take advantage of the cost of today’s hardware and try to put the processing
where it is needed. These technologies depend heavily on today’s fast communication networks to share data
and communication needs.
The throughput from a system can be increased. The improvement in the speed will depend
on the type of application. Applications which need a lot of interprocess communications will show a large
amount of speed improvements. Applications which are highly sequential will show the least amount of speed
improvement. Applications which handle arrays and matrices are the candidates for speed increase. However,
these will need to be written using languages which support parallel processing.
One of the real attractions of parallel processing is the resulting high fault tolerance. If
one processor fails, the process can be rescheduled to run on another processor. This may reduce the through-
put of the system. However, the system will continue to be available. Critical applications like aircraft flight
control cannot tolerate any failure during flying periods. In such applications, multiple processors are kept to
achieve improved availability and good performance.
A parallel computing system can be configured according to needs. As the de-
mands on computing increase, the system can be upgraded by putting more processors. The upgradation cost
will be minimal, since the CPU costs are only a small part of an entire system.
Using today’s technology, parallel computers with nearly the same power as CRAY
machines can be configured at nearly 1/10th the cost of those huge machines. The maintenance costs are also
lower during the life of the system, since most of the components are available off the shelf.
Human beings think in a sequential manner, one thing after the other. Hence, it is very difficult
for us to write programs which can work strictly in parallel. In today’s languages, many programming
constructs can be handled in parallel. A good example will be a “DO Loop” given below:
DO 100 I = 1, 10
A(I) = B(I) * C(I)
100 CONTINUE
The above example is a sample FORTRAN code which is very common in most applications. This example
can be rewritten as shown below:
A(1) = B(1) * C(1)
A(2) = B(2) * C(2)
A(3) = B(3) * C(3)
A(4) = B(4) * C(4)
A(5) = B(5) * C(5)
A(6) = B(6) * C(6)
A(7) = B(7) * C(7)
A(8) = B(8) * C(8)
A(9) = B(9) * C(9)
A(10) = B(10)* C(10)
Each line of computation can then be processed by a separate processor.
Another example of parallel processing will be to execute two separate pieces of code before continuing
on a common thread. These separate pieces of code can be executed in parallel on different processors.
One way a parallel processing computer can be used effectively is in multiuser environments. Each
processor can execute the jobs submitted by different users. Since these jobs are normally independent of
each other and do not share any information, they can go on at the same time. In this fashion, the throughput
of a computer system can be increased. An advantage of this kind of system is that the applications can be
written in commercially available languages. The Operating System has to be designed to handle the selection
of a job and a processor to execute that job. The Operating System will also be responsible for keeping track
of the memory which will normally be shared.
Computer systems can be classified in various ways. The most elementary classification is
Von Neumann and non-Von Neumann computers. The single CPU with its set of I/O and
memory devices which it can access over a single bus is called Von Neumann architecture. Any
computer architecture which supports more than one CPU and can access memory from more than one bus
is called non-Von Neumann architecture. According to this definition, parallel processing systems belong
to non-Von Neumann architectures. Distributed computers can be an interconnection of Von Neumann
architectures over a computer network or can also be built around a non-Von Neumann architecture.
All multiprocessor architectures can be classified as tightly coupled or loosely coupled. In a tightly coupled
system, the communication channel between different processors is very fast and the access time to the
channel is generally low. A good example of such a tightly coupled system is shared memory to communicate
between two processors. In other words, the bandwidth of a channel for a tightly coupled system is very high.
When the communication bandwidth is low, the processors are said to be loosely coupled.
A much more appropriate classification for parallel computers is based on concepts put forward by Flynn.
He made two assumptions about computers. And these were: all the computers have instruction streams and
data streams. Hence, in any computer, we can have a single instruction stream or multiple instruction streams
and similarly we can have a single data stream or multiple data streams. Combining these possibilities, he
classified computers as discussed below:
The first and simplest classification is of Single Instruction Stream and a Single Data Stream
(SISD) computers. This is the technology of most of the computers available today which have only
one CPU and one path for data through the main memory.
The second classification is of Single Instruction Stream and Multiple Data Stream (SIMD)
computers. This type of computers are manufactured for doing things like matrix manipulations.
Here, an array of processors execute the same set of instructions, each processor working on different
data. This can speed up handling of large matrix manipulations and arrays. Some parallel computers
have this architecture.
The third classification is of Multiple Instruction Stream and Single Data Stream (MISD)
computers. This class of computers are rarely used.
Multiple Instruction Stream and Multiple Data Stream (MIMD)
computers. Most parallel and distributed processors fall in this category. MIMD can also be thought
of as an interconnection of many SISD computers. This intercommunication gives more paths to
memory and I/O depending on the kind of connection, whether it is tightly coupled or loosely coupled.
To be more specific, a loosely coupled MIMD computer will be a collection of computers, each having
separate memory and I/O devices. This can also be called multicomputer system. On the other hand, a tightly
coupled MIMD system will have many processors, each with some private memory and a large shared
memory and shared I/O resources. This kind of configuration is normally termed as a multiprocessor system.
One of the easiest ways to connect multiple processors is over a single backplane or motherboard into which
the CPU, the memory and I/O cards may be plugged. The backplane typically has signals for address, data
and other control information. These backplanes are standardized by international bodies like the IEEE and
have become commonly accepted industry standards. Examples of such backplane standards include VME,
Multibus and Futurebus.
The design of today’s backplane buses allow multiprocessor systems to be built easily. In this kind of
design, each CPU card has some local memory which is used as cache memory. All memory requests are
made through this cache memory and if not found, a bus access is made to the main memory. If the bus
accesses are high, the number of CPUs which can be supported on a bus will be lower. A current thumb rule
is: a bus-based system is ideal for building parallel processors with a maximum of about 20 CPUs.
A bus-based architecture is a good candidate for distributed processing. If each processor is preprogrammed
(by using ROM or downloading the code into each processor’s private memory at startup) to execute specific
functions, the amount of bus access will be reduced, since the code is already available in the local memory.
The processor has to access the bus only for shared data. In such an environment, multiple processors can be
supported on a single bus. Figure 11.1 depicts a bus-based multiprocessor system.
The second possible interconnection for multiprocessors is a crossbar switch as shown in Fig. 11.2. In this
kind of connection, the entire memory is divided into smaller blocks. Access to each memory block is through
a switch and this access is possible to every computer. In the example shown in the figure, the four CPUs can
access four different memory blocks at any specific time. It will not be possible for any two CPUs to access the
same memory block. This type of interconnection and access is very similar to the old telephone exchanges.
(Today, thanks to computerization of telephone exchanges, this switching technology is obsolete.) The only
difference between the telephone exchange and the computer memory connection is that every telephone will
need interconnection facility with every other telephone, making this a little more complicated.
We can illustrate the concept of the crossbar switch in a parallel computer assuming four telephones
calling four other telephones waiting for calls at random. Let us designate the calling telephones as C1, C2,
C3 and C4 and those telephones waiting for calls as M1, M2, M3 and M4.
(i) When C1 calls M1, they get a connection established and can continue the connection as long as they
need it.
(ii) If at this moment, C2 wants to connect to M1, C2 gets a busy signal.
(iii) C2 waits till C1 has finished the conversation with M1.
(iv) C1 finishes the conversation with M1 and disconnects. C2 gets connected to M1.
(v) On the other hand, when C1 is talking to M1, C2 can get a connection established with M2, C3 with
M4, and C4 with M3.
Hypercube implies a parallel computer architecture in which, if there are n processors in a system,
each is connected to log n to the base of 2 neighbouring processors. An 8-node hypercube is shown in
Fig. 11.3. In such an 8-processor system, 1-processor is connected to three other processors. The number of
interconnections each processor has with its neighbours is known
as the dimension of the system. The 8-processor system in the
above example is known as a 3-dimensional hypercube.
The advantage of this type of connections is that the delay
in communication from one processor to another increases
logarithmically to the total number of processors instead of
increasing proportionately. Each processor is a full computer
with CPU, memory and other I/O devices. Today, a lot of research
work is done around such systems. These computers are also
known as massively parallel computer systems. Thirty-two
and 64 computer hypercubes are commonly available. Current
research is on to create 16-dimensional hypercubes with 65,536
processors.
An interesting property of hypercubes is that an n dimensional hypercube is a proper subset of an n + 1
dimensional hypercube. These are recursive structures and are found to be highly suitable for carrying out
recursive computations.
The primary aim of an Operating System is to manage the available hardware resources. In a
single processor Operating System, these resources were memory, file system and I/O devices. In
a parallel processor system, there is a new resource to be managed—multiple processors. This new
resource, instead of simplifying matters, complicates the issues further.
In previous chapters we have studied in depth the working of Operating Systems. In a single processor
system, there is only one process running at any time. All other processes are ready to run, waiting for some
resources to be freed, sleeping or waiting for some process to complete before this one can proceed. In a
multiprocessor Operating System, to be efficient, the number of running processes should be equal to the
number of processors available in the system. How can this be achieved? We can achieve it in three ways as
follows:
Let us consider briefly how each of these categories can be implemented and what their advantages and
disadvantages are.
Each processor has its own copy of the Operating System and manages its own file system, memory and I/O
devices. To support multiprocessing, some more primitives must be added. These primitives will support
interprocessor communication and other aspects. This type of Operating System is ideal in a hypercube
architecture. Each node in a hypercube is a standalone computer. A number of primitives have to be added
to communicate and pass messages between each node. To be highly efficient, these messages should be
handled as fast as possible.
The major advantage of this system is that systems can be built around commercially available products.
Also, existing applications need not be rewritten to execute on these systems, to take advantage of the
parallelism supported by the hardware.
The disadvantages are many. A large Operating System executes on each processor. This will take up a
lot of memory and execution time. The allocation of available processors to the processes cannot be handled
efficiently in such a system. Such allocation is normally done by another system which also downloads
various processes to other processors.
In a master/slave system, a supervisory processor handles the Executive Operating System. This system
treats all other available processors, in addition to memory and I/O devices, as resources. The master processor
allocates to each process, various resources (including the processors) necessary for its execution.
One of the major advantage of this system is that since the master processor knows the availability and
the capability of each processor it can allocate processes most suitable to the processor. In addition, it is
comparatively easy to develop such a system.
In a master/slave system, if the master processor fails, the entire system can fail. All interprocessor
communication has to be done through the master processor only. The master processor can become
overloaded if it has to manage a number of slave processors, and this can lead to the low efficiency of the
system.
The ideal operating environment for a parallel processing system is to have every processor identical in every
way. Ideally, all data structures should be available in a global memory area, and each processor should be
able to access it and decide what it should do next. Any processor can do anything that any other processor
can do. All Operating System data structures are kept in shared memory.
Each processor has a very small executive. This is called a microkernel. Unlike the large monolithic
kernels available with an Operating System like UNIX which have all the system services, the micro-kernel
is a very small executive which has only a limited functionality to handle processes and basic I/O, and has
primitives for interprocess communications. All other services, including file system and networking services,
are offloaded and run as higher level services, available to the applications. This type of an architecture allows
very efficient parallel systems. The Operating System as well as the system calls can also be executed on any
processor, and in parallel.
Such an Operating System is very efficient and can utilize the processors to the maximum. Failure of
one or two processors will degrade the performances but not bring the system to a standstill. Adding more
processors in such systems is very easy.
The main disadvantage is: The complex nature of the Symmetric Operating System. This system is difficult
to build and difficult to debug.
One of the most common problems in a multiprocessing system is related to mutual exclusion. There will be
memory areas which need to be accessed by more than one process. These memory areas should not be accessed
indiscriminately. When one process is accessing a memory region, another should have no access to it. Such
memory regions are known as critical regions as seen earlier. A Multiprocessing Operating System should allow
mutual exclusion among processes, while accessing and updating the shared critical memory areas.
These problems are not unique to multiprocessor systems, but are also known to single processor systems.
In a single processor system, since there is no real parallel access to data structures, these problems are not
as complex or difficult. A single processor system handles these shared data structures through semaphores
and monitors.
In a multiprocessor system, one of the easiest ways to handle critical regions and mutual exclusions is by
keeping one processor responsible for this job. This will imply that this critical regions handler be a server
for all critical data handling. Any other process which wants to enter the critical region will have to send a
message to this server. The server will queue up the requests and will handle the critical region in a first-
come first-served basis, or on a priority basis. Each client will have to inform the server when its usage of the
critical region is over. The server will then hand over the critical region to the next process in the queue. The
disadvantage with this algorithm is the dependence on a single server. If the server breaks down, or is very
busy, the handling of the critical region will be affected, hence, affecting the performance.
A second solution to this problem is a distributed solution. When a process wants to enter a critical region,
it sends a time stamped message to all the processes requesting for the entry into the critical region. As a
result, the following happens:
(i) A process which does not want the critical region will reply agreeing for the request.
(ii) A process which has the critical region does not reply and queues up the request.
(iii) If a process which receives it has also sent a request for the same critical region, it checks the time
stamp and sends OK if its own time stamp is higher than that of the other process. Otherwise it does
not send any message.
(iv) A process can enter the critical region if it gets an OK from all the other processes.
In this algorithm, a process has to send a message and get a reply from every other process. This can result
in a lot of message traffic. Also, if a process does not reply because of some failure, the sender will keep
waiting for the reply. Chances of a process crashing out or getting killed are not rare. Hence, this algorithm
can fail very often.
A third algorithm is similar to the one used in token ring Local Area Networks. All the processes can be
arranged in a ring-like fashion by ordering the processes, assigning an address to each process and recognizing
its nearest two neighbours. During start up, the lowest addressed process starts by sending a predefined
message known as a token to its neighbour. Now the following takes place:
(i) When no process wants to enter the critical region, this token passes from one process to another and
circulates among the processes.
(ii) When a process wants to enter a critical region, it waits for this token. When the token arrives, it
accesses the critical region, completes its work with the critical region and exits. On completion of its
work with the critical region, it passes the token to its neighbour.
In a good design, in order to give a fair share to all processes, a process should not enter a second critical
region when it gets the token. Ideally, it should wait for its next chance. Here again, as in everything else,
there is a trade-off between fairness and throughput. This algorithm is efficient and fair. Everybody gets a turn
to grab the critical region. One problem with this algorithm is that a process having the token can die, and
the token can be lost. There are methods available to get hold of the lost tokens and continue token passing.
Another problem which is also common to single processor and multiprocessing systems is deadlocks. In a
multiprocessor system, they occur more often, are more difficult to detect and even more difficult to resolve,
once detected. Deadlocks can be handled in four ways as explained earlier. To recapitulate, these are:
(a) Ignore the problem.
(b) Detect and try to recover.
(c) Prevent, by proper software design.
(d) Avoid, by allocating resources carefully.
We will now discuss these one by one
The most common way of handling deadlocks is to ignore it. This has been a very successful strat-
egy, especially in single processor systems. The idea is to act as if we never heard about the problem. Hence,
it is referred to as the Ostrich algorithm. In some cases, it may be best left to the applications to do the needful.
It is very difficult to design a system with deadlock avoidance built in. We will now study how deadlocks can
be detected and prevented.
We can use a central co-ordinator as in the case of mutual exclusion for critical regions. This co-
ordinator keeps track of all processes and resources of the entire system. In this scheme, each process should
send a message to the central co-ordinator whenever it gets new resources. Alternatively, the co-ordinator can
send messages to the processes and get this information. The idea is that the central co-ordinator keeps track
of which processes have what resources allocated to them. Using this information, this central co-ordinator
can now detect a deadlock easily. When the co-ordinator detects a deadlock, it can kill one of the processes
and try to get out of the deadlock.
A distributed solution for the deadlock detection problem is also possible. When Process A is waiting for
a resource which is in use by Process B, Process A sends a message to Process B, specifying the resource and
the process which is blocked (i.e. A) and the one to whom the message is sent (Process B). These are basically
the source and destination addresses. When B receives the message, it in turn sends a message specifying
the processes and resources for which it is waiting, to all the processes on which it is blocked. For instance,
if Process C is waiting for two resources held currently by Processes X and Y, it sends a message to both of
them. All the receiving processes, in turn, create messages and send them to the processes they are blocked
on. If the starting Process A, gets a message, a cycle is completed and this indicates that somewhere there is
a resource which is causing a deadlock. This completes the deadlock detection process. The deadlock can be
resolved by killing one of the processes involved.
When a process requires a resource and finds that it is going to block on that resource, a check
can be made to see which process has been running for a longer time. An older process should not be made
to wait for a resource giving way to a newer one. This way a deadlock cycle can never occur. The newer one
can be killed to restart later.
Therefore, a process which has gone through larger computer usage has a better chance of completion.
There are many strategies to avoid a deadlock. Some of these are discussed in the chapter on Dead-
locks (Chapter 10). We will not discuss these here. Parallel processing adds to the complication, the discus-
sion of which is beyond the current text.
One of the new features added to multiprocessor systems is a facility called threads. A thread
can be described as one flow of control within a process. Each thread is associated with only one process,
but a process can have many threads. A process has associated with it an address space, a program counter
and its own stack. The address space can consist of text (i.e. the program object code) and data. Resources
are allocated to a process. All threads within a process share the address space, all resources and data of the
process. Threads are, therefore, similar to tasks (i.e. asynchronous code paths) within a process as defined
in Chapter 5. The only added complication in their scheduling is that they can be concurrently running on
different processors, sharing their address space in the common memory.
Threads, like processes, can create child threads. Each thread runs sequentially till completion. Within a
process, all threads can run in parallel in a multiprocessor system. The common kernel support for threads
normally available as system calls are to create, exit, join, detach, yield and identify. Create allows the
creation (or in UNIX parlance forking) of a new thread. A new thread exits or becomes defunct with the exit
call.
A parent thread can create a child and wait for its exit condition and again continue the computation. Join
is available to wait for the death of a child thread. When the parent thread wants to create a child thread and
wants both parent and child to continue, detach can be used. Now the child process will never be waited for.
If a thread wants to stop processing for some time and wants another thread to take over, the system call yield
can be used to give up the CPU. To get the identity of the thread, the system call “identify” can be used.
Threads can be implemented in two ways. The kernel can support threads directly by allowing each thread
to be scheduled and managed. The kernel can then schedule threads irrespective of which process they belong
to. Such threads are known as heavyweight processes. Another way of implementing threads is in user space.
CPU time is allocated by the kernel for the process. The process has to manage this time among its threads
and handle the scheduling. Such threads are known as lightweight processes. Lightweight processes tend to
be faster, since they do not have to go to the kernel for every rescheduling (context switching).
Users can choose their own scheduling strategy. The disadvantages are that every user process will have
to implement a lot of functionality for scheduling, blocking and managing of threads. An Operating System
is supposed to carry out this functionality and the kernel may be the right place to have this. The kernel has
to keep track of processes and threads and their relationships when they are heavyweight processes. This will
involve a large amount of data to be handled by the kernel, making it slow.
All other functionalities commonly available in an Operating System are moved to the user space. These are
available to an application as services in a Client–Server model. The services which are available in a typical
UNIX environment include file systems, networking, shared memory access and interprocess communication
facilities like pipes, I/O interface to terminals, etc. These are linked to the microkernel through an emulation
library. Instead of UNIX, a system can provide emulation in a similar fashion to other Operating Systems
like MS-DOS or VMS. An emulation library with the necessary servers can be used to handle these services.
Mach handles process scheduling in a two-tier hierarchy and has processes and threads. In Mach, a process
is an environment in which threads can execute. It has an address space for text, data and stack. Resources are
allocated to processes. A thread is an executable and is always associated with a process. A process can have
more than one thread. A thread has a program counter and registers associated with it.
Memory is managed by this Operating System as a memory object, a data structure which can be mapped
into a process address space. When a user level process attempts to read a page which is not in memory,
a page fault occurs and the kernel catches the page fault. The handling of this page fault can, however, be
carried out by a user level process.
Mach has a very powerful interprocess communication facility built in. This is supported by using an
abstraction called a port. Messages are sent and received through these ports. The messages can be queued.
The size of the queue can be set on a per port basis. One of the important features of Mach is the built-in
security associated with these messages. Mach is one Operating System which is designed with security
features built-in and not added as an afterthought. The security for ports are built as capability lists, and
sending and receiving by different processes is possible only if the correct permissions have been granted to
them.
Compared to other Operating Systems, Mach has a very flexible memory management scheme. Memory is
managed by a user process and the kernel together. The kernel handles the hardware Memory Management
Unit, processes page faults and page replacement policies, and manages address maps. The user process
handles the management of the virtual memory-related issues. The functions allocated to the user memory
manager will typically include keeping track of virtual pages which are in use, and indicating where they are
to be—in main memory or backing store. The two—the kernel and the user process—have to co-ordinate and
communicate through a protocol. The advantage of this system is that users can implement or use memory
management schemes best suited for the application.
Mach processes, as in the case of UNIX can access a large virtual address space—0 to 2n – 1—on
an n-bit machine. This memory is managed by paging. In a more traditional Operating System, the
process has no control over where the memory requested should be located. The base address is always
the address of the next available location, which can accommodate the requested memory. However,
in Mach, the user process can specify the base address. If no base address is specified, the Operating
System will find a free address space and provide it to the process.
Virtual memory allocation in Mach is based on the concept of a memory object. This is a data
structure that can be mapped into the process, virtual address space. A memory object can be a page, a
set of pages or files which are mapped into the memory.
Files can be mapped into virtual memory and they can be read from/written into, as can be done
with memory objects. Paging is used to handle mapped files also. All the changes to a mapped file are
automatically reflected back in the file system when the file is unmapped. When the process terminates, all
files mapped by it are automatically unmapped. Handling files using mapping can make file handling faster.
Mach is designed to work on multiprocessor systems. It recognizes the need for sharing of information
between different processes. All threads within a process see the same address space and share all information
across them.
Child processes are created by other parent processes. Child processes in UNIX inherit the environment
of the parent process. However, sharing an address space does not take place. Each process proceeds in its
own programmed direction and special handling is necessary to share information between the two, in UNIX.
Mach has rectified some of these deficiencies. It allows some inheritance attributes to different regions in
its address space. These attributes define if and how the memory address space can be shared between the
child and the parent processes. There are three such attributes. They are: (a) the region will not be used in
child processes, (b) region will be shared between parent and child, and (c) region will be used as separate
copy in parent and child.
Under case (a), no sharing of the region is allowed and a region allocated in the parent is unallocated in the
child. Case (b) allows sharing. Changes made in one process are visible in the other. The third case provides
separate copies to each process.
A good parallel processing or distributed Operating System needs a strong support for communications. In
Mach, all communication is handled by a mechanism called port. Ports are implemented as protected mail
boxes. The protection facilities ensure that only authorized processes have access to the port. Mach ports are
unidirectional. This means that one process can use one port only for writing or reading, and not for both. To
handle bidirectional communications between two processes, it needs two ports. Communication over a port
is connection-oriented and the kernel ensures reliable and ‘in sequence’ delivery of messages.
A port can belong to a port set. This allows messages to be received from more than one port within a
process, enabling demultiplexing of messages in one process.
One of the important features of Mach is that even some of the system calls are implemented as services.
These also need ports to get the specific services performed. When each process is started, it has some default
ports associated with it. These will include a process exception port as well.
Mach is designed as a microkernel and is optimized for environment for a multiprocessor Operating System.
However, the common functionality found in an OS-like file management, user interfaces are handled in
Mach as separate processes. Depending on the requirements, any OS features can be emulated on top of Mach
as shown in Fig. 11.4. This functionality is outside of the Mach kernel and is implemented as a user process.
Normally, this is handled using two user processes running on top of the Mach kernel.
Let us look at the typical implementation of a UNIX interface as shown in Fig. 11.5. One of the two
processes is an emulation library, which carries out the functions of the UNIX system calls. The other one
is a UNIX server which communicates with the kernel and the application. When the UNIX server is started
(or initialized), the server provides the kernel with a list of system calls and the addresses where they are to
be redirected. When an application makes a UNIX system call, the Mach kernel receives the system call and
immediately redirects it to the emulation library. The emulation library analyzes the call and creates an RPC
call to the UNIX server. The server spawns a thread, completes the procedure and returns the control to the
Application Program. The control can go to the application from the UNIX server directly without going
through the kernel.
n n
n
With the advent of micro- and
minicomputers, distributed proces-
sing is becoming more and more
popular. In this chapter, we will study the
concepts involved in distributed processing
and look at its relationship with the Operating
Systems. Merely distributing the hardware
equipment will not constitute distributed
processing. For example, a large, central computer
with a number of remote terminals connected
to it will not constitute distributed processing,
because neither the processing nor the data is
distributed in any genuine sense in this case.
There can be a number of computers at
different factories and offices of a company with
no connections among them. If a computer at
the chairman’s office has a program to produce
a summary report based on the data from all
locations, the data in the form of tapes, floppies
or paper has to travel to the central location
physically where it has to be re-entered. This
also cannot be called a ‘distributed processing
environment’, though in the strict sense,
the processing is distributed. In distributed
processing, a number of computers have to
be connected to one another by links, enabling electronic data transfer and sharing of data among various
computers (nodes). We will learn that the Operating System has to be structured in a different way to cater to
this form of distributed processing. There are two approaches to this problem. Network Operating System
(NOS) and Global Operating System (GOS). We will study the basic principles of both in this chapter.
Distributed processing implies that a number of computer systems are connected together to form a network.
As we will learn, there are different ways of connecting these computers forming different topologies.
The idea is simple. Instead of a centralized computer system with centralized applications, data or control,
you have a number of computers with distributed applications, data or control or combinations of these.
Figure 12.1 depicts the difference between centralized versus distributed processing.
We will now study what we mean by the following terms:
The term “Distributed Applications” really means that different applications or programs reside on different
computers as shown in Fig. 12.1. For instance, one computer may capture, store and process data required for
a payroll application. Another may capture, store and process the data for purchasing application, and so on.
This scheme gives rise to the possibility of capturing the data where it originates. However, we still need the
computers to be connected, because there could be some data which is shared. For instance, the purchasing
application requires the inventory data which may be residing on a different computer. Applications could be
distributed in two ways:
The applications residing at the zonal level carry out all the zonal processing, including processing of
summaries received from branches and in turn producing summaries for the higher—(divisional) level.
Therefore, one could map the distributed applications in a truly hierarchical style of MIS of any organization.
In this case, the application running on different computers may be the same say, Sales Analysis. But
within that application, different programs with different capabilities and access rights may reside at different
levels. This is not the case in a strictly horizontal distribution where the application with all its programs are
duplicated at almost all the nodes with a few exceptions.
In a distributed processing environment, the data required by various applications could be maintained as:
Replicated data.
Partitioned data.
Figure 12.3 illustrates these concepts. The figure shows three nodes networked together. It also shows
three databases normally required for any purchasing function. One is the standard Part database containing
data items such as Part Number, Part Name, Unit of Measure and Price. Another is the Part Balances database
which gives the quantity on hand for every part at that particular node. As each part could be stocked at
different nodes, this database may be necessary to be maintained at each node. A third database is Sources
database which indicates which part is supplied by which supplier, and vice versa. If the purchasing function
is centralized, this database could be maintained only at one node centrally.
In centralized data, the data resides at only one (preferably the central) node, but it can be shared amongst
all the other nodes, e.g. the Sources database in Fig. 12.3. In such a situation, this central node must run an
Operating System which implements the functions of Information Management. It must keep track of various
users, their files and directories. It must also handle sharing, protection and disk space allocation issues. This
node must also run a piece of front end software—normally a part of the Operating System—which receives
requests for various pieces of data and then services them one by one. This piece, and therefore, the entire
node, is sometimes called a server. This scheme allows each node to have its local, private data but all the
sharable data has to reside on a central server.
In contrast to this scheme, you could distribute or spread your data on various nodes according to the
needs. In fact, out of the total database, you could have only a part as centralized, and the rest distributed.
If a database is required very often at every node, and if transmitting it from a central node to the others
is very expensive and time consuming, then that specific database could be replicated or duplicated at every
node. But then, when any part of the data in the database changes, the updates have to be made at all the desired
locations immediately. Parts database in Fig. 12.3 is shown to be replicated at two locations. It is present both
at locations 1 and 3 . In an Airlines Reservations System which connects the required computers at different
locations worldwide, one can imagine that the flight timings and fares would need a lot of replication, because
this information will be required very often at every node. Keeping it only at the central node would increase
the network traffic tremendously and would be very expensive.
This replication also makes sense because it is a relatively stable data changing only at a definite,
preplanned frequency when the time tables and fares change.
A third approach to the distribution of data is partitioned data. In this case, the database is sliced into
multiple portions and different portions reside on different nodes according to the need. For example, the
Balances database in Fig. 12.3 is partitioned. A location may not be physically storing the balances of all
the parts existing in a company. The balances of only those parts that are stocked at that location can be
maintained there. At the same time, a part can be existing at multiple nodes. In this case, a balances record
for the same part will exist at all those locations showing the respective balances. This is what we mean by
partitioning.
Processing differs, depending upon the kind of data distribution employed. For instance, any location
other than location 1 wanting any information from the Sources database will have to send a query to location
1. Location 1 will need to have some kind of front end software called an Agent or Server which will receive
this query, get an answer and send it to the inquiring location. Similarly, partitioning of databases means that
if one wants consolidated balances of all the parts, one will have to summarize and collate all the Balances
databases at all the locations.
the status of communication lines and nodes. For instance, in any large network, you always require Network
Management programs which constantly monitor the status of various lines and nodes, and based on that
information, suggest diagnostics/repairs actions. They also update the resources tables used in routing
algorithms, if a particular link or a node fails for instance. This control function can be centralized, or it can
be distributed with varying degrees of distribution.
We will study an example of distributed processing to reinforce our ideas about it.
Let us discuss an example illustrated by Fig. 12.4. In this figure, P1, P2, P3 and P4 are different processors
running different programs. T1 is a terminal and D1 through D7 are the disk drives and disks attached to
various computers as shown in the figure.
In this case, an interesting situation can arise. A user wanting to run an Application Program (AP) requires
some data. But this data could be residing on certain disks connected to different processors. There could be
multiple ways of running this program in such a case.
Let us assume that a user at terminal T1 connected to a processor P1 wants to run a program AP residing
in the memory of processor P4. Let us also assume that AP requires data from disks D1, D3, D4 and D5. This
can be achieved in a number of different ways as follows:
(i)
(ii)
(iii)
The data files continue to reside at the remote nodes. As and when the AP desires any data record from
a remote node, an appropriate request is generated for that remote node. The remote node receives it. The
‘Server software’ at the remote node interpret it, generates an appropriate system call on the Operating
System running on that machine, and on extracting the desired data, sends it to the requesting node over the
communications lines.
The “control” function will have to decide the optimal move strategies for the programs and data for
options (i) through (iii). The decision will be based on the data volumes to be transmitted, data traffic loads at
different nodes and the transmission speeds of different lines. All these options can be executed quite easily if
file transfer utilities are available on all the nodes. The first three strategies are based on transferring the entire
files. In a LAN environment, a powerful computer with larger disk capacity to store all the sharable data is
used to carry out this function. It is called file server. It receives a request for a file, locates it and sends the
entire file across the network to the destination node obviously with the help of the Operating System running
on the server machine.
The last option, viz. option (iv) is very interesting. In this case, the full file is not transferred at all. The
records are transmitted one by one and only as and when required. For instance, a query program may require
information about a specific customer instead of the entire customer file. In such cases, the node where the
sharable data is stored will have to receive the requests for specific records. The software at this server node
will have to search for the requested data, using the indexes or other data structures maintained at that node.
This software, with the help of the Operating System running on the server machine then accesses only those
specific records instead of the entire file, and transmits only those across the network to the destination node.
This software is called Database Server. In fact, the entire node with its hardware and software is normally
termed as a Database Server. This is a very important concept in Client-Server computing as we will learn.
In either of the schemes (i.e. file or database servers), a node or a client workstation needs to access some
data on a remote server node. This could be implemented using Remote Procedure Call (RPC) for READ.
This is very similar to a Read system call studied earlier, except that this Read is performed on the remote
machine. How is this done? Each machine or a node has a Local Operating System (LOS), but that clearly
is incapable of performing this remote Read. The ideal situation will be the one where the AP/user can make
a request to the LOS running on the local machine using the commands of the LOS only, as if the data were
local. In this case, this request would be made like a system call on the LOS (such as DOS or OS/2) only.
We need some kind of software layer running on top of the LOS to trap this system call for READ and
send it to the LOS if the request is for the local data, or else send it to the appropriate remote node over the
network if it is for the remote data. This software layer is provided by a Network Operating System (NOS).
Figure 12.5 depicts the architecture of a typical NOS.
We will study the salient features of any NOS by tracing the steps involved in a remote read.
(a) A separate piece of software called Redirection Software has to exist on the client workstation as
shown in the figure. This is a part of NOS.
(b) If an AP makes a system call which is not related to the Input/Output functions, it is directly handled
by the LOS running on the client workstation as shown in the figure.
(c) Normally, in a non-NOS environment, all I/O calls also are made by the AP to the LOS (Operating
System running on the machine). However, in the NOS environment, this is not true. In such a case,
the call is made to the Redirection Software which is a part of NOS. This is shown in the figure. The
AP making this I/O call has to be aware the data location (local or remote) and accordingly has to
make the request to LOS or NOS. Hence, NOS cannot be said to implement location transparency
completely.
(d) Different NOS implementations employ different methods to differentiate the calls for the local and
remote data. For instance, in NetWare, if you use a drive number which is F: or above for any
directory, it assumes that the data is remote. Using such a technique, the Redirection Software either
sends the request to LOS if the request is local, or sends it to the remote server if the request is remote.
In NetWare, pieces of software called NET3 or NET4 carry out this redirection function.
the client workstation to the server. After this, the requested data on the server has to traverse back
to the client workstation. In this case, a piece of software has to ensure that both these messages (the
request for the data and the actual data) are transmitted correctly to the desired nodes. This piece of
software is called Communication Management Software. This also is a part of NOS and has to
reside on both the client workstation as well as the server. This is also shown in the figure.
This software has to implement packetizing, routing, error/flow control functions, etc.; and therefore,
and some other modules carry out these functions. We will call all these functions together by the
sure that a message is communicated between the client and the server without any error.
(f) If the requested data is remote, the Redirection Software on the client workstation passes the request
request to the Network Services Software residing on the server, as shown in the figure. This, again,
is a part of NOS. This module is responsible for services on the network for sharable resources such
as disks, files, databases and printers. This module will receive many requests from different clients.
It will then generate a task for each of these requests. It will schedule these tasks and service them.
This module will have to have multiple I/O buffers to store data for these different tasks. The NOS
will have to implement some kind of multitasking to achieve this. LOS such as DOS did not have to
do this. Network Services Software has to implement access control and protection mechanisms for
the shared resources like any other multiuser Operating System. This is another way in which NOS
differs from any LOS such as DOS.
(i) The Network Services Software on the server finally communicates with the Information Management
module of the Operating System running on the server to get the requested data. Here again, there
could be two approaches. One is to build in the capabilities of Information Management in the NOS
such as NetWare itself. Another is to run a separate Operating System, such as UNIX, on the server.
In the former case, the Network Services Software module of NOS has to make appropriate calls to
the Operating System running on the server for getting the data finally. This is why NetWare and
UNIX can coexist on one machine only.
(j) After getting the desired data, the Network Services Software on the server hands it over to the
flow and sequence control, packetizing, routing, and sessions control functions.
combined into one. But you could also have a modem server or a plotter server or a name server and so on.
It is also possible to have all these servers as separate computers. Again, you could have two computers with
large disks acting as two file servers. This can complicate matters substantially. The figure shows only a
simplified version of the scenario.
We had mentioned in point (d) that there is a mechanism by which the NOS running on the client
workstation determines whether the request is local or remote. Let us illustrate this by an example of NetWare
NOS. Imagine that there are a number of client workstations running DOS connected to a server running
NetWare. There can be multiple servers, but let us assume that there is only one server in our example. Figure
some other system directories for internal use by NetWare to store passwords, accounting information, etc.
for different users. The APPLNS directory contains various programs that a client workstation can execute,
provided the user has the access rights. Wordprocessor (WP) is one such program very commonly used by
the client workstation. In many cases, it can be stored by client workstations on their respective local disks
to avoid the network traffic. But in our example, we will assume that the client workstation has to get it from
the server before execution.
What will the user logged on to the DOS running on the client workstation have to do in this case? The
user will have to use a MAP command to map the directories on the server to various drives starting from F:
because we know that drives F: onward are reserved for the data on the remote (server) node. For instance, the
NET3 or NET4 (depending on the version) is the Redirection Software running on the client workstation
which stores these mappings in its own table for future use.
Now, when the user wants to run the WP program from a client workstation, he needs only to key in F: to
indicate that he wants to access the remote APPLNS directory on the server. (In NetWare, the drive numbers
between A: to E: are treated as local drives, by convention.) Therefore, NET3/NET4 of NetWare realizes that
it is a remote drive. It then refers to the mapping table mentioned in Fig. 12.7 to find out the directory on the
server the user desires to go into. In this case, it is the APPLNS directory on the server, denoted by F: drive.
NetWare running on the client workstation, communicates this desire to its counterpart on the server in a
and interprets them, and checks whether that remote user has the access rights for that directory on the
server. If everything is OK, it sends a confirmatory reply—again in pre-agreed formats of frames/packets.
packets, disassembles them, interprets the message and then changes the user’s working directory.
At that moment, the user on the client workstation is in the APPLNS directory on the server and can just
type “WP” to execute the wordprocessor on his client workstation. The Redirection Software of NetWare
running on the client workstation knows that the file WP in its working directory is actually a file residing
on the remote server, and that it has to be downloaded in the client workstation’s memory first before it can
and the server then open a session, and the file transfer takes place chunk by chunk, after the file is broken
into frames/packets, and so on. After the WP file is loaded in the memory of the client workstation, the
DOS and the WP program now starts executing under the local DOS.
All this is fine. But what will happen if an Application Program (AP) stored on the local disk wants to
access the remote files such as Purchase Order (shown as PO in the figure) and Returns (shown as RTN in the
figure) during its execution? In this case, when is this mapping supposed to happen? At the compilation time
or the run time? Should the programmer be aware of it? If the programmer declares that both PO and RTN are
remote files, but for reasons of efficiency, if one or both of these files were transferred to the local disk before
the execution of this program, what will happen? Obviously, there could be a problem!
For this reason, normally, there is a provision to do these mappings at the run time. This can be done by
simply parameterizing the file names. At the beginning of the AP, the AP prompts for these file names, and
then the user can mention the appropriate drive numbers where they are located. When the instruction starts
getting executed, if the drive number is less than F:, The Redirection Software of NetWare running on the
client workstation passes the instruction to the local DOS. If, however, the drive number is F: or higher, it
When an AP running under DOS gives an instruction such as “Open .....”, the compiler generates an
I/O function call or system call which executes interrupt 21 Hex or 33 in decimal. The parameters that are
required to be passed to this function call are the three mentioned above which the instruction of the AP
registers with the three parameters derived from the instruction in the AP as given below, and then generates
an instruction to transfer control to the interrupt 21 Hex. At run time, the interrupt then is executed.
After these registers are loaded, the interrupt 21 Hex is executed. The interrupt 21 Hex assumes that the
successfully, or there could be some problem (the file not found or access denied, etc.). The code indicating
the success/failure and the reasons for the failure if met are loaded by this interrupt 21 Hex itself in the
instruction to test that register to ascertain the success of the operation. The compiler itself generates it just
after the interrupt (Function call) instruction. The compiler then generates the instructions to take appropriate
actions in the case of an error.
the server machine may be running a different Operating System with its own system calls expecting the
parameters to be loaded in a different set of registers before any system call on the server can be executed.
data representations, etc. entirely different from the one of the client workstation. Hence, some conversion
demands. Let us take some cases as examples to clarify our point. In all these cases, let us assume that the
which carries out the I/O functions. Microsoft LAN Manager is responsible for functions like multitasking
Operating Systems running on the client workstation and the server are the same.
In this case, the request for remote data along with the three parameters discussed above can be transported
the success/failure of the operation. This, as a returned parameter, can be sent back to the client workstation,
pass to the AP running on the client workstation at instruction 4, from where it will resume, as if the whole
determine the success/failure and then take the appropriate action according to the instructions in Fig.12.8. If
with the same architecture as the client workstation, but the server Operating System (NetWare) differs from
the client workstation Operating System (DOS). In this case, NetWare maintains its files and directories on
the server according to its own file system. It has separate system calls from DOS function calls. The Net-
running on both the machines. On the server, a piece of interface software receives all these requests from
different client workstations, and queues and schedules them. It creates one task for each of these requests
and manages these tasks and the memory allocated to these tasks. The memory will be required to store the
code for each task as well as for the I/O buffers for each task. This is why, apart from multitasking, NetWare
has to implement Memory Management schemes as well. This is what makes it different from DOS.
When a task is scheduled, another piece of software within NetWare interprets the request and loads the
parameter to the client workstation. Here, an interface software on the client workstation responsible for this
the processing proceeds as if the whole operation were local. The control now is passed to the AP. The AP at
local client workstation, and then takes the appropriate action as per instruction 4 in Fig.12.8.
denote the success/failure of the operation. The interface module of NetWare will then need to transport the
actual data (in case of success) as well as the returned parameter to the interface module of NetWare running
denote the success/failure. The AP on the client workstation can then resume as if all these operations were
local and proceed as per instruction 4 in Fig.12.8.
The interface module of NetWare therefore, carries out the function of translating a remote request
into a form understood by the LOS running on the server. Therefore, the interface module of NetWare
will be different for each different Operating System running on the server. If Novell tomorrow wants
to allow a specific new machine M running its own Operating System to act as a server, it will have to
write this interface software anew to carry out this translation before it can declare something like: “Now
Novell can support VMS on the VAX as the server.” This interface software for this translation is not quite
trivial, as the number of system calls to be translated, their formats, the way they expect the input parameters
and the way they return the output parameters might differ quite substantially from one Operating System to
the other.
A part of any NOS will have to run on both the
client workstation as well as the server. This part is needed in all the cases discussed above. It is respon-
various levels of OSI reference. It is concerned with delivering any message (requests for data, the actual
data or acknowledgments) to the destination without any error. We will study various protocols in later
sections.
The ordinary Operating System does not perform these functions, and therefore, it has to depend on
Software is a part of the NOS. This software has to reside on all the nodes including the server. Without
this, no communication will be possible. The formats and the protocols of all the messages and packets have
corresponding to different OSI layers which could reside on the same or different pieces of hardware, as we
shall see later.
File and Printer Services module controls
these resources. This software runs only on a computer which is also called a server, for convenience. All
requests for shared resources are queued up at the server by this software. They are then scheduled and run
as separate tasks, thereby demanding that the NOS be a multitasking Operating System. There are two phi-
losophies of the NOS running on the server depending upon who (LOS or NOS) carries out this multitasking
function.
The Network Management Software carries out the function of
monitoring the upkeep of the network and its different components. Normally, any network consists of the
hardware such as computers, modems, repeaters, adapters, lines, multiplexers, and so on. These can be tested
by software, i.e. there are specific instructions to test these. The testing is done by sending special packets
from time to time to various hardware equipments and analyzing the responses. Depending upon whether
those packets were received properly by the hardware equipment and based on the time that the equipment
took to respond and the nature of the response, conclusions are drawn about the health of that equipment and
that of the entire network.
The Network Management Software is responsible for all this. It keeps track of all the hardware in the
network. It maintains a list of equipment along with its location and its up/down status. For large networks,
this list can run into hundreds of entries. The list has to be updated when any node or any equipment is
added to the network, or any piece goes out of order. The Network Management Software polls different
hardware pieces and updates the up/down status at a regular interval. It then generates various reports for the
management to take actions in terms of repairs or replacements. It can generate different statistical reports
to point out which are the ‘weak areas’ in the network to cater to in terms of maintenance, spare parts or
redundancy. A very sophisticated Network Management Software can even trigger an alarm at the house of
the hardware maintenance engineer!
Network Management Software can provide inputs to the routing algorithms in the form of a list showing
available and functional pieces of hardware. The routing algorithm will then have to choose amongst various
paths that are possible at any time.
by installing and maintaining large networks for them also support these networks by writing and providing
the Network Management Software for the network. This piece is usually written on the top of the existing
Operating Systems. The only difference is that in NOS, it becomes an integral part of NOS itself.
Figure 12.9 depicts the NOS environment.
The functions that NOS carries out are restricted to what we discussed above. These in-
clude memory, process or user interface management at a rudimentary level as far as managing various tasks
on the server are concerned. Each user’s request for any resource becomes a task for the NOS which then
has to schedule such requests and allocate memory for them before they are run. The NOS does not utilize
the memory and the processing power of all the computers in the network put together in the most optimal
fashion. For example, some workstations may have excess memory, while some others may not be able to
execute large programs. The NOS does not attempt to solve this problem.
The Global Operating System (GOS) attempts to do exactly that. It maintains a list of all the processes
running on different machines globally. It also maintains a list of resources needed by each process at a global
level. If one processor is busy but another is relatively free, it could be worthwhile to run a new process on the
second processor. The NOS simply does not have this intelligence which GOS has. It is clear that GOS will
complicated fashion than any ordinary single processor multiuser system. GOS also manages the memory at
two workstations for that process and then manage the process of Address Translation; NOS does not have
to do it. Therefore, GOS can be viewed as a superset of LOS and NOS. The GOS functions are listed below:
Today, very few successful and powerful GOS implementations exist in the commercial world. It is still in
the arena of research and development, as many complicated issues are involved in the design of any GOS.
negotiations or compromises amongst the peer nodes. Where should the knowledge (about the status of
the memory, I/O devices, processes, etc.) be maintained? Should it be centralized or distributed? At what
frequency should it be updated? If updated very frequently, the decisions for further allocations would be
more accurate, but then the overheads of updating them will be very high. Another solution is to update it less
frequently and use some sort of “best effort” policy.
The GOS has to use the pool of resources and allocate them to various processes/tasks at
a global level. It may decide on specific migration policies, based on various algorithms to use the available
resources optimally at any time. In general, the GOS has to take decisions about the following migrations:
Process migration.
We have referred to data migration in an example in Sec. 12.2.5. If a program at site X wants some data
can be sent (as in the Network File System or NFS). As we shall learn later, these two approaches are similar
to the approaches of a File Server and a Database Server, respectively. Both have merits and demerits. If a full
file is sent, the software to do this is far simpler to write. But then it loads the network quite heavily. Also, if
If only the updated portions are sent, the data traffic reduces, but the software to manage this becomes
more complex. Depending upon the frequency of these remote data requests, the GOS may decide to migrate
some portions of data from one node to another on a semi-permanent or permanent basis, or it may decide
to replicate the data to improve the performance. However, replication gives rise to further complications for
maintaining the integrity of the databases. This decision requires performance tracking and maintaining the
past history of requests. Obviously, the algorithms to manage all this are fairly complex.
The GOS may have to employ techniques for computation migration also. Imagine that there are a
number of zones within a division, and that at the zonal level, large data files reside. Assume that we want
summaries of these files to be produced at the zonal level and then sent to the divisional level where a Process
p would use them to produce the final results. If we adopt only data migration, we will need to transfer all the
files to the divisional level before another Process, say r, at the divisional level would produce the summaries
and then Process p would process them as local files to produce the results.
Alternatively, if computation migration strategy is followed, another process, say t, at a divisional level
would request for reading and summarizing the file at the remote machine. This can be done by using the
Process q is created at the remote site on receiving the request from Process t. This Process q then reads
executing, t gets blocked. When the different processes equivalent to q at all the sites finish producing the
summaries and sending them to t, Process t terminates and wakes p up. Process p now proceeds to produce a
desired report from the summaries. This is called computation migration.
The most important and difficult decision for any GOS is that of process migration. This involves in
deciding which process should run where, and this decision is continuously made even while the process is in
execution. For instance, a process, say P1, might be running at a node where a lot of page faults are occurring
due to less memory being available. However, when the process was initiated, that was the only machine
available. After a while, if a process running on another machine gets over, thereby freeing a large amount of
memory on that machine, the GOS should be able to migrate Process P1 to the new machine.
In order to be able to carry out this process migration in general terms, the GOS has not only to initiate
a process at a node but also has to monitor its performance continuously. It then has to constantly take a
decision whether migration will help. This decision is based on the available resources at that time. Process
migration may also take place simply because a specific computer may be better suited to perform specific
There is a trade-off between the gain in the performance of the migrated process versus the overheads of
migration. It may well be that almost immediately after the migration, the process might terminate. Was all
this then worthwhile? How is the GOS to know all this? Another complication in the process migration is
that it is normally a global decision and not only between two machines or two processes. If migration has
to be done at all, and if there are n candidates for migration, the GOS may have to choose the best candidate.
For instance, if Process PA wants larger memory and Process PB wants a faster processor, which of
the Processes, PA or PB, should be migrated to a computer which has larger memory as well as faster
processes? These are the complex issues involved. Another issue could be that a migrated process may run
system to migrate data which may not have been necessary before migration. This data migration may make
the whole scheme very inefficient. Therefore, the GOS has to keep the perspective of “overall efficiency” in
mind.
Some of the reasons why the GOS prefers to carry out the process migration are listed below:
Load balancing: To achieve a uniform utilization of resources. It should not happen that one node is
completely free while at another node, a process cannot get executed due to the lack of resources.
Special facilities: There may be special hardware/software facilities available at a specific node
where a process wanting to use these facilities has to be migrated.
Reduced network load: A process migration might avert the data migration, thereby reducing the
load on the network.
Some of the problems/decisions associated with process migration are listed below:
at the target node. This is simple. But then the next question is that of the process address space migration.
This question becomes important in virtual memory systems, where one part of the address space resides in
the main memory and the other part is on the disk. Should the entire address space be migrated, or only the
portion in the main memory? The former is a cleaner, simpler but more expensive solution. The latter is a
cheaper solution but it is also a tougher one to implement. Remember that in this case, even after migration,
the source node continues to be involved in the process by maintaining the page/segment tables. The most
complicated situation is where the address space of a process resides on multiple disk and memory modules.
The address translation would be a horrendous task for any GOS, in this case!
At the time of migrating a process, the GOS has to provide for a mechanism for storing the pending
messages for that process temporarily somewhere and then direct them to the new destination after the
migration is complete. Therefore, the migrating process has to leave the forwarding address at the source
node for this redirection to be possible.
An important point in all these migrations is the level of transparency maintained by the GOS for the
user. At one extreme, the user/programmer specifies what is to be migrated where and when. At the other
extreme, the user is completely unaware of it, which means complete location transparency. The GOS aims at
achieving the latter. But then, that makes the life of the GOS designers quite complicated. However, if that is
achieved, all the users get a uniform view of the GOS as depicted in Fig. 12.11. Note that this happens despite
the different Local Operating Systems running at different nodes.
The GOS has to maintain a global list of all the resources and
allocate them to various processes. This also includes migrated processes. One of the additional problems is
to detect/avoid/prevent the deadlocks, arising out of the resource allocations.
Deadlock handling in distributed systems is a complex process due to the difficulties in maintaining the
updated list of global resources. Therefore, decisions are normally based only on the local information. This
is the reason why algorithms required in this case are different from the ones used in the non-distributed
environment.
In fact, in a distributed environment, a phantom or false deadlock may arise. This is a case where a
deadlock is detected, when actually none exists. Imagine a situation where there are two Processes P1 and
P2. Assume that P1 is holding R1, and P2 is holding R2 at any given time. We know that if P1 requests
for R2 without which it cannot proceed, and at the same time if P2 requests for R1, without which it also
cannot proceed, a deadlock results. However, in a distributed environment, P1 and P2 could be running on
two different nodes. There has to be another module which has to keep track of all the processes and their
requests for resources networkwide. Assume that this is the module running on one of the nodes which is
also responsible for detecting a deadlock. All the requests, allocations and releases of resources have to be
communicated to this deadlock detection module. The sequence in which these messages reach this module is
very important and can sometimes cause a false alarm of a deadlock. Let us assume that P1 is holding R1 and
P2 is holding R2, but no further requests or releases have been made. Now, let us assume that the following
sequence of events takes place.
(i) P2 releases R2.
(ii) P1 requests for R2.
(iii) P2 requests for R1.
(iv) The messages about all these events reach the deadlock detection module in the same sequence as
listed above.
In this case, actually no deadlock will occur. P1 will get R2 released by P2, it will go to completion
and release R1, which then will be available for P2. P2 also can now go to completion. However, if the
messages of events (ii) and (iii) reach the deadlock detection module before the message of event (i), due to
unpredictable traffic conditions on the network or different line speeds, a deadlock would be reported by this
module. But this is a false or phantom deadlock alarm, because in reality, R2 is no more held by P2. Handling
such situations adds to the overall complexity of deadlock detection in distributed environments.
Deadlock detection module itself can be a centralized function or a distributed one. Again, it can be a
hierarchical distribution or a horizontal distribution. In the former case, a tree structure is formed for
controlling various processes. At any node of this tree, the information of all the resource allocations of the
dependent nodes in the tree is maintained. This allows the deadlock detection to be carried out at lower levels
as well instead of at the root node itself. In horizontally distributed control, all the processes cooperate in
detecting deadlocks. This requires that a lot of information about resource requests, allocations and releases
to be time-stamped and exchanged between these nodes. This would involve a lot of overheads.
Apart from the deadlocks arising out of resource allocations, deadlocks can occur in the communications
system also due to the buffers getting full in the ‘store and forward’ method. As the GOS is responsible for all
the communications as well, it has to have a means of handling these deadlocks too! This was not necessary
in the non-distributed environments.
Deadlock prevention in a distributed environment uses similar methods as used in a centralized system.
These methods normally try to prevent the circular wait condition, or the hold and wait condition. Both of
these methods require a process to determine the resource requirements in advance, and therefore, are not
very satisfactory in all practical situations.
Some of the partial solutions offered to guarantee the proper and consistent allocation of distributed
resources are listed below:
Decision theory.
Games theory/Team theory.
Divide and conquer.
A detailed discussion on these solutions is beyond the scope of the current text.
Process migration is central to any DOS-based process management. In simple terms, this is the
transfer of adequate amount of information and state of a process from one host to another, so that
the process can execute on the target host.
Process migration is a very important aspect of distributed operating systems. This aspect deals with the
decisions of transferring the state of a process from one computer to another computer (i.e. distributing the
The source transfers the process to the destination, where it now executes. There could be an additional
computer, the arbitrator, which may help in the process migration activity.
What are the motivations for process migration? The chief ones are as follows:
By making sure that more than one computer are involved in the processing of a client’s re-
quest, we can guarantee continuous availability of resources. More specifically, we can set up the distributed
environment in such a way that some servers are available as backup. So, these redundant servers can take
over from a server, which is either overloaded, or is on the verge of failure because of any errors.
Some client requests may demand special software and/or hardware require-
ments. These may not be available with the server, which is attending to the client’s request. However, these
resources may be available with another server in the distributed environment. In such a case, it is prudent for
the server currently attending to the client to forward the request to the server, which possesses the necessary
resources.
The main decisions in a process migration are actually answers to the following questions:
The decision regarding who should initiate process migration rests on the purpose of the process migration
capability. For instance, if the process migration is to be done for load balancing, then the decision regarding
the process migration may be taken by a module in the operating system, which is responsible for monitoring
the load on the various hosts, and distributing it evenly in case it finds an imbalance beyond an acceptable
threshold. This module would take upon itself, the job of preempting a process from the loaded host, determining
where it should be relocated and communicating it to the other hosts in the participating DOS environment.
During process migration, the process on the source host must be removed, and a new process on the target
host must be created. Note that the process is moved, not copied. As such, the process area/image, including
between this process and other processes (such as message passing links, signals) must be appropriately
Approach Description
Eager (All) Here, the entire address space is transferred at the time of migration. This is the
cleanest approach, which leaves no trace of the process on the source host. Of course,
this is the most expensive approach, as well.
Pre-copy In this case, as the address space is being copied to the target host, the process on the
source host continues execution. The pages that get modified on the source during this
copy operation (called as pre-copying operation) need to be copied to the destination
in a second copy operation.
Eager (Dirty) This approach copies only the modified (dirty) pages from the main memory of the
source to the destination. Any additional pages needed by the destination are sent on
demand only directly from the disk.
Only when they are requested for by the destination, are they sent.
Flushing This is the other extreme of the Eager (All) approach. Here, the modified (dirty)
pages are flushed on the source. That is, they are saved to the disk of the source, and
removed from the main memory of the source. No pages are sent to the destination.
Whenever the destination needs a page, it must access the disk of the source, and
obtain the appropriate page from there.
To resolve the problem of handling incomplete activities (i.e.
messages and signal processing) on the source, a facility for holding the outstanding messages and signals
is provided. They are then directed to the destination appropriately, as a part of the process migration func-
tionality.
Let us quickly study a process migration scenario to get a good understanding of how the theory of process
migration works in practice.
1. Some process on computer A needs to migrate to another computer for some reason. Let us assume
that computer B is the destination.
2. The process on computer A sends a remote message to computer B, informing of this process
migration request. It also sends some of the contents of the process address space and information
regarding open files to computer B.
3. The kernel on computer B forks a new child. The kernel informs the child of the migration request
from the process on computer A.
4. The child process on computer B receives the contents of the process address space, stack, open files
etc., as required for completing its task.
5. The child process informs the kernel on computer B when it is ready to start the process from where
the original process on computer A had left off.
7. The original process on computer A terminates itself, and sends a notification to the kernel of computer
B.
as appropriate.
In any process migration mechanism, there should be a facility for the destination of the process migration to
refuse receiving an incoming process (i.e. a process that is being migrated from some other host). Moreover,
a host should be able to evict an already accepted process. For instance, a host could be just booting up, and
would have already a queue of waiting (migrated) processes. It should be allowed to accept/refuse them.
In order to facilitate this, a monitor process at each host keeps the track of the current load. It determines if
the host is in a position to accept any more processes. Accordingly, it may decide to evict some processes, in
the event of the maximum limit getting breached. In the case of eviction, the evicted process is migrated back
to the original host. From there, this process may migrate to another host. All processes marked for eviction
are immediately suspended.
The migration of a partially executed process, or whose creation is complete, from one host to another
requires that the process be preemptive. If it is a non-preemptive process, there is no scope for migration.
The only possibility of the migration of a non-preemptive process, on the other hand, is if such a process
non-preemptive processes is a lot easier. Migration of non-preemptive processes is generally useful for load
balancing. This avoids the overhead of transferring the complete state and all the areas of a process.
now discuss it in a little more detail.
We have studied how a client workstation requests for some data on the server and how the server sends the
data back to the client workstation. There is a procedure on the server to locate and retrieve the data on the
shared disk attached to it. Essentially, this procedure is responsible for the functions in the File System and
the Device Driver on the Server which is a part of Information Management. This procedure is a part of the
Operating System running on the server. When a client workstation requests for some data on the server, this
procedure in the server Operating System is called remotely from the client workstation. Hence, it is called a
.
running on the sender node uses the SEND primitive to activate the message handling module. The Message
Handling Module receives the message from the sender node and sends it across to the destination node.
The Message Handling Module is also responsible for communications management which involves
packetizing as per the data link layer used underneath (e.g. if Ethernet is used, it expects packets of a certain
size and in a specific format). It also is responsible for flow control, error control, etc. A message could be a
request for some data, or it could be the data itself. Essentially, the Message Handling Module ensures that
the entire message reaches the desired node and in correct fashion. The receiving message handling module
says which procedure is to be executed on the remote node (e.g. open a file) along with other parameters
such as file name, mode, etc. in the case of file manipulations. We have seen an example of this when we had
considered how a remote read request works in the NOS environment. We have also seen how the parameters
and B = returned parameters. We know that there are two methods of passing parameters to a local procedure.
One is “by value” and the other is “by reference”. When we pass parameters by value, the actual values are
passed as parameters. Therefore, in this method, A and B will be the actual values of the parameters. It is like
the immediate addressing mode used in the assembly language programming. When parameters are passed
by reference, the addresses of the actual parameters (and not their values) are passed to the procedure. It is
like the direct addressing mode used in the assembly language programming. A and B in this scheme would
be addresses of the actual parameters.
of opening a file. Assume that only the addresses of these parameters in the client node were sent to the server
node. It is very difficult to let processors on different machines to share a common address space, excepting
in a multiprocessing or tightly coupled system where the Operating System has different considerations.
Hence, the server node would not be able to access the addresses where the parameters are stored in the client
machine. The server node would again have to request the client node for accessing those parameters at those
given addresses to get the actual parameter values and then for sending these actual values back to the server.
Instead, the client node could have sent the parameter values in the first place.
Hence, call by reference
consuming process, thereby increasing the load on the network. This is the reason why generally only the
call by value
the Message Handling Module from the source node to that at the destination node, where they are received
and then loaded on the appropriate registers as per the expectations of the local procedure. We have seen an
example of this when remote READ in NOS was explained earlier.
to the message handling module discussed earlier runs to ensure packing/unpacking of parameters in the
message in accordance with the packet/frame formats of the protocol used, and also to ensure error-free
communication.
completes the call and returns the results after which the client process becomes ‘ready’ again. This is exactly
how a remote read operation in any NOS takes place. The interface modules at both the nodes take care of
error-free communication between the client and the server nodes. This takes care of passing of parameters
and transmitting the actual data between the nodes. The actual functioning of various modules will be clear
from the figure.
There are some issues regarding the way the parameters are represented while passing, and the way their
binding takes place. We will now discuss these in brief.
Systems, the way the parameters are passed will be identical within at least a given programming language.
programming language differs, problems can arise. For instance, in the two systems, numbers or even the text
could be represented differently. We also know that the word length can affect how the parameters are passed.
registers of Aviion. In this case, we need some software layer to do the conversion required. If a proper full-
fledged communications software adhering to the OSI standards is used, the presentation layer is supposed
to take care of all this. But then using such a software has tremendous overheads. This is the reason, most of
Hence, this interface module has to be rewritten every time a new workstation with a new Operating System
has to be supported on a network.
A recommended approach would be to devise a common standard format. Every interface module then
is designed to have routines both to convert from/to its own formats to/from the standard format. Let us call
these routines R1 (to convert from its own internal format to the standard one) and R2 (to convert from the
standard format to its own internal format). For each machine and its Operating System, both these routines
will have to be written. Both R1 and R2 must also be present on each node as per this scheme. A sending
interface then uses its own R1 routine to convert to the standard format. The receiving interface uses its
own R2 routine to convert from the standard format to its own internal format which is then used to actually
execute the remote procedure.
If a number of services are provided by a server node, normally a port number is associated with a specific
service. For instance, a port number 1154 could be associated with the function of listing the current users.
A port can be viewed as a specific service provided by an OS similar to a system call. This is devised to
address but also the port number (specifying the service required) and the parameters required to complete
The interface module on the remote node reads this port number and then executes the appropriate system
service at that node.
A themselves communicate with each other first, before communicating with the remote process on computer
B. For instance, a process on computer A could accept the user’s inputs, and call another process on computer
computer A actually behave like threads (although technically they are completely distinct processes). How
is this possible? Let us discuss this scheme.
When a process (say X) on computer A begins execution, it advertises (exports) its interface to the kernel
of the operating system on computer A. That is, it informs the kernel that it is ready, and that it expects certain
and pushes the result, if any, on to the common stack. Process X pops this return value from the stack. This
is shown in Fig. 12.14.
would usually create a thread in response to a client request, thus saving on the overheads of creating and
destroying a process every time.
non-distributed OS environment. However, in the case of DOS, another parameter is also involved, which
is the location of the host on which the process is to execute within the network. For keeping track of all
the processes within the network, the process manager usually maintains a directory of all the processes.
Alternatively, it uses a special process that walks through the kernel queue spaces of all the computers, and
mechanisms to be in place.
process to be deleted.
To create a distributed OS that includes such a process manager, there are two broad-level approaches:
process-based DOS and object-based DOS. This is shown in Fig.12.15 .
The traditional computing was procedure-oriented. The process-based DOS is also modelled on the same
principles. Here, the network resources are managed as a large heterogeneous collection. In the modern
object-based DOS, each type of hardware along with its software is considered as a single object, and is
accessed/manipulated accordingly. This classification has significance in the way process management is
done. So, let us discuss these two approaches now.
This is the traditional approach to DOS. Here, a client/server paradigm provides the process management
functionality. The link between the various processes is provided by using messages, ports or pipes. The
chief focus here is on the processes and the inter-process communication. This means that the features such
as process creation, scheduling, communication, etc. are crucial here.
These features can be provided in various ways. For instance, there may be a single operating system, and
all the participating hosts could run a copy of that operating system. Operating systems in a DOS environment
are generally created in the form of a distinct kernel on each host. There must be a high amount of cooperation
and data, and message sharing among these kernels. When a process needs to be run, these kernels exchange
messages to decide where it should be run. After this decision is taken, the dispatcher on the selected kernel
must initiate the process. This may involve other activities, such as moving a process out of one kernel’s
memory into another kernel’s memory, reorganizing the memory allocations at these operating systems,
of scheduling is based on several factors, such as load balancing, minimum communications, minimum
impact to memory contents, etc.
Evidently, synchronization is very vital in a DOS. For instance, when a process on host A needs some I/O
on host B, the process on host A must be moved to the WAITING stage. Then the I/O should take place on
host B, and its results need to be communicated to host A. At this juncture, the WAITING process on host
these activities.
The object-based DOS is a different way of looking at the underlying hosts, network and the operating system
resources. The whole computing environment is viewed as a collection of objects. An object can represent
unique identifier, which distinguishes it from any other object in that environment.
Objects can change their states in response to the external events. Obviously, to achieve this, objects have
some inherent properties, which are actually changed in response to external events, which we call as the
change of state in the object. For instance, a memory chip has a fixed set of memory locations, each of which
can be read or written to. Once these properties are crystal clear, a memory chip can be treated as an object,
and accessed as required.
An object-based DOS is thus management of objects. Each process is treated as a distinct object, and its
properties are manipulated in the desired manner. Thus, process management, in this case, means dealing
with the policies and rules of the process objects, such as their creation, change of states, removing them, etc.
As we have discussed, traditionally, the remote communication between processes consisted of procedural
calls, called as Remote Procedure Calls (RPC). That is, a client application would typically call a
procedure, say, Get_data. It would not know that Get_data actually resides on a server that is accessible
not locally, but remotely over a network. However, the client calls it as if the procedure were available
network, and manages the interaction between the client, the server and Get_data. This is shown in
With the popularity of object-oriented systems increasing very rapidly over the last decade or so, the
procedural way of calling procedures remotely has also changed. That also has now a object-oriented flavour.
Technologies such as Distributed COM (DCOM), Common Object Request Broker Architecture
(CORBA), Internet Inter Operable Protocol (IIOP) and Remote Method Invocation (RMI) are
pipe exists between two objects, rather than two procedures, which are physically apart
on two different networks, and which interact via this pipe. These technologies fall in the category of Object
Request Brokers (ORB). These form the backbone of a part of modern e-commerce transactions. A broad-
level ORB call is shown in Fig. 12.17. Note the calls are now between objects, rather than procedures, as was
An ORB infrastructure has two main pieces: the client and the server. The client ORB communicates with
the server ORB using a protocol, such as IIOP. Usually, the client ORB is called as the stub, and the server
ORB is called as skeleton. The objects on the client and the server can be created in different programming
languages, and yet they can communicate with each other. In order to facilitate this, a special language,
called as Interface Definition Language (IDL) is used. The IDL is a uniform wrapper over the underlying
programming language, which hides the syntaxes and implementations (and therefore, the differences) of the
usual programming languages.
The network consists of a number of nodes. At each node, there may be some local files/databases. In the
NOS environment, if a user/programmer wants to access a file from another node, he has to know the location
of that file and explicitly specify it before he can make a request to transfer it to his node. In the GOS
environment, this is not necessary. This gives it location transparency.
In the NOS environment, it is often advantageous to keep multiple copies of the same file on different nodes,
so that the time to transfer the file is reduced substantially. The request for a file in this case is satisfied from
the nearest node having the file. To implement this strategy, a study has to be undertaken to identify which
node has been requesting for which file from which remote node and at what frequency. After this study,
multiple copies of the same file can be maintained at the appropriate nodes.
The advantage of this approach is the reduction in the network load and the improvement in the performance,
but the disadvantage is the increased storage requirement, and the need to maintain the consistency in all the
copies of a file. If a node updates that file, the changes have to be propagated to all the copies of a file on all
the nodes. Another disadvantage is the need to monitor the file replication strategy, as the pattern of requests
for remote files may change with the passage of time.
In a network, each node runs an LOS with its own file system. Let us call it the Local File System (LFS).
LFS is responsible for allocating disk blocks to a file and for maintaining the allocation tables such as FAT.
The LFS provides various services such as “create a file”, “delete a file”, “read from a file” and “write to a
file”. It maintains the directory structure for all the local directories and files. It allows the user to change his
working directory, list all the files in a local directory and implement the access controls on the local files
and directories.
The problem arises when a user wants to carry out all these functions on a remote file. The Distributed File
System (DFS) aims at allowing precisely that. The ultimate aim is to allow any user to see the entire structures
of files and directories on all the nodes put together as one composite hierarchy from his perspective. And
an interesting point is that the perspectives of different users may, or almost certainly will, differ. The DFS
has to cater to all these personalized views from within a global physical reality. Let us imagine a network of
three nodes. Figure 12.18 depicts different directories at various nodes in this network. In this figure, a three-
dimensional box represents a root directory at any node, a rectangle represents a directory or a subdirectory,
and a circle represents a file.
User u1 may want to see as if the entire directories of u2 and u3 are in his “own” file system. The resultant
“virtual” or “imaginary” directory structure for u1 would be as shown in Fig. 12.19.
Alternatively, user u1 may want to manipulate the directories on the other nodes in a different fashion. For
instance, he may want to see the directories on node 3 for u3(:) to be under one of the directories of u2(:), say
D21, and where u2(:) itself is under u1(:).
The resultant structure would look as shown in Fig. 12.20.
To achieve this (i.e. to allow a user to define and use his virtual file system across the network) is not a
simple task. The composite views of the file systems may be quite different for users u1, u2 and u3. u2 may,
and normally will, want all or partial directory structures currently under u1 and u3 to be seen as if they were
under u2. The DFS has to implement these different views even though ultimately, physically, there is only
one file system.
Having arrived at this individualized file system, the DFS has to implement the location transparency in
addition. For instance, the user/programmer may wish to go into any directory, copy files from one to the
other, execute programs from some other directory or write data to a file in any directory as if all this file
system were local. If a user in this case gives an instruction to copy a file to a directory which is remote,
DFS has to also take into account the access control restrictions at a global level. But then, it is difficult to
enable the network to be hidden from the users in this fashion. The Local Operating Systems and the Local
File Systems for these three nodes may be quite different. The user interfaces, system calls and internal
organizations of these file systems may also be quite different.
In this case, if a user is in the D12 directory and if he wants to change his working directory to u3(:) to
display the contents of F31, how should he do it? Ideally, he should be able to give the same command to
accomplish this task as if he were working on his local machine running LOS with all the files and directories
file systems running on these nodes are quite different. This is the reason most vendors of DFS implement a
simplified version of this “dream” as of now. This version is based on certain assumptions which reduce the
flexibility but make it more practical and realistic.
If such an environment were to become a reality, it can be implemented in two ways. We will now study
these.
In this scheme, the name (i.e. the path name) of the file does not have to change
when the file is transferred from one location to another. We have already studied the need to replicate files.
Even if only one copy is to be maintained, an optimal node can be found out from the study of network de-
mands. The pattern of these demands from various nodes might change with the passage of time, and hence,
a file may need to be transferred to another node to optimize the network load. This exercise may have to be
done at a regular frequency. The point is that when this happens, no program or command should need any
change.
In the scheme of location independence, this mapping of logical name to physical location has to be done
dynamically every time a request is made. In fact, a table can be maintained for each file, giving details of the
file and various nodes where it is kept. Any time the location of that file changes or the file is replicated, this
table is automatically updated. When a node requests for a file, the availability of the nodes where that file
is stored and transmission overheads for transmitting from each of those to the requesting node are assessed
and the location then can be chosen from where the file will be transferred. Then, referring back to the table
again, the actual path name can be dynamically constructed and used. This requires extra execution time, but
then, it becomes very flexible. DFS needs to do extra work to implement this. This scheme is more ambitious.
Therefore, in practice, most of the DFSs provide a static, location transparent mapping which is the second
scheme.
In this scheme, a user is expected to decide in advance which files/directories
will reside on which nodes and also define all the “user’s perceptions” of the composite file/directory struc-
tures. The DFS has various commands and protocols to allow the user to define “his” file system. This is
stored internally in a way that is very complex due to the differences between the Local Operating Systems
and the differences in the naming, storing, and manipulating facilities, and commands in the Local File Sys-
tems on various nodes.
It is obvious that in a DFS environment also, the time required for reading any data from a file will consist
of the three components as listed below:
In the local Read operation, the seek time could be very high. However, for remote files, the transmission
time could also be very high. Therefore, from the user’s perspective, even if the commands and facilities are
the same across the network, the transmission from remote nodes has to be well controlled for achieving
better performance. This is achieved by file replication and also cache management. We have studied file
replication earlier. We will study cache management in a later section.
A very important implementation consideration in the design of DFS is the policy to be used to implement
the file operations, particularly write and update operations. DFS normally has to have different philosophies
in the lock management.
DFS has to have an interface software to talk to the Operating Systems running on different nodes. This
software has to be running on every node to take care of the translation to/from the system calls of the LOS on
that node. If all the nodes run the same Operating System with the same file system, the complexity of DFS is
greatly reduced. In such a case, DFS can be implemented as a software running over and above the Operating
System, or it can be implemented as an integral part of the Operating System.
AT&T’s UNIX system V, release 3.0 (SVR3) has a feature called Remote File System (RFS) which is a
distributed file system for UNIX. Therefore, if all nodes run UNIX SVR3 and RFS on top of it, a user can
access file systems on other nodes ‘seamlessly’ after constructing his view of the global file system. Sun
has its Network File System (NFS) which again is a distributed file system which is a part of the SunOS
Operating System. This is now available as System V release 4, which is a combination of the features
available on 4.2 BSD UNIX and System V release 3.0. We will study NFS as a case study in the next section.
same time. This is achieved by supporting different kinds of protocols simultaneously. Figure 12.21 depicts
different machines, their corresponding LOSs, DFSs and protocols under the umbrella of NetWare. SMB and
AFP are the Distributed File Systems on the PS/2 and Macintosh machines respectively.
protocol to transmit the data from a remote file. NFS makes remote files available to local programs without
requiring any program to be modified or relinked.
Figure 12.22 depicts NFS functionality. It shows a client node, a server node and various software processes
running at both the nodes.
Let us now understand various components/processes in NFS.
NFS needs interface processes at both the client as well as server nodes. When
the server machine is booted, it also initiates a number of these interface processes called ‘daemons’. These
processes are responsible for getting the requests from the remote clients for NFS and servicing them one
by one.
Figure 12.22 shows NFS File System as the interface process on the client node. This is the counterpart
of the daemon process on the server. This module is responsible for receiving a request from a user/AP on
the client workstation for remote data (NFS), and then for sending it to the daemon on the server. After the
server NFS satisfies the request, this module is also responsible for receiving the response from the daemon
and passing it on to the VFS on the client workstation.
These packets in turn are converted into Ethernet frames. While doing this, the data and the control
the format of the Ethernet frames. After this, the control information is added to each of these data blocks
to formulate the Ethernet frames in accordance with their format. This control information contains among
Ultimately, the Ethernet cable carries these bits to the server bit by bit. At the server end, the data bits
ultimately converted back into the original message format. All these activities plus the error/flow/sequence
on the exact mechanisms and interfaces between these three modules is beyond the scope of the current text.
We will now trace the exact steps that are followed when a client accesses a remote file. Assume that all the
NFS components as discussed above are already loaded on the client as well as the server and the exporting/
mounting process has already taken place. The following sequence of actions takes place:
and retrieval now take place according to the LFS algorithms running on the client. These functions
are the same as the ones discussed in the chapter on Information Management earlier.
VFS on the client machine then initiates these NFS calls one by one sequentially. The server machine
processes each NFS call separately and passes the control to the client. The client then initiates the
next NFS call. This is done to facilitate recovery in case of the server failure. The client knows where
exactly to restart from. Therefore, the recovery in this case becomes the client’s responsibility.
on the client. As we know, along with the data, data types, etc. are also sent, and XDR is responsible
for the conversion, if any, is required.
NFS Call Meaning
getattr() gets a file’s attributes
lookup() returns a file handle for a file
read() reads data from a remote file
write() writes data to a remote file
create() creates a file on the remote host
remove() deletes a file on the remote host
rename() renames a file on the remote host
mkdir() creates a directory on the remote host
rmdir() removes a directory on the remote host
readdir() reads contents of a directory on the remote host
The XDR on the server again performs the format conversion, if necessary, and passes the control to
In distributed processing, the load on the network can be quite detrimental for the performance
of the system. Therefore, any good design tries to reduce the network traffic. This is the main
reason the cache memory is normally used in any distributed environment. This cache memory has
nothing to do with the associative memory implemented in the hardware.
The basic philosophy of this cache is similar to buffering in ordinary systems. Let us review this concept
first. Buffering reads more data at a time than what is absolutely necessary at that moment. This, in other
words, is an anticipatory read operation. Once the seek operation is performed, and the desired data on the
disk surface is located, more blocks can be read into the memory at only a slight increase in the time or cost.
The buffering technique is used in many non-distributed, standalone systems and the Operating System
is responsible for the buffer management. This is done because in a non-distributed environment, of the time
taken to read the data from the disk, the actual data transmission time is far less significant than the seek time
and the rotational delay. This is also done with the hope that the subsequent read requests can be satisfied by
the data already read in the memory, thereby, saving the I/O overheads.
If the read requests follow a certain pattern (as in sequential read), buffering helps a lot. If you recall,
the page replacement algorithms are based on the assumption (or hope) that a page once referred to will be
required again fairly soon. Buffering is based on a similar hope. If the read requests are extremely random
and the data records are widely dispersed on the disk space, buffering provides very little help. Buffering has
to providing multiple buffers at different locations within the network system. There are four places where
the desired data can be present at any time if an Application Program running on the client wants to access
it. These are listed below:
The performance depends upon at which of these four locations the data resides. In fact, as time progresses,
different parts of the same file may reside at these four locations. The Operating System has to move data
between these locations depending upon the history and frequency of past references. The Operating System
keeps the most recently used blocks in the client’s memory and the least recently used blocks on the server’s
disk, at the other extreme. The blocks which are referenced in between these two extremes are kept on the
client’s disk or the server’s memory.
For each of these locations, the Operating System will have to maintain a list of blocks along with the
reference pattern (here too, algorithms such as LRU approximation can be used at each location). Let us refer
to these locations as L1, L2, L3 and L4, where L1 is the client’s memory, and L4 is the server’s disk. Now
if a block at L3 starts getting referenced very often, the Operating System will have to push it to L2. But if
the buffer area at L2 is full at that time, the Operating System may have to evict a block at L2 (i.e. overwrite
it) to accommodate this new entry. Which block from L2 should be overwritten? Obviously, it should be the
least recently used one in that buffer, and so on. This procedure has to be followed continuously and at all the
locations. This is what makes it complex.
This complexity is compounded by another factor which concerns data integrity. Imagine that the same
data file or its part was requested by many clients simultaneously and then subsequently it was downloaded at
all these clients. If each client starts updating this file independently, there will be chaos. Some updates will
be lost. Ultimately, we need to keep only one copy on the server. Which one should be kept? This is referred
to as cache consistency problem. If all the clients only want to read the file, this problem does not arise. As
soon as any client issues a command to open a file for writing, the problem begins.
The simplest solution to the cache consistency problem is to use file locking to prevent the simultaneous
access itself. This guarantees consistency at the expense of performance and flexibility. (No simultaneous
reads also can take place.)
Another solution is to take the following actions as soon as a client requests a file to be opened for writing:
algorithms to download the often referenced blocks of the file in the client’s memory or its disk, or
the server’s memory in that order. Now one can allow concurrent read operations again until the next
write request arrives.
A client workstation desiring to print something (say, a report) issues commands as usual. However,
the data to be printed is not directly printed on the printer. Instead, it is redirected to the disk attached to
the printer server. The data cannot be directly printed because the printer may be busy printing a report for
another client workstation! Therefore, while the printer is busy, all other requests for the printer are queued.
The spooler software keeps track of all these requests and services them one by one. The “report” stored on
the disk is normally not stored exactly in the same way in which it is printed. Some coding scheme is used
to denote specific actions.
A specific control character may mean skipping a certain number of blank lines. Another could mean
characters themselves are unprintable and hence, are not used in the actual report to be printed. This is the
reason why these control characters are used in this coding scheme. This philosophy is the same as the one
used in a spooler system. While printing a report, the server program interprets the control characters and
generates corresponding print commands such as “skip a page”, “skip a line”, etc. to actually carry out the
printing.
In the client-based, or file server, environment, the LAN is principally used for sharing
the peripherals. All the sharable data is stored on the file server and if a client workstation
requires any data, the server sends down the entire file to that client workstation, in most cases.
The file server stores the files as only raw files or as a stream of bits and has no knowledge of chains, pointers
or indexes interconnecting various data records to reflect their relationships. Hence, if a client wants a record
for a specific customer, the entire file is transferred from the server to the client workstation first, after which
the client has to extract the required record and then process it. In fact, in this scheme, all the processing is
done on the client machine, thereby, increasing the network load.
If the client workstation’s memory is not very large, the file is broken into manageable physical chunks
after which it is sent chunk by chunk. This is the only level of “intelligence” expected to exist in this scheme.
After the chunk arrives in the memory of the client, it is the Application Program (AP) in the client which has
to find out the desired customer record.
In the database or the indexed file environment, all the processing will take place on the client workstation
only. For the index search, the index stored as a file on the server will have to be entirely downloaded to the
client workstation. The client workstation will have to carry out the index search to identify the desired chunk
of the data file and request for only that chunk from the server. The file server then will talk to the Operating
System running on the server, extract that chunk, and send it to the client workstation. In most cases, however,
the file server transmits the entire file. It is basically used to store the commonly used programs such as
wordprocessors or spreadsheets which are downloaded to the respective client workstations wanting it.
It should be noted that the maximum “intelligence” that a file server can have is to send only the required
data chunk instead of the full file. However, identification of the desired data chunk (i.e. the index search),
extraction of the specific data record desired and the subsequent data processing are functions done only at
the client node. This is why it is called client-based computing. In many cases, the file server can send only
full files. In fact, if a one-time query is very complex, this form of “dumb” file server can hurt.
Imagine a query, for instance, to “display all the invoices and orders for a given customer along with his
three entire databases will have to be transferred to the client first before the AP running on the client could
produce the results. It is easy to imagine the resulting complications. However, because of the widespread
availability of such dumb file servers which can download only an entire file (or a physical chunk at best),
we will continue to use the term file server for them, unless explicitly stated otherwise.
To implement the file server, a multi-tasking Operating System is needed, as requests for files could arrive
simultaneously from various client workstations. Each such request becomes a task under the Operating
System on the server. The Operating System then has to prioritize these requests, allocate memory to each
task (remember that for each task, a number of buffers will be required to carry out the file I/O) and then
schedule these tasks. These tasks will also go through various states such as Ready, Blocked, etc.
If a buffer on the server for a particular task is full and the data from it is not yet transmitted to the client
workstation, no further data can be read from that file. Therefore, that task has to be blocked. When the buffer
is emptied and therefore, more data can be read from the file, the task can be made ready again. After carrying
for further I/O, it gets blocked again. When the I/O is over, it can be made ready. Figure 12.25 depicts this
scheme.
etc.) are mostly in terms of entire files only. This approach has a number of limitations. Multiple users cannot
update the same data simultaneously. Files can be used only by one client workstation at a time. Imagine two
client workstations having identical copies of the file downloaded onto them, each one trying to change the
value of a data item independently! The whole integrity will be lost. There are of course ways to maintain
the integrity by allowing only multiple read operations, but then it reduces the flexibility in a truly multiuser
environment. Another disadvantage is the tremendous increase in the network traffic. Imagine that you want
to display the customer name of 30 characters for a specific customer from a customer file of 1 MB, and in
such a case, you have to download 1 MB of data to retrieve 30 bytes from it!
Today, some database system software runs only in the client-based computing mode, i.e. you cannot get
an individual selected record from the server. The whole file—at best as a series of physical chunks one by
one—has to be downloaded from the server onto the client workstation. The DBMS software running on
the client workstation then locates the record on the client workstation out of the downloaded data and then
uses it for further computation/display. This is obviously inefficient. Due to these limitations, a different
In this case, a program such as a database management system is split into two portions.
The workstation or client portion and the server portion. This is why this environment is called
again responsible for doing further computation/processing on the record(s) and displaying the results on the
and the software for the DB portion (such as VSAM, IMS/DB, IDMS/DB or DB2) would reside on a central
user-friendly Graphical User Interface (GUI) using languages such as Visual Basic, Power Builder or Visual
machine, and database services are provided for on the server machine. This is how distribution of tasks takes
place. Each one does something that it can do best.
actually locating the desired record and sending it to the client. The processing of that record takes place on
the client machine.
Therefore, the server is not only responsible for maintaining indexes on the databases but also for using
them to carry out the searches to locate and access the desired record(s) on the server itself. It is also responsible
for handling concurrent requests for the same record by using record level locks. In fact, it is also responsible
for a transaction-based recovery to guarantee the database integrity in case a true DBMS is used on the server
as a DB portion for data retrievals. For instance, if a user keys in an invoice and as a result, three different
records have to be updated, or none at all, if the database integrity and consistency have to be maintained.
In order to provide this atomicity, a database system allows a user to encapsulate all these three updates
together as one transaction and ensures a transaction-based recovery using rollback/rollforward features.
The features maintain the pre- and post-update copies until the transaction is complete to enable recovery in
case of a failure. The point is that the transaction (i.e. all the three updates, in this case) has to be executed
standard in this case. However, the concept of database server could be used in non-relational databases or
even in file management systems such as Btrieve also. Whether the processing is split into two parts or not,
server runs is only a single-user Operating System such as DOS, this would have been impossible.
In this multi-user Operating System environment, a database request made by a client becomes a process
under the Operating System on the server. The Operating System schedules these processes as any other
process. For instance, when a process would generate an I/O request for reading the index record from the
disk, it would be ‘blocked’, and another process whose index is already read and index search is to be made
in the memory could be dispatched because it would be ‘ready’.
After the desired database record is read, the front-end would be responsible for reaching it to the
appropriate client. Therefore, it would use the communications capability of the NOS to do this, if required,
or it could have routines of its own for this purpose.
A disadvantage of this approach is that multi-process architecture makes inefficient use of system
resources such as memory. In this scheme, the context switching is also more expensive. One of the
reasons for this is the existence of multiple processes for the server Operating System and therefore,
Single process, multi-threading
Fig. 12.27.
With this approach, the scheduling of various tasks and maintaining the list of tasks in various states
(ready, blocked, etc.) are done by the database server itself. Therefore, the database server, in a sense takes
over some of the functionalities of the Operating System.
Figure 12.27 shows that even if the database server manages several tasks, it itself is only one process as
far as the Operating System running on the server is concerned. Hence, the database server could theoretically
cases, you could run it under a powerful machine running, say UNIX.
The database server itself consists of various modules as shown in Fig. 12.28 (this is only one of the
possible implementations).
The kernel is responsible for creating, scheduling and executing various tasks, and for context switching
and management of various buffers and caches. The tasks could be the ones created by the user queries as
shown in Fig. 12.27, or they could be database tasks such as for locking or recovery, or the tasks which are
network-related. The kernel is also responsible for managing various task priorities.
The parser
client workstation as it is, without much pre-processing. The parser parses the query to ensure that the query
statements are syntactically and semantically correct. If they are, the parser generates an intermediate form
of the query, and hands it over to the optimizer.
The optimizer figures out the most efficient way of processing a query. In our example, it will have to
there is no index, it will have to suggest going through all the records sequentially one by one. If an index on
only one of them exists, it will have to use only that index. If index exists on both the fields, it will have to
use both of these, but the sequence in which both are searched affects the search time and hence, it will have
to decide the appropriate strategy.
But to achieve this, the database software has to be organized a little differently and has to maintain a “count”
field to denote the number of record occurrences for a given key value in an index. Depending upon all these,
the optimizer performs the query optimization, i.e. suggests the way the database searches should be carried
out.
The strategy suggested by the optimizer will be used by the compiler to generate the appropriate
instructions, which then would constitute a task when its execution begins. Our example shows only a simple
query. In actual practice, the query could be very complex using multiple tables and multiple conditions.
There could be many possible ways to handle such a query. The optimizer normally evaluates these possible
ways in terms of their costs or efficiencies and chooses a method which is the most cost effective. (This is
why such an optimizer is called ‘Cost based optimizer’.)
The compiler is responsible for generating the actual executable statements to search the indexes and
data records, and retrieve the appropriate records which match the given conditions. This is obviously done
according to the strategy suggested by the optimizer. These statements form the program which the task has
to execute. The program contains call statements for reading/writing the actual data records, locking a record,
etc. These are the actual system calls only. A database server can use the services (system calls) of the NOS
to implement these. However, a database server can have its own I/O routines, bypassing the I/O calls of the
NOS. This is done to improve upon the speed.
As discussed earlier, at the time of executing a task, it can get ‘blocked’ after requesting an I/O routine,
whereas it can be made ‘ready’ after the I/O is over, whereupon other instructions such as the ones for the index
search in the memory can be executed. All this is managed by the kernel of the database server.
and Visual Basic are some of the examples. They provide a user-friendly Graphical User Interface (GUI) for
user interaction. They make database calls in a standard fashion as expected by the database server products
database engines which receive
the database requests in a pre-specified standard format and service them one by one as discussed earlier.
We can imagine that all the issues that are found in a simple, non-distributed environment,
such as mutual exclusion, starvation and deadlock are quite possible in the case of a distributed
environment. In fact, the possibility of these happening is more here, as a number of entities are
involved, which can lead to chaos. To add to the trouble, there is no global state. That is, there is no way for
an operating system or a participating process to know about the overall state of all the processes. It can only
know about its own state (i.e. local processes). For obtaining information about remote processes, it has to
rely on the messages from communication with other processes. Worse yet, these states reflect the position
that was in the past. Although this past can be as less as 1 second, it can have very serious consequences in
business and other applications.
For instance, suppose that Process A gets the information from a remote Process B that the balance in an
account is USD 2000, at 11.59 a.m. Therefore, Process A considers that as a sufficient balance for performing
a withdrawal transaction of USD 1500, and goes ahead. Unfortunately, at 12.00 pm, Process B also performs
concurrency. In a local system, we can somehow control it, with the help of database transactions. How do
we take care of it in a remote transaction?
To solve this, there are many approaches, of which we shall discuss two.
message was sent, and without any duplications). The initiator process invokes the algorithm by recording
its state, and sending a special marker control message on all the outgoing channels (i.e. to all the other
processes) before performing any transaction or sending other messages. All the other processes do the same.
Therefore, each process has a marker (similar to a commit point) where it can go back, in case it is later real-
ized that there was some concurrency problem.
The two phase commit protocol is used to synchronize updates on two or more
databases that run on physically different computers. This ensures that either all of them succeed or all of
them fail. The database as a whole, therefore, is not left in an inconsistent state. To achieve this, a central
coordinator (one of the database machines) is designated, who coordinates synchronization of the commit/
rollback operations. Other participants in the two phase commit protocol have the right to either say OK, I
can commit now or Nope, I cannot commit now. Accordingly, the central coordinator takes a decision about
committing the transaction or rolling it back. For committing, the central coordinator must get an OK from all
the participants. Even if one of the participants cannot commit because of whatever reasons, the transaction
must be rolled back by all the participants. This works as follows:
During this phase, the central coordinator sends a Prepare to commit message to
all the participants of the transaction. In response, each participant sends either a Ready to commit or Cannot
commit message back to the central coordinator, depending on whether they can go ahead with the commit-
ting of the transaction or not. This is shown in Fig. 12.29.
Only if the central coordinator receives a Ready to commit reply from all the
participants in Phase 1, it now sends a Commit message to all the participants. Otherwise, it sends a Rollback
message to all the participants. Assuming that a Commit message was sent by the root coordinator, all the
participants now commit the transaction, and send a Completed message back to the coordinator. If any of
the participants fails in the commit process for whatever reason, that participant sends a Refuse message back
to the central coordinator. If the central coordinator receives at least one Refuse, it asks all the participants to
roll back. Otherwise, the whole transaction is considered as successful. This is shown in Fig. 12.30.
quite important for successful and error-free inter-process communication and synchronization.
In the absence of such a mechanism, the interaction between processes can be chaotic, and can
lead to serious problems. We have discussed various algorithms and philosophies to deal with the issue of
inter-process communication and mutual exclusion. The problem of mutual exclusion can be solved by using
these techniques in a non-distributed environment. However, in a distributed environment, mutual exclusion
becomes a very tricky affair. Recall that there is no concept of a shared (single) memory or a global clock
environment. As such, no single process can decide how to deal with the issues of synchronization and inter-
process communication. This means that processes must depend on the exchange of messages among each
other, and not on any shared memory or global clock.
To deal with these issues related to the problem of distributed mutual exclusion, two main approaches
(centralized and distributed) have emerged, as depicted in Fig. 12.31.
nodes are equal. This means that all the nodes have the same amount of information about the environment
and network. Each node has a partial view of the overall environment, and none has all of it (unlike in the
node makes its own decisions. All the nodes are equally responsible for the success of mutual exclusion. Thus,
nodes require exchanging messages, and therefore, spending time in this activity, whenever a decision is to
be made regarding a resource.
Following are the advantages of this approach:
(b) This approach makes the infrastructure more scalable, and the fear of the collapse of the entire
environment in the case of the failure of a single node is not there.
The drawbacks of this approach are as follows:
(a) There is no concept of a single clock to which all the nodes agree, and therefore, there is no time
synchronization.
(b) During the message exchanges between nodes, there can be cases of delays as the decision-making
can be quite complicated.
issues therein, several algorithms have been proposed. We shall study the Lamport algorithm, since it is the
simplest to understand and implement, and has also proven to be quite dependable.
The Lamport algorithm uses the concept of timestamps. It elegantly resolves the
problem of the lack of a global system clock. It orders the various events in a distributed system without using
a clock. The way this works is as follows.
Each time a process sends a message, it is considered as an event. Each node (called as i) maintains a
incremented by 1), and i is called as site number (each site has a unique number, starting with 1 onwards).
Fig. 12.33.
P3, which constitute a distributed environment. Let us assume that each of them wants to send messages a,
b and c, respectively, one after the other to be received by the other two sites. We have intentionally kept the
flow simple for the ease of understanding. The flow of operations and the resulting calculations are depicted
in Fig. 12.34.
The beauty of Lamport algorithm should be apparent now. Before any process sends a message, it first
increments its local counter. The receiving processes always compute the greater of their local counter and
the receiver counter, plus 1. This leads to an automatic global clock-like counter.
Let us now understand how this is useful in mutual exclusion.
Whenever any process wants to enter critical section, it sends a message similar to the ones discussed
above to all the other processes. When it receives a reply from all the other processes, it can enter its critical
section. Each process must reply to every request. If a process does not wish to enter into its critical section
at that point (i.e. when it has received a message from another process), it can reply back immediately.
Otherwise, it compares its local timestamp (i.e. local counter) with the one received, and if its local timestamp
is less than the one received (which means that its message request was created earlier), it defers its reply. The
reason for this is simple. This is because the process (rightly) believes that it should get a higher priority than
the process, which has sent it a message.
Does this not lead to deadlock? What if all the processes believe that it is their turn to enter into the
critical section at the same time? Interestingly, this situation will never arise, because we know that a process
will defer its reply only if its timestamp value is smaller than that contained inside the received message.
Since timestamps are arranged in a sequentially increasing manner very strictly, at any given point, only one
process would have the smallest timestamp value. All other processes would have replied back to this process
(as their timestamp values would be higher than the timestamp value of this process). As such, that process
can enter its critical section, and there is no fear of deadlocks.
Like almost everything else in distributed systems, deadlocks offer a great design
challenge. Dealing with deadlocks even in a non-distributed environment is not easy. When
coupled with the challenge of multiple processes/nodes/Operating Systems, the task becomes
even tougher.
We shall discuss three strategies related to deadlocks in a distributed environment: Prevent a deadlock,
Avoid a deadlock, Detect a deadlock. We shall use the term transaction to mean the same thing as a message,
as the former is more commonly used in the literature for distributed systems (in the context of deadlocks).
are interested. Let us imagine that at a given point of time, T1 owns R, and T2 wants to own R soon
ultimately lead to a deadlock. To solve such problems, two possible methods exist: Wait-die method and
Wound-wait method.
Let us discuss these two methods now.
As we have mentioned previously, the preamble is that T1 owns R, and T2 is making an
attempt to own R. Let the timestamp of T1 be TS1 and that of T2 be TS2. In this approach, the algorithm for
preventing a likely deadlock is as shown in Fig. 12.35.
As we can see, if the transaction T2 has started prior to the transaction T1 (in terms of their timestamps),
then T2 waits for T1 to finish. Otherwise, the distributed Operating System simply kills T2 and restarts it after
a while (with the same timestamp, TS2), hoping that T1 would have released the resource R by now.
To summarize, an older transaction gets a higher priority, as desired.
Also, we can see that a killed transaction is restarted with its original
timestamp value. That allows it to retain its older priority (which would
be higher than most other transactions, when it restarts).
This technique takes a different approach, as
compared to the earlier method. Here, we start off with a similar
comparison. We compare the timestamps of T1 and T2 (i.e. TS1
and TS2, respectively). If TS2 is less than TS1 (which means
that supposedly T2 had started prior to T1), we now kill T1.
Note that in the earlier case, we had blocked T2 in such a case. If
T1 has started earlier, however, we halt T2. This process is shown
After a lot of research, Operating Systems designers have concluded that they cannot
avoid a deadlock in a distributed environment. There are two main reasons for this:
1. Every node in the distributed system needs to be aware of the global state (i.e. information about
all the other nodes). This is clearly impossible, because the states of the nodes would keep on
changing constantly, and even if they communicate these changes to each other, there would be
inevitable delays, making the information obsolete. As such, this requirement of global state will
never be met.
2. Maintaining global state, if at all, would entail tremendous amount of overheads in terms of network
transmission/communication, processing and storage overheads.
arbitration is needed to detect deadlocks in distributed Operating Systems. Three solutions are proposed to
handle this situation, as follows:
In this approach, one node is designated as the control node, and it decides how to
detect and come out of deadlocks. All other processes provide information to this node, and abide by its rules.
This approach has the advantage of simplicity, but the drawbacks of big overheads in terms of communica-
tions, storage and the danger of the control node failing. The last point is most relevant, as in such a case; the
entire environment can come to a halt.
This creates a tree-like structure. Here, the node designated as parent detects dead-
locks of its subordinate nodes, and takes an appropriate decision.
This is a democratic approach, wherein all the nodes/processes are treated as equal.
All the nodes/processes need to cooperate with each other. There is a big amount of information exchange,
resulting into substantial overheads.
As we can see, all the approaches have their advantages, and disadvantages. Accordingly, the decision
of choosing one of them is left to the actual problem specification, which differs from one situation to
another.
Local Area Network (LAN) is a network of computers connected within an area normally less than 1 km
in distance, i.e. no two computers are away from each other by more than 1 km. Hence, a LAN is usually
used within an office, a building or a factory. For longer distances, typically covering a city, a Metropolitan
Area Network (MAN) is used. If two computers crossing the boundaries of cities or nations are connected,
we call it a Wide Area Network (WAN). We will talk only about LANs for now. Before explaining how
the LAN works and how NOS fits in all this, we will talk about some general issues concerning error
control and formulating packets/frames in LANs. We have seen that a significant portion of NOS comprises
Operating System.
In data communications, error detection and prevention is of utmost importance. Many methods have been
proposed to achieve this. The most common of them are: Vertical Redundancy Check (VRC), Longitudinal
Redundancy Check (LRC) and Cyclic Redundancy Check (CRC). All these add a certain “overhead” bits
bits are not calculated randomly. There is some logic used based on the data bits in the chunk of data to be
transmitted. Even if a single bit in the data block to be transmitted changes during the transmission, these
check bits received at the destination would not tally with the expected check bits. The source node computes
these check bits as per the preagreed logic, and appends them to the data bits before sending them to the
destination. At the other end, the check bits are recomputed using the same logic based on the received data
bits. Then, these are compared with the check bits received from the source node. A mismatch indicates an
error. Figure 12.37 depicts this scheme.
At both the source and the destination nodes, these overhead control bits are calculated by the hardware
and hence, it is a very fast process.
If the computed error control or check bits tally with the ones received along with the data at the destination
a variation of the parity bit scheme applied to a block of data bytes. It is computed longitudinally instead
of vertically. Therefore, there will be a parity byte consisting of 8 bits associated with each data block. For
instance, if a data block consists of 10 bytes, all the data bits with bit number = 0 from all the 10 bytes are
considered for computing the bit number = 0 of the parity byte. Parity bit calculation could well be based on
odd or even parity scheme.
would still remain the same and therefore, will not detect any error.
Imagine that an entire file is to be sent from one node to another. A question is: Should it be sent as it is, with
end, the entire file will have to be sent again. Therefore, a long message, e.g. a file in the example above, is
broken into packets which are often called frames in the LAN environment. Each frame is given a sequence
In this scheme, if an error is detected, only that frame needs to be sent again. What applies to a data file
in our example equally applies to any kind of message. In fact, we will henceforth use the term message to
mean the actual data (e.g. a file or a record or a field) or a request to transmit or retransmit the data. A message
could also be sent to acknowledge the data already sent. In short, a message is anything that is sent between
two nodes. If it is big in size, it needs to be broken into packets/frames.
Figure 12.38 depicts a typical frame format. It is by no means a standard or universal format. In fact,
different protocols associated with different network types such as Ethernet, Token Ring, etc. have different
frame formats. We have depicted this format only to make the concept clear.
“overhead” fields which have to be added to the actual data bits. This is a necessary evil as we shall see! It
must be remembered that the frame can be a data frame or an acknowledgment frame denoted by the “flag”.
This is valid only if the system follows a convention of acknowledging the messages.
The address has three components. A network could be constructed by having multiple LANs as depicted
in Fig. 12.39.
Figure 12.39 shows three networks, each with a number of nodes. Each node could be running a multiuser
Operating System. It could be executing various processes which are associated with different sockets—one
for each socket—shown in the diagram. Therefore, the address consists of three components, viz. network
number, node number and the socket number. Using the source and destination addresses, any process running
on one node can send a message to any process running on any other node on the same or a different network.
A node sends messages to another for any of the following reasons:
(a) Open a communication session with another node.
(b) Send an actual data frame to another node.
(c) Acknowledge of a frame received correctly.
(d) Request for the retransmission of a frame.
(e) Broadcast a message to all nodes.
error was detected in the received data. It might, for instance, send an acknowledgment if the data is received
driver broadly carry out the data link layer functionality as per the OSI standards, and hence, constitute
a little later.
process. They communicate in terms of frames, as discussed earlier. The frame formats and acknowledgment
procedures have to be as per the agreed protocol. If Node A wants to send any message to Node B, the
following will happen.
protocol, and for each frame, add the overhead bits such as addresses, frame size, etc.
for the entire frame including the overhead bits, and append it to the frame.
(f) There are various methods or protocols to let Node A know that everything was received well by
Node B.
(i) One method relies a lot on the communication hardware and software and hence, does not send any
acknowledgment. The reason for this is that errors are rare. If they do occur, it relies on the higher
layers to detect them and take corrective actions. This is not a very attractive scheme where accuracy
is of great importance.
(ii) Another method believes in sending back an acknowledgment for every frame received. In this case,
Node A waits for some predetermined time for this acknowledgment frame before concluding that the
original frame was not received properly and hence, needs to be retransmitted.
(iii) A third scheme believes in Node B sending only retransmission requests back to Node A if a frame
was received in error. If the frame was received correctly, no action is taken. The “flag” field in the
frame can be used for this. Node B sends a message back where the flag field denotes the error and
the frame number specifies which frame has to be transmitted. Node A then sends only those frames
if that was not received properly. This obviously involves a larger buffer memory at the transmitting
faster as compared to a scheme where a frame cannot be sent unless the earlier one is acknowledged.
In this case of selective retransmission, it is the responsibility of the receiving node (Node B in this case)
to sequence and arrange the frames into the original message, before the message is handed over to the higher
layers for processing. This checking for sequence may be necessary because of one more reason. There may
be multiple paths between two nodes and different frames may take different routes with different speeds and
congestion levels. Hence, frame 4 may reach before frames 2 or 3!
In the discussion that follows, we will use the term “acknowledgment scheme” to mean the method or
protocol followed for acknowledgments. This protocol could employ any of the methods discussed above or
some other scheme. It belongs to the data link layer in the OSI reference model. As in many other protocols,
there is no agreement or standardization in this protocol as well.
Normally, twisted wire pairs, coaxial cables, and optical fibres are used heavily as the LAN media. Each of
them has its merits and demerits.
A number of topologies are possible in a network. However, generally, in basic terms, three important
ones are considered. They are star, ring and bus topologies. Many other topologies can be derived from these
basic topologies.
There are two types of LANs—baseband LANs and broadband LANs. Baseband means that a signal is
sent without any conversion (analog to digital or vice versa). Therefore, a baseband LAN sends the data
as digital pulses without converting them to analog signals. Hence, it does not require modems. It might
however require regenerative repeaters if the length of the LAN is more. This is because, with distance, a
digital pulse is likely to become weak and distorted. Regenerative repeater receives an “approximate” digital
pulse, reconstructs the original one and then sends it out. There are three ways in which a 0 and 1 are normally
encoded as a digital pulse. These are NRZL, Manchester and Differential Manchester encoding.
Broadband LAN uses modems to convert the digital data into analog signals and then to send them
across the network, usually using Frequency Division Multiplexing (FDM). It normally divides the total
available bandwidth of the medium into three bands: for voice, for data and for video. Therefore, the total
bandwidth available for the data may be reduced. Though it is theoretically conceivable, it is quite impractical
to subdivide this reduced bandwidth allocated for data further into smaller bandwidths for different nodes.
Therefore, in both the baseband as well as the broadband LANs, normally only one node can transmit data
to some other node at a time. This scheme demands some arbitration as to who should use the medium, when
and for how long. We will revisit this issue soon.
A detailed discussion of these topics is beyond the scope of the present text and it is probably unnecessary
as well. Suffice it to say that the physical layer in the OSI model is concerned with the kind of medium that
is used, the maximum data rate possible, the data encoding method used, etc.
Protocol is nothing but a convention. We encounter this term quite often in newspapers
when describing about the meeting between the leaders of two nations. To signify that
“Everything is ok and the train can start” by a green flag is also a protocol. When we write a
letter, we follow a certain protocol. The place where we write the address, the place where we adhere the
stamp, the place where we write the name of the recipient, the way we begin with the pleasantries, the place
Protocols can and normally have layers hidden in them, if we look into them a little carefully. A good
example of this is, human conversation in general and over the telephone. In particular, Fig. 12.41 depicts
these layers. We will take this example and describe the exact steps to learn about these layers. An interesting
point is that we do this without knowing that we use protocols. While studying this, we will encounter a
number of terms, which are also used in the computer networks.
war and we will also assume that each one is taking down what the other one has to say. Thus, we will call this
World War as an idea. Normally, the conversation takes place in terms of several messages from either end,
hopefully one after the other. A message is a block of statements or sentences. A message could also consist
of only one word such as OK or yes, denoting a positive acknowledgment (ACK) of what has been heard
or received. A message could also mean a negative acknowledgment (NAK) or request for repeating such
as Come again or, Pardon me or, Can you repeat please, etc. Remember that this can happen both ways. For
instance, a typical conversation could be as follows.
X: In World War II, the allied countries should have…. However, they did not do so because of the
climate conditions. In addition, they did not have enough ammunition.
X: ...
Therefore at the level of ideas World War.
However, in reality the conversation consists of a number of messages from both sides as discussed before.
Therefore, at a lower level, the view would be that a number of messages are sent at both ends. The protocol
at this level decides what denotes an acknowledgment, what denotes a negative acknowledgement, etc. for
the entire message.
A message could be too long. In this case, it may not be wise for X to speak for half an hour, only to
or negative acknowledgments after each sentence in a message by Yeah, Ok or Come again, etc. A sentence
is like a packet in the computer parlance. In this case also, one could decide a protocol to necessarily send
a positive or negative acknowledgment after each sentence. If that is the case, the sender (the speaker) X
will not proceed to the next statement until he hears some form of acknowledgment, or otherwise, and, in
fact, repeat the statement if he receives a negative acknowledgment before proceeding. An alternative to
this would be a timeout strategy. The speaker X would speak a sentence and wait for some time to hear any
kind of acknowledgment. If he does not hear anything back, he assumes that the previous statement was
not received properly, and therefore, repeats the sentence. A form of sliding window would mean speaking
and acknowledging multiple sentences simultaneously may be 3 or 4 at a time. This is via media between
acknowledging each sentence or the full message. We are not aware of this, but we actually follow all these
protocols in daily conversations.
Apart from this error control, we also take care of flow control. This refers to the speed mismatch
between the speaker and the listener. If the speaker speaks too fast, the listener says Go slow or Please
wait if he is taking down. In the world of computers, if the receiving computer is not fast enough, or if its
memory buffer is full, which cannot hold any further data, it has to request the sender to wait. This is called
as flow control. Thus, the data link control layer is responsible for the error control at the sentences level,
and the flow control. This layer also decides who is going to speak, when, by a convention, or in brief, who
has a control of the medium (in this case, the telephone line). This is called as media access control. This
and both can and usually do speak simultaneously, causing chaos. In fact, it can so happen that after a pause,
thinking that the other party is waiting to hear from you, you may start speaking. However, exactly at the
same time, the other party also can start speaking, thinking that you want the other party to speak. This results
in a collision. The conversation gets mixed up normally, and both the parties realize about this collision and
stop talking for a while (unless it is a married couple!). Hopefully, the parties will pause for different time
intervals, thereby avoiding collision. Otherwise this process repeats. When to start speaking, how long to wait
after the collision before restarting etc. are typical conventions followed at this layer. These are the unwritten
protocols of the media access control that we follow in our everyday conversation.
In actual practice, we know that when we speak, the electrical signals in the telephone wires change. This
is a physical layer. There must be a protocol here, too! What this level signifies is things such as how the
telephone instruments are constructed, the way the telephone wires are manufactured and laid, the signal
levels to denote engaged or busy tone, the signal level to generate a ring, the signal levels required to carry
human voice, etc. This is a protocol at a physical level. Obviously, if a telephone and a refrigerator were
connected at two ends of a wire, communication would be impossible!
The same concept of protocols applies equally well to the computer communications. Let us see, how. Let us
imagine a network of computers as shown in Fig. 12.42.
Each computer is called as a node. In distributed processing, different parts of databases/files can and
normally do reside on different nodes as per the need. This necessitates transmitting files or messages from
one node to the other as and when needed. Let us assume that Node A wants to transfer a file X to Node D.
Node A is not directly connected to Node D. This is very common, because connecting every Node to every
other node would mean a huge amount of wiring.
This is the reason that the concept of store and forward is used in computer networks. First of all, a path
is chosen. Let us say that it is A-F-G-D. Using this path, the Node A sends the file to Node F. The computer
at F normally has to store this file in its memory buffer or on the disk. This storing is necessary, because the
link F-G may be busy at this juncture, or Node F may have received a number of messages/files to be sent to
other nodes (A, E or G) already and those could be waiting in a queue at Node F. When the link F-G is free
and ready for transmitting the file from F to G, Node F actually transmits it to the Node G. Thus, the Node
F stores and forwards the file from A to G. This process repeats until the file reaches the destination Node
D. This procedure demands that each node maintains a memory buffer to store the file, and some software,
which controls the queuing of different messages and then transmitting them to the next nodes. This software
also will have to take care of error and flow control functions in an error-free manner.
When the file/message is transmitted, both the nodes (source and destination) as well as all the intermediate
nodes have to agree on some basic fundamentals. For example, what is a bit 1 and what is a bit 0? As we
know, ultimately, bits 0 and 1 correspond to some physical property (voltage level 0 = bit 0, voltage level 5
= bits 1, etc). If there is no understanding between the nodes, the bits could be completely misinterpreted.
This understanding or protocol at the physical level is called as physical layer. It deals with things like
what are bits 0 and 1, the communication modes (serial/parallel, simplex/half-duplex/duplex, synchronous/
asynchronous, etc.).
How does the next node find out whether the file or the message was received correctly or not? And also,
how does that node react if it finds an error? There are several methods to detect an error in transmission, as
There are many ways in which the positive or negative acknowledgment can be sent by the receiving
node to the source node. If no error is detected, the receiving node can send a positive acknowledgment
back, meaning that everything is OK. However, if an error is detected, the receiving node can either send a
negative acknowledgment or choose not to send anything. The latter is called as time out. In this method, the
source node can wait for some time for the positive acknowledgment and having not received it in a specific
time, conclude that the file has not been received OK at the destination and then send it again. This is a good
method, excepting that when the source node starts sending the file again, the positive acknowledgment
(OK message) from the receiving node could have been already half way to the source node. When this
acknowledgment is received at the source node, it will be too late for the source node! The file/message
would have been already sent twice to the destination node! There is normally a protocol to handle such a
situation (e.g. the receiving node discards the second copy of the file). A surer way is to definitely send either
OK or NOT OK message back, and not to use the time out method, i.e. wait until either a positive or negative
acknowledgement is received. However, this entails long waits because these messages themselves could
take long time to travel, due to the network traffic. The overall network efficiency in this case reduces, as the
source node has to wait until it receives some acknowledgment.
All these functions of error detection, acknowledgments and retransmissions are clubbed under a name
error control, and constitute an important part of the communications software—i.e. the data link layer in
the OSI terminology, residing at every node, i.e. the source, destination as well as all the intermediate nodes,
because the message has to reach correctly to the next node first, before it reaches the destination node
correctly. The data link layer also takes care of flow control, to take care of the speed mismatch between
any two adjacent communicating computers. If the sending computer sends data too fast, it can get lost at
the destination. The speeds, therefore, have to be continuously adjusted or monitored. This is called as flow
control.
If an error is detected, the entire file will have to be retransmitted. If the file size is large, the probability
of an error is higher, as well as the time that it will take for retransmission. Also, the chances of an error in
a retransmission are also higher. This is the reason that large messages (such as a file) are broken down in
smaller chunks or blocks. These are called packets or frames. Another reason why data is sent in packets is
when two pairs of computers want to use a shared transmission line. Imagine that computer A wants to send
a big file of 10 MB to computer D by a route A-F-G-D. Also, at the same time, computer F wants to send a
small file of 2 KB to computer G. Further suppose that the transmission of the big file over the link F-G starts
momentarily ahead of the smaller file transmission over F-G. Assuming that only one pair of computers can
use one transmission exclusively, the smaller transmission will have to wait for a long time before the bigger
transmission gets over. Thus, a bigger transmission simply can hold up smaller transmissions, causing great
injustice. Thus, it is better that each communication party break down their transmission into packets and
takes turn to send down packets. Thus, both the files are broken down into packets first. At Node F, a packet
from the big file is followed by a packet from the small file, etc. This is called as Time Division Multiplexing,
as we have seen. At the other end (G), the smaller file is reassembled and used, whereas the packets for the
bigger file are separated, stored and forwarded to the Node D.
Obviously, every packet will have to have a header containing source address, destination address, packet
forwarding or routing the packet to the next node, and
ultimately to the final destination. The packet number helps in reassembling the packets in case they reach the
There are two ways in which the path can be chosen. One is the virtual circuit approach, and the other
is the datagram approach. As we know, in a virtual circuit, the path is chosen in the beginning and all the
packets belonging to the same message follow the same route. For instance, if a route A-F-G-D is chosen to
send the file from A to D, all the packets of that file will traverse by the same route. At D, therefore, they will
be received in the same order only, thereby avoiding the function of re-sequencing. This is because, even if
packet 2 is received erroneously by Node G from Node F, Node G will ask for its retransmission. Node F
will then retransmit packet 2, and before sending packet 3, wait until making sure that Node G has received
packet 2 without any error. It will send packet 3 only after ensuring this. All this necessitates maintaining
many buffers at different nodes for storing and forwarding the packets. As against this, in datagram, the entire
circuit is not pre-determined. A packet is sent to the next node on the route, which is the best at that time, and
will take the packet to the ultimate destination.
routing is not a simple task by any stretch of imagination. Remember, each node
is receiving many packets from different nodes to be temporarily stored and then forwarded to different
nodes. For instance, Node F in Fig. 12.42 can have packets received from A to be forwarded to E or G, or
meant for itself. It can also have packets received from E to be forwarded to A or to G, or to D via G, or the
packets meant for itself. Node F can be receiving packets from Node G meant for Nodes A, E or for itself.
In addition, Node F itself will want to send various packets to different nodes. Therefore, the buffer of Node
F will contain all these packets. The source and destination addresses come handy in keeping track of these
algorithm picks them up one by one and sends or forwards them based on the destination node and the route
chosen.
A-F-E-D or A-F-G-E-D or A-F-E-G-D-? Apparently, A-E-D seems to be an obvious answer, as AED appears
to be the shortest route. However, the looks can be deceptive. Node E’s buffer may be full at a given moment
method for forwarding the messages, there will be a long wait before our message received from A will be
forwarded to D.
This is an example of network congestion. These congestion levels have to be known before the route is
chosen. Also, a path may be required to be chosen from any node to any other. Therefore, this information
about congestion or load on all the nodes and all the lines should be available at every node. Each node then
has algorithms to choose the best path at that moment. This again is an important part of communications
software, the network layer in the OSI parlance, residing at every node.
Note that although we have shown the network to be consisting of only the computers called as nodes,
in real life, it is not so simple. Since these computers in a network are used for specialized purposes (such
as running an application program or serving files on request), the job of routing packets from the sending
computer to the receiving computer is handled by dedicated computers called as routers. A router is a special
computer that has the sole job of routing packets between the various computers on a network. It decides
which packet to forward to which next node, so that it can ultimately reach the final destination. The necessary
routing software runs inside the router to carry out this routing process. Therefore, although we have not
shown for the sake of simplicity, in real life, we would have a number of routers connecting the various
portions of a network to each other.
As we know, in case of the datagram approach, different packets belonging to a single message can travel
by different routes. For a packet, a decision is taken about the next node to which it should be sent. For
instance, at a given moment, the Node F as well as the line A-F could have the least congestion (as compared
to A-E and A-B. Therefore, the packet is sent via the route A-F. It takes a finite time for the packet to reach the
Node A decides to send the next packet. However, during this time interval, a number of packets could have
arrived at Node F from Node E, to be forwarded to either A or G, or the ones meant for F itself. Therefore,
the congestion at Node F may have increased. Hence, the next packet could be sent by Node A via the route
A-E to be ultimately forwarded to D.
Therefore, different packets belonging to a message may not travel by a given predetermined route.
In this case, it is possible that packet 3 may arrive before packet 2 at Node D. This necessitates the
function of re-sequencing and making sure that the entire message has been received without error.
the error-free receipt of the whole message. This packet consisting of the acknowledgment for the
entire message will travel from the destination node to the source node. This function of ensuring in
sequence and error-free receipt of the entire message and its acknowledgment retransmission is again a part
of the communication software, typically the Transport Layer in OSI parlance. It is clear that in case of the
virtual circuit approach, there is a guarantee that packets will arrive at the destination in the order that they
were sent, because, in this case, a route (also called as a Virtual Circuit Number, VCN) is chosen in the
beginning itself. It is used for all the packets belonging to that message. This is also why the packet in the
7 (Highest) Application
The usual manner in which these seven layers are represented is shown in Fig. 12.43.
would travel via a number of intermediate nodes. These intermediate nodes are concerned with the lowermost
three OSI layers, i.e. physical, data link and network. The other four layers are used by the sender (X) and the
layer calls upon the services of its lower layer. For instance,
The application layer software running at the source node creates the data to be transmitted to the application
layer software running at a destination node (remember virtual path?). It hands it over to the presentation
layer at the source node. Each of the remaining OSI layers from this point onwards adds its own header to
the frame as it moves from this layer (presentation layer) to the bottommost layer (the physical layer) at the
source node. At the lowest physical layer, the data is transmitted as voltage pulses across the communication
medium such as coaxial cable. This means that the application layer (layer 7) hands over the entire data to the
presentation layer. Let us call this as L7 data
processes this data, it adds its own header to the original data and sends it to the next layer in the hierarchy
(i.e. the session layer). Therefore, from the sixth (presentation) layer to the fifth (session) layer, the data is
the original data (L7) and all the headers are sent across the physical medium. Figure 12.47 illustrates this
process.
The physical layer is concerned with sending raw bits between the source and destination nodes, which,
in this case, are adjacent nodes. To do this, the source and the destination nodes have to agree on a number
of factors such as what voltage constitutes a bit value 0, what voltage constitutes bit value 1, what is the bit
interval (i.e. the bit rate), whether the communication is in only one or both the directions simultaneously (i.e.
simplex, half duplex or full duplex), and so on. It also deals with the electrical and mechanical specifications
To summarize, the physical layer has to take into account the following factors:
The data link layer is responsible for transmitting a group of bits between the adjacent nodes. The group of
bits is generally called as frame or packet. The network layer passes a data unit to the data link layer. At this
stage, the data link layer adds the header and trailer information to this, as shown in Fig. 12.48. This now
becomes a data unit to be passed to the physical layer.
The header (and trailer, which is not shown, but is instead assumed to be present) contain the addresses and
other control information. The addresses at this level refer to the physical addresses of the adjacent nodes in
the network, between which the frame is being sent. Thus, these addresses change as the frame travels from
different nodes on a route from the source node to the destination node. The addresses of the end nodes, i.e.
those of the source and destination nodes are already a part of data unit transferred from the network layer
to the data link layer. Therefore, it is not a part of the header and trailer added and deleted at the data link
layer. Hence, they remain unchanged as the packet moves through different nodes from the source to the
destination.
Let us illustrate this by an example. Let us imagine that in our original discussion, Node A wants to send
a packet to Node D. Let us imagine that we use the datagram approach. In this case, the logical (i.e. IP)
addresses of Nodes A and D, say ADDL (A) and ADDL (D) are the source and destination addresses. The
data unit passed by the network layer to the data link layer will contain them. The data unit will look as it is
shown in Fig. 12.49. Let us call this as DN.
When this data unit (DN) is passed from the network layer at Node A to the data link layer at Node A, the
following happens:
(i) The routing table is consulted, which mentions the next node to which the frame should be sent for a
specific destination node, which is Node D in this case. Let us imagine that the next node is F, based
on the congestion conditions at that time, i.e. the path A-F is selected.
(ii) At this juncture, the data link layer at Node A forms a data unit, say DD, which looks as shown in
Fig. 12.50. We will notice that DD has encapsulated DN and added the physical addresses of A and F
(iii) Using the physical addresses of adjacent Nodes A and F, the packet moves from Node A to Node
F after performing the flow control functions, as discussed later (i.e. checking if Node F is ready to
accept a frame from A and at what data rate, etc). Here, the packet is passed on from the data link
layer to the network layer of node F after performing the error control function (i.e. verifying that the
packet is error-free). Here, ADDP (A) and ADDP (F) are removed and DN is recovered. Now, this DN
needs to be sent to the next hop to reach node D. For this, the final destination address, i.e. ADDL (D)
is extracted from DN. The frame now has to be sent from Node F to Node D.
(iv) Again, the routing algorithm is performed at Node F using ADDR (D) as the final destination, and the
congestion conditions, etc. and a path is chosen. Let us say that the chosen path is FG.
(v) The network layer at Node F passes DN to the data link layer at Node F. Here, the physical addresses
of F and G are added to form the data unit at the data link layer at Node F, as shown in Fig. 12.51.
(vi) This continues until the data unit at data link layer DD reaches Node D. There again, the physical
addresses are removed to get the original DN, which is passed on to the network layer at Node D. the
network layer verifies ADDL (A) and ADDL (D), ensures that the packet is meant for itself, removes
these addresses, and sends the actual data to the transport layer at Node D.
and buffer size, and congestion condition, it is determined whether the frame/packet can be sent to the adjacent
node, and if so, at what speed. If it can be sent, the node is ready to send the data. However, we have to make
sure that the medium is free to carry the frame/packet.
If the connection is a multipoint type (i.e. the medium is shared), then the problem of who should send how
much data at what times, has to be solved. This problem typically arises in Local Area Networks (LANs), and
The network layer is responsible for routing a packet within the subnet, i.e. from the source to the destination
nodes across multiple nodes in the same network, or across multiple networks. This layer ensures the
successful delivery of a packet to the destination node. To perform this, it has to choose a route. As discussed
before, a route could be chosen before sending all the packets belonging to the same message (virtual circuit)
or it could be chosen for each packet at each node (datagram). This layer is also responsible for tackling the
congestion problem at a node, when there are too many packets stored at a node to be forwarded to the next
node. Where there is only one small network based on broadcast philosophy (e.g. a single Ethernet LAN),
this layer is either absent or has very minimal functionality.
There are many private or public subnet operators who provide the hardware links and the software
consisting of physical, data link and network layers (e.g. X 25). They guarantee an error-free delivery of a
packet to the destination at a charge. This layer has to carry out the accounting function to facilitate this
billing based on how many packets are routed, when, etc. When packets are sent across national boundaries,
the rates may change, thus making this accounting function complex.
A router can connect two networks with different protocols, packet lengths and formats. The network layer
is responsible for the creation of a homogenous network by helping to overcome these problems.
At this layer, a header is added to a packet, which includes the source and destination addresses (logical
addresses). These are not same as the physical addresses between each pair of adjacent nodes at the data link
layer as seen before. If we refer to Fig. 12.52 where we want to send a packet from A to D, addresses of nodes
A and D (i.e. ADDL (A) and ADDL (D)) are these addresses, which are added to the actual data to form a data
unit at the network layer (DN). These addresses, and in fact, the whole of DN remains unchanged throughout
the journey of the packet from A to F to G to D. Only physical addresses of the adjacent nodes keep getting
added and removed, as the packet travels from A to F to G to D. Finally, at Node D, after verifying the
addresses, ADDL (A) and ADDL (D) are removed and the actual data is recovered and sent to the transport
layer at Node D, as shown in Fig. 12.53.
To summarize, the network layer performs the following functions:
Routing: As discussed earlier.
Congestion control: As discussed before.
Logical addressing: Source and destination logical addresses (e.g. IP addresses).
Address transformations: Interpreting logical addresses to get their physical equivalent (e.g. ARP
protocol). We shall discuss this in detail in the next section of the book.
Accounting and billing: As discussed before.
Source to destination error-free delivery of a packet.
Transport layer is the first end-to-end layer as shown in Fig. 12.54. All the lower layers were the protocols
between the adjacent nodes. Therefore, a header at the transport layer contains information that helps to send
the message to the corresponding layer at the destination node, although the message broken into packets
may travel through a number of intermediate nodes. As we know, each end node may be running several
processes (may be for several users through several terminals). The transport layer ensures that the complete
message arrives at the destination, and in the proper order and is passed on to the proper application. The
transport layer takes care of error control and flow control, both at the source and at the destination for the
entire message, rather than only for a packet.
As we know, these days, a computer can run many applications at the same time. All these applications
could need communication with the same or different remote computers at the same time. For example,
suppose we have two computers A and B. Let us say A hosts a file server, in which B is interested. Similarly,
suppose another messaging application on A wants to send a message to B. Since the two different applications
want to communicate with their counterparts on remote computers at the same time, it is very essential that
a communication channel between not only the two computers must be established, but also between the
respective applications on the two computers. This is the job of the transport layer. It enables communication
between two applications residing on different computers.
The transport layer receives data from the session layer on the source computer, which needs to be sent
across to the other computer. For this, the transport layer on the source computer breaks the data into smaller
packets and gives them to the lower layer (network layer), from which it goes to still lower layers and finally
gets transmitted to the destination computer. If the original data is to be re-created at the session layer of
the destination computer, we would need some mechanism for identifying the sequence in which the data
was fragmented into packets by the transport layer at the source computer. For this purpose, when it breaks
the session layer data into packets, the transport layer of the source computer adds sequence numbers to the
packets. Now, the transport layer at the destination can reassemble them to create the original data and present
it to the session layer.
Figure 12.54 shows the relationship between transport layer and its two immediate neighbours.
The transport layer may also establish a logical connection between the source and the destination. A
connection is a logical path that is associated with all the packets of a message, between the source and the
destination. A connection consists of three phases: establishment, data transfer and connection release. By
using connections, the transport layer can perform the sequencing, error detection and correction in a better
way. To summarize, the responsibilities of the transport layer are as follows:
Host-to-host message delivery: Ensuring that all the packets of a message sent by a source node
arrive at the intended destination.
Application-to-application communication: The transport layer enables communication between
two applications running on different computers.
Segmentation and reassembly: The transport layer breaks a message into packets, numbers them
by adding sequence numbers at the source; and uses the sequence numbers at the destination to
reassemble the original message.
Connection: The transport layer might create a logical connection between the source and the
destination for the duration of the complete message transfer for better control over the message
transfer.
The main functions of the session layer are to establish, maintain and synchronize the interaction between
two communicating hosts. It makes sure that a session once established is closed gracefully, and not abruptly.
For example, suppose that a user wants to send a very big document consisting of 1000 pages to another user
on a different computer. Suppose that after the first 105 pages have been sent, the connection between the two
hosts is broken for some reason. A question now is, when the connection between the two hosts is restored
after some time, must the transmission start all over again, i.e. from the first page? Or can the user start with
The session layer checks and establishes connections between the hosts of two different users. For this,
the users might need to enter identification information such as login and password. Besides this, the session
layer also decides things such as whether both users can send as well as receive data at the same time, or
whether only one host can send and the other can receive, and so on (i.e. whether the communication is
simplex, half duplex or full duplex).
Let us reiterate our earlier example of the transmission of a very big document between two hosts. To
avoid a complete retransmission from the first page, the session layer between the two hosts could create
sub-sessions. After each sub-session is over, a checkpoint can be taken. For instance, the session layers at
the two hosts could decide that after a successful transmission of a set of every 10 pages, they would take a
checkpoint. This means that if the connection breaks after the first 105 pages have been transmitted, after the
connection is restored, the transmission would start at the 101st page. This is because, the last checkpoint
would have been taken after the 100th page was transmitted. The session layer is shown in Fig. 12.55.
In some cases, the checkpointing may not be required at all, as the data being transmitted is trivial and
small. Regardless of whether it is required or not, when the session layer receives data from the presentation
layer, it adds a header to it, which among other things also contains information as to whether there is any
checkpointing, and if there is, at what point. To summarize, the responsibilities of the session layer are as
follows:
Sessions and sub-sessions: The session layer divides a session into sub-sessions for avoiding
retransmission of entire messages by adding the checkpointing feature.
Synchronization: The session layer decides the order in which data needs to be passed to the transport
layer.
Dialog control: The session layer also decides which user/application sends data, and at what point
of time, and whether the communication is simplex, half duplex or full duplex.
Session closure: The session layer ensures that the session between the hosts is closed gracefully.
When two hosts are communicating with each other, they might be using different coding standards and
n n
n
The history of Windows 2000 can be traced back
to the 1980s. IBM had developed the world’s first
true microcomputer, the IBM PC. Also called as
the desktop computer, this microcomputer was based on the
8088 microprocessor. Microsoft wrote the Operating System
for this computer. This Operating System was called as Disk
Operating System (DOS). The microcomputer quickly became
very successful. This also meant that DOS was also becoming
popular.
DOS (version 1.0) was then a very basic Operating System. It
consisted of just 8K of code in the memory. This was enhanced
to 24K in DOS 2.0. When IBM developed the PC-AT (AT stands
for Advanced Technology) in 1986 to work with Intel’s 80286
microprocessor, Microsoft increased the DOS (now version 3.0)
memory size to 36K.
All this while, DOS was a traditional command-line
Operating System. This meant that users had to type commands
to issue instructions to the Operating System. For instance, to
copy a file a.dat with another name b.dat, the user would need
to issue an instruction such as copy a.dat b.dat at the DOS
command prompt. About the same time, people were becoming
more interested in Graphical User Interface (GUI). A company
called as Apple Computers had developed a GUI Operating System, called as Lisa (which was a precursor
to the Apple Macintosh). This meant that instead of having to type commands for getting their work done,
users could now navigate through menus, select options and drag icons from one place to another, etc. This
was a striking shift in the history of Operating Systems from the days when end users had to remember and
enter cryptic commands. Now, the users’ job became far more intuitive and simpler. Microsoft woke up to
this reality, and released its Windows Operating System version 1.0 in 1985. This version, and the next
one released in 1987 (version 2.0) were not at all popular. However, in 1990, Windows 3.0 was released
for the 80386 processor. This achieved some amount of success. But the true popularity of Windows can be
attributed to its next two releases (versions 3.1 and 3.11).
Technically, all the Windows Operating Systems until the arrival of Windows 95 in August 1995 were
simple wrappers on top of the DOS Operating System. This meant that they were more or less providing a
GUI to DOS. Windows 95 was a 32-bit Operating System, unlike its predecessors, all of which were 16-bit.
However, Windows 95 was not completely new, either. It still had portions of code that used the older 16-bit
DOS calls. The next major release, Windows 98, made in June 1998 was better in the sense that it contained
lesser DOS code. Yet, it was not completely free from DOS.
Incidentally, Microsoft had realized long back (around 1990) that it was not prudent to build a 32-bit
strong Windows Operating System on top of the older 16-bit DOS legacy. A better approach was surely
to write a 32-bit Operating System from scratch. With this idea in mind, Microsoft released Windows NT
(New Technology) in 1993. Windows NT does not depend on any DOS code. It is an Operating System of
its own, and is not a wrapper on top of DOS, unlike all its predecessors. Windows NT brought the power of
proven Operating Systems (such as UNIX, VMS) to the desktop. To match with the other family of Windows,
Microsoft named the first release of Windows NT as NT Version 3.1. NT is a Network Operating System
(NOS), like Novell and UNIX. That is why, NT is widely used as a server Operating System.
Initially, Microsoft had believed that NT would soon become the de facto standard of the desktop Operating
Systems world, because of its power. However, this impression proved incorrect, as people continued to
show interest firstly in Windows 3.1/3.11 and then Windows 95/98. The chief reasons for this were that NT
demanded far more resources (such as main memory), and there were actually very less number of 32-bit
applications to exploit the NT features at that point of time. Because of this, Microsoft had to keep releasing
newer versions of Windows and Windows NT families of Operating Systems.
The major upgrade of NT came in 1996 with version 4.0. The significant attribute of this release was its
similarity with the user interface of Windows 95. By this time, NT was also gaining a place on the desktops,
and not only on the high-end servers. Many users had started migrating from Windows 95/98 to Windows NT.
However, to keep the users of the (non-NT) Windows family happy, in 2000, Microsoft released Windows
Me (Millennium Edition), the new avatar of Windows 98.
Incidentally, a major goal of Windows NT was portability. Therefore, while the non-NT versions were
being written mostly in assembly, NT was almost completely written in C. Only some low-level code, such
as interrupt handling, was written in assembly.
The next release of Windows NT was to be initially named as version 5.0. However, with the thought
of coming up with a single version of Windows for both NT and non-NT users, Microsoft renamed it to
Windows 2000. Thus, there is no new NT anymore. Everything going forward will be Windows 2000. The
plan was to make everyone use a single 32-bit Operating System. From the reactions so far, it looks as though
this plan may eventually succeed. In fact, that would be good for everybody, since one need not worry about
maintaining the legacy of older applications.
Let us compare the broad-level characteristics of Windows 95/98 and Windows NT, as shown in
Table 13.1.
Characteristic Windows 95/98 Windows NT
16-bit/32-bit? 16-bit 32-bit
Processor supported Only Intel 80 #86-based Intel, MIPS, Alpha, and many others
Win32 API Yes Yes
Security features Not provided Provided
Unicode support Not provided Provided
NTFS file system Not provided Provided
FAT file system Provided Optional
Support for all old MS-DOS Provided Not provided
Programs
Multiprocessor support Not provided Provided
Plug-and-Play Provided Not provided
Power management features Provided Not provided
With this background in mind, let us now discuss the technical side of Windows NT and Windows 2000.
When Windows NT was designed, one of the main objectives was to provide support for a variety of Operating
System environments (discussed later). Consequently, a number of factors, which differ from one Operating
System to another had to be considered. Some of these factors are as follows:
1. Handling the nomenclature for processes
2. Protecting resources of a process
3. Support for multithreading
4. Relationship between various processes
As a result, the design of Windows NT process management was kept quite simple, thus allowing a variety
of Operating Systems to implement these features in the manner that suits them the best. On top of them,
Windows NT provides a wrapper-like structure to allow for uniformity.
In Windows NT, processes are represented as objects. The term object is used here in the same sense as it
is used in Object Technology. We shall not discuss its internals, or its meaning, and instead, instruct the reader
to refer to an appropriate book on Object Technology.
Since the whole purpose of Windows NT was to support multiple threads within a process, the concept
of multithreading was supported since the inception of Windows NT. Multithreading is very useful in any
Operating System, and more so in the Windows family of Operating Systems. The reason for this is that
applications running under Windows are inherently suited to multithreading. This is best illustrated with an
example. Consider this book that you are reading at the moment. Suppose that the author has used Microsoft
Word as the software to create the computer version of the book. Let us assume further that currently the
author is editing page 300 in the document. At this point, suppose that the author realizes that the definition
of a term was incorrect on page 1 of the document, and was actually not required at all. Therefore, the author
goes back to page 1, deletes the entire line, and now wants to come back to page 300. Quite clearly, the
word processor has to reorganize the document to reflect the current status (i.e. considering that one line
was deleted from page 1, so the first line of page 2 could now become the last line of page 1, and so on).
If the word processor does this now, the author may have to wait for a second or two. This would be quite
annoying! Consider another example, where the author wants to save the document on the hard disk. When
the user gives an appropriate command to the word processor to do so, the word processor may start the
action to save the document, and again make the user wait for a second. This would not go well with the
author, as well! It is for this reason that the word processor can be considered as a process, within which,
we should have a number of threads, dedicated to
specific operations. For instance, one thread can be
used for document editing, another threads could
take care of re-pagination, and a third one could
be dedicated to the file saving operations, and so
on. Most importantly, note that all the threads share
the same address space. Therefore, they cannot be
different processes. They must be different threads
within the same process.
Moreover, Windows NT threads and processes
have built-in synchronization facilities. The main
fields of a process and thread are shown in Fig. 13.1.
Let us describe some of these key attributes, as shown in Table 13.2.
Attribute Description
Process Id/Client Id A unique number that identifies the process, or the thread within a process
Access Token Contains security information related to the logged on user
Base Priority Execution priority for the process/thread
Quota Limits The maximum size of paged and non-paged system memory, paging file
space and processor time
Execution Time Total amount of time that all the threads have executed (in the case of a
process) and the actual execution time (in the case of a thread)
I/O Counters Number and type of I/O operations performed by the process/thread
Exit status Reason why the process/thread was terminated.
Thread Context The set of register values and other changing data, which defines the state
of execution of a thread
Dynamic Priority The execution priority of a thread at a given moment
Alert Status A flag that indicates whether a thread should execute an asynchronous
procedure call
A process in Windows NT must contain at least one thread. That is, a process begins by spawning a thread.
It can later on seed more threads, if necessary. Obviously, the threads in a process can exchange information
via the memory and resources that they share. As an aside, if it is a multiprocessor system, more than one
thread can execute at the same time.
We had remarked that one of the goals of Windows NT was to provide support for various
Operating Systems. Let us understand what this means, and how it works. Consider an older
Operating System, such as the POSIX standard or 16-bit Windows (i.e. Windows 3.1). These
Operating Systems do not have the concept of threads. Does it not contradict our earlier argument that a
process in Windows NT begins by spawning a thread? It may seem so. However, it is not true. What happens is
this: When such an Operating System client uses Windows NT, and wants it to create a process, Windows NT
creates a process (which in turn, spawns a thread, as usual) and returns the process handle back to the client.
The fact that a thread was also created remains hidden from the client. The needs of the client (i.e. a process
id) thus get satisfied, without compromising on the internal working of Windows NT (i.e. a process must
spawn a thread).
The earlier versions of Windows (e.g. Windows 3.11, Windows 95) used to support the non-preemptive
multitasking. Many times, this used to lead to serious problems, as one misbehaving application could bring
the entire computer down on its knees. In contrast, Windows NT introduced the concept of preemptive
multitasking. Erring applications could no longer bring the whole system to a halt. Moreover, each process
now got its own address space. Therefore, problems in the area of one process could not cause damage to
other processes.
Windows NT provides rich features for synchronization of processes and threads. As we have mentioned
earlier, Windows NT treats processes and threads as objects. In addition to these, Windows NT defines
several other objects, which are specific to synchronization. They are: event, event pair, semaphore, timer,
and mutant. Table 13.3 summarizes these objects and their description.
Object Description
Process Running instance of a program, which has its own address space and resources. It terminates when
the last running thread within that process is terminated.
Thread This is a serially executable entity within a process. The synchronization is over as soon as it
terminates.
File This is the instance of an open file or device. After the I/O operation is complete, the synchronization
effect is lost.
Event This is an alert that a system event has occurred.
Event pair This is reserved only for client-server communication. For instance, a client thread could
communicate that it has sent a message to a server thread. This causes the synchronization (i.e. a
transaction) to start to ensure that the server receives the message correctly.
Semaphore This is the counter, which specifies the number of threads that can simultaneously use a resource.
When the semaphore counter becomes zero, the synchronization point is reached.
Timer The system counter, which records the passage of time. This is useful for CPU scheduling
synchronization activities to determine time slices, expiration times, etc.
Mutant This object is similar to the MUTEX object, which provides mutual exclusion capabilities.
Windows NT was designed to work on various processors. The basic page size in Windows NT consists of
4K. Programs running under Windows NT can choose to implement segmentation and paging in four possible
ways, as described in Table 13.4. We have chosen 80486 processor to illustrate the capabilities of Windows
NT, since it provides the most generic view.
In segmentation, each virtual address is made up of a 16-bit segment number, and a 32-bit offset. Out of
the 16 bits of the segment number, 2 are used for protection information, leaving the remaining 14 bits to
denote a specific segment.
In paging, a two-level structure is used. The page directory consists of up to 1024 entries. Thus, the
available 4GB address space is broken down into 1024 page groups. Each such entry points to a page table.
Each entry within the page table points to a single 4KB page. The address model used by Windows NT is
depicted in Fig. 13.2. This shows how a 32-bit address is composed of the various entries.
Thus, to locate a page, the page directory is first consulted, using the bits 22–31 of a virtual address. This
directs to a page table, indicated by bits 12–21. Finally, bits 0–11 specify a page frame within this page table.
As we have mentioned earlier, Windows 2000 is actually Windows NT 5.0. It should not surprise
us then that Windows 2000 contains many Windows NT 4.0 features. Each Windows 2000
process is a protected one. Windows 2000 is a 32-bit multiprogramming Operating System, which,
going forward, would be changed to a 64-bit Operating System. Each process owns a 32-bit demand paged
virtual address space (which would soon become 64-bit). There is a very clear demarcation between the
Operating System itself and the user processes. The Operating System executes in the kernel mode, whereas
the user processes execute in the user mode, thus providing complete protection (i.e. greatly reducing the
chances of a user process changing the Operating System code, thereby bringing the Operating System
down). Each process can have one or more threads, each of which is a visible and schedulable entity from the
perspective of the Windows 2000 Operating System. The US Department of Defense C2 security guidelines
are followed to ensure tight security for files, processes, directories, etc. Symmetric multiprocessing with a
maximum of 32 CPUs is supported.
Interestingly, one can actually come to know that Windows 2000 is a successor to Windows NT. The
system directory is called as winnt, and the file which contains the Operating System code is called as
ntoskrnl.exe. If we study the properties of this file, there is more to see. This file indicates that the version
number is 5 (indicating that this is Windows NT 5.0). There are many other files and subdirectories with NT
embedded inside their name somewhere.
Apart from the main new features listed earlier, the following features are relatively new for a Microsoft
Operating System:
Similar to the way the previous Windows NT versions appeared, Windows 2000 is also available in various
versions or flavours. There are four main options when someone wants to buy the Windows 2000 Operating
System. It can come in the form of a Professional, Server, Advanced server and Datacenter server. The
binary executables for all the four versions are identical, but at installation time, the appropriate product
type is recorded in the Windows 2000 database (called as registry, which we shall study later). During the
boot operation, the Operating System then examines this database to see which version of Windows 2000 is
installed on that computer, and performs the appropriate tasks. The key features of these four versions are
shown in Table 13.5.
Professional 2 4 GB 0 10
Server 4 4 GB 0 Unlimited
Advanced server 8 8 GB 2 Unlimited
Datacenter server 32 64 GB 4 Unlimited
As we can see, the differences between the various Windows 2000 versions are mainly related to the
capabilities of the particular binary installed on a computer. This allows Microsoft to charge different
customers differently (e.g. charge business consumers more than home users, etc). The cluster size is
something not found commonly in the Operating System literature. It actually refers to the capability of
Windows 2000 to allow multiple computers running Windows 2000 to be used as a single computer. This can
be useful in many situations, most notably, when running a computer as a Web server. Busy Web sites can
have literally thousands of users sending requests for Web pages at the same time. In such cases, it is very
useful to cluster multiple computers to form a single logical Web server computer.
Technically, the way the different versions are installed and used is actually a surprisingly simple fact to
know. Two variables, ProductType and ProductSuite indicate a particular version of Windows 2000 being
used. Changing the values of these variables is of course, illegal. Moreover, Windows 2000 detects this
change and records this attempt of tampering with these variables in a manner that cannot be repudiated later.
As we know, every Operating System has a set of functions, called as system calls. These are low-level calls
(such as disk I/O, memory management calls, etc.). Microsoft has always maintained great secrecy about the
system calls related to the Windows family of Operating Systems. However, Microsoft has released a fully
documented set of function calls, called as Win32 API (Application Programming Interface). Developers
can make calls to the Win32 API and get their work done. They need not be concerned as to whether a Win32
API function call actually makes a system call, or does something else. Existing function calls in Win32 API
never change. However, new calls can be added with each new version. The idea of using Win32 API is shown
in Fig. 13.3.
This idea is remarkably different from UNIX. The system calls are public, and can be used by anybody
directly in UNIX. They are very less in number, too. In contrast, Windows not only hides the system calls, but
also provides a large number of function calls via the Win32 API.
Win32 API provides extensive function calls for meeting all the traditional Operating System objectives,
such as process creation/management, inter process management, memory management, file I/O,
security, etc. Additionally, because Windows 2000 is a GUI-based Operating System, it has the necessary
API calls for screen management (i.e. creating/destroying/moving menus, text boxes, lists, etc). There are
certain details that Windows 2000 hides from the application programmer. For example, the mechanism of
memory management is hidden from the programmer to a large extent. To be technically correct though, the
programmer’s process has a limited view of which file is mapped on which area of the virtual memory. Using
this feature, the process under consideration can read from or write to the file as if its contents reside in the
main memory.
Windows 2000 treats files as a sequence of bytes. This is very similar to the UNIX philosophy, which also
does not recognize the concept of blocks and records. There are more than 60 API calls in Windows 2000 to
perform file operations. We shall study some of them subsequently.
The main feature that distinguishes Windows 2000 from other Operating Systems in terms of the API set
is the presence of GUI calls. Any Operating System is expected to anyway provide API calls for performing
tasks such as process/thread management, file I/O, memory management, etc. However, because Windows
2000 also provides a rich GUI functionality for the users to work with, it also has to ensure that the application
programmer can manipulate the screen the way she wants. That is also the reason why Windows 2000
provides thousands of API calls for working with the various aspects of the GUI, such as windows, toolbars,
task bars, boxes, lists, and so on.
In most cases, when a Win32 API call is made, the Operating System performs the necessary actions
(usually, this causes the creation of kernel objects, such as processes, files, threads, etc.) and returns back a
handle to the caller. The handler is similar in concept to a file descriptor or a process id. The caller can use
this handle to perform more tasks on the object, as desired. Like UNIX file descriptors, only the process that
created the handle knows about it. It cannot be passed on arbitrarily to other processes. Every object created
by a process also bears a security descriptor. The security descriptor is an object, which signifies what actions
can be performed on that object, and by whom.
In addition to this mechanism, the Operating System itself creates a number of objects. These objects
are typically the ones, which are useful to many processes running on the top of that Operating System. For
instance, the Operating System may create a device driver object for interfacing with the printer. This object
reference can then be used by the Operating System itself, as well as by other processes, for their use. One
question may arise at this stage. If such an object can be shared, why did we mention earlier that object
handles cannot be passed from one process to another? Actually what we said was not completely true.
Object handles can indeed be passed between objects. However, this procedure requires careful sharing and
protection mechanisms, in order to ensure that there are no concurrency and security issues. This is what we
meant when we said that objects cannot arbitrarily exchange process handles.
Interestingly, many times, people ask whether Windows 2000 is an Object Oriented Operating System, or
not. The answer to that is both yes and no. On the one hand, the only way to work with the objects in Windows
2000 is by invoking the methods on their handles. This is clearly an object-oriented principle. However, on
the other hand, there is little flavour of inheritance and polymorphism in Windows 2000, if at all. Therefore,
we cannot argue one way or the other with complete conviction regarding this issue.
Every Operating System must keep a lot of information about the files on the disk, registered programs,
authorized users, and so on. There are different mechanisms for storing this information. The early versions
of Windows (e.g. Windows 3.1/3.11) used to maintain a lot of initialization files (called as .ini files) to store
this information. This was quite clumsy and chaotic.
Beginning with Windows 95, this information is now stored in a centralized database, called as the
Windows registry. Since it contains all the critical information about a given computer, the registry is perhaps
the singlemost important aspect of any Windows installation. If it is corrupted, the computer may become
completely or partially useless. Unfortunately, to add to the problems, working with registry is not easy. The
nomenclature is quite tough to understand and work with.
The basic idea of the registry is actually quite similar to that of a directory of disk files found on any
computer. Just as we have the root directory, which contains sub-directories, sub-sub-directories, and so on,
the Windows registry is also organized in a similar hierarchical fashion. Each high-level directory is called
as a key. All top-level directories begin with the word HKEY (handle to a key). For instance, the current
information about the currently logged on user is stored in a top-level directory called as HKEY_CURRENT_
USER. Sub-directories have names that signify better meanings. The lowest-level entries contain the actual
information, and are called as values. Each value consists of three parts: name, type and data. For example, a
value could indicate that (a) this is the information regarding the default window position (name), (b) it is an
integer (type) and (c) it is equal to 2e003fc (data).
Unless the user is well conversant with the concepts of the registry and the implications of changing any
values therein, it is strongly recommended not to play around with it. One should be safe viewing its contents,
though. Generally, the command regedit or regedit32 causes the contents of the registry to be displayed on the
screen. A sample screenshot of the registry is shown in Fig. 13.4.
As we can see, HKEY_CLASSES_ROOT, HKEY_CURRENT_USER, etc. are the keys, each of which
contains sub-keys, and so on.
A Windows programmer can access the registry with the help of Win32 API calls. These calls allow for
the creation of new registry entries, changes to the existing entries, search for specific entries, etc. Some of
these calls are shown in Table 13.6.
Windows automatically backs up all the registry entries during the system shutdown. It also keeps backing
them up automatically after a specific period. This prevents a loss of the most critical aspect of any Windows-
based computer these days. However, if the registry still gets corrupted due to an unforeseen event and no
backups are available, all the software must be re-installed.
Let us briefly discuss the important entries at the root of a Windows 2000 registry. Each of these has sub-
roots and more levels, as and when applicable. However, we shall limit our discussion only to the top one or
two entries, essentially to understand what sort of information the registry provides.
This sub-key (i.e. sub-directory) is a link to another entry in the registry. It points
to the directory, which deals with the Component Object Model (COM) objects. This also takes care of the
association between the various file extensions and the corresponding programs. For instance, when a user
double clicks on a JPEG image, it is this directory, which is consulted by Windows 2000 to see which pro-
gram to invoke (perhaps Microsoft Paint, in this case).
This sub-key contains information about the current user. This information is
typically concerned with the user preferences.
This is the most significant sub-key in the registry. It holds all the valuable
information about the computer on which it resides. This directory has sub-directories, as follows:
This sub-key contains information regarding the hardware, device drivers, etc.
This sub-key contains the list of users, groups, passwords, accounts and
security information. This information is useful for authentication. The policy decisions (such as the maxi-
mum number of failed login attempts before disabling the user account, minimum password length and its
organization, etc.) are also stored here.
As the name suggests, this sub-key holds information regarding the software installed on
that computer. For instance, if the user has installed Netscape browser, this sub-key will have another
sub-key in turn, called as Netscape, which will contain information regarding the installed version of Netscape,
uninstalling information, list of drivers used, etc. Each such sub-key can hold whatever information the writer
of the software wants it to hold.
This sub-key contains information that is useful at the time of the booting of the computer. For
instance, it contains the list of drivers to be loaded at the time of start-up, list of services to be loaded after a
successful booting, disk partitioning information, etc.
This sub-key contains all the users’ profiles. This data is useful because different users have
different preferences for the same program on the same computer. For instance, a particular user may want
the Microsoft Office program to automatically start executing as soon as the computer boots. Another user
may not want this feature. Similarly, different users have different preferences for backgrounds, screen savers,
and so on. All such individual preferences are recorded in this sub-key.
The Windows 2000 Operating System is composed of two sections: the Operating System and the environment
subsystems. The Operating System runs in the kernel mode, and handles the traditional Operating System
functionalities, such as process management, disk I/O, memory management, etc. The environment
subsystems assist user programs in many ways. We shall examine this as we go along.
Figure 13.5 shows the overall organization of the Windows 2000 Operating System. This is a slightly
simplified view. Technically, Windows 2000 is divided into many layers. Each layer makes use of the services
of the layers below it. The Executive is sub-divided into many modules. Each module performs a well-defined
function, and allows other modules to interact with it in the form of interfaces.
Let us discuss various layers in this organization one-by-one. Before we do so, it is worth pointing out
that the HAL and the kernel—the two bottom-most software layers—are written in the C and Assembly
languages. This makes sense, as they are very close to the actual hardware, and therefore, they cannot be
isolated from it. However, the upper layers are almost completely written in C, making them hardware-
independent. The driver programs are written in C and in some cases, C++.
The kernel is located above the Hardware Abstraction Layer (HAL), and is involved in most of the
core Operating System functions, such as memory management, process management, disk I/O, etc. It uses
the executive (discussed subsequently) for this purpose. Although it is hardware-specific, most of the kernel
code is written in C. Only those portions of the kernel are written in Assembly language, where performance
is the most significant criterion.
The main objective of the kernel is to isolate the rest of the Operating System completely from the hardware
(with the help of HAL), and make it highly portable. It accesses the HAL for lower-level hardware functions,
and provides slightly higher-level abstractions of the same for the higher layers. As an example, the HAL
provides a mechanism for reading a certain number of bytes from the disk. However, the kernel goes one step
further by providing features to bring the disk arm on the appropriate sector, load the various registers with the
needed parameter values, and actually initiate the read operation.
The kernel also handles thread switching. When it is time to stop the execution of a thread and begin the
execution of a new one, the kernel handles the tasks of saving the information about the dying thread (such
as the registers, its state of execution, etc.) in the process table, mark it as waiting, and loads the new thread
in the main memory for execution.
Notably, some portion of the kernel as well as the entire HAL is always in the main memory of the
computer. It is not swapped out.
Apart from providing a high-level abstraction of the hardware and handling thread switches, the kernel
also provides support for two low-level objects: control objects and dispatcher objects. These objects are
invisible to the users. The executive of Windows 2000 provides an abstraction on top of these objects.
Control objects control the operation of the computer, in the form of processes, objects, interrupts,
etc. Control objects, in turn, contain two objects, called as DPC and APC.
n
The DPC (Deferred Procedure Call) object splits the time interval of an Interrupt Service
Procedure (ISP) into two portions: critical and non-critical. This helps Windows 2000 in serving
the interrupts appropriately, based on their criticality. For example, when the user presses a key
on the keyboard, the key need s to be captured immediately, but it need not be processed at the
very moment, if another very important activity is going on at the same time. As long as the user
is shown an appropriate response to the keystroke in say about 100 milliseconds, ignoring the
keystroke in order to devote the CPU time to the more critical process is OK. In order to deal
with this kind of classification, the DPC maintains a DPC queue, which contains the interrupt
information.
n
The APC (Asynchronous Procedure Call) object is quite similar to a DPC object. However, the
difference between the two is that whereas DPC runs in the context of the Operating System, APC
executes in the context of a specific process. Why is this differentiation necessary? Let us examine
this. Consider a mouse button click. When this happens, it may not be so significant at that instant
to know for which process this mouse button was clicked. The interrupt for this can be straightaway
generated, and the actual process for which it is destined, can be found out subsequently. However,
now consider a disk READ operation. In this case, it is immediately relevant to know for which
process the READ operation is being performed, because, Windows 2000 has to allocate buffers
for that process, due to the issues of privacy and security, etc.
Dispatcher objects, on the other hand, are a group of objects for which a thread can wait. Examples
of such objects are semaphores, mutexes, etc.
The executive sits above the kernel in the Windows 2000 organization. The executive can be
ported to another hardware quite easily, as it is quite abstract in nature, and is written in C. The executive is
made up of various components, the boundaries of which are described vaguely in the literature. The general
view is to consider the executive as a collection of 10 components, which interact with each other quite often
to perform a task.
Let us discuss these components in brief. We shall examine some of these in greater detail later in this
chapter.
This component deals with the handling of processes and threads, right from their incep-
tion to swapping in/out, and termination. This component is the basis for multi-programming in Windows
2000.
This component manages all the objects in the Operating System. These objects are the
timers, threads, files, I/O devices, semaphores, etc. The object manager keeps a track of all these objects.
When a new object is to be created, it provides a portion of the virtual memory from the kernel’s address
space to that object. It returns that memory back to the free pool when the object needs to be removed from
the memory.
Windows 2000 works with demand-paged virtual memory. This component handles
that area. That is, it maps virtual pages to physical page frames. It also enforces protection mechanisms and
handles virtual memory.
As the name suggests, this component deals with I/O devices. It provides a mechanism for
handling I/O devices, and provides an abstract set of I/O services. It makes the device access generic by pro-
viding an abstract framework, and in turn, calling the actual device driver, as appropriate. The I/O manager
also holds all the device drivers. Technically, the I/O manager provides support for two types of file systems,
the older FAT (legacy of the MS-DOS days) and the newer NTFS (extended from Windows NT). We shall
discuss these file systems later.
This component ensures that the most recently accessed disk blocks are in the main mem-
ory. This facilitates faster access in the likely event that they are referred to again. This component keeps
an eye on the main memory to judge which disk blocks are going to be needed again in the near future, and
accordingly caches them. As we have mentioned, Windows 2000 can support multiple file systems. However,
one cache manager can serve all of them, obviating the need for separate cache managers, one each for a type
of file system. The working model of the cache manager is quite straightforward. Whenever a disk block is
needed, the cache manager examines its area to see if the block is cached. If it is, the cache manager simply
returns the contents of the block to the requesting process. However, if the block is not cached, the cache
manager invokes the services of the appropriate file system to bring it.
The job of this component is to locate and load an appropriate driver when new
hardware is attached to the computer. This component is alerted when a new piece of hardware is found. At
this point, this component takes over, and makes sure that the new piece of hardware works seamlessly with
the rest of the Operating System.
Windows 2000 enforces stringent security at the Operating System level. It follows the
guidelines provided by the US Department of Defense for this purpose. The security issues range from the
time the user attempts to log on (i.e. user authentication) up to the time she logs off. In the meanwhile, a
number of security verifications are required, such as access control, zeroing out virtual pages before they are
allocated to another process, etc. The security manager is responsible for all these activities.
This component takes care of the registry by adding new entries as and when
needed, and retrieving them when queried.
The main task of this component is to save power. When a computer is not in use for some
time, this component detects this fact, and turns the monitor and disk drives off. Additionally, on laptops, the
power manager monitors the battery, and when it realizes that the battery is going to become dry, it alerts
the user, so that the user can save open files and other pieces of data and gracefully shut down the computer,
rather than causing an abrupt shut down.
This component provides for inter-process communication between
processes and the sub-systems. Notably, the LPC mechanism is also used for some system calls. Therefore,
it is imperative that this mechanism is efficient.
The system interface portion facilitates communication between user application pro-
grams and the Windows 2000 kernel. Its main job is to accept Win32 API calls, and map them on to the cor-
responding Windows 2000 system calls. This marks the interface between the kernel area and the user area.
This also frees the application programs from worrying about the system call details, and only concentrate
on the higher Win32 API calls.
Objects are perhaps the most important concept in Windows 2000. They
provide a standardized and reliable interface to all the Operating System resources and data structures, such
as processes, threads, semaphores, etc. Some of the most significant aspects of this are as follows:
handles to objects.
the security checks and constraints. This minimizes the chances of bypassing these checks.
fulfilment.
Windows 2000 views objects as consecutive words in the memory. It is considered as a data structure in
the main memory. Each object has an object name, the directory in which it resides, security information, and
a list of processes with open handles to the object. The overall structure of an object is shown in Fig. 13.6.
As we can see, each object has two portions: a header (which contains information common to all object
types); and object-specific data (which contains data specific to that object). The header of the object mainly
contains information such as its name, directory, access rights, processes that access it, etc. Additionally,
it also specifies the cost involved in using this object. For instance, if a queue object costs one point and a
process has an upper limit of 10 points (which it can consume), the process can use at the most 10 different
queues. This also helps in putting checks on the usage of the various Operating System resources.
An object occupies kernel virtual address space, when it is active. This means that the moment an object
ceases to be useful, it should be immediately removed. To enable this, the header of the object contains a
field, called as reference count. Whenever a process starts using an object for the first time, it increments this
counter by 1. Whenever a process is no longer interested in the object, it decrements this counter by 1. As
such, when the value of this count becomes 0, it means that no process is using the object. Accordingly, at this
juncture, Windows 2000 removes the object from the memory, thus freeing valuable resources.
Objects are typed. This means that each object has some properties that are common to all the objects
belonging to that type. The type is indicated with a pointer from the object header. The type information
consists of its name, synchronization information, etc.
Table 13.7 lists some of the most prominent object types managed by the object manager.
Windows 2000 resolves this problem with an elegant solution. The approach is to use a Dynamic Link
Library (DLL). A DLL contains a group of functions and associated procedures to execute a given task. For
instance, there could be a DLL that handles all graphics-related functionalities, which are reusable across
applications. Windows 2000 provides many such DLLs. When an application program is linked, it includes
the appropriate DLLs in the link process. The linker notes the fact that some of the calls in the application
program need the services of the DLLs, and makes an entry of this in the executable file that is generated.
Note that the DLL file itself is not included as a part of the generated executable file, but only its entry is.
At run time, when the program needs to make a call to one of the functions in the DLL, the linker looks
for the DLL on the disk, loads it in the main memory, and maps its address into the virtual address space of
the application program. When another application program makes a call to one of the functions in the same
DLL, the linker does not load another copy of the DLL in the main memory. Recall that it had already loaded
it once before. Therefore, it now simply maps the same DLL entry into the address space of this application
program. If more application program need the services of the same DLL, the same DLL can be reused as
many times as necessary! Since this process happens at run-time, and the DLL is loaded dynamically, as and
when needed, it is called as a Dynamic Link Library (DLL). This idea is shown in Fig. 13.8.
Of course, although many application programs can share a single DLL, each one has its independent copy
of the data. Only the DLL code is shared. This ensures that one application program does not overwrite any
data area of the DLL in which another program is interested.
Thus, we can now imagine how a Windows 2000 program works. Every user process usually links with
some DLLs, which together constitute the Win32 interface. When the user process needs to get some task
done from a DLL function, it makes an appropriate call to the API function. At this stage, one of the following
things can happen:
ntdll.dll), which passes control to the Operating
System.
Windows 2000 supports the concept of processes, like any other Operating System.
However, it also introduces a few more concepts. Let us discuss all of them.
job in Windows 2000 is a collection of processes. Windows 2000 manages a job with a view of
imposing restrictions, such as the maximum number of processes per job, the total CPU time available
for each process within a job, and the security mechanisms for the job as a whole (e.g. whether any
process within a job can acquire administrative privileges).
process is a running instance of a program, which demands and uses system
resources, such as main memory, registers, CPU time, etc. This is quite similar to processes in any
other Operating System. Every process is provided with a 4GB address space, half of which is reserved
for the Operating System. A Win32 API call can be used to create a process. This call accepts the
executable file name associated with the process, which specifies the initial contents of the process’s
address space.
threads. In fact, every process must start by creating a thread. A
thread is the basis for Windows 2000 scheduling, because it always selects a thread for execution,
never a process. A thread has one of the possible states (ready, running, blocked, etc.), which a
process does not have. Every thread has a unique thread id, which is different from the parent process
id. A thread runs in the address space of its parent process. When a thread finishes its job, it can exit.
When all the threads in a process are dead, the process itself gets terminated.
one or more fibres, just as one process can create many threads. The reason for this is the overheads
associated with thread switching. As we know, to switch from one thread to another, the Operating
System must save the context of the running thread, mark it as blocked, load another thread and
its parameters, mark it as running, and hand over the CPU control to this new thread. This is quite
resource consuming. For this reason, a Windows 2000 user process can spawn one or more fibres, and
allow them to execute in parallel without making any explicit system calls. Windows 2000 provides
Win32 API calls to create and manage fibres. A detailed discussion of how this works is beyond the
scope of the current text.
A summary of these ideas is shown in Fig. 13.9.
Windows does not have a concept of process
hierarchy. That is, all processes are treated as equal.
There is no concept of a parent process or a child
process.
Pipes are similar to the UNIX pipes. In Windows 2000, two modes of pipes are supported: byte and
message. Byte-mode pipes do not differentiate between message boundaries (just as UNIX pipes). However,
message-mode pipes preserve message boundaries (unlike UNIX pipes). For instance, if we send eight 56-
byte messages using a message-mode pipe, they would be treated as eight 56-byte messages at the receiver,
and not as a single 448-byte message.
Not present in UNIX, mailslots are actually similar to pipes in some ways. Unlike pipes, mailslots
are one-way. They do not provide assurance of reliable delivery, but provide broadcasting feature.
Quite similar in nature to pipes, sockets allow remote communication between processes on dif-
ferent machines. For instance, Process A can open a socket with another Process B on a different computer,
and send messages using this socket over the network. Process B reads messages from this socket. Of course,
sockets can also be used for local communication (i.e. communication between two processes on the same
machine), but their overhead makes them more suitable for remote communication.
Using RPC, Process A on one machine can call a procedure/function of
another Process B on a different machine, as if A and B are running on the same machine. The result of the
procedure call is also returned back by Process B to Process A over the network transparently, as if they are
local.
To ensure that inter-process communication works correctly, Windows 2000 supports many inter-process
synchronization methods, such as semaphores, mutexes, critical sections and events. As we have mentioned
before, the basic entity under consideration here is a thread, and not a process. Therefore, synchronization
mechanisms are also relevant for threads, not processes. Consequently, when a thread blocks on a semaphore,
all the other threads in the semaphore are not impacted, and they can continue running. Let us quickly
examine the synchronization mechanisms used in Windows 2000.
The Win32 API provides a call CreateSemaphore, which receives an initial value, and a maxi-
mum value. A semaphore is a kernel object and hence, it is bound by the rules of the security descriptor and
a handle. Calls to up or down the semaphore are available.
Mutexes are simpler than semaphores, although both are kernel objects from a technical perspective.
Unlike semaphores, mutexes do not have any counters. They are simple binary locks, which can be applied
or released as desired.
Critical sections are similar in nature to mutexes, except that they are local to the address
space of the thread that created them. Moreover, critical sections are not kernel objects, because of which,
they neither have security descriptors, nor can they be passed between processes.
An event can take one of the two possible states: set and cleared.
In Windows 2000, there are about 100 API calls to deal with processes, threads and fibres. Most of these
calls relate to IPC in some fashion. Examples of such calls are listed in Table 13.8.
Function Description
CreateProcess Create a new process
CreateThread Create a new thread in a process
CreateFibre Create a new fibre within a thread
ExitProcess Terminate a process
ExitThread Terminate a thread
ExitFibre Terminate a fibre
CreateSemaphore Create a new semaphore
CreateMutex Create a new mutex
OpenSemaphore Open an existing semaphore
OpenMutex Open an existing mutex
ReleaseSemaphore Release an existing semaphore
ReleaseMutex Release an existing mutex
Let us now examine the typical steps involved in the creation of a Windows
2000 process, which will give us a good idea of what is involved here.
1. The process that wants to create another process invokes the CreateProcess function.
2. The executable file, which is a parameter to the CreateProcess call is opened. If it is a valid file, the
registry is consulted to check if it is a special file (e.g. if it must be run in a supervisory mode). This
is done by the file kernel32.dll.
3. A system call NtCreateProcess is now made. This call creates an empty process, and enters its entry
into the name space of the object manager. This causes the creation of the kernel object as well as the
executive object.
4. The process manager creates a Process Control Block (PCB) for the object, and assigns a process id,
access parameters and various other fields to it.
5. Another system call, NtCreateThread, is now made. This call creates the initial thread. At the same
time, the user and the kernel stacks are created.
6. The file kernel32.dll now sends a message to the Win32 environment subsystem, informing it about
the new process. It also passes to it the handles for the process and the thread. Entries for this process
and thread are made in the subsystem tables, so that its list of processes and threads is up to date.
7. The thread can start executing now. It begins by making a system call to complete its initialisation
routine.
8. The system call sets the priority of the thread and performs other housekeeping. Now, the actual code
inside the thread takes over.
Interestingly, Windows 2000 does not have a centralized scheduling philosophy. Instead, when a thread
cannot execute any further because of any reason, it enters the kernel mode and runs the scheduler. This
provides the information to the exiting thread as to which thread should the control now go to. The reasons
for a thread to stop executing, and therefore, calling the scheduler thread, can be one of the following:
(a) The thread needs some resources (e.g. an event, I/O), without which it cannot execute any further.
(b) The thread’s allotted time slice expires.
A thread in Windows 2000 can be in one of the six possible states, namely, ready, standby, running,
waiting, transition and terminated. Let us quickly examine what these states mean.
Such a thread can be executed if the processor is available.
This thread state indicates that it is selected for execution on a specific processor. If the priority of
this thread is high, the thread, which is currently executing on the processor, would be pre-empted, and the
thread in the standby state would be executed.
This thread is currently executing, as the resources of the processor are available to this thread.
Such a thread usually waits for a resource to be made available, or for an event to occur. Alterna-
tively, such a thread is suspended by the environment subsystem.
When the required resources for a waiting thread are made available, it moves into the ready state.
Otherwise, it changes into a transition state, and becomes ready when the required resources are available.
Such a thread indicates that its execution is complete.
The standby state and the transition state are somewhat unique to Windows 2000. Other states of a thread
(i.e. ready, running, waiting and terminated) are quite common across all the Operating Systems. The
transition state allows the Operating System to handle the cases of resource pre-emption. When a resource
(e.g. disk) is in high demand, the Operating System makes an attempt to allocate it to a process, which is
about to execute, than to a process, which is dormant.
The typical state transition of threads in the case of Windows 2000 is as shown in Fig. 13.10.
As we have mentioned previously, each process in Windows 2000 is allocated a 4GB virtual address space.
Windows reserves about half of this address space for its own storage. Virtual addresses are 32-bits long.
Each virtual page can take one of the three possible states: free, reserved, or committed. A free page is the
one that is not used currently. A free page becomes a committed page when code or data are mapped on to it.
Of course, a committed page can be in the main memory, or on the disk. If a reference is made to a committed
page, it is directly accessed from the main memory, if it is available. Otherwise, a page fault occurs, and the
page is brought into the main memory from the disk. A page can acquire the status of being a reserved page,
in which case, it cannot be used by any other process than the one, which reserved it.
The Win32 API provides many functions for memory management. These functions allow a program to
allocate, free, query and protect areas of the overall virtual address space. Some of these calls are tabulated
in Table 13.9.
Function Description
VirtualAlloc Reserve an area of memory
VirtualFree Release an area of memory
VirtualProtect Change the protection parameters (read/write/execute) on a region
VirtualQuery Obtain information about a region
VirtualLock Disable paging for a region
VirtualUnlock Enable paging for a region
Windows 2000 supports the concept of a sequential 4GB address space per process. Segmentation is not
allowed. Interestingly, the memory manager does not work with threads, but instead, it works with processes.
Thus, a process, and not a thread, owns an address space. When a process begins executing, no pages related
to that process are automatically brought into the main memory. A page is brought into the main memory
only when required (demand paging). In other words, the only way a page is brought into the main memory
is through a page fault.
When a page fault occurs, two events may be necessary: (a) The required page needs to be brought from
the disk to the main memory, and optionally, (b) A page from the main memory needs to be written to the
disk to create space for the page being brought in from the disk. This means that Windows 2000 might need
two disk operations (read and write, for (a) and (b), respectively) to deal with a single page fault. This can be
quite expensive. To minimize the impact, Windows 2000 tries to keep a lot of free pages in the main memory.
This ensures that only the missing page needs to be brought in, which means only a read operation.
Windows 2000 can work with many file systems, three of which are
significant: FAT-16, FAT-32 and NTFS (NT File System). Let us discuss them in brief.
We shall focus our attention on NTFS, since the other two file systems are actually not relevant going
forward.
In NTFS, a file name can consist of up to 255 bytes, and a full path can take up to 32,767 bytes. File names
are stored in Unicode, which means that people using languages other than English can use their native
languages (such as Greek, Hindi, Russian, etc). The Win32 API calls for file manipulation are quite similar to
those in UNIX. Like in UNIX, when a file is opened in Windows 2000, a file handle is obtained, which can
be used for further processing. Using the Win32 API calls, files can be created/opened, deleted, closed, read
from, written to, etc.
NTFS organizes files in a hierarchical fashion; similar to the way UNIX organizes them. The separator
character, however, is the backslash (\) unlike the forward slash (/) in UNIX. Each NTFS disk partition
contains files, directories and other data structures, such as bitmaps. A disk partition is made up of a series of
disk blocks. A block size is between 512 bytes to 64 KB.
Each disk partition contains an important data structure, called as the Master File Table (MFT). MFT
contains a series of 1KB records and describes a file or a directory. It lists the file name, other attributes and
a list of blocks that the file occupies on the disk. For large files, one MFT entry is not enough. In such cases,
there can be multiple MFT entries for a file, with each entry pointing to the next entry to form an MFT chain.
Interestingly, the MFT itself is considered as a file!
NTFS tries to allocate consecutive blocks to a file. For this, it allocates more than one block to the file at
the time of its creation. Of course, this cannot always be guaranteed.
The main file handling API calls in Windows 2000 are listed in Table 13.10.
Function Description
CreateFile Similar to the UNIX open call, creates or opens a file
DeleteFile Similar to the UNIX unlink call, deletes a file
CloseHandle Similar to the UNIX close call, closes a file
ReadFile Similar to the UNIX read call, reads from a file
WriteFile Similar to the UNIX write call, writes to a file
SetFilePointer Similar to the UNIX lseek call, sets the file pointer to a specific position in the file
GetFileAttributes Similar to the UNIX stat call, obtains the properties of a file
LockFile Similar to the UNIX fcntl call, locks some portion of a file to provide mutual exclusion
UnlockFile Similar to the UNIX fcntl call, unlocks a previously locked portion of a file
Table 13.11 lists some of the important Win32 API calls related to the directory handling.
Function Description
CreateDirectory Similar to the UNIX mkdir call, creates a new directory
RemoveDirectory Similar to the UNIX rmdir call, deletes a directory
FindFirstFile Similar to the UNIX opendir call, starts reading entries in a directory
FindNextFile Similar to the UNIX readdir call, reads the next entry in a directory
MoveFile Similar to the UNIX rename call, moves a file from one directory to another
SetCurrentDirectory Similar to the UNIX chdir call, changes the current working directory to the specified
one
As we have mentioned earlier, the MFT is the main data structure in NTFS. Each MFT record consists
of the attribute header and value fields. Each attribute starts with a header and also specifies the length of its
value (because attribute values can be of variable lengths). If the attribute value is short, it is stored in the
table. Otherwise, it is stored elsewhere on the disk, and a pointer from the MFT is created to that entry.
The organization of the MFT is shown in Table 13.12. As we can see, the first 16 records (numbered 0
to 15) in the file are reserved for NTFS metadata. The remaining records are used for recording information
about user files.
Let us now discuss the meaning of the various attributes of the metadata (i.e. the first 16 records, numbered
0 to 15) of the MFT.
This record identifies the MFT file itself. In other words, it identifies the blocks corre-
sponding to the MFT file, so that the Operating System can locate the MFT file. Unless the Operating System
knows about the MFT, it cannot locate the other files on the disk, and therefore, cannot do anything useful.
This is a copy (mirror image) of the MFT file. This information is very critical for the
Operating System to function properly (and even to boot!). Therefore, a redundant copy is maintained.
This place contains a history or log of the changes to the file system. For instance,
when a user creates a new directory, or when the user changes the attributes of a file (from say READ-ONLY
to READ-WRITE), this fact is recorded here. This information can be useful for recovery and audit purposes.
This record contains the details about the volume, such as the size of the volume, its
label and version.
This record contains information about the attributes of the MFT.
This record contains information about the root directory, which, itself, is a file.
This record contains information about the free (unused) space on the disk.
This record contains a pointer to the bootstrap loader file.
This record logically connects all the bad blocks together to ensure that they are never
used, even accidentally.
This record contains security information.
Different languages have different rules for uppercase letters. This record contains
information as to how to map them to the normal English uppercase letters.
This record is actually a directory, which contains miscellaneous files containing
information regarding disk quotas, object identifiers, etc.
These records are currently unused, and are reserved for future use.
After this, the entries for the user files begin.
NTFS uses the following concepts when allocating disk space to files:
This is the smallest area of physical space on the disk, from the viewpoint of NTFS. A sector typi-
cally consists of 512 bytes (or another power of 2).
One or more consecutive sectors in the same track make up a cluster.
This is a logical partition on the disk. It consists of one or more clusters, and is used by NTFS while
allocating space to a file. Typically, a volume not only contains files, but also the information about these files,
free space information, etc. There can be one or more volumes per physical disk.
This concept is shown in Fig. 13.11.
NTFS has support for concealed file compression. That is, as a file is being written to
the disk, NTFS automatically compresses the disk blocks before performing the write operation. Similarly,
when these disk blocks are read back, NTFS automatically decompresses them. The processes need not be
aware of these mechanisms. A user can request for file compression at the time of the creation of the file, or
at a later date.
When NTFS writes a file that is to be compressed to the disk, it examines the first 16 logical blocks of the
file. It executes a compression algorithm on these blocks. If the resulting compressed file occupies 15 blocks
or lesser, then the compressed file is stored to the disk at one go, if possible. Then the next 16 blocks of the
original file are examined, and if they can be stored in 15 blocks or fewer, they are stored on to the disk, and
so on.
During the reading operation, NTFS needs to know which groups of blocks are compressed, so that it can
uncompress them before reading them. Since the decompression mechanism happens in main memory, this
is not a problem at all.
To encrypt files, NTFS uses the services of a driver, called as Encrypting File System
(EFS).
The encryption process works as follows.
1. The file is encrypted with a one-time symmetric key, and by using an algorithm, which is a variant of
the Data Encryption Standard (DES). Clearly, secure storage of the key can be a concern. How this is
tackled, is explained below.
2. The one-time symmetric key is now encrypted with the user’s public key.
3. The encrypted file and the outcome of Step 2 are stored on the disk.
This process is shown in Fig. 13.12.
As an aside, the user’s private key is not stored as it is on the disk. That would defeat the whole purpose of
encryption. Instead, it is encrypted with another one-time symmetric key, and the result is stored on the disk
as a file. This one-time symmetric key is derived from the user’s login password.
For decryption, the following method is used:
1. EFS asks the user for her login password. Using this password, it derives the one-time symmetric key.
2. Using the one-time symmetric key, EFS decrypts the user’s private key.
3. Using the user’s private key, EFS obtains the one-time symmetric key, as was used during the
encryption process.
4. Using the one-time symmetric key obtained in step 3, EFS decrypts the disk file.
This process is shown in Fig. 13.13.
Windows 2000 provides some interesting security features, which we shall first sum-
marize, as shown in Table. 13.13.
Every Windows 2000 user and group is assigned unique SID (Security ID), which is a binary number
containing a short header followed by a random number. A SID is supposed to be unique all over the world. A
process and its threads run under that user’s SID. Each process in Windows 2000 has an access token, which
contains the SID and other information. Similarly, each resource (such as a file) has a security descriptor
associated with it, which describes what actions are allowed for which SID (similar to the UNIX permissions).
Windows 2000 uses Kerberos (discussed subsequently) for
user authentication. However, it also supports the challenge/response mechanism of Windows NT, which is
Feature Description
Secure login with anti-spoofing measures Secure login demands that the administrator makes it mandatory
for all users to have a password to login. Spoofing happens when
an attacker develops a program that displays a login screen to the
user, and moves away, hoping that the unsuspecting user would
enter the user id and password, thinking that this is a login screen.
In reality, the attacker’s aim is simply to capture the user id and
password, and display a login failure screen to the user. Windows
2000 prevents this attack by requiring users to use a combination
of CTRL-ALT-DEL keys to log in. The keyboard driver program
captures this key sequence, and it invokes a system call that displays
the correct login screen. There is no mechanism to disable the
CTRL-ALT-DEL key combination, making this scheme highly
successful.
Discretionary access controls This feature allows the owner of a resource such as a file to
determine who can access that resource, and in what way.
Privileged access controls This feature allows the system administrator to override the
discretionary access controls in the cases of problems.
Address space protection Windows 2000 provides each process with its dedicated protected
virtual address space. This prevents a malicious process to attack
another genuine process.
New page zeroing This feature ensures that any new pages brought into action because
of existing memory exhaustion always contain binary zeroes. That
way, a process cannot determine what the earlier process was
doing.
Security auditing Using this feature, the system administrator can perform a variety
of audits with the help of system-generated logs.
called as NT LAN Manager (NTLM). NTLM is based on a challenge/response mechanism, and avoids the
transmission of the user’s password in clear text. The NTLM mechanism works as follows:
1. The user gets the screen for login. In response, the user enters the user id and password. The user’s
computer (i.e. the client) computes a message digest of the password, and discards the password
entered by the user. The message digest (also called as hash) is a fixed-length representation of the
password. Conceptually, it is similar to the fingerprint of a person, or the parity bits used in data
communications. Industry-standard algorithms, such as MD5 and SHA-1 can be used to compute
the message digest of any text. Given some text and its message digest, we can always prove that the
message digest is computed on that piece of text only. If we change even 1 bit of the text and calculate
its message digest, it would differ dramatically from the earlier message digest.
2. The client sends the user id in clear text to the server.
3. The server sends a 16-byte random number challenge (also called as nonce) to the client.
4. The client encrypts the random challenge with the message digest of the password, which was
computed in step 1. The client sends this encrypted random challenge (which is called as the client’s
response) to the server.
5. The server forwards the user id, the original random challenge sent to the client, and the client’s
response to a special computer, called as the domain controller. The domain controller keeps a track
of the user ids and the message digests of the passwords.
6. The domain controller accepts these values from the server, retrieves the message digest of the
password for this user from its database (called as Security Access Manager or SAM), and uses it to
encrypt the random challenge received from the server.
7. The domain controller compares the encrypted random challenge received from the server (in step
5) and the one that it has computed (in step 6). If the two match, the user authentication is treated as
successful.
This process is shown in Fig. 13.14.
Many real-life systems use an authentication protocol called as Kerberos. The basis for Kerberos is another
protocol, called as Needham–Shroeder. Designed at MIT, to allow the workstations to access network
resources in a secure manner, the name Kerberos signifies a three-headed dog in the Greek mythology
(apparently used to keep outsiders away). Version 4 of Kerberos is found in most practical implementations,
including Windows 2000. However, Version 5 is also in use now. We shall discuss Kerberos in detail, as it is
the primary authentication mechanism in Windows 2000.
In response, the AS performs several actions. It first creates a package of the user name (Alice) and a
randomly generated session key (KS). It encrypts this package with the symmetric key that the AS shares
with the Ticket Granting Server (TGS). The output of this step is called as the Ticket Granting Ticket (TGT).
Note that the TGT can be opened only by the TGS, since only it possesses the corresponding symmetric key
for decryption. The AS then combines the TGT with the session key (KS), and encrypts the two together
using a symmetric key derived from the password of Alice (KA). Note that the final output can, therefore, be
opened only by Alice. The conceptual view is shown in Fig. 13.16.
After this message is received, Alice’s workstation asks her for the password. When Alice enters it, the
workstation generates the symmetric key (KA) derived from the password (in the same manner as AS would
have done earlier) and uses that key to extract the session key (KS) and the Ticket Granting Ticket (TGT).
The workstation destroys the password of Alice form its memory immediately, to prevent an attacker from
stealing it. Note that Alice cannot open the TGT, as it is encrypted with the key of the TGS.
Obtaining a Service Granting Ticket (SGT) Now, let us assume that after a successful login,
Alice wants to make use of Bob—the email server, for some email communication. For this, Alice would
inform her workstation that she needs to contact Bob. Therefore, Alice needs a ticket to communicate with
Bob. At this juncture, Alice’s workstation creates a message intended for the Ticket Granting Server (TGS),
which contains the following items:
Both Windows 2000 and Windows NT have evolved from the days of the MS-DOS Operating System.
Although MS-DOS is no longer a widely used Operating System, nevertheless, there are a number of old
applications, which were developed for MS-DOS, which now need to execute on Windows 2000. This is
simply because those applications used to concentrate more on the functionality, rather than the ease of usage
or GUI features. Those days were known for character-based screens. However, just because Microsoft has
shifted its attention from MS-DOS to Windows 2000, the users of the older applications do not wish to throw
these applications away, and yet at the same time, want to use Windows 2000 as their Operating System. How
does one achieve both these objectives at the same time?
Precisely in order to solve this problem, the idea of MS-DOS emulation came into being. Windows
2000 provides a mechanism called as protected mode, inside which MS-DOS programs can execute safely.
When an MS-DOS program begins to execute, Windows 2000 creates a normal Win32 process. However, in
addition to this, it also invokes the MS-DOS emulation program, called as ntvdm (NT Virtual DOS Machine).
This special program is responsible for monitoring the behaviour of the MS-DOS program, and executing its
system calls. It is very relevant to note here that MS-DOS recognizes main memory only up to 1 MB on the
8088 PC and up to 16 MB with a few tricks on the 80286. Therefore, the trick is to load ntvdm high in the
address space of the MS-DOS process, so that it cannot clash with, or bring down, ntvdm.
When the MS-DOS program needs to perform normal instructions, it can do that very well, since Pentium
supports all the 8088 and 80286 instructions. However, when the MS-DOS program attempts to perform some
kind of I/O operation, or needs to execute a system call, ntvdm takes over, and executes the corresponding
Windows 2000 I/O operation or system call. In effect, ntvdm performs a mapping between MS-DOS and
Windows 2000 system calls. This is shown in Fig. 13.21.
It must be pointed out, however, that not all MS-DOS programs work this way. Some MS-DOS programs
attempt to perform direct hardware operations, bypassing the Operating System calls, such as trying to read
the keyboard directly, or writing to the video RAM on their own. In such cases, ntvdm makes an attempt to
try and figure out what the MS-DOS program was attempting to do, and tries to perform an equivalent action.
But if it cannot do so, there is no option but to terminate the MS-DOS program abruptly. This is what the term
protected mode means. Windows 2000 makes a sincere attempt to stop programs from trying to do what they
are not supposed to, and yet, tries to provide an execution environment for them.
n
n n
n
Any Operating System has to deal with
two categories of people, viz.
the algorithms that need to be employed to achieve the same user/system call interfaces. In fact, they will
to choose one’s own internal data structures and algorithms as long as they give the same interface to users,
implementation.
all the universities and has started dominating the commercial and industrial world in a big way.
intention of marketing it and nor did Bell Labs want to sell it for a long time.
Bell Labs.
with that proposal, they would have been shown the door! Bell Labs had burnt their fingers badly very
and hence, the operating system became massive, bulky and complex.
leave alone port it to other machines. It also took far longer to develop. It was a good crucible, however, where
the computing world would go a step backwards towards batch processing of some kind.
They could not have dared to propose the development of a new operating system on this background.
At this time, Thomson and his colleagues learnt that the patent department of Bell Labs was looking for a
produce an Office Automation System for the patent department, though Thomson called it ‘Editing system
for office tasks’ in his proposal. The proposal met with negative reaction at first, but was eventually approved
In order to do the editing, one needed an editor software. But the editor would need a lot of support from
the Operating System in terms of file management and terminal management. Whatever a user typed in had
to be displayed on the terminal with all the features such as backspace, tab, scroll, etc. The same text had to
be stored in files. The files had to be stored on a disk in some form of directory structure. When a user wanted
to edit an existing file, the file had to be retrieved in the buffers and the desired chunk had to be displayed on
(or at least Thomson grabbed this opportunity and used it to his advantage!)
through the word of mouth, it spread gradually within the Bell Labs first and then slowly outside. Even then,
use outside.
language. In those days, very fast and efficient computers had not come into existence and writing an Operating
System in a higher level language was considered to be very impractical by the computing community.
combined features of both high and low level languages. It could, therefore, provide a good compromise
to come, the same students would be selecting a computer and an Operating System for their organizations,
where they would be holding key managerial positions.
years.
They differed in many ways. The utilities available on both varied. In fact, even the file formats also differed.
which defined the system calls, file formats, and so on. The document was meant to bring some harmony
POSIX
Interface Standards.
code could be called POSIX Compliant which essentially means that it will be truly portable.
and some others formed a consortium called Open Software Foundation (OSF). The idea was to come out
UNIX International (UI)
mature. It was not rushed to the marketplace with bugs. This was its strong point.
debug or enhance it. Because it is mainly written in C, and almost all vendors provide C compilers, the kernel
basically control the hardware will need to be rewritten at the time of porting.
Also, in the other operating systems, quite a few functions are incorporated in the kernel instead of as simple,
could mean the execution of an application program or a utility program. This execution could be based on
a condition being satisfied too! Thus, because the shell commands including scripts and the system calls are
interface instructions in the programming languages. If the databases used are also the same, then there will
be no problem.
they pertain to the system calls and their interfaces as well. In some cases, user interface also differs slightly.
This increases the porting efforts.
parameters to obtain a ‘tailor made’ device driver for a specific device at a given moment. Things of course
are not that simple to be so easily generalized. But you could write generalized drivers for devices which are
that device in/dev directory. If the device characteristics differ a lot, you may have to modify the generalized
device driver or write a new one for it corresponding to the file created in /dev directory.
For instance, the devices can be character devices like terminals, or block devices like disks. Each will
require a certain specific driver program to be executed, and a specific memory buffer from/to which data can
and the addresses of the driver program as well as the memory buffer reserved for the device. These and other
details about the device are stored in the device file for that device in the /dev directory in a predefined format
or layout.
that device in the /dev directory. The user issues a usual instruction as per the programming language syntax.
For instance, he may want to write to a printer. Corresponding to this instruction, the compiler generates a
system call to write to a device file for that printer defined in /dev directory. This system call to write to a
device file is intelligent. At the time of execution, it extracts the contents of the device file which contains
the device characteristics and the address of the device driver. It uses this information to invoke the required
device driver and so on. When a new device is added to a system, the device driver for that device is written
your need, the more will be the size of the operating system because, those many drivers will need to be
included and loaded and those many buffers will have to be created. Thus, you normally specify to the system
execute in either a user mode or kernel mode. When an interrupt occurs (e.g. due to I/O completion or a timer
clock interrupt), or a process executes a system call, the execution mode changes from user to kernel. At this
time, the kernel starts processing some routines, but the process running at that time will still be the same
process except that it will be executing in the kernel mode.
If a process with higher priority wants to execute when there is no memory available, two techniques are
out (except its resident part). Another is demand paging where only some pages are thrown out and the new
process is now created. In this case, both the processes continue. These days, demand paging is more popular.
The kernel maintains various data structures to manage the processes. For example, it maintains a data
structure called u-area which is similar to the Process Control Block (PCB) discussed earlier. There is one
is a user stack as well as kernel stack. These are used respectively in user and kernel modes of a process
for storing return addresses or parameters for system calls/functions. An interesting point is that the virtual
address space of a user process consists of not only the user portion of the process, but some of the kernel
data structures associated with that process as well. If a process is in the kernel mode, it can access the entire
address space; but if it is in the user mode, it can access only the user portion of the user address space. The
user portion of a process consists of the following elements or regions.
This user portion is entirely swappable if memory is to be freed for some other process. The kernel portion
swappable or resident. This resident portion is never swapped. For instance, the kernel keeps a process table
which has one entry for each existing process in the system, regardless of its state such as running or ready
or blocked (sleeping). This process table is never swapped, because it contains very fundamental information
about process scheduling and swapping itself. For instance, this table is required to take decision about which
processes are to be swapped in, if some memory space becomes free. There are pointers from an entry in
regions reentrant
text, i.e. more than one user can use it with only one copy of it in the memory. The code is stable and does not
modify itself during the course of its execution. In order to implement the ideas of sharing of various regions,
various data structures (e.g. per process region tables) need to be shared.
subdirectories underneath it. A disk can be divided into multiple partitions. Each of the partitions has its own
file system. A file system starts with a root directory at the top of the inverted tree. The root directory contains
a number of directories, which in turn contain a number of files/subdirectories, and so on. This is shown in
further into fields, is completely left to the application program.
Ordinary files can be the regular text files or binary files. The word ‘text’ here is used in
the way we use English text. It should not be confused with the text region of a process which consists of
compiled code. Text files can contain all the source programs or the documents prepared using a word pro
cat to display the
contents of a text file on the screen.
have any meaning according to the ASCII convention or otherwise. Therefore, cat command cannot be used
files. This means that a directory is like a file with a number of records or entries. There is one entry for each
file, or a subdirectory under that directory. The entry contains the symbolic name of the file/subdirectory
underneath and a pointer to another record or data structure called index node or inode in short. The ‘inode’
maintains the information about the file/directory such as its owner, access rights, various creation, usage
and amendment dates and the addresses used in locating/retrieving various blocks allocated to that file/direc
tory. Thus, every file, directory or a subdirectory including the root has an inode record of a fixed length. All
pointer to locate the inode which contains the information about the file/directory.
comprehension of the correspondence. The symbolic names are actually stored in a file called Symbolic File
now can use this to carry out the address translation to arrive at the sectors that are needed to be read. It then
can instruct the controller to read those sectors to read the data.
Every directory has two fixed entries—one for the entry of ‘.’ and the other for ‘..’. The ‘.’ entry is for the
The ‘..’ entry is for the parent directory and its corresponding inode number. For instance, the ‘..’ entry
becomes the working directory or current directory. When the user logs in, this /etc/passwd file is consulted
and the user is then placed in the home directory. At any moment, the directory in which the user operates
(cd)
pwd
If a file is to be specified by a user sitting at a terminal, there are two ways of specifying it. One is the
relative pathname, which is specified with respect to the current directory and the other is the absolute
pathname,
accessible without mentioning any pathname.
ls
The system will respond by displaying
with linking, you could have only one physical file with one inode, with two or more symbolic names for it,
and hence, those many entries in the (symbolic) directories pointing to the same inode.
Linking allows file sharing and obviates the need to copy the same file in the other directories which would
Internally, Link system call adds a symbolic file to the new directory pointing to the same inode.
A third category of files is Special files. Special
devices as files. In general terms, one can say that for each I/O
device such as a tape, a disk, a terminal and a printer, there is
special files.
in this case is a special file for that specific device. The address fields in the inodes for the files for various
devices maintain pointers to the contents of these special files. The special file itself contains, amongst other
things, the address of the memory buffer (character lists or clists in the case of terminals), and the address
of the piece of the device driver software for the terminal. The kernel accesses the inode for the required
special file, picks up this address and then executes the actual device driver. But to the user or a programmer,
bytes, but in a FIFO manner, i.e. the byte which is written to this file first is the one which is put out first by the
of a number of bytes is written to this FIFO file by a process. Let us also assume that another process reads
this file. That process will receive the message exactly in the same way and sequence as it was sent. Thus,
The FIFO file differs from the other files in one important way. Once data is read from a FIFO file or a
pipe, it cannot be read again, i.e. the data is transient.
background process (called a daemon) continuously checks the stored times and mounts the file system at
same thing. These commands issue the required system calls internally, to carry out the desired task.
their directories were to be supported all the time in one file system, it would be a difficult task. Imagine each
user or a student having his own removable disk/diskette with his own file system. In this environment, by
the need. It also allows added security since he can effectively disallow certain users from using the system
at a certain time of the day.
The concept of hierarchical file system in general terms has been studied earlier. Let us now look at some of
shell process uses the system call to locate the binary executable file for this command. It finds it in /bin
directory. The /bin directory will contain various symbolic names and their corresponding inodes. The kernel
for different directories under the root directory, and then displaying their names.
/bin/mv is a utility to change the name of a file. It can also be used to move a file to
a new directory. The old file and its name do not exist any more in the old directory.
/bin/cp is used to copy a file into a new directory. The old file is retained intact in the old directory. A new
file is then created. The target directory now shows the symbolic file name and the new inode number for this
file, while retaining both in the old (source) directory.
/usr is a very important directory, to which all users belong. By default, all the user directories
Apart from the user directories, /usr has also three main subdirectories — /usr/lib, /usr/bin and /usr
point is that all these directories are under /usr as well as directly under the root directory /. Thus, there are
two bin directories to keep binaries: /bin and /usr/bin. There is little reason for both to exist. In fact, one
could store all binaries in /usr/bin or in /bin. In SunOS, these two directories are linked together meaning that
logically, the two are treated as only one directory by the SunOS.
than working on the original file. /tmp directory is cleaned up automatically when the system starts. There is
however a problem if a user creates a temporary file with the same name as the one in /tmp directory such as
password : This is encrypted form of password the user must supply during login. This field can be
forced to change his or her password. After login, the system checks if the age is ‘ripe’ for
it to force the user to change the password. If so, the system prompts for the new password,
accepts it and updates the password field in his record of this file.
userid : This is a numeric user id unique to each user.
groupid : This is the group id to which the user belongs.
idstring : This is the full name of the user.
directory or the listing from the working directory etc. can be easily accomplished as
studied earlier. When a user is created onto the system, his home directory is ascertained,
and is input by the system administrator along with his/her username, password, etc. The /
etc/passwd then stores it.
command : This is a command which should be executed immediately after logging in. Typically, the shell
to be used such as /bin/sh or /bin/csh is specified here. After logging in the kernel searches
the /etc/passwd file, picks this command up and invokes it. This is how most of the time, the
user directly sees the shell prompt ($) after logging in. This could however be tailored for a
specific user to run, say Purchase Order (PO) Entry program directly and immediately after
logging in. The pathname for that program file will need to be specified there. That is all. As
soon as that user logs on, the PO entry program will be invoked and started directly. This
concept can be used in airlines or banking or retailing business, if specific user is supposed to
use only a specific screen throughout the day. This saves a lot of time.
/etc/motd is a file which contains a message that is displayed on a terminal as soon as a user
logs onto the system. The process involved in the login procedure extracts this message and displays it. There
is a command which can be used by the system administrator to set this message up. This is how a user can
/etc/fsck contains a utility to check the consistency of the file system. This checking is done
normally during bootstrapping. The inconsistencies can be caused by power loss or malfunctioning of the
hardware. This program goes through all the inodes for directories, subdirectories and files to ensure that all
the linkages between different directories and files are correct and the linkages between various blocks al
located to those are also correct. It also does that for all the free blocks and bad blocks. In order to do this, it
has to go through multiple passes.
/etc/mount is a utility program to allow mounting of file systems as discussed earlier.
/etc/umount is a utility program to allow unmounting of file systems as discussed earlier.
permissions that can be granted to these categories. For instance, the owner could have all ‘rwx’ permissions,
but the group could have only ‘r’ and ‘x’ permissions and others could have only read(r) permission. The /etc/
group file serves as a master file for all the groups. If any group is mentioned or created for any file, it has to
first exist in this master file of /etc/group to qualify to be a valid group.
/tmp directory maintains all the temporary files created while executing different utilities such
as editors.
/dev directory contains different special device files. There is one such file for each device. This file
contains the address (directly/indirectly) of the device driver and various parameters. Thus, when some data is
in this case is a special device file for that device. The procedure to execute this system call internally invokes
As already seen, one disk can be partitioned to house multiple file systems. Con
versely, one file system can span multiple physical disks by essentially using the mount command. A layout
involved in booting, the first block contains the boot or bootstrap program. After the machine is powered
on, the hardware itself is constructed in such a way that it automatically reads the boot block containing the
bootstrap program. Typically, this bootstrap program has instructions to read in a longer bootstrap program,
needs to have this boot block, even if multiple file
systems may exist on a disk. In all the other file systems,
this block is empty.
Superblock. It
acts as a file system header. It contains the information
about the file system. Any file system has to be mounted
on the root directory before it can be used. When used
by the mount command, the file system’s superblock is
read into the kernel’s memory. Thus, all the superblocks
of all the file systems are available for the kernel
file is created, an entry from this list can be allocated and the list then can be updated. This list can be fairly
long. The kernel normally reserves a small portion in the superblock itself to hold a few free inode entries. If
the list is small enough, it can be stored entirely in the superblock itself. If it is larger, the remaining entries
are stored outside the superblock. We will call them disk inodes.
When a file is to be created, a free inode from the superblock is allocated. When all the free inode entries in
the superblock are exhausted, the list of free inodes kept outside the superblock is consulted. The superblock
essentially acts like the ‘cache’ in this case. When a superblock does not have any free inode entry left, it is
updated with the entries in the disk inodes, whereupon the disk inodes also are appropriately updated. There
example. The figure shows that a number of inodes are free or unallocated. The superblock maintains a partial
numbers. The superblock also maintains an index or pointer to the next free inode in the file system which
superblock maintains a pointer to the next free inode. This is now accessed.
(iv) When the pointer indicates that there is no free inode number left in the superblock, the kernel searches
the disk inodes to find the inodes that are free and updates the list of free inodes in the superblock. It
also updates the value of the pointer to point to the very beginning of the list in the superblock, and
then starts processing as earlier.
cluster. Only one block is allocated to a file at a time, on demand. The blocks allocated to a file need not be
multiple levels of these indices and the addresses of the data blocks/indices are maintained in the inode for
that file. Thus, after accessing the inode for a file, one can traverse through all the data blocks for a file. The
details of this will be studied with the study of the structure of inodes.
very interesting way of maintaining a list of free data blocks. A partial list of the free data blocks is maintained
in the superblock itself. The superblock also maintains a pointer which gives the next free data block in the
partial list of free data blocks maintained in the superblock. The superblock obviously cannot contain the full
a pointer from the superblock to other blocks containing the list of remaining free data blocks. This is quite
similar to the way the free inode entries are kept.
As we have seen earlier, the superblock itself maintains a partial list of
free blocks. The superblock has a size which is more than one block of data. Assume that the superblock
number in the superblock is quite different. This last free block is really not a free block, but it actually
needs to be updated. If the superblock is getting full due to this new entry, an additional block will have to
be allocated, its number will have to be maintained as the last (the one on the extreme left) and the entry for
This process can go on at various levels.
A comparison of this method with the one using a bit map for the allocation of free blocks is of interest.
The main advantage is the better speed when blocks are to be allocated to a file. Also, as more and more disk
blocks get allocated and the disk gets full (i.e. very few blocks remain free), the difference between the two
methods diminishes.
We will now study the structure of inodes and the method of allocating free data blocks to a file.
The inode contains the following information:
its inode.
This specifies whether a file is an ordinary file, a directory, a special file, or a FIFO file.
These specify the time at which the file was created, time at which it was last used
and the time at which it was last modified.
This represents the number of symbolic names the file is known by. The
kernel physically deletes a file (frees all the data blocks and also the inode allocated to that file), only if this
environment. Therefore, its orientation has always been different. In such an environment, where hundreds of
students create small files and then typically forget to delete them, one normally has a large number of small
files as against the commercial environment where there are only a few but fairly large files. This scenario is
time not penalizing the users with large files very heavily. One of the ways to achieve this is to reduce the time
#
#
block numbers allocated to that file can be accommodated within the inode itself and, therefore, they can be
filled up. The next one is, therefore, the third entry.
A question arises as to what would happen if the file size
a data block from the available free blocks, and uses this
block as an index block. The eleventh entry in the inode
gives the block number of this index block. This index block contains addresses or a pointers, as the figure
data block numbers. The kernel now goes through the list of free blocks and allocates one for this file. It now
inserts the block number of this actual data block as the first entry in the index block. As the file expands, it
repeats this procedure of acquiring a new free data block, allocating it and writing its block number in the
second entry of the index block, and so on. The kernel knows which entry in the index block to use next by
means of simple arithmetic by using the file size as seen earlier. This index block is called a single indirect
#
have to use the double indirect
In double indirect, the twelfth entry in the inode points to a double indirect index block. This index block
whether to locate the block number in the inode itself (direct), or in an index at a specific level of indirection.
(ii) Convert the logical byte numbers into logical block numbers and offsets, by dividing them by block
graphically.
(iv) After doing the calculations, access an entry for the single indirect in the inode, and access the index
block at the single indirect level by tracing that pointer.
(ix) The kernel again finds the physical address (cylinder, track, sector) of the newly allocated block.
the record from the memory (by specifying the memory address) and write on the disk at those
directory consists of entries for the files and subdirectories under it. Each entry contains a symbolic file or a
The method of resolving the pathnames has been discussed in the sections
depicted in the figure correspond to the steps (i) to (xi) described as follows.
It can be stored as an absolute pathname or the inode number of the current directory. In either case,
the inode number of the current directory can be found by using the same algorithm as described
below starting from the root (/) directory, if not mentioned explicitly.
(ii) The kernel now accesses the inode for the current directory.
actually reads the current directory file in the memory.
(v) The kernel extracts the inode number for the parent directory from the current directory using this
(vi) The kernel accesses the inode for the parent directory and reads the actual contents of that file for the
parent directory, again by going through the direct/indirect data blocks of that file.
directory as before.
belong to only one group. The combination of both (uid, gid) forms a domain. For each user, this (uid, gid) is
maintained in a record of the password file /etc/passwd. We have seen that the inode also maintains both uid
and gid for each file, where the uid is the user id of the owner of the file and gid is the group id of the owner
what actions (read/write/execute) are permissible for that user on that file.
user, the ACL would have been very long and would have consumed a lot of disk space. It also would have
file into three broad categories. The first is called the Owner—the one who creates the file. The owner’s uid,
therefore, defines this category. The second category is that of the Group—defined by the gid of the owner.
The third category is that of Others—which do not belong to the first two categories. Apart from the owner,
execute it. On the assumption that it is a file consisting of a compiled, binary, executable program. This is the
The Access Control Verification now proceeds as follows:
(i) A user is assigned a uid and gid by the System Administrator as discussed above. This is stored in the
(ii) At any time, when a user logs on to the system, he keys in his user name and password. As seen
case) validates these with the ones stored in the /etc/passwd file. Only after a match is found, access to
the system is granted; and the user can give a command to the shell. When this user logs on and starts
binary code. Also let us assume that after compiling the program and creating this file, the system
analyst decides as to who can execute(x) it and the system administrator accordingly sets its access
control bits in the inode appropriately using certain previleged commands which only be can execute.
After receiving the command from a user to execute this program, the kernel accesses the inode for
it.
previous example.)
The kernel maintains some data structures in the memory for faster access. For instance, the kernel
always keeps the superblocks of all the file systems in the memory as was seen earlier. This is mainly
file is created and a new inode has to be assigned, or some new data blocks have to be allocated to an existing
file, the superblock in the memory can be consulted first. This is obviously far faster, as it saves the disk
accesses.
Apart from the superblocks, the kernel keeps the following data structures in the memory:
to enable the kernel to traverse to the relevant FT entry and then to Inode Table (IT) entries. This is depicted in
These data structures, therefore, allow sharing. For a file which is shared by multiple processes in multiple
to it.
output and standard error respectively. This is done for every process and hence, these three entries will be
structures held for that file. This is done even if the files are not explicitly closed by the program that the
process was executing.
File Table (FT) is a table where there is one entry for each file and every mode that
it is opened in by each process. For instance, take the same example of the file of which inode number is
The main purpose of the FT is to maintain the file offset for that specific mode for that specific file. For
is for two system calls—viz. ‘fork’ and ‘dup’. In fork, a process creates a child process and duplicates all the
The fact that the shell and the child process share the read/write pointers to the standard input and output
enables the read/write pointers to be positioned accurately when the shell gains control again.
The kernel reserves some memory to hold the entries in the Inode Table (IT). When
ever a process opens a file, its inode on the disk is copied into the memory with a few additional fields. At
any point there will be pointers from the FT to the IT, as shown in the figure. If there are three entries for the
same file in the FT, all three will point towards only one entry in the IT. The count in this entry of the IT will
the number of pointers to that file at the run time i.e. it gives the number of processes that have opened the
same file in the same or different modes at the run time.
When a process closes that file or it terminates, those pointers from the FT to the IT are removed and the
to the same file being called by two or more symbolic names from different directories. This field continues
to exist even if no process has opened it.
At any time, some entries in the IT will be free. These free entries are linked together in a list, so that at the
file open time of a new file, a free entry could be retrieved and allocated to the new file after being removed
from the free list. When the inode is not in use anymore, that entry is returned to the linked list of free inodes
entries in the IT.
relative number of the entry within the inode area. This number is however necessary in the IT entry as all inodes
may not be copied into the memory.
When a process wants to perform any operation on a file, it has to open it first. The format of this system call
is as follows:
data structures.
Pathname, mode etc. are input parameters to the open system call. The kernel carries out the following
(a) The kernel resolves the pathname specified as one of the parameters. If the filename does not exist, the
and exits. Otherwise, it invokes the system call to create the file. This algorithm will allocate an inode
the access control bits (rwx) in the inode for the file which is created, before being opened. The kernel
finally retrieves the inode number on the disk for the file to be opened, regardless of whether the file
existed before or had to be created.
one (which will be the case, if the file is newly being created, or it is opened for the first time), it
copies the inode from the disk into the IT entry. This is done after grabbing a free inode entry in the
the process wanting to open that file is allowed to do so in the desired mode. (If the file is being
(e) It now grabs a free FT entry, delinks it from the list of free entries in the FT and sets up the fields
within the grabbed FT entry as follows:
If the same process opens the same file but in two modes—one for reading and the other for writing, the
that inode, and, therefore, it will not be copied a fresh from the disk. The algorithm takes care of this. In this
It should be noted that the reading starts in a file at the offset maintained in the FT entry for that file and
seen, all these entries are set up at the time a file is opened in a specific mode.
(c) The kernel verifies that the user for which the process has issued this system call has the ‘read’ access
to this file. The inode contains rwx access rights for all the three categories. From the user’s category,
the kernel establishes this access right for the user. If it is denied, it outputs an error message and
exits. We have studied this earlier.
has to be read for the ‘read’ system call, and the offset is picked up from the entry in the FT.
(e) The kernel now locks the inode entry in the IT. It is known that for any operation on that file (write,
system call or the end of file is encountered prematurely, all bytes will be read. This process takes
place block by block as follows:
by consulting the inode itself; otherwise the index blocks at different levels of indirection have to be
consulted.
The kernel transfers the desired bytes from the system buffer into the memory buffer, as long as they
The kernel now increments the file offset and decrements the count field by the bytes actually read or
by the same amount.
The kernel repeats this procedure until all the bytes are read, or an error or end of file condition
results.
for error recovery in picking up the thread from where the operation was left incomplete.
A question arises: Why is inode entry in the IT is locked during this operation? The reason is to ensure
consistency. A process could invoke a read system call and then go to sleep in between. If another processes
were allowed to modify the file while the first process was sleeping, the results of the read system call
second block. This could result in undesirable inconsistencies. This is the reason for the inode entries in IT
If all the allocated blocks to a file are exhausted, new blocks may have to be allocated. This necessitates
the invocation of system calls or algorithms for grabbing free blocks or putting them into the inode or indirect
indexes structure, and then delinking those blocks from the list of free blocks. This was not necessary for the
read system call.
The read and write system calls allow sequential reading
or writing of bytes with respect to the byte offset for that
n
number of bytes from offset m where n is the record length
and m
The system call for random seek allows this. The syntax of
this call is as shown below:
call, one can use the read or write system call to read/write the data from that offset onward. In a database
reading of a record from that position onwards would give details about the desired record. It is for the reason
that this facility is useful in database applications.
(b) The kernel traverses to the corresponding FT entry using the pointers.
entry and reduces the count field in that entry as well.
FT and IT entries are deleted or not. Any further reference for that file within that process (i.e. with the
Therefore, if a file is opened in multiple modes by the same process, it has to be closed explicitly for all
close all the files one by one for all the modes. The entries for all the files exclusively opened by that process
are obliterated completely in all the data structures. The other entries continue to exist, as there are other
processes which still might require them.
To execute this system call, the kernel parses the pathname to find out if the file name already exists by
carrying out the procedure of pathname resolution which we have discussed earlier. If the file does not exist,
This process essentially means that the kernel assumes the creation of a new file with the same name. The
old file therefore, has to be deleted and a new one created in the same directory. The steps followed by the
kernel are as given below:
(a) The kernel parses the pathname through various directories, until the final file name is encountered. It
stores the inode for the directory above it (i.e. the directory in which the file is to be created), as well
as the filename.
(b) The kernel accesses the inode for the directory and checks the rwx bits for the category of the user
who has created the process to ascertain whether the process has the access right to write into the
(d) The kernel then searches for the filename within this directory to check if it already exits. If it does, it
goes through the following steps:
(i) Access the inode number for that file from the directory.
(ii) Access the inode of the file.
i.e. add them to the list of free blocks.
(e) If the filename does not already exist, it goes through the following steps:
(i) Search for a free entry within the directory and store the filename in it.
(ii) Allocate a free inode from a list of free inodes as given by the superblock and delink it from the
free list.
file.
(f) The kernel initializes the fields in the inode entry for this file, before writing the inode back on the
disk. This initialization is done as follows:
the time of creation. In such a case, the block allocation algorithm has to be invoked, the direct block number
entries and the file size within the inode have to be set appropriately.
(e) Search for the file name. If it does not exist, output an error message and exit.
(f) If the file is found, extract the inode number of that file from the directory entry and access the actual
inode using that number.
because, there are other users for the same file. Therefore, one cannot physically delete it. If, however,
numbers are given by indices at different levels of indirection (single, double and triple) and sets all
the blocks free, i.e. these blocks are added to the list of free blocks.
(ii) The system now frees the inode entry itself and adds it to the list of free inodes.
(h) The system then deletes the entry for that file in the directory, regardless of whether the inode was
freed or not. This is because, that file cannot be accessed by the same pathname anymore.
After the login, a user is placed in his home directory. The home directory pathname is maintained in the /
etc/passwd file by the system administrator at the time of creating a user. At the login time, this is copied into
this system call is executed. There is a shell command or utility called cd (change the working directory) that
a user can execute sitting at a terminal, which again internally uses the same system call.
After the execution of this system call, the new directory, whose pathname is given as a parameter, becomes
the current or working directory. It is executed as follows:
(a) The kernel parses the pathname in the parameter and ensures it as a valid directory name. To do this, it
uses the same algorithm as for the pathname resolution. If the pathname is invalid, it outputs an error
message and exits.
(b) If the pathname is valid, it locates the inode for that directory and checks its file type and permission
bits to ensure that the target file is a directory and that the owner of the process has access permission
to this directory. (Otherwise changing to this new directory may be quite useless!).
synchronization of various processes. There are two kinds of pipes: named pipes and unnamed pipes. The
difference between the two will be ignored for the present discussion.
A Pipe is essentially a file in which one process is writing some data and another process is reading the
pipe and a regular file are as follows:
FT entry. The reason for this is that the system calls for pipes are very similarly treated as those for
mistake, it would change one of the offsets and the writing and reading of data in the strict FIFO
is to treat a pipe differently from a file from user’s/programmer’s perspective. This would force the
implementation to be less simple. Another alternative is to maintain the offsets in the IT entry instead
of the FT entry. This alternative is preferred.
the inode itself. It does not use any indirect blocks (indices). That is the reason the access time is very
good for the pipe files.
(c) If the pipe is full, the pointer is reset to the beginning again, thus forming a circular buffer. It does
not, however, overwrite any data unless it is read out. If this happens, it will go to sleep after waking
up the processes that wanted to read from that pipe after the pipe is full. After some data is read, the
woken up and then the data can be written into that pipe.
(d) If a process tries to read an empty pipe, it goes to sleep after waking up those processes which want
to write to it. After some data is written, the original process can be woken up to read the data in the
same sequence.
normally implemented as a separate file system, with its list of available inodes and free data blocks.
and access the inode of the root in the mounted file system. The parsing algorithm can then traverse down and
access directories P or Q, as necessary.
The kernel then proceeds to implement the system call as given below:
(a) If the user is not a superuser, output an error message and exit. (Only the superuser is allowed to
execute this system call.)
use.
(c) Access that inode record.
needs to do is to issue a system call to write to a file, excepting that the file in this case is a special device file
for that device in the /dev directory.
A question arises as to how the system call of writing to this special device file is internally converted to
writing to that particular device, such as a disk or a printer or a terminal.
It is known that devices are divided into basically two categories. character devices such as terminals and
printers and block devices such as disks and tapes. Character devices receive/transmit data one character at a
time. Block devices receive/transmit data a block at a time. This gives rise to the possibility of buffering the
of device drivers to drive them. The special file for a device maintains an indicator indicating whether it is a
character or a block device.
is a specific device driver for all the terminals, another for all the printers. This device type is called major
number minor number
you want to write a character to a terminal, all you have to do is to execute the device driver for terminals
assumption that for two terminals of the same type, the device driver should be the same. The device driver
should know only the actual device on which a specific operation is to be carried out, which is supplied to it
in the form of a minor number. It is obvious from this discussion that if the system has two types of terminals
devices such as disks, there is a routine called strategy which sits on the top of read and write routines. This
strategy routine is also a part of the device driver. This routine helps in disk arm scheduling to optimize the
arm movement. For instance, after a specific read or write operation is completed, the strategy routine will
find out the next operation to be scheduled. (This depends upon the current disk arms position as well as the
disk arm scheduling algorithm.)
If a device can operate in both the block and character modes, it will have entries in the tables for both
devices normally assigns a number of buffers for each device. It can employ anticipatory fetch strategies
and read more blocks than are necessary at the moment. When a block is requested, it can search for it in
the buffer first, to avoid the overhead of an additional I/O. All this is unnecessary and also not possible with
for devices. It provides a uniform and familiar interface for application programmers as well as system
programmers. An added advantage of this scheme is that the access to various devices also can be controlled
by similar protection/permission (rwx) mechanism as the ones used for other files.
The interface between the kernel and the device is divided into two parts: (a) kernel to the device driver
interface, and (b) the device driver to the actual device (hardware) interface. The kernel to device driver
interface is simple. From the device type and the operation involved, the kernel calls the specific device driver
which checks for permission and carries out the initial buffer/character lists processing. After this process the
second interface takes over. The device driver starts executing instructions which are specific to the hardware
device driver calls the appropriate interrupt handling routine to take care of an interrupt after it occurs.
A special file for a device is also like any other file having an inode. The inode has a field called file
These tables are the heart of the device management. The first character of these names denotes the broad
characters SW signify the switch table. The tables maintain the addresses of the routines to be executed for
the address of the device driver for that device type. The device driver can then be invoked for that device
type with that specific device number passed as a parameter to the device driver in the form of minor number.
The figure also shows the addresses of the clists in the case of character devices and buffers in the case
of block devices maintained in the table. For each type of device in either category, different clists/buffers
are allocated and maintained by the kernel. The device driver in turn has to take care of the Interrupt Service
Illustrating this by an example, assume that an application program writes to a terminal. The following
then takes place.
(a) The compiled code contains a system call (same as the one used for writing to a file), except that in
this case, the file is a special file for that terminal.
(b) The system encounters this system call and the kernel starts processing it.
(c) The kernel accesses the inode for that device (or special) file to check for permissions.
(d) The kernel notes down the type of the device, i.e. character or block in a special file. It uses this
size. The kernel extracts and stores them.
understanding, actually the addresses of these routines are maintained (directly or indirectly) in this
table.
(i) The kernel now branches to this device driver, wherein it passes the minor number (i.e. the terminal
id) as a parameter to this routine.
consists of instructions for actually manipulating the devices and interrupt vectors. After a device
interrupt takes place, the system identifies the interrupting device and calls the appropriate interrupt
table also maintains the addresses of the clists/buffers for different devices from which the data will
be read/written from/to the devices. The kernel maintains a cursor to denote the position from where
the reading/writing should start.
(k) The processing steps thereafter have been detailed in previous discussions.
After all instructions are compiled and addresses generated, the compiler may leave some space for
expansion and then define some space for the stack region. There are only two restrictions on this procedure.
Firstly, the total virtual address space (including the gaps in between) allocated to a process cannot exceed the
The compiler keeps all this information about starting virtual addresses, etc. in an executable file as shown
section size and starting virtual address for each section in the executable file as shown in the figure.
The executable file is divided into four parts as shown in the figure.
C Section Contents
The Primary Header contains a magic number which specifies the type of the executable file. It also
indicates the number of sections in the program. It then stores the initial register values to be loaded, once
the process begins execution. This part will also contain the value for the Program Counter (PC). This value
specifies the address from which the execution should begin. It is the virtual address of one of the instructions
The Section Header specifies the type of the section and it also stores the size and starting virtual address
of a section as discussed earlier.
The Section Contents
locations reserved for different tables and variables with their initial values (given by value clauses).
The Other Information includes symbol tables which are used for debugging a program at a source level.
When a program starts executing, this symbol table is also loaded. It gives various data and paragraph names
(labels) and their machine addresses. Thus, when a source level debugger wants to display the value of a
counter, the debugger knows where to pick up the value from.
When a process is created and it wants to execute a program from the program name treated as a filename,
the kernel resolves the pathname and gets its inode. From the inode, it then reads the actual blocks allocated
memory locations and then the actual sections are loaded into those locations. If paging is used, the Page
from the primary header. The information from the section headers is used to set up various data structures
such as priority.
such as execution time and kernel resources utilization. (These are used to
set the process priority).
normally three in number, one each for text, data and
stack for each user process.
We will study these data structures in the sections that
follow.
maintained separately, because this portion can be swapped with the process image at the context switch.
The entry in the process table for that process is still retained in the main memory, as it is necessary for the
permission bits supplied by the Create system call as a parameter. The resulting permission bits will
be set finally as permissions for that file in its inode. Thus, all the files created or used by this process
could have certain permission denied as per the mask.
(i) Limit fields are used to restrict the size of a process and the size of a file that a process can write.
I/O parameters describe the source/target addresses, amount of data to be transferred, etc. They
meaning of this field has been discussed while discussing the system calls.
(k) A return field stores the result of system calls.
(l) An error field records errors encountered during the execution of a system call issued by this process.
The exact significance of this field has been discussed while discussing the system calls.
another area to store the error codes (l), if any errors are encountered during the execution of the system call.
They can then be interpreted and appropriate action can then be taken.
Pregion has normally three entries for each process—one each for text, data and stack regions of that process.
entry, there are multiple pointers pointing towards the entries in the Pregion table. Thus, the entries in the
them as contiguous for the sake of simplicity. The figure shows only one pointer instead of three from the
process table to the pregion table, to avoid cluttering.
Each entry for a process and for a specific region in this pregion table contains the following information:
cess an entry of the process table from the pregion table. Thus, it is possible to traverse from the process table
to the pregion table and vice versa. This is not shown in the figure to avoid cluttering.
section.
is the starting virtual address of that region in that process. The compiler generates this
address and stores it in the executable file. When the process is created, the pregion tables are created. At
a region is shared by two processes (e.g. text region of a word processor or a compiler), this starting virtual
address can be different in two processes sharing that region.
gives the information about the kind of accesses that are allowed for that process on that
for this.
These are pointers from a shared region table entry back to all the pregion table entries which share that
region. Alternatively, one can maintain only one pointer from the region table entry back to only the first
entries, the kernel can traverse through all the pregion entries for the same region. When an unshared region is
end of the chain. If a process is created using a shared region, this link could be extended with the appropriate
This pointer chain is to enable the kernel to locate the shared regions properly and speedily. At a logical
to have it that way. In this case, the bidirectional pointers between the process and pregion tables will cease
The region table allows sharing of regions. It is known that an editor program also has text, data and stack
regions. The text portion of the editor can be used by many users executing it (e.g. VI). For each user, the data
region will, however, vary depending upon the file or the data being edited. Thus, some regions are shared
and some are not. This is made possible by the region table. In the region table, there is only one entry for a
memory management scheme is used instead of paging, region table directly contains a pointer to the main
memory address for that region.
A region in the main memory is ultimately loaded from the contents of the executable file (generated by
Thus, the region table also contains a pointer to this inode to trace back to the source from which the region
was loaded.
To summarize, the region table contains the following information:
(text, data or stack)
region id or region number. Thus, the region id or number need not actually form a field in the region table.
It has been shown only to facilitate comprehension.
At any time, the region table contains some entries which are used and some which are free. The free
The page map tables maintain information about logical pages versus physical page frames for each region.
for a region. Virtual page numbers are shown only for better
comprehension.
been studied earlier. We have also studied, that though the compiler
When the program is to be loaded in the memory, the kernel consults the list of free physical page
corresponding to the logical page in the address to be translated for every instruction. The displacement d
page frame number, the displacement is appended to it without any change to get the final physical address.
region of a compiler, editor or shell), then there is only one entry in the region table for all those shared
those pregion entries sharing the region will point to the region table entry which in turn will point to the
the links to the regions for that process are removed. The corresponding pregion entries are removed and
done, the pregion table entries and the process table entry for that process are removed. Along with this, the
If demand paging is used as a memory management scheme, all the pages need not exist in the physical
memory at any time. In fact, only a few pages could be loaded from the executable file into the physical
memory. After a while, a few more could be added if ‘demanded’. If some pages are to be removed to create
room for some different pages from the same or a different process, the pages are overwritten, normally using
kernel for swapping a page, if necessary. The following considerations are used to make the decision:
and hence, that page frame needs to be written back before being overwritten by the other page. The
kernel writes this page to the swap file.
(ii) The page in the text region is not likely to be dirty because most compilers today produce a reentrant
code which does not modify itself. Also no other program is allowed to modify this program due
loaded from the executable file. It was from there that it was originally loaded anyway.
The point is that a page from a process could be at any of the following three locations as shown in
entries include, apart from the page frame number, fields such as dirty bit (modified or not), age (time it has
on the disk before being overwritten so that next time the latest, updated copy of that page can be brought in.
Once the page is copied on the disk, the disk block entry for that page is updated, so that it can be located at
the next page fault for that page.
to the process table, if bidirectional pointers are used. Similarly, the bidirectional pointers between
for a region can be moved from the executable file to the pregion table entries.
(l) The priority of that process can then be computed and the process table entry for that process can
A process executes either in the user mode or the kernel mode. When it is in the user mode, it uses the user
stack to store data such as parameters, return addresses of nested subroutines, or procedures. When the
process issues a system call, it starts executing in the kernel mode. At this time, the kernel uses its own stack
to store the parameters used for executing that system call as well as for storing the results from that system
state (bubble ‘a’) after completing the processing in the kernel mode, and at that time itself if its time slot is
complete, it goes into a preempted state (bubble ‘g’). Thus, the Operating System knows that such a process
A process not waiting for any event and the one which has given up the control of the
A process which is created but which is not yet in the ready state.
A process which has terminated, but not completely removed from the system.
The process table entry for a process has a field called ‘process state’ which tells you the state of that
maintained in the priority order, where the process priority is one of the fields in the process table. As the
processes change their states and the priorities quite frequently during their life cycles, the algorithms and
data structures to manage these changes need to be very efficient.
It is known that the data structures such as the process table entries of a process hierarchy are also chained
together for speedy traversal. This makes the life of the Operating System designers more complex. For
fact, all of these will be maintained in a priority sequence. All others will be chained in the time sequence.
Also, all the processes in a process hierarchy will be linked together, regardless of the states of different
processes within it.
‘init’ has to be ‘hand created’. All the other processes are created by using the fork system call. The newly
created process becomes a child to the process which issues the fork system call. At the time of creation, the
child process is identical to the parent process except only the pid. This pid is different for the parent and
child processes, and it helps to know whether it is a child or a parent process. All the other data structures like
of the parent process and lets the text region be shared between the two processes.
A question arises as to why the process address is duplicated and if it is, then what can the child process
enables the child process address space to be loaded by the desired program from the disk and then allows
the execution of the new program to start. Thus, the Exec system call does not create a new process. It only
replaces the address space of the child process by the new program to be executed.
A fork followed by an Exec is commonly employed by shell each time a command is issued, to execute
wait until the child process terminates. The death of a child process sends a signal to the parent waiting on the
It was seen that the parent and child share the text portion. But it is known that the child process has to
algorithm for a parent and a child achieve two different things? This is achieved by checking in the common
The sequence of events that takes place is outlined below:
child process is identical to the parent process except the pid field in the process table entry. This pid
(b) The child also shares the same text region as the parent. Thus, the parent and the child both execute
after the fork system call was at PPP. Because the context of both processes also is identical (as it is
(c) The parent starts executing the process at PPP after the fork. It checks whether it is a parent or a child
by checking the pid. As it is not a child process, it waits until it receives a signal indicating the demise
of the child.
(d) The child process is put in a ready in memory state as soon as it is created. It is eventually scheduled
and it also starts at PPP. The reason for this is explained in (b) above. It again checks whether it is a
parent process or a child process by examining its pid. Because it is a child process, it issues a system
headers of the executable file, talks to memory management to load all the regions in the address
is now overwritten by the code for the program that needs to be executed.
(e) The child now starts executing the new program with a new id. Before this, the initial register values
was shell. In this case, the shell prompts for a new command. As soon as a new command is given,
scheme remains the same.
(a) The kernel ensures that there are enough system resources to create a new process. This is done as
follows.
(i) It ensures that the system can handle one more process for scheduling and that the load on the
scheduler is manageable.
(ii) It ensures that this specific user is not running too many processes by monopolizing on the existing
resources.
(iii) It ensures that there is enough memory to accommodate the new process. It is known that the new
memory requirements. In the swapping system, the entire memory has to be available. In pure paging
systems, the memory for all the pages to hold the entire address space as well as the page map tables is
necessary. In demand paging, only the page map tables are necessary at the least to initiate a process.
Further pages from the address space can be accumulated with page faults in demand paging.
If space is not available in the main memory, the kernel checks if there is space on the disk such as in
any more processes. If this number becomes equal to or higher than this maximum, the kernel
(d) The kernel initializes the fields in the slot of process table for the child process as follows:
in the FT.
(h) The kernel copies the data and stack regions (unshared portions) into another memory area for the
(i) After the static portion of the child context, the kernel creates the dynamic portion. It copies the parent
stacks for both the child and the parent at this time are identical.
Following actions are carried out by the kernel to implement this system call:
(a) The exec system call expects a name of an executable file to be supplied as a parameter. This is stored
for future use. Along with the file name, other parameters also may be required to be supplied and
(c) The kernel ascertains the user category (whether it is owner, group or others). It then accesses the
execute (x) permission for that category for that executable file from the inode. It checks whether that
process has the permission to execute that file. If it does not, the kernel outputs an error message and
quits.
the ones existing for the child process, as they were copied from the parent process.. Thus, the kernel
frees all the regions attached to the child process. This is to prepare for loading the new program
from the executable image into the regions of the child process. This freeing is done after storing
the parameters for this system call which were stored in this memory only. This storing is done to
directory is what the kernel wants to load into memory of the child process.
(f) The kernel then allocates new regions of required sizes after consulting the headers of the image of
are established.
(g) The kernel attaches these regions to the child process, i.e. it creates the links between the region tables
processes at the appropriate place depending upon its priority. Eventually, it is dispatched.
(k) After the child process is dispatched, the context of the process is generated from the save register
context as discussed in point (i) above. The PC, SP, etc. will then have the correct values.
instruction in the program to be executed. This commences the execution of the new program such as
(e) and the required output is generated. The parent process waits until the child process terminates if
the child process is executed in the foreground; else it also continues.
(m) The child process terminates and goes into a zombie state. The desired program is already complete.
It now sends a signal to the parent to denote the ‘demise of the child’, so that the parent can now wake
from those of the parent process thereafter. If the child process calls another subprogram, the process of
forking/executing is repeated. Thus, a process hierarchy of various levels of depths can be created.
m
for Exit, when it encounters a ‘stop run’ instruction in a COBOL program. After the termination of a process,
all the resources are released, the process state becomes zombie and then it sends a ‘death of the child’ signal
to the parent. The format of this system call is as follows:
The status is a value returned to the parent process for examination and further action. The kernel takes the
following steps to carry out this system call execution:
(a) The kernel disables all signals to the terminating process, because now handling any of them is
closes all the files with the close system call. It follows the pointers to the FT and the IT and reduces
(c) The inodes held by the process directly (e.g. current directory inode) are also released.
(d) The kernel frees all the pregion table entries. Before this is done, it follows the pointers to region table
entry, the corresponding page map tables and the corresponding memory. All these are added to their
respective free pools.
the process table entry of that terminating process. This process table entry is not deleted even after
message and exits. Otherwise, for each zombie child process, it follows the following steps:
(b) It has already been seen that when the child process terminates, its accounting details are stored in the
it adds the accumulated accounting details from the process table slot of the zombie child into those
saved in the process table entry of the parent process and the cycle repeats. The idea is that a process
should be responsible for all the resources used by its children, grandchildren, etc., in general, the
entire hierarchy under it.
(c) The kernel then removes the entry in the process table for the zombie child.
After studying various data structures and system calls for managing various files and processes, the process
kernel.
users can use the system is called booting. The following takes place at the time of booting:
(a) The first block on every file system is reserved for the boot block. The boot block contains the boot
program. Each file system need not have a boot program, but at least one file system on a disk must
have it. That disk is called bootable disk.
(b) There is one or more hardware switches on each machine, which if pressed automatically generate
the hardware itself. The control is thereafter passed to the first instruction in the boot program, again
automatically.
/etc/init, i.e. the program called ‘init’ in the /etc directory. The init program carries out a number of
initialization activities; therefore, it is called ‘init’.
user mode, only the superuser can log in with root privileges. The single user mode is often used for
testing and debugging/repairing file systems and other structures.
is a process which therefore, runs at every terminal, waiting for a user to type in any input. It accepts
the input and passes on to the login process to verify the username and password. We will study this
in more detail while studying the login procedure.
(l) Init creates a shell process to execute the commands in the file ‘/etc/rc’. The /etc/rc file contains
a shell script which contains commands to mount file systems, start background processes
(daemons), remove temporary files and commence the accounting programs. The exact details of /etc/
rc differ from implementation to implementation. It also displays the copyright messages.
It is waiting for any user to login at any terminal and start using.
Before a user goes through the login procedure the following steps should take place:
processes, one each for all the terminals. The process hierarchy at
password and other details are stored in /etc/passwd file. The table
search to ensure this validity must be an efficient one for better
response time. It must be remembered, that there could be hundreds
of such users and login could take place a number of times every day
for each one of them !
If the username does not exist, the login process proceeds as follows:
(i) The login process prompts a user for a password. (This is again dependent on the implementation.
(g) If the username and password both match, the login process copies the user id, group id, and home
ids for that user are assigned and stored in /etc/passwd file at the time a user is created in the system.
field in /etc/passwd file which contains the pathname of this initial program which is now to be
to the pathname of the shell the user wants to execute. There can be different shell programs that can
will have been defined at the time of adding a user to the system by the supervisor. The idea is that
this initial program should start executing as soon as a user logs in.
If a specific user wants to skip the shell and directly go to execute a specific
program, that pathname can be given in the /etc/passwd file instead of the
shell’s. This is useful for Point Of Sale (POS) or many other applications
where after logging in, the user should not have to take the trouble of even
executing the shell commands. This is also the case normally in data entry
applications, where an untrained operator would like to see the relevant
application screen immediately after logging in, so that the data entry could
commence fairly mechanically.
We will assume that the user is running the shell for further discussion.
(k) The login process thus, spawns the shell process by the fork/exec
procedure as seen earlier. The process hierarchy at this time looks
(l) The shell runs the appropriate system login scripts which initialize
the user's environment. It runs /etc/csh.login for C shell and/etc/
(m) The shell runs the user's local login script, if it exists in the user's
home (login) directory. This is "login" for C shell and "profile" for
(n) The shell then will be ready to accept the commands from the user.
It is known that the efork process copies most of the data structures
for that user/terminal will have the user’s id, group id, etc. inherited
from the ancestors. The same thing continues if shell also spawns
other processes. This is the way the access rights for any process are
transmitted and used for the controlling purposes.
it terminates. At this time, it sends a message to its parent (shell in this case), and wakes it up. The shell now
the user gives another command. This process continues until the user wants to execute any shell command.
implementations make use of ‘multilevel feedback queues’ that were studied earlier. This method
priority), but when it runs, it can run for longer duration (by increasing the time slice). This method, therefore,
maintains a number of queues within ready processes. Each queue has an associated value of priority and a
as it gets blocked after causing an interrupt. The Operating System monitors this performance, and depending
upon the past behaviour of a process, when a process becomes ready, the Operating System computes its new
priority and introduces the process to the appropriate queue. If a process consumes full time slice, next time,
any queue, the Operating System schedules the processes in a round robin method.
The dynamic placement and movement of ready processes in different queues is applicable for only user
processes. The kernel processes, though divided into different priority queues, are more rigid in this respect.
performance, any process waiting for a disk I/O is always introduced in that queue with a fixed priority level.
When a process issues a system call for disk I/O and is about to go to sleep, a fixed predefined priority value is
attached to it with the cause of this sleep. This is not dependent upon the priority with which it was executing
The question arises as to how the kernel manages the priorities and time slices for user processes in
number (P) associated with a priority level. By convention, the higher priority number denotes lower priority
with actually a higher priority and thus, it will be scheduled earlier. The priority numbers are calculated with
this convention in mind as will be seen later.
a queue of processes maintained by creating links between the process table entries of different processes
The kernel
by the number of clock ticks consumed), the value of F becomes high. Then the kernel calculates priority
number (P) of any process in such a way that a higher value of F results in a higher value of P, and thus,
priority value are linked together at every context switch. Within the same priority value (P), the kernel selects
described below:
(iv) The kernel chooses a process with the highest priority, i.e. the lowest value of P. In this case, all the
values.
(xiii) After t
The values of F at the end of t
(xv) The kernel calculates the new values of P using the formula discussed earlier. These values now
(xvi) At time t
t t
for ready processes. It has preserved the value of F in the process data structures for it to continue.
Therefore, at time t
too more recently, its priority number increases (i.e. priority decreases) and hence, it is scheduled only later.
processes. Both are candidates for scheduling, as they are not waiting for any I/O. At the context switch, when
the kernel recalculates the priorities, it moves the preempted process to the ready process in the appropriate
a thing had happened and as soon as it comes out of the critical region, it recomputes the new values of F and P.
time for online processes such as text editors as well as computation bound programs run in the background.
Though, the memory management scheme varies from implementation to implementation, two schemes are
program can still be divided into a number of pages and the physical memory can be correspondingly divided
into several page frames. The process image in the main memory can spread over a number of physical page
frames, which are not necessarily contiguous, and, therefore, requires the page map tables to map the logical
image is swapped out or swapped in. This is what really differentiates swapping or simple paging from
demand paging. In the former, the entire process image has to be in the physical memory before execution,
whereas in demand paging, execution can start with no page in the memory to begin with. The pages are
swapping is an easier method to implement, whereas demand paging is more difficult. Both swapping and
demand paging will now be considered.
A swap device is a part of the disk. Only the kernel can read data from the swap device or write it back.
allocation is done contiguously to achieve higher speeds for the I/O while performing the functions of
swapping in or out. This scheme obviously gives rise to fragmentation and may not use the disk space
optimally, but it is still followed to achieve higher speeds.
contiguous, the kernel will have to maintain an index (like inode) to keep track of all the blocks allocated to
the swapped process file. Such a process will degrade the I/O performance. When a process has to be swapped
in, the kernel will have to go through the index, find the blocks that need to be read, and then read them one by
contiguous allocation can give rise to tremendous overheads. Swapping is managed by using a simple data
structure called swap map. We will now consider a few examples to clarify this position.
(i) At any moment, the kernel maintains a swap map of free blocks available on the swap device. For
Starting block no. No. of free blocks
: :
: :
: :
: :
: :
When the kernel decides to swap a process out in order to make room for a new higher priority process,
the kernel follows the steps given below:
(a) Trace the links from the process table to pregion.
(b) Traverse from pregion to the region tables.
that the virtual process image can have some gaps between
different regions to allow for growth at run time. These gaps
are generated by the compiler itself. In addition, the physical
locations where even a single region resides in the physical
The page replacement algorithm decides which page has to be removed. (There could be a global,
overwritten or it has to be preserved before it is overwritten. If the page was modified (i.e. it was
dirty) after loading (typically pages from data and stack regions), then it needs to be preserved.
(vii) If a page is to be preserved, it is written onto the swap file, so that next time the kernel can locate it
and load it back.
(viii) Thus, a page of a process can be in the physical memory or on the disk. On the disk, again, it can be
that page, e.g. whether it was referenced, whether it was modified and how long it has been a part of
the working set. This extra information is useful for the kernel in implementing the page replacement
algorithm.
(ix) The kernel has to keep track of all the page frames in terms of whether they are free, and if not, the
process to which they are allocated. This is done by maintaining another data structure called Page
Frame Data Table (PFDT).
As seen earlier, every region table entry has a pointer to a page map table
where the details of all the pages are kept. Each entry in the page map table consists of two parts—Page Table
The page map table has as many entries as there are pages in a region. At any time, a page can either be
Table Entry (PTE) gives the address of the page in the ‘page frame number’. If the page is on the disk, the
and the block number within that file. Again, the desired page can be at two places on the disk. At the very
beginning, it will be in the executable file on the disk. When the process is scheduled, it is brought in the
memory from the executable file. If and when, subsequently it is swapped out, it will be in the swap file. The
and using a field in the PTE called time of page reference. The hardware itself could pick up the system time
and move it to this field at the time of the page reference. The pages then could be maintained in the ascending
is used for this purpose. Whenever a page is referenced, the hardware, at the time of Address Translation
The time duration at which these bits are cleared is again a design issue. If this duration is small, it allows the
Operating System a fine differentiation amongst the page references, but then the overheads of this method
increase. On the other hand, if these bits are cleared after a long interval, the overheads are less but the kernel
then, cannot differentiate between recently used and most recently used pages.
to take care of this if there is no hardware support. The software implementation is, of course, slow. This bit
instruction.
to the data and the stack regions can change in their contents and therefore, the bits for those changed pages
The modify bit is mainly used while swapping out the pages. If this bit is ‘OFF’, the page can be made free
kernel writes back this page to the disk to store the latest copy of that page. At this time, it has to write it back
onto the swap file and not the executable file, so that the original contents are not lost. The idea is that next
time a program is executed from the beginning, it should start exactly in the same way that it did the first time.
Age bits indicate the length of time that a process has been a part of the working set.
This is a physical address of the page to be used for address translation, if the valid
the direct blocks (within the inode itself) or the blocks at different levels of indirection (depending upon the
(i) The state of the page frame (free, allocated, on swap, executable file or being loaded, etc.).
(ii) The logical device and block number that contain a copy of that page.
(iii) A pointer for free page frame
the header and these pointers, the kernel can access all the free page frames.
simple. It helps to improve the search timings. Basically, the kernel needs to maintain a list of all the
That is the reason the device number and block number, i.e. the address on the disk is kept in the
out whether that block is already in the physical memory and if so, where to find it.
different headers as shown in the figure. The hashing algorithm may vary but the principles and philosophy
traverse that queue starting with its header to check if that block is still in the physical memory, or it needs to
A very important use of this scheme is to avoid extra disk I/O by checking first if the page frame is actually
available in the physical memory itself before loading it from the disk. For instance, imagine that an editor is
being used by a number of users. As all of them log off or finish their work with the editor, the page frames
The page frames associated with the data and stack of each process running that editor will be freed as
the main memory. They are actually not being allocated to any other process and consequently they are not
overwritten by any other process. These page frames will be linked in both the chains—the chain of free
page frames (for further allocation) and the hashed queue. When a new process wants a page frame, and
one of these page frames is allocated to it, it is removed from both queues and then it is linked back to the
and is still linked to the old hash queue. Also imagine that a new user now wants to execute this editor
hashing algorithm to check if the corresponding page frames are still available in the main memory. If any
are available, it only reassigns the same and delinks them from the chain of free page frames. This saves a lot
of I/O time especially for programs which are used very often.
Let us now take an example of a process to see how the scheme works.
(a) Assume that a user wants to execute a program.
(b) The kernel creates a process for the same and creates an entry in the process table.
(c) The kernel accesses the executable file containing the image of that process. It finds its size in terms
(d) The kernel then finds out whether the pages belonging to the text regions are already available in the
main memory. This is done by using the hashing techniques. From the device number and the block
number for each block in the executable file, an appropriate hash queue is searched to check if any
page is available. If available, it can be directly assigned to this process. In this case, the page frame
page frame.
(e) If any page does not exist in the main memory, a page has to be loaded from the disk into a free page
header.
(f) This is the way in which the kernel allocates the page frames to this process. (They could amount to
less than the total size of the process address space.) After the allocation, the kernel removes the pages
entries.
page fault.
number (P) and displacement (d). The hardware separates these two parts. The page number (P) is
(i.e. it is in the memory). The page frame number of the PTE is then extracted and the displacement
(d) is appended to it to get the physical address. The instruction can be completed. The reference bit is
again taken up only when the required page is read into the physical memory. The hardware must
have this capability to abandon and restart an instruction (after the page fault processing) to support
memory by using the hashing algorithm. As discussed earlier, the page may have been used by a
process terminated some time ago but the page may not have been removed. If it is not in the physical
bit OFF. It then loads the contents of that page from that address into the allocated page frame. It
translation as before.
(n) The kernel monitors the number of free pages available. When they fall below a minimum level,
it invokes a process called a Page Stealer Process (PSP). The function of this process is to go
of the text pages is available on the executable file. Thus, another one is not needed on the swap file).
(p) Assume that after some time, a data page on the swap file is to be accessed. This causes a page fault
as seen earlier, since the valid bit in the PTE is OFF. The current instruction is abandoned. The kernel
frame by using the hashing algorithm. If it is not, it grabs a free page frame and reads the contents
(q) When the process terminates, the pages belonging to the data and stack regions are released
immediately. For pages in the text region, the kernel decrements the reference count. If the count now
types in brief.
own resources, such as its address space, stack, process control block, etc.
The concept of a thread in Solaris is quite similar in nature to that in many other
operating systems. That is, a thread is a part of a process, and has its own execution path. Solaris provides
for parallel execution within a process. Interestingly, these threads are logical in nature in the sense that the
operating system does not have any idea about them. The application programmer creates these threads, since
the programmer believes that this would lead to a better (or overall faster) execution of a process. Internally,
these (logical) threads are mapped to (physical) kernel threads, as discussed below.
to one kernel thread. The kernel of the Solaris operating system schedules lightweight processes indepen
dently and these may execute in parallel. A lightweight process is visible to the application within a process.
In other words, the data structures for a lightweight process are located in the corresponding address space
of the process.
These types of threads are visible to the operating system. They are actually created,
started, swapped etc. by the operating system. These threads have logical as well as physical existence. There
must be exactly one kernel thread for each lightweight process.
the appropriate model for fulfilling the needs of a particular requirement. That is, the parallelism can be
achieved to the desired extent. Some programs may need logical, but not physical parallelism. For instance,
an application may need to have multiple windows open, only one of which may require being active at a
given point in time. In such a situation, we could model the windows as user level threads, mapped on to a
threads are logical entities. That is, they are visible to the application programmer, but not to the kernel.
kernel does not have to worry about this. This also means that the expensive operations required for a typical
thread switch are not required here.
On the other hand, we could model steps of execution as multiple lightweight processes, each corresponding
in the case of I/O operations). In such cases, the other threads in the same process can continue executing,
regardless of the blocked thread.
processor state data structure to keep a track of the items such as the
process’s priority, registers, signal mask, stack etc., Solaris has one such data structure per lightweight
As shown in the figure, there is one data area per lightweight process in Solaris. If one process contains
more than one lightweight process, then there are as many lightweight process data structures as the number
of lightweight processes, with pointers between them.
supports three main protocols specifically related to thread synchronization. These are:
As we have mentioned, the kernel of the Solaris operating system knows only about the kernel threads.
As such, these mechanisms are available only for managing the concurrency of kernel threads. Once a
A mutual exclusion mechanism allows only one thread that has acquired the mutual
exclusion lock to proceed. All the other threads are blocked. The thread that creates the mutual exclusion lock
must also unlock it. Solaris provides the following operations related to the mutual exclusion lock:
As we can see, the thread interested in acquiring a lock invokes the mutex_enter ( ) primitive. If this
waits using the mutex_tryenter ( ) primitive. Once the lock is acquired, the thread can perform the necessary
operations. Finally, when it is time to release the lock, the thread executes the mutex_exit ( ) primitive.
anisms. Solaris provides mechanisms for decrementing and incrementing the semaphore count, and also to
perform the waiting operation. These primitives are as follows:
n n
n
n
n
n
n
n
n
The UNIX operating system has evolved over a
period of time. Moreover, unlike some of the other
operating systems, it was initially used and en-
hanced in universities, which made it quite sturdy and popular.
Students who used UNIX as a part of their graduate or post-
graduate programs in colleges and universities later went on
to become successful programmers and business heads. It was,
therefore, quite natural that they felt that UNIX should be used
as much as possible. This led to a widespread use of UNIX in
the serious computing world. However, as a result of this, an
interesting problem also emerged: that of lack of compatibil-
ity. More specifically, because the UNIX operating system was
made almost freely available, many versions of UNIX (called
“clones”) emerged. Over a period of time, two major versions of
UNIX became winners: 4.3BSD and System V Release 3. This
was the case towards the end of the 1980s. As if this was not
sufficient, many vendors added their own specific enhancements,
making things even murkier.
Technically, what was the problem with this? Quite clearly,
because there was no single UNIX standard available, no two UNIX programs could be called as 100% binary-
compatible. That is, a program developed on one of the UNIX-based computers could not be guaranteed a
successful execution on another UNIX-based computer, if the two UNIX versions were different. This was
because, although the two distinct versions would be more or less similar in nature, there would be minute
differences (especially in the areas of some of the system calls, e.g. in the number of parameters), which
would mean that the program compiled for 4.3 BSD UNIX would not run as it was on the System V Release
3 UNIX (and vice versa). The source program would need recompilation on the System V before it could be
executed. This was the meaning of lack of compatibility at the binary level.
There was an attempt from the AT&T (i.e. from the makers of the System V UNIX) to standardize UNIX.
They published a list of the system calls, file formats etc. in a document, hoping that the other party (i.e.
BSD UNIX) would also publish their own list, or even better, reconcile with them to come up with a standard
UNIX version. This document is known as SVID (System V Interface Definition). However, this attempt
failed, as the other party (BSD) simply ignored this document, and did not respond at all.
This led to the intervention of the neutral body of IEEE to try and standardize UNIX. The Standards
Board appointed by IEEE was (and is) known for publishing standards that make incompatible protocols
and mechanisms work by achieving a minimum common objective, and doing away with the special, more
specific protocols. IEEE involved hundreds of people from the academia, businesses, government and other
bodies in trying to standardize UNIX. The aim of this attempt was to come up with an operating system
standard known as POSIX (POS for Portable Operating System and IX from UNIX).
POSIX proposed a minimum base standard to which all the vendors were expected to conform. This
standard defined a basic set of system calls, which was expected to be followed by all the vendors. This would
mean that once a programmer wrote a program adhering to the POSIX standard, it would execute on any
UNIX clone, as long as that UNIX clone also conformed to the POSIX standard. This would mean that the
program would now be compatible at the binary level.
It is significant to note the approach taken by IEEE to come up with the POSIX standard. The traditional
mechanism used by standard bodies is to take all the features from all the participating companies/protocols,
and include all of them in the resulting standard. However, IEEE chose a different approach in the case
of POSIX. Here, it chose to include only those system calls and other features, which were common to
both the versions of UNIX (BSD and System V). That is, it was consciously decided to consider only the
commonalities.
Although the POSIX attempt was an important milestone in the history of UNIX, yet that was not the end
of the story. Having brought the BSD and System V camps of UNIX together, IEEE and the entire world
concerning UNIX would have felt quite good. At this juncture, another different problem emerged. Some
organizations including IBM, HP, DEC and others realized that AT&T would get a big control over UNIX
even after the birth of POSIX. Consequently, they decided to have their own UNIX! They set up another
body, called OSF (Open Software Foundation) to create a flavour of UNIX, which would adhere to the IEEE
standard, and yet contained many other features, most notably a Graphical User Interface (GUI). In response,
AT&T set up its own body, called as UI (Unix International), which did the same thing for their version of
UNIX, i.e. System V. As a result, there were again two major UNIX clones.
However, this time there was no attempt to further standardize UNIX, and it was left open to the market.
Ultimately, the System V version emerged as a “better” UNIX. OSF gradually vanished. However, other
variants of UNIX emerged later on, e.g. Solaris, which was created by Sun Microsystems as a specific product
based on System V.
Above discussion is how UNIX emerged and not Linux
Among many versions of UNIX that emerged thereafter, the most successful one is Linux. Finland’s Linus
Torvalds decided to write a UNIX clone. Its first major release was in 1991. Called as Version 0.01, it had
about 9,300 C lines of code and about 950 lines of assembly code. It ran only on the 80386 computers at that
time, because it contained 80386 assembly language code in between the C code. Now, of course, Linux is
ported on many hardware platforms.
The next major version of Linux (version 1.0) came out in 1994, and consisted of about 165,000 lines of
code. It went through many minor revisions, until the time the version 2.0 was released. This version contained
about 470,000 C lines and about 8,000 assembly language lines. It supported 64-bit microprocessors and had
a number of new features.
Apart from its roots in UNIX, which make it a very solid operating system, the main motivation behind
Linux is that it is completely free. That is, its original source code can be actually downloaded from the
Internet (from many sites, one of which is www.kernel.org). Users who download Linux can use, change, or
redistribute the source code as well as the executable binary version of the operating system freely.
Now, Linux is also getting commercialized, with many hardware companies supplying it as a pre-installed
operating system along with their computers. Linux CDs are also becoming popular, as it is cumbersome to
download Linux from the Internet.
Linux is quite similar to UNIX in all the areas that concern a typical operating system. However, in
many cases, there are subtle differences or improvements over UNIX. In the following sections, we shall
concentrate only on these differences. This also means that we assume that the reader has a fair knowledge
of how UNIX works.
Apart from the individual differences between UNIX and Linux under various categories, such
as process management, memory management, etc. there are some key areas where the two
Operating Systems differ at the conceptual level.
UNIX is generally considered as a monolithic operating system. This means that all the Operating System
functionality (i.e. kernel) is contained within a single large chunk. Thus, UNIX has a single code area; it runs
as a single process and has a single address space. As a result, all the code portions in UNIX have an access
to all the other code/data areas. Thus, any changes to the UNIX kernel means that we have to re-test the entire
kernel if necessary, re-link, reboot and re-instal the entire Operating System. Therefore, even adding a simple
device driver can be quite a daunting task in UNIX.
Linux, on the other hand, does not have this problem. It is designed as a group of relatively independent
loadable modules. This facilitates dynamic linking and stackable modules, similar to the client/server
architecture. That is, each service is fairly independent of each other, and specializes in one area.
This also means that common code can be carved out in case of Linux, and all the other modules can use
the services of that common code, rather than having to duplicate it everywhere, thus making the overall
architecture a lot simpler.
On the other hand, there are also many commonalities between UNIX and Linux. For instance, about 80%
of the UNIX system calls are directly borrowed into Linux. Other features of UNIX, such as libraries, data
structures and algorithms are also extensively used in Linux without any changes.
Apart from the usual process management in UNIX, Linux has introduced a new system call,
called as clone. The syntax of this call is as follows:
This call creates a new thread either as a part of the existing process, or as a new process, depending on
the value of the parameter sharing_flags.
The “clone” function begins with a call to the function (which is the first parameter to the clone function
— refer to the syntax). The last parameter to the clone function is argument, which is passed as it is to
function. This concept is illustrated in Fig. 15.1.
The new thread created by the “clone” function call receives its own stack. Its stack pointer is initialized
to the value as specified in the stack_pointer parameter.
The value of the parameter sharing_flags provides better sharing properties with regard to this thread and
its parent. Table 15.1 depicts the possible values of this flag, and the meaning that it conveys.
Flag value Meaning if this flag is set Meaning if this flag is not set
CLONE_VM Create a new thread within the process Create a completely new process
when this call is executed
CLONE_FS Share the properties of the parent process, Do not share these properties of the
such as the root directory, working parent process
directory, etc.
CLONE_FILES Share the files of the parent process Create a copy of the files used by the
parent process
CLONE_SIGHAND Share the signal handler table Create a copy of the signal handler table
CLONE_PID Use the same PID, as of the parent process Use a new PID for this process
Linux maintains the information about every task/process with the help of a table, called as task_struct.
The main entries in the task_struct table are shown in Table 15.2.
Parameter Meaning
State Represents the state of the process, such as Executing, Ready, Suspended,
Stopped, etc.
Scheduling information Specifies the information required for scheduling the process, such as its
priority, allocated time slice, etc.
Identifiers Lists the process identifiers, such as the process’ PID, UID, GID, etc.
IPC Identifies which Inter Process Communication (IPC) mechanism out of the
ones possible, to use.
Links Contains links to the parent (the process which created this process), siblings
(other processes which have the same parent as this process) and children
(processes that are created by this process).
Times/timers Specifies the process creation time, time used so far, etc.
File system Contains pointers to the files used/accessed by this process.
Virtual memory Specifies the details about the virtual memory usage and access mechanisms
for this process (such as page numbers, details of mapping to disk, etc.).
Context This parameter contains the details about the various process context areas,
such as the registers used, the stack details/pointers, etc.
Interestingly, Linux does not have separate tables for processes and threads. In other words, it does not
distinguish between a process and a thread. It treats both of them equally, or rather, as the same.
Linux maintains a list of pointers (sometimes called as Process table), which point to all the task_struct
entries. This concept is illustrated in Fig. 15.2.
Thus, to access the details about any process, Linux consults the appropriate pointer, and follows it up to
the correct process entry.
Linux considers processes as belonging to either of two main categories: real-time and others.
The term real-time is a bit of misnomer, since it does not necessarily have to do anything with
real-time activities. Instead, this term refers to more important or more urgent processes. Based on
this idea, Linux process scheduling can be classified into three categories, as shown in Fig. 15.3.
The broad-level characteristics of the three scheduling types are as follows:
This is the category, which contains processes with the highest priority. These processes
cannot be preempted. The only way to preempt a process in this category is via a new real-time FIFO process.
These processes are quite similar to the real-time FIFO processes, except that
the CPU clock can preempt them.
These are ordinary processes, which do not have any urgency, and are scheduled
by using the default timesharing algorithms.
The mechanism used by Linux to schedule processes is quite interesting. Each process has a scheduling
priority. The default value of this field is 20. This value for a process can be altered by using the “nice”
system call. The syntax for “nice” is quite simple, as follows:
When a call to the “nice” function is made, the new scheduling priority of the thread becomes equal to:
Old scheduling priority - Value
The scheduling priority can be between 1 and 40, whereas the value can be between –20 and +19. The
higher the value of the scheduling priority, the higher is the attention provided by the operating system to the
process (i.e. more CPU time, faster response time, etc).
Each process also has another value, called quantum, associated with it. This is equal to the number of
CPU clock ticks. The default clock assumed by Linux is of 100 Hz. Therefore, one tick = 100 milliseconds
(ms). Each tick is called as jiffy in Linux terminology.
Based on all these concepts, the scheduler calculates the value of the goodness of a process, as illustrated
in Fig. 15.4.
What is the significance of the goodness? Linux always selects the process to be scheduled next, based on
the value of the goodness. The process with the highest goodness value is scheduled.
With every clock tick, Linux reduces the value of the quantum for that process by 1. A process stops
executing, if one of the following happens:
(a) The value of the quantum becomes 0, i.e. the time slice is over.
(b) The process needs some I/O, and therefore, cannot continue.
(c) A previously blocked process with a goodness value greater than the currently executing process
becomes ready.
The scheduler periodically resets the quantum of all the processes to a value based on the following
formula:
Processes that have been blocked because of I/O
would usually have some quantum left. Whereas
the processes that have exhausted their full quantum
(i.e. the ones, which are more CPU-hungry) will
have the value of quantum closer to 0. In order
to ensure that the CPU-hungry processes do not
continue grabbing the CPU more, and instead, the
processes that were blocked because of I/O but are
now ready, get a higher priority, this formula is
devised. Let us understand how it works.
As we have noted, for a CPU-hungry process,
quantum will be closer to 0. Therefore, based on
the above formula, the new quantum will be closer
to its scheduling priority. On the other hand, for
processes that were blocked due to I/O, but are now
ready, the new quantum will also consider their old
quantum.
Note that by using such a mechanism, Linux
automatically makes sure that the processes that are
using the CPU extensively would now get a lower
share of the CPU.
For inter-process synchronization, Linux imple-
ments wait queues and semaphores. Wait queues
are circular linked lists, which describe the process
descriptors. We have already discussed semaphores
in the earlier chapters. In Linux, a semaphore con-
sists of three fields: semaphore counter, number of
waiting processes and list of processes waiting for the semaphore.
In Linux, every process gets a 3 GB virtual address space. An area of 1 GB is reserved for page
tables and kernel data. The virtual address space is partitioned into contiguous areas, i.e. pages. The
page size for a particular process is pre-determined. For instance, in the case of Pentium, it is 4 KB.
A process begins execution with a fixed area of memory. It can then request for more memory dynamically.
Linux allocates this additional memory as and when needed.
The memory area for each process is described in the kernel using a data structure, called as VM_AREA_
STRUCT. This area contains details, such as the ones mentioned below.
– Whether this memory area is read-only, read/write, etc.
– Contains up if this page contains data segment (since data tends to grow up), and down if
this page contains stack segment (since stack tends to grow down)
Let us now understand how a process receives pages during its lifetime. Suppose a process has been
designated a number of pages (say 64) just before it starts executing. To deal with any requests, the number
of pages equal to the nearest power of 2 is allocated, as per the Linux philosophy.
How does this happen? Let us assume that the first memory request arrives for 7 pages. Clearly, the closest
match is that of 8 pages. In order to do so, Linux divides the 64-page designated area into two halves of 32
pages each. Since even 32 pages are too big for this request, one of the 32-page half is sub-divided into two
16-page blocks. One of the 16-page blocks is further divided into two 8-page blocks, and one of these is
allocated to the process. This is shown in Fig. 15.6.
It is easy to imagine how further requests for memory will be satisfied. If the next request is for a size less
than or equal to 8 pages, the unused 8-page block will be allocated. However, if it is greater than 8 pages, the
16-page or 32-page block, as appropriate, will be allocated, and so on.
Technically, Linux uses Buddy algorithm. It uses a demand-paged system with no pre-paging. It also
does not use any working set. For quick page replacement, Linux keeps some pages free, which can be
allocated to the requesting processes. It also reclaims them when the pages are no longer in use. A use bit and
a modify bit respectively signify if a page is in use or has been modified.
Linux supports over 12 file systems, with the technology of NFS. When Linux (i.e. the
operating system code) is linked, the choice of the default file system needs to be specified. The
other file systems can be called dynamically, depending on the requirements. The Ext2 file
system is the most popular choice. It is similar to the Berkley file system.
This file system considers that the disk begins with a boot block and then rest of the disk to be made up
of a series of block groups. Block groups are numbered sequentially, and they contain many sub-fields. The
overall organization is shown in Fig. 15.7.
Each device (such as hard disk, printer, keyboard, etc.) has a special software program, called
device driver, associated with it. This program resides in a special area in the kernel of the
Operating System, and includes the instructions for the communication between the Operating
System and the device. For instance, the device driver for the hard disk performs the low-level interaction
with the hard disk on behalf of the Operating System. This makes the Operating System independent of the
peculiarities of the underlying devices, and allows it to use a common (generic) interface for all the devices,
regardless of the actual make of the devices.
In Linux, whenever we want to attach a new device to the computer, we must also inform Linux about
its device driver and supply it. A device driver file can be loaded as and when needed, or it can be kept in
the main memory if the device is accessed quite frequently. The actual file for a device driver can be located
anywhere on the disk, but it is better to keep it in the /dev directory for ease of reference.
Note that a device in Linux is treated in a fashion that is quite similar to UNIX. A device can belong to
either the character mode or the block mode. The appropriate information about the same is contained in its
device driver.
Linux was designed as a multi-user operating system right from the beginning. As such, the security of the
Linux operating system was considered as an important aspect right from the beginning. We shall review the
important security concepts in Linux in brief.
Being a multi-user system, Linux needs to ensure that many users are able to access the operating system
services at the same time. This calls for a high level of security and privacy. Linux assigns a unique UID
(User ID) to every user. A UID is an integer between 0 and 65,535. In fact, Linux also marks files, processes
and other resources with the UID of the user who owns them. Multiple users are grouped into a GID (Group
ID), which is a 16-bit number. The system administrator can do the assignment of a user to a group. Earlier,
one user could belong to only one group. Now, a user can belong to multiple groups. Each process in Linux
carries the UID and GID of the owner.
When a new file gets created, it is assigned the appropriate UID and GID of the process that creates it.
At this time, permissions regarding the files can also be specified (i.e. who can do which operations
with that file). The permissions are of three types: read (r), write (w) and execute (x). Moreover, each of
these permissions can be specified for the owner, group that the owner belongs to, and other users. This is
represented by using one bit (0 or 1) for each permission and user type. Table 15.3 explains the idea with a
few examples.
Bit pattern Symbolic representation Meaning
111000000 rwx------ Owner can read, write, execute
111101101 rwxr-xr-x Owner can perform all actions, others can read
and execute
000000111 ------rwx Only others can do anything, owner and group
have no access (strange, but possible!)
110100000 rw-r------ Owner can read, write, group can read
As we can see, the first three bits signify what the owner can do, the next three bits signify what the users
belonging to the owner’s group can do, and the last three bits signify what the other users can do. A hyphen
indicates no permission.
Linux stores the message digests of the user passwords in the user database. When a user wants to log on to
the system, Linux expects the user to enter the user-id and password. After user enters these details, Linux
creates a message digest of the password, and compares it with the one stored in the user database against
that user. If the two match, the user is considered as successfully authenticated. This method can lead to
dictionary-based attacks, as follows. An attacker builds a list of all the possible passwords in the world,
and calculates the message digest of each one using a known algorithm (which is also used by Linux). The
attacker stores this list of passwords and their corresponding message digests in a disk file. It can then try out
each of the password message digests against the Linux user database. If there is a match, the attacker can
successfully log in to the system!
To avoid such attacks, Linux uses the concept of salt (explained in Chapter 6). In Linux, this works
as follows. For each user, there are three columns in the user database: the user id, the salt, and the
message digest of the concatenation of the user’s password and the salt together. For example, consider
a user Ana. Suppose that her password is testing, and suppose that the randomly chosen salt value for her
id is 3719. Then, Linux calculates the message digest of the password and the salt together as shown in
Fig. 15.8, and stores the result as the third column in the entry for Ana.
Linux calculates the message digest of the password and the salt together for each user id (let us call this as
the derived password), and stores the result along with the user id and the salt, as shown in Table 15.4. Here,
we illustrate the user database entries for three users, as an example.
As we can see, the user id and the salt are stored in clear text. Does the clear text salt not help the attacker?
Not quite! As we know, the attacker’s plan is to first create her own list of probable passwords, then calculate
their message digests, and compare that list with the passwords in the Linux user database. However, now
the attacker’s task is quite difficult! Suppose that the attacker suspects that someone’s password is test. It
is no longer adequate to calculate the message digest of the word test and add that to the list of probable
passwords, so as to compare them with the user database entries. She has to calculate the message digest of
the strings that combine the password with the salt, such as test0001, test0002, test0003, and so on. This is
quite challenging indeed!
n
0GL 24 Bad block 81
1GL 21 Banker’s algorithm 266
2GL 19 Base register 282
3GL 18 Basic File Directory (BFD) 122
4GL 18 Batch system 12
Belady’s anomaly 337
Absolute pathname 563 Best fit 104, 279
Access Control List (ACL) 381 Binary Semaphore 239
Control Matrix (ACM) 362 Biometric 375
hierarchy 379 Bit vector 131
rights 122, 377 Bitmap 103, 107
Block a process 201
Accumulator 27
device 556, 596
Active File List (AFL) 102
structured language 380
File Number (AFN) 127 Blocked process 186, 211
threats 359 Blocking message passing 448
Address bus 24 Blocks allocation list 103
translation 54, 275, 301 Boot block 64
Addressing mode 21 Booting 64
Amendment 358 Bootstrapping 65
Application layer 492, 504 Buddy system 290, 649
Arithmetic and Logical Unit (ALU) 18, 27 Buffering 463
Assembler 2, 21 Bus-based architecture 407
Associative memory 306
Cache 306
register 306
Caesar Cipher 388
Asymmetric key encryption 391
Call by reference 362, 448
Authentication 359, 372, 387
by value 362, 448
Autocoder 2 Canonical mode 166
Availability 357 Capability list 383
Avoid a deadlock 266 segment 384
CD-ROM 172 CPU scheduling 207
CD-ROM sector 174 utilization 208
Central capability list 384 CRC 81
Centralized data 426 Create a process 196
processing 424 Critical region 231
Chain 107 Cryptography 387
Chained allocation 108 Cyclic Redundancy Check (CRC) 81, 481
Chameleon 361 Cylinder 87
Change process priority 201
Character blocks (Cblocks) 156 Data bus 23-24
device 556, 596 link layer 488, 490, 496
List (Clist) 156 Management System (DMS) 98
Child process 194 migration 440
Cipher text 387 stream 407
Circular (C) - SCAN method 149 Database Management System (DBMS) 5, 99
Circular Wait Condition 259, 265 server 429
Client-based computing 467 Deadlock 256, 412, 479
-server computing 429, 468 strategies 259
Cluster (of a file) 73 Decoder 24, 28
Collision 488 Decryption 387
Command interpreter 56 Default priotity 209
Language (CL) 56 Degree of multiprogramming 184
Common Object Request Broker Demand paging 295, 327, 329
Architecture (CORBA) 453 segmentation 329
Communication management software 431 Denial 358
Compaction 291 Detect a Deadlock 259
Compatible Time Sharing System (CTSS) 9 Device driver 136, 650
Compiler 22 independence 143
Complete path name 119 request queue 139
Computation migration 440 swap space 177
Computer virus 65 Digital certificate 398
Concurrent process 251 signature 394
programming 252 Dimension of a system 409
Confidentiality 48, 357, 387 Dining philosophers’ problem 244
Connection 492 Direct addressing 21, 29
Consumer process 227 Memory Access (DMA) 18, 50, 85
Content-addressable memory 306 Directed Resource Allocation Graphs (DRAG) 257
Context switching 184 Directory 95
Contiguous allocation 102 implementation 130
Control bus 26 Dirty bit 327
node 475 page 326, 327
Cooked mode 166 Disclosure 358
Counting Semaphore 239 Disk controller 50, 82
Covert channel 361 Diskless workstation 432
CPU register 17 Dispatch a process 199, 202
Distributed COM (DCOM) 453 Frame 482, 490
File System 455 Free space management 131
File System (DFS) 455
mutual exclusion 475 General purpose register 24
Operating System (DOS) 11 Semaphore 239
Processing 11, 403, 424 Global Operating System (GOS) 424
snapshot algorithm 473 replacement policy 334
Domain 378 Graphical User Interfaces (GUI) 11, 62
switching 378
Dumb terminal 152 Hard disk 38
DVD 175 link 120
Dynamic Link Library (DLL) 526 real-fime system 13
Random Access Memory (DRAM) 38 Hardware interrupt 37
relocation 23, 283 Heavyweight process 414
Heuristic scheduling 214
Emulator 6 Hierarchical directory systems 130
Encryption 386 distribution 425, 443
Environment 56 Hole 104
Error control 488 Home directory 118
Execute cycle 27, 303 Horizontal distribution 425, 443
External data bus 24 Hypercubes 409
External priority 209
Ignore a deadlock 259
Fabrication 358 Immediate addressing 31
Fairness 207 Indexed allocation 108
False deadlock 443 Indirect addressing 21, 30
Fence register 277 Indivisible instruction 237
Fetch cycle 27, 28, 303 Information Management (IM) 53
File Allocation Table (FAT) 108 Inode 560
directory 95 Input Output Control System (IOCS) 5
Management System (FMS) 99 Input/Output Request Block (IORB) 144
Map Table (FMT) 329 Instruction Register (IR) 27
organization 129 Instruction stream 407
replication 455 Integrated Circuit (IC) 6
sharing 129 Integrity 357, 387
-system swap space 178 Intelligent terminal 152
First Come First Serve (FCFS) 146 Inter Task Communication 218
fit 104, 279 Interface Definition Language (IDL) 454
In First Out (FIFO) 309 Internal data bus 24
Fixed Block Architecture (FBA) 79 fragmentation 277, 286
Partitioned Memory Management 278 priority 209
Flow control 488 Internet Inter Operable Protocol (IIOP) 453
Formatting a disk 81 worm 365
Forwarding packets 490 Interrupt 17, 33
Fragmentation 105 Mask Register (IMR) 34
Pending Register (IPR) 34 Master/slave system 410
Register (IR) 34 Medium term scheduler 212, 419
Service Routine (ISR) 33-34, 51 Memory Address Register (MAR) 24, 40
bandwidth 321
Job Control Language (JCL) 7 Buffer Register (MBR) 24, 40
Data Register (MDR) 24, 40
Kerberos 538 Management (MM) 53, 274
Kernel 63 Message 482
thread 220 digest 396
Key 98 passing 250
pair 391 Metropolitan Area Network (MAN) 481
Kill a process 199, 264 Microkernel 411
Monitor 249
Lamport Algorithm 477 Motherboard 136
Large Scale Integration (LSI) 10 Multilevel Feedback Queues (MFQ) 215
Latency 77 Page Tables 309
Least Recently Used (LRU) 309, 340 Multiple Instruction Stream and Multiple Data
Level of a language 18 Stream (MIMD) 407
Lightweight process 414 Instruction Stream and Single Data Stream
Limit register 286 (MISD) 407
Line tapping 360 queues 279
Load balancing 441 Multiplexed Information and Computing Service
instruction 20, 38 (MULTICS) 9
Loader 23 Multiprocessing 6
Local Area Network (LAN) 481 Multiprogramming 6, 7, 184
capability list 385 Multithreading 217
capability segment 385 models 219
File System (LFS) 455 Mutual exclusion 233, 411
Operating System (LOS) 429 Exclusion Condition 258, 264
replacement policy 334
Locality of reference 327 Named pipe 593
Location independence 458 Negative acknowledgment (NAK) 486
Log structured file system 131 Nested interrupt 33
Logical address space 283 Network File System (NFS) 458
address 282 Interface Card (NIC) 484
Block Number (LBN) 93 layer 491, 499
Long term scheduling 211 Operating System (NOS) 11, 424
Longitudinal Redundancy Check (LRC) 481 services software 431
Look ahead memory 306 NIC 484
Look ahead register 306 No Preemption Condition 259, 265
Loosely coupled architecture 403 Node 488
LRU approximations 343 Non-contiguous allocation 107, 294
-preemptive process 210
Magnetic disk 38 -procedural language 18
Massively parallel computer systems 409 -repudiation 394
-Von Neumann architecture 407 Positive acknowledgment (ACK) 486
Not Frequently Used (NFU) 343 POSIX 554
Recently Used (NRU) 343 Preemptive process 210
N-step SCAN method 148 Presentation layer 492, 503
Prevent a deadlock 264
Object 377 Primary memory 38
code portability 56 Printer server 465
program 2 Priority based scheduling 213
Request Broker (ORB) 453 class 214
One time password 374 Privacy 49
Online system 5 Private key 391
Op code 2 Procedural language 18
Open Software Foundation (OSF) 554 Process states 186
Operations on a process 195 Control Block (PCB) 185, 189, 442
OSI model 484, 493 hierarchy 194
Ostrich algorithm 259 list 262
Overlapping partitions 292 Management (PM) 53, 183
manager 451
Packet 482, 490 migration 441, 444
Page fault 327 state transitions 188
Fault Frequency (PFF) 346 Producer process 227
Map Table (PMT) 296 -Consumer problem 227
Map Table base register (PMTBR) 305 Profile record 118
Map Table Limit Register (PMTLR) 303 Program context 33
replacement policy 326, 327 Counter (PC) 26
Paging 295 Status Word (PSW) 285
Parallel processing 403 Programmed I/O 86
Partial path name 119 Protection 276, 376
Partition 278 Protection bits 277
Description Table (PTD) 278 ring 379
Partitioned data 426 Protocol 486
Pass phrase 373 Pseudo-swap space 177
Password 372 Public key 391
Peterson’s Algorithm 236 key cryptography 391
Phantom 443 Key Infrastructure (PKI) 398
Physical address 282
address space 283 Quick fit 290
Block Number (PBN) 93
layer 488-489, 495 Rabbit 361
swap space 177 Race condition 231
Pipe 227, 593 Rail Fence Technique 388
Plain text 387 Random Access Memory (RAM) 38
Port 450 Raw mode 166
Portability 6, 55-56 Readers’ and writers’ problem 246
Portable 7 Ready process 186
Real-time system 12, 208 Sharing 276
Recover from a deadlock 263 Shell 56
Redirection software 430 Short term scheduler 212
Reduced Instructions Set Computers (RISC) 23 Shortest Seek Time First (SSTF) 147
Register Save Area (RSA) 33, 185 Simulator 6
Relative Byte Number (RBN) 92 Simultaneously Peripheral Operations On-Line
path name 119, 563 (Spooling) 7
Record Number (RRN) 91 Single Contiguous Memory Management 276
Reliable service 448 Instruction Stream and Multiple Data Stream
Relocating linker 282 (SIMD) 407
loader 282 Instruction Stream and Single Data Stream (SISD)
Relocation 23, 275, 301 407
Remote File System (RFS) 458 queue 279
Method Invocation (RMI) 453 -level directory systems 130
Procedure Call (RPC) 429, 447 Skeleton 454
Replicated data 426 Sleeping barber problem 247
Resource list 262 Soft link 120
Response time 208 Soft real-time system 13
Resume a process 205 Software interrupt 37
Rogue software 361 Source code portability 56
Rotational delay 77, 79 domain 362
Round robin policy 212 program 2
Routing packets 491 Space wastage 102
RSA algorithm 392 Spooler 5
Running process 186 Spooling 7
Stack 341
Salt 373 Pointer (SP) 26
Satellite computer 2 Static partition 278
SCAN method 148 relocation 23
Secondary memory 38 Store and forward 488
Secure Electronic Transaction (SET) 398 instruction 20, 38
Socket Later (SSL) 398 Stub 454
Security 48 Subject 377
design principles 371 Substitution cipher 388
Seek time 77 Suspend a process 205, 263
Segment Descriptor Table (SDT) 317 Swap space 176
Map Table (SMT) 317 Swapped process 211
Map Table base register (SMTBR) 319 Swapping 281, 301
Map Table Limit register (SMTLR) 319 in 281
Segmentation 295, 315 out 282
Segmented page method 295 Switched-memory access 408
Selective retransmission 485 Symbolic File Directory (SFD) 122
Semaphore 239 Symmetric key encryption 390
Session layer 492, 502 Symmetric Multiprocessing (SMP) 418
Setup operation 2 System call 5, 20, 49, 54-55, 66-67
System generation 278 Variable partition 287
Vector interrupt 36
Tapping 358 Vertical distribution 425
Target domain 362 Redundancy Check (VRC) 481
Task synchronization 218 Virtual address 282
Teardown operation 2 address space 283
Terminal I/O 152 Circuit Number (VCN) 492
Terminal software 155 File System 456
Thread 217, 413 machine 65-66
Thread Control Block (TCB) 218 memory management 275, 295, 326, 349
library 220 path 493
Throughput 7, 207 Processor (VP) 418
Tightly coupled architecture 404 Virus 361, 366
Time out 489 detection 370
Time Sharing Operating Systems’ 9 prevention 370
Time slice 189 removal 370
Time up a process 202 Volume Table of Contents (VTOC) 101
Timesharing 8, 208 Von Neumann architecture 407
Timesharing Option (TSO) 8
Track 87 Wait for Condition 259, 264
Transaction processing 6 -die method 479
Processing (TP) monitor 8 Waiting time 208
Transmission time 77 Wake up a process 189, 204
Transport layer 492, 500 Wasted memory 286
Transposition cipher 388 Wide Area Network (WAN) 481
Trap door 359 Win32 API 516
Trojan Horse 361 Windows Hardware Abstraction Layer (HAL) 521
Turnaround time 208 registry 518
Two phase commit 473 Working directory 118
Two-level directory systems 130 set 327
set method 295
Unnamed pipe 593 Worm 361, 364
Unreliable service 448 Worst fit 104, 279
User interface 56 Wound-wait method 479
User thread 220
Userfriendlyness 10