Migration Spe
Migration Spe
SUMMARY
The Sprite operating system allows executing processes to be moved between hosts at any time. We use
this process migration mechanism to offload work onto idle machines, and also to evict migrated processes
when idle workstations are reclaimed by their owners. Sprites migration mechanism provides a high
degree of transparency both for migrated processes and for users. Idle machines are identified, and
eviction is invoked, automatically by daemon processes. On Sprite it takes up to a few hundred
milliseconds on SPARCstation 1 workstations to perform a remote exec, whereas evictions typically
occur in a few seconds. The pmake program uses remote invocation to invoke tasks concurrently.
Compilations commonly obtain speed-up factors in the range of three to six; they are limited primarily
by contention for centralized resources such as file servers. CPU-bound tasks such as simulations can
make more effective use of idle hosts, obtaining as much as eight-fold speed-up over a period of hours.
Process migration has been in regular service for over two years.
KEY WORDS
Process migration
INTRODUCTION
In a network of personal workstations, many machines are typically idle at any given
time. These idle hosts represent a substantial pool of processing power, many times
greater than what is available on any users personal machine in isolation. In recent
years a number of mechanisms have been proposed or implemented to harness idle
processors (e.g. References 14 ). We have implemented process migration in the
Sprite operating system for this purpose; this paper is a description of our implementation and our experiences using it.
By process migration we mean the ability to move a processs execution site at
any time from a source machine to a destination (or target) machine of the same
architecture. In practice, process migration in Sprite usually occurs at two particular
times. Most often, migration happens as part of the exec system call when a resourceintensive program is about to be initiated. Exec -time migration is particularly convenient
because the processs virtual memory is reinitialized by the exec system call and thus
need not be transferred from the source to the target machine. The second common
*Present address: Vrije Universiteit, Dept. of Mathematics and Computer Science, De Boelelaan 1081a. 1081
HV Amsterdam, The Netherlands. Internet: douglis@cs.vu.nl.
00380644/91/08075729$14.50
1991 by John Wiley & Sons, Ltd.
758
759
760
into the kernel. In such a system the solution to the transparency problem is
not as obvious; in the worst case, every kernel call might have to be specially
coded to handle remote processes differently than local ones. We consider this
issue in greater depth below.
4. Sprite already provides network support. We were able to capitalize on existing
mechanisms in Sprite to simplify the implementation of process migration. For
example, Sprite already provided remote access to files and devices, and it has
a single network-wide space of process identifiers; these features and others
made it much easier to provide transparency in the migration mechanism. In
addition, process migration was able to use the same kernel-to-kernel remote
procedure call facility that is used for the network file system and many other
purposes. On SPARCstation 1 workstations (roughly 10 MIPS) running on a
10 Mbits/s Ethernet, the minimum round-trip latency of a remote procedure
call is about 16 ms and the throughput is 480660 Kbytes/s. Much of the
efficiency of our migration mechanism can be attributed to the efficiency of
the underlying RPC mechanism.
To summarize our environmental considerations, we wished to offload work to
machines whose users are gone, and to do it in a way that would not be noticed by
those users when they returned. We also wanted the migration mechanism to work
within the existing Sprite kernel structure, which had one potential disadvantage
(kernel calls) and several potential advantages (network-transparent facilities and a
fast RPC mechanism).
WHY MIGRATION?
Much simpler mechanisms than migration are already available for invoking operations on other machines. In order to understand why migration might be useful,
consider the rsh command, which provides an extremely simple form of remote
invocation under the BSD versions of UNIX. rsh takes as arguments the name of a
machine and a command, and causes the given command to be executed on the
13
given remote machine.
rsh has the advantages of being simple and readily available, but it lacks four
important features: transparency, eviction, performance and automatic selection.
First, a process created by rsh does not run in the same environment as the parent
process: the current directory may be different, environment variables are not
transmitted to the remote process, and in many systems the remote process will not
have access to the same files and devices as the parent process. In addition, the user
has no direct access to remote processes created by rsh: the processes do not appear
in listings of the users processes and they cannot be manipulated unless the user
logs in to the remote machine. We felt that a mechanism with greater transparency
than rsh would be easier to use.
The second problem with rsh is that it does not permit eviction. A process started
by rsh cannot be moved once it has begun execution. If a user returns to a machine
with rsh -generated processes, then either the user must tolerate degraded response
until the foreign processes complete, or the foreign processes must be killed, which
causes work to be lost and annoyance to the user who owns the foreign processes.
Nichols butler system terminates foreign processes after warning the user and providing the processes with the opportunity to save their state, but Nichols noted that the
761
1
ability to migrate existing processes would make butler much more pleasant to use.
Another option is to run foreign processes at low priority so that a returning user
receives acceptable interactive response, but this would slow down the execution of
the foreign processes. It seemed to us that several opportunities for annoyance could
be eliminated, both for the user whose jobs are offloaded and for the user whose
workstation is borrowed, by evicting foreign processes when the workstations user
returns.
The third problem with rsh is performance. rsh uses standard network protocols
with no particular kernel support; the overhead of establishing connections, checking
access permissions, and establishing an execution environment may result in delays
of several seconds. This makes rsh impractical for short-lived jobs and limits the
speed-ups that can be obtained using it.
The final problem with rsh is that it requires the user to pick a suitable destination
machine for offloading. In order to make offloading as convenient as possible for
users, we decided to provide an automatic mechanism to keep track of idle machines
and select destinations for migration.
Of course, it is unfair to make comparisons with rsh, since some of its disadvantages
could be eliminated without resorting to full-fledged process migration. For example,
Nichols butler layers an automatic selection mechanism on top of a rsh -like remote
execution facility. Several remote execution mechanisms, including butler, preserve
the current directory and environment variables. Some UNIX systems even provide
a checkpoint/restart facility that permits a process to be terminated and later
14
recreated as a different process with the same address space and open files. A
combination of these approaches, providing remote invocation and checkpointing but
not process migration, would offer significant functionality without the complexity of
a full-fledged process migration facility.
The justification for process migration, above and beyond remote invocation, is
twofold. First, process migration provides additional flexibility that a system with
only remote invocation lacks. Checkpointing and restarting a long-running process
is not always possible, especially if the process interacts with other processes;
ultimately, the user would have to decide whether a process can be checkpointed or
not. With transparent process migration, the system need not restrict which processes
make use of load-sharing. Secondly, migration is only moderately more complicated
than transparent remote invocation. Much of the complexity in remote execution
arises even if processes can only move in conjunction with program invocation. In
particular, if remote execution is transparent it turns shared state into distributed
shared state, which is much more difficult to manage. The access position of a file
is one example of this effect, as described below in the section on transferring
open files. Many of the other issues about maintaining transparency during remote
execution would also remain. Permitting a process to migrate at other times during
its lifetime requires the system to transfer additional state, such as the processs
address space, but is not significantly more complicated.
Thus we decided to take an extreme approach and implement a migration mechanism that allows processes to be moved at any time, to make that mechanism as
transparent as possible, and to automate the selection of idle machines. We felt that
this combination of features would encourage the use of migration. We also recognized that our mechanism would probably be much more complex than rsh. As a
result, one of our key criteria in choosing among implementation alternatives was
simplicity.
762
763
for output requests to be passed back from the process to be device, and for input
data to be forwarded from the devices machine to the process. In the case of
message channels, arranging for forwarding might consist of changing sender and
receiver addresses so that messages to and from the channel can find their way from
and to the process. Ideally, forwarding should be implemented transparent y, so that
it is not obvious outside the operating system whether the state was transferred or
forwarding was arranged.
The third option, sacrificing transparency, is a last resort: if neither state transfer
nor forwarding is feasible, then one can ignore the state on the source machine and
simply use the corresponding state on the target machine. The only situation in
Sprite where neither state transfer nor forwarding seemed reasonable is for memorymapped I/O devices such as frame buffers, as alluded to above. In our current
implementation, we disallow migration for processes using these devices.
In a few rare cases, lack of transparency may be desirable. For example, a process
that requests the amount of physical memory available should obtain information
about its current host rather than its home machine. For Sprite, a few specialpurpose kernel calls, such as to read instrumentation counters in the kernel, are also
intentionally non-transparent with respect to migration. In general, though, it would
be unfortunate if a process behaved differently after migration than before.
17
On the surface, it might appear that message-based systems such as Accent,
9
16
Charlotte or V simplify many of the state-management problems. In these systems
all of a processs interactions with the rest of the world occur in a uniform fashion
through message channels. Once the basic execution state of a process has been
migrated, it would seem that all of the remaining issues could be solved simply by
forwarding messages on the processs message channels. The message forwarding
could be done in a uniform fashion, independent of the servers being communicated
with or their state about the migrated process.
In contrast, state management might seem more difficult in a system like Sprite
that is based on kernel calls. In such a system most of a processs services must be
provided by the kernel of the machine where the process executes. This requires
that the state for each service be transferred during migration. The state for each
service will be different, so this approach would seem to be much more complicated
than the uniform message-forwarding approach.
It turns out that neither of these initial impressions is correct. For example, it
would be possible to implement forwarding in a kernel-call-based system by leaving
all of the kernel state on the home
machine and using remote procedure call to
14
forward home every kernel call. This would result in something very similar to
forwarding messages, and we initially used an approach like this in Sprite.
Unfortunately, an approach based entirely on forwarding kernel calls or forwarding
messages will not work in practice, for two reasons. The first problem is that some
services must necessarily be provided on the machine where a process is executing.
If a process invokes a kernel call to allocate virtual memory (or if it sends a message
to a memory server to allocate virtual memory), the request must be processed by
the kernel or server on the machine where the process executes, since only that
kernel or server has control over the machines page tables. Forwarding is not a
viable option for such machine-specific functions: state for these operations must be
migrated with processes. The second problem with forwarding is cost. It will often
be much more expensive to forward an operation to some other machine than to
764
Virtual memory transfer is the aspect of migration that has been discussed the
most in the literature, perhaps because it is generally believed to be the limiting
17
factor in the speed of migration. One simple method for transferring virtual memory
is to send the processs entire memory image to the target machine at migration
9
5
time, as in Charlotte and LOCUS. This approach is simple but it has two disadvantages. First, the transfer can take many seconds, during which time the process is
frozen: it cannot execute on either the source or destination machine. For some
processes, particularly those with real-time needs, long freeze times may be unacceptable. The second disadvantage of a monolithic virtual memory transfer is that it may
result in wasted work for portions of the virtual memory that are not used by the
process after it migrates. The extra work is particularly unfortunate (and costly) if
it requires old pages to be read from secondary storage. For these reasons, several
other approaches have been used to reduce the overhead of virtual memory transfer;
the mechanisms are diagramed in Figure 1 and described in the paragraphs below.
In the V system, long freeze times could have resulted in time-outs for processes
trying to communicate with a migrating process. To address this problem, Theimer
3,8
used a method called pre-copying. Rather than freezing a process at the beginning
of migration, V allows the process to continue executing while its address space is
transferred. In the original implementation of migration in V, the entire memory
of the process was transferred directly to the target; Theimer also proposed an
implementation that would use virtual memory to write modified pages to a shared
backing storage server on the network. In either case, some pages could be modified
on the source machine after they have been copied elsewhere, so V then freezes the
process and copies the pages that have been modified. Theimer showed that precopying reduces freeze times substantially. However, it has the disadvantage of
copying some pages twice, which increases the total amount of work to migrate a
process. Pre-copying seems most useful in an environment like V where processes
have real-time response requirements.
The Accent system uses a lazy copying approach to reduce the cost of process
4,17
When a process migrates in Accent, its virtual memory pages are left
migration.
765
time
source
target
I
Figure 1. Different techniques for transferring virtual memory. (a) shows the scheme used in LOCUS
an-d Charlotte, where the entire- address space is copied at the time a process migrates. (b) shows the precopying scheme used in V, where the virtual memory is transferred duirng migration but the process
continues to execute during most of the transfer. (c) shows Accents lazy-copying approach, where pages
are retrieved from the source machine as they are referenced on the target. Residual dependencies in
Accent can last for the life of the migrated process. (d) shows Sprites approach, where dirty pages are
flushed to a file server during migration and the target retrieves pages from the file server as they are
referenced. In the case of eviction, there are no residual dependencies on the source after migration.
When a process migrates away from its home machine, it has residual dependencies on its home
throughout its lifetime
on the source machine until they are actually referenced on the target machine.
Pages are copied to the target when they are referenced for the first time. This
approach allows a process to begin execution on the target with minimal freeze time
but introduces many short delays later as pages are retrieved from the source
machine. Overall, lazy copying reduces the cost of migration because pages that are
not used are never copied at all. Zayas found that for typical programs only onequarter to one-half of a processs allocated memory needed to be transferred. One
disadvantage of lazy copying is that it leaves residual dependencies on the source
machine: the source must store the unreferenced pages and provide them on demand
to the target. In the worst case, a process that migrates several times could leave
virtual memory dependencies on any or all of the hosts on which it ever executed.
Sprites migration facility uses a different form of lazy copying that takes advantage
of our existing network services while providing some of the advantages of lazy
766
767
768
file is also in use by some other process on the source machine; if the only use is
by the migrating process, then the file will be cacheable on the target machine. In
the current implementation, once caching is disabled for a file, it remains disabled
until no process has the file open (even if all processes accessing the file migrate to
the same machine); however, in practice, caching is disabled infrequently enough
that an optimization to re-enable caching of uncacheable files has not been a high
priority.
When an open file is transferred during migration, the file cache on the source
machine may contain modified blocks for the file. These blocks are flushed to the
files server machine during migration, so that after migration the target machine
can retrieve the blocks from the file server without involving the source. This
approach is similar to the mechanism for virtual memory transfer, and thus has the
same advantages and disadvantages. It is also similar to what happens in Sprite for
shared file access without migration: if a file is opened, modified, and closed on one
machine, then opened on another machine, the modified blocks are flushed from
the first machines cache to the server at the time of the second open.
The third component of the state of an open file is an access position, which
indicates where in the file the next read or write operation will occur. Unfortunately
the access position for a file may be shared between two or more processes. This
happens, for example, when a process opens a file and then forks a child process:
the child inherits both the open file and the access position. Under normal circumstances all of the processes sharing a single access position will reside on the same
machine, but migration can move one of the processes without the others, so that
the access position becomes shared between machines. After several false starts we
eventually dealt with this problem in a fashion similar to caching: if an access position
becomes shared between machines, then neither machine stores the access position
(nor do they cache the file); instead, the files server maintains the access position
and all operations on the file are forwarded to the server.
5
Another possible approach to shared file offsets is the one used in LOCUS. If
process migration causes a file access position to be shared between machines,
LOCUS lets the sharing machines take turns managing the access position. In order
to perform I/O on a file with a shared access position, a machine must acquire the
access position token for the file. While a machine has the access position token it
caches the access position and no other machine may access the file. The token
rotates among machines as needed to give each machine access to the file in turn.
This approach is similar to the approach LOCUS uses for managing a shared file,
where clients take turns caching the file and pass read and write tokens around to
ensure cache consistency. We chose not to use the Locus approach because the
token-passing approach is more complex than the disable-caching approach, and
because the disable-caching approach meshed better with the existing Sprite file system.
Figure 2 shows the mechanism currently used by Sprite for migrating open files.
The key part of this mechanism occurs in a late phase of migration when the target
machine requests that the server update its internal tables to reflect that the file is
now in use on the target instead of the source. The server in turn calls the source
machine to retrieve information about the file, such as the files access position and
whether the file is in use by other processes on the source machine. This two-level
remote procedure call synchronizes the three machines (source, target and server)
and provides a convenient point for updating state about the open file.
769
Network
Figure 2. Transferring open files. (1) The source passes information about all open files to the target;(2)
for each file, the target notifies the server that the open file has been moved; (3) during this call the server
communicates again with the source to release its state associated with the file and to obtain the most
recent state associated with the file
Aside from virtual memory and open files, the main remaining issue is how to
deal with the process control block (PCB) for the migrating process: should it be
left on the source machine or transferred with the migrating process? For Sprite we
use a combination of both approaches. The home machine for a process (the one
where it would execute if there were no migration) must assist in some operations
on the process, so it always maintains a PCB for the process. The details of this
interaction are described in the next section. In addition, the current machine for a
process also has a PCB for it. If a process is migrated, then most of the information
about the process is kept in the PCB on its current machine; the PCB on the home
machine serves primarily to locate the process and most of its fields are unused.
The other elements of process state besides virtual memory and open files are
much easier to transfer than virtual memory and open files, since they are not as
bulky as virtual memory and they do not involve distributed state such as open files.
At present the other state consists almost entirely of fields from the process control
block. In general, all that needs to be done is to transfer these fields to the target
machine and reinstate them in the process control block on the target.
SUPPORTING TRANSPARENCY: HOME MACHINES
As was mentioned previously, transparency was one of our most important goals in
implementing migration. By transparency we mean two things in particular. First,
a processs behaviour should not be affected by migration. Its execution environment
should appear the same, it should have the same access to system resources such as
files and devices, and it should produce exactly the same results as it if had not
migrated. Secondly, a processs appearance to the rest of the world should not be
affected by migration. To the rest of the world the process should appear as if it
never left its original machine, and any operation that is possible on an unmigrated
process (such as stopping or signaling) should be possible on a migrated process.
770
Sprite provides both of these forms of transparency; we know of no other implementation of process migration that provides transparency to the same degree.
In Sprite the two aspects of transparency are defined with respect to a processs
home machine, which is the machine where it would execute if there were no
migration at all. Even after migration, everything should appear as if the process
were still executing on its home machine. In order to achieve transparency, Sprite
uses four different techniques, which are described in the paragraphs below.
The most desirable approach is to make kernel calls location-independent; Sprite
has been gradually evolving in this direction. For example, in the early versions of
the system we permitted different machines to have different views of the file system
name space. This required open and several other kernel calls to be forwarded home
after migration, imposing about a 20 per cent penalty on the performance of remote
compilations. In order to simplify migration (and for several other good reasons
also), we changed the file system so that every machine in the network sees the
same name space. This made the open kernel call location-independent, so no extra
effort was necessary to make open work transparently for remote processes.
Our second technique was to transfer state from the source machine to the target
at migration time as described above, so that normal kernel calls may be used after
migration. We used the state-transfer approach for virtual memory, open files,
process and user identifiers, resource usage statistics, and a variety of other things.
Our third technique was to forward kernel calls home. This technique was originally used for a large number of kernel calls, but we have gradually replaced most
uses of forwarding with transparency or state transfer. At present there are only a
few kernel calls that cannot be implemented transparently and for which we cannot
easily transfer state. For example, clocks are not synchronized between Sprite
machines, so for remote processes Sprite forwards the gettimeofday kernel call back
to the home machine. This guarantees that time advances monotonically even for
remote processes, but incurs a performance penalty for processes that read the time
frequently. Another example is the getpgrp kernel call, which obtains state about
the process group of a process. The home machine maintains the state that groups
collections of processes together, since they may physically execute on different
machines.
Forwarding also occurs from the home machine to a remote processs current
machine. For example, when a process is signalled (e.g. when some other process
specifies its identifier in the kill kernel call), the signal operation is sent initially to
the processs home machine. If the process is not executing on the home machine,
then the home machine forwards the operation on to the processs current machine.
The performance of such operations could be improved by retaining a cache on each
machine of recently-used process identifiers and their last known execution sites.
This approach is used in LOCUS and V and allows many operations to be sent
directly to a remote process without passing through another host. An incorrect
execution site is detected the next time it is used and correct information is found
by sending a message to the host on which the process was created (LOCUS) or by
multi-casting (V).
The fourth approach is really just a set of ad hoc techniques for a few kernel
calls that must update state on both a processs current execution site and its home
machine. One example of such a kernel call is fork, which creates a new process.
Process identifiers in Sprite consist of a home machine identifier and an index of a
771
772
773
additional complexity, and we were not convinced that the benefits would justify
the implementation difficulties. For example, most processes in a UNIX-like environment are so short-lived that migration will not produce a noticeable benefit and may
even slow things down. Eager et al. provide
additional evidence that migration is
19
only useful under particular conditions. Thus, for Sprite we decided to make
migration a special case rather than the normal case.
The Sprite kernels provide no particular support for any of the migration policy
decisions, but user-level applications provide assistance in four forms: idle-host
selection, the pmake program, a mig shell command, and eviction. These are discussed
in the following subsections.
Selecting idle hosts
Each Sprite machine runs a background process called the load-average daemon,
which monitors the usage of that machine. When the workstation appears to be idle,
the load-average daemon notifies the central migration server that the machine is
ready to accept migrated processes. Programs that invoke migration, such as pmake
and mig described below, call a standard library procedure Mig_RequestldleHosts to
obtain the identifiers for one or more idle hosts, which they then pass to the kernel
when they invoke migration. Normally only one process may be assigned to any host
at any one time, in order to avoid contention for processor time; however, processes
that request idle hosts can indicate that they will be executing long-running processes
and the central server will permit shorter tasks to execute on those hosts as well.
Maintaining the database of idle hosts can be a challenging problem in a distributed
system, particularly if the system is very large in size or if there are no shared
facilities available for storing load information. A number of distributed algorithms
have been proposed to solve this problem, such as disseminating load information
6
20
among hosts periodically, querying other hosts at random to find an idle one, 8 or
multicasting and accepting a response from any host that indicates availability.
In Sprite we have used centralized approaches for storing the idle-host database.
Centralized techniques are generally simpler, they permit better decisions by keeping
all the information up-to-date in a single place, and they can scale to systems with
hundreds of workstations without contention problems for the centralized database.
We initially stored the database in a single file in the file system. The load-average
daemons set flags in the file when their hosts became idle, and the Mig_RequestldleHosts library procedure selected idle hosts at random from the file, marking the
selected hosts so that no one else would select them. Standard file-locking primitives
were used to synchronize access to the file.
We later switched to a server-based approach, where a single server process
keeps the database in its virtual memory. The load-average daemons and the Mig_
RequestIdleHosts procedure communicate with the server using a message protocol.
The server approach has a number of advantages over the file-based approach. It is
more efficient, because only a single remote operation is required to select an idle
machine; the file-based approach required several remote operations to open the
file, lock it, read it, etc. The server approach makes it easy to retain state from
request to request; we use this, for example, to provide fair allocation of idle hosts
when there are more would-be users than idle machines. Although some of these
features could have been implemented with a shared file, they would incur a high
774
overhead from repeated communication with a file server. Lastly, the server approach
provides better protection of the database information (in the shared-file approach
the file had to be readable and writable by all users).
We initially chose a conservative set of criteria for determining whether a machine
is idle. The load-average daemon originally considered a host to be idle only if (a)
it had had no keyboard or mouse input for at least five minutes, and (b) there were
fewer runnable processes than processors, on average. In choosing these criteria we
wanted to be certain not to inconvenience active users or delay background processes
they might have left running. We assumed that there would usually be plenty of idle
machines to go around, so we were less concerned about using them efficiently.
After experience with the five-minute threshold, we reduced the threshold for input
to 30 seconds; this increased the pool of available machines without any noticeable
impact on the owners of those machines.
pmake and Mig
Sprite provides two convenient ways to use migration. The most common use of
process migration is by the pmake program. pmake is similar in function to the make
7
UNIX utility and is used, for example, to detect when source files have changed
and recompile the corresponding object files. Make performs its compilations and
other actions serially; in contrast, pmake uses process migration to invoke as many
commands in parallel as there are idle hosts available. This use of process migration
is completely transparent to users and results in substantial speed-ups in many
situations, as shown below. Other systems besides Sprite have also benefitted from
parallel make facilities; see References 21 and 2 for examples.
The approach used by pmake has at least one advantage over a fully-automatic
processor pool approach where all the migration decisions are made centrally.
Because pmake makes the choice of processes to offload, and knows how many hosts
are available, it can scale its parallelism to match the number of idle hosts. If the
offloading choice were made by some other agent, pmake might overload the system
by creating more processes than could be accommodated efficiently. pmake also
provides a degree of flexibility by permitting the user to specify that certain tasks
should not be offloaded if they are poorly suited for remote execution.
The second easy way to use migration is with a program called mig, which takes
as argument a shell command. Mig will select an idle machine using the mechanism
described above and use process migration to execute the given command on that
machine. Mig may also be used to migrate an existing process.
Eviction
The final form of system support for migration is eviction. The load-average
daemons detect when a user returns. On the first keystroke or mouse-motion invoked
by the user, the load-average daemon will check for foreign processes and evict
them. When an eviction occurs, foreign processes are migrated back to their home
machines, and the process that obtained the host is notifed that the host has been
reclaimed. That process is free to remigrate the evicted processes or to suspend
them if there is no new host available. To date, pmake is the only application
that automatically remigrate processes, but other applications (such as mig) could
remigrate processes as well.
775
Evictions also occur when a host is reclaimed from one process in order to allocate
it to another. If the centralized server receives a request for an idle host when no
idle hosts are available, and one process has been allocated more than its fair share
of hosts, the server reclaims one of the hosts being used by that process. It grants
that host to the process that had received less than its fair share. The process that
lost the host must reduce its parallelism until it can obtain additional hosts again.
A possible optimization for evictions would be to permit an evicted process to
migrate directly to a new idle host rather than to its home machine. In practice, half
of the evictions that occur in the system take place due to fairness
considerations
22
rather than because a user has returned to an idle workstation. Permitting direct
migration between two remote hosts would benefit the other half of the evictions
that occur, but would complicate the implementation: it would require a three-way
communication between the two remote hosts and the home machine, which always
knows where its processes execute. Thus far, this optimization has not seemed to
be warranted.
PERFORMANCE AND USAGE PATTERNS
We evaluated process migration in Sprite by taking three sets of measurements. The
next subsections discuss particular operations in isolation, such as the time to migrate
a trivial process or invoke a remote command; the performance improvement of
pmake using parallel remote execution; and empirical measurements of Sprites
process migration facility over a period of several weeks, including the extent to
which migration is used, the cost and frequency of eviction, and the availability of
idle hosts.
Migration overhead
776
Table I. Costs associated with process migration. All measurements were performed on SPARCstation 1
workstations. Host selection may be amortized across several migrations if applications such as pmake
reuse idle hosts. The time to migrate a process depends on how many open files the process has and
how many modified blocks for those tiles are cached locally (these must be flushed to the server). If
the migration is not done at exec time, modified virtual memory pages must be flushed as well. If done
at exec time, the processs arguments and environment variables are transferred. The execs w e r e
performed with no open files. The bandwidth of the RPC system is 480 Kbytes/s using a single channel,
and 660 Kbytes/s using multiple RPC connections in parallel for the virtual memory system
Action
Time/Rate
36 ms
76 ms
94 ins/tile
480 Kbytes/s
660 Kbytes/s
480 Kbytes/s
Fork, exec null process with migration, wait for child to exit
Fork, exec null process locally, wait for child to exit
81 ms
46 ms
computation and file I/O. Figure 3 shows the total execution time to run several
programs, listed in Table II, both entirely locally and entirely on a single remote
host. Applications that communicate frequently with the home machine suffered
considerable degradation. Two of the benchmarks, fork and gettime, are contrived
examples of the type of degradation a process might experience if it performed
many location-dependent system calls without much user-level computation. The rcp
benchmark is a more realistic example of the penalties processes can encounter: it
copies data using TCP, and TCP operations are sent to a user-level TCP server on
the home machine. Forwarding these TCP operations causes rcp to perform about
40 per cent more slowly when run remotely than locally. As may be seen in
Figure 3, however, applications such as compilations and text formatting show little
degradation due to remote execution.
Application performance
777
Figure 3. Comparison between local and remote execution of programs. The elapsed time to execute
CPU-intensive and file-intensive applications such as pmake and LATEX showed negligible effects from
remote execution (3 and 1 per cent degradation, respectively). Other applications suffered performance
penalties ranging from 42 per cent (rcp), to 73 per cent (fork), to 3200 per cent (gettime)
Table II. Workload for comparisons between local and remote execution
Name
pmake
LATEX
rcp
fork
gettime
Description
recompile pmake source sequentially using pmake
run LATEX on a draft of this article
copy a 1 Mbyte file to another host using
fork and wait for child, 1000 times
get the time of day 10,000 times
TCP
The compile and link curve in Figure 4(b) shows a speed-up factor of 5 using 12
hosts. Clearly, there is a significant difference between the speed-ups obtained for
the normalized compile benchmark and the compile and link benchmark. The
difference is partly attributable to the sequential parts of running pmake: determining
file dependencies and linking object files all must be done on a single host. More
importantly, file caching affects speed-up substantially. As described above, when a
host opens a file for which another host is caching modified blocks, the host with
the modified blocks transfers them to the server that stores the file. Thus, if pmake
uses many hosts to compile different files in parallel, and then a single host links
the resulting object files together, that host must wait for each of the other hosts to
flush the object files they created. It must then obtain the object files from the
server. In this case, linking the files together when they have all been created on a
single host takes only 56 s, but the link step takes 6569 s when multiple hosts are
used for the compilations.
778
Figure 4. Performance of recompiling the Sprite kernel using a varying number of hosts and the pmake
program. Graph (a) shows the time to compile all the input files and then link the resulting object files
into a single file. In addition, it shows a normalized curve that shows the time taken for the compilation
only, deducting as well the pmake start-up overhead of 19 s to determine dependencies; this curve
represents the parellelizable portion of the pmake benchmark. Graph (b) shows the speed-up obtained
for each point in (a), which is the ratio between the time taken on a single host and the time using
multiple hosts in parallel
In practice, we do not even obtain the fivefold speed-up indicated by this benchmark, because we compile and link each kernel module separately and link the
modules together afterwards. Each link step is an additional synchronization point
that may be performed by only one host at a time. In our development environment,
we typically see three to four times speed-up when rebuilding a kernel from scratch.
Table III presents some examples of typical pmake speed-ups. These times are
representative of the performance improvements seen in day-to-day use. Figure 5
shows the corresponding speed-up curves for each set of compilations when the
number of hosts used varies from 1 to 12. In each case, the marginal improvement
of additional hosts decreases as more hosts are added.
The speed-up curves in Figure 4(b) and Figure 5 show that the marginal improvement from using additional hosts is significantly less than the processing power of
the hosts would suggest. The poor improvement is due to bottlenecks on both the
file server and the workstation running pmake. Figure 6 shows the utilization of the
processors on the file server and client workstation over 5 s intervals during the 12way kernel pmake. It shows that the pmake process uses nearly 100 per cent of a
SPARCstation processor while it determines dependencies and starts to migrate
processes to perform compilations. Then the Sun-4/280 file servers processor
becomes a bottleneck as the 12 hosts performing compilations open files and write
back cached object files. The network utilization, also shown in Figure 6, averaged
around 20 per cent and is thus not yet a problem. However, as the server and client
processors get faster, the network may easily become the next bottleneck.
Though migration has been used in Sprite to perform compilations for nearly two
years, it has only recently been used for more wide-ranging applications. Excluding
779
Figure 5. Speed-up of compilations using a variable number of hosts. This graph shows the speed-up
relative to running pmake on one host (i.e. without migration). The speed-up obtained depends on the
extent that hosts can be kept busy, the amount of parallelization available to pmake and system bottlenecks
Table III. Examples of pmake performance. Sequential execution is done on a single host; parallel
execution uses migration to execute up to 12 tasks in parallel. Each measurement gives the time to
compile the indicated number of files and link the resulting object files together in one or more steps.
When multiple steps are required, their sequentiality reduces the speed-up that may be obtained;
pmake, for example, is organized into two directories that are compiled and linked separately, and then
the two linked object files are linked together
Program
Number of
links
files
gremlin
24
36
49
276
TEX
pmake
kernel
1
1
3
1
Sequential time
180
259
162
1971
Parallel time
41
48
55
453
Speed-up
443
542
295
435
compilations, simulations are the primary application for Sprites process migration
facility. It is now common for users to use pmake to run up to one hundred
simulations, letting pmake control the parallelism. The length and parallelism of
simulations results in more frequent evictions than occur with most compilations,
and pmake automatically remigrate or suspends processes subsequent to eviction.
In addition to having a longer average execution time, simulations also sometimes
differ from compilations in their use of the file system. Whereas some simulators
are quite I/O intensive, others are completely limited by processor time. Because
they perform minimal interaction with file servers and use little network bandwidth,
they can scale better than parallel compilations do. One set of simulations obtained
over 800 per cent effective processor utilizationeight minutes of processing time
per minute of elapsed timeover the course of an hour, using all idle hosts on the
system (up to 1015 hosts of the same architecture).
780
Figure 6. Processor and network utilization during the 12-way pmake. Both the file server and the client
workstation running pmake were saturated
Usage patterns
781
Fraction remote
7277%
838%
8758%
156%
7542%
027%
35 33%
2922%
3186%
l070%
141%
3108%
Secondly, we recorded each time a host changed from idle to active, indicating
that foreign processes would be evicted if they exist, and we counted the number of
times evictions actually occurred. To date, evictions have been extremely rare. On
the average, each host changed to the active state only once every 26 rein, and very
few of these transitions actually resulted in processes being evicted (012 processes
per hour per host in a collection of more than 25 hosts). The infrequency of evictions
has been due primarily to the policy used for allocating hosts: hosts are assigned in
decreasing order of idle time, so that the hosts that have been idle the longest are
used most often for migration. The average time that hosts had been idle prior to
being allocated for remote execution was 17 h, but the average idle time of those
hosts that later evicted processes was only 4 min. (One may therefore assume that
if hosts were allocated randomly, rather than in order of idle time, evictions would
be considerably more frequent. ) Finally, when evictions did occur, the time needed
to evict varied considerably, with a mean of 30 s and a standard deviation of 31 s
to migrate an average of 33 processes. An average of 37 4-Kbyte pages were written
per process that migrated, with a standard deviation of 65 from host to host.
Thirdly, over the course of over a year, we periodically recorded the state of every
host (active, idle or hosting foreign processes) in a log file. A surprisingly large
number (6678 per cent) of hosts are available for migration at any time, even
during the day on weekdays. This is partly due to our environment, in which several
users own both a Sun and a DECstation and use only one or the other at a time.
Some workstations are available for public use and are not used on a regular basis.
However, after discounting for extra workstations, we still find a sizable fraction of
hosts available, concurring with Theimer, Nichols, and others. Table V summarizes
the availability of hosts in Sprite over this period.
To further study the availability of idle hosts, we recorded information about
requests for idle hosts over a 25-day period. During this period, over 17,000 processes
requested one or more idle hosts, and 86 per cent of those processes obtained as
many hosts as they requested. Only 2 per cent of processes were unable to obtain
782
Weekdays
Off-hours
Total
In use
Idle
In use for
migration
31
20
23
66
78
75
3
2
2
17
Based on our experience, as well as those of others (V, Charlotte and Accent ,
we have observed the following:
1. The overall improvement from using idle hosts can be substantial, depending
upon the degree of parallelism in an application.
2. Remote execution currently accounts for a sizeable fraction of all processing
on Sprite. Even so, idle hosts are plentiful. Our use of idle hosts is currently
limited more by a lack of applications (other than pmake) than by a lack of
hosts.
3. The cost of exec -time migration is high by comparison to the cost of local
process creation, but it is relatively small compared to times that are noticeable
by humans. Furthermore, the overhead of providing transparent remote
execution in Sprite is negligible for most classes of processes. The system may
therefore be liberal about placing processes on other hosts at exec time, as
long as the likelihood of eviction is relatively low.
4. The cost of transferring a processs address space and flushing modified file
blocks dominates the cost of migrating long-running processes, thereby limiting
the effectiveness of a dynamic pool of processors approach. Although there
are other environments in which such an approach could have many favorable
aspects, given our assumptions above about host availability and workstation
ownership, using process migration to balance the load among all Sprite hosts
would likely be both unnecessary and undesirable.
HISTORY AND EXPERIENCE
The greatest lesson we have learned from our experience with process migration is
the old adage use it or lose it. Although an experimental version of migration was
783
Figure 7. Distribution of host requests and satisfaction rates. For a given number of hosts, shown on the
X-axis, the line labelled requesting shows the fraction of processes that requested at least that many
hosts. The line labelled satisfied shows, out of those processes that requested at least that number of
hosts, the fraction of processes that successfully obtained that many hosts. Thus, 98 per cent of all
processes were able to obtain at least one host, and over 80 per cent of processes that requested at least
ten hosts obtained 10 hosts. Only 24 per cent of processes requested more than one host
23
operational in 1986, it took another two years to make migration a useful utility.
Part of the problem was that a few important mechanisms were not implemented
initially (e.g. there was no automatic host selection, migration was not integrated
with pmake, and process migration did not deal gracefully with machine crashes).
But the main problem was that migration continually broke due to other changes in
the Sprite kernel. Without regular use, problems with migration were not noticed
and tended to accumulate. As a result, migration was only used for occasional
experiments. Before each experiment a major effort was required to fix the accumulated problems, and migration quickly broke again after the experiment was finished.
By the autumn of 1988 we were beginning to suspect that migration was too fragile
to be maintainable. Before abandoning it we decided to make one last push to make
process migration completely usable, integrate it with the pmake program, and use
it for long enough to understand its benefits as well as its drawbacks. This was a
fortunate decision. Within one week after migration became available in pmake,
other members of the Sprite project were happily using it and achieving speed-up
factors of two to five in compilations. Because of its complex interactions with the
rest of the kernel, migration is still more fragile than we would like and it occasionally
breaks in response to other changes in the kernel. However, it is used so frequently
that problems are detected immediately and they can usually be fixed quickly. The
maintenance load is still higher for migration than for many other parts of the kernel,
but only slightly. Today we consider migration to be an indispensable part of the
Sprite system.
We are not the only ones to have had difficulties keeping process migration
running: for example, Theimer reported similar experiences with his implementation
784
In addition to acting as guinea pigs for the early unstable implementations of process
migration, other members of the Sprite project have made significant contributions
to Sprites process migration facility. Mike Nelson and Brent Welch implemented
most of the mechanism for migrating open files, and Adam de Boor wrote the pmake
program. Lisa Bahler, Thorsten von Eicken, John Hartman, Darrell Long, Mendel
Rosenblum, and Ken Shirriff provided comments on early drafts of this paper, which
improved the presentation substantially. We are also grateful to the anonymous
referees of SoftwarePractice and Experience, who provided valuable feedback and
suggestions. Of course, any errors in this article are our responsibility alone.
This work was supported in part by the U.S. Defense Advanced Research Projects
785
Agency under contract N00039-85-C-0269 and in part by the U.S. National Science
Foundation under grant ECS-8351961.
REFERENCES
1. D. Nichols, Using idle workstations in a shared computing environment, Proceedings of the
Eleventh ACM Symposium on Operating Systems Principles, ACM, Austin, TX, November 1987,
pp. 512.
2. E. Roberts and J. Ellis, parmake and dp: experience with a distributed, parallel implementation of
make, Proceedings from the Second Workshop on Lurge-Grained Parallelism, Software Engineering
Institute, Carnegie-Mellon University, November 1987, Report CMU/SEI-87-SR-5.
3. M. Theimer, K. Lantz and D. Cheriton, Preemptable remote execution facilities for the V-system,
Proceedings of the 10th Symposium on Operating System Principles, December 1985, pp. 212.
4. E. Zayas, Attacking the process migration bottleneck, Proceedings of the Eleventh ACM Symposium on Operating Systems Principles, Austin, TX, November 1987, pp. 1322.
5. G. J. Popek and B. J. Walker (eds), The LOCUS Distributed System Architecture, C o m p u t e r
Systems Series, The MIT Press, 1985.
6. A. Barak, A. Shiloh and R. Wheeler, Flood prevention in the MOSIX load-balancing scheme,
IEEE Computer Society Technical Committee on Operating Systems Newsletter, 3, (1), 2327(1989).
7. S. I. Feldman, Makea program for maintaining computer programs, Software-Practice and
Experience, 9, (4), 255265 (1979).
8. M. Theimer, Preemptable remote execution facilities for loosely-coupled distributed systems,
Ph.D. Thesis, Stanford University, 1986.
9. Y. Artsy and R. Finkel, Designing a process migration facility: the Charlotte experience, IEEE
Computer, 22, (9), 4756 (1989).
10. J. K. Ousterhout, A. R. Cherenson, F. Douglis, M. N. Nelson and B. B. Welch, The Sprite
network operating system, IEEE Computer, 21, (2), 23-36 (1988).
11. M. Nelson, B. Welch and J. Ousterhout, Caching in the Sprite network file system, A C M
Transactions on Computer Systems, 6, (1), 134154 ( 1988).
12. A. D. Birrell and B. J. Nelson, Implementing remote procedure calls, ACM Transactions on
Computer Systems, 2, (l), 3959 (1984).
13. Computer Science Division, University of California, Berkeley, UNIX Users Reference Manual,
4.3 Berkeley Software Distribution, Virtual VAX-11 Version, April 1986.
14. M. Litzkow, Remote UNIX, Proceedings of the USENIX 1987 Summer Conference, June 1987.
15. M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian and M. Young, Mach: a
new kernel foundation for UNIX development, Proceedings of the USENIX 1986 Summer Conference, July 1986.
16. D. R. Cheriton, The V distributed system, Communications of the ACM, 31, (3), 314333 (1988).
17. E. Zayas, The use of copy-on-reference in a process migration system, Ph.D. Thesis, Carnegie
Mellon University, Pittsburgh, PA, April 1987. Report No. CMU-CS-87-121.
18. K. Li and P. Hudak, Memory coherence in shared virtual memory systems, Proceedings of the
5th ACM Symposium on Principles of Distributed Computing, ACM, August 1986, pp. 229239.
19. D. L. Eager, E. D. Lazowska and J. Zahorjan, The limited performance benefits of migrating
active processes for load sharing, ACM SIGMETRICS 1988, May 1988.
20. D. L. Eager, E. D. Lazowska and J. Zahorjan, Adaptive load sharing in homogeneous distributed
systems, IEEE Trans. Software Engineering, SE-12, (5), 662675 (1986).
21. E. H. Baalbergen, Parallel and distributed compilations in loosely-coupled systems: a case study,
Proceedings of Workshop on Large Grain Parallelism, Providence, RI, October 1986.
22. F. Douglis, Transparent process migration in the Sprite operating system, Ph.D. Thesis, University
of California, Berkeley, CA 94720, September 1990. Available as Technical Report UCB/CSD
90/598.
23. F. Douglis and J. Ousterhout, Process migration in the Sprite operating system, Proceedings of
the 7th International Conference on Distributed Computing Systems, IEEE, Berlin, West Germany,
September 1987, pp. 1825.