Interactivity vs. Fairness in Networked Linux Systems
Interactivity vs. Fairness in Networked Linux Systems
www.elsevier.com/locate/comnet
Received 18 September 2006; received in revised form 6 February 2007; accepted 3 April 2007
Available online 3 May 2007
Abstract
In general, the Linux 2.6 scheduler can ensure fairness and provide excellent interactive performance at the same time.
However, our experiments and mathematical analysis have shown that the current Linux interactivity mechanism tends to
incorrectly categorize non-interactive network applications as interactive, which can lead to serious fairness or starvation
issues. In the extreme, a single process can unjustifiably obtain up to 95% of the CPU! The root cause is due to the facts
that: (1) network packets arrive at the receiver independently and discretely, and the ‘‘relatively fast’’ non-interactive net-
work process might frequently sleep to wait for packet arrival. Though each sleep lasts for a very short period of time, the
wait-for-packet sleeps occur so frequently that they lead to interactive status for the process. (2) The current Linux inter-
activity mechanism provides the possibility that a non-interactive network process could receive a high CPU share, and at
the same time be incorrectly categorized as interactive. In this paper, we propose and test a possible solution to address the
interactivity vs. fairness problems. Experiment results have proved the effectiveness of the proposed solution.
2007 Elsevier B.V. All rights reserved.
1389-1286/$ - see front matter 2007 Elsevier B.V. All rights reserved.
doi:10.1016/j.comnet.2007.04.012
W. Wu, M. Crawford / Computer Networks 51 (2007) 4050–4069 4051
is credited with much of the overall system perfor- lent interactive performance at the same time. How-
mance improvements. ever, our experiments and analysis have shown that
One design goal of Linux 2.6 is to improve inter- the current Linux interactivity mechanism tends to
activity [5]. Processes such as text editors and com- incorrectly categorize non-interactive network
mand shells interact constantly with their users, and applications as interactive, which can lead to serious
spend a lot of time waiting for keystrokes and fairness or starvation issues. The interactivity mech-
mouse events. When inputs are received the process anism allows the possibility that a non-interactive
must be woken up quickly; otherwise, the user will network process could consume a large CPU share,
find the system to be unresponsive and annoying. and at the same time be incorrectly categorized as
Typically, the delay must not exceed 150 ms [1]. interactive. Further, incorrectly labeled ‘‘interactive
Linux 2.6 provides excellent interactive performance network applications’’ might block true interac-
by employing the following measures [1,2,4]: (1) its tive applications, resulting in degraded interactive
scheduler is a typical decay usage priority scheduler. performance.
Processes are scheduled in priority order, where Linux-based network end systems have been
effective priority has two components: static priority widely deployed in the High-Energy Physics
and dynamic priority bonus. The static priority (HEP) community at labs like CERN, DESY,
reflects inherent relative importance of processes, Fermilab, and SLAC, and at many universities. At
which is expressed by processes’ nice values. The Fermilab, thousands of networked systems run
dynamic priority bonus depends on CPU usage pat- Linux; these include computational farms, trigger
terns; the scheduler favors interactive processes and processing farms, hierarchical storage servers, and
penalizes non-interactive processes by adjusting the desktop workstations. From a network perfor-
dynamic priority bonus. (2) To reduce scheduling mance perspective, Linux represents an opportunity
latency, expired interactive processes are reinserted since it is amenable to optimization and tuning due
back into the active array, instead of the expired to its open source support and projects such as
array. In addition, an interactive process’ timeslice web100 and net100 that enable examination of
is divided into smaller pieces, preventing interactive internal states [8,9]. The performance of Linux-
processes from blocking each other. (3) Linux 2.6 is based network end systems is of great interest to
kernel-preemptible. Whenever a scheduler clock tick HEP and other scientific and commercial communi-
or interrupt occurs, if a higher-priority task has ties. In this paper, we analyze the interactivity vs.
become runnable, it will preempt the running task fairness issues in networked Linux systems. Our
as long as the latter holds no kernel locks. (4) Linux analysis is based on Linux kernel 2.6.14. Also, it is
2.6’s clock granularity has reached 1 ms level. assumed that the NIC (Network Interface Card)
Fairness is another design goal of Linux 2.6 [5]. driver makes use of Linux’s ‘‘New API,’’ or NAPI
Fairness is the ability of all tasks not only to make [10,11], which reduces the interrupt load on the
forward progress, but to do so relatively evenly. The CPUs. The contributions of the paper are as fol-
opposite of fairness is starvation, which occurs if lows: (1) We systematically study and analyze the
some tasks make no forward progress at all [6,7]. Linux 2.6 scheduling and interactivity mechanism;
Linux 2.6 scheduler’s active–expired array design (2) Our researches have pointed out that the current
is supposed to ensure fairness [1,2]. However, as Linux interactivity mechanism is not effective in dis-
described above, an expired interactive process is tinguishing non-interactive network processes from
reinserted back into the active array instead of the interactive network processes, and might result in
expired array. This leads to the possibility of starva- serious fairness/starvation problems. Mathematical
tion for the processes in the expired array if the analysis and experiments results have verified our
active array continues to hold runnable processes. conclusions. (3) Further, we propose and test a pos-
To circumvent the starvation issue, when the first sible solution to address the interactivity vs. fairness
expired process is older than some limit, expired problems. Experiment results have proved the effec-
processes are moved to the expired array without tiveness of our proposed solution.
regard to their interactive status. Usually, an inter- The remainder of the paper is organized as fol-
active process does not consume much CPU time lows: In Section 2 the related researches on interac-
because most of time it sleeps waiting for user tivity and fairness are presented. Section 3 analyzes
inputs. In general, the Linux 2.6 scheduler can Linux scheduling and interactivity mechanisms. In
ensure fairness among processes, and provide excel- Section 4, we investigate the interactivity vs. fairness
4052 W. Wu, M. Crawford / Computer Networks 51 (2007) 4050–4069
problems in networked Linux systems through math- research has been found to relate interactivity and
ematical analysis. In Section 5, we show experiment fairness to network applications.
results to further study the problems, verifying our
conclusions in Section 4. In Section 6, we propose 3. Linux scheduling and interactivity
and test a possible solution to address the interactiv-
ity vs. fairness problems in network Linux systems. Linux 2.6 is a preemptive multi-processing operat-
And finally in Section 7, we conclude the paper. ing system. Processes (tasks) are scheduled to run in
a prioritized round robin manner [1–4], to achieve the
2. Related work objectives of fairness, interactivity and efficiency.
For the sake of scheduling, a Linux process has a
The schedulers of Unix variants such as BSD4.3, dynamic priority and a static priority. A process’
FreeBSD, Solaris, and SVR4 [12–15] are typical static priority is equivalent to its nice value, which
decay usage priority schedulers: processes are sched- is specified by the user and not changed by the ker-
uled in priority order; higher priority processes are nel. The dynamic priority is used by the scheduler to
scheduled to run first. The priorities of I/O bound rate the process with respect to the other processes
(interactive) processes grow with time, so that when in the system. An eligible process with better (smal-
they are awakened, they have higher priority than ler-valued) dynamic priority is scheduled to run
CPU-bound (non-interactive) processes, and are before a process with a worse (higher-valued)
therefore scheduled to run immediately. In general, dynamic priority. The dynamic priority varies dur-
those schedulers provide excellent interactive ing a process’ life. It depends on the process’ sched-
response on general-purpose time-sharing systems uling history and its specified static priority, which
for traditional interactive applications that have we will elaborate in the following sections. There
low CPU consumption. However, those schedulers are 140 possible priority levels for processes (both
are not effective in support of interactive multimedia dynamic priority and static priority) in Linux. The
applications (e.g., audio player, video player) that top 100 levels are used only for real-time processes,
have high CPU usages. To address this problem, which we do not address in this paper. The last 40
Etsion et al. [16] proposed the human-centered levels are used for conventional processes.
scheduling of interactive and multimedia applica-
tions on a loaded desktop. In their approach, the 3.1. Linux scheduler
scheduler first estimates the ‘‘volume of user-inter-
action’’ associated with each process by monitoring As shown in Fig. 1, the whole process scheduling
relevant I/O device activity, and then the scheduler is based on a data structure called runqueue. Essen-
uses those estimates to prioritize interactive pro-
cesses, without respect to their CPU usages. How-
ever, this method might not be appropriate for
some network applications.
To ensure fairness, proportional-share schedulers
[17–20] are usually employed to control the relative
rates at which different processes can use the proces-
sor. Over the years, different proportional-share
schedulers have been proposed. In [17], Waldspurger
et al. proposed the lottery scheduling to enable flex-
ible control over the relative rates at which CPU-
bound workloads consume processor time. In [18],
Goyal et al. proposed a hierarchical CPU scheduler
for multimedia operating systems, which provides
protection between various classes of applications.
In [21], Petrou et al. proposed a hybrid lottery
scheduler, which aims to achieve responsiveness
comparable to the FreeBSD scheduler while main-
taining lottery scheduling’s flexible control over rel-
ative execution rates and load insulation. So far, no Fig. 1. Linux process scheduling.
W. Wu, M. Crawford / Computer Networks 51 (2007) 4050–4069 4053
tially, a runqueue keeps track of all runnable tasks A sleep_avg is stored for each process: a process is
assigned to a particular CPU. One runqueue is cre- credited for its sleep time and penalized for its run-
ated and maintained for each CPU in a system. time. A process with high sleep_avg is considered
Each runqueue contains two priority arrays: active interactive, and low sleep_avg is non-interactive.
priority array and expired priority array. Each prior- The interactive estimator framework embedded into
ity array contains a queue of runnable processes per Linux operates automatically and transparently.
priority level. Processes with higher dynamic prior- A process’ dynamic priority varies during the
ity are scheduled to run first. Within a given prior- process’ life span. It depends on the process’ interac-
ity, processes are scheduled round robin. All tasks tivity status and its specified static priority. Linux
on a CPU begin in the active priority array. Each assigns a dynamic priority to process P at time t
process’ timeslice is calculated based on its static pri- as follows:
ority; when a process in the active priority array
dynamic priorityðP ; tÞ ¼ maxf100; minfstatic priorityðP Þ
uses up its timeslice, it is considered expired. An
expired process is moved to the expired priority þ 5 bonusðP ; tÞ; 139gg ð1Þ
array if it is not interactive. An expired interactive bonusðP ; tÞ ¼ P ! sleep avgðtÞ
process is reinserted into the active array if possible. MAX BONUS=MAX SLEEP AVG: ð2Þ
In either case, a new timeslice and priority are calcu-
lated. When there are no more runnable tasks in the The constant MAX_BONUS is 10 and MAX_
active priority array, it is simply swapped with the SLEEP_AVG is 1000 ms. P ! sleep_avg(t) is the
expired priority array. An unexpired process might sleep_avg (in ms) for process P at time t, and it
be put into a wait queue to sleep, waiting for is limited to the range 0 6 P ! sleep_avg(t) 6
expected events such as completion of I/O. When MAX_SLEEP_AVG. Therefore, bonus(P, t) ranges
a sleeping process wakes up, its timeslice and prior- from 0 to 10. The quantity 5 bonus(P, t) is also
ity are recalculated and it is moved to the active pri- called the dynamic priority bonus. The more time a
ority array. As for preemption, whenever a process spends sleeping, the higher the sleep_avg
scheduler clock tick or interrupt occurs, if a is, and the higher the priority boost.
higher-priority task has become runnable, it will From (1) and (2), it can be seen that Linux credits
preempt the running task as long as the latter holds interactive processes and penalizes non-interactive
no kernel locks. processes by adjusting dynamic priority bonus. In
this way, Linux allows interactive processes to pre-
3.2. Interactive scheduling empt non-interactive processes when they have
same, or nearly the same, static priorities.
As we have said above, an interactive process When a process runs out its timeslice, the Linux
needs to be responsive. The Linux kernel must pro- kernel needs to determine its interactivity status.
vide the capabilities of interactive scheduling. To An expired interactive process is reinserted back
this end, it needs to: into the active array, instead of the expired array.
The interactivity threshold condition for process P
• Perform process classification: differentiate inter- is
active processes from non-interactive processes. bonusðP ; tE Þ P static priorityðP Þ=4 23; ð3Þ
• Try to minimize the scheduling latency [7] for
interactive processes. where tE is the moment that process P expires. For a
– Prevent non-interactive processes from block- process P with a default nice value of 0, the static
ing interactive processes. priority is 120 [1,4] and the interactivity threshold
– Prevent interactive processes from blocking is equivalent to P ! sleep_avg(tE) P 700 ms.
other interactive processes. If and only if the condition in (3) holds, P is
deemed interactive. Reinserting an interactive pro-
cess into the active array helps to increase respon-
The interactivity estimator is designed to find siveness. If it was not done in this way, an
which processes are interactive and which are not. interactive process in the expired array would have
It is based on the premise that non-interactive pro- to wait for all the runnable processes in the active
cesses tend to use up all the CPU time offered to array to finish before regaining the CPU. However,
them, whereas interactive processes often sleep [1]. keeping an expired interactive process in the active
4054 W. Wu, M. Crawford / Computer Networks 51 (2007) 4050–4069
array might lead to starvation for the processes in some auto-regulation into the calculation [22]. The
the expired array as long as the active array contin- updating of sleep_avg occurs at the moments that:
ues to hold runnable processes. To circumvent star- (a) a process wakes up from sleep or blocking state,
vation, special interactivity rules have been made: or (b) a process yields the CPU.
In the example of Fig. 2, at t0 process P starts to
• Rule 1: If the time since the first process in the run for a duration of tr. At t1, P goes to sleep and
active array expired is greater than or equal yields the CPU to process Q. Then at t2, P wakes
to STARVATION_LIMIT · NRrunning + 1, any up and preempts Q. In general, the updating of slee-
expired processes are moved to the expired array p_avg follows (4) and (5):
without regard to their interactive status.
Here, the constant STARVATION_LIMIT is P ! sleep avgðt1 Þ ¼ maxf0; P ! sleep avgðt0 Þ tr ag;
1000 ms, and NRrunning is the number of pro- ð4Þ
cesses in the runqueue.
• Rule 2: The interactivity is also ignored if a pro- where a is a weighting factor for run time, a = 1/
cess in the expired array has a better static max {1,bonus(P, t0)}
priority. P ! sleep avgðt2 Þ ¼ minfMAX SLEEP AVG; P
! sleep avgðt1 Þ þ ts bg; ð5Þ
Furthermore, an interactive process P’s timeslice
is divided into smaller pieces. Each piece has the size where b is a weighting factor for sleep time, b = max
of TIMESLICE_GRANULARITY(P), which is {1,10 bonus(P, t1)}.
actually a macro that yields the product of the num- However, when updating sleep_avg for waking
ber of CPUs in the system and a constant propor- processes, special measures are taken to treat the
tional to bonus(P, t) [1,4]. An interactive process following scenarios [1,4]: (a) Processes that sleep a
does not receive any less timeslice, instead a task long time are categorized as idle and will get mini-
of equal priority may preempt the running process mally interactive status to prevent them suddenly
every TIMESLICE_GRANULARITY(P). The becoming CPU intensive and starving other pro-
process is then requeued to the end of the list for cesses. (b) Processes waking from an uninterruptible
its priority level. Processes at the same priority level sleep are limited in their sleep_avg rise as they are
run in round-robin fashion, so execution will rotate likely to have been waiting on disk I/O, which is
more frequently among interactive processes of the not a strong indicator of interactivity. (Most local
same priority, preventing them from blocking each disk I/O is associated with uninterruptible sleep.)
other. (c) When an awakened process is put into a
runqueue, there might be scheduling latency, which
3.3. Sleep_avg scoring could be of a non-negligible duration. In this case,
the time spent on the runqueue might or might
The basic idea of sleep_avg is to credit sleep time not be credited as sleep time, depending on the state
and penalize run time. However, the calculation of of the process when it was awakened. The state of
sleep_avg is not a simple counter up and down. the process is encoded within the process’ activated
The current interactive status of the process is used field [1]. Let’s assume that process P waits on
to weight both sleep time and run time to introduce runqueue for a period of tw before it is scheduled
Table 1 ftp, rcp, scp, and the like are not. If they are misclas-
Credited sleep time vs. wait time on runqueue sified, it will raise scheduling fairness issues. In the
P ! activated code 1 1 2 0 following sections, we use a simplified model to ana-
Credited sleep time 0 0.3*tw tw N/A lyze the fairness vs. interactivity issues.
Assume there is bulk data flowing from a sender
to a receiver (as in ftp, for example). Process P is the
to run. The credited sleep time is as shown in Table
data receiving process in the receiver. The network
1. For example, a process might sleep to wait for
is relatively stable, and incoming packets are evenly
data from network. Afterwards, when the process
spaced with a rate of Ni packets/s (pps). There is no
is woken up, its wait time on the runqueue is fully
other traffic directed to the receiver. This assump-
credited to the sleep_avg because its P ! activated
tion holds for traffic patterns like voice over IP
code is 2.
[24] or an ideal TCP self-clocking stream such as
Since Linux only counts time in integral tick
in [25]. In reality, the incoming traffic pattern is
units, the Linux clock granularity might play a role
irregular. However, NAPI or ‘‘interrupt coalescing’’
when updating the sleep_avg: some sleep/run times
will mask the arrival pattern and to some extent nul-
are rounded up to the next whole tick, while others
lify its effect on the receiver. Similar conclusions are
are rounded down. On average, these two effects
still expected to be valid, and are borne out by
tend to cancel out [23]. Furthermore, in Linux 2.6
experiments. Also, let the NAPI driver’s hardware
the clock granularity is 1 ms level. In general, the
interrupt time be Tintr, which includes NIC interrupt
sleep_avg is updated with reasonable accuracy.
dispatch and service time; the software interrupt
softnet’s packet service rate be Rsn (pps); and pro-
4. Interactivity vs. fairness in networked Linux cess P’s data service rate is SP(pps). When the net-
system work bandwidth is limited, or the sender’s
processing power is relatively slower than the recei-
In previous sections we have discussed the Linux ver’s processing power, we can assume that
interactive scheduling mechanism: an expired inter- Ni Rsn. Let process P have the default nice value
active process is reinserted back into the active of 0.
array, instead of the expired array. Interactive
scheduling makes the Linux systems more respon-
sive and interactive. However, interactive schedul- 4.1. Single process receiver
ing would bring the possibility of unfairness if the
interactivity classification was inaccurate. For Only process P runs on the receiver, no other
example, when a non-interactive process is incor- processes. At time 0, P is waiting for network data
rectly classified as interactive, reinserting it back from the sender (TCP or UDP).
into the active array will gain it extra scheduling As shown in Fig. 3, packets start to arrive at
runs, at the expense of other non-interactive pro- receiver at time 0. As an interrupt-driven operating
cesses. What’s worse is that when a non-interactive system, the Linux execution sequence is: hardware
process incorrectly gains interactive status, its interrupts ! software interrupts ! processes [1,2].
dynamic priority is correspondingly enhanced, Packet 1 is first transferred to ring buffer, then the
which might block some true interactive processes. NIC raises a hardware interrupt to schedule softirq
As remarked above, special measures have been – softnet. Afterwards, the software interrupt han-
taken to make interactivity classification accurate. dler (softnet) starts to move packet 1 from ring buf-
Those measures are effective in preventing processes fer to the socket’s receive buffer of process P,
that mainly wait for disk I/O from being categorized waking up process P and putting it on the runqueue.
as interactive [22]. However, our experiments and During this period, new packets might arrive at the
analysis have shown that the current interactivity receiver. For example, packet 2 arrives during the
classification mechanism is not effective in classify- period in Fig. 3. Softnet continues to process the
ing network-related processes. It tends to classify packets within the ring buffer until it is empty. Let-
applications like ftp and rcp as interactive when ting Tsn be the duration that Softnet spends on the
bandwidth is limited or the sender is slower than ring buffer, we see that
the receiver. Applications like ssh, telnet, and http
clients are generally interactive applications; but 1 þ bðT intr þ T sn Þ N i c ¼ T sn Rsn : ð6Þ
4056 W. Wu, M. Crawford / Computer Networks 51 (2007) 4050–4069
Here, Tsn*Rsn is actually the number of packets sleep_avg. For the process being discussed, its
that are handled together P ! activated code is 2. As shown in Fig. 3, process
P runs for Tr and sleeps for Ts in each cycle.
1 þ T intr N i Here
T sn ¼ Rsn Rsn : ð7Þ
Rsn N i j k
1þT intr N i
T sn Rsn Rsn N i
Rsn
Then softirq yields the CPU. Process P begins to Tr ¼ ¼ ; ð8Þ
S S
run, moving data from the socket’s receive buffer j P k jP k
1þT intr N i 1þT intr N i
into user space. Since there are Tsn * Rsn packets Rsn N i
Rsn Rsn N i
Rsn
in the receiver buffer, process P runs for a duration Ts ¼ : ð9Þ
Ni SP
of Tr = (Tsn * Rsn)/SP. Here, we are considering a
relatively low incoming packet rate compared to Following (4) and (5), it is easy to update
the receiver’s processing power. Before the next P ! sleep_avg(t) at time t.
packet (P3 in Fig. 3) arrives at the receiver, process From (8) and (9), it follows that
P runs out of data, and again goes to sleep, waiting Tr Ni
for more. Either of two conditions could lead to a ¼ : ð10Þ
T s SP N i
relatively low incoming packet rate: the network
bandwidth from sender to receiver is low, or the sen- Correspondingly, process P’s CPU share is
der’s hardware is less powerful than the receiver’s. If Tr Ni
the next packet always arrives before process P goes ¼ : ð11Þ
T r þ T s SP
to sleep, the sender will overrun the receiver. Incom-
ing packets would accumulate in the socket’s receive Given the receiver and process P, SP is fixed.
buffer. For TCP traffic, the flow control mechanism Therefore, it can be derived from (3), (4), (5), and
would take effect to slow down the sender. (10) that process P’s interactivity status would be
When the next packet arrives at the receiver, the strongly dependent on the packet arrival rate Ni,
same scenario as described above occurs. The cycle instead of interactive activities.
repeats until process P stops. At time tE, process P’s As illustrated in Fig. 3, we will count cycles of
timeslice expires. run and sleep beginning when the process wakes
When incoming traffic wakes up process P, its up. Cycle 1 starts at t0 and ends at t1. Since an inter-
wait time on runqueue is fully credited to the val Tr of running is not more than 100 ms and
W. Wu, M. Crawford / Computer Networks 51 (2007) 4050–4069 4057
decreases sleep_avg by aTr, with a 6 1, sleep_avg we can make the conclusions: network packets
may fall to the next 100 ms bracket during the run- arrive at the receiver independently and discretely
ning portion of a cycle, but no further. This may and the ‘‘relatively fast’’ non-interactive network
increase b by 1, but no more. Referring to (4) and process might frequently sleep to wait for packet
(5), we collect the possible changes of sleep_avg in arrival. Though each sleep lasts a very short period
one cycle, Dsleep_avg, in Table 2. of time, the wait-for-packet sleeps occur so fre-
From Table 2, we can surmise the following quently that they lead to interactive status for the
theorem. process.
The current Linux interactivity mechanism carries
Theorem 1. Process P is the data receiving process in
the chance that a non-interactive network process
the receiver. The network is relatively stable, and
could consume a high CPU share, and at the same
incoming packets are evenly spaced with a rate of Ni
time be incorrectly categorized as interactive. For
(pps). And Process P’s data service rate is SP (pps). If
example, assuming 700 ms 6 P ! sleep_avg(t0) <
Ni/SP < 0.9, P will be categorized as interactive if it
800 ms, process P has gained interactive status.
runs long enough.
Based on Table 2, the change of sleep_avg in each
cycle is 4Ts Tr/7 (or 3Ts Tr/7). To keep the inter-
Proof. If Ni/SP < 0.9, from (10) it can be derived
active status, it needs to meet the condition of 4Ts
that Tr/Ts < 9. From Table 2, it is seen that when
Tr/7 P 0 (or 3Ts Tr/7 P 0), Which is Tr/Ts 6 28
Tr/Ts < 9, Dsleep_avg > 0 for any cycle. To catego-
(or Tr/Ts 6 21). This condition can be easily met in
rize a process as interactive, it suffices to meet the
normal network conditions. However, although pro-
condition in (3). Let us assume process P’s initial
cess P keeps its interactive status, process P might
sleep_avg is sleep_avg(0)when it is initially forked,
still be using a high CPU percentage. When process P
and its nice value is 0.
just meets the condition of Tr/Ts 6 28 (or Tr/Ts 6 21)
to keep the interactive status, its CPU can reach as
• If sleep_avg(0) P 700 ms. Since Dsleep_avg > 0
high as 96.55%. Table 3 shows process P’s maximal
for any cycle, process P will always be catego-
CPU share at different scenarios while keeping its
rized as interactive.
sleep_avg in the indicated range.
• If sleep_avg(0) < 700 ms. Since Dsleep avg P
4T s T r =6 > 52 T s for any cycle, process P
needs to run for P some finite number of cycles 4.2. Receiver plus other CPU load
n
n to achieve k¼1 Dsleep avgðkÞ > 700 ms
sleep avgð0Þ. In this case, process P runs on the receiver with
M other non-interactive processes. All the processes
Therefore, process P will meet the condition in (3) have the same default nice value of 0.
to be categorized if it is running long enough. h
Theorem 2. Process P runs on the receiver with M
Theorem 1 shows that process P’s interactivity non-interactive processes. All the processes have the
status is strongly dependent on the packet arrival default nice value of 0. Assume that the network is
rate Ni, instead of its interactive activities. Clearly, relatively stable, and P has already gained interactive
Table 2
Changes of sleep_avg in each cycle
P ! sleep_avg(t0) a b Dsleep_avg
0 6 P ! sleep_avg(t0) < 100 1 10 10Ts Tr
100 6 P ! sleep_avg(t0) < 200 1 10 or 9 10Ts Tr or 9Ts Tr
200 6 P ! sleep_avg(t0) < 300 1/2 9 or 8 9Ts Tr/2 or 8Ts Tr/2
300 6 P ! sleep_avg(t0) < 400 1/3 8 or 7 8Ts Tr/3 or 7Ts Tr/3
400 6 P ! sleep_avg(t0) < 500 1/4 7 or 6 7Ts Tr/4 or 6Ts Tr/4
500 6 P ! sleep_avg(t0) < 600 1/5 6 or 5 6Ts Tr/5 or 5Ts Tr/5
600 6 P ! sleep_avg(t0) < 700 1/6 5 or 4 5Ts Tr/6 or 4Ts Tr/6
700 6 P ! sleep_avg(t0) < 800 1/7 4 or 3 4Ts Tr/7 or 3Ts Tr/7
800 6 P ! sleep_avg(t0) < 900 1/8 3 or 2 3Ts Tr/8 or 2Ts Tr/8
900 6 P ! sleep_avg(t0) < 1000 1/9 2 or 1 2Ts Tr/9 or Ts Tr/9
P ! sleep_avg(t0) = 1000 1/10 1 Ts Tr/10
4058 W. Wu, M. Crawford / Computer Networks 51 (2007) 4050–4069
Table 4
Senders and receiver features for experiments upon fermilab’s sub-networks
Fast sender Slow sender Receiver
CPU Two Intel Xeon CPUs One Intel Pentium IV CPU One Intel Pentium III CPU
(3.0 GHz) (2.8 GHz) (1 GHz)
System memory 3829 MB 512 MB 512 MB
NIC Syskonnect, 32 bit-PCI bus slot at Intel PRO/1000, 32 bit-PCI bus slot at 3COM, 3C996B-T, 32 bit-PCI bus
33 MHz, 1 Gbps, twisted pair 33 MHz 1 Gbps, twisted pair slot at 33 MHz, 1 Gbps, twisted pair
Table 5 Table 7
Sender and receiver features for experiments upon wide area Iperf experiment results in the receiver (fast sender)
networks
Load Scheduler Throughput CPU share Reinsertion
BNL sender FNAL receiver (Mbps) (%) count
CPU One Intel Pentium IV One Intel Pentium III BL0 WI 464 99.228 7
CPU (3.2 GHz) CPU (1 GHz) NI 478 99.975 0
System memory 1G 512 MB BL1 WI 241 49.995 7
NIC Intel PRO/1000, 3COM, 3C996B-T, NI 241 50.197 0
32 bit-PCI bus slot 32 bit-PCI bus slot BL2 WI 159 34.246 8
at 33 MHz, 1 Gbps, at 33 MHz, 1 Gbps, NI 160 32.826 0
twisted pair twisted pair BL4 WI 97.0 20.859 8
NI 105 20.175 0
BL8 WI 74.2 15.375 47
NI 58.3 11.143 0
Table 6
Iperf experiment results in the receiver (slow sender)
Load Scheduler Throughput CPU share Reinsertion
(Mbps) (%) count conditions. With a slow sender, iperf’s CPU shares
BL0 WI 436 78.489 780 stays near 80%, no matter how many background
NI 473 87.569 0 processes there are. This is in accord with Theorem
BL1 WI 443 81.573 815 2. With a fast sender, iperf’s CPU share is similar to
NI 285 49.923 0 what it receives under non-interactive scheduling.
BL2 WI 438 80.613 801
NI 185 33.022 0
For better presentation, we show the results of
BL4 WI 430 79.217 785 CPU shares in Fig. 6. In the Figure, ‘‘FWI’’ repre-
NI 113 20.025 0 sents fast sender and interactive scheduling in the
BL8 WI 440 81.093 811 receiver; ‘‘SWI’’ represents slow sender and interac-
NI 64.7 11.117 0 tive scheduling in the receiver; ‘‘FNI’’ represents fast
sender and non-interactive scheduling in the recei-
reinsertion count is only 47. As the experiment runs ver; ‘‘SNI’’ represents slow sender and non-interac-
for 100 seconds, and the timeslice for a process with tive scheduling in the receiver.
default nice value of 0 is 100 ms, there cannot be To further probe the interactivity vs. fairness
more than 1000 expirations of iperf’s timeslice. issues, we randomly choose two groups of experi-
When eliminating factors of process sleep time and ment results. The experiments are run with back-
system interrupt time by noting iperf’s CPU share, ground load of BL8, one with fast sender, and the
reinsertion count of 800 implies that iperf is catego- other with slow sender. The experiment results are
rized as interactive almost all the time. given in Figs. 7–10.
Experiment results in Tables 6 and 7 also verify Figs. 7 and 8 give iperf’s sleep_avg in the receiver
the correctness of Theorem 2: interactive scheduling for slow and fast sender respectively. For the slow
can lead to the fairness issue. As for non-interactive sender (Fig. 7), it can be seen that iperf’s sleep_avg
scheduling, when the number of background pro- is always greater than 700 ms. It means that iperf
cesses increases, iperf’s CPU share is correspond- is categorized as interactive all the time. However,
ingly reduced. Basically, if the M + 1 processes for the fast sender (Fig. 8), iperf is categorized as
run in the system, each process has its share of non-interactive most of the time. This is the reason
1/(M+1). However, under interactive scheduling, that with a fast sender, iperf’s CPU share is similar
iperf’s CPU shares are dependent on the network to what it is under non-interactive scheduling. These
W. Wu, M. Crawford / Computer Networks 51 (2007) 4050–4069 4061
experiment results agree with our analysis in previ- rent interactivity classification mechanism is not
ous sections. It further demonstrates that the cur- effective in classifying network-related processes,
4062 W. Wu, M. Crawford / Computer Networks 51 (2007) 4050–4069
Fig. 9. Histogram of time intervals between consecutive timeslice expiration instants for iperf in the receiver (slow sender).
which are strongly dependent on the network the active array, instead of the expired array. Also,
conditions. due to its interactive status, iperf gains a priority
Figs. 9 and 10 give the histograms of time inter- bonus, resulting in higher dynamic priority than
vals between consecutive timeslice expiration other non-interactive processes. Those non-interac-
instants for iperf in the receiver. These results verify tive processes only run during the periods that iperf
the correctness of Theorem 2 from another perspec- sleeps. Considering that facts that (1) with a nice
tive. Fig. 7 shows that with the slow sender iperf is value of 0, the timeslice is 100 ms; (2) iperf might
always categorized as interactive. Therefore, each sleep to wait for data, most of the time intervals
time iperf’s timeslice expires, it is reinserted into between consecutive timeslice expiration instants
W. Wu, M. Crawford / Computer Networks 51 (2007) 4050–4069 4063
Fig. 10. Histogram of time intervals between consecutive timeslice expiration instants for iperf in the receiver (fast sender).
in Fig. 9 are between 100 ms and 200 ms. However, most of time with a fast sender (Fig. 8). Once iperf’s
Fig. 10, the fast sender case, shows another story. timeslice expires, it will be moved to the expired
This is due to the fact that iperf is non-interactive array and can only regain the CPU after all eight
non-interactive processes finish their timeslices.
Table 8 That is why the majority of the time intervals
Iperf experiment results in the receiver between consecutive timeslice expirations for iperf
Load Scheduler Throughput CPU share Reinsertion are greater than 900 ms.
(Mbps) (%) count
BL0 WI 325 75.877 713
NI 304 65.68 0 5.2. Experiments over wide area networks from BNL
BL1 WI 277 59.472 593 to FNAL
NI 248 47.063 0
BL2 WI 274 58.996 588
NI 195 31.922 0
We repeat our experiments over the wide area
BL4 WI 278 64.144 620 networks from BNL to FNAL. Experiment results
NI 116 19.645 0 also verify the claims of previous sections. Table 8
BL8 WI 273 58.788 586 shows the iperf experiment results in the receiver.
NI 79.8 9.717 0 Fig. 11 gives the comparison of CPU shares. It
shows that the fairness issue also arises in wide area 6. A possible solution
networking.
Figs. 12 and 13 give the results of one random Our experiments and analysis described above
wide area network experiment from BNL to FNAL. have shown that the current interactivity classifica-
The background load of the experiment is BL8. tion mechanism is not effective in distinguishing
Fig. 12 gives iperf’s sleep_avg in the receiver. It non-interactive network processes from interactive
can be seen that iperf is also categorized as interac- processes, resulting in serious fairness/starvation
tive all the time due to network conditions. Fig. 13 problems. To summarize, the causes of this are:
shows the histogram of time intervals between con- (1) network packets arrive at the receiver indepen-
secutive timeslice expiration instants for iperf in the dently and discretely; the ‘‘relatively fast’’ non-inter-
receiver. It gives similar results as Fig. 9. active network process might frequently sleep to
Fig. 13. histogram of time intervals between consecutive timeslice expiration instants for iperf in the receiver (WAN).
W. Wu, M. Crawford / Computer Networks 51 (2007) 4050–4069 4065
Table 9
Wait-for-packet sleep statistics for iperf data transmission experiment
Experiment <2 ms (%) <5 ms (%) <10 ms (%) <15 ms (%) <20 ms (%) Mean (ms) Throughput (Mbps)
BNL -> FNAL (1) 68.32 83.82 97.79 99.84 99.88 2.2214 263
BNL -> FNAL (2) 68.72 85.08 98.85 99.92 99.95 2.0071 221
FNAL -> FNAL (1) 99.78 99.85 99.93 99.93 99.93 0.2285 383
FNAL -> FNAL (2) 99.70 99.79 99.88 99.88 99.89 0.2259 438
wait for network packets. Though each sleep lasts dently and discretely. The ‘‘relatively fast’’
for a short period of time, they occur more than fre- network process in the receiver might fre-
quently enough to lead to interactivity status. (2) quently sleep to wait for network packets.
The current Linux interactivity mechanism provides Though each wait-for-packet sleep is short,
the possibilities that a non-interactive network pro- they are very frequent. Iperf also belongs to
cess could consume a high CPU share, and at the this category. Table 9 gives the wait-for-
same time be incorrectly categorized as interactive. packet sleep statistics for a group of data
To resolve the interactivity vs. fairness issues there transmission experiments in Section 5. It
might be two basic approaches. One approach is shows that most wait-for-packet sleeps last
to completely overhaul the interactivity mechanism. for a few milliseconds or less.
However, the current mechanism has been proven (c) Multimedia network applications. For these
effective for traditional non-networked applications. applications, network packets are transmitted
Major modifications would be likely to affect those and received periodically. For example, VOIP
applications. Clearly, this approach might be com- packets are transmitted and received every
plex and time-consuming. The second approach is 20 ms. These applications are categorized as
to reduce or eliminate those sleep_avg updates trig- ‘‘soft real-time’’ so other measures should be
gered by short inter-packet sleeps under non-inter- taken, regardless of the issues investigated
active conditions. We pursue the latter course. here, to guarantee their CPU shares and
Usually, network applications can be classified responsiveness. Possibilities include (1) In
into the following categories: Linux 2.6, making use of chrt [27] to classify
these applications as real-time. Linux 2.6 pro-
(a) Interactive network applications like ssh, tel- vides two real-time scheduling policies,
net, and web browsing. Since those applica- SCHED_FIFO and SCHED_RR, which sup-
tions involve human interactions, the wait- port soft real-time behaviors [1,2]. (2) When
for-packet sleeps in the receiver usually last developing these applications, specifically
for hundreds of milliseconds or even seconds requesting real-time support. Linux 2.6 pro-
to wait for user inputs. For example, in [16], vides a family of system calls to support such
Etsion et al. have reported that standard typ- capabilities [2]. However such an approach
ing at a rate of about 8 characters per second. might reduce application portability [28]. (3)
In the extreme case, if a packet was sent out Making use of a proportional-share scheduler
for each character typed, the inter-packet [18,20] to provide protection between various
space would be average around 125 ms. classes of applications. This paper mainly
(b) Non-interactive network applications. Some address the interactivity vs. fairness issues for
non-interactive network applications, like network applications of categories (a) and (b).
ftp,3 gridftp, and scp, involve bulk data trans-
mission. As explained above, due to packet- Table 9 gives us insight on how to distinguish
switched network’s packet delivery nature: interactive network applications from non-interac-
network packets arrive in the receiver indepen- tive ones: for a truly interactive application, the
wait-for-packet sleeps usually last for tens or
3
hundreds of milliseconds or more; however, the
FTP implementations usually are multi-processed or multi- inter-packet sleeps for bulk data transmission appli-
threaded: one process/thread is in charge of FTP control channel,
which may be interactive; other processes/threads are in charge of
cations usually last for a few milliseconds or less.
data transmissions. Here, we mean FT P’s data transmission Accordingly, to resolve the interactivity vs. fairness
processes/threads, and similarly for gridftp. issues in networked Linux systems, our strategy is as
4066 W. Wu, M. Crawford / Computer Networks 51 (2007) 4050–4069
wide area networks, and the packet jitter is high, vs. fairness issues at all, it can be set as 0. If process-
interactive_network_threshold could be ing of streaming media such as VOIP is competing
configured even higher. Usually high packet jitter with other system loads and has not been protected
implies low throughput; it would not cause serious as suggested above, an interactive_net-
fairness issues in the receiver. Therefore, interac- work_threshold of 15 ms may be better.
tive_network_threshold need not be too We repeat the data transmission experiments as
high. In our implementation, the default interac- described in Section 5 on the Linux updated with
tive_network_threshold is set at 30 ms. If the new interactivity parameter described as above.
system owner does not care about the interactivity We compare the new experiment data with those
Fig. 16. Iperf’s sleep_avg in the receiver for experiments from BNL to FNAL (a) interactive_network_threshold = 10 ms
(b) interactive_nework_threshold = 30 ms.
4068 W. Wu, M. Crawford / Computer Networks 51 (2007) 4050–4069
obtained in Section 5. The old experiment will be packet jitter is higher, the interactive_net-
prefixed with ‘‘O-’’, the new data with ‘‘N-’’. work_threshold needs to be correspondingly
Table 10 shows the iperf experiment results in the configured higher. Setting interactive_net-
receiver for experiments over Fermilab’s sub-net- work_threshold to 30 ms effectively improves
works. Since the fairness issue is not serious with the the system’s fairness, while not affecting true
fast sender, the experiments are run only with the slow interactive network applications’ performance. In
sender. The interactive_network_threshold Fig. 16, we also see scheduling delays causing jumps
is set as 5 ms. For better comparison and presenta- in sleep_avg.
tion, we show the comparisons of CPU shares in
Fig. 14. It can be seen that: with the updated interac-
7. Conclusions
tivity algorithm, iperf’s CPU share decreases as the
background load increases; the reinsertion count of
Our researches have pointed out that the current
N-WI is much reduced compared to O-WI. Since
Linux interactivity mechanism is not effective in dis-
interactive_network_threshold is set so
tinguishing non-interactive network processes from
low, it won’t affect the scheduling of true interac-
interactive network processes, and results in serious
tive network applications. The experiment results
fairness/starvation problems. Mathematical analy-
imply that our proposed solution is effective in resolv-
sis and experiments results have verified our conclu-
ing the fairness issues while maintaining the interac-
sions. Further, we propose and test a simple
tivity performance for true interactive network
scheduler modification to address the interactivity
applications.
vs. fairness problems in networked Linux systems.
Fig. 15 shows iperf’s sleep_avg in the receiver
Experiment results have proved the effectiveness of
with the updated interactivity algorithm for a ran-
our proposed solution. The improvements in fair-
domly chosen experiment (interactive_net-
ness come at a cost: the network throughput for a
work_threshold=5 ms, BL8). Compared with
given process may be reduced, while the CPU share,
Fig. 7, it can be seen that most of the time iperf is
response time, or network throughputs of other pro-
not categorized as interactive. When it is not, it
cesses are improved. This will be a desirable trade-
doesn’t gain extra runs at the expense of other
off in some environments, but perhaps not in all.
non-interactive processes. This explains why iperf’s
CPU share is effectively decreased when the back-
ground load is increased. It further verifies the effec- Acknowledgements
tiveness of our proposed solution. However, it still
can be seen from Fig. 15 that iperf’s sleep_avg might We thank the editor and reviewers for their com-
jump from a low value to a much higher value, lead- ments, which helped improve the paper. Also, we
ing to the interactive status (also in Fig. 8). This is would like to thank Dr. Dantong Yu and Dr. Dim-
caused by the scheduling delay: when a low- itrios Katramatos of Brookhaven National Labora-
dynamic-priority iperf wakes up upon packet arri- tory. Without their sincere help, the wide area
val, it might wait on the runqueue for a relatively network experiments between BNL and FNAL
long time before it is scheduled to run, which is fully were impossible.
credited to the sleep_avg. Since the scheduling
delays of interactive network processes cannot be References
differentiated from those of non-interactive pro-
cesses, the influence of this type of scheduling delays [1] D.P. Bovet et al., Understanding the Linux Kernel, third
is hard to eliminate. This is also the reason that the ed., O’Reilly Press, 2005, ISBN 0-596-00565-2.
CPU shares in the N-WI runs are higher than in NI. [2] R. Love, Linux Kernel Development, second ed., Noval
Press, 2005, ISBN 0672327201.
Similar results are obtained in experiments over [3] C.S. Rodriguez et al., The Linux(R) Kernel Primer: A Top-
the wide area networks from BNL to FNAL. Down Approach for ·86 and PowerPC Architectures,
Fig. 16 shows iperf’s sleep_avg in the receiver for Prentice Hall PTR, 2005, ISBN 0131181637.
two random experiments from BNL to FNAL with [4] www.kernel.org.
the new interactivity algorithm. In Fig. 16(a), [5] ‘‘Goals, Design and Implementation of the new ultra-
scalable O(1) scheduler’’, Linux Documentation, sched-
interactive_network_threshold is set as design.txt.
10 ms, while it is set as 30 ms in Fig. 16(b). It can [6] A. Silberschatz et al., Operating System Concepts, seventh
be seen that for wide area networks, since the ed., John Wiley & Sons, 2004, ISBN 0471694665.
W. Wu, M. Crawford / Computer Networks 51 (2007) 4050–4069 4069
[7] R. Love, Interactive kernel performance: kernel performance [23] Y. Etsion et al., Effects of clock resolution on the scheduling
in desktop and real-time applications, in: Proceedings of the of interactive and soft real-time processes, in: Proceedings of
Linux Symposium, July 23–26, 2003, Ottawa, Canada. ACM SIGMETRICS Conference, Measurement and Mod-
[8] M. Mathis et al., Web100: Extended TCP instrumentation eling of Computer Systems, June 2003, pp. 172–183.
for research, education and diagnosis, ACM Computer [24] J. Davidson et al., Voice over IP Fundamentals, second ed.,
Communications Review 33 (3) (2003). Cisco Press, 2006, ISBN 1587052571.
[9] T. Dunigan et al., A TCP Tuning Daemon, SuperComput- [25] V. Jacobson, Congestion avoidance and control, in: Pro-
ing (2002). ceedings of ACM SIGCOMM, Stanford, CA, August 1988,
[10] M. Rio et al., A Map of the Networking Code in Linux pp. 314 – 329.
Kernel 2.4.20, March 2004. [26] http://dast.nlanr.net/Projects/Iperf/.
[11] J.C. Mogul et al., Eliminating receive livelock in an inter- [27] E. Siever et al., Linux in a Nutshell, fifth ed., O’Reilly
rupt-driven kernel, ACM Transactions on Computer Sys- Media, Sebastopol, CA, 2005, ISBN 0-596-00930-5.
tems 15 (3) (1997) 217–252. [28] Y. Etsion et al., Desktop scheduling: how can we know what
[12] M.J. Bach, The Design of the UNIX Operating System, the user wants?, in: Proceedings of the 14th international
Prentice-Hall, 1986, ISBN 0132017997. workshop on Network and Operating systems support for
[13] U. Vahalia, UNIX Internals: The New Frontiers, Prentice Digital Audio and Video, Cork, Ireland, 2004. pp. 110–115.
Hall, 1995, ISBN 0131019082.
[14] J. Mauro et al., Solaris Internals Core Kernel Architecture,
first ed., Prentice Hall PTR, 2000, ISBN 0130224960. Wenji Wu holds a B.A. degree in Elec-
[15] M.K. McKusick et al., The Design and Implementation of trical Engineering (1994) from Zhejiang
the FreeBSD Operating System, Addison-Wesley Profes- University (Hangzhou, PRC), and doc-
sional, 2004, ISBN 0201702452. torate in computer engineering (2003)
[16] Y. Etsion et al., Process prioritization using output produc- from the University of Arizona (Tucson,
tion: scheduling for multimedia, ACM Transactions on USA). He is currently a Network
Multimedia Computing, Communications, and Applications Researcher in Fermi National Accelera-
2 (4) (2006) 318–342. tor Laboratory. His research interests
[17] C.A. Waldspurger et al., Lottery scheduling: flexible pro- include high performance networking,
portional-share resource management, in: Proceedings of the optical networking, and network mod-
1st USENIX Symposium on Operating Systems Design and eling and simulation.
Implementation, Monterey, CA, November 1994.
[18] P. Goyal et al., A hierarchical cpu scheduler for multimedia
operating systems, in: Proceedings of the 2nd OSDI Sym-
posium, October 1996.
[19] J. Nieh et al., Virtual-time Round-robin: An O(1) Propor- Matt Crawford leads the Wide Area
tional Share Scheduler, in: Proceedings of the 2001 USENIX Systems group in Fermilab’s Computing
Annual echnical Conference, USENIX, Berkeley, CA, 2001, Division. He holds a bachelor’s degree in
pp. 245–259. Applied Mathematics and Physics from
[20] K. Jeffay et al., Proportional share scheduling of operating Caltech and a doctorate in Physics from
system services for real-time applications, in: IEEE Real the University of Chicago. He currently
Time System Symposium, Madrid, Spain, December 1998. manages the Lambda Station project,
[21] D. Petrou et al., Implementing lottery scheduling: matching and his professional interests lie in the
the specialisations in traditional schedulers, in: Proceedings areas of scalable data movement and
of the 1999 USENIX Technical Conference, pages 1–14, access.
Monterey, CA, USA, June 1999.
[22] http://kerneltrap.org/node/780.