Thermal Aware Sceduling Paper
Thermal Aware Sceduling Paper
Abstract—High temperatures and fluctuating temperatures temperature can be managed. The approach consists of two
decrease component reliability and lifespan. This work proposes parts: (a) anticipating when the CPU is going to be at elevated
a proactive software-based thermal aware scheduler to lower temperatures and (b) identifying the hot processes. We use
core temperature and its temperature fluctuations. It proposes
a Simple Time Derivative (STD) scheduler, which uses the two approaches to anticipate when the CPU is going to be at
time derivative of the core temperature as a predictor. Major elevated temperatures: (i) using a simple temperature threshold
heat dissipating processes can be identified by their usage of and (ii) using a Temperature Derivative Predictor (TDP) which
integer arithmetic, float operations and other CPU performance is the time derivative of the core temperature to predict the
counters. The “hot” processes are put to sleep for a short future CPU temperature. Major heat dissipating processes can
duration, if the time derivative goes above an empirically defined
threshold. This work evaluates STD using FFT, SOR, LU, and be identified by the use of integer arithmetic, float operations
Sparse benchmarks of the SciMark benchmark suite running on and other parameters as shown in [4]. If the TDP is higher than
a desktop computer. We found upto 5◦ C decrease in average/peak a predefined threshold, referred as ThresholdGradient, the hot
temperatures as compared to the baseline approach (without any processes are put to sleep for a short duration. This approach
thermal scheduling). The execution penalties only apply to the is more relevant to a system with applications whose activity
hot processes and not the whole system. For LU/Sparse the core
stayed at 35◦ C or below for 100%/82% of time with STD vs. only may cause a high rise in temperature over a relatively short
28%/19% of increase in run-time for the baseline. Furthermore, duration, e.g. applications with many floating-point operations.
for the baseline the temperature went over 40◦ C for 16% of To the best of our knowledge, this approach has not been used
run-time vs. 0% for the STD. Holding the temperature lower has before.
advantages in cooling energy reduction particularly when several
systems are running together in a room or in a server system. We The major contributions of this work are as follows:
also compared our results against Simple Threshold approach. • Proposed a software-based dynamic thermal management
STD provided lower run-time penalties and energy consumption
technique, Simple Time Derivative (STD), which uses
than the Simple Threshold strategy and marginally outperformed
in terms of temperature reduction. This research provides insight time derivatives of core temperature to predict the future
into the temperature reductions possible using a user-defined increase in temperature. Unlike the PID controller, which
software approach and the corresponding penalties on the hot slows down the entire CPU, this approach slows only
processes. The approach can be combined with air conditioning the major heat dissipating process. The approach can be
management techniques in server production systems to reduce
seamlessly integrated with PID and other hardware-based
energy consumption for any job mix where execution time is not
paramount. The reduction in temperature and its variations also thermal techniques.
increases reliability and lifespan of the CPU chip. • The approach provides a finer tuning of temperature
Index Terms—process scheduling; thermal-aware; thermal sen- at the OS level for the purpose of higher component
sors; proactive; predictor. reliability and lifespan. Space applications, where long-
term reliability and lifespan is required, are good target
I. I NTRODUCTION applications.
Based on the amount and type of activity, processor power • Implementation of Baseline (Base) and Simple Threshold
consumption produces significant heat and increase in chip (Threshold) approaches for comparison.
operating temperature. When the CPU temperature increases • Study of the impact of sleep time, TDP threshold, polling
beyond certain threshold (even at temperatures below the frequency, and activation temperature on core tempera-
hardware cut-off), it decreases chip reliability and increases tures.
cooling costs for the CPU [1]. Previous work [2] shows that • Evaluation of the proposed approaches using SciMark
the failure rate of chip doubles for every 10◦ C increase in benchmarks.
temperature. Large number of thermal cycles i.e. temporal The remainder of this paper is organized as follows. Section
fluctuations can accelerate package fatigue and plastic defor- II discusses the proposed approach, Section III the experiments
mations of chip materials, thereby causing permanent failure performed, Section IV the results and observations, Section
[3]. Thus, chip thermal management techniques can increase V the selection of parameters, Section VI assumptions and
overall component reliability and lifespan. known limitations of STD, and finally, we discuss related work
We hypothesize that by proactively scheduling the thermally in Section VII and conclusions and future work in Section
intensive processes (or “hot” processes), the rise of CPU VIII.
978-1-5386-3470-7/17/$31.00 2017
c IEEE
Algorithm 1 The STD Algorithm The implementation of scheduler was done in Java, which
Const ThresholdGradient was the language used by SciMark benchmarks as well. For
Process P all the benchmark processes, priority was set to normal. We
while true do set the benchmark parameter Resolution Default to execute
T1 ← Read core temperature each benchmark for about ten (10) minutes for the baseline
Wait δt time //Polling interval approach. CPU was left to cool before starting the next
T2 ← Read core temperature experiment. No other processes were running on system other
TDP = (T2 − T1 )/δt than basic OS processes.
if TDP > ThresholdGradient then We used on-chip core temperature sensors [8] to measure
Sleep P core temperature in Celsius at regular intervals and calculated
end if the TDP. We used Watts up PRO ES power meter to mea-
end while sure the consumption of power by CPU during execution of
benchmarks.
We compared the effect of the three different strategies on
II. A PPROACH core temperature and execution times: STD, Threshold and
We tried using regression of T and dT/dt approach to predict Base. The Base approach is when the benchmark process
the temperature. However, the historical bias introduced by the executes in the native mode.
regression was not effective for predicting the rapidly changing The threshold gradient used with STD approach and thresh-
temperature. After several trials we found a simple temperature old temperature used with Threshold approach were deter-
derivative approach to work best. mined empirically using several experiments with various
We use two approaches to reduce the CPU temperature. The configurations with an attempt to minimize core temperature
first is named the Simple Time Derivative scheduler and the with minimal impact on program execution time. In order
second is called the Simple Threshold scheduler. to find the scheduler configuration parameter values, which
In the STD scheduler, outlined in Algorithm 1, we hy- result in optimal thermal benefits, we experimented with
pothesize that using the core temperature gradient (TDP) will different polling intervals, sleep times and threshold gradient
be helpful in predicting the future temperature. It computes values. Polling interval is the period at which core temperature
the rate of change of core temperature. Core temperatures was polled using hardware sensors. When STD scheduler is
are noted at two observation points δt time apart and their used, the ThresholdGradient is the empirically determined
difference is divided by δt to get the rate of change of tem- limit which when exceeded by the observed TDP, triggers
perature or, the TDP. If the value of predictor is greater than an the process sleep for a short duration to help reduce the
empirically determined threshold gradient, the hot process is core temperature. Details of these tests and corresponding
temporarily put to sleep. In an actual implementation, the hot observations have been provided in Section V.
processes can be identified as described in [5]. This approach The threshold gradient values that we experimented with
reduces core temperature by proactively reducing processor were 2, 4, 6, 8, and 10 and the sleep times were 250ms,
activity even before the core temperature reaches hardware 500ms, 750ms, and 1s. The polling intervals experimented
threshold limit. were 100ms, 150ms, 200ms, 250ms and 500ms. The configu-
The Threshold approach reactively puts the process to sleep ration parameters, which resulted in optimum thermal benefits,
when core temperature exceeds an empirically determined were polling interval of 200ms, sleep time of 750ms and
predefined threshold temperature. threshold gradient of 4. For the Threshold approach we used
The above approaches are simple software approaches over a threshold temperature of 35◦ C. All tests were then run with
and above the hardware/kernel level approaches such as the this configuration with the benchmark activity isolated to one
PID controller/DVFS. core, using task set utility on Ubuntu.
To understand the behavior of Threshold and STD strategies,
III. E XPERIMENTS we performed several sets of tests using various benchmarks.
In order to evaluate the proposed approaches, we used We compared the results for three different strategies using
SciMark [6] benchmark suite. We selected SciMark bench- metrics of peak core temperature, average core temperature,
marks since they perform numerical calculations, which have and execution time. The average core temperature was an
a significant impact on temperature [7]. The benchmarks used average of all readings captured during benchmark execution.
were Fast Fourier Transform (FFT), Jacobi Successive Over- The peak core temperature was the maximum temperature
Relaxation (SOR), Dense Unit Factorization (LU), and Sparse observed during benchmark execution. The execution time of
Matrix Multiply (Sparse). These benchmark applications are the application was the overall time taken to complete the
floating point, memory, and integer intensive. execution of one instance of benchmark program. The results
We conducted the experiments on a Dell OptiPlex 780 with of the experiments have been shown in the graphs in the next
Intel Core 2 Duo E7600 / 3.06 GHz desktop with 4 GB RAM section. It is important to emphasize that the run-time penalties
and 320 GB HDD running Ubuntu 9.10. The room temperature are only for the hot processes and they are not system-wide
was steady at 68◦ F. penalties - indeed other processes will speed up because of
Fig. 2. Variation of program execution time and temperature for four
benchmarks executed on a specific core with three strategies.
Fig. 1. Variation of core temperature over time with three strategies for LU
benchmark.