Mitigating The Post-Recovery Overhead in Fault Tolerant Systems
Mitigating The Post-Recovery Overhead in Fault Tolerant Systems
1 Introduction
The demand for computational power has been leading the improvement of the High Performance Computing (HPC) area, generally represented by the use of clusters of computers running parallel applications. For these applications, correctly nish and the spent time of their executions become major issues when planning to perform their tasks in computer based solutions. Therefore, it is reasonable to say that those applications commonly have two basic constraints: performance and availability (also known as performability1 ). In this area, fault tolerance plays an important role in order to provide high availability isolating the application from the faults effects. Therefore, a fault tolerant solution must take into consideration these two constraints when it has been designed. However, some fault tolerant solutions, may generate system conguration changes during the recovery process. This behavior happens when those ones manage the faults just using the own active cluster resources, i.e., the application continues executing with one less node, but keeping the number of processes, causing a unplanned process distribution. Parallel systems are designed intending to achieve certain performance level. In order to satisfy this performance level, the process distribution through the nodes takes in consideration factors like application characteristics, CPU power, or memory availability. Whenever this distribution changes, may lead to system performance degradation, due to the node loss, keeping the number process. In previous work, our group presented RADIC2, 3 (Redundant Array of Distributed Independent Fault Tolerance Controllers) as a exible, decentralized, transparent and scalable architecture that provides fault tolerance to message passing based parallel systems
a This
using rollback-recovery techniques. RADIC acts as a layer that isolates an application from the possible cluster failures. RADIC, in its basic protection level, does not demand any passive resources to perform its activities. Thus, after a failure the controller recovers the faulty process in some existent node of the cluster. This behavior may leads to a recovery side-effect consisting in system degradation in the post-recovery execution, which may generates performance loss according with the application characteristics. In this paper, we present RADIC II, which incorporates a new protection level in RADIC using a dynamic redundancy functionality, allowing to mitigate or avoid the recovery side-effects. This functionality allows restoring a changed system conguration and it can avoid the conguration changes. RADIC II allows dynamically inserting new spare nodes during the application execution in order to replace the requested ones. Moreover, RADIC II provides a transparent management of spare nodes, which is able to request and use them without need any administrator intervention and not maintaining any centralized information about these spare nodes. We evaluate our solution performing several experiments comparing the effects of recovery having or not available spare nodes. These experiments observe two measures: the overall execution time, and the throughput of an application. We executed a matrix product algorithm, using a SPMD approach implementing a Cannon algorithm and we executed an N-Body particle simulation using a pipeline paradigm. The results are conclusive showing the benets of use our solution in these scenarios. This paper is organized as follows. The next section presents related works. The RADIC architecture is presented in section 3. The section 4 explains the recovery sideeffects. Our proposal is described in section 5. and the experimental results are presented in the section 6. Finally, in the section 7 we present our conclusions, and future work.
2 Related Work
There are several projects and researches providing fault tolerance to message passing parallel systems. Some of them use spare nodes in order to recover from a failure. Following, we present some of them and their characteristics. MPICH-V4 is a framework that has implemented four rollback-recovery protocols. MPICH-V provides automatic and transparent fault tolerance for MPI applications using aruntime environment formed by some components: Dispatcher, Channel memories, Checkpoint servers, and Computing/Communicating. At the recovery process, it uses spare nodes but, unlike our proposal, it has a centralized management of these spares and does not allow dynamic insertion of them. FT-Pro5 is a fault tolerance solution that bases on a combination of rollback-recovery and failure prediction to take some action at each decision point. Using this approach, this solution aims to keeps the system performance avoiding excessive checkpoints. Currently support three different preventive actions: Process migration, coordinated checkpoint using central checkpoint storages and no action. Each preventive action is selected dynamically in an adaptive way intending to reduce the overhead of fault tolerance. FT-Pro works an initially determined and static number of spare nodes. The Score-D6 checkpoint solution uses spare nodes in order to provide fault tolerance through a distributed coordinated checkpoint system. This system uses a parity generation that guarantees the checkpoint reconstitution in failures case. These spares nodes are de-
ned at the application start and are consumed until reach zero. This solution does not have a mechanism to allow dynamic spare node insertion or node replacement. Furthermore, it performs the recovery process basing on a central agent. MPI/FT7 is a non-transparent solution for fault tolerance in MPI applications. The user can choose the moment to take checkpoint according with application characteristics. It provides distinct fault tolerance models for each application execution model, i.e. master-worker, SPMD. In some models, the recover procedure bases on an interaction of two elements, a central coordinator and self-checking-threads (SCTs) that use spare nodes taken from a pool. As this solution does not allow dynamic insertion of new spares, the application will fail after the exhaustion of this pool. The LAM/MPI8, 9 implementation uses a component architecture called System Services Interface (SSI) that allows checkpoint an MPI application using a coordinated checkpoint approach. This feature is not transparent, needing a back-end checkpoint system. In case of failure, all applications nodes stop and a restart command is needed. As LAM/MPI demands a faulty node replacement. Unlike our proposal, this procedure is neither automatic, nor transparent, and can not be done during the application execution.
N0
N1
N2
P0 O0
P1 O1 T1
P2 O2 T2
...
T0
...
Figure 2. Part of a RADIC cluster showing the relationship between observers and protectors.
protectors and observers. Every node of the parallel computer has one dedicated protector and there is one dedicated observer attached to every parallel applications process. Each active protector communicates with two active protectors assumed as neighbors: an antecessor and a successor. Therefore, all active protectors establish a protection system throughout the nodes of the parallel computer using a heartbeat/watchdog scheme in order to detect faults. Moreover, the active protectors are responsible to store the checkpoints and message logs of the processes running on its successor node. The protectors also are in charge to start the recovery process. Observers are RADIC processes attached to each application processes. From the RADIC operational point-of-view, an observer and its application process compose an inseparable pair. The group of observers implements the message-passing mechanism for the parallel application. Furthermore, each observer performs some tasks related to fault tolerance: It takes checkpoints (according with a specied period) and event logs of its application process and send them to a protector running in another node, namely the antecessor protector; it detects communication failures with other processes and with its protector. In the recovering phase, the observer manages the messages from the message log of its application process and establishes a new protector. In order to mask faults, the observers maintains a mapping table, called radictable, indicating the location of all application processes and their respective protectors and updates this table as needed. The Fig. 2 depicts the relationship between observers and protectors in a part of a cluster with three nodes (N0 N2 ). In this gure, we can see that each observer (O0 ..O2 ) is attached with connects to a protector in a neighbor node to whom it sends checkpoints and event logs of its application process. Each protector (T0 T2 ) receives the connection of one or more observers and connects with other protector in the neighbor node. Each observer has an arrow that connects it to to an antecessor protector. Similarly, it receives a connection from its successor. A protector only communicates with their immediate neighbors. The observers consider the protector running in the same node as a local protector. Similarly, the protector considers the observers running in its node as local observers. 3.2 The RADIC Recovery Process In normal operation, the protectors are monitoring computers nodes, and the observers care about checkpoints and message logs of the distributed application processes. Together, protectors and observers function like a distributed controller for fault tolerance. When protectors and observers detect a failure, both actuate to reestablish the consistent state of the distributed parallel application and to reestablish the structure of the RADIC controller.
N0
N1
N3
N3
N0
P1 O1 P2 O2
N1
N2
N3
P0 O0 T0
P1 O1 T1
P2 O2 T3
P3 O3 T3
P0 O0 T0
P3 O3 T3
T1
(a)
(b)
Figure 3. A RADIC conguration before (a) and after (b) a fault recovery.
The protectors and observers implicated in the failure will take simultaneous atomic actions in order to reestablish the integrity of the RADIC controllers structure. Table 1 lists the atomic activities of each element for a fault in a generic node Ni . In case of using the Protectors Observers
RADIC basic protection level, the system conguration changes after the recovery, and the recovered process will be running in the same node of its protector. The Fig. 3a shows a RADIC conguration before the fault and the Fig. 3b shows the after a recovery process. In these pictures, we can see that the node N2 failed and the process P2 recovered in the node of its protector (T1 ). As the gure shows, after the recovery, the system conguration has changed and the node N1 has two processes in execution. This protection level is well suited for applications with dynamic workload balancing.
4 Recovery Side-effects
As we can see in the Fig. 3, when a fault occurs, some other active node executing an application process is responsible to receive the state data of the failed process, its checkpoint and log, and re-launch the failed process in this own node. Hence, more than the overhead imposed by the fault tolerance activity11 , the post-recovery execution may be affected by this behavior. In the Fig. 3b, it is easy to perceive that in the node N1 , both processes will suffer a slowdown in their executions and the memory usage in this node will be increased may leading to disk swap. Moreover, these processes will be accessing the same protector, competing by to send the redundancy data. Supposing that a previous process distribution was made, aiming to achieve a certain performance level according with the cluster characteristics, this condition becomes very undesirable if the application is not able to adapt itself to workload changes along the nodes. In order to quantify these side-effects, we performed some experiments applying a SPMD program based on the Cannon algorithm, which presents a tightly coupling between the processes. The gure 4 shows a chart with the results of these executions. The program was executed in a nine nodes cluster, performing a matrix product between two 1500x1500 matrixes. We injected one fault in different moments: at 25%, when we believe that the
800,0
1500x1500 Matrix product using 9 nodes - SPMD cannon algorithm- 1 fault 160 loops -ckpt each 60s
72,61% Without Faults With Faults
700,0
49,37%
600,0
27,05%
500,0
434,57
400,0
434,57
434,57
300,0
200,0
100,0
0,0
25%
75%
Figure 4. Execution times of a SPMD program after recover a fault in distinct moments.
fault tolerance starts to be needed, at the middle of the execution, and at the 75% when we believe that the fault tolerance starts to be crucial. We adjust the program to repeat the computation 160 times in order to enlarge the execution time. The rst bar represents the fault free execution time and the second one, the execution with the fault. As we can see, the overhead generated in the post-recovery execution is very large when the fault occurs at the 25%, and still considerable when the fault occurs at the 75%.
Yes
Overloaded?
Send new node request with its own data to found protector
Send new node request and forward the data to neighbor protectors
Send new node request and forward the updated data to neighbor protectors
process P2 . Finally, the old process P2 suicides and the message forwarding continue with the updated information of the replacement node.
The previous approach is useful to restore the original performability of a system using the RADIC protection level. However, this approach is dependable of a user intervention, to insert a replacement node in the system. In order to avoid this situation, RADIC II incorporates a transparent management of spare nodes, which is able to request and use them without need any administrator intervention and not maintaining any centralized information about these spare nodes. These spares may be present at the application starts or may be included during the application execution using the same mechanism described as follows. In this approach, we insert a node running a protector in spare mode. In such mode, the protector is not part of the fault detection scheme in order to generate no overhead. When the protector starts, it searches some active node in the protectors fault detection scheme running with the application and starts a communication protocol with it requesting for its addition in a structure included in the protectors called spare table - The spare table contains some information about the current spare nodes and their states. The working protector that receives this request, searches if the new spare data already is on its spare table. If not, this protector adds the new spare data and forwards this request to its neighbors, passing the new spare information in sequence. Each protector performs the same tasks until to receive an already existent spare node data, nalizing the message forward process. The ow in the Fig. 6a claries this announcement procedure. The Fig. 7a depicts a RADIC II conguration after this process with one idle spare. We can see that the spare protector (TS ) is inactive (represented in light gray) and not making part of the fault detection scheme, thus, does not generate unnecessary overhead in the system. We can see in each active protector, the representation of the spare information as a small light gray triangle.
Running on Spare protector Protector starts in spare mode Protector Detects a fault
Send new spare request with its own data to found protector
No
Recover locally
Yes
Is it still idle?
Yes
Request it
Yes
NO
No
Command the spare to recover the process Send new spare request and forward the data to neighbor protectors
Finish Recovery
(a)
(b)
Figure 6. Flows representing: (a) the new spare announcement procedure and (b) the recovery task using spare nodes.
N0
N1
N3
N3
Ns
N0
N1
N3
N3
Ns
P0 O0 T0
P1 O1 T1
P2 O2 T3
P3 O3 T3 Ts
P0 O0 T0
P1 O1 T1
P2 O2 T3
P3 O3 T3
P3 O3 Ts
(a)
(b)
Figure 7. A RADIC II conguration using one spare, before (a) and after (b) a fault recovery.
5.3 Recovering Using Spare Node In RADIC II, the recovery process is slightly different from the process mentioned in item 3.1 in order to contemplate the spare node use. Currently, when a protector detects a fault, it rst searches a spare data in its spare table, if found some idle spare, i.e. the number of observers reported in the spare table still is equals to zero, it starts a spare use protocol. In this protocol, the active protector communicates with the spare asking for its situation, i.e. how many observers are running on its node, in order to conrm its state. The active protector repeats this procedure until to nd an idle spare, sending to it a request for use. If after the search for an idle spare results none, RADIC II proceeds with the regular RADIC recovery process mentioned in section 3. From the request moment, the spare will not accept any requests from other protectors. After receive the request conrmation, the protector commands the spare to join the protectors fault detection scheme. After nishes this step, the protector sends the failed process checkpoint and log to the spare, and command it to recover the failed process using the regular RADIC recovery process. The ow in the Fig. 6b claries this entire process, complementing the understanding of this process. The Fig. 7b shows the RADIC II system conguration after the recovery of a fault in the node N3 . The spare node assumed the process P3 , maintaining the original system
distribution, it is possible to see that the protector T3 already has updated the spare state information. 5.4 Changes in the Fault-Masking Task RADIC II inserts a small indeterminism in locating a failed process because this process may has been recovered in any spare of the conguration, and the process may not able to locate the recovered process in this spare. Hence, we had to change the original RADIC fault-masking task, which bases on a heuristic to determine where a faulty process will be after the recovery. This heuristic was very efcient in RADIC, because its recovery process was very deterministic, i.e. the failed process always recovers in its protector. The applied change consists in if after the observer did not nd the recovered process asking by its original protector, it searches in the spares using its spare table, looking for the faulty process. However, only the protectors have the spare table information, not the observers. Consequently, we increased the communication between the observers and the protector running in its node, the local protector. In order to execute the new fault masking protocol, the observer communicates with the local protector, asking for the spare table, and seeks in this table for the recovered process.
6 Experimental Results
In order to test our proposal, we had modied the RADIC MPI prototype called RADICMPI, including the dynamic redundancy functionality. Currently, RADICMPI incorporates basic MPI functions, including the blocking and non-blocking peer-to-peer communications. The experiments were conducted on a collection of twelve 1.9GHz Athlon-XP2600+ PC workstations running Linux Fedora Core 4 with kernel 2.6.17. Each workstation had 256MB of memory and a 40GB local disk. The nodes were interconnected by a 100BaseT Ethernet switch. We performed some experiments running well-known applications in different contexts. We represented two distinct approaches of common parallel applications and measuring comparatively the effects of use or not the spare nodes approach. We applied two parallel program kinds: a matrix product using the SPMD paradigm and an N-Body particle simulation using non-blocking functions in a pipeline paradigm based in the example presented by Gropp14. The Fig. 8a shows the chart with the results of the SPMD matrix product. We performed the same experiment shown in the section4 . In this case, we executed using the dynamic redundancy functionality with one spare too in order to compare with the other results. In this chart, we can see that the overhead caused by a recovery without spare (the middle column in each fault moment) versus using spare (the right column in each fault moment) with one fault occurring in different moments. The overhead not using spares shows itself inversely proportional to the moment when the fault occurs, generating greater overheads (reaching 72.61% in the worst case analyzed) in premature fault case, while using spare, the overhead keeps constantly and low (14% aproximadetely) despite the moment of the fault. As many of the actual parallel applications are intended to run continuously in a 24x7 scheme, we performed an experiment intending represent the behavior of these applications. In this experiment, we executed continuously the N-Body
Figure 8. Experiments results of (a) SPMD matrix product and (b) N-Body simulation in a pipeline.
particle simulation in a ten nodes pipeline and injected three faults in different moments and different machines, measuring the throughput of the program in simulation steps per minute. We analyzed four situations: a) a failure-free execution, used as comparing; b) three faults recovered without spare in the same node; c) three faults recovered without spare in different nodes and d) three faults recovered with spare. In the Fig. 8b we can see the result chart of this experiment. In such experiment, we can perceive the inuence of the application kind over the post-recovery execution. When the three faults are recovered in different nodes, the applications throughput suffers an initial degradation, but in the subsequent faults, just changes a little. This behavior occurs because the pipeline arrangement: the degradation of the node containing the second recovered is masked by the delay caused by the rst recovered process node. This assumption is conrmed when all faults processes are recovered in the same node, we can perceive a degradation of the throughput after each failure. When executing with spare nodes presence we see that after quick throughput degradation, the system backs to the original simulation step rate. We see also that the penalization imposed by the recovery process using spare is greater than the regular RADIC process, but this loss is quickly compensated by the throughput restoring in the remaining execution. The large message log to be processed and the re-use of the original RADIC recovery implementation in RADICMPI cause the large recovery overhead present using spare in the third fault, thus we are unnecessarily sending the checkpoint twice.
7 Conclusions
In this paper, we argued that in order to tolerate a fault, some message passing based solutions may generate a system conguration change at the recovery, i.e., respawning failed process in other active node of the application, which changes the original process distribution. Moreover, we shown that this change of conguration may lead to some system degradation, generally meaning overall performance loss as demonstrated by some experiments ran over the RADIC architecture. This project was undertaken to design a fault tolerant architecture avoiding the side effects of the recovery process and evaluate it effectiveness. We used RADIC as basis to
10
the development of RADIC II, an architecture that adds other protection level to RADIC, inserting a exible dynamic redundancy feature. RADIC II is able to restore an original system conguration, re-establishing the initial performability of a system by inserting replacement nodes that assume the faulty processes. In order to avoid the system conguration changes, we also implemented a transparent management of spare nodes, keeping the information about them totally decentralized. This functionality has the ability of dynamically insert new spare nodes during the application execution or already to start the application with them. We conclude that the use of a exible redundancy scheme is a good approach to mitigate or to avoid the effects of the system conguration changes. Our solution has shown to be effective even in faults near at the application nishing. RADIC II also shows a small overhead caused by the recovery process. The experimental results have shown execution times and throughput values very near to a failure-free execution. The ideal number of spare nodes and its ideal allocation through the cluster still are undiscovered subject. Further research might be investigate how it is possible to achieve better results allocating spare nodes according with some requirements like degradation level acceptable, or memory limits of a node. Moreover, a study about the possible election policies to be used in the node replacement feature will be useful to determine which the ideal behavior to be taken in these situations, considering factors like load balance or time to recover.
References
1. J. F. Meyer. On evaluating the performability of degradable computing systems. Computers, IEEE Transactions on, 29(8):720731, 1980. 2. A. Duarte, D. Rexachs, and E. Luque. A distributed scheme for fault tolerance in large clusters of workstations. In Proceedings of Parallel Computing 2005 (ParCo2005), in Press, 2005. 3. Angelo Duarte, Dolores Rexachs, and Emilio Luque. An intelligent management of fault tolerance in cluster using radicmpi. In Proceedings of The 13th European PVM/MPI Users Group Meeting, volume 4192/2006 of Lecture Notes in Computer Science, pages 150157. Springer Berlin / Heidelberg, 2006. 4. A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello. Mpich-v project: A multiprotocol automatic fault-tolerant mpi. International Journal of High Performance Computing Applications, 20(3):319, 2006. 5. Y. Li and Z. Lan. Exploit failure prediction for adaptive fault-tolerance in cluster computing. In Proceedings of the Sixth IEEE International Sympo-sium on Cluster Computing and the Grid (CCGRID06), volume 1, pages 531538, 16-19 May 2006 2006. 6. M. Kondo, T. Hayashida, M. Imai, H. Nakamura, T. Nanya, and A. Hori. Evaluation of checkpointing mechanism on score cluster system. IEICE Transactions on Information and Systems, 86(12):25532562, 2003. 7. R. Batchu, Y. S. Dandass, A. Skjellum, and M. Beddhu. Mpi/ft: A model-based approach to low-overhead fault tolerant message-passing middleware. CLUSTER COMPUTING, 7(4):303315, 2004.
11
8. Greg Burns, Raja Daoud, and James Vaigl. Lam: An open cluster environment for mpi. In Proceedings of Supercomputing Symposium, pages 379386, 1994. 9. Jeffrey M. Squyres and Andrew Lumsdaine. A component architecture for lam/mpi. In Proceedings, 10th European PVM/MPI Users Group Meeting; Lecture Notes in Computer Science, pages 379387, Venice, Italy, September / October 2003. Springer-Verlag. 10. E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3):375408, 2002. 11. S. Rao, L. Alvisi, and H. M. Vin. The cost of recovery in message logging protocols. IEEE Transactions on Knowledge and Data Engineering, 12(2):160173, 2000. 12. K. Nagaraja, G. Gama, R. Bianchini, R. P. Martin, W. Meira, and T. D. Nguyen. Quantifying the performability of cluster-based services. Parallel and Distributed Systems, IEEE Transactions on, 16(5):456467, 2005. 13. Pankaj Jalote. Fault Tolerance in Distributed Systems, volume 1. P T R Prentice Hall, United States of America, 1994. p. 142. 14. William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press, Cambridge, MA, 1999. LCCN: QA76.642 G76 1999.
12