0% found this document useful (0 votes)

66 views12 pages

Mitigating The Post-Recovery Overhead in Fault Tolerant Systems

The document discusses mitigating performance degradation caused by system configuration changes during fault recovery in fault tolerant systems. It presents RADIC II, an enhancement to the RADIC fault tolerance architecture that incorporates dynamic redundancy. RADIC II allows inserting spare nodes during application execution to replace failed nodes, avoiding changes to the process distribution. Experimental results show RADIC II recovers correctly without degrading post-recovery system performance, providing an effective approach for high availability parallel applications.

Uploaded by

Guna Alexander

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views12 pages

Mitigating The Post-Recovery Overhead in Fault Tolerant Systems

Uploaded by

Guna Alexander

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Mitigating the post-recovery overhead in fault tolerant systems

Guna Santos, Angelo Duarte, Dolores Rexachs, and Emilio Luque

Computer Architecture and Operating Systems Department, University Autonoma of Barcelona. Bellaterra, Barcelona 08193, Espa a n E-mail: {guna, angelo}@caos.uab.es, {dolores.rexachs, emilio.luque}@uab.es
The demand for computational power has been leading the improvement of the High Performance Computing (HPC) area. In this area, fault tolerance plays an important role in order to provide high availability isolating the application from the faults effects. Performance and availability form an undissociable binomial for some kind of applications. Therefore, the fault tolerant solutions must take into consideration these two constraints when it has been designed. In this paper, we present a few side-effects that some fault tolerant solutions may presents when recovering a failed process. We present RADIC II, which incorporates a new protection level in RADIC using a dynamic redundancy functionality, allowing to mitigate or avoid the recovery side-effects. This functionality allows restoring a changed system conguration and it can avoid the conguration changes. The results has shown that RADIC-II operates correctly and becomes itself as a good approach to provide high availability to the parallel applications without suffer a system degradation in post-recovery execution.a

1 Introduction
The demand for computational power has been leading the improvement of the High Performance Computing (HPC) area, generally represented by the use of clusters of computers running parallel applications. For these applications, correctly nish and the spent time of their executions become major issues when planning to perform their tasks in computer based solutions. Therefore, it is reasonable to say that those applications commonly have two basic constraints: performance and availability (also known as performability1 ). In this area, fault tolerance plays an important role in order to provide high availability isolating the application from the faults effects. Therefore, a fault tolerant solution must take into consideration these two constraints when it has been designed. However, some fault tolerant solutions, may generate system conguration changes during the recovery process. This behavior happens when those ones manage the faults just using the own active cluster resources, i.e., the application continues executing with one less node, but keeping the number of processes, causing a unplanned process distribution. Parallel systems are designed intending to achieve certain performance level. In order to satisfy this performance level, the process distribution through the nodes takes in consideration factors like application characteristics, CPU power, or memory availability. Whenever this distribution changes, may lead to system performance degradation, due to the node loss, keeping the number process. In previous work, our group presented RADIC2, 3 (Redundant Array of Distributed Independent Fault Tolerance Controllers) as a exible, decentralized, transparent and scalable architecture that provides fault tolerance to message passing based parallel systems
a This

work is supported by MEyC-Espaa under contract TIN 2004-03388

using rollback-recovery techniques. RADIC acts as a layer that isolates an application from the possible cluster failures. RADIC, in its basic protection level, does not demand any passive resources to perform its activities. Thus, after a failure the controller recovers the faulty process in some existent node of the cluster. This behavior may leads to a recovery side-effect consisting in system degradation in the post-recovery execution, which may generates performance loss according with the application characteristics. In this paper, we present RADIC II, which incorporates a new protection level in RADIC using a dynamic redundancy functionality, allowing to mitigate or avoid the recovery side-effects. This functionality allows restoring a changed system conguration and it can avoid the conguration changes. RADIC II allows dynamically inserting new spare nodes during the application execution in order to replace the requested ones. Moreover, RADIC II provides a transparent management of spare nodes, which is able to request and use them without need any administrator intervention and not maintaining any centralized information about these spare nodes. We evaluate our solution performing several experiments comparing the effects of recovery having or not available spare nodes. These experiments observe two measures: the overall execution time, and the throughput of an application. We executed a matrix product algorithm, using a SPMD approach implementing a Cannon algorithm and we executed an N-Body particle simulation using a pipeline paradigm. The results are conclusive showing the benets of use our solution in these scenarios. This paper is organized as follows. The next section presents related works. The RADIC architecture is presented in section 3. The section 4 explains the recovery sideeffects. Our proposal is described in section 5. and the experimental results are presented in the section 6. Finally, in the section 7 we present our conclusions, and future work.

2 Related Work
There are several projects and researches providing fault tolerance to message passing parallel systems. Some of them use spare nodes in order to recover from a failure. Following, we present some of them and their characteristics. MPICH-V4 is a framework that has implemented four rollback-recovery protocols. MPICH-V provides automatic and transparent fault tolerance for MPI applications using aruntime environment formed by some components: Dispatcher, Channel memories, Checkpoint servers, and Computing/Communicating. At the recovery process, it uses spare nodes but, unlike our proposal, it has a centralized management of these spares and does not allow dynamic insertion of them. FT-Pro5 is a fault tolerance solution that bases on a combination of rollback-recovery and failure prediction to take some action at each decision point. Using this approach, this solution aims to keeps the system performance avoiding excessive checkpoints. Currently support three different preventive actions: Process migration, coordinated checkpoint using central checkpoint storages and no action. Each preventive action is selected dynamically in an adaptive way intending to reduce the overhead of fault tolerance. FT-Pro works an initially determined and static number of spare nodes. The Score-D6 checkpoint solution uses spare nodes in order to provide fault tolerance through a distributed coordinated checkpoint system. This system uses a parity generation that guarantees the checkpoint reconstitution in failures case. These spares nodes are de-

Parallel application Message passing implementation RADIC fault masking functions

Message delivering

RADIC fault tolerance functions

Event logs, checkpoints, fault detection and recovery

Cluster Structure (Fault-probable)

Figure 1. RADIC Abstraction layers.

ned at the application start and are consumed until reach zero. This solution does not have a mechanism to allow dynamic spare node insertion or node replacement. Furthermore, it performs the recovery process basing on a central agent. MPI/FT7 is a non-transparent solution for fault tolerance in MPI applications. The user can choose the moment to take checkpoint according with application characteristics. It provides distinct fault tolerance models for each application execution model, i.e. master-worker, SPMD. In some models, the recover procedure bases on an interaction of two elements, a central coordinator and self-checking-threads (SCTs) that use spare nodes taken from a pool. As this solution does not allow dynamic insertion of new spares, the application will fail after the exhaustion of this pool. The LAM/MPI8, 9 implementation uses a component architecture called System Services Interface (SSI) that allows checkpoint an MPI application using a coordinated checkpoint approach. This feature is not transparent, needing a back-end checkpoint system. In case of failure, all applications nodes stop and a restart command is needed. As LAM/MPI demands a faulty node replacement. Unlike our proposal, this procedure is neither automatic, nor transparent, and can not be done during the application execution.

3 The RADIC II Architecture

RADIC establishes an architecture model that denes the interaction of the fault-tolerant architecture and the parallel computers structure. Fig. 1 depicts how the RADIC architecture interacts with the structure of the parallel computer (in the lower level) and with the parallel applications structure (in the higher level). RADIC implements two levels between the message-passing level and the computer structure. The lower level implements the fault tolerance mechanism and the higher level implements the fault masking and message delivering mechanism. RADIC bases on a rollback-recovery technique using a pessimistic message-log approach10. This approach allows RADIC works without any central information in order to perform its fault-tolerance functions. Its fault model assumes that the probability of failures in the nodes follows a Poisson distribution and that the absence of an expected communication is a fault.. RADIC deals only with short transient faults by retrying the communication in specied times. 3.1 Protectors and Observers The structure of the RADIC architecture uses a group of processes that collaborate in order to create a distributed controller for fault tolerance. There are two classes of processes:

P0 O0

P1 O1 T1

P2 O2 T2

...

Figure 2. Part of a RADIC cluster showing the relationship between observers and protectors.

protectors and observers. Every node of the parallel computer has one dedicated protector and there is one dedicated observer attached to every parallel applications process. Each active protector communicates with two active protectors assumed as neighbors: an antecessor and a successor. Therefore, all active protectors establish a protection system throughout the nodes of the parallel computer using a heartbeat/watchdog scheme in order to detect faults. Moreover, the active protectors are responsible to store the checkpoints and message logs of the processes running on its successor node. The protectors also are in charge to start the recovery process. Observers are RADIC processes attached to each application processes. From the RADIC operational point-of-view, an observer and its application process compose an inseparable pair. The group of observers implements the message-passing mechanism for the parallel application. Furthermore, each observer performs some tasks related to fault tolerance: It takes checkpoints (according with a specied period) and event logs of its application process and send them to a protector running in another node, namely the antecessor protector; it detects communication failures with other processes and with its protector. In the recovering phase, the observer manages the messages from the message log of its application process and establishes a new protector. In order to mask faults, the observers maintains a mapping table, called radictable, indicating the location of all application processes and their respective protectors and updates this table as needed. The Fig. 2 depicts the relationship between observers and protectors in a part of a cluster with three nodes (N0 N2 ). In this gure, we can see that each observer (O0 ..O2 ) is attached with connects to a protector in a neighbor node to whom it sends checkpoints and event logs of its application process. Each protector (T0 T2 ) receives the connection of one or more observers and connects with other protector in the neighbor node. Each observer has an arrow that connects it to to an antecessor protector. Similarly, it receives a connection from its successor. A protector only communicates with their immediate neighbors. The observers consider the protector running in the same node as a local protector. Similarly, the protector considers the observers running in its node as local observers. 3.2 The RADIC Recovery Process In normal operation, the protectors are monitoring computers nodes, and the observers care about checkpoints and message logs of the distributed application processes. Together, protectors and observers function like a distributed controller for fault tolerance. When protectors and observers detect a failure, both actuate to reestablish the consistent state of the distributed parallel application and to reestablish the structure of the RADIC controller.

P1 O1 P2 O2

P0 O0 T0

P1 O1 T1

P2 O2 T3

P3 O3 T3

P0 O0 T0

P3 O3 T3

(a)

(b)

Figure 3. A RADIC conguration before (a) and after (b) a fault recovery.

The protectors and observers implicated in the failure will take simultaneous atomic actions in order to reestablish the integrity of the RADIC controllers structure. Table 1 lists the atomic activities of each element for a fault in a generic node Ni . In case of using the Protectors Observers

Table 1. Recovery activities performed by the each element implicated in a failure.

RADIC basic protection level, the system conguration changes after the recovery, and the recovered process will be running in the same node of its protector. The Fig. 3a shows a RADIC conguration before the fault and the Fig. 3b shows the after a recovery process. In these pictures, we can see that the node N2 failed and the process P2 recovered in the node of its protector (T1 ). As the gure shows, after the recovery, the system conguration has changed and the node N1 has two processes in execution. This protection level is well suited for applications with dynamic workload balancing.

4 Recovery Side-effects
As we can see in the Fig. 3, when a fault occurs, some other active node executing an application process is responsible to receive the state data of the failed process, its checkpoint and log, and re-launch the failed process in this own node. Hence, more than the overhead imposed by the fault tolerance activity11 , the post-recovery execution may be affected by this behavior. In the Fig. 3b, it is easy to perceive that in the node N1 , both processes will suffer a slowdown in their executions and the memory usage in this node will be increased may leading to disk swap. Moreover, these processes will be accessing the same protector, competing by to send the redundancy data. Supposing that a previous process distribution was made, aiming to achieve a certain performance level according with the cluster characteristics, this condition becomes very undesirable if the application is not able to adapt itself to workload changes along the nodes. In order to quantify these side-effects, we performed some experiments applying a SPMD program based on the Cannon algorithm, which presents a tightly coupling between the processes. The gure 4 shows a chart with the results of these executions. The program was executed in a nine nodes cluster, performing a matrix product between two 1500x1500 matrixes. We injected one fault in different moments: at 25%, when we believe that the

800,0

1500x1500 Matrix product using 9 nodes - SPMD cannon algorithm- 1 fault 160 loops -ckpt each 60s
72,61% Without Faults With Faults

700,0

49,37%
600,0

27,05%

Execution time (s)

500,0

434,57
400,0

434,57

300,0

200,0

100,0

0,0

25%

50% Fault moment

75%

Figure 4. Execution times of a SPMD program after recover a fault in distinct moments.

fault tolerance starts to be needed, at the middle of the execution, and at the 75% when we believe that the fault tolerance starts to be crucial. We adjust the program to repeat the computation 160 times in order to enlarge the execution time. The rst bar represents the fault free execution time and the second one, the execution with the fault. As we can see, the overhead generated in the post-recovery execution is very large when the fault occurs at the 25%, and still considerable when the fault occurs at the 75%.

5 Protecting the System

Besides the high availability, the applications running in clusters of computers usually demands high performance. Indeed, recently studies12 has demonstrated the relationship between these two requirements, retrieving the performability concept formally introduced by Meyer1 , which takes into consideration that in parallel computers, as a degradable system, performance and availability cannot be dissociated and the overall performance of some systems is very dependable of its availability. In order to maintain the original performability of a system, we developed RADIC II, incorporating a new protection level in RADIC using a dynamic redundancy functionality, which allows to mitigate or to avoid the recovery side-effects by restoring the system conguration or by avoiding the changes on the system conguration. 5.1 Restoring the System Conguration If we were using the basic RADIC protection level, RADIC II allows re-establishing the original process distribution. For that, we implemented a mechanism in our solution permitting inserting a replacement node during the program execution. The ow in Fig. 5 represents the steps performed to replace a node. Let us take the example in gure 3b to explain these steps: First, we execute a protector process in a special mode at the replacement node. Such protector connects with some active node of the system and announces itself to the other protectors through a reliable broadcast basing in the message forwarding technique13. The message forwarding continues until reach an overloaded node (N1 ), when the protector T1 will request the node usage. By its turn, some other steps compound the request node usage: The replacement node incorporates itself to fault detection scheme provided by the other protectors, following, the protector T1 tells the observer O2 to take its checkpoint in the new node. After this, T1 commands the new protector to recover the

Receive new node request and data

Running on active protectors

Protector starts in spare mode

Running on protector in the replacement node

Do Already have this data?

Yes

Stop Message Forwarding

No Find some node on the system Request node usage

Overloaded?

Send new node request with its own data to found protector

Store the new node data

Update node data

Send new node request and forward the data to neighbor protectors

Send new node request and forward the updated data to neighbor protectors

Figure 5. Flow representing the procedure of a faulty node replacement.

process P2 . Finally, the old process P2 suicides and the message forwarding continue with the updated information of the replacement node.

5.2 Avoiding System Conguration Changes

The previous approach is useful to restore the original performability of a system using the RADIC protection level. However, this approach is dependable of a user intervention, to insert a replacement node in the system. In order to avoid this situation, RADIC II incorporates a transparent management of spare nodes, which is able to request and use them without need any administrator intervention and not maintaining any centralized information about these spare nodes. These spares may be present at the application starts or may be included during the application execution using the same mechanism described as follows. In this approach, we insert a node running a protector in spare mode. In such mode, the protector is not part of the fault detection scheme in order to generate no overhead. When the protector starts, it searches some active node in the protectors fault detection scheme running with the application and starts a communication protocol with it requesting for its addition in a structure included in the protectors called spare table - The spare table contains some information about the current spare nodes and their states. The working protector that receives this request, searches if the new spare data already is on its spare table. If not, this protector adds the new spare data and forwards this request to its neighbors, passing the new spare information in sequence. Each protector performs the same tasks until to receive an already existent spare node data, nalizing the message forward process. The ow in the Fig. 6a claries this announcement procedure. The Fig. 7a depicts a RADIC II conguration after this process with one idle spare. We can see that the spare protector (TS ) is inactive (represented in light gray) and not making part of the fault detection scheme, thus, does not generate unnecessary overhead in the system. We can see in each active protector, the representation of the spare information as a small light gray triangle.

Running on Spare protector Protector starts in spare mode Protector Detects a fault

Find some node on each Protectors` chain

Search an idle spare in the spare table

Send new spare request with its own data to found protector

Is there some spare in my table?

Recover locally

Yes

Receive new spare request and data

Running on working protectors

Ask for the spare state

Is it still idle?

Yes

Request it

Do Already have this data?

Yes

Stop Message Forwarding

Tell to antecessor waits a new successor

Update spare table

Command the spare to join the protectors chain

Store the new spare data

Send checkpoint and log to spare

Command the spare to recover the process Send new spare request and forward the data to neighbor protectors

Finish Recovery

(a)

(b)

Figure 6. Flows representing: (a) the new spare announcement procedure and (b) the recovery task using spare nodes.

P0 O0 T0

P1 O1 T1

P2 O2 T3

P3 O3 T3 Ts

P0 O0 T0

P1 O1 T1

P2 O2 T3

P3 O3 T3

P3 O3 Ts

(a)

(b)

Figure 7. A RADIC II conguration using one spare, before (a) and after (b) a fault recovery.

5.3 Recovering Using Spare Node In RADIC II, the recovery process is slightly different from the process mentioned in item 3.1 in order to contemplate the spare node use. Currently, when a protector detects a fault, it rst searches a spare data in its spare table, if found some idle spare, i.e. the number of observers reported in the spare table still is equals to zero, it starts a spare use protocol. In this protocol, the active protector communicates with the spare asking for its situation, i.e. how many observers are running on its node, in order to conrm its state. The active protector repeats this procedure until to nd an idle spare, sending to it a request for use. If after the search for an idle spare results none, RADIC II proceeds with the regular RADIC recovery process mentioned in section 3. From the request moment, the spare will not accept any requests from other protectors. After receive the request conrmation, the protector commands the spare to join the protectors fault detection scheme. After nishes this step, the protector sends the failed process checkpoint and log to the spare, and command it to recover the failed process using the regular RADIC recovery process. The ow in the Fig. 6b claries this entire process, complementing the understanding of this process. The Fig. 7b shows the RADIC II system conguration after the recovery of a fault in the node N3 . The spare node assumed the process P3 , maintaining the original system

distribution, it is possible to see that the protector T3 already has updated the spare state information. 5.4 Changes in the Fault-Masking Task RADIC II inserts a small indeterminism in locating a failed process because this process may has been recovered in any spare of the conguration, and the process may not able to locate the recovered process in this spare. Hence, we had to change the original RADIC fault-masking task, which bases on a heuristic to determine where a faulty process will be after the recovery. This heuristic was very efcient in RADIC, because its recovery process was very deterministic, i.e. the failed process always recovers in its protector. The applied change consists in if after the observer did not nd the recovered process asking by its original protector, it searches in the spares using its spare table, looking for the faulty process. However, only the protectors have the spare table information, not the observers. Consequently, we increased the communication between the observers and the protector running in its node, the local protector. In order to execute the new fault masking protocol, the observer communicates with the local protector, asking for the spare table, and seeks in this table for the recovered process.

6 Experimental Results
In order to test our proposal, we had modied the RADIC MPI prototype called RADICMPI, including the dynamic redundancy functionality. Currently, RADICMPI incorporates basic MPI functions, including the blocking and non-blocking peer-to-peer communications. The experiments were conducted on a collection of twelve 1.9GHz Athlon-XP2600+ PC workstations running Linux Fedora Core 4 with kernel 2.6.17. Each workstation had 256MB of memory and a 40GB local disk. The nodes were interconnected by a 100BaseT Ethernet switch. We performed some experiments running well-known applications in different contexts. We represented two distinct approaches of common parallel applications and measuring comparatively the effects of use or not the spare nodes approach. We applied two parallel program kinds: a matrix product using the SPMD paradigm and an N-Body particle simulation using non-blocking functions in a pipeline paradigm based in the example presented by Gropp14. The Fig. 8a shows the chart with the results of the SPMD matrix product. We performed the same experiment shown in the section4 . In this case, we executed using the dynamic redundancy functionality with one spare too in order to compare with the other results. In this chart, we can see that the overhead caused by a recovery without spare (the middle column in each fault moment) versus using spare (the right column in each fault moment) with one fault occurring in different moments. The overhead not using spares shows itself inversely proportional to the moment when the fault occurs, generating greater overheads (reaching 72.61% in the worst case analyzed) in premature fault case, while using spare, the overhead keeps constantly and low (14% aproximadetely) despite the moment of the fault. As many of the actual parallel applications are intended to run continuously in a 24x7 scheme, we performed an experiment intending represent the behavior of these applications. In this experiment, we executed continuously the N-Body

Figure 8. Experiments results of (a) SPMD matrix product and (b) N-Body simulation in a pipeline.

particle simulation in a ten nodes pipeline and injected three faults in different moments and different machines, measuring the throughput of the program in simulation steps per minute. We analyzed four situations: a) a failure-free execution, used as comparing; b) three faults recovered without spare in the same node; c) three faults recovered without spare in different nodes and d) three faults recovered with spare. In the Fig. 8b we can see the result chart of this experiment. In such experiment, we can perceive the inuence of the application kind over the post-recovery execution. When the three faults are recovered in different nodes, the applications throughput suffers an initial degradation, but in the subsequent faults, just changes a little. This behavior occurs because the pipeline arrangement: the degradation of the node containing the second recovered is masked by the delay caused by the rst recovered process node. This assumption is conrmed when all faults processes are recovered in the same node, we can perceive a degradation of the throughput after each failure. When executing with spare nodes presence we see that after quick throughput degradation, the system backs to the original simulation step rate. We see also that the penalization imposed by the recovery process using spare is greater than the regular RADIC process, but this loss is quickly compensated by the throughput restoring in the remaining execution. The large message log to be processed and the re-use of the original RADIC recovery implementation in RADICMPI cause the large recovery overhead present using spare in the third fault, thus we are unnecessarily sending the checkpoint twice.

7 Conclusions
In this paper, we argued that in order to tolerate a fault, some message passing based solutions may generate a system conguration change at the recovery, i.e., respawning failed process in other active node of the application, which changes the original process distribution. Moreover, we shown that this change of conguration may lead to some system degradation, generally meaning overall performance loss as demonstrated by some experiments ran over the RADIC architecture. This project was undertaken to design a fault tolerant architecture avoiding the side effects of the recovery process and evaluate it effectiveness. We used RADIC as basis to

the development of RADIC II, an architecture that adds other protection level to RADIC, inserting a exible dynamic redundancy feature. RADIC II is able to restore an original system conguration, re-establishing the initial performability of a system by inserting replacement nodes that assume the faulty processes. In order to avoid the system conguration changes, we also implemented a transparent management of spare nodes, keeping the information about them totally decentralized. This functionality has the ability of dynamically insert new spare nodes during the application execution or already to start the application with them. We conclude that the use of a exible redundancy scheme is a good approach to mitigate or to avoid the effects of the system conguration changes. Our solution has shown to be effective even in faults near at the application nishing. RADIC II also shows a small overhead caused by the recovery process. The experimental results have shown execution times and throughput values very near to a failure-free execution. The ideal number of spare nodes and its ideal allocation through the cluster still are undiscovered subject. Further research might be investigate how it is possible to achieve better results allocating spare nodes according with some requirements like degradation level acceptable, or memory limits of a node. Moreover, a study about the possible election policies to be used in the node replacement feature will be useful to determine which the ideal behavior to be taken in these situations, considering factors like load balance or time to recover.

References
1. J. F. Meyer. On evaluating the performability of degradable computing systems. Computers, IEEE Transactions on, 29(8):720731, 1980. 2. A. Duarte, D. Rexachs, and E. Luque. A distributed scheme for fault tolerance in large clusters of workstations. In Proceedings of Parallel Computing 2005 (ParCo2005), in Press, 2005. 3. Angelo Duarte, Dolores Rexachs, and Emilio Luque. An intelligent management of fault tolerance in cluster using radicmpi. In Proceedings of The 13th European PVM/MPI Users Group Meeting, volume 4192/2006 of Lecture Notes in Computer Science, pages 150157. Springer Berlin / Heidelberg, 2006. 4. A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello. Mpich-v project: A multiprotocol automatic fault-tolerant mpi. International Journal of High Performance Computing Applications, 20(3):319, 2006. 5. Y. Li and Z. Lan. Exploit failure prediction for adaptive fault-tolerance in cluster computing. In Proceedings of the Sixth IEEE International Sympo-sium on Cluster Computing and the Grid (CCGRID06), volume 1, pages 531538, 16-19 May 2006 2006. 6. M. Kondo, T. Hayashida, M. Imai, H. Nakamura, T. Nanya, and A. Hori. Evaluation of checkpointing mechanism on score cluster system. IEICE Transactions on Information and Systems, 86(12):25532562, 2003. 7. R. Batchu, Y. S. Dandass, A. Skjellum, and M. Beddhu. Mpi/ft: A model-based approach to low-overhead fault tolerant message-passing middleware. CLUSTER COMPUTING, 7(4):303315, 2004.

8. Greg Burns, Raja Daoud, and James Vaigl. Lam: An open cluster environment for mpi. In Proceedings of Supercomputing Symposium, pages 379386, 1994. 9. Jeffrey M. Squyres and Andrew Lumsdaine. A component architecture for lam/mpi. In Proceedings, 10th European PVM/MPI Users Group Meeting; Lecture Notes in Computer Science, pages 379387, Venice, Italy, September / October 2003. Springer-Verlag. 10. E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3):375408, 2002. 11. S. Rao, L. Alvisi, and H. M. Vin. The cost of recovery in message logging protocols. IEEE Transactions on Knowledge and Data Engineering, 12(2):160173, 2000. 12. K. Nagaraja, G. Gama, R. Bianchini, R. P. Martin, W. Meira, and T. D. Nguyen. Quantifying the performability of cluster-based services. Parallel and Distributed Systems, IEEE Transactions on, 16(5):456467, 2005. 13. Pankaj Jalote. Fault Tolerance in Distributed Systems, volume 1. P T R Prentice Hall, United States of America, 1994. p. 142. 14. William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press, Cambridge, MA, 1999. LCCN: QA76.642 G76 1999.

RADIC II A Fault Tolerant Architecture With Flexible Dynamic Redundancy
100% (1)
RADIC II A Fault Tolerant Architecture With Flexible Dynamic Redundancy
129 pages
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
No ratings yet
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
13 pages
ch-4-Fault Tularance- Naming-SM
No ratings yet
ch-4-Fault Tularance- Naming-SM
42 pages
DS UNIT-3 NOTES
No ratings yet
DS UNIT-3 NOTES
35 pages
A_survey_of_fault_tolerance_approaches_on_different_architecture_levels
No ratings yet
A_survey_of_fault_tolerance_approaches_on_different_architecture_levels
9 pages
Dependable_Systems
No ratings yet
Dependable_Systems
22 pages
Fault-tolerant Parallel Computing
No ratings yet
Fault-tolerant Parallel Computing
4 pages
A Framework For Proactive Fault Tolerance
No ratings yet
A Framework For Proactive Fault Tolerance
6 pages
CMU-CS-99-148
No ratings yet
CMU-CS-99-148
44 pages
CSC423 - Lec12 - Distributed and Parallel ComputerSystems
No ratings yet
CSC423 - Lec12 - Distributed and Parallel ComputerSystems
28 pages
Nonstop Fault Tolerant Servers Quick Reference
No ratings yet
Nonstop Fault Tolerant Servers Quick Reference
10 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
DS unit_4
No ratings yet
DS unit_4
20 pages
Reference Paper 1
No ratings yet
Reference Paper 1
10 pages
Lec 3
No ratings yet
Lec 3
30 pages
dis sys
No ratings yet
dis sys
16 pages
Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing
No ratings yet
Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing
13 pages
Trust Based Node Recovery and Checkpointing Techniques in Manets
No ratings yet
Trust Based Node Recovery and Checkpointing Techniques in Manets
6 pages
Icst 1011
No ratings yet
Icst 1011
6 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
fault tolerance techniques
No ratings yet
fault tolerance techniques
4 pages
Unit-5 Faults in RTOS
No ratings yet
Unit-5 Faults in RTOS
5 pages
Ascs 04 0213
No ratings yet
Ascs 04 0213
5 pages
MOD 5 - Memory and Storage Organization
No ratings yet
MOD 5 - Memory and Storage Organization
44 pages
Survey ON Fault Tolerance IN Grid Computing: P. Latchoumy and P. Sheik Abdul Khader
No ratings yet
Survey ON Fault Tolerance IN Grid Computing: P. Latchoumy and P. Sheik Abdul Khader
14 pages
Cloud
No ratings yet
Cloud
18 pages
Process Fault-Tolerance: Semantics, Design and Applications For High Performance Computing
No ratings yet
Process Fault-Tolerance: Semantics, Design and Applications For High Performance Computing
11 pages
Future Trends in Fault Tolerant (Lect.10)
No ratings yet
Future Trends in Fault Tolerant (Lect.10)
3 pages
AvailabilityTactic PDF
No ratings yet
AvailabilityTactic PDF
3 pages
Synthesis of Fault-Tolerant Embedded Systems: Eles, Petru Izosimov, Viacheslav Pop, Paul Peng, Zebo
No ratings yet
Synthesis of Fault-Tolerant Embedded Systems: Eles, Petru Izosimov, Viacheslav Pop, Paul Peng, Zebo
7 pages
Motherboard Manual Ga-ma770-Ud3 e
No ratings yet
Motherboard Manual Ga-ma770-Ud3 e
100 pages
Sophos Firewall Mcqs Meet Vyas
No ratings yet
Sophos Firewall Mcqs Meet Vyas
10 pages
IEC 61850-Based Adaptive Protection System For The MV Distribution
No ratings yet
IEC 61850-Based Adaptive Protection System For The MV Distribution
8 pages
1.Multi-Level Diskless Checkpointing
No ratings yet
1.Multi-Level Diskless Checkpointing
3 pages
MIC Microproject
No ratings yet
MIC Microproject
30 pages
Identification of Critical Factors For Fast Multiple Faults Recovery Based On Reassignment of Task in Cluster Computing
No ratings yet
Identification of Critical Factors For Fast Multiple Faults Recovery Based On Reassignment of Task in Cluster Computing
5 pages
Tutorial Installing Fanurex 3
No ratings yet
Tutorial Installing Fanurex 3
2 pages
JN0 643 Q&A Troytec
No ratings yet
JN0 643 Q&A Troytec
157 pages
Conf SW RDC 12 07 2022
No ratings yet
Conf SW RDC 12 07 2022
33 pages
How To Create A Multiboot USB Drive
100% (1)
How To Create A Multiboot USB Drive
7 pages
Wiser Home Controller 2 Installation
No ratings yet
Wiser Home Controller 2 Installation
30 pages
Failover In-Depth
No ratings yet
Failover In-Depth
4 pages
Exam 11-13
No ratings yet
Exam 11-13
21 pages
String 1033
No ratings yet
String 1033
27 pages
CNALLQSTIONS
No ratings yet
CNALLQSTIONS
14 pages
SARA Anyconnect Troubleshooting
No ratings yet
SARA Anyconnect Troubleshooting
14 pages
BFD3 Download Install Guide
No ratings yet
BFD3 Download Install Guide
22 pages
SBI PO: Computer Quiz
No ratings yet
SBI PO: Computer Quiz
7 pages
Windows Server 2008 r2 Interview Questions and Answers Part1
No ratings yet
Windows Server 2008 r2 Interview Questions and Answers Part1
6 pages
Consolidated Worksheet - I.C.T (Computer) First Mid Term 2024 - 2025
No ratings yet
Consolidated Worksheet - I.C.T (Computer) First Mid Term 2024 - 2025
3 pages
Datasheet IFP9850
No ratings yet
Datasheet IFP9850
3 pages
Types of Cloud
No ratings yet
Types of Cloud
7 pages
AnyToISO User Guide
No ratings yet
AnyToISO User Guide
14 pages
5330 Ipmi QRC Letter Size v2
No ratings yet
5330 Ipmi QRC Letter Size v2
2 pages
ZKBio Time 8.0.6 Six Fold Leaflet PDF
No ratings yet
ZKBio Time 8.0.6 Six Fold Leaflet PDF
2 pages
Do Practice CO
No ratings yet
Do Practice CO
2 pages
Computer Generations: Types of Computer Generation and Examples
No ratings yet
Computer Generations: Types of Computer Generation and Examples
5 pages
Datasheet CH3HNAS English
No ratings yet
Datasheet CH3HNAS English
3 pages
Dell Latitude D505: Intuitive Form Factors & Design
No ratings yet
Dell Latitude D505: Intuitive Form Factors & Design
2 pages
Linux Kernel 6.1 by Chatgpt
No ratings yet
Linux Kernel 6.1 by Chatgpt
1 page
802.11a/b/g/n Dual Radio Hotspot Indoor Access Point: BROWAN - Wireless Broadband Anywhere
No ratings yet
802.11a/b/g/n Dual Radio Hotspot Indoor Access Point: BROWAN - Wireless Broadband Anywhere
2 pages
Split Horizon and Route Poisoning - Layer 3 Routing Loops - MPLSVPN - Moving Towards SDN and NFV Based Networks
No ratings yet
Split Horizon and Route Poisoning - Layer 3 Routing Loops - MPLSVPN - Moving Towards SDN and NFV Based Networks
2 pages
How Computer Memory Works
No ratings yet
How Computer Memory Works
2 pages
Practical Redux Engineering: Definitive Reference for Developers and Engineers
From Everand
Practical Redux Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Reactive Programming with Java and Project Reactor: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering Reactive Programming with Java and Project Reactor: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
Polly in Action: Definitive Reference for Developers and Engineers
From Everand
Polly in Action: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Munin for Systems Monitoring: Definitive Reference for Developers and Engineers
From Everand
Munin for Systems Monitoring: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Building Scalable Systems with C: Optimizing Performance and Portability
From Everand
Building Scalable Systems with C: Optimizing Performance and Portability
Larry Jones
No ratings yet
LACP Configuration and Implementation Guide: Definitive Reference for Developers and Engineers
From Everand
LACP Configuration and Implementation Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
OpenACC Programming Essentials: Definitive Reference for Developers and Engineers
From Everand
OpenACC Programming Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Deployment Automation with Fabric: Definitive Reference for Developers and Engineers
From Everand
Efficient Deployment Automation with Fabric: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Application Performance Management in Modern Systems: Definitive Reference for Developers and Engineers
From Everand
Application Performance Management in Modern Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Commvault Administration and Best Practices: Definitive Reference for Developers and Engineers
From Everand
Commvault Administration and Best Practices: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
C++ Exception Handling Made Easy: A Practical Guide with Examples
From Everand
C++ Exception Handling Made Easy: A Practical Guide with Examples
William E. Clark
No ratings yet
OneAgent Deployment and Optimization: Definitive Reference for Developers and Engineers
From Everand
OneAgent Deployment and Optimization: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CodePipeline in Depth: Definitive Reference for Developers and Engineers
From Everand
CodePipeline in Depth: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Web App Deployment with Passenger: Definitive Reference for Developers and Engineers
From Everand
Efficient Web App Deployment with Passenger: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Streamlit Development Essentials: Definitive Reference for Developers and Engineers
From Everand
Streamlit Development Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Programming with Patterns in Parallel and Distributed Systems
From Everand
Programming with Patterns in Parallel and Distributed Systems
Pasquale De Marco
No ratings yet
Blue-Green Deployment Engineering: Definitive Reference for Developers and Engineers
From Everand
Blue-Green Deployment Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Effective Error Monitoring with Bugsnag: Definitive Reference for Developers and Engineers
From Everand
Effective Error Monitoring with Bugsnag: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Automated Application Deployment with CodeDeploy: Definitive Reference for Developers and Engineers
From Everand
Automated Application Deployment with CodeDeploy: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
AppDynamics Administration and Optimization: Definitive Reference for Developers and Engineers
From Everand
AppDynamics Administration and Optimization: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Defect Prediction in Software Development & Maintainence
From Everand
Defect Prediction in Software Development & Maintainence
Rudra Kumar
No ratings yet
PLC Programming & Implementation: An Introduction to PLC Programming Methods and Applications
From Everand
PLC Programming & Implementation: An Introduction to PLC Programming Methods and Applications
Ojula Technology Innovations
No ratings yet
Key Principles of IT Architecture
From Everand
Key Principles of IT Architecture
Nelson Ambrose
No ratings yet
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
Mastering OpenTelemetry: Building Scalable Observability Systems for Cloud-Native Applications
From Everand
Mastering OpenTelemetry: Building Scalable Observability Systems for Cloud-Native Applications
Robert Johnson
No ratings yet
Embedded Systems Programming with C++: Real-World Techniques
From Everand
Embedded Systems Programming with C++: Real-World Techniques
Robert Johnson
No ratings yet
Networked Control System: Fundamentals and Applications
From Everand
Networked Control System: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Mitigating The Post-Recovery Overhead in Fault Tolerant Systems

Uploaded by

Mitigating The Post-Recovery Overhead in Fault Tolerant Systems

Uploaded by

Mitigating the post-recovery overhead in fault tolerant systems

Guna Santos, Angelo Duarte, Dolores Rexachs, and Emilio Luque

work is supported by MEyC-Espaa under contract TIN 2004-03388

Parallel application Message passing implementation RADIC fault masking functions

RADIC fault tolerance functions

Cluster Structure (Fault-probable)

Figure 1. RADIC Abstraction layers.

3 The RADIC II Architecture

Table 1. Recovery activities performed by the each element implicated in a failure.

Execution time (s)

50% Fault moment

5 Protecting the System

Receive new node request and data

Running on active protectors

Protector starts in spare mode

Running on protector in the replacement node

Do Already have this data?

Stop Message Forwarding

No Find some node on the system Request node usage

Store the new node data

Update node data

Figure 5. Flow representing the procedure of a faulty node replacement.

5.2 Avoiding System Conguration Changes

Find some node on each Protectors` chain

Search an idle spare in the spare table

Is there some spare in my table?

Receive new spare request and data

Running on working protectors

Ask for the spare state

Do Already have this data?

Stop Message Forwarding

Tell to antecessor waits a new successor

Update spare table

Command the spare to join the protectors chain

Store the new spare data

Send checkpoint and log to spare

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.