Define The Terms: Rollback Propagation.: Coordinated Checkpointing
Define The Terms: Rollback Propagation.: Coordinated Checkpointing
Coordinated Checkpointing:
In this approach, all processes in the system take checkpoints at the same time, ensuring a
consistent global state. This simplifies recovery because the system can restart from the
last global checkpoint without inconsistency.
Uncoordinated Checkpointing:
Here, processes take checkpoints independently without synchronization. While this can
reduce overhead and increase flexibility, it may lead to inconsistent states that complicate
recovery. Inconsistent states can require additional mechanisms, like logging or message
logging, to handle dependencies between processes.
A coordinated checkpointing and recovery technique that takes a consistent set of checkpointing and
avoids domino effect and livelock problems during the recovery
Includes 2 parts: the checkpointing algorithm and the recovery algorithm
Checkpointing algorithm
Assumptions: FIFO channel, end-to-end protocols, communication failures do not partition the
network, single process initiation, no process fails during the execution of the algorithm
Two kinds of checkpoints: permanent and tentative
Permanent checkpoint: local checkpoint, part of a consistent global checkpoint
Tentative checkpoint: temporary checkpoint, become permanent checkpoint when the algorithm
terminates successfully
Checkpointing algorithm
First phase
An initiating process Pi takes a tentative checkpoint and requests all other processes to take tentative
checkpoints. Each process informs Pi whether it succeeded in taking a tentative checkpoint. A process
says “no” to a request if it fails to take a tentative checkpoint, which could be due to several reasons,
depending upon the underlying application. If Pi learns that all the processes have successfully taken
tentative checkpoints, Pi decides that all tentative checkpoints should be made permanent; otherwise,
Pi decides that all the tentative checkpoints should be discarded.
Second phase
Pi informs all the processes of the decision it reached at the end of the first phase. A process, on
receiving the message from Pi, will act accordingly.
Therefore, either all or none of the processes advance the checkpoint by taking permanent checkpoints.
The algorithm requires that after a process has taken a tentative checkpoint, it cannot send messages
related to the underlying computation until it is informed of Pi’s decision.
First phase
An initiating process Pi sends a message to all other processes to check if they all are willing to restart
from their previous checkpoints. A process may reply “no” to a restart request due to any reason (e.g., it
is already participating in a checkpoint or recovery process initiated by some other process). If Pi learns
that all processes are willing to restart from their previous checkpoints, Pi decides that all processes
should roll back to their previous checkpoints.
Otherwise, Pi aborts the rollback attempt and it may attempt a recovery at a later time.
Second phase
Pi propagates its decision to all the processes. On receiving Pi’s decision, a process acts accordingly.
During the execution of the recovery algorithm, a process cannot send messages related to the
underlying computation while it is waiting for Pi’s decision.
7. Describe about the Juang–Venkatesan algorithm for asynchronous check pointing and recovery.
Assumptions: communication channels are reliable, delivery messages in FIFO order, infinite buffers,
message transmission delay is arbitrary but finite
Underlying computation/application is event-driven: process P is at state s, receives message m,
processes the message, moves to state s’ and send messages out. So the triplet (s, m, msgs_sent)
represents the state of P
Hence, Y need not roll back further. In the second iteration, Y sends ROLLBACK(Y, 2) to X and
ROLLBACK(Y, 1) to Z; Z sends ROLLBACK(Z, 1) to Y and ROLLBACK(Z, 0) to X; X sends
ROLLBACK(X, 0) to Z and ROLLBACK(X, 1) to Y . Note that if Y rolls back beyond ey3 and loses the
message from X that caused ey3, X can resend this message to Y because
ex2 is logged at X and this message is available in the log. The second and third iteration will progress in
the same manner. Note that the set of recovery points chosen at the end of the first iteration, {ex2, ey2,
ez1}, is consistent, and no further rollback occurs.