The CPPC Project

Controller/comPiler for Portable Checkpointing

In order to introduce fault-tolerance through CPPC, the code of your application needs to be changed so that it communicates with the runtime library, passing information about variables that need to be dumped in the next checkpoint, where to create the state files, etc. Also, flow-control structures are placed to control the re-execution of certain critical portions of code at restart. This will enable the recovery of certain non-portable parts of data, such as MPI communicators or open files, that cannot be just stored as binary data in a state file.

As the insertion of these function calls and flow-control structures would imply significant effort by the end user, the CPPC runtime library is distributed along with a compiler that helps the user by automatically performing necessary transformations to the original application code.

The automatic communication analysis and checkpoint insertion sported from v0.7.x of the compiler are fairly stable, and a huge improvement over v0.6.x, but might still not work well with all applications. You can deactivate both and rely on manual directive insertion if you experience trouble with the analyses or the output code.

The compiler-provided directives, in case the user wants to manually guide the compiler operation, are:

  • cppc execute/end execute: These mark a block of code that needs to be re-executed upon application restart. This directive should be inserted when you want to recover state by re-execution, instead of saving/reading it to/from disk.

  • cppc checkpoint: This directive may be used for manually marking points where the state is dumped to a state file. If so, it must be inserted at safe points in the application: locations where there are neither in-transit, nor orphan messages between processes. In a typical example, a checkpoint should not be placed in between an MPI_Send() and its matching MPI_Recv(). If this happened, the message would not be resent upon application restart, but the destination process would still expect to receive it.

  • cppc checkpoint loop: This directive is the same as the previous one, except that you mark a loop in whose body you want the checkpoint inserted. The compiler will take into account communications between processes and insert a checkpoint in the first safe point it can find inside the loop body.

Checkpoints are usually inserted inside loops. You usually won't expect a state file to be dumped at each loop iteration. Checkpoint frequency can be controlled by using the CPPC/Controller/Frequency parameter (see the "Example application" section). Also, note that a bug exists in v0.3 of the compiler which results in a bad analysis of necessary variables if you place a checkpoint at the end of a loop. If working with v0.3 and placing this directive inside a loop, be sure to place it at the beginning of its body. This bug was corrected in v0.4. If the automatic checkpoint/communication analyses work for your application, then there is nothing you will need to manually alter in its code. Else, the steps that you want to take in order to integrate your application with the CPPC framework are:
  1. Decide where you want to dump the state. Place checkpoint or checkpoint loop directives in those spots. If the communication analysis is not used, be sure to check that those spots are safe points as defined above. Bear in mind that the code after a checkpoint and up to the end of the application will be the code being executed upon application restart.

  2. Compile the application linking with the appropriate CPPC dynamic library.