Dmtcp Checkpointing
https://docs.rc.uab.edu/
Please use the new documentation url https://docs.rc.uab.edu/ for all Research Computing documentation needs.
As a result of this move, we have deprecated use of this wiki for documentation. We are providing read-only access to the content to facilitate migration of bookmarks and to serve as an historical record. All content updates should be made at the new documentation site. The original wiki will not receive further updates.
Thank you,
The Research Computing Team
Introduction
DMTCP stands for Distributed MultiThreaded CheckPointing and is used to store in-memory data while a task is running to allow restarting the task if something goes wrong. DMTCP works by creating a checkpoint file at user-defined times. Checkpointing can be setup to occur at regular intervals without user monitoring, or on demand at any time the task is running. More information is available at the DMTCP website.
How it Works
DMTCP operates using a server-client model, which allows for dynamic and on-demand checkpointing. The client wraps your software with an application that monitors execution and memory, and waits for checkpoint instructions from the server. The server sends checkpointing instructions to the client to initiate a checkpoint.
When a checkpoint is requested by the server, the client halts execution of the wrapped software. Any data in system memory associated with the wrapped software is dumped to disk as a binary data file. The client also creates a restart shell script for restarting from the most recent checkpoint.
If execution of the job is interrupted for any reason before completion, the restart shell script can be executed
Advantages and Considerations
There are multiple advantages to using checkpointing: 1. restart after hardware failure 2. restart after SLURM job timeout 3. allow jobs to run longer than maximum time limit 4. allow debugging starting at a specific point in time
Considerations for checkpointing: 1. all data and application executables and libraries must be stored in memory at the time of checkpointing 2. any data stored, by your program, in temporary files on disk must be dealt with carefully 3. The following command must be used on Cheaha before using the restart script: `export DMTCP_COORD_HOST=localhost` 4. IMPORTANT checkpoint frequency can negatively impact performance
The first two considerations above are typical of most Cheaha users, so DMTCP should "just work". Please isolate a small test case and test DMTCP checkpointing and recovery on your workflow before running it with your full data set. It is always best practice to familiarize yourself with new tools before using them in practice. Please contact us for Support if your test case or primary workflow aren't working as expected.
Jobs are run on the first available node. DMTCP stores the hostname, i.e. the node name, as the default DMTCP server and client address. There is no guarantee the next job will be located on that node, which can result in an error. The third consideration accounts for this by replacing the static node name with the current localhost, which will be the new node name.
The last consideration is very important. How often your checkpoint occurs should be carefully considered. DMTCP copies memory to disk, which can take seconds or minutes depending on how much information is in memory. During this time, your software is not executing. Checkpointing too frequently, or using a too short interval, can cause degradation of performance. In contrast, checkpointing too infrequently can cause excessive loss of data in the event of failure. It is important to find a balance.
When deciding how often to checkpoint, consider how much memory usage is expected, how long the job is expected to take, and how much time loss is acceptable, and the purpose for checkpointing. If the job will take at least one day to complete, a good rule of thumb is to set the checkpointing interval between 1 hour and 1 day. For jobs shorter than one day, checkpointing is unlikely to be necessary.
Use with MPI
As of 06/17/2021 the DMTCP versions on Cheaha only work for single node (SMP) jobs. MPI jobs require a specialized version of DMTCP that is not yet officially released and has additional considerations. If this applies to you, please review the slides at the Official DMTCP page. If you need checkpointing for your MPI job please contact us for Support.
Additional Resources and Tutorials
UAB Data Science Club has put together two tutorial videos on using DMTCP on a toy workflow in Python: Part 1 and Part 2.
The code for a toy Python example, ready to test on Cheaha, is available on our GitLab instance.
Additional examples are available from web sources: