I work in a research group in the HPC field. Our group develops many
tools that use process checkpoint/restart (CR). Basically the people here have found
three CR mechanism that actually works:
1- Berkeley lab’s checkpoint/restart (BLCR)
– Probably the most robust framework to CR in Linux. Is a hybrid
– You can compile OpenMPI message passing library to checkpoint distributed applications using BLCR. This is very convenient in HPC.
– It looks that they are slowing down its development. The last
official release is 0.82 (June 16, 2009) and support kernel 2.6.30
(pretty old). To compile with newer kernels there are some patches
flowing in the development mailing list but I think only to give
support until 2.6.34 I think.
– You need root permissions to insert the blcr kernel module. One of
our tools used BLCR and we couldn’t run in many clusters because the
sysadmins were skeptical about inserting a kernel module with a few
random patches published in a mailing list.
2- DMTCP: Distributed MultiThreaded CheckPointing
– A completely user-space solution. You don’t need to bother the
sysadmins to install kernel modules.
– Can checkpoint distributed computation (we already tried with
OpenMPI and it also checkpoints the orte daemon).
– There is current development to add DMTCP to OpenMPI for parallel applications checkpointing as a alternative to BLCR.
– Since it is implemented in user-space it has a lot of workarounds to
maintain process state in userspace.
– Duplicates kernel-space process information.
– Only works with socket-based communications (it doesn’t work with
proprietary infiniband protocols for example).
3- Linux-cr checkpoint/restart mechanism
– The checkpoint/restart mechanism is implemented in the kernel as
syscalls and some user-space tools.
– Their intention is to push the mechanism upstream for kernel inclusion.
– Since their implementation is kernel based it is very robust.
– The patch-set still didn’t make for kernel inclusion. And the the
whole subject is complicated . Not all kernel developers agree that
implement CR in the kernel is a good idea.
– You need a custom kernel that has linux-cr support.
So which CR mechanism you choose will depend on many factors (if you have
control over the machine, use sockets for communications, can use a custom kernel, etc).
I am a developer of DMTCP. We came across this blog only recently. We are internally developing support for DMTCP over Infiniband. It will be open-source. If someone would like to be an alpha tester, please send us e-mail.
Hello Gene,First of all, thank you for developing DMTCP that is a great tool. We have been using DMTCP as the default process checkpoint/restart mechanism for our project.I hope I was both fair and objective when talking about DMTCP pros and cons.I'm very glad to known that there is an alpha release with infiniband support. I did read that infiniband support was a work in progress in the project FAQ. I should have mentioned that in the post, sorry about that.I will contact you privately to know more about this alpha release, we will be very glad to test the release and help in any way we can to improve DMTCP.