<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body dir="auto">
DMTCP might be an option? Pretty sure there are RPMs for it in RHEL/CentOS 7. Don’t recall it being any trouble to install.
<div><br>
</div>
<div><a href="http://dmtcp.sourceforge.net/">http://dmtcp.sourceforge.net/</a><br>
<div dir="ltr"><br>
On Oct 4, 2019, at 9:47 PM, Eliot Moss <<a href="mailto:moss@cs.umass.edu">moss@cs.umass.edu</a>> wrote:<br>
<br>
</div>
<blockquote type="cite">
<div dir="ltr"><span>Dear slurm users --</span><br>
<span></span><br>
<span>I'm new to slurm (somewhat experienced with Grid Engine, though that's</span><br>
<span>not relevant to this post). I have access to two slurm based clusters,</span><br>
<span>and have an application that (a) can be _very_long running (more than</span><br>
<span>8 weeks for one execution, though the compute and I/O demands of one</span><br>
<span>such job are not huge by modern standards) and that (b) is not at all</span><br>
<span>practical to convert to do its own checkpoints. (I am running traces</span><br>
<span>from the valgrind program of every memory reference and branch made</span><br>
<span>when running individual SPEC benchmarks; this is then piped to 8</span><br>
<span>downstream analyzers, mostly Java programs.)</span><br>
<span></span><br>
<span>From what I have read, BLCR would meet my needs for checkpointing,</span><br>
<span>but the admins of both clusters are reluctant to pursue BLCR support.</span><br>
<span>I myself am wondering whether it is still working, etc., and what it</span><br>
<span>means that built-in support has been removed, etc. Can someone offer</span><br>
<span>a brief explanation of the status and recent history of BLCR w.r.t.</span><br>
<span>slurm?</span><br>
<span></span><br>
<span>Many thanks! Eliot Moss, UMass Amherst Computer Science</span><br>
<span></span><br>
</div>
</blockquote>
</div>
</body>
</html>