[slurm-users] Status of BLCR?
moss at cs.umass.edu
Sat Oct 5 02:46:17 UTC 2019
Dear slurm users --
I'm new to slurm (somewhat experienced with Grid Engine, though that's
not relevant to this post). I have access to two slurm based clusters,
and have an application that (a) can be _very_long running (more than
8 weeks for one execution, though the compute and I/O demands of one
such job are not huge by modern standards) and that (b) is not at all
practical to convert to do its own checkpoints. (I am running traces
from the valgrind program of every memory reference and branch made
when running individual SPEC benchmarks; this is then piped to 8
downstream analyzers, mostly Java programs.)
From what I have read, BLCR would meet my needs for checkpointing,
but the admins of both clusters are reluctant to pursue BLCR support.
I myself am wondering whether it is still working, etc., and what it
means that built-in support has been removed, etc. Can someone offer
a brief explanation of the status and recent history of BLCR w.r.t.
Many thanks! Eliot Moss, UMass Amherst Computer Science
More information about the slurm-users