[slurm-users] [External] Re: Status of BLCR?

Kevin Buckley Kevin.Buckley at pawsey.org.au
Tue Oct 15 01:57:00 UTC 2019


On 2019/10/07 05:24, Eliot Moss wrote:
> On 10/6/2019 9:23 AM, George Wm Turner wrote:
>> I stumbled across CRIU (Checkpoint/Restore In Userspace) https://criu.org/Main_Page a couple of 
>> weeks ago.  I have not utilized it yet it; it's on my ToDo list. They claim that it’s packaged with 
>> most distress;  I checked RHEL/CentOS and it was there. Be careful of package/kernel versions; i.e 
>>   a good reason to go with the version included in your distro.  BLCR was last updated January 2013; 
>> back in the day, it worked well enough for simpler apps;  complicated MPI apps was less so.
> 
> Thanks, George.  I've installed it and started looking at it.  At present
> I am applying it to a Grid Engine job, and have not figured out how to make
> it restore successfully.  (Checkpointing goes all right, but gives a minor
> warning.)  It does seem to require running as root, and of course my file
> systems are NFS mounted, which leads to issues.  (Since I am just running
> some scratch things for testing, using 777 permissions (ouch!) seems to
> allow checkpointing to proceed.
> 

A couple of folk in this thread have mentioned DMTCP and, as I see that
you are using Grid Engine, I though to say that, back in 2015, I had got
DMTCP working against a "Son of Grid Engine" installation.

GE checkpoint-realted config back then was:

ckpt_name          dmtcp
interface          APPLICATION-LEVEL
ckpt_command       /usr/pkg/sge/3rd_party/gridengine_dmtcp-e54e202/dmtcp_checkpoint
migr_command       /usr/pkg/sge/3rd_party/gridengine_dmtcp-e54e202/dmtcp_migrate
restart_command    NONE
clean_command      NONE
ckpt_dir           /local/tmp/kevin/DMTCP
signal             NONE

Seem to recall that the issue that prevented it being taken further was the
need to have a long running co-ordinator against which to restart the jobs.

Not clear how the SoGE settings map across into Slurm config-land but hoping
that some of that is of some use.

Kevin

-- 
Supercomputing Systems Administrator
Pawsey Supercomputing Centre



More information about the slurm-users mailing list