[slurm-users] [External] Re: Status of BLCR?

Eliot Moss moss at cs.umass.edu
Sun Oct 6 21:24:45 UTC 2019


On 10/6/2019 9:23 AM, George Wm Turner wrote:
> I stumbled across CRIU (Checkpoint/Restore In Userspace) https://criu.org/Main_Page a couple of 
> weeks ago.  I have not utilized it yet it; it's on my ToDo list. They claim that it’s packaged with 
> most distress;  I checked RHEL/CentOS and it was there. Be careful of package/kernel versions; i.e 
>   a good reason to go with the version included in your distro.  BLCR was last updated January 2013; 
> back in the day, it worked well enough for simpler apps;  complicated MPI apps was less so.

Thanks, George.  I've installed it and started looking at it.  At present
I am applying it to a Grid Engine job, and have not figured out how to make
it restore successfully.  (Checkpointing goes all right, but gives a minor
warning.)  It does seem to require running as root, and of course my file
systems are NFS mounted, which leads to issues.  (Since I am just running
some scratch things for testing, using 777 permissions (ouch!) seems to
allow checkpointing to proceed.

I do need to understand a bit more of how it works and what flags I need :-) ...

It seems it needs root privilege to work, though maybe doing suid to root
is enough (I've not tried setting that on the executable).

Regards - EM



More information about the slurm-users mailing list