[slurm-users] [External] Re: Status of BLCR?
Eliot Moss
moss at cs.umass.edu
Tue Oct 15 02:02:38 UTC 2019
On 10/14/2019 9:57 PM, Kevin Buckley wrote:
> On 2019/10/07 05:24, Eliot Moss wrote:
>> On 10/6/2019 9:23 AM, George Wm Turner wrote:
>>> I stumbled across CRIU (Checkpoint/Restore In Userspace) https://criu.org/Main_Page a couple of
>>> weeks ago. I have not utilized it yet it; it's on my ToDo list. They claim that it’s packaged
>>> with most distress; I checked RHEL/CentOS and it was there. Be careful of package/kernel
>>> versions; i.e a good reason to go with the version included in your distro. BLCR was last
>>> updated January 2013; back in the day, it worked well enough for simpler apps; complicated MPI
>>> apps was less so.
>>
>> Thanks, George. I've installed it and started looking at it. At present
>> I am applying it to a Grid Engine job, and have not figured out how to make
>> it restore successfully. (Checkpointing goes all right, but gives a minor
>> warning.) It does seem to require running as root, and of course my file
>> systems are NFS mounted, which leads to issues. (Since I am just running
>> some scratch things for testing, using 777 permissions (ouch!) seems to
>> allow checkpointing to proceed.
>>
>
> A couple of folk in this thread have mentioned DMTCP and, as I see that
> you are using Grid Engine, I though to say that, back in 2015, I had got
> DMTCP working against a "Son of Grid Engine" installation.
>
> GE checkpoint-realted config back then was:
>
> ckpt_name dmtcp
> interface APPLICATION-LEVEL
> ckpt_command /usr/pkg/sge/3rd_party/gridengine_dmtcp-e54e202/dmtcp_checkpoint
> migr_command /usr/pkg/sge/3rd_party/gridengine_dmtcp-e54e202/dmtcp_migrate
> restart_command NONE
> clean_command NONE
> ckpt_dir /local/tmp/kevin/DMTCP
> signal NONE
>
> Seem to recall that the issue that prevented it being taken further was the
> need to have a long running co-ordinator against which to restart the jobs.
>
> Not clear how the SoGE settings map across into Slurm config-land but hoping
> that some of that is of some use.
Thank you, Kevin!
Still seeing if I can get CRIU going. As with many things, it's one little
roadblock after another, figuring things out. Hard to know if a roadblock
will become a show stopper!
Regards - Eliot
More information about the slurm-users
mailing list