[slurm-users] Application level checkpointing
Oytun.Peksel at semcon.com
Wed Oct 9 12:03:24 UTC 2019
I would like to setup a queing system for multiple users with limited resources. I'll have only 1 node and 48 cpus to work with. So I am using select/cons_res for select type.
Have to use preemption because there are many jobs running with 3 different partitions with different priorities. The gang,suspend preemption works fine however it limits my suspended jobs to a few cause the average job memory consumption is pretty high.
I read about BCLR and DMTCP checkpointing and got the impression that it has a huge overhead and maybe not quite ready yet.
The jobs we run here has application level checkpointing (like abaqus,ansys etc. ) I am wondering would there be way to incorporate application level checkpoint/restart features to slurms preemption features using bash scripts.
· A low priority job would be checkpointed, canceled and requeued.
· After resources are available it would be restarted and let run
· After completion all resulting files are merged.
Supose these bash scripts are there (I know how to do it).
The question is how to incorporate it in the slurm scheduling mechanism.
Simulation & Digital Twins
Semcon Sweden AB
417 80 GÖTEBORG
oytun.peksel at semcon.com <mailto:oytun.peksel at semcon.com>
Follow us: LINKEDIN<https://www.linkedin.com/company/semcon> FACEBOOK<https://www.facebook.com/semcon> TWITTER<https://twitter.com/Semcon> YOUTUBE<https://www.youtube.com/user/SemconGlobal> INSTAGRAM<https://www.instagram.com/semcon>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users