[slurm-users] Application level checkpointing
Oytun Peksel
Oytun.Peksel at semcon.com
Wed Oct 9 12:03:24 UTC 2019
Hi,
I would like to setup a queing system for multiple users with limited resources. I'll have only 1 node and 48 cpus to work with. So I am using select/cons_res for select type.
Have to use preemption because there are many jobs running with 3 different partitions with different priorities. The gang,suspend preemption works fine however it limits my suspended jobs to a few cause the average job memory consumption is pretty high.
I read about BCLR and DMTCP checkpointing and got the impression that it has a huge overhead and maybe not quite ready yet.
The jobs we run here has application level checkpointing (like abaqus,ansys etc. ) I am wondering would there be way to incorporate application level checkpoint/restart features to slurms preemption features using bash scripts.
Namely :
· A low priority job would be checkpointed, canceled and requeued.
· After resources are available it would be restarted and let run
· After completion all resulting files are merged.
Supose these bash scripts are there (I know how to do it).
The question is how to incorporate it in the slurm scheduling mechanism.
Oytun Peksel
Eng
Simulation & Digital Twins
Semcon Sweden AB
Lindholmsallén 2
417 80 GÖTEBORG
Sweden
Phone
+46739205917
Mobile
+46739205917
E-mail
oytun.peksel at semcon.com <mailto:oytun.peksel at semcon.com>
www.semcon.com<http://www.semcon.com>
Follow us: LINKEDIN<https://www.linkedin.com/company/semcon> FACEBOOK<https://www.facebook.com/semcon> TWITTER<https://twitter.com/Semcon> YOUTUBE<https://www.youtube.com/user/SemconGlobal> INSTAGRAM<https://www.instagram.com/semcon>
When you communicate with us or otherwise interact with Semcon, we will process personal data that you provide to us or we collect about you, please read more in our Privacy Policy<https://semcon.com/data-privacy-policy/>.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191009/14df4a67/attachment.htm>
More information about the slurm-users
mailing list