[slurm-users] Application level checkpointing

Wed Oct 9 12:03:24 UTC 2019

Hi,

I would like to setup a queing system for multiple users with limited resources. I'll have only 1 node and 48 cpus to work with. So I am using select/cons_res for select type.
Have to use preemption because there are many jobs running with 3 different partitions with different priorities. The gang,suspend preemption works fine however it limits my suspended jobs to a few cause the average job memory consumption is pretty high.

I read about BCLR and DMTCP checkpointing and got the impression that it has a huge overhead and maybe not quite ready yet.

The jobs we run here has application level checkpointing (like abaqus,ansys etc. ) I am wondering would there be way to incorporate application level checkpoint/restart features to slurms preemption features using bash scripts.
Namely :

·        A low priority job would be checkpointed, canceled and requeued.

·        After resources are available it would be restarted and let run

·        After completion all resulting files are merged.

Supose these bash scripts are there (I know how to do it).

The question is how to incorporate it in the slurm scheduling mechanism.

Oytun Peksel

Eng

Simulation & Digital Twins

Semcon Sweden AB

Lindholmsallén 2

417 80 GÖTEBORG

Sweden

Phone

+46739205917

Mobile

+46739205917

E-mail

oytun.peksel at semcon.com <mailto:oytun.peksel at semcon.com>

www.semcon.com<http://www.semcon.com>

Follow us: LINKEDIN<https://www.linkedin.com/company/semcon>  FACEBOOK<https://www.facebook.com/semcon>  TWITTER<https://twitter.com/Semcon>  YOUTUBE<https://www.youtube.com/user/SemconGlobal>  INSTAGRAM<https://www.instagram.com/semcon>

When you communicate with us or otherwise interact with Semcon, we will process personal data that you provide to us or we collect about you, please read more in our Privacy Policy<https://semcon.com/data-privacy-policy/>.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191009/14df4a67/attachment.htm>