[slurm-users] SLURM starts new job before CG finishes

Erwin, James james.erwin at intel.com
Mon Feb 3 13:58:15 UTC 2020


Hello,
Thank you for your reply Lyn. I found a temporary workaround (epilog touching a file in /tmp/ and making a prolog wait until the epilog finishes and removes the file).
I was looking at CompleteWait before I tried these work-arounds but as it is written in the docs, I do not understand how this would help.

CompleteWait
The time, in seconds, given for a job to remain in COMPLETING state before any additional jobs are scheduled. If set to zero, pending jobs will be started as soon as possible. Since a COMPLETING job's resources are released for use by other jobs as soon as the Epilog completes on each individual node, this can result in very fragmented resource allocations.

In my case, the epilog is still executing (according ps and the health checks), and slurm still starts new jobs on the node.

Thanks,
James


From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Lyn Gerner
Sent: Wednesday, January 22, 2020 12:27 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Cc: slurm-users at schedmd.com
Subject: Re: [slurm-users] SLURM starts new job before CG finishes

James, you might take a look at CompleteWait and KillWait.

Regards,
Lyn

On Fri, Jan 3, 2020 at 12:27 PM Erwin, James <james.erwin at intel.com<mailto:james.erwin at intel.com>> wrote:
Hello,

I’ve recently updated a cluster to SLURM 19.05.4 and notice that new jobs are starting on nodes still in the CG state. In an epilog I am running node health checks that last about 2-3 minutes. In the previous version (ancient 15.08), jobs would not start running on these nodes until the epilog was complete and the node is out of the CG state. Does anyone know why this overlap of R with CG might be happening?

There is a release note for version 19.05.3 that looks possibly related but I’m not exactly sure what it means:

* Changes in Slurm 19.05.3
==========================
...
-- Nodes in COMPLETING state treated as being currently available for job
    will-run test.


Thanks,
James

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200203/a2cd6bb1/attachment-0002.htm>


More information about the slurm-users mailing list