[slurm-users] SLURM starts new job before CG finishes

Thu Feb 6 15:48:11 UTC 2020

Hi James,

Just for a slightly different take, 2-3 minutes seems a bit long for an
epilog script. Do you need to run all of those checks after every job?

Also, you describe it as running health checks; why not run those checks
via the HealthCheckProgram every HealthCheckInterval (e.g. 1 hour)?

Or better, split more job-specific checks into the Epilog and put general
node-specific checks into HealthCheckProgram.

But either way, as Lyn noted, you might still need to set CompleteWait to a
non-zero value to allow the epilog to finish.

Kind regards,
Paddy

On Mon, Feb 03, 2020 at 01:58:15PM +0000, Erwin, James wrote:

> Hello,
> Thank you for your reply Lyn. I found a temporary workaround (epilog touching a file in /tmp/ and making a prolog wait until the epilog finishes and removes the file).
> I was looking at CompleteWait before I tried these work-arounds but as it is written in the docs, I do not understand how this would help.
> 
> CompleteWait
> The time, in seconds, given for a job to remain in COMPLETING state before any additional jobs are scheduled. If set to zero, pending jobs will be started as soon as possible. Since a COMPLETING job's resources are released for use by other jobs as soon as the Epilog completes on each individual node, this can result in very fragmented resource allocations.
> 
> In my case, the epilog is still executing (according ps and the health checks), and slurm still starts new jobs on the node.
> 
> Thanks,
> James
> 
> 
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Lyn Gerner
> Sent: Wednesday, January 22, 2020 12:27 PM
> To: Slurm User Community List <slurm-users at lists.schedmd.com>
> Cc: slurm-users at schedmd.com
> Subject: Re: [slurm-users] SLURM starts new job before CG finishes
> 
> James, you might take a look at CompleteWait and KillWait.
> 
> Regards,
> Lyn
> 
> On Fri, Jan 3, 2020 at 12:27 PM Erwin, James <james.erwin at intel.com<mailto:james.erwin at intel.com>> wrote:
> Hello,
> 
> I’ve recently updated a cluster to SLURM 19.05.4 and notice that new jobs are starting on nodes still in the CG state. In an epilog I am running node health checks that last about 2-3 minutes. In the previous version (ancient 15.08), jobs would not start running on these nodes until the epilog was complete and the node is out of the CG state. Does anyone know why this overlap of R with CG might be happening?
> 
> There is a release note for version 19.05.3 that looks possibly related but I’m not exactly sure what it means:
> 
> * Changes in Slurm 19.05.3
> ==========================
> ...
> -- Nodes in COMPLETING state treated as being currently available for job
>     will-run test.
> 
> 
> Thanks,
> James
> 

-- 
Paddy Doyle
Research IT / Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
https://www.tchpc.tcd.ie/