[slurm-users] Is this a bug in slurm array completion logic or expected behaviour

Fri Jan 31 08:54:41 UTC 2020

On Thu, Jan 30, 2020 at 7:54 AM Antony Cleave <antony.cleave at gmail.com>
wrote:

> epilog jobid=513,arraytaskid=4,SLURM_ARRAY_JOB_ID=509,JobState(509)=RUNNING
> epilog jobid=514,arraytaskid=5,SLURM_ARRAY_JOB_ID=509,JobState(509)=RUNNING
> epilog jobid=515,arraytaskid=6,SLURM_ARRAY_JOB_ID=509,JobState(509)=RUNNING
> epilog jobid=518,arraytaskid=9,SLURM_ARRAY_JOB_ID=509,JobState(509)=RUNNING
> epilog jobid=512,arraytaskid=3,SLURM_ARRAY_JOB_ID=509,JobState(509)=RUNNING
> epilog jobid=517,arraytaskid=8,SLURM_ARRAY_JOB_ID=509,JobState(509)=RUNNING
> epilog jobid=509,arraytaskid=10,SLURM_ARRAY_JOB_ID=509,JobState(509)=COMPLETING
> epilog jobid=516,arraytaskid=7,SLURM_ARRAY_JOB_ID=509,JobState(509)=COMPLETED
> epilog jobid=511,arraytaskid=2,SLURM_ARRAY_JOB_ID=509,JobState(509)=COMPLETED
> epilog jobid=510,arraytaskid=1,SLURM_ARRAY_JOB_ID=509,JobState(509)=COMPLETED
>
> Slurm seems to think that the array job is complete after the final array
> task has completed even thought there are 3 more tasks running.
>

Are you sure you are not misreading this? Every job in an array have an
IDs; one of these IDs is also the array ID. In this log, the array ID is
509, and the job 509 is not COMPLETED, it's COMPLETING. It's waiting for
its peers to complete. It's basically when the epilog script in called on
the controller, but it can run arbitrarily long.

As a side note, there is a catch here: the resources previously given to
the COMPLETING jobs are eligible for allocation. Check man slurm.conf for
the CompleteWait parameter, and the reduce_completing_frag scheduler option
(although I do not remember if the latter flag is in 18.x; IIRC, it may
have been added comparatively recently). Make sure not to create a race
between creating and tearing down that filesystem, if it would be requested
by two job arrays (i.e. A starts, you allocate the FS, array B wants the
same FS, then A ends and you tear it down, putting B into a confused
state--it's not clear if you can create multiple instances of that
filesystem or not).

short of making every task in the array job create a file on completion and
> testing for the existence of all files, is there a way to reliably detect
> when all tasks in the array have completed?
>

If by " epilogctld" you mean EpilogSlurmctld, then your program is given a
lot of information about the array in its environment. And you have full
access to squeue and scontrol, so you can poll the array.

There are also triggers, but I never used them on jobs--I only used
system-wide triggers that revive nodes that went down. The trigger
machinery is also quite powerful; check if it could help you solve your
problem. Polling is not a very scalable solution; if you can do it with
triggers, it may be better. Be only aware that slurmctld batches triggers,
so they do not trigger immediately (the delay is about 10-20s, IIRC). You
can poll more frequently, but there is a risk of overloading the controller
with these poll requests.

Flag files is a horrible solution in practice, don't do that. I've seen
scripts which started simple first, then added exponential delays, then
tried to enforce consistency and sync points on the shared FS after a
timeout... This seems simple but ends up untenable.

 -kkm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200131/2bf463dd/attachment-0001.htm>