[slurm-users] Job array start time and SchedNodes

Tue Dec 7 15:17:01 UTC 2021

Dear Thekla,

Thekla Loizou <t.loizou at cyi.ac.cy> writes:

> Dear Loris,
>
> There is no specific node required for this array. I can verify that from
> "scontrol show job 124841" since the requested node list is empty:
> ReqNodeList=(null)
>
> Also, all 17 nodes of the cluster are identical so all nodes fulfill the job
> requirements, not only node cn06.
>
> By "saving" the other nodes I mean that the scheduler estimates that the array
> jobs will start on 2021-12-11T03:58:00. No other jobs are scheduled to run
> during that time on the other nodes. So it seems that somehow the scheduler
> schedules the array jobs on more than one nodes but this is not showing in the
> squeue or scontrol output.

My guess is that there is something wrong with either the job
configuration or the node configuration, if Slurm thinks 9 jobs which
require a whole node can all be started simultaneously on same node.

Cheers,

Loris

> Regards,
>
> Thekla
>
>
> On 7/12/21 12:16 μ.μ., Loris Bennett wrote:
>> Hi Thekla,
>>
>> Thekla Loizou <t.loizou at cyi.ac.cy> writes:
>>
>>> Dear all,
>>>
>>> I have noticed that SLURM schedules several jobs from a job array on the same
>>> node with the same start time and end time.
>>>
>>> Each of these jobs requires the full node. You can see the squeue output below:
>>>
>>>            JOBID     PARTITION  ST   START_TIME          NODES SCHEDNODES
>>> NODELIST(REASON)
>>>
>>>            124841_1       cpu     PD 2021-12-11T03:58:00      1
>>> cn06                 (Priority)
>>>            124841_2       cpu     PD 2021-12-11T03:58:00      1
>>> cn06                 (Priority)
>>>            124841_3       cpu     PD 2021-12-11T03:58:00      1
>>> cn06                 (Priority)
>>>            124841_4       cpu     PD 2021-12-11T03:58:00      1
>>> cn06                 (Priority)
>>>            124841_5       cpu     PD 2021-12-11T03:58:00      1
>>> cn06                 (Priority)
>>>            124841_6       cpu     PD 2021-12-11T03:58:00      1
>>> cn06                 (Priority)
>>>            124841_7       cpu     PD 2021-12-11T03:58:00      1
>>> cn06                 (Priority)
>>>            124841_8       cpu     PD 2021-12-11T03:58:00      1
>>> cn06                 (Priority)
>>>            124841_9       cpu     PD 2021-12-11T03:58:00      1
>>> cn06                 (Priority)
>>>
>>> Is this a bug or am I missing something? Is this because the jobs have the same
>>> JOBID and are still in pending state? I am aware that the jobs will not actually
>>> all run on the same node at the same time and that the scheduler somehow takes
>>> into account that this job array has 9 jobs that will need 9 nodes. I am
>>> creating a timeline with the start time of all jobs and when the job array jobs
>>> will start running no other jobs are set to run on the remaining nodes (so it
>>> "saves" the other nodes for the jobs of the array even if they are all scheduled
>>> to run on the same node based on squeue or scontrol).
>> In general jobs from an array will be scheduled on whatever nodes
>> fulfil their requirements.  The fact that all the jobs have
>>
>>    cn06
>>
>> as NODELIST however seems to suggest that you have either specified cn06
>> as the node the jobs should run on, or cn06 is the only node which
>> fulfils the job requirements.
>>
>> I'm not sure what you mean about '"saving" the other nodes'.
>>
>> Cheers,
>>
>> Loris
>>
>
-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de