[slurm-users] Job array start time and SchedNodes

Thu Dec 9 11:04:16 UTC 2021

Dear Thekla,

Yes, I think you are right.  I have found a similar job on my system and
this does seem to be the normal, slightly confusing behaviour.  It looks
as if the pending elements of the array get assigned a single node,
but then start on other nodes:

  $ squeue -j 8536946 -O jobid,jobarrayid,reason,schednodes,nodelist,state | head
  JOBID               JOBID               REASON              SCHEDNODES          NODELIST            STATE
  8536946             8536946_[401-899]   Resources           g002                                    PENDING
  8658719             8536946_400         None                (null)              g006                RUNNING
  8658685             8536946_399         None                (null)              g012                RUNNING
  8658625             8536946_398         None                (null)              g001                RUNNING
  8658491             8536946_397         None                (null)              g006                RUNNING
  8658428             8536946_396         None                (null)              g003                RUNNING
  8658427             8536946_395         None                (null)              g003                RUNNING
  8658426             8536946_394         None                (null)              g007                RUNNING
  8658425             8536946_393         None                (null)              g002                RUNNING

This strikes me as a bit odd.

Cheers,

Loris

Thekla Loizou <t.loizou at cyi.ac.cy> writes:

> Dear Loris,
>
> Thank you for your reply. I don't believe that there is something wrong with the
> job configuration or the node configuration to be honest.
>
> I have just submitted a simple sleep script:
>
> #!/bin/bash
>
> sleep 10
>
> as below:
>
> sbatch --array=1-10 --ntasks-per-node=40 --time=09:00:00 test.sh
>
> and squeue shows:
>
>           131799_1       cpu  test.sh   thekla PD N/A      1
> cn04                 (Priority)
>           131799_2       cpu  test.sh   thekla PD N/A      1
> cn04                 (Priority)
>           131799_3       cpu  test.sh   thekla PD N/A      1
> cn04                 (Priority)
>           131799_4       cpu  test.sh   thekla PD N/A      1
> cn04                 (Priority)
>           131799_5       cpu  test.sh   thekla PD N/A      1
> cn04                 (Priority)
>           131799_6       cpu  test.sh   thekla PD N/A      1
> cn04                 (Priority)
>           131799_7       cpu  test.sh   thekla PD N/A      1
> cn04                 (Priority)
>           131799_8       cpu  test.sh   thekla PD N/A      1
> cn04                 (Priority)
>           131799_9       cpu  test.sh   thekla PD N/A      1
> cn04                 (Priority)
>          131799_10       cpu  test.sh   thekla PD N/A      1
> cn04                 (Priority)
>
> All of the jobs seem to be scheduled on node cn04.
>
> When they start running they run on separate nodes:
>
>           131799_1       cpu  test.sh   thekla  R       0:02 1 cn01
>           131799_2       cpu  test.sh   thekla  R       0:02 1 cn02
>           131799_3       cpu  test.sh   thekla  R       0:02 1 cn03
>           131799_4       cpu  test.sh   thekla  R       0:02 1 cn04
>
> Regards,
>
> Thekla
>
> On 7/12/21 5:17 μ.μ., Loris Bennett wrote:
>> Dear Thekla,
>>
>> Thekla Loizou <t.loizou at cyi.ac.cy> writes:
>>
>>> Dear Loris,
>>>
>>> There is no specific node required for this array. I can verify that from
>>> "scontrol show job 124841" since the requested node list is empty:
>>> ReqNodeList=(null)
>>>
>>> Also, all 17 nodes of the cluster are identical so all nodes fulfill the job
>>> requirements, not only node cn06.
>>>
>>> By "saving" the other nodes I mean that the scheduler estimates that the array
>>> jobs will start on 2021-12-11T03:58:00. No other jobs are scheduled to run
>>> during that time on the other nodes. So it seems that somehow the scheduler
>>> schedules the array jobs on more than one nodes but this is not showing in the
>>> squeue or scontrol output.
>> My guess is that there is something wrong with either the job
>> configuration or the node configuration, if Slurm thinks 9 jobs which
>> require a whole node can all be started simultaneously on same node.
>>
>> Cheers,
>>
>> Loris
>>
>>> Regards,
>>>
>>> Thekla
>>>
>>>
>>> On 7/12/21 12:16 μ.μ., Loris Bennett wrote:
>>>> Hi Thekla,
>>>>
>>>> Thekla Loizou <t.loizou at cyi.ac.cy> writes:
>>>>
>>>>> Dear all,
>>>>>
>>>>> I have noticed that SLURM schedules several jobs from a job array on the same
>>>>> node with the same start time and end time.
>>>>>
>>>>> Each of these jobs requires the full node. You can see the squeue output below:
>>>>>
>>>>>             JOBID     PARTITION  ST   START_TIME          NODES SCHEDNODES
>>>>> NODELIST(REASON)
>>>>>
>>>>>             124841_1       cpu     PD 2021-12-11T03:58:00      1
>>>>> cn06                 (Priority)
>>>>>             124841_2       cpu     PD 2021-12-11T03:58:00      1
>>>>> cn06                 (Priority)
>>>>>             124841_3       cpu     PD 2021-12-11T03:58:00      1
>>>>> cn06                 (Priority)
>>>>>             124841_4       cpu     PD 2021-12-11T03:58:00      1
>>>>> cn06                 (Priority)
>>>>>             124841_5       cpu     PD 2021-12-11T03:58:00      1
>>>>> cn06                 (Priority)
>>>>>             124841_6       cpu     PD 2021-12-11T03:58:00      1
>>>>> cn06                 (Priority)
>>>>>             124841_7       cpu     PD 2021-12-11T03:58:00      1
>>>>> cn06                 (Priority)
>>>>>             124841_8       cpu     PD 2021-12-11T03:58:00      1
>>>>> cn06                 (Priority)
>>>>>             124841_9       cpu     PD 2021-12-11T03:58:00      1
>>>>> cn06                 (Priority)
>>>>>
>>>>> Is this a bug or am I missing something? Is this because the jobs have the same
>>>>> JOBID and are still in pending state? I am aware that the jobs will not actually
>>>>> all run on the same node at the same time and that the scheduler somehow takes
>>>>> into account that this job array has 9 jobs that will need 9 nodes. I am
>>>>> creating a timeline with the start time of all jobs and when the job array jobs
>>>>> will start running no other jobs are set to run on the remaining nodes (so it
>>>>> "saves" the other nodes for the jobs of the array even if they are all scheduled
>>>>> to run on the same node based on squeue or scontrol).
>>>> In general jobs from an array will be scheduled on whatever nodes
>>>> fulfil their requirements.  The fact that all the jobs have
>>>>
>>>>     cn06
>>>>
>>>> as NODELIST however seems to suggest that you have either specified cn06
>>>> as the node the jobs should run on, or cn06 is the only node which
>>>> fulfils the job requirements.
>>>>
>>>> I'm not sure what you mean about '"saving" the other nodes'.
>>>>
>>>> Cheers,
>>>>
>>>> Loris
>>>>
>
-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de