[slurm-users] Limit on number of nodes user able to request

Brian Andrus toomuchit at gmail.com
Thu Apr 1 20:05:47 UTC 2021


How are you taking them offline? I would expect a SuspendProgram script 
that is running the command that shuts them down. Also, one of your 
SlurmctldParameters should be "idle_on_node_suspend"

Brian Andrus

On 4/1/2021 12:25 PM, Sajesh Singh wrote:
>
> Brian,
>
>   Targeting the correct partition and no QOS limits imposed that would 
> cause this issue. The only way I found to remedy is to completely 
> remove the cloud nodes from Slurm, restart slurmctld, readd nodes to 
> Slurm, restart slurmctld.
>
> I believe the issue is caused by when the nodes in the cloud go 
> offline and slurmctld is no longer able to reach them. I am not able 
> to change the node state manually so that slurmctld  will allow it to 
> be used the next time a job requires it. I cannot set the state to CLOUD.
>
> The other option may be to bring up all of the nodes that are in this 
> unknown state so that slurmctld can go through the motions with them 
> and them run the job again.
>
> -Sajesh-
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf 
> Of *Brian Andrus
> *Sent:* Thursday, April 1, 2021 2:51 PM
> *To:* slurm-users at lists.schedmd.com
> *Subject:* Re: [slurm-users] Limit on number of nodes user able to request
>
> *EXTERNAL SENDER*
>
> For this one, you want to look closely at the job. Is it targeting a 
> specific partition/nodelist?
>
> See what resources it is looking for (scontrol show job <jobid>)
> Also look at the partition limits as well as any QOS items (if you are 
> using them).
>
> Brian Andrus
>
> On 4/1/2021 10:00 AM, Sajesh Singh wrote:
>
>     Some additional information after enabling debug3 on slurmctld daemon:
>
>     Logs show that there are enough usable nodes for the job:
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-11
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-12
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-13
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-14
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-15
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-16
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-17
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-18
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-19
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-20
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-21
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-22
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-23
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-24
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-25
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-26
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-27
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-28
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-29
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-30
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-31
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-32
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-33
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-34
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-35
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-36
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-37
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-38
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-39
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-40
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-41
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-42
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-43
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-44
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-45
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-46
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-47
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-48
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-49
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-50
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-51
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-52
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-53
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-54
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-55
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-56
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-57
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-58
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-59
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-60
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-61
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-62
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-63
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-64
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-65
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-66
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-67
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-68
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-69
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-70
>
>     [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
>     containing node-71
>
>     But then the following line is in the log as well:
>
>     debug3: select_nodes: JobId=67171529 required nodes not avail
>
>     --
>
>     -Sajesh-
>
>     *From:* slurm-users <slurm-users-bounces at lists.schedmd.com>
>     <mailto:slurm-users-bounces at lists.schedmd.com> *On Behalf Of
>     *Sajesh Singh
>     *Sent:* Thursday, March 25, 2021 9:02 AM
>     *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
>     <mailto:slurm-users at lists.schedmd.com>
>     *Subject:* Re: [slurm-users] Limit on number of nodes user able to
>     request
>
>     *EXTERNAL SENDER*
>
>     No nodes in downed or drained state. These are nodes that are
>     dynamically brought up and down via the powersave plugin. When the
>     are taken offline due to being idle I believe the state is set to
>     FUTURE by the powersave plugin.
>
>     -Sajesh-
>
>     *From:* slurm-users <slurm-users-bounces at lists.schedmd.com
>     <mailto:slurm-users-bounces at lists.schedmd.com>> *On Behalf Of
>     *Brian Andrus
>     *Sent:* Wednesday, March 24, 2021 11:02 PM
>     *To:* slurm-users at lists.schedmd.com
>     <mailto:slurm-users at lists.schedmd.com>
>     *Subject:* Re: [slurm-users] Limit on number of nodes user able to
>     request
>
>     *EXTERNAL SENDER*
>
>     Do 'sinfo -R' and see if you have any down or drained nodes.
>
>     Brian Andrus
>
>     On 3/24/2021 6:31 PM, Sajesh Singh wrote:
>
>         Slurm 20.02
>
>         CentOS 8
>
>         I just recently noticed a strange behavior when using the
>         powersave plugin for bursting to AWS. I have a queue
>         configured with 60 nodes, but if I submit a job to use all of
>         the nodes I get the error:
>
>         (Nodes required for job are DOWN, DRAINED or reserved for jobs
>         in higher priority partitions
>
>         If I lower the job to request 50 nodes it gets submitted and
>         runs with no problems. I do not have and associations or QOS
>         limits in place that would limit the user. Any ideas as to
>         what could be causing this limit of 50 nodes to be imposed?
>
>         -Sajesh-
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210401/57e2ce9d/attachment-0001.htm>


More information about the slurm-users mailing list