[slurm-users] Limit on number of nodes user able to request
Brian Andrus
toomuchit at gmail.com
Thu Apr 1 20:05:47 UTC 2021
How are you taking them offline? I would expect a SuspendProgram script
that is running the command that shuts them down. Also, one of your
SlurmctldParameters should be "idle_on_node_suspend"
Brian Andrus
On 4/1/2021 12:25 PM, Sajesh Singh wrote:
>
> Brian,
>
> Targeting the correct partition and no QOS limits imposed that would
> cause this issue. The only way I found to remedy is to completely
> remove the cloud nodes from Slurm, restart slurmctld, readd nodes to
> Slurm, restart slurmctld.
>
> I believe the issue is caused by when the nodes in the cloud go
> offline and slurmctld is no longer able to reach them. I am not able
> to change the node state manually so that slurmctld will allow it to
> be used the next time a job requires it. I cannot set the state to CLOUD.
>
> The other option may be to bring up all of the nodes that are in this
> unknown state so that slurmctld can go through the motions with them
> and them run the job again.
>
> -Sajesh-
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf
> Of *Brian Andrus
> *Sent:* Thursday, April 1, 2021 2:51 PM
> *To:* slurm-users at lists.schedmd.com
> *Subject:* Re: [slurm-users] Limit on number of nodes user able to request
>
> *EXTERNAL SENDER*
>
> For this one, you want to look closely at the job. Is it targeting a
> specific partition/nodelist?
>
> See what resources it is looking for (scontrol show job <jobid>)
> Also look at the partition limits as well as any QOS items (if you are
> using them).
>
> Brian Andrus
>
> On 4/1/2021 10:00 AM, Sajesh Singh wrote:
>
> Some additional information after enabling debug3 on slurmctld daemon:
>
> Logs show that there are enough usable nodes for the job:
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-11
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-12
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-13
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-14
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-15
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-16
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-17
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-18
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-19
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-20
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-21
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-22
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-23
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-24
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-25
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-26
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-27
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-28
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-29
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-30
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-31
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-32
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-33
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-34
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-35
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-36
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-37
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-38
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-39
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-40
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-41
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-42
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-43
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-44
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-45
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-46
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-47
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-48
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-49
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-50
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-51
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-52
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-53
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-54
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-55
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-56
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-57
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-58
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-59
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-60
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-61
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-62
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-63
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-64
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-65
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-66
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-67
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-68
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-69
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-70
>
> [2021-04-01T10:39:14.400] debug2: found 1 usable nodes from config
> containing node-71
>
> But then the following line is in the log as well:
>
> debug3: select_nodes: JobId=67171529 required nodes not avail
>
> --
>
> -Sajesh-
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com>
> <mailto:slurm-users-bounces at lists.schedmd.com> *On Behalf Of
> *Sajesh Singh
> *Sent:* Thursday, March 25, 2021 9:02 AM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> <mailto:slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] Limit on number of nodes user able to
> request
>
> *EXTERNAL SENDER*
>
> No nodes in downed or drained state. These are nodes that are
> dynamically brought up and down via the powersave plugin. When the
> are taken offline due to being idle I believe the state is set to
> FUTURE by the powersave plugin.
>
> -Sajesh-
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com
> <mailto:slurm-users-bounces at lists.schedmd.com>> *On Behalf Of
> *Brian Andrus
> *Sent:* Wednesday, March 24, 2021 11:02 PM
> *To:* slurm-users at lists.schedmd.com
> <mailto:slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] Limit on number of nodes user able to
> request
>
> *EXTERNAL SENDER*
>
> Do 'sinfo -R' and see if you have any down or drained nodes.
>
> Brian Andrus
>
> On 3/24/2021 6:31 PM, Sajesh Singh wrote:
>
> Slurm 20.02
>
> CentOS 8
>
> I just recently noticed a strange behavior when using the
> powersave plugin for bursting to AWS. I have a queue
> configured with 60 nodes, but if I submit a job to use all of
> the nodes I get the error:
>
> (Nodes required for job are DOWN, DRAINED or reserved for jobs
> in higher priority partitions
>
> If I lower the job to request 50 nodes it gets submitted and
> runs with no problems. I do not have and associations or QOS
> limits in place that would limit the user. Any ideas as to
> what could be causing this limit of 50 nodes to be imposed?
>
> -Sajesh-
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210401/57e2ce9d/attachment-0001.htm>
More information about the slurm-users
mailing list