[slurm-users] Nodes stuck in drain state and sending Invalid Argument every second

Dean Schulze dean.w.schulze at gmail.com
Thu Feb 6 21:06:04 UTC 2020


I moved two nodes to another controller and the two nodes will not come out
of the drain state now.  I've rebooted the hosts but they are still stuck
in the drain state.  There is nothing in the location given for saving
state so I can't understand why a reboot doesn't clear this.

Here's the node state:

$ scontrol show node slurmnode1
NodeName=slurmnode1 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=0 CPUTot=16 CPULoad=0.58
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:gp100:4
   NodeAddr=slurmnode1 NodeHostName=slurmnode1 Version=19.05.4
   OS=Linux 5.3.0-28-generic #30~18.04.1-Ubuntu SMP Fri Jan 17 06:14:09 UTC
2020
   RealMemory=47671 AllocMem=0 FreeMem=46385 Sockets=1 Boards=1
   State=DOWN*+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
   Partitions=debug
   BootTime=2020-02-06T13:48:25 SlurmdStartTime=2020-02-06T13:48:31
   CfgTRES=cpu=16,mem=47671M,billing=16
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=none [dean at 2020-02-06T13:38:13]


The nodes are also sending the controller an error nearly every second
while the slurmds are running:

error: _slurm_rpc_node_registration node=slurmnode2: Invalid argument

I did have to open up the slurm ports on the network after moving these two
nodes to the new controller since the nodes are wired while the controller
is wireless, but there seems to be two way communication.

Any ideas what the problem is?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200206/3d86bdb3/attachment.htm>


More information about the slurm-users mailing list