[slurm-users] Nodes stuck in drain state

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Thu May 25 13:50:07 UTC 2023


On 5/25/23 15:23, Roger Mason wrote:
> NodeName=node012 CoresPerSocket=2
>     CPUAlloc=0 CPUTot=4 CPULoad=N/A
>     AvailableFeatures=(null)
>     ActiveFeatures=(null)
>     Gres=(null)
>     NodeAddr=node012 NodeHostName=node012
>     RealMemory=10193 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
>     State=UNKNOWN+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>     Partitions=macpro
>     BootTime=None SlurmdStartTime=None
>     CfgTRES=cpu=4,mem=10193M,billing=4
>     AllocTRES=
>     CapWatts=n/a
>     CurrentWatts=0 AveWatts=0
>     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>     Reason=Low RealMemory [slurm at 2023-05-25T09:26:59]
> 
> But the 'Low RealMemory' is incorrect.  The entry in slurm.conf for
> node012 is:
> 
> NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
> ThreadsPerCore=1 RealMemory=10193  State=UNKNOWN

Thanks for the info.  Some questions arise:

1. Is slurmd running on the node?

2. What's the output of "slurmd -C" on the node?

3. Define State=UP in slurm.conf in stead of UNKNOWN

4. Why have you configured TmpDisk=0?  It should be the size of the /tmp 
filesystem.

Since you run Slurm 20.02, there are some suggestions in my Wiki page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#compute-node-configuration 
where this might be useful:

> Note for Slurm 20.02: The Boards=1 SocketsPerBoard=2 configuration gives error messages, see bug_9241 and bug_9233. Use Sockets= in stead:

I hope changing these slurm.conf parameters will help.

Best regards,
Ole






More information about the slurm-users mailing list