[slurm-users] Nodes stuck in drain state

Groner, Rob rug262 at psu.edu
Thu May 25 14:56:51 UTC 2023


A quick test to see if it's a configuration error is to set config_overrides in your slurm.conf and see if the node then responds to scontrol update.

________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Brian Andrus <toomuchit at gmail.com>
Sent: Thursday, May 25, 2023 10:54 AM
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Nodes stuck in drain state

That output of slurmd -C is your answer.

Slurmd only sees 6GB of memory and you are claiming it has 10GB.

I would run some memtests, look at meminfo on the node, etc.

Maybe even check that the type/size of memory in there is what you think
it is.

Brian Andrus

On 5/25/2023 7:30 AM, Roger Mason wrote:
> Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk> writes:
>
>> 1. Is slurmd running on the node?
> Yes.
>
>> 2. What's the output of "slurmd -C" on the node?
> NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
> ThreadsPerCore=1 RealMemory=6097
>
>> 3. Define State=UP in slurm.conf in stead of UNKNOWN
> Will do.
>
>> 4. Why have you configured TmpDisk=0?  It should be the size of the
>> /tmp filesystem.
> I have not configured TmpDisk.  This the entry in slurm.conf for that
> node:
> NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
> ThreadsPerCore=1 RealMemory=10193  State=UNKNOWN
>
> But I do notice that slurmd -C now says there is less memory than
> configured.
>
> Thanks again.
>
> Roger
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230525/529e60fb/attachment-0001.htm>


More information about the slurm-users mailing list