<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Arial, Helvetica, sans-serif; font-size: 10pt; color: rgb(0, 0, 0);" class="elementToProof">
Hi Patrick,</div>
<div style="font-family: Arial, Helvetica, sans-serif; font-size: 10pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: Arial, Helvetica, sans-serif; font-size: 10pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0">
You may want to review the release notes for 19.05 and any intermediate versions:<br>
<br>
</div>
<div style="font-family: Arial, Helvetica, sans-serif; font-size: 10pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0 ContentPasted1">
<a href="https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/RELEASE_NOTES" id="LPlnk601306" class="OWAAutoLink">https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/RELEASE_NOTES</a><br>
</div>
<div style="font-family: Arial, Helvetica, sans-serif; font-size: 10pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0 ContentPasted1">
<br>
</div>
<div style="font-family: Arial, Helvetica, sans-serif; font-size: 10pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0 ContentPasted1 ContentPasted2 ContentPasted3">
<a href="https://github.com/SchedMD/slurm/blob/slurm-18-08-9-1/RELEASE_NOTES" id="LPlnk877497" class="OWAAutoLink">https://github.com/SchedMD/slurm/blob/slurm-18-08-9-1/RELEASE_NOTES</a><br>
<br>
I'd also check the <code>slurmd.log</code> on the compute nodes. It's usually in
<code>/var/log/slurm/slurmd.log</code> </div>
<div style="font-family: Arial, Helvetica, sans-serif; font-size: 10pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0 ContentPasted1 ContentPasted4 ContentPasted5 ContentPasted6">
<br>
I'm not 100% sure your gres.conf is correct, we use one gres.conf for all our nodes, it looks something like this:<br>
<br>
</div>
<div style="font-family: Arial, Helvetica, sans-serif; font-size: 10pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0 ContentPasted1 ContentPasted4 ContentPasted5 ContentPasted6">
<code>NodeName=gpu-[1,2] Name=gpu Type=teslaM40 File=/dev/nvidia[0-3]</code></div>
<div style="font-family: Arial, Helvetica, sans-serif; font-size: 10pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0 ContentPasted1 ContentPasted4 ContentPasted5 ContentPasted6">
<div class="ContentPasted4 elementToProof"><code>NodeName=gpu-[3,6] Name=gpu Type=teslaK80 File=/dev/nvidia[0-7]</code></div>
</div>
<div style="font-family: Arial, Helvetica, sans-serif; font-size: 10pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0 ContentPasted1 ContentPasted4 ContentPasted5 ContentPasted6">
<code>NodeName=gpu-[7-9] Name=gpu Type=teslaV100 File=/dev/nvidia[0-3]</code><br>
</div>
<div style="font-family: Arial, Helvetica, sans-serif; font-size: 10pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0 ContentPasted1 ContentPasted4 ContentPasted5 ContentPasted6">
<br>
SchedMd docs example is a little different as they have a unique gres.conf by node in their example at:<br>
<br>
<a href="https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/doc/man/man5/gres.conf.5" id="LPlnk871426">https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/doc/man/man5/gres.conf.5</a><br>
<br>
<code>Name=gpu Type=gtx560 File=/dev/nvidia0 COREs=0,1</code><br>
<br>
</div>
<div style="font-family: Arial, Helvetica, sans-serif; font-size: 10pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0 ContentPasted1 ContentPasted4 ContentPasted5 ContentPasted6">
I don't see <code>Name</code> in your <code>gres.conf</code>?</div>
<div style="font-family: Arial, Helvetica, sans-serif; font-size: 10pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0 ContentPasted1 ContentPasted4 ContentPasted5 ContentPasted6">
<br>
</div>
<div style="font-family: Arial, Helvetica, sans-serif; font-size: 10pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0 ContentPasted1 ContentPasted4 ContentPasted5 ContentPasted6">
Kind regards</div>
<div class="elementToProof">
<div style="font-family: Arial, Helvetica, sans-serif; font-size: 10pt; color: rgb(0, 0, 0);">
<br>
</div>
<div id="Signature">
<div>
<div></div>
<div></div>
<div id="divtagdefaultwrapper" dir="ltr" style="font-size: 12pt; font-family: Calibri, Arial, Helvetica, sans-serif; color: rgb(0, 0, 0);">
<div class="BodyFragment"><font size="2">
<div class="PlainText"><span style="font-family: arial, sans-serif; font-size: small; color: rgb(34, 34, 34);">-- </span><br style="font-family: arial, sans-serif; font-size: small; color: rgb(34, 34, 34);">
<div class="gmail_signature" style="font-family: arial, sans-serif; font-size: small; color: rgb(34, 34, 34);">
<div dir="ltr"><span style="font-family:Helvetica">Mick Timony</span></div>
<div dir="ltr"><span style="font-family:Helvetica"><span style="font-family: Helvetica; font-size: small; display: inline !important; background-color: rgb(255, 255, 255);">Senior DevOps Engineer</span><br>
</span><span style="font-family:Helvetica">Harvard Medical School</span></div>
<div dir="ltr"><span style="font-family:Helvetica">--</span></div>
<div dir="ltr"><span style="font-family:Helvetica"><br>
</span></div>
</div>
</div>
</font></div>
</div>
</div>
</div>
</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> slurm-users <slurm-users-bounces@lists.schedmd.com> on behalf of Patrick Goetz <pgoetz@math.utexas.edu><br>
<b>Sent:</b> Thursday, August 24, 2023 11:27 AM<br>
<b>To:</b> Slurm User Community List <slurm-users@lists.schedmd.com><br>
<b>Subject:</b> [slurm-users] Nodes stay drained no matter what I do</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">
<div class="PlainText"><br>
Master/Nodes: Ubuntu 20.04, Slurm 19.05.5 (as packaged by Debian)<br>
<br>
This is an upgrade from a working Ubuntu 18.04/Slurm 17.x system where I <br>
re-used the original slurm.conf (fearing this might cause issues). The <br>
hardware is the same. The Master and nodes all use the same slurm.conf, <br>
gres.conf, and cgroup.conf files which are soft linked into <br>
/etc/slurm-llnl from an NFS mounted filesystem.<br>
<br>
As per the subject, the nodes refuse to revert to idle:<br>
<br>
-----------------------------------------------------------<br>
root@hypnotoad:~# sinfo -N -l<br>
Thu Aug 24 10:01:20 2023<br>
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK <br>
WEIGHT AVAIL_FE REASON<br>
dgx-2 1 dgx drained 80 80:1:1 500000 0 <br>
1 (null) gres/gpu count repor<br>
dgx-3 1 dgx drained 80 80:1:1 500000 0 <br>
1 (null) gres/gpu count repor<br>
dgx-4 1 dgx drained 80 80:1:1 500000 0 <br>
1 (null) gres/gpu count<br>
...<br>
titan-3 1 titans* drained 40 40:1:1 250000 0 <br>
1 (null) gres/gpu count report<br>
...<br>
-----------------------------------------------------------<br>
<br>
Neither of these commands has any effect:<br>
<br>
scontrol update NodeName=dgx-[2-6] State=RESUME<br>
scontrol update state=idle nodename=dgx-[2-6]<br>
<br>
<br>
When I check the slurmctld log I find this helpful information:<br>
<br>
-----------------------------------------------------------<br>
...<br>
[2023-08-24T00:00:00.033] error: _slurm_rpc_node_registration <br>
node=dgx-4: Invalid argument<br>
[2023-08-24T00:00:00.037] error: _slurm_rpc_node_registration <br>
node=dgx-2: Invalid argument<br>
[2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration <br>
node=titan-12: Invalid argument<br>
[2023-08-24T00:00:00.216] error: _slurm_rpc_node_registration <br>
node=titan-11: Invalid argument<br>
[2023-08-24T00:00:00.266] error: _slurm_rpc_node_registration <br>
node=dgx-6: Invalid argument<br>
...<br>
-----------------------------------------------------------<br>
<br>
Googling, this appears to indicate that there is a resource mismatch <br>
between the actual hardware and what is specified in slurm.conf. Note <br>
that the existing configuration worked for Slurm 17, but I checked, and <br>
it looks fine to me:<br>
<br>
Relevant parts of slurm.conf:<br>
<br>
-----------------------------------------------------------<br>
SchedulerType=sched/backfill<br>
SelectType=select/cons_res<br>
SelectTypeParameters=CR_Core_Memory<br>
<br>
PartitionName=titans Default=YES Nodes=titan-[3-15] State=UP <br>
MaxTime=UNLIMITED<br>
PartitionName=dgx Nodes=dgx-[2-6] State=UP MaxTime=UNLIMITED<br>
<br>
GresTypes=gpu<br>
NodeName=titan-[3-15] Gres=gpu:titanv:8 RealMemory=250000 CPUs=40<br>
NodeName=dgx-2 Gres=gpu:tesla-v100:7 RealMemory=500000 CPUs=80<br>
NodeName=dgx-[3-6] Gres=gpu:tesla-v100:8 RealMemory=500000 CPUs=80<br>
-----------------------------------------------------------<br>
<br>
All the nodes in the titan partition are identical hardware, as are the <br>
nodes in the dgx partition save for dgx-2, which lost a GPU and is no <br>
longer under warranty. So, using a couple of representative nodes:<br>
<br>
root@dgx-4:~# slurmd -C<br>
NodeName=dgx-4 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 <br>
ThreadsPerCore=2 RealMemory=515846<br>
<br>
root@titan-8:~# slurmd -C<br>
NodeName=titan-8 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 <br>
ThreadsPerCore=2 RealMemory=257811<br>
<br>
<br>
I'm at a loss for how to debug this and am looking suggestions. Since <br>
the resources on these machines are strictly dedicated to Slurm jobs, <br>
would it be best to use the output of `slurmd -C` directly for the right <br>
hand side of NodeName, reducing the memory a bit for OS overhead? Is <br>
there any way to get better debugging output? "Invalid argument" doesn't <br>
tell me much.<br>
<br>
Thanks.<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
</div>
</span></font></div>
</body>
</html>