<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Okay,</p>
<p>thanks to S. Zhang I was able to figure out why nothing changed.
While I did restart systemctld at the beginning of my tests, I
didn't do so later, because I felt like it was unnecessary, but it
is right there in the fourth line of the log that this is needed.
Somehow I misread it and thought it automatically restarted
slurmctld. <br>
</p>
<p>Given the setup:</p>
<p>slurm.conf<br>
...<br>
GresTypes=gpu<br>
NodeName=NName SocketsPerBoard=8 CoresPerSocket=1 RealMemory=8000
GRES=gpu:1 State=UNKNOWN<br>
...</p>
<p>gres.conf<br>
NodeName=NName Name=gpu File=/dev/tty0<br>
</p>
<p>When restarting, I get the following error:</p>
<p>error: Setting node NName state to INVAL with reason:gres/gpu
count reported lower than configured (0 < 1)</p>
<p>So it is still not working, but at least I get a more helpful log
message. Because I know that this /dev/tty trick works, I am still
unsure where the current error lies, but I will try to investigate
it further. I am thankful for any ideas in that regard.</p>
<p>Best regards,<br>
Xaver<br>
</p>
<div class="moz-cite-prefix">On 19.07.23 10:23, Xaver Stiensmeier
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:f8a4bbc6-b3b9-0ff3-7ab5-a7b340571dee@gmx.de">
<p>Alright,</p>
<p>I tried a few more things, but I still wasn't able to get past:
srun: error: Unable to allocate resources: Invalid generic
resource (gres) specification.</p>
<p>I should mention that the node I am trying to test GPU with,
doesn't really have a gpu, but Rob was so kind to find out that
you do not need a gpu as long as you just link to a file in
/dev/ in the gres.conf. As mentioned: This is just for testing
purposes - in the end we will run this on a node with a gpu, but
it is not available at the moment.</p>
<p><b>The error isn't changing</b></p>
<p>If I omitt "GresTypes=gpu" and "Gres=gpu:1", I still get the
same error.</p>
<p><b>Debug Info</b></p>
<p>I added the gpu debug flag and logged the following:</p>
<p>[2023-07-18T14:59:45.026] restoring original state of nodes<br>
[2023-07-18T14:59:45.026] select/cons_tres:
part_data_create_array: select/cons_tres: preparing for 2
partitions<br>
[2023-07-18T14:59:45.026] error: GresPlugins changed from (null)
to gpu ignored<br>
[2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to
change GresPlugins<br>
[2023-07-18T14:59:45.026] read_slurm_conf: backup_controller not
specified<br>
[2023-07-18T14:59:45.026] error: GresPlugins changed from (null)
to gpu ignored<br>
[2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to
change GresPlugins<br>
[2023-07-18T14:59:45.026] select/cons_tres:
select_p_reconfigure: select/cons_tres: reconfigure<br>
[2023-07-18T14:59:45.027] select/cons_tres:
part_data_create_array: select/cons_tres: preparing for 2
partitions<br>
[2023-07-18T14:59:45.027] No parameter for mcs plugin, default
values set<br>
[2023-07-18T14:59:45.027] mcs: MCSParameters = (null). ondemand
set.<br>
[2023-07-18T14:59:45.028] _slurm_rpc_reconfigure_controller:
completed usec=5898<br>
[2023-07-18T14:59:45.952]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2</p>
<p>I am a bit unsure what to do next to further investigate this
issue.</p>
<p>Best regards,<br>
Xaver<br>
</p>
<div class="moz-cite-prefix">On 17.07.23 15:57, Groner, Rob wrote:<br>
</div>
<blockquote type="cite"
cite="mid:BL0PR02MB4499D0148BD114169F5D89AC803BA@BL0PR02MB4499.namprd02.prod.outlook.com">
<div class="elementToProof"> That would certainly do it. If you
look at the slurmctld log when it comes up, it will say that
it's marking that node as invalid because it has less (0) gres
resources then you say it should have. That's because slurmd
on that node will come up and say "What gres resources??"</div>
<div class="elementToProof"> <br>
</div>
<div class="elementToProof"> For testing purposes, you can just
create a dummy file on the node, then in gres.conf, point to
that file as the "graphics file" interface. As long as you
don't try to actually use it as a graphics file, that should
be enough for that node to think it has gres/gpu resources.
That's what I do in my vagrant slurm cluster.</div>
<div class="elementToProof"> <br>
</div>
<div class="elementToProof"> Rob</div>
<div class="elementToProof"> <br>
</div>
<hr tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><b>From:</b> slurm-users <a
class="moz-txt-link-rfc2396E"
href="mailto:slurm-users-bounces@lists.schedmd.com"
moz-do-not-send="true"><slurm-users-bounces@lists.schedmd.com></a>
on behalf of Xaver Stiensmeier <a
class="moz-txt-link-rfc2396E"
href="mailto:xaverstiensmeier@gmx.de" moz-do-not-send="true"><xaverstiensmeier@gmx.de></a><br>
<b>Sent:</b> Monday, July 17, 2023 9:43 AM<br>
<b>To:</b> <a class="moz-txt-link-abbreviated
moz-txt-link-freetext"
href="mailto:slurm-users@lists.schedmd.com"
moz-do-not-send="true">slurm-users@lists.schedmd.com</a> <a
class="moz-txt-link-rfc2396E"
href="mailto:slurm-users@lists.schedmd.com"
moz-do-not-send="true"><slurm-users@lists.schedmd.com></a><br>
<b>Subject:</b> Re: [slurm-users] GRES and GPUs
<div> </div>
</div>
<div class="BodyFragment"><span>
<div class="PlainText">Hi Hermann,<br>
<br>
Good idea, but we are already using
`SelectType=select/cons_tres`. After<br>
setting everything up again (in case I made an unnoticed
mistake), I saw<br>
that the node got marked STATE=inval.<br>
<br>
To be honest, I thought I can just claim that a node has a
gpu even if<br>
it doesn't have one - just for testing purposes. Could
this be the issue?<br>
<br>
Best regards,<br>
Xaver Stiensmeier<br>
<br>
On 17.07.23 14:11, Hermann Schwärzler wrote:<br>
> Hi Xaver,<br>
><br>
> what kind of SelectType are you using in your
slurm.conf?<br>
><br>
> Per <a href="https://slurm.schedmd.com/gres.html"
moz-do-not-send="true">https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D&reserved=0</a>
you have to consider:<br>
> "As for the --gpu* option, these options are only
supported by Slurm's<br>
> select/cons_tres plugin."<br>
><br>
> So you can use "--gpus ..." only when you state<br>
> SelectType = select/cons_tres<br>
> in your slurm.conf.<br>
><br>
> But "--gres=gpu:1" should work always.<br>
><br>
> Regards<br>
> Hermann<br>
><br>
><br>
> On 7/17/23 13:43, Xaver Stiensmeier wrote:<br>
>> Hey,<br>
>><br>
>> I am currently trying to understand how I can
schedule a job that<br>
>> needs a GPU.<br>
>><br>
>> I read about GRES <a
href="https://slurm.schedmd.com/gres.html"
moz-do-not-send="true">https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D&reserved=0</a>
and tried to use:<br>
>><br>
>> GresTypes=gpu<br>
>> NodeName=test Gres=gpu:1<br>
>><br>
>> But calling - after a 'sudo scontrol
reconfigure':<br>
>><br>
>> srun --gpus 1 hostname<br>
>><br>
>> didn't work:<br>
>><br>
>> srun: error: Unable to allocate resources:
Invalid generic resource<br>
>> (gres) specification<br>
>><br>
>> so I read more <a
href="https://slurm.schedmd.com/gres.conf.html"
moz-do-not-send="true">https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.conf.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=aCh8X6QtJpRlIWxo%2BQxL85CC%2FbIo6bDxAY%2Fd5B9khmE%3D&reserved=0</a>
but that<br>
>> didn't really help me.<br>
>><br>
>><br>
>> I am rather confused. GRES claims to be generic
resources but then it<br>
>> comes with three defined resources (GPU, MPS,
MIG) and using one of<br>
>> those didn't work in my case.<br>
>><br>
>> Obviously, I am misunderstanding something, but I
am unsure where to<br>
>> look.<br>
>><br>
>><br>
>> Best regards,<br>
>> Xaver Stiensmeier<br>
>><br>
><br>
<br>
</div>
</span></div>
</blockquote>
</blockquote>
</body>
</html>