[slurm-users] GRES and GPUs
Xaver Stiensmeier
xaverstiensmeier at gmx.de
Wed Jul 19 12:19:32 UTC 2023
Okay,
thanks to S. Zhang I was able to figure out why nothing changed. While I
did restart systemctld at the beginning of my tests, I didn't do so
later, because I felt like it was unnecessary, but it is right there in
the fourth line of the log that this is needed. Somehow I misread it and
thought it automatically restarted slurmctld.
Given the setup:
slurm.conf
...
GresTypes=gpu
NodeName=NName SocketsPerBoard=8 CoresPerSocket=1 RealMemory=8000
GRES=gpu:1 State=UNKNOWN
...
gres.conf
NodeName=NName Name=gpu File=/dev/tty0
When restarting, I get the following error:
error: Setting node NName state to INVAL with reason:gres/gpu count
reported lower than configured (0 < 1)
So it is still not working, but at least I get a more helpful log
message. Because I know that this /dev/tty trick works, I am still
unsure where the current error lies, but I will try to investigate it
further. I am thankful for any ideas in that regard.
Best regards,
Xaver
On 19.07.23 10:23, Xaver Stiensmeier wrote:
>
> Alright,
>
> I tried a few more things, but I still wasn't able to get past: srun:
> error: Unable to allocate resources: Invalid generic resource (gres)
> specification.
>
> I should mention that the node I am trying to test GPU with, doesn't
> really have a gpu, but Rob was so kind to find out that you do not
> need a gpu as long as you just link to a file in /dev/ in the
> gres.conf. As mentioned: This is just for testing purposes - in the
> end we will run this on a node with a gpu, but it is not available at
> the moment.
>
> *The error isn't changing*
>
> If I omitt "GresTypes=gpu" and "Gres=gpu:1", I still get the same error.
>
> *Debug Info*
>
> I added the gpu debug flag and logged the following:
>
> [2023-07-18T14:59:45.026] restoring original state of nodes
> [2023-07-18T14:59:45.026] select/cons_tres: part_data_create_array:
> select/cons_tres: preparing for 2 partitions
> [2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to
> gpu ignored
> [2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to
> change GresPlugins
> [2023-07-18T14:59:45.026] read_slurm_conf: backup_controller not specified
> [2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to
> gpu ignored
> [2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to
> change GresPlugins
> [2023-07-18T14:59:45.026] select/cons_tres: select_p_reconfigure:
> select/cons_tres: reconfigure
> [2023-07-18T14:59:45.027] select/cons_tres: part_data_create_array:
> select/cons_tres: preparing for 2 partitions
> [2023-07-18T14:59:45.027] No parameter for mcs plugin, default values set
> [2023-07-18T14:59:45.027] mcs: MCSParameters = (null). ondemand set.
> [2023-07-18T14:59:45.028] _slurm_rpc_reconfigure_controller: completed
> usec=5898
> [2023-07-18T14:59:45.952]
> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
>
> I am a bit unsure what to do next to further investigate this issue.
>
> Best regards,
> Xaver
>
> On 17.07.23 15:57, Groner, Rob wrote:
>> That would certainly do it. If you look at the slurmctld log when it
>> comes up, it will say that it's marking that node as invalid because
>> it has less (0) gres resources then you say it should have. That's
>> because slurmd on that node will come up and say "What gres resources??"
>>
>> For testing purposes, you can just create a dummy file on the node,
>> then in gres.conf, point to that file as the "graphics file"
>> interface. As long as you don't try to actually use it as a graphics
>> file, that should be enough for that node to think it has gres/gpu
>> resources. That's what I do in my vagrant slurm cluster.
>>
>> Rob
>>
>> ------------------------------------------------------------------------
>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf
>> of Xaver Stiensmeier <xaverstiensmeier at gmx.de>
>> *Sent:* Monday, July 17, 2023 9:43 AM
>> *To:* slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
>> *Subject:* Re: [slurm-users] GRES and GPUs
>> Hi Hermann,
>>
>> Good idea, but we are already using `SelectType=select/cons_tres`. After
>> setting everything up again (in case I made an unnoticed mistake), I saw
>> that the node got marked STATE=inval.
>>
>> To be honest, I thought I can just claim that a node has a gpu even if
>> it doesn't have one - just for testing purposes. Could this be the issue?
>>
>> Best regards,
>> Xaver Stiensmeier
>>
>> On 17.07.23 14:11, Hermann Schwärzler wrote:
>> > Hi Xaver,
>> >
>> > what kind of SelectType are you using in your slurm.conf?
>> >
>> > Per
>> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D&reserved=0
>> <https://slurm.schedmd.com/gres.html> you have to consider:
>> > "As for the --gpu* option, these options are only supported by Slurm's
>> > select/cons_tres plugin."
>> >
>> > So you can use "--gpus ..." only when you state
>> > SelectType = select/cons_tres
>> > in your slurm.conf.
>> >
>> > But "--gres=gpu:1" should work always.
>> >
>> > Regards
>> > Hermann
>> >
>> >
>> > On 7/17/23 13:43, Xaver Stiensmeier wrote:
>> >> Hey,
>> >>
>> >> I am currently trying to understand how I can schedule a job that
>> >> needs a GPU.
>> >>
>> >> I read about GRES
>> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D&reserved=0
>> <https://slurm.schedmd.com/gres.html> and tried to use:
>> >>
>> >> GresTypes=gpu
>> >> NodeName=test Gres=gpu:1
>> >>
>> >> But calling - after a 'sudo scontrol reconfigure':
>> >>
>> >> srun --gpus 1 hostname
>> >>
>> >> didn't work:
>> >>
>> >> srun: error: Unable to allocate resources: Invalid generic resource
>> >> (gres) specification
>> >>
>> >> so I read more
>> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.conf.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=aCh8X6QtJpRlIWxo%2BQxL85CC%2FbIo6bDxAY%2Fd5B9khmE%3D&reserved=0
>> <https://slurm.schedmd.com/gres.conf.html> but that
>> >> didn't really help me.
>> >>
>> >>
>> >> I am rather confused. GRES claims to be generic resources but then it
>> >> comes with three defined resources (GPU, MPS, MIG) and using one of
>> >> those didn't work in my case.
>> >>
>> >> Obviously, I am misunderstanding something, but I am unsure where to
>> >> look.
>> >>
>> >>
>> >> Best regards,
>> >> Xaver Stiensmeier
>> >>
>> >
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230719/4557c964/attachment.htm>
More information about the slurm-users
mailing list