[slurm-users] GRES and GPUs
Xaver Stiensmeier
xaverstiensmeier at gmx.de
Wed Jul 19 08:23:17 UTC 2023
Alright,
I tried a few more things, but I still wasn't able to get past: srun:
error: Unable to allocate resources: Invalid generic resource (gres)
specification.
I should mention that the node I am trying to test GPU with, doesn't
really have a gpu, but Rob was so kind to find out that you do not need
a gpu as long as you just link to a file in /dev/ in the gres.conf. As
mentioned: This is just for testing purposes - in the end we will run
this on a node with a gpu, but it is not available at the moment.
*The error isn't changing*
If I omitt "GresTypes=gpu" and "Gres=gpu:1", I still get the same error.
*Debug Info*
I added the gpu debug flag and logged the following:
[2023-07-18T14:59:45.026] restoring original state of nodes
[2023-07-18T14:59:45.026] select/cons_tres: part_data_create_array:
select/cons_tres: preparing for 2 partitions
[2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to gpu
ignored
[2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to change
GresPlugins
[2023-07-18T14:59:45.026] read_slurm_conf: backup_controller not specified
[2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to gpu
ignored
[2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to change
GresPlugins
[2023-07-18T14:59:45.026] select/cons_tres: select_p_reconfigure:
select/cons_tres: reconfigure
[2023-07-18T14:59:45.027] select/cons_tres: part_data_create_array:
select/cons_tres: preparing for 2 partitions
[2023-07-18T14:59:45.027] No parameter for mcs plugin, default values set
[2023-07-18T14:59:45.027] mcs: MCSParameters = (null). ondemand set.
[2023-07-18T14:59:45.028] _slurm_rpc_reconfigure_controller: completed
usec=5898
[2023-07-18T14:59:45.952]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
I am a bit unsure what to do next to further investigate this issue.
Best regards,
Xaver
On 17.07.23 15:57, Groner, Rob wrote:
> That would certainly do it. If you look at the slurmctld log when it
> comes up, it will say that it's marking that node as invalid because
> it has less (0) gres resources then you say it should have. That's
> because slurmd on that node will come up and say "What gres resources??"
>
> For testing purposes, you can just create a dummy file on the node,
> then in gres.conf, point to that file as the "graphics file"
> interface. As long as you don't try to actually use it as a graphics
> file, that should be enough for that node to think it has gres/gpu
> resources. That's what I do in my vagrant slurm cluster.
>
> Rob
>
> ------------------------------------------------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf
> of Xaver Stiensmeier <xaverstiensmeier at gmx.de>
> *Sent:* Monday, July 17, 2023 9:43 AM
> *To:* slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] GRES and GPUs
> Hi Hermann,
>
> Good idea, but we are already using `SelectType=select/cons_tres`. After
> setting everything up again (in case I made an unnoticed mistake), I saw
> that the node got marked STATE=inval.
>
> To be honest, I thought I can just claim that a node has a gpu even if
> it doesn't have one - just for testing purposes. Could this be the issue?
>
> Best regards,
> Xaver Stiensmeier
>
> On 17.07.23 14:11, Hermann Schwärzler wrote:
> > Hi Xaver,
> >
> > what kind of SelectType are you using in your slurm.conf?
> >
> > Per
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D&reserved=0
> <https://slurm.schedmd.com/gres.html> you have to consider:
> > "As for the --gpu* option, these options are only supported by Slurm's
> > select/cons_tres plugin."
> >
> > So you can use "--gpus ..." only when you state
> > SelectType = select/cons_tres
> > in your slurm.conf.
> >
> > But "--gres=gpu:1" should work always.
> >
> > Regards
> > Hermann
> >
> >
> > On 7/17/23 13:43, Xaver Stiensmeier wrote:
> >> Hey,
> >>
> >> I am currently trying to understand how I can schedule a job that
> >> needs a GPU.
> >>
> >> I read about GRES
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D&reserved=0
> <https://slurm.schedmd.com/gres.html> and tried to use:
> >>
> >> GresTypes=gpu
> >> NodeName=test Gres=gpu:1
> >>
> >> But calling - after a 'sudo scontrol reconfigure':
> >>
> >> srun --gpus 1 hostname
> >>
> >> didn't work:
> >>
> >> srun: error: Unable to allocate resources: Invalid generic resource
> >> (gres) specification
> >>
> >> so I read more
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.conf.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=aCh8X6QtJpRlIWxo%2BQxL85CC%2FbIo6bDxAY%2Fd5B9khmE%3D&reserved=0
> <https://slurm.schedmd.com/gres.conf.html> but that
> >> didn't really help me.
> >>
> >>
> >> I am rather confused. GRES claims to be generic resources but then it
> >> comes with three defined resources (GPU, MPS, MIG) and using one of
> >> those didn't work in my case.
> >>
> >> Obviously, I am misunderstanding something, but I am unsure where to
> >> look.
> >>
> >>
> >> Best regards,
> >> Xaver Stiensmeier
> >>
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230719/1ba2317c/attachment-0001.htm>
More information about the slurm-users
mailing list