[slurm-users] GRES and GPUs

Hermann Schwärzler hermann.schwaerzler at uibk.ac.at
Wed Jul 19 13:04:41 UTC 2023


Hi Xaver,

I think you are missing the "Count=..." part in gres.conf

It should read

NodeName=NName Name=gpu File=/dev/tty0 Count=1

in your case.

Regards,
Hermann

On 7/19/23 14:19, Xaver Stiensmeier wrote:
> Okay,
> 
> thanks to S. Zhang I was able to figure out why nothing changed. While I 
> did restart systemctld at the beginning of my tests, I didn't do so 
> later, because I felt like it was unnecessary, but it is right there in 
> the fourth line of the log that this is needed. Somehow I misread it and 
> thought it automatically restarted slurmctld.
> 
> Given the setup:
> 
> slurm.conf
> ...
> GresTypes=gpu
> NodeName=NName SocketsPerBoard=8 CoresPerSocket=1 RealMemory=8000 
> GRES=gpu:1 State=UNKNOWN
> ...
> 
> gres.conf
> NodeName=NName Name=gpu File=/dev/tty0
> 
> When restarting, I get the following error:
> 
> error: Setting node NName state to INVAL with reason:gres/gpu count 
> reported lower than configured (0 < 1)
> 
> So it is still not working, but at least I get a more helpful log 
> message. Because I know that this /dev/tty trick works, I am still 
> unsure where the current error lies, but I will try to investigate it 
> further. I am thankful for any ideas in that regard.
> 
> Best regards,
> Xaver
> 
> On 19.07.23 10:23, Xaver Stiensmeier wrote:
>>
>> Alright,
>>
>> I tried a few more things, but I still wasn't able to get past: srun: 
>> error: Unable to allocate resources: Invalid generic resource (gres) 
>> specification.
>>
>> I should mention that the node I am trying to test GPU with, doesn't 
>> really have a gpu, but Rob was so kind to find out that you do not 
>> need a gpu as long as you just link to a file in /dev/ in the 
>> gres.conf. As mentioned: This is just for testing purposes - in the 
>> end we will run this on a node with a gpu, but it is not available at 
>> the moment.
>>
>> *The error isn't changing*
>>
>> If I omitt "GresTypes=gpu" and "Gres=gpu:1", I still get the same error.
>>
>> *Debug Info*
>>
>> I added the gpu debug flag and logged the following:
>>
>> [2023-07-18T14:59:45.026] restoring original state of nodes
>> [2023-07-18T14:59:45.026] select/cons_tres: part_data_create_array: 
>> select/cons_tres: preparing for 2 partitions
>> [2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to 
>> gpu ignored
>> [2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to 
>> change GresPlugins
>> [2023-07-18T14:59:45.026] read_slurm_conf: backup_controller not specified
>> [2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to 
>> gpu ignored
>> [2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to 
>> change GresPlugins
>> [2023-07-18T14:59:45.026] select/cons_tres: select_p_reconfigure: 
>> select/cons_tres: reconfigure
>> [2023-07-18T14:59:45.027] select/cons_tres: part_data_create_array: 
>> select/cons_tres: preparing for 2 partitions
>> [2023-07-18T14:59:45.027] No parameter for mcs plugin, default values set
>> [2023-07-18T14:59:45.027] mcs: MCSParameters = (null). ondemand set.
>> [2023-07-18T14:59:45.028] _slurm_rpc_reconfigure_controller: completed 
>> usec=5898
>> [2023-07-18T14:59:45.952] 
>> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
>>
>> I am a bit unsure what to do next to further investigate this issue.
>>
>> Best regards,
>> Xaver
>>
>> On 17.07.23 15:57, Groner, Rob wrote:
>>> That would certainly do it.  If you look at the slurmctld log when it 
>>> comes up, it will say that it's marking that node as invalid because 
>>> it has less (0) gres resources then you say it should have.  That's 
>>> because slurmd on that node will come up and say "What gres resources??"
>>>
>>> For testing purposes,  you can just create a dummy file on the node, 
>>> then in gres.conf, point to that file as the "graphics file" 
>>> interface.  As long as you don't try to actually use it as a graphics 
>>> file, that should be enough for that node to think it has gres/gpu 
>>> resources. That's what I do in my vagrant slurm cluster.
>>>
>>> Rob
>>>
>>> ------------------------------------------------------------------------
>>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf 
>>> of Xaver Stiensmeier <xaverstiensmeier at gmx.de>
>>> *Sent:* Monday, July 17, 2023 9:43 AM
>>> *To:* slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
>>> *Subject:* Re: [slurm-users] GRES and GPUs
>>> Hi Hermann,
>>>
>>> Good idea, but we are already using `SelectType=select/cons_tres`. After
>>> setting everything up again (in case I made an unnoticed mistake), I saw
>>> that the node got marked STATE=inval.
>>>
>>> To be honest, I thought I can just claim that a node has a gpu even if
>>> it doesn't have one - just for testing purposes. Could this be the issue?
>>>
>>> Best regards,
>>> Xaver Stiensmeier
>>>
>>> On 17.07.23 14:11, Hermann Schwärzler wrote:
>>> > Hi Xaver,
>>> >
>>> > what kind of SelectType are you using in your slurm.conf?
>>> >
>>> > Per 
>>> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D&reserved=0 <https://slurm.schedmd.com/gres.html> you have to consider:
>>> > "As for the --gpu* option, these options are only supported by Slurm's
>>> > select/cons_tres plugin."
>>> >
>>> > So you can use "--gpus ..." only when you state
>>> > SelectType              = select/cons_tres
>>> > in your slurm.conf.
>>> >
>>> > But "--gres=gpu:1" should work always.
>>> >
>>> > Regards
>>> > Hermann
>>> >
>>> >
>>> > On 7/17/23 13:43, Xaver Stiensmeier wrote:
>>> >> Hey,
>>> >>
>>> >> I am currently trying to understand how I can schedule a job that
>>> >> needs a GPU.
>>> >>
>>> >> I read about GRES 
>>> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D&reserved=0 <https://slurm.schedmd.com/gres.html> and tried to use:
>>> >>
>>> >> GresTypes=gpu
>>> >> NodeName=test Gres=gpu:1
>>> >>
>>> >> But calling - after a 'sudo scontrol reconfigure':
>>> >>
>>> >> srun --gpus 1 hostname
>>> >>
>>> >> didn't work:
>>> >>
>>> >> srun: error: Unable to allocate resources: Invalid generic resource
>>> >> (gres) specification
>>> >>
>>> >> so I read more 
>>> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.conf.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=aCh8X6QtJpRlIWxo%2BQxL85CC%2FbIo6bDxAY%2Fd5B9khmE%3D&reserved=0 <https://slurm.schedmd.com/gres.conf.html> but that
>>> >> didn't really help me.
>>> >>
>>> >>
>>> >> I am rather confused. GRES claims to be generic resources but then it
>>> >> comes with three defined resources (GPU, MPS, MIG) and using one of
>>> >> those didn't work in my case.
>>> >>
>>> >> Obviously, I am misunderstanding something, but I am unsure where to
>>> >> look.
>>> >>
>>> >>
>>> >> Best regards,
>>> >> Xaver Stiensmeier
>>> >>
>>> >
>>>



More information about the slurm-users mailing list