[slurm-users] gres names

Erik Bryer ebryer at isi.edu
Wed Dec 16 15:51:54 UTC 2020


Hi Loris,

That actually makes some sense. There is one thing that troubles me though. If, on a VM with no GPUs, I define...

NodeName=saga-test01 CPUS=2 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=1800 State=UNKNOWN Gres=gpu:gtx1080ti:4

...and try to run the following I get an error...

$ sbatch -w saga-test02 --gpus=gtx1080ti:1  --partition scavenge --wrap "ls -l" --qos scavengesbatch: error: Batch job submission failed: Requested node configuration is not available

This also fouls the whole cluster. Directly after issuing the sbatch, this occurs:

Dec 16 07:39:03 saga-test03 slurmctld[1169]: error: Setting node saga-test01 state to DRAIN

During past tests I've been unable to get both nodes back online without removing the spurious gres from the node definition. All this still makes me wonder whether there is a direct link between the hardware and gres names. I think so. Someone mentioned the gres names get spit out by NVML (but you can also make up your own (?)), but I can't find a record of ours. Any thoughts?

Thanks,
Erik
________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Loris Bennett <loris.bennett at fu-berlin.de>
Sent: Wednesday, December 16, 2020 12:07 AM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] gres names

Hi Erik,

Erik Bryer <ebryer at isi.edu> writes:

> Thanks for your reply. I can't find NVML in the logs going back to
> 11/22. dmesg goes back to the last boot, but has no mention of
> NVML. Regarding make one up on my own, how does Slurm know string
> "xyzzy" corresponds to a tesla gpu, e.g.?

As I understand it, Slurm doesn't need to know the correspondence, since
all it is doing is counting.  If you define a GRES, say,

  magic:wand

you can configure your nodes to have, say, 2 of these.  Then if a job
requests

 --gres=magic:wand:1

and starts, a subsequent job which requests

 --gres=magic:wand:2

will have to wait until the first magic wand become free again.
However, Slurm doesn't need to know whether your nodes really do have
magic wands, but your users do need to request them, if their jobs
require them.  To prevent them using a magic wand without requesting
one, you have to check the job parameters on submission, which you can
do via the job submit plugin.

Regards

Loris

> Thanks,
> Erik
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Michael Di Domenico <mdidomenico4 at gmail.com>
> Sent: Tuesday, December 15, 2020 1:24 PM
> To: Slurm User Community List <slurm-users at lists.schedmd.com>
> Subject: Re: [slurm-users] gres names
>
> you can either make them up on your own or they get spit out by NVML
> in the slurmd.log file
>
> On Tue, Dec 15, 2020 at 12:55 PM Erik Bryer <ebryer at isi.edu> wrote:
>>
>> Hi,
>>
>> Where do I get the gres names, e.g. "rtx2080ti", to use for my gpus in my node definitions in slurm.conf?
>>
>> Thanks,
>> Erik
>
--
Dr. Loris Bennett (Hr./Mr.)
ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201216/0e6afd06/attachment.htm>


More information about the slurm-users mailing list