[slurm-users] Configuring SLURM on single node GPU cluster
Kamil Wilczek
kmwil at mimuw.edu.pl
Wed Apr 6 14:15:12 UTC 2022
Hello,
try to comment out the line:
AutoDetect=nvml
And then restart "slurmd" and "slurmctld".
Job allocations to the same GPU might be an effect of automatic MPS
configuration, thogugh I'm not sure for 100%:
https://slurm.schedmd.com/gres.html#MPS_Management
Kind Regards
--
Kamil Wilczek
W dniu 06.04.2022 o 15:53, Sushil Mishra pisze:
> Dear SLURM users,
>
> I am very new to alarm and need some help in configuring slurm in a
> single node machine. This machine has 8x Nvidia GPUs and 96 core cpu.
> Vendor has set up a "LocalQ" but thai somehow is running all
> the calculations in GPU 0. If I submit 4 independent jobs at a time, it
> starts running all four calculations on GPU 0. I want slurm to assign a
> specific GPU (setting a CUDA_VISIBLE_DEVICE variable) for each job and
> before it starts running and hold rest of the jobs in queue until a GPU
> becomes available.
>
> slurm.conf looks like:
> *$ cat /etc/slurm-llnl/slurm.conf
> *
> ClusterName=localcluster
> SlurmctldHost=localhost
> MpiDefault=none
> ProctrackType=proctrack/linuxproc
> ReturnToService=2
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmctldPort=6817
> SlurmdPidFile=/var/run/slurmd.pid
> SlurmdPort=6818
> SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
> SlurmUser=slurm
> StateSaveLocation=/var/lib/slurm-llnl/slurmctld
> SwitchType=switch/none
> TaskPlugin=task/none
> #
> GresTypes=gpu
> #SlurmdDebug=debug2
>
> # TIMERS
> InactiveLimit=0
> KillWait=30
> MinJobAge=300
> SlurmctldTimeout=120
> SlurmdTimeout=300
> Waittime=0
> # SCHEDULING
> SchedulerType=sched/backfill
> SelectType=select/cons_tres
> SelectTypeParameters=CR_Core
> #
> #AccountingStoragePort=
> AccountingStorageType=accounting_storage/none
> JobCompType=jobcomp/none
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/none
> SlurmctldDebug=info
> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
> SlurmdDebug=info
> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
> #
> # COMPUTE NODES
> NodeName=localhost CPUs=96 State=UNKNOWN Gres=gpu
> #NodeName=mannose NodeAddr=130.74.2.86 CPUs=1 State=UNKNOWN
>
> # Partitions list
> PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=7-00:00:00 State=UP
> #PartitionName=gpu_short MaxCPUsPerNode=32 DefMemPerNode=65556
> DefCpuPerGPU=8 DefMemPerGPU=65556 MaxMemPerNode=532000
> MaxTime=01-00:00:00 State=UP Nodes=localhost Default=YES
>
> and :
> *$ cat /etc/slurm-llnl/gres.conf*
> #detect GPUs
> AutoDetect=nvlm
> # GPU gres
> NodeName=localhost Name=gpu File=/dev/nvidia0
> NodeName=localhost Name=gpu File=/dev/nvidia1
> NodeName=localhost Name=gpu File=/dev/nvidia2
> NodeName=localhost Name=gpu File=/dev/nvidia3
> NodeName=localhost Name=gpu File=/dev/nvidia4
> NodeName=localhost Name=gpu File=/dev/nvidia5
> NodeName=localhost Name=gpu File=/dev/nvidia6
> NodeName=localhost Name=gpu File=/dev/nvidia7
>
> Best,
> Sushil
>
--
Kamil Wilczek [https://keys.openpgp.org/]
[D415917E84B8DA5A60E853B6E676ED061316B69B]
Laboratorium Komputerowe
Wydział Matematyki, Informatyki i Mechaniki
Uniwersytet Warszawski
ul. Banacha 2
02-097 Warszawa
Tel.: 22 55 44 392
https://www.mimuw.edu.pl/
https://www.uw.edu.pl/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 236 bytes
Desc: OpenPGP digital signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220406/fb87aad9/attachment.sig>
More information about the slurm-users
mailing list