[slurm-users] Configuring SLURM on single node GPU cluster

Wed Apr 6 14:15:12 UTC 2022

Hello,

try to comment out the line:

     AutoDetect=nvml

And then restart "slurmd" and "slurmctld".

Job allocations to the same GPU might be an effect of automatic MPS
configuration, thogugh I'm not sure for 100%:
https://slurm.schedmd.com/gres.html#MPS_Management

Kind Regards
-- 
Kamil Wilczek

W dniu 06.04.2022 o 15:53, Sushil Mishra pisze:
> Dear SLURM users,
> 
> I am very new to alarm and need some help in configuring slurm in a 
> single node machine. This machine has 8x Nvidia GPUs and 96 core cpu. 
> Vendor has set up a "LocalQ" but thai somehow is running all 
> the calculations in GPU 0. If I submit 4 independent jobs at a time, it 
> starts running all four calculations on GPU 0. I want slurm to assign a 
> specific GPU (setting a CUDA_VISIBLE_DEVICE variable) for each job and 
> before it starts running and hold rest of the jobs in queue until a GPU 
> becomes available.
> 
> slurm.conf looks like:
> *$ cat /etc/slurm-llnl/slurm.conf
> *
> ClusterName=localcluster
> SlurmctldHost=localhost
> MpiDefault=none
> ProctrackType=proctrack/linuxproc
> ReturnToService=2
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmctldPort=6817
> SlurmdPidFile=/var/run/slurmd.pid
> SlurmdPort=6818
> SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
> SlurmUser=slurm
> StateSaveLocation=/var/lib/slurm-llnl/slurmctld
> SwitchType=switch/none
> TaskPlugin=task/none
> #
> GresTypes=gpu
> #SlurmdDebug=debug2
> 
> # TIMERS
> InactiveLimit=0
> KillWait=30
> MinJobAge=300
> SlurmctldTimeout=120
> SlurmdTimeout=300
> Waittime=0
> # SCHEDULING
> SchedulerType=sched/backfill
> SelectType=select/cons_tres
> SelectTypeParameters=CR_Core
> #
> #AccountingStoragePort=
> AccountingStorageType=accounting_storage/none
> JobCompType=jobcomp/none
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/none
> SlurmctldDebug=info
> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
> SlurmdDebug=info
> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
> #
> # COMPUTE NODES
> NodeName=localhost CPUs=96 State=UNKNOWN Gres=gpu
> #NodeName=mannose NodeAddr=130.74.2.86 CPUs=1 State=UNKNOWN
> 
> # Partitions list
> PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=7-00:00:00 State=UP
> #PartitionName=gpu_short  MaxCPUsPerNode=32 DefMemPerNode=65556 
> DefCpuPerGPU=8 DefMemPerGPU=65556 MaxMemPerNode=532000 
> MaxTime=01-00:00:00 State=UP Nodes=localhost  Default=YES
> 
> and :
> *$ cat /etc/slurm-llnl/gres.conf*
> #detect GPUs
> AutoDetect=nvlm
> # GPU gres
> NodeName=localhost Name=gpu File=/dev/nvidia0
> NodeName=localhost Name=gpu File=/dev/nvidia1
> NodeName=localhost Name=gpu File=/dev/nvidia2
> NodeName=localhost Name=gpu File=/dev/nvidia3
> NodeName=localhost Name=gpu File=/dev/nvidia4
> NodeName=localhost Name=gpu File=/dev/nvidia5
> NodeName=localhost Name=gpu File=/dev/nvidia6
> NodeName=localhost Name=gpu File=/dev/nvidia7
> 
> Best,
> Sushil
> 

-- 
Kamil Wilczek  [https://keys.openpgp.org/]
[D415917E84B8DA5A60E853B6E676ED061316B69B]
Laboratorium Komputerowe
Wydział Matematyki, Informatyki i Mechaniki
Uniwersytet Warszawski

ul. Banacha 2
02-097 Warszawa

Tel.: 22 55 44 392
https://www.mimuw.edu.pl/
https://www.uw.edu.pl/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 236 bytes
Desc: OpenPGP digital signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220406/fb87aad9/attachment.sig>