[slurm-users] Configuring SLURM on single node GPU cluster

Stephen Cousins steve.cousins at maine.edu
Wed Apr 6 14:17:53 UTC 2022


Hi Sushil,

Try changing NodeName specification to:

NodeName=localhost CPUs=96 State=UNKNOWN Gres=gpu*:8*


Also:

TaskPlugin=task/cgroup


Best,

Steve

On Wed, Apr 6, 2022 at 9:56 AM Sushil Mishra <sushilbioinfo at gmail.com>
wrote:

> Dear SLURM users,
>
> I am very new to alarm and need some help in configuring slurm in a single
> node machine. This machine has 8x Nvidia GPUs and 96 core cpu. Vendor has
> set up a "LocalQ" but thai somehow is running all the calculations in GPU
> 0. If I submit 4 independent jobs at a time, it starts running all four
> calculations on GPU 0. I want slurm to assign a specific GPU (setting a
> CUDA_VISIBLE_DEVICE variable) for each job and before it starts running and
> hold rest of the jobs in queue until a GPU becomes available.
>
> slurm.conf looks like:
>
> *$ cat /etc/slurm-llnl/slurm.conf *
> ClusterName=localcluster
> SlurmctldHost=localhost
> MpiDefault=none
> ProctrackType=proctrack/linuxproc
> ReturnToService=2
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmctldPort=6817
> SlurmdPidFile=/var/run/slurmd.pid
> SlurmdPort=6818
> SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
> SlurmUser=slurm
> StateSaveLocation=/var/lib/slurm-llnl/slurmctld
> SwitchType=switch/none
> TaskPlugin=task/none
> #
> GresTypes=gpu
> #SlurmdDebug=debug2
>
> # TIMERS
> InactiveLimit=0
> KillWait=30
> MinJobAge=300
> SlurmctldTimeout=120
> SlurmdTimeout=300
> Waittime=0
> # SCHEDULING
> SchedulerType=sched/backfill
> SelectType=select/cons_tres
> SelectTypeParameters=CR_Core
> #
> #AccountingStoragePort=
> AccountingStorageType=accounting_storage/none
> JobCompType=jobcomp/none
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/none
> SlurmctldDebug=info
> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
> SlurmdDebug=info
> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
> #
> # COMPUTE NODES
> NodeName=localhost CPUs=96 State=UNKNOWN Gres=gpu
> #NodeName=mannose NodeAddr=130.74.2.86 CPUs=1 State=UNKNOWN
>
> # Partitions list
> PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=7-00:00:00 State=UP
> #PartitionName=gpu_short  MaxCPUsPerNode=32 DefMemPerNode=65556
> DefCpuPerGPU=8 DefMemPerGPU=65556 MaxMemPerNode=532000 MaxTime=01-00:00:00
> State=UP Nodes=localhost  Default=YES
>
> and :
> *$ cat /etc/slurm-llnl/gres.conf*
> #detect GPUs
> AutoDetect=nvlm
> # GPU gres
> NodeName=localhost Name=gpu File=/dev/nvidia0
> NodeName=localhost Name=gpu File=/dev/nvidia1
> NodeName=localhost Name=gpu File=/dev/nvidia2
> NodeName=localhost Name=gpu File=/dev/nvidia3
> NodeName=localhost Name=gpu File=/dev/nvidia4
> NodeName=localhost Name=gpu File=/dev/nvidia5
> NodeName=localhost Name=gpu File=/dev/nvidia6
> NodeName=localhost Name=gpu File=/dev/nvidia7
>
> Best,
> Sushil
>
>

-- 
________________________________________________________________
 Steve Cousins          Interim Director/Supercomputer Engineer
 Advanced Computing Group            University of Maine System
 244 Neville Hall (UMS Data Center)              (207) 581-3574
 Orono ME 04469                      steve.cousins at maine.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220406/01acfff2/attachment.htm>


More information about the slurm-users mailing list