[slurm-users] GPU jobs not running correctly
Andrey Malyutin
malyutinag at gmail.com
Fri Aug 20 22:48:04 UTC 2021
Thank you for your help, Sam! The rest of the slurm.conf, excluding the
node and partition configuration from the earlier email is below. I've also
included scontrol output for a 1 GPU job that runs successfully on node01.
Best,
Andrey
*Slurm.conf*
#
# See the slurm.conf man page for more information.
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
SlurmdSpoolDir=/cm/local/apps/slurm/var/spool
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
#ProctrackType=proctrack/pgid
ProctrackType=proctrack/cgroup
#PluginDir=
CacheGroups=0
#FirstJobId=
ReturnToService=2
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
TaskPlugin=task/cgroup
#TrackWCKey=no
#TreeWidth=50
#TmpFs=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd
#JobCompType=jobcomp/filetxt
#JobCompLoc=/cm/local/apps/slurm/var/spool/job_comp.log
#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherType=jobacct_gather/cgroup
#JobAcctGatherFrequency=30
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
# AccountingStorageLoc=slurm_acct_db
# AccountingStoragePass=SLURMDBD_USERPASS
# Scheduler
SchedulerType=sched/backfill
# Statesave
StateSaveLocation=/cm/shared/apps/slurm/var/cm/statesave/slurm
# Generic resources types
GresTypes=gpu
# Epilog/Prolog section
PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob
Prolog=/cm/local/apps/cmd/scripts/prolog
Epilog=/cm/local/apps/cmd/scripts/epilog
# Power saving section (disabled)
# GPU related plugins
#SelectType=select/cons_tres
#SelectTypeParameters=CR_Core
#AccountingStorageTRES=gres/gpu
# END AUTOGENERATED SECTION -- DO NOT REMOVE
*Scontrol for working 1GPU job on node01*
JobId=285 JobName=cryosparc_P2_J232
UserId=cryosparc(1003) GroupId=cryosparc(1003) MCS_label=N/A
Priority=4294901570 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:51 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2021-08-21T00:05:30 EligibleTime=2021-08-21T00:05:30
AccrueTime=2021-08-21T00:05:30
StartTime=2021-08-21T00:05:30 EndTime=Unknown Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-21T00:05:30
Partition=CSLive AllocNode:Sid=headnode:108964
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node01
BatchHost=node01
NumNodes=1 NumCPUs=64 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=64,node=1,billing=64
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=24000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/data/backups/takeda2/data/cryosparc_projects/P8/J232/queue_sub_script.sh
WorkDir=/ssd/CryoSparc/cryosparc_master
StdErr=/data/backups/takeda2/data/cryosparc_projects/P8/J232/job.log
StdIn=/dev/null
StdOut=/data/backups/takeda2/data/cryosparc_projects/P8/J232/job.log
Power=
TresPerNode=gpu:1
MailUser=cryosparc MailType=NONE
*Cgroup*
# This section of this file was automatically generated by cmd. Do not edit
manually!
# BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=no
TaskAffinity=no
ConstrainCores=no
ConstrainRAMSpace=no
ConstrainSwapSpace=no
ConstrainDevices=no
ConstrainKmemSpace=yes
AllowedRamSpace=100.00
AllowedSwapSpace=0.00
MinKmemSpace=30
MaxKmemPercent=100.00
MaxRAMPercent=100.00
MaxSwapPercent=100.00
MinRAMSpace=30
On Fri, Aug 20, 2021 at 3:12 PM Fulcomer, Samuel <samuel_fulcomer at brown.edu>
wrote:
> ...and I'm not sure what "AutoDetect=NVML" is supposed to do in the
> gres.conf file. We've always used "nvidia-smi topo -m" to confirm that
> we've got a single-root or dual-root node and have entered the correct info
> in gres.conf to map connections to the CPU sockets...., e.g.:
>
> # 8-gpu A6000 nodes - dual-root
> NodeName=gpu[1504-1506] Name=gpu Type=a6000 File=/dev/nvidia[0-3] CPUs=0-23
> NodeName=gpu[1504-1506] Name=gpu Type=a6000 File=/dev/nvidia[4-7]
> CPUs=24-47
>
>
>
>
>
> On Fri, Aug 20, 2021 at 6:01 PM Fulcomer, Samuel <
> samuel_fulcomer at brown.edu> wrote:
>
>> Well... you've got lots of weirdness, as the scontrol show job command
>> isn't listing any GPU TRES requests, and the scontrol show node command
>> isn't listing any configured GPU TRES resources.
>>
>> If you send me your entire slurm.conf I'll have a quick look-over.
>>
>> You also should be using cgroup.conf to fence off the GPU devices so that
>> a job only sees the GPUs that it's been allocated. The lines in the batch
>> file to figure it out aren't necessary. I forgot to ask you about
>> cgroup.conf.
>>
>> regards,
>> Sam
>>
>> On Fri, Aug 20, 2021 at 5:46 PM Andrey Malyutin <malyutinag at gmail.com>
>> wrote:
>>
>>> Thank you Samuel,
>>>
>>> Slurm version is 20.02.6. I'm not entirely sure about the platform,
>>> RTX6000 nodes are about 2 years old, and 3090 node is very recent.
>>> Technically we have 4 nodes (hence references to node04 in info below), but
>>> one of the nodes is down and out of the system at the moment. As you see,
>>> the job really wants to run on the downed node instead of going to node02
>>> or node03.
>>>
>>> Thank you again,
>>> Andrey
>>>
>>>
>>>
>>> *scontrol info:*
>>>
>>> JobId=283 JobName=cryosparc_P2_J214
>>>
>>> UserId=cryosparc(1003) GroupId=cryosparc(1003) MCS_label=N/A
>>>
>>> Priority=4294901572 Nice=0 Account=(null) QOS=normal
>>>
>>> JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:node04
>>> Dependency=(null)
>>>
>>> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>>>
>>> RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
>>>
>>> SubmitTime=2021-08-20T20:55:00 EligibleTime=2021-08-20T20:55:00
>>>
>>> AccrueTime=2021-08-20T20:55:00
>>>
>>> StartTime=Unknown EndTime=Unknown Deadline=N/A
>>>
>>> SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-20T23:36:14
>>>
>>> Partition=CSCluster AllocNode:Sid=headnode:108964
>>>
>>> ReqNodeList=(null) ExcNodeList=(null)
>>>
>>> NodeList=(null)
>>>
>>> NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>>>
>>> TRES=cpu=4,mem=24000M,node=1,billing=4
>>>
>>> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>>>
>>> MinCPUsNode=1 MinMemoryNode=24000M MinTmpDiskNode=0
>>>
>>> Features=(null) DelayBoot=00:00:00
>>>
>>> OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
>>>
>>>
>>> Command=/data/backups/takeda2/data/cryosparc_projects/P8/J214/queue_sub_script.sh
>>>
>>> WorkDir=/ssd/CryoSparc/cryosparc_master
>>>
>>> StdErr=/data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log
>>>
>>> StdIn=/dev/null
>>>
>>> StdOut=/data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log
>>>
>>> Power=
>>>
>>> TresPerNode=gpu:1
>>>
>>> MailUser=cryosparc MailType=NONE
>>>
>>>
>>> *Script:*
>>>
>>> #SBATCH --job-name cryosparc_P2_J214
>>>
>>> #SBATCH -n 4
>>>
>>> #SBATCH --gres=gpu:1
>>>
>>> #SBATCH -p CSCluster
>>>
>>> #SBATCH --mem=24000MB
>>>
>>> #SBATCH
>>> --output=/data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log
>>>
>>> #SBATCH
>>> --error=/data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log
>>>
>>>
>>>
>>> available_devs=""
>>>
>>> for devidx in $(seq 0 15);
>>>
>>> do
>>>
>>> if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid
>>> --format=csv,noheader) ]] ; then
>>>
>>> if [[ -z "$available_devs" ]] ; then
>>>
>>> available_devs=$devidx
>>>
>>> else
>>>
>>> available_devs=$available_devs,$devidx
>>>
>>> fi
>>>
>>> fi
>>>
>>> done
>>>
>>> export CUDA_VISIBLE_DEVICES=$available_devs
>>>
>>>
>>>
>>> /ssd/CryoSparc/cryosparc_worker/bin/cryosparcw run --project P2 --job
>>> J214 --master_hostname headnode.cm.cluster --master_command_core_port 39002
>>> > /data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log 2>&1
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *Slurm.conf*
>>>
>>> # This section of this file was automatically generated by cmd. Do not
>>> edit manually!
>>>
>>> # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
>>>
>>> # Server nodes
>>>
>>> SlurmctldHost=headnode
>>>
>>> AccountingStorageHost=master
>>>
>>>
>>> #############################################################################################
>>>
>>> #GPU Nodes
>>>
>>>
>>> #############################################################################################
>>>
>>> NodeName=node[02-04] Procs=64 CoresPerSocket=16 RealMemory=257024
>>> Sockets=2 ThreadsPerCore=2 Feature=RTX6000 Gres=gpu:4
>>>
>>> NodeName=node01 Procs=64 CoresPerSocket=16 RealMemory=386048 Sockets=2
>>> ThreadsPerCore=2 Feature=RTX3090 Gres=gpu:4
>>>
>>> #NodeName=node[05-08] Procs=8 Gres=gpu:4
>>>
>>> #
>>>
>>>
>>> #############################################################################################
>>>
>>> # Partitions
>>>
>>>
>>> #############################################################################################
>>>
>>> PartitionName=defq Default=YES MinNodes=1 DefaultTime=UNLIMITED
>>> MaxTime=UNLIMITED AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1
>>> OverSubscribe=NO PreemptMode=OFF AllowAccounts=ALL AllowQos=ALL
>>> Nodes=node[01-04]
>>>
>>> PartitionName=CSLive MinNodes=1 DefaultTime=UNLIMITED MaxTime=UNLIMITED
>>> AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 OverSubscribe=NO
>>> PreemptMode=OFF AllowAccounts=ALL AllowQos=ALL Nodes=node01
>>>
>>> PartitionName=CSCluster MinNodes=1 DefaultTime=UNLIMITED
>>> MaxTime=UNLIMITED AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1
>>> OverSubscribe=NO PreemptMode=OFF AllowAccounts=ALL AllowQos=ALL
>>> Nodes=node[02-04]
>>>
>>> ClusterName=slurm
>>>
>>>
>>>
>>> *Gres.conf*
>>>
>>> # This section of this file was automatically generated by cmd. Do not
>>> edit manually!
>>>
>>> # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
>>>
>>> AutoDetect=NVML
>>>
>>> # END AUTOGENERATED SECTION -- DO NOT REMOVE
>>>
>>> #Name=gpu File=/dev/nvidia[0-3] Count=4
>>>
>>> #Name=mic Count=0
>>>
>>>
>>>
>>> *Sinfo:*
>>>
>>> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
>>>
>>> defq* up infinite 1 down* node04
>>>
>>> defq* up infinite 3 idle node[01-03]
>>>
>>> CSLive up infinite 1 idle node01
>>>
>>> CSCluster up infinite 1 down* node04
>>>
>>> CSCluster up infinite 2 idle node[02-03]
>>>
>>>
>>>
>>> *Node1:*
>>>
>>> NodeName=node01 Arch=x86_64 CoresPerSocket=16
>>>
>>> CPUAlloc=0 CPUTot=64 CPULoad=0.04
>>>
>>> AvailableFeatures=RTX3090
>>>
>>> ActiveFeatures=RTX3090
>>>
>>> Gres=gpu:4
>>>
>>> NodeAddr=node01 NodeHostName=node01 Version=20.02.6
>>>
>>> OS=Linux 3.10.0-1160.11.1.el7.x86_64 #1 SMP Fri Dec 18 16:34:56 UTC
>>> 2020
>>>
>>> RealMemory=386048 AllocMem=0 FreeMem=16665 Sockets=2 Boards=1
>>>
>>> State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>>>
>>> Partitions=defq,CSLive
>>>
>>> BootTime=2021-08-04T13:59:08 SlurmdStartTime=2021-08-10T09:32:43
>>>
>>> CfgTRES=cpu=64,mem=377G,billing=64
>>>
>>> AllocTRES=
>>>
>>> CapWatts=n/a
>>>
>>> CurrentWatts=0 AveWatts=0
>>>
>>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>>
>>>
>>>
>>> *Node2-3*
>>>
>>> NodeName=node02 Arch=x86_64 CoresPerSocket=16
>>>
>>> CPUAlloc=0 CPUTot=64 CPULoad=0.48
>>>
>>> AvailableFeatures=RTX6000
>>>
>>> ActiveFeatures=RTX6000
>>>
>>> Gres=gpu:4(S:0-1)
>>>
>>> NodeAddr=node02 NodeHostName=node02 Version=20.02.6
>>>
>>> OS=Linux 3.10.0-1160.11.1.el7.x86_64 #1 SMP Fri Dec 18 16:34:56 UTC
>>> 2020
>>>
>>> RealMemory=257024 AllocMem=0 FreeMem=2259 Sockets=2 Boards=1
>>>
>>> State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>>>
>>> Partitions=defq,CSCluster
>>>
>>> BootTime=2021-07-29T20:47:32 SlurmdStartTime=2021-08-10T09:32:55
>>>
>>> CfgTRES=cpu=64,mem=251G,billing=64
>>>
>>> AllocTRES=
>>>
>>> CapWatts=n/a
>>>
>>> CurrentWatts=0 AveWatts=0
>>>
>>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>>
>>> On Thu, Aug 19, 2021, 6:07 PM Fulcomer, Samuel <
>>> samuel_fulcomer at brown.edu> wrote:
>>>
>>>> What SLURM version are you running?
>>>>
>>>> What are the #SLURM directives in the batch script? (or the sbatch
>>>> arguments)
>>>>
>>>> When the single GPU jobs are pending, what's the output of 'scontrol
>>>> show job JOBID'?
>>>>
>>>> What are the node definitions in slurm.conf, and the lines in gres.conf?
>>>>
>>>> Are the nodes all the same host platform (motherboard)?
>>>>
>>>> We have P100s, TitanVs, Titan RTXs, Quadro RTX 6000s, 3090s, V100s, DGX
>>>> 1s, A6000s, and A40s, with a mix of single and dual-root platforms, and
>>>> haven't seen this problem with SLURM 20.02.6 or earlier versions.
>>>>
>>>> On Thu, Aug 19, 2021 at 8:38 PM Andrey Malyutin <malyutinag at gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> We are in the process of finishing up the setup of a cluster with 3
>>>>> nodes, 4 GPUs each. One node has RTX3090s and the other 2 have RTX6000s.Any
>>>>> job asking for 1 GPU in the submission script will wait to run on the 3090
>>>>> node, no matter resource availability. Same job requesting 2 or more GPUs
>>>>> will run on any node. I don't even know where to begin troubleshooting this
>>>>> issue; entries for the 3 nodes are effectively identical in slurm.conf. Any
>>>>> help would be appreciated. (If helpful - this cluster is used for
>>>>> structural biology, with cryosparc and relion packages).
>>>>>
>>>>> Thank you,
>>>>> Andrey
>>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210820/6e3db069/attachment-0001.htm>
More information about the slurm-users
mailing list