[slurm-users] GPU jobs not running correctly

Fri Aug 20 22:09:50 UTC 2021

...and I'm not sure what "AutoDetect=NVML" is supposed to do in the
gres.conf file. We've always used "nvidia-smi topo -m" to confirm that
we've got a single-root or dual-root node and have entered the correct info
in gres.conf to map connections to the CPU sockets...., e.g.:

# 8-gpu A6000 nodes - dual-root
NodeName=gpu[1504-1506] Name=gpu Type=a6000 File=/dev/nvidia[0-3] CPUs=0-23
NodeName=gpu[1504-1506] Name=gpu Type=a6000 File=/dev/nvidia[4-7] CPUs=24-47

On Fri, Aug 20, 2021 at 6:01 PM Fulcomer, Samuel <samuel_fulcomer at brown.edu>
wrote:

> Well... you've got lots of weirdness, as the scontrol show job command
> isn't listing any GPU TRES requests, and the scontrol show node command
> isn't listing any configured GPU TRES resources.
>
> If you send me your entire slurm.conf I'll have a quick look-over.
>
> You also should be using cgroup.conf to fence off the GPU devices so that
> a job only sees the GPUs that it's been allocated. The lines in the batch
> file to figure it out aren't necessary. I forgot to ask you about
> cgroup.conf.
>
> regards,
> Sam
>
> On Fri, Aug 20, 2021 at 5:46 PM Andrey Malyutin <malyutinag at gmail.com>
> wrote:
>
>> Thank you Samuel,
>>
>> Slurm version is 20.02.6. I'm not entirely sure about the platform,
>> RTX6000 nodes are about 2 years old, and 3090 node is very recent.
>> Technically we have 4 nodes (hence references to node04 in info below), but
>> one of the nodes is down and out of the system at the moment. As you see,
>> the job really wants to run on the downed node instead of going to node02
>> or node03.
>>
>> Thank you again,
>> Andrey
>>
>>
>>
>> *scontrol info:*
>>
>> JobId=283 JobName=cryosparc_P2_J214
>>
>>    UserId=cryosparc(1003) GroupId=cryosparc(1003) MCS_label=N/A
>>
>>    Priority=4294901572 Nice=0 Account=(null) QOS=normal
>>
>>    JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:node04
>> Dependency=(null)
>>
>>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>>
>>    RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
>>
>>    SubmitTime=2021-08-20T20:55:00 EligibleTime=2021-08-20T20:55:00
>>
>>    AccrueTime=2021-08-20T20:55:00
>>
>>    StartTime=Unknown EndTime=Unknown Deadline=N/A
>>
>>    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-20T23:36:14
>>
>>    Partition=CSCluster AllocNode:Sid=headnode:108964
>>
>>    ReqNodeList=(null) ExcNodeList=(null)
>>
>>    NodeList=(null)
>>
>>    NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>>
>>    TRES=cpu=4,mem=24000M,node=1,billing=4
>>
>>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>>
>>    MinCPUsNode=1 MinMemoryNode=24000M MinTmpDiskNode=0
>>
>>    Features=(null) DelayBoot=00:00:00
>>
>>    OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
>>
>>
>> Command=/data/backups/takeda2/data/cryosparc_projects/P8/J214/queue_sub_script.sh
>>
>>    WorkDir=/ssd/CryoSparc/cryosparc_master
>>
>>    StdErr=/data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log
>>
>>    StdIn=/dev/null
>>
>>    StdOut=/data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log
>>
>>    Power=
>>
>>    TresPerNode=gpu:1
>>
>>    MailUser=cryosparc MailType=NONE
>>
>>
>> *Script:*
>>
>> #SBATCH --job-name cryosparc_P2_J214
>>
>> #SBATCH -n 4
>>
>> #SBATCH --gres=gpu:1
>>
>> #SBATCH -p CSCluster
>>
>> #SBATCH --mem=24000MB
>>
>> #SBATCH
>> --output=/data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log
>>
>> #SBATCH
>> --error=/data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log
>>
>>
>>
>> available_devs=""
>>
>> for devidx in $(seq 0 15);
>>
>> do
>>
>>     if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid
>> --format=csv,noheader) ]] ; then
>>
>>         if [[ -z "$available_devs" ]] ; then
>>
>>             available_devs=$devidx
>>
>>         else
>>
>>             available_devs=$available_devs,$devidx
>>
>>         fi
>>
>>     fi
>>
>> done
>>
>> export CUDA_VISIBLE_DEVICES=$available_devs
>>
>>
>>
>> /ssd/CryoSparc/cryosparc_worker/bin/cryosparcw run --project P2 --job
>> J214 --master_hostname headnode.cm.cluster --master_command_core_port 39002
>> > /data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log 2>&1
>>
>>
>>
>>
>>
>>
>>
>> *Slurm.conf*
>>
>> # This section of this file was automatically generated by cmd. Do not
>> edit manually!
>>
>> # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
>>
>> # Server nodes
>>
>> SlurmctldHost=headnode
>>
>> AccountingStorageHost=master
>>
>>
>> #############################################################################################
>>
>> #GPU  Nodes
>>
>>
>> #############################################################################################
>>
>> NodeName=node[02-04] Procs=64 CoresPerSocket=16 RealMemory=257024
>> Sockets=2 ThreadsPerCore=2 Feature=RTX6000 Gres=gpu:4
>>
>> NodeName=node01 Procs=64 CoresPerSocket=16 RealMemory=386048 Sockets=2
>> ThreadsPerCore=2 Feature=RTX3090 Gres=gpu:4
>>
>> #NodeName=node[05-08] Procs=8 Gres=gpu:4
>>
>> #
>>
>>
>> #############################################################################################
>>
>> # Partitions
>>
>>
>> #############################################################################################
>>
>> PartitionName=defq Default=YES MinNodes=1 DefaultTime=UNLIMITED
>> MaxTime=UNLIMITED AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1
>> OverSubscribe=NO PreemptMode=OFF AllowAccounts=ALL AllowQos=ALL
>> Nodes=node[01-04]
>>
>> PartitionName=CSLive MinNodes=1 DefaultTime=UNLIMITED MaxTime=UNLIMITED
>> AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 OverSubscribe=NO
>> PreemptMode=OFF AllowAccounts=ALL AllowQos=ALL Nodes=node01
>>
>> PartitionName=CSCluster MinNodes=1 DefaultTime=UNLIMITED
>> MaxTime=UNLIMITED AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1
>> OverSubscribe=NO PreemptMode=OFF AllowAccounts=ALL AllowQos=ALL
>> Nodes=node[02-04]
>>
>> ClusterName=slurm
>>
>>
>>
>> *Gres.conf*
>>
>> # This section of this file was automatically generated by cmd. Do not
>> edit manually!
>>
>> # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
>>
>> AutoDetect=NVML
>>
>> # END AUTOGENERATED SECTION   -- DO NOT REMOVE
>>
>> #Name=gpu File=/dev/nvidia[0-3] Count=4
>>
>> #Name=mic Count=0
>>
>>
>>
>> *Sinfo:*
>>
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>>
>> defq*        up   infinite      1  down* node04
>>
>> defq*        up   infinite      3   idle node[01-03]
>>
>> CSLive       up   infinite      1   idle node01
>>
>> CSCluster    up   infinite      1  down* node04
>>
>> CSCluster    up   infinite      2   idle node[02-03]
>>
>>
>>
>> *Node1:*
>>
>> NodeName=node01 Arch=x86_64 CoresPerSocket=16
>>
>>    CPUAlloc=0 CPUTot=64 CPULoad=0.04
>>
>>    AvailableFeatures=RTX3090
>>
>>    ActiveFeatures=RTX3090
>>
>>    Gres=gpu:4
>>
>>    NodeAddr=node01 NodeHostName=node01 Version=20.02.6
>>
>>    OS=Linux 3.10.0-1160.11.1.el7.x86_64 #1 SMP Fri Dec 18 16:34:56 UTC
>> 2020
>>
>>    RealMemory=386048 AllocMem=0 FreeMem=16665 Sockets=2 Boards=1
>>
>>    State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>>
>>    Partitions=defq,CSLive
>>
>>    BootTime=2021-08-04T13:59:08 SlurmdStartTime=2021-08-10T09:32:43
>>
>>    CfgTRES=cpu=64,mem=377G,billing=64
>>
>>    AllocTRES=
>>
>>    CapWatts=n/a
>>
>>    CurrentWatts=0 AveWatts=0
>>
>>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>>
>>
>> *Node2-3*
>>
>> NodeName=node02 Arch=x86_64 CoresPerSocket=16
>>
>>    CPUAlloc=0 CPUTot=64 CPULoad=0.48
>>
>>    AvailableFeatures=RTX6000
>>
>>    ActiveFeatures=RTX6000
>>
>>    Gres=gpu:4(S:0-1)
>>
>>    NodeAddr=node02 NodeHostName=node02 Version=20.02.6
>>
>>    OS=Linux 3.10.0-1160.11.1.el7.x86_64 #1 SMP Fri Dec 18 16:34:56 UTC
>> 2020
>>
>>    RealMemory=257024 AllocMem=0 FreeMem=2259 Sockets=2 Boards=1
>>
>>    State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>>
>>    Partitions=defq,CSCluster
>>
>>    BootTime=2021-07-29T20:47:32 SlurmdStartTime=2021-08-10T09:32:55
>>
>>    CfgTRES=cpu=64,mem=251G,billing=64
>>
>>    AllocTRES=
>>
>>    CapWatts=n/a
>>
>>    CurrentWatts=0 AveWatts=0
>>
>>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>> On Thu, Aug 19, 2021, 6:07 PM Fulcomer, Samuel <samuel_fulcomer at brown.edu>
>> wrote:
>>
>>> What SLURM version are you running?
>>>
>>> What are the #SLURM directives in the batch script? (or the sbatch
>>> arguments)
>>>
>>> When the single GPU jobs are pending, what's the output of 'scontrol
>>> show job JOBID'?
>>>
>>> What are the node definitions in slurm.conf, and the lines in gres.conf?
>>>
>>> Are the nodes all the same host platform (motherboard)?
>>>
>>> We have P100s, TitanVs, Titan RTXs, Quadro RTX 6000s, 3090s, V100s, DGX
>>> 1s, A6000s, and A40s, with a mix of single and dual-root platforms, and
>>> haven't seen this problem with SLURM 20.02.6 or earlier versions.
>>>
>>> On Thu, Aug 19, 2021 at 8:38 PM Andrey Malyutin <malyutinag at gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> We are in the process of finishing up the setup of a cluster with 3
>>>> nodes, 4 GPUs each. One node has RTX3090s and the other 2 have RTX6000s.Any
>>>> job asking for 1 GPU in the submission script will wait to run on the 3090
>>>> node, no matter resource availability. Same job requesting 2 or more GPUs
>>>> will run on any node. I don't even know where to begin troubleshooting this
>>>> issue; entries for the 3 nodes are effectively identical in slurm.conf. Any
>>>> help would be appreciated. (If helpful - this cluster is used for
>>>> structural biology, with cryosparc and relion packages).
>>>>
>>>> Thank you,
>>>> Andrey
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210820/0d5699f0/attachment-0001.htm>