[slurm-users] Running gpu and cpu jobs on the same node

Wed Sep 30 21:42:46 UTC 2020

I could have missed a detail on my description, but we definitely don’t enable oversubscribe, or shared, or exclusiveuser. All three of those are set to “no” on all active queues.

Current subset of slurm.conf and squeue output:

=====

# egrep '^PartitionName=(gpu|any-interactive) ' /etc/slurm/slurm.conf
PartitionName=gpu Default=NO MinNodes=1 DefaultTime=1-00:00:00 MaxTime=30-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=2000 AllowAccounts=ALL AllowQos=ALL LLN=NO MaxCPUsPerNode=16 ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP TRESBillingWeights=CPU=3.00,Mem=1.024G,GRES/gpu=30.00 Nodes=gpunode[001-004]
PartitionName=any-interactive Default=NO MinNodes=1 MaxNodes=4 DefaultTime=02:00:00 MaxTime=02:00:00 AllowGroups=ALL PriorityJobFactor=3 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=2000 AllowAccounts=ALL AllowQos=ALL LLN=NO MaxCPUsPerNode=12 ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP TRESBillingWeights=CPU=3.00,Mem=1.024G,GRES/gpu=30.00 Nodes=node[001-040],gpunode[001-004]
# squeue -o "%6i %.15P %.10j %.5u %4C %5D %16R %6b" | grep gpunode002
778462             gpu CNN_GRU.sh miibr 1    1     gpunode002       gpu:1
778632 any-interactive       bash rnour 1    1     gpunode002       N/A

=====

From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Relu Patrascu <relu at cs.toronto.edu>
Reply-To: Slurm User Community List <slurm-users at lists.schedmd.com>
Date: Wednesday, September 30, 2020 at 4:02 PM
To: "slurm-users at lists.schedmd.com" <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Running gpu and cpu jobs on the same node

If you don't use OverSubscribe then resources are not shared. What resources a job gets allocated is not available to other jobs, regardless of partition.

Relu
On 2020-09-30 16:12, Ahmad Khalifa wrote:
I have a machine with 4 rtx2080ti and a core i9. I submit jobs to it through MPI PMI2 (from Relion).

If I use 5 MPI and 4 threads, then basically I'm using all 4 GPUs and 20 threads of my cpu.

My question is, my current configuration allows submitting jobs to the same node, but with a different partition, but I'm not sure if I use #SBATCH --partition=cpu that the submitted jobs will only use the remaining 2 cores (4 threads) or is it going to share resources with my gpu job?!

Thanks.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200930/015630f3/attachment-0001.htm>