[slurm-users] Slurm 17.11 and configuring backfill and oversubscribe to allow concurrent processes

Thu Feb 27 17:14:52 UTC 2020

We figured out the issue.

All of our jobs are requesting 1 GPU. Each node only has 1 GPU. Thus, the
jobs that are pending are pending based on:, resources - meaning "no
resources are available for these jobs", meaning "I want a GPU, but there
are no GPUs that I can use until a job on a node finishes".

So looking at the new cons_tres option at
https://slurm.schedmd.com/SLUG19/GPU_Scheduling_and_Cons_Tres.pdf, would we
be able to use, e.g., --mem-per-gpu= Memory per allocated GPU, and it a
user allocated --mem-per-gpu=8, and the V100 we have is 32 GB, will
subsequent jobs be able to use the remaining 24 GB?

Would Slurm be able to use multi-process service (MPS):
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
if we had it enabled? I'm also trying to see if MPS would work with
TensorFlow and finding mixed results.

Thanks for your reply, Ahmet.

We'd consider SchedMD pait support but their minimum is $10K and 250
nodes...a bit higher than our 4 nodes.

On Thu, Feb 27, 2020 at 3:53 AM mercan <ahmet.mercan at uhem.itu.edu.tr> wrote:

> Hi;
>
> At your partition definition, there is "Shared=NO". This is means "do
> not share nodes between jobs". This parameter conflict with
> "OverSubscribe=FORCE:12 " parameter. Acording to the slurm
> documentation, the Shared parameter has been replaced by the
> OverSubscribe parameter. But, I suppose it still works.
>
> Regards,
>
> Ahmet M.
>
>
> On 26.02.2020 22:56, Robert Kudyba wrote:
> > We run Bright 8.1 and Slurm 17.11. We are trying to allow for multiple
> > concurrent jobs to run on our small 4 node cluster.
> >
> > Based on
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__community.brightcomputing.com_question_5d6614ba08e8e81e885f1991-3Faction-3Dartikel-26cat-3D14-26id-3D410-26artlang-3Den-26highlight-3Dslurm-2B-252526-25252334-25253Bgang-2Bscheduling-252526-25252334-25253B&d=DwIFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=yigW9AeWH0d5Z6d0fJEJ-SLrHDh1b1WfnHjIur1Cywk&s=JXCldpkgwkDQTsj6kERPbX4hIO1G9jBTaGe4WHHWtKE&e=
>
> > and
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_cons-5Fres-5Fshare.html&d=DwIFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=yigW9AeWH0d5Z6d0fJEJ-SLrHDh1b1WfnHjIur1Cywk&s=0xnOemAfvqAmLn7PbzlzspC3ZTvkBqVMxpOyJ6iQOaU&e=
> >
> > Here are some settings in /etc/slurm/slurm.conf:
> >
> > SchedulerType=sched/backfill
> > # Nodes
> > NodeName=node[001-003] CoresPerSocket=12 RealMemory=191800 Sockets=2
> > Gres=gpu:1
> > # Partitions
> > PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
> > PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO
> > Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO
> > AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO
> > OverSubscribe=FORCE:12 OverTimeLimit=0 State=UP Nodes=node[001-003]
> > PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
> > PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO
> > Hidden=NO Shared=NO GraceTime= 0 PreemptMode=OFF ReqResv=NO
> > AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO
> > OverSubscribe=FORCE:12 OverTimeLimit=0 State=UP
> > # Generic resources types
> > GresTypes=gpu,mic
> > # Epilog/Prolog parameters
> > PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob
> > Prolog=/cm/local/apps/cmd/scripts/prolog
> > Epilog=/cm/local/apps/cmd/scripts/epilog
> > # Fast Schedule option
> > FastSchedule=1
> > # Power Saving
> > SuspendTime=-1 # this disables power saving
> > SuspendTimeout=30
> > ResumeTimeout=60
> > SuspendProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweroff
> > ResumeProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweron
> > # END AUTOGENERATED SECTION -- DO NOT REMOVE
> > #
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__kb.brightcomputing.com_faq_index.php-3Faction-3Dartikel-26cat-3D14-26id-3D410-26artlang-3Den-26highlight-3Dslurm-2B-2526-252334-253Bgang-2Bscheduling-2526-252334-253B&d=DwIFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=yigW9AeWH0d5Z6d0fJEJ-SLrHDh1b1WfnHjIur1Cywk&s=Yf8fh3avSWaIjsjRyFUW3mJgOlvaTfqZ5xYcsA8pMmo&e=
> > SelectType=select/cons_res
> > SelectTypeParameters=CR_CPU
> > SchedulerTimeSlice=60
> > EnforcePartLimits=YES
> >
> > But it appears each job takes 1 of the 3 nodes and all other jobs are
> > back scheduled. Do we have an incorrect option set?
> >
> > squeue -a
> > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
> > 1937 defq PaNet5 user1 PD 0:00 1 (Resources)
> > 1938 defq PoNet5 user1 PD 0:00 1 (Priority)
> > 1964 defq SENet5 user1 PD 0:00 1 (Priority)
> > 1979 defq IcNet5 user1 PD 0:00 1 (Priority)
> > 1980 defq runtrain user2 PD 0:00 1 (Priority)
> > 1981 defq InRes5 user1   PD 0:00 1 (Priority)
> > 1983 defq run_LSTM user3 PD 0:00 1 (Priority)
> > 1984 defq run_hui. user4 PD 0:00 1 (Priority)
> > 1936 defq SeRes5 user1   R 10:02:39 1 node003
> > 1950 defq sequenti user5  R 1-02:03:00 1 node001
> > 1978 defq run_hui. user16 R 13:48:21 1 node002
> >
> > Am I misunderstanding some of the settings?
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200227/76244681/attachment.htm>