<div dir="ltr">We figured out the issue. <div><br></div><div>All of our jobs are requesting 1 GPU. Each node only has 1 GPU. Thus, the jobs that are pending are pending based on:, resources - meaning "no resources are available for these jobs", meaning "I want a GPU, but there are no GPUs that I can use until a job on a node finishes".<br></div><div><br></div><div>So looking at the new cons_tres option at <a href="https://slurm.schedmd.com/SLUG19/GPU_Scheduling_and_Cons_Tres.pdf">https://slurm.schedmd.com/SLUG19/GPU_Scheduling_and_Cons_Tres.pdf</a>, would we be able to use, e.g., --mem-per-gpu= Memory per allocated GPU, and it a user allocated --mem-per-gpu=8, and the V100 we have is 32 GB, will subsequent jobs be able to use the remaining 24 GB?</div><div><br></div><div>Would Slurm be able to use multi-process service (MPS): <a href="https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf">https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf</a> if we had it enabled? I'm also trying to see if MPS would work with TensorFlow and finding mixed results.</div><div><br></div><div>Thanks for your reply, Ahmet. </div><div><br></div><div>We'd consider SchedMD pait support but their minimum is $10K and 250 nodes...a bit higher than our 4 nodes. </div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Feb 27, 2020 at 3:53 AM mercan <<a href="mailto:ahmet.mercan@uhem.itu.edu.tr">ahmet.mercan@uhem.itu.edu.tr</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi;<br>
<br>
At your partition definition, there is "Shared=NO". This is means "do <br>
not share nodes between jobs". This parameter conflict with <br>
"OverSubscribe=FORCE:12 " parameter. Acording to the slurm <br>
documentation, the Shared parameter has been replaced by the <br>
OverSubscribe parameter. But, I suppose it still works.<br>
<br>
Regards,<br>
<br>
Ahmet M.<br>
<br>
<br>
On 26.02.2020 22:56, Robert Kudyba wrote:<br>
> We run Bright 8.1 and Slurm 17.11. We are trying to allow for multiple <br>
> concurrent jobs to run on our small 4 node cluster.<br>
><br>
> Based on <br>
> <a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__community.brightcomputing.com_question_5d6614ba08e8e81e885f1991-3Faction-3Dartikel-26cat-3D14-26id-3D410-26artlang-3Den-26highlight-3Dslurm-2B-252526-25252334-25253Bgang-2Bscheduling-252526-25252334-25253B&d=DwIFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=yigW9AeWH0d5Z6d0fJEJ-SLrHDh1b1WfnHjIur1Cywk&s=JXCldpkgwkDQTsj6kERPbX4hIO1G9jBTaGe4WHHWtKE&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint.com/v2/url?u=https-3A__community.brightcomputing.com_question_5d6614ba08e8e81e885f1991-3Faction-3Dartikel-26cat-3D14-26id-3D410-26artlang-3Den-26highlight-3Dslurm-2B-252526-25252334-25253Bgang-2Bscheduling-252526-25252334-25253B&d=DwIFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=yigW9AeWH0d5Z6d0fJEJ-SLrHDh1b1WfnHjIur1Cywk&s=JXCldpkgwkDQTsj6kERPbX4hIO1G9jBTaGe4WHHWtKE&e=</a> <br>
> and<br>
> <a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_cons-5Fres-5Fshare.html&d=DwIFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=yigW9AeWH0d5Z6d0fJEJ-SLrHDh1b1WfnHjIur1Cywk&s=0xnOemAfvqAmLn7PbzlzspC3ZTvkBqVMxpOyJ6iQOaU&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_cons-5Fres-5Fshare.html&d=DwIFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=yigW9AeWH0d5Z6d0fJEJ-SLrHDh1b1WfnHjIur1Cywk&s=0xnOemAfvqAmLn7PbzlzspC3ZTvkBqVMxpOyJ6iQOaU&e=</a> <br>
><br>
> Here are some settings in /etc/slurm/slurm.conf:<br>
><br>
> SchedulerType=sched/backfill<br>
> # Nodes<br>
> NodeName=node[001-003] CoresPerSocket=12 RealMemory=191800 Sockets=2 <br>
> Gres=gpu:1<br>
> # Partitions<br>
> PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL <br>
> PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO <br>
> Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO <br>
> AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO <br>
> OverSubscribe=FORCE:12 OverTimeLimit=0 State=UP Nodes=node[001-003]<br>
> PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL <br>
> PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO <br>
> Hidden=NO Shared=NO GraceTime= 0 PreemptMode=OFF ReqResv=NO <br>
> AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO <br>
> OverSubscribe=FORCE:12 OverTimeLimit=0 State=UP<br>
> # Generic resources types<br>
> GresTypes=gpu,mic<br>
> # Epilog/Prolog parameters<br>
> PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob<br>
> Prolog=/cm/local/apps/cmd/scripts/prolog<br>
> Epilog=/cm/local/apps/cmd/scripts/epilog<br>
> # Fast Schedule option<br>
> FastSchedule=1<br>
> # Power Saving<br>
> SuspendTime=-1 # this disables power saving<br>
> SuspendTimeout=30<br>
> ResumeTimeout=60<br>
> SuspendProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweroff<br>
> ResumeProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweron<br>
> # END AUTOGENERATED SECTION -- DO NOT REMOVE<br>
> # <br>
> <a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__kb.brightcomputing.com_faq_index.php-3Faction-3Dartikel-26cat-3D14-26id-3D410-26artlang-3Den-26highlight-3Dslurm-2B-2526-252334-253Bgang-2Bscheduling-2526-252334-253B&d=DwIFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=yigW9AeWH0d5Z6d0fJEJ-SLrHDh1b1WfnHjIur1Cywk&s=Yf8fh3avSWaIjsjRyFUW3mJgOlvaTfqZ5xYcsA8pMmo&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint.com/v2/url?u=http-3A__kb.brightcomputing.com_faq_index.php-3Faction-3Dartikel-26cat-3D14-26id-3D410-26artlang-3Den-26highlight-3Dslurm-2B-2526-252334-253Bgang-2Bscheduling-2526-252334-253B&d=DwIFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=yigW9AeWH0d5Z6d0fJEJ-SLrHDh1b1WfnHjIur1Cywk&s=Yf8fh3avSWaIjsjRyFUW3mJgOlvaTfqZ5xYcsA8pMmo&e=</a> <br>
> SelectType=select/cons_res<br>
> SelectTypeParameters=CR_CPU<br>
> SchedulerTimeSlice=60<br>
> EnforcePartLimits=YES<br>
><br>
> But it appears each job takes 1 of the 3 nodes and all other jobs are <br>
> back scheduled. Do we have an incorrect option set?<br>
><br>
> squeue -a<br>
> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)<br>
> 1937 defq PaNet5 user1 PD 0:00 1 (Resources)<br>
> 1938 defq PoNet5 user1 PD 0:00 1 (Priority)<br>
> 1964 defq SENet5 user1 PD 0:00 1 (Priority)<br>
> 1979 defq IcNet5 user1 PD 0:00 1 (Priority)<br>
> 1980 defq runtrain user2 PD 0:00 1 (Priority)<br>
> 1981 defq InRes5 user1 PD 0:00 1 (Priority)<br>
> 1983 defq run_LSTM user3 PD 0:00 1 (Priority)<br>
> 1984 defq run_hui. user4 PD 0:00 1 (Priority)<br>
> 1936 defq SeRes5 user1 R 10:02:39 1 node003<br>
> 1950 defq sequenti user5 R 1-02:03:00 1 node001<br>
> 1978 defq run_hui. user16 R 13:48:21 1 node002<br>
><br>
> Am I misunderstanding some of the settings?<br>
><br>
><br>
</blockquote></div>