[slurm-users] Single Node cluster. How to manage oversubscribing

Mon Feb 27 05:23:19 UTC 2023

Hey,

Thanks for sticking with this.

On Sun, 26 Feb 2023 at 23:43, Doug Meyer <dameyer99 at gmail.com> wrote:

> Hi,
>
> Suggest removing "boards=1",  The docs say to include it but in previous
> discussions with schedmd we were advised to remove it.
>
>
I just did. Then ran scontrol reconfigure.

> When you are running execute "scontrol show node <nodename>" and look at
> the lines ConfigTres and AllocTres.  The former is what the maitre d
> believes is available, the latter what has been allocated.
>
> Then "scontrol show job <jobid>" looking down at the "NumNodes" like which
> will show you what the job requested.
>
> I suspect there is a syntax error in the submit.
>
>
Okay. Now this is strange.

First, I launched this job twice <https://pastebin.com/s21yXFH2>
This should take up 20 + 20 = 40 cores, because of the

   1. #SBATCH -n 20                  # Number of tasks
   2. #SBATCH --cpus-per-task=1

running scontrol show job on both jobids yields

   -    NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   -    NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

Then, running scontrol on the node yields:

   - scontrol show node $HOSTNAME
   - CfgTRES=cpu=64,mem=95311M,billing=64,gres/gpu=1
   - AllocTRES=cpu=40

So far so good. Both show 40 cores allocated.

However, if I now add another job with 60 cores
<https://pastebin.com/C0uW0Aut>,this happens:

scontrol on the node:

CfgTRES=cpu=64,mem=95311M,billing=64,gres/gpu=1
   AllocTRES=cpu=60

squeue
 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               413       CPU   normal    admin  R      21:22      1
shavak-DIT400TR-55L
               414       CPU   normal    admin  R      19:53      1
shavak-DIT400TR-55L
               417       CPU elevated    admin  R       1:31      1
shavak-DIT400TR-55L

scontrol on the jobids:

admin at shavak-DIT400TR-55L:~/mpi_runs_inf$ scontrol show job 413|grep NumCPUs
   NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
admin at shavak-DIT400TR-55L:~/mpi_runs_inf$ scontrol show job 414|grep NumCPUs
   NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
admin at shavak-DIT400TR-55L:~/mpi_runs_inf$ scontrol show job 417|grep NumCPUs
   NumNodes=1 NumCPUs=60 NumTasks=60 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

So there are 100 CPUs running, according to this, but 60 according to
scontrol on the node??????

The submission scripts are on pastebin:

https://pastebin.com/s21yXFH2
https://pastebin.com/C0uW0Aut

AR

> Doug
>
>
> On Sun, Feb 26, 2023 at 2:43 AM Analabha Roy <hariseldon99 at gmail.com>
> wrote:
>
>> Hi Doug,
>>
>> Again, many thanks for your detailed response.
>> Based on my understanding of your previous note, I did the following:
>>
>> I set the nodename with CPUs=64 Boards=1 SocketsPerBoard=2
>> CoresPerSocket=16 ThreadsPerCore=2
>>
>> and the partitions with oversubscribe=force:2
>>
>> then I put further restrictions with the default qos
>> to MaxTRESPerNode:cpu=32, MaxJobsPU=MaxSubmit=2
>>
>> That way, no single user can request more than 2 X 32 cores legally.
>>
>> I launched two jobs, sbatch -n 32 each as one user. They started running
>> immediately, taking up all 64 cores.
>>
>> Then I logged in as another user and launched the same job with sbatch -n
>> 2. To my dismay, it started to run!
>>
>> Shouldn't slurm have figured out that all 64 cores were occupied and
>> queued the -n 2 job to pending?
>>
>> AR
>>
>>
>> On Sun, 26 Feb 2023 at 02:18, Doug Meyer <dameyer99 at gmail.com> wrote:
>>
>>> Hi,
>>>
>>> You got me, I didn't know that " oversubscribe=FORCE:2" is an option.
>>> I'll need to explore that.
>>>
>>> I missed the question about srun.  srun is the preferred I believe.  I
>>> am not associated with drafting the submit scripts but can ask my peer.
>>> You do need to stipulate the number of cores you want.  Your "sbatch -n 1"
>>> should be changed to the number of MPI ranks you desire.
>>>
>>> As good as slurm is, many come to assume it does far more than it does.
>>> I explain slurm as a maître d' in a very exclusive restaurant, aware of
>>> every table and the resources they afford.  When a reservation is placed, a
>>> job submitted, a review of the request versus the resources matches the
>>> pending  guest/job against the resources and when the other diners/jobs are
>>> expected to finish.  If a guest requests resources that are not available
>>> in the restaurant, the reservation is denied.  If a guest arrives and does
>>> not need all the resources, the place settings requested but unused are
>>> left in reservation until the job finishes.  Slurm manages requests against
>>> an inventory.  Without enforcement, a job that requests 1 core but uses 12
>>> will run.  If your 64 core system accepts 64 single core reservations,
>>> slurm believing 64 cores are needed, 64 jobs wll start.  and then the wait
>>> staff (the OS) is left to deal with 768 tasks running on 64 cores.  It
>>> becomes a sad comedy as the system will probably run out of RAM triggering
>>> OOM killer or just run horribly slow.  Never assume slurm is going to
>>> prevent bad actors once they begin running unless you have configured it to
>>> do so.
>>>
>>> We run a very lax environment.  We set a standard of 6 GB per job unless
>>> the sbatch declares otherwise and a max runtime default.  Without an
>>> estimated runtime to work with the backfill scheduler is crippled.  In an
>>> environment mixing single thread and MPI jobs of various sizes it is
>>> critical the jobs are honest in their requirements providing slurm the
>>> information needed to correctly assign resources.
>>>
>>> Doug
>>>
>>> On Sat, Feb 25, 2023 at 12:04 PM Analabha Roy <hariseldon99 at gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Thanks for your considered response. Couple of questions linger...
>>>>
>>>> On Sat, 25 Feb 2023 at 21:46, Doug Meyer <dameyer99 at gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Declaring cores=64 will absolutely work but if you start running MPI
>>>>> you'll want a more detailed config description.  The easy way to read it is
>>>>> "128=2 sockets * 32 corespersocket * 2 threads per core".
>>>>>
>>>>> NodeName=hpc[306-308] CPUs=128 Sockets=2 CoresPerSocket=32
>>>>> ThreadsPerCore=2 RealMemory=512000 TmpDisk=100
>>>>>
>>>>> But if you just want to work with logical cores the "cpus=128" will
>>>>> work.
>>>>>
>>>>> If you go with the more detailed description then you need to declare
>>>>> oversubscription (hyperthreading) in the partition declaration.
>>>>>
>>>>
>>>>
>>>> Yeah, I'll try that.
>>>>
>>>>
>>>>> By default slurm will not let two different jobs share the logical
>>>>> cores comprising a physical core.  For example if Sue has an Array of
>>>>> 1-1000 her array tasks could each take a logical core on a physical core.
>>>>> But if Jamal is also running they would not be able to share the physical
>>>>> core. (as I understand it).
>>>>>
>>>>> PartitionName=a Nodes= [301-308] Default=No OverSubscribe=YES:2
>>>>> MaxTime=Infinite State=Up AllowAccounts=cowboys
>>>>>
>>>>>
>>>>> In the sbatch/srun the user needs to add a declaration
>>>>> "oversubscribe=yes" telling slurm the job can run on both logical cores
>>>>> available.
>>>>>
>>>>
>>>> How about setting oversubscribe=FORCE:2? That way, users need not add a
>>>> setting in their scripts.
>>>>
>>>>
>>>>
>>>>
>>>>> In the days on Knight's Landing each core could handle four logical
>>>>> cores but I don't believe there are any current AMD or Intel processors
>>>>> supporting more then two logical cores (hyperthreads per core).  The
>>>>> conversation about hyperthreads is difficult as the Intel terminology is
>>>>> logical cores for hyperthreading and cores for physical cores but the
>>>>> tendency is to call the logical cores threads or hyperthreaded cores.  This
>>>>> can be very confusing for consumers of the resources.
>>>>>
>>>>>
>>>>> In any case, if you create an array job of 1-100 sleep jobs, my
>>>>> simplest logical test job, then you can use scontrol show node <nodename>
>>>>> to see the nodes resource configuration as well as consumption.  squeue -w
>>>>> <nodename> -i 10 will iteratate every ten seconds to show you the node
>>>>> chomping through the job.
>>>>>
>>>>>
>>>>> Hope this helps.  Once you are comfortable I would urge you to use the
>>>>> NodeName/Partition descriptor format above and encourage your users to
>>>>> declare oversubscription in their jobs.  It is a little more work up front
>>>>> but far easier than correcting scripts later.
>>>>>
>>>>>
>>>>> Doug
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Feb 23, 2023 at 9:41 PM Analabha Roy <hariseldon99 at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Howdy, and thanks for the warm welcome,
>>>>>>
>>>>>> On Fri, 24 Feb 2023 at 07:31, Doug Meyer <dameyer99 at gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Did you configure your node definition with the outputs of slurmd
>>>>>>> -C?  Ignore boards.  Don't know if it is still true but several years ago
>>>>>>> declaring boards made things difficult.
>>>>>>>
>>>>>>>
>>>>>> $ slurmd -C
>>>>>> NodeName=shavak-DIT400TR-55L CPUs=64 Boards=1 SocketsPerBoard=2
>>>>>> CoresPerSocket=16 ThreadsPerCore=2 RealMemory=95311
>>>>>> UpTime=0-00:47:51
>>>>>> $ grep NodeName /etc/slurm-llnl/slurm.conf
>>>>>> NodeName=shavak-DIT400TR-55L CPUs=64 RealMemory=95311 Gres=gpu:1
>>>>>>
>>>>>> There is a difference. I, too, discarded the Boards and sockets in
>>>>>> slurmd.conf . Is that the problem?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Also, if you have hyperthreaded AMD or Intel processors your
>>>>>>> partition declaration should be overscribe:2
>>>>>>>
>>>>>>>
>>>>>> Yes I do, It's actually 16 X 2 cores with hyperthreading, but the
>>>>>> BIOS is set to show them as 64 cores.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Start with a very simple job with a script containing sleep 100 or
>>>>>>> something else without any runtime issues.
>>>>>>>
>>>>>>>
>>>>>> I ran this MPI hello world thing
>>>>>> <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count.c>with
>>>>>> this sbatch script.
>>>>>> <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count_normal.sbatch>
>>>>>> Should be the same thing as your suggestion, basically.
>>>>>> Should I switch to 'srun' in the batch file?
>>>>>>
>>>>>> AR
>>>>>>
>>>>>>
>>>>>>> When I started with slurm I built the sbatch one small step at a
>>>>>>> time.  Nodes, cores. memory, partition, mail, etc
>>>>>>>
>>>>>>> It sounds like your config is very close but your problem may be in
>>>>>>> the submit script.
>>>>>>>
>>>>>>> Best of luck and welcome to slurm. It is very powerful with a huge
>>>>>>> community.
>>>>>>>
>>>>>>> Doug
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Feb 23, 2023 at 6:58 AM Analabha Roy <hariseldon99 at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi folks,
>>>>>>>>
>>>>>>>> I have a single-node "cluster" running Ubuntu 20.04 LTS with the
>>>>>>>> distribution packages for slurm (slurm-wlm 19.05.5)
>>>>>>>> Slurm only ran one job in the node at a time with the default
>>>>>>>> configuration, leaving all other jobs pending.
>>>>>>>> This happened even if that one job only requested like a few cores
>>>>>>>> (the node has 64 cores, and slurm.conf is configged accordingly).
>>>>>>>>
>>>>>>>> in slurm conf, SelectType is set to select/cons_res, and
>>>>>>>> SelectTypeParameters to CR_Core. NodeName is set with CPUs=64. Path to file
>>>>>>>> is referenced below.
>>>>>>>>
>>>>>>>> So I set OverSubscribe=FORCE in the partition config and restarted
>>>>>>>> the daemons.
>>>>>>>>
>>>>>>>> Multiple jobs are now run concurrently, but when Slurm is
>>>>>>>> oversubscribed, it is *truly* *oversubscribed*. That is to say, it
>>>>>>>> runs so many jobs that there are more processes running than cores/threads.
>>>>>>>> How should I config slurm so that it runs multiple jobs at once per
>>>>>>>> node, but ensures that it doesn't run more processes than there are cores?
>>>>>>>> Is there some TRES magic for this that I can't seem to figure out?
>>>>>>>>
>>>>>>>> My slurm.conf is here on github:
>>>>>>>> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/slurm.conf
>>>>>>>> The only gres I've set is for the GPU:
>>>>>>>> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/gres.conf
>>>>>>>>
>>>>>>>> Thanks for your attention,
>>>>>>>> Regards,
>>>>>>>> AR
>>>>>>>> --
>>>>>>>> Analabha Roy
>>>>>>>> Assistant Professor
>>>>>>>> Department of Physics
>>>>>>>> <http://www.buruniv.ac.in/academics/department/physics>
>>>>>>>> The University of Burdwan <http://www.buruniv.ac.in/>
>>>>>>>> Golapbag Campus, Barddhaman 713104
>>>>>>>> West Bengal, India
>>>>>>>> Emails: daneel at utexas.edu, aroy at phys.buruniv.ac.in,
>>>>>>>> hariseldon99 at gmail.com
>>>>>>>> Webpage: http://www.ph.utexas.edu/~daneel/
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Analabha Roy
>>>>>> Assistant Professor
>>>>>> Department of Physics
>>>>>> <http://www.buruniv.ac.in/academics/department/physics>
>>>>>> The University of Burdwan <http://www.buruniv.ac.in/>
>>>>>> Golapbag Campus, Barddhaman 713104
>>>>>> West Bengal, India
>>>>>> Emails: daneel at utexas.edu, aroy at phys.buruniv.ac.in,
>>>>>> hariseldon99 at gmail.com
>>>>>> Webpage: http://www.ph.utexas.edu/~daneel/
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Analabha Roy
>>>> Assistant Professor
>>>> Department of Physics
>>>> <http://www.buruniv.ac.in/academics/department/physics>
>>>> The University of Burdwan <http://www.buruniv.ac.in/>
>>>> Golapbag Campus, Barddhaman 713104
>>>> West Bengal, India
>>>> Emails: daneel at utexas.edu, aroy at phys.buruniv.ac.in,
>>>> hariseldon99 at gmail.com
>>>> Webpage: http://www.ph.utexas.edu/~daneel/
>>>>
>>>
>>
>> --
>> Analabha Roy
>> Assistant Professor
>> Department of Physics
>> <http://www.buruniv.ac.in/academics/department/physics>
>> The University of Burdwan <http://www.buruniv.ac.in/>
>> Golapbag Campus, Barddhaman 713104
>> West Bengal, India
>> Emails: daneel at utexas.edu, aroy at phys.buruniv.ac.in,
>> hariseldon99 at gmail.com
>> Webpage: http://www.ph.utexas.edu/~daneel/
>>
>

-- 
Analabha Roy
Assistant Professor
Department of Physics
<http://www.buruniv.ac.in/academics/department/physics>
The University of Burdwan <http://www.buruniv.ac.in/>
Golapbag Campus, Barddhaman 713104
West Bengal, India
Emails: daneel at utexas.edu, aroy at phys.buruniv.ac.in, hariseldon99 at gmail.com
Webpage: http://www.ph.utexas.edu/~daneel/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230227/3db805d0/attachment-0001.htm>