[slurm-users] Single Node cluster. How to manage oversubscribing

Wed Mar 1 02:20:43 UTC 2023

Hi,

I forgot one thing you didn't mention.  When you change the node
descriptors and partitions you have to also restart slurmctld.  scontrol
reconfigure works for the nodes but the main daemon has to be told to
reread the config.  Until you restart the daemon it will be referencing the
config from the last time it started.

Doug

On Sun, Feb 26, 2023 at 10:25 PM Analabha Roy <hariseldon99 at gmail.com>
wrote:

> Hey,
>
>
> Thanks for sticking with this.
>
> On Sun, 26 Feb 2023 at 23:43, Doug Meyer <dameyer99 at gmail.com> wrote:
>
>> Hi,
>>
>> Suggest removing "boards=1",  The docs say to include it but in previous
>> discussions with schedmd we were advised to remove it.
>>
>>
> I just did. Then ran scontrol reconfigure.
>
>
>
>> When you are running execute "scontrol show node <nodename>" and look at
>> the lines ConfigTres and AllocTres.  The former is what the maitre d
>> believes is available, the latter what has been allocated.
>>
>> Then "scontrol show job <jobid>" looking down at the "NumNodes" like
>> which will show you what the job requested.
>>
>> I suspect there is a syntax error in the submit.
>>
>>
> Okay. Now this is strange.
>
> First, I launched this job twice <https://pastebin.com/s21yXFH2>
> This should take up 20 + 20 = 40 cores, because of the
>
>
>    1. #SBATCH -n 20                  # Number of tasks
>    2. #SBATCH --cpus-per-task=1
>
>
>
> running scontrol show job on both jobids yields
>
>    -    NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>    -    NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>
> Then, running scontrol on the node yields:
>
>
>    - scontrol show node $HOSTNAME
>    - CfgTRES=cpu=64,mem=95311M,billing=64,gres/gpu=1
>    - AllocTRES=cpu=40
>
>
> So far so good. Both show 40 cores allocated.
>
>
>
> However, if I now add another job with 60 cores
> <https://pastebin.com/C0uW0Aut>,this happens:
>
> scontrol on the node:
>
> CfgTRES=cpu=64,mem=95311M,billing=64,gres/gpu=1
>    AllocTRES=cpu=60
>
>
> squeue
>  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
>                413       CPU   normal    admin  R      21:22      1
> shavak-DIT400TR-55L
>                414       CPU   normal    admin  R      19:53      1
> shavak-DIT400TR-55L
>                417       CPU elevated    admin  R       1:31      1
> shavak-DIT400TR-55L
>
> scontrol on the jobids:
>
> admin at shavak-DIT400TR-55L:~/mpi_runs_inf$ scontrol show job 413|grep
> NumCPUs
>    NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
> admin at shavak-DIT400TR-55L:~/mpi_runs_inf$ scontrol show job 414|grep
> NumCPUs
>    NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
> admin at shavak-DIT400TR-55L:~/mpi_runs_inf$ scontrol show job 417|grep
> NumCPUs
>    NumNodes=1 NumCPUs=60 NumTasks=60 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>
> So there are 100 CPUs running, according to this, but 60 according to
> scontrol on the node??????
>
> The submission scripts are on pastebin:
>
> https://pastebin.com/s21yXFH2
> https://pastebin.com/C0uW0Aut
>
>
> AR
>
>
>
>
>
>
>> Doug
>>
>>
>> On Sun, Feb 26, 2023 at 2:43 AM Analabha Roy <hariseldon99 at gmail.com>
>> wrote:
>>
>>> Hi Doug,
>>>
>>> Again, many thanks for your detailed response.
>>> Based on my understanding of your previous note, I did the following:
>>>
>>> I set the nodename with CPUs=64 Boards=1 SocketsPerBoard=2
>>> CoresPerSocket=16 ThreadsPerCore=2
>>>
>>> and the partitions with oversubscribe=force:2
>>>
>>> then I put further restrictions with the default qos
>>> to MaxTRESPerNode:cpu=32, MaxJobsPU=MaxSubmit=2
>>>
>>> That way, no single user can request more than 2 X 32 cores legally.
>>>
>>> I launched two jobs, sbatch -n 32 each as one user. They started running
>>> immediately, taking up all 64 cores.
>>>
>>> Then I logged in as another user and launched the same job with sbatch
>>> -n 2. To my dismay, it started to run!
>>>
>>> Shouldn't slurm have figured out that all 64 cores were occupied and
>>> queued the -n 2 job to pending?
>>>
>>> AR
>>>
>>>
>>> On Sun, 26 Feb 2023 at 02:18, Doug Meyer <dameyer99 at gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> You got me, I didn't know that " oversubscribe=FORCE:2" is an option.
>>>> I'll need to explore that.
>>>>
>>>> I missed the question about srun.  srun is the preferred I believe.  I
>>>> am not associated with drafting the submit scripts but can ask my peer.
>>>> You do need to stipulate the number of cores you want.  Your "sbatch -n 1"
>>>> should be changed to the number of MPI ranks you desire.
>>>>
>>>> As good as slurm is, many come to assume it does far more than it
>>>> does.  I explain slurm as a maître d' in a very exclusive restaurant, aware
>>>> of every table and the resources they afford.  When a reservation is
>>>> placed, a job submitted, a review of the request versus the resources
>>>> matches the pending  guest/job against the resources and when the other
>>>> diners/jobs are expected to finish.  If a guest requests resources that are
>>>> not available in the restaurant, the reservation is denied.  If a guest
>>>> arrives and does not need all the resources, the place settings requested
>>>> but unused are left in reservation until the job finishes.  Slurm manages
>>>> requests against an inventory.  Without enforcement, a job that requests 1
>>>> core but uses 12 will run.  If your 64 core system accepts 64 single core
>>>> reservations, slurm believing 64 cores are needed, 64 jobs wll start.  and
>>>> then the wait staff (the OS) is left to deal with 768 tasks running on 64
>>>> cores.  It becomes a sad comedy as the system will probably run out of RAM
>>>> triggering OOM killer or just run horribly slow.  Never assume slurm is
>>>> going to prevent bad actors once they begin running unless you have
>>>> configured it to do so.
>>>>
>>>> We run a very lax environment.  We set a standard of 6 GB per job
>>>> unless the sbatch declares otherwise and a max runtime default.  Without an
>>>> estimated runtime to work with the backfill scheduler is crippled.  In an
>>>> environment mixing single thread and MPI jobs of various sizes it is
>>>> critical the jobs are honest in their requirements providing slurm the
>>>> information needed to correctly assign resources.
>>>>
>>>> Doug
>>>>
>>>> On Sat, Feb 25, 2023 at 12:04 PM Analabha Roy <hariseldon99 at gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Thanks for your considered response. Couple of questions linger...
>>>>>
>>>>> On Sat, 25 Feb 2023 at 21:46, Doug Meyer <dameyer99 at gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Declaring cores=64 will absolutely work but if you start running MPI
>>>>>> you'll want a more detailed config description.  The easy way to read it is
>>>>>> "128=2 sockets * 32 corespersocket * 2 threads per core".
>>>>>>
>>>>>> NodeName=hpc[306-308] CPUs=128 Sockets=2 CoresPerSocket=32
>>>>>> ThreadsPerCore=2 RealMemory=512000 TmpDisk=100
>>>>>>
>>>>>> But if you just want to work with logical cores the "cpus=128" will
>>>>>> work.
>>>>>>
>>>>>> If you go with the more detailed description then you need to declare
>>>>>> oversubscription (hyperthreading) in the partition declaration.
>>>>>>
>>>>>
>>>>>
>>>>> Yeah, I'll try that.
>>>>>
>>>>>
>>>>>> By default slurm will not let two different jobs share the logical
>>>>>> cores comprising a physical core.  For example if Sue has an Array of
>>>>>> 1-1000 her array tasks could each take a logical core on a physical core.
>>>>>> But if Jamal is also running they would not be able to share the physical
>>>>>> core. (as I understand it).
>>>>>>
>>>>>> PartitionName=a Nodes= [301-308] Default=No OverSubscribe=YES:2
>>>>>> MaxTime=Infinite State=Up AllowAccounts=cowboys
>>>>>>
>>>>>>
>>>>>> In the sbatch/srun the user needs to add a declaration
>>>>>> "oversubscribe=yes" telling slurm the job can run on both logical cores
>>>>>> available.
>>>>>>
>>>>>
>>>>> How about setting oversubscribe=FORCE:2? That way, users need not add
>>>>> a setting in their scripts.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> In the days on Knight's Landing each core could handle four logical
>>>>>> cores but I don't believe there are any current AMD or Intel processors
>>>>>> supporting more then two logical cores (hyperthreads per core).  The
>>>>>> conversation about hyperthreads is difficult as the Intel terminology is
>>>>>> logical cores for hyperthreading and cores for physical cores but the
>>>>>> tendency is to call the logical cores threads or hyperthreaded cores.  This
>>>>>> can be very confusing for consumers of the resources.
>>>>>>
>>>>>>
>>>>>> In any case, if you create an array job of 1-100 sleep jobs, my
>>>>>> simplest logical test job, then you can use scontrol show node <nodename>
>>>>>> to see the nodes resource configuration as well as consumption.  squeue -w
>>>>>> <nodename> -i 10 will iteratate every ten seconds to show you the node
>>>>>> chomping through the job.
>>>>>>
>>>>>>
>>>>>> Hope this helps.  Once you are comfortable I would urge you to use
>>>>>> the NodeName/Partition descriptor format above and encourage your users to
>>>>>> declare oversubscription in their jobs.  It is a little more work up front
>>>>>> but far easier than correcting scripts later.
>>>>>>
>>>>>>
>>>>>> Doug
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 23, 2023 at 9:41 PM Analabha Roy <hariseldon99 at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Howdy, and thanks for the warm welcome,
>>>>>>>
>>>>>>> On Fri, 24 Feb 2023 at 07:31, Doug Meyer <dameyer99 at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Did you configure your node definition with the outputs of slurmd
>>>>>>>> -C?  Ignore boards.  Don't know if it is still true but several years ago
>>>>>>>> declaring boards made things difficult.
>>>>>>>>
>>>>>>>>
>>>>>>> $ slurmd -C
>>>>>>> NodeName=shavak-DIT400TR-55L CPUs=64 Boards=1 SocketsPerBoard=2
>>>>>>> CoresPerSocket=16 ThreadsPerCore=2 RealMemory=95311
>>>>>>> UpTime=0-00:47:51
>>>>>>> $ grep NodeName /etc/slurm-llnl/slurm.conf
>>>>>>> NodeName=shavak-DIT400TR-55L CPUs=64 RealMemory=95311 Gres=gpu:1
>>>>>>>
>>>>>>> There is a difference. I, too, discarded the Boards and sockets in
>>>>>>> slurmd.conf . Is that the problem?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Also, if you have hyperthreaded AMD or Intel processors your
>>>>>>>> partition declaration should be overscribe:2
>>>>>>>>
>>>>>>>>
>>>>>>> Yes I do, It's actually 16 X 2 cores with hyperthreading, but the
>>>>>>> BIOS is set to show them as 64 cores.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Start with a very simple job with a script containing sleep 100 or
>>>>>>>> something else without any runtime issues.
>>>>>>>>
>>>>>>>>
>>>>>>> I ran this MPI hello world thing
>>>>>>> <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count.c>with
>>>>>>> this sbatch script.
>>>>>>> <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count_normal.sbatch>
>>>>>>> Should be the same thing as your suggestion, basically.
>>>>>>> Should I switch to 'srun' in the batch file?
>>>>>>>
>>>>>>> AR
>>>>>>>
>>>>>>>
>>>>>>>> When I started with slurm I built the sbatch one small step at a
>>>>>>>> time.  Nodes, cores. memory, partition, mail, etc
>>>>>>>>
>>>>>>>> It sounds like your config is very close but your problem may be in
>>>>>>>> the submit script.
>>>>>>>>
>>>>>>>> Best of luck and welcome to slurm. It is very powerful with a huge
>>>>>>>> community.
>>>>>>>>
>>>>>>>> Doug
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Feb 23, 2023 at 6:58 AM Analabha Roy <
>>>>>>>> hariseldon99 at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi folks,
>>>>>>>>>
>>>>>>>>> I have a single-node "cluster" running Ubuntu 20.04 LTS with the
>>>>>>>>> distribution packages for slurm (slurm-wlm 19.05.5)
>>>>>>>>> Slurm only ran one job in the node at a time with the default
>>>>>>>>> configuration, leaving all other jobs pending.
>>>>>>>>> This happened even if that one job only requested like a few cores
>>>>>>>>> (the node has 64 cores, and slurm.conf is configged accordingly).
>>>>>>>>>
>>>>>>>>> in slurm conf, SelectType is set to select/cons_res, and
>>>>>>>>> SelectTypeParameters to CR_Core. NodeName is set with CPUs=64. Path to file
>>>>>>>>> is referenced below.
>>>>>>>>>
>>>>>>>>> So I set OverSubscribe=FORCE in the partition config and restarted
>>>>>>>>> the daemons.
>>>>>>>>>
>>>>>>>>> Multiple jobs are now run concurrently, but when Slurm is
>>>>>>>>> oversubscribed, it is *truly* *oversubscribed*. That is to say,
>>>>>>>>> it runs so many jobs that there are more processes running than
>>>>>>>>> cores/threads.
>>>>>>>>> How should I config slurm so that it runs multiple jobs at once
>>>>>>>>> per node, but ensures that it doesn't run more processes than there are
>>>>>>>>> cores? Is there some TRES magic for this that I can't seem to figure out?
>>>>>>>>>
>>>>>>>>> My slurm.conf is here on github:
>>>>>>>>> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/slurm.conf
>>>>>>>>> The only gres I've set is for the GPU:
>>>>>>>>> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/gres.conf
>>>>>>>>>
>>>>>>>>> Thanks for your attention,
>>>>>>>>> Regards,
>>>>>>>>> AR
>>>>>>>>> --
>>>>>>>>> Analabha Roy
>>>>>>>>> Assistant Professor
>>>>>>>>> Department of Physics
>>>>>>>>> <http://www.buruniv.ac.in/academics/department/physics>
>>>>>>>>> The University of Burdwan <http://www.buruniv.ac.in/>
>>>>>>>>> Golapbag Campus, Barddhaman 713104
>>>>>>>>> West Bengal, India
>>>>>>>>> Emails: daneel at utexas.edu, aroy at phys.buruniv.ac.in,
>>>>>>>>> hariseldon99 at gmail.com
>>>>>>>>> Webpage: http://www.ph.utexas.edu/~daneel/
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Analabha Roy
>>>>>>> Assistant Professor
>>>>>>> Department of Physics
>>>>>>> <http://www.buruniv.ac.in/academics/department/physics>
>>>>>>> The University of Burdwan <http://www.buruniv.ac.in/>
>>>>>>> Golapbag Campus, Barddhaman 713104
>>>>>>> West Bengal, India
>>>>>>> Emails: daneel at utexas.edu, aroy at phys.buruniv.ac.in,
>>>>>>> hariseldon99 at gmail.com
>>>>>>> Webpage: http://www.ph.utexas.edu/~daneel/
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Analabha Roy
>>>>> Assistant Professor
>>>>> Department of Physics
>>>>> <http://www.buruniv.ac.in/academics/department/physics>
>>>>> The University of Burdwan <http://www.buruniv.ac.in/>
>>>>> Golapbag Campus, Barddhaman 713104
>>>>> West Bengal, India
>>>>> Emails: daneel at utexas.edu, aroy at phys.buruniv.ac.in,
>>>>> hariseldon99 at gmail.com
>>>>> Webpage: http://www.ph.utexas.edu/~daneel/
>>>>>
>>>>
>>>
>>> --
>>> Analabha Roy
>>> Assistant Professor
>>> Department of Physics
>>> <http://www.buruniv.ac.in/academics/department/physics>
>>> The University of Burdwan <http://www.buruniv.ac.in/>
>>> Golapbag Campus, Barddhaman 713104
>>> West Bengal, India
>>> Emails: daneel at utexas.edu, aroy at phys.buruniv.ac.in,
>>> hariseldon99 at gmail.com
>>> Webpage: http://www.ph.utexas.edu/~daneel/
>>>
>>
>
> --
> Analabha Roy
> Assistant Professor
> Department of Physics
> <http://www.buruniv.ac.in/academics/department/physics>
> The University of Burdwan <http://www.buruniv.ac.in/>
> Golapbag Campus, Barddhaman 713104
> West Bengal, India
> Emails: daneel at utexas.edu, aroy at phys.buruniv.ac.in, hariseldon99 at gmail.com
> Webpage: http://www.ph.utexas.edu/~daneel/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230228/a5cead6e/attachment-0001.htm>