[slurm-users] Single Node cluster. How to manage oversubscribing

Analabha Roy hariseldon99 at gmail.com
Sun Feb 26 09:41:21 UTC 2023


Hi Doug,

Again, many thanks for your detailed response.
Based on my understanding of your previous note, I did the following:

I set the nodename with CPUs=64 Boards=1 SocketsPerBoard=2
CoresPerSocket=16 ThreadsPerCore=2

and the partitions with oversubscribe=force:2

then I put further restrictions with the default qos
to MaxTRESPerNode:cpu=32, MaxJobsPU=MaxSubmit=2

That way, no single user can request more than 2 X 32 cores legally.

I launched two jobs, sbatch -n 32 each as one user. They started running
immediately, taking up all 64 cores.

Then I logged in as another user and launched the same job with sbatch -n
2. To my dismay, it started to run!

Shouldn't slurm have figured out that all 64 cores were occupied and queued
the -n 2 job to pending?

AR


On Sun, 26 Feb 2023 at 02:18, Doug Meyer <dameyer99 at gmail.com> wrote:

> Hi,
>
> You got me, I didn't know that " oversubscribe=FORCE:2" is an option.
> I'll need to explore that.
>
> I missed the question about srun.  srun is the preferred I believe.  I am
> not associated with drafting the submit scripts but can ask my peer.  You
> do need to stipulate the number of cores you want.  Your "sbatch -n 1"
> should be changed to the number of MPI ranks you desire.
>
> As good as slurm is, many come to assume it does far more than it does.  I
> explain slurm as a maître d' in a very exclusive restaurant, aware of every
> table and the resources they afford.  When a reservation is placed, a job
> submitted, a review of the request versus the resources matches the
> pending  guest/job against the resources and when the other diners/jobs are
> expected to finish.  If a guest requests resources that are not available
> in the restaurant, the reservation is denied.  If a guest arrives and does
> not need all the resources, the place settings requested but unused are
> left in reservation until the job finishes.  Slurm manages requests against
> an inventory.  Without enforcement, a job that requests 1 core but uses 12
> will run.  If your 64 core system accepts 64 single core reservations,
> slurm believing 64 cores are needed, 64 jobs wll start.  and then the wait
> staff (the OS) is left to deal with 768 tasks running on 64 cores.  It
> becomes a sad comedy as the system will probably run out of RAM triggering
> OOM killer or just run horribly slow.  Never assume slurm is going to
> prevent bad actors once they begin running unless you have configured it to
> do so.
>
> We run a very lax environment.  We set a standard of 6 GB per job unless
> the sbatch declares otherwise and a max runtime default.  Without an
> estimated runtime to work with the backfill scheduler is crippled.  In an
> environment mixing single thread and MPI jobs of various sizes it is
> critical the jobs are honest in their requirements providing slurm the
> information needed to correctly assign resources.
>
> Doug
>
> On Sat, Feb 25, 2023 at 12:04 PM Analabha Roy <hariseldon99 at gmail.com>
> wrote:
>
>> Hi,
>>
>> Thanks for your considered response. Couple of questions linger...
>>
>> On Sat, 25 Feb 2023 at 21:46, Doug Meyer <dameyer99 at gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Declaring cores=64 will absolutely work but if you start running MPI
>>> you'll want a more detailed config description.  The easy way to read it is
>>> "128=2 sockets * 32 corespersocket * 2 threads per core".
>>>
>>> NodeName=hpc[306-308] CPUs=128 Sockets=2 CoresPerSocket=32
>>> ThreadsPerCore=2 RealMemory=512000 TmpDisk=100
>>>
>>> But if you just want to work with logical cores the "cpus=128" will work.
>>>
>>> If you go with the more detailed description then you need to declare
>>> oversubscription (hyperthreading) in the partition declaration.
>>>
>>
>>
>> Yeah, I'll try that.
>>
>>
>>> By default slurm will not let two different jobs share the logical cores
>>> comprising a physical core.  For example if Sue has an Array of 1-1000 her
>>> array tasks could each take a logical core on a physical core.  But if
>>> Jamal is also running they would not be able to share the physical core.
>>> (as I understand it).
>>>
>>> PartitionName=a Nodes= [301-308] Default=No OverSubscribe=YES:2
>>> MaxTime=Infinite State=Up AllowAccounts=cowboys
>>>
>>>
>>> In the sbatch/srun the user needs to add a declaration
>>> "oversubscribe=yes" telling slurm the job can run on both logical cores
>>> available.
>>>
>>
>> How about setting oversubscribe=FORCE:2? That way, users need not add a
>> setting in their scripts.
>>
>>
>>
>>
>>> In the days on Knight's Landing each core could handle four logical
>>> cores but I don't believe there are any current AMD or Intel processors
>>> supporting more then two logical cores (hyperthreads per core).  The
>>> conversation about hyperthreads is difficult as the Intel terminology is
>>> logical cores for hyperthreading and cores for physical cores but the
>>> tendency is to call the logical cores threads or hyperthreaded cores.  This
>>> can be very confusing for consumers of the resources.
>>>
>>>
>>> In any case, if you create an array job of 1-100 sleep jobs, my simplest
>>> logical test job, then you can use scontrol show node <nodename> to see the
>>> nodes resource configuration as well as consumption.  squeue -w <nodename>
>>> -i 10 will iteratate every ten seconds to show you the node chomping
>>> through the job.
>>>
>>>
>>> Hope this helps.  Once you are comfortable I would urge you to use the
>>> NodeName/Partition descriptor format above and encourage your users to
>>> declare oversubscription in their jobs.  It is a little more work up front
>>> but far easier than correcting scripts later.
>>>
>>>
>>> Doug
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Feb 23, 2023 at 9:41 PM Analabha Roy <hariseldon99 at gmail.com>
>>> wrote:
>>>
>>>> Howdy, and thanks for the warm welcome,
>>>>
>>>> On Fri, 24 Feb 2023 at 07:31, Doug Meyer <dameyer99 at gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Did you configure your node definition with the outputs of slurmd -C?
>>>>> Ignore boards.  Don't know if it is still true but several years ago
>>>>> declaring boards made things difficult.
>>>>>
>>>>>
>>>> $ slurmd -C
>>>> NodeName=shavak-DIT400TR-55L CPUs=64 Boards=1 SocketsPerBoard=2
>>>> CoresPerSocket=16 ThreadsPerCore=2 RealMemory=95311
>>>> UpTime=0-00:47:51
>>>> $ grep NodeName /etc/slurm-llnl/slurm.conf
>>>> NodeName=shavak-DIT400TR-55L CPUs=64 RealMemory=95311 Gres=gpu:1
>>>>
>>>> There is a difference. I, too, discarded the Boards and sockets in
>>>> slurmd.conf . Is that the problem?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Also, if you have hyperthreaded AMD or Intel processors your partition
>>>>> declaration should be overscribe:2
>>>>>
>>>>>
>>>> Yes I do, It's actually 16 X 2 cores with hyperthreading, but the BIOS
>>>> is set to show them as 64 cores.
>>>>
>>>>
>>>>
>>>>
>>>>> Start with a very simple job with a script containing sleep 100 or
>>>>> something else without any runtime issues.
>>>>>
>>>>>
>>>> I ran this MPI hello world thing
>>>> <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count.c>with
>>>> this sbatch script.
>>>> <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count_normal.sbatch>
>>>> Should be the same thing as your suggestion, basically.
>>>> Should I switch to 'srun' in the batch file?
>>>>
>>>> AR
>>>>
>>>>
>>>>> When I started with slurm I built the sbatch one small step at a
>>>>> time.  Nodes, cores. memory, partition, mail, etc
>>>>>
>>>>> It sounds like your config is very close but your problem may be in
>>>>> the submit script.
>>>>>
>>>>> Best of luck and welcome to slurm. It is very powerful with a huge
>>>>> community.
>>>>>
>>>>> Doug
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Feb 23, 2023 at 6:58 AM Analabha Roy <hariseldon99 at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi folks,
>>>>>>
>>>>>> I have a single-node "cluster" running Ubuntu 20.04 LTS with the
>>>>>> distribution packages for slurm (slurm-wlm 19.05.5)
>>>>>> Slurm only ran one job in the node at a time with the default
>>>>>> configuration, leaving all other jobs pending.
>>>>>> This happened even if that one job only requested like a few cores
>>>>>> (the node has 64 cores, and slurm.conf is configged accordingly).
>>>>>>
>>>>>> in slurm conf, SelectType is set to select/cons_res, and
>>>>>> SelectTypeParameters to CR_Core. NodeName is set with CPUs=64. Path to file
>>>>>> is referenced below.
>>>>>>
>>>>>> So I set OverSubscribe=FORCE in the partition config and restarted
>>>>>> the daemons.
>>>>>>
>>>>>> Multiple jobs are now run concurrently, but when Slurm is
>>>>>> oversubscribed, it is *truly* *oversubscribed*. That is to say, it
>>>>>> runs so many jobs that there are more processes running than cores/threads.
>>>>>> How should I config slurm so that it runs multiple jobs at once per
>>>>>> node, but ensures that it doesn't run more processes than there are cores?
>>>>>> Is there some TRES magic for this that I can't seem to figure out?
>>>>>>
>>>>>> My slurm.conf is here on github:
>>>>>> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/slurm.conf
>>>>>> The only gres I've set is for the GPU:
>>>>>> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/gres.conf
>>>>>>
>>>>>> Thanks for your attention,
>>>>>> Regards,
>>>>>> AR
>>>>>> --
>>>>>> Analabha Roy
>>>>>> Assistant Professor
>>>>>> Department of Physics
>>>>>> <http://www.buruniv.ac.in/academics/department/physics>
>>>>>> The University of Burdwan <http://www.buruniv.ac.in/>
>>>>>> Golapbag Campus, Barddhaman 713104
>>>>>> West Bengal, India
>>>>>> Emails: daneel at utexas.edu, aroy at phys.buruniv.ac.in,
>>>>>> hariseldon99 at gmail.com
>>>>>> Webpage: http://www.ph.utexas.edu/~daneel/
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Analabha Roy
>>>> Assistant Professor
>>>> Department of Physics
>>>> <http://www.buruniv.ac.in/academics/department/physics>
>>>> The University of Burdwan <http://www.buruniv.ac.in/>
>>>> Golapbag Campus, Barddhaman 713104
>>>> West Bengal, India
>>>> Emails: daneel at utexas.edu, aroy at phys.buruniv.ac.in,
>>>> hariseldon99 at gmail.com
>>>> Webpage: http://www.ph.utexas.edu/~daneel/
>>>>
>>>
>>
>> --
>> Analabha Roy
>> Assistant Professor
>> Department of Physics
>> <http://www.buruniv.ac.in/academics/department/physics>
>> The University of Burdwan <http://www.buruniv.ac.in/>
>> Golapbag Campus, Barddhaman 713104
>> West Bengal, India
>> Emails: daneel at utexas.edu, aroy at phys.buruniv.ac.in,
>> hariseldon99 at gmail.com
>> Webpage: http://www.ph.utexas.edu/~daneel/
>>
>

-- 
Analabha Roy
Assistant Professor
Department of Physics
<http://www.buruniv.ac.in/academics/department/physics>
The University of Burdwan <http://www.buruniv.ac.in/>
Golapbag Campus, Barddhaman 713104
West Bengal, India
Emails: daneel at utexas.edu, aroy at phys.buruniv.ac.in, hariseldon99 at gmail.com
Webpage: http://www.ph.utexas.edu/~daneel/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230226/f3733539/attachment-0001.htm>


More information about the slurm-users mailing list