[slurm-users] Single Node cluster. How to manage oversubscribing

Analabha Roy hariseldon99 at gmail.com
Sat Feb 25 19:02:05 UTC 2023


Hi,

Thanks for your considered response. Couple of questions linger...

On Sat, 25 Feb 2023 at 21:46, Doug Meyer <dameyer99 at gmail.com> wrote:

> Hi,
>
> Declaring cores=64 will absolutely work but if you start running MPI
> you'll want a more detailed config description.  The easy way to read it is
> "128=2 sockets * 32 corespersocket * 2 threads per core".
>
> NodeName=hpc[306-308] CPUs=128 Sockets=2 CoresPerSocket=32
> ThreadsPerCore=2 RealMemory=512000 TmpDisk=100
>
> But if you just want to work with logical cores the "cpus=128" will work.
>
> If you go with the more detailed description then you need to declare
> oversubscription (hyperthreading) in the partition declaration.
>


Yeah, I'll try that.


> By default slurm will not let two different jobs share the logical cores
> comprising a physical core.  For example if Sue has an Array of 1-1000 her
> array tasks could each take a logical core on a physical core.  But if
> Jamal is also running they would not be able to share the physical core.
> (as I understand it).
>
> PartitionName=a Nodes= [301-308] Default=No OverSubscribe=YES:2
> MaxTime=Infinite State=Up AllowAccounts=cowboys
>
>
> In the sbatch/srun the user needs to add a declaration "oversubscribe=yes"
> telling slurm the job can run on both logical cores available.
>

How about setting oversubscribe=FORCE:2? That way, users need not add a
setting in their scripts.




> In the days on Knight's Landing each core could handle four logical cores
> but I don't believe there are any current AMD or Intel processors
> supporting more then two logical cores (hyperthreads per core).  The
> conversation about hyperthreads is difficult as the Intel terminology is
> logical cores for hyperthreading and cores for physical cores but the
> tendency is to call the logical cores threads or hyperthreaded cores.  This
> can be very confusing for consumers of the resources.
>
>
> In any case, if you create an array job of 1-100 sleep jobs, my simplest
> logical test job, then you can use scontrol show node <nodename> to see the
> nodes resource configuration as well as consumption.  squeue -w <nodename>
> -i 10 will iteratate every ten seconds to show you the node chomping
> through the job.
>
>
> Hope this helps.  Once you are comfortable I would urge you to use the
> NodeName/Partition descriptor format above and encourage your users to
> declare oversubscription in their jobs.  It is a little more work up front
> but far easier than correcting scripts later.
>
>
> Doug
>
>
>
>
>
> On Thu, Feb 23, 2023 at 9:41 PM Analabha Roy <hariseldon99 at gmail.com>
> wrote:
>
>> Howdy, and thanks for the warm welcome,
>>
>> On Fri, 24 Feb 2023 at 07:31, Doug Meyer <dameyer99 at gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Did you configure your node definition with the outputs of slurmd -C?
>>> Ignore boards.  Don't know if it is still true but several years ago
>>> declaring boards made things difficult.
>>>
>>>
>> $ slurmd -C
>> NodeName=shavak-DIT400TR-55L CPUs=64 Boards=1 SocketsPerBoard=2
>> CoresPerSocket=16 ThreadsPerCore=2 RealMemory=95311
>> UpTime=0-00:47:51
>> $ grep NodeName /etc/slurm-llnl/slurm.conf
>> NodeName=shavak-DIT400TR-55L CPUs=64 RealMemory=95311 Gres=gpu:1
>>
>> There is a difference. I, too, discarded the Boards and sockets in
>> slurmd.conf . Is that the problem?
>>
>>
>>
>>
>>
>>
>>
>>> Also, if you have hyperthreaded AMD or Intel processors your partition
>>> declaration should be overscribe:2
>>>
>>>
>> Yes I do, It's actually 16 X 2 cores with hyperthreading, but the BIOS is
>> set to show them as 64 cores.
>>
>>
>>
>>
>>> Start with a very simple job with a script containing sleep 100 or
>>> something else without any runtime issues.
>>>
>>>
>> I ran this MPI hello world thing
>> <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count.c>with
>> this sbatch script.
>> <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count_normal.sbatch>
>> Should be the same thing as your suggestion, basically.
>> Should I switch to 'srun' in the batch file?
>>
>> AR
>>
>>
>>> When I started with slurm I built the sbatch one small step at a time.
>>> Nodes, cores. memory, partition, mail, etc
>>>
>>> It sounds like your config is very close but your problem may be in the
>>> submit script.
>>>
>>> Best of luck and welcome to slurm. It is very powerful with a huge
>>> community.
>>>
>>> Doug
>>>
>>>
>>>
>>> On Thu, Feb 23, 2023 at 6:58 AM Analabha Roy <hariseldon99 at gmail.com>
>>> wrote:
>>>
>>>> Hi folks,
>>>>
>>>> I have a single-node "cluster" running Ubuntu 20.04 LTS with the
>>>> distribution packages for slurm (slurm-wlm 19.05.5)
>>>> Slurm only ran one job in the node at a time with the default
>>>> configuration, leaving all other jobs pending.
>>>> This happened even if that one job only requested like a few cores (the
>>>> node has 64 cores, and slurm.conf is configged accordingly).
>>>>
>>>> in slurm conf, SelectType is set to select/cons_res, and
>>>> SelectTypeParameters to CR_Core. NodeName is set with CPUs=64. Path to file
>>>> is referenced below.
>>>>
>>>> So I set OverSubscribe=FORCE in the partition config and restarted the
>>>> daemons.
>>>>
>>>> Multiple jobs are now run concurrently, but when Slurm is
>>>> oversubscribed, it is *truly* *oversubscribed*. That is to say, it
>>>> runs so many jobs that there are more processes running than cores/threads.
>>>> How should I config slurm so that it runs multiple jobs at once per
>>>> node, but ensures that it doesn't run more processes than there are cores?
>>>> Is there some TRES magic for this that I can't seem to figure out?
>>>>
>>>> My slurm.conf is here on github:
>>>> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/slurm.conf
>>>> The only gres I've set is for the GPU:
>>>> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/gres.conf
>>>>
>>>> Thanks for your attention,
>>>> Regards,
>>>> AR
>>>> --
>>>> Analabha Roy
>>>> Assistant Professor
>>>> Department of Physics
>>>> <http://www.buruniv.ac.in/academics/department/physics>
>>>> The University of Burdwan <http://www.buruniv.ac.in/>
>>>> Golapbag Campus, Barddhaman 713104
>>>> West Bengal, India
>>>> Emails: daneel at utexas.edu, aroy at phys.buruniv.ac.in,
>>>> hariseldon99 at gmail.com
>>>> Webpage: http://www.ph.utexas.edu/~daneel/
>>>>
>>>
>>
>> --
>> Analabha Roy
>> Assistant Professor
>> Department of Physics
>> <http://www.buruniv.ac.in/academics/department/physics>
>> The University of Burdwan <http://www.buruniv.ac.in/>
>> Golapbag Campus, Barddhaman 713104
>> West Bengal, India
>> Emails: daneel at utexas.edu, aroy at phys.buruniv.ac.in,
>> hariseldon99 at gmail.com
>> Webpage: http://www.ph.utexas.edu/~daneel/
>>
>

-- 
Analabha Roy
Assistant Professor
Department of Physics
<http://www.buruniv.ac.in/academics/department/physics>
The University of Burdwan <http://www.buruniv.ac.in/>
Golapbag Campus, Barddhaman 713104
West Bengal, India
Emails: daneel at utexas.edu, aroy at phys.buruniv.ac.in, hariseldon99 at gmail.com
Webpage: http://www.ph.utexas.edu/~daneel/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230226/bc563bba/attachment-0001.htm>


More information about the slurm-users mailing list