[slurm-users] Single Node cluster. How to manage oversubscribing

Analabha Roy hariseldon99 at gmail.com
Fri Feb 24 04:39:22 UTC 2023


Howdy, and thanks for the warm welcome,

On Fri, 24 Feb 2023 at 07:31, Doug Meyer <dameyer99 at gmail.com> wrote:

> Hi,
>
> Did you configure your node definition with the outputs of slurmd -C?
> Ignore boards.  Don't know if it is still true but several years ago
> declaring boards made things difficult.
>
>
$ slurmd -C
NodeName=shavak-DIT400TR-55L CPUs=64 Boards=1 SocketsPerBoard=2
CoresPerSocket=16 ThreadsPerCore=2 RealMemory=95311
UpTime=0-00:47:51
$ grep NodeName /etc/slurm-llnl/slurm.conf
NodeName=shavak-DIT400TR-55L CPUs=64 RealMemory=95311 Gres=gpu:1

There is a difference. I, too, discarded the Boards and sockets in
slurmd.conf . Is that the problem?







> Also, if you have hyperthreaded AMD or Intel processors your partition
> declaration should be overscribe:2
>
>
Yes I do, It's actually 16 X 2 cores with hyperthreading, but the BIOS is
set to show them as 64 cores.




> Start with a very simple job with a script containing sleep 100 or
> something else without any runtime issues.
>
>
I ran this MPI hello world thing
<https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count.c>with
this sbatch script.
<https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count_normal.sbatch>
Should be the same thing as your suggestion, basically.
Should I switch to 'srun' in the batch file?

AR


> When I started with slurm I built the sbatch one small step at a time.
> Nodes, cores. memory, partition, mail, etc
>
> It sounds like your config is very close but your problem may be in the
> submit script.
>
> Best of luck and welcome to slurm. It is very powerful with a huge
> community.
>
> Doug
>
>
>
> On Thu, Feb 23, 2023 at 6:58 AM Analabha Roy <hariseldon99 at gmail.com>
> wrote:
>
>> Hi folks,
>>
>> I have a single-node "cluster" running Ubuntu 20.04 LTS with the
>> distribution packages for slurm (slurm-wlm 19.05.5)
>> Slurm only ran one job in the node at a time with the default
>> configuration, leaving all other jobs pending.
>> This happened even if that one job only requested like a few cores (the
>> node has 64 cores, and slurm.conf is configged accordingly).
>>
>> in slurm conf, SelectType is set to select/cons_res, and
>> SelectTypeParameters to CR_Core. NodeName is set with CPUs=64. Path to file
>> is referenced below.
>>
>> So I set OverSubscribe=FORCE in the partition config and restarted the
>> daemons.
>>
>> Multiple jobs are now run concurrently, but when Slurm is oversubscribed,
>> it is *truly* *oversubscribed*. That is to say, it runs so many jobs
>> that there are more processes running than cores/threads.
>> How should I config slurm so that it runs multiple jobs at once per node,
>> but ensures that it doesn't run more processes than there are cores? Is
>> there some TRES magic for this that I can't seem to figure out?
>>
>> My slurm.conf is here on github:
>> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/slurm.conf
>> The only gres I've set is for the GPU:
>> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/gres.conf
>>
>> Thanks for your attention,
>> Regards,
>> AR
>> --
>> Analabha Roy
>> Assistant Professor
>> Department of Physics
>> <http://www.buruniv.ac.in/academics/department/physics>
>> The University of Burdwan <http://www.buruniv.ac.in/>
>> Golapbag Campus, Barddhaman 713104
>> West Bengal, India
>> Emails: daneel at utexas.edu, aroy at phys.buruniv.ac.in,
>> hariseldon99 at gmail.com
>> Webpage: http://www.ph.utexas.edu/~daneel/
>>
>

-- 
Analabha Roy
Assistant Professor
Department of Physics
<http://www.buruniv.ac.in/academics/department/physics>
The University of Burdwan <http://www.buruniv.ac.in/>
Golapbag Campus, Barddhaman 713104
West Bengal, India
Emails: daneel at utexas.edu, aroy at phys.buruniv.ac.in, hariseldon99 at gmail.com
Webpage: http://www.ph.utexas.edu/~daneel/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230224/3463e9f3/attachment-0001.htm>


More information about the slurm-users mailing list