[slurm-users] GPUs not available after making use of all threads?
Hermann Schwärzler
hermann.schwaerzler at uibk.ac.at
Sat Feb 11 10:13:54 UTC 2023
Hi Sebastian,
we did a similar thing just recently.
We changed our node settings from
NodeName=DEFAULT CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=32
ThreadsPerCore=2
to
NodeName=DEFAULT Boards=1 SocketsPerBoard=2 CoresPerSocket=32
ThreadsPerCore=2
in order to make use of individual hyper-threads possible (we use this
in combination with
SelectTypeParameters=CR_Core_Memory).
This works as expected: after this, when e.g. asking for
--cpus-per-task=4 you will get 4 hyper-threads (2 cores) per task
(unless you also specify e.g. "--hint=nomultithread").
So you might try to remove the "CPUs=256" part of your
node-specification to let Slurm do that calculation of the number of
CPUs itself.
BTW: on a side-note: as most of our users do not bother to use
hyper-threads or even do not want to as their programs might suffer from
doing so, we made "--hint=nomultithread" the default in our installation
by adding
CliFilterPlugins=cli_filter/lua
to our slurm.conf and creating a cli_filter.lua file in the same
directory as slurm.conf, that contains this
function slurm_cli_setup_defaults(options, early_pass)
options['hint'] = 'nomultithread'
return slurm.SUCCESS
end
(see also
https://github.com/SchedMD/slurm/blob/master/etc/cli_filter.lua.example).
So if user really want to use hyper-threads they have to add
"--hint=multithread" to their job/allocation-options.
Regards,
Hermann
On 2/10/23 00:31, Sebastian Schmutzhard-Höfler wrote:
> Dear all,
>
> we have a node with 2 x 64 CPUs (with two threads each) and 8 GPUs,
> running slurm 22.05.5
>
> In order to make use of individual threads, we changed|
> |
>
> |SelectTypeParameters=CR_Core||
> NodeName=nodename CPUs=256 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2 |
>
> to
>
> |SelectTypeParameters=CR_CPU NodeName=nodename CPUs=256|
>
> We are now able to allocate individual threads to jobs, despite the
> following error in slurmd.log:
>
> error: Node configuration differs from hardware: CPUs=256:256(hw) Boards=1:1(hw) SocketsPerBoard=256:2(hw) CoresPerSocket=1:64(hw) ThreadsPerCore=1:2(hw)
>
>
> However, it appears that since this change, we can only make use of 4
> out of the 8 GPUs.
> The output of "sinfo -o %G" might be relevant.
>
> In the first situation it was
>
> $ sinfo -o %G
> GRES
> gpu:A100:8(S:0,1)
>
> Now it is:
>
> $ sinfo -o %G
> GRES
> gpu:A100:8(S:0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126)
>
> ||Has anyone faced this or a similar issue and can give me some directions?
> Best wishes
>
> Sebastian
>
> ||
>
More information about the slurm-users
mailing list