[slurm-users] GPUs not available after making use of all threads?

Sun Feb 12 21:04:55 UTC 2023

Hi Hermann,

Using your suggested settings did not work for us.

When trying to allocate a single thread with --cpus-per-task=1, it still 
reserved a whole CPU (two threads). On the other hand, when requesting 
an even number of threads, it does what it should.

However, I could make it work by using

SelectTypeParameters=CR_Core
NodeName=nodename Sockets=2 CoresPerSocket=128 ThreadsPerCore=1

instead of

SelectTypeParameters=CR_Core
NodeName=nodename Sockets=2 CoresPerSocket=64 ThreadsPerCore=2

So your suggestion brought me in the right direction. Thanks!

If anyone thinks this is complete nonsense, please let me know!

Best wishes,

Sebastian

On 11.02.23 11:13, Hermann Schwärzler wrote:
> Hi Sebastian,
>
> we did a similar thing just recently.
>
> We changed our node settings from
>
> NodeName=DEFAULT CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 
> ThreadsPerCore=2
>
> to
>
> NodeName=DEFAULT Boards=1 SocketsPerBoard=2 CoresPerSocket=32 
> ThreadsPerCore=2
>
> in order to make use of individual hyper-threads possible (we use this 
> in combination with
> SelectTypeParameters=CR_Core_Memory).
>
> This works as expected: after this, when e.g. asking for 
> --cpus-per-task=4 you will get 4 hyper-threads (2 cores) per task 
> (unless you also specify e.g. "--hint=nomultithread").
>
> So you might try to remove the "CPUs=256" part of your 
> node-specification to let Slurm do that calculation of the number of 
> CPUs itself.
>
>
> BTW: on a side-note: as most of our users do not bother to use 
> hyper-threads or even do not want to as their programs might suffer 
> from doing so, we made "--hint=nomultithread" the default in our 
> installation by adding
>
> CliFilterPlugins=cli_filter/lua
>
> to our slurm.conf and creating a cli_filter.lua file in the same 
> directory as slurm.conf, that contains this
>
> function slurm_cli_setup_defaults(options, early_pass)
>         options['hint'] = 'nomultithread'
>
>         return slurm.SUCCESS
> end
>
> (see also 
> https://github.com/SchedMD/slurm/blob/master/etc/cli_filter.lua.example).
> So if user really want to use hyper-threads they have to add 
> "--hint=multithread" to their job/allocation-options.
>
> Regards,
> Hermann
>
> On 2/10/23 00:31, Sebastian Schmutzhard-Höfler wrote:
>> Dear all,
>>
>> we have a node with 2 x 64 CPUs (with two threads each) and 8 GPUs, 
>> running slurm 22.05.5
>>
>> In order to make use of individual threads, we changed|
>> |
>>
>> |SelectTypeParameters=CR_Core||
>> NodeName=nodename CPUs=256 Sockets=2 CoresPerSocket=64 
>> ThreadsPerCore=2 |
>>
>> to
>>
>> |SelectTypeParameters=CR_CPU NodeName=nodename CPUs=256|
>>
>> We are now able to allocate individual threads to jobs, despite the 
>> following error in slurmd.log:
>>
>> error: Node configuration differs from hardware: CPUs=256:256(hw) 
>> Boards=1:1(hw) SocketsPerBoard=256:2(hw) CoresPerSocket=1:64(hw) 
>> ThreadsPerCore=1:2(hw)
>>
>>
>> However, it appears that since this change, we can only make use of 4 
>> out of the 8 GPUs.
>> The output of "sinfo -o %G" might be relevant.
>>
>> In the first situation it was
>>
>> $ sinfo -o %G
>> GRES
>> gpu:A100:8(S:0,1)
>>
>> Now it is:
>>
>> $ sinfo -o %G
>> GRES
>> gpu:A100:8(S:0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126) 
>>
>>
>> ||Has anyone faced this or a similar issue and can give me some 
>> directions?
>> Best wishes
>>
>> Sebastian
>>
>> ||
>>
>