[slurm-users] [External] incorrect number of cpu's being reported in srun job

Wed Jun 23 01:48:18 UTC 2021

Thanks for the reply... I will look into how to configure it.

Sid Young
Translational Research Institute

On Wed, Jun 23, 2021 at 7:06 AM Prentice Bisbal <pbisbal at pppl.gov> wrote:

> Yes,
>
> You need to use the cgroups plugin.
>
>
> On Fri, Jun 18, 2021, 12:29 AM Sid Young <sid.young at gmail.com> wrote:
>
>> G'Day all,
>>
>> I've had a question from a user of our new HPC, the following should
>> explain it:
>>
>> ➜ srun -N 1 --cpus-per-task 8 --time 01:00:00 --mem 2g --pty python3
>> Python 3.6.8 (default, Nov 16 2020, 16:55:22)
>> [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux
>> Type "help", "copyright", "credits" or "license" for more information.
>> >>> import os
>> >>> os.cpu_count()
>> 256
>> >>> len(os.sched_getaffinity(0))
>> 256
>> >>>
>>
>> The output of os.cpu_count() is correct: there are 256 CPUs on the
>> server, but the output of len(os.sched_getaffinity(0)) is still 256 when I
>> was expecting it to be 8 - the number of CPUs this process is restricted
>> to. Is my slurm command incorrect? When I run a similar test on XXXXXX I
>> get the expected behaviour:
>>
>> ➜ qsub -I -l select=1:ncpus=4:mem=1gb
>> qsub: waiting for job 9616042.pbs to start
>> qsub: job 9616042.pbs ready
>> ➜ python3
>> Python 3.4.10 (default, Dec 13 2019, 16:20:47) [GCC] on linux
>> Type "help", "copyright", "credits" or "license" for more information.
>> >>> import os
>> >>> os.cpu_count()
>> 72
>> >>> len(os.sched_getaffinity(0))
>> 4
>> >>>
>>
>> This seems to be a problem for me as I have a program provided by a
>> third-party company that keeps trying to run with 256 threads and crashes.
>> The program is a compiled binary so I don't know if they're just grabbing
>> the number of CPUs or correctly getting the scheduler affinity, but it
>> seems as though TRI's HPC will return the total number of CPUs in any case.
>> There aren't any options with the program to set the number of threads
>> manually.
>>
>> My question to the group is what's causing this? Do I need a cgroups
>> plugin?
>>
>> I think these are the relevant lines from the slurm.conf file:
>>
>> SelectType=select/cons_res
>> SelectTypeParameters=CR_CPU_Memory
>> ReturnToService=1
>> CpuFreqGovernors=OnDemand,Performance,UserSpace
>> CpuFreqDef=Performance
>>
>>
>>
>>
>> Sid Young
>> Translational Research Institute
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210623/2fd422d2/attachment.htm>