I wonder whether there might be core-pinning/NUMA toplogy/hyperthreading sort of thing going on here? If the code run outside SLURM runs faster, on the same hardware, than when run under SLURM, it might be because some of the cores SLURM has confined the cgroup to are hyperthreads on a single physical core. Or perhaps they’re not allocated to the physical sockets in an optimal way… that sort of thing?
Tim
-- Tim Cutts Senior Director, R&D IT - Data, Analytics & AI, Scientific Computing Platform AstraZeneca
Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our Service Cataloguehttps://azcollaboration.sharepoint.com/sites/CMU993 |
From: Michael DiDomenico via slurm-users slurm-users@lists.schedmd.com Date: Wednesday, 23 April 2025 at 7:53 pm To: Cc: Slurm User Community List slurm-users@lists.schedmd.com Subject: [slurm-users] Re: Job running slower when using Slurm the program probably says 32 threads, because it's just looking at the box, not what slurm cgroups allow (assuming your using them) for cpu
i think for an openmp program (not openmpi) you definitely want the first command with --cpus-per-task=32
are you measuring the runtime inside the program or outside it? if the later the 10sec addition in time could be the slurm setup/node allocation
On Wed, Apr 23, 2025 at 2:41 PM Jeffrey Layton laytonjb@gmail.com wrote:
I tried using ntasks and cpus-per-task to get all 32 cores. So I added --ntasks=# --cpus-per-task=N to th sbatch command so that it now looks like:
sbatch --nodes=1 --ntasks=1 --cpus-per-task=32 <script>
It now takes 28 seconds (I ran it a few times).
If I change the command to
sbatch --nodes=1 --ntasks=32 --cpus-per-task=1 <script>
It now takes about 30 seconds.
Outside of Slurm it was only taking about 19.6 seconds. So either way it takes longer.
Interesting, in the output from bt, it gives the Total Threads and Avail Threads. In all cases the answer is 32. If the code was only using 1 thread I'm wondering why it would say Avail Threads is 32.
I'm still not sure why it takes longer when Slurm is being used, but I'm reading as much as I can.
Thanks!
Jeff
On Wed, Apr 23, 2025 at 2:15 PM Jeffrey Layton laytonjb@gmail.com wrote:
Roger. I didn't configure Slurm so let me look at slurm.conf and gres.conf to see if they restrict a job to a single CPU.
Thanks
On Wed, Apr 23, 2025 at 1:48 PM Michael DiDomenico via slurm-users slurm-users@lists.schedmd.com wrote:
without knowing anything about your environment, its reasonable to suspect that maybe your openmp program is multi-threaded, but slurm is constraining your job to a single core. evidence of this should show up when running top on the node, watching the cpu% used for the program
On Wed, Apr 23, 2025 at 1:28 PM Jeffrey Layton via slurm-users slurm-users@lists.schedmd.com wrote:
Good morning,
I'm running an NPB test, bt.C that is OpenMP and built using NV HPC SDK (version 25.1). I run it on a compute node by ssh-ing to the node. It runs in about 19.6 seconds.
Then I run the code using a simple job:
Command to submit job: sbatch --nodes=1 run-npb-omp
The script run-npb-omp is the following:
#!/bin/bash
cd /home/.../NPB3.4-OMP/bin
./bt.C.x
When I use Slurm, the job takes 482 seconds.
Nothing really appears in the logs. It doesn't do any IO. No data is copied anywhere. I'm king of at a loss to figure out why. Any suggestions of where to look?
Thanks!
Jeff
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com ________________________________
AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.
This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.comhttps://www.astrazeneca.com