May 2024 - slurm-users - lists.schedmd.com

Performance Discrepancy between Slurm and Direct mpirun for VASP Jobs.
by Hongyi Zhao 27 May '24

27 May '24

Dear Slurm Users, I am experiencing a significant performance discrepancy when running the same VASP job through the Slurm scheduler compared to running it directly with mpirun. I am hoping for some insights or advice on how to resolve this issue. System Information: Slurm Version: 21.08.5 OS: Ubuntu 22.04.4 LTS (Jammy) Job Submission Script: #!/usr/bin/env bash #SBATCH -N 1 #SBATCH -D . #SBATCH --output=%j.out #SBATCH --error=%j.err ##SBATCH --time=2-00:00:00 #SBATCH --ntasks=36 #SBATCH --mem=64G echo '#######################################################' echo "date = $(date)" echo "hostname = $(hostname -s)" echo "pwd = $(pwd)" echo "sbatch = $(which sbatch | xargs realpath -e)" echo "" echo "WORK_DIR = $WORK_DIR" echo "SLURM_SUBMIT_DIR = $SLURM_SUBMIT_DIR" echo "SLURM_JOB_NUM_NODES = $SLURM_JOB_NUM_NODES" echo "SLURM_NTASKS = $SLURM_NTASKS" echo "SLURM_NTASKS_PER_NODE = $SLURM_NTASKS_PER_NODE" echo "SLURM_CPUS_PER_TASK = $SLURM_CPUS_PER_TASK" echo "SLURM_JOBID = $SLURM_JOBID" echo "SLURM_JOB_NODELIST = $SLURM_JOB_NODELIST" echo "SLURM_NNODES = $SLURM_NNODES" echo "SLURMTMPDIR = $SLURMTMPDIR" echo '#######################################################' echo "" module purge > /dev/null 2>&1 module load vasp ulimit -s unlimited mpirun vasp_std Performance Observation: When running the job through Slurm: werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$ grep LOOP OUTCAR LOOP: cpu time 14.4893: real time 14.5049 LOOP: cpu time 14.3538: real time 14.3621 LOOP: cpu time 14.3870: real time 14.3568 LOOP: cpu time 15.9722: real time 15.9018 LOOP: cpu time 16.4527: real time 16.4370 LOOP: cpu time 16.7918: real time 16.7781 LOOP: cpu time 16.9797: real time 16.9961 LOOP: cpu time 15.9762: real time 16.0124 LOOP: cpu time 16.8835: real time 16.9008 LOOP: cpu time 15.2828: real time 15.2921 LOOP+: cpu time 176.0917: real time 176.0755 When running the job directly with mpirun: werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$ mpirun -n 36 vasp_std werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$ grep LOOP OUTCAR LOOP: cpu time 9.0072: real time 9.0074 LOOP: cpu time 9.0515: real time 9.0524 LOOP: cpu time 9.1896: real time 9.1907 LOOP: cpu time 10.1467: real time 10.1479 LOOP: cpu time 10.2691: real time 10.2705 LOOP: cpu time 10.4330: real time 10.4340 LOOP: cpu time 10.9049: real time 10.9055 LOOP: cpu time 9.9718: real time 9.9714 LOOP: cpu time 10.4511: real time 10.4470 LOOP: cpu time 9.4621: real time 9.4584 LOOP+: cpu time 110.0790: real time 110.0739 Could you provide any insights or suggestions on what might be causing this performance issue? Are there any specific configurations or settings in Slurm that I should check or adjust to align the performance more closely with the direct mpirun execution? Thank you for your time and assistance. Best regards, Zhao -- Assoc. Prof. Hongsheng Zhao <hongyi.zhao(a)gmail.com> Theory and Simulation of Materials Hebei Vocational University of Technology and Engineering No. 473, Quannan West Street, Xindu District, Xingtai, Hebei province

4 10

Building Slurm debian package vs building from source
by Arnuld 23 May '24

23 May '24

We have several nodes, most of which have different Linux distributions (distro for short). Controller has a different distro as well. The only common thing between controller and all the does is that all of them ar x86_64. I can install Slurm using package manager on all the machines but this will not work because controller will have a different version of Slurm compared to the nodes (21.08 vs 23.11) If I build from source then I see two solutions: - build a deb package - build a custom package (./configure, make, make install) Building a debian package on the controller and then distributing the binaries on nodes won't work either because that binary will start looking for the shared libraries that it was built for and those don't exist on the nodes. So the only solution I have is to build a static binary using a custom package. Am I correct or is there another solution here?

4 7

Slurm Build Error
by Arnuld 23 May '24

23 May '24

Getting this error when I run "make install": echo >>"lib_ref.lo" /bin/bash ../../libtool --tag=CC --mode=link gcc -DNUMA_VERSION1_COMPATIBILITY -g -O2 -fno-omit-frame-pointer -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -o lib_ref.la lib_ref.lo -lpthread -lm -lresolv libtool: link: ar cr .libs/lib_ref.a libtool: link: ranlib .libs/lib_ref.a libtool: link: ( cd ".libs" && rm -f "lib_ref.la" && ln -s "../lib_ref.la" " lib_ref.la" ) /bin/bash ../../libtool --tag=CC --mode=link gcc -DNUMA_VERSION1_COMPATIBILITY -g -O2 -fno-omit-frame-pointer -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -export-dynamic -o sacctmgr account_functions.o archive_functions.o association_functions.o config_functions.o cluster_functions.o common.o event_functions.o federation_functions.o file_functions.o instance_functions.o runaway_job_functions.o job_functions.o reservation_functions.o resource_functions.o sacctmgr.o qos_functions.o txn_functions.o user_functions.o wckey_functions.o problem_functions.o tres_function.o -Wl,-rpath=/root/slurm-slurm-23-11-7-1/z/lib/slurm -L../../src/api/.libs -lslurmfull -export-dynamic -lreadline -lhistory lib_ref.la -lpthread -lm -lresolv libtool: link: gcc -DNUMA_VERSION1_COMPATIBILITY -g -O2 -fno-omit-frame-pointer -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -o sacctmgr account_functions.o archive_functions.o association_functions.o config_functions.o cluster_functions.o common.o event_functions.o federation_functions.o file_functions.o instance_functions.o runaway_job_functions.o job_functions.o reservation_functions.o resource_functions.o sacctmgr.o qos_functions.o txn_functions.o user_functions.o wckey_functions.o problem_functions.o tres_function.o -Wl,-rpath=/root/slurm-slurm-23-11-7-1/z/lib/slurm -Wl,--export-dynamic -L../../src/api/.libs /root/slurm-slurm-23-11-7-1/src/api/.libs/libslurmfull.a -lreadline -lhistory ./.libs/lib_ref.a -lpthread -lm -lresolv -pthread /usr/bin/ld: sacctmgr.o: warning: relocation against `_binary_usage_txt_end' in read-only section `.text' /usr/bin/ld: sacctmgr.o: in function `_usage': /root/slurm-slurm-23-11-7-1/src/sacctmgr/sacctmgr.c:926:(.text+0x1f): undefined reference to `_binary_usage_txt_start' /usr/bin/ld: /root/slurm-slurm-23-11-7-1/src/sacctmgr/sacctmgr.c:926:(.text+0x26): undefined reference to `_binary_usage_txt_end' /usr/bin/ld: warning: creating DT_TEXTREL in a PIE collect2: error: ld returned 1 exit status make[3]: *** [Makefile:672: sacctmgr] Error 1 make[3]: Leaving directory '/root/slurm-slurm-23-11-7-1/src/sacctmgr' make[2]: *** [Makefile:520: all-recursive] Error 1 make[2]: Leaving directory '/root/slurm-slurm-23-11-7-1/src' make[1]: *** [Makefile:621: all-recursive] Error 1 make[1]: Leaving directory '/root/slurm-slurm-23-11-7-1' make: *** [Makefile:520: all] Error 2 ----------------------------------------------------------------------------------------- I used these config options: ./configure --enable-debug --enable-pam --disable-sview --disable-shared --with-munge --with-json --with-yaml --with-http-parser --with-pmix --with-lz4 --with-hwloc --with-jwt --with-libcurl --with-freeipmi --with-rdkafka --with-bpf --prefix=/root/slurm-slurm-23-11-7-1/z

1 0

memory high water mark reporting
by Emyr James 22 May '24

22 May '24

Hi, We are trying out slurm having been running grid engine for a long while. In grid engine, the cgroups peak memory and max_rss are generated at the end of a job and recorded. It logs the information from the cgroup hierarchy as well as doing a getrusage call right at the end on the parent pid of the whole job "container" before cleaning up. With slurm it seems that the only way memory is recorded is by the acct gather polling. I am trying to add something in an epilog script to get the memory.peak but It looks like the cgroup hierarchy has been destroyed by the time the epilog is run. Where in the code is the cgroup hierarchy cleared up ? Is there no way to add something in so that the accounting is updated during the job cleanup process so that peak memory usage can be accurately logged ? I can reduce the polling interval from 30s to 5s but don't know if this causes a lot of overhead and in any case this seems to not be a sensible way to get values that should just be determined right at the end by an event rather than using polling. Many thanks, Emyr

5 10

Slurm version 23.11.7 is now available
by Marshall Garey 21 May '24

21 May '24

We are pleased to announce the availability of Slurm version 23.11.7. The 23.11.7 release fixes a few potential crashes in slurmctld when using less common options on job submission, slurmrestd compatibility with auth/slurm, and some additional minor and moderate severity bugs. Slurm can be downloaded from https://www.schedmd.com/downloads.php . -Marshall > -- slurmrestd - Correct OpenAPI specification for > 'GET /slurm/v0.0.40/jobs/state' having response as null. > -- Allow running jobs on overlapping partitions if jobs don't specify -s. > -- Fix segfault when requesting a shared gres along with an exclusive > allocation. > -- Fix regression in 23.02 where afternotok and afterok dependencies were > rejected for federated jobs not running on the origin cluster of the > submitting job. > -- slurmctld - Disable job table locking while job state cache is active when > replying to `squeue --only-job-state` or `GET /slurm/v0.0.40/jobs/state`. > -- Fix sanity check when setting tres-per-task on the job allocation as well as > the step. > -- slurmrestd - Fix compatiblity with auth/slurm. > -- Fix issue where TRESRunMins gets off correct value if using > QOS UsageFactor != 1. > -- slurmrestd - Require `user` and `association_condition` fields to be > populated for requests to 'POST /slurmdb/v0.0.40/users_association'. > -- Avoid a slurmctld crash with extra_constraints enabled when a job requests > certain invalid --extra values. > -- `scancel --ctld` and `DELETE /slurm/v0.0/40/jobs` - Fix support for job > array expressions (e.g. 1_[3-5]). Also fix signaling a single pending array > task (e.g. 1_10), which previously signaled the whole array job instead. > -- Fix a possible slurmctld segfault when at some point we failed to create an > external launcher step. > -- Allow the slurmctld to open a connection to the slurmdbd if the first > attempt fails due to a protocol error. > -- mpi/cray_shasta - Fix launch for non-het-steps within a hetjob. > -- sacct - Fix "gpuutil" TRES usage output being incorrect when using --units. > -- Fix a rare deadlock on slurmctld shutdown or reconfigure. > -- Fix issue that only left one thread on each core available when "CPUs=" is > configured to total thread count on multi-threaded hardware and no other > topology info ("Sockets=", "CoresPerSocket", etc.) is configured. > -- Fix the external launcher step not being allocated a VNI when requested. > -- jobcomp/kafka - Fix payload length when producing and sending a message. > -- scrun - Avoid a crash if RunTimeDelete is called before the container > finishes. > -- Save the slurmd's cred_state while reconfiguring to prevent the loss job > credentials.

1 0

SLUG Call for Papers Deadline
by Victoria Hobson 21 May '24

21 May '24

Slurm User Group (SLUG) 2024 is set for September 12-13 at the University of Oslo in Oslo, Norway. Registration information and a high-level schedule can be found here:https://slug24.splashthat.com/ The deadline to submit a presentation abstract is Friday, May 31st. We do not intend to extend this deadline. If you are interested in presenting your own usage, developments, site report, tutorial, etc about Slurm, please fill out the following form:https://forms.gle/N7bFo5EzwuTuKkBN7 Notifications of final presentations accepted will go out by Friday, June 14th. -- Victoria Hobson SchedMD LLC Vice President of Marketing

1 0

Slurm not allocating correct cgroup cpu ids in srun step (possible bug)
by Ashley Wright 21 May '24

21 May '24

Hi, At our site we have recently upgraded to Slurm 23.11.5 and are having trouble with MPI jobs doing srun inside a sbatch'ed script. The cgroup does not appear to be setup correctly for the srun (step_0). As an example $ cat /sys/fs/cgroup/cpuset/slurm/uid_11000..../job..../cpuset.cpus 0,2-3,68-69,96,98-99,164-165 $ cat /sys/fs/cgroup/cpuset/slurm/uid_11000..../job..../step_0/cpuset.cpus 0,2,68,96,98,164 The sbatch is allocated a range of cpus in the cgroup. However, when step_0 is run, only some of those CPUs are in the group. I have noticed that it is always the range which is missing, ie 2-5 only 2 is included, 3,4,5 are missing. This also only happens if there are multiple groups of cpus in the allocations. ie only 1-12 would be fine, however 1-12,15-20 would result in 1,15 only. The sbatch also seems fine, with step_batch and step_extern being allocated correctly. This causes numerous issues with MPI jobs as they end up overloading cpus. We are running our nodes with threading enabled on the CPUs, and with cgroups and affinity plugins. I have attached our slurm.conf to show our settings. Our /etc/slurm/cgroup.conf is ConstrainCores=yes ConstrainDevices=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes We have turned on logging at debug2 level, but I haven't yet found anything useful. Happy for a suggestion on what to look for. Is anyone able to provide any advice on where to go next to try and identify the issue? Regards, Ashley Wright

1 0

Running slurm on alternate ports
by Alan Stange 20 May '24

20 May '24

Hello all, for testing purposes, we would like to run slurm on ports different from the default values. No problems in setting this up. But how does one tell srun/sbatch/etc what the different port numbers are? I see no command line options to specify a port or an alternate configuration file. Thank you, Alan

2 2

Invalid/incorrect gres.conf syntax
by Gestió Servidors 20 May '24

20 May '24

Hello, I have configured my "gres.conf" in this way: NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceRTX2070 File=/dev/nvidia0 Cores=0-11 NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia1 Cores=12-23 NodeName=node-gpu-2 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia0 Cores=0-11 NodeName=node-gpu-2 AutoDetect=off Name=gpu Type=GeForceGTX1080 File=/dev/nvidia1 Cores=12-23 NodeName=node-gpu-3 AutoDetect=off Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0 Cores=0-11 NodeName=node-gpu-4 AutoDetect=off Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0 Cores=0-7 node-gpu-1 and node-gpu-2 are two systems with two sockets; node-gpu-3 and node-gpu-4 have only one socket. In my "slurm.conf" I have these lines: AccountingStorageTRES=gres/gpu SelectType=select/cons_tres GresTypes=gpu NodeName=node-gpu-1 CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceRTX2070:1,gpu:GeForceGTX1080Ti:1 NodeName=node-gpu-2 CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceGTX1080Ti:1,gpu:GeForceGTX1080:1 NodeName=node-gpu-3 CPUs=12 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=23000 Gres=gpu:GeForceRTX3080:1 NodeName=node-gpu-4 CPUs=8 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=7800 Gres=gpu:GeForceRTX3080:1 However, slurmctld reports warning logs about "error syntax in Cores attribute" in gres.conf. Where is the syntax error? Thanks a lot!

1 0

Apply an specific QoS to all users that belongs to an specific account
by Gestió Servidors 20 May '24

20 May '24

Hi, I would like to know if it is possible to apply an specific QoS to all users that belongs to an specific account. For example, I have created some new users "user_XX" and, also, I have created their new accounts in SLURM with "sacctmgr create account name=Test" and "sacctmgr create user name=user_XX DefaultAccount=Test". After it, I have changed default QoS "normal" to a new QoS "minimal" (with some limits) to account "Test" (sacctmgr modify account where name=Test set qos=minimal), but what I have seen is that users "user_XX" that belongs to "Test" account continue in QoS "normal" (QoS by default), so it seems that users have not inherited QoS applied to their account. Is there any way to do this and am I doing something wrong? Thanks.

1 0

2025

2024

slurm-users May 2024

2025

2024

slurm-users May 2024 ----- 2025 ----- July 2025 June 2025 May 2025 April 2025 March 2025 February 2025 January 2025 ----- 2024 ----- December 2024 November 2024 October 2024 September 2024 August 2024 July 2024 June 2024 May 2024 April 2024 March 2024 February 2024 January 2024

slurm-users May 2024