- slurm-users - lists.schedmd.com

Slurm version 23.11.4 is now available
by Tim McMullan 22 Feb '24

22 Feb '24

We are pleased to announce the availability of Slurm version 23.11.4. The 23.11.4 release includes a number of fixes to stability and various bug fixes. Some notable changes include that VSZ is no longer being reported when using cgroup/v2 (this is not provided by the kernel), a warning has been added if using select/linear and tolology/tree together as this will not be supported in the next major release, and a backwards compatibility issue that caused jobs using --gpus to be rejected when submitted from 23.02 or 22.05. Slurm can be downloaded from https://www.schedmd.com/downloads.php . -Tim > * Changes in Slurm 23.11.4 > ========================== > -- Fix a memory leak when updating partition nodes. > -- Don't leave a partition around if it fails to create with scontrol. > -- Fix segfault when creating partition with bad node list from scontrol. > -- Fix preserving partition nodes on bad node list update from scontrol. > -- Fix assertion in developer mode on a failed message unpack. > -- Fix repeat POWER_DOWN requests making the nodes available for ping. > -- Fix rebuilding job alias_list on restart when nodes are still powering up. > -- Fix INVALID nodes running health check. > -- Fix cloud/future nodes not setting addresses on invalid registration. > -- scrun - Remove the requirement to set the SCRUN_WORKING_DIR environment > variable. This was a regression in 23.11. > -- Add warning for using select/linear with topology/tree. > This combination will not be supported in the next major version. > -- Fix health check program not being run after first pass of all nodes when > using MaxNodeCount. > -- sacct - Set process exit code to one for all errors. > -- Add SlurmctldParameters=disable_triggers option. > -- Fix issue running steps when the allocation requested an exclusive > allocation shards along with shards. > -- Fix cleaning up the sleep process and the cgroup of the extern step if > slurm_spank_task_post_fork returns an error. > -- slurm_completion - Add missing --gres-flags= options > multiple-tasks-per-sharing and one-task-per-sharing. > -- scrun - Avoid race condition that could cause outbound network > communications to incorrectly rejected with an incomplete packet error. > -- scrun - Gracefully handle kernel giving invalid expected number of incoming > bytes for a connection causing incoming packet corruption resulting in > connection getting closed. > -- srun - return 1 when a step lauch fails > -- scrun - Avoid race condition that could cause deadlock during shutdown. > -- Fix scontrol listpids to work under dynamic node scenarios. > -- Add --tres-bind to --help and --usage output. > -- Add --gres-flags=allow-task-sharing to allow GPUs to still be accessible > among all tasks when binding GPUs to specific tasks. > -- Fix issue with CUDA_VISIBLE_DEVICES showing the same MIG device for all > tasks when using MIGs with --tres-per-task or --gpus-per-task. > -- slurmctld - Prevent a potential hang during shutdown/reconfigure if the > association cache thread was previously shut down. > -- scrun - Avoid race condition that could cause scrun to hang during > shutdown when connections have pending events. > -- scrun - Avoid excessive polling of connections during shutdown that could > needlessly cause 100% CPU usage on a thread. > -- sbcast - Use user identity from broadcast credential instead of looking it > up locally on the node. > -- scontrol - Remove "abort" option handling. > -- Fix an error message referring to the wrong RPC. > -- Fix memory leak on error when creating dynamic nodes. > -- Fix a slurmctld segfault when a cloud/dynamic node changes hostname on > registration. > -- Prevent a slurmctld deadlock if the gpu plugin fails to load when > creating a node. > -- Change a slurmctld fatal() to an error() when attempting to create a > dynamic node with a global autodetect set in gres.conf. > -- Fix leaving node records on error when creating nodes with scontrol. > -- scrun/sackd - Avoid race condition where shutdown could deadlock. > -- Fix a regression in 23.02.5 that caused pam_slurm_adopt to fail when > the user has multiple jobs on a node. > -- Add GLOB_SILENCE flag that silences the error message which will display if > an include directive attempts to use the "*" wildcard. > -- Fix jobs getting rejected when submitting with --gpus option from older > versions of job submission commands (23.02 and older). > -- cgroup/v2 - Return 0 for VSZ. Kernel cgroups do not provide this metric. > -- scrun - Avoid race condition where outbound RPCs could be corrupted. > -- scrun - Avoid race condition that could cause a crash while compiled in > debug mode. > -- gpu/rsmi - Disable gpu usage statistics when not using ROCM 6.0.0+ > -- Fix stuck processes and incorrect environment when using --get-user-env. > -- Avoid segfault in the slurmdbd when TrackWCKey=no but you are still using > use WCKeys. > -- Fix ctld segfault with TopologyParam=RoutePart and no partition defined. > -- slurmctld - Fix missing --deadline handling for jobs not evaluated by the > schedulers (i.e. non-runnable, skipped for other reasons, etc.). > -- Demote some eio related logs from error to verbose in user commands. These > are not generally actionable by the user and are easilly generated by port > scanning a machine running srun. > -- Make sprio correctly print array tasks that have not yet been split out. > -- topology/block - Restrict the number of last-level blocks in any allocation. -- Tim McMullan Release Management, Support, and Development SchedMD LLC - Commercial Slurm Development and Support

1 0

RHEL 8.9+SLURM-23.11.3+MLNX_OFED_LINUX-23.10-1.1.9.0+ OpenMPI-5.0.2
by S L 20 Feb '24

20 Feb '24

Hi All, We're currently in the process of setting up SLURM on a RHEL 8.9 based cluster. Here's a summary of the steps we've taken so far: Installed MLNX OFED ConnectX-5.2. Compiled and installed PMiX and UCX. Compiled and installed Slurm with PMiX_v4 and UCX support. Compiled OpenMPI with SLURM, PMIx, libevent, and hwloc support. All compute nodes are reachable via the IB network. *Problem:* While hello world MPI jobs are working fine on multiple nodes, the jobs are not utilizing Infiniband. srun --mpi=pmix -N2 -n2 --ntasks-per-node=2 ./hello > log.out 2>&1 Output from srun --mpi=list: MPI plugin types are... none cray_shasta pmi2 pmix specific pmix plugin versions available: pmix_v4 Could someone please point me in the right direction on how to troubleshoot this issue? Thank you for your assistance. Sudhakar

2 1

URL for how to do for SLURM accounting setup
by John Joseph 20 Feb '24

20 Feb '24

Dear ALL, Good morning we were able to setup a test SLURM based system, with 4 nodes , Ubuntu 22.04 LTS and we were able to run COMSOL using "comsol batch" command Now we plan to have accounting https://slurm.schedmd.com/accounting.html Like to reach out and get guidance on any tutorial or how to do documentation on setting up accounting Appreciate your support Thanks Joseph John

2 2

CPU utilisation using two commands scontrol association and sreport makes a huge difference
by prachikakade.lit＠gmail.com 20 Feb '24

20 Feb '24

Dear Team, I created a small cluster of 3 nodes on my VM ware to work on the CPU utilization concept. I created a user name= hpcuser01, and allocated GrpTresMin=cpu=59000040 -> CPU minutes and gpu=0 Now, when I checked his utilization using scontrol association cmd # scontrol show ass user=hpcuser01 | grep -n15 hpcuser01 "" Output: QOS=hpcuser01_test(11) UsageRaw=5789263754.153907 GrpJobs=N(0) GrpJobsAccrue=N(5) GrpSubmitJobs=N(5) GrpWall=N(1188290.27) GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0) GrpTRESMins=cpu=59000040(59001152),mem=N(293429427523),energy=N(0),node=N(1993778),billing=N(96073808),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=1(0) GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=46080(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0)"" Here we can see that GrpTRESMins=cpu=59000040(59001152) after converting in Hours ( value /60) it will be: 983334(983352) CPU hours This means 983352 CPU hours have been used by hpcuser01 but, when I checked his utilization using sreport cmd #sreport -t hour --tres=cpu,gres/gpu,billing cluster AccountUtilizationByUser user=hpcuser01 start=2021-01-01T00:00:00 end=2024-02-20T23:59:59 -P | grep -iw -e billing -e cpu -e gres/gpu Output: cpu|979403 billing|1594902 gres/gpu|0 here I found that his utilisation is : 979403 I could like to understand why there is so much of a difference using two cmds to find out utilisation. Almost the difference is 979403 - 983352 = 3949 Request you to help me out with this. Regards, Prachi

1 0

Slurm Power Saving Guide: Why doesnt slurm mark as failed when resumeProgram returns =/= 0
by Xaver Stiensmeier 19 Feb '24

19 Feb '24

Dear slurm-user list, I had cases where our resumeProgram failed due to temporary cloud timeouts. In that case the resumeProgram returns a value =/= 0. Why does Slurm still wait until resumeTimeout instead of just accepting the startup as failed which then should lead to a rescheduling of the job. Is there some way to achieve the described effect i.e. tell Slurm: "You can stop waiting, the node won't come alive." or am I missing the correct way how this should be handled in Slurm? Best regards, Xaver

1 0

slurmdbd 17.02: "cluster not registered" (but things work)
by Matthias Leopold 19 Feb '24

19 Feb '24

Hi, I need to take care of a 17.02 Slurm cluster (I'm preparing it for upgrades). I see that slurmdbd logs various "cluster not registered" messages at startup (DBD_CLUSTER_TRES,DBD_JOB_START,DBD_STEP_START), but I don't see a real problem. Accounting works. Do I have to worry? Can this be related to upper/lower case issues with ClusterName? For some reason ClusterName is all upper case in slurm.conf. According to docs (and my experience with Slurm 21.08) though, this should be OK. Thx for help Matthias

1 0

Recover Batch Script Error
by Jason Simms 16 Feb '24

16 Feb '24

Hello all, I've used the "scontrol write batch_script" command to output the job submission script from completed jobs in the past, but for some reason, no matter which job I specify, it tells me it is invalid. Any way to troubleshoot this? Alternatively, is there another way - even if a manual database query - to recover the job script, assuming it exists in the database? sacct --jobs=38960 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 38960 amr_run_v+ tsmith2lab tsmith2lab 72 COMPLETED 0:0 38960.batch batch tsmith2lab 40 COMPLETED 0:0 38960.extern extern tsmith2lab 72 COMPLETED 0:0 38960.0 hydra_pmi+ tsmith2lab 72 COMPLETED 0:0 scontrol write batch_script 38960 job script retrieval failed: Invalid job id specified Warmest regards, Jason -- *Jason L. Simms, Ph.D., M.P.H.* Manager of Research Computing Swarthmore College Information Technology Services (610) 328-8102 Schedule a meeting: https://calendly.com/jlsimms

4 3

Need help managing licence
by Sylvain MARET 16 Feb '24

16 Feb '24

Hello everyone ! Recently our users bought a cplex dynamic license and want to use it on our slurm cluster. I've installed the paid version of cplex within modules so authorized user can load it with a simple module load cplex/2111 command but I don't know how to manage and ensure slurm doesn't launch a job if 20 people are already running code with this license. How do you guys manage paid licenses on your cluster ? Any advice would be appreciated ! Regards, Sylvain Maret

2 1

MPI/PMIx Issues after 23.11 Update
by Oliver Smith 16 Feb '24

16 Feb '24

Hi all, We’re running a small slurm dev cluster on Ubuntu and are facing issues with MPI/PMIx after upgrading slurm from 23.02.5 to 23.11.3. The first job step to use MPI within a job fails roughly 80% of the time but following attempts to use MPI within the same job work fine. For the failing job step we see this error after hitting the MPI timeout: slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_reset_if_to: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:741: 0x55a03f8d7a90: collective timeout seq=0 slurmstepd: error: mpi/pmix_v4: pmixp_coll_log: hpc-d-msh-01a02 [1]: pmixp_coll.c:286: Dumping collective state slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:759: 0x55a03f8d7a90: COLL_FENCE_RING state seq=0 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:762: my peerid: 1:hpc-d-msh-01a02 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:769: neighbor id: next 0:hpc-d-msh-01a01, prev 0:hpc-d-msh-01a01 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:778: Context ptr=0x55a03f8d7b08, #0, in-use=0 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:778: Context ptr=0x55a03f8d7b40, #1, in-use=0 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:778: Context ptr=0x55a03f8d7b78, #2, in-use=1 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:787: seq=0 contribs: loc=1/prev=1/fwd=0 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:791: neighbor contribs [2]: slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:824: done contrib: hpc-d-msh-01a01 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:826: wait contrib: - slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:828: status=PMIXP_COLL_RING_FINILIZE slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:831: buf (offset/size): 0/33362 [hpc-d-msh-01a01.tds.hpc.barf1.com:47652] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13) [hpc-d-msh-01a01.tds.hpc.barf1.com:47652] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 31 [hpc-d-msh-01a01:47652] *** An error occurred in MPI_Send [hpc-d-msh-01a01:47652] *** reported by process [683360612,0] [hpc-d-msh-01a01:47652] *** on communicator MPI_COMM_WORLD [hpc-d-msh-01a01:47652] *** MPI_ERR_OTHER: known error not in list [hpc-d-msh-01a01:47652] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [hpc-d-msh-01a01:47652] *** and potentially your MPI job) slurmstepd: error: *** STEP 1890.0 ON hpc-d-msh-01a01 CANCELLED AT 2024-02-14T16:14:52 *** OpenMPI/PMIx versions have not changed and downgrading slurm to 23.02.5 seems to resolve the issue. We’d appreciate any pointers anyone might have. Thanks Oli This email comprises confidential information of Mercedes-Benz Grand Prix Limited ("MGP") unless it contains an explicit statement to the contrary made by an authorised representative of MGP. Contracts may only be concluded on behalf of MGP by its authorised signatories and not solely by email communication. No employee, agent, contractor, consultant and/or other representative of MGP is authorised to conclude any legally binding agreement on behalf of MGP by email alone without the express prior written confirmation of two authorised signatories of MGP. Mercedes-Benz Grand Prix Limited. Registered in England No. 787446. Registered Office at Mercedes-Benz Grand Prix Limited, Operations Centre, Brackley, Northants NN13 7BD. Note: The MGP Legal Department also acts on behalf of Mercedes-Benz Motorsport Limited ("MBM") and the above notice applies mutatis mutandis in respect of all email communications of MBM. MBM: Mercedes-Benz Motorsport Limited. Registered in England No. 13057973. Registered office at Mercedes-Benz Motorsport Limited, Lauda Drive, Brackley, Northants NN13 7BD. Please consider the environment before printing this email.

1 0

slurmctld: slurm_bufs_sendto(msg_type=SRUN_STEP_SIGNAL) failed: Connection reset by peer
by Rike-Benjamin Schuppner 16 Feb '24

16 Feb '24

Hi, I am getting the following error in the logs whenever I run a few srun jobs in a batch. Jan 25 11:24:03 slurmctl.XYZ slurmctld[272961]: slurmctld: debug: _send_timeout: Socket POLLERR: Connection reset by peer Jan 25 11:24:03 slurmctl.XYZ slurmctld[272961]: slurmctld: error: slurm_send_node_msg: [socket:[921897]] slurm_bufs_sendto(msg_type=SRUN_STEP_SIGNAL) failed: Connection reset by peer Jan 25 11:24:03 slurmctl.XYZ slurmctld[272961]: slurmctld: debug: laying out the 1 tasks on 1 hosts compute2 dist 1 The slurm version is 23.11.3 and an example sbatch file is: #!/bin/bash #SBATCH --job-name=slurm_test #SBATCH --mem=1gb #SBATCH --time=00:05:00 #SBATCH --output=slurm_test_%j.log pwd; hostname; date srun --exclusive -c1 -N1 -n1 bash -c "hostname ; sleep 10" & srun --exclusive -c1 -N1 -n1 bash -c "hostname ; sleep 10" & wait The more sruns I have in the script (&-backgrounded or not), the more the error shows up. Is there anything I could do to fix this error? Best /rike

2 1

2025

2024