November 2024 - slurm-users - lists.schedmd.com

job_container/tmpfs and srun.
by Phill Harvey-Smith 17 Jan '25

17 Jan '25

Hi all, On our setup we are using job_container/tmpfs to give each job it's own temp space. Since our compute nodes have reasonably sized disks for tasks that do a lot of disk I/O on user's data we have asked users to copy their data to the local disk at the beginning of the task and (if needed) copy it back at the end. This saves lots of NFS thrashing slowing down both the task and the NFS servers. However some of our users are having problems with this, their initial sbatch script will create a temp directory in their private /tmp copy their data to it and then try to srun a program. The srun will fall over as it doesn't seem to have have access to the copied data. I suspect this is because the srun task is getting it's own private /tmp. So my question is, is there a way to have the srun task inherit the /tmp of the initial sbatch? I'll include a sample of the script our user is using below. If any further information is required please feel free to ask. Cheers. Phill. #!/usr/bin/bash #SBATCH --nodes 1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=1 #SBATCH --time=00:00:10 #SBATCH --mem-per-cpu=3999 #SBATCH --output=script_out.log #SBATCH --error=script_error.log # The above options puts the STDOUT and STDERR of sbatch in # log files prefixed with 'script_'. # Create a randomly-named directory under /tmp jobtmpdir=$(mktemp -d) # Register a function to try and cleanup in case of job failure cleanup_handler() { echo "Cleaning up ${jobtmpdir}" rm -rf ${jobtmpdir} } trap 'cleanup_handler' SIGTERM EXIT # Change working directory to this directory cd ${jobtmpdir} # Copy the executable and input files from # where the job was submitted to the temporary directory. cp ${SLURM_SUBMIT_DIR}/a.out . cp ${SLURM_SUBMIT_DIR}/input.txt . # Run the executable, handling the collection of stdout # and stderr ourselves by redirecting to file srun ./a.out 2> task_error.log > task_out.log # Copy output data back to the submit directory. cp output.txt ${SLURM_SUBMIT_DIR} cp task_out.log ${SLURM_SUBMIT_DIR} cp task_error.log ${SLURM_SUBMIT_DIR} # Cleanup cd ${SLURM_SUBMIT_DIR} cleanup_handler

2 1

Why is my job killed when ResumeTimeout is reached instead of it being requeued?
by Xaver Stiensmeier 09 Dec '24

09 Dec '24

Dear Slurm-user list, when a job fails because the node startup fails (cloud scheduling), the job should be re-queued: Resume Timeout Maximum time permitted (in seconds) between when a node resume request is issued and when the node is actually available for use. Nodes which fail to respond in this time frame will be marked DOWN and the jobs scheduled on the node requeued. however, instead of requeuing the job, it is killed. [2024-11-18T10:41:52.003] node bibigrid-worker-wubqboa1z2kkgx0-0 not resumed by ResumeTimeout(1200) - marking down and power_save [2024-11-18T10:41:52.003] Killing JobId=1 on failed node bibigrid-worker-wubqboa1z2kkgx0-0 [2024-11-18T10:41:52.046] update_node: node bibigrid-worker-wubqboa1z2kkgx0-0 reason set to: FailedStartup [2024-11-18T10:41:52.046] power down request repeating for node bibigrid-worker-wubqboa1z2kkgx0-0 Our ResumeProgram does not change the state of the underlying workers, I think we should set the nodes to DOWN explicitly if the startup fails given: *ResumeProgram* is unable to restore a node to service with a responding slurmd and an updated BootTime, it should set the node state to DOWN, which will result in a requeue of any job associated with the node - this will happen automatically if the node doesn't register within ResumeTimeout but in any case as we can see in the log the job should be requeued based on it reaching the ResumeTimeout alone. I am unsure why that is not happening. The power down request is sent by the ResumeFailProgram. We have SlurmctldParameters=idle_on_node_suspend activated, but that shouldn't affect Resume, I guess. My Slurm version is slurm 23.11.5 Best regards, Xaver # More context ## Slurmctld from submitting job to failure [2024-11-18T10:21:45.490] sched: _slurm_rpc_allocate_resources JobId=1 NodeList=bibigrid-worker-wubqboa1z2kkgx0-0 usec=1221 [2024-11-18T10:21:45.499] debug: sackd_mgr_dump_state: saved state of 0 nodes [2024-11-18T10:21:58.387] debug: sched/backfill: _attempt_backfill: beginning [2024-11-18T10:21:58.387] debug: sched/backfill: _attempt_backfill: no jobs to backfill [2024-11-18T10:22:20.003] debug: sched: Running job scheduler for full queue. [2024-11-18T10:23:20.003] debug: sched: Running job scheduler for full queue. [2024-11-18T10:23:20.009] debug: sackd_mgr_dump_state: saved state of 0 nodes [2024-11-18T10:23:23.003] debug: Spawning ping agent for bibigrid-master-wubqboa1z2kkgx0 [2024-11-18T10:23:23.398] debug: sched/backfill: _attempt_backfill: beginning [2024-11-18T10:23:23.398] debug: sched/backfill: _attempt_backfill: no jobs to backfill [2024-11-18T10:23:53.398] debug: sched/backfill: _attempt_backfill: beginning [2024-11-18T10:23:53.398] debug: sched/backfill: _attempt_backfill: no jobs to backfill [2024-11-18T10:24:21.000] debug: sched: Running job scheduler for full queue. [2024-11-18T10:24:21.484] slurmscriptd: error: _run_script: JobId=0 resumeprog exit status 1:0 [2024-11-18T10:25:20.003] debug: sched: Running job scheduler for full queue. [2024-11-18T10:26:02.000] debug: Spawning ping agent for bibigrid-master-wubqboa1z2kkgx0 [2024-11-18T10:26:02.417] debug: sched/backfill: _attempt_backfill: beginning [2024-11-18T10:26:02.417] debug: sched/backfill: _attempt_backfill: no jobs to backfill [2024-11-18T10:26:20.007] debug: sched: Running job scheduler for full queue. [2024-11-18T10:26:32.417] debug: sched/backfill: _attempt_backfill: beginning [2024-11-18T10:26:32.417] debug: sched/backfill: _attempt_backfill: no jobs to backfill [2024-11-18T10:27:20.003] debug: sched: Running job scheduler for full queue. [2024-11-18T10:28:20.003] debug: Updating partition uid access list [2024-11-18T10:28:20.003] debug: sched: Running job scheduler for full queue. [2024-11-18T10:28:20.008] debug: sackd_mgr_dump_state: saved state of 0 nodes [2024-11-18T10:29:20.003] debug: sched: Running job scheduler for full queue. [2024-11-18T10:29:22.000] debug: Spawning ping agent for bibigrid-master-wubqboa1z2kkgx0 [2024-11-18T10:29:22.448] debug: sched/backfill: _attempt_backfill: beginning [2024-11-18T10:29:22.448] debug: sched/backfill: _attempt_backfill: no jobs to backfill [2024-11-18T10:30:20.007] debug: sched: Running job scheduler for full queue. [2024-11-18T10:31:20.003] debug: sched: Running job scheduler for full queue. [2024-11-18T10:32:21.000] debug: sched: Running job scheduler for full queue. [2024-11-18T10:32:42.000] debug: Spawning ping agent for bibigrid-master-wubqboa1z2kkgx0 [2024-11-18T10:32:42.478] debug: sched/backfill: _attempt_backfill: beginning [2024-11-18T10:32:42.478] debug: sched/backfill: _attempt_backfill: no jobs to backfill [2024-11-18T10:33:12.479] debug: sched/backfill: _attempt_backfill: beginning [2024-11-18T10:33:12.479] debug: sched/backfill: _attempt_backfill: no jobs to backfill [2024-11-18T10:33:20.003] debug: sched: Running job scheduler for full queue. [2024-11-18T10:33:20.010] debug: sackd_mgr_dump_state: saved state of 0 nodes [2024-11-18T10:34:20.003] debug: sched: Running job scheduler for full queue. [2024-11-18T10:35:20.007] debug: sched: Running job scheduler for full queue. [2024-11-18T10:36:01.004] debug: Spawning ping agent for bibigrid-master-wubqboa1z2kkgx0 [2024-11-18T10:36:01.504] debug: sched/backfill: _attempt_backfill: beginning [2024-11-18T10:36:01.504] debug: sched/backfill: _attempt_backfill: no jobs to backfill [2024-11-18T10:36:20.003] debug: sched: Running job scheduler for full queue. [2024-11-18T10:36:31.505] debug: sched/backfill: _attempt_backfill: beginning [2024-11-18T10:36:31.505] debug: sched/backfill: _attempt_backfill: no jobs to backfill [2024-11-18T10:37:21.000] debug: sched: Running job scheduler for full queue. [2024-11-18T10:38:20.008] debug: Updating partition uid access list [2024-11-18T10:38:20.008] debug: sched: Running job scheduler for full queue. [2024-11-18T10:38:20.017] debug: sackd_mgr_dump_state: saved state of 0 nodes [2024-11-18T10:39:20.003] debug: sched: Running job scheduler for full queue. [2024-11-18T10:39:21.003] debug: Spawning ping agent for bibigrid-master-wubqboa1z2kkgx0 [2024-11-18T10:39:21.530] debug: sched/backfill: _attempt_backfill: beginning [2024-11-18T10:39:21.530] debug: sched/backfill: _attempt_backfill: no jobs to backfill [2024-11-18T10:39:51.531] debug: sched/backfill: _attempt_backfill: beginning [2024-11-18T10:39:51.531] debug: sched/backfill: _attempt_backfill: no jobs to backfill [2024-11-18T10:40:21.000] debug: sched: Running job scheduler for full queue. [2024-11-18T10:41:20.003] debug: sched: Running job scheduler for full queue. [2024-11-18T10:41:52.003] node bibigrid-worker-wubqboa1z2kkgx0-0 not resumed by ResumeTimeout(1200) - marking down and power_save [2024-11-18T10:41:52.003] Killing JobId=1 on failed node bibigrid-worker-wubqboa1z2kkgx0-0 [2024-11-18T10:41:52.046] update_node: node bibigrid-worker-wubqboa1z2kkgx0-0 reason set to: FailedStartup [2024-11-18T10:41:52.046] power down request repeating for node bibigrid-worker-wubqboa1z2kkgx0-0 [2024-11-18T10:41:52.047] debug: sackd_mgr_dump_state: saved state of 0 nodes [2024-11-18T10:41:52.549] debug: sched/backfill: _attempt_backfill: beginning [2024-11-18T10:41:52.549] debug: sched/backfill: _attempt_backfill: no jobs to backfill [2024-11-18T10:41:52.736] _slurm_rpc_complete_job_allocation: JobId=1 error Job/step already completing or completed [2024-11-18T10:41:53.000] debug: Spawning ping agent for bibigrid-master-wubqboa1z2kkgx0 [2024-11-18T10:41:53.000] debug: sched: Running job scheduler for default depth. [2024-11-18T10:41:53.014] update_node: node bibigrid-worker-wubqboa1z2kkgx0-0 reason set to: FailedStartup [2024-11-18T10:41:53.014] update_node: node bibigrid-worker-wubqboa1z2kkgx0-0 state set to IDLE

1 1

sinfo not listing any partitions
by Kent L. Hanson 02 Dec '24

02 Dec '24

I am doing a new install of slurm 24.05.3 I have all the packages built and installed on headnode and compute node with the same munge.key, slurm.conf, and gres.conf file. I was able to run munge and unmunge commands to test munge successfully. Time is synced with chronyd. I can't seem to find any useful errors in the logs. For some reason when I run sinfo no nodes are listed. I just see the headers for each column. Has anyone seen this or know what a next step of troubleshooting would be? I'm new to this and not sure where to go from here. Thanks for any and all help! The odd output I am seeing [username@headnode ~] sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST (Nothing is output showing status of partition or nodes) Slurm.conf ClusterName=slurmkvasir SlurmctldHost=kadmin2 MpiDefault=none ProctrackType=proctrack/cgroup PrologFlags=contain ReturnToService=2 SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmctldPort=6817 SlurmPidFile=/var/run/slurm/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm StateSaveLocation=/var/spool/slurmctld TaskPlugin=task/cgroup MinJobAge=600 SchedulerType=sched/backfill SelectType=select/cons_tres PriorityType=priority/multifactor AccountingStorageHost=localhost AccountingStoragePass=/var/run/munge/munge.socket.2 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageTRES=gres/gpu,cpu,node JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/cgroup SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=info SlurmLogFile=/var/log/slurm/slurmd.log nodeName=k[001-448] PartitionName=default Nodes=k[001-448] Default=YES MaxTime=INFINITE State=up Slurmctld.log Error: Configured MailProg is invalid Slurmctld version 24.05.3 started on cluster slurmkvasir Accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Regisetering slurmctld at port 8617 Error: read_slurm_conf: default partition not set. Revovered state of 448 nodes Down nodes: k[002-448] Recovered information about 0 jobs Revovered state of 0 reservations Read_slurm_conf: backup_controller not specified Select/cons_tres; select_p_reconfigure: select/cons_tres: reconfigure Running as primary controller Slurmd.log Error: Node configuration differs from hardware: CPUS=1:40(hw) Boards=1:1(hw) SocketsPerBoard=1:2(hw) CoresPerSocket=1:20(hw) ThreadsPerCore:1:1(hw) CPU frequency setting not configured for this node Slurmd version 24.05.3started Slurmd started on Wed, 27 Nov 2024 06:51:03 -0700 CPUS=1 Boards=1 Cores=1 Threads=1 Memory=192030 TmpDisk=95201 uptime 166740 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) Error: _forward_thread: failed to k019 (10.142.0.119:6818): Connection timed out (Above line repeated 20 or so times for different nodes.) Thanks, Kent Hanson

6 9

Slurm version 24.11 is now available
by Tim Wickberg 29 Nov '24

29 Nov '24

We are pleased to announce the availability of the Slurm 24.11 release. To highlight some new features in 24.11: - New gpu/nvidia plugin. This does not rely on any NVIDIA libraries, and will build by default on all systems. It supports basic GPU detection and management, but cannot currently identify GPU-to-GPU links, or provide usage data as these are not exposed by the kernel driver. - Add autodetected GPUs to the output from "slurmd -C". - Added new QOS-based reports to "sreport". - Revamped network I/O with the "conmgr" thread-pool model. - Added new "hostlist function" syntax for management commands and configuration files. - switch/hpe_slingshot - Added support for hardware collectives setup through the fabric manager. (Requires SlurmctldParameters=enable_stepmgr) - Added SchedulerParameters=bf_allow_magnetic_slot configuration option to allow backfill planning for magnetic reservations. - Added new "scontrol listjobs" and "liststeps" commands to complement "listpids", and provide --json/--yaml output for all three subcommands. - Allow jobs to be submitted against multiple QOSes. - Added new experimental "oracle" backfill scheduling support, which permits jobs to be delayed if the oracle function determines the reduced fragmentation of the network topology is sufficiently advantageous. - Improved responsiveness of the controller when jobs are requeued by replacing the "db_index" identifier with a slurmctld-generated unique identifier. ("SLUID") - New options to job_container/tmpfs to permit site-specific scripts to modify the namespace before user steps are launched, and to ensure all steps are completely captured within that namespace. The Slurm documentation has also been updated to the 24.11 release. (Older versions can be found in the archive, linked from the main documentation page.) Slurm can be downloaded from https://www.schedmd.com/download-slurm/ . - Tim -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support

1 0

Non-intiutive rank placement/CPU masking
by Ohlerich, Martin 28 Nov '24

28 Nov '24

Dear * , I've some question for understanding. Essentially, I use the following job script: ---------------------------> #!/bin/bash #SBATCH -J srun_test #SBATCH --time=0:02:00 #SBATCH --export=NONE #SBATCH --partition=test #SBATCH --nodes=2 #SBATCH -o ./%x.%j.out #SBATCH -D . srun --export=all --mpi=pmi2 --verbose --cpu-bind=verbose -n 96 --ntasks-per-node=48 --ntasks-per-socket=24 -c 2 -m block:block -B 2:56:1 /lrz/sys/tools/placement_test_2021/bin/placement-test.intel_impi ---------------------------> On a sapphire rapid with 2 sockets and 56 physical CPUs. The executable is just some MPI dummy, build with Intel Compilers and Intel MPI (although I guess that's not essential here). The related MPI environment looks as show below (1). Slurm environment is shown below under (2). Now. what I would expect intuitively is that I place here 24 ranks per socket. And here also on the even CPUs in sequence starting from CPU 0 of each socket. What I get is srun: defined options srun: -------------------- -------------------- srun: (null) : i20r01c01s[06,08] srun: cpu-bind : verbose srun: cpus-per-task : 2 srun: distribution : block:block srun: export : all srun: extra-node-info : 2:56:1 srun: jobid : 395726 srun: job-name : MGLET_srun srun: mpi : pmi2 srun: nodes : 2 srun: ntasks : 96 srun: ntasks-per-node : 48 srun: ntasks-per-socket : 24 srun: verbose : 1 srun: -------------------- -------------------- srun: end of defined options srun: jobid 395726: nodes(2):`i20r01c01s[06,08]', cpu counts: 224(x2) srun: Implicitly setting --exact, because -c/--cpus-per-task given. srun: CpuBindType=verbose,threads srun: launching StepId=395726.0 on host i20r01c01s06, 48 tasks: [0-47] srun: launching StepId=395726.0 on host i20r01c01s08, 48 tasks: [48-95] srun: topology/tree: init: topology tree plugin loaded cpu-bind=MASK - i20r01c01s06, task 1 1 [98437]: mask 0xc set cpu-bind=MASK - i20r01c01s06, task 2 2 [98438]: mask 0x30 set cpu-bind=MASK - i20r01c01s06, task 3 3 [98439]: mask 0xc0 set cpu-bind=MASK - i20r01c01s06, task 4 4 [98440]: mask 0x300 set cpu-bind=MASK - i20r01c01s06, task 5 5 [98441]: mask 0xc00 set cpu-bind=MASK - i20r01c01s06, task 6 6 [98442]: mask 0x3000 set cpu-bind=MASK - i20r01c01s06, task 7 7 [98443]: mask 0xc000 set cpu-bind=MASK - i20r01c01s06, task 8 8 [98444]: mask 0x30000 set cpu-bind=MASK - i20r01c01s06, task 9 9 [98445]: mask 0xc0000 set cpu-bind=MASK - i20r01c01s06, task 10 10 [98446]: mask 0x300000 set cpu-bind=MASK - i20r01c01s06, task 11 11 [98447]: mask 0xc00000 set cpu-bind=MASK - i20r01c01s06, task 12 12 [98448]: mask 0x3000000 set cpu-bind=MASK - i20r01c01s06, task 13 13 [98449]: mask 0xc000000 set cpu-bind=MASK - i20r01c01s06, task 14 14 [98450]: mask 0x30000000 set cpu-bind=MASK - i20r01c01s06, task 15 15 [98451]: mask 0xc0000000 set cpu-bind=MASK - i20r01c01s06, task 16 16 [98452]: mask 0x300000000 set cpu-bind=MASK - i20r01c01s06, task 17 17 [98453]: mask 0xc00000000 set cpu-bind=MASK - i20r01c01s06, task 18 18 [98454]: mask 0x3000000000 set cpu-bind=MASK - i20r01c01s06, task 19 19 [98455]: mask 0xc000000000 set cpu-bind=MASK - i20r01c01s06, task 20 20 [98456]: mask 0x30000000000 set cpu-bind=MASK - i20r01c01s06, task 21 21 [98457]: mask 0xc0000000000 set cpu-bind=MASK - i20r01c01s06, task 22 22 [98458]: mask 0x300000000000 set cpu-bind=MASK - i20r01c01s06, task 23 23 [98459]: mask 0xc00000000000 set cpu-bind=MASK - i20r01c01s06, task 24 24 [98460]: mask 0x300000000000000 set cpu-bind=MASK - i20r01c01s06, task 25 25 [98461]: mask 0xc00000000000000 set cpu-bind=MASK - i20r01c01s06, task 26 26 [98462]: mask 0x3000000000000000 set cpu-bind=MASK - i20r01c01s06, task 27 27 [98463]: mask 0xc000000000000000 set cpu-bind=MASK - i20r01c01s06, task 28 28 [98464]: mask 0x30000000000000000 set cpu-bind=MASK - i20r01c01s06, task 29 29 [98465]: mask 0xc0000000000000000 set cpu-bind=MASK - i20r01c01s06, task 30 30 [98466]: mask 0x300000000000000000 set cpu-bind=MASK - i20r01c01s06, task 31 31 [98467]: mask 0xc00000000000000000 set cpu-bind=MASK - i20r01c01s06, task 32 32 [98468]: mask 0x3000000000000000000 set cpu-bind=MASK - i20r01c01s06, task 33 33 [98469]: mask 0xc000000000000000000 set cpu-bind=MASK - i20r01c01s06, task 34 34 [98470]: mask 0x30000000000000000000 set cpu-bind=MASK - i20r01c01s06, task 35 35 [98471]: mask 0xc0000000000000000000 set cpu-bind=MASK - i20r01c01s06, task 36 36 [98472]: mask 0x300000000000000000000 set cpu-bind=MASK - i20r01c01s06, task 37 37 [98473]: mask 0xc00000000000000000000 set cpu-bind=MASK - i20r01c01s06, task 38 38 [98474]: mask 0x3000000000000000000000 set cpu-bind=MASK - i20r01c01s06, task 39 39 [98475]: mask 0xc000000000000000000000 set cpu-bind=MASK - i20r01c01s06, task 40 40 [98476]: mask 0x30000000000000000000000 set cpu-bind=MASK - i20r01c01s06, task 41 41 [98477]: mask 0xc0000000000000000000000 set cpu-bind=MASK - i20r01c01s06, task 42 42 [98478]: mask 0x300000000000000000000000 set cpu-bind=MASK - i20r01c01s06, task 43 43 [98479]: mask 0xc00000000000000000000000 set cpu-bind=MASK - i20r01c01s06, task 44 44 [98480]: mask 0x3 set cpu-bind=MASK - i20r01c01s06, task 45 45 [98481]: mask 0xc set cpu-bind=MASK - i20r01c01s06, task 46 46 [98482]: mask 0x30 set srun: Node i20r01c01s06, 48 tasks started cpu-bind=MASK - i20r01c01s08, task 48 0 [179006]: mask 0x3 set cpu-bind=MASK - i20r01c01s08, task 49 1 [179007]: mask 0xc set cpu-bind=MASK - i20r01c01s08, task 50 2 [179008]: mask 0x30 set cpu-bind=MASK - i20r01c01s08, task 51 3 [179009]: mask 0xc0 set cpu-bind=MASK - i20r01c01s08, task 52 4 [179010]: mask 0x300 set cpu-bind=MASK - i20r01c01s08, task 53 5 [179011]: mask 0xc00 set cpu-bind=MASK - i20r01c01s08, task 54 6 [179012]: mask 0x3000 set cpu-bind=MASK - i20r01c01s08, task 55 7 [179013]: mask 0xc000 set cpu-bind=MASK - i20r01c01s08, task 56 8 [179014]: mask 0x30000 set cpu-bind=MASK - i20r01c01s08, task 57 9 [179015]: mask 0xc0000 set cpu-bind=MASK - i20r01c01s08, task 58 10 [179016]: mask 0x300000 set cpu-bind=MASK - i20r01c01s08, task 59 11 [179017]: mask 0xc00000 set cpu-bind=MASK - i20r01c01s08, task 60 12 [179018]: mask 0x3000000 set cpu-bind=MASK - i20r01c01s08, task 61 13 [179019]: mask 0xc000000 set cpu-bind=MASK - i20r01c01s08, task 62 14 [179020]: mask 0x30000000 set cpu-bind=MASK - i20r01c01s08, task 63 15 [179021]: mask 0xc0000000 set cpu-bind=MASK - i20r01c01s08, task 64 16 [179022]: mask 0x300000000 set cpu-bind=MASK - i20r01c01s08, task 65 17 [179023]: mask 0xc00000000 set cpu-bind=MASK - i20r01c01s08, task 66 18 [179024]: mask 0x3000000000 set cpu-bind=MASK - i20r01c01s08, task 67 19 [179025]: mask 0xc000000000 set cpu-bind=MASK - i20r01c01s08, task 68 20 [179026]: mask 0x30000000000 set cpu-bind=MASK - i20r01c01s08, task 69 21 [179027]: mask 0xc0000000000 set cpu-bind=MASK - i20r01c01s08, task 70 22 [179028]: mask 0x300000000000 set cpu-bind=MASK - i20r01c01s08, task 71 23 [179029]: mask 0xc00000000000 set cpu-bind=MASK - i20r01c01s08, task 72 24 [179030]: mask 0x300000000000000 set cpu-bind=MASK - i20r01c01s08, task 73 25 [179031]: mask 0xc00000000000000 set cpu-bind=MASK - i20r01c01s08, task 74 26 [179032]: mask 0x3000000000000000 set cpu-bind=MASK - i20r01c01s08, task 75 27 [179033]: mask 0xc000000000000000 set cpu-bind=MASK - i20r01c01s08, task 76 28 [179034]: mask 0x30000000000000000 set cpu-bind=MASK - i20r01c01s08, task 77 29 [179035]: mask 0xc0000000000000000 set cpu-bind=MASK - i20r01c01s08, task 78 30 [179036]: mask 0x300000000000000000 set cpu-bind=MASK - i20r01c01s08, task 79 31 [179037]: mask 0xc00000000000000000 set cpu-bind=MASK - i20r01c01s08, task 80 32 [179038]: mask 0x3000000000000000000 set cpu-bind=MASK - i20r01c01s08, task 81 33 [179039]: mask 0xc000000000000000000 set cpu-bind=MASK - i20r01c01s08, task 82 34 [179040]: mask 0x30000000000000000000 set cpu-bind=MASK - i20r01c01s08, task 83 35 [179041]: mask 0xc0000000000000000000 set cpu-bind=MASK - i20r01c01s08, task 84 36 [179042]: mask 0x300000000000000000000 set cpu-bind=MASK - i20r01c01s08, task 85 37 [179043]: mask 0xc00000000000000000000 set cpu-bind=MASK - i20r01c01s08, task 86 38 [179044]: mask 0x3000000000000000000000 set cpu-bind=MASK - i20r01c01s08, task 87 39 [179045]: mask 0xc000000000000000000000 set cpu-bind=MASK - i20r01c01s08, task 88 40 [179046]: mask 0x30000000000000000000000 set cpu-bind=MASK - i20r01c01s08, task 89 41 [179047]: mask 0xc0000000000000000000000 set cpu-bind=MASK - i20r01c01s08, task 90 42 [179048]: mask 0x300000000000000000000000 set cpu-bind=MASK - i20r01c01s08, task 91 43 [179049]: mask 0xc00000000000000000000000 set cpu-bind=MASK - i20r01c01s08, task 92 44 [179050]: mask 0x3 set cpu-bind=MASK - i20r01c01s08, task 93 45 [179051]: mask 0xc set cpu-bind=MASK - i20r01c01s08, task 94 46 [179052]: mask 0x30 set cpu-bind=MASK - i20r01c01s08, task 95 47 [179053]: mask 0xc0 set srun: Node i20r01c01s08, 48 tasks started cpu-bind=MASK - i20r01c01s06, task 47 47 [98483]: mask 0xc0 set In words: the last 4 ranks meant for the second socket of each node are actually placed on the first socket ... So, effectively --ntasks-per-socket is then ignored? I couldn't see an obvious reason for that. Maybe I miss some important point ... or neglect some interference with Intel MPI's runtime environment (there didn't change anything in the masks when unsetting KMP_AFFINITY). I'd like also to mention that explicit placement via cpu_map does the right thing. ---------------------------> #!/bin/bash #SBATCH -J srun_test #SBATCH --time=0:02:00 #SBATCH --export=NONE #SBATCH --partition=test #SBATCH --nodes=2 #SBATCH --ntasks-per-node=48 #SBATCH -o ./%x.%j.out #SBATCH -D . srun --export=all --mpi=pmi2 --cpu-bind=map_cpu:$(seq 0 2 46 | tr '\n' ',')$(seq 56 2 102 | tr '\n' ',') /lrz/sys/tools/placement_test_2021/bin/placement-test.intel_impi ---------------------------> So, a workaround is available for me. But if I could get some illuminating hint on where my intuition failed above, I'd be very grateful. Thank you! Cheers, Martin (1) I_MPI_FILESYSTEM=on I_MPI_HYDRA_BOOTSTRAP=slurm I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS=--external-launcher I_MPI_HYDRA_BRANCH_COUNT=128 I_MPI_OFFLOAD=0 I_MPI_OFFLOAD_FAST_MEMCPY_COLL=0 I_MPI_OFFLOAD_RDMA=0 I_MPI_OFI_PROVIDER=psm3 I_MPI_ROOT=/dss/lrzsys/sys/spack/release/24.1.0/opt/x86_64/intel-oneapi-mpi/2021.11.0-gcc-w56vuor/mpi/2021.11 KMP_AFFINITY=granularity=thread,compact,1,0 (2) SLURMD_NODENAME=i20r01c01s06 SLURM_CLUSTER_NAME=sng2 SLURM_CONF=/etc/slurm/slurm.conf SLURM_CPUS_ON_NODE=224 SLURM_ECLIBR=0 SLURM_ECPLUG=1 SLURM_ERLAST=sbatch SLURM_ERSBAC=1 SLURM_GET_USER_ENV=0 SLURM_GTIDS=0 SLURM_JOBID=395726 SLURM_JOB_ACCOUNT=pr28fa SLURM_JOB_CPUS_PER_NODE=224(x2) SLURM_JOB_END_TIME=1732781080 SLURM_JOB_GID=3000114 SLURM_JOB_ID=395726 SLURM_JOB_NAME=MGLET_srun SLURM_JOB_NODELIST=i20r01c01s[06,08] SLURM_JOB_NUM_NODES=2 SLURM_JOB_PARTITION=test SLURM_JOB_QOS=test SLURM_JOB_START_TIME=1732780959 SLURM_JOB_UID=3808660 SLURM_JOB_USER=di49zop SLURM_LOCALID=0 SLURM_NNODES=2 SLURM_NODEID=0 SLURM_NODELIST=i20r01c01s[06,08] SLURM_PRIO_PROCESS=0 SLURM_PROCID=0 SLURM_SCRIPT_CONTEXT=prolog_task SLURM_SETUP_LICENSE=open source - no access restrictions SLURM_SETUP_MAINTAINER_LIST=Bader(a)lrz.de SLURM_SUBMIT_DIR=/dss/dsshome1/00/di49zop/test_mglet SLURM_SUBMIT_HOST=login26 SLURM_TASKS_PER_NODE=224(x2) SLURM_TASK_PID=98399 SLURM_TOPOLOGY_ADDR=leaf.m02r05.i20r01c01s06 SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.node

1 0

I'm very confused about priority/fairshare calculation
by Cutts, Tim 27 Nov '24

27 Nov '24

I don't know how many times I've read the docs; I keep thinking I understand it, but something is really wrong with prioritisation on our cluster, and we're struggling to understand why. The setup: 1. We have a group who submit two types of work; production jobs and research jobs. 2. We have two sacctmgr accounts for this; let's call those 'prod' and 'research'. 3. We also have some dedicated hardware that they paid for which can be used only by users associated with the prod account. Desired behaviour: 1. Usage of their dedicated hardware by production jobs should not hugely decrease the fairshare priority for research jobs in other partitions. 2. Usage of shared hardware should decrease their fairshare priority (whether by production or research jobs) 3. Memory should make a relatively small contribution to TRES usage (it's not normally the constrained resource) Our approach: Set TRESBillingWeights for cpu, memory and gres/GPU usage on shared partitions. Typically these are set to: CPU=1.0,Mem=0.25G,GRES/gpu=1.0 Set TRESBillingWeights to something small on the dedicated hardware partition, such as: CPU=0.25 Set PriorityWeightFairshare and PriorityWeightAge to values such that Fairshare dominates when jobs are young, and Age takes over if they've been pending a long time The observed behaviour: 1. production association jobs have a high priority; this is working well 2. research jobs are still getting heavily penalised in fairshare, and we don't understand why; they seem to have enormous RawUsage, largely coming from memory: Here's what I see from sshare (sensitive details removed, obviously): sshare -l -A prod, research -a -o Account,RawUsage,EffectvUsage,FairShare,LevelFS,TRESRunMins%80 | grep -v cpu=0 > ' Account RawUsage EffectvUsage FairShare LevelFS TRESRunMins -------------------- ----------- ------------- ---------- ---------- -------------------------------------------------------------------------------- prod 1587283 0.884373 0.226149 cpu=81371,mem=669457237,energy=0,node=20610,billing=100833,fs/disk=0,vmem=0,pag+ prod 1082008 0.681681 0.963786 0.366740 cpu=81281,mem=669273429,energy=0,node=20520,billing=100833,fs/disk=0,vmem=0,pag+ prod 505090 0.318202 0.964027 0.785664 cpu=90,mem=184320,energy=0,node=90,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=+ research 1043560787 0.380577 0.121648 cpu=17181098808,mem=35196566339054,energy=0,node=4295361360,billing=25773481938+ research 146841 0.000141 0.005311 124.679238 cpu=824,mem=3375923,energy=0,node=824,billing=824,fs/disk=0,vmem=0,pages=0,gres+ research 17530141 0.016798 0.001449 1.044377 cpu=254484,mem=3379938816,energy=0,node=161907,billing=893592,fs/disk=0,vmem=0,+ research 167597 0.000161 0.005070 109.238498 cpu=7275,mem=223516160,energy=0,node=7275,billing=50931,fs/disk=0,vmem=0,pages=+ research 12712481 0.012182 0.001931 1.440166 cpu=186327,mem=95399526,energy=0,node=23290,billing=232909,fs/disk=0,vmem=0,pag+ research 11521011 0.011040 0.002173 1.589104 cpu=8167,mem=267626086,energy=0,node=8167,billing=65338,fs/disk=0,vmem=0,pages=+ research 9719735 0.009314 0.002414 1.883599 cpu=15020,mem=69214617,energy=0,node=1877,billing=3755,fs/disk=0,vmem=0,pages=0+ research 25004766 0.023961 0.001207 0.732184 cpu=590778,mem=6464600473,energy=0,node=98910,billing=2266887,fs/disk=0,vmem=0,+ research 68938740 0.066061 0.000724 0.265570 cpu=159332,mem=963064985,energy=0,node=89957,billing=192706,fs/disk=0,vmem=0,pa+ research 7359413 0.007052 0.002656 2.487710 cpu=81401,mem=583487624,energy=0,node=20350,billing=20350,fs/disk=0,vmem=0,page+ research 718714430 0.688714 0.000241 0.025473 cpu=20616,mem=337774728,energy=0,node=5154,billing=92772,fs/disk=0,vmem=0,pages+ research 1016606 0.000974 0.003863 18.009010 cpu=17179774580,mem=35184178340113,energy=0,node=4294943645,billing=25769661870+ Firstly, why are the mem TRES numbers so enormous? Secondly, what's going on with the last user, where the rawusage is tiny, but the TRESRunMins is ridiculously big? That could be messing up the whole thing. Thanks in advance for any advice (either that can help explain what I've misunderstood, or to suggestions of "there's a better way to achieve what you want") Tim -- Tim Cutts Scientific Computing Platform Lead AstraZeneca ________________________________ AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA. This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com<https://www.astrazeneca.com>

1 0

making a maint reservation on a specific GPU
by Paul Raines 22 Nov '24

22 Nov '24

We have a 8 GPU server in which one GPU has gone into an error state that will require a reboot to clear. I have jobs on the server running on good GPUs that will take another 3 days to complete. In the meantime, I would like short jobs to run on the good free GPUs till I reboot. I set a reservation for the time window I plan to reboot on the whole node with scontrol create reservation reservationName=rtx-01_reboot users=root starttime=2024-11-25T06:00:00 duration=720 Nodes=rtx-01 flags=maint,ignore_jobs But I would like to set a reservation on just the bad GPU (gpu_id=7) from now till 2024-11-25T06:00:00 so no job runs that will use it. Is that possible? --------------------------------------------------------------- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown, MA 02129 USA The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.

1 0

Suspending jobs and resuming
by Ratnasamy, Fritz 22 Nov '24

22 Nov '24

Hi, I am using an old slurm version 20.11.8 and we had to reboot our cluster today for maintenance. I suspended all the jobs on it with the command scontrol suspend list_job_ids and all the jobs paused and were suspended. However, when I tried to resume them after the reboot, scontrol resume did not work (it was showing in the reason column " (JobHeldAdmin)". I was able to release them with scontrol release and the jobs started to run back. However, the SLURM recorded time on it resetted (Time columns, showing 0:00 for all the jobs) though the jobs seem to have re-started from the last point before he got suspended. 1- Did I follow the right procedure to suspend, reboot and resume/release? 2- In this case, does the wall time for all the jobs goes into reset and therefore anyone with slurm admin rights will be able to have their jobs last longer than the wall time limit by suspending and resuming a job? Best, *Fritz Ratnasamy* Data Scientist Information Technology

2 1

Slurm PID Files
by Matthias Leopold 20 Nov '24

20 Nov '24

Hi, I compiled and installed Slurm 24.05 on Ubuntu 22.04 following this tutorial: https://www.schedmd.com/slurm/installation-tutorial/ Systemd service files are from deb packages that result from this. Do I have to worry that slurmctld and slurmd don't write PID files although SlurmctldPidFile and SlurmdPidFile are defined in slurm.conf? Paths for PID files exist and are writeable, logs don't show any error. slurmdbd does write a PID file as defined in slurmdbd.conf. thx Matthias

1 0

Does Slurm support DSP
by shaobo liu 20 Nov '24

20 Nov '24

Dear all Does slurm support DSP (Digital Signal Processing)? slurm website does not see DSP related content。

3 4

2025

2024

slurm-users November 2024