- slurm-users - lists.schedmd.com

Job Step State
by Emyr James 01 Oct '24

01 Oct '24

Dear all, I am working on a script to take completed job accounting data from the slurm accounting database and insert the equivalent data into a clickhouse table for fast reporting I can see that all the information is included in the cluster_job_table and cluster_job_step_table which seem to be joined on job_db_inx To get the cpu usage and peak memory usage etc. I can see that I need to parse the tres columns in the job steps. I couldn't find any column called MaxRSS in the database even … [View More]

2 2

Hardcoded CGroups v2 Slice
by Khalid Al-Hawaj 01 Oct '24

01 Oct '24

Hello, I am in the process of setting up SLURM to be used in a profiling cluster. The purpose of SLURM is to allow users to submit jobs to be profiled. The latency is a very important aspect of profiling the applications correctly. I was able to leverage cgroupsv2.0 to isolate user.slice from the cores that would be used by SLURM jobs. The issue is that slurmstepd shares the resources with system.slice; I was digging through the code, and I saw that the creation of the scope is here: https://… [View More]

1 0

Updated one compute node to Ubuntu 24.04 LTS, now it does not receive jobs
by Cristóbal Navarro 29 Sep '24

29 Sep '24

Dear community, I am having a strange issue I have been unable to find the cause. Last week I did a full update on the cluster, which is composed of the master node, and two compute nodes (nodeGPU01 -> DGXA100 and nodeGPU02 -> custom GPU server). After the update, I got - master node ended up with Ubuntu 24.04, - nodeGPU01 with latest DGX OS (still Ubuntu 22.04) - nodeGPU02 with Ubuntu 24.04 LTS. - Launching jobs from master choosing the partitions of nodeGPU01 works … [View More]perfectly. - Launching jobs from master choosing the partition of nodeGPU02 stopped working (hangs). The nodeGPU02 (Ubuntu 24) is no longer processing jobs successfully, while the other nodeGPU01 works perfectly even when the master has Ubuntu 24. Any help is welcome, I have tried many things and had no success in finding the cause of this. Please let me know if you need more information. Many thanks in advance. This is the initial `slurmd` log of the problematic node (nodeGPU02), notice the messages in yellow ➜ ~ sudo systemctl status slurmd.service ● slurmd.service - Slurm node daemon Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; preset: enabled) Active: active (running) since Sat 2024-09-28 14:00:22 -03; 4s ago Main PID: 4821 (slurmd) Tasks: 1 Memory: 17.0M (peak: 29.7M) CPU: 174ms CGroup: /system.slice/slurmd.service └─4821 /usr/sbin/slurmd -D -s Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: MPI: Loading all types Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: mpi/pmix_v5: init: PMIx plugin loaded Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: mpi/pmix_v5: init: PMIx plugin loaded Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug2: No mpi.conf file (/etc/slurm/mpi.conf) Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: slurmd started on Sat, 28 Sep 2024 14:00:25 -0300 Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: _step_connect: connect() failed for /var/spool/slurmd/slurmd/nodeGPU02_57436.0: Connection refused Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug2: health_check success rc:0 output: Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: CPUs=128 Boards=1 Sockets=2 Cores=64 Threads=1 Memory=773744 TmpDisk=899181 Uptime=2829 CPUSpecList=(null) FeaturesAvail=(nu> Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: _step_connect: connect() failed for /var/spool/slurmd/slurmd/nodeGPU02_57436.0: Connection refused Sep 28 14:00:25 nodeGPU02 slurmd[4821]: slurmd: debug: _handle_node_reg_resp: slurmctld sent back 11 TRES This is the verbose output of the srun command (notice yellow messages). ➜ ~ srun -vvvp rtx hostname srun: defined options srun: -------------------- -------------------- srun: partition : rtx srun: verbose : 3 srun: -------------------- -------------------- srun: end of defined options srun: debug: propagating RLIMIT_CPU=18446744073709551615 srun: debug: propagating RLIMIT_FSIZE=18446744073709551615 srun: debug: propagating RLIMIT_DATA=18446744073709551615 srun: debug: propagating RLIMIT_STACK=8388608 srun: debug: propagating RLIMIT_CORE=0 srun: debug: propagating RLIMIT_RSS=18446744073709551615 srun: debug: propagating RLIMIT_NPROC=3090276 srun: debug: propagating RLIMIT_NOFILE=1024 srun: debug: propagating RLIMIT_MEMLOCK=18446744073709551615 srun: debug: propagating RLIMIT_AS=18446744073709551615 srun: debug: propagating SLURM_PRIO_PROCESS=0 srun: debug: propagating UMASK=0002 srun: debug: Entering slurm_allocation_msg_thr_create() srun: debug: port from net_stream_listen is 34081 srun: debug: Entering _msg_thr_internal srun: Waiting for resource configuration srun: Nodes nodeGPU02 are ready for job srun: jobid 57463: nodes(1):`nodeGPU02', cpu counts: 1(x1) srun: debug2: creating job with 1 tasks srun: debug2: cpu:1 is not a gres: srun: debug: requesting job 57463, user 99, nodes 1 including ((null)) srun: debug: cpus 1, tasks 1, name hostname, relative 65534 srun: CpuBindType=(null type) srun: debug: Entering slurm_step_launch srun: debug: mpi/pmix_v4: pmixp_abort_agent_start: (null) [0]: pmixp_agent.c:382: Abort agent port: 41393 srun: debug: mpi/pmix_v4: mpi_p_client_prelaunch: (null) [0]: mpi_pmix.c:285: setup process mapping in srun srun: debug: Entering _msg_thr_create() srun: debug: mpi/pmix_v4: _pmix_abort_thread: (null) [0]: pmixp_agent.c:353: Start abort thread srun: debug: initialized stdio listening socket, port 33223 srun: debug: Started IO server thread (140079189182144) srun: debug: Entering _launch_tasks srun: launching StepId=57463.0 on host nodeGPU02, 1 tasks: 0 srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: route/default: init: route default plugin loaded srun: debug2: Called _file_writable srun: topology/none: init: topology NONE plugin loaded srun: debug2: Tree head got back 0 looking for 1 srun: debug: slurm_recv_timeout at 0 of 4, timeout srun: error: slurm_receive_msgs: [[nodeGPU02]:6818] failed: Socket timed out on send/recv operation srun: debug2: Tree head got back 1 srun: debug: launch returned msg_rc=1001 err=5004 type=9001 srun: debug2: marking task 0 done on failed node 0 srun: error: Task launch for StepId=57463.0 failed on node nodeGPU02: Socket timed out on send/recv operation srun: error: Application launch failed: Socket timed out on send/recv operation srun: Job step aborted srun: debug2: false, shutdown srun: debug2: false, shutdown srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: false, shutdown srun: debug: IO thread exiting srun: debug: mpi/pmix_v4: _conn_readable: (null) [0]: pmixp_agent.c:105: false, shutdown srun: debug: mpi/pmix_v4: _pmix_abort_thread: (null) [0]: pmixp_agent.c:355: Abort thread exit srun: debug2: slurm_allocation_msg_thr_destroy: clearing up message thread srun: debug2: false, shutdown srun: debug: Leaving _msg_thr_internal srun: debug2: spank: spank_pyxis.so: exit = 0 This is the `tail -f` log of slurmctld when launching a simple `srun hostname` [2024-09-28T14:08:10.264] ==================== [2024-09-28T14:08:10.264] JobId=57463 nhosts:1 ncpus:1 node_req:1 nodes=nodeGPU02 [2024-09-28T14:08:10.264] Node[0]: [2024-09-28T14:08:10.264] Mem(MB):65536:0 Sockets:2 Cores:64 CPUs:1:0 [2024-09-28T14:08:10.264] Socket[0] Core[0] is allocated [2024-09-28T14:08:10.264] -------------------- [2024-09-28T14:08:10.264] cpu_array_value[0]:1 reps:1 [2024-09-28T14:08:10.264] ==================== [2024-09-28T14:08:10.264] gres/gpu: state for nodeGPU02 [2024-09-28T14:08:10.264] gres_cnt found:3 configured:3 avail:3 alloc:0 [2024-09-28T14:08:10.264] gres_bit_alloc: of 3 [2024-09-28T14:08:10.264] gres_used:(null) [2024-09-28T14:08:10.264] topo[0]:(null)(0) [2024-09-28T14:08:10.264] topo_core_bitmap[0]:0-63 of 128 [2024-09-28T14:08:10.264] topo_gres_bitmap[0]:0 of 3 [2024-09-28T14:08:10.264] topo_gres_cnt_alloc[0]:0 [2024-09-28T14:08:10.264] topo_gres_cnt_avail[0]:1 [2024-09-28T14:08:10.264] topo[1]:(null)(0) [2024-09-28T14:08:10.264] topo_core_bitmap[1]:0-63 of 128 [2024-09-28T14:08:10.264] topo_gres_bitmap[1]:1 of 3 [2024-09-28T14:08:10.264] topo_gres_cnt_alloc[1]:0 [2024-09-28T14:08:10.264] topo_gres_cnt_avail[1]:1 [2024-09-28T14:08:10.264] topo[2]:(null)(0) [2024-09-28T14:08:10.264] topo_core_bitmap[2]:0-63 of 128 [2024-09-28T14:08:10.264] topo_gres_bitmap[2]:2 of 3 [2024-09-28T14:08:10.264] topo_gres_cnt_alloc[2]:0 [2024-09-28T14:08:10.264] topo_gres_cnt_avail[2]:1 [2024-09-28T14:08:10.265] sched: _slurm_rpc_allocate_resources JobId=57463 NodeList=nodeGPU02 usec=1339 [2024-09-28T14:08:10.368] ==================== [2024-09-28T14:08:10.368] JobId=57463 StepId=0 [2024-09-28T14:08:10.368] JobNode[0] Socket[0] Core[0] is allocated [2024-09-28T14:08:10.368] ==================== [2024-09-28T14:08:30.409] _job_complete: JobId=57463 WTERMSIG 12 [2024-09-28T14:08:30.410] gres/gpu: state for nodeGPU02 [2024-09-28T14:08:30.410] gres_cnt found:3 configured:3 avail:3 alloc:0 [2024-09-28T14:08:30.410] gres_bit_alloc: of 3 [2024-09-28T14:08:30.410] gres_used:(null) [2024-09-28T14:08:30.410] topo[0]:(null)(0) [2024-09-28T14:08:30.410] topo_core_bitmap[0]:0-63 of 128 [2024-09-28T14:08:30.410] topo_gres_bitmap[0]:0 of 3 [2024-09-28T14:08:30.410] topo_gres_cnt_alloc[0]:0 [2024-09-28T14:08:30.410] topo_gres_cnt_avail[0]:1 [2024-09-28T14:08:30.410] topo[1]:(null)(0) [2024-09-28T14:08:30.410] topo_core_bitmap[1]:0-63 of 128 [2024-09-28T14:08:30.410] topo_gres_bitmap[1]:1 of 3 [2024-09-28T14:08:30.410] topo_gres_cnt_alloc[1]:0 [2024-09-28T14:08:30.410] topo_gres_cnt_avail[1]:1 [2024-09-28T14:08:30.410] topo[2]:(null)(0) [2024-09-28T14:08:30.410] topo_core_bitmap[2]:0-63 of 128 [2024-09-28T14:08:30.410] topo_gres_bitmap[2]:2 of 3 [2024-09-28T14:08:30.410] topo_gres_cnt_alloc[2]:0 [2024-09-28T14:08:30.410] topo_gres_cnt_avail[2]:1 [2024-09-28T14:08:30.410] _job_complete: JobId=57463 done [2024-09-28T14:08:58.687] gres/gpu: state for nodeGPU01 [2024-09-28T14:08:58.687] gres_cnt found:8 configured:8 avail:8 alloc:0 [2024-09-28T14:08:58.687] gres_bit_alloc: of 8 [2024-09-28T14:08:58.687] gres_used:(null) [2024-09-28T14:08:58.687] topo[0]:A100(808464705) [2024-09-28T14:08:58.687] topo_core_bitmap[0]:48-63 of 128 [2024-09-28T14:08:58.687] topo_gres_bitmap[0]:0 of 8 [2024-09-28T14:08:58.687] topo_gres_cnt_alloc[0]:0 [2024-09-28T14:08:58.687] topo_gres_cnt_avail[0]:1 [2024-09-28T14:08:58.687] topo[1]:A100(808464705) [2024-09-28T14:08:58.687] topo_core_bitmap[1]:48-63 of 128 [2024-09-28T14:08:58.687] topo_gres_bitmap[1]:1 of 8 [2024-09-28T14:08:58.687] topo_gres_cnt_alloc[1]:0 [2024-09-28T14:08:58.687] topo_gres_cnt_avail[1]:1 [2024-09-28T14:08:58.687] topo[2]:A100(808464705) [2024-09-28T14:08:58.687] topo_core_bitmap[2]:16-31 of 128 [2024-09-28T14:08:58.687] topo_gres_bitmap[2]:2 of 8 [2024-09-28T14:08:58.687] topo_gres_cnt_alloc[2]:0 [2024-09-28T14:08:58.687] topo_gres_cnt_avail[2]:1 [2024-09-28T14:08:58.687] topo[3]:A100(808464705) [2024-09-28T14:08:58.687] topo_core_bitmap[3]:16-31 of 128 [2024-09-28T14:08:58.688] topo_gres_bitmap[3]:3 of 8 [2024-09-28T14:08:58.688] topo_gres_cnt_alloc[3]:0 [2024-09-28T14:08:58.688] topo_gres_cnt_avail[3]:1 [2024-09-28T14:08:58.688] topo[4]:A100(808464705) [2024-09-28T14:08:58.688] topo_core_bitmap[4]:112-127 of 128 [2024-09-28T14:08:58.688] topo_gres_bitmap[4]:4 of 8 [2024-09-28T14:08:58.688] topo_gres_cnt_alloc[4]:0 [2024-09-28T14:08:58.688] topo_gres_cnt_avail[4]:1 [2024-09-28T14:08:58.688] topo[5]:A100(808464705) [2024-09-28T14:08:58.688] topo_core_bitmap[5]:112-127 of 128 [2024-09-28T14:08:58.688] topo_gres_bitmap[5]:5 of 8 [2024-09-28T14:08:58.688] topo_gres_cnt_alloc[5]:0 [2024-09-28T14:08:58.688] topo_gres_cnt_avail[5]:1 [2024-09-28T14:08:58.688] topo[6]:A100(808464705) [2024-09-28T14:08:58.688] topo_core_bitmap[6]:80-95 of 128 [2024-09-28T14:08:58.688] topo_gres_bitmap[6]:6 of 8 [2024-09-28T14:08:58.688] topo_gres_cnt_alloc[6]:0 [2024-09-28T14:08:58.688] topo_gres_cnt_avail[6]:1 [2024-09-28T14:08:58.688] topo[7]:A100(808464705) [2024-09-28T14:08:58.688] topo_core_bitmap[7]:80-95 of 128 [2024-09-28T14:08:58.688] topo_gres_bitmap[7]:7 of 8 [2024-09-28T14:08:58.688] topo_gres_cnt_alloc[7]:0 [2024-09-28T14:08:58.688] topo_gres_cnt_avail[7]:1 [2024-09-28T14:08:58.688] type[0]:A100(808464705) [2024-09-28T14:08:58.688] type_cnt_alloc[0]:0 [2024-09-28T14:08:58.688] type_cnt_avail[0]:8 [2024-09-28T14:08:58.690] gres/gpu: state for nodeGPU02 [2024-09-28T14:08:58.690] gres_cnt found:3 configured:3 avail:3 alloc:0 [2024-09-28T14:08:58.690] gres_bit_alloc: of 3 [2024-09-28T14:08:58.690] gres_used:(null) [2024-09-28T14:08:58.690] topo[0]:(null)(0) [2024-09-28T14:08:58.690] topo_core_bitmap[0]:0-63 of 128 [2024-09-28T14:08:58.690] topo_gres_bitmap[0]:0 of 3 [2024-09-28T14:08:58.690] topo_gres_cnt_alloc[0]:0 [2024-09-28T14:08:58.690] topo_gres_cnt_avail[0]:1 [2024-09-28T14:08:58.690] topo[1]:(null)(0) [2024-09-28T14:08:58.690] topo_core_bitmap[1]:0-63 of 128 [2024-09-28T14:08:58.690] topo_gres_bitmap[1]:1 of 3 [2024-09-28T14:08:58.690] topo_gres_cnt_alloc[1]:0 [2024-09-28T14:08:58.690] topo_gres_cnt_avail[1]:1 [2024-09-28T14:08:58.690] topo[2]:(null)(0) [2024-09-28T14:08:58.690] topo_core_bitmap[2]:0-63 of 128 [2024-09-28T14:08:58.690] topo_gres_bitmap[2]:2 of 3 [2024-09-28T14:08:58.690] topo_gres_cnt_alloc[2]:0 [2024-09-28T14:08:58.690] topo_gres_cnt_avail[2]:1 [2024-09-28T14:09:49.763] Resending TERMINATE_JOB request JobId=57463 Nodelist=nodeGPU02 This is the `tail -f` log of slurmd when launching the job from master, notice the messages in yellow [2024-09-28T14:08:10.270] debug2: Processing RPC: REQUEST_LAUNCH_PROLOG [2024-09-28T14:08:10.321] debug2: prep/script: _run_subpath_command: prolog success rc:0 output: [2024-09-28T14:08:10.323] debug2: Finish processing RPC: REQUEST_LAUNCH_PROLOG [2024-09-28T14:08:10.377] debug: Checking credential with 720 bytes of sig data [2024-09-28T14:08:10.377] debug2: Start processing RPC: REQUEST_LAUNCH_TASKS [2024-09-28T14:08:10.377] debug2: Processing RPC: REQUEST_LAUNCH_TASKS [2024-09-28T14:08:10.377] launch task StepId=57463.0 request from UID:10082 GID:10088 HOST:10.10.0.1 PORT:36478 [2024-09-28T14:08:10.377] CPU_BIND: JobNode[0] CPU[0] Step alloc [2024-09-28T14:08:10.377] CPU_BIND: ==================== [2024-09-28T14:08:10.377] CPU_BIND: Memory extracted from credential for StepId=57463.0 job_mem_limit=65536 step_mem_limit=65536 [2024-09-28T14:08:10.377] debug: Waiting for job 57463's prolog to complete [2024-09-28T14:08:10.377] debug: Finished wait for job 57463's prolog to complete [2024-09-28T14:08:10.378] error: _send_slurmstepd_init failed [2024-09-28T14:08:10.384] debug2: debug level read from slurmd is 'debug2'. [2024-09-28T14:08:10.385] debug2: _read_slurmd_conf_lite: slurmd sent 11 TRES. [2024-09-28T14:08:10.385] debug2: Received CPU frequency information for 128 CPUs [2024-09-28T14:08:10.385] select/cons_tres: common_init: select/cons_tres loaded [2024-09-28T14:08:10.385] debug: switch/none: init: switch NONE plugin loaded [2024-09-28T14:08:10.385] [57463.0] debug: auth/munge: init: loaded [2024-09-28T14:08:10.385] [57463.0] debug: Reading cgroup.conf file /etc/slurm/cgroup.conf [2024-09-28T14:08:10.395] [57463.0] debug: cgroup/v2: init: Cgroup v2 plugin loaded [2024-09-28T14:08:10.396] [57463.0] debug: hash/k12: init: init: KangarooTwelve hash plugin loaded [2024-09-28T14:08:10.396] [57463.0] debug: acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded [2024-09-28T14:08:10.396] [57463.0] debug: acct_gather_profile/none: init: AcctGatherProfile NONE plugin loaded [2024-09-28T14:08:10.396] [57463.0] debug: acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded [2024-09-28T14:08:10.396] [57463.0] debug: acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded [2024-09-28T14:08:10.396] [57463.0] debug2: Reading acct_gather.conf file /etc/slurm/acct_gather.conf [2024-09-28T14:08:10.396] [57463.0] debug2: hwloc_topology_init [2024-09-28T14:08:10.399] [57463.0] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/slurmd/hwloc_topo_whole.xml) found [2024-09-28T14:08:10.400] [57463.0] debug: CPUs:128 Boards:1 Sockets:2 CoresPerSocket:64 ThreadsPerCore:1 [2024-09-28T14:08:10.401] [57463.0] debug: task/cgroup: init: core enforcement enabled [2024-09-28T14:08:10.401] [57463.0] debug: task/cgroup: task_cgroup_memory_init: task/cgroup/memory: TotCfgRealMem:773744M allowed:100%(enforced), swap:0%(enforced), max:100%(773744M) max+swap:0%(773744M) min:30M kmem:100%(773744M permissive) min:30M [2024-09-28T14:08:10.401] [57463.0] debug: task/cgroup: init: memory enforcement enabled [2024-09-28T14:08:10.401] [57463.0] debug: task/cgroup: init: device enforcement enabled [2024-09-28T14:08:10.401] [57463.0] debug: task/cgroup: init: Tasks containment cgroup plugin loaded [2024-09-28T14:08:10.401] [57463.0] debug: jobacct_gather/linux: init: Job accounting gather LINUX plugin loaded [2024-09-28T14:08:10.401] [57463.0] cred/munge: init: Munge credential signature plugin loaded [2024-09-28T14:08:10.401] [57463.0] debug: job_container/none: init: job_container none plugin loaded [2024-09-28T14:08:10.401] [57463.0] debug: gres/gpu: init: loaded [2024-09-28T14:08:10.401] [57463.0] debug: gpu/generic: init: init: GPU Generic plugin loaded [2024-09-28T14:08:30.415] debug2: Start processing RPC: REQUEST_TERMINATE_JOB [2024-09-28T14:08:30.415] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2024-09-28T14:08:30.415] debug: _rpc_terminate_job: uid = 777 JobId=57463 [2024-09-28T14:08:30.415] debug: credential for job 57463 revoked [2024-09-28T14:08:30.415] debug: sent SUCCESS, waiting for step to start [2024-09-28T14:08:30.415] debug: Blocked waiting for JobId=57463, all steps [2024-09-28T14:08:58.688] debug2: Start processing RPC: REQUEST_NODE_REGISTRATION_STATUS [2024-09-28T14:08:58.689] debug2: Processing RPC: REQUEST_NODE_REGISTRATION_STATUS [2024-09-28T14:08:58.689] debug: _step_connect: connect() failed for /var/spool/slurmd/slurmd/nodeGPU02_57436.0: Connection refused [2024-09-28T14:08:58.692] debug: _handle_node_reg_resp: slurmctld sent back 11 TRES. [2024-09-28T14:08:58.692] debug2: Finish processing RPC: REQUEST_NODE_REGISTRATION_STATUS -- Cristóbal A. Navarro [View Less]

1 1

errors compiling Slurm 18 on RHEL 9: [Makefile:577: scancel] Error 1 & It's not recommended to have unversioned Obsoletes
by Robert Kudyba 27 Sep '24

27 Sep '24

We're in the process of upgrading but first we're moving to RHEL 9. My attempt to compile using rpmbuild -v -ta --define "_lto_cflags %{nil}" slurm-18.08.9.tar.bz2 (H/T to Brian for this flag <https://groups.google.com/g/slurm-users/c/W8YfGIn1rDI/m/4bsSAoqZAAAJ>). I've stumped Google and the Slurm mailing list with the scancel error so hoping someone here knows of a work around. /bin/ld: opt.o:/root/rpmbuild/BUILD/slurm-18.08.9/src/scancel/../../src/scancel/scancel.h:78: multiple … [View More]definition of `opt'; scancel.o:/root/rpmbuild/BUILD/slurm-18.08.9/src/scancel/../../src/scancel/scancel.h:78: first defined here collect2: error: ld returned 1 exit status make[3]: *** [Makefile:577: scancel] Error 1 make[3]: Leaving directory '/root/rpmbuild/BUILD/slurm-18.08.9/src/scancel' make[2]: *** [Makefile:563: all-recursive] Error 1 make[2]: Leaving directory '/root/rpmbuild/BUILD/slurm-18.08.9/src' make[1]: *** [Makefile:690: all-recursive] Error 1 make[1]: Leaving directory '/root/rpmbuild/BUILD/slurm-18.08.9' make: *** [Makefile:589: all] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.jhiGyR (%build) RPM build errors: Macro expanded in comment on line 22: %_prefix path install path for commands, libraries, etc. line 70: It's not recommended to have unversioned Obsoletes: Obsoletes: slurm-lua slurm-munge slurm-plugins Macro expanded in comment on line 158: %define _unpackaged_files_terminate_build 0 line 224: It's not recommended to have unversioned Obsoletes: Obsoletes: slurm-sql line 256: It's not recommended to have unversioned Obsoletes: Obsoletes: slurm-sjobexit slurm-sjstat slurm-seff line 275: It's not recommended to have unversioned Obsoletes: Obsoletes: pam_slurm Bad exit status from /var/tmp/rpm-tmp.jhiGyR (%build) #!/bin/sh RPM_SOURCE_DIR="/root" RPM_BUILD_DIR="/root/rpmbuild/BUILD" RPM_OPT_FLAGS="-O2 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS "-Wl,-z,lazy" -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection" RPM_LD_FLAGS="-Wl,-z,relro -Wl,--as-needed "-Wl,-z,lazy" -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 " RPM_ARCH="x86_64" RPM_OS="linux" RPM_BUILD_NCPUS="48" export RPM_SOURCE_DIR RPM_BUILD_DIR RPM_OPT_FLAGS RPM_LD_FLAGS RPM_ARCH RPM_OS RPM_BUILD_NCPUS RPM_LD_FLAGS RPM_DOC_DIR="/usr/share/doc" export RPM_DOC_DIR RPM_PACKAGE_NAME="slurm" RPM_PACKAGE_VERSION="18.08.9" RPM_PACKAGE_RELEASE="1.el9" export RPM_PACKAGE_NAME RPM_PACKAGE_VERSION RPM_PACKAGE_RELEASE LANG=C export LANG unset CDPATH DISPLAY ||: RPM_BUILD_ROOT="/root/rpmbuild/BUILDROOT/slurm-18.08.9-1.el9.x86_64" export RPM_BUILD_ROOT PKG_CONFIG_PATH="${PKG_CONFIG_PATH}:/usr/lib64/pkgconfig:/usr/share/pkgconfig" export PKG_CONFIG_PATH CONFIG_SITE=${CONFIG_SITE:-NONE} export CONFIG_SITE set -x umask 022 cd "/root/rpmbuild/BUILD" cd 'slurm-18.08.9' CFLAGS="${CFLAGS:--O2 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS "-Wl,-z,lazy" -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection}" ; export CFLAGS ; CXXFLAGS="${CXXFLAGS:--O2 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS "-Wl,-z,lazy" -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection}" ; export CXXFLAGS ; FFLAGS="${FFLAGS:--O2 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS "-Wl,-z,lazy" -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -I/usr/lib64/gfortran/modules}" ; export FFLAGS ; FCFLAGS="${FCFLAGS:--O2 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS "-Wl,-z,lazy" -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -I/usr/lib64/gfortran/modules}" ; export FCFLAGS ; LDFLAGS="${LDFLAGS:--Wl,-z,relro -Wl,--as-needed "-Wl,-z,lazy" -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 }" ; export LDFLAGS ; LT_SYS_LIBRARY_PATH="${LT_SYS_LIBRARY_PATH:-/usr/lib64:}" ; export LT_SYS_LIBRARY_PATH ; CC="${CC:-gcc}" ; export CC ; CXX="${CXX:-g++}" ; export CXX; [ ""x != x ] && for file in $(find . -type f -name configure -print); do /usr/bin/sed -r --in-place=.backup 's/^char $\*f$  = /__attribute__ ((used)) char (*f) () = /g' $file; diff -u $file.backup $file && mv $file.backup $file /usr/bin/sed -r --in-place=.backup 's/^char $\*f$ ;/__attribute__ ((used)) char (*f) ();/g' $file; diff -u $file.backup $file && mv $file.backup $file /usr/bin/sed -r --in-place=.backup 's/^char \$2 ;/__attribute__ ((used)) char \$2 ();/g' $file; diff -u $file.backup $file && mv $file.backup $file /usr/bin/sed --in-place=.backup '1{$!N;$!N};$!N;s/int x = 1;\nint y = 0;\nint z;\nint nan;/volatile int x = 1; volatile int y = 0; volatile int z, nan;/;P;D' $file; diff -u $file.backup $file && mv $file.backup $file /usr/bin/sed --in-place=.backup 's#^lt_cv_sys_global_symbol_to_cdecl=.*#lt_cv_sys_global_symbol_to_cdecl="sed -n -e '"'"'s/^T .* \$.*\$$/extern int \\1();/p'"'"' -e '"'"'s/^$symcode* .* \$.*\$$/extern char \\1;/p'"'"'"#' $file; diff -u $file.backup $file && mv $file.backup $file done; [ "1" = 1 ] && for i in $(find $(dirname ./configure) -name config.guess -o -name config.sub) ; do [ -f /usr/lib/rpm/redhat/$(basename $i) ] && /usr/bin/rm -f $i && /usr/bin/cp -fv /usr/lib/rpm/redhat/$(basename $i) $i ; done ; [ "1" = 1 ] && [ x != "x"-Wl,-z,lazy"" ] && for i in $(find . -name ltmain.sh) ; do /usr/bin/sed -i.backup -e 's~compiler_flags=$~compiler_flags=""-Wl,-z,lazy""~' $i done ; ./configure --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu \ --program-prefix= \ --disable-dependency-tracking \ \ --prefix=/usr \ --exec-prefix=/usr \ --bindir=/usr/bin \ --sbindir=/usr/sbin \ --sysconfdir=/etc/slurm \ --datadir=/usr/share \ --includedir=/usr/include \ --libdir=/usr/lib64 \ --libexecdir=/usr/libexec \ --localstatedir=/var \ --sharedstatedir=/var/lib \ --mandir=/usr/share/man \ --infodir=/usr/share/info \ \ make -j48 RPM_EC=$? for pid in $(jobs -p); do kill -9 ${pid} || continue; done exit ${RPM_EC} [View Less]

2 2

Re: SLUG'24 presentation slides?
by Kilian Cavalotti 27 Sep '24

27 Sep '24

Awesome, thanks Victoria! Cheers, -- Kilian On Thu, Sep 26, 2024 at 11:17 AM Victoria Hobson <victoria(a)schedmd.com> wrote: > Hi Kilian, > > We're getting these posted now and an email will go out when they are > available! > > Thanks, > > > Victoria Hobson > > *Vice President of Marketing * > > 909.609.8889 > > www.schedmd.com > > > On Mon, Sep 23, 2024 at 10:49 AM Kilian Cavalotti via slurm-users < > slurm-users(a)lists.… [View More]

1 0

A note on updating Slurm from 23.02 to 24.05 & multi-cluster
by Ward Poelmans 26 Sep '24

26 Sep '24

Hi all, We hit a snag when updating our clusters from Slurm 23.02 to 24.05. After updating the slurmdbd, our multi cluster setup was broken until everything was updated to 24.05. We had not anticipated this. SchedMD says that fixing it would be a very complex operation. Hence, this warning to everybody on planning to update: make sure to quickly updating everything once you've updated the slurmdbd daemon. Reference: https://support.schedmd.com/show_bug.cgi?id=20931 Ward

3 3

Jobs pending with reason "priority" but nodes are idle
by Long, Daniel S. 25 Sep '24

25 Sep '24

Hi, On our cluster we have some jobs that are queued even though there are available nodes to run on. The listed reason is "priority" but that doesn't really make sense to me. Slurm isn't picking another job to run on those nodes; it's just not running anything at all. We do have a quite heterogeneous cluster, but as far as I can tell the queued jobs aren't requesting anything that would preclude them from running on the idle nodes. They are array jobs, if that makes a difference. Thanks for … [View More]

3 6

Max TRES per user and node
by Guillaume COCHARD 25 Sep '24

25 Sep '24

Hello, We are looking for a method to limit the TRES used by each user on a per-node basis. For example, we would like to limit the total memory allocation of jobs from a user to 200G per node. There is MaxTRESperNode (https://slurm.schedmd.com/sacctmgr.html#OPT_MaxTRESPerNode), but unfortunately, this is a per-job limit, not per user. Ideally, we would like to apply this limit on partitions and/or QoS. Does anyone know if this is possible and how to achieve it? Thank you,

4 9

SLURM Telegraf Plugin
by Pablo Collado Soto 25 Sep '24

25 Sep '24

Hi all, I recently wrote an SLURM input plugin [0] for Telegraf [1]. I just wanted to let the community know so that you can use it if you'd find that useful. Maybe its existence can also be included in the documentation somewhere? Anyway, thanks a ton for your time, Pablo Collado Soto References: 0: https://github.com/influxdata/telegraf/tree/master/plugins/inputs/slurm 1: https://www.influxdata.com/time-series-platform/telegraf/ + -------------------------------------- + | … [View More]

2 1

Setting up fairshare accounting
by tluchko 24 Sep '24

24 Sep '24

Hello, We have a new cluster and I'm trying to setup fairshare accounting. I'm trying to track CPU, MEM and GPU. It seems that billing for individual jobs is correct, but billing isn't being accumulated (TRESRunMin is always 0). In my slurm.conf, I think the relevant lines are AccountingStorageType=accounting_storage/slurmdbd AccountingStorageTRES=gres/gpu PriorityFlags=MAX_TRES PartitionName=gpu Nodes=node[1-7] MaxCPUsPerNode=384 MaxTime=7-0:00:00 State=UP TRESBillingWeights="CPU=1.0,MEM=0.… [View More]125G,GRES/gpu=9.6" PartitionName=cpu Nodes=node[1-7] MaxCPUsPerNode=182 MaxTime=7-0:00:00 State=UP TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6" I currently have one recently finished job and one running job. sacct gives $ sacct --format=JobID,JobName,ReqTRES%50,AllocTRES%50,TRESUsageInAve%50,TRESUsageInMax%50 JobID JobName ReqTRES AllocTRES TRESUsageInAve TRESUsageInMax ------------ ---------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- 154 interacti+ billing=9,cpu=1,gres/gpu=1,mem=1G,node=1 billing=9,cpu=2,gres/gpu=1,mem=2G,node=1 154.interac+ interacti+ cpu=2,gres/gpu=1,mem=2G,node=1 cpu=00:00:00,energy=0,fs/disk=2480503,mem=3M,page+ cpu=00:00:00,energy=0,fs/disk=2480503,mem=3M,page+ 155 interacti+ billing=9,cpu=1,gres/gpu=1,mem=1G,node=1 billing=9,cpu=2,gres/gpu=1,mem=2G,node=1155.interac+ interacti+ cpu=2,gres/gpu=1,mem=2G,node=1 billing=9 seems correct to me, since I have 1 GPU allocated, which has the largest score of 9.6. However, sshare doesn't show anything in TRESRunMins sshare --format=Account,User,RawShares,FairShare,RawUsage,EffectvUsage,TRESRunMins%110 Account User RawShares FairShare RawUsage EffectvUsage TRESRunMins -------------------- ---------- ---------- ---------- ----------- ------------- -------------------------------------------------------------------------------------------------------------- root 21589714 1.000000 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0 abrol_group 2000 0 0.000000 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0 luchko_group 2000 21589714 1.000000 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0 luchko_group tluchko 1 0.333333 21589714 1.000000 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0 Why is TRESRunMin all 0 but RawUsage is not for tluchko? I have checked and slurmdbd is running. Thank you, Tyler Sent with [Proton Mail](https://proton.me/) secure email. [View Less]

1 2

SLUG'24 presentation slides?
by Kilian Cavalotti 23 Sep '24

23 Sep '24

Hi SchedMD, I'm sure they will eventually, but do you know when the slides of the SLUG'24 presentation will be available online at https://slurm.schedmd.com/publications.html, like previous editions'? Thanks! -- Kilian

1 0

what updates NODEADDR
by Jakub Szarlat 21 Sep '24

21 Sep '24

Hi I'm using dynamic nodes with "slurmd -Z" with slurm 23.11.1. Firstly I find that when you do "scontrol show node" it shows the NODEADDR as ip rather than the NODENAME. Because I'm playing around with running this in containers on docker swarm I find this ip can be wrong. I can force it with scontrol update however after a while something updates it to something else again. Does anybody know if this is done by slurmd or slurmctld or something else? How can I stop this from happening? How can … [View More]

2 1

SLURM GRES reservation not working properly on 24.05.1
by Minulakshmi S 20 Sep '24

20 Sep '24

Hello, *Issue 1:* I am using slurm version 24.05.1 , my slurmd has a single node where I connect multiple gres by enabling the overscribe feature. I am able to use the advance reservation of gres only using *gres** name* (tres=gres/gpu:*SYSTEM12*). i.e while in reservation period , if other users submits job with SYSTEM12 , then slurm places this job in queue *user1@host$ srun --gres=gpu:SYSTEM12:1 hostname* *srun: job 333 queued and waiting for resources * but when other users just submit … [View More]

1 0

Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING
by Xaver Stiensmeier 20 Sep '24

20 Sep '24

Dear slurm-user list, I have a cloud node that is powered up and down on demand. Rarely it can happen that slurm's resumeTimeout is reached and the node is therefore powered down. We have set ReturnToService=2 in order to avoid the node being marked down, because the instance behind that node is created on demand and therefore after a failure nothing stops the system to start the node again as it is a different instance. I thought this would be enough, but apparently the node is still marked … [View More]

2 3

SlurmDBD errors
by Sajesh Singh 19 Sep '24

19 Sep '24

OS: CentOS 8.5 Slurm: 22.05 Recently upgraded to 22.05. Upgrade was successful, but after a while I started to see the following messages in the slurmdbd.log file: error: We have more time than is possible (9344745+7524000+0)(16868745) > 12362400 for cluster CLUSTERNAME(3434) from 2024-09-18T13:00:00 - 2024-09-18T14:00:00 tres 1 (this may happen if oversubscription of resources is allowed without Gang) We do have partitions with overlapping nodes, but do not have "Suspend,Gang" set as the … [View More]

2 2

Change a job from --exclusive to --exclusive=user
by Gerhard Strangar 18 Sep '24

18 Sep '24

Hello, is it possible to change a pending job from --exclusive to --exclusive=user? I tried scontrol update jobid=... oversubscribe=user, but it seems to only accept yes or no. Gerhard

1 0

Feature request: Max Jobs Per Minute
by Ransom, Geoffrey M. 16 Sep '24

16 Sep '24

Hello We have another batch of new users and some more batches of large array jobs with very short runtimes due to errors in the jobs or just by design. Trying to deal with these issues, Setting ArrayTaskThrottle and user education, I had a thought that it would be very nice to have a limit on how many jobs can start in a given minute for users, so if they posted a 200000 array job with 15 second tasks then the scheduler wouldn't launch more than a 100 or 200 per minute and be less likely to … [View More]

2 2

Detailed locations for SLUG'24
by Bjørn-Helge Mevik 10 Sep '24

10 Sep '24

Dear all SLUG attendees! The information about which buildings/addresses the SLUG reception and presentations are to be held is not very visible on the https://slug24.splashthat.com. There is a map there with all locations (https://www.google.com/maps/d/u/0/edit?mid=1bcGaTiW0TNB5noQsjQ3ulctzKuqlGrQ…), but I've gotten questions about it, so: The reception on Wednesday will be held in the top floor of Oslo Science Park (Forskningsparken). Address: Gaustadalléen 21. There will be someone in … [View More]

1 1

Issue with cgroup v2 when IgnoreSystemd=yes
by Honoré Bergeron 09 Sep '24

09 Sep '24

Hi, This is a follow-up from https://groups.google.com/g/slurm-users/c/JI3UkbCtj3U, but as I could not find any progress, I am opening a new thread. When setting IgnoreSystemd=yes in cgroup.conf, I have the error: error: common_file_write_content: unable to open '/sys/fs/cgroup/system.slice/cgroup.subtree_control' for writing: No such file or directory error: Cannot enable cpuset in /sys/fs/cgroup/system.slice/cgroup.subtree_control: No such file or directory error: common_file_write_content:… [View More]

1 0

Configuration for nodes with different TmpFs locations and TmpDisk sizes
by Jake Longo 06 Sep '24

06 Sep '24

Hi, We have a number of machines in our compute cluster that have larger disks available for local data. I would like to add them to the same partition as the rest of the nodes but assign them a larger TmpDisk value which would allow users to request a larger tmp and land on those machines. The main hurdle is that (for reasons beyond my control) the larger local disks are on a special mount point /largertmp whereas the rest of the compute cluster uses the vanilla /tmp. I can't see an obvious … [View More]

2 1

salloc not starting shell despite LaunchParameters=use_interactive_step
by Loris Bennett 06 Sep '24

06 Sep '24

Hi, With $ salloc --version slurm 23.11.10 and $ grep LaunchParameters /etc/slurm/slurm.conf LaunchParameters=use_interactive_step the following $ salloc --partition=interactive --ntasks=1 --time=00:03:00 --mem=1000 --qos=standard salloc: Granted job allocation 18928869 salloc: Nodes c001 are ready for job creates a job $ squeue --me JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 18928779 interacti interact loris … [View More]

5 8

Nodelist syntax and semantics
by Jackson, Gary L. 05 Sep '24

05 Sep '24

Is there a description of the “nodelist” syntax and semantics somewhere other than the source code? By “nodelist” I mean expressions like “name[000,099-100]” and how this one, for example, expands to “name000, name099, name100”. -- Gary

2 1

Configuration for nodes with different TmpFs locations and TmpDisk sizes
by Jake Longo 05 Sep '24

05 Sep '24

Hi all, We have a number of machines in our compute cluster that have larger disks available for local data. I would like to add them to the same partition as the rest of the nodes but assign them a larger TmpDisk value which would allow users to request a larger tmp and land on those machines. The main hurdle is that (for reasons beyond my control) the larger local disks are on a special mount point /largertmp whereas the rest of the compute cluster uses the vanilla /tmp. I can't see an … [View More]

2 1

Make a job pending in the plugin
by benjamin.jin＠bos-semi.com 05 Sep '24

05 Sep '24

Hello all, I am tyring to build a custom plugin to force some jobs to be pended. In the official document, `ESLURM*` errors are only valid for `job_submit_lua`. I tried to send `ESLURM_JOB_PENDING`, but it only rejects the job submission. Does anyone know how to pend a job in job_submit plugin? Thanks.

2 1

Bug? sbatch not respecting MaxMemPerNode setting
by Angel de Vicente 05 Sep '24

05 Sep '24

Hello, we found an issue with Slurm 24.05.1 and the MaxMemPerNode setting. Slurm is installed in a single workstation, and thus, the number of nodes is just 1. The relevant sections in slurm.conf read: ,---- | EnforcePartLimits=ALL | PartitionName=short Nodes=..... State=UP Default=YES MaxTime=2-00:00:00 MaxCPUsPerNode=76 MaxMemPerNode=231000 OverSubscribe=FORCE:1 `---- Now, if I submit a job requesting 76 CPUs and each one needing 4000M (for a total of 304000M), Slurm does indeed … [View More]

2 3

2025

2024

slurm-users ----- 2025 ----- June 2025 May 2025 April 2025 March 2025 February 2025 January 2025 ----- 2024 ----- December 2024 November 2024 October 2024 September 2024 August 2024 July 2024 June 2024 May 2024 April 2024 March 2024 February 2024 January 2024

slurm-users