November 2024 - slurm-users - lists.schedmd.com

slurmctld keeps segfaulting, possibly during or just after backfill
by Marcus Lauer 12 Nov '24

12 Nov '24

We are running into a problem where slurmctld is segfaulting a few times a day. We had this problem with SLURM 23.11.8 and now with 23.11.10 as well, though the problem only appears on one of the several SLURM clusters we have, and all of them use one of those versions of SLURM. I was wondering if anyone has encountered a similar issue and has any thoughts on how to prevent this. Obviously we use "SchedulerType=sched/backfill" but strangely when I switched to sched/builtin for a while there were still slurmctld segfaults. We also set "SchedulerParameters=enable_user_top,bf_max_job_test=2000". I have tried turning those off but it did not help. I have also tried tweaking several other settings to no avail. Most of the cluster runs Rocky Linux 8.10 (including the slurmctld system) though we still have some Scientific Linux 7.9 compute nodes (we compile SLURM separately for those). Here is the crash-time error from journalctl: Oct 02 06:31:20 our.host.name kernel: sched_agent[2048355]: segfault at 8 ip 00007fec755d7ea8 sp 00007fec6bffe7e8 error 4 in libslurmfull.so[7fec7555a000+1f4000] Oct 02 06:31:20 our.host.name kernel: Code: 48 39 c1 7e 19 48 c1 f8 06 ba 01 00 00 00 48 d3 e2 48 f7 da 48 0b 54 c6 10 48 21 54 c7 10 c3 b8 00 00 00 00 eb da 48 8b 4f 08 <48> 39 4e 08 48 0f 4e 4e 08 49 89 c9 48 83 f9 3f 76 4e ba 40 00 00 Oct 02 06:31:20 our.host.name systemd[1]: Started Process Core Dump (PID 2169426/UID 0). Oct 02 06:31:20 our.host.name systemd-coredump[2169427]: Process 2048344 ( slurmctld) of user 991 dumped core. This is followed by a list of each of the dozen or so related threads. The one which is dumping core is first and looks like this: Stack trace of thread 2048355: #0 0x00007fec755d7ea8 bit_and_not (libslurmfull.so) #1 0x000000000044531f _job_alloc (slurmctld) #2 0x000000000044576b _job_alloc_whole_node_internal (slurmctld) #3 0x0000000000446e6d gres_ctld_job_alloc_whole_node (slurmctld) #4 0x00007fec722e29b8 job_res_add_job (select_cons_tres.so) #5 0x00007fec722f7c32 select_p_select_nodeinfo_set (select_cons_tres.so) #6 0x00007fec756e7dc7 select_g_select_nodeinfo_set (libslurmfull.so) #7 0x0000000000496eb3 select_nodes (slurmctld) #8 0x0000000000480826 _schedule (slurmctld) #9 0x00007fec753421ca start_thread (libpthread.so.0) #10 0x00007fec745f78d3 __clone (libc.so.6) I have run slurmctld with "debug5" level logging and it appears that the error occurs right after backfill considers a large number of jobs. Slurmctld could be failing at the end of backfill or when doing something which happens just after backfill runs. Usually this is the last message before the crash: [2024-09-25T18:39:42.076] slurmscriptd: debug: _slurmscriptd_mainloop: finished If anyone has any thoughts or advice on this that would be appreciated. Thank you. -- Marcus Lauer Systems Administrator CETS Group, Research Support

2 2

The hostname resolution case sensitive
by Bill 11 Nov '24

11 Nov '24

Hi, I want to confirm that the hostname resolution is case sensitive in SLURM ? Many thanks, Bill

3 5

Discrepancy in GPU Hour Reporting Between sacct and sreport in SLURM
by Manisha Yadav 11 Nov '24

11 Nov '24

I am trying to find the GPU hour utilization for a user during a specific time period using the sacct and sreport commands. However, I am noticing a significant difference between the outputs of these two commands. Could you explain the reasons for this discrepancy? Are there specific factors or configurations in SLURM that could lead to variations in the reported GPU hours? There is a significant discrepancy in the results produced for GPU hour utilization by the sacct and sreport commands in SLURM Thanks, Manisha ------------------------------------------------------------------------------------------------------------ [ C-DAC is on Social-Media too. Kindly follow us at: Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ] This e-mail is for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies and the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email is strictly prohibited and appropriate legal action will be taken. ------------------------------------------------------------------------------------------------------------

1 0

Access denied by pam_slurm_adopt
by John Hearns 10 Nov '24

10 Nov '24

I have cluster which uses Slurm 23.11.6 When I submit a multi-node job and run something like clush -b -w $SLURM_JOB_NODELIST "date" very often the ssh command fails with: Access denied by pam_slurm_adopt: you have no active jobs on this node This will happen maybe on 50% of the nodes There is the same behaviour of I salloc a number of nodes then try to ssh to a node. I have traced this to slurmstepd spawning a long sleep, which I believe allows proctrackd to 'see' if a job is active. On nodes that I can ssh into: root 3211 1 0 Nov08 ? 00:00:00 /usr/sbin/slurmd --systemd root 3227 1 0 Nov08 ? 00:00:00 /usr/sbin/slurmstepd infinity root 24322 1 0 15:40 ? 00:00:00 slurmstepd: [15709.extern] root 24326 24322 0 15:40 ? 00:00:00 \_ sleep 100000000 On nodes where I cannot ssh: root 3226 1 0 Nov08 ? 00:00:00 /usr/sbin/slurmd --systemd root 3258 1 0 Nov08 ? 00:00:00 /usr/sbin/slurmstepd infinity Maybe I am not understanding something here? ps. I ahve tried to run the pam_slurm_adopt module with options to debug, and have not found anything useful John H

2 2

Change primary alloc node
by Bhaskar Chakraborty 04 Nov '24

04 Nov '24

Hi, Is there a way to change/control the primary node (i.e. where the initial task starts) as part of a job's allocation. For eg, if a job requires 6 CPUs & its allocation is distributed over 3 hosts h1, h2 & h3 I find that it always starts the task in 1 particularnode (say h1) irrespective of how many slots were available in the hosts. Can we somehow let slurm have the primary node as h2? Is there any C-API inside select plugin which can do this trick if we were to control it through the configured select plugin? Thanks.-Bhaskar.

2 4

Slurm Job Sched Priority
by Bhaskar Chakraborty 03 Nov '24

03 Nov '24

Hello, Is there any DS in slurmctld which portrays the dynamic relative priority of pending jobs? We are trying to use slurm for developing a scheduling solution and 1 of the problems we face at the outset is how to determinethe order of scheduling for pending jobs. One option is to find scheduling iteration window begin & close pointers & cache the job ids as seen in order & then make them the priority order at that point of time. ( This means for 500 pending jobs, say, if we can find which are the slurmctld calls which mark the beginning & end of a sched iteration then we can use the scheduling orderof jobs as the relative priority order for that period of time, of course it may change depending on fairshare, user initiated priority modification etc. ) A concrete existing data structure showing the dynamic priority itself from slurmctld would be handy. Help appreciated. Thanks! Bhaskar.

3 4

2025

2024

slurm-users November 2024

2025

2024

slurm-users November 2024 ----- 2025 ----- July 2025 June 2025 May 2025 April 2025 March 2025 February 2025 January 2025 ----- 2024 ----- December 2024 November 2024 October 2024 September 2024 August 2024 July 2024 June 2024 May 2024 April 2024 March 2024 February 2024 January 2024

slurm-users November 2024