We are running into a problem where slurmctld is segfaulting a few times a day. We had this problem with SLURM 23.11.8 and now with 23.11.10 as well, though the problem only appears on one of the several SLURM clusters we have, and all of them use one of those versions of SLURM. I was wondering if anyone has encountered a similar issue and has any thoughts on how to prevent this.
Obviously we use "SchedulerType=sched/backfill" but strangely when I switched to sched/builtin for a while there were still slurmctld segfaults. We also set "SchedulerParameters=enable_user_top,bf_max_job_test=2000". I have tried turning those off but it did not help. I have also tried tweaking several other settings to no avail. Most of the cluster runs Rocky Linux 8.10 (including the slurmctld system) though we still have some Scientific Linux 7.9 compute nodes (we compile SLURM separately for those).
Here is the crash-time error from journalctl:
Oct 02 06:31:20 our.host.name kernel: sched_agent[2048355]: segfault at 8 ip 00007fec755d7ea8 sp 00007fec6bffe7e8 error 4 in libslurmfull.so[7fec7555a000+1f4000] Oct 02 06:31:20 our.host.name kernel: Code: 48 39 c1 7e 19 48 c1 f8 06 ba 01 00 00 00 48 d3 e2 48 f7 da 48 0b 54 c6 10 48 21 54 c7 10 c3 b8 00 00 00 00 eb da 48 8b 4f 08 <48> 39 4e 08 48 0f 4e 4e 08 49 89 c9 48 83 f9 3f 76 4e ba 40 00 00 Oct 02 06:31:20 our.host.name systemd[1]: Started Process Core Dump (PID 2169426/UID 0). Oct 02 06:31:20 our.host.name systemd-coredump[2169427]: Process 2048344 ( slurmctld) of user 991 dumped core.
This is followed by a list of each of the dozen or so related threads. The one which is dumping core is first and looks like this:
Stack trace of thread 2048355: #0 0x00007fec755d7ea8 bit_and_not (libslurmfull.so) #1 0x000000000044531f _job_alloc (slurmctld) #2 0x000000000044576b _job_alloc_whole_node_internal (slurmctld) #3 0x0000000000446e6d gres_ctld_job_alloc_whole_node (slurmctld) #4 0x00007fec722e29b8 job_res_add_job (select_cons_tres.so) #5 0x00007fec722f7c32 select_p_select_nodeinfo_set (select_cons_tres.so) #6 0x00007fec756e7dc7 select_g_select_nodeinfo_set (libslurmfull.so) #7 0x0000000000496eb3 select_nodes (slurmctld) #8 0x0000000000480826 _schedule (slurmctld) #9 0x00007fec753421ca start_thread (libpthread.so.0) #10 0x00007fec745f78d3 __clone (libc.so.6)
I have run slurmctld with "debug5" level logging and it appears that the error occurs right after backfill considers a large number of jobs. Slurmctld could be failing at the end of backfill or when doing something which happens just after backfill runs. Usually this is the last message before the crash:
[2024-09-25T18:39:42.076] slurmscriptd: debug: _slurmscriptd_mainloop: finished
If anyone has any thoughts or advice on this that would be appreciated. Thank you.
Following up on this, it looks like slurmctld crashes reliably just after a job which was submitted to multiple partitions completes. Has anyone encountered this sort of thing before?
Here is a simplified version of our cluster's partitions:
Nodes Partition Priority node[01-10] facultyA 8 node[11-20] facultyB 8 node[21-30] facultyC 8 node[01-30] standby 1 node31 debug 4
The "facultyX" partitions can only be used by people in a certain SLURM account which is tied to who their faculty sponsor (PI) is. Standby can be used by member of any group, as can debug.
I can reliably get slurmctld to crash about 30 seconds after a job like this finishes:
--------- job script --------- #!/bin/bash #SBATCH --nodes=1 #SBATCH --partition=facultyA,standby hostname ------------------------------
Slurmctld will also crash if a person submits to "--partition=facultyA,facultyB". Some researchers do work with more than one faculty member so this is definitely a legitimate scenario. Basically any combination of partitions seems to cause the problem.
We have a Lua job_submit script which removes the "debug" partition from jobs which are submitted to more than one partition. Even when jobs are submitted to "--partition=facultyA,debug" that doesn't fix the problem. Even though the job will only run in the facultyA partition and "scontrol show job JOBID" will only show facultyA as a partition to which the job has been submitted, slurmctld will still crash 30 seconds after the job finishes. Turning off that script by commenting out JobSubmitPlugin and running "scontrol reconfigure" does not help incidentally.
Perhaps I have missed something which needs to be changed or enabled when people are allowed to submit to multiple partitions? Yet this used to work in the past. The cluster in question is several years old but this problem has only appeared in the last two months. The really annoying thing is that we have other clusters with multiple partitions and submitting jobs to a list of partitions on those clusters works as expected. The problem seems to be unique to one cluster.
On Wed, Oct 2, 2024 at 1:24 PM Marcus Lauer melauer@seas.upenn.edu wrote:
We are running into a problem where slurmctld is segfaulting a few
times a day. We had this problem with SLURM 23.11.8 and now with 23.11.10 as well, though the problem only appears on one of the several SLURM clusters we have, and all of them use one of those versions of SLURM. I was wondering if anyone has encountered a similar issue and has any thoughts on how to prevent this.
Obviously we use "SchedulerType=sched/backfill" but strangely
when I switched to sched/builtin for a while there were still slurmctld segfaults. We also set "SchedulerParameters=enable_user_top,bf_max_job_test=2000". I have tried turning those off but it did not help. I have also tried tweaking several other settings to no avail. Most of the cluster runs Rocky Linux 8.10 (including the slurmctld system) though we still have some Scientific Linux 7.9 compute nodes (we compile SLURM separately for those).
Here is the crash-time error from journalctl:
Oct 02 06:31:20 our.host.name kernel: sched_agent[2048355]: segfault at 8 ip 00007fec755d7ea8 sp 00007fec6bffe7e8 error 4 in libslurmfull.so[7fec7555a000+1f4000] Oct 02 06:31:20 our.host.name kernel: Code: 48 39 c1 7e 19 48 c1 f8 06 ba 01 00 00 00 48 d3 e2 48 f7 da 48 0b 54 c6 10 48 21 54 c7 10 c3 b8 00 00 00 00 eb da 48 8b 4f 08 <48> 39 4e 08 48 0f 4e 4e 08 49 89 c9 48 83 f9 3f 76 4e ba 40 00 00 Oct 02 06:31:20 our.host.name systemd[1]: Started Process Core Dump (PID 2169426/UID 0). Oct 02 06:31:20 our.host.name systemd-coredump[2169427]: Process 2048344 ( slurmctld) of user 991 dumped core.
This is followed by a list of each of the dozen or so related threads. The one which is dumping core is first and looks like this:
Stack trace of thread 2048355: #0 0x00007fec755d7ea8 bit_and_not (libslurmfull.so) #1 0x000000000044531f _job_alloc (slurmctld) #2 0x000000000044576b _job_alloc_whole_node_internal (slurmctld) #3 0x0000000000446e6d gres_ctld_job_alloc_whole_node (slurmctld) #4 0x00007fec722e29b8 job_res_add_job (select_cons_tres.so) #5 0x00007fec722f7c32 select_p_select_nodeinfo_set (select_cons_tres.so) #6 0x00007fec756e7dc7 select_g_select_nodeinfo_set (libslurmfull.so) #7 0x0000000000496eb3 select_nodes (slurmctld) #8 0x0000000000480826 _schedule (slurmctld) #9 0x00007fec753421ca start_thread (libpthread.so.0) #10 0x00007fec745f78d3 __clone (libc.so.6)
I have run slurmctld with "debug5" level logging and it appears
that the error occurs right after backfill considers a large number of jobs. Slurmctld could be failing at the end of backfill or when doing something which happens just after backfill runs. Usually this is the last message before the crash:
[2024-09-25T18:39:42.076] slurmscriptd: debug: _slurmscriptd_mainloop: finished
If anyone has any thoughts or advice on this that would be
appreciated. Thank you.
-- Marcus Lauer Systems Administrator CETS Group, Research Support
We have just in the last few days developed a similar condition here under slurm v.23.11.6. This is reliably reproducible in this sequence: we are adding new nodes and and a new partition. If I add partition NewPart with nodes NewGpuNode[1-4] and accounts OwnerAccount,AdminAccount, at some point very soon after adding the second account in the sequence we get into a state where sched_agent segfaults every few minutes.
I have tried varying the name of the partition but that does not help. The new nodes can be defined and running slurm without this occurring. The accounts have been defined and are used in other contexts in slurm without issue.
[Tue Nov 12 05:35:52 2024] sched_agent[3161256]: segfault at 52 ip 000014fd6b1ad0c8 sp 000014fd5bce9fa8 error 4 in libslurmfull.so[14fd6b126000+20b000]
To alleviate the condition I back out the partition, then the nodes. This last sequence the core dumps happen immediately after adding the second Account. Once the condition starts removing the partition AND some of the nodes seems to be the only thing that alleviates the condition. The behavior has the feeling of reaching some capacity threshold of N elements.
Any advice or assistance appreciated. -- expecting more than 10 new gpu nodes in the next week or so, and at the moment I am unable to deliver these nodes to the customer for their use.
Nov 12 06:14:31 OurScheduler kernel: sched_agent[3202994]: segfault at 52 ip 00007fe04d2740c8 sp 00007fe041ea0fa8 error 4 in libslurmfull.so[7fe04d1ed000+20b000] Nov 12 06:14:31 OurScheduler kernel: Code: fa 55 48 89 e5 48 8b 07 48 c7 00 00 00 00 00 e8 3e 9d fc ff 5d c3 f3 0f 1e fa 48 8b 47 08 c3 f3 0f 1e fa 48 89 f0 48 c1 f8 06 <48> 8b 44 c7 10 89 f1 48 d3 f8 83 e0 01 c3 f3 0f 1e fa 48 89 f2 48 Nov 12 06:14:31 OurScheduler systemd[1]: Started Process Core Dump (PID 3203565/UID 0). Nov 12 06:14:32 OurScheduler systemd-coredump[3203566]: Process 3202983 (slurmctld) of user 47 dumped core.#012#012Stack trace of thread 3202994:#012#0 0x00007fe04d2740c8 bit_test (libslurmfull.so)#012#1 0x00007fe04b5eb804 _can_use_gres_exc_topo (select_cons_tres.so)#012#2 0x00007fe04b60049a _can_job_run_on_node (select_cons_tres.so)#012#3 0x00007fe04b602bf9 _job_test (select_cons_tres.so)#012#4 0x00007fe04b6050fc _run_now (select_cons_tres.so)#012#5 0x00007fe04b60734d select_p_job_test (select_cons_tres.so)#012#6 0x00007fe04d3916fd select_g_job_test (libslurmfull.so)#012#7 0x00000000004957b5 _pick_best_nodes (slurmctld)#012#8 0x0000000000497c01 _get_req_features (slurmctld)#012#9 0x000000000049b11c select_nodes (slurmctld)#012#10 0x000000000048404f _schedule (slurmctld)#012#11 0x00007fe04cfd51ca start_thread (libpthread.so.0)#012#12 0x00007fe04c28a8d3 __clone (libc.so.6)#012#012Stack trace of thread 3202987:#012#0 0x00007fe04cfdb7da pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)#012#1 0x0000000000425524 _agent_nodes_update (slurmctld)#012#2 0x00007fe04cfd51ca start_thread (libpthread.so.0)#012#3 0x00007fe04c28a8d3 __clone (libc.so.6)#012#012Stack trace of thread 3202983:#012#0 0x00007fe04cfda485 __pthread_rwlock_wrlock (libpthread.so.0)#012#1 0x00000000004885b4 lock_slurmctld (slurmctld)#012#2 0x000000000042fd84 _slurmctld_background (slurmctld)#012#3 0x00000000004340ff main (slurmctld)#012#4 0x00007fe04c28b7e5 __libc_start_main (libc.so.6)#012#5 0x000000000041c29e _start (slurmctld)#012#012Stack trace of thread 3202990:#012#0 0x00007fe04cfdb7da pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)#012#1 0x00007fe0485648e0 _agent (accounting_storage_slurmdbd.so)#012#2 0x00007fe04cfd51ca start_thread (libpthread.so.0)#012#3 0x00007fe04c28a8d3 __clone (libc.so.6)#012#012Stack trace of thread 3203009:#012#0 0x00007fe04cfdb48c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)#012#1 0x000000000042d028 _purge_files_thread (slurmctld)#012#2 0x00007fe04cfd51ca start_thread (libpthread.so.0)#012#3 0x00007fe04c28a8d3 __clone (libc.so.6)#012#012Stack trace of thread 3202986:#012#0 0x00007fe04cfdb7da pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)#012#1 0x0000000000428be8 _agent_init (slurmctld)#012#2 0x00007fe04cfd51ca start_thread (libpthread.so.0)#012#3 0x00007fe04c28a8d3 __clone (libc.so.6)#012#012Stack trace of thread 3202989:#012#0 0x00007fe04cfdb7da pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)#012#1 0x00007fe04855c353 _set_db_inx_thread (accounting_storage_slurmdbd.so)#012#2 0x00007fe04cfd51ca start_thread (libpthread.so.0)#012#3 0x00007fe04c28a8d3 __clone (libc.so.6)#012#012Stack trace of thread 3202988:#012#0 0x00007fe04cfdb7da pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)#012#1 0x000000000042537d _agent_srun_update (slurmctld)#012#2 0x00007fe04cfd51ca start_thread (libpthread.so.0)#012#3 0x00007fe04c28a8d3 __clone (libc.so.6)#012#012Stack trace of thread 3202985:#012#0 0x00007fe04c383ac1 __poll (libc.so.6)#012#1 0x00007fe04d28cbb5 poll (libslurmfull.so)#012#2 0x00000000004d1c07 _slurmctld_listener_thread (slurmctld)#012#3 0x00007fe04cfd51ca start_thread (libpthread.so.0)#012#4 0x00007fe04c28a8d3 __clone (libc.so.6)#012#012Stack trace of thread 3203004:#012#0 0x00007fe04cfdb7da pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)#012#1 0x0000000000442408 _test_dep_job_thread (slurmctld)#012#2 0x00007fe04cfd51ca start_thread (libpthread.so.0)#012#3 0x00007fe04c28a8d3 __clone (libc.so.6)#012#012Stack trace of thread 3202996:#012#0 0x00007fe04cfdb7da pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)#012#1 0x00007fe0422bffa2 _decay_thread (priority_multifactor.so)#012#2 0x00007fe04cfd51ca start_thread (libpthread.so.0)#012#3 0x00007fe04c28a8d3 __clone (libc.so.6)#012#012Stack trace of thread 3203011:#012#0 0x00007fe04cfdb48c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)#012#1 0x000000000042cf19 _acct_update_thread (slurmctld)#012#2 0x00007fe04cfd51ca start_thread (libpthread.so.0)#012#3 0x00007fe04c28a8d3 __clone (libc.so.6)#012#012Stack trace of thread 3203008:#012#0 0x00007fe04cfdb7da pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)#012#1 0x00000000004d7117 slurmctld_state_save (slurmctld)#012#2 0x00007fe04cfd51ca start_thread (libpthread.so.0)#012#3 0x00007fe04c28a8d3 __clone (libc.so.6)#012#012Stack trace of thread 3203003:#012#0 0x00007fe04cfdb7da pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)#012#1 0x000000000043926c _remote_dep_recv_thread (slurmctld)#012#2 0x00007fe04cfd51ca start_thread (libpthread.so.0)#012#3 0x00007fe04c28a8d3 __clone (libc.so.6)#012#012Stack trace of thread 3203007:#012#0 0x00007fe04c2a030c __sigtimedwait (libc.so.6)#012#1 0x00007fe04cfdf8ac sigwait (libpthread.so.0)#012#2 0x0000000000431e9f _slurmctld_signal_hand (slurmctld)#012#3 0x00007fe04cfd51ca start_thread (libpthread.so.0)#012#4 0x00007fe04c28a8d3 __clone (libc.so.6)#012#012Stack trace of thread 3203001:#012#0 0x00007fe04cfdb7da pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)#012#1 0x000000000043b22a _agent_thread (slurmctld)#012#2 0x00007fe04cfd51ca start_thread (libpthread.so.0)#012#3 0x00007fe04c28a8d3 __clone (libc.so.6)#012#012Stack trace of thread 3202993:#012#0 0x00007fe04cfdb7da pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)#012#1 0x00007fe041ea8248 _my_sleep (sched_backfill.so)#012#2 0x00007fe041eb154e backfill_agent (sched_backfill.so)#012#3 0x00007fe04cfd51ca start_thread (libpthread.so.0)#012#4 0x00007fe04c28a8d3 __clone (libc.so.6)#012#012Stack trace of thread 3203005:#012#0 0x00007fe04cfdb7da pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)#012#1 0x0000000000435fd0 _origin_dep_update_thread (slurmctld)#012#2 0x00007fe04cfd51ca start_thread (libpthread.so.0)#012#3 0x00007fe04c28a8d3 __clone (libc.so.6)#012#012Stack trace of thread 3203002:#012#0 0x00007fe04cfdb7da pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)#012#1 0x000000000043fb1d _fed_job_update_thread (slurmctld)#012#2 0x00007fe04cfd51ca start_thread (libpthread.so.0)#012#3 0x00007fe04c28a8d3 __clone (libc.so.6)#012#012Stack trace of thread 3203006:#012#0 0x00007fe04c383ac1 __poll (libc.so.6)#012#1 0x000000000042f1fd poll (slurmctld)#012#2 0x00007fe04cfd51ca start_thread (libpthread.so.0)#012#3 0x00007fe04c28a8d3 __clone (libc.so.6) Nov 12 06:14:32 OurScheduler systemd[1]: slurmctld.service: Main process exited, code=killed, status=11/SEGV Nov 12 06:14:32 OurScheduler systemd[1]: slurmctld.service: Failed with result 'signal'. Nov 12 06:14:32 OurScheduler systemd[1]: systemd-coredump@42-3203565-0.service: Succeeded