[slurm-users] Re: slurmctld keeps segfaulting, possibly during or just after backfill

2 Nov 2024

      Following up on this, it looks like slurmctld crashes reliably just
after a job which was submitted to multiple partitions completes. Has
anyone encountered this sort of thing before?
Here is a simplified version of our cluster's partitions:
Nodes         Partition    Priority
node[01-10]   facultyA     8
node[11-20]   facultyB     8
node[21-30]   facultyC     8
node[01-30]   standby      1
node31        debug        4
The "facultyX" partitions can only be used by people in a certain
SLURM account which is tied to who their faculty sponsor (PI) is. Standby
can be used by member of any group, as can debug.
I can reliably get slurmctld to crash about 30 seconds after a job
like this finishes:
--------- job script ---------
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --partition=facultyA,standby
hostname
------------------------------
Slurmctld will also crash if a person submits to
"--partition=facultyA,facultyB". Some researchers do work with more than
one faculty member so this is definitely a legitimate scenario. Basically
any combination of partitions seems to cause the problem.
We have a Lua job_submit script which removes the "debug"
partition from jobs which are submitted to more than one partition. Even
when jobs are submitted to "--partition=facultyA,debug" that doesn't fix
the problem. Even though the job will only run in the facultyA partition
and "scontrol show job JOBID" will only show facultyA as a partition to
which the job has been submitted, slurmctld will still crash 30 seconds
after the job finishes. Turning off that script by commenting out
JobSubmitPlugin and running "scontrol reconfigure" does not help
incidentally.
Perhaps I have missed something which needs to be changed or
enabled when people are allowed to submit to multiple partitions? Yet this
used to work in the past. The cluster in question is several years old but
this problem has only appeared in the last two months. The really annoying
thing is that we have other clusters with multiple partitions and
submitting jobs to a list of partitions on those clusters works as
expected. The problem seems to be unique to one cluster.
On Wed, Oct 2, 2024 at 1:24 PM Marcus Lauer melauer@seas.upenn.edu wrote:
...
    We are running into a problem where slurmctld is segfaulting a few

times a day. We had this problem with SLURM 23.11.8 and now with 23.11.10
as well, though the problem only appears on one of the several SLURM
clusters we have, and all of them use one of those versions of SLURM. I was
wondering if anyone has encountered a similar issue and has any thoughts on
how to prevent this.
    Obviously we use "SchedulerType=sched/backfill" but strangely

when I switched to sched/builtin for a while there were still slurmctld
segfaults. We also set
"SchedulerParameters=enable_user_top,bf_max_job_test=2000". I have tried
turning those off but it did not help. I have also tried tweaking several
other settings to no avail. Most of the cluster runs Rocky Linux 8.10
(including the slurmctld system) though we still have some Scientific Linux
7.9 compute nodes (we compile SLURM separately for those).
    Here is the crash-time error from journalctl:

Oct 02 06:31:20 our.host.name kernel: sched_agent[2048355]: segfault at 8
ip 00007fec755d7ea8 sp 00007fec6bffe7e8 error 4 in
libslurmfull.so[7fec7555a000+1f4000]
Oct 02 06:31:20 our.host.name kernel: Code: 48 39 c1 7e 19 48 c1 f8 06 ba
01 00 00 00 48 d3 e2 48 f7 da 48 0b 54 c6 10 48 21 54 c7 10 c3 b8 00 00 00
00 eb da 48 8b 4f 08 <48> 39 4e 08 48 0f 4e 4e 08 49 89 c9 48 83 f9 3f 76
4e ba 40 00 00
Oct 02 06:31:20 our.host.name systemd[1]: Started Process Core Dump (PID
2169426/UID 0).
Oct 02 06:31:20 our.host.name systemd-coredump[2169427]: Process 2048344 (
slurmctld) of user 991 dumped core.
This is followed by a list of each of the dozen or so related threads. The
one which is dumping core is first and looks like this:
Stack trace of thread 2048355:
#0  0x00007fec755d7ea8 bit_and_not (libslurmfull.so)
#1  0x000000000044531f _job_alloc (slurmctld)
#2  0x000000000044576b _job_alloc_whole_node_internal (slurmctld)
#3  0x0000000000446e6d gres_ctld_job_alloc_whole_node (slurmctld)
#4  0x00007fec722e29b8 job_res_add_job (select_cons_tres.so)
#5  0x00007fec722f7c32 select_p_select_nodeinfo_set (select_cons_tres.so)
#6  0x00007fec756e7dc7 select_g_select_nodeinfo_set (libslurmfull.so)
#7  0x0000000000496eb3 select_nodes (slurmctld)
#8  0x0000000000480826 _schedule (slurmctld)
#9  0x00007fec753421ca start_thread (libpthread.so.0)
#10 0x00007fec745f78d3 __clone (libc.so.6)
   I have run slurmctld with "debug5" level logging and it appears

that the error occurs right after backfill considers a large number of
jobs. Slurmctld could be failing at the end of backfill or when doing
something which happens just after backfill runs. Usually this is the last
message before the crash:
[2024-09-25T18:39:42.076] slurmscriptd: debug:  _slurmscriptd_mainloop:
finished
   If anyone has any thoughts or advice on this that would be

appreciated. Thank you.
--
Marcus Lauer
Systems Administrator
CETS Group, Research Support
-- 
Marcus Lauer
Systems Administrator
CETS Group, Research Support

2025

2024

[slurm-users] Re: slurmctld keeps segfaulting, possibly during or just after backfill