[slurm-users] Re: slurmctld keeps segfaulting, possibly during or just after backfill

21 Jun 2026

      Hi Matthias,

It appears that your problem is related to your job_submit.lua script, 
can you please confirm?  Testing such Lua scripts is your own 
responsibility, and you should look in slurmctld.log for error messages 
related to job_submit.lua.

This Wiki page discusses job submit plugins and may perhaps be useful?
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#job-submit-pl...
In particular, errors in the Lua script may cause problems like what you 
have experienced, see this comment:
...
If slurmctld gets an error when executing /etc/slurm/job_submit.lua, it will use any previously cached script and ignore the file on disk henceforth (see comment 15 in ticket_14472).
WARNING: If slurmctld does not have a cached script (because it was just restarted, for example) it may crash!
Therefore I'm always extra careful when changing my job_submit.lua.

IHTH,
Ole

On 6/19/2026 10:42 AM, Matthias Loose via slurm-users wrote:
...
We just ran into the same problem.
We just upgraded to slurm 24.11.7 and I woke up today to 2 crashed 
controllers. They would immediately crash on restart.
Whe troubleshooting revealed multi partition jobs to be the problem.
-- Temporary safety guard for Slurm 24.11.7 crash investigation.
-- Reject multi-partition submissions
function slurm_job_submit(job_desc, part_list, submit_uid)
     if job_desc.partition ~= nil and string.find(job_desc.partition, 
",") then
         slurm.log_user("Multi-partition jobs are temporarily disabled. 
Please submit to exactly one partition.")
         slurm.log_info("Rejected multi-partition job from uid=%s 
partition=%s",
                        tostring(submit_uid), tostring(job_desc.partition))
         return slurm.ERROR
     end
    return slurm.SUCCESS
end
function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)
     if job_desc.partition ~= nil and string.find(job_desc.partition, 
",") then
         slurm.log_user("Changing jobs to multiple partitions is 
temporarily disabled. Please choose exactly one partition.")
         slurm.log_info("Rejected multi-partition job modification from 
uid=%s partition=%s",
                        tostring(modify_uid), tostring(job_desc.partition))
         return slurm.ERROR
     end
    return slurm.SUCCESS
end
This has now stabilized our cluster and luckily we can operate without 
multi partition jobs, but this was a really nasty surprise.
What did you end up doing with this problem? Is this a SLURM 24.11.7 
problem and I need to just upgrade again?
Kind regards, Matze
On 01/11/2024 17:16, Marcus Lauer via slurm-users wrote:
...
        Following up on this, it looks like slurmctld crashes reliably 
just after a job which was submitted to multiple partitions completes. 
Has anyone encountered this sort of thing before?
        Here is a simplified version of our cluster's partitions:
Nodes Partition    Priority
node[01-10] facultyA     8
node[11-20] facultyB     8
node[21-30] facultyC     8
node[01-30] standby      1
node31 debug        4
        The "facultyX" partitions can only be used by people in a 
certain SLURM account which is tied to who their faculty sponsor (PI) 
is. Standby can be used by member of any group, as can debug.
        I can reliably get slurmctld to crash about 30 seconds after a 
job like this finishes:
--------- job script ---------
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --partition=facultyA,standby
hostname
------------------------------
Slurmctld will also crash if a person submits to "-- 
partition=facultyA,facultyB". Some researchers do work with more than 
one faculty member so this is definitely a legitimate scenario. 
Basically any combination of partitions seems to cause the problem.
         We have a Lua job_submit script which removes the "debug" 
partition from jobs which are submitted to more than one partition. 
Even when jobs are submitted to "--partition=facultyA,debug" that 
doesn't fix the problem. Even though the job will only run in the 
facultyA partition and "scontrol show job JOBID" will only show 
facultyA as a partition to which the job has been submitted, slurmctld 
will still crash 30 seconds after the job finishes. Turning off that 
script by commenting out JobSubmitPlugin and running "scontrol 
reconfigure" does not help incidentally.
         Perhaps I have missed something which needs to be changed or 
enabled when people are allowed to submit to multiple partitions? Yet 
this used to work in the past. The cluster in question is several 
years old but this problem has only appeared in the last two months. 
The really annoying thing is that we have other clusters with multiple 
partitions and submitting jobs to a list of partitions on those 
clusters works as expected. The problem seems to be unique to one cluster.
On Wed, Oct 2, 2024 at 1:24 PM Marcus Lauer <melauer@seas.upenn.edu> 
wrote:
        We are running into a problem where slurmctld is
    segfaulting a few times a day. We had this problem with SLURM
    23.11.8 and now with 23.11.10 as well, though the problem only
    appears on one of the several SLURM clusters we have, and all of
    them use one of those versions of SLURM. I was wondering if anyone
    has encountered a similar issue and has any thoughts on how to
    prevent this.
        Obviously we use "SchedulerType=sched/backfill" but
    strangely when I switched to sched/builtin for a while there were
    still slurmctld segfaults. We also set
    "SchedulerParameters=enable_user_top,bf_max_job_test=2000". I have
    tried turning those off but it did not help. I have also tried
    tweaking several other settings to no avail. Most of the cluster
    runs Rocky Linux 8.10 (including the slurmctld system) though we
    still have some Scientific Linux 7.9 compute nodes (we compile
    SLURM separately for those).
        Here is the crash-time error from journalctl:
Oct 02 06:31:20 our.host.name <https://
    eur01.safelinks.protection.outlook.com/?
    url=http%3A%2F%2Four.host.name%2F&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7C07dafc2f40cb4c4a274e08decde2f36c%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C639176364614148319%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C60000%7C%7C%7C&sdata=k1R0QlrOJ8061zqkXArgiauq%2BxxgcPHLblhClFGsRBk%3D&reserved=0> kernel: sched_agent[2048355]: segfault at 8 ip 00007fec755d7ea8 sp 00007fec6bffe7e8 error 4 in libslurmfull.so[7fec7555a000+1f4000]
    Oct 02 06:31:20 our.host.name <https://
    eur01.safelinks.protection.outlook.com/?
    url=http%3A%2F%2Four.host.name%2F&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7C07dafc2f40cb4c4a274e08decde2f36c%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C639176364614167248%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C60000%7C%7C%7C&sdata=xVbPmI5d%2Bi4Q2YEYF9XB8BlvK1kSWKzqkl1F0gI34Qo%3D&reserved=0>kernel: Code: 48 39 c1 7e 19 48 c1 f8 06 ba 01 00 00 00 48 d3 e2 48 f7 da 48 0b 54 c6 10 48 21 54 c7 10 c3 b8 00 00 00 00 eb da 48 8b 4f 08 <48> 39 4e 08 48 0f 4e 4e 08 49 89 c9 48 83 f9 3f 76 4e ba 40 00 00
    Oct 02 06:31:20 our.host.name <https://
    eur01.safelinks.protection.outlook.com/?
    url=http%3A%2F%2Four.host.name%2F&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7C07dafc2f40cb4c4a274e08decde2f36c%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C639176364614178326%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C60000%7C%7C%7C&sdata=bmEIY2ZO3w6CQIZDmEteQLnfhIA9pa2RdH9XRDgtCow%3D&reserved=0>systemd[1]: Started Process Core Dump (PID 2169426/UID 0).
    Oct 02 06:31:20 our.host.name <https://
    eur01.safelinks.protection.outlook.com/?
    url=http%3A%2F%2Four.host.name%2F&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7C07dafc2f40cb4c4a274e08decde2f36c%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C639176364614189414%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C60000%7C%7C%7C&sdata=e2Scq12evqMbA64L0rc1%2B%2B4UoKS2kVMTi%2BblM8QlTv8%3D&reserved=0>systemd-coredump[2169427]: Process 2048344 (slurmctld) of user 991 dumped core.
This is followed by a list of each of the dozen or so related
    threads. The one which is dumping core is first and looks like this:
Stack trace of thread 2048355:
    #0  0x00007fec755d7ea8 bit_and_not (libslurmfull.so)
    #1  0x000000000044531f _job_alloc (slurmctld)
    #2  0x000000000044576b _job_alloc_whole_node_internal (slurmctld)
    #3  0x0000000000446e6d gres_ctld_job_alloc_whole_node (slurmctld)
    #4  0x00007fec722e29b8 job_res_add_job (select_cons_tres.so)
    #5  0x00007fec722f7c32 select_p_select_nodeinfo_set
    (select_cons_tres.so)
    #6  0x00007fec756e7dc7 select_g_select_nodeinfo_set (libslurmfull.so)
    #7  0x0000000000496eb3 select_nodes (slurmctld)
    #8  0x0000000000480826 _schedule (slurmctld)
    #9  0x00007fec753421ca start_thread (libpthread.so.0)
    #10 0x00007fec745f78d3 __clone (libc.so.6)
       I have run slurmctld with "debug5" level logging and it
    appears that the error occurs right after backfill considers a
    large number of jobs. Slurmctld could be failing at the end of
    backfill or when doing something which happens just after backfill
    runs. Usually this is the last message before the crash:
[2024-09-25T18:39:42.076] slurmscriptd: debug:
     _slurmscriptd_mainloop: finished
       If anyone has any thoughts or advice on this that would be
    appreciated. Thank you.
-- 
    Marcus Lauer
    Systems Administrator
    CETS Group, Research Support
-- 
Marcus Lauer
Systems Administrator
CETS Group, Research Support

[slurm-users] Re: slurmctld keeps segfaulting, possibly during or just after backfill

Ole Holm Nielsen