[slurm-users] Slurmctld 18.08.1 and 18.08.3 segfault
Bill Broadley
bill at cse.ucdavis.edu
Tue Nov 13 16:32:48 MST 2018
After being up since the second week in Oct or so, yesterday our slurm
controller started segfaultings. It was compiled/run on ubuntu 16.04.1.
Nov 12 14:31:48 nas-11-1 kernel: [2838306.311552] srvcn[9111]: segfault at 58 ip
00000000004b51fa sp 00007fbe270efb70 error 4 in slurmctld[400000+eb000]
Nov 12 14:32:48 nas-11-1 kernel: [2838366.586784] srvcn[11217]: segfault at 58
ip 00000000004b51fa sp 00007f8f7cc41b70 error 4 in slurmctld[400000+eb000]
Nov 12 14:33:48 nas-11-1 kernel: [2838426.761784] srvcn[13231]: segfault at 58
ip 00000000004b51fa sp 00007fb78a7e6b70 error 4 in slurmctld[400000+eb000]
Nov 12 14:34:48 nas-11-1 kernel: [2838486.976987] srvcn[15228]: segfault at 58
ip 00000000004b51fa sp 00007ffb8e9e8b70 error 4 in slurmctld[400000+eb000]
I compiled 18.08.3 on 18.04 and it hits the same problem.
Now slurmctld segfaults shortly after boot:
slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: debug2: Tree head got back 1
Segmentation fault (core dumped)
If I look at the core dump:
# gdb ./slurmctld
GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git
Reading symbols from ./slurmctld...done.
(gdb) core ./core
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./slurmctld -D -v -v -v'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 _step_dealloc_lps (step_ptr=0x555787af0f70) at step_mgr.c:2092
2092 i_first = bit_ffs(job_resrcs_ptr->node_bitmap);
[Current thread is 1 (Thread 0x7f06a93d3700 (LWP 25825))]
(gdb) bt
#0 _step_dealloc_lps (step_ptr=0x555787af0f70) at step_mgr.c:2092
#1 post_job_step (step_ptr=step_ptr at entry=0x555787af0f70) at step_mgr.c:4720
#2 0x000055578571d1d8 in _post_job_step (step_ptr=0x555787af0f70) at step_mgr.c:270
#3 _internal_step_complete (job_ptr=job_ptr at entry=0x555787af04a0,
step_ptr=step_ptr at entry=0x555787af0f70) at step_mgr.c:311
#4 0x000055578571d35c in job_step_complete (job_id=7035546, step_id=4294967295,
uid=uid at entry=0, requeue=requeue at entry=false,
job_return_code=<optimized out>) at step_mgr.c:878
#5 0x00005557856f0522 in _slurm_rpc_step_complete (msg=0x7f06a93d2e20,
running_composite=<optimized out>) at proc_req.c:3863
#6 0x00005557856fde0b in slurmctld_req (msg=0x7f06a93d2e20, arg=0x7f067c001370)
at proc_req.c:512
#7 0x00005557856897e2 in _service_connection (arg=<optimized out>) at
controller.c:1274
#8 0x00007f06be41a6db in start_thread (arg=0x7f06a93d3700) at pthread_create.c:463
#9 0x00007f06be14388f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)
Has anyone seen anything like this before?
More information about the slurm-users
mailing list