[slurm-users] slurmctld crashing on release of job submitted with a hold
Avalon Johnson
avalonjo at usc.edu
Fri Jun 8 17:27:30 MDT 2018
Has anyone else encountered this problem where slumrctld crashes on an release
of a job's hold
I am wondering if there is something unique to our configurations that is leading to this crash.
Here is what I have found so far:
There appears to be a bug in slurmctld placing a hold on a job and then releasing the hold causes the slurmctld
to core dump due to an Arithmetic exception:
Version of slurm:
hpc-sched2# rpm -q --info slurm
Name : slurm
Version : 17.11.6
Release : 1usc.el7.centos
Architecture: x86_64
To produce this error:
$ sbatch --hold printenv.BATCH
Submitted batch job 934654
Specs for the job shows:
$ scontrol show job 934654
JobId=934654 JobName=printenv.BATCH
UserId=avalonjo(...) GroupId=... MCS_label=N/A
Priority=0 Nice=0 Account=lc_hpcc QOS=lc_hpcc_maxcpumins
JobState=PENDING Reason=JobHeldUser Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=01:20:00 TimeMin=N/A
SubmitTime=2018-06-07T17:21:21 EligibleTime=Unknown
StartTime=Unknown EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2018-06-07T17:21:21
Partition=main AllocNode:Sid=...:61228
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=1G,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=1G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=...../printenv.BATCH
WorkDir=..../Infiniband
StdErr=..../Infiniband/./OutputDir/%x.934654
StdIn=/dev/null
StdOut=.../Infiniband/./OutputDir/%x.934654
Power=
Now release the job:
$ scontrol release job 931432
Invalid job id specified for job job
slurm_suspend error: Invalid job id specified
Unexpected message received for job 931432
slurm_suspend error: Unexpected message received
At which point slurmctld core dumps:
Using gdb to analyze the core file:
# gdb /usr/sbin/slurmctld ./core.28720
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7_4.1
Copyright (C) 2013 Free Software Foundation, Inc.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 8, Arithmetic exception.
#0 0x00000000004173ff in _validate_time_limit
(time_limit_in=time_limit_in at entry=0x7f3338196568,
part_max_time=part_max_time at entry=60, tres_req_cnt=0,
max_limit=2000000000,
out_max_limit=out_max_limit at entry=0x7f33380864e0,
limit_set_time=limit_set_time at entry=0x7f3384ccb472,
strict_checking=strict_checking at entry=true, is64=is64 at entry=true) at
acct_policy.c:1120
#1 0x00000000004174b9 in _validate_tres_time_limits
(tres_pos=tres_pos at entry=0x7f3384ccad14,
time_limit_in=time_limit_in at entry=0x7f3338196568, part_max_time=60,
job_tres_array=0x7f3384ccb3a8,
max_tres_array=0xeeead0, out_max_tres_array=0x7f33380864e0,
limit_set_time=limit_set_time at entry=0x7f3384ccb472,
strict_checking=strict_checking at entry=true) at acct_policy.c:1174
#2 0x0000000000418635 in _qos_policy_validate
(job_desc=job_desc at entry=0x7f3338196370,
assoc_ptr=assoc_ptr at entry=0x1287450, part_ptr=part_ptr at entry=0x1965410,
qos_ptr=qos_ptr at entry=0xf06f80,
qos_out_ptr=qos_out_ptr at entry=0x7f3384ccadf0, reason=reason at entry=0x0,
acct_policy_limit_set=acct_policy_limit_set at entry=0x7f3384ccb470,
update_call=update_call at entry=true,
user_name=user_name at entry=0x1287610 "avalonjo", job_cnt=job_cnt at entry=1,
strict_checking=strict_checking at entry=true)
at acct_policy.c:1522
#3 0x0000000000418da9 in _acct_policy_validate
(job_desc=job_desc at entry=0x7f3338196370,
part_ptr=part_ptr at entry=0x1965410, assoc_in=assoc_in at entry=0x1287450,
qos_ptr_1=0xf065c0, qos_ptr_2=0xf06f80,
reason=reason at entry=0x0,
acct_policy_limit_set=acct_policy_limit_set at entry=0x7f3384ccb470,
update_call=update_call at entry=true) at acct_policy.c:2660
#4 0x000000000041b3fa in acct_policy_validate
(job_desc=job_desc at entry=0x7f3338196370, part_ptr=0x1965410,
assoc_in=0x1287450, qos_ptr=0xf06f80, reason=reason at entry=0x0,
acct_policy_limit_set=acct_policy_limit_set at entry=0x7f3384ccb470,
update_call=update_call at entry=true)
at acct_policy.c:2976
#5 0x0000000000457333 in _update_job (job_ptr=job_ptr at entry=0x2428270,
job_specs=job_specs at entry=0x7f3338196370,
uid=uid at entry=203387) at job_mgr.c:11717
#6 0x000000000045ad71 in update_job_str (msg=msg at entry=0x7f3384ccbe50,
uid=uid at entry=203387) at job_mgr.c:13447
#7 0x000000000048da6c in _slurm_rpc_update_job (msg=0x7f3384ccbe50) at
proc_req.c:4366
---Type <return> to continue, or q <return> to quit---
#8 slurmctld_req (msg=msg at entry=0x7f3384ccbe50,
arg=arg at entry=0x7f33b4029480) at proc_req.c:447
#9 0x0000000000424f28 in _service_connection (arg=0x7f33b4029480) at
controller.c:1125
#10 0x00007f33d26c8e25 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f33d23f634d in clone () from /lib64/libc.so.6
Which shows that it died in _validate_time_limit in file acct_policy.c
validate_time_limit has the following line:
max_time_limit = (uint32_t)(max_limit / tres_req_cnt);
And using gdb to print tres_req_cnt we get:
(gdb) p max_limit
$8 = 2000000000
(gdb) p tres_req_count
No symbol "tres_req_count" in current context.
(gdb) p tres_req_cnt
$9 = 0
Which will result in the suspected divide by zero.
After tracing back it appears that the original variable was 'msg' in:
src/slurmctld/controller.c
As shown by gdb:
(gdb) frame 8
#8 slurmctld_req (msg=msg at entry=0x7f3384ccbe50,
arg=arg at entry=0x7f33b4029480) at proc_req.c:447
(gdb) p ((job_desc_msg_t *) msg->data)->tres_req_cnt[0]
$27 = 0
Had the value set to 0
Perhaps since no one else has encountered this it's somehow related to how we
have slurm configured but non-the-less it probably shouldn't be dividing by
zero.
Avalon Johnson
Systems Programmer
Information Technology Services
CAL 365-104B, University of Southern California
Los Angeles, California 90089-2812
e-mail: avalonjo at usc.edu
It takes a village ..."
More information about the slurm-users
mailing list