[slurm-users] slurmctld crashing on release of job submitted with a hold

Avalon Johnson avalonjo at usc.edu
Fri Jun 8 17:27:30 MDT 2018



Has anyone else encountered this problem where slumrctld crashes on an release 
of a job's hold

I am wondering if there is something unique to our configurations that is leading to this crash.

Here is what I have found so far:

There appears to be a bug in slurmctld placing a hold on a job and then releasing the hold causes the slurmctld
to core dump due to an Arithmetic exception:

Version of slurm:

     hpc-sched2# rpm -q --info slurm
     Name        : slurm
     Version     : 17.11.6
     Release     : 1usc.el7.centos
     Architecture: x86_64


To produce this error:

   $ sbatch --hold printenv.BATCH
   Submitted batch job 934654


Specs for the job shows:

   $ scontrol show job 934654
   JobId=934654 JobName=printenv.BATCH
      UserId=avalonjo(...) GroupId=... MCS_label=N/A
      Priority=0 Nice=0 Account=lc_hpcc QOS=lc_hpcc_maxcpumins
      JobState=PENDING Reason=JobHeldUser Dependency=(null)
      Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
      RunTime=00:00:00 TimeLimit=01:20:00 TimeMin=N/A
      SubmitTime=2018-06-07T17:21:21 EligibleTime=Unknown
      StartTime=Unknown EndTime=Unknown Deadline=N/A
      PreemptTime=None SuspendTime=None SecsPreSuspend=0
      LastSchedEval=2018-06-07T17:21:21
      Partition=main AllocNode:Sid=...:61228
      ReqNodeList=(null) ExcNodeList=(null)
      NodeList=(null)
      NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
      TRES=cpu=1,mem=1G,node=1
      Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
      MinCPUsNode=1 MinMemoryCPU=1G MinTmpDiskNode=0
      Features=(null) DelayBoot=00:00:00
      Gres=(null) Reservation=(null)
      OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
      Command=...../printenv.BATCH
      WorkDir=..../Infiniband
      StdErr=..../Infiniband/./OutputDir/%x.934654
      StdIn=/dev/null
      StdOut=.../Infiniband/./OutputDir/%x.934654
      Power=

Now release the job:

   $ scontrol release job 931432
   Invalid job id specified for job job
   slurm_suspend error: Invalid job id specified
   Unexpected message received for job 931432
   slurm_suspend error: Unexpected message received


At which point slurmctld core dumps:


Using gdb to analyze the core file:


   # gdb /usr/sbin/slurmctld ./core.28720
   GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7_4.1
   Copyright (C) 2013 Free Software Foundation, Inc.

   [Thread debugging using libthread_db enabled]
   Using host libthread_db library "/lib64/libthread_db.so.1".
   Core was generated by `/usr/sbin/slurmctld'.
   Program terminated with signal 8, Arithmetic exception.
   #0  0x00000000004173ff in _validate_time_limit 
(time_limit_in=time_limit_in at entry=0x7f3338196568,
       part_max_time=part_max_time at entry=60, tres_req_cnt=0, 
max_limit=2000000000,
       out_max_limit=out_max_limit at entry=0x7f33380864e0, 
limit_set_time=limit_set_time at entry=0x7f3384ccb472,
       strict_checking=strict_checking at entry=true, is64=is64 at entry=true) at 
acct_policy.c:1120
   #1  0x00000000004174b9 in _validate_tres_time_limits 
(tres_pos=tres_pos at entry=0x7f3384ccad14,
       time_limit_in=time_limit_in at entry=0x7f3338196568, part_max_time=60, 
job_tres_array=0x7f3384ccb3a8,
       max_tres_array=0xeeead0, out_max_tres_array=0x7f33380864e0, 
limit_set_time=limit_set_time at entry=0x7f3384ccb472,
       strict_checking=strict_checking at entry=true) at acct_policy.c:1174
   #2  0x0000000000418635 in _qos_policy_validate 
(job_desc=job_desc at entry=0x7f3338196370,
       assoc_ptr=assoc_ptr at entry=0x1287450, part_ptr=part_ptr at entry=0x1965410, 
qos_ptr=qos_ptr at entry=0xf06f80,
       qos_out_ptr=qos_out_ptr at entry=0x7f3384ccadf0, reason=reason at entry=0x0,
       acct_policy_limit_set=acct_policy_limit_set at entry=0x7f3384ccb470, 
update_call=update_call at entry=true,
       user_name=user_name at entry=0x1287610 "avalonjo", job_cnt=job_cnt at entry=1, 
strict_checking=strict_checking at entry=true)
       at acct_policy.c:1522
   #3  0x0000000000418da9 in _acct_policy_validate 
(job_desc=job_desc at entry=0x7f3338196370,
       part_ptr=part_ptr at entry=0x1965410, assoc_in=assoc_in at entry=0x1287450, 
qos_ptr_1=0xf065c0, qos_ptr_2=0xf06f80,
       reason=reason at entry=0x0, 
acct_policy_limit_set=acct_policy_limit_set at entry=0x7f3384ccb470,
       update_call=update_call at entry=true) at acct_policy.c:2660
   #4  0x000000000041b3fa in acct_policy_validate 
(job_desc=job_desc at entry=0x7f3338196370, part_ptr=0x1965410,
       assoc_in=0x1287450, qos_ptr=0xf06f80, reason=reason at entry=0x0,
       acct_policy_limit_set=acct_policy_limit_set at entry=0x7f3384ccb470, 
update_call=update_call at entry=true)
       at acct_policy.c:2976
   #5  0x0000000000457333 in _update_job (job_ptr=job_ptr at entry=0x2428270, 
job_specs=job_specs at entry=0x7f3338196370,
       uid=uid at entry=203387) at job_mgr.c:11717
   #6  0x000000000045ad71 in update_job_str (msg=msg at entry=0x7f3384ccbe50, 
uid=uid at entry=203387) at job_mgr.c:13447
   #7  0x000000000048da6c in _slurm_rpc_update_job (msg=0x7f3384ccbe50) at 
proc_req.c:4366
   ---Type <return> to continue, or q <return> to quit---
   #8  slurmctld_req (msg=msg at entry=0x7f3384ccbe50, 
arg=arg at entry=0x7f33b4029480) at proc_req.c:447
   #9  0x0000000000424f28 in _service_connection (arg=0x7f33b4029480) at 
controller.c:1125
   #10 0x00007f33d26c8e25 in start_thread () from /lib64/libpthread.so.0
   #11 0x00007f33d23f634d in clone () from /lib64/libc.so.6


Which shows that it died in _validate_time_limit in file acct_policy.c

validate_time_limit has the following line:

                 max_time_limit = (uint32_t)(max_limit / tres_req_cnt);


And using gdb to print tres_req_cnt we get:

     (gdb) p max_limit
   $8 = 2000000000
   (gdb) p tres_req_count
   No symbol "tres_req_count" in current context.
   (gdb) p tres_req_cnt
   $9 = 0

Which will result in the suspected divide by zero.


After tracing back  it appears that the original variable was 'msg' in:

      src/slurmctld/controller.c

As shown by gdb:

   (gdb) frame 8
   #8  slurmctld_req (msg=msg at entry=0x7f3384ccbe50, 
arg=arg at entry=0x7f33b4029480) at proc_req.c:447


   (gdb) p ((job_desc_msg_t *) msg->data)->tres_req_cnt[0]
   $27 = 0


Had the value set to 0

Perhaps since no one else has encountered this it's somehow related to how we 
have slurm configured but non-the-less it probably shouldn't be dividing by 
zero.


Avalon Johnson

Systems Programmer
Information Technology Services
CAL 365-104B, University of Southern California
Los Angeles, California 90089-2812

e-mail: avalonjo at usc.edu
         It takes a village ..."



More information about the slurm-users mailing list