[slurm-users] additional jobs killed by scancel.

Mon May 11 16:52:59 UTC 2020

Hi there,

We are using slurm 18.08 and had a weird occurrence over the weekend.  A
user canceled one of his jobs using scancel, and two additional jobs of the
user running on the same node were killed concurrently.  The jobs had no
dependency, but they were all allocated 1 gpu. I am curious to know why
this happened,  and if this is a known bug is there a workaround to prevent
it happening?  Any suggestions gratefully received.

-Alastair

FYI
The cancelled job (533898) has this at the end of the .err file:

slurmstepd: error: *** JOB 533898 ON NODE056 CANCELLED AT
> 2020-05-10T00:26:03 ***
>

both of the killed jobs (533900 and 533902)  have this:

slurmstepd: error: get_exit_code task 0 died by signal

here is the slurmd log from the node and ths how-job output for each job:

[2020-05-09T19:49:46.735] _run_prolog: run job script took usec=4
> [2020-05-09T19:49:46.735] _run_prolog: prolog with lock for job 533898 ran
> for 0 seconds
> [2020-05-09T19:49:46.754] ====================
> [2020-05-09T19:49:46.754] batch_job:533898 job_mem:10240MB
> [2020-05-09T19:49:46.754] JobNode[0] CPU[0] Job alloc
> [2020-05-09T19:49:46.755] JobNode[0] CPU[1] Job alloc
> [2020-05-09T19:49:46.756] JobNode[0] CPU[2] Job alloc
> [2020-05-09T19:49:46.757] JobNode[0] CPU[3] Job alloc
> [2020-05-09T19:49:46.758] ====================
> [2020-05-09T19:49:46.758] Launching batch job 533898 for UID 1224431221
> [2020-05-09T19:53:14.060] _run_prolog: run job script took usec=3
> [2020-05-09T19:53:14.060] _run_prolog: prolog with lock for job 533900 ran
> for 0 seconds
> [2020-05-09T19:53:14.080] ====================
> [2020-05-09T19:53:14.080] batch_job:533900 job_mem:10240MB
> [2020-05-09T19:53:14.081] JobNode[0] CPU[4] Job alloc
> [2020-05-09T19:53:14.082] JobNode[0] CPU[5] Job alloc
> [2020-05-09T19:53:14.083] JobNode[0] CPU[6] Job alloc
> [2020-05-09T19:53:14.083] JobNode[0] CPU[7] Job alloc
> [2020-05-09T19:53:14.084] ====================
> [2020-05-09T19:53:14.085] Launching batch job 533900 for UID 1224431221
> [2020-05-09T19:55:26.283] _run_prolog: run job script took usec=21
> [2020-05-09T19:55:26.284] _run_prolog: prolog with lock for job 533902 ran
> for 0 seconds
> [2020-05-09T19:55:26.304] ====================
> [2020-05-09T19:55:26.304] batch_job:533902 job_mem:10240MB
> [2020-05-09T19:55:26.304] JobNode[0] CPU[8] Job alloc
> [2020-05-09T19:55:26.305] JobNode[0] CPU[9] Job alloc
> [2020-05-09T19:55:26.306] JobNode[0] CPU[10] Job alloc
> [2020-05-09T19:55:26.306] JobNode[0] CPU[11] Job alloc
> [2020-05-09T19:55:26.307] ====================
> [2020-05-09T19:55:26.307] Launching batch job 533902 for UID 1224431221
> [2020-05-10T00:26:03.127] [533898.extern] done with job
> [2020-05-10T00:26:03.975] [533898.batch] error: *** JOB 533898 ON NODE056
> CANCELLED AT 2020-05-10T00:26:03 ***
> [2020-05-10T00:26:04.425] [533898.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
> [2020-05-10T00:26:04.428] [533898.batch] done with job
> [2020-05-10T00:26:05.202] [533900.batch] error: get_exit_code task 0 died
> by signal
> [2020-05-10T00:26:05.202] [533902.batch] error: get_exit_code task 0 died
> by signal
> [2020-05-10T00:26:05.202] [533900.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9
> [2020-05-10T00:26:05.202] [533902.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9
> [2020-05-10T00:26:05.211] [533902.batch] done with job
> [2020-05-10T00:26:05.216] [533900.batch] done with job
> [2020-05-10T00:26:05.234] [533902.extern] done with job
> [2020-05-10T00:26:05.235] [533900.extern] done with job

[root at node056 2020-05-10]# cat 533{898,900,902}/show-job.txt
> JobId=533898 JobName=r18-relu-ent
>  UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
>  Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
>  JobState=CANCELLED Reason=None Dependency=(null)
>  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:15
>  RunTime=04:36:17 TimeLimit=5-00:00:00 TimeMin=N/A
>  SubmitTime=2020-05-09T19:49:45 EligibleTime=2020-05-09T19:49:45
>  AccrueTime=2020-05-09T19:49:45
>  StartTime=2020-05-09T19:49:46 EndTime=2020-05-10T00:26:03 Deadline=N/A
>  PreemptTime=None SuspendTime=None SecsPreSuspend=0
>  LastSchedEval=2020-05-09T19:49:46
>  Partition=gpuq AllocNode:Sid=ARGO-2:7221
>  ReqNodeList=(null) ExcNodeList=(null)
>  NodeList=NODE056
>  BatchHost=NODE056
>  NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
>  TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
>  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>  MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
>  Features=(null) DelayBoot=00:00:00
>  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>
>  Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_relu_ent.slurm
>  WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
>
>  StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.err
>  StdIn=/dev/null
>
>  StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.out
>  Power=
>  TresPerNode=gpu:1
>
> JobId=533900 JobName=r18-soft
>  UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
>  Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
>  JobState=FAILED Reason=JobLaunchFailure Dependency=(null)
>  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9
>  RunTime=04:32:51 TimeLimit=5-00:00:00 TimeMin=N/A
>  SubmitTime=2020-05-09T19:53:13 EligibleTime=2020-05-09T19:53:13
>  AccrueTime=2020-05-09T19:53:13
>  StartTime=2020-05-09T19:53:14 EndTime=2020-05-10T00:26:05 Deadline=N/A
>  PreemptTime=None SuspendTime=None SecsPreSuspend=0
>  LastSchedEval=2020-05-09T19:53:14
>  Partition=gpuq AllocNode:Sid=ARGO-2:7221
>  ReqNodeList=(null) ExcNodeList=(null)
>  NodeList=NODE056
>  BatchHost=NODE056
>  NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
>  TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
>  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>  MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
>  Features=(null) DelayBoot=00:00:00
>  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>
>  Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft.slurm
>  WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
>
>  StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.err
>  StdIn=/dev/null
>
>  StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.out
>  Power=
>  TresPerNode=gpu:1
>
> JobId=533902 JobName=r18-soft-ent
>  UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
>  Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
>  JobState=FAILED Reason=JobLaunchFailure Dependency=(null)
>  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9
>  RunTime=04:30:39 TimeLimit=5-00:00:00 TimeMin=N/A
>  SubmitTime=2020-05-09T19:55:26 EligibleTime=2020-05-09T19:55:26
>  AccrueTime=2020-05-09T19:55:26
>  StartTime=2020-05-09T19:55:26 EndTime=2020-05-10T00:26:05 Deadline=N/A
>  PreemptTime=None SuspendTime=None SecsPreSuspend=0
>  LastSchedEval=2020-05-09T19:55:26
>  Partition=gpuq AllocNode:Sid=ARGO-2:7221
>  ReqNodeList=(null) ExcNodeList=(null)
>  NodeList=NODE056
>  BatchHost=NODE056
>  NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
>  TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
>  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>  MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
>  Features=(null) DelayBoot=00:00:00
>  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>
>  Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft_ent.slurm
>  WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
>
>  StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.err
>  StdIn=/dev/null
>
>  StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.out
>  Power=
>  TresPerNode=gpu:1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200511/a87fb448/attachment-0001.htm>