[slurm-users] additional jobs killed by scancel.

Tue May 12 14:07:41 UTC 2020

I see one job cancelled and two jobs failed.
Your slurmd log is incomplete -- it doesn't show the two failed jobs
exiting/failing, so the real error is not here.

It might also be helpful to look through slurmctld's log starting from
when the first job was canceled, looking at any messages mentioning
the node or the two failed jobs.

I've had nodes do strange things on job cancel.  Last one I tracked
down to the job epilog failing because it was NFS mounted and nfs was
being slower than slurm liked, so it took the node offline and killed
everything on it.

On Mon, May 11, 2020 at 12:55 PM Alastair Neil <ajneil.tech at gmail.com> wrote:
>
> Hi there,
>
> We are using slurm 18.08 and had a weird occurrence over the weekend.  A user canceled one of his jobs using scancel, and two additional jobs of the user running on the same node were killed concurrently.  The jobs had no dependency, but they were all allocated 1 gpu. I am curious to know why this happened,  and if this is a known bug is there a workaround to prevent it happening?  Any suggestions gratefully received.
>
> -Alastair
>
> FYI
> The cancelled job (533898) has this at the end of the .err file:
>
>> slurmstepd: error: *** JOB 533898 ON NODE056 CANCELLED AT 2020-05-10T00:26:03 ***
>
>
> both of the killed jobs (533900 and 533902)  have this:
>
>> slurmstepd: error: get_exit_code task 0 died by signal
>
>
> here is the slurmd log from the node and ths how-job output for each job:
>
>> [2020-05-09T19:49:46.735] _run_prolog: run job script took usec=4
>> [2020-05-09T19:49:46.735] _run_prolog: prolog with lock for job 533898 ran for 0 seconds
>> [2020-05-09T19:49:46.754] ====================
>> [2020-05-09T19:49:46.754] batch_job:533898 job_mem:10240MB
>> [2020-05-09T19:49:46.754] JobNode[0] CPU[0] Job alloc
>> [2020-05-09T19:49:46.755] JobNode[0] CPU[1] Job alloc
>> [2020-05-09T19:49:46.756] JobNode[0] CPU[2] Job alloc
>> [2020-05-09T19:49:46.757] JobNode[0] CPU[3] Job alloc
>> [2020-05-09T19:49:46.758] ====================
>> [2020-05-09T19:49:46.758] Launching batch job 533898 for UID 1224431221
>> [2020-05-09T19:53:14.060] _run_prolog: run job script took usec=3
>> [2020-05-09T19:53:14.060] _run_prolog: prolog with lock for job 533900 ran for 0 seconds
>> [2020-05-09T19:53:14.080] ====================
>> [2020-05-09T19:53:14.080] batch_job:533900 job_mem:10240MB
>> [2020-05-09T19:53:14.081] JobNode[0] CPU[4] Job alloc
>> [2020-05-09T19:53:14.082] JobNode[0] CPU[5] Job alloc
>> [2020-05-09T19:53:14.083] JobNode[0] CPU[6] Job alloc
>> [2020-05-09T19:53:14.083] JobNode[0] CPU[7] Job alloc
>> [2020-05-09T19:53:14.084] ====================
>> [2020-05-09T19:53:14.085] Launching batch job 533900 for UID 1224431221
>> [2020-05-09T19:55:26.283] _run_prolog: run job script took usec=21
>> [2020-05-09T19:55:26.284] _run_prolog: prolog with lock for job 533902 ran for 0 seconds
>> [2020-05-09T19:55:26.304] ====================
>> [2020-05-09T19:55:26.304] batch_job:533902 job_mem:10240MB
>> [2020-05-09T19:55:26.304] JobNode[0] CPU[8] Job alloc
>> [2020-05-09T19:55:26.305] JobNode[0] CPU[9] Job alloc
>> [2020-05-09T19:55:26.306] JobNode[0] CPU[10] Job alloc
>> [2020-05-09T19:55:26.306] JobNode[0] CPU[11] Job alloc
>> [2020-05-09T19:55:26.307] ====================
>> [2020-05-09T19:55:26.307] Launching batch job 533902 for UID 1224431221
>> [2020-05-10T00:26:03.127] [533898.extern] done with job
>> [2020-05-10T00:26:03.975] [533898.batch] error: *** JOB 533898 ON NODE056 CANCELLED AT 2020-05-10T00:26:03 ***
>> [2020-05-10T00:26:04.425] [533898.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
>> [2020-05-10T00:26:04.428] [533898.batch] done with job
>> [2020-05-10T00:26:05.202] [533900.batch] error: get_exit_code task 0 died by signal
>> [2020-05-10T00:26:05.202] [533902.batch] error: get_exit_code task 0 died by signal
>> [2020-05-10T00:26:05.202] [533900.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9
>> [2020-05-10T00:26:05.202] [533902.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9
>> [2020-05-10T00:26:05.211] [533902.batch] done with job
>> [2020-05-10T00:26:05.216] [533900.batch] done with job
>> [2020-05-10T00:26:05.234] [533902.extern] done with job
>> [2020-05-10T00:26:05.235] [533900.extern] done with job
>
>
>> [root at node056 2020-05-10]# cat 533{898,900,902}/show-job.txt
>> JobId=533898 JobName=r18-relu-ent
>>  UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
>>  Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
>>  JobState=CANCELLED Reason=None Dependency=(null)
>>  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:15
>>  RunTime=04:36:17 TimeLimit=5-00:00:00 TimeMin=N/A
>>  SubmitTime=2020-05-09T19:49:45 EligibleTime=2020-05-09T19:49:45
>>  AccrueTime=2020-05-09T19:49:45
>>  StartTime=2020-05-09T19:49:46 EndTime=2020-05-10T00:26:03 Deadline=N/A
>>  PreemptTime=None SuspendTime=None SecsPreSuspend=0
>>  LastSchedEval=2020-05-09T19:49:46
>>  Partition=gpuq AllocNode:Sid=ARGO-2:7221
>>  ReqNodeList=(null) ExcNodeList=(null)
>>  NodeList=NODE056
>>  BatchHost=NODE056
>>  NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
>>  TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
>>  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>>  MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
>>  Features=(null) DelayBoot=00:00:00
>>  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>>  Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_relu_ent.slurm
>>  WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
>>  StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.err
>>  StdIn=/dev/null
>>  StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.out
>>  Power=
>>  TresPerNode=gpu:1
>>
>> JobId=533900 JobName=r18-soft
>>  UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
>>  Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
>>  JobState=FAILED Reason=JobLaunchFailure Dependency=(null)
>>  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9
>>  RunTime=04:32:51 TimeLimit=5-00:00:00 TimeMin=N/A
>>  SubmitTime=2020-05-09T19:53:13 EligibleTime=2020-05-09T19:53:13
>>  AccrueTime=2020-05-09T19:53:13
>>  StartTime=2020-05-09T19:53:14 EndTime=2020-05-10T00:26:05 Deadline=N/A
>>  PreemptTime=None SuspendTime=None SecsPreSuspend=0
>>  LastSchedEval=2020-05-09T19:53:14
>>  Partition=gpuq AllocNode:Sid=ARGO-2:7221
>>  ReqNodeList=(null) ExcNodeList=(null)
>>  NodeList=NODE056
>>  BatchHost=NODE056
>>  NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
>>  TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
>>  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>>  MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
>>  Features=(null) DelayBoot=00:00:00
>>  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>>  Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft.slurm
>>  WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
>>  StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.err
>>  StdIn=/dev/null
>>  StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.out
>>  Power=
>>  TresPerNode=gpu:1
>>
>> JobId=533902 JobName=r18-soft-ent
>>  UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
>>  Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
>>  JobState=FAILED Reason=JobLaunchFailure Dependency=(null)
>>  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9
>>  RunTime=04:30:39 TimeLimit=5-00:00:00 TimeMin=N/A
>>  SubmitTime=2020-05-09T19:55:26 EligibleTime=2020-05-09T19:55:26
>>  AccrueTime=2020-05-09T19:55:26
>>  StartTime=2020-05-09T19:55:26 EndTime=2020-05-10T00:26:05 Deadline=N/A
>>  PreemptTime=None SuspendTime=None SecsPreSuspend=0
>>  LastSchedEval=2020-05-09T19:55:26
>>  Partition=gpuq AllocNode:Sid=ARGO-2:7221
>>  ReqNodeList=(null) ExcNodeList=(null)
>>  NodeList=NODE056
>>  BatchHost=NODE056
>>  NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
>>  TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
>>  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>>  MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
>>  Features=(null) DelayBoot=00:00:00
>>  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>>  Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft_ent.slurm
>>  WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
>>  StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.err
>>  StdIn=/dev/null
>>  StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.out
>>  Power=
>>  TresPerNode=gpu:1
>
>
>