<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><div dir="ltr">Overzealous node cleanup epilog script? </div><div dir="ltr"><br><blockquote type="cite">On 11 May 2020, at 17:56, Alastair Neil <ajneil.tech@gmail.com> wrote:<br><br></blockquote></div><blockquote type="cite"><div dir="ltr"><div dir="ltr"><div>Hi there,</div><div><br></div><div>We are using slurm 18.08 and had a weird occurrence over the weekend.  A user canceled one of his jobs using scancel, and two additional jobs of the user running on the same node were killed concurrently.  The jobs had no dependency, but they were all allocated 1 gpu. I am curious to know why this happened,  and if this is a known bug is there a workaround to prevent it happening?  Any suggestions gratefully received.</div><div><br></div><div>-Alastair</div><div><br></div><div>FYI<br></div><div>The cancelled job (533898) has this at the end of the .err file:</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>slurmstepd: error: *** JOB 533898 ON NODE056 CANCELLED AT 2020-05-10T00:26:03 ***<br></div></blockquote><div><br></div><div>both of the killed jobs (533900 and 533902)  have this:<br></div><div><br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">slurmstepd: error: get_exit_code task 0 died by signal</blockquote></div><div><br></div><span></span><div>here is the slurmd log from the node and ths how-job output for each job:</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><font size="1">[2020-05-09T19:49:46.735] _run_prolog: run job script took usec=4</font><br><font size="1">[2020-05-09T19:49:46.735] _run_prolog: prolog with lock for job 533898 ran for 0 seconds</font><br><font size="1">[2020-05-09T19:49:46.754] ====================</font><br><font size="1">[2020-05-09T19:49:46.754] batch_job:533898 job_mem:10240MB</font><br><font size="1">[2020-05-09T19:49:46.754] JobNode[0] CPU[0] Job alloc</font><br><font size="1">[2020-05-09T19:49:46.755] JobNode[0] CPU[1] Job alloc</font><br><font size="1">[2020-05-09T19:49:46.756] JobNode[0] CPU[2] Job alloc</font><br><font size="1">[2020-05-09T19:49:46.757] JobNode[0] CPU[3] Job alloc</font><br><font size="1">[2020-05-09T19:49:46.758] ====================</font><br><font size="1">[2020-05-09T19:49:46.758] Launching batch job 533898 for UID 1224431221</font><br><font size="1">[2020-05-09T19:53:14.060] _run_prolog: run job script took usec=3</font><br><font size="1">[2020-05-09T19:53:14.060] _run_prolog: prolog with lock for job 533900 ran for 0 seconds</font><br><font size="1">[2020-05-09T19:53:14.080] ====================</font><br><font size="1">[2020-05-09T19:53:14.080] batch_job:533900 job_mem:10240MB</font><br><font size="1">[2020-05-09T19:53:14.081] JobNode[0] CPU[4] Job alloc</font><br><font size="1">[2020-05-09T19:53:14.082] JobNode[0] CPU[5] Job alloc</font><br><font size="1">[2020-05-09T19:53:14.083] JobNode[0] CPU[6] Job alloc</font><br><font size="1">[2020-05-09T19:53:14.083] JobNode[0] CPU[7] Job alloc</font><br><font size="1">[2020-05-09T19:53:14.084] ====================</font><br><font size="1">[2020-05-09T19:53:14.085] Launching batch job 533900 for UID 1224431221</font><br><font size="1">[2020-05-09T19:55:26.283] _run_prolog: run job script took usec=21</font><br><font size="1">[2020-05-09T19:55:26.284] _run_prolog: prolog with lock for job 533902 ran for 0 seconds</font><br><font size="1">[2020-05-09T19:55:26.304] ====================</font><br><font size="1">[2020-05-09T19:55:26.304] batch_job:533902 job_mem:10240MB</font><br><font size="1">[2020-05-09T19:55:26.304] JobNode[0] CPU[8] Job alloc</font><br><font size="1">[2020-05-09T19:55:26.305] JobNode[0] CPU[9] Job alloc</font><br><font size="1">[2020-05-09T19:55:26.306] JobNode[0] CPU[10] Job alloc</font><br><font size="1">[2020-05-09T19:55:26.306] JobNode[0] CPU[11] Job alloc</font><br><font size="1">[2020-05-09T19:55:26.307] ====================</font><br><font size="1">[2020-05-09T19:55:26.307] Launching batch job 533902 for UID 1224431221</font><br><font size="1">[2020-05-10T00:26:03.127] [533898.extern] done with job</font><br><font size="1">[2020-05-10T00:26:03.975] [533898.batch] error: *** JOB 533898 ON NODE056 CANCELLED AT 2020-05-10T00:26:03 ***</font><br><font size="1">[2020-05-10T00:26:04.425] [533898.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15</font><br><font size="1">[2020-05-10T00:26:04.428] [533898.batch] done with job</font><br><font size="1">[2020-05-10T00:26:05.202] [533900.batch] error: get_exit_code task 0 died by signal</font><br><font size="1">[2020-05-10T00:26:05.202] [533902.batch] error: get_exit_code task 0 died by signal</font><br><font size="1">[2020-05-10T00:26:05.202] [533900.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9</font><br><font size="1">[2020-05-10T00:26:05.202] [533902.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9</font><br><font size="1">[2020-05-10T00:26:05.211] [533902.batch] done with job</font><br><font size="1">[2020-05-10T00:26:05.216] [533900.batch] done with job</font><br><font size="1">[2020-05-10T00:26:05.234] [533902.extern] done with job</font><br><font size="1">[2020-05-10T00:26:05.235] [533900.extern] done with job</font></blockquote><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><font size="1">[root@node056 2020-05-10]# cat 533{898,900,902}/show-job.txt<br>JobId=533898 JobName=r18-relu-ent<br> UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A<br> Priority=19375 Nice=0 Account=csjkosecka QOS=csqos<br> JobState=CANCELLED Reason=None Dependency=(null)<br> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:15<br> RunTime=04:36:17 TimeLimit=5-00:00:00 TimeMin=N/A<br> SubmitTime=2020-05-09T19:49:45 EligibleTime=2020-05-09T19:49:45<br> AccrueTime=2020-05-09T19:49:45<br> StartTime=2020-05-09T19:49:46 EndTime=2020-05-10T00:26:03 Deadline=N/A<br> PreemptTime=None SuspendTime=None SecsPreSuspend=0<br> LastSchedEval=2020-05-09T19:49:46<br> Partition=gpuq AllocNode:Sid=ARGO-2:7221<br> ReqNodeList=(null) ExcNodeList=(null)<br> NodeList=NODE056<br> BatchHost=NODE056<br> NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*<br> TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1<br> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*<br> MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0<br> Features=(null) DelayBoot=00:00:00<br> OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)<br> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_relu_ent.slurm<br> WorkDir=/scratch/sshres2/workspace-scratch/cs747-project<br> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.err<br> StdIn=/dev/null<br> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.out<br> Power=<br> TresPerNode=gpu:1<br><br>JobId=533900 JobName=r18-soft<br> UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A<br> Priority=19375 Nice=0 Account=csjkosecka QOS=csqos<br> JobState=FAILED Reason=JobLaunchFailure Dependency=(null)<br> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9<br> RunTime=04:32:51 TimeLimit=5-00:00:00 TimeMin=N/A<br> SubmitTime=2020-05-09T19:53:13 EligibleTime=2020-05-09T19:53:13<br> AccrueTime=2020-05-09T19:53:13<br> StartTime=2020-05-09T19:53:14 EndTime=2020-05-10T00:26:05 Deadline=N/A<br> PreemptTime=None SuspendTime=None SecsPreSuspend=0<br> LastSchedEval=2020-05-09T19:53:14<br> Partition=gpuq AllocNode:Sid=ARGO-2:7221<br> ReqNodeList=(null) ExcNodeList=(null)<br> NodeList=NODE056<br> BatchHost=NODE056<br> NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*<br> TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1<br> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*<br> MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0<br> Features=(null) DelayBoot=00:00:00<br> OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)<br> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft.slurm<br> WorkDir=/scratch/sshres2/workspace-scratch/cs747-project<br> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.err<br> StdIn=/dev/null<br> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.out<br> Power=<br> TresPerNode=gpu:1<br><br>JobId=533902 JobName=r18-soft-ent<br> UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A<br> Priority=19375 Nice=0 Account=csjkosecka QOS=csqos<br> JobState=FAILED Reason=JobLaunchFailure Dependency=(null)<br> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9<br> RunTime=04:30:39 TimeLimit=5-00:00:00 TimeMin=N/A<br> SubmitTime=2020-05-09T19:55:26 EligibleTime=2020-05-09T19:55:26<br> AccrueTime=2020-05-09T19:55:26<br> StartTime=2020-05-09T19:55:26 EndTime=2020-05-10T00:26:05 Deadline=N/A<br> PreemptTime=None SuspendTime=None SecsPreSuspend=0<br> LastSchedEval=2020-05-09T19:55:26<br> Partition=gpuq AllocNode:Sid=ARGO-2:7221<br> ReqNodeList=(null) ExcNodeList=(null)<br> NodeList=NODE056<br> BatchHost=NODE056<br> NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*<br> TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1<br> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*<br> MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0<br> Features=(null) DelayBoot=00:00:00<br> OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)<br> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft_ent.slurm<br> WorkDir=/scratch/sshres2/workspace-scratch/cs747-project<br> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.err<br> StdIn=/dev/null<br> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.out<br> Power=<br> TresPerNode=gpu:1</font></blockquote><div><br></div><div><span><br></span></div></div>
</div></blockquote></body></html>