[slurm-users] additional jobs killed by scancel.

Wed May 13 22:23:10 UTC 2020

Hmm, works for me.  Maybe they added that in more recent versions of slurm.
I'm using version 18+

On Wed, May 13, 2020 at 5:12 PM Alastair Neil <ajneil.tech at gmail.com> wrote:
>
> invalid field requested: "reason"
>
> On Tue, 12 May 2020 at 16:47, Steven Dick <kg4ydw at gmail.com> wrote:
>>
>> What do you get from
>>
>> sacct -o jobid,elapsed,reason,exit -j 533900,533902
>>
>> On Tue, May 12, 2020 at 4:12 PM Alastair Neil <ajneil.tech at gmail.com> wrote:
>> >
>> >  The log is continuous and has all the messages logged by slurmd on the node for all the jobs mentioned, below are the entries from the slurmctld log:
>> >
>> >> [2020-05-10T00:26:03.097] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=533898 uid 1224431221
>> >>
>> >> [2020-05-10T00:26:03.098] email msg to sshres2 at masonlive.gmu.edu: Slurm Job_id=533898 Name=r18-relu-ent Ended, Run time 04:36:17, CANCELLED, ExitCode 0
>> >> [2020-05-10T00:26:03.098] job_signal: 9 of running JobId=533898 successful 0x8004
>> >> [2020-05-10T00:26:05.204] _job_complete: JobId=533902 WTERMSIG 9
>> >> [2020-05-10T00:26:05.204] email msg to sshres2 at masonlive.gmu.edu: Slurm Job_id=533902 Name=r18-soft-ent Failed, Run time 04:30:39, FAILED
>> >> [2020-05-10T00:26:05.205] _job_complete: JobId=533902 done
>> >> [2020-05-10T00:26:05.210] _job_complete: JobId=533900 WTERMSIG 9
>> >> [2020-05-10T00:26:05.210] email msg to sshres2 at masonlive.gmu.edu: Slurm Job_id=533900 Name=r18-soft Failed, Run time 04:32:51, FAILED
>> >> [2020-05-10T00:26:05.215] _job_complete: JobId=533900 done
>> >
>> >
>> > it is curious, that all the jobs were running on the same processor, perhaps this is a cgroup related failure?
>> >
>> > On Tue, 12 May 2020 at 10:10, Steven Dick <kg4ydw at gmail.com> wrote:
>> >>
>> >> I see one job cancelled and two jobs failed.
>> >> Your slurmd log is incomplete -- it doesn't show the two failed jobs
>> >> exiting/failing, so the real error is not here.
>> >>
>> >> It might also be helpful to look through slurmctld's log starting from
>> >> when the first job was canceled, looking at any messages mentioning
>> >> the node or the two failed jobs.
>> >>
>> >> I've had nodes do strange things on job cancel.  Last one I tracked
>> >> down to the job epilog failing because it was NFS mounted and nfs was
>> >> being slower than slurm liked, so it took the node offline and killed
>> >> everything on it.
>> >>
>> >> On Mon, May 11, 2020 at 12:55 PM Alastair Neil <ajneil.tech at gmail.com> wrote:
>> >> >
>> >> > Hi there,
>> >> >
>> >> > We are using slurm 18.08 and had a weird occurrence over the weekend.  A user canceled one of his jobs using scancel, and two additional jobs of the user running on the same node were killed concurrently.  The jobs had no dependency, but they were all allocated 1 gpu. I am curious to know why this happened,  and if this is a known bug is there a workaround to prevent it happening?  Any suggestions gratefully received.
>> >> >
>> >> > -Alastair
>> >> >
>> >> > FYI
>> >> > The cancelled job (533898) has this at the end of the .err file:
>> >> >
>> >> >> slurmstepd: error: *** JOB 533898 ON NODE056 CANCELLED AT 2020-05-10T00:26:03 ***
>> >> >
>> >> >
>> >> > both of the killed jobs (533900 and 533902)  have this:
>> >> >
>> >> >> slurmstepd: error: get_exit_code task 0 died by signal
>> >> >
>> >> >
>> >> > here is the slurmd log from the node and ths how-job output for each job:
>> >> >
>> >> >> [2020-05-09T19:49:46.735] _run_prolog: run job script took usec=4
>> >> >> [2020-05-09T19:49:46.735] _run_prolog: prolog with lock for job 533898 ran for 0 seconds
>> >> >> [2020-05-09T19:49:46.754] ====================
>> >> >> [2020-05-09T19:49:46.754] batch_job:533898 job_mem:10240MB
>> >> >> [2020-05-09T19:49:46.754] JobNode[0] CPU[0] Job alloc
>> >> >> [2020-05-09T19:49:46.755] JobNode[0] CPU[1] Job alloc
>> >> >> [2020-05-09T19:49:46.756] JobNode[0] CPU[2] Job alloc
>> >> >> [2020-05-09T19:49:46.757] JobNode[0] CPU[3] Job alloc
>> >> >> [2020-05-09T19:49:46.758] ====================
>> >> >> [2020-05-09T19:49:46.758] Launching batch job 533898 for UID 1224431221
>> >> >> [2020-05-09T19:53:14.060] _run_prolog: run job script took usec=3
>> >> >> [2020-05-09T19:53:14.060] _run_prolog: prolog with lock for job 533900 ran for 0 seconds
>> >> >> [2020-05-09T19:53:14.080] ====================
>> >> >> [2020-05-09T19:53:14.080] batch_job:533900 job_mem:10240MB
>> >> >> [2020-05-09T19:53:14.081] JobNode[0] CPU[4] Job alloc
>> >> >> [2020-05-09T19:53:14.082] JobNode[0] CPU[5] Job alloc
>> >> >> [2020-05-09T19:53:14.083] JobNode[0] CPU[6] Job alloc
>> >> >> [2020-05-09T19:53:14.083] JobNode[0] CPU[7] Job alloc
>> >> >> [2020-05-09T19:53:14.084] ====================
>> >> >> [2020-05-09T19:53:14.085] Launching batch job 533900 for UID 1224431221
>> >> >> [2020-05-09T19:55:26.283] _run_prolog: run job script took usec=21
>> >> >> [2020-05-09T19:55:26.284] _run_prolog: prolog with lock for job 533902 ran for 0 seconds
>> >> >> [2020-05-09T19:55:26.304] ====================
>> >> >> [2020-05-09T19:55:26.304] batch_job:533902 job_mem:10240MB
>> >> >> [2020-05-09T19:55:26.304] JobNode[0] CPU[8] Job alloc
>> >> >> [2020-05-09T19:55:26.305] JobNode[0] CPU[9] Job alloc
>> >> >> [2020-05-09T19:55:26.306] JobNode[0] CPU[10] Job alloc
>> >> >> [2020-05-09T19:55:26.306] JobNode[0] CPU[11] Job alloc
>> >> >> [2020-05-09T19:55:26.307] ====================
>> >> >> [2020-05-09T19:55:26.307] Launching batch job 533902 for UID 1224431221
>> >> >> [2020-05-10T00:26:03.127] [533898.extern] done with job
>> >> >> [2020-05-10T00:26:03.975] [533898.batch] error: *** JOB 533898 ON NODE056 CANCELLED AT 2020-05-10T00:26:03 ***
>> >> >> [2020-05-10T00:26:04.425] [533898.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
>> >> >> [2020-05-10T00:26:04.428] [533898.batch] done with job
>> >> >> [2020-05-10T00:26:05.202] [533900.batch] error: get_exit_code task 0 died by signal
>> >> >> [2020-05-10T00:26:05.202] [533902.batch] error: get_exit_code task 0 died by signal
>> >> >> [2020-05-10T00:26:05.202] [533900.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9
>> >> >> [2020-05-10T00:26:05.202] [533902.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9
>> >> >> [2020-05-10T00:26:05.211] [533902.batch] done with job
>> >> >> [2020-05-10T00:26:05.216] [533900.batch] done with job
>> >> >> [2020-05-10T00:26:05.234] [533902.extern] done with job
>> >> >> [2020-05-10T00:26:05.235] [533900.extern] done with job
>> >> >
>> >> >
>> >> >> [root at node056 2020-05-10]# cat 533{898,900,902}/show-job.txt
>> >> >> JobId=533898 JobName=r18-relu-ent
>> >> >>  UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
>> >> >>  Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
>> >> >>  JobState=CANCELLED Reason=None Dependency=(null)
>> >> >>  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:15
>> >> >>  RunTime=04:36:17 TimeLimit=5-00:00:00 TimeMin=N/A
>> >> >>  SubmitTime=2020-05-09T19:49:45 EligibleTime=2020-05-09T19:49:45
>> >> >>  AccrueTime=2020-05-09T19:49:45
>> >> >>  StartTime=2020-05-09T19:49:46 EndTime=2020-05-10T00:26:03 Deadline=N/A
>> >> >>  PreemptTime=None SuspendTime=None SecsPreSuspend=0
>> >> >>  LastSchedEval=2020-05-09T19:49:46
>> >> >>  Partition=gpuq AllocNode:Sid=ARGO-2:7221
>> >> >>  ReqNodeList=(null) ExcNodeList=(null)
>> >> >>  NodeList=NODE056
>> >> >>  BatchHost=NODE056
>> >> >>  NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
>> >> >>  TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
>> >> >>  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>> >> >>  MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
>> >> >>  Features=(null) DelayBoot=00:00:00
>> >> >>  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>> >> >>  Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_relu_ent.slurm
>> >> >>  WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
>> >> >>  StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.err
>> >> >>  StdIn=/dev/null
>> >> >>  StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.out
>> >> >>  Power=
>> >> >>  TresPerNode=gpu:1
>> >> >>
>> >> >> JobId=533900 JobName=r18-soft
>> >> >>  UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
>> >> >>  Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
>> >> >>  JobState=FAILED Reason=JobLaunchFailure Dependency=(null)
>> >> >>  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9
>> >> >>  RunTime=04:32:51 TimeLimit=5-00:00:00 TimeMin=N/A
>> >> >>  SubmitTime=2020-05-09T19:53:13 EligibleTime=2020-05-09T19:53:13
>> >> >>  AccrueTime=2020-05-09T19:53:13
>> >> >>  StartTime=2020-05-09T19:53:14 EndTime=2020-05-10T00:26:05 Deadline=N/A
>> >> >>  PreemptTime=None SuspendTime=None SecsPreSuspend=0
>> >> >>  LastSchedEval=2020-05-09T19:53:14
>> >> >>  Partition=gpuq AllocNode:Sid=ARGO-2:7221
>> >> >>  ReqNodeList=(null) ExcNodeList=(null)
>> >> >>  NodeList=NODE056
>> >> >>  BatchHost=NODE056
>> >> >>  NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
>> >> >>  TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
>> >> >>  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>> >> >>  MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
>> >> >>  Features=(null) DelayBoot=00:00:00
>> >> >>  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>> >> >>  Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft.slurm
>> >> >>  WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
>> >> >>  StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.err
>> >> >>  StdIn=/dev/null
>> >> >>  StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.out
>> >> >>  Power=
>> >> >>  TresPerNode=gpu:1
>> >> >>
>> >> >> JobId=533902 JobName=r18-soft-ent
>> >> >>  UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
>> >> >>  Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
>> >> >>  JobState=FAILED Reason=JobLaunchFailure Dependency=(null)
>> >> >>  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9
>> >> >>  RunTime=04:30:39 TimeLimit=5-00:00:00 TimeMin=N/A
>> >> >>  SubmitTime=2020-05-09T19:55:26 EligibleTime=2020-05-09T19:55:26
>> >> >>  AccrueTime=2020-05-09T19:55:26
>> >> >>  StartTime=2020-05-09T19:55:26 EndTime=2020-05-10T00:26:05 Deadline=N/A
>> >> >>  PreemptTime=None SuspendTime=None SecsPreSuspend=0
>> >> >>  LastSchedEval=2020-05-09T19:55:26
>> >> >>  Partition=gpuq AllocNode:Sid=ARGO-2:7221
>> >> >>  ReqNodeList=(null) ExcNodeList=(null)
>> >> >>  NodeList=NODE056
>> >> >>  BatchHost=NODE056
>> >> >>  NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
>> >> >>  TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
>> >> >>  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>> >> >>  MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
>> >> >>  Features=(null) DelayBoot=00:00:00
>> >> >>  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>> >> >>  Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft_ent.slurm
>> >> >>  WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
>> >> >>  StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.err
>> >> >>  StdIn=/dev/null
>> >> >>  StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.out
>> >> >>  Power=
>> >> >>  TresPerNode=gpu:1
>> >> >
>> >> >
>> >> >
>> >>
>>