[slurm-users] Extract job information after completion

Sean McGrath smcgrat at tchpc.tcd.ie
Tue Apr 27 16:19:01 UTC 2021


Hi,

On Tue, Apr 27, 2021 at 03:14:04PM +0000, O'Grady, Paul Christopher wrote:

> Sometimes when a slurm job fails I want to see what a user did, getting the command/workdir/stdout/stderr information.  I can see that with "scontrol show job <jobid>".  However, after the job is done that command doesn't seem to work anymore, saying "invalid job id".  I try to use sacct, which seems to save history, but I can only find the "workdir" parameter there, not stdout/stderr/cmd.  I tried using the "jobname" field of sacct, but when I use the "wrap" option of sbatch, then jobname only shows the string "wrap" which isn't useful.
> 
> My question:  is there an easy way for me to get command/workdir/stdout/stderr information after a job has completed?  Thanks!

Not sure if this is what you need. We do the following:

In slurm.conf set: 

EpilogSlurmctld=/etc/slurm/slurm.epilogslurmctld

Which does a number of things, including the following:

root at pople01:/etc/slurm # tail -6 slurm.epilogslurmctld                                                                                                                 
# 20150210 - Sean
# Save the details of a job by doing an scontrol show job=job
# So it can be referenced for trubleshooting in future if needed
# should be run by the slurm epilog

/usr/bin/scontrol show job="$SLURM_JOB_ID" > "$recordsdir/$SLURM_JOBID.record"

So it writes the following to the file system:

root at pople01:/etc/slurm # cat /home/support/root/slurm_job_records/pople/2021/6.record
JobId=6 JobName=sbatch.sh
   UserId=smcgrat(5446) GroupId=smcgrat(9249) MCS_label=N/A
   Priority=1104631 Nice=0 Account=tchpc QOS=normal
   JobState=COMPLETING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2021-04-27T15:56:12 EligibleTime=2021-04-27T15:56:12
   AccrueTime=2021-04-27T15:56:12
   StartTime=2021-04-27T15:56:13 EndTime=2021-04-27T15:56:13 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-04-27T15:56:13
   Partition=compute AllocNode:Sid=pople01:14314
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=
   BatchHost=pople-n001
   NumNodes=2 NumCPUs=32 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=32,mem=126000M,node=2,billing=32
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=63000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/users/smcgrat/sbatch.sh
   WorkDir=/home/users/smcgrat
   StdErr=/home/users/smcgrat/slurm-6.out
   StdIn=/dev/null
   StdOut=/home/users/smcgrat/slurm-6.out
   Power=

Hope that helps.

Sean


> 
> chris
> 
> 

-- 
Sean McGrath M.Sc

Systems Administrator
Trinity Centre for High Performance and Research Computing
Trinity College Dublin

sean.mcgrath at tchpc.tcd.ie

https://www.tcd.ie/
https://www.tchpc.tcd.ie/

+353 (0) 1 896 3725




More information about the slurm-users mailing list