[slurm-users] Extract job information after completion
Sean McGrath
smcgrat at tchpc.tcd.ie
Tue Apr 27 16:19:01 UTC 2021
Hi,
On Tue, Apr 27, 2021 at 03:14:04PM +0000, O'Grady, Paul Christopher wrote:
> Sometimes when a slurm job fails I want to see what a user did, getting the command/workdir/stdout/stderr information. I can see that with "scontrol show job <jobid>". However, after the job is done that command doesn't seem to work anymore, saying "invalid job id". I try to use sacct, which seems to save history, but I can only find the "workdir" parameter there, not stdout/stderr/cmd. I tried using the "jobname" field of sacct, but when I use the "wrap" option of sbatch, then jobname only shows the string "wrap" which isn't useful.
>
> My question: is there an easy way for me to get command/workdir/stdout/stderr information after a job has completed? Thanks!
Not sure if this is what you need. We do the following:
In slurm.conf set:
EpilogSlurmctld=/etc/slurm/slurm.epilogslurmctld
Which does a number of things, including the following:
root at pople01:/etc/slurm # tail -6 slurm.epilogslurmctld
# 20150210 - Sean
# Save the details of a job by doing an scontrol show job=job
# So it can be referenced for trubleshooting in future if needed
# should be run by the slurm epilog
/usr/bin/scontrol show job="$SLURM_JOB_ID" > "$recordsdir/$SLURM_JOBID.record"
So it writes the following to the file system:
root at pople01:/etc/slurm # cat /home/support/root/slurm_job_records/pople/2021/6.record
JobId=6 JobName=sbatch.sh
UserId=smcgrat(5446) GroupId=smcgrat(9249) MCS_label=N/A
Priority=1104631 Nice=0 Account=tchpc QOS=normal
JobState=COMPLETING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2021-04-27T15:56:12 EligibleTime=2021-04-27T15:56:12
AccrueTime=2021-04-27T15:56:12
StartTime=2021-04-27T15:56:13 EndTime=2021-04-27T15:56:13 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-04-27T15:56:13
Partition=compute AllocNode:Sid=pople01:14314
ReqNodeList=(null) ExcNodeList=(null)
NodeList=
BatchHost=pople-n001
NumNodes=2 NumCPUs=32 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=32,mem=126000M,node=2,billing=32
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=63000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/home/users/smcgrat/sbatch.sh
WorkDir=/home/users/smcgrat
StdErr=/home/users/smcgrat/slurm-6.out
StdIn=/dev/null
StdOut=/home/users/smcgrat/slurm-6.out
Power=
Hope that helps.
Sean
>
> chris
>
>
--
Sean McGrath M.Sc
Systems Administrator
Trinity Centre for High Performance and Research Computing
Trinity College Dublin
sean.mcgrath at tchpc.tcd.ie
https://www.tcd.ie/
https://www.tchpc.tcd.ie/
+353 (0) 1 896 3725
More information about the slurm-users
mailing list