[slurm-users] Providing users with info on wait time vs. run time

Sebastian Potthoff s.potthoff at uni-muenster.de
Fri Sep 16 13:41:38 UTC 2022


Hi Hermann,

>> So you both are happily(?) ignoring this warning the "Prolog and Epilog Guide",
>> right? :-)
>> 
>> "Prolog and Epilog scripts [...] should not call Slurm commands (e.g. squeue,
>> scontrol, sacctmgr, etc)."
> 
> We have probably been doing this since before the warning was added to
> the documentation.  So we are "ignorantly ignoring" the advice :-/

Same here :) But if $SLURM_JOB_STDOUT is not defined as documented … what can you do.

>> May I ask how big your clusters are (number of nodes) and how heavily they are
>> used (submitted jobs per hour)?


We have around 500 nodes (mostly 2x18 cores). Jobs ending (i.e. calling the epilog script) varies quite a lot between 1000 and 15k a day, so something in between 40 and 625 Jobs/hour. During those peaks Slurm can become noticeably slower, however usually it runs fine.

Sebastian 

> Am 16.09.2022 um 15:15 schrieb Loris Bennett <loris.bennett at fu-berlin.de>:
> 
> Hi Hermann,
> 
> Hermann Schwärzler <hermann.schwaerzler at uibk.ac.at <mailto:hermann.schwaerzler at uibk.ac.at>> writes:
> 
>> Hi Loris,
>> hi Sebastian,
>> 
>> thanks for the information on how you are doing this.
>> So you both are happily(?) ignoring this warning the "Prolog and Epilog Guide",
>> right? :-)
>> 
>> "Prolog and Epilog scripts [...] should not call Slurm commands (e.g. squeue,
>> scontrol, sacctmgr, etc)."
> 
> We have probably been doing this since before the warning was added to
> the documentation.  So we are "ignorantly ignoring" the advice :-/
> 
>> May I ask how big your clusters are (number of nodes) and how heavily they are
>> used (submitted jobs per hour)?
> 
> We have around 190 32-core nodes.  I don't know how I would easily find
> out the average number of jobs per hour.  The only problems we have had
> with submission have been when people have written their own mechanisms
> for submitting thousands of jobs.  Once we get them to use job array,
> such problems generally disappear.
> 
> Cheers,
> 
> Loris
> 
>> Regards,
>> Hermann
>> 
>> On 9/16/22 9:09 AM, Loris Bennett wrote:
>>> Hi Hermann,
>>> Sebastian Potthoff <s.potthoff at uni-muenster.de> writes:
>>> 
>>>> Hi Hermann,
>>>> 
>>>> I happened to read along this conversation and was just solving this issue today. I added this part to the epilog script to make it work:
>>>> 
>>>> # Add job report to stdout
>>>> StdOut=$(/usr/bin/scontrol show job=$SLURM_JOB_ID | /usr/bin/grep StdOut | /usr/bin/xargs | /usr/bin/awk 'BEGIN { FS = "=" } ; { print $2 }')
>>>> 
>>>> NODELIST=($(/usr/bin/scontrol show hostnames))
>>>> 
>>>> # Only add to StdOut file if it exists and if we are the first node
>>>> if [ "$(/usr/bin/hostname -s)" = "${NODELIST[0]}" -a ! -z "${StdOut}" ]
>>>> then
>>>>   echo "################################# JOB REPORT ##################################" >> $StdOut
>>>>   /usr/bin/seff $SLURM_JOB_ID >> $StdOut
>>>>   echo "###############################################################################" >> $StdOut
>>>> fi
>>> We do something similar.  At the end of our script pointed to by
>>> EpilogSlurmctld we have
>>>   OUT=`scontrol show jobid ${job_id} | awk -F= '/ StdOut/{print $2}'`
>>>   if [ ! -f "$OUT" ]; then
>>>     exit
>>>   fi
>>>   printf "\n== Epilog Slurmctld
>>> ==================================================\n\n" >>  ${OUT}
>>>   seff ${SLURM_JOB_ID} >> ${OUT}
>>>   printf
>>> "\n======================================================================\n"
>>>>> ${OUT}
>>>   chown ${user} ${OUT}
>>> Cheers,
>>> Loris
>>> 
>>>>   Contrary to what it says in the slurm docs https://slurm.schedmd.com/prolog_epilog.html  I was not able to use the env var SLURM_JOB_STDOUT, so I had to fetch it via scontrol. In addition I had to
>>>> make sure it is only called by the „leading“ node as the epilog script will be called by ALL nodes of a multinode job and they would all call seff and clutter up the output. Last thing was to check if StdOut is
>>>> not of length zero (i.e. it exists). Interactive jobs would otherwise cause the node to drain.
>>>> 
>>>> Maybe this helps.
>>>> 
>>>> Kind regards
>>>> Sebastian
>>>> 
>>>> PS: goslmailer looks quite nice with its recommendations! Will definitely look into it.
>>>> 
>>>> --
>>>> Westfälische Wilhelms-Universität (WWU) Münster
>>>> WWU IT
>>>> Sebastian Potthoff (eScience / HPC)
>>>> 
>>>>  Am 15.09.2022 um 18:07 schrieb Hermann Schwärzler <hermann.schwaerzler at uibk.ac.at>:
>>>> 
>>>>  Hi Ole,
>>>> 
>>>>  On 9/15/22 5:21 PM, Ole Holm Nielsen wrote:
>>>> 
>>>>  On 15-09-2022 16:08, Hermann Schwärzler wrote:
>>>> 
>>>>  Just out of curiosity: how do you insert the output of seff into the out-file of a job?
>>>> 
>>>>  Use the "smail" tool from the slurm-contribs RPM and set this in slurm.conf:
>>>>  MailProg=/usr/bin/smail
>>>> 
>>>>  Maybe I am missing something but from what I can tell smail sends an email and does *not* change or append to the .out file of a job...
>>>> 
>>>>  Regards,
>>>>  Hermann
>>> 
>> 
> -- 
> Dr. Loris Bennett (Herr/Mr)
> ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de <mailto:loris.bennett at fu-berlin.de>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220916/7b419768/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4935 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220916/7b419768/attachment-0001.bin>


More information about the slurm-users mailing list