[slurm-users] Job Step Output Delay

Maria Semple maria at rstudio.com
Tue Feb 9 23:47:12 UTC 2021


Hello all,

I've noticed an odd behaviour with job steps in some Slurm environments.
When a script is launched directly as a job, the output is written to file
immediately. When the script is launched as a step in a job, output is
written in ~30 second chunks. This doesn't happen in all Slurm
environments, but if it happens in one, it seems to always happen. For
example, on my local development cluster, which is a single node on Ubuntu
18, I don't experience this. On a large Centos 7 based cluster, I do.

Below is a simple reproducible example:

loop.sh:
#!/bin/bash
for i in {1..100}
do
   echo $i
   sleep 1
done

withsteps.sh:
#!/bin/bash
srun ./loop.sh

Then from the command line running sbatch loop.sh followed by tail -f
slurm-<job #>.out prints the job output in smaller chunks, which appears to
be related to file system buffering or the time it takes for the tail
process to notice that the file has updated. Running cat on the file every
second shows that the output is in the file immediately after it is emitted
by the script.

If you run sbatch withsteps.sh instead, tail-ing or repeatedly cat-ing the
output file will show that the job output is written in a chunk of 30 - 35
lines.

I'm hoping this is something that is possible to work around, potentially
related to an OS setting, the way Slurm was compiled, or a Slurm setting.

-- 
Thanks,
Maria
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210209/3bffe170/attachment.htm>


More information about the slurm-users mailing list