<div dir="ltr">The larger cluster is using NFS. I can see how that could be related to the difference of behaviours between the clusters.<div><br></div><div> The buffering behaviour is the same if I tail the file from the node running the job. The only thing that seems to change the behaviour is whether I use <font face="monospace">srun</font> to create a job step or not.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Feb 10, 2021 at 4:09 PM Aaron Jackson <<a href="mailto:Aaron.Jackson@nottingham.ac.uk">Aaron.Jackson@nottingham.ac.uk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Is it being written to NFS? You say on your local dev cluster it's a<br>
single node. Is it also the login node as well as compute? In that case<br>
I guess there is no NFS. Larger cluster will be using some sort of<br>
shared storage, so whichever shared file system you are using likely has<br>
caching.<br>
<br>
If you are able to connect directly to the node which is running the<br>
job, you can try tailing from there. It'll likely update immediately if<br>
what I said above is the case.<br>
<br>
Cheers,<br>
Aaron<br>
<br>
<br>
On 9 February 2021 at 23:47 GMT, Maria Semple wrote:<br>
<br>
> Hello all,<br>
><br>
> I've noticed an odd behaviour with job steps in some Slurm environments.<br>
> When a script is launched directly as a job, the output is written to file<br>
> immediately. When the script is launched as a step in a job, output is<br>
> written in ~30 second chunks. This doesn't happen in all Slurm<br>
> environments, but if it happens in one, it seems to always happen. For<br>
> example, on my local development cluster, which is a single node on Ubuntu<br>
> 18, I don't experience this. On a large Centos 7 based cluster, I do.<br>
><br>
> Below is a simple reproducible example:<br>
><br>
> loop.sh:<br>
> #!/bin/bash<br>
> for i in {1..100}<br>
> do<br>
> echo $i<br>
> sleep 1<br>
> done<br>
><br>
> withsteps.sh:<br>
> #!/bin/bash<br>
> srun ./loop.sh<br>
><br>
> Then from the command line running sbatch loop.sh followed by tail -f<br>
> slurm-<job #>.out prints the job output in smaller chunks, which appears to<br>
> be related to file system buffering or the time it takes for the tail<br>
> process to notice that the file has updated. Running cat on the file every<br>
> second shows that the output is in the file immediately after it is emitted<br>
> by the script.<br>
><br>
> If you run sbatch withsteps.sh instead, tail-ing or repeatedly cat-ing the<br>
> output file will show that the job output is written in a chunk of 30 - 35<br>
> lines.<br>
><br>
> I'm hoping this is something that is possible to work around, potentially<br>
> related to an OS setting, the way Slurm was compiled, or a Slurm setting.<br>
<br>
<br>
-- <br>
Research Fellow<br>
School of Computer Science<br>
University of Nottingham<br>
<br>
<br>
<br>
This message and any attachment are intended solely for the addressee<br>
and may contain confidential information. If you have received this<br>
message in error, please contact the sender and delete the email and<br>
attachment. <br>
<br>
Any views or opinions expressed by the author of this email do not<br>
necessarily reflect the views of the University of Nottingham. Email<br>
communications with the University of Nottingham may be monitored <br>
where permitted by law.<br>
<br>
<br>
<br>
<br>
<br>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div>Thanks,<br></div>Maria<br></div></div>