<div dir="ltr">I am not a Slurm expert by any stretch of the imagination, so my answer is not authoritative.<div><br><div>That said, I am not aware of any functional equivalent for Slurm, and I would love to learn that I am mistaken!</div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Dec 12, 2023 at 1:39 AM Pacey, Mike <<a href="mailto:m.pacey@lancaster.ac.uk">m.pacey@lancaster.ac.uk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="msg-4928595830331357702">
<div lang="EN-GB" style="overflow-wrap: break-word;">
<div class="m_-4928595830331357702WordSection1">
<p class="MsoNormal"><span>Hi Davide,<u></u><u></u></span></p>
<p class="MsoNormal"><span><u></u> <u></u></span></p>
<p class="MsoNormal"><span>The jobs do eventually run, but can take several minutes or sometimes several hours to switch to a running state even when there’s plenty of resources free immediately.<u></u><u></u></span></p>
<p class="MsoNormal"><span><u></u> <u></u></span></p>
<p class="MsoNormal"><span>With Grid Engine it was possible to turn on scheduling diagnostics and get a summary of the scheduler’s decisions on a pending job by running “qstat -j jobid”. But there doesn’t seem to be any functional
equivalent with SLURM?<u></u><u></u></span></p>
<p class="MsoNormal"><span><u></u> <u></u></span></p>
<p class="MsoNormal"><span>Regards,<u></u><u></u></span></p>
<p class="MsoNormal"><span>Mike<u></u><u></u></span></p>
<p class="MsoNormal"><span><u></u> <u></u></span></p>
<p class="MsoNormal"><span><u></u> <u></u></span></p>
<div>
<div style="border-right:none;border-bottom:none;border-left:none;border-top:1pt solid rgb(225,225,225);padding:3pt 0cm 0cm">
<p class="MsoNormal"><b><span lang="EN-US">From:</span></b><span lang="EN-US"> slurm-users <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>>
<b>On Behalf Of </b>Davide DelVento<br>
<b>Sent:</b> Monday, December 11, 2023 4:23 PM<br>
<b>To:</b> Slurm User Community List <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a>><br>
<b>Subject:</b> [External] Re: [slurm-users] Troubleshooting job stuck in Pending state<u></u><u></u></span></p>
</div>
</div>
<p class="MsoNormal"><u></u> <u></u></p>
<p><strong><span style="font-family:Calibri,sans-serif;color:rgb(164,52,58)">This email originated outside the University. Check before clicking links or attachments.</span></strong><u></u><u></u></p>
<div>
<div>
<p class="MsoNormal">By getting "stuck" do you mean the job stays PENDING forever or does eventually run? I've seen the latter (and I agree with you that I wish Slurm will log things like "I looked at this job and I am not starting it yet because....") but
not the former<u></u><u></u></p>
</div>
<p class="MsoNormal"><u></u> <u></u></p>
<div>
<div>
<p class="MsoNormal">On Fri, Dec 8, 2023 at 9:00 AM Pacey, Mike <<a href="mailto:m.pacey@lancaster.ac.uk" target="_blank">m.pacey@lancaster.ac.uk</a>> wrote:<u></u><u></u></p>
</div>
<blockquote style="border-top:none;border-right:none;border-bottom:none;border-left:1pt solid rgb(204,204,204);padding:0cm 0cm 0cm 6pt;margin-left:4.8pt;margin-right:0cm">
<div>
<div>
<div>
<p class="MsoNormal">Hi folks,<u></u><u></u></p>
<p class="MsoNormal"> <u></u><u></u></p>
<p class="MsoNormal">I’m looking for some advice on how to troubleshoot jobs we occasionally see on our cluster that are stuck in a pending state despite sufficient matching resources being free. In
the case I’m trying to troubleshoot the Reason field lists (Priority) but to find any way to get the scheduler to tell me what exactly is the priority job blocking.
<u></u><u></u></p>
<p class="MsoNormal"> <u></u><u></u></p>
<ul type="disc">
<li class="m_-4928595830331357702m-6175276809849702723msolistparagraph">
I tried setting the scheduler log level to debug3 for 5 minutes at one point, but my logfile ballooned from 0.5G to 1.5G and didn’t offer any useful info for this case.<u></u><u></u></li><li class="m_-4928595830331357702m-6175276809849702723msolistparagraph">
I’ve tried ‘scontrol schedloglevel 1’ but it returns the error: ‘slurm_set_schedlog_level error: Requested operation is presently disabled’<u></u><u></u></li></ul>
<p class="MsoNormal"> <u></u><u></u></p>
<p class="MsoNormal">I’m aware that the backfill scheduler will occasionally hold on to free resources in order to schedule a larger job with higher priority, but in this case I can’t find any pending
job that might fit the bill.<u></u><u></u></p>
<p class="MsoNormal"> <u></u><u></u></p>
<p class="MsoNormal">And to possibly complicate matters, this is on a large partition that has no maximum time limit and most pending jobs have no time limits either. (We use backfill/fairshare as we
have smaller partitions of rarer resources that benefit from it, plus we’re aiming to use fairshare even on the no-time-limits partitions to help balance out usage).<u></u><u></u></p>
<p class="MsoNormal"> <u></u><u></u></p>
<p class="MsoNormal">Hoping someone can provide pointers.<u></u><u></u></p>
<p class="MsoNormal"> <u></u><u></u></p>
<p class="MsoNormal">Regards,<u></u><u></u></p>
<p class="MsoNormal">Mike<u></u><u></u></p>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</div></blockquote></div>