[slurm-users] [External] Re: Troubleshooting job stuck in Pending state

Tue Dec 12 15:04:15 UTC 2023

I am not a Slurm expert by any stretch of the imagination, so my answer is
not authoritative.

That said, I am not aware of any functional equivalent for Slurm, and I
would love to learn that I am mistaken!

On Tue, Dec 12, 2023 at 1:39 AM Pacey, Mike <m.pacey at lancaster.ac.uk> wrote:

> Hi Davide,
>
>
>
> The jobs do eventually run, but can take several minutes or sometimes
> several hours to switch to a running state even when there’s plenty of
> resources free immediately.
>
>
>
> With Grid Engine it was possible to turn on scheduling diagnostics and get
> a summary of the scheduler’s decisions on a pending job by running “qstat
> -j jobid”. But there doesn’t seem to be any functional equivalent with
> SLURM?
>
>
>
> Regards,
>
> Mike
>
>
>
>
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf Of
> *Davide DelVento
> *Sent:* Monday, December 11, 2023 4:23 PM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* [External] Re: [slurm-users] Troubleshooting job stuck in
> Pending state
>
>
>
> *This email originated outside the University. Check before clicking links
> or attachments.*
>
> By getting "stuck" do you mean the job stays PENDING forever or does
> eventually run? I've seen the latter (and I agree with you that I wish
> Slurm will log things like "I looked at this job and I am not starting it
> yet because....") but not the former
>
>
>
> On Fri, Dec 8, 2023 at 9:00 AM Pacey, Mike <m.pacey at lancaster.ac.uk>
> wrote:
>
> Hi folks,
>
>
>
> I’m looking for some advice on how to troubleshoot jobs we occasionally
> see on our cluster that are stuck in a pending state despite sufficient
> matching resources being free. In the case I’m trying to troubleshoot the
> Reason field lists (Priority) but to find any way to get the scheduler to
> tell me what exactly is the priority job blocking.
>
>
>
>    - I tried setting the scheduler log level to debug3 for 5 minutes at
>    one point, but my logfile ballooned from 0.5G to 1.5G and didn’t offer any
>    useful info for this case.
>    - I’ve tried ‘scontrol schedloglevel 1’ but it returns the error:
>    ‘slurm_set_schedlog_level error: Requested operation is presently disabled’
>
>
>
> I’m aware that the backfill scheduler will occasionally hold on to free
> resources in order to schedule a larger job with higher priority, but in
> this case I can’t find any pending job that might fit the bill.
>
>
>
> And to possibly complicate matters, this is on a large partition that has
> no maximum time limit and most pending jobs have no time limits either. (We
> use backfill/fairshare as we have smaller partitions of rarer resources
> that benefit from it, plus we’re aiming to use fairshare even on the
> no-time-limits partitions to help balance out usage).
>
>
>
> Hoping someone can provide pointers.
>
>
>
> Regards,
>
> Mike
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231212/736a5a44/attachment.htm>