[slurm-users] slurmctld hanging

byron lbgpublic at gmail.com
Thu Jul 28 14:22:00 UTC 2022


Hi Loris - about a second

On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett <loris.bennett at fu-berlin.de>
wrote:

> Hi Byron,
>
> byron <lbgpublic at gmail.com> writes:
>
> > Hi
> >
> > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we
> occasionally (3 times in 2 months) have slurmctld hanging so we get the
> following message when running sinfo
> >
> > “slurm_load_jobs error: Socket timed out on send/recv operation”
> >
> > It only seems to happen when one of our users runs a job that submits a
> short lived job every second for 5 days (up to 90,000 in a day).  Although
> that could be a red-herring.
>
> What's your definition of a 'short lived job'?
>
> > There is nothing to be found in the slurmctld log.
> >
> > Can anyone suggest how to even start troubleshooting this?  Without
> anything in the logs I dont know where to start.
> >
> > Thanks
>
> Cheers,
>
> Loris
>
> --
> Dr. Loris Bennett (Herr/Mr)
> ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220728/4354eb89/attachment.htm>


More information about the slurm-users mailing list