[slurm-users] slurmctld hanging

Loris Bennett loris.bennett at fu-berlin.de
Thu Jul 28 13:45:24 UTC 2022


Hi Byron,

byron <lbgpublic at gmail.com> writes:

> Hi 
>
> We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally (3 times in 2 months) have slurmctld hanging so we get the following message when running sinfo
>
> “slurm_load_jobs error: Socket timed out on send/recv operation”
>
> It only seems to happen when one of our users runs a job that submits a short lived job every second for 5 days (up to 90,000 in a day).  Although that could be a red-herring.  

What's your definition of a 'short lived job'?

> There is nothing to be found in the slurmctld log.
>
> Can anyone suggest how to even start troubleshooting this?  Without anything in the logs I dont know where to start.
>
> Thanks

Cheers,

Loris

-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de



More information about the slurm-users mailing list