[slurm-users] slurmctld hanging
Loris Bennett
loris.bennett at fu-berlin.de
Fri Jul 29 05:57:41 UTC 2022
Hi Byron,
byron <lbgpublic at gmail.com> writes:
> Hi Loris - about a second
What is the use-case for that? Are these individual jobs or it a job
array. Either way it sounds to me like a very bad idea. On our system,
jobs which can start immediately because resources are available, still
take a few seconds to start running (I'm looking at the values for
'submit' and 'start' from 'sacct'). If a one-second job has to wait for
just a minute, the ration of wait-time to run-time is already
disproportionately large.
Why doesn't the user bundle these individual jobs together? Depending
on your maximum run-time and to what degree jobs can make use of
backfill, I would tell the user something between a single job and
maybe 100 job. I certainly wouldn't allow one-second jobs in any
significant numbers on our system.
I think having a job starting every second is causing your slurmdbd to
timeout and that is the error you are seeing.
Regards
Loris
> On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett <loris.bennett at fu-berlin.de> wrote:
>
> Hi Byron,
>
> byron <lbgpublic at gmail.com> writes:
>
> > Hi
> >
> > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally (3 times in 2 months) have slurmctld hanging so we get the following message when running sinfo
> >
> > “slurm_load_jobs error: Socket timed out on send/recv operation”
> >
> > It only seems to happen when one of our users runs a job that submits a short lived job every second for 5 days (up to 90,000 in a day). Although that could be a red-herring.
>
> What's your definition of a 'short lived job'?
>
> > There is nothing to be found in the slurmctld log.
> >
> > Can anyone suggest how to even start troubleshooting this? Without anything in the logs I dont know where to start.
> >
> > Thanks
>
> Cheers,
>
> Loris
>
> --
> Dr. Loris Bennett (Herr/Mr)
> ZEDAT, Freie Universität Berlin Email loris.bennett at fu-berlin.de
More information about the slurm-users
mailing list