[slurm-users] slurmctld hanging
byron
lbgpublic at gmail.com
Fri Jul 29 09:31:20 UTC 2022
Yep, the question of how he has the job set up is an ongoing conversation,
but for now it is staying like this and I have to make do.
Even with all the traffic he is generating though (at worst 1 a second over
the course of a day) I would still have though that slurm was capable of
managing that. And it was, until I did the upgrade.
On Fri, Jul 29, 2022 at 7:00 AM Loris Bennett <loris.bennett at fu-berlin.de>
wrote:
> Hi Byron,
>
> byron <lbgpublic at gmail.com> writes:
>
> > Hi Loris - about a second
>
> What is the use-case for that? Are these individual jobs or it a job
> array. Either way it sounds to me like a very bad idea. On our system,
> jobs which can start immediately because resources are available, still
> take a few seconds to start running (I'm looking at the values for
> 'submit' and 'start' from 'sacct'). If a one-second job has to wait for
> just a minute, the ration of wait-time to run-time is already
> disproportionately large.
>
> Why doesn't the user bundle these individual jobs together? Depending
> on your maximum run-time and to what degree jobs can make use of
> backfill, I would tell the user something between a single job and
> maybe 100 job. I certainly wouldn't allow one-second jobs in any
> significant numbers on our system.
>
> I think having a job starting every second is causing your slurmdbd to
> timeout and that is the error you are seeing.
>
> Regards
>
> Loris
>
> > On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett <
> loris.bennett at fu-berlin.de> wrote:
> >
> > Hi Byron,
> >
> > byron <lbgpublic at gmail.com> writes:
> >
> > > Hi
> > >
> > > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we
> occasionally (3 times in 2 months) have slurmctld hanging so we get the
> following message when running sinfo
> > >
> > > “slurm_load_jobs error: Socket timed out on send/recv operation”
> > >
> > > It only seems to happen when one of our users runs a job that submits
> a short lived job every second for 5 days (up to 90,000 in a day).
> Although that could be a red-herring.
> >
> > What's your definition of a 'short lived job'?
> >
> > > There is nothing to be found in the slurmctld log.
> > >
> > > Can anyone suggest how to even start troubleshooting this? Without
> anything in the logs I dont know where to start.
> > >
> > > Thanks
> >
> > Cheers,
> >
> > Loris
> >
> > --
> > Dr. Loris Bennett (Herr/Mr)
> > ZEDAT, Freie Universität Berlin Email
> loris.bennett at fu-berlin.de
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220729/4817e706/attachment.htm>
More information about the slurm-users
mailing list