[slurm-users] slurmctld hanging

byron lbgpublic at gmail.com
Fri Jul 29 09:31:20 UTC 2022


Yep, the question of how he has the job set up is an ongoing conversation,
but for now it is staying like this and I have to make do.

Even with all the traffic he is generating though (at worst 1 a second over
the course of a day) I would still have though that slurm was capable of
managing that.  And it was, until I did the upgrade.


On Fri, Jul 29, 2022 at 7:00 AM Loris Bennett <loris.bennett at fu-berlin.de>
wrote:

> Hi Byron,
>
> byron <lbgpublic at gmail.com> writes:
>
> > Hi Loris - about a second
>
> What is the use-case for that?  Are these individual jobs or it a job
> array.  Either way it sounds to me like a very bad idea.  On our system,
> jobs which can start immediately because resources are available, still
> take a few seconds to start running (I'm looking at the values for
> 'submit' and 'start' from 'sacct').  If a one-second job has to wait for
> just a minute, the ration of wait-time to run-time is already
> disproportionately large.
>
> Why doesn't the user bundle these individual jobs together?  Depending
> on your maximum run-time and to what degree jobs can make use of
> backfill, I would tell the user something between a single job and
> maybe 100 job.  I certainly wouldn't allow one-second jobs in any
> significant numbers on our system.
>
> I think having a job starting every second is causing your slurmdbd to
> timeout and that is the error you are seeing.
>
> Regards
>
> Loris
>
> > On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett <
> loris.bennett at fu-berlin.de> wrote:
> >
> >  Hi Byron,
> >
> >  byron <lbgpublic at gmail.com> writes:
> >
> >  > Hi
> >  >
> >  > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we
> occasionally (3 times in 2 months) have slurmctld hanging so we get the
> following message when running sinfo
> >  >
> >  > “slurm_load_jobs error: Socket timed out on send/recv operation”
> >  >
> >  > It only seems to happen when one of our users runs a job that submits
> a short lived job every second for 5 days (up to 90,000 in a day).
> Although that could be a red-herring.
> >
> >  What's your definition of a 'short lived job'?
> >
> >  > There is nothing to be found in the slurmctld log.
> >  >
> >  > Can anyone suggest how to even start troubleshooting this?  Without
> anything in the logs I dont know where to start.
> >  >
> >  > Thanks
> >
> >  Cheers,
> >
> >  Loris
> >
> >  --
> >  Dr. Loris Bennett (Herr/Mr)
> >  ZEDAT, Freie Universität Berlin         Email
> loris.bennett at fu-berlin.de
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220729/4817e706/attachment.htm>


More information about the slurm-users mailing list