[slurm-users] slurmctld hanging
maciej.pawlik.xml at gmail.com
Fri Jul 29 09:49:32 UTC 2022
does the slurmctld recover by itself or does It require a manual restart of
the service? We had some deadlock issues related to MCS handling just after
doing the 19->20->21 upgrades. I don't recall what fixed the issue but
disabling MCS might be a good place to start if you are using it.
pt., 29 lip 2022 o 11:34 byron <lbgpublic at gmail.com> napisał(a):
> Yep, the question of how he has the job set up is an ongoing conversation,
> but for now it is staying like this and I have to make do.
> Even with all the traffic he is generating though (at worst 1 a second
> over the course of a day) I would still have though that slurm was capable
> of managing that. And it was, until I did the upgrade.
> On Fri, Jul 29, 2022 at 7:00 AM Loris Bennett <loris.bennett at fu-berlin.de>
>> Hi Byron,
>> byron <lbgpublic at gmail.com> writes:
>> > Hi Loris - about a second
>> What is the use-case for that? Are these individual jobs or it a job
>> array. Either way it sounds to me like a very bad idea. On our system,
>> jobs which can start immediately because resources are available, still
>> take a few seconds to start running (I'm looking at the values for
>> 'submit' and 'start' from 'sacct'). If a one-second job has to wait for
>> just a minute, the ration of wait-time to run-time is already
>> disproportionately large.
>> Why doesn't the user bundle these individual jobs together? Depending
>> on your maximum run-time and to what degree jobs can make use of
>> backfill, I would tell the user something between a single job and
>> maybe 100 job. I certainly wouldn't allow one-second jobs in any
>> significant numbers on our system.
>> I think having a job starting every second is causing your slurmdbd to
>> timeout and that is the error you are seeing.
>> > On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett <
>> loris.bennett at fu-berlin.de> wrote:
>> > Hi Byron,
>> > byron <lbgpublic at gmail.com> writes:
>> > > Hi
>> > >
>> > > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we
>> occasionally (3 times in 2 months) have slurmctld hanging so we get the
>> following message when running sinfo
>> > >
>> > > “slurm_load_jobs error: Socket timed out on send/recv operation”
>> > >
>> > > It only seems to happen when one of our users runs a job that
>> submits a short lived job every second for 5 days (up to 90,000 in a day).
>> Although that could be a red-herring.
>> > What's your definition of a 'short lived job'?
>> > > There is nothing to be found in the slurmctld log.
>> > >
>> > > Can anyone suggest how to even start troubleshooting this? Without
>> anything in the logs I dont know where to start.
>> > >
>> > > Thanks
>> > Cheers,
>> > Loris
>> > --
>> > Dr. Loris Bennett (Herr/Mr)
>> > ZEDAT, Freie Universität Berlin Email
>> loris.bennett at fu-berlin.de
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users