[slurm-users] slurmctld hanging

Fri Jul 29 09:49:32 UTC 2022

Hi Byron,

does the slurmctld recover by itself or does It require a manual restart of
the service? We had some deadlock issues related to MCS handling just after
doing the 19->20->21 upgrades. I don't recall what fixed the issue but
disabling MCS might be a good place to start if you are using it.

best regards
Maciej Pawlik

pt., 29 lip 2022 o 11:34 byron <lbgpublic at gmail.com> napisał(a):

> Yep, the question of how he has the job set up is an ongoing conversation,
> but for now it is staying like this and I have to make do.
>
> Even with all the traffic he is generating though (at worst 1 a second
> over the course of a day) I would still have though that slurm was capable
> of managing that.  And it was, until I did the upgrade.
>
>
> On Fri, Jul 29, 2022 at 7:00 AM Loris Bennett <loris.bennett at fu-berlin.de>
> wrote:
>
>> Hi Byron,
>>
>> byron <lbgpublic at gmail.com> writes:
>>
>> > Hi Loris - about a second
>>
>> What is the use-case for that?  Are these individual jobs or it a job
>> array.  Either way it sounds to me like a very bad idea.  On our system,
>> jobs which can start immediately because resources are available, still
>> take a few seconds to start running (I'm looking at the values for
>> 'submit' and 'start' from 'sacct').  If a one-second job has to wait for
>> just a minute, the ration of wait-time to run-time is already
>> disproportionately large.
>>
>> Why doesn't the user bundle these individual jobs together?  Depending
>> on your maximum run-time and to what degree jobs can make use of
>> backfill, I would tell the user something between a single job and
>> maybe 100 job.  I certainly wouldn't allow one-second jobs in any
>> significant numbers on our system.
>>
>> I think having a job starting every second is causing your slurmdbd to
>> timeout and that is the error you are seeing.
>>
>> Regards
>>
>> Loris
>>
>> > On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett <
>> loris.bennett at fu-berlin.de> wrote:
>> >
>> >  Hi Byron,
>> >
>> >  byron <lbgpublic at gmail.com> writes:
>> >
>> >  > Hi
>> >  >
>> >  > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we
>> occasionally (3 times in 2 months) have slurmctld hanging so we get the
>> following message when running sinfo
>> >  >
>> >  > “slurm_load_jobs error: Socket timed out on send/recv operation”
>> >  >
>> >  > It only seems to happen when one of our users runs a job that
>> submits a short lived job every second for 5 days (up to 90,000 in a day).
>> Although that could be a red-herring.
>> >
>> >  What's your definition of a 'short lived job'?
>> >
>> >  > There is nothing to be found in the slurmctld log.
>> >  >
>> >  > Can anyone suggest how to even start troubleshooting this?  Without
>> anything in the logs I dont know where to start.
>> >  >
>> >  > Thanks
>> >
>> >  Cheers,
>> >
>> >  Loris
>> >
>> >  --
>> >  Dr. Loris Bennett (Herr/Mr)
>> >  ZEDAT, Freie Universität Berlin         Email
>> loris.bennett at fu-berlin.de
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220729/c8e67aae/attachment.htm>