<div dir="ltr">Hi Byron,<div><br></div><div>does the slurmctld recover by itself or does It require a manual restart of the service? We had some deadlock issues related to MCS handling just after doing the 19->20->21 upgrades. I don't recall what fixed the issue but disabling MCS might be a good place to start if you are using it.</div><div><br></div><div>best regards</div><div>Maciej Pawlik </div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">pt., 29 lip 2022 o 11:34 byron <<a href="mailto:lbgpublic@gmail.com">lbgpublic@gmail.com</a>> napisał(a):<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Yep, the question of how he has the job set up is an ongoing conversation, but for now it is staying like this and I have to make do.</div><div><br></div><div>Even with all the traffic he is generating though (at worst 1 a second over the course of a day) I would still have though that slurm was capable of managing that.  And it was, until I did the upgrade.<br></div></div><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jul 29, 2022 at 7:00 AM Loris Bennett <<a href="mailto:loris.bennett@fu-berlin.de" target="_blank">loris.bennett@fu-berlin.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Byron,<br>

<br>

byron <<a href="mailto:lbgpublic@gmail.com" target="_blank">lbgpublic@gmail.com</a>> writes:<br>

<br>

> Hi Loris - about a second<br>

<br>

What is the use-case for that?  Are these individual jobs or it a job<br>

array.  Either way it sounds to me like a very bad idea.  On our system,<br>

jobs which can start immediately because resources are available, still<br>

take a few seconds to start running (I'm looking at the values for<br>

'submit' and 'start' from 'sacct').  If a one-second job has to wait for<br>

just a minute, the ration of wait-time to run-time is already<br>

disproportionately large. <br>

<br>

Why doesn't the user bundle these individual jobs together?  Depending<br>

on your maximum run-time and to what degree jobs can make use of<br>

backfill, I would tell the user something between a single job and<br>

maybe 100 job.  I certainly wouldn't allow one-second jobs in any<br>

significant numbers on our system.<br>

<br>

I think having a job starting every second is causing your slurmdbd to<br>

timeout and that is the error you are seeing.<br>

<br>

Regards<br>

<br>

Loris<br>

<br>

> On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett <<a href="mailto:loris.bennett@fu-berlin.de" target="_blank">loris.bennett@fu-berlin.de</a>> wrote:<br>

><br>

>  Hi Byron,<br>

><br>

>  byron <<a href="mailto:lbgpublic@gmail.com" target="_blank">lbgpublic@gmail.com</a>> writes:<br>

><br>

>  > Hi <br>

>  ><br>

>  > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally (3 times in 2 months) have slurmctld hanging so we get the following message when running sinfo<br>

>  ><br>

>  > “slurm_load_jobs error: Socket timed out on send/recv operation”<br>

>  ><br>

>  > It only seems to happen when one of our users runs a job that submits a short lived job every second for 5 days (up to 90,000 in a day).  Although that could be a red-herring.  <br>

><br>

>  What's your definition of a 'short lived job'?<br>

><br>

>  > There is nothing to be found in the slurmctld log.<br>

>  ><br>

>  > Can anyone suggest how to even start troubleshooting this?  Without anything in the logs I dont know where to start.<br>

>  ><br>

>  > Thanks<br>

><br>

>  Cheers,<br>

><br>

>  Loris<br>

><br>

>  -- <br>

>  Dr. Loris Bennett (Herr/Mr)<br>

>  ZEDAT, Freie Universität Berlin         Email <a href="mailto:loris.bennett@fu-berlin.de" target="_blank">loris.bennett@fu-berlin.de</a><br>

<br>

</blockquote></div>

</blockquote></div>