[slurm-users] slurmctld hanging

Fri Jul 29 11:03:28 UTC 2022

byron <lbgpublic at gmail.com> writes:

> Yep, the question of how he has the job set up is an ongoing conversation, but for now it is staying like this and I have to make do.

Wow, your user must have friends in high places, if he gets to do some
thing as goofy as starting a one-second job every second.

> Even with all the traffic he is generating though (at worst 1 a second over the course of a day) I would still have though that slurm was capable of managing that.  And it was, until I did the upgrade.

Maybe you were just lucky.  Aren't blocks of jobs going to start
simultaneously if, say, a large MPI-job ends and multiple nodes become
available at the same time? 

And if there a delay of more than a second those jobs starting, isn't
the number of pending jobs just going to increase until the user hits
MaxSubmitJobs?  What happens then?  Or do the friends in high places
ensure that the priorities of this user's jobs are always higher than
everyone else's?

Cheers,

Loris

> On Fri, Jul 29, 2022 at 7:00 AM Loris Bennett <loris.bennett at fu-berlin.de> wrote:
>
>  Hi Byron,
>
>  byron <lbgpublic at gmail.com> writes:
>
>  > Hi Loris - about a second
>
>  What is the use-case for that?  Are these individual jobs or it a job
>  array.  Either way it sounds to me like a very bad idea.  On our system,
>  jobs which can start immediately because resources are available, still
>  take a few seconds to start running (I'm looking at the values for
>  'submit' and 'start' from 'sacct').  If a one-second job has to wait for
>  just a minute, the ration of wait-time to run-time is already
>  disproportionately large. 
>
>  Why doesn't the user bundle these individual jobs together?  Depending
>  on your maximum run-time and to what degree jobs can make use of
>  backfill, I would tell the user something between a single job and
>  maybe 100 job.  I certainly wouldn't allow one-second jobs in any
>  significant numbers on our system.
>
>  I think having a job starting every second is causing your slurmdbd to
>  timeout and that is the error you are seeing.
>
>  Regards
>
>  Loris
>
>  > On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett <loris.bennett at fu-berlin.de> wrote:
>  >
>  >  Hi Byron,
>  >
>  >  byron <lbgpublic at gmail.com> writes:
>  >
>  >  > Hi 
>  >  >
>  >  > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally (3 times in 2 months) have slurmctld hanging so we get the following message when running sinfo
>  >  >
>  >  > “slurm_load_jobs error: Socket timed out on send/recv operation”
>  >  >
>  >  > It only seems to happen when one of our users runs a job that submits a short lived job every second for 5 days (up to 90,000 in a day).  Although that could be a red-herring.  
>  >
>  >  What's your definition of a 'short lived job'?
>  >
>  >  > There is nothing to be found in the slurmctld log.
>  >  >
>  >  > Can anyone suggest how to even start troubleshooting this?  Without anything in the logs I dont know where to start.
>  >  >
>  >  > Thanks
>  >
>  >  Cheers,
>  >
>  >  Loris
>  >
>  >  -- 
>  >  Dr. Loris Bennett (Herr/Mr)
>  >  ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de
>
-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de