[slurm-users] slurmctld hanging
byron
lbgpublic at gmail.com
Thu Jul 28 13:25:35 UTC 2022
Hi
We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally
(3 times in 2 months) have slurmctld hanging so we get the following
message when running sinfo
“slurm_load_jobs error: Socket timed out on send/recv operation”
It only seems to happen when one of our users runs a job that submits a
short lived job every second for 5 days (up to 90,000 in a day). Although
that could be a red-herring.
There is nothing to be found in the slurmctld log.
Can anyone suggest how to even start troubleshooting this? Without
anything in the logs I dont know where to start.
Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220728/3b82a4f8/attachment.htm>
More information about the slurm-users
mailing list