[slurm-users] error: persistent connection experienced an error
Christopher Benjamin Coffey
Chris.Coffey at nau.edu
Fri Dec 13 20:19:43 UTC 2019
Hi All,
I wonder if any of you have seen these errors in slurmdbd.log
error: persistent connection experienced an error
When we see these errors, we are seeing job errors with some kind of accounting in slurm like:
slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should never happen
slurmstepd: error: _prec_extra: Could not find task_cpuacct_cg, this should never happen
srun: fatal: slurm_allocation_msg_thr_create: pthread_create error Resource temporarily unavailable
I haven't been able to figure out what makes the slurmdbd get into this condition. The slurm controller, and slurmdbd are on the same box, so it's increasingly odd that the slurmdbd can't communicate with slurmctld. While we figure this out, we have begun restarting slurmctl and slurmdbd every day to try and keep them "in sync".
Anyone seen this? Any thoughts? Maybe the one port shown here by:
sacctmgr list cluster
Becomes overwhelmed at times? We have a range of ports for the controller to be contacted on. Maybe the db should try on another port if that’s the issue?
SlurmctldPort=6900-6950
Best,
Chris
--
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
More information about the slurm-users
mailing list