[slurm-users] error: persistent connection experienced an error

Christopher Benjamin Coffey Chris.Coffey at nau.edu
Fri Dec 13 20:19:43 UTC 2019


Hi All,

I wonder if any of you have seen these errors in slurmdbd.log

error: persistent connection experienced an error

When we see these errors, we are seeing job errors with some kind of accounting in slurm like:

slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should never happen
slurmstepd: error: _prec_extra: Could not find task_cpuacct_cg, this should never happen
srun: fatal: slurm_allocation_msg_thr_create: pthread_create error Resource temporarily unavailable

I haven't been able to figure out what makes the slurmdbd get into this condition. The slurm controller, and slurmdbd are on the same box, so it's increasingly odd that the slurmdbd can't communicate with slurmctld. While we figure this out, we have begun restarting slurmctl and slurmdbd every day to try and keep them "in sync". 

Anyone seen this? Any thoughts? Maybe the one port shown here by:

sacctmgr list cluster

Becomes overwhelmed at times? We have a range of ports for the controller to be contacted on. Maybe the db should try on another port if that’s the issue?

SlurmctldPort=6900-6950

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 



More information about the slurm-users mailing list