[slurm-users] sacct thinks slurmctld is not up

Riebs, Andy andy.riebs at hpe.com
Thu Jul 18 15:09:31 UTC 2019


Brian, FWIW, we just restart slurmctld when this happens. I’ll be interested to hear if there’s a proper fix.

Andy

From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Brian Andrus
Sent: Thursday, July 18, 2019 11:01 AM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: [slurm-users] sacct thinks slurmctld is not up

All,

I have slurmdbd running and everything is (mostly) happy. It's been working well for months, but fairly regularly, when I do 'sacctmgr show runaway jobs', I get:

sacctmgr: error: Slurmctld running on cluster orion is not up, can't check running jobs

if I do 'sacctmgr show cluster', it lists the cluster but has no IP in the ControlHost field.

slurmctld is most definitely running (on the same system even), but the only fix I find is to restart slurmctld. Then I can check and there is an IP in the ControlHost field and I am able to check for runawayjobs.

Is this a known issue? Is there a better fix than restarting slurmctld?

Brian Andrus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190718/2023dd9b/attachment.htm>


More information about the slurm-users mailing list