[slurm-users] bug 2119 with slurm 18.08.2

Magnus Jonsson magnus at hpc2n.umu.se
Mon Nov 12 00:51:16 MST 2018


We got the same problem on our clusters. It was due to our backup script
of mysql was locking the tables (and taking to long time).

If looking at ''mod_time'' and ''control_host'' of ''cluster_table'' in 
the database:

select mod_time,control_host from cluster_table;

We found that ''mod_time'' was matching the backup time exactly and the 
''control_host'' column was empty.

Hope this will help you go forward with your problem.

best regards,
Magnus

On 2018-11-08 19:44, Brian Andrus wrote:
> All,
> I am seeing what looks like the same issue as 
> https://bugs.schedmd.com/show_bug.cgi?id=2119
> 
> Where, slurmctld is not picking up new accounts unless it is restarted.
> 
> I have 4 clusters (non-federated), all using the same slurmdbd
> When I added an association for user name=me cluster=DevOps  
> account=Project1 and then tried to start a job, I kept getting an error:
> *srun: error: Unable to allocate resources: Invalid account or 
> account/partition combination specified*
> 
> Then I restarted slurmctld on DevOps master and my job ran fine.
> 
> Is there some slurmdbd caching going on by slurmctld?
> 
> This is an issue in a production environment. We don't want to have to 
> restart all the slurmctld daemons anytime there is a change to any 
> associations. That could get painful
> 
> Brian Andrus

-- 
Magnus Jonsson, Developer, HPC2N, UmeƄ Universitet



More information about the slurm-users mailing list