[slurm-users] [EXT] slurmctld error
ibotsis at isc.tuc.gr
ibotsis at isc.tuc.gr
Tue Apr 6 10:17:41 UTC 2021
sacctmgr list cluster
Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
tuc 0 0 1 normal
scontrol show config | grep ClusterName
ClusterName = tuc
From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Sean Crosby
Sent: Tuesday, April 6, 2021 12:47 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] [EXT] slurmctld error
It looks like your attachment of sinfo -R didn't come through
It also looks like your dbd isn't set up correctly
Can you also show the output of
sacctmgr list cluster
and
scontrol show config | grep ClusterName
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Tue, 6 Apr 2021 at 19:18, Ioannis Botsis <ibotsis at isc.tuc.gr <mailto:ibotsis at isc.tuc.gr> > wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation attempts
_____
Hi Sean,
I am trying to submit a simple job but freeze
srun -n44 -l /bin/hostname
srun: Required node not available (down, drained or reserved)
srun: job 15 queued and waiting for resources
^Csrun: Job allocation 15 has been revoked
srun: Force Terminated job 15
daemons are active and running on server and all nodes
nodes definition in slurm.conf is …
DefMemPerNode=3934
NodeName=wn0[01-44] CPUs=2 RealMemory=3934 Sockets=2 CoresPerSocket=2 State=UNKNOWN
PartitionName=TUC Nodes=ALL Default=YES MaxTime=INFINITE State=UP
tail -10 /var/log/slurmdbd.log
[2021-04-06T12:09:16.481] error: We should have gotten a new id: Table 'slurm_acct_db.tuc_job_table' doesn't exist
[2021-04-06T12:09:16.481] error: _add_registered_cluster: trying to register a cluster (tuc) with no remote port
[2021-04-06T12:09:16.482] error: We should have gotten a new id: Table 'slurm_acct_db.tuc_job_table' doesn't exist
[2021-04-06T12:09:16.482] error: It looks like the storage has gone away trying to reconnect
[2021-04-06T12:09:16.483] error: We should have gotten a new id: Table 'slurm_acct_db.tuc_job_table' doesn't exist
[2021-04-06T12:09:16.483] error: _add_registered_cluster: trying to register a cluster (tuc) with no remote port
[2021-04-06T12:09:16.484] error: We should have gotten a new id: Table 'slurm_acct_db.tuc_job_table' doesn't exist
[2021-04-06T12:09:16.484] error: It looks like the storage has gone away trying to reconnect
[2021-04-06T12:09:16.484] error: We should have gotten a new id: Table 'slurm_acct_db.tuc_job_table' doesn't exist
[2021-04-06T12:09:16.485] error: _add_registered_cluster: trying to register a cluster (tuc) with no remote port
tail -10 /var/log/slurmctld.log
[2021-04-06T12:09:35.701] debug: backfill: no jobs to backfill
[2021-04-06T12:09:42.001] debug: slurmdbd: PERSIST_RC is -1 from DBD_FLUSH_JOBS(1408): (null)
[2021-04-06T12:10:00.042] debug: slurmdbd: PERSIST_RC is -1 from DBD_FLUSH_JOBS(1408): (null)
[2021-04-06T12:10:05.701] debug: backfill: beginning
[2021-04-06T12:10:05.701] debug: backfill: no jobs to backfill
[2021-04-06T12:10:05.989] debug: sched: Running job scheduler
[2021-04-06T12:10:19.001] debug: slurmdbd: PERSIST_RC is -1 from DBD_FLUSH_JOBS(1408): (null)
[2021-04-06T12:10:35.702] debug: backfill: beginning
[2021-04-06T12:10:35.702] debug: backfill: no jobs to backfill
[2021-04-06T12:10:37.001] debug: slurmdbd: PERSIST_RC is -1 from DBD_FLUSH_JOBS(1408): (null)
Attached sinfo -R
Any hint?
jb
From: slurm-users <slurm-users-bounces at lists.schedmd.com <mailto:slurm-users-bounces at lists.schedmd.com> > On Behalf Of Sean Crosby
Sent: Tuesday, April 6, 2021 7:54 AM
To: Slurm User Community List <slurm-users at lists.schedmd.com <mailto:slurm-users at lists.schedmd.com> >
Subject: Re: [slurm-users] [EXT] slurmctld error
The other thing I notice for my slurmdbd.conf is that I have
DbdAddr=localhost
DbdHost=localhost
You can try changing your slurmdbd.conf to set those 2 values as well to see if that gets slurmdbd to listen on port 6819
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Tue, 6 Apr 2021 at 14:31, Sean Crosby < <mailto:scrosby at unimelb.edu.au> scrosby at unimelb.edu.au> wrote:
Interesting. It looks like slurmdbd is not opening the 6819 port
What does
ss -lntp | grep 6819
show? Is something else using that port?
You can also stop the slurmdbd service and run it in debug mode using
slurmdbd -D -vvv
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Tue, 6 Apr 2021 at 14:02, < <mailto:ibotsis at isc.tuc.gr> ibotsis at isc.tuc.gr> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation attempts
_____
Hi Sean
ss -lntp | grep $(pidof slurmdbd) return nothing……
systemctl status slurmdbd.service
● slurmdbd.service - Slurm DBD accounting daemon
Loaded: loaded (/lib/systemd/system/slurmdbd.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2021-04-05 13:52:35 EEST; 16h ago
Docs: man:slurmdbd(8)
Process: 1453365 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 1453375 (slurmdbd)
Tasks: 1
Memory: 5.0M
CGroup: /system.slice/slurmdbd.service
└─1453375 /usr/sbin/slurmdbd
Apr 05 13:52:35 <http://se01.grid.tuc.gr> se01.grid.tuc.gr systemd[1]: Starting Slurm DBD accounting daemon...
Apr 05 13:52:35 <http://se01.grid.tuc.gr> se01.grid.tuc.gr systemd[1]: slurmdbd.service: Can't open PID file /run/slurmdbd.pid (yet?) after start: Operation not permitted
Apr 05 13:52:35 <http://se01.grid.tuc.gr> se01.grid.tuc.gr systemd[1]: Started Slurm DBD accounting daemon.
File /run/slurmdbd.pid exist and has pidof slurmdbd value….
From: slurm-users < <mailto:slurm-users-bounces at lists.schedmd.com> slurm-users-bounces at lists.schedmd.com> On Behalf Of Sean Crosby
Sent: Tuesday, April 6, 2021 12:49 AM
To: Slurm User Community List < <mailto:slurm-users at lists.schedmd.com> slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] [EXT] slurmctld error
What's the output of
ss -lntp | grep $(pidof slurmdbd)
on your dbd host?
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Tue, 6 Apr 2021 at 05:00, < <mailto:ibotsis at isc.tuc.gr> ibotsis at isc.tuc.gr> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation attempts
_____
Hi Sean,
10.0.0.100 is the dbd and ctld host with name se01. Firewall is inactive……
nc -nz 10.0.0.100 6819 || echo Connection not working
give me back ….. Connection not working
jb
From: slurm-users < <mailto:slurm-users-bounces at lists.schedmd.com> slurm-users-bounces at lists.schedmd.com> On Behalf Of Sean Crosby
Sent: Monday, April 5, 2021 2:52 PM
To: Slurm User Community List < <mailto:slurm-users at lists.schedmd.com> slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] [EXT] slurmctld error
The error shows
slurmctld: debug2: Error connecting slurm stream socket at <http://10.0.0.100:6819> 10.0.0.100:6819: Connection refused
slurmctld: error: slurm_persist_conn_open_without_init: failed to open persistent connection to se01:6819: Connection refused
Is 10.0.0.100 the IP address of the host running slurmdbd?
If so, check the iptables firewall running on that host, and make sure the ctld server can access port 6819 on the dbd host.
You can check this by running the following from the ctld host (requires the package nmap-ncat installed)
nc -nz 10.0.0.100 6819 || echo Connection not working
This will try connecting to port 6819 on the host 10.0.0.100, and output nothing if the connection works, and would output Connection not working otherwise
I would also test this on the DBD server itself
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Mon, 5 Apr 2021 at 21:00, Ioannis Botsis < <mailto:ibotsis at isc.tuc.gr> ibotsis at isc.tuc.gr> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation attempts
_____
Hi Sean,
Thank you for your prompt response, I made the changes you suggested, slurmctld refuse running……. find attached new slurmctld -Dvvvv
jb
From: slurm-users < <mailto:slurm-users-bounces at lists.schedmd.com> slurm-users-bounces at lists.schedmd.com> On Behalf Of Sean Crosby
Sent: Monday, April 5, 2021 11:46 AM
To: Slurm User Community List < <mailto:slurm-users at lists.schedmd.com> slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] [EXT] slurmctld error
Hi Jb,
You have set AccountingStoragePort to 3306 in slurm.conf, which is the MySQL port running on the DBD host.
AccountingStoragePort is the port for the Slurmdbd service, and not for MySQL.
Change AccountingStoragePort to 6819 and it should fix your issues.
I also think you should comment out the lines
AccountingStorageUser=slurm
AccountingStoragePass=/run/munge/munge.socket.2
You shouldn't need those lines
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Mon, 5 Apr 2021 at 18:03, Ioannis Botsis < <mailto:ibotsis at isc.tuc.gr> ibotsis at isc.tuc.gr> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation attempts
_____
Hello everyone,
I installed the slurm 19.05.5 from Ubuntu repo, for the first time in a cluster with 44 identical nodes but I have problem with slurmctld.service
When I try to activate slurmctd I get the following message…
fatal: You are running with a database but for some reason we have no TRES from it. This should only happen if the database is down and you don't have any state files
* Ubuntu 20.04.2 runs on the server and nodes in the exact same version.
* munge 0.5.13 installed from Ubuntu repo running on server and nodes.
* mysql Ver 8.0.23-0ubuntu0.20.04.1 for Linux on x86_64 ((Ubuntu)) installed from ubuntu repo running on server.
slurm.conf is the same on all nodes and on server.
slurmd.service is active and running on all nodes without problem.
mysql.service is active and running on server.
slurmdbd.service is active and running on server (slurm_acct_db created).
Find attached slurm.conf <http://slurmdbd.com> slurmdbd.com and detailed output of slurmctld -Dvvvv command.
Any hint?
Thanks in advance
jb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210406/869cb8a1/attachment-0001.htm>
More information about the slurm-users
mailing list