[slurm-users] [EXT] slurmctld error

Sean Crosby scrosby at unimelb.edu.au
Tue Apr 6 09:46:36 UTC 2021


It looks like your attachment of sinfo -R didn't come through

It also looks like your dbd isn't set up correctly

Can you also show the output of

sacctmgr list cluster

and

scontrol show config | grep ClusterName

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Tue, 6 Apr 2021 at 19:18, Ioannis Botsis <ibotsis at isc.tuc.gr> wrote:

> * UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts *
> ------------------------------
>
> Hi Sean,
>
>
>
> I am trying to submit a simple job but freeze
>
>
>
> srun -n44 -l /bin/hostname
>
> srun: Required node not available (down, drained or reserved)
>
> srun: job 15 queued and waiting for resources
>
> ^Csrun: Job allocation 15 has been revoked
>
> srun: Force Terminated job 15
>
>
>
>
>
> daemons are active and running on server and all nodes
>
>
>
> nodes definition in slurm.conf is …
>
>
>
> DefMemPerNode=3934
>
> NodeName=wn0[01-44] CPUs=2 RealMemory=3934 Sockets=2 CoresPerSocket=2
> State=UNKNOWN
>
> PartitionName=TUC Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>
>
>
> tail -10 /var/log/slurmdbd.log
>
>
>
> [2021-04-06T12:09:16.481] error: We should have gotten a new id: Table
> 'slurm_acct_db.tuc_job_table' doesn't exist
>
> [2021-04-06T12:09:16.481] error: _add_registered_cluster: trying to
> register a cluster (tuc) with no remote port
>
> [2021-04-06T12:09:16.482] error: We should have gotten a new id: Table
> 'slurm_acct_db.tuc_job_table' doesn't exist
>
> [2021-04-06T12:09:16.482] error: It looks like the storage has gone away
> trying to reconnect
>
> [2021-04-06T12:09:16.483] error: We should have gotten a new id: Table
> 'slurm_acct_db.tuc_job_table' doesn't exist
>
> [2021-04-06T12:09:16.483] error: _add_registered_cluster: trying to
> register a cluster (tuc) with no remote port
>
> [2021-04-06T12:09:16.484] error: We should have gotten a new id: Table
> 'slurm_acct_db.tuc_job_table' doesn't exist
>
> [2021-04-06T12:09:16.484] error: It looks like the storage has gone away
> trying to reconnect
>
> [2021-04-06T12:09:16.484] error: We should have gotten a new id: Table
> 'slurm_acct_db.tuc_job_table' doesn't exist
>
> [2021-04-06T12:09:16.485] error: _add_registered_cluster: trying to
> register a cluster (tuc) with no remote port
>
>
>
> tail -10 /var/log/slurmctld.log
>
>
>
> [2021-04-06T12:09:35.701] debug:  backfill: no jobs to backfill
>
> [2021-04-06T12:09:42.001] debug:  slurmdbd: PERSIST_RC is -1 from
> DBD_FLUSH_JOBS(1408): (null)
>
> [2021-04-06T12:10:00.042] debug:  slurmdbd: PERSIST_RC is -1 from
> DBD_FLUSH_JOBS(1408): (null)
>
> [2021-04-06T12:10:05.701] debug:  backfill: beginning
>
> [2021-04-06T12:10:05.701] debug:  backfill: no jobs to backfill
>
> [2021-04-06T12:10:05.989] debug:  sched: Running job scheduler
>
> [2021-04-06T12:10:19.001] debug:  slurmdbd: PERSIST_RC is -1 from
> DBD_FLUSH_JOBS(1408): (null)
>
> [2021-04-06T12:10:35.702] debug:  backfill: beginning
>
> [2021-04-06T12:10:35.702] debug:  backfill: no jobs to backfill
>
> [2021-04-06T12:10:37.001] debug:  slurmdbd: PERSIST_RC is -1 from
> DBD_FLUSH_JOBS(1408): (null)
>
>
>
> Attached sinfo -R
>
>
>
> Any hint?
>
>
>
> jb
>
>
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf Of
> *Sean Crosby
> *Sent:* Tuesday, April 6, 2021 7:54 AM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] [EXT] slurmctld error
>
>
>
> The other thing I notice for my slurmdbd.conf is that I have
>
>
>
> DbdAddr=localhost
> DbdHost=localhost
>
>
>
> You can try changing your slurmdbd.conf to set those 2 values as well to
> see if that gets slurmdbd to listen on port 6819
>
>
>
> Sean
>
>
>
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
>
>
>
>
> On Tue, 6 Apr 2021 at 14:31, Sean Crosby <scrosby at unimelb.edu.au> wrote:
>
> Interesting. It looks like slurmdbd is not opening the 6819 port
>
>
>
> What does
>
>
>
> ss -lntp | grep 6819
>
>
>
> show? Is something else using that port?
>
>
>
> You can also stop the slurmdbd service and run it in debug mode using
>
>
>
> slurmdbd -D -vvv
>
>
>
> Sean
>
>
>
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
>
>
>
>
> On Tue, 6 Apr 2021 at 14:02, <ibotsis at isc.tuc.gr> wrote:
>
> *UoM notice: *External email. Be cautious of links, attachments, or
> impersonation attempts
>
>
> ------------------------------
>
> Hi Sean
>
>
>
> ss -lntp | grep $(pidof slurmdbd)     return nothing……
>
>
>
> systemctl status slurmdbd.service
>
>
>
> ● slurmdbd.service - Slurm DBD accounting daemon
>
>      Loaded: loaded (/lib/systemd/system/slurmdbd.service; enabled; vendor
> preset: enabled)
>
>      Active: active (running) since Mon 2021-04-05 13:52:35 EEST; 16h ago
>
>        Docs: man:slurmdbd(8)
>
>     Process: 1453365 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS
> (code=exited, status=0/SUCCESS)
>
>    Main PID: 1453375 (slurmdbd)
>
>       Tasks: 1
>
>      Memory: 5.0M
>
>      CGroup: /system.slice/slurmdbd.service
>
>              └─1453375 /usr/sbin/slurmdbd
>
>
>
> Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: Starting Slurm DBD
> accounting daemon...
>
> Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: slurmdbd.service: Can't open
> PID file /run/slurmdbd.pid (yet?) after start: Operation not permitted
>
> Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: Started Slurm DBD accounting
> daemon.
>
>
>
> File /run/slurmdbd.pid exist and has  pidof slurmdbd   value….
>
>
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf Of
> *Sean Crosby
> *Sent:* Tuesday, April 6, 2021 12:49 AM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] [EXT] slurmctld error
>
>
>
> What's the output of
>
>
>
> ss -lntp | grep $(pidof slurmdbd)
>
>
>
> on your dbd host?
>
>
>
> Sean
>
>
>
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
>
>
>
>
> On Tue, 6 Apr 2021 at 05:00, <ibotsis at isc.tuc.gr> wrote:
>
> *UoM notice: *External email. Be cautious of links, attachments, or
> impersonation attempts
>
>
> ------------------------------
>
> Hi Sean,
>
>
>
> 10.0.0.100 is the dbd and ctld host with name se01. Firewall is inactive……
>
>
>
> nc -nz 10.0.0.100 6819 || echo Connection not working
>
>
>
> give me back …..  Connection not working
>
>
>
> jb
>
>
>
>
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf Of
> *Sean Crosby
> *Sent:* Monday, April 5, 2021 2:52 PM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] [EXT] slurmctld error
>
>
>
> The error shows
>
>
> slurmctld: debug2: Error connecting slurm stream socket at 10.0.0.100:6819:
> Connection refused
>
> slurmctld: error: slurm_persist_conn_open_without_init: failed to open
> persistent connection to se01:6819: Connection refused
>
>
>
> Is 10.0.0.100 the IP address of the host running slurmdbd?
>
> If so, check the iptables firewall running on that host, and make sure the
> ctld server can access port 6819 on the dbd host.
>
> You can check this by running the following from the ctld host (requires
> the package nmap-ncat installed)
>
> nc -nz 10.0.0.100 6819 || echo Connection not working
>
> This will try connecting to port 6819 on the host 10.0.0.100, and output
> nothing if the connection works, and would output Connection not working
> otherwise
>
> I would also test this on the DBD server itself
>
>  --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
>
>
>
>
> On Mon, 5 Apr 2021 at 21:00, Ioannis Botsis <ibotsis at isc.tuc.gr> wrote:
>
> *UoM notice: *External email. Be cautious of links, attachments, or
> impersonation attempts
>
>
> ------------------------------
>
> Hi Sean,
>
>
>
> Thank you for your prompt response,  I made the changes you suggested,
> slurmctld refuse running……. find attached new slurmctld -Dvvvv
>
>
>
> jb
>
>
>
>
>
>
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf Of
> *Sean Crosby
> *Sent:* Monday, April 5, 2021 11:46 AM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] [EXT] slurmctld error
>
>
>
> Hi Jb,
>
>
>
> You have set AccountingStoragePort to 3306 in slurm.conf, which is the
> MySQL port running on the DBD host.
>
>
>
> AccountingStoragePort is the port for the Slurmdbd service, and not for
> MySQL.
>
>
>
> Change AccountingStoragePort to 6819 and it should fix your issues.
>
>
>
> I also think you should comment out the lines
>
>
>
> AccountingStorageUser=slurm
> AccountingStoragePass=/run/munge/munge.socket.2
>
>
>
> You shouldn't need those lines
>
>
>
> Sean
>
>
>
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
>
>
>
>
> On Mon, 5 Apr 2021 at 18:03, Ioannis Botsis <ibotsis at isc.tuc.gr> wrote:
>
> *UoM notice: *External email. Be cautious of links, attachments, or
> impersonation attempts
>
>
> ------------------------------
>
> Hello everyone,
>
>
>
> I installed the slurm 19.05.5 from Ubuntu repo,  for the first time in a
> cluster with 44  identical nodes but I have problem with slurmctld.service
>
>
>
> When I try to activate slurmctd I get the following message…
>
>
>
> fatal: You are running with a database but for some reason we have no TRES
> from it.  This should only happen if the database is down and you don't
> have any state files
>
>
>
>    - Ubuntu 20.04.2 runs on the server and nodes in the exact same
>    version.
>    - munge 0.5.13 installed from Ubuntu repo running on server and nodes.
>    - mysql  Ver 8.0.23-0ubuntu0.20.04.1 for Linux on x86_64 ((Ubuntu))
>    installed from ubuntu repo running on server.
>
>
>
> slurm.conf is the same on all nodes and on server.
>
>
>
> slurmd.service is active and running on all nodes without problem.
>
>
>
> mysql.service is active and running on server.
>
> slurmdbd.service is active and running on server (slurm_acct_db created).
>
>
>
> Find attached slurm.conf slurmdbd.com  and detailed output of slurmctld
> -Dvvvv  command.
>
>
>
> Any hint?
>
>
>
> Thanks in advance
>
>
>
> jb
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210406/e80ec1b9/attachment-0001.htm>


More information about the slurm-users mailing list