[slurm-users] [EXT] slurmctld error

Sean Crosby scrosby at unimelb.edu.au
Tue Apr 6 04:54:14 UTC 2021


The other thing I notice for my slurmdbd.conf is that I have

DbdAddr=localhost
DbdHost=localhost

You can try changing your slurmdbd.conf to set those 2 values as well to
see if that gets slurmdbd to listen on port 6819

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Tue, 6 Apr 2021 at 14:31, Sean Crosby <scrosby at unimelb.edu.au> wrote:

> Interesting. It looks like slurmdbd is not opening the 6819 port
>
> What does
>
> ss -lntp | grep 6819
>
> show? Is something else using that port?
>
> You can also stop the slurmdbd service and run it in debug mode using
>
> slurmdbd -D -vvv
>
> Sean
>
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
>
>
> On Tue, 6 Apr 2021 at 14:02, <ibotsis at isc.tuc.gr> wrote:
>
>> * UoM notice: External email. Be cautious of links, attachments, or
>> impersonation attempts *
>> ------------------------------
>>
>> Hi Sean
>>
>>
>>
>> ss -lntp | grep $(pidof slurmdbd)     return nothing……
>>
>>
>>
>> systemctl status slurmdbd.service
>>
>>
>>
>> ● slurmdbd.service - Slurm DBD accounting daemon
>>
>>      Loaded: loaded (/lib/systemd/system/slurmdbd.service; enabled;
>> vendor preset: enabled)
>>
>>      Active: active (running) since Mon 2021-04-05 13:52:35 EEST; 16h ago
>>
>>        Docs: man:slurmdbd(8)
>>
>>     Process: 1453365 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS
>> (code=exited, status=0/SUCCESS)
>>
>>    Main PID: 1453375 (slurmdbd)
>>
>>       Tasks: 1
>>
>>      Memory: 5.0M
>>
>>      CGroup: /system.slice/slurmdbd.service
>>
>>              └─1453375 /usr/sbin/slurmdbd
>>
>>
>>
>> Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: Starting Slurm DBD
>> accounting daemon...
>>
>> Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: slurmdbd.service: Can't
>> open PID file /run/slurmdbd.pid (yet?) after start: Operation not permitted
>>
>> Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: Started Slurm DBD
>> accounting daemon.
>>
>>
>>
>> File /run/slurmdbd.pid exist and has  pidof slurmdbd   value….
>>
>>
>>
>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf
>> Of *Sean Crosby
>> *Sent:* Tuesday, April 6, 2021 12:49 AM
>> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
>> *Subject:* Re: [slurm-users] [EXT] slurmctld error
>>
>>
>>
>> What's the output of
>>
>>
>>
>> ss -lntp | grep $(pidof slurmdbd)
>>
>>
>>
>> on your dbd host?
>>
>>
>>
>> Sean
>>
>>
>>
>> --
>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
>> Research Computing Services | Business Services
>> The University of Melbourne, Victoria 3010 Australia
>>
>>
>>
>>
>>
>> On Tue, 6 Apr 2021 at 05:00, <ibotsis at isc.tuc.gr> wrote:
>>
>> *UoM notice: *External email. Be cautious of links, attachments, or
>> impersonation attempts
>>
>>
>> ------------------------------
>>
>> Hi Sean,
>>
>>
>>
>> 10.0.0.100 is the dbd and ctld host with name se01. Firewall is inactive……
>>
>>
>>
>> nc -nz 10.0.0.100 6819 || echo Connection not working
>>
>>
>>
>> give me back …..  Connection not working
>>
>>
>>
>> jb
>>
>>
>>
>>
>>
>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf
>> Of *Sean Crosby
>> *Sent:* Monday, April 5, 2021 2:52 PM
>> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
>> *Subject:* Re: [slurm-users] [EXT] slurmctld error
>>
>>
>>
>> The error shows
>>
>>
>> slurmctld: debug2: Error connecting slurm stream socket at
>> 10.0.0.100:6819: Connection refused
>>
>> slurmctld: error: slurm_persist_conn_open_without_init: failed to open
>> persistent connection to se01:6819: Connection refused
>>
>>
>>
>> Is 10.0.0.100 the IP address of the host running slurmdbd?
>>
>> If so, check the iptables firewall running on that host, and make sure
>> the ctld server can access port 6819 on the dbd host.
>>
>> You can check this by running the following from the ctld host (requires
>> the package nmap-ncat installed)
>>
>> nc -nz 10.0.0.100 6819 || echo Connection not working
>>
>> This will try connecting to port 6819 on the host 10.0.0.100, and output
>> nothing if the connection works, and would output Connection not working
>> otherwise
>>
>> I would also test this on the DBD server itself
>>
>>  --
>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
>> Research Computing Services | Business Services
>> The University of Melbourne, Victoria 3010 Australia
>>
>>
>>
>>
>>
>> On Mon, 5 Apr 2021 at 21:00, Ioannis Botsis <ibotsis at isc.tuc.gr> wrote:
>>
>> *UoM notice: *External email. Be cautious of links, attachments, or
>> impersonation attempts
>>
>>
>> ------------------------------
>>
>> Hi Sean,
>>
>>
>>
>> Thank you for your prompt response,  I made the changes you suggested,
>> slurmctld refuse running……. find attached new slurmctld -Dvvvv
>>
>>
>>
>> jb
>>
>>
>>
>>
>>
>>
>>
>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf
>> Of *Sean Crosby
>> *Sent:* Monday, April 5, 2021 11:46 AM
>> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
>> *Subject:* Re: [slurm-users] [EXT] slurmctld error
>>
>>
>>
>> Hi Jb,
>>
>>
>>
>> You have set AccountingStoragePort to 3306 in slurm.conf, which is the
>> MySQL port running on the DBD host.
>>
>>
>>
>> AccountingStoragePort is the port for the Slurmdbd service, and not for
>> MySQL.
>>
>>
>>
>> Change AccountingStoragePort to 6819 and it should fix your issues.
>>
>>
>>
>> I also think you should comment out the lines
>>
>>
>>
>> AccountingStorageUser=slurm
>> AccountingStoragePass=/run/munge/munge.socket.2
>>
>>
>>
>> You shouldn't need those lines
>>
>>
>>
>> Sean
>>
>>
>>
>> --
>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
>> Research Computing Services | Business Services
>> The University of Melbourne, Victoria 3010 Australia
>>
>>
>>
>>
>>
>> On Mon, 5 Apr 2021 at 18:03, Ioannis Botsis <ibotsis at isc.tuc.gr> wrote:
>>
>> *UoM notice: *External email. Be cautious of links, attachments, or
>> impersonation attempts
>>
>>
>> ------------------------------
>>
>> Hello everyone,
>>
>>
>>
>> I installed the slurm 19.05.5 from Ubuntu repo,  for the first time in a
>> cluster with 44  identical nodes but I have problem with slurmctld.service
>>
>>
>>
>> When I try to activate slurmctd I get the following message…
>>
>>
>>
>> fatal: You are running with a database but for some reason we have no
>> TRES from it.  This should only happen if the database is down and you
>> don't have any state files
>>
>>
>>
>>    - Ubuntu 20.04.2 runs on the server and nodes in the exact same
>>    version.
>>    - munge 0.5.13 installed from Ubuntu repo running on server and nodes.
>>    - mysql  Ver 8.0.23-0ubuntu0.20.04.1 for Linux on x86_64 ((Ubuntu))
>>    installed from ubuntu repo running on server.
>>
>>
>>
>> slurm.conf is the same on all nodes and on server.
>>
>>
>>
>> slurmd.service is active and running on all nodes without problem.
>>
>>
>>
>> mysql.service is active and running on server.
>>
>> slurmdbd.service is active and running on server (slurm_acct_db created).
>>
>>
>>
>> Find attached slurm.conf slurmdbd.com  and detailed output of slurmctld
>> -Dvvvv  command.
>>
>>
>>
>> Any hint?
>>
>>
>>
>> Thanks in advance
>>
>>
>>
>> jb
>>
>>
>>
>>
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210406/765b3347/attachment-0001.htm>


More information about the slurm-users mailing list