[slurm-users] [EXT] slurmctld error

Sean Crosby scrosby at unimelb.edu.au
Tue Apr 6 11:11:04 UTC 2021


I just checked my cluster and my spool dir is

SlurmdSpoolDir=/var/spool/slurm

(i.e. without the d at the end)

It doesn't really matter, as long as the directory exists and has the
correct permissions on all nodes
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Tue, 6 Apr 2021 at 20:52, Sean Crosby <scrosby at unimelb.edu.au> wrote:

> I think I've worked out a problem
>
> I see in your slurm.conf you have this
>
> SlurmdSpoolDir=/var/spool/slurm/d
>
> It should be
>
> SlurmdSpoolDir=/var/spool/slurmd
>
> You'll need to restart slurmd on all the nodes after you make that change
>
> I would also double check the permissions on that directory on all your
> nodes. It needs to be owned by user slurm
>
> ls -lad /var/spool/slurmd
>
> Sean
>
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
>
>
> On Tue, 6 Apr 2021 at 20:37, Sean Crosby <scrosby at unimelb.edu.au> wrote:
>
>> It looks like your ctl isn't contacting the slurmdbd properly. The
>> control host, control port etc are all blank.
>>
>> The first thing I would do is change the ClusterName in your slurm.conf
>> from upper case TUC to lower case tuc. You'll then need to restart your
>> ctld. Then recheck sacctmgr show cluster
>>
>> If that doesn't work, try changing AccountingStorageHost in slurm.conf to
>> localhost as well
>>
>> For your worker nodes, your nodes are all in drain state.
>>
>> Show the output of
>>
>> scontrol show node wn001
>>
>> It will give you the reason for why the node is drained.
>>
>> Sean
>>
>> --
>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
>> Research Computing Services | Business Services
>> The University of Melbourne, Victoria 3010 Australia
>>
>>
>>
>> On Tue, 6 Apr 2021 at 20:19, <ibotsis at isc.tuc.gr> wrote:
>>
>>> * UoM notice: External email. Be cautious of links, attachments, or
>>> impersonation attempts *
>>> ------------------------------
>>>
>>> sinfo -N -o "%N %T %C %m %P %a"
>>>
>>> NODELIST STATE CPUS(A/I/O/T) MEMORY PARTITION AVAIL
>>>
>>> wn001 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn002 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn003 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn004 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn005 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn006 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn007 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn008 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn009 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn010 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn011 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn012 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn013 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn014 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn015 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn016 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn017 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn018 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn019 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn020 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn021 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn022 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn023 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn024 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn025 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn026 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn027 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn028 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn029 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn030 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn031 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn032 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn033 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn034 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn035 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn036 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn037 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn038 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn039 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn040 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn041 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn042 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn043 drained 0/0/2/2 3934 TUC* up
>>>
>>> wn044 drained 0/0/2/2 3934 TUC* up
>>>
>>>
>>>
>>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf
>>> Of *Sean Crosby
>>> *Sent:* Tuesday, April 6, 2021 12:47 PM
>>> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
>>> *Subject:* Re: [slurm-users] [EXT] slurmctld error
>>>
>>>
>>>
>>> It looks like your attachment of sinfo -R didn't come through
>>>
>>>
>>>
>>> It also looks like your dbd isn't set up correctly
>>>
>>>
>>>
>>> Can you also show the output of
>>>
>>>
>>>
>>> sacctmgr list cluster
>>>
>>>
>>>
>>> and
>>>
>>>
>>>
>>> scontrol show config | grep ClusterName
>>>
>>>
>>>
>>> Sean
>>>
>>>
>>>
>>> --
>>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
>>> Research Computing Services | Business Services
>>> The University of Melbourne, Victoria 3010 Australia
>>>
>>>
>>>
>>>
>>>
>>> On Tue, 6 Apr 2021 at 19:18, Ioannis Botsis <ibotsis at isc.tuc.gr> wrote:
>>>
>>> *UoM notice: *External email. Be cautious of links, attachments, or
>>> impersonation attempts
>>>
>>>
>>> ------------------------------
>>>
>>> Hi Sean,
>>>
>>>
>>>
>>> I am trying to submit a simple job but freeze
>>>
>>>
>>>
>>> srun -n44 -l /bin/hostname
>>>
>>> srun: Required node not available (down, drained or reserved)
>>>
>>> srun: job 15 queued and waiting for resources
>>>
>>> ^Csrun: Job allocation 15 has been revoked
>>>
>>> srun: Force Terminated job 15
>>>
>>>
>>>
>>>
>>>
>>> daemons are active and running on server and all nodes
>>>
>>>
>>>
>>> nodes definition in slurm.conf is …
>>>
>>>
>>>
>>> DefMemPerNode=3934
>>>
>>> NodeName=wn0[01-44] CPUs=2 RealMemory=3934 Sockets=2 CoresPerSocket=2
>>> State=UNKNOWN
>>>
>>> PartitionName=TUC Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>>
>>>
>>>
>>> tail -10 /var/log/slurmdbd.log
>>>
>>>
>>>
>>> [2021-04-06T12:09:16.481] error: We should have gotten a new id: Table
>>> 'slurm_acct_db.tuc_job_table' doesn't exist
>>>
>>> [2021-04-06T12:09:16.481] error: _add_registered_cluster: trying to
>>> register a cluster (tuc) with no remote port
>>>
>>> [2021-04-06T12:09:16.482] error: We should have gotten a new id: Table
>>> 'slurm_acct_db.tuc_job_table' doesn't exist
>>>
>>> [2021-04-06T12:09:16.482] error: It looks like the storage has gone away
>>> trying to reconnect
>>>
>>> [2021-04-06T12:09:16.483] error: We should have gotten a new id: Table
>>> 'slurm_acct_db.tuc_job_table' doesn't exist
>>>
>>> [2021-04-06T12:09:16.483] error: _add_registered_cluster: trying to
>>> register a cluster (tuc) with no remote port
>>>
>>> [2021-04-06T12:09:16.484] error: We should have gotten a new id: Table
>>> 'slurm_acct_db.tuc_job_table' doesn't exist
>>>
>>> [2021-04-06T12:09:16.484] error: It looks like the storage has gone away
>>> trying to reconnect
>>>
>>> [2021-04-06T12:09:16.484] error: We should have gotten a new id: Table
>>> 'slurm_acct_db.tuc_job_table' doesn't exist
>>>
>>> [2021-04-06T12:09:16.485] error: _add_registered_cluster: trying to
>>> register a cluster (tuc) with no remote port
>>>
>>>
>>>
>>> tail -10 /var/log/slurmctld.log
>>>
>>>
>>>
>>> [2021-04-06T12:09:35.701] debug:  backfill: no jobs to backfill
>>>
>>> [2021-04-06T12:09:42.001] debug:  slurmdbd: PERSIST_RC is -1 from
>>> DBD_FLUSH_JOBS(1408): (null)
>>>
>>> [2021-04-06T12:10:00.042] debug:  slurmdbd: PERSIST_RC is -1 from
>>> DBD_FLUSH_JOBS(1408): (null)
>>>
>>> [2021-04-06T12:10:05.701] debug:  backfill: beginning
>>>
>>> [2021-04-06T12:10:05.701] debug:  backfill: no jobs to backfill
>>>
>>> [2021-04-06T12:10:05.989] debug:  sched: Running job scheduler
>>>
>>> [2021-04-06T12:10:19.001] debug:  slurmdbd: PERSIST_RC is -1 from
>>> DBD_FLUSH_JOBS(1408): (null)
>>>
>>> [2021-04-06T12:10:35.702] debug:  backfill: beginning
>>>
>>> [2021-04-06T12:10:35.702] debug:  backfill: no jobs to backfill
>>>
>>> [2021-04-06T12:10:37.001] debug:  slurmdbd: PERSIST_RC is -1 from
>>> DBD_FLUSH_JOBS(1408): (null)
>>>
>>>
>>>
>>> Attached sinfo -R
>>>
>>>
>>>
>>> Any hint?
>>>
>>>
>>>
>>> jb
>>>
>>>
>>>
>>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf
>>> Of *Sean Crosby
>>> *Sent:* Tuesday, April 6, 2021 7:54 AM
>>> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
>>> *Subject:* Re: [slurm-users] [EXT] slurmctld error
>>>
>>>
>>>
>>> The other thing I notice for my slurmdbd.conf is that I have
>>>
>>>
>>>
>>> DbdAddr=localhost
>>> DbdHost=localhost
>>>
>>>
>>>
>>> You can try changing your slurmdbd.conf to set those 2 values as well to
>>> see if that gets slurmdbd to listen on port 6819
>>>
>>>
>>>
>>> Sean
>>>
>>>
>>>
>>> --
>>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
>>> Research Computing Services | Business Services
>>> The University of Melbourne, Victoria 3010 Australia
>>>
>>>
>>>
>>>
>>>
>>> On Tue, 6 Apr 2021 at 14:31, Sean Crosby <scrosby at unimelb.edu.au> wrote:
>>>
>>> Interesting. It looks like slurmdbd is not opening the 6819 port
>>>
>>>
>>>
>>> What does
>>>
>>>
>>>
>>> ss -lntp | grep 6819
>>>
>>>
>>>
>>> show? Is something else using that port?
>>>
>>>
>>>
>>> You can also stop the slurmdbd service and run it in debug mode using
>>>
>>>
>>>
>>> slurmdbd -D -vvv
>>>
>>>
>>>
>>> Sean
>>>
>>>
>>>
>>> --
>>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
>>> Research Computing Services | Business Services
>>> The University of Melbourne, Victoria 3010 Australia
>>>
>>>
>>>
>>>
>>>
>>> On Tue, 6 Apr 2021 at 14:02, <ibotsis at isc.tuc.gr> wrote:
>>>
>>> *UoM notice: *External email. Be cautious of links, attachments, or
>>> impersonation attempts
>>>
>>>
>>> ------------------------------
>>>
>>> Hi Sean
>>>
>>>
>>>
>>> ss -lntp | grep $(pidof slurmdbd)     return nothing……
>>>
>>>
>>>
>>> systemctl status slurmdbd.service
>>>
>>>
>>>
>>> ● slurmdbd.service - Slurm DBD accounting daemon
>>>
>>>      Loaded: loaded (/lib/systemd/system/slurmdbd.service; enabled;
>>> vendor preset: enabled)
>>>
>>>      Active: active (running) since Mon 2021-04-05 13:52:35 EEST; 16h ago
>>>
>>>        Docs: man:slurmdbd(8)
>>>
>>>     Process: 1453365 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS
>>> (code=exited, status=0/SUCCESS)
>>>
>>>    Main PID: 1453375 (slurmdbd)
>>>
>>>       Tasks: 1
>>>
>>>      Memory: 5.0M
>>>
>>>      CGroup: /system.slice/slurmdbd.service
>>>
>>>              └─1453375 /usr/sbin/slurmdbd
>>>
>>>
>>>
>>> Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: Starting Slurm DBD
>>> accounting daemon...
>>>
>>> Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: slurmdbd.service: Can't
>>> open PID file /run/slurmdbd.pid (yet?) after start: Operation not permitted
>>>
>>> Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: Started Slurm DBD
>>> accounting daemon.
>>>
>>>
>>>
>>> File /run/slurmdbd.pid exist and has  pidof slurmdbd   value….
>>>
>>>
>>>
>>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf
>>> Of *Sean Crosby
>>> *Sent:* Tuesday, April 6, 2021 12:49 AM
>>> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
>>> *Subject:* Re: [slurm-users] [EXT] slurmctld error
>>>
>>>
>>>
>>> What's the output of
>>>
>>>
>>>
>>> ss -lntp | grep $(pidof slurmdbd)
>>>
>>>
>>>
>>> on your dbd host?
>>>
>>>
>>>
>>> Sean
>>>
>>>
>>>
>>> --
>>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
>>> Research Computing Services | Business Services
>>> The University of Melbourne, Victoria 3010 Australia
>>>
>>>
>>>
>>>
>>>
>>> On Tue, 6 Apr 2021 at 05:00, <ibotsis at isc.tuc.gr> wrote:
>>>
>>> *UoM notice: *External email. Be cautious of links, attachments, or
>>> impersonation attempts
>>>
>>>
>>> ------------------------------
>>>
>>> Hi Sean,
>>>
>>>
>>>
>>> 10.0.0.100 is the dbd and ctld host with name se01. Firewall is
>>> inactive……
>>>
>>>
>>>
>>> nc -nz 10.0.0.100 6819 || echo Connection not working
>>>
>>>
>>>
>>> give me back …..  Connection not working
>>>
>>>
>>>
>>> jb
>>>
>>>
>>>
>>>
>>>
>>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf
>>> Of *Sean Crosby
>>> *Sent:* Monday, April 5, 2021 2:52 PM
>>> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
>>> *Subject:* Re: [slurm-users] [EXT] slurmctld error
>>>
>>>
>>>
>>> The error shows
>>>
>>>
>>> slurmctld: debug2: Error connecting slurm stream socket at
>>> 10.0.0.100:6819: Connection refused
>>>
>>> slurmctld: error: slurm_persist_conn_open_without_init: failed to open
>>> persistent connection to se01:6819: Connection refused
>>>
>>>
>>>
>>> Is 10.0.0.100 the IP address of the host running slurmdbd?
>>>
>>> If so, check the iptables firewall running on that host, and make sure
>>> the ctld server can access port 6819 on the dbd host.
>>>
>>> You can check this by running the following from the ctld host (requires
>>> the package nmap-ncat installed)
>>>
>>> nc -nz 10.0.0.100 6819 || echo Connection not working
>>>
>>> This will try connecting to port 6819 on the host 10.0.0.100, and output
>>> nothing if the connection works, and would output Connection not working
>>> otherwise
>>>
>>> I would also test this on the DBD server itself
>>>
>>>  --
>>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
>>> Research Computing Services | Business Services
>>> The University of Melbourne, Victoria 3010 Australia
>>>
>>>
>>>
>>>
>>>
>>> On Mon, 5 Apr 2021 at 21:00, Ioannis Botsis <ibotsis at isc.tuc.gr> wrote:
>>>
>>> *UoM notice: *External email. Be cautious of links, attachments, or
>>> impersonation attempts
>>>
>>>
>>> ------------------------------
>>>
>>> Hi Sean,
>>>
>>>
>>>
>>> Thank you for your prompt response,  I made the changes you suggested,
>>> slurmctld refuse running……. find attached new slurmctld -Dvvvv
>>>
>>>
>>>
>>> jb
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf
>>> Of *Sean Crosby
>>> *Sent:* Monday, April 5, 2021 11:46 AM
>>> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
>>> *Subject:* Re: [slurm-users] [EXT] slurmctld error
>>>
>>>
>>>
>>> Hi Jb,
>>>
>>>
>>>
>>> You have set AccountingStoragePort to 3306 in slurm.conf, which is the
>>> MySQL port running on the DBD host.
>>>
>>>
>>>
>>> AccountingStoragePort is the port for the Slurmdbd service, and not for
>>> MySQL.
>>>
>>>
>>>
>>> Change AccountingStoragePort to 6819 and it should fix your issues.
>>>
>>>
>>>
>>> I also think you should comment out the lines
>>>
>>>
>>>
>>> AccountingStorageUser=slurm
>>> AccountingStoragePass=/run/munge/munge.socket.2
>>>
>>>
>>>
>>> You shouldn't need those lines
>>>
>>>
>>>
>>> Sean
>>>
>>>
>>>
>>> --
>>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
>>> Research Computing Services | Business Services
>>> The University of Melbourne, Victoria 3010 Australia
>>>
>>>
>>>
>>>
>>>
>>> On Mon, 5 Apr 2021 at 18:03, Ioannis Botsis <ibotsis at isc.tuc.gr> wrote:
>>>
>>> *UoM notice: *External email. Be cautious of links, attachments, or
>>> impersonation attempts
>>>
>>>
>>> ------------------------------
>>>
>>> Hello everyone,
>>>
>>>
>>>
>>> I installed the slurm 19.05.5 from Ubuntu repo,  for the first time in a
>>> cluster with 44  identical nodes but I have problem with slurmctld.service
>>>
>>>
>>>
>>> When I try to activate slurmctd I get the following message…
>>>
>>>
>>>
>>> fatal: You are running with a database but for some reason we have no
>>> TRES from it.  This should only happen if the database is down and you
>>> don't have any state files
>>>
>>>
>>>
>>>    - Ubuntu 20.04.2 runs on the server and nodes in the exact same
>>>    version.
>>>    - munge 0.5.13 installed from Ubuntu repo running on server and
>>>    nodes.
>>>    - mysql  Ver 8.0.23-0ubuntu0.20.04.1 for Linux on x86_64 ((Ubuntu))
>>>    installed from ubuntu repo running on server.
>>>
>>>
>>>
>>> slurm.conf is the same on all nodes and on server.
>>>
>>>
>>>
>>> slurmd.service is active and running on all nodes without problem.
>>>
>>>
>>>
>>> mysql.service is active and running on server.
>>>
>>> slurmdbd.service is active and running on server (slurm_acct_db created).
>>>
>>>
>>>
>>> Find attached slurm.conf slurmdbd.com  and detailed output of slurmctld
>>> -Dvvvv  command.
>>>
>>>
>>>
>>> Any hint?
>>>
>>>
>>>
>>> Thanks in advance
>>>
>>>
>>>
>>> jb
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210406/9be1120e/attachment-0001.htm>


More information about the slurm-users mailing list