[slurm-users] [EXT] Slurmd problem on client
larsa at kth.se
Mon Aug 24 11:11:56 UTC 2020
Yes, the regular slurm commands work from the client.
The firewalld daemon have been stopped/disabled, and iptables are set to let everything through, on both the master and the client. I should have mentioned that in the list of prerequisites in my initial e-mail.
Från: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] För Sean Crosby
Skickat: den 24 augusti 2020 12:45
Till: Slurm User Community List <slurm-users at lists.schedmd.com>
Ämne: Re: [slurm-users] [EXT] Slurmd problem on client
Do the regular slurm commands work from the client?
scontrol show part
If they don't, it would be a sign of communication problems.
Is there a software firewall running on the master/client?
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Mon, 24 Aug 2020 at 20:02, Lars Kloo <larsa at kth.se <mailto:larsa at kth.se> > wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation attempts
I have a client slurmd problem, that I cannot really figure out how to solve. I would be grateful for any suggestions on how to move forward.
The master computer on a small local calculational cluster is getting quite old, and therefore I am currently in the process of exchanging it. I also use one calculational node for the basic master-client set-up of all programs, including slurm. Some basic data: CentOS 7.7, slurm 20.02.4.
Setting up the systemctld on the master node is (seemingly) straightforward. Getting slurmd to work on the client appears more complicated. I get the following error message (journalctl –xe) when starting slurmd on the client:
Aug 24 11:01:34 cpu3.calc.cluster slurmd: error: _fetch_child: failed to fetch remote configs
No useful error messages are obtained from ‘systemctl –l status slurmd.service’ on the client, slurmd.log on the client, nor slurmctld.log on the master.
In this context, the following should be noted:
- root and test user exist on the master and client; same uid and gid on both machines
- ping works in both directions (master <-> client)
- passphrase-free ssh login work in both directions for both root and for a test user
- munged is running and with the same key on both machines
- the same slurm.conf is read from the master and from the client
- named (bind) has been set up on the master, and nslookup and dig work properly on the client
- the ‘forward’ zone file of named on the master (DNS) contains the recommended SRV record directing slurmctld requests to port 6817 on the master (syntax seems ok, i.e. no error messages)
I have also tried to start slurmd in a config-less mode (slurm.conf edited on the master) with the suggested environment variable set (slurmd on the client). Then, slurmd starts without error messages, but slurmctld on the master cannot communicate with slurmd on the client.
Has anyone encountered a similar problem --- and how did you solve it? Or, do you have any suggestions where to start looking?
Many thanks for input, and best regards,
Lars Kloo, Prof.
Tillämpad fysikalisk kemi Applied Physical Chemistry
Institutionen för kemi Dept. of Chemistry
Kungliga Tekniska högskolan Royal Inst. of Technology (KTH)
100 44 STOCKHOLM SE-100 44 Stockholm
Tel: 08-790 9343 Tel: +46-8-790 9343
Fax: 08-790 9349 Fax: +46-8-790 9349
E-post: <mailto:lakloo at kth.se> lakloo at kth.se E-mail: <mailto:lakloo at kth.se> lakloo at kth.se
WWW: <http://www.kth.se/che/divisions/tfk> http://www.kth.se/che/divisions/tfk
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users