[slurm-users] enable_configless, srun and DNS vs. hosts file
Mark Dixon
mark.c.dixon at durham.ac.uk
Tue Nov 16 14:00:08 UTC 2021
Hi Paul,
Thanks for the thought but no, we'd restarted all slurmctld, slurmdbd and
slurmd daemons since changing any of the slurm config files.
I have a very cut-down slurm.conf on the non-slurmctld nodes, which seems
to be consulted when running srun (regardless of whether slurmd is running
or not).
Removing the simplified NodeName lines from the cut-down slurm.conf causes
srun to immediately return to its "can't find address for host" behaviour
I outlined at the start. Seen this both on clients running slurmd and
those that don't.
The cut-down slurm.conf is slowly growing: I've found that I also need to
add GresTypes, otherwise srun/sbatch don't know what users can put in
their "--gres" flag and so reject it. I guess at least that makes sense -
the tools need to get that information from somewhere.
Interesting!
Best,
Mark
On Fri, 12 Nov 2021, Paul Brunk wrote:
> [EXTERNAL EMAIL]
>
> Hi:
>
> We run configless. If we add a node to slurm.conf and don't restart
> slurmd on our submit nodes, then attempts to submit to that new node
> will get the error you saw. Restarting slurmd on the submit node fixes
> it. This is the documented behavior (adding nodes needs slurmd
> restarted everywhere). Could this be what you're seeing (as opposed to
> /etc/hosts vs DNS)?
>
> --
> Wishing that I'd just listened this time,
> Paul Brunk, system administrator, Workstation Support Group
> GACRC (formerly RCC)
> UGA EITS (formerly UCNS)
>
>
> -----Original Message-----
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Mark Dixon
> Sent: Wednesday, November 10, 2021 10:14
> To: slurm-users at lists.schedmd.com
> Subject: [slurm-users] enable_configless, srun and DNS vs. hosts file
>
> [EXTERNAL SENDER - PROCEED CAUTIOUSLY]
>
>
> Hi,
>
> I'm using the "enable_configless" mode to avoid the need for a shared slurm.conf file, and am having similar trouble to others when running "srun", e.g.
>
> srun: error: fwd_tree_thread: can't find address for host cn120, check slurm.conf
> srun: error: Task launch for StepId=113.0 failed on node cn120: Can't find an address, check slurm.conf
> srun: error: Application launch failed: Can't find an address, check slurm.conf
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>
> I understand that the accepted solution is to add the nodenames to DNS. Is that really correct?
>
> I ask because it would be a great help if slurm instead used the more usual mechanism and consult the sources listed in /etc/nsswitch.conf. We use a large /etc/hosts file instead of DNS for our cluster and would rather not start running named if we can help it.
>
> Thanks,
>
> Mark
>
> PS Adding a line like "NodeName=cn[001-999]" to the submit/compute host
> slurm.conf file makes this go away (I hope skipping the node detail, or
> adding nodes that don't exist [yet] won't cause other problems).
>
>
>
More information about the slurm-users
mailing list