[slurm-users] enable_configless, srun and DNS vs. hosts file

Mark Dixon mark.c.dixon at durham.ac.uk
Tue Nov 16 14:00:08 UTC 2021

Hi Paul,

Thanks for the thought but no, we'd restarted all slurmctld, slurmdbd and 
slurmd daemons since changing any of the slurm config files.

I have a very cut-down slurm.conf on the non-slurmctld nodes, which seems 
to be consulted when running srun (regardless of whether slurmd is running 
or not).

Removing the simplified NodeName lines from the cut-down slurm.conf causes 
srun to immediately return to its "can't find address for host" behaviour 
I outlined at the start. Seen this both on clients running slurmd and 
those that don't.

The cut-down slurm.conf is slowly growing: I've found that I also need to 
add GresTypes, otherwise srun/sbatch don't know what users can put in 
their "--gres" flag and so reject it. I guess at least that makes sense - 
the tools need to get that information from somewhere.




On Fri, 12 Nov 2021, Paul Brunk wrote:

> Hi:
> We run configless.  If we add a node to slurm.conf and don't restart 
> slurmd on our submit nodes, then attempts to submit to that new node 
> will get the error you saw.  Restarting slurmd on the submit node fixes 
> it.  This is the documented behavior (adding nodes needs slurmd 
> restarted everywhere).  Could this be what you're seeing (as opposed to 
> /etc/hosts vs DNS)?
> --
> Wishing that I'd just listened this time,
> Paul Brunk, system administrator, Workstation Support Group
> GACRC (formerly RCC)
> UGA EITS  (formerly UCNS)
> -----Original Message-----
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Mark Dixon
> Sent: Wednesday, November 10, 2021 10:14
> To: slurm-users at lists.schedmd.com
> Subject: [slurm-users] enable_configless, srun and DNS vs. hosts file
> Hi,
> I'm using the "enable_configless" mode to avoid the need for a shared slurm.conf file, and am having similar trouble to others when running "srun", e.g.
>   srun: error: fwd_tree_thread: can't find address for host cn120, check slurm.conf
>   srun: error: Task launch for StepId=113.0 failed on node cn120: Can't find an address, check slurm.conf
>   srun: error: Application launch failed: Can't find an address, check slurm.conf
>   srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> I understand that the accepted solution is to add the nodenames to DNS. Is that really correct?
> I ask because it would be a great help if slurm instead used the more usual mechanism and consult the sources listed in /etc/nsswitch.conf. We use a large /etc/hosts file instead of DNS for our cluster and would rather not start running named if we can help it.
> Thanks,
> Mark
> PS Adding a line like "NodeName=cn[001-999]" to the submit/compute host
>    slurm.conf file makes this go away (I hope skipping the node detail, or
>    adding nodes that don't exist [yet] won't cause other problems).

More information about the slurm-users mailing list