[slurm-users] [External] srun at front-end nodes with --enable_configless fails with "Can't find an address, check slurm.conf"
Matthew BETTINGER
matthew.bettinger at external.total.com
Mon Mar 22 21:54:51 UTC 2021
Also check the settings on your nodeaddr in slurm.conf
On 3/22/21, 2:48 PM, "slurm-users on behalf of Michael Robbert" <slurm-users-bounces at lists.schedmd.com on behalf of mrobbert at mines.edu> wrote:
I haven't tried configless setup yet, but the problem you're hitting looks like it could be a DNS issue. Can you do a dns lookup of n26 from the login node? The way that non-interactive batch jobs are started may not require that, but I believe that it is required for interactive jobs.
Mike Robbert
Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research Computing
Information and Technology Solutions (ITS)
303-273-3786 | mrobbert at mines.edu
Our values: Trust | Integrity | Respect | Responsibility
On 3/22/21, 11:24, "slurm-users on behalf of Josef Dvoracek" <slurm-users-bounces at lists.schedmd.com on behalf of jose at fzu.cz> wrote:
Hi @list;
I was able to configure "configless" slurm cluster with quite
minimalistic slurm.conf everywhere, of-course excepting slurmctld
server. All nodes are running slurmd, including front-end/login nodes to
pull the config.
Submitting jobs using sbatch scripts works fine, but interactive jobs
using srun are failing with
$ srun --verbose -w n26 --pty /bin/bash
...
srun: error: fwd_tree_thread: can't find address for host n26, check
slurm.conf
srun: error: Task launch for 200137.0 failed on node n26: Can't find an
address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check
slurm.conf
...
Does it mean that on submit hosts one has to manually specify all
relevant NodeNames?
I thought that running slurmd there will pull configuration from
slurmserver. (I can see the file is actually sucessfully pulled into
/run/slurm/conf/slurm.conf ).
So far I found two workarounds:
workaround1:
specify nodenames at login/front-end nodes in slurm.conf:
NodeName=n[(...)n26(...)] Sockets=2 CoresPerSocket=16 ThreadsPerCore=2
State=UNKNOWN
then, srun works as expected.
workaround2:
directing environment variable SLURM_CONF to the slurm.conf pulled by
slurmd:
export SLURM_CONF=/run/slurm/conf/slurm.conf
then again, srun works as expected.
Is this expected behavior? I actually expected that srun at configless
login/front-end node with running slurmd recognizes the pulled
configuration, but apparently, that's not the case.
cheers
josef
setup at front-end and compute nodes:
[root at FRONTEND ~]# slurmd --version
slurm 20.02.5
[root at FRONTEND ~]#
[root at FRONTEND ~]# cat /etc/sysconfig/slurmd
SLURMD_OPTIONS="--conf-server slurmserver2.DOMAIN"
[root at FRONTEND ~]#
[root at FRONTEND ~]# cat /etc/slurm/slurm.conf
ClusterName=CLUSTERNAME
ControlMachine=slurmserver2.DOMAIN
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=slurmserver2.DOMAIN
AccountingStoragePort=7031
SlurmctldParameters=enable_configless
[root at FRONTEND ~]#
ClusterName=XXXXX
ControlMachine=slurmserver2.DOMAIN
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=slurmserver2.DOMAIN
AccountingStoragePort=7031
SlurmctldParameters=enable_configless
--
Josef Dvoracek
Institute of Physics | Czech Academy of Sciences
cell: +420 608 563 558 | https://telegram.me/jose_d | FZU phone nr. : 2669
More information about the slurm-users
mailing list