[slurm-users] [External] srun at front-end nodes with --enable_configless fails with "Can't find an address, check slurm.conf"

Michael Robbert mrobbert at mines.edu
Mon Mar 22 19:47:49 UTC 2021


I haven't tried configless setup yet, but the problem you're hitting looks like it could be a DNS issue. Can you do a dns lookup of n26 from the login node? The way that non-interactive batch jobs are started may not require that, but I believe that it is required for interactive jobs. 

Mike Robbert
Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research Computing
Information and Technology Solutions (ITS)
303-273-3786 | mrobbert at mines.edu  

Our values: Trust | Integrity | Respect | Responsibility

On 3/22/21, 11:24, "slurm-users on behalf of Josef Dvoracek" <slurm-users-bounces at lists.schedmd.com on behalf of jose at fzu.cz> wrote:

    Hi @list;

    I was able to configure "configless" slurm cluster with quite 
    minimalistic slurm.conf everywhere, of-course excepting slurmctld 
    server. All nodes are running slurmd, including front-end/login nodes to 
    pull the config.

    Submitting jobs using sbatch scripts works fine, but interactive jobs 
    using srun are failing with

    $ srun --verbose -w n26 --pty /bin/bash
    ...
    srun: error: fwd_tree_thread: can't find address for host n26, check 
    slurm.conf
    srun: error: Task launch for 200137.0 failed on node n26: Can't find an 
    address, check slurm.conf
    srun: error: Application launch failed: Can't find an address, check 
    slurm.conf
    ...


    Does it mean that on submit hosts one has to manually specify all 
    relevant NodeNames?
    I thought that running slurmd there will pull configuration from 
    slurmserver. (I can see the file is actually sucessfully pulled into 
    /run/slurm/conf/slurm.conf ).


    So far I found two workarounds:

    workaround1:

    specify nodenames at login/front-end nodes in slurm.conf:

    NodeName=n[(...)n26(...)] Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 
    State=UNKNOWN

    then, srun works as expected.


    workaround2:

    directing environment variable SLURM_CONF to the slurm.conf pulled by 
    slurmd:

    export SLURM_CONF=/run/slurm/conf/slurm.conf

    then again, srun works as expected.


    Is this expected behavior? I actually expected that srun at configless 
    login/front-end node with running slurmd recognizes the pulled 
    configuration, but apparently, that's not the case.

    cheers

    josef


    setup at front-end and compute nodes:

    [root at FRONTEND ~]# slurmd --version
    slurm 20.02.5
    [root at FRONTEND ~]#

    [root at FRONTEND ~]# cat /etc/sysconfig/slurmd
    SLURMD_OPTIONS="--conf-server slurmserver2.DOMAIN"
    [root at FRONTEND ~]#

    [root at FRONTEND ~]# cat /etc/slurm/slurm.conf
    ClusterName=CLUSTERNAME
    ControlMachine=slurmserver2.DOMAIN
    AccountingStorageType=accounting_storage/slurmdbd
    AccountingStorageHost=slurmserver2.DOMAIN
    AccountingStoragePort=7031
    SlurmctldParameters=enable_configless
    [root at FRONTEND ~]#












    ClusterName=XXXXX
    ControlMachine=slurmserver2.DOMAIN
    AccountingStorageType=accounting_storage/slurmdbd
    AccountingStorageHost=slurmserver2.DOMAIN
    AccountingStoragePort=7031
    SlurmctldParameters=enable_configless



    -- 
    Josef Dvoracek
    Institute of Physics | Czech Academy of Sciences
    cell: +420 608 563 558 | https://telegram.me/jose_d | FZU phone nr. : 2669


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5173 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210322/31ddd56c/attachment.bin>


More information about the slurm-users mailing list