[slurm-users] [External] srun at front-end nodes with --enable_configless fails with "Can't find an address, check slurm.conf"

Matthew BETTINGER matthew.bettinger at external.total.com
Mon Mar 22 21:54:51 UTC 2021


Also check the settings on your nodeaddr in slurm.conf

On 3/22/21, 2:48 PM, "slurm-users on behalf of Michael Robbert" <slurm-users-bounces at lists.schedmd.com on behalf of mrobbert at mines.edu> wrote:

    I haven't tried configless setup yet, but the problem you're hitting looks like it could be a DNS issue. Can you do a dns lookup of n26 from the login node? The way that non-interactive batch jobs are started may not require that, but I believe that it is required for interactive jobs. 

    Mike Robbert
    Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research Computing
    Information and Technology Solutions (ITS)
    303-273-3786 | mrobbert at mines.edu  

    Our values: Trust | Integrity | Respect | Responsibility

    On 3/22/21, 11:24, "slurm-users on behalf of Josef Dvoracek" <slurm-users-bounces at lists.schedmd.com on behalf of jose at fzu.cz> wrote:

        Hi @list;

        I was able to configure "configless" slurm cluster with quite 
        minimalistic slurm.conf everywhere, of-course excepting slurmctld 
        server. All nodes are running slurmd, including front-end/login nodes to 
        pull the config.

        Submitting jobs using sbatch scripts works fine, but interactive jobs 
        using srun are failing with

        $ srun --verbose -w n26 --pty /bin/bash
        ...
        srun: error: fwd_tree_thread: can't find address for host n26, check 
        slurm.conf
        srun: error: Task launch for 200137.0 failed on node n26: Can't find an 
        address, check slurm.conf
        srun: error: Application launch failed: Can't find an address, check 
        slurm.conf
        ...


        Does it mean that on submit hosts one has to manually specify all 
        relevant NodeNames?
        I thought that running slurmd there will pull configuration from 
        slurmserver. (I can see the file is actually sucessfully pulled into 
        /run/slurm/conf/slurm.conf ).


        So far I found two workarounds:

        workaround1:

        specify nodenames at login/front-end nodes in slurm.conf:

        NodeName=n[(...)n26(...)] Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 
        State=UNKNOWN

        then, srun works as expected.


        workaround2:

        directing environment variable SLURM_CONF to the slurm.conf pulled by 
        slurmd:

        export SLURM_CONF=/run/slurm/conf/slurm.conf

        then again, srun works as expected.


        Is this expected behavior? I actually expected that srun at configless 
        login/front-end node with running slurmd recognizes the pulled 
        configuration, but apparently, that's not the case.

        cheers

        josef


        setup at front-end and compute nodes:

        [root at FRONTEND ~]# slurmd --version
        slurm 20.02.5
        [root at FRONTEND ~]#

        [root at FRONTEND ~]# cat /etc/sysconfig/slurmd
        SLURMD_OPTIONS="--conf-server slurmserver2.DOMAIN"
        [root at FRONTEND ~]#

        [root at FRONTEND ~]# cat /etc/slurm/slurm.conf
        ClusterName=CLUSTERNAME
        ControlMachine=slurmserver2.DOMAIN
        AccountingStorageType=accounting_storage/slurmdbd
        AccountingStorageHost=slurmserver2.DOMAIN
        AccountingStoragePort=7031
        SlurmctldParameters=enable_configless
        [root at FRONTEND ~]#












        ClusterName=XXXXX
        ControlMachine=slurmserver2.DOMAIN
        AccountingStorageType=accounting_storage/slurmdbd
        AccountingStorageHost=slurmserver2.DOMAIN
        AccountingStoragePort=7031
        SlurmctldParameters=enable_configless



        -- 
        Josef Dvoracek
        Institute of Physics | Czech Academy of Sciences
        cell: +420 608 563 558 | https://telegram.me/jose_d | FZU phone nr. : 2669





More information about the slurm-users mailing list