[slurm-users] Frontend node mode issues identified in v22.05.2

Jordi Blasco jbllistes at gmail.com
Thu Aug 4 01:35:53 UTC 2022


Hi,

I have been maintaining a Slurm simulator
<https://hub.docker.com/repository/registry-1.docker.io/hpcnow/slurm_simulator/general>
for
ages. I have everything automated in other to try new features and keep my
configuration up to date, version after version. Unfortunately, from
version 21, the front-end mode makes the slurmd daemon crash with the
following error message:

slurmd: error: _find_node_record: lookup failure for node "slurm-simulator"
slurmd: error: _find_node_record: lookup failure for node
"slurm-simulator", alias "slurm-simulator"
slurmd: error: slurmd initialization failed

The exact same container, with the same configuration but using version
20.11.9, works just fine. I reproduce the same steps manually in a VM to
remove the noise introduced by the container, but the result is the same.

The attached configuration is available in the container.

[root at slurm-simulator /]# cat /etc/slurm/slurm.conf
ClusterName=simulator
SlurmctldHost=slurm-simulator
FrontendName=slurm-simulator
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
AccountingStorageType=accounting_storage/slurmdbd
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
SlurmdParameters=config_overrides
include /etc/slurm/nodes.conf
include /etc/slurm/partitions.conf
[root at slurm-simulator /]# cat /etc/slurm/nodes.conf
NodeName=node[001-10]      RealMemory=248000    Sockets=2
CoresPerSocket=32  ThreadsPerCore=1 State=UNKNOWN
NodeAddr=slurm-simulator NodeHostName=slurm-simulator
[root at slurm-simulator /]# cat /etc/slurm/partitions.conf
PartitionName=long Nodes=node[001-10] Default=YES State=UP
OverSubscribe=NO MaxTime=14-00:00:00

The error can be reproduced by running the following commands:

docker run --rm --detach \
           --name "${USER}_simulator" \
           -h "slurm-simulator" \
           --security-opt seccomp:unconfined \
           --privileged -e container=docker \
           -v /run -v /sys/fs/cgroup:/sys/fs/cgroup \
           --cgroupns=host \
           hpcnow/slurm_simulator:21.08.8-2 /usr/sbin/init
docker exec -ti ${USER}_simulator /bin/bash
slurmd -D -vvvvv

If you try the same command with v20.11.9 it will work. I have tried using
the new SlurmdParameters=config_overrides option, but I still get the same
problem.


Any ideas or suggestions?


Thanks!

On Mon, 11 Jul 2022 at 23:21, Jordi Blasco <jbllistes at gmail.com> wrote:

> Thank Ole,
>
> I checked the /etc/nsswitch.conf and I have even setup a dnsmasq service,
> just in case.
>
> [root at slurm-simulator /]# cat /etc/nsswitch.conf | grep hosts
> # Valid databases are: aliases, ethers, group, gshadow, hosts,
> hosts:      files dns myhostname
>
> [root at slurm-simulator /]# ping slurm-simulator -c 1
> PING slurm-simulator (172.17.0.4) 56(84) bytes of data.
> 64 bytes from slurm-simulator (172.17.0.4): icmp_seq=1 ttl=64 time=0.022 ms
>
> --- slurm-simulator ping statistics ---
> 1 packets transmitted, 1 received, 0% packet loss, time 0ms
> rtt min/avg/max/mdev = 0.022/0.022/0.022/0.000 ms
>
> [root at slurm-simulator /]# cat /etc/resolv.conf | grep -v "^#"
> nameserver 172.17.0.4
> nameserver 172.31.0.2
> search eu-west-3.compute.internal
> [root at slurm-simulator /]# host slurm-simulator
> slurm-simulator has address 172.17.0.4
> [root at slurm-simulator /]# host 172.17.0.4
> 4.0.17.172.in-addr.arpa domain name pointer slurm-simulator.
>
>
> Regards,
>
> Jordi
>
>
>
> On Mon, 11 Jul 2022 at 23:09, Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk>
> wrote:
>
>> On 7/11/22 12:54, Jordi Blasco wrote:
>> > I use the front-end node mode
>> > <https://slurm.schedmd.com/faq.html#multi_slurmd> to emulate a real
>> > cluster in order to validate the Slurm configuration in a Docker
>> container
>> > and develop custom plugins. With versions 21.08.8-2 and 22.05.2, slurmd
>> is
>> > complaining about not being able to find the frontend node.
>> >
>> > slurmd -D -vvv
>> > ...
>> > slurmd: error: _find_node_record: lookup failure for node
>> "slurm-simulator"
>> > slurmd: error: _find_node_record: lookup failure for node
>> > "slurm-simulator", alias "slurm-simulator"
>> > slurmd: error: slurmd initialization failed
>>
>> This could be a DNS lookup issue.  Can you ping the node named
>> "slurm-simulator"?
>>
>> /Ole
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220804/24b9949a/attachment-0001.htm>


More information about the slurm-users mailing list