[slurm-users] error: _find_node_record: lookup failure in Slurm 20.02.0

Sat Mar 28 14:27:19 UTC 2020

Hello Everyone,

In 19.05 and previous versions, I was able to run multiple nodes on the
same virtual machine or container. While upgrading to 20.02.0, when I run
sbatch to kick off a job, it is stuck in the CF (Configuring) state.

[root at slurmcluster log]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
                 6    normal     wrap     root CF      13:10      1 c1

The slurmctld.log file shows the following error, and it just loops thereon
after with the same error message:

==> slurmctld.log <==
[2020-03-22T13:53:28.917] debug2: Tree head got back 1
[2020-03-22T13:53:28.921] debug2: node_did_resp slurmcluster
[2020-03-22T13:53:28.922] debug3: create_mmap_buf: loaded file
`/var/spool/slurm/ctld/job_state` as Buf
[2020-03-22T13:53:28.922] debug3: Writing job id 6 to header record of
job_state file
[2020-03-22T13:53:58.983] debug2: Testing job time limits and checkpoints
[2020-03-22T13:53:58.983] error: _find_node_record(766): lookup failure for
slurmcluster
[2020-03-22T13:53:58.983] error: _find_node_record(778): lookup failure for
slurmcluster alias slurmcluster
[2020-03-22T13:54:28.071] debug2: Testing job time limits and checkpoints
[2020-03-22T13:54:28.071] error: _find_node_record(766): lookup failure for
slurmcluster
[2020-03-22T13:54:28.071] error: _find_node_record(778): lookup failure for
slurmcluster alias slurmcluster
[2020-03-22T13:54:28.071] debug2: Performing purge of old job records
[2020-03-22T13:54:28.071] debug:  sched: Running job scheduler
[2020-03-22T13:54:58.119] debug2: Testing job time limits and checkpoints
[2020-03-22T13:54:58.119] error: _find_node_record(766): lookup failure for
slurmcluster
[2020-03-22T13:54:58.119] error: _find_node_record(778): lookup failure for
slurmcluster alias slurmcluster

I've tried manipulating the local /etc/hosts to make sure there wasn't a
DNS problem of some kind, as the error message hints at.

Here is a link to my slurm.conf:
https://github.com/giovtorres/docker-centos7-slurm/blob/master/files/slurm/slurm.conf

I saw that FastSchedule=2 was called out in the Release Notes and was
deprecated. I am using FastSchedule=1. Is this deprecated as well? Has this
behaviour changed? Sadly, the behaviour of FastSchedule is not documented
anywhere. I'm not even sure that is the crux of the problem here.

Any pointers would be greatly appreciated!

Thanks,
Giovanni
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200328/62a12bd4/attachment.htm>