[slurm-users] Questions about scontrol reconfigure / reconfig

Mon Jan 17 03:41:43 UTC 2022

Hello,

I have some questions about adding nodes in configless mode.  My version 
of SLURM is 21.08.5. I gave logs below to ease the read of the message.

First, is "scontrol reconfigure" equal to "scontrol reconfig" ?

Then, I have a strange behaviour at node addition.

I have an healthy cluster with two nodes :

nico at control-node-0:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
COMPUTE*     up   infinite      2   idle hpc-node-[0-1]

My config file is :

nico at control-node-0:~$ cat /etc/slurm/slurm.conf
ClusterName=slurmtest
ControlMachine=control-node-0
SlurmUser=slurm
AuthType=auth/munge
StateSaveLocation=/var/slurm/
SlurmdSpoolDir=/var/slurm/
SlurmctldPidFile=/var/slurm/slurmctld.pid
SlurmdPidFile=/var/slurm/slurmd.pid
SlurmdLogFile=/var/slurm/slurmd.log
SlurmctldLogFile=/var/slurm/slurmctld.log
#SlurmctldParameters=enable_configless,cloud_dns
SlurmctldParameters=enable_configless
CommunicationParameters=NoAddrCache
ProctrackType=proctrack/linuxproc

NodeName=hpc-node-[0-1] Procs=2 State=UNKNOWN
PartitionName=COMPUTE Nodes=hpc-node-[0-1] Default=YES MaxTime=INFINITE 
State=UP

If I submit a job, it's ok :

nico at control-node-0:~$ srun -N 2 hostname
hpc-node-0
hpc-node-1

Now, I try to submit a job too large for my cluster :

nico at control-node-0:~$ srun -N 3 hostname
srun: Requested partition configuration not available now
srun: job 10 queued and waiting for resources

It's pending.

I add a new compute node in config file so, Nodename becomes :

NodeName=hpc-node-[0-2] Procs=2 State=UNKNOWN

At this state the DNS is OK, everybody resolves everybody.

I send a TERM signal to slurmctld :

root at control-node-0:/# kill -TERM `cat /var/slurm/slurmctld.pid`

log 1 below

And I restart it :

slurmctld -D -f /etc/slurm/slurm.conf

Few seconds later, my job fails (log 2) with :

srun: job 10 has been allocated resources
srun: error: fwd_tree_thread: can't find address for host hpc-node-2, 
check slurm.conf

Only hostnames of hpc-node-0 and hpc-node-1 are displayed.

I guess it's because the slurm.conf is not updated on compute nodes, so 
my nodes don't know hpc-node-2 even if he is resolveable. Here is the 
Nodename section of compute nodes :

root at hpc-node-0:/# cat /var/slurm/conf-cache/slurm.conf
NodeName=hpc-node-[0-1] Procs=2 State=UNKNOWN

But the node is visible on my slurmctld (cause it has been sigtermed i 
think) :

nico at control-node-0:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
COMPUTE*     up   infinite      3   idle hpc-node-[0-2]

And if I submit the job again, it works :

nico at control-node-0:~$ srun -N 3 hostname
hpc-node-0
hpc-node-2
hpc-node-1

Despite the fact that conf has not been updated on compute nodes :

root at hpc-node-0:/# cat /var/slurm/conf-cache/slurm.conf
NodeName=hpc-node-[0-1] Procs=2 State=UNKNOWN

Is this a normal behaviour ?

Note :

If i perform a "scontrol reconfig" compute node configuration is updated :

root at hpc-node-0:/# cat /var/slurm/conf-cache/slurm.conf
NodeName=hpc-node-[0-2] Procs=2 State=UNKNOWN

I try a second scenario.  I go back to the two nodes configuration and 
submit a too large job like above. I also add a third node. And this 
time instead of sending a TERM, I make a "scontrol reconfig" instead of 
sending a TERM signal.

root at control-node-0:/# scontrol reconfig
slurm_reconfigure error: Zero Bytes were transmitted or received

Nothing is updated.

I run this command a second time and immediatly the job fails with
the same error as before in log 2.

But, this time, the configuration on compute nodes is updated :

root at hpc-node-0:/# cat /var/slurm/conf-cache/slurm.conf
NodeName=hpc-node-[0-2] Procs=2 State=UNKNOWN

If I run the job a second time, it's OK :

nico at control-node-0:~$ srun -N 3 hostname
hpc-node-0
hpc-node-1
hpc-node-2

I think I'm missing something in adding new compute nodes. Can you tell 
me about best practices to add compute nodes with a configless 
configuration ?

Than you !

Kind regards,

==== LOGS =====

log 1
-----
[2022-01-17T02:45:06.809] Terminate signal (SIGINT or SIGTERM) received
[2022-01-17T02:45:06.854] Saving all slurm state
[2022-01-17T02:45:06.922] error: Configured MailProg is invalid
[2022-01-17T02:45:06.922] slurmctld version 21.08.5 started on cluster 
slurmtest
[2022-01-17T02:45:06.924] No memory enforcing mechanism configured.
[2022-01-17T02:45:06.931] Recovered state of 2 nodes
[2022-01-17T02:45:06.931] Recovered JobId=10 Assoc=0
[2022-01-17T02:45:06.931] Recovered information about 1 jobs
[2022-01-17T02:45:06.931] Recovered state of 0 reservations
[2022-01-17T02:45:06.931] read_slurm_conf: backup_controller not specified
[2022-01-17T02:45:06.931] Running as primary controller
[2022-01-17T02:45:06.931] No parameter for mcs plugin, default values set
[2022-01-17T02:45:06.931] mcs: MCSParameters = (null). ondemand set.
[2022-01-17T02:45:06.933] error: slurm_receive_msg 
[192.168.104.56:49698]: Zero Bytes were transmitted or received
[2022-01-17T02:45:10.004] 
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2022-01-17T02:45:10.010] error: Node hpc-node-1 appears to have a 
different slurm.conf than the slurmctld.  This could cause issues with 
communication and functionality.  Please review both files and make sure 
they are the same.  If this is expected ignore, and set 
DebugFlags=NO_CONF_HASH in your slurm.conf.
[2022-01-17T02:45:10.010] error: Node hpc-node-0 appears to have a 
different slurm.conf than the slurmctld.  This could cause issues with 
communication and functionality.  Please review both files and make sure 
they are the same.  If this is expected ignore, and set 
DebugFlags=NO_CONF_HASH in your slurm.conf.
[2022-01-17T02:45:36.927] sched/backfill: _start_job: Started JobId=10 
in COMPUTE on hpc-node-[0-2]

log 2
------
srun: job 10 has been allocated resources
srun: error: fwd_tree_thread: can't find address for host hpc-node-2, 
check slurm.conf
srun: error: Task launch for StepId=10.0 failed on node hpc-node-2: 
Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check 
slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
hpc-node-0
hpc-node-1
srun: error: Timed out waiting for job step to complete

-- 
Nicolas Greneche
USPN
Support à la recherche / RSSI
https://www-magi.univ-paris13.fr