[slurm-users] Questions about scontrol reconfigure / reconfig
Nicolas Greneche
nicolas.greneche at univ-paris13.fr
Mon Jan 17 03:41:43 UTC 2022
Hello,
I have some questions about adding nodes in configless mode. My version
of SLURM is 21.08.5. I gave logs below to ease the read of the message.
First, is "scontrol reconfigure" equal to "scontrol reconfig" ?
Then, I have a strange behaviour at node addition.
I have an healthy cluster with two nodes :
nico at control-node-0:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
COMPUTE* up infinite 2 idle hpc-node-[0-1]
My config file is :
nico at control-node-0:~$ cat /etc/slurm/slurm.conf
ClusterName=slurmtest
ControlMachine=control-node-0
SlurmUser=slurm
AuthType=auth/munge
StateSaveLocation=/var/slurm/
SlurmdSpoolDir=/var/slurm/
SlurmctldPidFile=/var/slurm/slurmctld.pid
SlurmdPidFile=/var/slurm/slurmd.pid
SlurmdLogFile=/var/slurm/slurmd.log
SlurmctldLogFile=/var/slurm/slurmctld.log
#SlurmctldParameters=enable_configless,cloud_dns
SlurmctldParameters=enable_configless
CommunicationParameters=NoAddrCache
ProctrackType=proctrack/linuxproc
NodeName=hpc-node-[0-1] Procs=2 State=UNKNOWN
PartitionName=COMPUTE Nodes=hpc-node-[0-1] Default=YES MaxTime=INFINITE
State=UP
If I submit a job, it's ok :
nico at control-node-0:~$ srun -N 2 hostname
hpc-node-0
hpc-node-1
Now, I try to submit a job too large for my cluster :
nico at control-node-0:~$ srun -N 3 hostname
srun: Requested partition configuration not available now
srun: job 10 queued and waiting for resources
It's pending.
I add a new compute node in config file so, Nodename becomes :
NodeName=hpc-node-[0-2] Procs=2 State=UNKNOWN
At this state the DNS is OK, everybody resolves everybody.
I send a TERM signal to slurmctld :
root at control-node-0:/# kill -TERM `cat /var/slurm/slurmctld.pid`
log 1 below
And I restart it :
slurmctld -D -f /etc/slurm/slurm.conf
Few seconds later, my job fails (log 2) with :
srun: job 10 has been allocated resources
srun: error: fwd_tree_thread: can't find address for host hpc-node-2,
check slurm.conf
Only hostnames of hpc-node-0 and hpc-node-1 are displayed.
I guess it's because the slurm.conf is not updated on compute nodes, so
my nodes don't know hpc-node-2 even if he is resolveable. Here is the
Nodename section of compute nodes :
root at hpc-node-0:/# cat /var/slurm/conf-cache/slurm.conf
NodeName=hpc-node-[0-1] Procs=2 State=UNKNOWN
But the node is visible on my slurmctld (cause it has been sigtermed i
think) :
nico at control-node-0:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
COMPUTE* up infinite 3 idle hpc-node-[0-2]
And if I submit the job again, it works :
nico at control-node-0:~$ srun -N 3 hostname
hpc-node-0
hpc-node-2
hpc-node-1
Despite the fact that conf has not been updated on compute nodes :
root at hpc-node-0:/# cat /var/slurm/conf-cache/slurm.conf
NodeName=hpc-node-[0-1] Procs=2 State=UNKNOWN
Is this a normal behaviour ?
Note :
If i perform a "scontrol reconfig" compute node configuration is updated :
root at hpc-node-0:/# cat /var/slurm/conf-cache/slurm.conf
NodeName=hpc-node-[0-2] Procs=2 State=UNKNOWN
I try a second scenario. I go back to the two nodes configuration and
submit a too large job like above. I also add a third node. And this
time instead of sending a TERM, I make a "scontrol reconfig" instead of
sending a TERM signal.
root at control-node-0:/# scontrol reconfig
slurm_reconfigure error: Zero Bytes were transmitted or received
Nothing is updated.
I run this command a second time and immediatly the job fails with
the same error as before in log 2.
But, this time, the configuration on compute nodes is updated :
root at hpc-node-0:/# cat /var/slurm/conf-cache/slurm.conf
NodeName=hpc-node-[0-2] Procs=2 State=UNKNOWN
If I run the job a second time, it's OK :
nico at control-node-0:~$ srun -N 3 hostname
hpc-node-0
hpc-node-1
hpc-node-2
I think I'm missing something in adding new compute nodes. Can you tell
me about best practices to add compute nodes with a configless
configuration ?
Than you !
Kind regards,
==== LOGS =====
log 1
-----
[2022-01-17T02:45:06.809] Terminate signal (SIGINT or SIGTERM) received
[2022-01-17T02:45:06.854] Saving all slurm state
[2022-01-17T02:45:06.922] error: Configured MailProg is invalid
[2022-01-17T02:45:06.922] slurmctld version 21.08.5 started on cluster
slurmtest
[2022-01-17T02:45:06.924] No memory enforcing mechanism configured.
[2022-01-17T02:45:06.931] Recovered state of 2 nodes
[2022-01-17T02:45:06.931] Recovered JobId=10 Assoc=0
[2022-01-17T02:45:06.931] Recovered information about 1 jobs
[2022-01-17T02:45:06.931] Recovered state of 0 reservations
[2022-01-17T02:45:06.931] read_slurm_conf: backup_controller not specified
[2022-01-17T02:45:06.931] Running as primary controller
[2022-01-17T02:45:06.931] No parameter for mcs plugin, default values set
[2022-01-17T02:45:06.931] mcs: MCSParameters = (null). ondemand set.
[2022-01-17T02:45:06.933] error: slurm_receive_msg
[192.168.104.56:49698]: Zero Bytes were transmitted or received
[2022-01-17T02:45:10.004]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2022-01-17T02:45:10.010] error: Node hpc-node-1 appears to have a
different slurm.conf than the slurmctld. This could cause issues with
communication and functionality. Please review both files and make sure
they are the same. If this is expected ignore, and set
DebugFlags=NO_CONF_HASH in your slurm.conf.
[2022-01-17T02:45:10.010] error: Node hpc-node-0 appears to have a
different slurm.conf than the slurmctld. This could cause issues with
communication and functionality. Please review both files and make sure
they are the same. If this is expected ignore, and set
DebugFlags=NO_CONF_HASH in your slurm.conf.
[2022-01-17T02:45:36.927] sched/backfill: _start_job: Started JobId=10
in COMPUTE on hpc-node-[0-2]
log 2
------
srun: job 10 has been allocated resources
srun: error: fwd_tree_thread: can't find address for host hpc-node-2,
check slurm.conf
srun: error: Task launch for StepId=10.0 failed on node hpc-node-2:
Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check
slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
hpc-node-0
hpc-node-1
srun: error: Timed out waiting for job step to complete
--
Nicolas Greneche
USPN
Support à la recherche / RSSI
https://www-magi.univ-paris13.fr
More information about the slurm-users
mailing list