[slurm-users] Help debugging Slurm configuration
Jeffrey Layton
laytonjb at gmail.com
Thu Dec 8 18:37:48 UTC 2022
Good afternoon,
I have a very simple two node cluster using Warewulf 4.3. I was following
some instructions on how to install the OpenHPC Slurm binaries (server and
client). I booted the compute node and the Slurm Server says it's in an
unknown state. This hasn't happened to me before but I would like to debug
the problem.
I checked the services on the S:urm server (head node)
$ systemctl status munge
● munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor
preset: disabled)
Active: active (running) since Thu 2022-12-08 13:12:10 EST; 4min 42s ago
Docs: man:munged(8)
Process: 1140 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
Main PID: 1182 (munged)
Tasks: 4 (limit: 48440)
Memory: 1.2M
CGroup: /system.slice/munge.service
└─1182 /usr/sbin/munged
Dec 08 13:12:10 localhost.localdomain systemd[1]: Starting MUNGE
authentication service...
Dec 08 13:12:10 localhost.localdomain systemd[1]: Started MUNGE
authentication service.
$ systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled;
vendor preset: disabled)
Active: active (running) since Thu 2022-12-08 13:12:17 EST; 4min 56s ago
Main PID: 1518 (slurmctld)
Tasks: 10
Memory: 23.0M
CGroup: /system.slice/slurmctld.service
├─1518 /usr/sbin/slurmctld -D -s
└─1555 slurmctld: slurmscriptd
Dec 08 13:12:17 localhost.localdomain systemd[1]: Started Slurm controller
daemon.
Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: No
parameter for mcs plugin, de>
Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: mcs:
MCSParameters = (null). on>
Dec 08 13:13:17 localhost.localdomain slurmctld[1518]: slurmctld:
SchedulerParameters=default_que>
I then booted the compute node and checked the services there:
systemctl status munge
● munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor
preset: disabled)
Active: active (running) since Thu 2022-12-08 18:14:53 UTC; 3min 24s ago
Docs: man:munged(8)
Process: 786 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
Main PID: 804 (munged)
Tasks: 4 (limit: 26213)
Memory: 940.0K
CGroup: /system.slice/munge.service
└─804 /usr/sbin/munged
Dec 08 18:14:53 n0001 systemd[1]: Starting MUNGE authentication service...
Dec 08 18:14:53 n0001 systemd[1]: Started MUNGE authentication service.
systemctl status slurmd
● slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor
preset: disabled)
Active: failed (Result: exit-code) since Thu 2022-12-08 18:15:53 UTC;
2min 40s ago
Process: 897 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS
(code=exited, status=1/FAILURE)
Main PID: 897 (code=exited, status=1/FAILURE)
Dec 08 18:15:44 n0001 systemd[1]: Started Slurm node daemon.
Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Main process exited,
code=exited, status=1/FAIL>
Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Failed with result
'exit-code'.
# systemctl status slurmd
● slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor
preset: disabled)
Active: active (running) since Thu 2022-12-08 18:19:04 UTC; 5s ago
Main PID: 996 (slurmd)
Tasks: 2
Memory: 1012.0K
CGroup: /system.slice/slurmd.service
├─996 /usr/sbin/slurmd -D -s --conf-server localhost
└─997 /usr/sbin/slurmd -D -s --conf-server localhost
Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon.
On the SLurm server I checked the queue and "sinfo -a" and found the
following:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
$ sinfo -a
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 1-00:00:00 1 unk* n0001
After a few moments (less than a minute - maybe 20-30 seconds, slurmd on
the compute node fails. WHen I checked the service I saw this:
$ systemctl status slurmd
● slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor
preset: disabled)
Active: failed (Result: exit-code) since Thu 2022-12-08 18:19:13 UTC;
10min ago
Process: 996 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS
(code=exited, status=1/FAILURE)
Main PID: 996 (code=exited, status=1/FAILURE)
Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon.
Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Main process exited,
code=exited, status=1/FAIL>
Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Failed with result
'exit-code'.
Below are the logs for the slurm server for today (I rebooted the compute
twice)
[2022-12-08T13:12:17.343] error: chdir(/var/log): Permission denied
[2022-12-08T13:12:17.343] error: Configured MailProg is invalid
[2022-12-08T13:12:17.347] slurmctld version 22.05.2 started on cluster
cluster
[2022-12-08T13:12:17.371] No memory enforcing mechanism configured.
[2022-12-08T13:12:17.374] Recovered state of 1 nodes
[2022-12-08T13:12:17.374] Recovered JobId=3 Assoc=0
[2022-12-08T13:12:17.374] Recovered JobId=4 Assoc=0
[2022-12-08T13:12:17.374] Recovered information about 2 jobs
[2022-12-08T13:12:17.375] select/cons_tres: part_data_create_array:
select/cons_tres: preparing for 1 partitions
[2022-12-08T13:12:17.375] Recovered state of 0 reservations
[2022-12-08T13:12:17.375] read_slurm_conf: backup_controller not specified
[2022-12-08T13:12:17.376] select/cons_tres: select_p_reconfigure:
select/cons_tres: reconfigure
[2022-12-08T13:12:17.376] select/cons_tres: part_data_create_array:
select/cons_tres: preparing for 1 partitions
[2022-12-08T13:12:17.376] Running as primary controller
[2022-12-08T13:12:17.376] No parameter for mcs plugin, default values set
[2022-12-08T13:12:17.376] mcs: MCSParameters = (null). ondemand set.
[2022-12-08T13:13:17.471]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2022-12-08T13:17:17.940] error: Nodes n0001 not responding
[2022-12-08T13:22:17.533] error: Nodes n0001 not responding
[2022-12-08T13:27:17.048] error: Nodes n0001 not responding
There are no logs on the compute node.
Any suggestions where to start looking? I think I'm seeing the trees and
not the forest :)
Thanks!
Jeff
P.S Here's some relevant features from the server slurm.conf
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster
SlurmctldHost=localhost
#SlurmctldHost=
...
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster
SlurmctldHost=localhost
#SlurmctldHost=
Here's some relevant parts of slurm.conf on the client node:
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster
SlurmctldHost=localhost
#SlurmctldHost=
...
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster
SlurmctldHost=localhost
#SlurmctldHost=
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221208/fa055baa/attachment-0001.htm>
More information about the slurm-users
mailing list