[slurm-users] Help debugging Slurm configuration

Thu Dec 8 18:58:52 UTC 2022

What does running this on the compute node show? (looks at journal log for
past 12 hours)
journalctl -S -12h -o verbose | grep slurm

You may want to increase your debug verbosity to debug5
https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdDebug while tracking
down this issue.
For reference, see https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdDebug

You should also address this error to fix logging:
[2022-12-08T13:12:17.343] error: chdir(/var/log): Permission denied

by making a directory /var/log/slurm and making the slurm user the owner on
both the controller and compute node. Then update your slurm.conf file like
this:
# LOGGING
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log

and then running 'scontrol reconfigure'

Kind Regards,
Glen

==========================================
Glen MacLachlan, PhD
*Lead High Performance Computing Engineer  *

Research Technology Services
The George Washington University
44983 Knoll Square
Enterprise Hall, 328L
Ashburn, VA 20147

==========================================

On Thu, Dec 8, 2022 at 1:41 PM Jeffrey Layton <laytonjb at gmail.com> wrote:

> Good afternoon,
>
> I have a very simple two node cluster using Warewulf 4.3. I was following
> some instructions on how to install the OpenHPC Slurm binaries (server and
> client). I booted the compute node and the Slurm Server says it's in an
> unknown state. This hasn't happened to me before but I would like to debug
> the problem.
>
> I checked the services on the S:urm server (head node)
>
> $ systemctl status munge
> ● munge.service - MUNGE authentication service
>    Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor
> preset: disabled)
>    Active: active (running) since Thu 2022-12-08 13:12:10 EST; 4min 42s ago
>      Docs: man:munged(8)
>   Process: 1140 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
>  Main PID: 1182 (munged)
>     Tasks: 4 (limit: 48440)
>    Memory: 1.2M
>    CGroup: /system.slice/munge.service
>            └─1182 /usr/sbin/munged
>
> Dec 08 13:12:10 localhost.localdomain systemd[1]: Starting MUNGE
> authentication service...
> Dec 08 13:12:10 localhost.localdomain systemd[1]: Started MUNGE
> authentication service.
>
> $ systemctl status slurmctld
> ● slurmctld.service - Slurm controller daemon
>    Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled;
> vendor preset: disabled)
>    Active: active (running) since Thu 2022-12-08 13:12:17 EST; 4min 56s ago
>  Main PID: 1518 (slurmctld)
>     Tasks: 10
>    Memory: 23.0M
>    CGroup: /system.slice/slurmctld.service
>            ├─1518 /usr/sbin/slurmctld -D -s
>            └─1555 slurmctld: slurmscriptd
>
> Dec 08 13:12:17 localhost.localdomain systemd[1]: Started Slurm controller
> daemon.
> Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: No
> parameter for mcs plugin, de>
> Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: mcs:
> MCSParameters = (null). on>
> Dec 08 13:13:17 localhost.localdomain slurmctld[1518]: slurmctld:
> SchedulerParameters=default_que>
>
>
>
> I then booted the compute node and checked the services there:
>
> systemctl status munge
> ● munge.service - MUNGE authentication service
>    Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor
> preset: disabled)
>    Active: active (running) since Thu 2022-12-08 18:14:53 UTC; 3min 24s ago
>      Docs: man:munged(8)
>   Process: 786 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
>  Main PID: 804 (munged)
>     Tasks: 4 (limit: 26213)
>    Memory: 940.0K
>    CGroup: /system.slice/munge.service
>            └─804 /usr/sbin/munged
>
> Dec 08 18:14:53 n0001 systemd[1]: Starting MUNGE authentication service...
> Dec 08 18:14:53 n0001 systemd[1]: Started MUNGE authentication service.
>
> systemctl status slurmd
> ● slurmd.service - Slurm node daemon
>    Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor
> preset: disabled)
>    Active: failed (Result: exit-code) since Thu 2022-12-08 18:15:53 UTC;
> 2min 40s ago
>   Process: 897 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS
> (code=exited, status=1/FAILURE)
>  Main PID: 897 (code=exited, status=1/FAILURE)
>
> Dec 08 18:15:44 n0001 systemd[1]: Started Slurm node daemon.
> Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Main process exited,
> code=exited, status=1/FAIL>
> Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Failed with result
> 'exit-code'.
>
> # systemctl status slurmd
> ● slurmd.service - Slurm node daemon
>    Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor
> preset: disabled)
>    Active: active (running) since Thu 2022-12-08 18:19:04 UTC; 5s ago
>  Main PID: 996 (slurmd)
>     Tasks: 2
>    Memory: 1012.0K
>    CGroup: /system.slice/slurmd.service
>            ├─996 /usr/sbin/slurmd -D -s --conf-server localhost
>            └─997 /usr/sbin/slurmd -D -s --conf-server localhost
>
> Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon.
>
>
>
>
> On the SLurm server I checked the queue and "sinfo -a" and found the
> following:
>
> $ squeue
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
> $ sinfo -a
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> normal*      up 1-00:00:00      1   unk* n0001
>
>
> After a few moments (less than a minute - maybe 20-30 seconds, slurmd on
> the compute node fails. WHen I checked the service I saw this:
>
> $ systemctl status slurmd
> ● slurmd.service - Slurm node daemon
>    Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor
> preset: disabled)
>    Active: failed (Result: exit-code) since Thu 2022-12-08 18:19:13 UTC;
> 10min ago
>   Process: 996 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS
> (code=exited, status=1/FAILURE)
>  Main PID: 996 (code=exited, status=1/FAILURE)
>
> Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon.
> Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Main process exited,
> code=exited, status=1/FAIL>
> Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Failed with result
> 'exit-code'.
>
>
> Below are the logs for the slurm server for today (I rebooted the compute
> twice)
>
> [2022-12-08T13:12:17.343] error: chdir(/var/log): Permission denied
> [2022-12-08T13:12:17.343] error: Configured MailProg is invalid
> [2022-12-08T13:12:17.347] slurmctld version 22.05.2 started on cluster
> cluster
> [2022-12-08T13:12:17.371] No memory enforcing mechanism configured.
> [2022-12-08T13:12:17.374] Recovered state of 1 nodes
> [2022-12-08T13:12:17.374] Recovered JobId=3 Assoc=0
> [2022-12-08T13:12:17.374] Recovered JobId=4 Assoc=0
> [2022-12-08T13:12:17.374] Recovered information about 2 jobs
> [2022-12-08T13:12:17.375] select/cons_tres: part_data_create_array:
> select/cons_tres: preparing for 1 partitions
> [2022-12-08T13:12:17.375] Recovered state of 0 reservations
> [2022-12-08T13:12:17.375] read_slurm_conf: backup_controller not specified
> [2022-12-08T13:12:17.376] select/cons_tres: select_p_reconfigure:
> select/cons_tres: reconfigure
> [2022-12-08T13:12:17.376] select/cons_tres: part_data_create_array:
> select/cons_tres: preparing for 1 partitions
> [2022-12-08T13:12:17.376] Running as primary controller
> [2022-12-08T13:12:17.376] No parameter for mcs plugin, default values set
> [2022-12-08T13:12:17.376] mcs: MCSParameters = (null). ondemand set.
> [2022-12-08T13:13:17.471]
> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
> [2022-12-08T13:17:17.940] error: Nodes n0001 not responding
> [2022-12-08T13:22:17.533] error: Nodes n0001 not responding
> [2022-12-08T13:27:17.048] error: Nodes n0001 not responding
>
> There are no logs on the compute node.
>
> Any suggestions where to start looking? I think I'm seeing the trees and
> not the forest :)
>
> Thanks!
>
> Jeff
>
> P.S Here's some relevant features from the server slurm.conf
>
>
> # slurm.conf file generated by configurator.html.
> # Put this file on all nodes of your cluster.
> # See the slurm.conf man page for more information.
> #
> ClusterName=cluster
> SlurmctldHost=localhost
> #SlurmctldHost=
> ...
> # slurm.conf file generated by configurator.html.
> # Put this file on all nodes of your cluster.
> # See the slurm.conf man page for more information.
> #
> ClusterName=cluster
> SlurmctldHost=localhost
> #SlurmctldHost=
>
>
>
>
> Here's some relevant parts of slurm.conf on the client node:
>
>
>
>
> # slurm.conf file generated by configurator.html.
> # Put this file on all nodes of your cluster.
> # See the slurm.conf man page for more information.
> #
> ClusterName=cluster
> SlurmctldHost=localhost
> #SlurmctldHost=
> ...
> # slurm.conf file generated by configurator.html.
> # Put this file on all nodes of your cluster.
> # See the slurm.conf man page for more information.
> #
> ClusterName=cluster
> SlurmctldHost=localhost
> #SlurmctldHost=
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221208/8082d149/attachment.htm>