[slurm-users] Help debugging Slurm configuration

Thu Dec 8 20:12:24 UTC 2022

Then try using the IP of the controller node as explained here
https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmctldAddr or here
https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmctldHost.

Also, if you look at the first few lines of /etc/hosts (just above the line
that reads ### ALL ENTRIES BELOW THIS LINE WILL BE OVERWRITTEN BY WAREWULF
###) you should see a hostname for the head node and it's IP address. If
you don't set this correctly then the slurmd daemon won't know how to reach
the slurmctld daemon.

Kind Regards,
Glen

==========================================
Glen MacLachlan, PhD
*Lead High Performance Computing Engineer  *

Research Technology Services
The George Washington University
44983 Knoll Square
Enterprise Hall, 328L
Ashburn, VA 20147

==========================================

On Thu, Dec 8, 2022 at 3:06 PM Jeffrey Layton <laytonjb at gmail.com> wrote:

> localhost is the ctrl name :)
>
> I can change it though if needed (I was lazy when I did the initial
> installation).
>
> Thanks!
>
> Jeff
>
>
> On Thu, Dec 8, 2022 at 2:30 PM Glen MacLachlan <maclach at gwu.edu> wrote:
>
>> One other thing to address is that SlurmctldHost should point to the
>> controller node where slurmctld is running, the name of which I would
>> expect Warewulf would put into /etc/hosts.
>> https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmctldHost
>>
>> Kind Regards,
>> Glen
>>
>> ==========================================
>> Glen MacLachlan, PhD
>> *Lead High Performance Computing Engineer  *
>>
>> Research Technology Services
>> The George Washington University
>> 44983 Knoll Square
>> Enterprise Hall, 328L
>> Ashburn, VA 20147
>>
>> ==========================================
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Dec 8, 2022 at 1:58 PM Glen MacLachlan <maclach at gwu.edu> wrote:
>>
>>>
>>> What does running this on the compute node show? (looks at journal log
>>> for past 12 hours)
>>> journalctl -S -12h -o verbose | grep slurm
>>>
>>>
>>> You may want to increase your debug verbosity to debug5
>>> https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdDebug while
>>> tracking down this issue.
>>> For reference, see
>>> https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdDebug
>>>
>>> You should also address this error to fix logging:
>>> [2022-12-08T13:12:17.343] error: chdir(/var/log): Permission denied
>>>
>>> by making a directory /var/log/slurm and making the slurm user the owner
>>> on both the controller and compute node. Then update your slurm.conf file
>>> like this:
>>> # LOGGING
>>> SlurmctldLogFile=/var/log/slurm/slurmctld.log
>>> SlurmdLogFile=/var/log/slurm/slurmd.log
>>>
>>> and then running 'scontrol reconfigure'
>>>
>>> Kind Regards,
>>> Glen
>>>
>>> ==========================================
>>> Glen MacLachlan, PhD
>>> *Lead High Performance Computing Engineer  *
>>>
>>> Research Technology Services
>>> The George Washington University
>>> 44983 Knoll Square
>>> Enterprise Hall, 328L
>>> Ashburn, VA 20147
>>>
>>> ==========================================
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Dec 8, 2022 at 1:41 PM Jeffrey Layton <laytonjb at gmail.com>
>>> wrote:
>>>
>>>> Good afternoon,
>>>>
>>>> I have a very simple two node cluster using Warewulf 4.3. I was
>>>> following some instructions on how to install the OpenHPC Slurm binaries
>>>> (server and client). I booted the compute node and the Slurm Server says
>>>> it's in an unknown state. This hasn't happened to me before but I would
>>>> like to debug the problem.
>>>>
>>>> I checked the services on the S:urm server (head node)
>>>>
>>>> $ systemctl status munge
>>>> ● munge.service - MUNGE authentication service
>>>>    Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled;
>>>> vendor preset: disabled)
>>>>    Active: active (running) since Thu 2022-12-08 13:12:10 EST; 4min 42s
>>>> ago
>>>>      Docs: man:munged(8)
>>>>   Process: 1140 ExecStart=/usr/sbin/munged (code=exited,
>>>> status=0/SUCCESS)
>>>>  Main PID: 1182 (munged)
>>>>     Tasks: 4 (limit: 48440)
>>>>    Memory: 1.2M
>>>>    CGroup: /system.slice/munge.service
>>>>            └─1182 /usr/sbin/munged
>>>>
>>>> Dec 08 13:12:10 localhost.localdomain systemd[1]: Starting MUNGE
>>>> authentication service...
>>>> Dec 08 13:12:10 localhost.localdomain systemd[1]: Started MUNGE
>>>> authentication service.
>>>>
>>>> $ systemctl status slurmctld
>>>> ● slurmctld.service - Slurm controller daemon
>>>>    Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled;
>>>> vendor preset: disabled)
>>>>    Active: active (running) since Thu 2022-12-08 13:12:17 EST; 4min 56s
>>>> ago
>>>>  Main PID: 1518 (slurmctld)
>>>>     Tasks: 10
>>>>    Memory: 23.0M
>>>>    CGroup: /system.slice/slurmctld.service
>>>>            ├─1518 /usr/sbin/slurmctld -D -s
>>>>            └─1555 slurmctld: slurmscriptd
>>>>
>>>> Dec 08 13:12:17 localhost.localdomain systemd[1]: Started Slurm
>>>> controller daemon.
>>>> Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: No
>>>> parameter for mcs plugin, de>
>>>> Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: mcs:
>>>> MCSParameters = (null). on>
>>>> Dec 08 13:13:17 localhost.localdomain slurmctld[1518]: slurmctld:
>>>> SchedulerParameters=default_que>
>>>>
>>>>
>>>>
>>>> I then booted the compute node and checked the services there:
>>>>
>>>> systemctl status munge
>>>> ● munge.service - MUNGE authentication service
>>>>    Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled;
>>>> vendor preset: disabled)
>>>>    Active: active (running) since Thu 2022-12-08 18:14:53 UTC; 3min 24s
>>>> ago
>>>>      Docs: man:munged(8)
>>>>   Process: 786 ExecStart=/usr/sbin/munged (code=exited,
>>>> status=0/SUCCESS)
>>>>  Main PID: 804 (munged)
>>>>     Tasks: 4 (limit: 26213)
>>>>    Memory: 940.0K
>>>>    CGroup: /system.slice/munge.service
>>>>            └─804 /usr/sbin/munged
>>>>
>>>> Dec 08 18:14:53 n0001 systemd[1]: Starting MUNGE authentication
>>>> service...
>>>> Dec 08 18:14:53 n0001 systemd[1]: Started MUNGE authentication service.
>>>>
>>>> systemctl status slurmd
>>>> ● slurmd.service - Slurm node daemon
>>>>    Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled;
>>>> vendor preset: disabled)
>>>>    Active: failed (Result: exit-code) since Thu 2022-12-08 18:15:53
>>>> UTC; 2min 40s ago
>>>>   Process: 897 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS
>>>> (code=exited, status=1/FAILURE)
>>>>  Main PID: 897 (code=exited, status=1/FAILURE)
>>>>
>>>> Dec 08 18:15:44 n0001 systemd[1]: Started Slurm node daemon.
>>>> Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Main process exited,
>>>> code=exited, status=1/FAIL>
>>>> Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Failed with result
>>>> 'exit-code'.
>>>>
>>>> # systemctl status slurmd
>>>> ● slurmd.service - Slurm node daemon
>>>>    Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled;
>>>> vendor preset: disabled)
>>>>    Active: active (running) since Thu 2022-12-08 18:19:04 UTC; 5s ago
>>>>  Main PID: 996 (slurmd)
>>>>     Tasks: 2
>>>>    Memory: 1012.0K
>>>>    CGroup: /system.slice/slurmd.service
>>>>            ├─996 /usr/sbin/slurmd -D -s --conf-server localhost
>>>>            └─997 /usr/sbin/slurmd -D -s --conf-server localhost
>>>>
>>>> Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon.
>>>>
>>>>
>>>>
>>>>
>>>> On the SLurm server I checked the queue and "sinfo -a" and found the
>>>> following:
>>>>
>>>> $ squeue
>>>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>>>> NODELIST(REASON)
>>>> $ sinfo -a
>>>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>>>> normal*      up 1-00:00:00      1   unk* n0001
>>>>
>>>>
>>>> After a few moments (less than a minute - maybe 20-30 seconds, slurmd
>>>> on the compute node fails. WHen I checked the service I saw this:
>>>>
>>>> $ systemctl status slurmd
>>>> ● slurmd.service - Slurm node daemon
>>>>    Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled;
>>>> vendor preset: disabled)
>>>>    Active: failed (Result: exit-code) since Thu 2022-12-08 18:19:13
>>>> UTC; 10min ago
>>>>   Process: 996 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS
>>>> (code=exited, status=1/FAILURE)
>>>>  Main PID: 996 (code=exited, status=1/FAILURE)
>>>>
>>>> Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon.
>>>> Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Main process exited,
>>>> code=exited, status=1/FAIL>
>>>> Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Failed with result
>>>> 'exit-code'.
>>>>
>>>>
>>>> Below are the logs for the slurm server for today (I rebooted the
>>>> compute twice)
>>>>
>>>> [2022-12-08T13:12:17.343] error: chdir(/var/log): Permission denied
>>>> [2022-12-08T13:12:17.343] error: Configured MailProg is invalid
>>>> [2022-12-08T13:12:17.347] slurmctld version 22.05.2 started on cluster
>>>> cluster
>>>> [2022-12-08T13:12:17.371] No memory enforcing mechanism configured.
>>>> [2022-12-08T13:12:17.374] Recovered state of 1 nodes
>>>> [2022-12-08T13:12:17.374] Recovered JobId=3 Assoc=0
>>>> [2022-12-08T13:12:17.374] Recovered JobId=4 Assoc=0
>>>> [2022-12-08T13:12:17.374] Recovered information about 2 jobs
>>>> [2022-12-08T13:12:17.375] select/cons_tres: part_data_create_array:
>>>> select/cons_tres: preparing for 1 partitions
>>>> [2022-12-08T13:12:17.375] Recovered state of 0 reservations
>>>> [2022-12-08T13:12:17.375] read_slurm_conf: backup_controller not
>>>> specified
>>>> [2022-12-08T13:12:17.376] select/cons_tres: select_p_reconfigure:
>>>> select/cons_tres: reconfigure
>>>> [2022-12-08T13:12:17.376] select/cons_tres: part_data_create_array:
>>>> select/cons_tres: preparing for 1 partitions
>>>> [2022-12-08T13:12:17.376] Running as primary controller
>>>> [2022-12-08T13:12:17.376] No parameter for mcs plugin, default values
>>>> set
>>>> [2022-12-08T13:12:17.376] mcs: MCSParameters = (null). ondemand set.
>>>> [2022-12-08T13:13:17.471]
>>>> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
>>>> [2022-12-08T13:17:17.940] error: Nodes n0001 not responding
>>>> [2022-12-08T13:22:17.533] error: Nodes n0001 not responding
>>>> [2022-12-08T13:27:17.048] error: Nodes n0001 not responding
>>>>
>>>> There are no logs on the compute node.
>>>>
>>>> Any suggestions where to start looking? I think I'm seeing the trees
>>>> and not the forest :)
>>>>
>>>> Thanks!
>>>>
>>>> Jeff
>>>>
>>>> P.S Here's some relevant features from the server slurm.conf
>>>>
>>>>
>>>> # slurm.conf file generated by configurator.html.
>>>> # Put this file on all nodes of your cluster.
>>>> # See the slurm.conf man page for more information.
>>>> #
>>>> ClusterName=cluster
>>>> SlurmctldHost=localhost
>>>> #SlurmctldHost=
>>>> ...
>>>> # slurm.conf file generated by configurator.html.
>>>> # Put this file on all nodes of your cluster.
>>>> # See the slurm.conf man page for more information.
>>>> #
>>>> ClusterName=cluster
>>>> SlurmctldHost=localhost
>>>> #SlurmctldHost=
>>>>
>>>>
>>>>
>>>>
>>>> Here's some relevant parts of slurm.conf on the client node:
>>>>
>>>>
>>>>
>>>>
>>>> # slurm.conf file generated by configurator.html.
>>>> # Put this file on all nodes of your cluster.
>>>> # See the slurm.conf man page for more information.
>>>> #
>>>> ClusterName=cluster
>>>> SlurmctldHost=localhost
>>>> #SlurmctldHost=
>>>> ...
>>>> # slurm.conf file generated by configurator.html.
>>>> # Put this file on all nodes of your cluster.
>>>> # See the slurm.conf man page for more information.
>>>> #
>>>> ClusterName=cluster
>>>> SlurmctldHost=localhost
>>>> #SlurmctldHost=
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221208/917757e4/attachment-0001.htm>