[slurm-users] Help debugging Slurm configuration

Thu Dec 8 20:03:45 UTC 2022

Thanks Glenn!

I change the slurm.conf logging to "debug5" on both the server and the
client.

I also created /var/log/slurm on both the client and server and chown-ed to
slurm:slurm.

On the server I did "scontrol reconfigure".

Then I rebooted the compute node. When I logged in, slurm was not up. I ran
systemctl start slurmd. It stayed for about 5 seconds then stoppped.

# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor
preset: disabled)
   Active: failed (Result: exit-code) since Thu 2022-12-08 19:51:58 UTC;
2min 33s ago
  Process: 1299 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS
(code=exited, status=1/FAILURE)
 Main PID: 1299 (code=exited, status=1/FAILURE)

Dec 08 19:51:49 n0001 systemd[1]: Started Slurm node daemon.
Dec 08 19:51:58 n0001 systemd[1]: slurmd.service: Main process exited,
code=exited, status=1/FAIL>
Dec 08 19:51:58 n0001 systemd[1]: slurmd.service: Failed with result
'exit-code'.

Here is the output from grepping through journalctl:

    UNIT=slurmd.service
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    MESSAGE=error: slurmd initialization failed
    UNIT=slurmd.service
    MESSAGE=slurmd.service: Main process exited, code=exited,
status=1/FAILURE
    UNIT=slurmd.service
    MESSAGE=slurmd.service: Failed with result 'exit-code'.
    MESSAGE=Operator of unix-process:911:7771 successfully authenticated as
unix-user:root to gain
 ONE-SHOT authorization for action org.freedesktop.systemd1.manage-units
for system-bus-name::1.24
 [systemctl start slurmd] (owned by unix-user:laytonjb)
    UNIT=slurmd.service
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    MESSAGE=error: slurmd initialization failed
    UNIT=slurmd.service
    MESSAGE=slurmd.service: Main process exited, code=exited,
status=1/FAILURE
    UNIT=slurmd.service
    MESSAGE=slurmd.service: Failed with result 'exit-code'.
    UNIT=slurmd.service
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    MESSAGE=error: slurmd initialization failed
    UNIT=slurmd.service
    MESSAGE=slurmd.service: Main process exited, code=exited,
status=1/FAILURE
    UNIT=slurmd.service
    MESSAGE=slurmd.service: Failed with result 'exit-code'.
    MESSAGE=Operator of unix-process:1254:240421 successfully authenticated
as unix-user:root to g
ain ONE-SHOT authorization for action org.freedesktop.systemd1.manage-units
for system-bus-name::1
.47 [systemctl start slurmd] (owned by unix-user:laytonjb)
    UNIT=slurmd.service
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    MESSAGE=error: slurmd initialization failed
    UNIT=slurmd.service
    MESSAGE=slurmd.service: Main process exited, code=exited,
status=1/FAILURE
    UNIT=slurmd.service
    MESSAGE=slurmd.service: Failed with result 'exit-code'.
    UNIT=slurmd.service
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    MESSAGE=error: slurmd initialization failed
    UNIT=slurmd.service
    MESSAGE=slurmd.service: Main process exited, code=exited,
status=1/FAILURE
    UNIT=slurmd.service
    MESSAGE=slurmd.service: Failed with result 'exit-code'.

These don't look too useful even with debug5 on.

Any thoughts?

Thanks!

Jeff

On Thu, Dec 8, 2022 at 2:01 PM Glen MacLachlan <maclach at gwu.edu> wrote:

>
> What does running this on the compute node show? (looks at journal log for
> past 12 hours)
> journalctl -S -12h -o verbose | grep slurm
>
>
> You may want to increase your debug verbosity to debug5
> https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdDebug while tracking
> down this issue.
> For reference, see
> https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdDebug
>
> You should also address this error to fix logging:
> [2022-12-08T13:12:17.343] error: chdir(/var/log): Permission denied
>
> by making a directory /var/log/slurm and making the slurm user the owner
> on both the controller and compute node. Then update your slurm.conf file
> like this:
> # LOGGING
> SlurmctldLogFile=/var/log/slurm/slurmctld.log
> SlurmdLogFile=/var/log/slurm/slurmd.log
>
> and then running 'scontrol reconfigure'
>
> Kind Regards,
> Glen
>
> ==========================================
> Glen MacLachlan, PhD
> *Lead High Performance Computing Engineer  *
>
> Research Technology Services
> The George Washington University
> 44983 Knoll Square
> Enterprise Hall, 328L
> Ashburn, VA 20147
>
> ==========================================
>
>
>
>
>
>
>
> On Thu, Dec 8, 2022 at 1:41 PM Jeffrey Layton <laytonjb at gmail.com> wrote:
>
>> Good afternoon,
>>
>> I have a very simple two node cluster using Warewulf 4.3. I was following
>> some instructions on how to install the OpenHPC Slurm binaries (server and
>> client). I booted the compute node and the Slurm Server says it's in an
>> unknown state. This hasn't happened to me before but I would like to debug
>> the problem.
>>
>> I checked the services on the S:urm server (head node)
>>
>> $ systemctl status munge
>> ● munge.service - MUNGE authentication service
>>    Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor
>> preset: disabled)
>>    Active: active (running) since Thu 2022-12-08 13:12:10 EST; 4min 42s
>> ago
>>      Docs: man:munged(8)
>>   Process: 1140 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
>>  Main PID: 1182 (munged)
>>     Tasks: 4 (limit: 48440)
>>    Memory: 1.2M
>>    CGroup: /system.slice/munge.service
>>            └─1182 /usr/sbin/munged
>>
>> Dec 08 13:12:10 localhost.localdomain systemd[1]: Starting MUNGE
>> authentication service...
>> Dec 08 13:12:10 localhost.localdomain systemd[1]: Started MUNGE
>> authentication service.
>>
>> $ systemctl status slurmctld
>> ● slurmctld.service - Slurm controller daemon
>>    Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled;
>> vendor preset: disabled)
>>    Active: active (running) since Thu 2022-12-08 13:12:17 EST; 4min 56s
>> ago
>>  Main PID: 1518 (slurmctld)
>>     Tasks: 10
>>    Memory: 23.0M
>>    CGroup: /system.slice/slurmctld.service
>>            ├─1518 /usr/sbin/slurmctld -D -s
>>            └─1555 slurmctld: slurmscriptd
>>
>> Dec 08 13:12:17 localhost.localdomain systemd[1]: Started Slurm
>> controller daemon.
>> Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: No
>> parameter for mcs plugin, de>
>> Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: mcs:
>> MCSParameters = (null). on>
>> Dec 08 13:13:17 localhost.localdomain slurmctld[1518]: slurmctld:
>> SchedulerParameters=default_que>
>>
>>
>>
>> I then booted the compute node and checked the services there:
>>
>> systemctl status munge
>> ● munge.service - MUNGE authentication service
>>    Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor
>> preset: disabled)
>>    Active: active (running) since Thu 2022-12-08 18:14:53 UTC; 3min 24s
>> ago
>>      Docs: man:munged(8)
>>   Process: 786 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
>>  Main PID: 804 (munged)
>>     Tasks: 4 (limit: 26213)
>>    Memory: 940.0K
>>    CGroup: /system.slice/munge.service
>>            └─804 /usr/sbin/munged
>>
>> Dec 08 18:14:53 n0001 systemd[1]: Starting MUNGE authentication service...
>> Dec 08 18:14:53 n0001 systemd[1]: Started MUNGE authentication service.
>>
>> systemctl status slurmd
>> ● slurmd.service - Slurm node daemon
>>    Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled;
>> vendor preset: disabled)
>>    Active: failed (Result: exit-code) since Thu 2022-12-08 18:15:53 UTC;
>> 2min 40s ago
>>   Process: 897 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS
>> (code=exited, status=1/FAILURE)
>>  Main PID: 897 (code=exited, status=1/FAILURE)
>>
>> Dec 08 18:15:44 n0001 systemd[1]: Started Slurm node daemon.
>> Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Main process exited,
>> code=exited, status=1/FAIL>
>> Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Failed with result
>> 'exit-code'.
>>
>> # systemctl status slurmd
>> ● slurmd.service - Slurm node daemon
>>    Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled;
>> vendor preset: disabled)
>>    Active: active (running) since Thu 2022-12-08 18:19:04 UTC; 5s ago
>>  Main PID: 996 (slurmd)
>>     Tasks: 2
>>    Memory: 1012.0K
>>    CGroup: /system.slice/slurmd.service
>>            ├─996 /usr/sbin/slurmd -D -s --conf-server localhost
>>            └─997 /usr/sbin/slurmd -D -s --conf-server localhost
>>
>> Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon.
>>
>>
>>
>>
>> On the SLurm server I checked the queue and "sinfo -a" and found the
>> following:
>>
>> $ squeue
>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>> NODELIST(REASON)
>> $ sinfo -a
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> normal*      up 1-00:00:00      1   unk* n0001
>>
>>
>> After a few moments (less than a minute - maybe 20-30 seconds, slurmd on
>> the compute node fails. WHen I checked the service I saw this:
>>
>> $ systemctl status slurmd
>> ● slurmd.service - Slurm node daemon
>>    Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled;
>> vendor preset: disabled)
>>    Active: failed (Result: exit-code) since Thu 2022-12-08 18:19:13 UTC;
>> 10min ago
>>   Process: 996 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS
>> (code=exited, status=1/FAILURE)
>>  Main PID: 996 (code=exited, status=1/FAILURE)
>>
>> Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon.
>> Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Main process exited,
>> code=exited, status=1/FAIL>
>> Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Failed with result
>> 'exit-code'.
>>
>>
>> Below are the logs for the slurm server for today (I rebooted the compute
>> twice)
>>
>> [2022-12-08T13:12:17.343] error: chdir(/var/log): Permission denied
>> [2022-12-08T13:12:17.343] error: Configured MailProg is invalid
>> [2022-12-08T13:12:17.347] slurmctld version 22.05.2 started on cluster
>> cluster
>> [2022-12-08T13:12:17.371] No memory enforcing mechanism configured.
>> [2022-12-08T13:12:17.374] Recovered state of 1 nodes
>> [2022-12-08T13:12:17.374] Recovered JobId=3 Assoc=0
>> [2022-12-08T13:12:17.374] Recovered JobId=4 Assoc=0
>> [2022-12-08T13:12:17.374] Recovered information about 2 jobs
>> [2022-12-08T13:12:17.375] select/cons_tres: part_data_create_array:
>> select/cons_tres: preparing for 1 partitions
>> [2022-12-08T13:12:17.375] Recovered state of 0 reservations
>> [2022-12-08T13:12:17.375] read_slurm_conf: backup_controller not specified
>> [2022-12-08T13:12:17.376] select/cons_tres: select_p_reconfigure:
>> select/cons_tres: reconfigure
>> [2022-12-08T13:12:17.376] select/cons_tres: part_data_create_array:
>> select/cons_tres: preparing for 1 partitions
>> [2022-12-08T13:12:17.376] Running as primary controller
>> [2022-12-08T13:12:17.376] No parameter for mcs plugin, default values set
>> [2022-12-08T13:12:17.376] mcs: MCSParameters = (null). ondemand set.
>> [2022-12-08T13:13:17.471]
>> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
>> [2022-12-08T13:17:17.940] error: Nodes n0001 not responding
>> [2022-12-08T13:22:17.533] error: Nodes n0001 not responding
>> [2022-12-08T13:27:17.048] error: Nodes n0001 not responding
>>
>> There are no logs on the compute node.
>>
>> Any suggestions where to start looking? I think I'm seeing the trees and
>> not the forest :)
>>
>> Thanks!
>>
>> Jeff
>>
>> P.S Here's some relevant features from the server slurm.conf
>>
>>
>> # slurm.conf file generated by configurator.html.
>> # Put this file on all nodes of your cluster.
>> # See the slurm.conf man page for more information.
>> #
>> ClusterName=cluster
>> SlurmctldHost=localhost
>> #SlurmctldHost=
>> ...
>> # slurm.conf file generated by configurator.html.
>> # Put this file on all nodes of your cluster.
>> # See the slurm.conf man page for more information.
>> #
>> ClusterName=cluster
>> SlurmctldHost=localhost
>> #SlurmctldHost=
>>
>>
>>
>>
>> Here's some relevant parts of slurm.conf on the client node:
>>
>>
>>
>>
>> # slurm.conf file generated by configurator.html.
>> # Put this file on all nodes of your cluster.
>> # See the slurm.conf man page for more information.
>> #
>> ClusterName=cluster
>> SlurmctldHost=localhost
>> #SlurmctldHost=
>> ...
>> # slurm.conf file generated by configurator.html.
>> # Put this file on all nodes of your cluster.
>> # See the slurm.conf man page for more information.
>> #
>> ClusterName=cluster
>> SlurmctldHost=localhost
>> #SlurmctldHost=
>>
>>
>>
>>
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221208/bdd93d02/attachment-0001.htm>