[slurm-users] Unable to submit job (ReqNodeNotAvail, UnavailableNodes)

Tue Nov 7 16:43:16 UTC 2023

On Tue, 7 Nov 2023 at 11:34, Diego Zuccato <diego.zuccato at unibo.it> wrote:

> Il 07/11/2023 11:15, JP Ebejer ha scritto:
> > but on running sinfo
> > right after, the node is still "drained".
>
> That's not normal :(
> Look at the slurmd log on the node for a reason. Probably the node
> detects an error and sets itself to drained. Another possibility is that
> slurmctld detects a mismatch between the node and its config: in this
> case you'll find the reason in slurmctld.log .
>

Ok great. So I clear the slurmd.log on the compute-0 node. I restart the
service (after changing the logging from debug3 to verbose).

[2023-11-07T16:34:17.575] topology/none: init: topology NONE plugin loaded
[2023-11-07T16:34:17.575] route/default: init: route default plugin loaded
[2023-11-07T16:34:17.577] task/affinity: init: task affinity plugin loaded
with CPU mask 0xffffffff
[2023-11-07T16:34:17.578] cred/munge: init: Munge credential signature
plugin loaded
[2023-11-07T16:34:17.578] slurmd version 22.05.8 started
[2023-11-07T16:34:17.579] error:  mpi/pmix_v4: init: (null) [0]:
mpi_pmix.c:195: pmi/pmix: can not load PMIx library
[2023-11-07T16:34:17.579] error: Couldn't load specified plugin name for
mpi/pmix: Plugin init() callback failed
[2023-11-07T16:34:17.579] error: MPI: Cannot create context for mpi/pmix
[2023-11-07T16:34:17.580] error:  mpi/pmix_v4: init: (null) [0]:
mpi_pmix.c:195: pmi/pmix: can not load PMIx library
[2023-11-07T16:34:17.580] error: Couldn't load specified plugin name for
mpi/pmix_v4: Plugin init() callback failed
[2023-11-07T16:34:17.580] error: MPI: Cannot create context for mpi/pmix_v4
[2023-11-07T16:34:17.580] slurmd started on Tue, 07 Nov 2023 16:34:17 +0000
[2023-11-07T16:34:17.580] CPUs=32 Boards=1 Sockets=2 Cores=8 Threads=2
Memory=64171 TmpDisk=1031475 Uptime=87818 CPUSpecList=(null)
FeaturesAvail=(null) FeaturesActive=(null)

I am not sure I understand this, and my MPI setting is none (so
MpiDefault=none).  The jobs I intend to run do not use MPI.

Could this be the cause, and how do I fix this (on Debian 12)?

Also if I stop, truncate the log file, and start the slurmctld service I
see similar errors.  Below:

[2023-11-07T16:40:22.888] error: Configured MailProg is invalid
[2023-11-07T16:40:22.889] slurmctld version 22.05.8 started on cluster
mycluster
[2023-11-07T16:40:22.890] cred/munge: init: Munge credential signature
plugin loaded
[2023-11-07T16:40:22.892] select/cons_res: common_init: select/cons_res
loaded
[2023-11-07T16:40:22.892] select/cons_tres: common_init: select/cons_tres
loaded
[2023-11-07T16:40:22.892] select/cray_aries: init: Cray/Aries node
selection plugin loaded
[2023-11-07T16:40:22.893] preempt/none: init: preempt/none loaded
[2023-11-07T16:40:22.894] ext_sensors/none: init: ExtSensors NONE plugin
loaded
[2023-11-07T16:40:22.895] error:  mpi/pmix_v4: init: (null) [0]:
mpi_pmix.c:195: pmi/pmix: can not load PMIx library
[2023-11-07T16:40:22.895] error: Couldn't load specified plugin name for
mpi/pmix_v4: Plugin init() callback failed
[2023-11-07T16:40:22.895] error: MPI: Cannot create context for mpi/pmix_v4
[2023-11-07T16:40:22.899] accounting_storage/none: init: Accounting storage
NOT INVOKED plugin loaded
[2023-11-07T16:40:22.901] No memory enforcing mechanism configured.
[2023-11-07T16:40:22.902] topology/none: init: topology NONE plugin loaded
[2023-11-07T16:40:22.904] sched: Backfill scheduler plugin loaded
[2023-11-07T16:40:22.904] route/default: init: route default plugin loaded
[2023-11-07T16:40:22.905] Recovered state of 1 nodes
[2023-11-07T16:40:22.905] Recovered JobId=8 Assoc=0
[2023-11-07T16:40:22.905] Recovered JobId=9 Assoc=0
[2023-11-07T16:40:22.905] Recovered JobId=10 Assoc=0
[2023-11-07T16:40:22.905] Recovered JobId=11 Assoc=0
[2023-11-07T16:40:22.905] Recovered information about 4 jobs
[2023-11-07T16:40:22.906] select/cons_tres: select_p_node_init:
select/cons_tres SelectTypeParameters not specified, using default value:
CR_Core_Memory
[2023-11-07T16:40:22.906] select/cons_tres: part_data_create_array:
select/cons_tres: preparing for 1 partitions
[2023-11-07T16:40:22.906] Recovered state of 0 reservations
[2023-11-07T16:40:22.906] State of 0 triggers recovered
[2023-11-07T16:40:22.906] read_slurm_conf: backup_controller not specified
[2023-11-07T16:40:22.906] select/cons_tres: select_p_reconfigure:
select/cons_tres: reconfigure
[2023-11-07T16:40:22.906] select/cons_tres: part_data_create_array:
select/cons_tres: preparing for 1 partitions
[2023-11-07T16:40:22.906] Running as primary controller
[2023-11-07T16:40:22.907] No parameter for mcs plugin, default values set
[2023-11-07T16:40:22.907] mcs: MCSParameters = (null). ondemand set.

Is this a step closer to resolution?

-- 
*The contents of this email are subject to *these terms 
<https://www.um.edu.mt/disclaimer/email/>.**
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231107/5c5c7d8b/attachment.htm>