[slurm-users] Unable to submit job (ReqNodeNotAvail, UnavailableNodes)

JP Ebejer jean.p.ebejer at um.edu.mt
Sun Nov 12 10:11:43 UTC 2023


Ok so a step further (I hope), but still am stuck with a non working
cluster.

I managed to solve both problems above by installing two debian packages
(sudo apt install mailutils libpmix-dev) on both head and compute nodes.

I have no errors in the two log files, but somehow the node is still
drained.

How do I get around this please?

On Tue, 7 Nov 2023 at 17:43, JP Ebejer <jean.p.ebejer at um.edu.mt> wrote:

>
>
> On Tue, 7 Nov 2023 at 11:34, Diego Zuccato <diego.zuccato at unibo.it> wrote:
>
>> Il 07/11/2023 11:15, JP Ebejer ha scritto:
>> > but on running sinfo
>> > right after, the node is still "drained".
>>
>> That's not normal :(
>> Look at the slurmd log on the node for a reason. Probably the node
>> detects an error and sets itself to drained. Another possibility is that
>> slurmctld detects a mismatch between the node and its config: in this
>> case you'll find the reason in slurmctld.log .
>>
>
> Ok great. So I clear the slurmd.log on the compute-0 node. I restart the
> service (after changing the logging from debug3 to verbose).
>
> [2023-11-07T16:34:17.575] topology/none: init: topology NONE plugin loaded
> [2023-11-07T16:34:17.575] route/default: init: route default plugin loaded
> [2023-11-07T16:34:17.577] task/affinity: init: task affinity plugin loaded
> with CPU mask 0xffffffff
> [2023-11-07T16:34:17.578] cred/munge: init: Munge credential signature
> plugin loaded
> [2023-11-07T16:34:17.578] slurmd version 22.05.8 started
> [2023-11-07T16:34:17.579] error:  mpi/pmix_v4: init: (null) [0]:
> mpi_pmix.c:195: pmi/pmix: can not load PMIx library
> [2023-11-07T16:34:17.579] error: Couldn't load specified plugin name for
> mpi/pmix: Plugin init() callback failed
> [2023-11-07T16:34:17.579] error: MPI: Cannot create context for mpi/pmix
> [2023-11-07T16:34:17.580] error:  mpi/pmix_v4: init: (null) [0]:
> mpi_pmix.c:195: pmi/pmix: can not load PMIx library
> [2023-11-07T16:34:17.580] error: Couldn't load specified plugin name for
> mpi/pmix_v4: Plugin init() callback failed
> [2023-11-07T16:34:17.580] error: MPI: Cannot create context for mpi/pmix_v4
> [2023-11-07T16:34:17.580] slurmd started on Tue, 07 Nov 2023 16:34:17 +0000
> [2023-11-07T16:34:17.580] CPUs=32 Boards=1 Sockets=2 Cores=8 Threads=2
> Memory=64171 TmpDisk=1031475 Uptime=87818 CPUSpecList=(null)
> FeaturesAvail=(null) FeaturesActive=(null)
>
> I am not sure I understand this, and my MPI setting is none (so
> MpiDefault=none).  The jobs I intend to run do not use MPI.
>
> Could this be the cause, and how do I fix this (on Debian 12)?
>
> Also if I stop, truncate the log file, and start the slurmctld service I
> see similar errors.  Below:
>
> [2023-11-07T16:40:22.888] error: Configured MailProg is invalid
> [2023-11-07T16:40:22.889] slurmctld version 22.05.8 started on cluster
> mycluster
> [2023-11-07T16:40:22.890] cred/munge: init: Munge credential signature
> plugin loaded
> [2023-11-07T16:40:22.892] select/cons_res: common_init: select/cons_res
> loaded
> [2023-11-07T16:40:22.892] select/cons_tres: common_init: select/cons_tres
> loaded
> [2023-11-07T16:40:22.892] select/cray_aries: init: Cray/Aries node
> selection plugin loaded
> [2023-11-07T16:40:22.893] preempt/none: init: preempt/none loaded
> [2023-11-07T16:40:22.894] ext_sensors/none: init: ExtSensors NONE plugin
> loaded
> [2023-11-07T16:40:22.895] error:  mpi/pmix_v4: init: (null) [0]:
> mpi_pmix.c:195: pmi/pmix: can not load PMIx library
> [2023-11-07T16:40:22.895] error: Couldn't load specified plugin name for
> mpi/pmix_v4: Plugin init() callback failed
> [2023-11-07T16:40:22.895] error: MPI: Cannot create context for mpi/pmix_v4
> [2023-11-07T16:40:22.899] accounting_storage/none: init: Accounting
> storage NOT INVOKED plugin loaded
> [2023-11-07T16:40:22.901] No memory enforcing mechanism configured.
> [2023-11-07T16:40:22.902] topology/none: init: topology NONE plugin loaded
> [2023-11-07T16:40:22.904] sched: Backfill scheduler plugin loaded
> [2023-11-07T16:40:22.904] route/default: init: route default plugin loaded
> [2023-11-07T16:40:22.905] Recovered state of 1 nodes
> [2023-11-07T16:40:22.905] Recovered JobId=8 Assoc=0
> [2023-11-07T16:40:22.905] Recovered JobId=9 Assoc=0
> [2023-11-07T16:40:22.905] Recovered JobId=10 Assoc=0
> [2023-11-07T16:40:22.905] Recovered JobId=11 Assoc=0
> [2023-11-07T16:40:22.905] Recovered information about 4 jobs
> [2023-11-07T16:40:22.906] select/cons_tres: select_p_node_init:
> select/cons_tres SelectTypeParameters not specified, using default value:
> CR_Core_Memory
> [2023-11-07T16:40:22.906] select/cons_tres: part_data_create_array:
> select/cons_tres: preparing for 1 partitions
> [2023-11-07T16:40:22.906] Recovered state of 0 reservations
> [2023-11-07T16:40:22.906] State of 0 triggers recovered
> [2023-11-07T16:40:22.906] read_slurm_conf: backup_controller not specified
> [2023-11-07T16:40:22.906] select/cons_tres: select_p_reconfigure:
> select/cons_tres: reconfigure
> [2023-11-07T16:40:22.906] select/cons_tres: part_data_create_array:
> select/cons_tres: preparing for 1 partitions
> [2023-11-07T16:40:22.906] Running as primary controller
> [2023-11-07T16:40:22.907] No parameter for mcs plugin, default values set
> [2023-11-07T16:40:22.907] mcs: MCSParameters = (null). ondemand set.
>
>
> Is this a step closer to resolution?
>
>
>
>

-- 

<https://www.um.edu.mt/>

Prof. Jean-Paul Ebejer | Associate Professor

BSc (Hons) (Melita), MSc (Imperial), DPhil (Oxon.)

*Centre for Molecular Medicine and Biobanking*

Office 320, Biomedical Sciences Building,

University of Malta, Msida, MSD 2080.  MALTA.

T: (00356) 2340 3263


*Department of Artificial Intelligence*

Associate Member

Join the *Bioinformatics at UM*
<https://groups.google.com/a/um.edu.mt/g/mailinglist-bioinformatics.research>
mailing
list!
*Where to find me* <https://bitsilla.com/blog/where-to-find-me/>


[image: https://twitter.com/dr_jpe] <https://twitter.com/dr_jpe> [image:
https://bitsilla.com/blog/] <https://bitsilla.com/blog/> [image:
https://github.com/jp-um] <https://github.com/jp-um>

-- 
*The contents of this email are subject to *these terms 
<https://www.um.edu.mt/disclaimer/email/>.**
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231112/02b077cf/attachment-0001.htm>


More information about the slurm-users mailing list