[slurm-users] Question about PMIX ERROR messages being emitted by some child of srun process

Pritchard Jr., Howard howardp at lanl.gov
Fri May 19 19:09:15 UTC 2023


HI,

So I’m testing the use of Open MPI 5.0.0 pre-release with the Slurm/PMIx setup currently on NERSC Perlmutter system.
First off, if I use the PRRte launch system, I don’t see the issue I’m raising here.

But, many NERSC users prefer to use the srun “native” launch method with applications compiled against Open MPI, hence this emal.

The SLURM version on Perlmutter is currently 2023.02.2

The PMIx version that the admins used to build slurm against is pmix-4.2.3.  I’ve attached the output of  pmix_info.

I’ve tested with Open MPI 5.0.0rc11 (or HEAD of 5.0.x) with both the PMIx embedded in the Open MPI and using the external PMIx 4.2.3 install.
I get the same results below when my app is linked either against the system PMIx or the embedded one.

My test application “works” but if I use srun, I get these types of messages:

srun -n 2 -N 2 --mpi=pmix ./ring_c

[cn316:2770176] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750

[cn315:1037721] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750

[cn315:1037721] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750

[cn315:1037721] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file base/gds_base_fns.c at line 268

[cn315:1037721] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file dstore_base.c at line 2624

[cn315:1037721] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file server/pmix_server.c at line 3417

[cn316:2770176] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 750

[cn316:2770176] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file base/gds_base_fns.c at line 268

[cn316:2770176] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file dstore_base.c at line 2624

[cn316:2770176] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file server/pmix_server.c at line 3417

After a lot of stracing and adding debug statements to the PMIx I have control over – the one in the embedded Open MPI tarball, I realized that these
messages are not coming from the app, but some transient process between the srun/slurmd processes and the application processes.
The pids in these error messages are the parents of the MPI processes.

I’ve tried various things like turning off the PMIX GDS shmem but that doesn’t help.  Also I’ve toggled the various SLURM_PMIX env. variables but to no effect.
This problem does not appear to be related to a recent slurm/pmix patch - https://bugs.schedmd.com/show_bug.cgi?id=16306#a0 and anyway it looks like that patch should be in 2023.02.2.

Another bit of info:

scontrol show config | grep -i pmix
PMIxCliTmpDirBase       = (null)
PMIxCollFence           = (null)
PMIxDebug               = 0
PMIxDirectConn          = yes
PMIxDirectConnEarly     = no
PMIxDirectConnUCX       = no
PMIxDirectSameArch      = no
PMIxEnv                 = (null)
PMIxFenceBarrier        = no
PMIxNetDevicesUCX       = (null)
PMIxTimeout             = 300
PMIxTlsUCX              = (null)

Now I myself don’t care too much about these messages.
But for users it might be disconcerting and also may cause automated regression testing frameworks to report lots of errors.

Should I ask NERSC to file a ticket with SchedMD or does someone know how to turn these messages off if in fact they are not important, or better yet know why a slurm process may be emitting these errors and how to fix it?

Thanks for any ideas,

Howard


—

[signature_61897647]

Howard Pritchard
Research Scientist
HPC-ENV

Los Alamos National Laboratory
howardp at lanl.gov

[signature_2560999014]<https://www.instagram.com/losalamosnatlab/>[signature_3849187500]<https://twitter.com/LosAlamosNatLab>[signature_1777390047]<https://www.linkedin.com/company/los-alamos-national-laboratory/>[signature_210780453]<https://www.facebook.com/LosAlamosNationalLab/>



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230519/1ac56c8f/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 4349 bytes
Desc: image001.png
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230519/1ac56c8f/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 1980 bytes
Desc: image002.png
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230519/1ac56c8f/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.png
Type: image/png
Size: 1516 bytes
Desc: image003.png
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230519/1ac56c8f/attachment-0007.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.png
Type: image/png
Size: 1333 bytes
Desc: image004.png
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230519/1ac56c8f/attachment-0008.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image005.png
Type: image/png
Size: 998 bytes
Desc: image005.png
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230519/1ac56c8f/attachment-0009.png>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: pmix_info.pmutter.txt
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230519/1ac56c8f/attachment-0001.txt>


More information about the slurm-users mailing list