[slurm-users] [EXTERNAL] problems with OpenMPI 4.0.3

Barbara Krašovec barbara.krasovec at ijs.si
Tue Jun 2 07:12:22 UTC 2020


Afaik, there were some problems with certain versions of UCX, where UCX
expected OPAL memory hooks from OMPI, but they were disabled and the
physical pages became out-of-sync. But I don't know if this is the case.

Maybe you could run dynamic debug to see if there is something useful in
dmesg:

echo "module mlx5_core +p"| tee /sys/kernel/debug/dynamic_debug/control

And you could also try to run ucx_info in debug mode.

Cheers,

Barbara


On 6/1/20 8:37 PM, Alberto Morillas, Angelines wrote:
> Yes I tried it but whit the same result 
> openmpi at 4.0.3 -cuda +cxx_exceptions fabrics=ucx  -java -legacylaunchers -memchecker +pmi schedulers=slurm  -sqlite3 -thread_multiple +vt
>
> You can compile wrf , when you sbatch your job it is running but it doesn´t do anything and we get the same, with  WCHAN=hrtime
>             0 S  4556  87383  87361  0  80   0 - 126676 hrtime ?       00:05:25 real.exe
>
>     ------------------------------
>
>     Message: 2
>     Date: Mon, 1 Jun 2020 16:56:05 +0000
>     From: "Pritchard Jr., Howard" <howardp at lanl.gov>
>     To: Slurm User Community List <slurm-users at lists.schedmd.com>
>     Subject: Re: [slurm-users] [EXTERNAL]  problems with OpenMPI 4.0.3
>     Message-ID: <20DC51AE-9F58-4B1C-B619-1A2077D5CA84 at lanl.gov>
>     Content-Type: text/plain; charset="utf-8"
>
>     HI Angelines,
>
>     Could you try reinstalling with fabric=ucx and rerunning?  
>     UCX is the preferred way to use Infiniband in the Open MPI 4.0.x release stream.
>
>     Howard
>
>     ?On 6/1/20, 10:29 AM, "slurm-users on behalf of Alberto Morillas, Angelines" <slurm-users-bounces at lists.schedmd.com on behalf of angelines.alberto at ciemat.es> wrote:
>
>         Hello     Howard
>
>         I installed it with spack: 
>         openmpi at 4.0.3 -cuda +cxx_exceptions fabrics=verbs -java -legacylaunchers -memchecker  +pmi schedulers=slurm -sqlite3 -thread_multiple +vt
>         where - --> not enable
>                     + --> enable
>
>         Thanks in advance.
>         ________________________________________________
>
>         Angelines Alberto Morillas
>
>         Unidad de Arquitectura Inform?tica
>         Despacho: 22.1.32
>         Telf.: +34 91 346 6119
>         Fax:   +34 91 346 6537
>
>         skype: angelines.alberto
>
>         CIEMAT
>         Avenida Complutense, 40
>         28040 MADRID
>         ________________________________________________ 
>
>
>
>
>             ------------------------------
>
>             Message: 2
>             Date: Mon, 1 Jun 2020 16:13:11 +0000
>             From: "Pritchard Jr., Howard" <howardp at lanl.gov>
>             To: Slurm User Community List <slurm-users at lists.schedmd.com>
>             Subject: Re: [slurm-users] [EXTERNAL]  problems with OpenMPI 4.0.3
>             Message-ID: <CA7FE91C-8104-476F-B9A2-528D23ED3F9D at lanl.gov>
>             Content-Type: text/plain; charset="utf-8"
>
>             Hello Angelines,
>
>             Do you know how the Open MPI 4.0.3 package was configured and built?   That information would be useful to help diagnose the problem.
>
>             Thanks,
>
>             Howard
>
>
>             From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of "Alberto Morillas, Angelines" <angelines.alberto at ciemat.es>
>             Reply-To: Slurm User Community List <slurm-users at lists.schedmd.com>
>             Date: Friday, May 29, 2020 at 4:25 AM
>             To: "slurm-users at lists.schedmd.com" <slurm-users at lists.schedmd.com>
>             Subject: [EXTERNAL] [slurm-users] problems with OpenMPI 4.0.3
>
>             Good morning,
>
>             We have a cluster with two kind of infiniband cards, one connectx-4 and the other connectx-6.
>             Openmpi-3.1.3 works fine, but when we start with connectx-6 we started to use openmpi-4.0.3 (that support connectx-6) and the programs that have several parts, first a call to a secuencial program and inside it a call to a parallel program, ? (in our case the program is WRF, but we have others like this with the same problem),  this kind of programs suddenly stop,
>
>             ?..
>             0 S  4556  87383  87361  0  80   0 - 126676 hrtime ?       00:05:25 real.exe
>             0 S  4556  87384  87361  0  80   0 - 126677 hrtime ?       00:05:33 real.exe
>             0 S  4556  87385  87361  0  80   0 - 126675 hrtime ?       00:05:28 real.exe
>             ??
>             The WCHAN=hrtime, and it looks that it is running, but really it doesn?t work
>
>             We don?t know if it could be  problem with slurm and this version of openmpi? Any idea?
>
>
>             ________________________________________________
>
>             Angelines Alberto Morillas
>
>             Unidad de Arquitectura Inform?tica
>             Despacho: 22.1.32
>             Telf.: +34 91 346 6119
>             Fax:   +34 91 346 6537
>
>             skype: angelines.alberto
>
>             CIEMAT
>             Avenida Complutense, 40
>             28040 MADRID
>             ________________________________________________
>
>
>             -------------- next part --------------
>             An HTML attachment was scrubbed...
>             URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200601/e0e1cbee/attachment-0001.htm>
>
>             ------------------------------
>
>             Message: 3
>             Date: Mon, 1 Jun 2020 16:16:00 +0000
>             From: Songpon Srisawai <songpons_pro at vistec.ac.th>
>             To: Slurm User Community List <slurm-users at lists.schedmd.com>
>             Subject: Re: [slurm-users] Slurm Job Count Credit system
>             Message-ID: <9666f3be-d648-4ee9-9ad2-80df973f87cc at Spark>
>             Content-Type: text/plain; charset="utf-8"
>
>             Greatly appreciated for your help. I will try to implement following your suggestion.
>             On 1 Jun 2020 22:23 +0700, Renfro, Michael <Renfro at tntech.edu>, wrote:
>             Even without the slurm-bank system, you can enforce a limit on resources with a QOS applied to those users. Something like:
>
>             =====
>
>             sacctmgr add qos bank1 flags=NoDecay,DenyOnLimit
>             sacctmgr modify qos bank1 set grptresmins=cpu=1000
>
>             sacctmgr add account bank1
>             sacctmgr modify account name=bank1 set qos+=bank1
>
>             sacctmgr add user someuser account=bank1
>             sacctmgr modify user someuser set qos+=bank1
>
>             =====
>
>             You can do lots with a QOS, including limiting the number of simultaneous running jobs, simultaneous running/queued jobs, etc. Unfortunately, the NoDecay flag is only documented to work on GrpTRESMins, GrpWall, and UsageRaw, not on the job count.
>
>             So if you can live with limiting the number of simultaneous jobs instead of a total number of jobs per time period, that?s possible with QOS. Otherwise, maybe someone else will have an idea.
>
>             --
>             Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
>             931 372-3601 / Tennessee Tech University
>
>             On May 31, 2020, at 11:35 AM, Songpon Srisawai <songpons_pro at vistec.ac.th> wrote:
>
>             Hello all,
>
>             I?m Slurm beginner who try to implement our cluster. I would like to know whether there are any Slurm credit/token system plugin such as the number of job count.
>
>             I found Slurm-bank that deposit hour to an account. But, I would like to deposit the jobs token instead of hours.
>
>             Thanks for any recommendation
>             Songpon
>
>             -------------- next part --------------
>             An HTML attachment was scrubbed...
>             URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200601/76ebd6f5/attachment.htm>
>
>             End of slurm-users Digest, Vol 32, Issue 2
>             ******************************************
>
>
>
>
>     End of slurm-users Digest, Vol 32, Issue 3
>     ******************************************
>




More information about the slurm-users mailing list