[slurm-users] [EXTERNAL] problems with OpenMPI 4.0.3
Barbara Krašovec
barbara.krasovec at ijs.si
Tue Jun 2 07:12:22 UTC 2020
Afaik, there were some problems with certain versions of UCX, where UCX
expected OPAL memory hooks from OMPI, but they were disabled and the
physical pages became out-of-sync. But I don't know if this is the case.
Maybe you could run dynamic debug to see if there is something useful in
dmesg:
echo "module mlx5_core +p"| tee /sys/kernel/debug/dynamic_debug/control
And you could also try to run ucx_info in debug mode.
Cheers,
Barbara
On 6/1/20 8:37 PM, Alberto Morillas, Angelines wrote:
> Yes I tried it but whit the same result
> openmpi at 4.0.3 -cuda +cxx_exceptions fabrics=ucx -java -legacylaunchers -memchecker +pmi schedulers=slurm -sqlite3 -thread_multiple +vt
>
> You can compile wrf , when you sbatch your job it is running but it doesn´t do anything and we get the same, with WCHAN=hrtime
> 0 S 4556 87383 87361 0 80 0 - 126676 hrtime ? 00:05:25 real.exe
>
> ------------------------------
>
> Message: 2
> Date: Mon, 1 Jun 2020 16:56:05 +0000
> From: "Pritchard Jr., Howard" <howardp at lanl.gov>
> To: Slurm User Community List <slurm-users at lists.schedmd.com>
> Subject: Re: [slurm-users] [EXTERNAL] problems with OpenMPI 4.0.3
> Message-ID: <20DC51AE-9F58-4B1C-B619-1A2077D5CA84 at lanl.gov>
> Content-Type: text/plain; charset="utf-8"
>
> HI Angelines,
>
> Could you try reinstalling with fabric=ucx and rerunning?
> UCX is the preferred way to use Infiniband in the Open MPI 4.0.x release stream.
>
> Howard
>
> ?On 6/1/20, 10:29 AM, "slurm-users on behalf of Alberto Morillas, Angelines" <slurm-users-bounces at lists.schedmd.com on behalf of angelines.alberto at ciemat.es> wrote:
>
> Hello Howard
>
> I installed it with spack:
> openmpi at 4.0.3 -cuda +cxx_exceptions fabrics=verbs -java -legacylaunchers -memchecker +pmi schedulers=slurm -sqlite3 -thread_multiple +vt
> where - --> not enable
> + --> enable
>
> Thanks in advance.
> ________________________________________________
>
> Angelines Alberto Morillas
>
> Unidad de Arquitectura Inform?tica
> Despacho: 22.1.32
> Telf.: +34 91 346 6119
> Fax: +34 91 346 6537
>
> skype: angelines.alberto
>
> CIEMAT
> Avenida Complutense, 40
> 28040 MADRID
> ________________________________________________
>
>
>
>
> ------------------------------
>
> Message: 2
> Date: Mon, 1 Jun 2020 16:13:11 +0000
> From: "Pritchard Jr., Howard" <howardp at lanl.gov>
> To: Slurm User Community List <slurm-users at lists.schedmd.com>
> Subject: Re: [slurm-users] [EXTERNAL] problems with OpenMPI 4.0.3
> Message-ID: <CA7FE91C-8104-476F-B9A2-528D23ED3F9D at lanl.gov>
> Content-Type: text/plain; charset="utf-8"
>
> Hello Angelines,
>
> Do you know how the Open MPI 4.0.3 package was configured and built? That information would be useful to help diagnose the problem.
>
> Thanks,
>
> Howard
>
>
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of "Alberto Morillas, Angelines" <angelines.alberto at ciemat.es>
> Reply-To: Slurm User Community List <slurm-users at lists.schedmd.com>
> Date: Friday, May 29, 2020 at 4:25 AM
> To: "slurm-users at lists.schedmd.com" <slurm-users at lists.schedmd.com>
> Subject: [EXTERNAL] [slurm-users] problems with OpenMPI 4.0.3
>
> Good morning,
>
> We have a cluster with two kind of infiniband cards, one connectx-4 and the other connectx-6.
> Openmpi-3.1.3 works fine, but when we start with connectx-6 we started to use openmpi-4.0.3 (that support connectx-6) and the programs that have several parts, first a call to a secuencial program and inside it a call to a parallel program, ? (in our case the program is WRF, but we have others like this with the same problem), this kind of programs suddenly stop,
>
> ?..
> 0 S 4556 87383 87361 0 80 0 - 126676 hrtime ? 00:05:25 real.exe
> 0 S 4556 87384 87361 0 80 0 - 126677 hrtime ? 00:05:33 real.exe
> 0 S 4556 87385 87361 0 80 0 - 126675 hrtime ? 00:05:28 real.exe
> ??
> The WCHAN=hrtime, and it looks that it is running, but really it doesn?t work
>
> We don?t know if it could be problem with slurm and this version of openmpi? Any idea?
>
>
> ________________________________________________
>
> Angelines Alberto Morillas
>
> Unidad de Arquitectura Inform?tica
> Despacho: 22.1.32
> Telf.: +34 91 346 6119
> Fax: +34 91 346 6537
>
> skype: angelines.alberto
>
> CIEMAT
> Avenida Complutense, 40
> 28040 MADRID
> ________________________________________________
>
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200601/e0e1cbee/attachment-0001.htm>
>
> ------------------------------
>
> Message: 3
> Date: Mon, 1 Jun 2020 16:16:00 +0000
> From: Songpon Srisawai <songpons_pro at vistec.ac.th>
> To: Slurm User Community List <slurm-users at lists.schedmd.com>
> Subject: Re: [slurm-users] Slurm Job Count Credit system
> Message-ID: <9666f3be-d648-4ee9-9ad2-80df973f87cc at Spark>
> Content-Type: text/plain; charset="utf-8"
>
> Greatly appreciated for your help. I will try to implement following your suggestion.
> On 1 Jun 2020 22:23 +0700, Renfro, Michael <Renfro at tntech.edu>, wrote:
> Even without the slurm-bank system, you can enforce a limit on resources with a QOS applied to those users. Something like:
>
> =====
>
> sacctmgr add qos bank1 flags=NoDecay,DenyOnLimit
> sacctmgr modify qos bank1 set grptresmins=cpu=1000
>
> sacctmgr add account bank1
> sacctmgr modify account name=bank1 set qos+=bank1
>
> sacctmgr add user someuser account=bank1
> sacctmgr modify user someuser set qos+=bank1
>
> =====
>
> You can do lots with a QOS, including limiting the number of simultaneous running jobs, simultaneous running/queued jobs, etc. Unfortunately, the NoDecay flag is only documented to work on GrpTRESMins, GrpWall, and UsageRaw, not on the job count.
>
> So if you can live with limiting the number of simultaneous jobs instead of a total number of jobs per time period, that?s possible with QOS. Otherwise, maybe someone else will have an idea.
>
> --
> Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
> 931 372-3601 / Tennessee Tech University
>
> On May 31, 2020, at 11:35 AM, Songpon Srisawai <songpons_pro at vistec.ac.th> wrote:
>
> Hello all,
>
> I?m Slurm beginner who try to implement our cluster. I would like to know whether there are any Slurm credit/token system plugin such as the number of job count.
>
> I found Slurm-bank that deposit hour to an account. But, I would like to deposit the jobs token instead of hours.
>
> Thanks for any recommendation
> Songpon
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200601/76ebd6f5/attachment.htm>
>
> End of slurm-users Digest, Vol 32, Issue 2
> ******************************************
>
>
>
>
> End of slurm-users Digest, Vol 32, Issue 3
> ******************************************
>
More information about the slurm-users
mailing list