[slurm-users] [EXTERNAL] problems with OpenMPI 4.0.3

Pritchard Jr., Howard howardp at lanl.gov
Mon Jun 1 16:56:05 UTC 2020


HI Angelines,

Could you try reinstalling with fabric=ucx and rerunning?  
UCX is the preferred way to use Infiniband in the Open MPI 4.0.x release stream.

Howard

On 6/1/20, 10:29 AM, "slurm-users on behalf of Alberto Morillas, Angelines" <slurm-users-bounces at lists.schedmd.com on behalf of angelines.alberto at ciemat.es> wrote:

    Hello     Howard
    
    I installed it with spack: 
    openmpi at 4.0.3 -cuda +cxx_exceptions fabrics=verbs -java -legacylaunchers -memchecker  +pmi schedulers=slurm -sqlite3 -thread_multiple +vt
    where - --> not enable
                + --> enable
    
    Thanks in advance.
    ________________________________________________
     
    Angelines Alberto Morillas
     
    Unidad de Arquitectura Informática
    Despacho: 22.1.32
    Telf.: +34 91 346 6119
    Fax:   +34 91 346 6537
     
    skype: angelines.alberto
     
    CIEMAT
    Avenida Complutense, 40
    28040 MADRID
    ________________________________________________ 
     
     
    
    El 1/6/20 18:16, "slurm-users en nombre de slurm-users-request at lists.schedmd.com" <slurm-users-bounces at lists.schedmd.com en nombre de slurm-users-request at lists.schedmd.com> escribió:
    
        Send slurm-users mailing list submissions to
        	slurm-users at lists.schedmd.com
    
        To subscribe or unsubscribe via the World Wide Web, visit
        	https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users
        or, via email, send a message with subject or body 'help' to
        	slurm-users-request at lists.schedmd.com
    
        You can reach the person managing the list at
        	slurm-users-owner at lists.schedmd.com
    
        When replying, please edit your Subject line so it is more specific
        than "Re: Contents of slurm-users digest..."
    
    
        Today's Topics:
    
           1. Re: Slurm Job Count Credit system (Renfro, Michael)
           2. Re: [EXTERNAL]  problems with OpenMPI 4.0.3
              (Pritchard Jr., Howard)
           3. Re: Slurm Job Count Credit system (Songpon Srisawai)
    
    
        ----------------------------------------------------------------------
    
        Message: 1
        Date: Mon, 1 Jun 2020 15:15:29 +0000
        From: "Renfro, Michael" <Renfro at tntech.edu>
        To: Slurm User Community List <slurm-users at lists.schedmd.com>
        Subject: Re: [slurm-users] Slurm Job Count Credit system
        Message-ID: <BD65CEB0-ACF7-4236-9206-44D0C93D57FA at tntech.edu>
        Content-Type: text/plain; charset="utf-8"
    
        Even without the slurm-bank system, you can enforce a limit on resources with a QOS applied to those users. Something like:
    
        =====
    
        sacctmgr add qos bank1 flags=NoDecay,DenyOnLimit
        sacctmgr modify qos bank1 set grptresmins=cpu=1000
    
        sacctmgr add account bank1
        sacctmgr modify account name=bank1 set qos+=bank1
    
        sacctmgr add user someuser account=bank1
        sacctmgr modify user someuser set qos+=bank1
    
        =====
    
        You can do lots with a QOS, including limiting the number of simultaneous running jobs, simultaneous running/queued jobs, etc. Unfortunately, the NoDecay flag is only documented to work on GrpTRESMins, GrpWall, and UsageRaw, not on the job count.
    
        So if you can live with limiting the number of simultaneous jobs instead of a total number of jobs per time period, that?s possible with QOS. Otherwise, maybe someone else will have an idea.
    
        -- 
        Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
        931 372-3601     / Tennessee Tech University
    
        > On May 31, 2020, at 11:35 AM, Songpon Srisawai <songpons_pro at vistec.ac.th> wrote:
        > 
        > Hello all,
        > 
        > I?m Slurm beginner who try to implement our cluster. I would like to know whether there are any Slurm credit/token system plugin such as the number of job count.
        > 
        > I found Slurm-bank that deposit hour to an account. But, I would like to deposit the jobs token instead of hours.
        > 
        > Thanks for any recommendation
        > Songpon 
    
    
        ------------------------------
    
        Message: 2
        Date: Mon, 1 Jun 2020 16:13:11 +0000
        From: "Pritchard Jr., Howard" <howardp at lanl.gov>
        To: Slurm User Community List <slurm-users at lists.schedmd.com>
        Subject: Re: [slurm-users] [EXTERNAL]  problems with OpenMPI 4.0.3
        Message-ID: <CA7FE91C-8104-476F-B9A2-528D23ED3F9D at lanl.gov>
        Content-Type: text/plain; charset="utf-8"
    
        Hello Angelines,
    
        Do you know how the Open MPI 4.0.3 package was configured and built?   That information would be useful to help diagnose the problem.
    
        Thanks,
    
        Howard
    
    
        From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of "Alberto Morillas, Angelines" <angelines.alberto at ciemat.es>
        Reply-To: Slurm User Community List <slurm-users at lists.schedmd.com>
        Date: Friday, May 29, 2020 at 4:25 AM
        To: "slurm-users at lists.schedmd.com" <slurm-users at lists.schedmd.com>
        Subject: [EXTERNAL] [slurm-users] problems with OpenMPI 4.0.3
    
        Good morning,
    
        We have a cluster with two kind of infiniband cards, one connectx-4 and the other connectx-6.
        Openmpi-3.1.3 works fine, but when we start with connectx-6 we started to use openmpi-4.0.3 (that support connectx-6) and the programs that have several parts, first a call to a secuencial program and inside it a call to a parallel program, ? (in our case the program is WRF, but we have others like this with the same problem),  this kind of programs suddenly stop,
    
        ?..
        0 S  4556  87383  87361  0  80   0 - 126676 hrtime ?       00:05:25 real.exe
        0 S  4556  87384  87361  0  80   0 - 126677 hrtime ?       00:05:33 real.exe
        0 S  4556  87385  87361  0  80   0 - 126675 hrtime ?       00:05:28 real.exe
        ??
        The WCHAN=hrtime, and it looks that it is running, but really it doesn?t work
    
        We don?t know if it could be  problem with slurm and this version of openmpi? Any idea?
    
    
        ________________________________________________
    
        Angelines Alberto Morillas
    
        Unidad de Arquitectura Inform?tica
        Despacho: 22.1.32
        Telf.: +34 91 346 6119
        Fax:   +34 91 346 6537
    
        skype: angelines.alberto
    
        CIEMAT
        Avenida Complutense, 40
        28040 MADRID
        ________________________________________________
    
    
        -------------- next part --------------
        An HTML attachment was scrubbed...
        URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200601/e0e1cbee/attachment-0001.htm>
    
        ------------------------------
    
        Message: 3
        Date: Mon, 1 Jun 2020 16:16:00 +0000
        From: Songpon Srisawai <songpons_pro at vistec.ac.th>
        To: Slurm User Community List <slurm-users at lists.schedmd.com>
        Subject: Re: [slurm-users] Slurm Job Count Credit system
        Message-ID: <9666f3be-d648-4ee9-9ad2-80df973f87cc at Spark>
        Content-Type: text/plain; charset="utf-8"
    
        Greatly appreciated for your help. I will try to implement following your suggestion.
        On 1 Jun 2020 22:23 +0700, Renfro, Michael <Renfro at tntech.edu>, wrote:
        Even without the slurm-bank system, you can enforce a limit on resources with a QOS applied to those users. Something like:
    
        =====
    
        sacctmgr add qos bank1 flags=NoDecay,DenyOnLimit
        sacctmgr modify qos bank1 set grptresmins=cpu=1000
    
        sacctmgr add account bank1
        sacctmgr modify account name=bank1 set qos+=bank1
    
        sacctmgr add user someuser account=bank1
        sacctmgr modify user someuser set qos+=bank1
    
        =====
    
        You can do lots with a QOS, including limiting the number of simultaneous running jobs, simultaneous running/queued jobs, etc. Unfortunately, the NoDecay flag is only documented to work on GrpTRESMins, GrpWall, and UsageRaw, not on the job count.
    
        So if you can live with limiting the number of simultaneous jobs instead of a total number of jobs per time period, that?s possible with QOS. Otherwise, maybe someone else will have an idea.
    
        --
        Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
        931 372-3601 / Tennessee Tech University
    
        On May 31, 2020, at 11:35 AM, Songpon Srisawai <songpons_pro at vistec.ac.th> wrote:
    
        Hello all,
    
        I?m Slurm beginner who try to implement our cluster. I would like to know whether there are any Slurm credit/token system plugin such as the number of job count.
    
        I found Slurm-bank that deposit hour to an account. But, I would like to deposit the jobs token instead of hours.
    
        Thanks for any recommendation
        Songpon
    
        -------------- next part --------------
        An HTML attachment was scrubbed...
        URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200601/76ebd6f5/attachment.htm>
    
        End of slurm-users Digest, Vol 32, Issue 2
        ******************************************
    
    



More information about the slurm-users mailing list