[slurm-users] lmod and slurm

Tue Dec 19 13:14:10 MST 2017

Yair,

You may want to look at using “module reset” rather than a plain purge. Also, the environment variable LMOD_SYSTEM_DEFAULT_MODULES takes a colon separated list of “default” modules and reset does a purge followed by an automatic load of that list of modules. We set that variable in our basic environment to point at a module file at TACC, called TACC, which loads everything we want loaded. You could also use the list function of the variable to handle that. We also set some other environment variables in our TACC module besides loading our default modules, but the choice of mechanisms is up to you.  

I can’t recommend having the login nodes having substantially different software from the compute nodes, so we don’t do that. We wouldn’t allow a user to run the CPU TensorFlow on a login node (since that’s a shared resource), so we’d only make the GPU version available everywhere. If a user tries to run the GPU version on a CPU-only compute node and it fails, that’s on them (and maybe a little consultant time to tell them their mistake in a trouble ticket). 

You can also feel free to ask some of these kinds of questions on the Lmod mailing list (https://sourceforge.net/p/lmod/mailman/) which is very active and monitored by the author and a very knowledgeable community.

Best,
Bill.

-- 
Bill Barth, Ph.D., Director, HPC
bbarth at tacc.utexas.edu        |   Phone: (512) 232-7069
Office: ROC 1.435            |   Fax:   (512) 475-9445

On 12/19/17, 8:43 AM, "slurm-users on behalf of Yair Yarom" <slurm-users-bounces at lists.schedmd.com on behalf of irush at cs.huji.ac.il> wrote:

    There are two issues:

    1. For the manually loaded modules by users, we can (and are)
       instructing them to load the modules within their sbatch scripts. The
       problem is that not all users read the documentation properly, so in
       the tensorflow example, they use the cpu version of tensorflow
       (available on the submission node) instead of the gpu version
       (available on the execution node). Their program works, but slowly,
       and some of them simply accept it without knowing there's a problem.

    2. We have modules which we want to be loaded by default, without
       telling users to load them. These are mostly for programs used by all
       users and for some settings we want to be set by default (and may be
       different per host). Letting users call 'module purge' or
       "--export=NONE" will unload the default modules as well.

    So I basically want to force modules to be unloaded for all jobs - to
    solve issue 1, while allowing modules to be loaded "automatically" by
    the system or user - for issue 2. 

    Thanks,
        Yair.

    On Tue, Dec 19 2017, Jeffrey Frey <frey at udel.edu> wrote:

    > Don't propagate the submission environment:
    >
    > srun --export=NONE myprogram
    >
    >
    >
    >> On Dec 19, 2017, at 8:37 AM, Yair Yarom <irush at cs.huji.ac.il> wrote:
    >> 
    >> 
    >> Thanks for your reply,
    >> 
    >> The problem is that users are running on the submission node e.g.
    >> 
    >> module load tensorflow
    >> srun myprogram
    >> 
    >> So they get the tensorflow version (and PATH/PYTHONPATH) of the
    >> submission node's version of tensorflow (and any additional default
    >> modules).
    >> 
    >> There is never a chance to run the "module add ${SLURM_CONSTRAINT}" or
    >> remove the unwanted modules that were loaded (maybe automatically) on
    >> the submission node and aren't working on the execution node.
    >> 
    >> Thanks,
    >>    Yair.
    >> 
    >> On Tue, Dec 19 2017, "Loris Bennett" <loris.bennett at fu-berlin.de> wrote:
    >> 
    >>> Hi Yair,
    >>> 
    >>> Yair Yarom <irush at cs.huji.ac.il> writes:
    >>> 
    >>>> Hi list,
    >>>> 
    >>>> We use here lmod[1] for some software/version management. There are two
    >>>> issues encountered (so far):
    >>>> 
    >>>> 1. The submission node can have different software than the execution
    >>>>   nodes - different cpu, different gpu (if any), infiniband, etc. When
    >>>>   a user runs 'module load something' on the submission node, it will
    >>>>   pass the wrong environment to the task in the execution
    >>>>   node. e.g. "module load tensorflow" can load a different version
    >>>>   depending on the nodes.
    >>>> 
    >>>> 2. There are some modules we want to load by default, and again this can
    >>>>   be different between nodes (we do this by source'ing /etc/lmod/lmodrc
    >>>>   and ~/.lmodrc).
    >>>> 
    >>>> For issue 1, we instruct users to run the "module load" in their batch
    >>>> script and not before running sbatch, but issue 2 is more problematic.
    >>>> 
    >>>> My current solution is to write a TaskProlog script that runs "module
    >>>> purge" and "module load" and export/unset the changed environment
    >>>> variables. I was wondering if anyone encountered this issue and have a
    >>>> less cumbersome solution.
    >>>> 
    >>>> Thanks in advance,
    >>>>    Yair.
    >>>> 
    >>>> [1] https://www.tacc.utexas.edu/research-development/tacc-projects/lmod
    >>> 
    >>> I don't fully understand your use-case, but, assuming you can divide
    >>> your nodes up by some feature, could you define a module per feature
    >>> which just loads the specific modules needed for that category, e.g. in
    >>> the batch file you would have
    >>> 
    >>>   #SBATCH --constraint=shiny_and_new
    >>> 
    >>>   module add ${SLURM_CONSTRAINT}
    >>> 
    >>> and would have a module file 'shiny_and_new', with contents like, say,
    >>> 
    >>>  module add tensorflow/2.0
    >>>  module add cuda/9.0
    >>> 
    >>> whereas the module 'rusty_and_old' would contain
    >>> 
    >>>  module add tensorflow/0.1
    >>>  module add cuda/0.2
    >>> 
    >>> Would that help?
    >>> 
    >>> Cheers,
    >>> 
    >>> Loris
    >> 
    >
    >
    > ::::::::::::::::::::::::::::::::::::::::::::::::::::::
    > Jeffrey T. Frey, Ph.D.
    > Systems Programmer V / HPC Management
    > Network & Systems Services / College of Engineering
    > University of Delaware, Newark DE  19716
    > Office: (302) 831-6034  Mobile: (302) 419-4976
    > ::::::::::::::::::::::::::::::::::::::::::::::::::::::