[slurm-users] lmod and slurm

Loris Bennett loris.bennett at fu-berlin.de
Tue Dec 19 07:08:00 MST 2017


Hi Yair,

Can't the users just use sbatch and put

  #SBATCH --constraint=shiny_and_new'

  module purge
  module add ${SLURM_CONSTRAINT}

  srun myprogram

in their batch scripts?

Loris

Yair Yarom <irush at cs.huji.ac.il> writes:

> Thanks for your reply,
>
> The problem is that users are running on the submission node e.g.
>
> module load tensorflow
> srun myprogram
>
> So they get the tensorflow version (and PATH/PYTHONPATH) of the
> submission node's version of tensorflow (and any additional default
> modules).
>
> There is never a chance to run the "module add ${SLURM_CONSTRAINT}" or
> remove the unwanted modules that were loaded (maybe automatically) on
> the submission node and aren't working on the execution node.
>
> Thanks,
>     Yair.
>
> On Tue, Dec 19 2017, "Loris Bennett" <loris.bennett at fu-berlin.de> wrote:
>
>> Hi Yair,
>>
>> Yair Yarom <irush at cs.huji.ac.il> writes:
>>
>>> Hi list,
>>>
>>> We use here lmod[1] for some software/version management. There are two
>>> issues encountered (so far):
>>>
>>> 1. The submission node can have different software than the execution
>>>    nodes - different cpu, different gpu (if any), infiniband, etc. When
>>>    a user runs 'module load something' on the submission node, it will
>>>    pass the wrong environment to the task in the execution
>>>    node. e.g. "module load tensorflow" can load a different version
>>>    depending on the nodes.
>>>
>>> 2. There are some modules we want to load by default, and again this can
>>>    be different between nodes (we do this by source'ing /etc/lmod/lmodrc
>>>    and ~/.lmodrc).
>>>
>>> For issue 1, we instruct users to run the "module load" in their batch
>>> script and not before running sbatch, but issue 2 is more problematic.
>>>
>>> My current solution is to write a TaskProlog script that runs "module
>>> purge" and "module load" and export/unset the changed environment
>>> variables. I was wondering if anyone encountered this issue and have a
>>> less cumbersome solution.
>>>
>>> Thanks in advance,
>>>     Yair.
>>>
>>> [1] https://www.tacc.utexas.edu/research-development/tacc-projects/lmod
>>
>> I don't fully understand your use-case, but, assuming you can divide
>> your nodes up by some feature, could you define a module per feature
>> which just loads the specific modules needed for that category, e.g. in
>> the batch file you would have
>>
>>    #SBATCH --constraint=shiny_and_new
>>
>>    module add ${SLURM_CONSTRAINT}
>>
>> and would have a module file 'shiny_and_new', with contents like, say,
>>
>>   module add tensorflow/2.0
>>   module add cuda/9.0
>>
>> whereas the module 'rusty_and_old' would contain
>>
>>   module add tensorflow/0.1
>>   module add cuda/0.2
>>
>> Would that help?
>>
>> Cheers,
>>
>> Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de



More information about the slurm-users mailing list