[slurm-users] lmod and slurm
Yair Yarom
irush at cs.huji.ac.il
Tue Dec 19 06:37:36 MST 2017
Thanks for your reply,
The problem is that users are running on the submission node e.g.
module load tensorflow
srun myprogram
So they get the tensorflow version (and PATH/PYTHONPATH) of the
submission node's version of tensorflow (and any additional default
modules).
There is never a chance to run the "module add ${SLURM_CONSTRAINT}" or
remove the unwanted modules that were loaded (maybe automatically) on
the submission node and aren't working on the execution node.
Thanks,
Yair.
On Tue, Dec 19 2017, "Loris Bennett" <loris.bennett at fu-berlin.de> wrote:
> Hi Yair,
>
> Yair Yarom <irush at cs.huji.ac.il> writes:
>
>> Hi list,
>>
>> We use here lmod[1] for some software/version management. There are two
>> issues encountered (so far):
>>
>> 1. The submission node can have different software than the execution
>> nodes - different cpu, different gpu (if any), infiniband, etc. When
>> a user runs 'module load something' on the submission node, it will
>> pass the wrong environment to the task in the execution
>> node. e.g. "module load tensorflow" can load a different version
>> depending on the nodes.
>>
>> 2. There are some modules we want to load by default, and again this can
>> be different between nodes (we do this by source'ing /etc/lmod/lmodrc
>> and ~/.lmodrc).
>>
>> For issue 1, we instruct users to run the "module load" in their batch
>> script and not before running sbatch, but issue 2 is more problematic.
>>
>> My current solution is to write a TaskProlog script that runs "module
>> purge" and "module load" and export/unset the changed environment
>> variables. I was wondering if anyone encountered this issue and have a
>> less cumbersome solution.
>>
>> Thanks in advance,
>> Yair.
>>
>> [1] https://www.tacc.utexas.edu/research-development/tacc-projects/lmod
>
> I don't fully understand your use-case, but, assuming you can divide
> your nodes up by some feature, could you define a module per feature
> which just loads the specific modules needed for that category, e.g. in
> the batch file you would have
>
> #SBATCH --constraint=shiny_and_new
>
> module add ${SLURM_CONSTRAINT}
>
> and would have a module file 'shiny_and_new', with contents like, say,
>
> module add tensorflow/2.0
> module add cuda/9.0
>
> whereas the module 'rusty_and_old' would contain
>
> module add tensorflow/0.1
> module add cuda/0.2
>
> Would that help?
>
> Cheers,
>
> Loris
More information about the slurm-users
mailing list