<div dir="ltr">I have to echo Loris' comments. My users tend to experiment, and a fair portion of my time is spent helping them correct errors they've inflicted upon themselves. I tend to provide guides for configuring and running our more usual applications, and then when they fail, I review the guidance with them in my office. <div><br></div><div>Some of my bigger nightmares begin with one of my truly talented users trying something because the procedure he's trying is "just like" what he did on another, very different system. Followed closely with "Well it SHOULD work this way". We then spend some quality time going over how things really work, and he goes away a bit happier, and wiser.</div><div><br></div><div>Plan to work with your users and be prepared to train them on nuance. </div><div><br></div><div>Gerry</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Dec 19, 2017 at 9:33 AM, Loris Bennett <span dir="ltr"><<a href="mailto:loris.bennett@fu-berlin.de" target="_blank">loris.bennett@fu-berlin.de</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">Yair Yarom <<a href="mailto:irush@cs.huji.ac.il">irush@cs.huji.ac.il</a>> writes:<br>
<br>
> There are two issues:<br>
><br>
> 1. For the manually loaded modules by users, we can (and are)<br>
> instructing them to load the modules within their sbatch scripts. The<br>
> problem is that not all users read the documentation properly, so in<br>
> the tensorflow example, they use the cpu version of tensorflow<br>
> (available on the submission node) instead of the gpu version<br>
> (available on the execution node). Their program works, but slowly,<br>
> and some of them simply accept it without knowing there's a problem.<br>
<br>
</span>To me, this is just what users do. They make mistakes, not just with<br>
loading modules, their programs run badly, so I have to tell them what<br>
they are doing wrong and point them to the documentation. You obviously<br>
need some sort of monitoring to help you spot the poorly configured jobs.<br>
<span class=""><br>
> 2. We have modules which we want to be loaded by default, without<br>
> telling users to load them. These are mostly for programs used by all<br>
> users and for some settings we want to be set by default (and may be<br>
> different per host). Letting users call 'module purge' or<br>
> "--export=NONE" will unload the default modules as well.<br>
<br>
</span>I'm not sure how you want to prevent users from doing 'module purge' at<br>
a point which will upset the environment you are trying to set up for them.<br>
<span class=""><br>
> So I basically want to force modules to be unloaded for all jobs - to<br>
> solve issue 1, while allowing modules to be loaded "automatically" by<br>
> the system or user - for issue 2.<br>
<br>
</span>There may well be a technical solution to your problem such that<br>
everything works as it should without the users having to know what is<br>
going on. However, my approach would be to use a submit plugin to<br>
reject some badly configured jobs and/or set defaults such that badly<br>
configured jobs fail quickly. In my experience, if users' jobs fail<br>
straight away, they mainly learn to do the right thing fairly fast and<br>
without getting frustrated, provided they get enough support. However,<br>
your users may be different, so YMMV.<br>
<br>
Cheers,<br>
<br>
Loris<br>
<div class="HOEnZb"><div class="h5"><br>
<br>
> Thanks,<br>
> Yair.<br>
><br>
><br>
> On Tue, Dec 19 2017, Jeffrey Frey <<a href="mailto:frey@udel.edu">frey@udel.edu</a>> wrote:<br>
><br>
>> Don't propagate the submission environment:<br>
>><br>
>> srun --export=NONE myprogram<br>
>><br>
>><br>
>><br>
>>> On Dec 19, 2017, at 8:37 AM, Yair Yarom <<a href="mailto:irush@cs.huji.ac.il">irush@cs.huji.ac.il</a>> wrote:<br>
>>><br>
>>><br>
>>> Thanks for your reply,<br>
>>><br>
>>> The problem is that users are running on the submission node e.g.<br>
>>><br>
>>> module load tensorflow<br>
>>> srun myprogram<br>
>>><br>
>>> So they get the tensorflow version (and PATH/PYTHONPATH) of the<br>
>>> submission node's version of tensorflow (and any additional default<br>
>>> modules).<br>
>>><br>
>>> There is never a chance to run the "module add ${SLURM_CONSTRAINT}" or<br>
>>> remove the unwanted modules that were loaded (maybe automatically) on<br>
>>> the submission node and aren't working on the execution node.<br>
>>><br>
>>> Thanks,<br>
>>> Yair.<br>
>>><br>
>>> On Tue, Dec 19 2017, "Loris Bennett" <<a href="mailto:loris.bennett@fu-berlin.de">loris.bennett@fu-berlin.de</a>> wrote:<br>
>>><br>
>>>> Hi Yair,<br>
>>>><br>
>>>> Yair Yarom <<a href="mailto:irush@cs.huji.ac.il">irush@cs.huji.ac.il</a>> writes:<br>
>>>><br>
>>>>> Hi list,<br>
>>>>><br>
>>>>> We use here lmod[1] for some software/version management. There are two<br>
>>>>> issues encountered (so far):<br>
>>>>><br>
>>>>> 1. The submission node can have different software than the execution<br>
>>>>> nodes - different cpu, different gpu (if any), infiniband, etc. When<br>
>>>>> a user runs 'module load something' on the submission node, it will<br>
>>>>> pass the wrong environment to the task in the execution<br>
>>>>> node. e.g. "module load tensorflow" can load a different version<br>
>>>>> depending on the nodes.<br>
>>>>><br>
>>>>> 2. There are some modules we want to load by default, and again this can<br>
>>>>> be different between nodes (we do this by source'ing /etc/lmod/lmodrc<br>
>>>>> and ~/.lmodrc).<br>
>>>>><br>
>>>>> For issue 1, we instruct users to run the "module load" in their batch<br>
>>>>> script and not before running sbatch, but issue 2 is more problematic.<br>
>>>>><br>
>>>>> My current solution is to write a TaskProlog script that runs "module<br>
>>>>> purge" and "module load" and export/unset the changed environment<br>
>>>>> variables. I was wondering if anyone encountered this issue and have a<br>
>>>>> less cumbersome solution.<br>
>>>>><br>
>>>>> Thanks in advance,<br>
>>>>> Yair.<br>
>>>>><br>
>>>>> [1] <a href="https://www.tacc.utexas.edu/research-development/tacc-projects/lmod" rel="noreferrer" target="_blank">https://www.tacc.utexas.edu/<wbr>research-development/tacc-<wbr>projects/lmod</a><br>
>>>><br>
>>>> I don't fully understand your use-case, but, assuming you can divide<br>
>>>> your nodes up by some feature, could you define a module per feature<br>
>>>> which just loads the specific modules needed for that category, e.g. in<br>
>>>> the batch file you would have<br>
>>>><br>
>>>> #SBATCH --constraint=shiny_and_new<br>
>>>><br>
>>>> module add ${SLURM_CONSTRAINT}<br>
>>>><br>
>>>> and would have a module file 'shiny_and_new', with contents like, say,<br>
>>>><br>
>>>> module add tensorflow/2.0<br>
>>>> module add cuda/9.0<br>
>>>><br>
>>>> whereas the module 'rusty_and_old' would contain<br>
>>>><br>
>>>> module add tensorflow/0.1<br>
>>>> module add cuda/0.2<br>
>>>><br>
>>>> Would that help?<br>
>>>><br>
>>>> Cheers,<br>
>>>><br>
>>>> Loris<br>
>>><br>
>><br>
>><br>
>> ::::::::::::::::::::::::::::::<wbr>::::::::::::::::::::::::<br>
>> Jeffrey T. Frey, Ph.D.<br>
>> Systems Programmer V / HPC Management<br>
>> Network & Systems Services / College of Engineering<br>
>> University of Delaware, Newark DE 19716<br>
>> Office: (302) 831-6034 Mobile: (302) 419-4976<br>
>> ::::::::::::::::::::::::::::::<wbr>::::::::::::::::::::::::<br>
<br>
</div></div><div class="HOEnZb"><div class="h5">--<br>
Dr. Loris Bennett (Mr.)<br>
ZEDAT, Freie Universität Berlin Email <a href="mailto:loris.bennett@fu-berlin.de">loris.bennett@fu-berlin.de</a><br>
<br>
<br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">Gerry Creager<div>NSSL/CIMMS</div><div>405.325.6371</div><div>++++++++++++++++++++++</div><div><div>“Big whorls have little whorls,</div><div>That feed on their velocity; </div><div>And little whorls have lesser whorls, </div><div>And so on to viscosity.” </div><div>Lewis Fry Richardson (1881-1953)</div></div></div></div>
</div>