[slurm-users] Slurm doesn't call mpiexec or mpirun when run through a GUI app
pbisbal at pppl.gov
Mon Apr 1 20:20:09 UTC 2019
On 3/28/19 1:25 PM, Reuti wrote:
>> Am 22.03.2019 um 16:20 schrieb Prentice Bisbal <pbisbal at pppl.gov>:
>> On 3/21/19 6:56 PM, Reuti wrote:
>>> Am 21.03.2019 um 23:43 schrieb Prentice Bisbal:
>>>> My users here have developed a GUI application which serves as a GUI interface to various physics codes they use. From this GUI, they can submit jobs to Slurm. On Tuesday, we upgraded Slurm from 18.08.5-2 to 18.08.6-2,and a user has reported a problem when submitting Slurm jobs through this GUI app that do not occur when the same sbatch script is submitted from sbatch on the command-line.
>>>> When I replaced the mpirun command with an equivalent srun command, everything works as desired, so the user can get back to work and be productive.
>>>> While srun is a suitable workaround, and is arguably the correct way to run an MPI job, I'd like to understand what is going on here. Any idea what is going wrong, or additional steps I can take to get more debug information?
>>> Was an alias to `mpirun` introduced? It may cover the real application and even the `which mpirun` will return the correct value, but never be executed.
>>> $ type mpirun
>>> $ alias mpirun
>>> may tell in the jobscript.
>> Unfortunately, the script is in tcsh, so the 'type' command doesn't work since, it's a bash built-in function. I did use the 'alias' command to see all the defined aliases, and mpirun and mpiexec are not aliased. Any other ideas?
> What was the outcome of this issue – could it be solved?
The user added
to his submission script to prevent any environment variables in the
GUI's environment from being applied to his job. After making that
change, his job worked as expected, so this confirmed it was an
environment issue. We compared the differences in 'env' from
GUI-submitted and manually submitted, jobs, and found a handfule of
variables that were set in the GUI environment that were not present in
the manual-submission environment. If memory serves me correctly, they
were all Open MPI parameters.
The user was happy using "--export=none" to fix this problem, so we
didn't bother going through the tedious task of removing the environment
variables one by one until we found the offending one. While still
testing/debugging, I did do one run where I thought I removed all the
offending variables by unsetting them all in the sbatch script, but the
error still occurred, so i must have missed the one that was causing the
Since the user was happy with the --export=none fix, and I had other
issues to fix in my queue, that's where we left it.
More information about the slurm-users