[slurm-users] Slurm doesn't call mpiexec or mpirun when run through a GUI app

Prentice Bisbal pbisbal at pppl.gov
Mon Apr 1 20:20:09 UTC 2019


On 3/28/19 1:25 PM, Reuti wrote:

> Hi,
>
>> Am 22.03.2019 um 16:20 schrieb Prentice Bisbal <pbisbal at pppl.gov>:
>>
>> On 3/21/19 6:56 PM, Reuti wrote:
>>> Am 21.03.2019 um 23:43 schrieb Prentice Bisbal:
>>>
>>>> Slurm-users,
>>>>
>>>> My users here have developed a GUI application which serves as a GUI interface to various physics codes they use. From this GUI, they can submit jobs to Slurm. On Tuesday, we upgraded Slurm from 18.08.5-2 to 18.08.6-2,and a user has reported a problem when submitting Slurm jobs through this GUI app that do not occur when the same sbatch script is submitted from sbatch on the command-line.
>>>>
>>>> […]
>>>> When I replaced the mpirun command with an equivalent srun command, everything works as desired, so the user can get back to work and be productive.
>>>>
>>>> While srun is a suitable workaround, and is arguably the correct way to run an MPI job, I'd like to understand what is going on here. Any idea what is going wrong, or additional steps I can take to get more debug information?
>>> Was an alias to `mpirun` introduced? It may cover the real application and even the `which mpirun` will return the correct value, but never be executed.
>>>
>>> $ type mpirun
>>> $ alias mpirun
>>>
>>> may tell in the jobscript.
>>>
>> Unfortunately, the script is in tcsh, so the 'type' command doesn't work since,  it's a bash built-in function. I did use the 'alias' command to see all the defined aliases, and mpirun and mpiexec are not aliased. Any other ideas?
> What was the outcome of this issue – could it be solved?
>

The user added

#SBATCH --export=none

to his submission script to prevent any environment variables in the 
GUI's environment from being applied to his job. After making that 
change, his job worked as expected, so this confirmed it was an 
environment issue. We compared the differences in 'env' from 
GUI-submitted and manually submitted, jobs, and found a handfule of 
variables that were set in the GUI environment that were not present in 
the manual-submission environment. If memory serves me correctly, they 
were all Open MPI parameters.

The user was happy using "--export=none" to fix this problem, so we 
didn't bother going through the tedious task of removing the environment 
variables one by one until we found the offending one. While still 
testing/debugging, I did do one run where I thought I removed all the 
offending variables by unsetting them all in the sbatch script, but the 
error still occurred, so i must have missed the one that was causing the 
issue.

Since the user was happy with the --export=none fix, and I had other 
issues to fix in my queue, that's where we left it.

--
Prentice




More information about the slurm-users mailing list