[slurm-users] Slurm doesn't call mpiexec or mpirun when run through a GUI app

Prentice Bisbal pbisbal at pppl.gov
Thu Mar 21 22:43:14 UTC 2019


Slurm-users,

My users here have developed a GUI application which serves as a GUI 
interface to various physics codes they use. From this GUI, they can 
submit jobs to Slurm. On Tuesday, we upgraded Slurm from 18.08.5-2 to 
18.08.6-2,and a user has reported a problem when submitting Slurm jobs 
through this GUI app that do not occur when the same sbatch script is 
submitted from sbatch on the command-line.

The GUI application generates the following sbatch script (non-essential 
information redacted or omitted):

#!/bin/tcsh

#SBATCH --job-name=XXXXXXXX

#SBATCH --ntasks=32
#SBATCH --mem=2G
#SBATCH --time=00-2:00:00
#SBATCH --partition=YYYYYYYY
#SBATCH --export=ALL
#SBATCH --output=ZZZZZZZ
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=XXXXXX at example.com

echo "The job's id is: $SLURM_JOBID"
echo "The master node of this job is: $SLURM_SUBMIT_HOST"
echo -n 'Started job at : ' ; date
echo " "

cd .
echo "Working directory : "$PWD

module purge
module use /p/focus/modules
module load mystellopt
module list

mpirun --verbose /path/to/command /path/to/input input.stellopt

echo " "
echo -n 'Ended job at  : ' ; date
echo " "
exit

When this job is submitted from the GUI application, every line is 
executed except the mpirun line, which can easily be verified by looking 
at the output file.  When this job is submitted with sbatch on the 
command-line, everything works as desired.

There is no error in the output file like "mpirun: command not found:", 
so it appears that the mpirun command is in the PATH in the job's 
environment. When I added the line "which mpirun" to the sbatch script, 
it found the correct mpirun command to use.

When I replaced the mpirun command with an equivalent srun command, 
everything works as desired, so the user can get back to work and be 
productive.

While srun is a suitable workaround, and is arguably the correct way to 
run an MPI job, I'd like to understand what is going on here. Any idea 
what is going wrong, or additional steps I can take to get more debug 
information?

The user does acknowledge that the GUI app itself could have been 
updated, which caused this, but since Slurm accepts the job, and the 
output of squeue and scontrol seems normal and the job is submitted and 
runs, it looks to me like the interactions between Slurm and the GUI app 
are fine.

Thanks in advance for your help.

-- 
Prentice




More information about the slurm-users mailing list