[slurm-users] Slurm doesn't call mpiexec or mpirun when run through a GUI app
Prentice Bisbal
pbisbal at pppl.gov
Thu Mar 21 22:43:14 UTC 2019
Slurm-users,
My users here have developed a GUI application which serves as a GUI
interface to various physics codes they use. From this GUI, they can
submit jobs to Slurm. On Tuesday, we upgraded Slurm from 18.08.5-2 to
18.08.6-2,and a user has reported a problem when submitting Slurm jobs
through this GUI app that do not occur when the same sbatch script is
submitted from sbatch on the command-line.
The GUI application generates the following sbatch script (non-essential
information redacted or omitted):
#!/bin/tcsh
#SBATCH --job-name=XXXXXXXX
#SBATCH --ntasks=32
#SBATCH --mem=2G
#SBATCH --time=00-2:00:00
#SBATCH --partition=YYYYYYYY
#SBATCH --export=ALL
#SBATCH --output=ZZZZZZZ
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=XXXXXX at example.com
echo "The job's id is: $SLURM_JOBID"
echo "The master node of this job is: $SLURM_SUBMIT_HOST"
echo -n 'Started job at : ' ; date
echo " "
cd .
echo "Working directory : "$PWD
module purge
module use /p/focus/modules
module load mystellopt
module list
mpirun --verbose /path/to/command /path/to/input input.stellopt
echo " "
echo -n 'Ended job at : ' ; date
echo " "
exit
When this job is submitted from the GUI application, every line is
executed except the mpirun line, which can easily be verified by looking
at the output file. When this job is submitted with sbatch on the
command-line, everything works as desired.
There is no error in the output file like "mpirun: command not found:",
so it appears that the mpirun command is in the PATH in the job's
environment. When I added the line "which mpirun" to the sbatch script,
it found the correct mpirun command to use.
When I replaced the mpirun command with an equivalent srun command,
everything works as desired, so the user can get back to work and be
productive.
While srun is a suitable workaround, and is arguably the correct way to
run an MPI job, I'd like to understand what is going on here. Any idea
what is going wrong, or additional steps I can take to get more debug
information?
The user does acknowledge that the GUI app itself could have been
updated, which caused this, but since Slurm accepts the job, and the
output of squeue and scontrol seems normal and the job is submitted and
runs, it looks to me like the interactions between Slurm and the GUI app
are fine.
Thanks in advance for your help.
--
Prentice
More information about the slurm-users
mailing list