[slurm-users] Slurm doesn't call mpiexec or mpirun when run through a GUI app
bbarth at tacc.utexas.edu
Fri Mar 22 18:17:58 UTC 2019
Slurm is almost certainly calling execve() with the path to a copy of this script as an argument eventually, so yes, tcsh will be noticed by the Linux kernel as the first line and invoked to handle the contents. Slurm doesn’t have to honor it since the kernel will. Slurm, usually makes a pass through the script to replace any %j instances in the #SBATCH lines with the jobid, etc, in a copy of the script before it runs it, but that's neither here nor there for you. The recommendations to set -x and -v to tcsh are probably your best debugging options at this point. My vote is with the others who think that the environment inside the script is likely screwed up. Throwing in a printenv and saving that can't hurt.
Bill Barth, Ph.D., Director, HPC
bbarth at tacc.utexas.edu | Phone: (512) 232-7069
Office: ROC 1.435 | Fax: (512) 475-9445
On 3/22/19, 11:41 AM, "slurm-users on behalf of Reuti" <slurm-users-bounces at lists.schedmd.com on behalf of reuti at staff.uni-marburg.de> wrote:
> Am 22.03.2019 um 16:20 schrieb Prentice Bisbal <pbisbal at pppl.gov>:
> On 3/21/19 6:56 PM, Reuti wrote:
>> Am 21.03.2019 um 23:43 schrieb Prentice Bisbal:
>>> My users here have developed a GUI application which serves as a GUI interface to various physics codes they use. From this GUI, they can submit jobs to Slurm. On Tuesday, we upgraded Slurm from 18.08.5-2 to 18.08.6-2,and a user has reported a problem when submitting Slurm jobs through this GUI app that do not occur when the same sbatch script is submitted from sbatch on the command-line.
>>> When I replaced the mpirun command with an equivalent srun command, everything works as desired, so the user can get back to work and be productive.
>>> While srun is a suitable workaround, and is arguably the correct way to run an MPI job, I'd like to understand what is going on here. Any idea what is going wrong, or additional steps I can take to get more debug information?
>> Was an alias to `mpirun` introduced? It may cover the real application and even the `which mpirun` will return the correct value, but never be executed.
>> $ type mpirun
>> $ alias mpirun
>> may tell in the jobscript.
> Unfortunately, the script is in tcsh,
Oh, I didn't notice this – correct.
> so the 'type' command doesn't work since,
Is it really running in `tcsh`? The commands look like being generic and available in various shells. Does SLURM honor the the first line of a script and/or use a default? In Bash a function would cover the `mpirun` too.
(I'm more used to GridEngine, where this can be configured in both ways how to start the scripts.)
In "tcsh" I see a defined "jobcmd" of having some effect.
> it's a bash built-in function. I did use the 'alias' command to see all the defined aliases, and mpirun and mpiexec are not aliased. Any other ideas?
More information about the slurm-users