[slurm-users] Slurm doesn't call mpiexec or mpirun when run through a GUI app
pbisbal at pppl.gov
Fri Mar 22 17:21:57 UTC 2019
The GUI app writes the script to the file slurm_script.sh in the cwd. I
did exactly what you suggested as my first step in debugging check the
Command= value from the output of 'scontrol show job' to see what script
was actually submitted, and it was the slurm_script.sh in the cwd.
The user did provide me with some very useful information this
afternoon: The GUI app uses python to launch the job: Here's what the
user wrote to me. OMFIT is the name of the GUI application:
> New clue for the mpirun issue: The following information might be
> * I modified the script to use |subprocess| submitting the job
> directly. The job was submitted, but somehow it returned
> NoneZeroError and the |mpiexec| line was skipped.
> executable='echo %s',#submit_command,
> p=subprocess.Popen('sbatch '+unique_remotedir+'slurm.script',
> print(std_out[-1], p.stderr.read())
> * As I mentioned above, my standalone python script can normally
> submit jobs likewise using |subprocess.Popen| or
> |subprocess.call|. I created the following script at the working
> directory and executed it with the same python version as OMFIT.
> It works without skip.
> |import sys import os.path import subprocess print(sys.version,
> sys.path, subprocess.__file__) p = subprocess.Popen('sbatch
> slurm.script', shell=True, stdout=subprocess.PIPE,
> stderr=subprocess.PIPE) print(p.stdout.read(), p.stderr.read()) |
> The question is why the same |subprocee.Popen| command works
> differently in OMFIT and in the terminal, even if they are called by
> the same version |python2.7|.
So now it's unclear whether this is a bug in Python, or Slurm 18.06.6-2.
Since the user can write a python script that does work, I think this is
something specific to the application's environment, rather than an
issue with the Python-Slurm interaction. The main piece of evidence that
this might be a bug in Slurm is that this issue started after the
upgrade from 18.08.5-2 to 18.08.6-2, but correlation doesn't necessarily
On 3/22/19 12:48 PM, Thomas M. Payerle wrote:
> Assuming the GUI produced script is as you indicated (I am not sure
> where you got the script you showed, but if it is not the actual
> script used by a job it might be worthwhile to examine the Command=
> file from scontrol show job to verify), then the only thing that
> should be different from a GUI submission and a manual submission is
> the submission environment. Does the manual submission work if you
> add --export=NONE to the sbatch command to prevent the exporting of
> environment variables? And maybe add a printenv to the script to see
> what environment is in both cases. Though I confess I am unable to
> think of any reasonable environmental setting that might cause the
> observed symptoms.
> On Fri, Mar 22, 2019 at 11:23 AM Prentice Bisbal <pbisbal at pppl.gov
> <mailto:pbisbal at pppl.gov>> wrote:
> On 3/21/19 6:56 PM, Reuti wrote:
> > Am 21.03.2019 um 23:43 schrieb Prentice Bisbal:
> >> Slurm-users,
> >> My users here have developed a GUI application which serves as
> a GUI interface to various physics codes they use. From this GUI,
> they can submit jobs to Slurm. On Tuesday, we upgraded Slurm from
> 18.08.5-2 to 18.08.6-2,and a user has reported a problem when
> submitting Slurm jobs through this GUI app that do not occur when
> the same sbatch script is submitted from sbatch on the command-line.
> >> […]
> >> When I replaced the mpirun command with an equivalent srun
> command, everything works as desired, so the user can get back to
> work and be productive.
> >> While srun is a suitable workaround, and is arguably the
> correct way to run an MPI job, I'd like to understand what is
> going on here. Any idea what is going wrong, or additional steps I
> can take to get more debug information?
> > Was an alias to `mpirun` introduced? It may cover the real
> application and even the `which mpirun` will return the correct
> value, but never be executed.
> > $ type mpirun
> > $ alias mpirun
> > may tell in the jobscript.
> Unfortunately, the script is in tcsh, so the 'type' command
> doesn't work
> since, it's a bash built-in function. I did use the 'alias'
> command to
> see all the defined aliases, and mpirun and mpiexec are not
> aliased. Any
> other ideas?
> Tom Payerle
> DIT-ACIGS/Mid-Atlantic Crossroads payerle at umd.edu <mailto:payerle at umd.edu>
> 5825 University Research Park (301) 405-6135
> University of Maryland
> College Park, MD 20740-3831
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users