[slurm-users] Slurm doesn't call mpiexec or mpirun when run through a GUI app

Fri Mar 22 17:21:57 UTC 2019

Thomas,

The GUI app writes the script to the file slurm_script.sh in the cwd.  I 
did exactly what you suggested as my first step in debugging check the 
Command= value from the output of 'scontrol show job' to see what script 
was actually submitted, and it was the slurm_script.sh in the cwd.

The user did provide me with some very useful information this 
afternoon: The GUI app uses python to launch the job: Here's what the 
user wrote to me. OMFIT is the name of the GUI application:

> New clue for the mpirun issue:  The following information might be 
> helpful.
>
>   * I modified the script to use |subprocess| submitting the job
>     directly. The job was submitted, but somehow it returned
>     NoneZeroError and the |mpiexec| line was skipped.
>
> OMFITx.executable(root,
>                        inputs=inputs,
>                        outputs=outputs,
>                        executable='echo %s',#submit_command,
>                        script=(bashscript,'slurm.script'),
>                        clean=True,
>                        std_out=std_out,
>                        remotedir=unique_remotedir,
>                        ignoreReturnCode=True)
>
>      p=subprocess.Popen('sbatch '+unique_remotedir+'slurm.script',
>                         shell=True,
>                         stdout=subprocess.PIPE,
>                        stderr=subprocess.PIPE)
>      std_out.append(p.stdout.read())
>      print(std_out[-1], p.stderr.read())
>
>   * As I mentioned above, my standalone python script can normally
>     submit jobs likewise using |subprocess.Popen| or
>     |subprocess.call|. I created the following script at the working
>     directory and executed it with the same python version as OMFIT.
>     It works without skip.
>
> |import sys import os.path import subprocess print(sys.version, 
> sys.path, subprocess.__file__) p = subprocess.Popen('sbatch 
> slurm.script', shell=True, stdout=subprocess.PIPE, 
> stderr=subprocess.PIPE) print(p.stdout.read(), p.stderr.read()) |
> The question is why the same |subprocee.Popen| command works 
> differently in OMFIT and in the terminal, even if they are called by 
> the same version |python2.7|.

So now it's unclear whether this is a bug in Python, or Slurm 18.06.6-2. 
Since the user can write a python script that does work, I think this is 
something specific to the application's environment, rather than an 
issue with the Python-Slurm interaction. The main piece of evidence that 
this might be a bug in Slurm is that this issue started after the 
upgrade from 18.08.5-2 to 18.08.6-2, but correlation doesn't necessarily 
mean causation.

Prentice

On 3/22/19 12:48 PM, Thomas M. Payerle wrote:
> Assuming the GUI produced script is as you indicated (I am not sure 
> where you got the script you showed, but if it is not the actual 
> script used by a job it might be worthwhile to examine the Command= 
> file from scontrol show job to verify), then the only thing that 
> should be different from a GUI submission and a manual submission is 
> the submission environment.  Does the manual submission work if you 
> add --export=NONE to the sbatch command to prevent the exporting of 
> environment variables?  And maybe add a printenv to the script to see 
> what environment is in both cases.  Though I confess I am unable to 
> think of any reasonable environmental setting that might cause the 
> observed symptoms.
>
> On Fri, Mar 22, 2019 at 11:23 AM Prentice Bisbal <pbisbal at pppl.gov 
> <mailto:pbisbal at pppl.gov>> wrote:
>
>     On 3/21/19 6:56 PM, Reuti wrote:
>     > Am 21.03.2019 um 23:43 schrieb Prentice Bisbal:
>     >
>     >> Slurm-users,
>     >>
>     >> My users here have developed a GUI application which serves as
>     a GUI interface to various physics codes they use. From this GUI,
>     they can submit jobs to Slurm. On Tuesday, we upgraded Slurm from
>     18.08.5-2 to 18.08.6-2,and a user has reported a problem when
>     submitting Slurm jobs through this GUI app that do not occur when
>     the same sbatch script is submitted from sbatch on the command-line.
>     >>
>     >> […]
>     >> When I replaced the mpirun command with an equivalent srun
>     command, everything works as desired, so the user can get back to
>     work and be productive.
>     >>
>     >> While srun is a suitable workaround, and is arguably the
>     correct way to run an MPI job, I'd like to understand what is
>     going on here. Any idea what is going wrong, or additional steps I
>     can take to get more debug information?
>     > Was an alias to `mpirun` introduced? It may cover the real
>     application and even the `which mpirun` will return the correct
>     value, but never be executed.
>     >
>     > $ type mpirun
>     > $ alias mpirun
>     >
>     > may tell in the jobscript.
>     >
>     Unfortunately, the script is in tcsh, so the 'type' command
>     doesn't work
>     since,  it's a bash built-in function. I did use the 'alias'
>     command to
>     see all the defined aliases, and mpirun and mpiexec are not
>     aliased. Any
>     other ideas?
>
>     Prentice
>
>
>
>
>
>
> -- 
> Tom Payerle
> DIT-ACIGS/Mid-Atlantic Crossroads payerle at umd.edu <mailto:payerle at umd.edu>
> 5825 University Research Park               (301) 405-6135
> University of Maryland
> College Park, MD 20740-3831
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190322/c7f3a752/attachment-0001.html>