[slurm-users] Slurm doesn't call mpiexec or mpirun when run through a GUI app

Thomas M. Payerle payerle at umd.edu
Fri Mar 22 17:55:59 UTC 2019


Does the GUI run as the user (e.g. the user starts the GUI, and so
submitting process is owned by user), or is the GUI running as a daemon (in
which case, is it submitting jobs as the user and if so how).  And is the
default shell of the user submitting the job tcsh (like the shebang in the
script) or bash/something else?

Basically, only things I can think of that might be different between GUI
submitted job and manually submitted job are:
1) the environment of the process running the sbatch command.
2) possibly the shell initialization scripts (if the GUI runs as a
different user, and maybe if the job script is different shell than user's
default shell)

As Christopher suggested, adding -x (and maybe -v also) to the submission
script shebang might be helpful, at least might tell us if something funky
is being missed in the script.
Also might be helpful to add a -V to the mpirun (did you ever specify which
MPI library is being used.  For OpenMPI at least that should print the
version, I assume others have something similar.  Anyway, at least if for
some reason the mpirun is being executed but is exiting quickly and
silently this will let us know that it ran).  A printenv in the script might
be useful too.

Also, if you could try the manual run with '--export=NONE' added to the
sbatch command, this should at least help let us see if it is the
environment (or at least if there is something in the manual run
environment which is missing in the GUI environment that makes things work).

I doubt it is a Python issue; all python is doing is generating the job
submission script and invoking sbatch.  You verified that the job was
submitted and that the submitted job script
is as expected, so I think this lets python off the hook.  That suggests
either the MPI library or Slurm, or with how the Slurm library is
integrated with the MPI library.  Is it possible to try with another MPI
library (either changing the code being run to something simple or using an
incompatible MPI lib --- at this point a crash and burn from MPI would be
an improvement:)



On Fri, Mar 22, 2019 at 1:24 PM Prentice Bisbal <pbisbal at pppl.gov> wrote:

> Thomas,
>
> The GUI app writes the script to the file slurm_script.sh in the cwd.  I
> did exactly what you suggested as my first step in debugging check the
> Command= value from the output of 'scontrol show job' to see what script
> was actually submitted, and it was the slurm_script.sh in the cwd.
>
> The user did provide me with some very useful information this afternoon:
> The GUI app uses python to launch the job: Here's what the user wrote to
> me. OMFIT is the name of the GUI application:
>
>
> New clue for the mpirun issue:  The following information might be
> helpful.
>
>    - I modified the script to use subprocess submitting the job directly.
>    The job was submitted, but somehow it returned NoneZeroError and the
>    mpiexec line was skipped.
>
> OMFITx.executable(root,
>                       inputs=inputs,
>                       outputs=outputs,
>                       executable='echo %s',#submit_command,
>                       script=(bashscript, 'slurm.script'),
>                       clean=True,
>                       std_out=std_out,
>                       remotedir=unique_remotedir,
>                       ignoreReturnCode=True)
>
>     p=subprocess.Popen('sbatch '+unique_remotedir+'slurm.script',
>                        shell=True,
>                        stdout=subprocess.PIPE,
>                       stderr=subprocess.PIPE)
>     std_out.append(p.stdout.read())
>     print(std_out[-1], p.stderr.read())
>
>
>    - As I mentioned above, my standalone python script can normally
>    submit jobs likewise using subprocess.Popen or subprocess.call. I
>    created the following script at the working directory and executed it with
>    the same python version as OMFIT. It works without skip.
>
> import sys
> import os.path
> import subprocess
>
> print(sys.version, sys.path, subprocess.__file__)
>
> p = subprocess.Popen('sbatch slurm.script', shell=True,
>                    stdout=subprocess.PIPE,
>                    stderr=subprocess.PIPE)
> print(p.stdout.read(), p.stderr.read())
>
> The question is why the same subprocee.Popen command works differently in
> OMFIT and in the terminal, even if they are called by the same version
> python2.7.
>
> So now it's unclear whether this is a bug in Python, or Slurm 18.06.6-2.
> Since the user can write a python script that does work, I think this is
> something specific to the application's environment, rather than an issue
> with the Python-Slurm interaction. The main piece of evidence that this
> might be a bug in Slurm is that this issue started after the upgrade from
> 18.08.5-2 to 18.08.6-2, but correlation doesn't necessarily mean causation.
>
> Prentice
>
>
> On 3/22/19 12:48 PM, Thomas M. Payerle wrote:
>
> Assuming the GUI produced script is as you indicated (I am not sure where
> you got the script you showed, but if it is not the actual script used by a
> job it might be worthwhile to examine the Command= file from scontrol show
> job to verify), then the only thing that should be different from a GUI
> submission and a manual submission is the submission environment.  Does the
> manual submission work if you add --export=NONE to the sbatch command to
> prevent the exporting of environment variables?  And maybe add a printenv
> to the script to see what environment is in both cases.  Though I confess I
> am unable to think of any reasonable environmental setting that might cause
> the observed symptoms.
>
> On Fri, Mar 22, 2019 at 11:23 AM Prentice Bisbal <pbisbal at pppl.gov> wrote:
>
>> On 3/21/19 6:56 PM, Reuti wrote:
>> > Am 21.03.2019 um 23:43 schrieb Prentice Bisbal:
>> >
>> >> Slurm-users,
>> >>
>> >> My users here have developed a GUI application which serves as a GUI
>> interface to various physics codes they use. From this GUI, they can submit
>> jobs to Slurm. On Tuesday, we upgraded Slurm from 18.08.5-2 to
>> 18.08.6-2,and a user has reported a problem when submitting Slurm jobs
>> through this GUI app that do not occur when the same sbatch script is
>> submitted from sbatch on the command-line.
>> >>
>> >> […]
>> >> When I replaced the mpirun command with an equivalent srun command,
>> everything works as desired, so the user can get back to work and be
>> productive.
>> >>
>> >> While srun is a suitable workaround, and is arguably the correct way
>> to run an MPI job, I'd like to understand what is going on here. Any idea
>> what is going wrong, or additional steps I can take to get more debug
>> information?
>> > Was an alias to `mpirun` introduced? It may cover the real application
>> and even the `which mpirun` will return the correct value, but never be
>> executed.
>> >
>> > $ type mpirun
>> > $ alias mpirun
>> >
>> > may tell in the jobscript.
>> >
>> Unfortunately, the script is in tcsh, so the 'type' command doesn't work
>> since,  it's a bash built-in function. I did use the 'alias' command to
>> see all the defined aliases, and mpirun and mpiexec are not aliased. Any
>> other ideas?
>>
>> Prentice
>>
>>
>>
>>
>>
>
> --
> Tom Payerle
> DIT-ACIGS/Mid-Atlantic Crossroads        payerle at umd.edu
> 5825 University Research Park               (301) 405-6135
> University of Maryland
> College Park, MD 20740-3831
>
>

-- 
Tom Payerle
DIT-ACIGS/Mid-Atlantic Crossroads        payerle at umd.edu
5825 University Research Park               (301) 405-6135
University of Maryland
College Park, MD 20740-3831
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190322/ed3bf02d/attachment-0001.html>


More information about the slurm-users mailing list