[slurm-users] salloc --no-shell question

Thu Jan 24 18:01:16 UTC 2019

Hmmm, I can't quite replicate that:

dmj at cori11:~> salloc -C knl -q interactive -N 2 --no-shell
salloc: Granted job allocation 18219715
salloc: Waiting for resource configuration
salloc: Nodes nid0[2318-2319] are ready for job
dmj at cori11:~> srun --jobid=18219715 /bin/false
srun: error: nid02318: task 0: Exited with exit code 1
srun: Terminating job step 18219715.0
srun: error: nid02319: task 1: Exited with exit code 1
dmj at cori11:~> echo $?
1
dmj at cori11:~> squeue -u dmj
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
          18219715 interacti   (null)      dmj  R       0:57      2
nid0[2318-2319]
dmj at cori11:~> srun --jobid=18219715 /bin/false
srun: error: nid02319: task 1: Exited with exit code 1
srun: Terminating job step 18219715.1
srun: error: nid02318: task 0: Exited with exit code 1
dmj at cori11:~> squeue -u dmj
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
          18219715 interacti   (null)      dmj  R       1:17      2
nid0[2318-2319]
dmj at cori11:~>

Is it possible that your failing sruns are not properly terminating when
the first rank crashes and is actually consuming all the requested time?

-Doug
----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
Acting Group Lead, Computational Systems Group
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen at lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________

On Thu, Jan 24, 2019 at 9:24 AM Pritchard Jr., Howard <howardp at lanl.gov>
wrote:

> Hello Slurm experts,
>
> We have a workflow where we have a script which invoke salloc —noshell and
> then launches a series of MPI
> jobs using srun with the jobid= option to make use of the reservation we
> got from the salloc invocation.
> We are needing to do things this way because the script itself needs to
> report back the results of the
> tests to an external server running at AWS.  The compute nodes within the
> allocated partition have no connectivity
> to the internet, hence our use of the —noshell option.
>
> This is all fine except for an annoying behavior of slurm.  If we have no
> test failures, I.e. all srun’ed tests
> exist successfully everything works fine.  However, once we start having
> failed tests, and hence non zero
> status return from srun, we maybe get one or two tests to run, and then
> slurm cancels the reservation.
>
> Here’s an example output from the script as its running some MPI tests,
> then some fail, then slurm drops
> our reservation:
>
> ExecuteCmd start: srun -n 16 -c 4 --mpi=pmix --jobid=2974
> /users/foobar/runInAllocMTT/mtt/masterWa
>
> lloc_scratch/TestGet_IBM/ompi-tests/ibm/random/leakcatch
>
> stdout: seed value: -219475876
>
> stdout: 0
>
> stdout: 1
>
> stdout: 2
>
> stdout: 3
>
> stdout: 4
>
> stdout: 5
>
> stdout: 6
>
> stdout: 7
>
> stdout: 8
>
> stdout: 9
>
> stdout: 10
>
> stdout: 11
>
> stdout: 12
>
> stdout: 13
>
> stdout: 14
>
> stdout: 15
>
> stdout: 16
>
> stdout: 17
>
> stdout: 18
>
> stdout: 19
>
> stdout: 20
>
> stdout: ERROR: buf 778 element 749856 is 103 should be 42
>
> stderr:
> --------------------------------------------------------------------------
>
> stderr: MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>
> stderr: with errorcode 16.
>
> stderr:
>
> stderr: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>
> stderr: You may or may not see output from other processes, depending on
>
> stderr: exactly when Open MPI kills them.
>
> stderr:
> --------------------------------------------------------------------------
>
> stderr: srun: Job step aborted: Waiting up to 32 seconds for job step to
> finish.
>
> stderr: slurmstepd: error: *** STEP 2974.490 ON st03 CANCELLED AT
> 2019-01-22T20:02:22 ***
>
> stderr: srun: error: st03: task 0: Exited with exit code 16
>
> stderr: srun: error: st03: tasks 1-15: Killed
>
> ExecuteCmd done
>
> ExecuteCmd start: srun -n 16 -c 4 --mpi=pmix --jobid=2974
> /users/foobar/runInAllocMTT/mtt/masterWa
>
> lloc_scratch/TestGet_IBM/ompi-tests/ibm/random/maxsoak
>
> stderr: srun: Job step aborted: Waiting up to 32 seconds for job step to
> finish.
>
> stderr: slurmstepd: error: *** STEP 2974.491 ON st03 CANCELLED AT
> 2019-01-22T23:06:08 DUE TO TIME LI
>
> MIT ***
>
> stderr: srun: error: st03: tasks 0-15: Terminated
>
> ExecuteCmd done
>
> ExecuteCmd start: srun -n 16 -c 4 --mpi=pmix --jobid=2974
> /users/foobar/runInAllocMTT/mtt/masterWa
>
> lloc_scratch/TestGet_IBM/ompi-tests/ibm/random/op_commutative
>
> stderr: srun: error: Unable to allocate resources: Invalid job id specified
>
> ExecuteCmd done
>
> This is not due to the allocation being revoked due to a time limit, even
> though the message says such.  The job had been running only about 30
> minutes
> into a 3 hour reservation.   We’ve double checked that and on one cluster
> which we can configure, we set the default
> job timelimit to infinite and still observe the issue.  But the fact that
> SLURM is reporting its a TIMELIMIT thing may be hinting at what’s going on
> that
> SLURM revokes the allocation.
>
> We see this on every cluster we’ve tried so far, so it doesn’t appear to
> be a site-specific configuration issue.
>
> Any insights into how to workaround/fix this problem would be appreciated.
>
> Thanks,
>
> Howard
>
>
> --
> Howard Pritchard
> B Schedule
> HPC-ENV
>
> Office 9, 2nd floor Research Park
>
> TA-03, Building 4200, Room 203
> Los Alamos National Laboratory
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190124/ade015db/attachment-0003.html>