[slurm-users] salloc --no-shell question

Pritchard Jr., Howard howardp at lanl.gov
Thu Jan 24 17:15:15 UTC 2019


Hello Slurm experts,

We have a workflow where we have a script which invoke salloc —noshell and then launches a series of MPI
jobs using srun with the jobid= option to make use of the reservation we got from the salloc invocation.
We are needing to do things this way because the script itself needs to report back the results of the
tests to an external server running at AWS.  The compute nodes within the allocated partition have no connectivity
to the internet, hence our use of the —noshell option.

This is all fine except for an annoying behavior of slurm.  If we have no test failures, I.e. all srun’ed tests
exist successfully everything works fine.  However, once we start having failed tests, and hence non zero
status return from srun, we maybe get one or two tests to run, and then slurm cancels the reservation.

Here’s an example output from the script as its running some MPI tests, then some fail, then slurm drops
our reservation:


ExecuteCmd start: srun -n 16 -c 4 --mpi=pmix --jobid=2974 /users/foobar/runInAllocMTT/mtt/masterWa

lloc_scratch/TestGet_IBM/ompi-tests/ibm/random/leakcatch

stdout: seed value: -219475876

stdout: 0

stdout: 1

stdout: 2

stdout: 3

stdout: 4

stdout: 5

stdout: 6

stdout: 7

stdout: 8

stdout: 9

stdout: 10

stdout: 11

stdout: 12

stdout: 13

stdout: 14

stdout: 15

stdout: 16

stdout: 17

stdout: 18

stdout: 19

stdout: 20

stdout: ERROR: buf 778 element 749856 is 103 should be 42

stderr: --------------------------------------------------------------------------

stderr: MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD

stderr: with errorcode 16.

stderr:

stderr: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.

stderr: You may or may not see output from other processes, depending on

stderr: exactly when Open MPI kills them.

stderr: --------------------------------------------------------------------------

stderr: srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

stderr: slurmstepd: error: *** STEP 2974.490 ON st03 CANCELLED AT 2019-01-22T20:02:22 ***

stderr: srun: error: st03: task 0: Exited with exit code 16

stderr: srun: error: st03: tasks 1-15: Killed

ExecuteCmd done

ExecuteCmd start: srun -n 16 -c 4 --mpi=pmix --jobid=2974 /users/foobar/runInAllocMTT/mtt/masterWa

lloc_scratch/TestGet_IBM/ompi-tests/ibm/random/maxsoak

stderr: srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

stderr: slurmstepd: error: *** STEP 2974.491 ON st03 CANCELLED AT 2019-01-22T23:06:08 DUE TO TIME LI

MIT ***

stderr: srun: error: st03: tasks 0-15: Terminated

ExecuteCmd done

ExecuteCmd start: srun -n 16 -c 4 --mpi=pmix --jobid=2974 /users/foobar/runInAllocMTT/mtt/masterWa

lloc_scratch/TestGet_IBM/ompi-tests/ibm/random/op_commutative

stderr: srun: error: Unable to allocate resources: Invalid job id specified

ExecuteCmd done

This is not due to the allocation being revoked due to a time limit, even though the message says such.  The job had been running only about 30 minutes
into a 3 hour reservation.   We’ve double checked that and on one cluster which we can configure, we set the default
job timelimit to infinite and still observe the issue.  But the fact that SLURM is reporting its a TIMELIMIT thing may be hinting at what’s going on that
SLURM revokes the allocation.

We see this on every cluster we’ve tried so far, so it doesn’t appear to be a site-specific configuration issue.

Any insights into how to workaround/fix this problem would be appreciated.

Thanks,

Howard


--
Howard Pritchard
B Schedule
HPC-ENV
Office 9, 2nd floor Research Park
TA-03, Building 4200, Room 203
Los Alamos National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190124/466ea853/attachment-0001.html>


More information about the slurm-users mailing list