[slurm-users] salloc --no-shell question
Pritchard Jr., Howard
howardp at lanl.gov
Thu Jan 24 17:15:15 UTC 2019
Hello Slurm experts,
We have a workflow where we have a script which invoke salloc —noshell and then launches a series of MPI
jobs using srun with the jobid= option to make use of the reservation we got from the salloc invocation.
We are needing to do things this way because the script itself needs to report back the results of the
tests to an external server running at AWS. The compute nodes within the allocated partition have no connectivity
to the internet, hence our use of the —noshell option.
This is all fine except for an annoying behavior of slurm. If we have no test failures, I.e. all srun’ed tests
exist successfully everything works fine. However, once we start having failed tests, and hence non zero
status return from srun, we maybe get one or two tests to run, and then slurm cancels the reservation.
Here’s an example output from the script as its running some MPI tests, then some fail, then slurm drops
our reservation:
ExecuteCmd start: srun -n 16 -c 4 --mpi=pmix --jobid=2974 /users/foobar/runInAllocMTT/mtt/masterWa
lloc_scratch/TestGet_IBM/ompi-tests/ibm/random/leakcatch
stdout: seed value: -219475876
stdout: 0
stdout: 1
stdout: 2
stdout: 3
stdout: 4
stdout: 5
stdout: 6
stdout: 7
stdout: 8
stdout: 9
stdout: 10
stdout: 11
stdout: 12
stdout: 13
stdout: 14
stdout: 15
stdout: 16
stdout: 17
stdout: 18
stdout: 19
stdout: 20
stdout: ERROR: buf 778 element 749856 is 103 should be 42
stderr: --------------------------------------------------------------------------
stderr: MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
stderr: with errorcode 16.
stderr:
stderr: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
stderr: You may or may not see output from other processes, depending on
stderr: exactly when Open MPI kills them.
stderr: --------------------------------------------------------------------------
stderr: srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
stderr: slurmstepd: error: *** STEP 2974.490 ON st03 CANCELLED AT 2019-01-22T20:02:22 ***
stderr: srun: error: st03: task 0: Exited with exit code 16
stderr: srun: error: st03: tasks 1-15: Killed
ExecuteCmd done
ExecuteCmd start: srun -n 16 -c 4 --mpi=pmix --jobid=2974 /users/foobar/runInAllocMTT/mtt/masterWa
lloc_scratch/TestGet_IBM/ompi-tests/ibm/random/maxsoak
stderr: srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
stderr: slurmstepd: error: *** STEP 2974.491 ON st03 CANCELLED AT 2019-01-22T23:06:08 DUE TO TIME LI
MIT ***
stderr: srun: error: st03: tasks 0-15: Terminated
ExecuteCmd done
ExecuteCmd start: srun -n 16 -c 4 --mpi=pmix --jobid=2974 /users/foobar/runInAllocMTT/mtt/masterWa
lloc_scratch/TestGet_IBM/ompi-tests/ibm/random/op_commutative
stderr: srun: error: Unable to allocate resources: Invalid job id specified
ExecuteCmd done
This is not due to the allocation being revoked due to a time limit, even though the message says such. The job had been running only about 30 minutes
into a 3 hour reservation. We’ve double checked that and on one cluster which we can configure, we set the default
job timelimit to infinite and still observe the issue. But the fact that SLURM is reporting its a TIMELIMIT thing may be hinting at what’s going on that
SLURM revokes the allocation.
We see this on every cluster we’ve tried so far, so it doesn’t appear to be a site-specific configuration issue.
Any insights into how to workaround/fix this problem would be appreciated.
Thanks,
Howard
--
Howard Pritchard
B Schedule
HPC-ENV
Office 9, 2nd floor Research Park
TA-03, Building 4200, Room 203
Los Alamos National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190124/466ea853/attachment-0001.html>
More information about the slurm-users
mailing list