[slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)

Fri Apr 16 19:39:17 UTC 2021

Excuse me, I am trying to run some software on a cluster which uses the
SLURM grid engine. IT support at my institution have exhausted their
knowledge of SLURM in trying to debug this rather nasty bug with a specific
feature of the grid engine and suggested I try here for tips.

I am using jobs of the form:

*#!/bin/bash#SBATCH --job-name=name       # Job name#SBATCH
--mail-type=END,FAIL             # Mail events (NONE, BEGIN, END, FAIL,
ALL)#SBATCH --mail-user=my_email at thing.thing     # Where to send mail
 #SBATCH --mem=2gb                        # Job memory request, not hugely
intensive#SBATCH --time=47:00:00                  # Time limit hrs:min:sec,
the sim software being run from within the bash script is quite slow, extra
memory can't speed it up and it can't run multi-core, hence long runs on
weak nodes#SBATCH --nodes=100#SBATCH --ntasks=100#SBATCH
--cpus-per-task=1#SBATCH --output=file_output_%j.log        # Standard
output and error log#SBATCH --account=code       # Project account#SBATCH
--ntasks-per-core=1 #only 1 task per core, must not be more#SBATCH
--ntasks-per-node=1 #only 1 task per node, must not be more#SBATCH
--ntasks-per-socket=1 #guessing here but fairly sure I don't want multiple
instances trying to use same socket#SBATCH --no-kill # *supposedly

* prevents restart of other jobs on other nodes if one of the 100 gets a
NODE_FAILecho My working directory is `pwd`echo Running job on host:echo -e
'\t'`hostname` at `date`echo module load toolchain/foss/2018bcd scratchcd
further_folderchmod +x my_bash_script.shsrun --no-kill -N
"${SLURM_JOB_NUM_NODES}" -n "${SLURM_NTASKS}"
./my_bash_script.shwaitechoecho Job completed at `date`*

I use a bash script to launch my special software and stuff which actually
handles each job, this software is a bit weird and two copies of it WILL
NOT EVER play nicely if made to share a node. Hence this job acts to launch
100 copies on 100 nodes, each of which does its own stuff and writes out to
a separate results file. I later proces the results files.

In my scenario I want 100 jobs to run, but if one or two failed and I only
got 99 or 95 back then I could work fine for further processing with just
99 or 95 result files. Getting back a few less jobs then I want is no
tragedy for my type of work.

But the problem is that when any one node has a failure, not that rare when
you're calling for 100 nodes simultaneously, SLURM would by default murder
the WHOLE LOT of jobs, and even more confusingly then restart a bunch of
them which ends up with a very confusing pile of results files. I thought
the --no-kill flag should prevent this fault, but instead of preventing the
killing of all jobs due to a single failure it only prevents the restart,
now I get a misleading message from the cluster telling me of a good exit
code when such slaughter occurs, but when I log in to the cluster I
discover a grid engine massacre of my jobs, all because just one of them
failed.

I understand that for interacting jobs on many nodes then killing all of
them because of one failure can be necessary, but my jobs are strictly
parallel, no cross-interaction between them at all. each is an utterly
separate simulation with different starting parameters. I need to ensure
that if one job fails and must be killed then the rest are not affected.

I have been advised that due to the simulation software being such as to
refuse to run >1 copy properly on any given node at once I am NOT able to
use "array jobs" and must stick to this sort of job which requests 100
nodes this way.

Please can anyone suggest how to instruct SLURM not to massacre ALL my jobs
because ONE (or a few) node(s) fails?

All my research is being put on hold by this bug which is making getting
large runs out of the cluster almost impossible, a very large fraction of
the jobs I submit has a failure on 1 of the 100 nodes and hence that very
large fraction of my jobs get killed on all nodes even though only one is
faulty. I don't get jobs lasting to give me many useful sets of the
100(ish) result files I need.

P.S. just to warn you, I'm not an HPC expert or a linux power user. I'm
comfortable with linux and with command lines and technical details but
will probably need a bit more explanation around answers than someone
specialised in high performance computing would.

--
Thank You
Rob

P.S. unsure of whether this is how one is supposed to add forum posts to
this google group, sent twice as I wasn't sure if the earlier one got
through as I might not have been correctly subscribed at that time, thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210416/140d8b77/attachment.htm>