[slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)

Renfro, Michael Renfro at tntech.edu
Fri Apr 16 21:42:26 UTC 2021

I can't speak to what happens on node failure, but I can at least get you a greatly simplified pair of scripts that will run only one copy on each node allocated:

# notarray.sh
#SBATCH --nodes=28
#SBATCH --ntasks-per-node=1
#SBATCH --no-kill
echo "notarray.sh is running on $(hostname)"
srun --no-kill somescript.sh


# somescript.sh
echo "somescript.sh is running on $(hostname)"

I can verify that after submitting the job with "sbatch notarray.sh":

  *   notarray.sh ran on only one allocated node, and
  *   somescript.sh ran once on each of the 28 nodes allocated, including the one that notarray.sh ran on.

No need to pass srun a set of parameters for how many tasks to run, since it can figure that out from the sbatch context.

From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Robert Peck <rp1060 at york.ac.uk>
Date: Friday, April 16, 2021 at 2:40 PM
To: slurm-users at schedmd.com <slurm-users at schedmd.com>
Subject: [slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)

External Email Warning

This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.

Excuse me, I am trying to run some software on a cluster which uses the SLURM grid engine. IT support at my institution have exhausted their knowledge of SLURM in trying to debug this rather nasty bug with a specific feature of the grid engine and suggested I try here for tips.

I am using jobs of the form:

#SBATCH --job-name=name       # Job name
#SBATCH --mail-type=END,FAIL             # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=my_email at thing.thing     # Where to send mail

#SBATCH --mem=2gb                        # Job memory request, not hugely intensive
#SBATCH --time=47:00:00                  # Time limit hrs:min:sec, the sim software being run from within the bash script is quite slow, extra memory can't speed it up and it can't run multi-core, hence long runs on weak nodes

#SBATCH --nodes=100
#SBATCH --ntasks=100
#SBATCH --cpus-per-task=1

#SBATCH --output=file_output_%j.log        # Standard output and error log
#SBATCH --account=code       # Project account
#SBATCH --ntasks-per-core=1 #only 1 task per core, must not be more
#SBATCH --ntasks-per-node=1 #only 1 task per node, must not be more
#SBATCH --ntasks-per-socket=1 #guessing here but fairly sure I don't want multiple instances trying to use same socket

#SBATCH --no-kill # supposedly prevents restart of other jobs on other nodes if one of the 100 gets a NODE_FAIL

echo My working directory is `pwd`
echo Running job on host:
echo -e '\t'`hostname` at `date`

module load toolchain/foss/2018b
cd scratch
cd further_folder
chmod +x my_bash_script.sh

srun --no-kill -N "${SLURM_JOB_NUM_NODES}" -n "${SLURM_NTASKS}" ./my_bash_script.sh

echo Job completed at `date`

I use a bash script to launch my special software and stuff which actually handles each job, this software is a bit weird and two copies of it WILL NOT EVER play nicely if made to share a node. Hence this job acts to launch 100 copies on 100 nodes, each of which does its own stuff and writes out to a separate results file. I later proces the results files.

In my scenario I want 100 jobs to run, but if one or two failed and I only got 99 or 95 back then I could work fine for further processing with just 99 or 95 result files. Getting back a few less jobs then I want is no tragedy for my type of work.

But the problem is that when any one node has a failure, not that rare when you're calling for 100 nodes simultaneously, SLURM would by default murder the WHOLE LOT of jobs, and even more confusingly then restart a bunch of them which ends up with a very confusing pile of results files. I thought the --no-kill flag should prevent this fault, but instead of preventing the killing of all jobs due to a single failure it only prevents the restart, now I get a misleading message from the cluster telling me of a good exit code when such slaughter occurs, but when I log in to the cluster I discover a grid engine massacre of my jobs, all because just one of them failed.

I understand that for interacting jobs on many nodes then killing all of them because of one failure can be necessary, but my jobs are strictly parallel, no cross-interaction between them at all. each is an utterly separate simulation with different starting parameters. I need to ensure that if one job fails and must be killed then the rest are not affected.

I have been advised that due to the simulation software being such as to refuse to run >1 copy properly on any given node at once I am NOT able to use "array jobs" and must stick to this sort of job which requests 100 nodes this way.

Please can anyone suggest how to instruct SLURM not to massacre ALL my jobs because ONE (or a few) node(s) fails?

All my research is being put on hold by this bug which is making getting large runs out of the cluster almost impossible, a very large fraction of the jobs I submit has a failure on 1 of the 100 nodes and hence that very large fraction of my jobs get killed on all nodes even though only one is faulty. I don't get jobs lasting to give me many useful sets of the 100(ish) result files I need.

P.S. just to warn you, I'm not an HPC expert or a linux power user. I'm comfortable with linux and with command lines and technical details but will probably need a bit more explanation around answers than someone specialised in high performance computing would.

Thank You

P.S. unsure of whether this is how one is supposed to add forum posts to this google group, sent twice as I wasn't sure if the earlier one got through as I might not have been correctly subscribed at that time, thanks

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210416/dde384c6/attachment-0001.htm>

More information about the slurm-users mailing list