<div dir="ltr"><div class="gmail-adn gmail-ads" style="border-left:none;padding:0px;display:flex;font-family:Roboto,RobotoDraft,Helvetica,Arial,sans-serif;font-size:medium"><div class="gmail-gs" style="margin:0px;padding:0px 0px 20px;width:1064px"><div class="gmail-"><div id="gmail-:ha" class="gmail-ii gmail-gt" style="font-size:0.875rem;direction:ltr;margin:8px 0px 0px;padding:0px"><div id="gmail-:h9" class="gmail-a3s gmail-aiL" style="overflow:hidden;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:small;line-height:1.5;font-family:Arial,Helvetica,sans-serif"><div dir="ltr">Excuse me, I am trying to run some software on a cluster which uses the SLURM grid engine. IT support at my institution have exhausted their knowledge of SLURM in trying to debug this rather nasty bug with a specific feature of the grid engine and suggested I try here for tips.<div><br></div><div>I am using jobs of the form:</div><div><br></div><div><b style="font-style:italic">#!/bin/bash<br>#SBATCH --job-name=name       # Job name<br>#SBATCH --mail-type=END,FAIL             # Mail events (NONE, BEGIN, END, FAIL, ALL)<br>#SBATCH --mail-user=my_email@thing.thing     # Where to send mail  <br><br>#SBATCH --mem=2gb                        # Job memory request, not hugely intensive<br>#SBATCH --time=47:00:00                  # Time limit hrs:min:sec, the sim software being run from within the bash script is quite slow, extra memory can't speed it up and it can't run multi-core, hence long runs on weak nodes<br><br>#SBATCH --nodes=100<br>#SBATCH --ntasks=100<br>#SBATCH --cpus-per-task=1<br><br>#SBATCH --output=file_output_%j.log        # Standard output and error log<br>#SBATCH --account=code       # Project account<br>#SBATCH --ntasks-per-core=1 #only 1 task per core, must not be more<br>#SBATCH --ntasks-per-node=1 #only 1 task per node, must not be more<br>#SBATCH --ntasks-per-socket=1 #guessing here but fairly sure I don't want multiple instances trying to use same socket<br><br>#SBATCH --no-kill # </b>supposedly<b style="font-style:italic"> prevents restart of other jobs on other nodes if one of the 100 gets a NODE_FAIL<br><br><br>echo My working directory is `pwd`<br>echo Running job on host:<br>echo -e '\t'`hostname` at `date`<br>echo<br> <br>module load toolchain/foss/2018b<br>cd scratch<br>cd further_folder<br>chmod +x my_bash_script.sh<br><br>srun --no-kill -N "${SLURM_JOB_NUM_NODES}" -n "${SLURM_NTASKS}" ./my_bash_script.sh<br><br>wait<br>echo<br>echo Job completed at `date`</b></div><div><b><i><br></i></b></div><div>I use a bash script to launch my special software and stuff which actually handles each job, this software is a bit weird and two copies of it WILL NOT EVER play nicely if made to share a node. Hence this job acts to launch 100 copies on 100 nodes, each of which does its own stuff and writes out to a separate results file. I later proces the results files. </div><div><br></div><div>In my scenario I want 100 jobs to run, but if one or two failed and I only got 99 or 95 back then I could work fine for further processing with just 99 or 95 result files. Getting back a few less jobs then I want is no tragedy for my type of work.</div><div><br></div><div>But the problem is that when any one node has a failure, not that rare when you're calling for 100 nodes simultaneously, SLURM would by default murder the WHOLE LOT of jobs, and even more confusingly then restart a bunch of them which ends up with a very confusing pile of results files. I thought the --no-kill flag should prevent this fault, but instead of preventing the killing of all jobs due to a single failure it only prevents the restart, now I get a misleading message from the cluster telling me of a good exit code when such slaughter occurs, but when I log in to the cluster I discover a grid engine massacre of my jobs, all because just one of them failed. </div><div><br></div><div>I understand that for interacting jobs on many nodes then killing all of them because of one failure can be necessary, but my jobs are strictly parallel, no cross-interaction between them at all. each is an utterly separate simulation with different starting parameters. I need to ensure that if one job fails and must be killed then the rest are not affected.</div><div><br></div><div>I have been advised that due to the simulation software being such as to refuse to run >1 copy properly on any given node at once I am NOT able to use "array jobs" and must stick to this sort of job which requests 100 nodes this way.</div><div><br></div><div>Please can anyone suggest how to instruct SLURM not to massacre ALL my jobs because ONE (or a few) node(s) fails?</div><div><br></div><div>All my research is being put on hold by this bug which is making getting large runs out of the cluster almost impossible, a very large fraction of the jobs I submit has a failure on 1 of the 100 nodes and hence that very large fraction of my jobs get killed on all nodes even though only one is faulty. I don't get jobs lasting to give me many useful sets of the 100(ish) result files I need.</div><div><br></div><div>P.S. just to warn you, I'm not an HPC expert or a linux power user. I'm comfortable with linux and with command lines and technical details but will probably need a bit more explanation around answers than someone specialised in high performance computing would.<font color="#888888"><br clear="all"><div><br></div>--<br></font><div dir="ltr"><div dir="ltr"><font color="#888888"><div><font size="2">Thank You</font></div><div>Rob</div></font><div><br></div><div>P.S. unsure of whether this is how one is supposed to add forum posts to this google group, sent twice as I wasn't sure if the earlier one got through as I might not have been correctly subscribed at that time, thanks</div></div></div></div></div><div class="gmail-yj6qo"></div><div class="gmail-adL"></div></div></div><div class="gmail-hi" style="border-bottom-left-radius:1px;border-bottom-right-radius:1px;padding:0px;width:auto;background:rgb(242,242,242);margin:0px"></div></div></div><div class="gmail-ajx" style="clear:both"><br></div></div></div>