[slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)

Tue Apr 20 05:44:50 UTC 2021

Hi, Robert,

Robert Peck <rp1060 at york.ac.uk> writes:

> Michael, thanks for the tip. I can give that a go but don't know if it will solve my issue.
>
> Jess, sorry I have no knowledge of how the university handles the cluster system in that sense.
>
> Has anyone else been reporting bugs with the --no-kill flag recently on your forum?
>
> --no-kill isn't all that heavily documented from online searching, are there any more specific parameters which can be used to create the desired behaviour of "if one node dies just ignore it and let the others keep going".
>
> Would posting some cluster logs here help? Although I have found that the verbosity level doesn't seem to be able to be set higher than the default?
>
> I did mention that I'd heard array jobs aren't ok due to my "separate job on separate node" requirement, but wondered if anyone could advise me of ways to build that need in to an array job setup? If that would avoid the problem I'm having
> where all jobs get slaughtered when just one of them fails?

I think that, rather than trying to bend Slurm to do what you think you
want it to do, you should really rethink your "separate job on separate
node" requirement.  It is just not normal behaviour for jobs to be
failing in the way you describe.  You may want to look at issues like
how many processes a single job is actually starting compared with the
number of cores you are requesting, or what sort of files the program
generates.  In the latter case, if a file name is not sufficiently
unique, files written to the same local file system by different jobs on
the same node may clobber each other.

Maybe people on the list could help you if you explained what your program is and a
bit about how it works.

Cheers,

Loris

PS: It is slightly confusing to use the phrase "grid engine" in
connection with Slurm, as Oracle's (previously Sun's) Grid Engine is a  
different piece of software, albeit one with a similar purpose, namely
that of a resource manager.

> Thank you
>
> On Sat, 17 Apr 2021 at 02:16, Jess Arrington <jess at schedmd.com> wrote:
>
>  Hi Robert,
>
>  I hope your day is treating you well.
>
>  Thank you for your posts on the Slurm user list.  
>
>  Would there be interest on your side to see a Slurm support contract for your systems at University of York?
>
>  Sites running Slurm with support give us feedback that support is invaluable and a great return back to the organization with much better system utilization with optimized configs by our experts (which pays for the support contract in and
>  of itself), guaranteed resolutions to their issues and their sites not having to rely on in-house best-effort support hacks that get very expensive and turn into complicated chaos and potential down systems.
>
>  Additionally, support keeps the Slurm project alive and going strong   
>
>  Please let me know your thoughts or if you would like me to reach out to another contact at University of York to chat about this further.
>
>  Take care,
>
>     Jess Arrington  
>       
>      Executive Director:Global Sales & Alliances    
>      jess at schedmd.com |  801-616-7823    
>      240 N 1200 E #203 Lehi, UT 84043    
>
>  On Fri, Apr 16, 2021 at 1:42 PM Robert Peck <rp1060 at york.ac.uk> wrote:
>
>  Excuse me, I am trying to run some software on a cluster which uses the SLURM grid engine. IT support at my institution have exhausted their knowledge of SLURM in trying to debug this rather nasty bug with a specific feature of the
>  grid engine and suggested I try here for tips.
>
>  I am using jobs of the form:
>
>  #!/bin/bash
>  #SBATCH --job-name=name       # Job name
>  #SBATCH --mail-type=END,FAIL             # Mail events (NONE, BEGIN, END, FAIL, ALL)
>  #SBATCH --mail-user=my_email at thing.thing     # Where to send mail  
>
>  #SBATCH --mem=2gb                        # Job memory request, not hugely intensive
>  #SBATCH --time=47:00:00                  # Time limit hrs:min:sec, the sim software being run from within the bash script is quite slow, extra memory can't speed it up and it can't run multi-core, hence long runs on
>  weak nodes
>
>  #SBATCH --nodes=100
>  #SBATCH --ntasks=100
>  #SBATCH --cpus-per-task=1
>
>  #SBATCH --output=file_output_%j.log        # Standard output and error log
>  #SBATCH --account=code       # Project account
>  #SBATCH --ntasks-per-core=1 #only 1 task per core, must not be more
>  #SBATCH --ntasks-per-node=1 #only 1 task per node, must not be more
>  #SBATCH --ntasks-per-socket=1 #guessing here but fairly sure I don't want multiple instances trying to use same socket
>
>  #SBATCH --no-kill # supposedly prevents restart of other jobs on other nodes if one of the 100 gets a NODE_FAIL
>
>  echo My working directory is `pwd`
>  echo Running job on host:
>  echo -e '\t'`hostname` at `date`
>  echo
>   
>  module load toolchain/foss/2018b
>  cd scratch
>  cd further_folder
>  chmod +x my_bash_script.sh
>
>  srun --no-kill -N "${SLURM_JOB_NUM_NODES}" -n "${SLURM_NTASKS}" ./my_bash_script.sh
>
>  wait
>  echo
>  echo Job completed at `date`
>
>  I use a bash script to launch my special software and stuff which actually handles each job, this software is a bit weird and two copies of it WILL NOT EVER play nicely if made to share a node. Hence this job acts to launch 100 copies on
>  100 nodes, each of which does its own stuff and writes out to a separate results file. I later proces the results files. 
>
>  In my scenario I want 100 jobs to run, but if one or two failed and I only got 99 or 95 back then I could work fine for further processing with just 99 or 95 result files. Getting back a few less jobs then I want is no tragedy for my type of
>  work.
>
>  But the problem is that when any one node has a failure, not that rare when you're calling for 100 nodes simultaneously, SLURM would by default murder the WHOLE LOT of jobs, and even more confusingly then restart a bunch of them
>  which ends up with a very confusing pile of results files. I thought the --no-kill flag should prevent this fault, but instead of preventing the killing of all jobs due to a single failure it only prevents the restart, now I get a misleading
>  message from the cluster telling me of a good exit code when such slaughter occurs, but when I log in to the cluster I discover a grid engine massacre of my jobs, all because just one of them failed. 
>
>  I understand that for interacting jobs on many nodes then killing all of them because of one failure can be necessary, but my jobs are strictly parallel, no cross-interaction between them at all. each is an utterly separate simulation with
>  different starting parameters. I need to ensure that if one job fails and must be killed then the rest are not affected.
>
>  I have been advised that due to the simulation software being such as to refuse to run >1 copy properly on any given node at once I am NOT able to use "array jobs" and must stick to this sort of job which requests 100 nodes this way.
>
>  Please can anyone suggest how to instruct SLURM not to massacre ALL my jobs because ONE (or a few) node(s) fails?
>
>  All my research is being put on hold by this bug which is making getting large runs out of the cluster almost impossible, a very large fraction of the jobs I submit has a failure on 1 of the 100 nodes and hence that very large fraction of
>  my jobs get killed on all nodes even though only one is faulty. I don't get jobs lasting to give me many useful sets of the 100(ish) result files I need.
>
>  P.S. just to warn you, I'm not an HPC expert or a linux power user. I'm comfortable with linux and with command lines and technical details but will probably need a bit more explanation around answers than someone specialised in high
>  performance computing would.
>
>  --
>  Thank You
>  Rob
>
>  P.S. unsure of whether this is how one is supposed to add forum posts to this google group, sent twice as I wasn't sure if the earlier one got through as I might not have been correctly subscribed at that time, thanks