[slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)

Mon Apr 19 16:24:08 UTC 2021

Michael, thanks for the tip. I can give that a go but don't know if it will
solve my issue.

Jess, sorry I have no knowledge of how the university handles the cluster
system in that sense.

Has anyone else been reporting bugs with the --no-kill flag recently on
your forum?

--no-kill isn't all that heavily documented from online searching, are
there any more specific parameters which can be used to create the desired
behaviour of "if one node dies just ignore it and let the others keep
going".

Would posting some cluster logs here help? Although I have found that the
verbosity level doesn't seem to be able to be set higher than the default?

I did mention that I'd heard array jobs aren't ok due to my "separate job
on separate node" requirement, but wondered if anyone could advise me of
ways to build that need in to an array job setup? If that would avoid the
problem I'm having where all jobs get slaughtered when just one of them
fails?

Thank you

On Sat, 17 Apr 2021 at 02:16, Jess Arrington <jess at schedmd.com> wrote:

> Hi Robert,
>
>
> I hope your day is treating you well.
>
>
> Thank you for your posts on the Slurm user list.
>
>
> Would there be interest on your side to see a Slurm support contract for
> your systems at University of York?
>
> Sites running Slurm with support give us feedback that support is
> invaluable and a great return back to the organization with much better
> system utilization with optimized configs by our experts (which pays for
> the support contract in and of itself), guaranteed resolutions to their
> issues and their sites not having to rely on in-house best-effort support
> hacks that get very expensive and turn into complicated chaos and potential
> down systems.
>
>
> Additionally, support keeps the Slurm project alive and going strong
>
>
> Please let me know your thoughts or if you would like me to reach out to
> another contact at University of York to chat about this further.
>
>
> Take care,
>
>
> <https://www.schedmd.com/>
> *Jess Arrington*
> *Executive Director:*Global Sales & Alliances
> *jess at schedmd.com <jess at schedmd.com> |  801-616-7823*
> 240 N 1200 E #203 Lehi, UT 84043
>
>
> On Fri, Apr 16, 2021 at 1:42 PM Robert Peck <rp1060 at york.ac.uk> wrote:
>
>> Excuse me, I am trying to run some software on a cluster which uses the
>> SLURM grid engine. IT support at my institution have exhausted their
>> knowledge of SLURM in trying to debug this rather nasty bug with a specific
>> feature of the grid engine and suggested I try here for tips.
>>
>> I am using jobs of the form:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *#!/bin/bash#SBATCH --job-name=name       # Job name#SBATCH
>> --mail-type=END,FAIL             # Mail events (NONE, BEGIN, END, FAIL,
>> ALL)#SBATCH --mail-user=my_email at thing.thing     # Where to send mail
>>  #SBATCH --mem=2gb                        # Job memory request, not hugely
>> intensive#SBATCH --time=47:00:00                  # Time limit hrs:min:sec,
>> the sim software being run from within the bash script is quite slow, extra
>> memory can't speed it up and it can't run multi-core, hence long runs on
>> weak nodes#SBATCH --nodes=100#SBATCH --ntasks=100#SBATCH
>> --cpus-per-task=1#SBATCH --output=file_output_%j.log        # Standard
>> output and error log#SBATCH --account=code       # Project account#SBATCH
>> --ntasks-per-core=1 #only 1 task per core, must not be more#SBATCH
>> --ntasks-per-node=1 #only 1 task per node, must not be more#SBATCH
>> --ntasks-per-socket=1 #guessing here but fairly sure I don't want multiple
>> instances trying to use same socket#SBATCH --no-kill # *supposedly
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> * prevents restart of other jobs on other nodes if one of the 100 gets a
>> NODE_FAILecho My working directory is `pwd`echo Running job on host:echo -e
>> '\t'`hostname` at `date`echo module load toolchain/foss/2018bcd scratchcd
>> further_folderchmod +x my_bash_script.shsrun --no-kill -N
>> "${SLURM_JOB_NUM_NODES}" -n "${SLURM_NTASKS}"
>> ./my_bash_script.shwaitechoecho Job completed at `date`*
>>
>> I use a bash script to launch my special software and stuff which
>> actually handles each job, this software is a bit weird and two copies of
>> it WILL NOT EVER play nicely if made to share a node. Hence this job acts
>> to launch 100 copies on 100 nodes, each of which does its own stuff and
>> writes out to a separate results file. I later proces the results files.
>>
>> In my scenario I want 100 jobs to run, but if one or two failed and I
>> only got 99 or 95 back then I could work fine for further processing with
>> just 99 or 95 result files. Getting back a few less jobs then I want is no
>> tragedy for my type of work.
>>
>> But the problem is that when any one node has a failure, not that rare
>> when you're calling for 100 nodes simultaneously, SLURM would by default
>> murder the WHOLE LOT of jobs, and even more confusingly then restart a
>> bunch of them which ends up with a very confusing pile of results files. I
>> thought the --no-kill flag should prevent this fault, but instead of
>> preventing the killing of all jobs due to a single failure it only prevents
>> the restart, now I get a misleading message from the cluster telling me of
>> a good exit code when such slaughter occurs, but when I log in to the
>> cluster I discover a grid engine massacre of my jobs, all because just one
>> of them failed.
>>
>> I understand that for interacting jobs on many nodes then killing all of
>> them because of one failure can be necessary, but my jobs are strictly
>> parallel, no cross-interaction between them at all. each is an utterly
>> separate simulation with different starting parameters. I need to ensure
>> that if one job fails and must be killed then the rest are not affected.
>>
>> I have been advised that due to the simulation software being such as to
>> refuse to run >1 copy properly on any given node at once I am NOT able to
>> use "array jobs" and must stick to this sort of job which requests 100
>> nodes this way.
>>
>> Please can anyone suggest how to instruct SLURM not to massacre ALL my
>> jobs because ONE (or a few) node(s) fails?
>>
>> All my research is being put on hold by this bug which is making getting
>> large runs out of the cluster almost impossible, a very large fraction of
>> the jobs I submit has a failure on 1 of the 100 nodes and hence that very
>> large fraction of my jobs get killed on all nodes even though only one is
>> faulty. I don't get jobs lasting to give me many useful sets of the
>> 100(ish) result files I need.
>>
>> P.S. just to warn you, I'm not an HPC expert or a linux power user. I'm
>> comfortable with linux and with command lines and technical details but
>> will probably need a bit more explanation around answers than someone
>> specialised in high performance computing would.
>>
>> --
>> Thank You
>> Rob
>>
>> P.S. unsure of whether this is how one is supposed to add forum posts to
>> this google group, sent twice as I wasn't sure if the earlier one got
>> through as I might not have been correctly subscribed at that time, thanks
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210419/784cf600/attachment-0001.htm>