[slurm-users] [ext] Re: srun jobfarming hassle question

Wed Jan 18 13:51:47 UTC 2023

Hi Martin,
I faced a similar problem where I had to deal with a huge taskfarm
(1000s of tasks processing 1TB of satellite data) with varying run
times and memory requirements. I ended up writing a REST server that
hands out tasks to clients. I then simply fired up an array job where
each job would request new tasks from the task server until either all
tasks were processed or it was killed when it exceeded run time or
memory. The system keeps track of completed tasks and running tasks so
that you can reschedule tasks that didn't complete. Code is available
on github and paper describing the service is here:
https://openresearchsoftware.metajnl.com/articles/10.5334/jors.393/
Cheers
magnus

-----Original Message-----
From: "Ohlerich, Martin" <Martin.Ohlerich at lrz.de>
Reply-To: Slurm User Community List <slurm-users at lists.schedmd.com>
To: slurm-users at schedmd.com <slurm-users at schedmd.com>, Slurm User
Community List <slurm-users at lists.schedmd.com>
Subject: [ext] Re: [slurm-users] srun jobfarming hassle question
Date: Wed, 18 Jan 2023 13:39:30 +0000

Hello Björn-Helge.

Sigh ... 
First of all, of course, many thanks! This indeed helped a lot!

Two comments:
a) Why are Interfaces at Slurm tools changed? I once learned that the
Interfaces must be designed to be as stable as possible. Otherwise,
users get frustrated and go away.
b) This only works if I have to specify --mem for a task. Although
manageable, I wonder why one needs to be that restrictive. In
principle, in the use case outlined, one task could use a bit less
memory, and the other may require a bit more the half of the node's
available memory. (So clearly this isn't always predictable.) I only
hope that in such cases the second task does not die from OOM ... (I
will know soon, I guess.)

Really, thank you! Was a very helpful hint!
Cheers, Martin

Von: slurm-users <slurm-users-bounces at lists.schedmd.com> im Auftrag von
Bjørn-Helge Mevik <b.h.mevik at usit.uio.no>
Gesendet: Mittwoch, 18. Januar 2023 13:49
An: slurm-users at schedmd.com
Betreff: Re: [slurm-users] srun jobfarming hassle question 
"Ohlerich, Martin" <Martin.Ohlerich at lrz.de> writes:

> Dear Colleagues,
>
>
> already for quite some years now are we again and again facing issues
on our clusters with so-called job-farming (or task-farming) concepts
in Slurm jobs using srun. And it bothers me that we can hardly help
users with requests in this regard.
>
>
> From the documentation
(https://slurm.schedmd.com/srun.html#SECTION_EXAMPLES), it reads like
this.
>
> ------------------------------------------->
>
> ...
>
> #SBATCH --nodes=??
>
> ...
>
> srun -N 1 -n 2 ... prog1 &> log.1 &
>
> srun -N 1 -n 1 ... prog2 &> log.2 &

Unfortunately, that part of the documentation is not quite up-to-date.
The semantics of srun has changed a little the last couple of
years/Slurm versions, so today, you have to use "srun --exact ...". 
From
"man srun" (version 21.08):

       --exact
              Allow  a step access to only the resources requested for
the
              step.  By default, all non-GRES resources on  each  node 
in
              the  step  allocation will be used. This option only
applies
              to step allocations.
              NOTE: Parallel steps will  either  be  blocked  or 
rejected
              until  requested step resources are available unless --
over‐
              lap is specified. Job resources can be held after  the 
com‐
              pletion  of  an  srun  command while Slurm does job
cleanup.
              Step epilogs and/or SPANK  plugins  can  further  delay 
the
              release of step resources.

-- 
Magnus Hagdorn
Charité – Universitätsmedizin Berlin
Geschäftsbereich IT | Scientific Computing

Campus Charité Virchow Klinikum
Forum 4 | Ebene 02 | Raum 2.020
Augustenburger Platz 1
13353 Berlin

magnus.hagdorn at charite.de
https://www.charite.de
HPC Helpdesk: sc-hpc-helpdesk at charite.de
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230118/ff7a21d1/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5449 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230118/ff7a21d1/attachment.bin>