[slurm-users] [ext] Re: srun jobfarming hassle question
Hagdorn, Magnus Karl Moritz
magnus.hagdorn at charite.de
Wed Jan 18 13:51:47 UTC 2023
Hi Martin,
I faced a similar problem where I had to deal with a huge taskfarm
(1000s of tasks processing 1TB of satellite data) with varying run
times and memory requirements. I ended up writing a REST server that
hands out tasks to clients. I then simply fired up an array job where
each job would request new tasks from the task server until either all
tasks were processed or it was killed when it exceeded run time or
memory. The system keeps track of completed tasks and running tasks so
that you can reschedule tasks that didn't complete. Code is available
on github and paper describing the service is here:
https://openresearchsoftware.metajnl.com/articles/10.5334/jors.393/
Cheers
magnus
-----Original Message-----
From: "Ohlerich, Martin" <Martin.Ohlerich at lrz.de>
Reply-To: Slurm User Community List <slurm-users at lists.schedmd.com>
To: slurm-users at schedmd.com <slurm-users at schedmd.com>, Slurm User
Community List <slurm-users at lists.schedmd.com>
Subject: [ext] Re: [slurm-users] srun jobfarming hassle question
Date: Wed, 18 Jan 2023 13:39:30 +0000
Hello Björn-Helge.
Sigh ...
First of all, of course, many thanks! This indeed helped a lot!
Two comments:
a) Why are Interfaces at Slurm tools changed? I once learned that the
Interfaces must be designed to be as stable as possible. Otherwise,
users get frustrated and go away.
b) This only works if I have to specify --mem for a task. Although
manageable, I wonder why one needs to be that restrictive. In
principle, in the use case outlined, one task could use a bit less
memory, and the other may require a bit more the half of the node's
available memory. (So clearly this isn't always predictable.) I only
hope that in such cases the second task does not die from OOM ... (I
will know soon, I guess.)
Really, thank you! Was a very helpful hint!
Cheers, Martin
Von: slurm-users <slurm-users-bounces at lists.schedmd.com> im Auftrag von
Bjørn-Helge Mevik <b.h.mevik at usit.uio.no>
Gesendet: Mittwoch, 18. Januar 2023 13:49
An: slurm-users at schedmd.com
Betreff: Re: [slurm-users] srun jobfarming hassle question
"Ohlerich, Martin" <Martin.Ohlerich at lrz.de> writes:
> Dear Colleagues,
>
>
> already for quite some years now are we again and again facing issues
on our clusters with so-called job-farming (or task-farming) concepts
in Slurm jobs using srun. And it bothers me that we can hardly help
users with requests in this regard.
>
>
> From the documentation
(https://slurm.schedmd.com/srun.html#SECTION_EXAMPLES), it reads like
this.
>
> ------------------------------------------->
>
> ...
>
> #SBATCH --nodes=??
>
> ...
>
> srun -N 1 -n 2 ... prog1 &> log.1 &
>
> srun -N 1 -n 1 ... prog2 &> log.2 &
Unfortunately, that part of the documentation is not quite up-to-date.
The semantics of srun has changed a little the last couple of
years/Slurm versions, so today, you have to use "srun --exact ...".
From
"man srun" (version 21.08):
--exact
Allow a step access to only the resources requested for
the
step. By default, all non-GRES resources on each node
in
the step allocation will be used. This option only
applies
to step allocations.
NOTE: Parallel steps will either be blocked or
rejected
until requested step resources are available unless --
over‐
lap is specified. Job resources can be held after the
com‐
pletion of an srun command while Slurm does job
cleanup.
Step epilogs and/or SPANK plugins can further delay
the
release of step resources.
--
Magnus Hagdorn
Charité – Universitätsmedizin Berlin
Geschäftsbereich IT | Scientific Computing
Campus Charité Virchow Klinikum
Forum 4 | Ebene 02 | Raum 2.020
Augustenburger Platz 1
13353 Berlin
magnus.hagdorn at charite.de
https://www.charite.de
HPC Helpdesk: sc-hpc-helpdesk at charite.de
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230118/ff7a21d1/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5449 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230118/ff7a21d1/attachment.bin>
More information about the slurm-users
mailing list