[slurm-users] srun jobfarming hassle question

Thu Jan 19 08:28:44 UTC 2023

Helle Björn-Helge.

Thank for reminding me /sys/fs for checking OOM issues. I lost that already out of sight again.

In this case, there are more steps involved (one for each srun call). I'm not sure whether cgroup handles each separately, or just on a node-base. If the latter ... why do I have to specify --mem at all in each single srun-step call? That is somehow illogical, imho. I mean that would semantically mean "Please tell me the resources you in order to find reasonable slots to run you task. But don't worry! On a node, I anyway do not care much. Do as you like, as long as the node's total memory consumption is below the threshold ...!" ;)

Anyway. I will see soon. My current user who I support with that is using quite memory-consumping stuff (bioinformatics) 😁

Thank you again!
Cheers, Martin

________________________________
Von: slurm-users <slurm-users-bounces at lists.schedmd.com> im Auftrag von Bjørn-Helge Mevik <b.h.mevik at usit.uio.no>
Gesendet: Donnerstag, 19. Januar 2023 08:23
An: slurm-users at schedmd.com
Betreff: Re: [slurm-users] srun jobfarming hassle question

"Ohlerich, Martin" <Martin.Ohlerich at lrz.de> writes:

> Hello Björn-Helge.
>
>
> Sigh ...
>
> First of all, of course, many thanks! This indeed helped a lot!

Good!

> b) This only works if I have to specify --mem for a task. Although
> manageable, I wonder why one needs to be that restrictive. In
> principle, in the use case outlined, one task could use a bit less
> memory, and the other may require a bit more the half of the node's
> available memory. (So clearly this isn't always predictable.) I only
> hope that in such cases the second task does not die from OOM ... (I
> will know soon, I guess.)

As I understand it, Slurm (at least cgroups) will only kill a step if it
uses more memory *in total* on a node than the job got allocated to the
node.  So if a job has 10 GiB allocated on a node, and a step runs two
tasks there, one task could use 9 GiB and the other 1 GiB without the
step being killed.

You can inspect the memory limits that are in effect in cgroups (v1) in
/sys/fs/cgroup/memory/slurm/uid_<uid>/job_<jobid> (usual location, at
least).

--
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230119/9f67f669/attachment-0001.htm>