[slurm-users] srun --mem issue

Fri Dec 9 07:12:33 UTC 2022

Ryan Novosielski <novosirj at rutgers.edu> writes:

>  On Dec 8, 2022, at 03:57, Loris Bennett <loris.bennett at fu-berlin.de> wrote:
>
>  Loris Bennett <loris.bennett at fu-berlin.de> writes:
>
>  Moshe Mergy <moshe.mergy at weizmann.ac.il> writes:
>
>  Hi Sandor
>
>  I personnaly block "--mem=0" requests in file job_submit.lua (slurm 20.02):
>
>   if (job_desc.min_mem_per_node == 0  or  job_desc.min_mem_per_cpu == 0) then
>         slurm.log_info("%s: ERROR: unlimited memory requested", log_prefix) 
>         slurm.log_info("%s: ERROR: job %s from user %s rejected because of an invalid (unlimited) memory request.", log_prefix, job_desc.name, job_desc.user_name) 
>         slurm.log_user("Job rejected because of an invalid memory request.") 
>         return slurm.ERROR
>    end
>
>  What happens if somebody explicitly requests all the memory, so in
>  Sandor's case --mem=500G ?
>
>  Maybe there is a better or nicer solution...
>
>  Can't you just use account and QOS limits:
>
>   https://slurm.schedmd.com/resource_limits.html
>
>  ?
>
>  And anyway, what is the use-case for preventing someone using all the
>  memory? In our case, if someone really need all the memory, they should be able
>  to have it. 
>
>  However, I do have a chronic problem with users requesting too much
>  memory. My approach has been to try to get people to use 'seff' to see
>  what resources their jobs in fact need.  In addition each month we
>  generate a graphical summary of 'seff' data for each user, like the one
>  shown here
>
>   https://www.fu-berlin.de/en/sites/high-performance-computing/Dokumentation/Statistik
>
>  and automatically send an email to those with a large percentage of
>  resource-inefficient jobs telling them to look at their graphs and
>  correct their resource requirements for future jobs.
>
>  Cheers,
>
>  Loris
>
> I may be wrong about this, but aren’t people penalized in their fair share score for using too much memory, and effectively penalized for wasting it by “paying” for it even if
> they don’t need it? They’re also penalized for it by likely having to wait longer to have their request satisfied if they specify more than they need. That’s generally what I
> used to tell people.

You are right and I tell my users exactly the same things.  However, on
our system, memory is normally the limiting factor, so if memory is
requested but not used, that reduces the throughput for everyone.

> I also make quite a bit of use of Ole Holm Nielsen’s pestat, to catch jobs that are not running efficiently, but that’s not automated, just a way to review.
>
> https://github.com/OleHolmNielsen/Slurm_tools/blob/master/pestat/pestat

I also use pestat, but with only 170 nodes in our main partition, even
using the -f option to flag problem nodes, I still get, today for
example, 146 nodes, so that can be a bit overwhelming.  Maybe I need to
tweak the thresholds a bit.

> --
> #BlackLivesMatter
> ____
> || \\UTGERS,   |---------------------------*O*---------------------------
> ||_// the State  |         Ryan Novosielski - novosirj at rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\    of NJ  | Office of Advanced Research Computing - MSB C630, Newark
>      `'
>
-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin