Hi,

I have got a very simple LD_PRELOAD that can do this. Maybe I should see if I can force slurmstepd to be run with that LD_PRELOAD and then see if that does it.

Ultimately am trying to get all the useful accounting metrics into a clickhouse database. If the LD_PRELOAD on slurmstepd seems to work then I can expand it to insert the relevant row into the clickhouse DB in the C code of the preload library.

But still...this seems like a very basic thing to do and am very suprised that it seems so difficult to do this with the standard accounting recording out of the box.

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation

From: Davide DelVento <davide.quantum@gmail.com>
Sent: 17 May 2024 01:02
To: Emyr James <emyr.james@crg.eu>
Cc: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] memory high water mark reporting

Not exactly the answer to your question (which I don't know) but if you can get to prefix whatever is executed with this https://github.com/NCAR/peak_memusage (which also uses getrusage) or a variant you will be able to do that.

On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users <slurm-users@lists.schedmd.com> wrote:

Hi,

We are trying out slurm having been running grid engine for a long while.

In grid engine, the cgroups peak memory and max_rss are generated at the end of a job and recorded. It logs the information from the cgroup hierarchy as well as doing a getrusage call right at the end on the parent pid of the whole job "container" before cleaning up.

With slurm it seems that the only way memory is recorded is by the acct gather polling. I am trying to add something in an epilog script to get the memory.peak but It looks like the cgroup hierarchy has been destroyed by the time the epilog is run.

Where in the code is the cgroup hierarchy cleared up ? Is there no way to add something in so that the accounting is updated during the job cleanup process so that peak memory usage can be accurately logged ?

I can reduce the polling interval from 30s to 5s but don't know if this causes a lot of overhead and in any case this seems to not be a sensible way to get values that should just be determined right at the end by an event rather than using polling.

Many thanks,

Emyr

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com