Hi,
We are trying out slurm having been running grid engine for a long while. In grid engine, the cgroups peak memory and max_rss are generated at the end of a job and recorded. It logs the information from the cgroup hierarchy as well as doing a getrusage call right at the end on the parent pid of the whole job "container" before cleaning up. With slurm it seems that the only way memory is recorded is by the acct gather polling. I am trying to add something in an epilog script to get the memory.peak but It looks like the cgroup hierarchy has been destroyed by the time the epilog is run. Where in the code is the cgroup hierarchy cleared up ? Is there no way to add something in so that the accounting is updated during the job cleanup process so that peak memory usage can be accurately logged ?
I can reduce the polling interval from 30s to 5s but don't know if this causes a lot of overhead and in any case this seems to not be a sensible way to get values that should just be determined right at the end by an event rather than using polling.
Many thanks,
Emyr
Not exactly the answer to your question (which I don't know) but if you can get to prefix whatever is executed with this https://github.com/NCAR/peak_memusage (which also uses getrusage) or a variant you will be able to do that.
On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi,
We are trying out slurm having been running grid engine for a long while. In grid engine, the cgroups peak memory and max_rss are generated at the end of a job and recorded. It logs the information from the cgroup hierarchy as well as doing a getrusage call right at the end on the parent pid of the whole job "container" before cleaning up. With slurm it seems that the only way memory is recorded is by the acct gather polling. I am trying to add something in an epilog script to get the memory.peak but It looks like the cgroup hierarchy has been destroyed by the time the epilog is run. Where in the code is the cgroup hierarchy cleared up ? Is there no way to add something in so that the accounting is updated during the job cleanup process so that peak memory usage can be accurately logged ?
I can reduce the polling interval from 30s to 5s but don't know if this causes a lot of overhead and in any case this seems to not be a sensible way to get values that should just be determined right at the end by an event rather than using polling.
Many thanks,
Emyr
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Hi,
I have got a very simple LD_PRELOAD that can do this. Maybe I should see if I can force slurmstepd to be run with that LD_PRELOAD and then see if that does it.
Ultimately am trying to get all the useful accounting metrics into a clickhouse database. If the LD_PRELOAD on slurmstepd seems to work then I can expand it to insert the relevant row into the clickhouse DB in the C code of the preload library.
But still...this seems like a very basic thing to do and am very suprised that it seems so difficult to do this with the standard accounting recording out of the box.
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________ From: Davide DelVento davide.quantum@gmail.com Sent: 17 May 2024 01:02 To: Emyr James emyr.james@crg.eu Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: Re: [slurm-users] memory high water mark reporting
Not exactly the answer to your question (which I don't know) but if you can get to prefix whatever is executed with this https://github.com/NCAR/peak_memusagehttps://urldefense.com/v3/__https://github.com/NCAR/peak_memusage__;!!D9dNQwwGXtA!XXr8CcM11ikS-fYyDe0CFyQWal6Qp5cgv1os4oHtVrAAJE68Fp6qqvZFKoNvW26ROOv3uLzwqRZLge3-6zV8CPYLzg$ (which also uses getrusage) or a variant you will be able to do that.
On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> wrote: Hi,
We are trying out slurm having been running grid engine for a long while. In grid engine, the cgroups peak memory and max_rss are generated at the end of a job and recorded. It logs the information from the cgroup hierarchy as well as doing a getrusage call right at the end on the parent pid of the whole job "container" before cleaning up. With slurm it seems that the only way memory is recorded is by the acct gather polling. I am trying to add something in an epilog script to get the memory.peak but It looks like the cgroup hierarchy has been destroyed by the time the epilog is run. Where in the code is the cgroup hierarchy cleared up ? Is there no way to add something in so that the accounting is updated during the job cleanup process so that peak memory usage can be accurately logged ?
I can reduce the polling interval from 30s to 5s but don't know if this causes a lot of overhead and in any case this seems to not be a sensible way to get values that should just be determined right at the end by an event rather than using polling.
Many thanks,
Emyr
-- slurm-users mailing list -- slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.commailto:slurm-users-leave@lists.schedmd.com
Looking here :
https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINS
It looks like it's possible to hook something in at the right place using the slurm_spank_task_exit or slurm_spank_exit plugins. Does anyone have any experience or examples of doing this ? Is there any more documentation available on this functionality ?
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________ From: Emyr James via slurm-users slurm-users@lists.schedmd.com Sent: 17 May 2024 01:15 To: Davide DelVento davide.quantum@gmail.com Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: memory high water mark reporting
Hi,
I have got a very simple LD_PRELOAD that can do this. Maybe I should see if I can force slurmstepd to be run with that LD_PRELOAD and then see if that does it.
Ultimately am trying to get all the useful accounting metrics into a clickhouse database. If the LD_PRELOAD on slurmstepd seems to work then I can expand it to insert the relevant row into the clickhouse DB in the C code of the preload library.
But still...this seems like a very basic thing to do and am very suprised that it seems so difficult to do this with the standard accounting recording out of the box.
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________ From: Davide DelVento davide.quantum@gmail.com Sent: 17 May 2024 01:02 To: Emyr James emyr.james@crg.eu Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: Re: [slurm-users] memory high water mark reporting
Not exactly the answer to your question (which I don't know) but if you can get to prefix whatever is executed with this https://github.com/NCAR/peak_memusagehttps://urldefense.com/v3/__https://github.com/NCAR/peak_memusage__;!!D9dNQwwGXtA!XXr8CcM11ikS-fYyDe0CFyQWal6Qp5cgv1os4oHtVrAAJE68Fp6qqvZFKoNvW26ROOv3uLzwqRZLge3-6zV8CPYLzg$ (which also uses getrusage) or a variant you will be able to do that.
On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> wrote: Hi,
We are trying out slurm having been running grid engine for a long while. In grid engine, the cgroups peak memory and max_rss are generated at the end of a job and recorded. It logs the information from the cgroup hierarchy as well as doing a getrusage call right at the end on the parent pid of the whole job "container" before cleaning up. With slurm it seems that the only way memory is recorded is by the acct gather polling. I am trying to add something in an epilog script to get the memory.peak but It looks like the cgroup hierarchy has been destroyed by the time the epilog is run. Where in the code is the cgroup hierarchy cleared up ? Is there no way to add something in so that the accounting is updated during the job cleanup process so that peak memory usage can be accurately logged ?
I can reduce the polling interval from 30s to 5s but don't know if this causes a lot of overhead and in any case this seems to not be a sensible way to get values that should just be determined right at the end by an event rather than using polling.
Many thanks,
Emyr
-- slurm-users mailing list -- slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.commailto:slurm-users-leave@lists.schedmd.com
Hi,
We have had similar questions from users regarding how best to find out the high memory peak of a job since they may run a job and get a not very useful value for variables in sacct such as the MaxRSS since Slurm didn’t poll during the use of its maximum memory usage.
With Cgroupv1 looking online it looks like memory.max_usage_in_bytes takes into account caches so can vary on how much I/O is used whilst total_rss in memory.stats looks more useful maybe. Maybe memory.peak is clearer?
Its not clear in the documentation how a user should in the sacct values to infer the actual usage of jobs to correct their behaviour in future submissions.
I would be keen to see improvements in high water mark reporting. I noticed that the jobacctgather plugin documentation was deleted back in Slurm 21.08 – Spank plugin does possibly look like the way to go. Also it seems a common problem across technologies e.g. https://github.com/google/cadvisor/issues/3286
Tom
From: Emyr James via slurm-users slurm-users@lists.schedmd.com Date: Monday, 20 May 2024 at 10:50 To: Davide DelVento davide.quantum@gmail.com, Emyr James emyr.james@crg.eu Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: memory high water mark reporting External email to Cardiff University - Take care when replying/opening attachments or links. Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor atodiadau neu ddolenni.
Looking here :
https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINS
It looks like it's possible to hook something in at the right place using the slurm_spank_task_exit or slurm_spank_exit plugins. Does anyone have any experience or examples of doing this ? Is there any more documentation available on this functionality ?
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________ From: Emyr James via slurm-users slurm-users@lists.schedmd.com Sent: 17 May 2024 01:15 To: Davide DelVento davide.quantum@gmail.com Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: memory high water mark reporting
Hi,
I have got a very simple LD_PRELOAD that can do this. Maybe I should see if I can force slurmstepd to be run with that LD_PRELOAD and then see if that does it.
Ultimately am trying to get all the useful accounting metrics into a clickhouse database. If the LD_PRELOAD on slurmstepd seems to work then I can expand it to insert the relevant row into the clickhouse DB in the C code of the preload library.
But still...this seems like a very basic thing to do and am very suprised that it seems so difficult to do this with the standard accounting recording out of the box.
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________ From: Davide DelVento davide.quantum@gmail.com Sent: 17 May 2024 01:02 To: Emyr James emyr.james@crg.eu Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: Re: [slurm-users] memory high water mark reporting
Not exactly the answer to your question (which I don't know) but if you can get to prefix whatever is executed with this https://github.com/NCAR/peak_memusagehttps://urldefense.com/v3/__https://github.com/NCAR/peak_memusage__;!!D9dNQwwGXtA!XXr8CcM11ikS-fYyDe0CFyQWal6Qp5cgv1os4oHtVrAAJE68Fp6qqvZFKoNvW26ROOv3uLzwqRZLge3-6zV8CPYLzg$ (which also uses getrusage) or a variant you will be able to do that.
On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> wrote: Hi,
We are trying out slurm having been running grid engine for a long while. In grid engine, the cgroups peak memory and max_rss are generated at the end of a job and recorded. It logs the information from the cgroup hierarchy as well as doing a getrusage call right at the end on the parent pid of the whole job "container" before cleaning up. With slurm it seems that the only way memory is recorded is by the acct gather polling. I am trying to add something in an epilog script to get the memory.peak but It looks like the cgroup hierarchy has been destroyed by the time the epilog is run. Where in the code is the cgroup hierarchy cleared up ? Is there no way to add something in so that the accounting is updated during the job cleanup process so that peak memory usage can be accurately logged ?
I can reduce the polling interval from 30s to 5s but don't know if this causes a lot of overhead and in any case this seems to not be a sensible way to get values that should just be determined right at the end by an event rather than using polling.
Many thanks,
Emyr
-- slurm-users mailing list -- slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.commailto:slurm-users-leave@lists.schedmd.com
Siwmae Thomas,
I grepped for memory.peak in the source and it's not there. memory.current is there and is used in src/plugins/cgroup/v2/cgroup_v2.c
Adding the ability to get memory.peak in this source file seems to be something that should be done?
Should extern cgroup_acct_t *cgroup_p_task_get_acct_data(uint32_t task_id) be modified to include looking at memory.peak ?
This may mean needing to modify the acct_stat struct in interfaces/cgroup.h to include it ?
typedef struct { uint64_t usec; uint64_t ssec; uint64_t total_rss; uint64_t mas_rss; uint64_t total_pgmajfault; uint64_t total_vmem; } cgroup_acct_t;
Presumably, with the polling method, it keeps looking at the current value and then keeps track of the max of these values. But the actual max may occur in between 2 polls so it would never see the true max value. At least by also reading memory.peak there is a chance to get closer to the real value with the polling method even if this not optimal. Ideally it should run this during cleanup of tasks as well as at the poll interval.
As an aside, I also did a grep for getrusage and it doesn't seem to be used at all. I see that it is looking at /proc/%d/stat so maybe this is where its getting the maxrss for non cgroup accounting. Still, getrusage would seem to be the more obvious choice for this ?
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________ From: Thomas Green - Staff in University IT, Research Technologies / Staff Technoleg Gwybodaeth, Technolegau Ymchwil GreenT10@cardiff.ac.uk Sent: 20 May 2024 13:08 To: Emyr James emyr.james@crg.eu; Davide DelVento davide.quantum@gmail.com Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Re: memory high water mark reporting
Hi,
We have had similar questions from users regarding how best to find out the high memory peak of a job since they may run a job and get a not very useful value for variables in sacct such as the MaxRSS since Slurm didn’t poll during the use of its maximum memory usage.
With Cgroupv1 looking online it looks like memory.max_usage_in_bytes takes into account caches so can vary on how much I/O is used whilst total_rss in memory.stats looks more useful maybe. Maybe memory.peak is clearer?
Its not clear in the documentation how a user should in the sacct values to infer the actual usage of jobs to correct their behaviour in future submissions.
I would be keen to see improvements in high water mark reporting. I noticed that the jobacctgather plugin documentation was deleted back in Slurm 21.08 – Spank plugin does possibly look like the way to go. Also it seems a common problem across technologies e.g. https://github.com/google/cadvisor/issues/3286https://urldefense.com/v3/__https://github.com/google/cadvisor/issues/3286__;!!D9dNQwwGXtA!UW9JUyJ5ByL6XxihSUX-hn_HC2rYL-BZ8HtbdSlP10hGha71tuIHFmUOQ7dPpEseh3Ecyo-rrPUDVWPKJ280u9w$
Tom
From: Emyr James via slurm-users slurm-users@lists.schedmd.com Date: Monday, 20 May 2024 at 10:50 To: Davide DelVento davide.quantum@gmail.com, Emyr James emyr.james@crg.eu Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: memory high water mark reporting
External email to Cardiff University - Take care when replying/opening attachments or links.
Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor atodiadau neu ddolenni.
Looking here :
https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINShttps://urldefense.com/v3/__https://slurm.schedmd.com/spank.html*SECTION_SPANK-PLUGINS__;Iw!!D9dNQwwGXtA!UW9JUyJ5ByL6XxihSUX-hn_HC2rYL-BZ8HtbdSlP10hGha71tuIHFmUOQ7dPpEseh3Ecyo-rrPUDVWPK6HobAdg$
It looks like it's possible to hook something in at the right place using the slurm_spank_task_exit or slurm_spank_exit plugins. Does anyone have any experience or examples of doing this ? Is there any more documentation available on this functionality ?
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________
From: Emyr James via slurm-users slurm-users@lists.schedmd.com Sent: 17 May 2024 01:15 To: Davide DelVento davide.quantum@gmail.com Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: memory high water mark reporting
Hi,
I have got a very simple LD_PRELOAD that can do this. Maybe I should see if I can force slurmstepd to be run with that LD_PRELOAD and then see if that does it.
Ultimately am trying to get all the useful accounting metrics into a clickhouse database. If the LD_PRELOAD on slurmstepd seems to work then I can expand it to insert the relevant row into the clickhouse DB in the C code of the preload library.
But still...this seems like a very basic thing to do and am very suprised that it seems so difficult to do this with the standard accounting recording out of the box.
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________
From: Davide DelVento davide.quantum@gmail.com Sent: 17 May 2024 01:02 To: Emyr James emyr.james@crg.eu Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: Re: [slurm-users] memory high water mark reporting
Not exactly the answer to your question (which I don't know) but if you can get to prefix whatever is executed with this https://github.com/NCAR/peak_memusagehttps://urldefense.com/v3/__https://github.com/NCAR/peak_memusage__;!!D9dNQwwGXtA!XXr8CcM11ikS-fYyDe0CFyQWal6Qp5cgv1os4oHtVrAAJE68Fp6qqvZFKoNvW26ROOv3uLzwqRZLge3-6zV8CPYLzg$ (which also uses getrusage) or a variant you will be able to do that.
On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> wrote:
Hi,
We are trying out slurm having been running grid engine for a long while.
In grid engine, the cgroups peak memory and max_rss are generated at the end of a job and recorded. It logs the information from the cgroup hierarchy as well as doing a getrusage call right at the end on the parent pid of the whole job "container" before cleaning up.
With slurm it seems that the only way memory is recorded is by the acct gather polling. I am trying to add something in an epilog script to get the memory.peak but It looks like the cgroup hierarchy has been destroyed by the time the epilog is run.
Where in the code is the cgroup hierarchy cleared up ? Is there no way to add something in so that the accounting is updated during the job cleanup process so that peak memory usage can be accurately logged ?
I can reduce the polling interval from 30s to 5s but don't know if this causes a lot of overhead and in any case this seems to not be a sensible way to get values that should just be determined right at the end by an event rather than using polling.
Many thanks,
Emyr
-- slurm-users mailing list -- slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.commailto:slurm-users-leave@lists.schedmd.com
A bit more digging....
the cgroups stuff seems to be communicating back the values it finds in src/plugins/jobacct_gather/cgroup/jobacct_gather_cgroup.c
prec->tres_data[TRES_ARRAY_MEM].size_read = cgroup_acct_data->total_rss;
I can't find anywhere in the code where it seems to be keeping track of the max value of total_rss seen so I can only conclude that it must be done in the database when slurmdbd puts in the values rather than being done in the slurm binaries themselves.
So this does seem to suggest that the peak value that is accounted at the end is just the maximum of the memory.current values that it sees over all the polls, even though there may be much higher transient values that may have occured in between the polls which would be taken into account by memory.peak but slurm never sees these values.
Can anyone more familiar with the code than me corrobarate this ?
Presumably non-cgroup accounting has a similar issue ? I.e. it polls rss and then the accounting db reports the highest seen even though using getrusage and checking ru_maxrss should be done too ?
Many thanks,
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________ From: Emyr James via slurm-users slurm-users@lists.schedmd.com Sent: 20 May 2024 13:56 To: Thomas Green - Staff in University IT, Research Technologies / Staff Technoleg Gwybodaeth, Technolegau Ymchwil GreenT10@cardiff.ac.uk; Davide DelVento davide.quantum@gmail.com Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: memory high water mark reporting
Siwmae Thomas,
I grepped for memory.peak in the source and it's not there. memory.current is there and is used in src/plugins/cgroup/v2/cgroup_v2.c
Adding the ability to get memory.peak in this source file seems to be something that should be done?
Should extern cgroup_acct_t *cgroup_p_task_get_acct_data(uint32_t task_id) be modified to include looking at memory.peak ?
This may mean needing to modify the acct_stat struct in interfaces/cgroup.h to include it ?
typedef struct { uint64_t usec; uint64_t ssec; uint64_t total_rss; uint64_t mas_rss; uint64_t total_pgmajfault; uint64_t total_vmem; } cgroup_acct_t;
Presumably, with the polling method, it keeps looking at the current value and then keeps track of the max of these values. But the actual max may occur in between 2 polls so it would never see the true max value. At least by also reading memory.peak there is a chance to get closer to the real value with the polling method even if this not optimal. Ideally it should run this during cleanup of tasks as well as at the poll interval.
As an aside, I also did a grep for getrusage and it doesn't seem to be used at all. I see that it is looking at /proc/%d/stat so maybe this is where its getting the maxrss for non cgroup accounting. Still, getrusage would seem to be the more obvious choice for this ?
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________ From: Thomas Green - Staff in University IT, Research Technologies / Staff Technoleg Gwybodaeth, Technolegau Ymchwil GreenT10@cardiff.ac.uk Sent: 20 May 2024 13:08 To: Emyr James emyr.james@crg.eu; Davide DelVento davide.quantum@gmail.com Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Re: memory high water mark reporting
Hi,
We have had similar questions from users regarding how best to find out the high memory peak of a job since they may run a job and get a not very useful value for variables in sacct such as the MaxRSS since Slurm didn’t poll during the use of its maximum memory usage.
With Cgroupv1 looking online it looks like memory.max_usage_in_bytes takes into account caches so can vary on how much I/O is used whilst total_rss in memory.stats looks more useful maybe. Maybe memory.peak is clearer?
Its not clear in the documentation how a user should in the sacct values to infer the actual usage of jobs to correct their behaviour in future submissions.
I would be keen to see improvements in high water mark reporting. I noticed that the jobacctgather plugin documentation was deleted back in Slurm 21.08 – Spank plugin does possibly look like the way to go. Also it seems a common problem across technologies e.g. https://github.com/google/cadvisor/issues/3286https://urldefense.com/v3/__https://github.com/google/cadvisor/issues/3286__;!!D9dNQwwGXtA!UW9JUyJ5ByL6XxihSUX-hn_HC2rYL-BZ8HtbdSlP10hGha71tuIHFmUOQ7dPpEseh3Ecyo-rrPUDVWPKJ280u9w$
Tom
From: Emyr James via slurm-users slurm-users@lists.schedmd.com Date: Monday, 20 May 2024 at 10:50 To: Davide DelVento davide.quantum@gmail.com, Emyr James emyr.james@crg.eu Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: memory high water mark reporting
External email to Cardiff University - Take care when replying/opening attachments or links.
Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor atodiadau neu ddolenni.
Looking here :
https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINShttps://urldefense.com/v3/__https://slurm.schedmd.com/spank.html*SECTION_SPANK-PLUGINS__;Iw!!D9dNQwwGXtA!UW9JUyJ5ByL6XxihSUX-hn_HC2rYL-BZ8HtbdSlP10hGha71tuIHFmUOQ7dPpEseh3Ecyo-rrPUDVWPK6HobAdg$
It looks like it's possible to hook something in at the right place using the slurm_spank_task_exit or slurm_spank_exit plugins. Does anyone have any experience or examples of doing this ? Is there any more documentation available on this functionality ?
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________
From: Emyr James via slurm-users slurm-users@lists.schedmd.com Sent: 17 May 2024 01:15 To: Davide DelVento davide.quantum@gmail.com Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: memory high water mark reporting
Hi,
I have got a very simple LD_PRELOAD that can do this. Maybe I should see if I can force slurmstepd to be run with that LD_PRELOAD and then see if that does it.
Ultimately am trying to get all the useful accounting metrics into a clickhouse database. If the LD_PRELOAD on slurmstepd seems to work then I can expand it to insert the relevant row into the clickhouse DB in the C code of the preload library.
But still...this seems like a very basic thing to do and am very suprised that it seems so difficult to do this with the standard accounting recording out of the box.
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________
From: Davide DelVento davide.quantum@gmail.com Sent: 17 May 2024 01:02 To: Emyr James emyr.james@crg.eu Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: Re: [slurm-users] memory high water mark reporting
Not exactly the answer to your question (which I don't know) but if you can get to prefix whatever is executed with this https://github.com/NCAR/peak_memusagehttps://urldefense.com/v3/__https://github.com/NCAR/peak_memusage__;!!D9dNQwwGXtA!XXr8CcM11ikS-fYyDe0CFyQWal6Qp5cgv1os4oHtVrAAJE68Fp6qqvZFKoNvW26ROOv3uLzwqRZLge3-6zV8CPYLzg$ (which also uses getrusage) or a variant you will be able to do that.
On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> wrote:
Hi,
We are trying out slurm having been running grid engine for a long while.
In grid engine, the cgroups peak memory and max_rss are generated at the end of a job and recorded. It logs the information from the cgroup hierarchy as well as doing a getrusage call right at the end on the parent pid of the whole job "container" before cleaning up.
With slurm it seems that the only way memory is recorded is by the acct gather polling. I am trying to add something in an epilog script to get the memory.peak but It looks like the cgroup hierarchy has been destroyed by the time the epilog is run.
Where in the code is the cgroup hierarchy cleared up ? Is there no way to add something in so that the accounting is updated during the job cleanup process so that peak memory usage can be accurately logged ?
I can reduce the polling interval from 30s to 5s but don't know if this causes a lot of overhead and in any case this seems to not be a sensible way to get values that should just be determined right at the end by an event rather than using polling.
Many thanks,
Emyr
-- slurm-users mailing list -- slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.commailto:slurm-users-leave@lists.schedmd.com
I changed the following in src/plugins/cgroup/v2/cgroup_v2.c
if (common_cgroup_get_param(&task_cg_info->task_cg, "memory.current", &memory_current, &tmp_sz) != SLURM_SUCCESS) { if (task_id == task_special_id) log_flag(CGROUP, "Cannot read task_special memory.peak file"); else log_flag(CGROUP, "Cannot read task %d memory.peak file", task_id); }
to
if (common_cgroup_get_param(&task_cg_info->task_cg, "memory.peak", &memory_current, &tmp_sz) != SLURM_SUCCESS) { if (task_id == task_special_id) log_flag(CGROUP, "Cannot read task_special memory.peak file"); else log_flag(CGROUP, "Cannot read task %d memory.peak file", task_id); }
and am using a polling interval of 5s. the values I get when adding this to the end of a batch script :
dir=$(awk -F: '{print $NF}' /proc/self/cgroup) echo [$(date +"%Y-%m-%d %H:%M:%S")] peak memory is `cat /sys/fs/cgroup$dir/memory.peak` echo [$(date +"%Y-%m-%d %H:%M:%S")] finished on $(hostname)
compared to what is in maxrss from sacct seem to be spot on for my test jobs at least. I guess this will do for now but it still feels very unsatisfactory to be using polling for this instead of having the code trigger the relevant stuff on job cleanup.
The downside of this "quick fix" is that now during a job run, sstat will report the max memory seen so far rather than the current usage. Personally I think this is not particularly useful anyway and if you really need to track memory usage as a job is running the LD_PRELOAD methods mentioned previously are better.
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________ From: Emyr James emyr.james@crg.eu Sent: 20 May 2024 14:30 To: Thomas Green - Staff in University IT, Research Technologies / Staff Technoleg Gwybodaeth, Technolegau Ymchwil GreenT10@cardiff.ac.uk; Davide DelVento davide.quantum@gmail.com; Emyr James emyr.james@crg.eu Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Re: memory high water mark reporting
A bit more digging....
the cgroups stuff seems to be communicating back the values it finds in src/plugins/jobacct_gather/cgroup/jobacct_gather_cgroup.c
prec->tres_data[TRES_ARRAY_MEM].size_read = cgroup_acct_data->total_rss;
I can't find anywhere in the code where it seems to be keeping track of the max value of total_rss seen so I can only conclude that it must be done in the database when slurmdbd puts in the values rather than being done in the slurm binaries themselves.
So this does seem to suggest that the peak value that is accounted at the end is just the maximum of the memory.current values that it sees over all the polls, even though there may be much higher transient values that may have occured in between the polls which would be taken into account by memory.peak but slurm never sees these values.
Can anyone more familiar with the code than me corrobarate this ?
Presumably non-cgroup accounting has a similar issue ? I.e. it polls rss and then the accounting db reports the highest seen even though using getrusage and checking ru_maxrss should be done too ?
Many thanks,
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________ From: Emyr James via slurm-users slurm-users@lists.schedmd.com Sent: 20 May 2024 13:56 To: Thomas Green - Staff in University IT, Research Technologies / Staff Technoleg Gwybodaeth, Technolegau Ymchwil GreenT10@cardiff.ac.uk; Davide DelVento davide.quantum@gmail.com Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: memory high water mark reporting
Siwmae Thomas,
I grepped for memory.peak in the source and it's not there. memory.current is there and is used in src/plugins/cgroup/v2/cgroup_v2.c
Adding the ability to get memory.peak in this source file seems to be something that should be done?
Should extern cgroup_acct_t *cgroup_p_task_get_acct_data(uint32_t task_id) be modified to include looking at memory.peak ?
This may mean needing to modify the acct_stat struct in interfaces/cgroup.h to include it ?
typedef struct { uint64_t usec; uint64_t ssec; uint64_t total_rss; uint64_t mas_rss; uint64_t total_pgmajfault; uint64_t total_vmem; } cgroup_acct_t;
Presumably, with the polling method, it keeps looking at the current value and then keeps track of the max of these values. But the actual max may occur in between 2 polls so it would never see the true max value. At least by also reading memory.peak there is a chance to get closer to the real value with the polling method even if this not optimal. Ideally it should run this during cleanup of tasks as well as at the poll interval.
As an aside, I also did a grep for getrusage and it doesn't seem to be used at all. I see that it is looking at /proc/%d/stat so maybe this is where its getting the maxrss for non cgroup accounting. Still, getrusage would seem to be the more obvious choice for this ?
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________ From: Thomas Green - Staff in University IT, Research Technologies / Staff Technoleg Gwybodaeth, Technolegau Ymchwil GreenT10@cardiff.ac.uk Sent: 20 May 2024 13:08 To: Emyr James emyr.james@crg.eu; Davide DelVento davide.quantum@gmail.com Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Re: memory high water mark reporting
Hi,
We have had similar questions from users regarding how best to find out the high memory peak of a job since they may run a job and get a not very useful value for variables in sacct such as the MaxRSS since Slurm didn’t poll during the use of its maximum memory usage.
With Cgroupv1 looking online it looks like memory.max_usage_in_bytes takes into account caches so can vary on how much I/O is used whilst total_rss in memory.stats looks more useful maybe. Maybe memory.peak is clearer?
Its not clear in the documentation how a user should in the sacct values to infer the actual usage of jobs to correct their behaviour in future submissions.
I would be keen to see improvements in high water mark reporting. I noticed that the jobacctgather plugin documentation was deleted back in Slurm 21.08 – Spank plugin does possibly look like the way to go. Also it seems a common problem across technologies e.g. https://github.com/google/cadvisor/issues/3286https://urldefense.com/v3/__https://github.com/google/cadvisor/issues/3286__;!!D9dNQwwGXtA!UW9JUyJ5ByL6XxihSUX-hn_HC2rYL-BZ8HtbdSlP10hGha71tuIHFmUOQ7dPpEseh3Ecyo-rrPUDVWPKJ280u9w$
Tom
From: Emyr James via slurm-users slurm-users@lists.schedmd.com Date: Monday, 20 May 2024 at 10:50 To: Davide DelVento davide.quantum@gmail.com, Emyr James emyr.james@crg.eu Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: memory high water mark reporting
External email to Cardiff University - Take care when replying/opening attachments or links.
Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor atodiadau neu ddolenni.
Looking here :
https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINShttps://urldefense.com/v3/__https://slurm.schedmd.com/spank.html*SECTION_SPANK-PLUGINS__;Iw!!D9dNQwwGXtA!UW9JUyJ5ByL6XxihSUX-hn_HC2rYL-BZ8HtbdSlP10hGha71tuIHFmUOQ7dPpEseh3Ecyo-rrPUDVWPK6HobAdg$
It looks like it's possible to hook something in at the right place using the slurm_spank_task_exit or slurm_spank_exit plugins. Does anyone have any experience or examples of doing this ? Is there any more documentation available on this functionality ?
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________
From: Emyr James via slurm-users slurm-users@lists.schedmd.com Sent: 17 May 2024 01:15 To: Davide DelVento davide.quantum@gmail.com Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: memory high water mark reporting
Hi,
I have got a very simple LD_PRELOAD that can do this. Maybe I should see if I can force slurmstepd to be run with that LD_PRELOAD and then see if that does it.
Ultimately am trying to get all the useful accounting metrics into a clickhouse database. If the LD_PRELOAD on slurmstepd seems to work then I can expand it to insert the relevant row into the clickhouse DB in the C code of the preload library.
But still...this seems like a very basic thing to do and am very suprised that it seems so difficult to do this with the standard accounting recording out of the box.
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________
From: Davide DelVento davide.quantum@gmail.com Sent: 17 May 2024 01:02 To: Emyr James emyr.james@crg.eu Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: Re: [slurm-users] memory high water mark reporting
Not exactly the answer to your question (which I don't know) but if you can get to prefix whatever is executed with this https://github.com/NCAR/peak_memusagehttps://urldefense.com/v3/__https://github.com/NCAR/peak_memusage__;!!D9dNQwwGXtA!XXr8CcM11ikS-fYyDe0CFyQWal6Qp5cgv1os4oHtVrAAJE68Fp6qqvZFKoNvW26ROOv3uLzwqRZLge3-6zV8CPYLzg$ (which also uses getrusage) or a variant you will be able to do that.
On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> wrote:
Hi,
We are trying out slurm having been running grid engine for a long while.
In grid engine, the cgroups peak memory and max_rss are generated at the end of a job and recorded. It logs the information from the cgroup hierarchy as well as doing a getrusage call right at the end on the parent pid of the whole job "container" before cleaning up.
With slurm it seems that the only way memory is recorded is by the acct gather polling. I am trying to add something in an epilog script to get the memory.peak but It looks like the cgroup hierarchy has been destroyed by the time the epilog is run.
Where in the code is the cgroup hierarchy cleared up ? Is there no way to add something in so that the accounting is updated during the job cleanup process so that peak memory usage can be accurately logged ?
I can reduce the polling interval from 30s to 5s but don't know if this causes a lot of overhead and in any case this seems to not be a sensible way to get values that should just be determined right at the end by an event rather than using polling.
Many thanks,
Emyr
-- slurm-users mailing list -- slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.commailto:slurm-users-leave@lists.schedmd.com
We have a pretty ugly patch that calls out to a script from common_cgroup_delete() in src/plugins/cgroup/common/cgroup_common.c. It checks that it's the job cgroup being deleted ("/job_*" as the path). The script collects the data and stores it elsewhere.
It's a really ugly way of doing it and I wish there was something better. It seems like this could be a good spot for a SPANK hook.
Ryan
On 5/20/24 09:32, Emyr James via slurm-users wrote:
I changed the following in src/plugins/cgroup/v2/cgroup_v2.c
if (common_cgroup_get_param(&task_cg_info->task_cg, *"memory.current"*, &memory_current, &tmp_sz) != SLURM_SUCCESS) { if (task_id == task_special_id) log_flag(CGROUP, "Cannot read task_special memory.peak file"); else log_flag(CGROUP, "Cannot read task %d memory.peak file", task_id); }
to
if (common_cgroup_get_param(&task_cg_info->task_cg, * "memory.peak"*, &memory_current, &tmp_sz) != SLURM_SUCCESS) { if (task_id == task_special_id) log_flag(CGROUP, "Cannot read task_special memory.peak file"); else log_flag(CGROUP, "Cannot read task %d memory.peak file", task_id); }
and am using a polling interval of 5s. the values I get when adding this to the end of a batch script :
dir=$(awk -F: '{print $NF}' /proc/self/cgroup) echo [$(date +"%Y-%m-%d %H:%M:%S")] peak memory is `cat /sys/fs/cgroup$dir/memory.peak` echo [$(date +"%Y-%m-%d %H:%M:%S")] finished on $(hostname)
compared to what is in maxrss from sacct seem to be spot on for my test jobs at least. I guess this will do for now but it still feels very unsatisfactory to be using polling for this instead of having the code trigger the relevant stuff on job cleanup.
The downside of this "quick fix" is that now during a job run, sstat will report the max memory seen so far rather than the current usage. Personally I think this is not particularly useful anyway and if you really need to track memory usage as a job is running the LD_PRELOAD methods mentioned previously are better.
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
*From:* Emyr James emyr.james@crg.eu *Sent:* 20 May 2024 14:30 *To:* Thomas Green - Staff in University IT, Research Technologies / Staff Technoleg Gwybodaeth, Technolegau Ymchwil GreenT10@cardiff.ac.uk; Davide DelVento davide.quantum@gmail.com; Emyr James emyr.james@crg.eu *Cc:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] Re: memory high water mark reporting A bit more digging....
the cgroups stuff seems to be communicating back the values it finds in src/plugins/jobacct_gather/cgroup/jobacct_gather_cgroup.c
prec->tres_data[TRES_ARRAY_MEM].size_read = cgroup_acct_data->total_rss;
I can't find anywhere in the code where it seems to be keeping track of the max value of total_rss seen so I can only conclude that it must be done in the database when slurmdbd puts in the values rather than being done in the slurm binaries themselves.
So this does seem to suggest that the peak value that is accounted at the end is just the maximum of the memory.current values that it sees over all the polls, even though there may be much higher transient values that may have occured in between the polls which would be taken into account by memory.peak but slurm never sees these values.
Can anyone more familiar with the code than me corrobarate this ?
Presumably non-cgroup accounting has a similar issue ? I.e. it polls rss and then the accounting db reports the highest seen even though using getrusage and checking ru_maxrss should be done too ?
Many thanks,
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
*From:* Emyr James via slurm-users slurm-users@lists.schedmd.com *Sent:* 20 May 2024 13:56 *To:* Thomas Green - Staff in University IT, Research Technologies / Staff Technoleg Gwybodaeth, Technolegau Ymchwil GreenT10@cardiff.ac.uk; Davide DelVento davide.quantum@gmail.com *Cc:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject:* [slurm-users] Re: memory high water mark reporting Siwmae Thomas,
I grepped for memory.peak in the source and it's not there. memory.current is there and is used in src/plugins/cgroup/v2/cgroup_v2.c
Adding the ability to get memory.peak in this source file seems to be something that should be done?
Should extern cgroup_acct_t *cgroup_p_task_get_acct_data(uint32_t task_id) be modified to include looking at memory.peak ?
This may mean needing to modify the acct_stat struct in interfaces/cgroup.h to include it ?
typedef struct { uint64_t usec; uint64_t ssec; uint64_t total_rss; *uint64_t mas_rss;* uint64_t total_pgmajfault; uint64_t total_vmem; } cgroup_acct_t;
Presumably, with the polling method, it keeps looking at the current value and then keeps track of the max of these values. But the actual max may occur in between 2 polls so it would never see the true max value. At least by also reading memory.peak there is a chance to get closer to the real value with the polling method even if this not optimal. Ideally it should run this during cleanup of tasks as well as at the poll interval.
As an aside, I also did a grep for getrusage and it doesn't seem to be used at all. I see that it is looking at /proc/%d/stat so maybe this is where its getting the maxrss for non cgroup accounting. Still, getrusage would seem to be the more obvious choice for this ?
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
*From:* Thomas Green - Staff in University IT, Research Technologies / Staff Technoleg Gwybodaeth, Technolegau Ymchwil GreenT10@cardiff.ac.uk *Sent:* 20 May 2024 13:08 *To:* Emyr James emyr.james@crg.eu; Davide DelVento davide.quantum@gmail.com *Cc:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] Re: memory high water mark reporting
Hi,
We have had similar questions from users regarding how best to find out the high memory peak of a job since they may run a job and get a not very useful value for variables in sacct such as the MaxRSS since Slurm didn’t poll during the use of its maximum memory usage.
With Cgroupv1 looking online it looks like memory.max_usage_in_bytes takes into account caches so can vary on how much I/O is used whilst total_rss in memory.stats looks more useful maybe. Maybe memory.peak is clearer?
Its not clear in the documentation how a user should in the sacct values to infer the actual usage of jobs to correct their behaviour in future submissions.
I would be keen to see improvements in high water mark reporting. I noticed that the jobacctgather plugin documentation was deleted back in Slurm 21.08 – Spank plugin does possibly look like the way to go. Also it seems a common problem across technologies e.g. https://github.com/google/cadvisor/issues/3286 https://urldefense.com/v3/__https://github.com/google/cadvisor/issues/3286__;!!D9dNQwwGXtA!UW9JUyJ5ByL6XxihSUX-hn_HC2rYL-BZ8HtbdSlP10hGha71tuIHFmUOQ7dPpEseh3Ecyo-rrPUDVWPKJ280u9w$
Tom
*From: *Emyr James via slurm-users slurm-users@lists.schedmd.com *Date: *Monday, 20 May 2024 at 10:50 *To: *Davide DelVento davide.quantum@gmail.com, Emyr James emyr.james@crg.eu *Cc: *slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject: *[slurm-users] Re: memory high water mark reporting
*External email to Cardiff University - *Take care when replying/opening attachments or links.
*Nid ebost mewnol o Brifysgol Caerdydd yw hwn - *Cymerwch ofal wrth ateb/agor atodiadau neu ddolenni.
Looking here :
https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINS https://urldefense.com/v3/__https://slurm.schedmd.com/spank.html*SECTION_SPANK-PLUGINS__;Iw!!D9dNQwwGXtA!UW9JUyJ5ByL6XxihSUX-hn_HC2rYL-BZ8HtbdSlP10hGha71tuIHFmUOQ7dPpEseh3Ecyo-rrPUDVWPK6HobAdg$
It looks like it's possible to hook something in at the right place using the slurm_spank_task_exit or slurm_spank_exit plugins. Does anyone have any experience or examples of doing this ? Is there any more documentation available on this functionality ?
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
*From:*Emyr James via slurm-users slurm-users@lists.schedmd.com *Sent:* 17 May 2024 01:15 *To:* Davide DelVento davide.quantum@gmail.com *Cc:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject:* [slurm-users] Re: memory high water mark reporting
Hi,
I have got a very simple LD_PRELOAD that can do this. Maybe I should see if I can force slurmstepd to be run with that LD_PRELOAD and then see if that does it.
Ultimately am trying to get all the useful accounting metrics into a clickhouse database. If the LD_PRELOAD on slurmstepd seems to work then I can expand it to insert the relevant row into the clickhouse DB in the C code of the preload library.
But still...this seems like a very basic thing to do and am very suprised that it seems so difficult to do this with the standard accounting recording out of the box.
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
*From:*Davide DelVento davide.quantum@gmail.com *Sent:* 17 May 2024 01:02 *To:* Emyr James emyr.james@crg.eu *Cc:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] memory high water mark reporting
Not exactly the answer to your question (which I don't know) but if you can get to prefix whatever is executed with this https://github.com/NCAR/peak_memusage https://urldefense.com/v3/__https://github.com/NCAR/peak_memusage__;!!D9dNQwwGXtA!XXr8CcM11ikS-fYyDe0CFyQWal6Qp5cgv1os4oHtVrAAJE68Fp6qqvZFKoNvW26ROOv3uLzwqRZLge3-6zV8CPYLzg$ (which also uses getrusage) or a variant you will be able to do that.
On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users slurm-users@lists.schedmd.com wrote:
Hi, We are trying out slurm having been running grid engine for a long while. In grid engine, the cgroups peak memory and max_rss are generated at the end of a job and recorded. It logs the information from the cgroup hierarchy as well as doing a getrusage call right at the end on the parent pid of the whole job "container" before cleaning up. With slurm it seems that the only way memory is recorded is by the acct gather polling. I am trying to add something in an epilog script to get the memory.peak but It looks like the cgroup hierarchy has been destroyed by the time the epilog is run. Where in the code is the cgroup hierarchy cleared up ? Is there no way to add something in so that the accounting is updated during the job cleanup process so that peak memory usage can be accurately logged ? I can reduce the polling interval from 30s to 5s but don't know if this causes a lot of overhead and in any case this seems to not be a sensible way to get values that should just be determined right at the end by an event rather than using polling. Many thanks, Emyr -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Hi,
I came to same conclusion and spotted similar bits of the code where code could be changed to get what was required. Without a new variable it will be tricky to implement properly due to way those existing variables are used and defined. Maybe a PeakMem variable in Slurm accounting database to capture this is required if enough interest in this feature.
N.B. I got confused with the memory – total_rss is already used, max_usage_in_bytes in cgroupsv1 is the only one (similar to peak in cgroupsv2).
Maybe only proper way is to monitor this sort of thing outside of Slurm with tools such as XDMOD.
Tom
From: Emyr James via slurm-users slurm-users@lists.schedmd.com Date: Monday, 20 May 2024 at 16:46 To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: memory high water mark reporting
I changed the following in src/plugins/cgroup/v2/cgroup_v2.c
if (common_cgroup_get_param(&task_cg_info->task_cg, "memory.current", &memory_current, &tmp_sz) != SLURM_SUCCESS) { if (task_id == task_special_id) log_flag(CGROUP, "Cannot read task_special memory.peak file"); else log_flag(CGROUP, "Cannot read task %d memory.peak file", task_id); }
to
if (common_cgroup_get_param(&task_cg_info->task_cg, "memory.peak", &memory_current, &tmp_sz) != SLURM_SUCCESS) { if (task_id == task_special_id) log_flag(CGROUP, "Cannot read task_special memory.peak file"); else log_flag(CGROUP, "Cannot read task %d memory.peak file", task_id); }
and am using a polling interval of 5s. the values I get when adding this to the end of a batch script :
dir=$(awk -F: '{print $NF}' /proc/self/cgroup) echo [$(date +"%Y-%m-%d %H:%M:%S")] peak memory is `cat /sys/fs/cgroup$dir/memory.peak` echo [$(date +"%Y-%m-%d %H:%M:%S")] finished on $(hostname)
compared to what is in maxrss from sacct seem to be spot on for my test jobs at least. I guess this will do for now but it still feels very unsatisfactory to be using polling for this instead of having the code trigger the relevant stuff on job cleanup.
The downside of this "quick fix" is that now during a job run, sstat will report the max memory seen so far rather than the current usage. Personally I think this is not particularly useful anyway and if you really need to track memory usage as a job is running the LD_PRELOAD methods mentioned previously are better.
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________ From: Emyr James emyr.james@crg.eu Sent: 20 May 2024 14:30 To: Thomas Green - Staff in University IT, Research Technologies / Staff Technoleg Gwybodaeth, Technolegau Ymchwil GreenT10@cardiff.ac.uk; Davide DelVento davide.quantum@gmail.com; Emyr James emyr.james@crg.eu Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Re: memory high water mark reporting
A bit more digging....
the cgroups stuff seems to be communicating back the values it finds in src/plugins/jobacct_gather/cgroup/jobacct_gather_cgroup.c
prec->tres_data[TRES_ARRAY_MEM].size_read = cgroup_acct_data->total_rss;
I can't find anywhere in the code where it seems to be keeping track of the max value of total_rss seen so I can only conclude that it must be done in the database when slurmdbd puts in the values rather than being done in the slurm binaries themselves.
So this does seem to suggest that the peak value that is accounted at the end is just the maximum of the memory.current values that it sees over all the polls, even though there may be much higher transient values that may have occured in between the polls which would be taken into account by memory.peak but slurm never sees these values.
Can anyone more familiar with the code than me corrobarate this ?
Presumably non-cgroup accounting has a similar issue ? I.e. it polls rss and then the accounting db reports the highest seen even though using getrusage and checking ru_maxrss should be done too ?
Many thanks,
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________ From: Emyr James via slurm-users slurm-users@lists.schedmd.com Sent: 20 May 2024 13:56 To: Thomas Green - Staff in University IT, Research Technologies / Staff Technoleg Gwybodaeth, Technolegau Ymchwil GreenT10@cardiff.ac.uk; Davide DelVento davide.quantum@gmail.com Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: memory high water mark reporting
Siwmae Thomas,
I grepped for memory.peak in the source and it's not there. memory.current is there and is used in src/plugins/cgroup/v2/cgroup_v2.c
Adding the ability to get memory.peak in this source file seems to be something that should be done?
Should extern cgroup_acct_t *cgroup_p_task_get_acct_data(uint32_t task_id) be modified to include looking at memory.peak ?
This may mean needing to modify the acct_stat struct in interfaces/cgroup.h to include it ?
typedef struct { uint64_t usec; uint64_t ssec; uint64_t total_rss; uint64_t mas_rss; uint64_t total_pgmajfault; uint64_t total_vmem; } cgroup_acct_t;
Presumably, with the polling method, it keeps looking at the current value and then keeps track of the max of these values. But the actual max may occur in between 2 polls so it would never see the true max value. At least by also reading memory.peak there is a chance to get closer to the real value with the polling method even if this not optimal. Ideally it should run this during cleanup of tasks as well as at the poll interval.
As an aside, I also did a grep for getrusage and it doesn't seem to be used at all. I see that it is looking at /proc/%d/stat so maybe this is where its getting the maxrss for non cgroup accounting. Still, getrusage would seem to be the more obvious choice for this ?
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________ From: Thomas Green - Staff in University IT, Research Technologies / Staff Technoleg Gwybodaeth, Technolegau Ymchwil GreenT10@cardiff.ac.uk Sent: 20 May 2024 13:08 To: Emyr James emyr.james@crg.eu; Davide DelVento davide.quantum@gmail.com Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Re: memory high water mark reporting
Hi,
We have had similar questions from users regarding how best to find out the high memory peak of a job since they may run a job and get a not very useful value for variables in sacct such as the MaxRSS since Slurm didn’t poll during the use of its maximum memory usage.
With Cgroupv1 looking online it looks like memory.max_usage_in_bytes takes into account caches so can vary on how much I/O is used whilst total_rss in memory.stats looks more useful maybe. Maybe memory.peak is clearer?
Its not clear in the documentation how a user should in the sacct values to infer the actual usage of jobs to correct their behaviour in future submissions.
I would be keen to see improvements in high water mark reporting. I noticed that the jobacctgather plugin documentation was deleted back in Slurm 21.08 – Spank plugin does possibly look like the way to go. Also it seems a common problem across technologies e.g. https://github.com/google/cadvisor/issues/3286https://urldefense.com/v3/__https://github.com/google/cadvisor/issues/3286__;!!D9dNQwwGXtA!UW9JUyJ5ByL6XxihSUX-hn_HC2rYL-BZ8HtbdSlP10hGha71tuIHFmUOQ7dPpEseh3Ecyo-rrPUDVWPKJ280u9w$
Tom
From: Emyr James via slurm-users slurm-users@lists.schedmd.com Date: Monday, 20 May 2024 at 10:50 To: Davide DelVento davide.quantum@gmail.com, Emyr James emyr.james@crg.eu Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: memory high water mark reporting
External email to Cardiff University - Take care when replying/opening attachments or links.
Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor atodiadau neu ddolenni.
Looking here :
https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINShttps://urldefense.com/v3/__https://slurm.schedmd.com/spank.html*SECTION_SPANK-PLUGINS__;Iw!!D9dNQwwGXtA!UW9JUyJ5ByL6XxihSUX-hn_HC2rYL-BZ8HtbdSlP10hGha71tuIHFmUOQ7dPpEseh3Ecyo-rrPUDVWPK6HobAdg$
It looks like it's possible to hook something in at the right place using the slurm_spank_task_exit or slurm_spank_exit plugins. Does anyone have any experience or examples of doing this ? Is there any more documentation available on this functionality ?
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________
From: Emyr James via slurm-users slurm-users@lists.schedmd.com Sent: 17 May 2024 01:15 To: Davide DelVento davide.quantum@gmail.com Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: memory high water mark reporting
Hi,
I have got a very simple LD_PRELOAD that can do this. Maybe I should see if I can force slurmstepd to be run with that LD_PRELOAD and then see if that does it.
Ultimately am trying to get all the useful accounting metrics into a clickhouse database. If the LD_PRELOAD on slurmstepd seems to work then I can expand it to insert the relevant row into the clickhouse DB in the C code of the preload library.
But still...this seems like a very basic thing to do and am very suprised that it seems so difficult to do this with the standard accounting recording out of the box.
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________
From: Davide DelVento davide.quantum@gmail.com Sent: 17 May 2024 01:02 To: Emyr James emyr.james@crg.eu Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: Re: [slurm-users] memory high water mark reporting
Not exactly the answer to your question (which I don't know) but if you can get to prefix whatever is executed with this https://github.com/NCAR/peak_memusagehttps://urldefense.com/v3/__https://github.com/NCAR/peak_memusage__;!!D9dNQwwGXtA!XXr8CcM11ikS-fYyDe0CFyQWal6Qp5cgv1os4oHtVrAAJE68Fp6qqvZFKoNvW26ROOv3uLzwqRZLge3-6zV8CPYLzg$ (which also uses getrusage) or a variant you will be able to do that.
On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> wrote:
Hi,
We are trying out slurm having been running grid engine for a long while.
In grid engine, the cgroups peak memory and max_rss are generated at the end of a job and recorded. It logs the information from the cgroup hierarchy as well as doing a getrusage call right at the end on the parent pid of the whole job "container" before cleaning up.
With slurm it seems that the only way memory is recorded is by the acct gather polling. I am trying to add something in an epilog script to get the memory.peak but It looks like the cgroup hierarchy has been destroyed by the time the epilog is run.
Where in the code is the cgroup hierarchy cleared up ? Is there no way to add something in so that the accounting is updated during the job cleanup process so that peak memory usage can be accurately logged ?
I can reduce the polling interval from 30s to 5s but don't know if this causes a lot of overhead and in any case this seems to not be a sensible way to get values that should just be determined right at the end by an event rather than using polling.
Many thanks,
Emyr
-- slurm-users mailing list -- slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.commailto:slurm-users-leave@lists.schedmd.com
Users can, of course always just wrap the job itself in time to record the maximum memory usage. Bit of a naïve approach but it does work. I agree the polling of current usage is not very satisfactory.
Tim
-- Tim Cutts Scientific Computing Platform Lead AstraZeneca
Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our Service Cataloguehttps://azcollaboration.sharepoint.com/sites/CMU993 |
From: greent10--- via slurm-users slurm-users@lists.schedmd.com Date: Monday, 20 May 2024 at 12:10 To: Emyr James emyr.james@crg.eu, Davide DelVento davide.quantum@gmail.com Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: memory high water mark reporting Hi,
We have had similar questions from users regarding how best to find out the high memory peak of a job since they may run a job and get a not very useful value for variables in sacct such as the MaxRSS since Slurm didn’t poll during the use of its maximum memory usage.
With Cgroupv1 looking online it looks like memory.max_usage_in_bytes takes into account caches so can vary on how much I/O is used whilst total_rss in memory.stats looks more useful maybe. Maybe memory.peak is clearer?
Its not clear in the documentation how a user should in the sacct values to infer the actual usage of jobs to correct their behaviour in future submissions.
I would be keen to see improvements in high water mark reporting. I noticed that the jobacctgather plugin documentation was deleted back in Slurm 21.08 – Spank plugin does possibly look like the way to go. Also it seems a common problem across technologies e.g. https://github.com/google/cadvisor/issues/3286https://github.com/google/cadvisor/issues/3286
Tom
From: Emyr James via slurm-users slurm-users@lists.schedmd.com Date: Monday, 20 May 2024 at 10:50 To: Davide DelVento davide.quantum@gmail.com, Emyr James emyr.james@crg.eu Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: memory high water mark reporting External email to Cardiff University - Take care when replying/opening attachments or links. Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor atodiadau neu ddolenni.
Looking here :
https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINShttps://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINS
It looks like it's possible to hook something in at the right place using the slurm_spank_task_exit or slurm_spank_exit plugins. Does anyone have any experience or examples of doing this ? Is there any more documentation available on this functionality ?
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________ From: Emyr James via slurm-users slurm-users@lists.schedmd.com Sent: 17 May 2024 01:15 To: Davide DelVento davide.quantum@gmail.com Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: memory high water mark reporting
Hi,
I have got a very simple LD_PRELOAD that can do this. Maybe I should see if I can force slurmstepd to be run with that LD_PRELOAD and then see if that does it.
Ultimately am trying to get all the useful accounting metrics into a clickhouse database. If the LD_PRELOAD on slurmstepd seems to work then I can expand it to insert the relevant row into the clickhouse DB in the C code of the preload library.
But still...this seems like a very basic thing to do and am very suprised that it seems so difficult to do this with the standard accounting recording out of the box.
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
________________________________ From: Davide DelVento davide.quantum@gmail.com Sent: 17 May 2024 01:02 To: Emyr James emyr.james@crg.eu Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: Re: [slurm-users] memory high water mark reporting
Not exactly the answer to your question (which I don't know) but if you can get to prefix whatever is executed with this https://github.com/NCAR/peak_memusagehttps://urldefense.com/v3/__https://github.com/NCAR/peak_memusage__;!!D9dNQwwGXtA!XXr8CcM11ikS-fYyDe0CFyQWal6Qp5cgv1os4oHtVrAAJE68Fp6qqvZFKoNvW26ROOv3uLzwqRZLge3-6zV8CPYLzg$ (which also uses getrusage) or a variant you will be able to do that.
On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> wrote: Hi,
We are trying out slurm having been running grid engine for a long while. In grid engine, the cgroups peak memory and max_rss are generated at the end of a job and recorded. It logs the information from the cgroup hierarchy as well as doing a getrusage call right at the end on the parent pid of the whole job "container" before cleaning up. With slurm it seems that the only way memory is recorded is by the acct gather polling. I am trying to add something in an epilog script to get the memory.peak but It looks like the cgroup hierarchy has been destroyed by the time the epilog is run. Where in the code is the cgroup hierarchy cleared up ? Is there no way to add something in so that the accounting is updated during the job cleanup process so that peak memory usage can be accurately logged ?
I can reduce the polling interval from 30s to 5s but don't know if this causes a lot of overhead and in any case this seems to not be a sensible way to get values that should just be determined right at the end by an event rather than using polling.
Many thanks,
Emyr
-- slurm-users mailing list -- slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.commailto:slurm-users-leave@lists.schedmd.com ________________________________
AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.
This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.comhttps://www.astrazeneca.com