I am confused by the reported amount of Down and PLND Down by sreport. According to it, our cluster would have had a significant amount of downtime, which I know didn't happen (or, according to the documentation "time that slurmctld was not responding", see https://slurm.schedmd.com/sreport.html)
Could it be my purge settings causing this problem? How can I check (maybe in some logs, maybe in the future) if actually slurmctld was not responding? The expected long-term numbers should be less than the ones reported for last month when we had an issue with a few nodes....
Thanks!
[davide@login ~]$ grep Purge /opt/slurm/slurmdbd.conf #JobPurge=12 #StepPurge=1 PurgeEventAfter=1month PurgeJobAfter=12month PurgeResvAfter=1month PurgeStepAfter=1month PurgeSuspendAfter=1month
[davide@login ~]$ sreport -t percent -T cpu,mem cluster utilization start=2/1/22 -------------------------------------------------------------------------------- Cluster Utilization 2022-02-01T00:00:00 - 2024-08-21T23:59:59 Usage reported in Percentage of Total -------------------------------------------------------------------------------- Cluster TRES Name Allocated Down PLND Down Idle Planned Reported --------- -------------- --------------- --------------- -------------- ---------------- --------- ---------------- cluster cpu 19.50% 12.07% 3.92% 64.36% 0.15% 100.03% cluster mem 16.13% 13.17% 4.56% 66.13% 0.00% 99.99%
[davide@login ~]$ sreport -t percent -T cpu,mem cluster utilization start=2/1/23 -------------------------------------------------------------------------------- Cluster Utilization 2023-02-01T00:00:00 - 2024-08-21T23:59:59 Usage reported in Percentage of Total -------------------------------------------------------------------------------- Cluster TRES Name Allocated Down PLND Down Idle Planned Reported --------- -------------- --------------- --------------- -------------- --------------- --------- ---------------- cluster cpu 28.74% 18.80% 6.44% 45.77% 0.24% 100.02% cluster mem 22.52% 20.54% 7.38% 49.55% 0.00% 99.98%
[davide@login ~]$ sreport -t percent -T cpu,mem cluster utilization start=2/1/24 -------------------------------------------------------------------------------- Cluster Utilization 2024-02-01T00:00:00 - 2024-08-21T23:59:59 Usage reported in Percentage of Total -------------------------------------------------------------------------------- Cluster TRES Name Allocated Down PLND Down Idle Planned Reported --------- -------------- -------------- --------------- -------------- --------------- -------- --------------- cluster cpu 29.92% 24.88% 17.73% 27.45% 0.02% 100.00% cluster mem 20.07% 28.60% 19.57% 31.76% 0.00% 100.00%
[davide@login ~]$ sreport -t percent -T cpu,mem cluster utilization start=8/8/24 -------------------------------------------------------------------------------- Cluster Utilization 2024-08-08T00:00:00 - 2024-08-21T23:59:59 Usage reported in Percentage of Total -------------------------------------------------------------------------------- Cluster TRES Name Allocated Down PLND Dow Idle Planned Reported --------- -------------- ------------- ------------ -------- -------------- -------- -------------- cluster cpu 15.96% 2.53% 0.00% 81.51% 0.00% 100.00% cluster mem 9.18% 2.22% 0.00% 88.60% 0.00% 100.00%
[davide@login ~]$ sreport -t percent -T cpu,mem cluster utilization start=7/7/24 -------------------------------------------------------------------------------- Cluster Utilization 2024-07-07T00:00:00 - 2024-08-21T23:59:59 Usage reported in Percentage of Total -------------------------------------------------------------------------------- Cluster TRES Name Allocated Down PLND Dow Idle Planned Reported --------- -------------- -------------- ------------- -------- -------------- -------- -------------- cluster cpu 27.07% 2.57% 0.00% 70.34% 0.02% 100.00% cluster mem 17.35% 2.26% 0.00% 80.40% 0.00% 100.00%
Hi Davide,
On 8/22/24 21:30, Davide DelVento via slurm-users wrote:
I am confused by the reported amount of Down and PLND Down by sreport. According to it, our cluster would have had a significant amount of downtime, which I know didn't happen (or, according to the documentation "time that slurmctld was not responding", see https://slurm.schedmd.com/sreport.html https://slurm.schedmd.com/sreport.html)
Could it be my purge settings causing this problem? How can I check (maybe in some logs, maybe in the future) if actually slurmctld was not responding? The expected long-term numbers should be less than the ones reported for last month when we had an issue with a few nodes....
Which version of Slurm are you using? There was an sreport bug that should be fixed in 23.11: https://support.schedmd.com/show_bug.cgi?id=17689
/Ole
Thanks Ole, this is very helpful. I was unaware of that issue. From the bug report it's not clear to me if it was just a sreport (display) issue, or if the problem was in the way the data was stored.
In fact I am running 23.11.5 which I installed in April. The numbers I see for the last few months (including April) are fine. The earlier numbers (when I was running an earlier version) are the ones affected by this problem. So if the issue was the way the data was stored, that explains it and I can live with it (even if I can't provide an accurate report for my management now) knowing that the problem won't happen again in the future.
Thanks and have a great weekend
On Fri, Aug 23, 2024 at 8:00 AM Ole Holm Nielsen via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi Davide,
On 8/22/24 21:30, Davide DelVento via slurm-users wrote:
I am confused by the reported amount of Down and PLND Down by sreport. According to it, our cluster would have had a significant amount of downtime, which I know didn't happen (or, according to the documentation "time that slurmctld was not responding", see https://slurm.schedmd.com/sreport.html https://slurm.schedmd.com/sreport.html)
Could it be my purge settings causing this problem? How can I check
(maybe
in some logs, maybe in the future) if actually slurmctld was not responding? The expected long-term numbers should be less than the ones reported for last month when we had an issue with a few nodes....
Which version of Slurm are you using? There was an sreport bug that should be fixed in 23.11: https://support.schedmd.com/show_bug.cgi?id=17689
/Ole
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com