[slurm-users] changing JobAcctGatherType w/running jobs
Paul Brunk
pbrunk at uga.edu
Tue Sep 7 23:26:16 UTC 2021
Hi all:
Running Slurm 20.11.8. I missed a chance at a recent outage to change our JobAcctGatherType from 'linux' to 'cgroup'. Our ProctrackType has been 'cgroup' for a long time. In short, I'm thinking it would harmless for me to do this now, with running jobs, and below I discuss the caveats I know of. Have any of you made this change with jobs running, or see why in my case I should not?
More info:
I see the warnings in the doc about not changing JobAcctGatherType while jobs are running. Some of you have asked SchedMd about this before:
- In slurm-dev.schedmd.narkive.com/EbK7qgSg/adding-jobacctgather-plugin-causing-rpc-errors#post1 from 2013, Moe says "don't change this while jobs are running; I'll doc that." (Hence it being doc'd now.)
- https://bugs.schedmd.com/show_bug.cgi?id=861 in 2014 mentioned that doing so would break 'sstat' for the already-running jobs.
- in https://bugs.schedmd.com/show_bug.cgi?id=2781 in 2016 SchedMD repeated the doc'd warning. In that case, the user reported job tasks completing while Slurm considered the jobs still running.
On a dev cluster, I started a job, then changed JobAcctGatherType from 'linux' to 'cgroup', then restarted slurmctld, then the slurmds. That job continued to run and was terminated by its timelimit. This was replicable.
I submitted a job with a known RAM-vs-time profile to several otherwise idle nodes. One node I left alone. The other four I switched from 'linux' to 'cgroup' at varied times during the jobs' lives. We have a Prometheus exporter which feeds a Grafana instance to graph the cgroup data. I looked at the 'memory' data across the nodes. One of them reported falsely high memory for the test job. Running the same job again without touching slurmd mid-job yielded identically correct graphs across the nodes.
Suppose I switch my cluster (slurmctld, all slurmds) at time T0. In principle a user might want to size her jobs and happen to look at the affected one of the memory-related metrics for a job which was running at T0.and get inaccurate info. Modulo that, we can afford to declare the memory-usage historical info re: the jobs running at T0 (we could tolerate any seeming inacurracies in fairshare arising from that info being inaccurate, and don't yet have e.g. a MaxTresPerX with some RAM value). With our 'cgroup' ProcTrackType, and requiring a mem spec on all jobs, I think we don't need worry if a given slurmd is sending slurmctld wrong or incomprehensible information about a given job's resource usage.
Does anyone know of reason to think otherwise? Thanks for reading this far :)
--
Grinning like an idiot,
Paul Brunk, system administrator
Georgia Advanced Computing Resource Center (GACRC)
Enterprise IT Svcs, the University of Georgia
More information about the slurm-users
mailing list