[slurm-users] changing JobAcctGatherType w/running jobs

Tue Sep 7 23:26:16 UTC 2021

Hi all:

Running Slurm 20.11.8.  I missed a chance at a recent outage to change our JobAcctGatherType from 'linux' to 'cgroup'.  Our ProctrackType has been 'cgroup' for a long time.  In short, I'm thinking it would harmless for me to do this now, with running jobs, and below I discuss the caveats I know of.  Have any of you made this change with jobs running, or see why in my case I should not?

More info:

I see the warnings in the doc about not changing JobAcctGatherType while jobs are running.  Some of you have asked SchedMd about this before:

- In slurm-dev.schedmd.narkive.com/EbK7qgSg/adding-jobacctgather-plugin-causing-rpc-errors#post1 from 2013, Moe says "don't change this while jobs are running; I'll doc that."  (Hence it being doc'd now.)

- https://bugs.schedmd.com/show_bug.cgi?id=861 in 2014 mentioned that doing so would break 'sstat' for the already-running jobs.

- in https://bugs.schedmd.com/show_bug.cgi?id=2781 in 2016 SchedMD repeated the doc'd warning.  In that case, the user reported job tasks completing while Slurm considered the jobs still running.

On a dev cluster, I started a job, then changed JobAcctGatherType from 'linux' to 'cgroup', then restarted slurmctld, then the slurmds.  That job continued to run and was terminated by its timelimit.  This was replicable.

I submitted a job with a known RAM-vs-time profile to several otherwise idle nodes.  One node I left alone.  The other four I switched from 'linux' to 'cgroup' at varied times during the jobs' lives.  We have a Prometheus exporter which feeds a Grafana instance to graph the cgroup data.  I looked at the 'memory' data across the nodes.  One of them reported falsely high memory for the test job.  Running the same job again without touching slurmd mid-job yielded identically correct graphs across the nodes.

Suppose I switch my cluster (slurmctld, all slurmds) at time T0.  In principle a user might want to size her jobs and happen to look at the affected one of the memory-related metrics for a job which was running at T0.and get inaccurate info.  Modulo that, we can afford to declare the memory-usage historical info re: the jobs running at T0 (we could tolerate any seeming inacurracies in fairshare arising from that info being inaccurate, and don't yet have e.g. a MaxTresPerX with some RAM value).  With our 'cgroup' ProcTrackType, and requiring a mem spec on all jobs, I think we don't need worry if a given slurmd is sending slurmctld wrong or incomprehensible information about a given job's resource usage.

Does anyone know of reason to think otherwise?  Thanks for reading this far :)

--
Grinning like an idiot,
Paul Brunk, system administrator
Georgia Advanced Computing Resource Center (GACRC)
Enterprise IT Svcs, the University of Georgia