[slurm-users] Access to slurm job cgroups in prolog/epilog script

Mon Mar 8 19:25:21 UTC 2021

Hi,

I am wondering about the exact execution order of prolog scripts and
plugins in Slurm with the goal to be able to access the freshly created
cgroups (by the task/cgroup plugin) in our prolog/epilog scripts, which
run with PrologFlags=Alloc to ensure the tranditional batch system
behaviour.

We want some information, namely, the prepared cpuset for the job in
the prolog and the statistics/counter differences in the epilog. I am
aware of accounting and profiling options using slurm and plugins, but
there are reasons I want to handle cgroup information myself; maybe
even to experiment with things that might go into a slurm plugin at
some point.

The job cgroups are created after the prolog scripts ran and destroyed
before the epilog scripts run (correct? — looks like that). The design
seems to focus on individual job steps, having things run closely coupled
to the possibly multiple componentes (steps, tasks) for batch jobs that
I only have a hazy concept of.

Is there a standard way to get the cgroup hierarchy for the job created
early, before the per-node prolog script that runs as root (slurmd
user), and final cleanup happening later, after the epilog ran? If
config doesn't do it, I thought about modifying task/cgroup, but I
suspect that the whole scope of the plugin is between the epilogs. Can
someone confirm that? I welcome pointers to documentation that explains
in detail when which parts of a plugin is run in relation to the slot
the prolog scripts get.

With https://bugs.schedmd.com/show_bug.cgi?id=9429, there seems to be a
way to keep the cgroup around longer, just sabotage the cleanup phase
and do it later in the epilog (as I do now on a Ubuntu 20.04 cluster
with the distro-provided slurmd that suffers from this bug). But will
e.g. moving the code from task_p_pre_setuid() to
task_p_slurmd_reserve_resources() give me early access in the prolog? I
might just try and break something, but I didn't find yet documentation
on these details of the plugin API and for once thought asking around
first might be also good.

I want cpuset information and at least things like per-node memory
high-water marks. The desired granularity is at the job level and it
would be nice to get rid of inefficient timeseries to approximate that.
The cpuset is needed in advance to user programs starting as I hook a
listener to the taskstats interface to cheaply and accurately account
for user processes (kernel tasks) with command names. My profiling is
somewhere between the hdf5 timeseries and the rought values you get out
of sacct, with an orthogonal bit about kernel tasks (to tell the user
how many python processes wasted how much memory each).

Alrighty then,

Thomas

PS: I guess lots is possible by writing a custom plugin that ties in
with what my prolog/epilog scripts do, but I'd prefer a light touch
first. Hacking the scripts during development is far more convenient.

-- 
Dr. Thomas Orgis
HPC @ Universität Hamburg