[slurm-users] Disable --no-allocate support for a node/SlurmD
alexander.grund at tu-dresden.de
Thu Jun 15 08:12:23 UTC 2023
> Ah okay, so your requirements include completely insulating (some)
> jobs from outside access, including root?
> I've seen this kind of requirements on e.g. working non-defaced
> medical data - generally a tough problem imo because this level of
> data security seems more or less incompatible with the idea of a
> multi-user HPC system.
> I remember that this year's ZKI-AK Supercomputing spring meeting had
> Sebastian Krey from GWDG presenting the KISSKI ("KI-Servicezentrum für
> Sensible und Kritische Infrastrukturen", https://kisski.gwdg.de/ )
> project, which works in this problem domain, are you involved in that?
> The setup with containerization and 'node hardening' sounds very
> similar to me.
Indeed. We (ZIH TU Dresden) are working together with Hendrik Nolte from
GWDG to implement their concept of a "secure Workflow on HPC" on our system.
In short the idea here is to have nodes with additional (cryptographic)
authentication of jobs.
I'm just double-checking alternatives for some details which may result
in easier implementation of the concept.
> Re "preventing the scripts from running": I'd say it's about as easy
> as to otherwise manipulate any job submission that goes through
> slurmctld (e.g. by editing slurm.conf), so without knowing your exact
> use case and requirements, I can't think of a simple solution.
The resource manager, i.e. slurmctld, and slurmd run on different machines.
There is a local copy of slurm.conf for slurmctld, and the node(s), i.e.
slurmd, each using only the relevant parts. So the slurmd doesn't care
about the submit plugins and slurmctld doesn't (need to) know about the
The idea in the workflow is that only the node itself needs to be
considered secure and access to the node is only possible via the slurmd
running on the node.
So that slurmd can be configured to always execute the Prolog (a local
script) prior to each job and deny its execution on failed authentication.
Circumventing this authentication now requires modifying the slurm.conf
on that node, which has to be considered impossible as an attacker with
that capability could also modify anything else (e.g. the Prolog to
remove the checks).
But the possibility of slurmd handling a `--no-alloc` job introduces a
new way to circumvent the authentication.
Using the slurm.conf of the slurmctld effectively only disables requests
to the slurmd to not run the Prolog (i.e. -Z flag), but if the slurmd
somehow receives such an request it would still handle it. So now the
security relies additionally on the security of the resource manager.
It would be more secure if slurmd on that node(s) could be configured to
never skip the Prolog, even if the user seems to be privileged.
As the node could be rebooted prior to each job using a readonly image
the security of each job can be ensured without any influence on the
rest of the cluster.
So in summary: We don't want to trust the slurmctld (running somewhere
else) but only the slurmd (running on the node) to always execute the
I hope that explains it well enough.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 5782 bytes
Desc: S/MIME Cryptographic Signature
More information about the slurm-users