[slurm-users] Disable --no-allocate support for a node/SlurmD

Thu Jun 15 08:12:23 UTC 2023

Hi,
> Ah okay,  so your requirements include completely insulating (some) 
> jobs from outside access, including root?
Correct.
> I've seen this kind of requirements on e.g. working non-defaced 
> medical data - generally a tough problem imo because this level of 
> data security seems more or less incompatible with the idea of a 
> multi-user HPC system.
>
> I remember that this year's ZKI-AK Supercomputing spring meeting had 
> Sebastian Krey from GWDG presenting the KISSKI ("KI-Servicezentrum für 
> Sensible und Kritische Infrastrukturen", https://kisski.gwdg.de/ ) 
> project, which works in this problem domain, are you involved in that? 
> The setup with containerization and 'node hardening' sounds very 
> similar to me.
Indeed. We (ZIH TU Dresden) are working together with Hendrik Nolte from 
GWDG to implement their concept of a "secure Workflow on HPC" on our system.
In short the idea here is to have nodes with additional (cryptographic) 
authentication of jobs.
I'm just double-checking alternatives for some details which may result 
in easier implementation of the concept.
> Re "preventing the scripts from running": I'd say it's about as easy 
> as to otherwise manipulate any job submission that goes through 
> slurmctld (e.g. by editing slurm.conf), so without knowing your exact 
> use case and requirements, I can't think of a simple solution.
The resource manager, i.e. slurmctld, and slurmd run on different machines.
There is a local copy of slurm.conf for slurmctld, and the node(s), i.e. 
slurmd, each using only the relevant parts. So the slurmd doesn't care 
about the submit plugins and slurmctld doesn't (need to) know about the 
Prolog, correct?
The idea in the workflow is that only the node itself needs to be 
considered secure and access to the node is only possible via the slurmd 
running on the node.
So that slurmd can be configured to always execute the Prolog (a local 
script) prior to each job and deny its execution on failed authentication.
Circumventing this authentication now requires modifying the slurm.conf 
on that node, which has to be considered impossible as an attacker with 
that capability could also modify anything else (e.g. the Prolog to 
remove the checks).

But the possibility of slurmd handling a `--no-alloc` job introduces a 
new way to circumvent the authentication.
Using the slurm.conf of the slurmctld effectively only disables requests 
to the slurmd to not run the Prolog (i.e. -Z flag), but if the slurmd 
somehow receives such an request it would still handle it. So now the 
security relies additionally on the security of the resource manager.
It would be more secure if slurmd on that node(s) could be configured to 
never skip the Prolog, even if the user seems to be privileged.
As the node could be rebooted prior to each job using a readonly image 
the security of each job can be ensured without any influence on the 
rest of the cluster.

So in summary: We don't want to trust the slurmctld (running somewhere 
else) but only the slurmd (running on the node) to always execute the 
Prolog.

I hope that explains it well enough.
Kind regards,
Alex

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5782 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230615/b0f63b4a/attachment.bin>