[slurm-users] slurmd-used libs in an NFS share?
pbrunk at uga.edu
Thu Mar 23 13:51:16 UTC 2023
In short, I'm thinking about housing some slurmd-used libs in an NFS
share, and am curious about the risk such sharedness offers to
job-running slurmds (not concerned about the jobs themselves here).
For our next Slurm deployment (not a rolling upgrade), our Rocky8
nodes will be 'statelite' (xCAT), as they are already in our CentOS7
cluster. They NFS mount a shared root image, which includes Slurm in
/opt. A separate NFS server provides user home dirs and the
"/usr/local"-like dir we call "/apps". The /scratch lives in Lustre.
We're going to use CUDA, and non-distro versions of PMIx and hwloc.
For the sake of the RAM-dwelling OS image size on the nodes, I'd like
for these to live in the /apps NFS share, while keeping Slurm in the
OS image "/opt". This would make CUDA and PMIx and hwloc unavailable
to node slurmds, in the event that the /apps mount fails and the OS
"/" mount does not. I won't care if slurmd can't start a job at such
times, since the user apps would be unavailable anyway (and our NHC
checks for that). But is there some risk to the slurmd parents of the
already-running jobs, if those slurmds need to (re-)access those
libraries while they're unavailable?
I've looked at e.g. Nvidia's DeepOps (puts CUDA in an unshared
/usr/local, replicated on each node), and Dell's Omnia (puts CUDA in
an NFS share), Nathan Rini's Docker-scale-out cluster (puts CUDA,
etc. in an unshared /usr/local, replicated on each node), and OpenHPC
(Slurm is in /usr, hwloc in /opt (NFS-shared)). I've started
deploying a dev Omnivector (thanks, Mike Hanby!) using LXD, to see
what they do, but haven't finished that.
Thanks. I've seen a few "I'm starting a Slurm cluster" walkthrough
threads online lately, but haven't seen this particular thing
addressed. I'm aware it might be a non-issue.
Paul Brunk, system administrator
Advanced Computing Resource Center
Enterprise IT Svcs, the University of Georgia
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users