[slurm-users] Ideal NFS exported StateSaveLocation size.
Brian Andrus
toomuchit at gmail.com
Mon Oct 24 14:20:55 UTC 2022
FWIW, I have used NFS/Gluster/Luster for a SaveStateLocation at various
times on various clusters.
I have never had an issue with any of them and run clusters in size up
to 1000+ nodes. I have even used the same share to symlink all the
nodes' slurm.conf with no issue.
Of course, YMMV, but if you aren't having excessive traffic to the
share, you should be good. I have yet to discover what would be
excessive enough to impact things.
The only use I have had for the HA is being able to keep the cluster
running/happy during maintenance.
Brian Andrus
On 10/24/2022 1:14 AM, Ole Holm Nielsen wrote:
> On 10/24/22 09:57, Diego Zuccato wrote:
>> Il 24/10/2022 09:32, Ole Holm Nielsen ha scritto:
>>
>> > It is definitely a BAD idea to store Slurm StateSaveLocation on a
>> slow
>> > NFS directory! SchedMD recommends to use local NVME or SSD disks
>> > because there will be many IOPS to this file system!
>>
>> IIUC it does have to be shared between controllers, right?
>>
>> Possibly use NVME-backed (or even better NVDIMM-backed) NFS share. Or
>> replica-3 Gluster volume with NVDIMMs for the bricks, for the
>> paranoid :)
>
> IOPS is the key parameter! Local NVME or SSD should beat any
> networked storage. The original question refers to having
> StateSaveLocation on a standard (slow) NFS drive, AFAICT.
>
> I don't know how many people prefer using 2 slurmctld hosts (primary
> and backup)? I certainly don't do that. Slurm does have a
> configurable SlurmctldTimeout parameter so that you can reboot the
> server quickly when needed.
>
> It would be nice if people with experience in HA storage for slurmctld
> could comment.
>
> /Ole
>
More information about the slurm-users
mailing list