[slurm-users] Kinda Off-Topic: data management for Slurm clusters

Fri Feb 22 23:54:08 UTC 2019

Hi Will,

I look after our GPU cluster in our vision lab. We have a similar setup
- we are working from a single ZFS file server. We have two pools:

/db which is about 40TB spinning SAS built out of two raidz vdevs, with
16TB of L2ARC (across 4 SSDs). This reduces the size of ARC quite
significantly, but our data would never fit in RAM anyway (at least on
our budget :D).

/home which is also 40TB spinning SAS but does not have L2ARC.

The file server has two 10GbE NICs hooked up to a Ubiquiti 16xg switch,
configured as a LAG. The hashing is based on layer 2 and layer 3
information. We have 9 GPU nodes, with either 4 or 8 cards. Each node
has a single 10GbE card, and the two ZFS pools are mounted via NFS.

So far I've not had any complaints about the read performance. On
average we are pulling around 400MB/s from the db pool. The highest
we've had this has been is around 800MB/s. At this present time the
L2ARC is taking around 4-5k iops, which is pretty light.

I'm also a user of the cluster, working on large volumetric datasets of
around 1TB (each sample is about 8-9MB) and also not had any issues. The
SSD cache takes several epochs to warm up to a new dataset, but our
average LRARC hit rate is around 90%.

Our NFS read and write block size is set to 131072 and the NFS server
has been configured to use 16 threads. We haven't messed about with the
MTU but I suspect there is the possibility of a slight performance
improvement by fiddling with this.

As I think I saw another member of this list suggest, you might want to
look into local NFS caching. We were originally going to do this but
have not bothered since everything has been fine so far.

>From a coding perspective, ensure that your users are using a sensible
number of threads for dataloading. Also check that the CPU isn't
overloaded if online augmentation is being used.

Myself and a colleague actually did a Computerphile video about it.
https://www.youtube.com/watch?v=RG2Z7Xgthb4

Happy to answer any questions about our setup.

best,
Aaron.

Will Dennis writes:

> Hi folks,
>
> Not directly Slurm-related, but... We have a couple of research groups that have large data sets they are processing via Slurm jobs (deep-learning applications) and are presently consuming the data via NFS mounts (both groups have 10G ethernet interconnects between the Slurm nodes and the NFS servers.) They are both now complaining of "too-long loading times" for the data, and are casting about for a way to bring the needed data onto the processing node, onto fast SSD single drives (or even SSD arrays.) These local drives would be considered "scratch space", not for long-term data storage, but for use over the lifetime of a job, or maybe perhaps a few sequential jobs (given the nature of the work.) "Permanent" storage would remain the existing NFS servers. We don't really have the funding for 25-100G networks and/or all-flash commercial data storage appliances (NetApp, Pure, etc.)
>
> Any good patterns that I might be able to learn about implementing here? We have a few ideas floating about, but I figured this already may be a solved problem in this community...
>
> Thanks!
> Will

--
Aaron Jackson - M6PIU
http://aaronsplace.co.uk/