[slurm-users] Kinda Off-Topic: data management for Slurm clusters
Matthew BETTINGER
matthew.bettinger at external.total.com
Fri Feb 22 20:44:41 UTC 2019
We stuck avere between Isilon and a cluster to get us over the hump until next budget cycle ... then we replaced with spectrascale for mid level storage. Still use lustre of course as scratch.
On 2/22/19, 12:24 PM, "slurm-users on behalf of Will Dennis" <slurm-users-bounces at lists.schedmd.com on behalf of wdennis at nec-labs.com> wrote:
(replies inline)
On Friday, February 22, 2019 1:03 PM, Alex Chekholko said:
>Hi Will,
>
>If your bottleneck is now your network, you may want to upgrade the network. Then the disks will become your bottleneck :)
>
Via network bandwidth analysis, it's not really network that's the problem... It’s the NFS/disk I/O...
>For GPU training-type jobs that load the same set of data over and over again, local node SSD is a good solution. Especially with the dropping SSD prices.
>
Good to hear :)
>For an example architecture, take a look at the DDN "AI" or IBM "AI" solutions. I think they generally take a storage box with lots of flash storage and connect it via 2 or 4 100Gb links to something like an nvidia DGX (compute node with 8 GPU). Presumably they are doing mostly small file reads.
>
>In my case, I have whitebox compute nodes with GPUs and SSDs and whitebox ZFS servers connected at 40GbE. A fraction of the performance at a fraction of the price.
>
Same here, but connected at only 10G... Again, no budget (as of yet, anyhow) to do 25/40/50/100G network or all-flash storage :(
>Regards,
>Alex
More information about the slurm-users
mailing list