[slurm-users] Kinda Off-Topic: data management for Slurm clusters

Alex Chekholko alex at calicolabs.com
Fri Feb 22 18:02:32 UTC 2019


Hi Will,

You have bumped into the old adage: "HPC is just about moving the
bottlenecks around".

If your bottleneck is now your network, you may want to upgrade the
network.  Then the disks will become your bottleneck :)

For GPU training-type jobs that load the same set of data over and over
again, local node SSD is a good solution.  Especially with the dropping SSD
prices.

For an example architecture, take a look at the DDN "AI" or IBM "AI"
solutions. I think they generally take a storage box with lots of flash
storage and connect it via 2 or 4 100Gb links to something like an nvidia
DGX (compute node with 8 GPU).  Presumably they are doing mostly small file
reads.

In my case, I have whitebox compute nodes with GPUs and SSDs and whitebox
ZFS servers connected at 40GbE.  A fraction of the performance at a
fraction of the price.

Regards,
Alex


On Fri, Feb 22, 2019 at 9:52 AM Will Dennis <wdennis at nec-labs.com> wrote:

> Thanks for the reply, Ray.
>
> For one of my groups, on the GPU servers in their cluster, I have provided
> a RAID-0 md array of multi-TB SSDs (for I/O speed) mounted on a given path
> ("/mnt/local" for historical reasons) that they can use for local scratch
> space. Their other servers in the cluster have a single multi-TB spinning
> disk mounted at that same path. We do not manage the data at all on this
> path; it's currently up to the researchers to put needed data there, and
> remove the data when it is no longer needed. (They wanted us to auto-manage
> the removal, but we aren't in a position to know what data they still need
> or not, and "delete data if atime/mtime is older than [...]" via cron is a
> bit too simplistic.) They can use that local-disk path in any way they
> want, with the caveat that it's not to be used as "permanent storage",
> there's no backups, and if we suffer a disk failure, etc, we just replace
> with new and the old data is gone.
>
> The other group has (at this moment) no local disk at all on their worker
> nodes. They actually work with even bigger data sets than the first group,
> and they are the ones that really need a solution. I figured that if I
> solve the one group's problem, I also can implement on the other (and
> perhaps even on future Slurm clusters we spin up.)
>
> A few other questions I have:
> - is it possible in Slurm to define more than one filesystem path (i.e,
> other than "/tmp") as "TmpDisk"?
> - any way to allocate storage on a node via GRES or another method?
>
>
> On Friday, February 22, 2019 12:06 PM, Raymond Wan wrote:
>
> >Hi Will,
> >
> >I'm not a system administrator, but on the cluster that I have access to,
> indeed that is what we are given.
> >Everything, including our home directories, are NFS mounted.
> >Each node has a very large scratch space (i.e., /tmp), which is
> periodically deleted.  I think the sysadmins have a cron job that wipes it
> >occasionally.
> >
> >We also only have a 10G network and sure...people will complain about how
> everything should be faster, but our sysadmins are doing the best they can
> >with the budget allocated for them.  If they want 100G speed, then they
> need to give the money for the sysadmins to play with.  :-)
> >
> >Each research group is given a disk array (or more, depending on their
> budget).  And thus disk quota isn't managed by the sysadmins.  If disk
> space >is exhausted, it's up to the head of the research group to either
> buy more disk space or get their team members to share.
> >
> >I suppose if some of this data is needed across jobs, you can maybe
> allocate a fixed amount of quota on each node's scratch space to each lab.
> Then, >you would have to teach them to write SLURM scripts that check if
> the file is there and if not, to make a copy of it.  Of course, you'd want
> to make >sure they are careful not to have concurrent jobs...  The type of
> data analysis I do involves an index.  If jobs #1 and #2 run on the same
> node, both >will see (in let's say /tmp/rwan/myindex/) that the index is
> absent and do a copy.  I guess this is the tricky bit...  But this kind of
> management is >left for us users to worry about; the sysadmins just give us
> the scratch space and it's up to us to find a way to make good use of it.
> >
> >I hope this helps.  I'm not sure if this is the kind of information you
> were looking for?
> >
> >Ray
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190222/fc79a638/attachment-0001.html>


More information about the slurm-users mailing list