[slurm-users] Kinda Off-Topic: data management for Slurm clusters

Fri Feb 22 18:01:36 UTC 2019

At least in our case we use a Lustre filesystem for scratch access, we 
have it mounted over IB though.  That said some of our nodes only access 
it over the 1GbE and I have never heard any complaints about 
performance.  In general for large scale production work Lustre tends to 
be more resilient and performant than NFS.  You don't never need SSD's, 
regular HDD's work fine.

-Paul Edmon-

On 2/22/19 12:50 PM, Will Dennis wrote:
> Thanks for the reply, Ray.
>
> For one of my groups, on the GPU servers in their cluster, I have provided a RAID-0 md array of multi-TB SSDs (for I/O speed) mounted on a given path ("/mnt/local" for historical reasons) that they can use for local scratch space. Their other servers in the cluster have a single multi-TB spinning disk mounted at that same path. We do not manage the data at all on this path; it's currently up to the researchers to put needed data there, and remove the data when it is no longer needed. (They wanted us to auto-manage the removal, but we aren't in a position to know what data they still need or not, and "delete data if atime/mtime is older than [...]" via cron is a bit too simplistic.) They can use that local-disk path in any way they want, with the caveat that it's not to be used as "permanent storage", there's no backups, and if we suffer a disk failure, etc, we just replace with new and the old data is gone.
>
> The other group has (at this moment) no local disk at all on their worker nodes. They actually work with even bigger data sets than the first group, and they are the ones that really need a solution. I figured that if I solve the one group's problem, I also can implement on the other (and perhaps even on future Slurm clusters we spin up.)
>
> A few other questions I have:
> - is it possible in Slurm to define more than one filesystem path (i.e, other than "/tmp") as "TmpDisk"?
> - any way to allocate storage on a node via GRES or another method?
>
>
> On Friday, February 22, 2019 12:06 PM, Raymond Wan wrote:
>
>> Hi Will,
>>
>> I'm not a system administrator, but on the cluster that I have access to, indeed that is what we are given.
>> Everything, including our home directories, are NFS mounted.
>> Each node has a very large scratch space (i.e., /tmp), which is periodically deleted.  I think the sysadmins have a cron job that wipes it >occasionally.
>>
>> We also only have a 10G network and sure...people will complain about how everything should be faster, but our sysadmins are doing the best they can >with the budget allocated for them.  If they want 100G speed, then they need to give the money for the sysadmins to play with.  :-)
>>
>> Each research group is given a disk array (or more, depending on their budget).  And thus disk quota isn't managed by the sysadmins.  If disk space >is exhausted, it's up to the head of the research group to either buy more disk space or get their team members to share.
>>
>> I suppose if some of this data is needed across jobs, you can maybe allocate a fixed amount of quota on each node's scratch space to each lab.  Then, >you would have to teach them to write SLURM scripts that check if the file is there and if not, to make a copy of it.  Of course, you'd want to make >sure they are careful not to have concurrent jobs...  The type of data analysis I do involves an index.  If jobs #1 and #2 run on the same node, both >will see (in let's say /tmp/rwan/myindex/) that the index is absent and do a copy.  I guess this is the tricky bit...  But this kind of management is >left for us users to worry about; the sysadmins just give us the scratch space and it's up to us to find a way to make good use of it.
>>
>> I hope this helps.  I'm not sure if this is the kind of information you were looking for?
>>
>> Ray