[slurm-users] Kinda Off-Topic: data management for Slurm clusters

Tue Feb 26 20:25:28 UTC 2019

But rsync -a will only help you if people are using identical or at 
least overlapping data sets?  And you don't need rsync to prune out old 
files.

On 2/26/19 1:53 AM, Janne Blomqvist wrote:
> On 22/02/2019 18.50, Will Dennis wrote:
>> Hi folks,
>>
>> Not directly Slurm-related, but... We have a couple of research groups 
>> that have large data sets they are processing via Slurm jobs 
>> (deep-learning applications) and are presently consuming the data via 
>> NFS mounts (both groups have 10G ethernet interconnects between the 
>> Slurm nodes and the NFS servers.) They are both now complaining of 
>> “too-long loading times” for the data, and are casting about for a way 
>> to bring the needed data onto the processing node, onto fast SSD 
>> single drives (or even SSD arrays.) These local drives would be 
>> considered “scratch space”, not for long-term data storage, but for 
>> use over the lifetime of a job, or maybe perhaps a few sequential jobs 
>> (given the nature of the work.) “Permanent” storage would remain the 
>> existing NFS servers. We don’t really have the funding for 25-100G 
>> networks and/or all-flash commercial data storage appliances (NetApp, 
>> Pure, etc.)
>>
>> Any good patterns that I might be able to learn about implementing 
>> here? We have a few ideas floating about, but I figured this already 
>> may be a solved problem in this community...
> 
> We have a similar problem, many ML users with big datasets. We use 
> Lustre over IB, but the problem isn't IO bandwidth per se but rather 
> that the datasets tend to be very suboptimal for any kind of network fs 
> (lots and lots of small files). We do have node local disks which we 
> currently have configured so that a per-job /tmp is mounted on the local 
> disk, and then cleaned up at job exit. But this isn't really good for ML 
> type workflows, as even if they use the local disk a large fraction of 
> the job runtime is then spent copying the data from Lustre to the local 
> disk, only for the data to be blown away when the job ends.
> 
> Adding quotas to node local disks doesn't really work either, as we have 
> lots of users/groups sharing our resources, and thus if we'd allocate 
> the disk space using quotas each one would be getting a uselessly small 
> amount.
> 
> One idea I've been toying with is to write some duct tape around rsync, 
> here are my notes about it:
> 
> ## datasync tool
> 
> Essentially a small wrapper around 'rsync -a'. The different is that it
> creates SRC/.datasync and DEST/.datasync directories containing special
> metadata:
> 
> - .datasync/TIMESTAMP: The mtime of this empty file is used to check
> whether the SRC dataset is newer than the DEST dataset, in that case run
> 'rsync -a', otherwise rsync can be skipped.
> 
> - DEST/.datasync/LAST_SYNCED: mtime of this empty file tells the last
> time this dataset was synced, whether any rsync was run or not.
> 
> - DEST/.datasync/SLURM_JOB_IDS: Contains the slurm job id's (if
> applicable) of the jobs that ran datasync with this DEST directory.
> 
> 
> So the idea would be that a user in the job script could do something like
> 
> #SBATCH blahblah
> srun datasync /scratch/my_group/dataset_big /l/my_group/dataset_big
> srun --gres=gpu:1 my_ML_job.py /l/my_group/dataset_big
> 
> 
> ## datasync-reaper
> 
> Admin tool that can be run from cron on every compute node to reap
> unused datasets based on policy, e.g. /l partition must have at least
> 50GB free (or max 70% full, or whatever).
> 
> When reaping, it searches for these special .datasync directories (up to
> a configurable recursion depth, say 2 by default), and based on the
> LAST_SYNCED timestamps, deletes entire datasets starting with the oldest
> LAST_SYNCED, until the policy goal has been met. Directory trees without
> .datasync directories are deleted first. .datasync/SLURM_JOB_IDS is used
> as an extra safety check to not delete a dataset used by a running job.
> 
> 
> 
> But nothing concrete done yet. Anyway, I'm open to suggestions about 
> better ideas, or existing tools that already solve this problem.
> 
>