[slurm-users] Kinda Off-Topic: data management for Slurm clusters
Goetz, Patrick G
pgoetz at math.utexas.edu
Tue Feb 26 20:25:28 UTC 2019
But rsync -a will only help you if people are using identical or at
least overlapping data sets? And you don't need rsync to prune out old
files.
On 2/26/19 1:53 AM, Janne Blomqvist wrote:
> On 22/02/2019 18.50, Will Dennis wrote:
>> Hi folks,
>>
>> Not directly Slurm-related, but... We have a couple of research groups
>> that have large data sets they are processing via Slurm jobs
>> (deep-learning applications) and are presently consuming the data via
>> NFS mounts (both groups have 10G ethernet interconnects between the
>> Slurm nodes and the NFS servers.) They are both now complaining of
>> “too-long loading times” for the data, and are casting about for a way
>> to bring the needed data onto the processing node, onto fast SSD
>> single drives (or even SSD arrays.) These local drives would be
>> considered “scratch space”, not for long-term data storage, but for
>> use over the lifetime of a job, or maybe perhaps a few sequential jobs
>> (given the nature of the work.) “Permanent” storage would remain the
>> existing NFS servers. We don’t really have the funding for 25-100G
>> networks and/or all-flash commercial data storage appliances (NetApp,
>> Pure, etc.)
>>
>> Any good patterns that I might be able to learn about implementing
>> here? We have a few ideas floating about, but I figured this already
>> may be a solved problem in this community...
>
> We have a similar problem, many ML users with big datasets. We use
> Lustre over IB, but the problem isn't IO bandwidth per se but rather
> that the datasets tend to be very suboptimal for any kind of network fs
> (lots and lots of small files). We do have node local disks which we
> currently have configured so that a per-job /tmp is mounted on the local
> disk, and then cleaned up at job exit. But this isn't really good for ML
> type workflows, as even if they use the local disk a large fraction of
> the job runtime is then spent copying the data from Lustre to the local
> disk, only for the data to be blown away when the job ends.
>
> Adding quotas to node local disks doesn't really work either, as we have
> lots of users/groups sharing our resources, and thus if we'd allocate
> the disk space using quotas each one would be getting a uselessly small
> amount.
>
> One idea I've been toying with is to write some duct tape around rsync,
> here are my notes about it:
>
> ## datasync tool
>
> Essentially a small wrapper around 'rsync -a'. The different is that it
> creates SRC/.datasync and DEST/.datasync directories containing special
> metadata:
>
> - .datasync/TIMESTAMP: The mtime of this empty file is used to check
> whether the SRC dataset is newer than the DEST dataset, in that case run
> 'rsync -a', otherwise rsync can be skipped.
>
> - DEST/.datasync/LAST_SYNCED: mtime of this empty file tells the last
> time this dataset was synced, whether any rsync was run or not.
>
> - DEST/.datasync/SLURM_JOB_IDS: Contains the slurm job id's (if
> applicable) of the jobs that ran datasync with this DEST directory.
>
>
> So the idea would be that a user in the job script could do something like
>
> #SBATCH blahblah
> srun datasync /scratch/my_group/dataset_big /l/my_group/dataset_big
> srun --gres=gpu:1 my_ML_job.py /l/my_group/dataset_big
>
>
> ## datasync-reaper
>
> Admin tool that can be run from cron on every compute node to reap
> unused datasets based on policy, e.g. /l partition must have at least
> 50GB free (or max 70% full, or whatever).
>
> When reaping, it searches for these special .datasync directories (up to
> a configurable recursion depth, say 2 by default), and based on the
> LAST_SYNCED timestamps, deletes entire datasets starting with the oldest
> LAST_SYNCED, until the policy goal has been met. Directory trees without
> .datasync directories are deleted first. .datasync/SLURM_JOB_IDS is used
> as an extra safety check to not delete a dataset used by a running job.
>
>
>
> But nothing concrete done yet. Anyway, I'm open to suggestions about
> better ideas, or existing tools that already solve this problem.
>
>
More information about the slurm-users
mailing list