[slurm-users] Kinda Off-Topic: data management for Slurm clusters

Wed Feb 27 07:32:26 UTC 2019

Hi,

I was perhaps a bit unprecise, sorry about that. The point of the "datasync" tool and the "datasync-reaper" cronjob would be to replace or augment the per-job /tmp that is cleaned at the end of each job. Datasets would then be left on the node local disks until they are deleted by datasync-reaper (and indeed, datasync-reaper wouldn't need to use rsync, plain rm -rf suffices). Thus multiple jobs that need the same dataset that would happen to be allocated to the same node could reuse the cached copy instead of rsync'ing it every time. Depending on the workflow, datasets could be shared among multiple users, or be used by a single user, but that would be up to each user or research group, not something dictated by the tool itself.

--
Janne Blomqvist

________________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Goetz, Patrick G <pgoetz at math.utexas.edu>
Sent: Tuesday, February 26, 2019 22:25
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] Kinda Off-Topic: data management for Slurm   clusters

But rsync -a will only help you if people are using identical or at
least overlapping data sets?  And you don't need rsync to prune out old
files.

On 2/26/19 1:53 AM, Janne Blomqvist wrote:
> On 22/02/2019 18.50, Will Dennis wrote:
>> Hi folks,
>>
>> Not directly Slurm-related, but... We have a couple of research groups
>> that have large data sets they are processing via Slurm jobs
>> (deep-learning applications) and are presently consuming the data via
>> NFS mounts (both groups have 10G ethernet interconnects between the
>> Slurm nodes and the NFS servers.) They are both now complaining of
>> “too-long loading times” for the data, and are casting about for a way
>> to bring the needed data onto the processing node, onto fast SSD
>> single drives (or even SSD arrays.) These local drives would be
>> considered “scratch space”, not for long-term data storage, but for
>> use over the lifetime of a job, or maybe perhaps a few sequential jobs
>> (given the nature of the work.) “Permanent” storage would remain the
>> existing NFS servers. We don’t really have the funding for 25-100G
>> networks and/or all-flash commercial data storage appliances (NetApp,
>> Pure, etc.)
>>
>> Any good patterns that I might be able to learn about implementing
>> here? We have a few ideas floating about, but I figured this already
>> may be a solved problem in this community...
>
> We have a similar problem, many ML users with big datasets. We use
> Lustre over IB, but the problem isn't IO bandwidth per se but rather
> that the datasets tend to be very suboptimal for any kind of network fs
> (lots and lots of small files). We do have node local disks which we
> currently have configured so that a per-job /tmp is mounted on the local
> disk, and then cleaned up at job exit. But this isn't really good for ML
> type workflows, as even if they use the local disk a large fraction of
> the job runtime is then spent copying the data from Lustre to the local
> disk, only for the data to be blown away when the job ends.
>
> Adding quotas to node local disks doesn't really work either, as we have
> lots of users/groups sharing our resources, and thus if we'd allocate
> the disk space using quotas each one would be getting a uselessly small
> amount.
>
> One idea I've been toying with is to write some duct tape around rsync,
> here are my notes about it:
>
> ## datasync tool
>
> Essentially a small wrapper around 'rsync -a'. The different is that it
> creates SRC/.datasync and DEST/.datasync directories containing special
> metadata:
>
> - .datasync/TIMESTAMP: The mtime of this empty file is used to check
> whether the SRC dataset is newer than the DEST dataset, in that case run
> 'rsync -a', otherwise rsync can be skipped.
>
> - DEST/.datasync/LAST_SYNCED: mtime of this empty file tells the last
> time this dataset was synced, whether any rsync was run or not.
>
> - DEST/.datasync/SLURM_JOB_IDS: Contains the slurm job id's (if
> applicable) of the jobs that ran datasync with this DEST directory.
>
>
> So the idea would be that a user in the job script could do something like
>
> #SBATCH blahblah
> srun datasync /scratch/my_group/dataset_big /l/my_group/dataset_big
> srun --gres=gpu:1 my_ML_job.py /l/my_group/dataset_big
>
>
> ## datasync-reaper
>
> Admin tool that can be run from cron on every compute node to reap
> unused datasets based on policy, e.g. /l partition must have at least
> 50GB free (or max 70% full, or whatever).
>
> When reaping, it searches for these special .datasync directories (up to
> a configurable recursion depth, say 2 by default), and based on the
> LAST_SYNCED timestamps, deletes entire datasets starting with the oldest
> LAST_SYNCED, until the policy goal has been met. Directory trees without
> .datasync directories are deleted first. .datasync/SLURM_JOB_IDS is used
> as an extra safety check to not delete a dataset used by a running job.
>
>
>
> But nothing concrete done yet. Anyway, I'm open to suggestions about
> better ideas, or existing tools that already solve this problem.
>
>