[slurm-users] Kinda Off-Topic: data management for Slurm clusters

Tue Feb 26 07:53:06 UTC 2019

On 22/02/2019 18.50, Will Dennis wrote:
> Hi folks,
> 
> Not directly Slurm-related, but... We have a couple of research groups 
> that have large data sets they are processing via Slurm jobs 
> (deep-learning applications) and are presently consuming the data via 
> NFS mounts (both groups have 10G ethernet interconnects between the 
> Slurm nodes and the NFS servers.) They are both now complaining of 
> “too-long loading times” for the data, and are casting about for a way 
> to bring the needed data onto the processing node, onto fast SSD single 
> drives (or even SSD arrays.) These local drives would be considered 
> “scratch space”, not for long-term data storage, but for use over the 
> lifetime of a job, or maybe perhaps a few sequential jobs (given the 
> nature of the work.) “Permanent” storage would remain the existing NFS 
> servers. We don’t really have the funding for 25-100G networks and/or 
> all-flash commercial data storage appliances (NetApp, Pure, etc.)
> 
> Any good patterns that I might be able to learn about implementing here? 
> We have a few ideas floating about, but I figured this already may be a 
> solved problem in this community...

We have a similar problem, many ML users with big datasets. We use 
Lustre over IB, but the problem isn't IO bandwidth per se but rather 
that the datasets tend to be very suboptimal for any kind of network fs 
(lots and lots of small files). We do have node local disks which we 
currently have configured so that a per-job /tmp is mounted on the local 
disk, and then cleaned up at job exit. But this isn't really good for ML 
type workflows, as even if they use the local disk a large fraction of 
the job runtime is then spent copying the data from Lustre to the local 
disk, only for the data to be blown away when the job ends.

Adding quotas to node local disks doesn't really work either, as we have 
lots of users/groups sharing our resources, and thus if we'd allocate 
the disk space using quotas each one would be getting a uselessly small 
amount.

One idea I've been toying with is to write some duct tape around rsync, 
here are my notes about it:

## datasync tool

Essentially a small wrapper around 'rsync -a'. The different is that it
creates SRC/.datasync and DEST/.datasync directories containing special
metadata:

- .datasync/TIMESTAMP: The mtime of this empty file is used to check
whether the SRC dataset is newer than the DEST dataset, in that case run
'rsync -a', otherwise rsync can be skipped.

- DEST/.datasync/LAST_SYNCED: mtime of this empty file tells the last
time this dataset was synced, whether any rsync was run or not.

- DEST/.datasync/SLURM_JOB_IDS: Contains the slurm job id's (if
applicable) of the jobs that ran datasync with this DEST directory.

So the idea would be that a user in the job script could do something like

#SBATCH blahblah
srun datasync /scratch/my_group/dataset_big /l/my_group/dataset_big
srun --gres=gpu:1 my_ML_job.py /l/my_group/dataset_big

## datasync-reaper

Admin tool that can be run from cron on every compute node to reap
unused datasets based on policy, e.g. /l partition must have at least
50GB free (or max 70% full, or whatever).

When reaping, it searches for these special .datasync directories (up to
a configurable recursion depth, say 2 by default), and based on the
LAST_SYNCED timestamps, deletes entire datasets starting with the oldest
LAST_SYNCED, until the policy goal has been met. Directory trees without
.datasync directories are deleted first. .datasync/SLURM_JOB_IDS is used
as an extra safety check to not delete a dataset used by a running job.

But nothing concrete done yet. Anyway, I'm open to suggestions about 
better ideas, or existing tools that already solve this problem.

-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqvist at aalto.fi