[slurm-users] Kinda Off-Topic: data management for Slurm clusters
Raymond Wan
rwan.work at gmail.com
Fri Feb 22 17:05:49 UTC 2019
Hi Will,
On 23/2/2019 12:50 AM, Will Dennis wrote:
...
> would be considered “scratch space”, not for long-term data
> storage, but for use over the lifetime of a job, or maybe
> perhaps a few sequential jobs (given the nature of the
> work.) “Permanent” storage would remain the existing NFS
> servers. We don’t really have the funding for 25-100G
> networks and/or all-flash commercial data storage appliances
> (NetApp, Pure, etc.)
>
> Any good patterns that I might be able to learn about
> implementing here? We have a few ideas floating about, but I
> figured this already may be a solved problem in this
> community...
I'm not a system administrator, but on the cluster that I
have access to, indeed that is what we are given.
Everything, including our home directories, are NFS mounted.
Each node has a very large scratch space (i.e., /tmp),
which is periodically deleted. I think the sysadmins have a
cron job that wipes it occasionally.
We also only have a 10G network and sure...people will
complain about how everything should be faster, but our
sysadmins are doing the best they can with the budget
allocated for them. If they want 100G speed, then they need
to give the money for the sysadmins to play with. :-)
Each research group is given a disk array (or more,
depending on their budget). And thus disk quota isn't
managed by the sysadmins. If disk space is exhausted, it's
up to the head of the research group to either buy more disk
space or get their team members to share.
I suppose if some of this data is needed across jobs, you
can maybe allocate a fixed amount of quota on each node's
scratch space to each lab. Then, you would have to teach
them to write SLURM scripts that check if the file is there
and if not, to make a copy of it. Of course, you'd want to
make sure they are careful not to have concurrent jobs...
The type of data analysis I do involves an index. If jobs
#1 and #2 run on the same node, both will see (in let's say
/tmp/rwan/myindex/) that the index is absent and do a copy.
I guess this is the tricky bit... But this kind of
management is left for us users to worry about; the
sysadmins just give us the scratch space and it's up to us
to find a way to make good use of it.
I hope this helps. I'm not sure if this is the kind of
information you were looking for?
Ray
More information about the slurm-users
mailing list