[slurm-users] Kinda Off-Topic: data management for Slurm clusters

Tue Feb 26 08:24:31 UTC 2019

Hi Janne,

On Tue, Feb 26, 2019 at 3:56 PM Janne Blomqvist
<janne.blomqvist at aalto.fi> wrote:
> When reaping, it searches for these special .datasync directories (up to
> a configurable recursion depth, say 2 by default), and based on the
> LAST_SYNCED timestamps, deletes entire datasets starting with the oldest
> LAST_SYNCED, until the policy goal has been met. Directory trees without
> .datasync directories are deleted first. .datasync/SLURM_JOB_IDS is used
> as an extra safety check to not delete a dataset used by a running job.
>
> But nothing concrete done yet. Anyway, I'm open to suggestions about
> better ideas, or existing tools that already solve this problem.

Interesting idea!  As I mentioned earlier, I perform data set copying
manually as the system administrators (in our case) aren't responsible
for this.  It would be nice if they did something like this for us
users.

I was wondering if SLURM could be configured in such a way to help
this along.  For example, if there are 12 nodes and 3 research groups,
can one configure it so that a job by research group A is allocated to
a node that has its data already there.  I guess it would be like the
local data is a "resource" and each node either has that resource or
not...with it dynamically changing.  As I only have a limited
knowledge of system administrator (I do co-administer a much smaller
cluster that doesn't have this problem), I wonder if something like
this is possible.  If so, some profiling with a real set of users as
guinea pigs :-) would be interesting.  As in whether it actually gives
noticeable benefits to users.

Ray