[slurm-users] Staging data on the nodes one will be processing on via sbatch

Sat Apr 3 19:59:19 UTC 2021

Unfortunately this is not a good workflow.

You would submit a staging job with a dependency for the compute job;
however, in the meantime, the scheduler might launch higher-priority jobs
that would want the scratch space, and cause it to be scrubbed.

In a rational process, the scratch space would be scrubbed for the
higher-priority jobs. I'm now thinking of a way that the scheduler
could consider data turds left by previous jobs, but that's not currently a
scheduling feature in SLURM multi-factor or any other scheduler I know.

The best current workflow is to stage data into fast local persistent
storage, and then to schedule jobs, or schedule a job that does it
synchronously (TImeLimit=Stage+Compute). The latter is pretty unsocial and
wastes cycles.

On Sat, Apr 3, 2021 at 3:45 PM Will Dennis <wdennis at nec-labs.com> wrote:

> Hi all,
>
>
>
> We have various NFS servers that contain the data that our researchers
> want to process. These are mounted on our Slurm clusters on well-known
> paths. Also, the nodes have local fast scratch disk on another well-known
> path. We do not have any distributed file systems in use (Our Slurm
> clusters are basically just collections of hetero nodes of differing types,
> not a traditional HPC setup by any means.)
>
>
>
> In most cases, the researchers can process the data directly off the NFS
> mounts without it causing any issues, but in some cases, this slows down
> the computation unacceptably. They could manually copy the data to the
> local drive using an allocation & srun commands, but I am wondering if
> there is a way to do this in sbatch?
>
>
>
> I tried this method:
>
>
>
> wdennis at submit01 ~> sbatch transfer.sbatch
>
> Submitted batch job 329572
>
> wdennis at submit01 ~> sbatch --dependency=afterok:329572 test_job.sbatch
>
> Submitted batch job 329573
>
> wdennis at submit01 ~>  sbatch --dependency=afterok:329573 rm_data.sbatch
>
> Submitted batch job 329574
>
> wdennis at submit01 ~>
>
> wdennis at submit01 ~> squeue
>
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>
>             329573       gpu wdennis_  wdennis PD       0:00      1
> (Dependency)
>
>             329574       gpu wdennis_  wdennis PD       0:00      1
> (Dependency)
>
>             329572       gpu wdennis_  wdennis  R       0:23      1
> compute-gpu02
>
>
>
> But it seems to not preserve the node allocated with the --dependency jobs:
>
>
>
>
> JobID|JobName|User|Partition|NodeList|AllocCPUS|ReqMem|CPUTime|QOS|State|ExitCode|AllocTRES|
>
>
> 329572|wdennis_data_transfer|wdennis|gpu|compute-gpu02|1|2Gc|00:02:01|normal|COMPLETED|0:0|cpu=1,mem=2G,node=1|
>
>
> 329573|wdennis_compute_job|wdennis|gpu|compute-gpu05|1|128Gn|00:03:00|normal|COMPLETED|0:0|cpu=1,mem=128G,node=1,gres/gpu=1|
>
>
> 329574|wdennis_data_removal|wdennis|gpu|compute-gpu02|1|2Gc|00:00:01|normal|COMPLETED|0:0|cpu=1,mem=2G,node=1|
>
>
>
> What is the best way to do something like “stage the data on a local path
> / run computation using the local copy / remove the locally staged data
> when complete”?
>
>
>
> Thanks!
>
> Will
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210403/69a9589b/attachment.htm>