[slurm-users] [External] Re: Staging data on the nodes one will be processing on via sbatch

Prentice Bisbal pbisbal at pppl.gov
Mon Apr 5 19:21:34 UTC 2021

I think this is exactly the type of use case heterogeneous job support 
is for, which has been supported since Slurm 17.11

> Slurm version 17.11 and later supports the ability to submit and 
> manage heterogeneous jobs, in which each component has virtually all 
> job options available including partition, account and QOS (Quality Of 
> Service). For example, part of a job might require four cores and 4 GB 
> for each of 128 tasks while another part of the job would require 16 
> GB of memory and one CPU.


Using this, you should be able to use a single core for the transfer 
from NFS , use all the cores/GPUs you need for the computation, and then 
use 1 single core to transfer back to NFS:

Disclaimer: I've never used this feature myself.


On 4/3/21 5:31 PM, Fulcomer, Samuel wrote:
> inline below...
> On Sat, Apr 3, 2021 at 4:50 PM Will Dennis <wdennis at nec-labs.com 
> <mailto:wdennis at nec-labs.com>> wrote:
>     Sorry, obvs wasn’t ready to send that last message yet…
>     Our issue is the shared storage is via NFS, and the “fast storage
>     in limited supply” is only local on each node. Hence the need to
>     copy it over from NFS (and then remove it when finished with it.)
>     I also wanted the copy & remove to be different jobs, because the
>     main processing job usually requires GPU gres, which is a
>     time-limited resource on the partition. I don’t want to tie up the
>     allocation of GPUs while the data is staged (and removed), and if
>     the data copy fails, don’t want to even progress to the job where
>     the compute happens (so like, copy_data_locally && process_data)
> ...yup... this is the problem. We've invested in GPFS and an NVMe 
> Excelero pool (for initial placement); however, we still have the 
> problem of having users pull down data from community repositories 
> before running useful computation.
> Your question has gotten me thinking about this more. In our case, all 
> of our nodes are diskless, so this wouldn't really work for us (but we 
> do have fast GPFS), but.... if your fast storage is only local to your 
> nodes, the subsequent compute jobs will need to request those specific 
> nodes, so you'll need to have a mechanism to increase the SLURM 
> scheduling  "weight" of the nodes after staging, so the scheduler 
> won't select them over nodes with a lower weight. That could be done 
> in a job epilog.
>         If you've got other fast storage in limited supply that can be
>         used for data that can be staged, then by all means use it,
>         but consider whether you want batch cpu cores tied up with the
>         wall time of transferring the data. This could easily be done
>         on a time-shared frontend login node from which the users
>         could then submit (via script) jobs after the data was staged.
>         Most of the transfer wallclock is in network wait, so don't
>         waste dedicated cores for it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210405/8e433768/attachment-0001.htm>

More information about the slurm-users mailing list