[slurm-users] [External] Re: Staging data on the nodes one will be processing on via sbatch
Prentice Bisbal
pbisbal at pppl.gov
Mon Apr 5 19:21:34 UTC 2021
I think this is exactly the type of use case heterogeneous job support
is for, which has been supported since Slurm 17.11
> Slurm version 17.11 and later supports the ability to submit and
> manage heterogeneous jobs, in which each component has virtually all
> job options available including partition, account and QOS (Quality Of
> Service). For example, part of a job might require four cores and 4 GB
> for each of 128 tasks while another part of the job would require 16
> GB of memory and one CPU.
https://slurm.schedmd.com/heterogeneous_jobs.html
Using this, you should be able to use a single core for the transfer
from NFS , use all the cores/GPUs you need for the computation, and then
use 1 single core to transfer back to NFS:
Disclaimer: I've never used this feature myself.
Prentice
On 4/3/21 5:31 PM, Fulcomer, Samuel wrote:
> inline below...
>
> On Sat, Apr 3, 2021 at 4:50 PM Will Dennis <wdennis at nec-labs.com
> <mailto:wdennis at nec-labs.com>> wrote:
>
> Sorry, obvs wasn’t ready to send that last message yet…
>
> Our issue is the shared storage is via NFS, and the “fast storage
> in limited supply” is only local on each node. Hence the need to
> copy it over from NFS (and then remove it when finished with it.)
>
> I also wanted the copy & remove to be different jobs, because the
> main processing job usually requires GPU gres, which is a
> time-limited resource on the partition. I don’t want to tie up the
> allocation of GPUs while the data is staged (and removed), and if
> the data copy fails, don’t want to even progress to the job where
> the compute happens (so like, copy_data_locally && process_data)
>
>
> ...yup... this is the problem. We've invested in GPFS and an NVMe
> Excelero pool (for initial placement); however, we still have the
> problem of having users pull down data from community repositories
> before running useful computation.
>
> Your question has gotten me thinking about this more. In our case, all
> of our nodes are diskless, so this wouldn't really work for us (but we
> do have fast GPFS), but.... if your fast storage is only local to your
> nodes, the subsequent compute jobs will need to request those specific
> nodes, so you'll need to have a mechanism to increase the SLURM
> scheduling "weight" of the nodes after staging, so the scheduler
> won't select them over nodes with a lower weight. That could be done
> in a job epilog.
>
>
>
> If you've got other fast storage in limited supply that can be
> used for data that can be staged, then by all means use it,
> but consider whether you want batch cpu cores tied up with the
> wall time of transferring the data. This could easily be done
> on a time-shared frontend login node from which the users
> could then submit (via script) jobs after the data was staged.
> Most of the transfer wallclock is in network wait, so don't
> waste dedicated cores for it.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210405/8e433768/attachment-0001.htm>
More information about the slurm-users
mailing list