[slurm-users] Job flexibility with cons_tres

Yair Yarom irush at cs.huji.ac.il
Tue Feb 9 14:53:41 UTC 2021


Hi,

We have a similar configuration, very heterogeneous cluster and cons_tres.
Users need to specify the CPU/memory/GPU/time, and it will schedule their
job somewhere. Indeed there's currently no guarantee that you won't be left
with a node with unusable GPUs because no CPUs or memory are available.

We use one partition with 100% of the nodes and a time limit of 2 days, and
a second partition with ~90% of the nodes and a limit of 7 days. This gives
shorter jobs a chance to run without waiting just for long jobs.

We also use weights for the nodes, such that smaller nodes (resource-wise)
will be selected first. This prevents smaller jobs from filling up the
larger nodes (unless previous smaller nodes are occupied).

HTH,
    Yair.



On Mon, Feb 8, 2021 at 1:41 PM Ansgar Esztermann-Kirchner <
aeszter at mpibpc.mpg.de> wrote:

> Hello List,
>
> we're running a heterogeneous cluster (just x86_64, but a lot of
> different node types from 8 to 64 HW threads, 1 to 4 GPUs).
> Our processing power (for our main application, at least) is
> exclusively provided by the GPUs, so cons_tres looks quite promising:
> depending on the size of the job, request an appropriate number of
> GPUs. Of course, you have to request some CPUs as well -- ideally,
> evenly distributed among the GPUs (e.g. 10 per GPU on a 20-core, 2-GPU
> node; 16 on a 64-core, 4-GPU node).
> Of course, one could use different partitions for different nodes, and
> then submit individual jobs with CPU requests tailored to one such
> partition, but I'd prefer a more flexible approach where a given job
> could run on any large enough node.
>
> Is there anyone with a similar setup? Any config options I've missed,
> or do you have a work-around?
>
> Thanks,
>
> A.
>
> --
> Ansgar Esztermann
> Sysadmin Dep. Theoretical and Computational Biophysics
> http://www.mpibpc.mpg.de/grubmueller/esztermann
>


-- 

  /|       |
  \/       | Yair Yarom | System Group (DevOps)
  []       | The Rachel and Selim Benin School
  [] /\    | of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //    \  | irush at cs.huji.ac.il
 //        |
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210209/3811abad/attachment-0001.htm>


More information about the slurm-users mailing list