[slurm-users] practical tips to budget cluster expansion for a research center with heterogeneous workloads?

Thu Mar 21 15:46:38 UTC 2019

Dear Slurm users,

my team is managing a HPC cluster (running Slurm) for a research
centre. We are planning to expand the cluster in the next couple of
years and we are facing a problem. We would like to put a figure on
how many resources will be needed on average for each user (in terms
of CPU cores, RAM, GPUs) but we have almost one hundred researchers
using the cluster for all sorts of different use cases so there isn't
a typical workload that we could take as a model. Most of the work is,
however, in the field of machine learning and deep learning. Users go
all the range from first year PhD students with limited skills to
researchers and professors with many years of experience.
In principle we could use a mix of: looking at current usage patterns,
user surveys, etc.

I was just wondering whether anyone here, working in a similar
setting, had some sort of guidelines that they have been using for
budgeting hardware purchases and that they would be willing to share?

Many thanks and regards

--
Graziano D'Innocenzo (PGP key: 9213BE46)
Systems Administrator - ADAPT Centre