Hello Slurm users,
I'm trying to write a check in our job_submit.lua script that enforces
relative resource requirements such as disallowing more than 4 CPUs or 48GB
of memory per GPU. The QOS itself has a MaxTRESPerJob of
cpu=32,gres/gpu=8,mem=384G (roughly one full node), but we're looking to
prevent jobs from "stranding" GPUs, e.g., a 32 CPU/384GB memory job with
only 1 GPU.
I might be missing something obvious, but the rabbit hole I'm going down at
the moment is trying to check all of the different ways job arguments could
be set in the job descriptor.
i.e., the following should all be disallowed:
srun --gres=gpu:1 --mem=49G ... (tres_per_node, mem_per_node set in the
descriptor)
srun --gpus=1 --mem-per-gpu=49G ... (tres_per_job, mem_per_tres)
srun --gres=gpu:1 --ntasks-per-gpu=5 ... (tres_per_node, num_tasks,
ntasks_per_tres)
srun --gpus=1 --ntasks=2 --mem-per-cpu=25G ... (tres_per_job, num_tasks,
mem_per_cpu)
...
Essentially what I'm looking for is a way to access the ReqTRES string from
the job record before it exists, and then run some logic against that i.e.,
if (CPU count / GPU count) > 4 or (mem count / GPU count) > 48G, error out.
Is something like this possible?
Thanks,
Matthew
--
Matthew Baney
Assistant Director of Computational Systems
mbaney(a)umd.edu | (301) 405-6756
University of Maryland Institute for Advanced Computer Studies
3154 Brendan Iribe Center
8125 Paint Branch Dr.
College Park, MD 20742