Hello Slurm users,
I'm trying to write a check in our job_submit.lua script that enforces relative resource requirements such as disallowing more than 4 CPUs or 48GB of memory per GPU. The QOS itself has a MaxTRESPerJob of cpu=32,gres/gpu=8,mem=384G (roughly one full node), but we're looking to prevent jobs from "stranding" GPUs, e.g., a 32 CPU/384GB memory job with only 1 GPU.
I might be missing something obvious, but the rabbit hole I'm going down at the moment is trying to check all of the different ways job arguments could be set in the job descriptor.
i.e., the following should all be disallowed:
srun --gres=gpu:1 --mem=49G ... (tres_per_node, mem_per_node set in the descriptor)
srun --gpus=1 --mem-per-gpu=49G ... (tres_per_job, mem_per_tres)
srun --gres=gpu:1 --ntasks-per-gpu=5 ... (tres_per_node, num_tasks, ntasks_per_tres)
srun --gpus=1 --ntasks=2 --mem-per-cpu=25G ... (tres_per_job, num_tasks, mem_per_cpu)
...
Essentially what I'm looking for is a way to access the ReqTRES string from the job record before it exists, and then run some logic against that i.e., if (CPU count / GPU count) > 4 or (mem count / GPU count) > 48G, error out.
Is something like this possible?
Thanks, Matthew