Hello Slurm users,
I'm trying to write a check in our job_submit.lua script that enforces relative resource requirements such as disallowing more than 4 CPUs or 48GB of memory per GPU. The QOS itself has a MaxTRESPerJob of cpu=32,gres/gpu=8,mem=384G (roughly one full node), but we're looking to prevent jobs from "stranding" GPUs, e.g., a 32 CPU/384GB memory job with only 1 GPU.
I might be missing something obvious, but the rabbit hole I'm going down at the moment is trying to check all of the different ways job arguments could be set in the job descriptor.
i.e., the following should all be disallowed:
srun --gres=gpu:1 --mem=49G ... (tres_per_node, mem_per_node set in the descriptor)
srun --gpus=1 --mem-per-gpu=49G ... (tres_per_job, mem_per_tres)
srun --gres=gpu:1 --ntasks-per-gpu=5 ... (tres_per_node, num_tasks, ntasks_per_tres)
srun --gpus=1 --ntasks=2 --mem-per-cpu=25G ... (tres_per_job, num_tasks, mem_per_cpu)
...
Essentially what I'm looking for is a way to access the ReqTRES string from the job record before it exists, and then run some logic against that i.e., if (CPU count / GPU count) > 4 or (mem count / GPU count) > 48G, error out.
Is something like this possible?
Thanks, Matthew
Hello Matthew,
You may be aware of this already, but most sites would make these kinds of checks/validations using job_submit.lua. I'm not an expert in that - though plenty of others on this list are - but I'm positive you could implement this type of validation logic. I'd like to say that I've come across a good tutorial for job_submit.lua, but I haven't really found one. This is kind of a good intro:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#job-submit-pl...
You can also find some sample scripts, such as:
https://github.com/SchedMD/slurm/blob/master/contribs/lua/job_submit.lua
Warmest regards, Jason
On Tue, Feb 27, 2024 at 5:02 PM Matthew R. Baney via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hello Slurm users,
I'm trying to write a check in our job_submit.lua script that enforces relative resource requirements such as disallowing more than 4 CPUs or 48GB of memory per GPU. The QOS itself has a MaxTRESPerJob of cpu=32,gres/gpu=8,mem=384G (roughly one full node), but we're looking to prevent jobs from "stranding" GPUs, e.g., a 32 CPU/384GB memory job with only 1 GPU.
I might be missing something obvious, but the rabbit hole I'm going down at the moment is trying to check all of the different ways job arguments could be set in the job descriptor.
i.e., the following should all be disallowed:
srun --gres=gpu:1 --mem=49G ... (tres_per_node, mem_per_node set in the descriptor)
srun --gpus=1 --mem-per-gpu=49G ... (tres_per_job, mem_per_tres)
srun --gres=gpu:1 --ntasks-per-gpu=5 ... (tres_per_node, num_tasks, ntasks_per_tres)
srun --gpus=1 --ntasks=2 --mem-per-cpu=25G ... (tres_per_job, num_tasks, mem_per_cpu)
...
Essentially what I'm looking for is a way to access the ReqTRES string from the job record before it exists, and then run some logic against that i.e., if (CPU count / GPU count) > 4 or (mem count / GPU count) > 48G, error out.
Is something like this possible?
Thanks, Matthew
-- Matthew Baney Assistant Director of Computational Systems mbaney@umd.edu | (301) 405-6756 University of Maryland Institute for Advanced Computer Studies 3154 Brendan Iribe Center 8125 Paint Branch Dr. College Park, MD 20742
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com