[slurm-users] job_submit.lua example (routes to alternative partitions based off GPU reservations and core requirements)
Renfro at tntech.edu
Mon Oct 15 15:37:28 MDT 2018
Hey, folks. Been working on a job submit filter to let us use otherwise idle cores in our GPU nodes.
We’ve got 40 non-GPU nodes and 4 GPU nodes deployed, each has 28 cores. We’ve had a set of partitions for the non-GPU nodes (batch, interactive, and debug), and another set of partitions for the GPU nodes (gpu, cpu-interactive, and cpu-debug).
As the GPU nodes ended up having idle cores a fair amount of the time, I decided to make some of those available to non-GPU jobs, without impacting GPU usage:
1) Restricting GPUs to only use a subset of CPU cores:
On the GPU nodes, I made the /etc/slurm/gres.conf look like:
Name=gpu File=/dev/nvidia0 CPUs=0-7
Name=gpu File=/dev/nvidia1 CPUs=8-15
which left another 12 cores available for non-GPU work. It also restricted GPU jobs to 8 cores or less per GPU, but additional cores tended not to make much difference for our jobs.
2) Making new partitions covering all nodes:
In /etc/slurm/slurm.conf on all the machines, I added three new partitions with MaxCPUsPerNode=12, and including all nodes, in contrast to the original non-GPU partition with no MaxCPUsPerNode and using non-GPU nodes and the original GPU partition with MaxCPUsPerNode=16 and using GPU nodes.
PartitionName=batch … Nodes=node[001-040]
PartitionName=interactive … Nodes=node[001-040]
PartitionName=debug … Nodes=node[001-040]
PartitionName=gpu … MaxCPUsPerNode=16 … Nodes=gpunode[001-004]
PartitionName=gpu-interactive … MaxCPUsPerNode=16 … Nodes=gpunode[001-004]
PartitionName=gpu-debug … MaxCPUsPerNode=16 … Nodes=gpunode[001-004]
PartitionName=any … MaxCPUsPerNode=12 … Nodes=node[001-040],gpunode[001-004]
PartitionName=any-interactive … MaxCPUsPerNode=12 … Nodes=node[001-040],gpunode[001-004]
PartitionName=any-debug … MaxCPUsPerNode=12 … Nodes=node[001-040],gpunode[001-004]
3) Made a job_submit.lua to route jobs automatically:
My job_submit.lua script is linked in a Gist below. For anyone outside my test population, the plugin returns slurm.SUCCESS.
For users inside my test population, it detects if the job reserved a GPU, and changes the partition to the correct GPU-enabled partition (either gpu, gpu-debug, or gpu-interactive).
Otherwise, if the job requests 12 CPUs per node or fewer (by whatever combination of ntasks, ntasks-per-node, and cpus-per-task are specified), it routes the job to the correct “anywhere” partition (any, any-debug, or any-interactive).
It’s worked for a very limited test population so far, and I’m gradually rolling it out to others. If it all goes correctly, we’ll increase our capacity for small jobs by about 4%.
Mike Renfro / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University
More information about the slurm-users