[slurm-users] job_submit.lua example (routes to alternative partitions based off GPU reservations and core requirements)

Mon Oct 15 15:37:28 MDT 2018

Hey, folks. Been working on a job submit filter to let us use otherwise idle cores in our GPU nodes.

We’ve got 40 non-GPU nodes and 4 GPU nodes deployed, each has 28 cores. We’ve had a set of partitions for the non-GPU nodes (batch, interactive, and debug), and another set of partitions for the GPU nodes (gpu, cpu-interactive, and cpu-debug).

As the GPU nodes ended up having idle cores a fair amount of the time, I decided to make some of those available to non-GPU jobs, without impacting GPU usage:

1) Restricting GPUs to only use a subset of CPU cores:

On the GPU nodes, I made the /etc/slurm/gres.conf look like:

=====

Name=gpu File=/dev/nvidia0 CPUs=0-7
Name=gpu File=/dev/nvidia1 CPUs=8-15

=====

which left another 12 cores available for non-GPU work. It also restricted GPU jobs to 8 cores or less per GPU, but additional cores tended not to make much difference for our jobs.

2) Making new partitions covering all nodes:

In /etc/slurm/slurm.conf on all the machines, I added three new partitions with MaxCPUsPerNode=12, and including all nodes, in contrast to the original non-GPU partition with no MaxCPUsPerNode and using non-GPU nodes and the original GPU partition with MaxCPUsPerNode=16 and using GPU nodes.

=====

PartitionName=batch … Nodes=node[001-040]
PartitionName=interactive … Nodes=node[001-040]
PartitionName=debug … Nodes=node[001-040]

PartitionName=gpu … MaxCPUsPerNode=16 … Nodes=gpunode[001-004]
PartitionName=gpu-interactive … MaxCPUsPerNode=16 … Nodes=gpunode[001-004]
PartitionName=gpu-debug … MaxCPUsPerNode=16 … Nodes=gpunode[001-004]

PartitionName=any … MaxCPUsPerNode=12 … Nodes=node[001-040],gpunode[001-004]
PartitionName=any-interactive … MaxCPUsPerNode=12 … Nodes=node[001-040],gpunode[001-004]
PartitionName=any-debug … MaxCPUsPerNode=12 … Nodes=node[001-040],gpunode[001-004]

=====

3) Made a job_submit.lua to route jobs automatically:

My job_submit.lua script is linked in a Gist below. For anyone outside my test population, the plugin returns slurm.SUCCESS.

For users inside my test population, it detects if the job reserved a GPU, and changes the partition to the correct GPU-enabled partition (either gpu, gpu-debug, or gpu-interactive).

Otherwise, if the job requests 12 CPUs per node or fewer (by whatever combination of ntasks, ntasks-per-node, and cpus-per-task are specified), it routes the job to the correct “anywhere” partition (any, any-debug, or any-interactive).

https://gist.github.com/mikerenfro/92d70562f9bb3f721ad1b221a1356de5

It’s worked for a very limited test population so far, and I’m gradually rolling it out to others. If it all goes correctly, we’ll increase our capacity for small jobs by about 4%.

-- 
Mike Renfro  / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University