[slurm-users] Suspend/Resume, CGROUP and SIGTSTP
Michael Smith
msmith at tenstorrent.com
Fri Jan 29 15:58:44 UTC 2021
I’ve setup SLURM to enable pre-emption so that high-priority jobs can take-over resources from lower-priority jobs. As we use a lot of expensive EDA software, we want to get the best use of these expensive licenses. The software all uses the FlexLM license manager, and when a job is suspended using SIGTSTP and later resumed with SIGCONT, it releases and then gets the license again allowing another job to use it.
I wrote a simple BASH script to test this behavior with SLURM:
#!/bin/bash
function suspendJob () {
echo "INFO: Job Suspended"
}
function resumeJob () {
echo "INFO: Job Resumed"
}
function terminateJob () {
echo "INFO: Job Terminating..."
}
trap suspendJob SIGTSTP
trap resumeJob SIGCONT
trap terminateJob SIGTERM
echo "Burning some compute now...."
yes > /dev/null
When I configure SLURM to use:
ProctrackType=protrack/pgid
This works as expected when I manually SUSPEND/RESUME/CANCEL a job with each of the corresponding messages appearing in the SLURM StdOut file.
When I change SLURM to use CGROUPS:
ProctrackType=protrack/cgroup
No messages appear at all in the SLURM StdOut file indicated that the cgroup was thrown into freezer without any signals being sent. Is this expected behavior and are there ways to “fix” this so that it behaves the same way as using Process Groups?
Maybe this is a moot point since SLURM still shows the License being Used under “scontrol show license” even if a job is suspended, but I figure that problem might be solvable…
Thanks,
Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210129/4c08fb24/attachment.htm>
More information about the slurm-users
mailing list