Hi,
We use PreemptMode and PriorityTier within Slurm to suspend low priority jobs when more urgent work needs to be done. This generally works well, but on occasion resumed jobs fail to restart - which is to say Slurm sets the job status to running but the actual code doesn't recover from being suspended.
Technically everything is working as expected, but I wondered if there was any best practice to pass onto users about how to cope with this state? Obviously not a direct Slurm question, but wondered if others had experience with this and any advice on how best to limit the impact?
Thanks, Paul
--
I don't really have an answer for you, just responding to make your message pop out in the "flood" of other topics we've got since you posted.
On our cluster we configure cancelling our jobs because it makes more sense for our situation, so I have no experience with that resume from being suspended. I can think of two possible reasons for this:
- one is memory (have you checked your memory logs and see if there is a correlation between node memory occupation and jobs not resuming correctly) - the second one is some resources disappearing (temp files? maybe in some circumstances slurm totally wipes out /tmp the second job -- if so, that would be a slurm bug, obviously)
Assuming that you're stuck without finding a root cause which you can address, I guess it depends on what "doesn't recover" means. It's one thing if it crashes immediately. It's another if it just stalls without even starting but slurm still thinks it's running and the users are charged their allocation -- even worse if your cluster does not enforce a wallclock limit (or has a very long one). Depending on frequency of the issue, size of your cluster and other conditions, you may want to consider writing a watchdog script which would search for these jobs and cancel them?
As I said, not really an answer, just my $0.02 cents (or even less)
On Wed, May 15, 2024 at 1:54 AM Paul Jones via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi,
We use PreemptMode and PriorityTier within Slurm to suspend low priority jobs when more urgent work needs to be done. This generally works well, but on occasion resumed jobs fail to restart - which is to say Slurm sets the job status to running but the actual code doesn't recover from being suspended.
Technically everything is working as expected, but I wondered if there was any best practice to pass onto users about how to cope with this state? Obviously not a direct Slurm question, but wondered if others had experience with this and any advice on how best to limit the impact?
Thanks, Paul
--
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com