Hi,

We use PreemptMode and PriorityTier within Slurm to suspend low priority jobs when more urgent work needs to be done. This generally works well, but on occasion resumed jobs fail to restart - which is to say Slurm sets the job status to running but the actual code doesn't recover from being suspended.

Technically everything is working as expected, but I wondered if there was any best practice to pass onto users about how to cope with this state? Obviously not a direct Slurm question, but wondered if others had experience with this and any advice on how best to limit the impact?

Thanks,
Paul

--