[slurm-users] SlurmdTimeout and keeping jobs running

Fri Aug 7 19:10:40 UTC 2020

Dear Slurm Community,

We recognize that the SlurmdTimeout has a default value of 300 seconds, and
that if the controller is unable to communicate with a node during that
time it will mark it down. We have two questions regarding this:

1. Won't also individual compute nodes kill their own jobs if they aren't
able to communicate with a controller in so many minutes? If so, is that
controlled by the same SlurmdTimeout or is that a different timeout
parameter?

2. Are there any major scheduling or performance implications to increasing
these values, aside from the obvious potential to schedule a job on a node
that is down?

Thanks so much,
__________________________________________________
*Jacob D. Chappell, CSM*
Research Computing | Research Computing Infrastructure
Information Technology Services | University of Kentucky
jacob.chappell at uky.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200807/340edc5b/attachment.htm>