[slurm-users] Change ExcNodeList on a running job

Ransom, Geoffrey M. Geoffrey.Ransom at jhuapl.edu
Wed Jun 10 21:48:05 UTC 2020



     I'm just curious as to what causes a user to decide that a given node has an issue? 
     If a node is healthy in all respects, why would a user decide not to use the node?

Not enough free TMPDIR space, a GPU starts having memory errors, or a machine with a temporary issue that slurm health checks are not tracking at the time so it can blackhole jobs.

But honestly, this is less about dealing with actual technical problems and more about dealing with keeping users happy as we help port their existing Univa jobs to slurm. We have a user with a run script that will add the local node to the exclude list and requeue itself up to 5 times if it thinks the program it launched is not running correctly because of a machine issue. I could emulate this behavior easily if the running job could update its own ExcNodeList and requeue itself. I can have a job requeue itself (just sleep after the scontrol command as the requeue is not instant) but slurm does not seem to let me update ExcNodeList on a running job.

Thanks for your suggestions.


More information about the slurm-users mailing list