[slurm-users] Re: Randomly draining nodes

21 Oct 2024


      You were right, I found that the slurm.conf file was different between the controller node and the computes, so I've synchronized it now. I was also considering setting up an epilogue script to help debug what happens after the job finishes. Do you happen to have any examples of what an epilogue script might look like?
However, I'm now encountering a different issue:
REASON               USER      TIMESTAMP           NODELIST
Kill task failed     root      2024-10-21T09:27:05 nodemm04
Kill task failed     root      2024-10-21T09:27:40 nodemm06
I also checked the logs and found the following entries:
On nodemm04:
[2024-10-21T09:27:06.000] [223608.extern] error: *** EXTERN STEP FOR 223608 STEPD TERMINATED ON nodemm04 AT 2024-10-21T09:27:05 DUE TO JOB NOT ENDING WITH SIGNALS ***
On nodemm06:
[2024-10-21T09:27:40.000] [223828.extern] error: *** EXTERN STEP FOR 223828 STEPD TERMINATED ON nodemm06 AT 2024-10-21T09:27:39 DUE TO JOB NOT ENDING WITH SIGNALS ***
It seems like there's an issue with the termination process on these nodes. Any thoughts on what could be causing this?
Thanks for your help!

2025

2024

[slurm-users] Re: Randomly draining nodes