You were right, I found that the slurm.conf file was different between the controller node and the computes, so I've synchronized it now. I was also considering setting up an epilogue script to help debug what happens after the job finishes. Do you happen to have any examples of what an epilogue script might look like?
However, I'm now encountering a different issue:
REASON USER TIMESTAMP NODELIST Kill task failed root 2024-10-21T09:27:05 nodemm04 Kill task failed root 2024-10-21T09:27:40 nodemm06
I also checked the logs and found the following entries:
On nodemm04:
[2024-10-21T09:27:06.000] [223608.extern] error: *** EXTERN STEP FOR 223608 STEPD TERMINATED ON nodemm04 AT 2024-10-21T09:27:05 DUE TO JOB NOT ENDING WITH SIGNALS ***
On nodemm06:
[2024-10-21T09:27:40.000] [223828.extern] error: *** EXTERN STEP FOR 223828 STEPD TERMINATED ON nodemm06 AT 2024-10-21T09:27:39 DUE TO JOB NOT ENDING WITH SIGNALS ***
It seems like there's an issue with the termination process on these nodes. Any thoughts on what could be causing this?
Thanks for your help!