Hello everyone,
I’ve recently
encountered an issue where some nodes in our cluster enter a
drain state randomly, typically after completing long-running
jobs. Below is the output from the sinfo
command showing the reason “Prolog error” :
root@controller-node:~# sinfo -R
REASON USER TIMESTAMP NODELIST
Prolog error slurm 2024-09-24T21:18:05 node[24,31]
When checking the slurmd.log
files on the nodes, I noticed the following errors:
[2024-09-24T17:18:22.386] [217703.extern] error: _handle_add_extern_pid_internal: Job 217703 can't add pid 3311892 to jobacct_gather plugin in the extern_step.
**(repeated 90 times)**
[2024-09-24T17:18:22.917] [217703.extern] error: _handle_add_extern_pid_internal: Job 217703 can't add pid 3313158 to jobacct_gather plugin in the extern_step.
...
[2024-09-24T21:17:45.162] launch task StepId=217703.0 request from UID:54059 GID:1600 HOST:<SLURMCTLD_IP> PORT:53514
[2024-09-24T21:18:05.166] error: Waiting for JobId=217703 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
[2024-09-24T21:18:05.166] error: slurm_send_node_msg: [(null)] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC_MSG) failed: Unexpected missing socket error
[2024-09-24T21:18:05.166] error: _rpc_launch_tasks: unable to send return code to address:port=<SLURMCTLD_IP>:53514 msg_type=6001: No such file or directory
If you know how to solve these errors, please let me know. I would greatly appreciate any guidance or suggestions for further troubleshooting.
Thank you in advance for your assistance.
Best regards,