Hello everyone,

I’ve recently encountered an issue where some nodes in our cluster enter a drain state randomly, typically after completing long-running jobs. Below is the output from the sinfo command showing the reason “Prolog error” :

root@controller-node:~# sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Prolog error         slurm     2024-09-24T21:18:05 node[24,31]

When checking the slurmd.log files on the nodes, I noticed the following errors:

[2024-09-24T17:18:22.386] [217703.extern] error: _handle_add_extern_pid_internal: Job 217703 can't add pid 3311892 to jobacct_gather plugin in the extern_step.
**(repeated 90 times)**
[2024-09-24T17:18:22.917] [217703.extern] error: _handle_add_extern_pid_internal: Job 217703 can't add pid 3313158 to jobacct_gather plugin in the extern_step.

...

[2024-09-24T21:17:45.162] launch task StepId=217703.0 request from UID:54059 GID:1600 HOST:<SLURMCTLD_IP> PORT:53514                                     
[2024-09-24T21:18:05.166] error: Waiting for JobId=217703 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
[2024-09-24T21:18:05.166] error: slurm_send_node_msg: [(null)] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC_MSG) failed: Unexpected missing socket error
[2024-09-24T21:18:05.166] error: _rpc_launch_tasks: unable to send return code to address:port=<SLURMCTLD_IP>:53514 msg_type=6001: No such file or directory     

If you know how to solve these errors, please let me know. I would greatly appreciate any guidance or suggestions for further troubleshooting.

Thank you in advance for your assistance.

Best regards,

--
Télécom Paris
Nacereddine LADDAOUI
Ingénieur de Recherche et de Développement

19 place Marguerite Perey
CS 20031
91123 Palaiseau Cedex
Site web Télécom ParisX Télécom ParisFacebook Télécom ParisLinkedIn Télécom ParisInstagram Télécom ParisBlog Télécom Paris
Une école de l'IMT