Randomly draining nodes - slurm-users

7 Oct 2024


      Hello everyone,
I’ve recently encountered an issue where some nodes in our cluster enter 
a drain state randomly, typically after completing long-running jobs. 
Below is the output from the |sinfo| command showing the reason *“Prolog 
error”* :
|root@controller-node:~# sinfo -R REASON USER TIMESTAMP NODELIST Prolog 
error slurm 2024-09-24T21:18:05 node[24,31] |
When checking the |slurmd.log| files on the nodes, I noticed the 
following errors:
|[2024-09-24T17:18:22.386] [217703.extern] error: 
_handle_add_extern_pid_internal: Job 217703 can't add pid 3311892 to 
jobacct_gather plugin in the extern_step. **(repeated 90 times)** 
[2024-09-24T17:18:22.917] [217703.extern] error: 
_handle_add_extern_pid_internal: Job 217703 can't add pid 3313158 to 
jobacct_gather plugin in the extern_step. ... [2024-09-24T21:17:45.162] 
launch task StepId=217703.0 request from UID:54059 GID:1600 
HOST:<SLURMCTLD_IP> PORT:53514 [2024-09-24T21:18:05.166] error: Waiting 
for JobId=217703 REQUEST_LAUNCH_PROLOG notification failed, giving up 
after 20 sec [2024-09-24T21:18:05.166] error: slurm_send_node_msg: 
[(null)] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC_MSG) failed: 
Unexpected missing socket error [2024-09-24T21:18:05.166] error: 
_rpc_launch_tasks: unable to send return code to 
address:port=<SLURMCTLD_IP>:53514 msg_type=6001: No such file or directory |
If you know how to solve these errors, please let me know. I would 
greatly appreciate any guidance or suggestions for further troubleshooting.
Thank you in advance for your assistance.
Best regards,

-- 
Télécom Paris https://www.telecom-paris.fr 	
*Nacereddine LADDAOUI*
Ingénieur de Recherche et de Développement

19 place Marguerite Perey
CS 20031
91123 Palaiseau Cedex
Site web Télécom Paris https://www.telecom-paris.frX Télécom Paris 
https://twitter.com/TelecomParis_Facebook Télécom Paris 
https://www.facebook.com/TelecomParisLinkedIn Télécom Paris 
https://www.linkedin.com/school/telecom-paris/Instagram Télécom Paris 
https://www.instagram.com/telecom_paris/Blog Télécom Paris 
https://imtech.wp.imt.fr
Une école de l'IMT https://www.imt.fr