Hello everyone,
I’ve recently encountered an issue where some nodes in our cluster enter a drain state randomly, typically after completing long-running jobs. Below is the output from the |sinfo| command showing the reason *“Prolog error”* :
|root@controller-node:~# sinfo -R REASON USER TIMESTAMP NODELIST Prolog error slurm 2024-09-24T21:18:05 node[24,31] |
When checking the |slurmd.log| files on the nodes, I noticed the following errors:
|[2024-09-24T17:18:22.386] [217703.extern] error: _handle_add_extern_pid_internal: Job 217703 can't add pid 3311892 to jobacct_gather plugin in the extern_step. **(repeated 90 times)** [2024-09-24T17:18:22.917] [217703.extern] error: _handle_add_extern_pid_internal: Job 217703 can't add pid 3313158 to jobacct_gather plugin in the extern_step. ... [2024-09-24T21:17:45.162] launch task StepId=217703.0 request from UID:54059 GID:1600 HOST:<SLURMCTLD_IP> PORT:53514 [2024-09-24T21:18:05.166] error: Waiting for JobId=217703 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec [2024-09-24T21:18:05.166] error: slurm_send_node_msg: [(null)] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC_MSG) failed: Unexpected missing socket error [2024-09-24T21:18:05.166] error: _rpc_launch_tasks: unable to send return code to address:port=<SLURMCTLD_IP>:53514 msg_type=6001: No such file or directory |
If you know how to solve these errors, please let me know. I would greatly appreciate any guidance or suggestions for further troubleshooting.
Thank you in advance for your assistance.
Best regards,
Apologies if I'm missing this in your post, but do you in fact have a Prolog configured in your slurm.conf?
Hi Laura,
Thank you for your reply.
Indeed, Prolog is not configured on my machine $ scontrol show config |grep -i prolog Prolog = (null) PrologEpilogTimeout = 65534 PrologSlurmctld = (null) PrologFlags = Alloc,Contain ResvProlog = (null) SrunProlog = (null) TaskProlog = (null)
Does it have to be set on all machines?
Your slurm.conf should be the same on all machines (is it? you don't have Prolog configured on some but not others?), but no, it is not mandatory to use a prolog. I am simply surprised that you could get a "Prolog error" without having a prolog configured, since an error in the prolog program itself is how I always get that error. Yours must be some kind of communication problem, or a difference in expectation between daemons about what requests ought be exchanged.
You were right, I found that the slurm.conf file was different between the controller node and the computes, so I've synchronized it now. I was also considering setting up an epilogue script to help debug what happens after the job finishes. Do you happen to have any examples of what an epilogue script might look like?
However, I'm now encountering a different issue:
REASON USER TIMESTAMP NODELIST Kill task failed root 2024-10-21T09:27:05 nodemm04 Kill task failed root 2024-10-21T09:27:40 nodemm06
I also checked the logs and found the following entries:
On nodemm04:
[2024-10-21T09:27:06.000] [223608.extern] error: *** EXTERN STEP FOR 223608 STEPD TERMINATED ON nodemm04 AT 2024-10-21T09:27:05 DUE TO JOB NOT ENDING WITH SIGNALS ***
On nodemm06:
[2024-10-21T09:27:40.000] [223828.extern] error: *** EXTERN STEP FOR 223828 STEPD TERMINATED ON nodemm06 AT 2024-10-21T09:27:39 DUE TO JOB NOT ENDING WITH SIGNALS ***
It seems like there's an issue with the termination process on these nodes. Any thoughts on what could be causing this?
Thanks for your help!
On 10/21/24 4:35 am, laddaoui--- via slurm-users wrote:
It seems like there's an issue with the termination process on these nodes. Any thoughts on what could be causing this?
That usually means processes wedged in the kernel for some reason, in an uninterruptible sleep state. You can define an "UnkillableStepProgram" to be run on the node when that happens to capture useful state info. You can do that by doing things like iterating through processes in the jobs cgroup dumping their `/proc/$PID/stack` somewhere useful, getting the `ps` info for all those same processes, and/or doing an `echo w > /proc/sysrq-trigger` to make the kernel dump all blocked tasks.
All the best, Chris
I have a cron job that emails me when hosts go into drain mode and tells me the reason (scontrol show node=$host | grep -i reason)
We get drains with the "Kill task failed" reason probably about 5 times a day. This despite having UnkillableStepTimeout=180
Right now we are still handling them manually by sshing to the node and running a script we wrote called clean_cgroup_jobs that looks for the unkilled processes using the cgroup info for the job
If it finds none, it deletes the cgroups for the job and we resume the node. This is true about 95% of the time.
In the case of a truly unkillable process, it lists the process and then we manually investigate. Often in this case it is hung NFS mount causing the problem and we have various ways of dealing with that that can involve faking the IP of the offline NFS server on another server to make the node client nfs kernel process finally exit.
In the rare case we can not find a way to kill the unkillable process we arrange to reboot the node.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Tue, 22 Oct 2024 12:59am, Christopher Samuel via slurm-users wrote:
External Email - Use Caution
On 10/21/24 4:35 am, laddaoui--- via slurm-users wrote:
It seems like there's an issue with the termination process on these nodes. Any thoughts on what could be causing this?
That usually means processes wedged in the kernel for some reason, in an uninterruptible sleep state. You can define an "UnkillableStepProgram" to be run on the node when that happens to capture useful state info. You can do that by doing things like iterating through processes in the jobs cgroup dumping their `/proc/$PID/stack` somewhere useful, getting the `ps` info for all those same processes, and/or doing an `echo w > /proc/sysrq-trigger` to make the kernel dump all blocked tasks.
All the best, Chris -- Chris Samuel : http://secure-web.cisco.com/1nkj9AvGGR14KG_wv9PtKtCMW_eu_C_6GKksFtwzqIHnSnp9... : Berkeley, CA, USA
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
On 22-10-2024 16:46, Paul Raines via slurm-users wrote:
I have a cron job that emails me when hosts go into drain mode and tells me the reason (scontrol show node=$host | grep -i reason)
In stead of cron you can also use Slurm triggers, see for example our scripts in the page https://github.com/OleHolmNielsen/Slurm_tools/tree/master/triggers You can tailor the triggers to do whatever tasks you need.
We get drains with the "Kill task failed" reason probably about 5 times a day. This despite having UnkillableStepTimeout=180
Some time ago it was recommended that UnkillableStepTimeout values above 127 (or 256?) should not be used, see https://support.schedmd.com/show_bug.cgi?id=11103. I don't know if this restriction is still valid with recent versions of Slurm?
Best regards, Ole
Hi Ole,
On 10/22/24 11:04 am, Ole Holm Nielsen via slurm-users wrote:
Some time ago it was recommended that UnkillableStepTimeout values above 127 (or 256?) should not be used, see https://support.schedmd.com/ show_bug.cgi?id=11103. I don't know if this restriction is still valid with recent versions of Slurm?
As I read it that last comment includes a commit message for the fix to that problem, and we happily use a much longer timeout than that without apparent issue.
https://support.schedmd.com/show_bug.cgi?id=11103#c30
All the best, Chris
Hi Chris,
Thanks for confirming that UnkillableStepTimeout can have larger values without issues. Do you have some suggestions for values that would safely cover network filesystem delays?
Best regards, Ole
On 10/24/24 07:51, Christopher Samuel via slurm-users wrote:
Some time ago it was recommended that UnkillableStepTimeout values above 127 (or 256?) should not be used, see https://support.schedmd.com/ show_bug.cgi?id=11103. I don't know if this restriction is still valid with recent versions of Slurm?
As I read it that last comment includes a commit message for the fix to that problem, and we happily use a much longer timeout than that without apparent issue.