230 is a strange exit status. Are you sure there's nothing in the prolog scripts and nothing called by the prolog scripts that could be returning that?
Do you know why systemd is stopping slurmd about twelve hours later?
Is there anything in in the general host log (e.g. /var/log/messages) or in dmesg during either of those times that might indicate why the prolog is failing or slurmd is stopping?
________________________________________ Od: Ratnasamy, Fritz via slurm-users slurm-users@lists.schedmd.com Poslano: torek, 07. oktober 2025 22:53 Za: Slurm User Community List Zadeva: [slurm-users] Prolog error causing node to drain
Hi,
We have a running slurm cluster and users have been submitting jobs for the past 3 months without any issues. Recently, some nodes jobs are getting drained randomly due to the reason "prolog error" Our slurm.conf has these 2 lines regarding prolog: PrologFlags=Contain,Alloc,X11 Prolog=/slurm_stuff/bin/prolog.d/prolog*
Inside the prolog.d folder, there are 2 scripts which run with no errors as far as I can see but is there a way to debug why the nodes are going in draining mode once in a while because of "prolog error"? That seems to happen at random times and on random nodes.
From the log file, I can see only this:
Oct 06 00:57:43 pgpu008.chicagobooth.edu slurmd[3709622]: slurmd: error: prolog failed: rc:230 output:Successfully started proces> Oct 06 00:57:43 pgpu008.chicagobooth.edu slurmd[3709622]: slurmd: error: [job 20398] prolog failed status=230:0 Oct 06 00:57:43 pgpu008 slurmd[3709622]: slurmd: Job 20398 already killed, do not launch batch job Oct 06 13:06:23 pgpu008 systemd[1]: Stopping Slurm node daemon... Oct 06 13:06:23 pgpu008 slurmd[3709622]: slurmd: Caught SIGTERM. Shutting down. Oct 06 13:06:23 pgpu008 slurmd[3709622]: slurmd: Slurmd shutdown completing
Currently, now the job 20398 that is getting killed in the picture above is in the state "Launch failed requeue held" after I resume the node.