Hi,

We have a running slurm cluster and users have been submitting jobs for the past 3 months without any issues. Recently, some nodes jobs are getting drained randomly due to the reason "prolog error"
Our slurm.conf has these 2 lines regarding prolog:

PrologFlags=Contain,Alloc,X11

Prolog=/slurm_stuff/bin/prolog.d/prolog*

Inside the prolog.d folder, there are 2 scripts which run with no errors as far as I can see but is there a way to debug why the nodes are going in draining mode once in a while because of "prolog error"? That seems to happen at random times and on random nodes.

From the log file, I can see only this:

Oct 06 00:57:43 pgpu008.chicagobooth.edu slurmd[3709622]: slurmd: error: prolog failed: rc:230 output:Successfully started proces>

Oct 06 00:57:43 pgpu008.chicagobooth.edu slurmd[3709622]: slurmd: error: [job 20398] prolog failed status=230:0

Oct 06 00:57:43 pgpu008 slurmd[3709622]: slurmd: Job 20398 already killed, do not launch batch job

Oct 06 13:06:23 pgpu008 systemd[1]: Stopping Slurm node daemon...

Oct 06 13:06:23 pgpu008 slurmd[3709622]: slurmd: Caught SIGTERM. Shutting down.

Oct 06 13:06:23 pgpu008 slurmd[3709622]: slurmd: Slurmd shutdown completing

Currently, now the job 20398 that is getting killed in the picture above is in the state "Launch failed requeue held" after I resume the node.

Fritz Ratnasamy
Data Scientist
Information Technology