[slurm-users] Analyzing a stuck job
Christopher Samuel
chris at csamuel.org
Thu Feb 14 16:30:42 UTC 2019
On 2/14/19 8:02 AM, Mahmood Naderan wrote:
> One job is in RH state which means JobHoldMaxRequeue.
> The output file, specified by --output shows nothing suspicious.
> Is there any way to analyze the stuck job?
This happens when a job fails to start for MAX_BATCH_REQUEUE times
(which is 5 at the moment).
Check your controller and slurmd logs to see what goes wrong when Slurm
tries to start it.
All the best,
Chris
More information about the slurm-users
mailing list