slurm 24.11 - squeue displays reason "launch failed requeued held"
You need to find the node which the job started on. Then look at the slurmd log on that node. You may find an indication of the reason for the failure.
On Tue, 7 Jan 2025 at 11:30, sportlecon sportlecon via slurm-users < slurm-users@lists.schedmd.com> wrote:
slurm 24.11 - squeue displays reason "launch failed requeued held"
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Generally, the troubleshooting steps which you should take for Slurm are:
squeue to look at the list of running/queued or held jobs
sinfo to show which nodes are idle, busy or down
scontrol show node to get more detailed information on a node
For problem nodes - indeed just log into any node to see what a healthy node looks like systemctl status slurmd cat /var/log/slurm/slurmd.log
On your slurm controller look at the slurmctld and slurmdbd logs
On Tue, 7 Jan 2025 at 11:30, sportlecon sportlecon via slurm-users < slurm-users@lists.schedmd.com> wrote:
slurm 24.11 - squeue displays reason "launch failed requeued held"
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com