[slurm-users] restarting slurmctld restarts jobs???

mercan ahmet.mercan at uhem.itu.edu.tr
Mon Sep 20 10:33:11 UTC 2021


Hi;

Please check the StateSaveLocation directory which should readable and 
writable by both slurmctld nodes and it should be a shared directory, 
not two local directory.

The explanation at below is taken from slurm web site:

"The backup controller recovers state information from the 
StateSaveLocation directory, which must be readable and writable from 
both the primary and backup controllers."

Regards;

Ahmet M.



20.09.2021 12:08 tarihinde Diego Zuccato yazdı:
> Hello all.
>
> After summer break, I noticed that rebooting one of the two slurmctld 
> nodes kills & requeues all running jobs. Before the break it did not 
> impact running jobs and nobody changed config during the break... Duh?
>
> Today I just restarted slurmctld and slurmd: same kill&requeue.
>
> I'm currently in the process of adding some nodes, but I already did 
> it other times w/ no issues (actually the second slurmctld node have 
> been installed to catch the race of a job terminating while the main 
> slurmctld was shut down).
>
> Anything I should double-check?
>
> Tks.
>



More information about the slurm-users mailing list