[slurm-users] restarting slurmctld restarts jobs???

Diego Zuccato diego.zuccato at unibo.it
Mon Sep 20 12:06:59 UTC 2021


Uhm... Writing it down triggered an alarm bell.
What if, at boot, slurmctld is started before home gets mounted? It 
wouldn't find the file.
That would explain the killing of jobs at reboot, but not the one when 
restarting slurmctld (w/ slurmdbd running). But probably worth more 
testing...

Il 20/09/2021 13:49, Diego Zuccato ha scritto:
> Tks. Checked it: it's on the home filesystem, NFS-shared between the 
> nodes. Well, actually a bit more involved than that: JobCompLoc points 
> to /var/spool/jobscompleted.txt but /var/spool/slurm is actually a 
> symlink to /home/conf/slurm_spool .
> 
> root at str957-cluster:/# grep spool /etc/slurm.conf
> JobCompLoc=/var/spool/slurm/jobscompleted.txt
> root at str957-cluster:/# ls -l /var/spool/
> [...]
> lrwxrwxrwx 1 root        root          22 apr 16 08:12 slurm -> 
> /home/conf/slurm_spool
> 
> The symlinks are on both nodes and the home is mounted.
> 
> When can the jobscompleted.txt file be removed? Maybe some weird 
> character slipped in and it messes the parsing? Can I test it?
> 
> Il 20/09/2021 12:33, mercan ha scritto:
>> Hi;
>>
>> Please check the StateSaveLocation directory which should readable and 
>> writable by both slurmctld nodes and it should be a shared directory, 
>> not two local directory.
>>
>> The explanation at below is taken from slurm web site:
>>
>> "The backup controller recovers state information from the 
>> StateSaveLocation directory, which must be readable and writable from 
>> both the primary and backup controllers."
>>
>> Regards;
>>
>> Ahmet M.
>>
>>
>>
>> 20.09.2021 12:08 tarihinde Diego Zuccato yazdı:
>>> Hello all.
>>>
>>> After summer break, I noticed that rebooting one of the two slurmctld 
>>> nodes kills & requeues all running jobs. Before the break it did not 
>>> impact running jobs and nobody changed config during the break... Duh?
>>>
>>> Today I just restarted slurmctld and slurmd: same kill&requeue.
>>>
>>> I'm currently in the process of adding some nodes, but I already did 
>>> it other times w/ no issues (actually the second slurmctld node have 
>>> been installed to catch the race of a job terminating while the main 
>>> slurmctld was shut down).
>>>
>>> Anything I should double-check?
>>>
>>> Tks.
>>>
> 

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



More information about the slurm-users mailing list