[slurm-users] restarting slurmctld restarts jobs???

Diego Zuccato diego.zuccato at unibo.it
Mon Sep 20 11:49:36 UTC 2021


Tks. Checked it: it's on the home filesystem, NFS-shared between the 
nodes. Well, actually a bit more involved than that: JobCompLoc points 
to /var/spool/jobscompleted.txt but /var/spool/slurm is actually a 
symlink to /home/conf/slurm_spool .

root at str957-cluster:/# grep spool /etc/slurm.conf
JobCompLoc=/var/spool/slurm/jobscompleted.txt
root at str957-cluster:/# ls -l /var/spool/
[...]
lrwxrwxrwx 1 root        root          22 apr 16 08:12 slurm -> 
/home/conf/slurm_spool

The symlinks are on both nodes and the home is mounted.

When can the jobscompleted.txt file be removed? Maybe some weird 
character slipped in and it messes the parsing? Can I test it?

Il 20/09/2021 12:33, mercan ha scritto:
> Hi;
> 
> Please check the StateSaveLocation directory which should readable and 
> writable by both slurmctld nodes and it should be a shared directory, 
> not two local directory.
> 
> The explanation at below is taken from slurm web site:
> 
> "The backup controller recovers state information from the 
> StateSaveLocation directory, which must be readable and writable from 
> both the primary and backup controllers."
> 
> Regards;
> 
> Ahmet M.
> 
> 
> 
> 20.09.2021 12:08 tarihinde Diego Zuccato yazdı:
>> Hello all.
>>
>> After summer break, I noticed that rebooting one of the two slurmctld 
>> nodes kills & requeues all running jobs. Before the break it did not 
>> impact running jobs and nobody changed config during the break... Duh?
>>
>> Today I just restarted slurmctld and slurmd: same kill&requeue.
>>
>> I'm currently in the process of adding some nodes, but I already did 
>> it other times w/ no issues (actually the second slurmctld node have 
>> been installed to catch the race of a job terminating while the main 
>> slurmctld was shut down).
>>
>> Anything I should double-check?
>>
>> Tks.
>>

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



More information about the slurm-users mailing list