[slurm-users] slurm_state

Loris Bennett loris.bennett at fu-berlin.de
Fri Mar 12 09:49:34 UTC 2021


Hello Sebastian,

sblock <s.block at tu-berlin.de> writes:

> Hello,
>
> we had an outage of the cluster file system which also included the
> slurm StateSaveLocation. Also slurm reported al jobs as orphan and then
> setting the nodes DOWN because they were not responding.
> After the file system was back user started to submit jobs, but the old
> queue was gone.
> Should slurm not use the old slurm_state when the filesystem is back?
> What can we do to prevent loosing the queue again in such a situation?
> The version is 17.11.5

Although one might think that the Slurm controller could gather most of
the lost information from the Slurmds which are still running, this seems
not to happen and my understanding is that all the information about the
queue is stored in the state save directory.  Therefore, if you want to
guard against such incidents in the future, you'll need to make a pretty
regular backup of that directory.

I don't know whether that is a common thing to do.  We don't do it
regularly, just when we update Slurm.

Cheers,

Loris

-- 
Dr. Loris Bennett (Hr./Mr.)
ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de



More information about the slurm-users mailing list