[slurm-users] SlurmdSpoolDir full
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Fri Dec 8 15:25:24 UTC 2023
Hi Xaver,
On 12/8/23 16:00, Xaver Stiensmeier wrote:
> during a larger cluster run (the same I mentioned earlier 242 nodes), I
> got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a
> directory on the workers that is used for job state information
> (https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir). However,
> I was unable to find more precise information on that dictionary. We
> compute all data on another volume so SlurmdSpoolDir has roughly 38 GB
> of free space where nothing is intentionally put during the run. This
> error only occurred on very few nodes.
>
> I would like to understand what Slurmd is placing in this dir that fills
> up the space. Do you have any ideas? Due to the workflow used, we have a
> hard time reconstructing the exact scenario that caused this error. I
> guess, the "fix" is to just pick a bit larger disk, but I am unsure
> whether Slurm behaves normal here.
With Slurm RPM installation this directory is configured:
$ scontrol show config | grep SlurmdSpoolDir
SlurmdSpoolDir = /var/spool/slurmd
In SlurmdSpoolDir we find job scripts and various cached data. In our
cluster it's usually a few Megabytes on each node. We never had any
issues with the size of SlurmdSpoolDir.
Do you store SlurmdSpoolDir on a shared network storage, or what?
Can you job scripts contain large amounts of data?
/Ole
More information about the slurm-users
mailing list