[slurm-users] SlurmdSpoolDir full

Sun Dec 10 17:07:15 UTC 2023

We maintain /tmp as a separate partition to mitigate this exact scenario on all nodes though it doesn’t necessarily need to be part of the primary system RAID.  No need for tmp resiliency.

Regards,
Peter

Peter Goode
Research Computing Systems Administrator
Lafayette College

> On Dec 10, 2023, at 11:33, Ryan Novosielski <novosirj at rutgers.edu> wrote:
> 
> This is basically always somebody filling up /tmp and /tmp residing on the same filesystem as the actual SlurmdSpoolDirectory.
> 
> /tmp, without modifications, it’s almost certainly the wrong place for temporary HPC files. Too large.
> 
> Sent from my iPhone
> 
>> On Dec 8, 2023, at 10:02, Xaver Stiensmeier <xaverstiensmeier at gmx.de> wrote:
>> 
>> Dear slurm-user list,
>> 
>> during a larger cluster run (the same I mentioned earlier 242 nodes), I
>> got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a
>> directory on the workers that is used for job state information
>> (https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir). However,
>> I was unable to find more precise information on that dictionary. We
>> compute all data on another volume so SlurmdSpoolDir has roughly 38 GB
>> of free space where nothing is intentionally put during the run. This
>> error only occurred on very few nodes.
>> 
>> I would like to understand what Slurmd is placing in this dir that fills
>> up the space. Do you have any ideas? Due to the workflow used, we have a
>> hard time reconstructing the exact scenario that caused this error. I
>> guess, the "fix" is to just pick a bit larger disk, but I am unsure
>> whether Slurm behaves normal here.
>> 
>> Best regards
>> Xaver Stiensmeier
>> 
>>