[slurm-users] SlurmdSpoolDir full
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Sun Dec 10 18:21:33 UTC 2023
On 10-12-2023 17:29, Ryan Novosielski wrote:
> This is basically always somebody filling up /tmp and /tmp residing on the same filesystem as the actual SlurmdSpoolDirectory.
>
> /tmp, without modifications, it’s almost certainly the wrong place for temporary HPC files. Too large.
Agreed! That's why temporary job directories may be configured in
Slurm, see the Wiki page for a summary:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#temporary-job-directories
/Ole
>> On Dec 8, 2023, at 10:02, Xaver Stiensmeier <xaverstiensmeier at gmx.de> wrote:
>>
>> Dear slurm-user list,
>>
>> during a larger cluster run (the same I mentioned earlier 242 nodes), I
>> got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a
>> directory on the workers that is used for job state information
>> (https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir). However,
>> I was unable to find more precise information on that dictionary. We
>> compute all data on another volume so SlurmdSpoolDir has roughly 38 GB
>> of free space where nothing is intentionally put during the run. This
>> error only occurred on very few nodes.
>>
>> I would like to understand what Slurmd is placing in this dir that fills
>> up the space. Do you have any ideas? Due to the workflow used, we have a
>> hard time reconstructing the exact scenario that caused this error. I
>> guess, the "fix" is to just pick a bit larger disk, but I am unsure
>> whether Slurm behaves normal here.
>>
>> Best regards
>> Xaver Stiensmeier
More information about the slurm-users
mailing list