[slurm-users] Slurm Crashing - File has zero size

Pedro Luiz de Castro pedro.castro at medicina.ulisboa.pt
Thu Oct 28 18:57:45 UTC 2021


Hello all

Since yesterday we’ve been having some trouble with slurm where it crashes and isn’t able to recover.
I’ve managed to track the fault to a zero sized file, launching slurmctld -Dvvvv

slurmctld: File /mnt/nfs/lobo/IMM-NFS/slurm/hash.4/job.2044004/environment has zero size

That’s the StateSaveLocation, so the environment file for this particular job is not getting correctly created.
I don’t believe it’s a space issue as there’s about 2TB of free space on this mountpoint.
Shouldn’t be permissions either, as other jobs run fine and get completed.

For now I’ve been launching slurmctld -i to work around this issue, killing the job in question.
This way slurm can still be running for our users.

Any ideas where I should look next to try and troubleshoot this issue?

Thanks for all the help in advance.

Best regards,
Pedro Luiz de Castro
IT Support & System Administrator
Information Systems
[iMM_JLA_horizontal_RGB_cor_positivo]
Faculdade de Medicina, Universidade de Lisboa
Avenida Professor Egas Moniz, 1649​-​028, Lisboa, Portugal
iMM Lisboa general contact (+​351) ​217 ​999 ​411 - ext: 47356
imm.medicina​.ulisboa​.pt

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211028/06c932c8/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 3792 bytes
Desc: image001.jpg
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211028/06c932c8/attachment-0001.jpg>


More information about the slurm-users mailing list