[slurm-users] Slurm Crashing - File has zero size
Pedro Luiz de Castro
pedro.castro at medicina.ulisboa.pt
Thu Oct 28 18:57:45 UTC 2021
Hello all
Since yesterday we’ve been having some trouble with slurm where it crashes and isn’t able to recover.
I’ve managed to track the fault to a zero sized file, launching slurmctld -Dvvvv
slurmctld: File /mnt/nfs/lobo/IMM-NFS/slurm/hash.4/job.2044004/environment has zero size
That’s the StateSaveLocation, so the environment file for this particular job is not getting correctly created.
I don’t believe it’s a space issue as there’s about 2TB of free space on this mountpoint.
Shouldn’t be permissions either, as other jobs run fine and get completed.
For now I’ve been launching slurmctld -i to work around this issue, killing the job in question.
This way slurm can still be running for our users.
Any ideas where I should look next to try and troubleshoot this issue?
Thanks for all the help in advance.
Best regards,
Pedro Luiz de Castro
IT Support & System Administrator
Information Systems
[iMM_JLA_horizontal_RGB_cor_positivo]
Faculdade de Medicina, Universidade de Lisboa
Avenida Professor Egas Moniz, 1649-028, Lisboa, Portugal
iMM Lisboa general contact (+351) 217 999 411 - ext: 47356
imm.medicina.ulisboa.pt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211028/06c932c8/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 3792 bytes
Desc: image001.jpg
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211028/06c932c8/attachment-0001.jpg>
More information about the slurm-users
mailing list