[slurm-users] Slurm Crashing - File has zero size
Pedro Luiz de Castro
pedro.castro at medicina.ulisboa.pt
Thu Nov 25 12:52:05 UTC 2021
Hi Brian,
Sorry for the very late reply, but we had some troubles with our emails for the past few weeks so I was unable to reply sooner.
I also hope I'm replying properly, as I'm quite new to mailing lists and replying through digests.
Your suggestion might have been right on the money.
I tried getting a reading of the used inodes with 'df -i' but it kept returning an IUse of 0% which was particularly odd.
In any case, I ended up zipping some big folders anyway and that seems to have solved the problem.
Thank you so much for your help.
Best,
Pedro Luiz de Castro
IT Support & System Administrator
Information Systems
Faculdade de Medicina, Universidade de Lisboa
Avenida Professor Egas Moniz, 1649-028, Lisboa, Portugal
iMM Lisboa general contact (+351) 217 999 411 - ext: 47356
imm.medicina.ulisboa.pt
-----Original Message-----
From: Brian Andrus <toomuchit at gmail.com>
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] Slurm Crashing - File has zero size
Message-ID: <71bf6c73-0d17-9654-e07d-1305efb9f98f at gmail.com>
Content-Type: text/plain; charset="utf-8"; Format="flowed"
You may have space, but do you have enough inodes?
Two different things to look at when trying to see why you cannot write to a disk.
Also verify that it is writeable by SlurmUser.
If something happened and it automatically remounted itself as read-only, that can do it too.
Brian Andrus
On 10/28/2021 11:57 AM, Pedro Luiz de Castro wrote:
>
> Hello all
>
> Since yesterday we?ve been having some trouble with slurm where it
> crashes and isn?t able to recover.
> I?ve managed to track the fault to a zero sized file, launching
> slurmctld -Dvvvv
>
> slurmctld: File
> /mnt/nfs/lobo/IMM-NFS/slurm/hash.4/job.2044004/environment has zero
> size
>
> That?s the StateSaveLocation, so the environment file for this
> particular job is not getting correctly created.
> I don?t believe it?s a space issue as there?s about 2TB of free space
> on this mountpoint.
>
> Shouldn?t be permissions either, as other jobs run fine and get completed.
>
> For now I?ve been launching slurmctld -i to work around this issue,
> killing the job in question.
>
> This way slurm can still be running for our users.
>
> Any ideas where I should look next to try and troubleshoot this issue?
>
> Thanks for all the help in advance.
>
> Best regards,
>
> *Pedro Luiz de Castro*
>
> IT Support & System Administrator
> Information Systems
>
> iMM_JLA_horizontal_RGB_cor_positivo
>
> Faculdade de Medicina, Universidade de Lisboa Avenida Professor Egas
> Moniz, 1649?-?028, Lisboa, Portugal iMM Lisboa general contact (+?351)
> ?217 ?999 ?411 - ext: 47356
>
More information about the slurm-users
mailing list