[slurm-users] Slurm Crashing - File has zero size

Pedro Luiz de Castro pedro.castro at medicina.ulisboa.pt
Thu Nov 25 12:52:05 UTC 2021


Hi Brian,

Sorry for the very late reply, but we had some troubles with our emails for the past few weeks so I was unable to reply sooner.
I also hope I'm replying properly, as I'm quite new to mailing lists and replying through digests.

Your suggestion might have been right on the money.
I tried getting a reading of the used inodes with 'df -i' but it kept returning an IUse of 0% which was particularly odd.
In any case, I ended up zipping some big folders anyway and that seems to have solved the problem.

Thank you so much for your help.

Best,
Pedro Luiz de Castro
IT Support & System Administrator
Information Systems

Faculdade de Medicina, Universidade de Lisboa 
Avenida Professor Egas Moniz, 1649​-​028, Lisboa, Portugal 
iMM Lisboa general contact (+​351) ​217 ​999 ​411 - ext: 47356
imm.medicina​.ulisboa​.pt


-----Original Message-----
From: Brian Andrus <toomuchit at gmail.com>
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] Slurm Crashing - File has zero size
Message-ID: <71bf6c73-0d17-9654-e07d-1305efb9f98f at gmail.com>
Content-Type: text/plain; charset="utf-8"; Format="flowed"

You may have space, but do you have enough inodes?

Two different things to look at when trying to see why you cannot write to a disk.

Also verify that it is writeable by SlurmUser.

If something happened and it automatically remounted itself as read-only, that can do it too.

Brian Andrus

On 10/28/2021 11:57 AM, Pedro Luiz de Castro wrote:
>
> Hello all
>
> Since yesterday we?ve been having some trouble with slurm where it 
> crashes and isn?t able to recover.
> I?ve managed to track the fault to a zero sized file, launching 
> slurmctld -Dvvvv
>
> slurmctld: File
> /mnt/nfs/lobo/IMM-NFS/slurm/hash.4/job.2044004/environment has zero 
> size
>
> That?s the StateSaveLocation, so the environment file for this 
> particular job is not getting correctly created.
> I don?t believe it?s a space issue as there?s about 2TB of free space 
> on this mountpoint.
>
> Shouldn?t be permissions either, as other jobs run fine and get completed.
>
> For now I?ve been launching slurmctld -i to work around this issue, 
> killing the job in question.
>
> This way slurm can still be running for our users.
>
> Any ideas where I should look next to try and troubleshoot this issue?
>
> Thanks for all the help in advance.
>
> Best regards,
>
> *Pedro Luiz de Castro*
>
> IT Support & System Administrator
> Information Systems
>
> iMM_JLA_horizontal_RGB_cor_positivo
>
> Faculdade de Medicina, Universidade de Lisboa Avenida Professor Egas 
> Moniz, 1649?-?028, Lisboa, Portugal iMM Lisboa general contact (+?351) 
> ?217 ?999 ?411 - ext: 47356
>


More information about the slurm-users mailing list