[slurm-users] Nodes not returning from DRAINING

Diego Zuccato diego.zuccato at unibo.it
Wed Oct 28 13:27:31 UTC 2020


Il 28/10/20 13:38, Diego Zuccato ha scritto:

>> What I'm looking for is some way to avoid having to do that.
> Now trying UnkillableStepTimeout=300 ... Fingers crossed...
Ok. Seems it's working.
The problem was that the writing of a big (2.2GB) core file via NFS took
too long, and default of 60s was not enough.

Strangely the core file seems corrupted (maybe because it's from a
4-nodes job and they all try to write to the same file?):
-8<--
# file core
core: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), too many
program headers (2533)
-8<--
And gdb can't backtrace :( So the core file takes a long time to be
created and is useless. Perfect :)

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



More information about the slurm-users mailing list