[slurm-users] Job canceled after reaching QOS limits for CPU time.

Fri Oct 30 14:19:19 UTC 2020

Il 30/10/20 14:38, Zacarias Benta ha scritto:

> I know it sound kind o silly giving a limit and at the same time
> allowing for exceptions, but we are trying to prevent the waste of
> valuable cpu time.
Then convince your users to use checkpointing. Then use shorter run
times (we have 24h for 'normal' QoS, 72h for 'long' QoS w/ very low
priority).

If the program writes (quickly, but you can tune the timeout) the
current state when receiving a SIGTERM, it can then load the previous
state and nothing is lost.

If you allow a job to run for a month, and on the 29th the node crashes,
you've lost a lot. If the job works in chunks of 24h, in the worst case
you lose 23h59' ...

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786