[slurm-users] How to deal with jobs that need to be restarted several time
Renfro, Michael
Renfro at tntech.edu
Tue Mar 12 14:32:38 UTC 2019
If the failures happen right after the job starts (or close enough), I’d use an interactive session with srun (or some other wrapper that calls srun, such as fisbatch).
Our hpcshell wrapper for srun is just a bash function:
=====
hpcshell ()
{
srun --partition=interactive $@ --pty bash -i
}
=====
The interactive partition argument is optional, but we use it as a time- and resource-limited partition with a higher priority. I always recommend our users to develop and debug with interactive jobs, and only submit the full production job with sbatch after all the easy bugs have been identified.
--
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University
> On Mar 12, 2019, at 9:26 AM, Selch, Brigitte (FIDF) <Brigitte.Selch at man.eu> wrote:
>
> External Email Warning
> This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
> Hello,
>
> Some jobs have to be restarted several times until they run.
> Users start the Job, it fails, they have to do some changes,
> they start the job again, it fails again … and so on.
>
> So they want to keep the resources until the job is running properly.
>
> Is there a possibility to ‘inherit’ allocated resources
> from one job to the next.
>
> Or something else to do the job?
>
> All our jobs are submitted with sbatch
>
> Thank you,
> Brigitte Selch
>
>
>
> Mit freundlichen Grüßen,
> Brigitte Selch
>
> MAN Truck & Bus AG
> IT Produktentwicklung Simulation (FIDF)
> Vogelweiher Str. 33
> 90441 Nürnberg
>
> Telefon +49 911 420 6056
> Brigitte.Selch at man.eu
>
>
>
> MAN Truck & Bus AG
> Sitz der Gesellschaft: München
> Registergericht: Amtsgericht München, HRB 86963
> Vorsitzender des Aufsichtsrates: Andreas Renschler
> Vorstand: Joachim Drees (Vorsitzender), Dirk Große-Loheide, Dr. Carsten Intra, Michael Kobriger, Jan-Henrik Lafrentz, Göran Nyberg, Dr. Frederik Zohm
>
> You can find information about how we process your personal data and your rights in our data protection notice: www.man.eu/data-protection-notice
>
> This e-mail (including any attachments) is confidential and may be privileged.
> If you have received it by mistake, please notify the sender by e-mail and delete this message from your system.
> Any unauthorised use or dissemination of this e-mail in whole or in part is strictly prohibited.
> Please note that e-mails are susceptible to change.
> MAN Truck & Bus AG (including its group companies) shall not be liable for the improper or incomplete transmission of the information contained in this communication nor for any delay in its receipt.
> MAN Truck & Bus AG (or its group companies) does not guarantee that the integrity of this communication has been maintained nor that this communication is free of viruses, interceptions or interference.
More information about the slurm-users
mailing list