[slurm-users] How to automatically release jobs that failed with "launch failed requeued held"

Doug Meyer dameyer99 at gmail.com
Wed Jan 23 01:39:17 UTC 2019


scontrol release job nnnnn

Not sure if the system can be set to automatically release jobs but I would
not want them too as a faulty system will go into a do loop start, fail,
start.

Doug

On Tue, Jan 22, 2019 at 10:45 AM Roger Moye <rmoye at quantlab.com> wrote:

> This morning we had several jobs fail with “launch failed requeued held”
> state.   We traced this to a failed prolog.   We fixed the problem but the
> jobs remained in this state.
>
>
>
> Is there a way to configure slurm so that it will automatically release
> the job from the Held state so that it can run?   There were plenty of
> healthy nodes for this job so I’d prefer that the job not remained held
> indefinitely.
>
>
>
> Thanks!
>
> -Roger
>
>
>
> [image: cid:image001.png at 01D22319.C7D5D540]
>
> Roger Moye
>
> HPC Engineer
>
> 713.425.6236 Office
>
> 713.898.0021 Mobile
>
>
>
> QUANTLAB Financial, LLC
>
> 3 Greenway Plaza
>
> Suite 200
>
> Houston, Texas 77046
>
> www.quantlab.com
>
>
>
>
> -----------------------------------------------------------------------------------
>
> The information in this communication and any attachment is confidential
> and intended solely for the attention and use of the named addressee(s).
> All information and opinions expressed herein are subject to change without
> notice. This communication is not to be construed as an offer to sell or
> the solicitation of an offer to buy any security. Any such offer or
> solicitation can only be made by means of the delivery of a confidential
> private offering memorandum (which should be carefully reviewed for a
> complete description of investment strategies and risks). Any reliance one
> may place on the accuracy or validity of this information is at their own
> risk. Past performance is not necessarily indicative of the future results
> of an investment. All figures are estimated and unaudited unless otherwise
> noted. If you are not the intended recipient, or a person responsible for
> delivering this to the intended recipient, you are not authorized to and
> must not disclose, copy, distribute, or retain this message or any part of
> it. In this case, please notify the sender immediately at 713-333-5440
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190122/cd8723e3/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 3364 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190122/cd8723e3/attachment.png>


More information about the slurm-users mailing list