[slurm-users] Job cancelled into the future
Reed Dier
reed.dier at focusvq.com
Thu Jan 19 17:33:27 UTC 2023
Just to hopefully close this out, I believe I was actually able to resolve this in “user-land” rather than mucking with the database.
I was able to requeue the bad jid’s, and they went pending.
Then I updated the jobs to a time limit of 60.
Then I scancelled the jobs, and they returned to a cancelled state, before they rolled off within about 10 minutes.
Surprised I didn’t think to try requeueing earlier, but here’s to hoping that this did the trick, and I will have more accurate reporting and fewer “more time than is possible” log errors.
Thanks,
Reed
> On Jan 17, 2023, at 11:29 AM, Reed Dier <reed.dier at focusvq.com> wrote:
>
> So I was going to take a stab at trying to rectify this after taking care of post-holiday matters.
>
> Paste of the $CLUSTER_job_table table where I think I see the issue, and now I just want to sanity check my steps to remediate.
> https://rentry.co/qhw6mg <https://rentry.co/qhw6mg> (pastebin alternative because markdown is paywalled for pastebin).
>
> There are a number of job steps with a timelimit of 4294967295, where as the others of the same job array are 525600.
> Obviously I want to edit those time limits to sane limits (match them to the others).
> I don’t see anything in the $CLUSTER_step_table that looks like it would need to be modified to match, though I could be wrong.
>
> But then the part of getting slurm to pick it up is where I’m wanting to make sure I’m on the right page.
> Should I manually update the mod_time timestamp and slurm will catch that at its next rollup?
> Or will slurm catch the change in the time limit at update the mod_time when it sees it upon rollup?
>
> I also don’t see any documentation stating how to manually trigger a rollup, either via slurmdbd.conf or command line flag.
> Will it automagically perform a rollup at some predefined, non-configurable interval, or when restarting the daemon?
>
> Apologies if this is all trivial information, just trying to measure twice and cut once.
>
> Appreciate everyone’s help so far.
>
> Thanks,
> Reed
>
>> On Dec 23, 2022, at 7:18 PM, Chris Samuel <chris at csamuel.org <mailto:chris at csamuel.org>> wrote:
>>
>> On 20/12/22 6:01 pm, Brian Andrus wrote:
>>
>>> You may want to dump the database, find what table/records need updated and try updating them. If anything went south, you could restore from the dump.
>>
>> +lots to making sure you've got good backups first, and stop slurmdbd before you start on the backups and don't restart it until you've made the changes, including setting the rollup times to be before the jobs started to make sure that the rollups include these changes!
>>
>> When you start slurmdbd after making the changes it should see that it needs to do rollups and kick those off.
>>
>> All the best,
>> Chris
>> --
>> Chris Samuel : http://www.csamuel.org/ <http://www.csamuel.org/> : Berkeley, CA, USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230119/5ea5cb87/attachment-0001.htm>
More information about the slurm-users
mailing list