[slurm-users] srun jobfarming hassle question

Ohlerich, Martin Martin.Ohlerich at lrz.de
Wed Jan 18 15:03:23 UTC 2023


Alright. I didn't see that option for GNU parallel. Retrying a task that failed for good reasons, makes maybe not much sense (e.g. due to OOM). And if the farming job timed out, on restart that job, GNU parallel does not start from the former state, does it? I guess book-keeping is an extra issue, why Magnus probably also used some server including some data base or so.

But ok. GNU parallel's documentation is indeed quite vast. I try to parse it's other/new features (it is also still developed on ... ).


Concerning Dask ... I heard of it. But never tried ('cos Intel advertised it ... 😏 ).

Maybe I should reconsider that.


Thank you for this input!

KR, Martin


________________________________
Von: slurm-users <slurm-users-bounces at lists.schedmd.com> im Auftrag von Ward Poelmans <ward.poelmans at vub.be>
Gesendet: Mittwoch, 18. Januar 2023 15:35
An: slurm-users at lists.schedmd.com
Betreff: Re: [slurm-users] srun jobfarming hassle question

On 18/01/2023 15:22, Ohlerich, Martin wrote:
> But Magnus (Thanks for the Link!) is right. This is still far away from a feature rich job- or task-farming concept, where at least some overview of the passed/failed/missing task statistics is available etc.

GNU parallel has log output and options to retry failed jobs.

If you want really fancy stuff, maybe look at dask combined with slurm plugins? It has dashboards for jupyter I believe.

Ward
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230118/08795336/attachment-0001.htm>


More information about the slurm-users mailing list