<div dir="auto"><div>Hi,</div><div dir="auto"><br></div><div dir="auto">Thanks for the advice. I already tried out mana, but at present it only works with mpich, not openmpi, which is what I've setup via Ubuntu. </div><div dir="auto"><br></div><div dir="auto"><br></div><div dir="auto">AR</div><div dir="auto"><br></div><div dir="auto"><br><div class="gmail_quote" dir="auto"><div dir="ltr" class="gmail_attr">On Sun, 19 Feb 2023, 02:10 Christopher Samuel, <<a href="mailto:chris@csamuel.org">chris@csamuel.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 2/10/23 11:06 am, Analabha Roy wrote:<br>
<br>
> I'm having some complex issues coordinating OpenMPI, SLURM, and DMTCP in <br>
> my cluster.<br>
<br>
If you're looking to try checkpointing MPI applications you may want to <br>
experiment with the MANA ("MPI-Agnostic, Network-Agnostic MPI") plugin <br>
for DMTCP here: <a href="https://github.com/mpickpt/mana" rel="noreferrer noreferrer" target="_blank">https://github.com/mpickpt/mana</a><br>
<br>
We (NERSC) are collaborating with the developers and it is installed on <br>
Cori (our older Cray system) for people to experiment with. The <br>
documentation for it may be useful to others who'd like to try it out - <br>
it's got a nice description of how it works too which even I as a <br>
non-programmer can understand. <br>
<a href="https://docs.nersc.gov/development/checkpoint-restart/mana/" rel="noreferrer noreferrer" target="_blank">https://docs.nersc.gov/development/checkpoint-restart/mana/</a><br>
<br>
Pay special attention to the caveats in our docs though!<br>
<br>
I've not used it myself, though I'm peripherally involved to give advice <br>
on system related issues.<br>
<br>
All the best,<br>
Chris<br>
-- <br>
Chris Samuel : <a href="http://www.csamuel.org/" rel="noreferrer noreferrer" target="_blank">http://www.csamuel.org/</a> : Berkeley, CA, USA<br>
<br>
<br>
</blockquote></div></div></div>