<div dir="ltr"><div><div>John/Chris,<br><br></div>Thanks for your advice. I'll need to do some reading on cgroups, I've never even been exposed to that concept.  I don't even know if the SLURM setup I have access to has the cgroups or PAM plugin/modules enabled/available.  Unfortunately I'm not involved in the administration of SLURM, I'm simply a user of a much larger system that's already established with other users doing compute tasks completely separate from my use case.  Therefore, I'm most interested in solutions that I can implement without sys admin support on the SLURM side, which is why I started looking at the <span style="font-family:monospace,monospace">--epilog</span> route.<br><br></div><div>I neither have the administrator access to SLURM nor the time to consider more complex approaches that might hack the Trick architecture.  <i>Literally the only thing that isn't working for me right now is the cleanup mechanism</i>, everything else is working just fine.  It's not as simple as killing all the simulation spawned processes, the processes themselves create message queues for internal communication that live in <span style="font-family:monospace,monospace">/dev/mqueue/</span> on each node, and when the sim gets a <span style="font-family:monospace,monospace">kill -9 </span>signal, there's no internal cleanup, and those files linger on the filesystem indefinitely, causing issues in subsequent runs on those machines. <br><br></div><div>From my understanding, there's already a "master" epilog script that kills all user processes implemented in our system after a user's job completes.  They have set up our SLURM nodes to be "reserved" for the user requesting them, so their greedy cleanup script isn't a problem for other compute processes, they are reserved for that single person.  I might just ping the administrators and ask them to also add an <span style="font-family:monospace,monospace">'rm /dev/mqueue/*' </span>to that script, to me that seems like the fastest solution given what I know.  I would prefer to keep that part in the "user space" since it's very specific to my use case, but <span style="font-family:monospace,monospace">srun --epilog</span> is not behaving as I would expect.  Can y'all confirm what I'm seeing is indeed what is expected to happen?<br><span style="font-family:monospace,monospace"><br>  ssh:    ssh machine001</span><br><font face="monospace, monospace">  srun:   srun --nodes 3 --epilog <b>cleanup.sh myProgram.exe<br></b></font></div><div><font face="monospace, monospace">  squeue: shows job 123 running on machine200, machine201, machine202<b><br></b></font></div><div><font face="monospace, monospace">  Kill:   scancel 123<b><br></b></font></div><div><font face="monospace, monospace">  Result: myProgram.exe is terminated, cleanup.sh runs on machine001<br><span style="font-family:arial,helvetica,sans-serif"><br></span></font></div><div><font face="monospace, monospace"><span style="font-family:arial,helvetica,sans-serif">I was expecting <span style="font-family:monospace,monospace">cleanup.sh</span> to run on one (or all) of the compute nodes (200-202), not on the machine I launched the srun command from (001).</span><br></font></div><div><br></div><div>John -- Yes we are heavily invested in the Trick framework and use their Monte-Carlo feature quite extensively, in the past we've used PBS to manage our compute nodes, but this is the first attempt to integrate Trick Monte-Carlo with SLURM.  We do spacecraft simulation and analysis for various projects.<br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Mar 5, 2018 at 12:36 AM, John Hearns <span dir="ltr"><<a href="mailto:hearnsj@googlemail.com" target="_blank">hearnsj@googlemail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>Dan, completely off topic here. May I ask what type of simulations are you running?</div><div>Clearly you probably have a large investment in time in Trick.</div><div>However as a fan of Julia language let me leave this link here:</div><div><a href="https://juliaobserver.com/packages/RigidBodyDynamics" target="_blank">https://juliaobserver.com/<wbr>packages/RigidBodyDynamics</a></div><div><br></div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On 5 March 2018 at 07:31, John Hearns <span dir="ltr"><<a href="mailto:hearnsj@googlemail.com" target="_blank">hearnsj@googlemail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>I completely agree with what Chris says regarding cgroups.  Implement them, and you will not regret it.</div><div><br></div><div>I have worked with other simulation frameworks, which work in a similar fashion to Trick, ie a master process which spawns </div><div>off independent worker processes on compute nodes. I am thinking on an internal application we have, and if I also say it Matlab.</div><div><br></div><div>In the Trick documentation:</div><h3 style="text-align:left;color:rgb(36,41,46);text-transform:none;line-height:25px;text-indent:0px;letter-spacing:normal;font-style:normal;font-variant:normal;font-weight:600;text-decoration:none;margin-top:24px;margin-bottom:16px;word-spacing:0px;white-space:normal;box-sizing:border-box;background-color:transparent"><a class="m_-3087711124455602849m_-3673808875872880672gmail-anchor" id="m_-3087711124455602849m_-3673808875872880672gmail-user-content-notes" style="color:rgb(3,102,214);line-height:20px;padding-right:4px;text-decoration:none;box-sizing:border-box;float:left;background-color:transparent" href="https://github.com/nasa/trick/wiki/UserGuide-Monte-Carlo#notes" target="_blank"></a><font size="2">Notes</font></h3><div><span style="text-align:left;color:rgb(36,41,46);text-transform:none;line-height:24px;text-indent:0px;letter-spacing:normal;font-family:-apple-system,BlinkMacSystemFont,"Segoe UI",Helvetica,Arial,sans-serif,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol";font-style:normal;font-variant:normal;font-weight:400;text-decoration:none;word-spacing:0px;display:inline;white-space:normal;word-wrap:break-word;font-size-adjust:none;font-stretch:normal;float:none;background-color:transparent">

</span></div><ol style="text-align:left;color:rgb(36,41,46);text-transform:none;text-indent:0px;letter-spacing:normal;padding-left:32px;font-style:normal;font-variant:normal;font-weight:400;text-decoration:none;margin-top:0px;margin-bottom:16px;word-spacing:0px;white-space:normal;box-sizing:border-box;background-color:transparent">

<li style="box-sizing:border-box">

<a style="color:rgb(3,102,214);text-decoration:none;box-sizing:border-box;background-color:transparent" href="https://en.wikipedia.org/wiki/Secure_Shell" rel="nofollow" target="_blank">SSH</a> is used to launch slaves across the network</li><li style="box-sizing:border-box">Each slave machine will work in parallel with other slaves, greatly reducing the computation time of a simulation</li></ol><div style="box-sizing:border-box">However I must say that there must be plenty of folks at NASA who use this simulation framework on HPC clusters with batch systems.</div><div style="box-sizing:border-box">It would surprise me that there are not 'adapation layers' available for Slurm, SGE, PBS etc.</div><div style="box-sizing:border-box">So in SLurm, you would do an sbatch which would reserve your worker nodes then run a series of srun which run the worker processes.</div><div style="box-sizing:border-box"><br></div><div style="box-sizing:border-box">(I hope I have that round the right way - I seem to recall doing srun then a series of sbatches in the past)</div><div style="box-sizing:border-box"><br></div><div style="box-sizing:border-box">But looking at the Trick Wiki quickly, I am wrong. It does seem to work on the model of "get a list of hosts allocated by your batch system"",</div><div style="box-sizing:border-box"> ie the SLURM_JOB_HOSTLIST then Trick will set up simulation queues which spwan off models using ssh.</div><div style="box-sizing:border-box">Looking at the Advanced Topics guide this does seem to be so:</div><div style="box-sizing:border-box"><a href="https://github.com/nasa/trick/blob/master/share/doc/trick/Trick_Advanced_Topics.pdf" target="_blank">https://github.com/nasa/trick/<wbr>blob/master/share/doc/trick/Tr<wbr>ick_Advanced_Topics.pdf</a></div><div style="box-sizing:border-box">The model is that you allocate up to 16 remote worker hosts for a long time. Then various modelling tasks are started on those hosts via ssh.</div><div style="box-sizing:border-box">Trick expects those hosts to be available for more tasks during your simulation session.</div><div style="box-sizing:border-box">Also there is discussion there about turning off irqbalance and cpuspeed, and disabling non necessary system services.</div><div style="box-sizing:border-box"><br></div><div style="box-sizing:border-box"><br></div><div style="box-sizing:border-box"><br></div><div style="box-sizing:border-box"><br></div><div style="box-sizing:border-box">As someone who has spent endless oodles of hours either killing orphaned processes on nodes, or seeing rogueprocess alarms,</div><div style="box-sizing:border-box">or running ps --forest to trace connections into batch job nodes which bypass the pbs/slurm daemons I despair slightly...</div><div style="box-sizing:border-box">I am probably very wrong, and NASA have excellent slurm integration.</div><div style="box-sizing:border-box"><br></div><div style="box-sizing:border-box">So I agree with Chris  - implement cgroups, and try to make sure your ssh 'lands'on a cgroup.</div><div style="box-sizing:border-box">'lscgroup' is a nice command to see what cgroups are active on a compte node.</div><div style="box-sizing:border-box">Also run an interactive job, ssh into one of your allocated workr nodes, then  cat /proc/self/cgroups   shows which cgroups you have landed iin.</div><div style="box-sizing:border-box"><br></div><div style="box-sizing:border-box"><br></div><div style="box-sizing:border-box"><br></div><div style="box-sizing:border-box"><br></div><div style="box-sizing:border-box"><br></div><div style="box-sizing:border-box"><br></div><div><b><br></b></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div></div><div class="m_-3087711124455602849HOEnZb"><div class="m_-3087711124455602849h5"><div class="gmail_extra"><br><div class="gmail_quote">On 5 March 2018 at 02:20, Christopher Samuel <span dir="ltr"><<a href="mailto:chris@csamuel.org" target="_blank">chris@csamuel.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 05/03/18 12:12, Dan Jordan wrote:<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

What is the /correct /way to clean up processes across the nodes<span><br>

given to my program by SLURM_JOB_NODELIST?<br>

</span></blockquote>

<br>

I'd strongly suggest using cgroups in your Slurm config to ensure that<br>

processes are corralled and tracked correctly.<br>

<br>

You can use pam_slurm_adopt from the contrib directory to capture<br>

inbound SSH sessions into a running job on the node (and deny access to<br>

people who don't).<br>

<br>

Then Slurm should take care of everything for you without needing an<br>

epilog.<br>

<br>

Hope this helps!<br>

Chris<br>

<br>

</blockquote></div><br></div>

</div></div></blockquote></div><br></div>

</div></div></blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">Dan Jordan<br></div>

</div>