[slurm-users] SLURM_JOB_NODELIST not available in prolog / epilog scripts

Sun Mar 4 23:36:52 MST 2018

Dan, completely off topic here. May I ask what type of simulations are you
running?
Clearly you probably have a large investment in time in Trick.
However as a fan of Julia language let me leave this link here:
https://juliaobserver.com/packages/RigidBodyDynamics

On 5 March 2018 at 07:31, John Hearns <hearnsj at googlemail.com> wrote:

> I completely agree with what Chris says regarding cgroups.  Implement
> them, and you will not regret it.
>
> I have worked with other simulation frameworks, which work in a similar
> fashion to Trick, ie a master process which spawns
> off independent worker processes on compute nodes. I am thinking on an
> internal application we have, and if I also say it Matlab.
>
> In the Trick documentation:
> <https://github.com/nasa/trick/wiki/UserGuide-Monte-Carlo#notes>Notes
>
>    1. SSH <https://en.wikipedia.org/wiki/Secure_Shell> is used to launch
>    slaves across the network
>    2. Each slave machine will work in parallel with other slaves, greatly
>    reducing the computation time of a simulation
>
> However I must say that there must be plenty of folks at NASA who use this
> simulation framework on HPC clusters with batch systems.
> It would surprise me that there are not 'adapation layers' available for
> Slurm, SGE, PBS etc.
> So in SLurm, you would do an sbatch which would reserve your worker nodes
> then run a series of srun which run the worker processes.
>
> (I hope I have that round the right way - I seem to recall doing srun then
> a series of sbatches in the past)
>
> But looking at the Trick Wiki quickly, I am wrong. It does seem to work on
> the model of "get a list of hosts allocated by your batch system"",
> ie the SLURM_JOB_HOSTLIST then Trick will set up simulation queues which
> spwan off models using ssh.
> Looking at the Advanced Topics guide this does seem to be so:
> https://github.com/nasa/trick/blob/master/share/doc/trick/
> Trick_Advanced_Topics.pdf
> The model is that you allocate up to 16 remote worker hosts for a long
> time. Then various modelling tasks are started on those hosts via ssh.
> Trick expects those hosts to be available for more tasks during your
> simulation session.
> Also there is discussion there about turning off irqbalance and cpuspeed,
> and disabling non necessary system services.
>
>
>
>
> As someone who has spent endless oodles of hours either killing orphaned
> processes on nodes, or seeing rogueprocess alarms,
> or running ps --forest to trace connections into batch job nodes which
> bypass the pbs/slurm daemons I despair slightly...
> I am probably very wrong, and NASA have excellent slurm integration.
>
> So I agree with Chris  - implement cgroups, and try to make sure your ssh
> 'lands'on a cgroup.
> 'lscgroup' is a nice command to see what cgroups are active on a compte
> node.
> Also run an interactive job, ssh into one of your allocated workr nodes,
> then  cat /proc/self/cgroups   shows which cgroups you have landed iin.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On 5 March 2018 at 02:20, Christopher Samuel <chris at csamuel.org> wrote:
>
>> On 05/03/18 12:12, Dan Jordan wrote:
>>
>> What is the /correct /way to clean up processes across the nodes
>>> given to my program by SLURM_JOB_NODELIST?
>>>
>>
>> I'd strongly suggest using cgroups in your Slurm config to ensure that
>> processes are corralled and tracked correctly.
>>
>> You can use pam_slurm_adopt from the contrib directory to capture
>> inbound SSH sessions into a running job on the node (and deny access to
>> people who don't).
>>
>> Then Slurm should take care of everything for you without needing an
>> epilog.
>>
>> Hope this helps!
>> Chris
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180305/b0fad011/attachment.html>