[slurm-users] SLURM_JOB_NODELIST not available in prolog / epilog scripts
hearnsj at googlemail.com
Sun Mar 4 23:36:52 MST 2018
Dan, completely off topic here. May I ask what type of simulations are you
Clearly you probably have a large investment in time in Trick.
However as a fan of Julia language let me leave this link here:
On 5 March 2018 at 07:31, John Hearns <hearnsj at googlemail.com> wrote:
> I completely agree with what Chris says regarding cgroups. Implement
> them, and you will not regret it.
> I have worked with other simulation frameworks, which work in a similar
> fashion to Trick, ie a master process which spawns
> off independent worker processes on compute nodes. I am thinking on an
> internal application we have, and if I also say it Matlab.
> In the Trick documentation:
> 1. SSH <https://en.wikipedia.org/wiki/Secure_Shell> is used to launch
> slaves across the network
> 2. Each slave machine will work in parallel with other slaves, greatly
> reducing the computation time of a simulation
> However I must say that there must be plenty of folks at NASA who use this
> simulation framework on HPC clusters with batch systems.
> It would surprise me that there are not 'adapation layers' available for
> Slurm, SGE, PBS etc.
> So in SLurm, you would do an sbatch which would reserve your worker nodes
> then run a series of srun which run the worker processes.
> (I hope I have that round the right way - I seem to recall doing srun then
> a series of sbatches in the past)
> But looking at the Trick Wiki quickly, I am wrong. It does seem to work on
> the model of "get a list of hosts allocated by your batch system"",
> ie the SLURM_JOB_HOSTLIST then Trick will set up simulation queues which
> spwan off models using ssh.
> Looking at the Advanced Topics guide this does seem to be so:
> The model is that you allocate up to 16 remote worker hosts for a long
> time. Then various modelling tasks are started on those hosts via ssh.
> Trick expects those hosts to be available for more tasks during your
> simulation session.
> Also there is discussion there about turning off irqbalance and cpuspeed,
> and disabling non necessary system services.
> As someone who has spent endless oodles of hours either killing orphaned
> processes on nodes, or seeing rogueprocess alarms,
> or running ps --forest to trace connections into batch job nodes which
> bypass the pbs/slurm daemons I despair slightly...
> I am probably very wrong, and NASA have excellent slurm integration.
> So I agree with Chris - implement cgroups, and try to make sure your ssh
> 'lands'on a cgroup.
> 'lscgroup' is a nice command to see what cgroups are active on a compte
> Also run an interactive job, ssh into one of your allocated workr nodes,
> then cat /proc/self/cgroups shows which cgroups you have landed iin.
> On 5 March 2018 at 02:20, Christopher Samuel <chris at csamuel.org> wrote:
>> On 05/03/18 12:12, Dan Jordan wrote:
>> What is the /correct /way to clean up processes across the nodes
>>> given to my program by SLURM_JOB_NODELIST?
>> I'd strongly suggest using cgroups in your Slurm config to ensure that
>> processes are corralled and tracked correctly.
>> You can use pam_slurm_adopt from the contrib directory to capture
>> inbound SSH sessions into a running job on the node (and deny access to
>> people who don't).
>> Then Slurm should take care of everything for you without needing an
>> Hope this helps!
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users