[slurm-users] SLURM_JOB_NODELIST not available in prolog / epilog scripts

Mon Mar 5 08:35:47 MST 2018

Dan, thankoyu very much for a comprehensive and understandable reply.

On 5 March 2018 at 16:28, Dan Jordan <ddj116 at gmail.com> wrote:

> John/Chris,
>
> Thanks for your advice. I'll need to do some reading on cgroups, I've
> never even been exposed to that concept.  I don't even know if the SLURM
> setup I have access to has the cgroups or PAM plugin/modules
> enabled/available.  Unfortunately I'm not involved in the administration of
> SLURM, I'm simply a user of a much larger system that's already established
> with other users doing compute tasks completely separate from my use case.
> Therefore, I'm most interested in solutions that I can implement without
> sys admin support on the SLURM side, which is why I started looking at the
> --epilog route.
>
> I neither have the administrator access to SLURM nor the time to consider
> more complex approaches that might hack the Trick architecture.  *Literally
> the only thing that isn't working for me right now is the cleanup mechanism*,
> everything else is working just fine.  It's not as simple as killing all
> the simulation spawned processes, the processes themselves create message
> queues for internal communication that live in /dev/mqueue/ on each node,
> and when the sim gets a kill -9 signal, there's no internal cleanup, and
> those files linger on the filesystem indefinitely, causing issues in
> subsequent runs on those machines.
>
> From my understanding, there's already a "master" epilog script that kills
> all user processes implemented in our system after a user's job completes.
> They have set up our SLURM nodes to be "reserved" for the user requesting
> them, so their greedy cleanup script isn't a problem for other compute
> processes, they are reserved for that single person.  I might just ping the
> administrators and ask them to also add an 'rm /dev/mqueue/*' to that
> script, to me that seems like the fastest solution given what I know.  I
> would prefer to keep that part in the "user space" since it's very specific
> to my use case, but srun --epilog is not behaving as I would expect.  Can
> y'all confirm what I'm seeing is indeed what is expected to happen?
>
>   ssh:    ssh machine001
>   srun:   srun --nodes 3 --epilog
> *cleanup.sh myProgram.exe*
>   squeue: shows job 123 running on machine200, machine201, machine202
>   Kill:   scancel 123
>   Result: myProgram.exe is terminated, cleanup.sh runs on machine001
>
> I was expecting cleanup.sh to run on one (or all) of the compute nodes
> (200-202), not on the machine I launched the srun command from (001).
>
> John -- Yes we are heavily invested in the Trick framework and use their
> Monte-Carlo feature quite extensively, in the past we've used PBS to manage
> our compute nodes, but this is the first attempt to integrate Trick
> Monte-Carlo with SLURM.  We do spacecraft simulation and analysis for
> various projects.
>
> On Mon, Mar 5, 2018 at 12:36 AM, John Hearns <hearnsj at googlemail.com>
> wrote:
>
>> Dan, completely off topic here. May I ask what type of simulations are
>> you running?
>> Clearly you probably have a large investment in time in Trick.
>> However as a fan of Julia language let me leave this link here:
>> https://juliaobserver.com/packages/RigidBodyDynamics
>>
>>
>> On 5 March 2018 at 07:31, John Hearns <hearnsj at googlemail.com> wrote:
>>
>>> I completely agree with what Chris says regarding cgroups.  Implement
>>> them, and you will not regret it.
>>>
>>> I have worked with other simulation frameworks, which work in a similar
>>> fashion to Trick, ie a master process which spawns
>>> off independent worker processes on compute nodes. I am thinking on an
>>> internal application we have, and if I also say it Matlab.
>>>
>>> In the Trick documentation:
>>> <https://github.com/nasa/trick/wiki/UserGuide-Monte-Carlo#notes>Notes
>>>
>>>    1. SSH <https://en.wikipedia.org/wiki/Secure_Shell> is used to
>>>    launch slaves across the network
>>>    2. Each slave machine will work in parallel with other slaves,
>>>    greatly reducing the computation time of a simulation
>>>
>>> However I must say that there must be plenty of folks at NASA who use
>>> this simulation framework on HPC clusters with batch systems.
>>> It would surprise me that there are not 'adapation layers' available for
>>> Slurm, SGE, PBS etc.
>>> So in SLurm, you would do an sbatch which would reserve your worker
>>> nodes then run a series of srun which run the worker processes.
>>>
>>> (I hope I have that round the right way - I seem to recall doing srun
>>> then a series of sbatches in the past)
>>>
>>> But looking at the Trick Wiki quickly, I am wrong. It does seem to work
>>> on the model of "get a list of hosts allocated by your batch system"",
>>> ie the SLURM_JOB_HOSTLIST then Trick will set up simulation queues which
>>> spwan off models using ssh.
>>> Looking at the Advanced Topics guide this does seem to be so:
>>> https://github.com/nasa/trick/blob/master/share/doc/trick/Tr
>>> ick_Advanced_Topics.pdf
>>> The model is that you allocate up to 16 remote worker hosts for a long
>>> time. Then various modelling tasks are started on those hosts via ssh.
>>> Trick expects those hosts to be available for more tasks during your
>>> simulation session.
>>> Also there is discussion there about turning off irqbalance and
>>> cpuspeed, and disabling non necessary system services.
>>>
>>>
>>>
>>>
>>> As someone who has spent endless oodles of hours either killing orphaned
>>> processes on nodes, or seeing rogueprocess alarms,
>>> or running ps --forest to trace connections into batch job nodes which
>>> bypass the pbs/slurm daemons I despair slightly...
>>> I am probably very wrong, and NASA have excellent slurm integration.
>>>
>>> So I agree with Chris  - implement cgroups, and try to make sure your
>>> ssh 'lands'on a cgroup.
>>> 'lscgroup' is a nice command to see what cgroups are active on a compte
>>> node.
>>> Also run an interactive job, ssh into one of your allocated workr nodes,
>>> then  cat /proc/self/cgroups   shows which cgroups you have landed iin.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 5 March 2018 at 02:20, Christopher Samuel <chris at csamuel.org> wrote:
>>>
>>>> On 05/03/18 12:12, Dan Jordan wrote:
>>>>
>>>> What is the /correct /way to clean up processes across the nodes
>>>>> given to my program by SLURM_JOB_NODELIST?
>>>>>
>>>>
>>>> I'd strongly suggest using cgroups in your Slurm config to ensure that
>>>> processes are corralled and tracked correctly.
>>>>
>>>> You can use pam_slurm_adopt from the contrib directory to capture
>>>> inbound SSH sessions into a running job on the node (and deny access to
>>>> people who don't).
>>>>
>>>> Then Slurm should take care of everything for you without needing an
>>>> epilog.
>>>>
>>>> Hope this helps!
>>>> Chris
>>>>
>>>>
>>>
>>
>
>
> --
> Dan Jordan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180305/2d4ef524/attachment.html>