[slurm-users] SLURM_JOB_NODELIST not available in prolog / epilog scripts

Mon Mar 5 08:28:42 MST 2018

John/Chris,

Thanks for your advice. I'll need to do some reading on cgroups, I've never
even been exposed to that concept.  I don't even know if the SLURM setup I
have access to has the cgroups or PAM plugin/modules enabled/available.
Unfortunately I'm not involved in the administration of SLURM, I'm simply a
user of a much larger system that's already established with other users
doing compute tasks completely separate from my use case.  Therefore, I'm
most interested in solutions that I can implement without sys admin support
on the SLURM side, which is why I started looking at the --epilog route.

I neither have the administrator access to SLURM nor the time to consider
more complex approaches that might hack the Trick architecture.  *Literally
the only thing that isn't working for me right now is the cleanup mechanism*,
everything else is working just fine.  It's not as simple as killing all
the simulation spawned processes, the processes themselves create message
queues for internal communication that live in /dev/mqueue/ on each node,
and when the sim gets a kill -9 signal, there's no internal cleanup, and
those files linger on the filesystem indefinitely, causing issues in
subsequent runs on those machines.

>From my understanding, there's already a "master" epilog script that kills
all user processes implemented in our system after a user's job completes.
They have set up our SLURM nodes to be "reserved" for the user requesting
them, so their greedy cleanup script isn't a problem for other compute
processes, they are reserved for that single person.  I might just ping the
administrators and ask them to also add an 'rm /dev/mqueue/*' to that
script, to me that seems like the fastest solution given what I know.  I
would prefer to keep that part in the "user space" since it's very specific
to my use case, but srun --epilog is not behaving as I would expect.  Can
y'all confirm what I'm seeing is indeed what is expected to happen?

  ssh:    ssh machine001
  srun:   srun --nodes 3 --epilog
*cleanup.sh myProgram.exe*
  squeue: shows job 123 running on machine200, machine201, machine202
  Kill:   scancel 123
  Result: myProgram.exe is terminated, cleanup.sh runs on machine001

I was expecting cleanup.sh to run on one (or all) of the compute nodes
(200-202), not on the machine I launched the srun command from (001).

John -- Yes we are heavily invested in the Trick framework and use their
Monte-Carlo feature quite extensively, in the past we've used PBS to manage
our compute nodes, but this is the first attempt to integrate Trick
Monte-Carlo with SLURM.  We do spacecraft simulation and analysis for
various projects.

On Mon, Mar 5, 2018 at 12:36 AM, John Hearns <hearnsj at googlemail.com> wrote:

> Dan, completely off topic here. May I ask what type of simulations are you
> running?
> Clearly you probably have a large investment in time in Trick.
> However as a fan of Julia language let me leave this link here:
> https://juliaobserver.com/packages/RigidBodyDynamics
>
>
> On 5 March 2018 at 07:31, John Hearns <hearnsj at googlemail.com> wrote:
>
>> I completely agree with what Chris says regarding cgroups.  Implement
>> them, and you will not regret it.
>>
>> I have worked with other simulation frameworks, which work in a similar
>> fashion to Trick, ie a master process which spawns
>> off independent worker processes on compute nodes. I am thinking on an
>> internal application we have, and if I also say it Matlab.
>>
>> In the Trick documentation:
>> <https://github.com/nasa/trick/wiki/UserGuide-Monte-Carlo#notes>Notes
>>
>>    1. SSH <https://en.wikipedia.org/wiki/Secure_Shell> is used to launch
>>    slaves across the network
>>    2. Each slave machine will work in parallel with other slaves,
>>    greatly reducing the computation time of a simulation
>>
>> However I must say that there must be plenty of folks at NASA who use
>> this simulation framework on HPC clusters with batch systems.
>> It would surprise me that there are not 'adapation layers' available for
>> Slurm, SGE, PBS etc.
>> So in SLurm, you would do an sbatch which would reserve your worker nodes
>> then run a series of srun which run the worker processes.
>>
>> (I hope I have that round the right way - I seem to recall doing srun
>> then a series of sbatches in the past)
>>
>> But looking at the Trick Wiki quickly, I am wrong. It does seem to work
>> on the model of "get a list of hosts allocated by your batch system"",
>> ie the SLURM_JOB_HOSTLIST then Trick will set up simulation queues which
>> spwan off models using ssh.
>> Looking at the Advanced Topics guide this does seem to be so:
>> https://github.com/nasa/trick/blob/master/share/doc/trick/Tr
>> ick_Advanced_Topics.pdf
>> The model is that you allocate up to 16 remote worker hosts for a long
>> time. Then various modelling tasks are started on those hosts via ssh.
>> Trick expects those hosts to be available for more tasks during your
>> simulation session.
>> Also there is discussion there about turning off irqbalance and cpuspeed,
>> and disabling non necessary system services.
>>
>>
>>
>>
>> As someone who has spent endless oodles of hours either killing orphaned
>> processes on nodes, or seeing rogueprocess alarms,
>> or running ps --forest to trace connections into batch job nodes which
>> bypass the pbs/slurm daemons I despair slightly...
>> I am probably very wrong, and NASA have excellent slurm integration.
>>
>> So I agree with Chris  - implement cgroups, and try to make sure your ssh
>> 'lands'on a cgroup.
>> 'lscgroup' is a nice command to see what cgroups are active on a compte
>> node.
>> Also run an interactive job, ssh into one of your allocated workr nodes,
>> then  cat /proc/self/cgroups   shows which cgroups you have landed iin.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 5 March 2018 at 02:20, Christopher Samuel <chris at csamuel.org> wrote:
>>
>>> On 05/03/18 12:12, Dan Jordan wrote:
>>>
>>> What is the /correct /way to clean up processes across the nodes
>>>> given to my program by SLURM_JOB_NODELIST?
>>>>
>>>
>>> I'd strongly suggest using cgroups in your Slurm config to ensure that
>>> processes are corralled and tracked correctly.
>>>
>>> You can use pam_slurm_adopt from the contrib directory to capture
>>> inbound SSH sessions into a running job on the node (and deny access to
>>> people who don't).
>>>
>>> Then Slurm should take care of everything for you without needing an
>>> epilog.
>>>
>>> Hope this helps!
>>> Chris
>>>
>>>
>>
>

-- 
Dan Jordan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180305/ab72cf11/attachment-0001.html>