<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<p>I would recommend putting a clean up process in your epilog
script. We have a check here that sees if the job completed and
if so it then terminates all the user processes by kill -9 to
clean up any residuals. If it fails it closes of the node so we
can reboot it.</p>
<p>-Paul Edmon-<br>
</p>
<br>
<div class="moz-cite-prefix">On 04/23/2018 08:10 AM, John Hearns
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAPqNE2XfYy8U1mTV0ZknKu1JybfouFkMB_MXZas1hfvjs1PdvA@mail.gmail.com">
<div dir="ltr">
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>Nicolo, I
cannot say
what your
problem is.<br>
</div>
However in the
past with
problems like
this I would <br>
<br>
</div>
a) look at
ps -eaf
--forest<br>
</div>
Try to see what
the parent
processes of
these job
processes are<br>
</div>
Clearly if the
parent PID is 1
then --forest is
nto much help.
But the --forest
option is my
'goto' option<br>
<br>
</div>
b) look closely at
the slurm logs. Do
not fool yourself -
force yourself to
read the logs line
by line, around the
timestamp when the
jobs ends.<br>
<br>
<br>
</div>
Being a bit more
helpful, in my last
job we had endless
problems with Matlab
jobs leaving orphaned
processes.<br>
</div>
To be fair to Matlab,
they have a utility
which 'properly' starts
parallel jobs under the
control of the batch
system (OK, it was
PBSpro)<br>
</div>
But users can easily start
a job and 'fire off'
processes in MAtlab which
are nut under the directo
control of the batch
daemon, leaving orphaned
processes<br>
</div>
when the jobs ends.<br>
<br>
</div>
Actually, if you think about
this this is how a batch
system works. The batch system
daemon starts running
processes on your behalf.<br>
</div>
When the job is killed, all the
daughter proccesses of that
daemon should die.<br>
</div>
It is intructive to run ps -eaf
--forest sometimes on a compute
node during a normal job run. Get
to know how things are being
created, and what their parents
are<br>
</div>
(two dashes in front of the forest
argument)<br>
<br>
</div>
Now think of users who start a batch
job and get a list of compute hosts.<br>
</div>
they MAY use a mechanism such as ssd or
indeed pbsdsh to start running job
rocesses on those nodes.<br>
</div>
You will then have trouble with orphaned
processes when the job ends.<br>
</div>
Techniques for dealing with this:<br>
</div>
a use the PAM module which stops ssh login
(actually - this probably allows ssh login
suring a job time when th euser has a node
allocated)<br>
</div>
b my favourite - CPU sets - actuallt this wont
stop ssh logins either.<br>
</div>
c Shouting, much shouting. Screaming.<br>
<br>
</div>
Regarding users behavng like this, I have seen
several cases of behaviour like this for
understandable reasons.<br>
</div>
On a ssytem which I did not manage, but was asked fro
advice, the vendor had provided a sample script for
running Ansys.<br>
</div>
The user wanted to run Abaqus on the compute nodes (or
some such - a different application anyway)<br>
</div>
So he started an empty Ansys job, which sat doing
nothing. Then took the list of hosts provided by the batch
system<br>
</div>
and fired up an interactive Abaqus session on his terminal.<br>
</div>
I honestly hesitate to label this behaviour 'wrong'<br>
<br>
</div>
I als have seen similar when running a CFD job.<br>
<div><br>
<div><br>
<div><br>
<br>
<div>
<div>
<div>
<div><br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<div>
<div>
<div>
<div>
<div><br>
<br>
<br>
<br>
<div>
<div>
<div>
<div>
<div>
<div>
<div><br>
<div>
<div>
<div>
<div>
<div>
<div><br>
<br>
<div><br>
<br>
<br>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On 23 April 2018 at 11:50, Nicolò
Parmiggiani <span dir="ltr"><<a
href="mailto:nicolo.parmiggiani@gmail.com" target="_blank"
moz-do-not-send="true">nicolo.parmiggiani@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">Hi,
<div><br>
</div>
<div>I have a job that keeps running even though the
internal process is finished.</div>
<div><br>
</div>
<div>What could be the problem?</div>
<div><br>
</div>
<div>Thank you.</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</body>
</html>