[slurm-users] Job still running after process completed
John Hearns
hearnsj at googlemail.com
Mon Apr 23 06:10:31 MDT 2018
Nicolo, I cannot say what your problem is.
However in the past with problems like this I would
a) look at ps -eaf --forest
Try to see what the parent processes of these job processes are
Clearly if the parent PID is 1 then --forest is nto much help. But the
--forest option is my 'goto' option
b) look closely at the slurm logs. Do not fool yourself - force yourself to
read the logs line by line, around the timestamp when the jobs ends.
Being a bit more helpful, in my last job we had endless problems with
Matlab jobs leaving orphaned processes.
To be fair to Matlab, they have a utility which 'properly' starts parallel
jobs under the control of the batch system (OK, it was PBSpro)
But users can easily start a job and 'fire off' processes in MAtlab which
are nut under the directo control of the batch daemon, leaving orphaned
processes
when the jobs ends.
Actually, if you think about this this is how a batch system works. The
batch system daemon starts running processes on your behalf.
When the job is killed, all the daughter proccesses of that daemon should
die.
It is intructive to run ps -eaf --forest sometimes on a compute node
during a normal job run. Get to know how things are being created, and what
their parents are
(two dashes in front of the forest argument)
Now think of users who start a batch job and get a list of compute hosts.
they MAY use a mechanism such as ssd or indeed pbsdsh to start running job
rocesses on those nodes.
You will then have trouble with orphaned processes when the job ends.
Techniques for dealing with this:
a use the PAM module which stops ssh login (actually - this probably
allows ssh login suring a job time when th euser has a node allocated)
b my favourite - CPU sets - actuallt this wont stop ssh logins either.
c Shouting, much shouting. Screaming.
Regarding users behavng like this, I have seen several cases of behaviour
like this for understandable reasons.
On a ssytem which I did not manage, but was asked fro advice, the vendor
had provided a sample script for running Ansys.
The user wanted to run Abaqus on the compute nodes (or some such - a
different application anyway)
So he started an empty Ansys job, which sat doing nothing. Then took the
list of hosts provided by the batch system
and fired up an interactive Abaqus session on his terminal.
I honestly hesitate to label this behaviour 'wrong'
I als have seen similar when running a CFD job.
On 23 April 2018 at 11:50, Nicolò Parmiggiani <nicolo.parmiggiani at gmail.com>
wrote:
> Hi,
>
> I have a job that keeps running even though the internal process is
> finished.
>
> What could be the problem?
>
> Thank you.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180423/92c1de8d/attachment-0003.html>
More information about the slurm-users
mailing list