[slurm-users] Job still running after process completed
Paul Edmon
pedmon at cfa.harvard.edu
Mon Apr 23 07:58:56 MDT 2018
I would recommend putting a clean up process in your epilog script. We
have a check here that sees if the job completed and if so it then
terminates all the user processes by kill -9 to clean up any residuals.
If it fails it closes of the node so we can reboot it.
-Paul Edmon-
On 04/23/2018 08:10 AM, John Hearns wrote:
> Nicolo, I cannot say what your problem is.
> However in the past with problems like this I would
>
> a) look at ps -eaf --forest
> Try to see what the parent processes of these job processes are
> Clearly if the parent PID is 1 then --forest is nto much help. But the
> --forest option is my 'goto' option
>
> b) look closely at the slurm logs. Do not fool yourself - force
> yourself to read the logs line by line, around the timestamp when the
> jobs ends.
>
>
> Being a bit more helpful, in my last job we had endless problems with
> Matlab jobs leaving orphaned processes.
> To be fair to Matlab, they have a utility which 'properly' starts
> parallel jobs under the control of the batch system (OK, it was PBSpro)
> But users can easily start a job and 'fire off' processes in MAtlab
> which are nut under the directo control of the batch daemon, leaving
> orphaned processes
> when the jobs ends.
>
> Actually, if you think about this this is how a batch system works.
> The batch system daemon starts running processes on your behalf.
> When the job is killed, all the daughter proccesses of that daemon
> should die.
> It is intructive to run ps -eaf --forest sometimes on a compute node
> during a normal job run. Get to know how things are being created, and
> what their parents are
> (two dashes in front of the forest argument)
>
> Now think of users who start a batch job and get a list of compute hosts.
> they MAY use a mechanism such as ssd or indeed pbsdsh to start running
> job rocesses on those nodes.
> You will then have trouble with orphaned processes when the job ends.
> Techniques for dealing with this:
> a use the PAM module which stops ssh login (actually - this probably
> allows ssh login suring a job time when th euser has a node allocated)
> b my favourite - CPU sets - actuallt this wont stop ssh logins either.
> c Shouting, much shouting. Screaming.
>
> Regarding users behavng like this, I have seen several cases of
> behaviour like this for understandable reasons.
> On a ssytem which I did not manage, but was asked fro advice, the
> vendor had provided a sample script for running Ansys.
> The user wanted to run Abaqus on the compute nodes (or some such - a
> different application anyway)
> So he started an empty Ansys job, which sat doing nothing. Then took
> the list of hosts provided by the batch system
> and fired up an interactive Abaqus session on his terminal.
> I honestly hesitate to label this behaviour 'wrong'
>
> I als have seen similar when running a CFD job.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On 23 April 2018 at 11:50, Nicolò Parmiggiani
> <nicolo.parmiggiani at gmail.com <mailto:nicolo.parmiggiani at gmail.com>>
> wrote:
>
> Hi,
>
> I have a job that keeps running even though the internal process
> is finished.
>
> What could be the problem?
>
> Thank you.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180423/09acf339/attachment.html>
More information about the slurm-users
mailing list