[slurm-users] Job still running after process completed

Mon Apr 23 07:58:56 MDT 2018

I would recommend putting a clean up process in your epilog script.  We 
have a check here that sees if the job completed and if so it then 
terminates all the user processes by kill -9 to clean up any residuals. 
If it fails it closes of the node so we can reboot it.

-Paul Edmon-

On 04/23/2018 08:10 AM, John Hearns wrote:
> Nicolo, I cannot say what your problem is.
> However in the past with problems like this I would
>
> a) look at ps -eaf --forest
> Try to see what the parent processes of these job processes are
> Clearly if the parent PID is 1 then --forest is nto much help. But the 
> --forest option is my 'goto' option
>
> b) look closely at the slurm logs. Do not fool yourself - force 
> yourself to read the logs line by line, around the timestamp when the 
> jobs ends.
>
>
> Being a bit more helpful, in my last job we had endless problems with 
> Matlab jobs leaving orphaned processes.
> To be fair to Matlab, they have a utility which 'properly' starts 
> parallel jobs under the control of the batch system (OK, it was PBSpro)
> But users can easily start a job and 'fire off' processes in MAtlab 
> which are nut under the directo control of the batch daemon, leaving 
> orphaned processes
> when the jobs ends.
>
> Actually, if you think about this this is how a batch system works. 
> The batch system daemon starts running processes on your behalf.
> When the job is killed, all the daughter proccesses of that daemon 
> should die.
> It is intructive to run ps -eaf --forest  sometimes on a compute node 
> during a normal job run. Get to know how things are being created, and 
> what their parents are
> (two dashes in front of the forest argument)
>
> Now think of users who start a batch job and get a list of compute hosts.
> they MAY use a mechanism such as ssd or indeed pbsdsh to start running 
> job rocesses on those nodes.
> You will then have trouble with orphaned processes when the job ends.
> Techniques for dealing with this:
> a use the PAM module which stops ssh login (actually - this probably 
> allows ssh login suring a job time when th euser has a node allocated)
> b my favourite - CPU sets - actuallt this wont stop ssh logins either.
> c Shouting, much shouting. Screaming.
>
> Regarding users behavng like this,  I have seen several cases of 
> behaviour like this for understandable reasons.
> On a ssytem which I did not manage, but was asked fro advice, the 
> vendor had provided a sample script for running Ansys.
> The user wanted to run Abaqus on the compute nodes (or some such - a 
> different application anyway)
> So  he started an empty Ansys job, which sat doing nothing. Then took 
> the list of hosts provided by the batch system
> and fired up an interactive Abaqus session on his terminal.
> I honestly hesitate to label this behaviour 'wrong'
>
> I als have seen similar when running a CFD job.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On 23 April 2018 at 11:50, Nicolò Parmiggiani 
> <nicolo.parmiggiani at gmail.com <mailto:nicolo.parmiggiani at gmail.com>> 
> wrote:
>
>     Hi,
>
>     I have a job that keeps running even though the internal process
>     is finished.
>
>     What could be the problem?
>
>     Thank you.
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180423/09acf339/attachment.html>