<div dir="ltr"><div><div><div><div><div><div><div><div><div><div><div><div><div><div><div><div><div><div><div><div><div><div><div><div><div><div><div>Nicolo, I cannot say what your problem is.<br></div>However in the past with problems like this I would <br><br></div>a) look at ps -eaf --forest<br></div>Try to see what the parent processes of these job processes are<br></div>Clearly if the parent PID is 1 then --forest is nto much help. But the --forest option is my 'goto' option<br><br></div>b) look closely at the slurm logs. Do not fool yourself - force yourself to read the logs line by line, around the timestamp when the jobs ends.<br><br><br></div>Being a bit more helpful, in my last job we had endless problems with Matlab jobs leaving orphaned processes.<br></div>To be fair to Matlab, they have a utility which 'properly' starts parallel jobs under the control of the batch system (OK, it was PBSpro)<br></div>But users can easily start a job and 'fire off' processes in MAtlab which are nut under the directo control of the batch daemon, leaving orphaned processes<br></div>when the jobs ends.<br><br></div>Actually, if you think about this this is how a batch system works. The batch system daemon starts running processes on your behalf.<br></div>When the job is killed, all the daughter proccesses of that daemon should die.<br></div>It is intructive to run ps -eaf --forest sometimes on a compute node during a normal job run. Get to know how things are being created, and what their parents are<br></div>(two dashes in front of the forest argument)<br><br></div>Now think of users who start a batch job and get a list of compute hosts.<br></div>they MAY use a mechanism such as ssd or indeed pbsdsh to start running job rocesses on those nodes.<br></div>You will then have trouble with orphaned processes when the job ends.<br></div>Techniques for dealing with this:<br></div>a use the PAM module which stops ssh login (actually - this probably allows ssh login suring a job time when th euser has a node allocated)<br></div>b my favourite - CPU sets - actuallt this wont stop ssh logins either.<br></div>c Shouting, much shouting. Screaming.<br><br></div>Regarding users behavng like this, I have seen several cases of behaviour like this for understandable reasons.<br></div>On a ssytem which I did not manage, but was asked fro advice, the vendor had provided a sample script for running Ansys.<br></div>The user wanted to run Abaqus on the compute nodes (or some such - a different application anyway)<br></div>So he started an empty Ansys job, which sat doing nothing. Then took the list of hosts provided by the batch system<br></div>and fired up an interactive Abaqus session on his terminal.<br></div>I honestly hesitate to label this behaviour 'wrong'<br><br></div>I als have seen similar when running a CFD job.<br><div><br><div><br><div><br><br><div><div><div><div><br><br><br><br><br><br><br><br><br><br><br><br><br><br><div><div><div><div><div><br><br><br><br><div><div><div><div><div><div><div><br><div><div><div><div><div><div><br><br><div><br><br><br></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 23 April 2018 at 11:50, Nicolò Parmiggiani <span dir="ltr"><<a href="mailto:nicolo.parmiggiani@gmail.com" target="_blank">nicolo.parmiggiani@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi,<div><br></div><div>I have a job that keeps running even though the internal process is finished.</div><div><br></div><div>What could be the problem?</div><div><br></div><div>Thank you.</div></div>
</blockquote></div><br></div>