[slurm-users] 17.11+auks+cgroups: finished jobs hang in completing state
Robbert Eggermont
R.Eggermont at tudelft.nl
Sun Mar 25 19:43:51 MDT 2018
Dear all,
We just upgraded from 17.02.10 to 17.11.5 (using auks and cgroups) and
we are hitting a nasty problem: finished jobs are hanging (indefinitely)
in the completing state.
On the node I see only two processes remaining: 'slurmstepd' and it's
child 'auks'. Looking at the slurmstepd with strace I couldn't identify
any attempts to close/kill auks (but I could very well have missed
them). Slurmstepd is regularly checking the cgroups. In the cgroups
tasks list I see (only) the slurmstepd and auks threads.
Killing (-9) auks makes the slurmstepd complete succesfully.
Does this sound familiar to anyone?
Or is there anyone out there who is successfully running 17.11.5 in
combination with auks and cgroups?
I'm wandering if there may be some kind of deadlock between not killing
auks and waiting for the cgroups to become empty?
Regards,
Robbert
More information about the slurm-users
mailing list