[slurm-users] 17.11+auks+cgroups: finished jobs hang in completing state

Robbert Eggermont R.Eggermont at tudelft.nl
Sun Mar 25 19:43:51 MDT 2018


Dear all,

We just upgraded from 17.02.10 to 17.11.5 (using auks and cgroups) and 
we are hitting a nasty problem: finished jobs are hanging (indefinitely) 
in the completing state.

On the node I see only two processes remaining: 'slurmstepd' and it's 
child 'auks'. Looking at the slurmstepd with strace I couldn't identify 
any attempts to close/kill auks (but I could very well have missed 
them). Slurmstepd is regularly checking the cgroups. In the cgroups 
tasks list I see (only) the slurmstepd and auks threads.

Killing (-9) auks makes the slurmstepd complete succesfully.

Does this sound familiar to anyone?

Or is there anyone out there who is successfully running 17.11.5 in 
combination with auks and cgroups?

I'm wandering if there may be some kind of deadlock between not killing 
auks and waiting for the cgroups to become empty?

Regards,

Robbert



More information about the slurm-users mailing list