[slurm-users] bug 4333, "srun: fatal: step_launch.c:1036 step_launch_state_destroy"

Andy Riebs andy.riebs at hpe.com
Wed Jan 24 09:38:49 MST 2018


Occasionally during program exit with Open MPI SHMEM jobs, we are seeing 
the following message:

srun: fatal: ../../../src/api/step_launch.c:1037 
step_launch_state_destroy: pthread_mutex_destroy(): Device or resource busy

Our environment:

  * 100+ node KNL cluster
  * CentOS 7.4
  * Open MPI 3.x (an interim kit between 3.0 and 3.1)
  * Slurm 17.11.0

This was reported at <https://bugs.schedmd.com/show_bug.cgi?id=4333>  
against a 17.11.0 RC kit, but we are seeing it now in the 17.11.0 
released kit (I confirmed that Moe's fix appears in our sources). Has 
anyone else seen this? Or better yet, has anyone found a way to fix it?

Andy

-- 
Andy Riebs
andy.riebs at hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
     May the source be with you!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180124/bfe0565b/attachment.html>


More information about the slurm-users mailing list