[slurm-users] bug 4333, "srun: fatal: step_launch.c:1036 step_launch_state_destroy"
Andy Riebs
andy.riebs at hpe.com
Wed Jan 24 09:38:49 MST 2018
Occasionally during program exit with Open MPI SHMEM jobs, we are seeing
the following message:
srun: fatal: ../../../src/api/step_launch.c:1037
step_launch_state_destroy: pthread_mutex_destroy(): Device or resource busy
Our environment:
* 100+ node KNL cluster
* CentOS 7.4
* Open MPI 3.x (an interim kit between 3.0 and 3.1)
* Slurm 17.11.0
This was reported at <https://bugs.schedmd.com/show_bug.cgi?id=4333>
against a 17.11.0 RC kit, but we are seeing it now in the 17.11.0
released kit (I confirmed that Moe's fix appears in our sources). Has
anyone else seen this? Or better yet, has anyone found a way to fix it?
Andy
--
Andy Riebs
andy.riebs at hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180124/bfe0565b/attachment.html>
More information about the slurm-users
mailing list