[slurm-users] Help with failing job execution
Jeffrey R. Lang
JRLang at uwyo.edu
Thu Mar 24 20:50:17 UTC 2022
My site recently updated to Slurm 21.08.6 and for the most part everything went fine. Two Ubuntu nodes however are having issues. Slurmd cannot execve the jobs on the nodes. As an example:
[jrlang at tmgt1 ~]$ salloc -A ARCC --nodes=1 --ntasks=20 -t 1:00:00 --bell --nodelist=mdgx01 --partition=dgx /bin/bash
salloc: Granted job allocation 2328489
[jrlang at tmgt1 ~]$ srun hostname
srun: error: task 0 launch failed: Slurmd could not execve job
srun: error: task 1 launch failed: Slurmd could not execve job
srun: error: task 2 launch failed: Slurmd could not execve job
srun: error: task 3 launch failed: Slurmd could not execve job
srun: error: task 4 launch failed: Slurmd could not execve job
srun: error: task 5 launch failed: Slurmd could not execve job
srun: error: task 6 launch failed: Slurmd could not execve job
srun: error: task 7 launch failed: Slurmd could not execve job
srun: error: task 8 launch failed: Slurmd could not execve job
srun: error: task 9 launch failed: Slurmd could not execve job
srun: error: task 10 launch failed: Slurmd could not execve job
srun: error: task 11 launch failed: Slurmd could not execve job
srun: error: task 12 launch failed: Slurmd could not execve job
srun: error: task 13 launch failed: Slurmd could not execve job
srun: error: task 14 launch failed: Slurmd could not execve job
srun: error: task 15 launch failed: Slurmd could not execve job
srun: error: task 16 launch failed: Slurmd could not execve job
srun: error: task 17 launch failed: Slurmd could not execve job
srun: error: task 18 launch failed: Slurmd could not execve job
srun: error: task 19 launch failed: Slurmd could not execve job
Looking in slurmd-mdgx01.log we only see
[2022-03-24T14:44:02.408] [2328501.interactive] error: Failed to invoke task plugins: one of task_p_pre_setuid functions returned error
[2022-03-24T14:44:02.409] [2328501.interactive] error: job_manager: exiting abnormally: Slurmd could not execve job
[2022-03-24T14:44:02.411] [2328501.interactive] done with job
Note that this issues didn't occure with Slurm 20.11.8.
Any ideas what could be causing the issue, cause I'm stumped?
Jeff
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220324/39ebe2e8/attachment-0001.htm>
More information about the slurm-users
mailing list