[slurm-users] Help with failing job execution

Jeffrey R. Lang JRLang at uwyo.edu
Thu Mar 24 20:50:17 UTC 2022


My site recently updated to Slurm 21.08.6 and for the most part everything went fine.  Two Ubuntu nodes however are having issues.    Slurmd cannot execve the jobs on the nodes.  As an example:

[jrlang at tmgt1 ~]$ salloc -A ARCC --nodes=1 --ntasks=20 -t 1:00:00 --bell --nodelist=mdgx01 --partition=dgx /bin/bash
salloc: Granted job allocation 2328489
[jrlang at tmgt1 ~]$ srun hostname
srun: error: task 0 launch failed: Slurmd could not execve job
srun: error: task 1 launch failed: Slurmd could not execve job
srun: error: task 2 launch failed: Slurmd could not execve job
srun: error: task 3 launch failed: Slurmd could not execve job
srun: error: task 4 launch failed: Slurmd could not execve job
srun: error: task 5 launch failed: Slurmd could not execve job
srun: error: task 6 launch failed: Slurmd could not execve job
srun: error: task 7 launch failed: Slurmd could not execve job
srun: error: task 8 launch failed: Slurmd could not execve job
srun: error: task 9 launch failed: Slurmd could not execve job
srun: error: task 10 launch failed: Slurmd could not execve job
srun: error: task 11 launch failed: Slurmd could not execve job
srun: error: task 12 launch failed: Slurmd could not execve job
srun: error: task 13 launch failed: Slurmd could not execve job
srun: error: task 14 launch failed: Slurmd could not execve job
srun: error: task 15 launch failed: Slurmd could not execve job
srun: error: task 16 launch failed: Slurmd could not execve job
srun: error: task 17 launch failed: Slurmd could not execve job
srun: error: task 18 launch failed: Slurmd could not execve job
srun: error: task 19 launch failed: Slurmd could not execve job

Looking in slurmd-mdgx01.log we only see

[2022-03-24T14:44:02.408] [2328501.interactive] error: Failed to invoke task plugins: one of task_p_pre_setuid functions returned error
[2022-03-24T14:44:02.409] [2328501.interactive] error: job_manager: exiting abnormally: Slurmd could not execve job
[2022-03-24T14:44:02.411] [2328501.interactive] done with job


Note that this issues didn't occure with Slurm 20.11.8.

Any ideas what could be causing the issue, cause I'm stumped?

Jeff
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220324/39ebe2e8/attachment-0001.htm>


More information about the slurm-users mailing list