[slurm-users] Help with failing job execution
f.otto at ucl.ac.uk
Thu Jun 2 18:43:42 UTC 2022
Hi Jeff & list,
we've encountered the same problem after upgrade to 21.08.8-2. All jobs failed with "Slurmd could not execve job".
I've traced this down to the slurmstepd process failing to modify the cgroup setting "memory.memsw.limit_in_bytes",
which happens because we have "ConstrainSwapSpace=yes" in Slurm's cgroup.conf file.
The error happens on Debian/Ubuntu systems as they don't turn on cgroup swap accounting by default.
The fix is to boot with the option "swapaccount=1" (i.e. add this to the grub/pxelinux/... boot config),
or one could set "ConstrainSwapSpace=no" if swap accounting is not needed.
In my understanding, with our config this error should also have shown up in previous
versions of Slurm. Perhaps it did happen, but wasn't caught properly.
BTW, what's the correct process to see the debug messages from slurmstepd? Even when running slurmd
with "-D -vvvvv", these didn't show up. I resorted to running slurmd in strace, in order to see where the
error happens, and it revealed that slurmstepd was printing some messages. But strace has a lot of overhead.
(apologies for breaking threading, I wasn't subscribed to slurm-users at the time and can't reply properly)
> My site recently updated to Slurm 21.08.6 and for the most part everything went fine. Two Ubuntu nodes however are having issues. Slurmd cannot execve the jobs on the nodes. As an example:
> [jrlang at tmgt1 ~]$ salloc -A ARCC --nodes=1 --ntasks=20 -t 1:00:00 --bell --nodelist=mdgx01 --partition=dgx /bin/bash
> salloc: Granted job allocation 2328489
> [jrlang at tmgt1 ~]$ srun hostname
> srun: error: task 0 launch failed: Slurmd could not execve job
> srun: error: task 1 launch failed: Slurmd could not execve job
> srun: error: task 19 launch failed: Slurmd could not execve job
> Looking in slurmd-mdgx01.log we only see
> [2022-03-24T14:44:02.408] [2328501.interactive] error: Failed to invoke task plugins: one of task_p_pre_setuid functions returned error
> [2022-03-24T14:44:02.409] [2328501.interactive] error: job_manager: exiting abnormally: Slurmd could not execve job
> [2022-03-24T14:44:02.411] [2328501.interactive] done with job
> Note that this issues didn't occure with Slurm 20.11.8.
> Any ideas what could be causing the issue, cause I'm stumped?
More information about the slurm-users