I'm running on Ubuntu 20.04. I've got a clean configuration of slurmctld and slurmd on one node.
1) I've configured oci.conf to the defaults defined by "OCI.CONF EXAMPLE FOR RUNC USING RUN (RECOMMENDED OVER USING CREATE/START):".
I have a container that I can run by hand:
runc --rootless true run -b /opt/pilot_results/results.20241108-184831/step1 test
sh-4.2# exit
and it returns.
However, when I
srun --container=/opt/pilot_results/results.20241108-184831/step1 ls
it hangs after completing the ls, and I have to double ctrl-c out of it.
2) I tried using the configuration for RUNC with Create/Start and it hangs on start.
3) I tried using the configuration for CRUN using RUN, I can run the container by hand with crun, but srun fails with:
srun --container=/opt/pilot_results/results.20241108-194914/step1 bash
bind socket to `/run/user/1008//pd-builds-bench-1.jrp.34.0.0/notify`: Address already in use
sync socket closed
srun: error: pd-builds-bench-1: task 0: Exited with exit code 1
4) I tried using the configuration for CRUN with Create/Start and it errors repeatedly with:
slurmstepd: error: _get_container_state: RunTimeQuery failed rc:256 output:error opening file `/run/user/1008//pd-builds-bench-1.jrp.51.0.0/status`: No such file or directory
I went through the (open and closed) support tickets and couldn't find anything that reflects any of these errors, and I'm pretty stuck at this point.
Any help would be welcome.