Hi,
I'm running on Ubuntu 20.04. I've got a clean configuration of slurmctld and slurmd on one node.
1) I've configured oci.conf to the defaults defined by "OCI.CONF EXAMPLE FOR RUNC USING RUN (RECOMMENDED OVER USING CREATE/START):".
I have a container that I can run by hand:
runc --rootless true run -b /opt/pilot_results/results.20241108-184831/step1 test
sh-4.2# exit
exit
and it returns.
However, when I
srun --container=/opt/pilot_results/results.20241108-184831/step1 ls
it hangs after completing the ls, and I have to double ctrl-c out of it.
2) I tried using the configuration for RUNC with Create/Start and it hangs on start.
3) I tried using the configuration for CRUN using RUN, I can run the container by hand with crun, but srun fails with:
srun --container=/opt/pilot_results/results.20241108-194914/step1 bash
bind socket to `/run/user/1008//pd-builds-bench-1.jrp.34.0.0/notify`: Address already in use
sync socket closed
srun: error: pd-builds-bench-1: task 0: Exited with exit code 1
4) I tried using the configuration for CRUN with Create/Start and it errors repeatedly with:
slurmstepd: error: _get_container_state: RunTimeQuery failed rc:256 output:error opening file `/run/user/1008//pd-builds-bench-1.jrp.51.0.0/status`: No such file or directory
I went through the (open and closed) support tickets and couldn't find anything that reflects any of these errors, and I'm pretty stuck at this point.
Any help would be welcome.
Thanks,
JRP