Hi Sean,
I appear to be having the same issue that you are having with OCI container jobs running forever / appearing to hang. I haven't figured it out yet, but perhaps we can compare notes and determine what aspect of configuration we both share.
Like you, I was following the examples in https://slurm.schedmd.com/containers.html and originally encountered the issue with an alpine container image running the `uptime` command, but I have also confirmed the issue with other images including ubuntu and with other processes. I always get the same results - the container process runs to completion and exits, but then the slurm job continues to run until it is cancelled or killed.
I have slurm v23.11.6 and am using the nvidia-container-runtime, what slurm version and runtime are you using?
My oci.conf is: ``` $ cat /etc/slurm/oci.conf EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeQuery="nvidia-container-runtime --rootless=true --root=/run/user/%U/ state %n.%u.%j.%s.%t" RunTimeKill="nvidia-container-runtime --rootless=true --root=/run/user/%U/ kill -a %n.%u.%j.%s.%t" RunTimeDelete="nvidia-container-runtime --rootless=true --root=/run/user/%U/ delete --force %n.%u.%j.%s.%t" RunTimeRun="nvidia-container-runtime --rootless=true --root=/run/user/%U/ run %n.%u.%j.%s.%t -b %b" ```
Hope that we can get to the bottom of this and resolve our issues with OCI containers!
Josh.
--- Hello. I am new to this list and Slurm overall. I have a lot of experience in computer operations, including Kubernetes, but I am currently exploring Slurm in some depth.
I have set up a small cluster and, in general, have gotten things working, but when I try to run a container job, it runs the command but then appears to hang as if the job container is still running.
So, running the following works, but it never returns to the prompt unless I use [Control-C].
$ srun --container /shared_fs/shared/oci_images/alpine uptime 19:21:47 up 20:43, 0 users, load average: 0.03, 0.25, 0.15
I'm unsure if something is misconfigured or if I'm misunderstanding how this should work, but any help and/or pointers would be greatly appreciated.
Thanks! Sean
-- slurm-users mailing list -- slurm...@lists.schedmd.com To unsubscribe send an email to slurm-us...@lists.schedmd.com
-- Dr. Joshua C. Randall Principal Software Engineer Altos Labs email: jrandall@altoslabs.com
Just an update to say that this issue for me appears to be specific to the `runc` runtime (or `nvidia-container-runtime` when it uses `runc` internally). I switched to using `crun` and the problem went away -- containers run using `srun --container` now terminate after the inner process terminates.
-- Dr. Joshua C. Randall Principal Software Engineer Altos Labs email: jrandall@altoslabs.com On Tue, May 28, 2024 at 2:18 PM Joshua Randall jrandall@altoslabs.com wrote:
Hi Sean,
I appear to be having the same issue that you are having with OCI container jobs running forever / appearing to hang. I haven't figured it out yet, but perhaps we can compare notes and determine what aspect of configuration we both share.
Like you, I was following the examples in https://slurm.schedmd.com/containers.html and originally encountered the issue with an alpine container image running the `uptime` command, but I have also confirmed the issue with other images including ubuntu and with other processes. I always get the same results - the container process runs to completion and exits, but then the slurm job continues to run until it is cancelled or killed.
I have slurm v23.11.6 and am using the nvidia-container-runtime, what slurm version and runtime are you using?
My oci.conf is:
$ cat /etc/slurm/oci.conf EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeQuery="nvidia-container-runtime --rootless=true --root=/run/user/%U/ state %n.%u.%j.%s.%t" RunTimeKill="nvidia-container-runtime --rootless=true --root=/run/user/%U/ kill -a %n.%u.%j.%s.%t" RunTimeDelete="nvidia-container-runtime --rootless=true --root=/run/user/%U/ delete --force %n.%u.%j.%s.%t" RunTimeRun="nvidia-container-runtime --rootless=true --root=/run/user/%U/ run %n.%u.%j.%s.%t -b %b"
Hope that we can get to the bottom of this and resolve our issues with OCI containers!
Josh.
Hello. I am new to this list and Slurm overall. I have a lot of experience in computer operations, including Kubernetes, but I am currently exploring Slurm in some depth.
I have set up a small cluster and, in general, have gotten things working, but when I try to run a container job, it runs the command but then appears to hang as if the job container is still running.
So, running the following works, but it never returns to the prompt unless I use [Control-C].
$ srun --container /shared_fs/shared/oci_images/alpine uptime 19:21:47 up 20:43, 0 users, load average: 0.03, 0.25, 0.15
I'm unsure if something is misconfigured or if I'm misunderstanding how this should work, but any help and/or pointers would be greatly appreciated.
Thanks! Sean
-- slurm-users mailing list -- slurm...@lists.schedmd.com To unsubscribe send an email to slurm-us...@lists.schedmd.com
-- Dr. Joshua C. Randall Principal Software Engineer Altos Labs email: jrandall@altoslabs.com