New subject: Container Jobs "hanging"

28 May 2024


      Hi Sean,
I appear to be having the same issue that you are having with OCI
container jobs running forever / appearing to hang. I haven't figured
it out yet, but perhaps we can compare notes and determine what aspect
of configuration we both share.
Like you, I was following the examples in
https://slurm.schedmd.com/containers.html and originally encountered
the issue with an alpine container image running the `uptime` command,
but I have also confirmed the issue with other images including ubuntu
and with other processes. I always get the same results - the
container process runs to completion and exits, but then the slurm job
continues to run until it is cancelled or killed.
I have slurm v23.11.6 and am using the nvidia-container-runtime, what
slurm version and runtime are you using?
My oci.conf is:
```
$ cat /etc/slurm/oci.conf
EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)="
RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)="
RunTimeQuery="nvidia-container-runtime --rootless=true
--root=/run/user/%U/ state %n.%u.%j.%s.%t"
RunTimeKill="nvidia-container-runtime --rootless=true
--root=/run/user/%U/ kill -a %n.%u.%j.%s.%t"
RunTimeDelete="nvidia-container-runtime --rootless=true
--root=/run/user/%U/ delete --force %n.%u.%j.%s.%t"
RunTimeRun="nvidia-container-runtime --rootless=true
--root=/run/user/%U/ run %n.%u.%j.%s.%t -b %b"
```
Hope that we can get to the bottom of this and resolve our issues with
OCI containers!
Josh.
---
Hello. I am new to this list and Slurm overall. I have a lot of
experience in computer operations, including Kubernetes, but I am
currently exploring Slurm in some depth.
I have set up a small cluster and, in general, have gotten things
working, but when I try to run a container job, it runs the command
but then appears to hang as if the job container is still running.
So, running the following works, but it never returns to the prompt
unless I use [Control-C].
$ srun --container /shared_fs/shared/oci_images/alpine uptime
19:21:47 up 20:43, 0 users, load average: 0.03, 0.25, 0.15
I'm unsure if something is misconfigured or if I'm misunderstanding
how this should work, but any help and/or pointers would be greatly
appreciated.
Thanks!
Sean
--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com
--
Dr. Joshua C. Randall
Principal Software Engineer
Altos Labs
email: jrandall@altoslabs.com
-- 
Altos Labs UK Limited | England | Company reg 13484917  
Registered 
address: 3rd Floor 1 Ashley Road, Altrincham, Cheshire, United Kingdom, 
WA14 2DT

Re: Container Jobs "hanging"