I am getting an unusual error when trying to run Podman containers using scrun on SLURM 23.11.3 (and 23.11.1 previously). In short, Podman works when not configured to use scrun, but when configured to use scrun it fails.
Podman gives this error:
scrun: fatal: Unable to request job allocation: Job cannot be submitted without the current working directory specified.
The slurmctld logs show this error:
_slurm_rpc_allocate_resources: Job cannot be submitted without the current working directory specified.
The scrun command does not appear to have a --chdir option so the working directory must be detected automatically, and this appears to be failing.
Perhaps someone has encountered this or a similar error before? If anyone has seen this before I would love to hear about your experiences.
A bit more detail in case anyone is interested. This is a two-node test system with one head node and one compute node, both running Rocky Linux 8.9. I have created /etc/containers/storage.conf and containers.conf files which are identical to the ones in the SLURM Containers Guide except that I set the rootless_storage_path to the same value as the graphroot in storage.conf. When I get rid of the /etc/containers/containers.conf file I can run containers on the head node. The backing store is ext4, though I have tried NFS (which had other problems). I have tried the vfs and overlay storage drivers but get the same results.
On the SLURM side I created an oci.conf which is the same as the "oci.conf example for crun using run (suggested)" from the SLURM Container Guide. I'm using crun because I have forced cgroup v2 on both systems ( systemd.unified_cgroup_hierarchy=1 on the kernel cmdline) and crun seems to support cgroup v2 better. I am using an scrun.lua which is similar to the one in the scrun manual page with some paths modified for my setup.
I should also mention that sbatch jobs run just fine and srun works as well. the test cluster seems to be working fine in general.