I'm helping with a workflow manager that needs to submit Slurm jobs. For logging and management reasons, the job (e.g. srun python) needs to be run as though it were a regular subprocess (python):
- stdin, stdout and stderr for the command should be connected to process inside the job - signals sent to the command should be sent to the job process - We don't want to use the existing job allocation, if this is run from a Slurm job - The command should only terminate when the job is finished, to avoid us needing to poll Slurm
We've tried:
- sbatch --wait, but then SIGTERM'ing the process doesn't kill the job - salloc, but that requires a TTY process to control it (?) - salloc srun seems to mess with the terminal when it's killed, likely because of being "designed to be executed in the foreground" - Plain srun re-uses the existing Slurm allocation, and specifying resources like --mem will just request then from the current job rather than submitting a new one
What is the best solution here?
On 4/4/25 5:23 am, Michael Milton via slurm-users wrote:
Plain srun re-uses the existing Slurm allocation, and specifying resources like --mem will just request then from the current job rather than submitting a new one
srun does that as it sees all the various SLURM_* environment variables in the environment of the running job. My bet would be that if you eliminated them from the environment of the srun then you would get a new allocation.
I've done similar things in the past to do an sbatch for a job that wants to run on very different hardware with:
env $(env | awk -F= '/^(SLURM|SBATCH)/ {print "-u",$1}' | paste -s -d\ ) sbatch [...]
So it could be worth substituting srun for sbatch there and see if that helps.
Best of luck! Chris
Thanks Chris,
I can verify that unsetting all these environment variables does allow you to `srun --mem 5G` within an `srun --mem 3G` (etc). I will see if this solves my problem.
Interestingly just by running ` unset SLURM_CPU_BIND SLURM_JOB_ID` I can get it working. SLURM_JOB_ID seems to be the variable that controls whether srun is inside the same job or not. Unsetting SLURM_CPU_BIND is needed to avoid "CPU binding outside of job step allocation".
Cheers
On Sat, Apr 5, 2025 at 3:39 PM Chris Samuel via slurm-users < slurm-users@lists.schedmd.com> wrote:
On 4/4/25 5:23 am, Michael Milton via slurm-users wrote:
Plain srun re-uses the existing Slurm allocation, and specifying resources like --mem will just request then from the current job rather than submitting a new one
srun does that as it sees all the various SLURM_* environment variables in the environment of the running job. My bet would be that if you eliminated them from the environment of the srun then you would get a new allocation.
I've done similar things in the past to do an sbatch for a job that wants to run on very different hardware with:
env $(env | awk -F= '/^(SLURM|SBATCH)/ {print "-u",$1}' | paste -s -d\ ) sbatch [...]
So it could be worth substituting srun for sbatch there and see if that helps.
Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com