The plain srun is probably the best bet, and if you really need the thing to be started from another slurm job (rather than the login node) you will need to exploit the fact that
> If necessary, srun will first create a resource allocation in which to run the parallel job.
AFAIK, there is no option to for the "create a resource allocation" even if it's not necessary. But you may try to request something that is "above and beyond" what the current allocation provides, and that might solve your problem.
Looking at the srun man page, I could speculate that --clusters or --cluster-constraint might help in that regard (but I am not sure).
Have a nice weekend
I'm helping with a workflow manager that needs to submit Slurm jobs. For logging and management reasons, the job (e.g. srun python) needs to be run as though it were a regular subprocess (python):
- stdin, stdout and stderr for the command should be connected to process inside the job
- signals sent to the command should be sent to the job process
- We don't want to use the existing job allocation, if this is run from a Slurm job
- The command should only terminate when the job is finished, to avoid us needing to poll Slurm
We've tried:
- sbatch --wait, but then SIGTERM'ing the process doesn't kill the job
- salloc, but that requires a TTY process to control it (?)
- salloc srun seems to mess with the terminal when it's killed, likely because of being "designed to be executed in the foreground"
- Plain srun re-uses the existing Slurm allocation, and specifying resources like --mem will just request then from the current job rather than submitting a new one
What is the best solution here?
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com