Hi list,
In our institution, our instructions to users who want to spawn an interactive job (for us, a bash shell) have always been to do "srun ..." from the login node, which has always been working well for us. But when we had a recent Slurm training, the SchedMD folks advised us to use "salloc" and then "srun" to do interactive jobs. I tried this today, "salloc" gave me a shell on a server, the same as srun does, but then when I tried to "srun [programname]" it hung there with no output. Of course when I tried "srun [programname] &" it spawned the background job, and gave me back a prompt. Either time I had to Ctrl-C the running srun job, and got no output other than the srun/slurmstepd termination output.
I think I read somewhere that directly invoking srun creates an allocation; why then would I want to do an initial salloc, and then srun? (i the case that I want a foreground program, such as a bash shell)
I have surveyed some other institution's Slurm interactive jobs documentation for users, I see both examples of advice to run srun directly, or salloc and then srun.
Please help me to understand how this is intended to work, and if we are "doing it wrong" :)
Thanks, Will
salloc is the currently recommended way for interactive sessions. srun is now intended for launching steps or MPI applications. So properly you would salloc and then srun inside the salloc.
As you've noticed with srun you tend lose control of your shell as it takes over so you have background the process unless it is the main process. We've hit this before when people use srun to subschedule in a salloc.
You can also just launch the salloc and then operate via the normal command line reserving srun for things like launching MPI.
The reason they changed from srun to salloc is that you can't srun inside a srun. So if you were a user who started a srun interactive session and then you tried to invoke MPI it would get weird as you would be invoking another srun. By using salloc you avoid this issue.
We used to use srun for interactive sessions as well but swapped to salloc a few years back and haven't had any issues.
-Paul Edmon-
On 2/28/2024 10:17 AM, wdennis--- via slurm-users wrote:
Hi list,
In our institution, our instructions to users who want to spawn an interactive job (for us, a bash shell) have always been to do "srun ..." from the login node, which has always been working well for us. But when we had a recent Slurm training, the SchedMD folks advised us to use "salloc" and then "srun" to do interactive jobs. I tried this today, "salloc" gave me a shell on a server, the same as srun does, but then when I tried to "srun [programname]" it hung there with no output. Of course when I tried "srun [programname] &" it spawned the background job, and gave me back a prompt. Either time I had to Ctrl-C the running srun job, and got no output other than the srun/slurmstepd termination output.
I think I read somewhere that directly invoking srun creates an allocation; why then would I want to do an initial salloc, and then srun? (i the case that I want a foreground program, such as a bash shell)
I have surveyed some other institution's Slurm interactive jobs documentation for users, I see both examples of advice to run srun directly, or salloc and then srun.
Please help me to understand how this is intended to work, and if we are "doing it wrong" :)
Thanks, Will
What do you mean "operate via the normal command line"? When you salloc, you are still on the login node.
$ salloc -p rtx6000 -A sysadm -N 1 --ntasks-per-node=1 --mem=20G --time=1-10:00:00 --gpus=2 --cpus-per-task=2 /bin/bash salloc: Pending job allocation 3798364 salloc: job 3798364 queued and waiting for resources salloc: job 3798364 has been allocated resources salloc: Granted job allocation 3798364 salloc: Waiting for resource configuration salloc: Nodes rtx-02 are ready for job mesg: cannot open /dev/pts/91: Permission denied mlsc-login[0]:~$ hostname mlsc-login.nmr.mgh.harvard.edu mlsc-login[0]:~$ printenv | grep SLURM_JOB_NODELIST SLURM_JOB_NODELIST=rtx-02
Seems you MUST use srun
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Wed, 28 Feb 2024 10:25am, Paul Edmon via slurm-users wrote:
External Email - Use Caution
salloc is the currently recommended way for interactive sessions. srun is now intended for launching steps or MPI applications. So properly you would salloc and then srun inside the salloc.
As you've noticed with srun you tend lose control of your shell as it takes over so you have background the process unless it is the main process. We've hit this before when people use srun to subschedule in a salloc.
You can also just launch the salloc and then operate via the normal command line reserving srun for things like launching MPI.
The reason they changed from srun to salloc is that you can't srun inside a srun. So if you were a user who started a srun interactive session and then you tried to invoke MPI it would get weird as you would be invoking another srun. By using salloc you avoid this issue.
We used to use srun for interactive sessions as well but swapped to salloc a few years back and haven't had any issues.
-Paul Edmon-
On 2/28/2024 10:17 AM, wdennis--- via slurm-users wrote:
Hi list,
In our institution, our instructions to users who want to spawn an interactive job (for us, a bash shell) have always been to do "srun ..." from the login node, which has always been working well for us. But when we had a recent Slurm training, the SchedMD folks advised us to use "salloc" and then "srun" to do interactive jobs. I tried this today, "salloc" gave me a shell on a server, the same as srun does, but then when I tried to "srun [programname]" it hung there with no output. Of course when I tried "srun [programname] &" it spawned the background job, and gave me back a prompt. Either time I had to Ctrl-C the running srun job, and got no output other than the srun/slurmstepd termination output.
I think I read somewhere that directly invoking srun creates an allocation; why then would I want to do an initial salloc, and then srun? (i the case that I want a foreground program, such as a bash shell)
I have surveyed some other institution's Slurm interactive jobs documentation for users, I see both examples of advice to run srun directly, or salloc and then srun.
Please help me to understand how this is intended to work, and if we are "doing it wrong" :)
Thanks, Will
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
He's talking about recent versions of Slurm which now have this option: https://slurm.schedmd.com/slurm.conf.html#OPT_use_interactive_step
-Paul Edmon-
On 2/28/2024 10:46 AM, Paul Raines wrote:
What do you mean "operate via the normal command line"? When you salloc, you are still on the login node.
$ salloc -p rtx6000 -A sysadm -N 1 --ntasks-per-node=1 --mem=20G --time=1-10:00:00 --gpus=2 --cpus-per-task=2 /bin/bash salloc: Pending job allocation 3798364 salloc: job 3798364 queued and waiting for resources salloc: job 3798364 has been allocated resources salloc: Granted job allocation 3798364 salloc: Waiting for resource configuration salloc: Nodes rtx-02 are ready for job mesg: cannot open /dev/pts/91: Permission denied mlsc-login[0]:~$ hostname mlsc-login.nmr.mgh.harvard.edu mlsc-login[0]:~$ printenv | grep SLURM_JOB_NODELIST SLURM_JOB_NODELIST=rtx-02
Seems you MUST use srun
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Wed, 28 Feb 2024 10:25am, Paul Edmon via slurm-users wrote:
External Email - Use Caution salloc is the currently recommended way for interactive sessions. srun is now intended for launching steps or MPI applications. So properly you would salloc and then srun inside the salloc.
As you've noticed with srun you tend lose control of your shell as it takes over so you have background the process unless it is the main process. We've hit this before when people use srun to subschedule in a salloc.
You can also just launch the salloc and then operate via the normal command line reserving srun for things like launching MPI.
The reason they changed from srun to salloc is that you can't srun inside a srun. So if you were a user who started a srun interactive session and then you tried to invoke MPI it would get weird as you would be invoking another srun. By using salloc you avoid this issue.
We used to use srun for interactive sessions as well but swapped to salloc a few years back and haven't had any issues.
-Paul Edmon-
On 2/28/2024 10:17 AM, wdennis--- via slurm-users wrote:
Hi list,
In our institution, our instructions to users who want to spawn an interactive job (for us, a bash shell) have always been to do "srun ..." from the login node, which has always been working well for us. But when we had a recent Slurm training, the SchedMD folks advised us to use "salloc" and then "srun" to do interactive jobs. I tried this today, "salloc" gave me a shell on a server, the same as srun does, but then when I tried to "srun [programname]" it hung there with no output. Of course when I tried "srun [programname] &" it spawned the background job, and gave me back a prompt. Either time I had to Ctrl-C the running srun job, and got no output other than the srun/slurmstepd termination output.
I think I read somewhere that directly invoking srun creates an allocation; why then would I want to do an initial salloc, and then srun? (i the case that I want a foreground program, such as a bash shell)
I have surveyed some other institution's Slurm interactive jobs documentation for users, I see both examples of advice to run srun directly, or salloc and then srun.
Please help me to understand how this is intended to work, and if we are "doing it wrong" :)
Thanks, Will
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
Thanks for the logical explanation, Paul. So when I rewrite my user documentation, I'll mention using `salloc` instead of `srun`.
Yes, we do have `LaunchParameters=use_interactive_step` set on our cluster, so salloc gives a shell on the allocated host.
Best, Will