[slurm-users] salloc with bash scripts problem
Brian Johanson
bjohanso at psc.edu
Wed Jan 2 11:38:49 MST 2019
sallocdefaultcommand specified in slurm.conf will change the default
behavior when salloc is executed without appending a command and also
explain conflicting behavior between installations.
SallocDefaultCommand
Normally, salloc(1) will run the user's default shell
when a command to execute is not specified on the salloc command line.
If SallocDefaultCommand is specified, salloc will instead
run the configured command. The command is passed to
'/bin/sh -c', so shell metacharacters are allowed, and commands with
multiple arguments should be quoted. For instance:
SallocDefaultCommand = "$SHELL"
would run the shell in the user's $SHELL environment
variable. and
SallocDefaultCommand = "srun -n1 -N1 --mem-per-cpu=0
--pty --preserve-env --mpi=none $SHELL"
would run spawn the user's default shell on the
allocated resources, but not consume any of the CPU or memory resources,
configure it as a pseudo-terminal, and preserve all of the job's
environment variables (i.e. and not over-write them with
the job step's allocation information).
For systems with generic resources (GRES) defined, the
SallocDefaultCommand value should explicitly specify a zero count for
the configured GRES. Failure to do so will result in the
launched shell consuming those GRES and preventing
subsequent srun commands from using them. For example, on Cray systems
add "--gres=craynetwork:0" as shown below:
SallocDefaultCommand = "srun -n1 -N1 --mem-per-cpu=0
--gres=craynetwork:0 --pty --preserve-env --mpi=none $SHELL"
For systems with TaskPlugin set, adding an option of
"--cpu-bind=no" is recommended if the default shell should have access
to all of the CPUs allocated to the job on that node, other‐
wise the shell may be limited to a single cpu or core.
On 1/2/2019 12:38 PM, Ryan Novosielski wrote:
> I don’t think that’s true (and others have shared documentation regarding interactive jobs and the S commands). There was documentation shared for how this works, and it seems as if it has been ignored.
>
> [novosirj at amarel2 ~]$ salloc -n1
> salloc: Pending job allocation 83053985
> salloc: job 83053985 queued and waiting for resources
> salloc: job 83053985 has been allocated resources
> salloc: Granted job allocation 83053985
> salloc: Waiting for resource configuration
> salloc: Nodes slepner012 are ready for job
>
> This is the behavior I’ve always seen. If I include a command at the end of the line, it appears to simply run it in the “new” shell that is created by salloc (which you’ll notice you can exit via CTRL-D or exit).
>
> [novosirj at amarel2 ~]$ salloc -n1 hostname
> salloc: Pending job allocation 83054458
> salloc: job 83054458 queued and waiting for resources
> salloc: job 83054458 has been allocated resources
> salloc: Granted job allocation 83054458
> salloc: Waiting for resource configuration
> salloc: Nodes slepner012 are ready for job
> amarel2.amarel.rutgers.edu
> salloc: Relinquishing job allocation 83054458
>
> You can, however, tell it to srun something in that shell instead:
>
> [novosirj at amarel2 ~]$ salloc -n1 srun hostname
> salloc: Pending job allocation 83054462
> salloc: job 83054462 queued and waiting for resources
> salloc: job 83054462 has been allocated resources
> salloc: Granted job allocation 83054462
> salloc: Waiting for resource configuration
> salloc: Nodes node073 are ready for job
> node073.perceval.rutgers.edu
> salloc: Relinquishing job allocation 83054462
>
> When you use salloc, it starts an allocation and sets up the environment:
>
> [novosirj at amarel2 ~]$ env | grep SLURM
> SLURM_NODELIST=slepner012
> SLURM_JOB_NAME=bash
> SLURM_NODE_ALIASES=(null)
> SLURM_MEM_PER_CPU=4096
> SLURM_NNODES=1
> SLURM_JOBID=83053985
> SLURM_NTASKS=1
> SLURM_TASKS_PER_NODE=1
> SLURM_JOB_ID=83053985
> SLURM_SUBMIT_DIR=/cache/home/novosirj
> SLURM_NPROCS=1
> SLURM_JOB_NODELIST=slepner012
> SLURM_CLUSTER_NAME=amarel
> SLURM_JOB_CPUS_PER_NODE=1
> SLURM_SUBMIT_HOST=amarel2.amarel.rutgers.edu
> SLURM_JOB_PARTITION=main
> SLURM_JOB_NUM_NODES=1
>
> If you run “srun” subsequently, it will run on the compute node, but a regular command will run right where you are:
>
> [novosirj at amarel2 ~]$ srun hostname
> slepner012.amarel.rutgers.edu
>
> [novosirj at amarel2 ~]$ hostname
> amarel2.amarel.rutgers.edu
>
> Again, I’d advise Mahmood to read the documentation that was already provided. It really doesn’t matter what behavior is requested — that’s not what this command does. If one wants to run a script on a compute node, the correct command is sbatch. I’m not sure what advantage salloc with srun has. I assume it’s so you can open an allocation and then occasionally send srun commands over to it.
>
> --
> ____
> || \\UTGERS, |---------------------------*O*---------------------------
> ||_// the State | Ryan Novosielski - novosirj at rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
> `'
>
>> On Jan 2, 2019, at 12:20 PM, Terry Jones <terry at jon.es> wrote:
>>
>> I know very little about how SLURM works, but this sounds like it's a configuration issue - that it hasn't been configured in a way that indicates the login nodes cannot also be used as compute nodes. When I run salloc on the cluster I use, I *always* get a shell on a compute node, never on the login node that I ran salloc on.
>>
>> Terry
>>
>>
>> On Wed, Jan 2, 2019 at 4:56 PM Mahmood Naderan <mahmood.nt at gmail.com> wrote:
>> Currently, users run "salloc --spankx11 ./qemu.sh" where qemu.sh is a script to run a qemu-system-x86_64 command.
>> When user (1) runs that command, the qemu is run on the login node since the user is accessing the login node. When user (2) runs that command, his qemu process is also running on the login node and so on.
>>
>> That is not what I want!
>> I expected slurm to dispatch the jobs on compute nodes.
>>
>>
>> Regards,
>> Mahmood
>>
>>
>>
>>
>> On Wed, Jan 2, 2019 at 7:39 PM Renfro, Michael <Renfro at tntech.edu> wrote:
>> Not sure what the reasons behind “have to manually ssh to a node”, but salloc and srun can be used to allocate resources and run commands on the allocated resources:
>>
>> Before allocation, regular commands run locally, and no Slurm-related variables are present:
>>
>> =====
>>
>> [renfro at login ~]$ hostname
>> login
>> [renfro at login ~]$ echo $SLURM_TASKS_PER_NODE
>>
>>
>> =====
>>
>> After allocation, regular commands still run locally, Slurm-related variables are present, and srun runs commands on the allocated node (my prompt change inside a job is a local thing, not done by default):
>>
>> =====
>>
>> [renfro at login ~]$ salloc
>> salloc: Granted job allocation 147867
>> [renfro at login(job 147867) ~]$ hostname
>> login
>> [renfro at login(job 147867) ~]$ echo $SLURM_TASKS_PER_NODE
>> 1
>> [renfro at login(job 147867) ~]$ srun hostname
>> node004
>> [renfro at login(job 147867) ~]$ exit
>> exit
>> salloc: Relinquishing job allocation 147867
>> [renfro at login ~]$
>>
>> =====
>>
>> Lots of people get interactive shells on a reserved node with some variant of ‘srun --pty $SHELL -I’, which doesn’t require explicitly running salloc or ssh, so what are you trying to accomplish in the end?
>>
>> --
>> Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
>> 931 372-3601 / Tennessee Tech University
>>
>>> On Jan 2, 2019, at 9:24 AM, Mahmood Naderan <mahmood.nt at gmail.com> wrote:
>>>
>>> I want to know if there any any way to push the node selection part on slurm and not a manual thing that is done by user.
>>> Currently, I have to manually ssh to a node and try to "allocate resources" using salloc.
>>>
>>>
>>> Regards,
>>> Mahmood
More information about the slurm-users
mailing list