[slurm-users] salloc with bash scripts problem

Wed Jan 2 13:28:00 MST 2019

I have included my login node in the list of nodes. Not all cores are
included though. Please see the output of "scontrol" below

[mahmood at rocks7 ~]$ scontrol show nodes
NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUTot=32 CPULoad=31.96
   AvailableFeatures=rack-0,32CPUs
   ActiveFeatures=rack-0,32CPUs
   Gres=(null)
   NodeAddr=10.1.1.254 NodeHostName=compute-0-0 Version=18.08
   OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017
   RealMemory=64261 AllocMem=0 FreeMem=5187 Sockets=32 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=444124 Weight=20511900 Owner=N/A
MCS_label=N/A
   Partitions=CLUSTER,WHEEL
   BootTime=2018-12-24T18:16:49 SlurmdStartTime=2019-01-02T23:53:20
   CfgTRES=cpu=32,mem=64261M,billing=47
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=compute-0-1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=6 CPUTot=32 CPULoad=25.90
   AvailableFeatures=rack-0,32CPUs
   ActiveFeatures=rack-0,32CPUs
   Gres=(null)
   NodeAddr=10.1.1.253 NodeHostName=compute-0-1 Version=18.08
   OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017
   RealMemory=64261 AllocMem=4096 FreeMem=509 Sockets=32 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=444124 Weight=20511899 Owner=N/A
MCS_label=N/A
   Partitions=CLUSTER,WHEEL,RUBY,EMERALD,QEMU
   BootTime=2018-12-24T18:07:22 SlurmdStartTime=2019-01-02T23:53:20
   CfgTRES=cpu=32,mem=64261M,billing=47
   AllocTRES=cpu=6,mem=4G
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=compute-0-2 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=6 CPUTot=32 CPULoad=5.95
   AvailableFeatures=rack-0,32CPUs
   ActiveFeatures=rack-0,32CPUs
   Gres=(null)
   NodeAddr=10.1.1.252 NodeHostName=compute-0-2 Version=18.08
   OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017
   RealMemory=64261 AllocMem=4096 FreeMem=7285 Sockets=32 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=444124 Weight=20511898 Owner=N/A
MCS_label=N/A
   Partitions=CLUSTER,WHEEL,RUBY,EMERALD,QEMU
   BootTime=2018-12-24T18:10:56 SlurmdStartTime=2019-01-02T23:53:20
   CfgTRES=cpu=32,mem=64261M,billing=47
   AllocTRES=cpu=6,mem=4G
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=compute-0-3 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=6 CPUTot=56 CPULoad=36.87
   AvailableFeatures=rack-0,56CPUs
   ActiveFeatures=rack-0,56CPUs
   Gres=(null)
   NodeAddr=10.1.1.251 NodeHostName=compute-0-3 Version=18.08
   OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017
   RealMemory=64147 AllocMem=4096 FreeMem=15274 Sockets=56 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=913567 Weight=20535897 Owner=N/A
MCS_label=N/A
   Partitions=CLUSTER,WHEEL,RUBY,EMERALD,QEMU
   BootTime=2018-12-24T18:02:51 SlurmdStartTime=2019-01-02T23:53:20
   CfgTRES=cpu=56,mem=64147M,billing=71
   AllocTRES=cpu=6,mem=4G
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=compute-0-4 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=6 CPUTot=56 CPULoad=37.47
   AvailableFeatures=rack-0,56CPUs
   ActiveFeatures=rack-0,56CPUs
   Gres=(null)
   NodeAddr=10.1.1.250 NodeHostName=compute-0-4 Version=18.08
   OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017
   RealMemory=64147 AllocMem=4096 FreeMem=15233 Sockets=56 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=50268 Weight=20535896 Owner=N/A
MCS_label=N/A
   Partitions=CLUSTER,WHEEL,RUBY,EMERALD,QEMU
   BootTime=2018-12-24T18:05:38 SlurmdStartTime=2019-01-02T23:53:20
   CfgTRES=cpu=56,mem=64147M,billing=71
   AllocTRES=cpu=6,mem=4G
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=rocks7 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUTot=8 CPULoad=23.40
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=10.1.1.1 NodeHostName=rocks7 Version=18.08
   OS=Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017
   RealMemory=64261 AllocMem=0 FreeMem=322 Sockets=8 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=272013 Weight=1 Owner=N/A
MCS_label=N/A
   Partitions=WHEEL,QEMU
   BootTime=2018-12-24T17:47:14 SlurmdStartTime=2019-01-02T23:53:20
   CfgTRES=cpu=8,mem=64261M,billing=8
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

And here are some salloc examples:

[mahmood at rocks7 ~]$ salloc
salloc: Granted job allocation 275
[mahmood at rocks7 ~]$ exit
exit
salloc: Relinquishing job allocation 275
[mahmood at rocks7 ~]$ salloc -n1
salloc: Granted job allocation 276
[mahmood at rocks7 ~]$ exit
exit
salloc: Relinquishing job allocation 276
[mahmood at rocks7 ~]$ salloc --nodelist=compute-0-2
salloc: Granted job allocation 277
[mahmood at rocks7 ~]$ exit
exit
salloc: Relinquishing job allocation 277
[mahmood at rocks7 ~]$ salloc -n1 hostname
salloc: Granted job allocation 278
rocks7.jupiterclusterscu.com
salloc: Relinquishing job allocation 278
salloc: Job allocation 278 has been revoked.
[mahmood at rocks7 ~]$

As you can see whenever I run salloc, I see the rocks7 prompt which is the
login node.

Regards,
Mahmood

On Wed, Jan 2, 2019 at 10:13 PM Brian Johanson <bjohanso at psc.edu> wrote:

> sallocdefaultcommand specified in slurm.conf will change the default
> behavior when salloc is executed without appending a command and also
> explain conflicting behavior between installations.
>
>
>         SallocDefaultCommand
>                Normally, salloc(1) will run the user's default shell
> when a command to execute is not specified on the salloc command line.
> If SallocDefaultCommand is specified,  salloc will  instead
>                run the configured command. The command is passed to
> '/bin/sh -c', so shell metacharacters are allowed, and commands with
> multiple arguments should be quoted. For instance:
>
>                    SallocDefaultCommand = "$SHELL"
>
>                would run the shell in the user's $SHELL environment
> variable.  and
>
>                    SallocDefaultCommand = "srun -n1 -N1 --mem-per-cpu=0
> --pty --preserve-env --mpi=none $SHELL"
>
>                would  run spawn the user's default shell on the
> allocated resources, but not consume any of the CPU or memory resources,
> configure it as a pseudo-terminal, and preserve all of the job's
>                environment variables (i.e. and not over-write them with
> the job step's allocation information).
>
>                For systems with generic resources (GRES) defined, the
> SallocDefaultCommand value should explicitly specify a zero count for
> the configured GRES.  Failure to do so  will  result in  the
>                launched shell consuming those GRES and preventing
> subsequent srun commands from using them.  For example, on Cray systems
> add "--gres=craynetwork:0" as shown below:
>                    SallocDefaultCommand = "srun -n1 -N1 --mem-per-cpu=0
> --gres=craynetwork:0 --pty --preserve-env --mpi=none $SHELL"
>
>                For  systems  with TaskPlugin set, adding an option of
> "--cpu-bind=no" is recommended if the default shell should have access
> to all of the CPUs allocated to the job on that node, other‐
>                wise the shell may be limited to a single cpu or core.
>
> On 1/2/2019 12:38 PM, Ryan Novosielski wrote:
> > I don’t think that’s true (and others have shared documentation
> regarding interactive jobs and the S commands). There was documentation
> shared for how this works, and it seems as if it has been ignored.
> >
> > [novosirj at amarel2 ~]$ salloc -n1
> > salloc: Pending job allocation 83053985
> > salloc: job 83053985 queued and waiting for resources
> > salloc: job 83053985 has been allocated resources
> > salloc: Granted job allocation 83053985
> > salloc: Waiting for resource configuration
> > salloc: Nodes slepner012 are ready for job
> >
> > This is the behavior I’ve always seen. If I include a command at the end
> of the line, it appears to simply run it in the “new” shell that is created
> by salloc (which you’ll notice you can exit via CTRL-D or exit).
> >
> > [novosirj at amarel2 ~]$ salloc -n1 hostname
> > salloc: Pending job allocation 83054458
> > salloc: job 83054458 queued and waiting for resources
> > salloc: job 83054458 has been allocated resources
> > salloc: Granted job allocation 83054458
> > salloc: Waiting for resource configuration
> > salloc: Nodes slepner012 are ready for job
> > amarel2.amarel.rutgers.edu
> > salloc: Relinquishing job allocation 83054458
> >
> > You can, however, tell it to srun something in that shell instead:
> >
> > [novosirj at amarel2 ~]$ salloc -n1 srun hostname
> > salloc: Pending job allocation 83054462
> > salloc: job 83054462 queued and waiting for resources
> > salloc: job 83054462 has been allocated resources
> > salloc: Granted job allocation 83054462
> > salloc: Waiting for resource configuration
> > salloc: Nodes node073 are ready for job
> > node073.perceval.rutgers.edu
> > salloc: Relinquishing job allocation 83054462
> >
> > When you use salloc, it starts an allocation and sets up the environment:
> >
> > [novosirj at amarel2 ~]$ env | grep SLURM
> > SLURM_NODELIST=slepner012
> > SLURM_JOB_NAME=bash
> > SLURM_NODE_ALIASES=(null)
> > SLURM_MEM_PER_CPU=4096
> > SLURM_NNODES=1
> > SLURM_JOBID=83053985
> > SLURM_NTASKS=1
> > SLURM_TASKS_PER_NODE=1
> > SLURM_JOB_ID=83053985
> > SLURM_SUBMIT_DIR=/cache/home/novosirj
> > SLURM_NPROCS=1
> > SLURM_JOB_NODELIST=slepner012
> > SLURM_CLUSTER_NAME=amarel
> > SLURM_JOB_CPUS_PER_NODE=1
> > SLURM_SUBMIT_HOST=amarel2.amarel.rutgers.edu
> > SLURM_JOB_PARTITION=main
> > SLURM_JOB_NUM_NODES=1
> >
> > If you run “srun” subsequently, it will run on the compute node, but a
> regular command will run right where you are:
> >
> > [novosirj at amarel2 ~]$ srun hostname
> > slepner012.amarel.rutgers.edu
> >
> > [novosirj at amarel2 ~]$ hostname
> > amarel2.amarel.rutgers.edu
> >
> > Again, I’d advise Mahmood to read the documentation that was already
> provided. It really doesn’t matter what behavior is requested — that’s not
> what this command does. If one wants to run a script on a compute node, the
> correct command is sbatch. I’m not sure what advantage salloc with srun
> has. I assume it’s so you can open an allocation and then occasionally send
> srun commands over to it.
> >
> > --
> > ____
> > || \\UTGERS,
>  |---------------------------*O*---------------------------
> > ||_// the State        |         Ryan Novosielski - novosirj at rutgers.edu
> > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
> Campus
> > ||  \\    of NJ        | Office of Advanced Research Computing - MSB
> C630, Newark
> >       `'
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190102/6860fd19/attachment-0001.html>