[slurm-users] Multinode MPI job
Mahmood Naderan
mahmood.nt at gmail.com
Wed Mar 27 15:39:09 UTC 2019
>If your SLURM version is at least 18.08 then you should be able to do it
with an heterogeneous job. See
https://slurm.schedmd.com/>heterogeneous_jobs.html
<https://slurm.schedmd.com/heterogeneous_jobs.html>
>From the example in that page, I have written this
#!/bin/bash
#SBATCH --job-name=myQE
#SBATCH --output=big-job
#SBATCH --mem-per-cpu=10g --ntasks=8
#SBATCH packjob
#SBATCH --mem-per-cpu=10g --ntasks=2
#SBATCH --partition=QUARTZ
#SBATCH --account=z5
mpirun pw.x -i mos2.rlx.in
So, I expect that first node runs with 8 cores and 80GB of memory and the
second node runs with 2 cores and 20GB or memory.
This is what I see in the output
$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
723+0 CLUSTER myQE mahmood R 2:37 1
compute-0-1
723+1 QUARTZ myQE mahmood R 2:37 1
compute-0-2
$ rocks run host compute-0-1 "ps aux | grep pw.x"
mahmood 25667 0.0 0.0 289544 13224 ? Sl 11:32 0:00 mpirun
pw.x -i mos2.rlx.in
mahmood 25672 99.4 2.0 3784972 2316076 ? Rl 11:32 2:49 pw.x -i
mos2.rlx.in
mahmood 25673 99.4 2.0 3783544 2314008 ? Rl 11:32 2:50 pw.x -i
mos2.rlx.in
mahmood 25674 99.4 2.0 3785532 2314196 ? Rl 11:32 2:50 pw.x -i
mos2.rlx.in
mahmood 25675 99.2 2.0 3787648 2316048 ? Rl 11:32 2:49 pw.x -i
mos2.rlx.in
mahmood 25676 99.4 2.0 3786600 2313916 ? Rl 11:32 2:50 pw.x -i
mos2.rlx.in
mahmood 25677 99.4 2.0 3786344 2314056 ? Rl 11:32 2:50 pw.x -i
mos2.rlx.in
mahmood 25678 99.4 2.0 3782632 2313892 ? Rl 11:32 2:50 pw.x -i
mos2.rlx.in
mahmood 25679 99.4 2.0 3784112 2313856 ? Rl 11:32 2:50 pw.x -i
mos2.rlx.in
mahmood 25889 1.0 0.0 113132 1588 ? Ss 11:35 0:00 bash -c
ps aux | grep pw.x
mahmood 25925 0.0 0.0 112664 960 ? S 11:35 0:00 grep
pw.x
$ rocks run host compute-0-2 "ps aux | grep pw.x"
mahmood 28296 0.0 0.0 113132 1588 ? Ss 11:35 0:00 bash -c
ps aux | grep pw.x
mahmood 28325 0.0 0.0 112664 960 ? S 11:35 0:00 grep
pw.x
So, compute-0-2 has no pw.x process.
Also, the partition names are weird. We have these entries:
$ cat /etc/slurm/parts
PartitionName=WHEEL RootOnly=yes Priority=1000 Nodes=ALL
PartitionName=RUBY AllowAccounts=y4,y8 Nodes=compute-0-[1-4]
PartitionName=EMERALD AllowAccounts=z2,z33,z7 Nodes=compute-0-[0-4],rocks7
PartitionName=QEMU AllowAccounts=q20_8 Nodes=compute-0-[1-4],rocks7
PartitionName=QUARTZ AllowAccounts=z5 Nodes=compute-0-[1-2],compute-0-4
Any thought?
Regards,
Mahmood
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190327/ab8024f7/attachment-0001.html>
More information about the slurm-users
mailing list