[slurm-users] srun using infiniband
Anne Hammond
hammond at txcorp.com
Wed Aug 31 23:01:53 UTC 2022
We have a
CentOS 8.5 cluster
slurm 20.11
Mellanox ConnectX 6 HDR IB and Mellanox 32 port switch
Our application is not scaling. I discovered the process communications
are going over ethernet, not ib. I used the ifconfig count for the eno2
(ethernet) and ib0 (infiniband) interfaces at end of a job, and subtracted
the count at the beginning. We are using sbatch and
srun {application}
If I interactively login to a node and use the command
mpiexec -iface ib0 -n 32 -machinefile machinefile {application}
where machinefile contains 32 lines with the ib hostname:
ne08-ib
ne08-ib
...
ne09-ib
ne09-ib
the application runs over ib and scales.
/etc/slurm/slurm.conf uses the ethernet interface for administrative
communications and allocation:
NodeName=ne[01-09] CPUs=32 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1
State=UNKNOWN
PartitionName=neon-noSMT Nodes=ne[01-09] Default=NO MaxTime=3-00:00:00
DefaultTime=4:00:00 State=UP OverSubscribe=YES
I've read this is the recommended configuration.
I looked for srun parameters that would instruct srun to run over the ib
interface when the job is run through the slurm queue.
I found the --network parameter:
srun --network=DEVNAME=mlx5_ib,DEVTYPE=IB
but there is not much documentation on this and I haven't been able to run
a job yet.
Is this the way we should be directing srun to run the executable over
infiniband?
Thanks in advance,
Anne Hammond
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220831/dbeaba77/attachment.htm>
More information about the slurm-users
mailing list