[slurm-users] srun using infiniband

Anne Hammond hammond at txcorp.com
Wed Aug 31 23:01:53 UTC 2022


We have a
  CentOS 8.5 cluster
  slurm 20.11
  Mellanox ConnectX 6 HDR IB and Mellanox 32 port switch

Our application is not scaling.  I discovered the process communications
are going over ethernet, not ib.  I used the ifconfig count for the eno2
(ethernet) and ib0 (infiniband) interfaces at end of a job, and subtracted
the count at the beginning.   We are using sbatch and
srun {application}

If I interactively login to a node and use the command
mpiexec -iface ib0 -n 32 -machinefile machinefile {application}

where machinefile contains 32 lines with the ib hostname:
ne08-ib
ne08-ib
...
ne09-ib
ne09-ib

the application runs over ib and scales.

/etc/slurm/slurm.conf uses the ethernet interface for administrative
communications and allocation:

NodeName=ne[01-09] CPUs=32 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1
State=UNKNOWN


PartitionName=neon-noSMT Nodes=ne[01-09] Default=NO MaxTime=3-00:00:00
DefaultTime=4:00:00 State=UP OverSubscribe=YES


I've read this is the recommended configuration.

I looked for srun parameters that would instruct srun to run over the ib
interface when the job is run through the slurm queue.

I found the --network parameter:

srun --network=DEVNAME=mlx5_ib,DEVTYPE=IB


but there is not much documentation on this and I haven't been able to run
a job yet.


Is this the way we should be directing srun to run the executable over
infiniband?


Thanks in advance,

Anne Hammond
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220831/dbeaba77/attachment.htm>


More information about the slurm-users mailing list