I concur with what folks have written so far, it really depends on your use case. For instance if you are looking at a cluster with GPU's and intend to do some serious computing there you are going to need RDMA of some sort. But it all depends on what you end up needing for your workflows.
For us we put most of our network traffic over the IB using IPoIB combined with aliasing all the nodes to their IB address. Thus all the internode network traffic spans the IB fabric rather than the ethernet. We then have 1GbE for our ethernet backend which we mainly use for management purposes. So we haven't heavily invested in a high speed ethernet backbone but instead invested in IB.
To invest in both seems to me to be overkill, you should focus on one or the other unless you have the cash to spend and a good use case.
-Paul Edmon-
I’m very appreciative for each person who’s provided some feedback, especially the lengthy replies.
Sounds like RoCE capable Ethernet backbone may be the default way to go unless the end users have some specific requirements that might need IB. At this point, we wouldn’t be interested in anything slower than 200Gbps. So perhaps Eth and IB are equivalent in terms of latency and RDMA capabilities, except one is an open standard.
Thanks,
Daniel Healy
On Mon, Feb 26, 2024 at 3:40 AM Cutts, Tim <tim.cutts@astrazeneca.com> wrote:
My view is that it depends entirely on the workload, and the systems with which your compute needs to interact. A few things I’ve experienced before.
- Modern ethernet networks have pretty good latency these days, and so MPI codes can run over them. Whether IB is worth the money is a cost/benefit calculation for the codes you want to run. The ethernet network we put in at Sanger in 2016 or so we measured as having similar latency, in practice, as FDR infiniband, if I remember correctly. So it wasn’t as good as state-of-the-art IB at the time, but not bad. Certainly good enough for our purposes, and we gained a lot of flexibility through software-defined networking, important if you have workloads which require better security boundaries than just a big shared network.
- If your workload is predominantly single node, embarrassingly parallel, you might do better to go with ethernet and invest the saved money in more compute nodes.
- If you only have ethernet, your cluster will be simpler, and require less specialised expertise to run
- If your parallel filesystem is Lustre, IB seems to be the more well-worn path than ethernet. We encountered a few Lustre bugs early on because of that.
- On the other hand, if you need to talk to Weka, ethernet is the well-worn path. Weka’s IB implementation requires the dedication of some cores on every client node, so you lose some compute capacity, which you don’t need to do if you’re using ethernet.
So, as any lawyer would say “it depends”. Most of my career has been in genomics, where IB definitely wasn’t necessary. Now that I’m in pharma, there’s more MPI code, so there’s more of a case for it.
Ultimately, I think you need to run the real benchmarks with real code, and as Jason says, work out whether the additional complexity and cost of the IB network is worth it for your particular workload. I don’t think the mantra “It’s HPC so it has to be Infiniband” is a given.
Tim
--
Tim Cutts
Scientific Computing Platform Lead
AstraZeneca
Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our Service Catalogue |
AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.
This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com