Tom Herbert, SiPanda CTO, August 26, 2024
Recently, we achieved a notable milestone at the Netdev0x18 conference. For the first time, we had presentations and discussions that covered the extremes of the Internet and everything in between. By “extremes” I mean distances in communications range from one meter, for instance a datacenter rack for AI/ML training, all the way up to 228 million kilometers for communicating with Mars rovers via the Mars Relay Network. Latency is commensurate with distance, so we’re talking delays from less than one microsecond to several minutes. It’s obvious that the characteristics at these extremes are going to be pretty different, and so that raises the question: are these extremes so different that we need completely different protocols, completely different implementations, and do we need a completely different mindset?
The distance-latency spectrum of the Internet
We can characterize the Internet by a spectrum of distance-latency. The picture below illustrates this. On the left we have data centers for AI/ML where intra rack communications between GPUs is achieving upwards of 400Gbps and latencies measured in nanoseconds. On the far right we have an ever increasing number of spacecraft creating interplanetary networks where communication latencies are measured in minutes. In the middle is the “open Internet” which is what people see in their everyday lives. Between the spectrum's bookends is twelve orders of magnitude difference in distance and latency– that’s a lot!
Distance-latency spectrum of the Internet.. This diagram shows the range of distances and latencies across the Internet, as well as the typical protocols for different parts of the spectrum.
Goin’ teeny tiny
AI/ML applications want, and basically require, low latency lossless networks. For instance, some learning jobs can take days or even weeks to run with very large data sets. High latency or packet loss can force a learning job to take much longer to complete and in some might even cause the job to fail. Network administrators are highly motivated to deploy the right hardware and software to optimize for AI and machine learning. Networks built for this purpose tend to be compact, homogeneous, and tightly managed to minimize variance.
As shown in the picture below, a data center may implement a hierarchical topology with many paths between any two nodes to provide bi-section bandwidth. Local connections, like those between hosts in the same rack, will typically provide higher throughput and lower latency than between two systems that are in different racks or different zones-- so lacement of the processes for a job are important, for instance we want to localize communications to minimize the number of hops packets take and hence minimize latency. The topology is so critical to AI/ML that there are some solutions where the physical topology can be dynamically reconfigured for specific jobs (e.g robotic arm that can reconnect optical cables).
A spine-leaf topolgoy for an AI/ML network.This exmaple topology is for a RoCEv2 network ar Meta.
The predominant transport in AI/ML networks is RoCEv2 which carries RDMA. Other emerging protocols include Falcon and Ultra Ethernet. All these protocols employ UDP/IP over optical Ethernet. There are two common protocol mechanisms for promoting lossless communications and low latency: Explicit Congestion Notification (ECN) and Priority Flow Control (PFC). Both of these employ signaling between hosts and network nodes. In the case of ECN, a switch in the path can signal an end host when it is experiencing congestion, when the host receives the signal it can reduce its transmit rate to avoid packet loss at the switch. PFC is a signal from switches to other switches or hosts to flow control a transmitter when network queues are full. The signal is xON/xOFF to flow control transmitters for some priority class. ECN and PFC can work together where ECN first tries to minimize congestion and PFC is fail-safe to prevent packet loss.
E.T. phone home
At the far right of the distance-latency spectrum is deep space and interplanetary networks, or more more pedantically: Delay Tolerant Networks (DTNs). The characteristics of DTN includesthe three D's: Disconnection, Disruption, and Delay. Unlike Earth terrestrial links, DTN links are not always established, for instance a spacecraft may only have certain windows when its antenna is properly orientated to communication with an earth based receiver. Even when links are established, various factors including communication failures and solar activity can cause communications disruptions. Delay, or latency, isn't just bounded by the spee of light, but also there can be considerable delays due to signal proessing. Effectively, the Bandwidth Delay Product can be gigantic in DTNs and so techniques for reliable communications techniques include store and forward of large blocks of data, data compression, and Forward Error Correction.
The picture below illustrates a possible topology for establishing a terrestrial mobile network on Mars. The topology takes the form of a dumbell where the ends are terrestrial networks on Earth and Mars that are connected by long haul (and we really do mean long haul links!) though space. As the picture suggests, a DTN network may consist of several heterogeneous links with different properties.
Example interplanerary nework topology between Earth and Mars.
A common protocol for DTN is Licklider Transport Protocol (LTP). The protocol operates on large units of data call bundles. Bundles can are further divided in blocks and then segments for transmission. LTP is a UDP based protocol. Reliable transport of bundles from a source to a destination employs a series of relays. Reliable communications and retransmissions take place locally between these relays, not so much end to end like TCP or like protocols for AI/ML. End to end reliability is achieved by the data going through a series of links in the path that are reliable– this optimizes for the heterogeneous nature of DTM networks where different protocols and algorithms can be used for communications over each link.
Commonality in implementations?
Okay, so we’ve described at a high level the bookend of the Internet distance-latency spectrum. As one might expect, their protocols and architecture are different. But what things do they have in common that we might leverage?
The first thing to notice is that protocols used in either environment are UDP/IP based. Running over IP means that we have a common network layer that abstracts out the link layer which will vary greatly (at least in DTN). That also means we can leverage implemenation and extensions in IP like IPv6 extension headers for signaling between hosts and network nodes. The use of UDP in these cases is a matter of practicality, it’s just easier to wrap new transport protocols in UDP we increase the chances packets are delivered.
Another thing in common is that low latency AI/ML networks and DTN can benefit from a programmable datapath albeit for different reasons. While we can commit protocols to hardware and that might yield the highest possible performance, doing so eliminates adaptability and flexibility– this is especially problematic if a bug or security issue is found in the hardware. In the data center it’s a big cost to swap out hardware to fix a problem or security issues, and in the case of space networks swapping out hardware isn’t feasible (once a Mars spacecraft is launched it's not exactly field serviceable :-) ). So there’s motivation at the extremes of the spectrum for programmable datapaths, and that’s also true for the open Internet. Programmability also benefits Software Defined Networking (SDN).
Given these commonalities, we can start to see the opportunity for at least part of the solutions at the opposite ends of the spectrum to be common. We could consider a hybrid approach. Start with a base platform that would include a management plane, a core programmable datapath model, and then integrated domain specific hardware accelerators in hardware. This is the sort of model that is the vision for SiPanda.
SiPanda
SiPanda was created to rethink the network datapath and bring both flexibility and wire-speed performance at scale to networking infrastructure. The SiPanda architecture enables data center infrastructure operators and application architects to build solutions for cloud service providers to edge compute (5G) that don’t require the compromises inherent in today’s network solutions. For more information, please visit www.sipanda.io. If you want to find out more about PANDA, you can email us at panda@sipanda.io. IP described here is covered by patent USPTO 12,026,546 and other patents pending.
Comentários