Horizontal Parallelism and the Networking Datapath

Updated: Aug 14

Tom Herbert, SiPanda CTO, July 8, 2024

I’m super excited to announce that SiPanda has been granted its first patent: Parallelism in Serial Pipeline Processing (USPTO 12,026,546)!!! As the title suggests, the patent introduces new techniques for parallelism in the serial datapath (think networking for the canonical example of a serial datapath). Parallelism is, of course, a pillar of computer performance. And it’s not just “a nice to have”, it is a major part of the answer to the end of Moore’s Law and Dennard Scaling. However, parallelism in serial data processing sounds like an oxymoron– we usually think of a program as having parallel or serial execution. Normally, I’d agree with that, but we really are trying to “find parallelism” in serial data processing. So this is the gist of the patent: how to squeeze a bunch of parallelism out of serial data processing such as the networking datapath.

We define two basic types of parallelism for serial data processing: horizontal parallelism and vertical parallelism. In this blog we tackle horizontal parallelism.

The basic idea of horizontal parallelism is that we process different data objects in parallel like shown in the figure below. Multi-queue is a rudimentary variant in use today where a NIC can steer packets to different queues (also called Receive Side Scaling– RSS). Multi-queue works by steering packets of the same flow to the same queue by hashing the identifying transport layer tuple, and steering by flow ensures cache locality of flow related data structures. Multi-queue works great if there are many different flows, but things fall apart when there are few flows. In the worst case scenario, all packets might be for the same flow and end up going to the same queue, and in turn those packers are processed by only one CPU and thoujghput falls off a cliff– these problems are known to happen when network tunnels or crypto is used that don’t expose enough flow entropy.

Horizontal parallelism. The diagram illustrates an example for processing eight packets in both horizontal parallelism. Each packet is composed of some number of protocol layers (represented by the colored blocks). In this example, packets are input at constant intervals as indicated by the dotted vertical lines. In horizontal parallelism the packets are processed in parallel in four threads, . Each marked row in the diagram represents a thread.

Horizontal parallelism to the rescue!

Horizontal parallelism steers packets similar to multi-queue except that packets are randomly sprayed across queues irrespective of their flow. So horizontal parallelism guarantees uniform distribution across queues. The downside is that we no longer have the advantages of cache locality for flow related data structures. To address that, we took a closer look at exactly what the requirements of processing critical region processing in TCP (i.e. the part of code that MUST run serialized). While this sounds like it should be super complex, it actually isn’t; we just needed to wind the clock back to 1993 and revisit Van Jacobson's famous email about processing TCP packets in thirty instructions (modern stacks may have a few additional bells and whistles, but things haven’t fundamentally changed). It turns out that the core of that core processing is a critical region to modify the Protocol Control Block (PCB) in just a few instructions. So instead of having each CPU acquire the PCB lock, load the PCB, update the PCB, and unlock– we propose moving that core processing to a “PCB accelerator” that does nothing but manipulate the PCB. The accelerator is simple enough to implement in hardware, but could be CPU running in a tight loop. Communicating with the accelerator can be achieved via an IPC that is a simple request/response paradigm.

Performance improvement

So with horizontal parallelism we can achieve high throughput regardless of the composition of received packets. The graph below shows this, it compares Linux multi-queue to horizontal parallelism. For fun, we also compare against “optimized” multi-queue steering by a hash algorithm or round robin (i.e. the first packet goes to the first queue, the second goes to the second, etc.). In this example, we assume there are 128 queues and each queue can process 7.8 million packets per second.

Maximum achievable throughput (pps) of TCP for different numbers of connections. Four methods are compared: Linux with multi-queue, our solution with critical region accelerators, and optimized multi queue with hash and round robin steering. We assume that there are 128 threads (or queues) for each method where a thread can process 7.8 million packets per second for a maximum throughput of one billion pps.

While processing packets or data objects is good, it would also be interesting if we could process the layers of a single packet in parallel– like for a TCP/IP packet we’d like to process the Ethernet, IP, TCP, and TCP options in parallel. That’s called vertical parallelism and it’s the subject of our next blog.

SiPanda

SiPanda was created to rethink the network datapath and bring both flexibility and wire-speed performance at scale to networking infrastructure. The SiPanda architecture enables data center infrastructure operators and application architects to build solutions for cloud service providers to edge compute (5G) that don’t require the compromises inherent in today’s network solutions. For more information, please visit www.sipanda.io. If you want to find out more about PANDA, you can email us at panda@sipanda.io. IP described here is covered by patent USPTO 12,026,546 and other patents pending.

Horizontal Parallelism and the Networking Datapath

Horizontal parallelism to the rescue!

Performance improvement

SiPanda

Recent Posts

Commenti