IOctopus: Outsmarting Nonuniform DMA

Session: Smart peripherals--Outside the box.

Authors: Igor Smolyar (Technion--Israel Institute of Technology and VMware Research); Alex Markuze (Technion--Israel Institute of Technology); Boris Pismenny (Technion--Israel Institute of Technology and Mellanox); Haggai Eran (Technion--Israel Institute of Technology and Mellanox); Gerd Zellweger (VMware Research); Austin Bolen (Dell); Liran Liss (Mellanox Technologies); Adam Morrison (Tel Aviv University); Dan Tsafrir (Technion--Israel Institute of Technology and VMware Research)

In a multi-CPU server, memory modules are local to the CPU to which they are connected, forming a nonuniform memory access (NUMA) architecture. Because non-local accesses are slower than local accesses, the NUMA architecture might degrade application performance. Similar slowdowns occur when an I/O device issues nonuniform DMA (NUDMA) operations, as the device is connected to memory via a single CPU. NUDMA effects therefore degrade application performance similarly to NUMA effects. We observe that the similarity is not inherent but rather a product of disregarding the intrinsic differences between I/O and CPU memory accesses. Whereas NUMA effects are inevitable, we show that NUDMA effects can and should be eliminated. We present IOctopus, a device architecture that makes NUDMA impossible by unifying multiple physical PCIe functions--one per CPU--in manner that makes them appear as one, both to the system software and externally to the server. IOctopus requires only a modest change to the device driver and firmware. We implement it on existing hardware and demonstrate that it improves throughput and latency by as much as $2.7\times$ and $1.28\times$, respectively, while ridding developers from the need to combat (what appeared to be) an unavoidable type of overhead.