PCIe and DMA — Brief ☧
*"And king Solomon made a navy of ships in Ezion-geber... and Hiram
sent in the navy his servants, shipmen that had knowledge of the sea,
with the servants of Solomon. And they came to Ophir, and fetched
from thence gold, four hundred and twenty talents, and brought it
to king Solomon."*
— 1 Kings 9:26-28
Deep version → | Back: Our Architecture → | Module 5 →
Q: The FPGA is a separate chip sitting on a board. The host CPU
is on the motherboard. How do they actually talk to each other?
A: Through PCIe — Peripheral Component Interconnect Express.
Think of it as a multi-lane highway connecting two cities (the CPU and
the FPGA). More lanes means more bandwidth, just like adding lanes to
a freeway reduces congestion. Our connection uses multiple PCIe lanes
to carry both small control messages and bulk data transfers.
Q: Solomon had two kinds of trade: messengers carrying orders
(fast, small) and ships carrying gold in bulk (slower, massive). Does
PCIe have something similar?
A: Exactly. The CPU sees the FPGA through **BARs (Base Address
Registers)** — windows into the FPGA's memory and registers that
appear as regular addresses in the CPU's address space. Different
BARs serve different purposes:
BAR Maps To Protocol Use in Our Design BAR0 OCL registers AXI-Lite Configuration, status, commands (the "messengers") BAR2 Shell registers AXI-Lite AWS shell management BAR4 HBM memory AXI4 Domain data upload/download (the "gold ships") BAR0 is for quick, small register reads and writes — like checking a
status flag or issuing a command. BAR4 is for moving large blocks of
domain data into HBM (see Memory).
Q: What is DMA, and why does it matter for bulk transfers?
A: DMA (Direct Memory Access) means moving data in bulk
without the CPU shepherding every single byte. In our case, the CPU
writes to BAR4 (mapped to HBM) using
mmap()— the operating systemmaps the FPGA's memory into the CPU's address space so it looks like
a regular array in memory. The CPU just writes to it
and the data flows over PCIe to HBM.
After writing, the CPU uses
SFENCE(a memory fence instruction) toensure its write buffer is flushed to PCIe. Without the fence, writes
might sit in the CPU's buffer and never reach the FPGA in time.
Q: The codebase mentions that
O_SYNCcaused a 90x slowdown.What happened there?
A:
O_SYNCforces every single write to make a full PCIeround-trip before the CPU can issue the next one — like a messenger
who waits for a signed receipt before sending the next letter. This
yielded only 8,390 operations/sec at 119 microseconds each.
Without
O_SYNC, writes are posted — fire-and-forget, likedropping letters in a mailbox without waiting for confirmation. This
achieved 776K operations/sec at 1.29 microseconds each. Solomon's
ships did not wait in port for a reply before the next voyage.
This is a classic throughput-vs-latency tradeoff: the individual write
latency is the same, but by overlapping many writes (like filling a
queue), total throughput skyrockets.
Our PCI Device ID
Vendor ID: 0x1D0F (Amazon)
Device ID: 0xF016 (F0xx valid range + John 3:16)
The Device ID must be in range 0xF000-0xF0FF for Amazon's vendor ID.
Soli Deo Gloria