4.10

PCIe and DMA

Host-to-FPGA transfers, DMA engines, bursting.

PCIe and DMA — Brief ☧

*"And king Solomon made a navy of ships in Ezion-geber... and Hiram

sent in the navy his servants, shipmen that had knowledge of the sea,

with the servants of Solomon. And they came to Ophir, and fetched

from thence gold, four hundred and twenty talents, and brought it

to king Solomon."*

— 1 Kings 9:26-28

Deep version → | Back: Our Architecture → | Module 5 →


Q: The FPGA is a separate chip sitting on a board. The host CPU

is on the motherboard. How do they actually talk to each other?

A: Through PCIe — Peripheral Component Interconnect Express.

Think of it as a multi-lane highway connecting two cities (the CPU and

the FPGA). More lanes means more bandwidth, just like adding lanes to

a freeway reduces congestion. Our connection uses multiple PCIe lanes

to carry both small control messages and bulk data transfers.

Q: Solomon had two kinds of trade: messengers carrying orders

(fast, small) and ships carrying gold in bulk (slower, massive). Does

PCIe have something similar?

A: Exactly. The CPU sees the FPGA through **BARs (Base Address

Registers)** — windows into the FPGA's memory and registers that

appear as regular addresses in the CPU's address space. Different

BARs serve different purposes:

BARMaps ToProtocolUse in Our Design
BAR0OCL registersAXI-LiteConfiguration, status, commands (the "messengers")
BAR2Shell registersAXI-LiteAWS shell management
BAR4HBM memoryAXI4Domain data upload/download (the "gold ships")

BAR0 is for quick, small register reads and writes — like checking a

status flag or issuing a command. BAR4 is for moving large blocks of

domain data into HBM (see Memory).

Q: What is DMA, and why does it matter for bulk transfers?

A: DMA (Direct Memory Access) means moving data in bulk

without the CPU shepherding every single byte. In our case, the CPU

writes to BAR4 (mapped to HBM) using mmap() — the operating system

maps the FPGA's memory into the CPU's address space so it looks like

a regular array in memory. The CPU just writes to it

and the data flows over PCIe to HBM.

After writing, the CPU uses SFENCE (a memory fence instruction) to

ensure its write buffer is flushed to PCIe. Without the fence, writes

might sit in the CPU's buffer and never reach the FPGA in time.

Q: The codebase mentions that O_SYNC caused a 90x slowdown.

What happened there?

A: O_SYNC forces every single write to make a full PCIe

round-trip before the CPU can issue the next one — like a messenger

who waits for a signed receipt before sending the next letter. This

yielded only 8,390 operations/sec at 119 microseconds each.

Without O_SYNC, writes are posted — fire-and-forget, like

dropping letters in a mailbox without waiting for confirmation. This

achieved 776K operations/sec at 1.29 microseconds each. Solomon's

ships did not wait in port for a reply before the next voyage.

This is a classic throughput-vs-latency tradeoff: the individual write

latency is the same, but by overlapping many writes (like filling a

queue), total throughput skyrockets.

Our PCI Device ID

Vendor ID:    0x1D0F  (Amazon)
Device ID:    0xF016  (F0xx valid range + John 3:16)

The Device ID must be in range 0xF000-0xF0FF for Amazon's vendor ID.

Learn more in the deep version

Related: AXI Protocol | Memory | Our Architecture


Soli Deo Gloria

Self-Check 1/1

DMA allows data transfer without: