4.11

Our Architecture

8 parallel lanes, bitmask AND units, hierarchical domain storage, HBM access.

Our Architecture (V9.A/V9.D) — Brief ☧

*"And the LORD spake unto Moses, saying, Take the sum of all the

congregation of the children of Israel, by their armies."*

— Numbers 1:1-2

Deep version → | Next: PCIe & DMA → | Back: Vivado →


Q: We have learned about LUTs, clocks, timing, memory, AXI,

pipelining, and resources. How do all these pieces come together into

one working system? What does our actual FPGA design look like?

A: Think of it like an army. When Israel went to war against the

Midianites (Numbers 31), the army was organized into divisions of a

thousand, each with its captain. The captains received orders from

Moses, dispatched soldiers to fight, and reported results back.

Our design follows the same pattern:

  • Moses = the host CPU, which issues intersection tasks
  • The captain = the M6B scheduler, which receives commands and assigns work
  • The soldiers = 8 compute lanes, each an independent circuit performing bitmask intersections in parallel
  • The messenger returning to Moses = the RDQ (Result Delivery Queue), which collects results in order for the host

Q: Can you show me the overall layout?

A: Here is the block diagram. Follow it from top (host CPU) down

to bottom (HBM memory):

Host CPU (x86)
   | PCIe BAR0 (registers) + BAR4 (HBM data)
   v
+----------------------------------------------+
|  AWS Shell (PCIe, clocks, HBM controller)    |
|----------------------------------------------|
|  Our Custom Logic (CL)                       |
|  +----------+  +----------+                  |
|  | OCL Regs |  | M6B      |                  |
|  | (AXI-L)  |--| Scheduler|                  |
|  +----------+  +----+-----+                  |
|                +----|----+                    |
|              Lane0 Lane1 ... Lane7           |
|                +----|----+                    |
|                     v                         |
|              +----------+                    |
|              | RDQ (Ret |                    |
|              |  Data Q) |                    |
|              +----------+                    |
|                     v                         |
|              +----------+                    |
|              | HBM AXI4 |                    |
|              +----------+                    |
+----------------------------------------------+

The OCL Regs block receives commands from the CPU via

AXI-Lite (small register reads/writes through BAR0).

The M6B scheduler dispatches intersection tasks to the lanes.

Each lane fetches its domain data from HBM

through AXI4. Results flow back through the RDQ to the host.

Q: What does "M6B" stand for?

A: "M6B" is revision nomenclature — the sixth major scheduler

rewrite, variant B. It is our multi-lane batch engine: it issues

intersection commands to up to 8 lanes simultaneously, manages the

RDQ for in-order result retirement, and coordinates HBM data

fetching so lanes do not stall waiting for data.

Q: What is V9.D vs V9.A?

A: V9.A is the current deployed architecture. V9.D ("target arch")

introduces improvements like banked indirect caches, tile-based

scheduling, and improved address-lane helpers for the depth-3

hierarchical pipeline. Think of it like

software versions — V9.A is the stable release, V9.D is the next

version with performance optimizations.

Key Performance Numbers

These are the numbers that define what our system can do. Each one reflects

design decisions we have studied throughout this module.

MetricValue
Clock250 MHz (4 ns/cycle)
Lanes8 (M6B_MAX_LANES)
Batch throughput13,255 solves/sec (depth-3, 8-lane)
SemMedDB sweep307K intersections/sec (950M pairs)
Register latency1.29 us (PCIe round-trip)
Domain sizeup to 262,144 values (depth-3 hierarchy)
Compression4,000:1 (vs. flat representation)

A few highlights to note. The 250 MHz clock means every computation

step completes in 4 nanoseconds -- this is the budget that

timing closure must meet. The 8 lanes

provide parallelism, but scaling to 16 lanes failed timing (as we saw in

resources). The 1.29 us register latency

is the PCIe round-trip cost -- 300x slower than the actual FPGA computation,

which is why batch mode (amortizing PCIe overhead across many operations)

is essential for high throughput. And the 4,000:1 compression comes

from our hierarchical bitmask representation: a domain with 262,144 possible

values can be stored in roughly 33 KB instead of 128 MB because the

hierarchy skips over empty regions.

To understand how these numbers relate to the

Vivado build flow, follow the links to those

topics. Each number is the result of a chain of design decisions that

ripple through the entire hardware stack.

Learn more in the deep version

Related: Memory | AXI Protocol | PCIe & DMA


Soli Deo Gloria

Self-Check 1/1

Our FPGA architecture uses _____ parallel lanes for domain propagation.