Our Architecture (V9.A/V9.D) — Brief ☧
*"And the LORD spake unto Moses, saying, Take the sum of all the
congregation of the children of Israel, by their armies."*
— Numbers 1:1-2
Deep version → | Next: PCIe & DMA → | Back: Vivado →
Q: We have learned about LUTs, clocks, timing, memory, AXI,
pipelining, and resources. How do all these pieces come together into
one working system? What does our actual FPGA design look like?
A: Think of it like an army. When Israel went to war against the
Midianites (Numbers 31), the army was organized into divisions of a
thousand, each with its captain. The captains received orders from
Moses, dispatched soldiers to fight, and reported results back.
Our design follows the same pattern:
- Moses = the host CPU, which issues intersection tasks
- The captain = the M6B scheduler, which receives commands and assigns work
- The soldiers = 8 compute lanes, each an independent circuit performing bitmask intersections in parallel
- The messenger returning to Moses = the RDQ (Result Delivery Queue), which collects results in order for the host
Q: Can you show me the overall layout?
A: Here is the block diagram. Follow it from top (host CPU) down
to bottom (HBM memory):
Host CPU (x86) | PCIe BAR0 (registers) + BAR4 (HBM data) v +----------------------------------------------+ | AWS Shell (PCIe, clocks, HBM controller) | |----------------------------------------------| | Our Custom Logic (CL) | | +----------+ +----------+ | | | OCL Regs | | M6B | | | | (AXI-L) |--| Scheduler| | | +----------+ +----+-----+ | | +----|----+ | | Lane0 Lane1 ... Lane7 | | +----|----+ | | v | | +----------+ | | | RDQ (Ret | | | | Data Q) | | | +----------+ | | v | | +----------+ | | | HBM AXI4 | | | +----------+ | +----------------------------------------------+The OCL Regs block receives commands from the CPU via
AXI-Lite (small register reads/writes through BAR0).
The M6B scheduler dispatches intersection tasks to the lanes.
Each lane fetches its domain data from HBM
through AXI4. Results flow back through the RDQ to the host.
Q: What does "M6B" stand for?
A: "M6B" is revision nomenclature — the sixth major scheduler
rewrite, variant B. It is our multi-lane batch engine: it issues
intersection commands to up to 8 lanes simultaneously, manages the
RDQ for in-order result retirement, and coordinates HBM data
fetching so lanes do not stall waiting for data.
Q: What is V9.D vs V9.A?
A: V9.A is the current deployed architecture. V9.D ("target arch")
introduces improvements like banked indirect caches, tile-based
scheduling, and improved address-lane helpers for the depth-3
hierarchical pipeline. Think of it like
software versions — V9.A is the stable release, V9.D is the next
version with performance optimizations.
Key Performance Numbers
These are the numbers that define what our system can do. Each one reflects
design decisions we have studied throughout this module.
| Metric | Value |
|---|---|
| Clock | 250 MHz (4 ns/cycle) |
| Lanes | 8 (M6B_MAX_LANES) |
| Batch throughput | 13,255 solves/sec (depth-3, 8-lane) |
| SemMedDB sweep | 307K intersections/sec (950M pairs) |
| Register latency | 1.29 us (PCIe round-trip) |
| Domain size | up to 262,144 values (depth-3 hierarchy) |
| Compression | 4,000:1 (vs. flat representation) |
A few highlights to note. The 250 MHz clock means every computation
step completes in 4 nanoseconds -- this is the budget that
timing closure must meet. The 8 lanes
provide parallelism, but scaling to 16 lanes failed timing (as we saw in
resources). The 1.29 us register latency
is the PCIe round-trip cost -- 300x slower than the actual FPGA computation,
which is why batch mode (amortizing PCIe overhead across many operations)
is essential for high throughput. And the 4,000:1 compression comes
from our hierarchical bitmask representation: a domain with 262,144 possible
values can be stored in roughly 33 KB instead of 128 MB because the
hierarchy skips over empty regions.
To understand how these numbers relate to the
Vivado build flow, follow the links to those
topics. Each number is the result of a chain of design decisions that
ripple through the entire hardware stack.
Soli Deo Gloria