Memory Architecture — Brief ☧
Deep version → | Next: AXI Protocol → | Back: Timing Closure →
Q: "Lay not up for yourselves treasures upon earth, where moth and
rust doth corrupt... but lay up for yourselves treasures in heaven"
(Matthew 6:19-20). In FPGA design, where do we lay up our data?
A: It depends on two things: how fast we need the data and how
much of it we have. Think about your own life — you keep your
phone in your pocket (tiny, instant access), important papers in a
desk drawer (more room, a few seconds to fetch), and old files in a
storage unit across town (huge capacity, takes a trip to retrieve).
Memory in an FPGA works the same way: it forms a hierarchy from
tiny-and-fast at the top to huge-and-slow at the bottom.
Q: What do the layers look like concretely?
A: Like the Temple with its nested courts — the Holy of Holies at
the center (most precious, smallest) out to the surrounding camps:
Memory Capacity Latency Analogy Flip-flops Individual bits 0 cycles (instant) Your pocket — always right there BRAM 36 Kbit blocks 1-2 cycles Your desk drawer URAM 288 Kbit blocks 1-2 cycles A filing cabinet nearby HBM 8 GB 80-100 cycles A warehouse in your building DDR 16-64 GB 100-200 cycles The storage unit across town Notice the pattern: as capacity grows, so does latency. This is just
like the time-complexity tradeoffs you see in
data structures — a tiny array is fast to scan, but a
hash table lets you store and find much more data
(at the cost of more overhead per access).
Q: Which layers does our design actually use?
A: Our domain data (the bitmask hierarchies) lives in
HBM — 8 GB of High Bandwidth Memory with ~460 GB/s bandwidth.
Each domain is a depth-3 hierarchy of increasingly detailed bitmasks:
L0 (8 bytes) + L1 (512 bytes) + L2 (32,768 bytes) = roughly 33 KB
per domain. HBM stores thousands of domains; BRAM caches the ones
currently being processed by each lane — like moving a file from the
warehouse to your desk while you work on it.
Our HBM Layout
HBM Address Space (8 GB)
├── Slot 0: Domain A (stride = 2,105,376 bytes)
├── Slot 1: Domain B
├── ...
└── Slot 255: Domain 255
Each slot reserves a fixed stride so that address calculation is simple
— the hardware computes base + slot * stride without needing a lookup
table, which keeps the algorithm running in
constant time per domain access.
Soli Deo Gloria