Resource Budgeting — Brief ☧
"So Joshua took the whole land, according to all that the LORD said unto Moses; and Joshua gave it for an inheritance unto Israel according to their divisions by their tribes."
— Joshua 11:23
Deep version → | Next: Vivado Build Flow → | Back: Pipelines →
Q: When Joshua divided the Promised Land among the twelve tribes,
each tribe received territory according to its size and needs. Judah
received the largest region; Simeon was embedded within Judah's
territory. If Joshua gave too much to one tribe, another would be left
without enough. How is this like FPGA design?
A: An FPGA has a fixed budget of resources — LUTs, flip-flops,
BRAM blocks, DSP slices — spread across its silicon. You must divide
them among the parts of your design just as Joshua divided territory
among the tribes. Think of it like a monthly household budget: you
have a set amount of income, and you allocate portions to rent, food,
savings, and so on. Overspend in one category and another suffers.
Q: What resources are we budgeting exactly?
A: Five main types, each serving a different role:
Resource What It Does Total on Chip Per Lane (~) 8 Lanes (~) Budget (%) LUTs Logic computation ~1.1M ~50K ~400K ~36% FFs 1-bit storage (flip-flops) ~2.2M ~30K ~240K ~11% BRAM 36 Kbit on-chip memory ~2,000 ~40 ~320 ~16% URAM 288 Kbit dense memory ~960 ~4 ~32 ~3% DSP Multiply-accumulate units ~9,000 ~2 ~16 <1% This is just like counting how much space an array or
hash table consumes in software — you would not
allocate a million-entry table when you only need a thousand. On an
FPGA, wasted resources means wasted silicon, and overused resources
means the design cannot be built.
Q: What is the danger zone? When do things break?
A: Above roughly 80% utilization, timing closure
becomes extremely difficult. The place-and-route tool needs room to
maneuver — to find short wiring paths through the chip. When the chip
is congested, wires must take long detours, delays increase, and paths
miss their timing budget. Our 8-lane design stays well under 80%
total, which is why it achieves WNS = +0.043 ns (timing met). Think
of it like city traffic: a road network at 50% capacity flows
smoothly, but at 90% everything gridlocks.
Q: Can we just add more lanes to do more work in parallel?
A: Only up to the resource budget. Going from 8 to 16 lanes
roughly doubles LUT usage from ~36% to ~72%, getting dangerously
close to the 80% danger zone. Our 16-lane experiment failed timing
(WNS = -0.589 ns) partly for this reason. Scaling must respect the
budget.
SLR Crossing
There is one more budgeting challenge that deserves its own discussion,
because it catches many designers off guard.
Large FPGAs are built from multiple silicon dies bonded together into
Super Logic Regions (SLRs). Our Virtex UltraScale+ HBM chip has three
SLRs, each containing roughly one-third of the total resources. Signals
that travel within a single SLR are fast (normal wire delays). But signals
that must cross from one SLR to another incur an additional ~0.5 ns of
delay -- roughly 12.5% of our entire 4 ns clock budget consumed just by
crossing the boundary.
This is like the Jordan River dividing the Promised Land: the tribes of
Reuben, Gad, and half of Manasseh on the east bank had to cross the river
to coordinate with the western tribes, and that crossing took time and
effort. Joshua placed related tribes near each other to minimize such
crossings; we use floorplanning (Pblocks) to assign logic to specific
SLRs and minimize expensive crossings on the critical path.
In practice, this means keeping each compute lane and its associated
caches within the same SLR, placing the HBM interface in SLR 0 (where
the HBM ports are physically located), and adding pipeline registers at
any boundary that must be crossed. This is directly analogous to data
locality in software: just as you keep related data in the same
array for cache friendliness, you keep
related logic on the same SLR for timing friendliness.
Soli Deo Gloria