Resource Budgeting — Brief ☧

"So Joshua took the whole land, according to all that the LORD said unto Moses; and Joshua gave it for an inheritance unto Israel according to their divisions by their tribes."

— Joshua 11:23

Deep version → | Next: Vivado Build Flow → | Back: Pipelines →

Q: When Joshua divided the Promised Land among the twelve tribes,

each tribe received territory according to its size and needs. Judah

received the largest region; Simeon was embedded within Judah's

territory. If Joshua gave too much to one tribe, another would be left

without enough. How is this like FPGA design?

A: An FPGA has a fixed budget of resources — LUTs, flip-flops,

BRAM blocks, DSP slices — spread across its silicon. You must divide

them among the parts of your design just as Joshua divided territory

among the tribes. Think of it like a monthly household budget: you

have a set amount of income, and you allocate portions to rent, food,

savings, and so on. Overspend in one category and another suffers.

Q: What resources are we budgeting exactly?

A: Five main types, each serving a different role:

Resource What It Does Total on Chip Per Lane (~) 8 Lanes (~) Budget (%)
LUTs Logic computation ~1.1M ~50K ~400K ~36%
FFs 1-bit storage (flip-flops) ~2.2M ~30K ~240K ~11%
BRAM 36 Kbit on-chip memory ~2,000 ~40 ~320 ~16%
URAM 288 Kbit dense memory ~960 ~4 ~32 ~3%
DSP Multiply-accumulate units ~9,000 ~2 ~16 <1%

This is just like counting how much space an array or

hash table consumes in software — you would not

allocate a million-entry table when you only need a thousand. On an

FPGA, wasted resources means wasted silicon, and overused resources

means the design cannot be built.

Q: What is the danger zone? When do things break?

A: Above roughly 80% utilization, timing closure

becomes extremely difficult. The place-and-route tool needs room to

maneuver — to find short wiring paths through the chip. When the chip

is congested, wires must take long detours, delays increase, and paths

miss their timing budget. Our 8-lane design stays well under 80%

total, which is why it achieves WNS = +0.043 ns (timing met). Think

of it like city traffic: a road network at 50% capacity flows

smoothly, but at 90% everything gridlocks.

Q: Can we just add more lanes to do more work in parallel?

A: Only up to the resource budget. Going from 8 to 16 lanes

roughly doubles LUT usage from ~36% to ~72%, getting dangerously

close to the 80% danger zone. Our 16-lane experiment failed timing

(WNS = -0.589 ns) partly for this reason. Scaling must respect the

budget.

Resource	What It Does	Total on Chip	Per Lane (~)	8 Lanes (~)	Budget (%)
LUTs	Logic computation	~1.1M	~50K	~400K	~36%
FFs	1-bit storage (flip-flops)	~2.2M	~30K	~240K	~11%
BRAM	36 Kbit on-chip memory	~2,000	~40	~320	~16%
URAM	288 Kbit dense memory	~960	~4	~32	~3%
DSP	Multiply-accumulate units	~9,000	~2	~16	<1%

SLR Crossing

There is one more budgeting challenge that deserves its own discussion,

because it catches many designers off guard.

Large FPGAs are built from multiple silicon dies bonded together into

Super Logic Regions (SLRs). Our Virtex UltraScale+ HBM chip has three

SLRs, each containing roughly one-third of the total resources. Signals

that travel within a single SLR are fast (normal wire delays). But signals

that must cross from one SLR to another incur an additional ~0.5 ns of

delay -- roughly 12.5% of our entire 4 ns clock budget consumed just by

crossing the boundary.

This is like the Jordan River dividing the Promised Land: the tribes of

Reuben, Gad, and half of Manasseh on the east bank had to cross the river

to coordinate with the western tribes, and that crossing took time and

effort. Joshua placed related tribes near each other to minimize such

crossings; we use floorplanning (Pblocks) to assign logic to specific

SLRs and minimize expensive crossings on the critical path.

In practice, this means keeping each compute lane and its associated

caches within the same SLR, placing the HBM interface in SLR 0 (where

the HBM ports are physically located), and adding pipeline registers at

any boundary that must be crossed. This is directly analogous to data

locality in software: just as you keep related data in the same

array for cache friendliness, you keep

related logic on the same SLR for timing friendliness.

Learn more in the deep version

Related: Timing Closure | Vivado | Our Architecture

Soli Deo Gloria