Pipelining — Brief ☧

Deep version → | Next: Resource Budgeting → | Back: CDC →

Q: Imagine a laundromat with one washer, one dryer, and one

folding table. You could wash a load, then dry it, then fold it — and

only after folding start the next load. But what if you started

washing the second load while the first is drying, and started drying

the second while you fold the first? Each individual load still takes

the same total time, but loads finish much more frequently. What is the

hardware term for this?

A: A pipeline. We break a long computation into stages, each

separated by registers (the "handoff points" from

Registers & Clocks). Any single result

takes multiple cycles to emerge — that is the latency (like one

load still taking wash + dry + fold time). But a new result can finish

every single cycle — that is the throughput (like a folded load

appearing every wash-cycle, once the pipeline is full).

Q: Why not just do the whole computation in one cycle and avoid

the complexity?

A: Because of timing closure. Remember, all

combinational logic between two registers must complete within one

clock period (4 ns at 250 MHz). If we try to compute a full 262K-bit

domain intersection in one shot, the logic chain is too long — it

exceeds that 4 ns budget. By splitting into stages, each stage's logic

fits within the clock period, and we can run at a high frequency.

It is similar to time complexity tradeoffs in

software: sometimes you restructure an algorithm

to do more small steps instead of one big step, trading a bit of

overhead for better overall performance.

Q: The aqueducts of Jerusalem carried water from Solomon's Pools

to the Temple Mount in stages. Is that the same idea?

A: A perfect analogy. The water did not arrive instantly — it

flowed through successive segments. But once the aqueduct was full,

fresh water arrived continuously at the Temple end. High latency to

fill; steady delivery once flowing.

Q: What happens when a downstream stage is not ready?

A: That is called backpressure. When a downstream stage signals

READY = 0, all upstream stages must pause — like a dam in the

aqueduct. Our AXI interfaces (see AXI Protocol) use

VALID/READY handshaking to implement this naturally. The pipeline

stalls gracefully and resumes when the blockage clears, without

losing or duplicating any data.

Pipeline Tradeoffs

This table captures the fundamental tradeoff. Read each row carefully --

it shows what you gain and what you pay for pipelining.

Property	Without pipelining	With 4-stage pipeline
Latency	1 cycle (but the cycle is very long)	4 short cycles
Throughput	1 result per long cycle	1 result per short cycle
Max frequency	Low (critical path = full computation)	High (critical path = 1 stage)
Area cost	Fewer registers	More registers (one set per stage boundary)

The key insight: pipelining does not make any individual result arrive

faster -- in fact, it makes each result take more clock cycles to

complete. What it does is allow you to run at a much higher clock

frequency (because each stage has less work to do), which means more

results finish per second. This is the classic latency-vs-throughput

tradeoff that appears everywhere in computer engineering, from CPU

instruction pipelines to network packet processing.

In practice, you add just enough pipeline stages to achieve

timing closure at your target clock

frequency and no more. Extra stages waste registers and add latency

without improving throughput. Our depth-3 intersection pipeline uses

2-6 stages depending on the path, with a fast-path shortcut that skips

the deeper stages when the L0 summary AND is zero (meaning no possible

overlap between the two domains).

Learn more in the deep version

Related: Timing Closure | Our Architecture

Soli Deo Gloria