Pipelining — Brief ☧
Deep version → | Next: Resource Budgeting → | Back: CDC →
Q: Imagine a laundromat with one washer, one dryer, and one
folding table. You could wash a load, then dry it, then fold it — and
only after folding start the next load. But what if you started
washing the second load while the first is drying, and started drying
the second while you fold the first? Each individual load still takes
the same total time, but loads finish much more frequently. What is the
hardware term for this?
A: A pipeline. We break a long computation into stages, each
separated by registers (the "handoff points" from
Registers & Clocks). Any single result
takes multiple cycles to emerge — that is the latency (like one
load still taking wash + dry + fold time). But a new result can finish
every single cycle — that is the throughput (like a folded load
appearing every wash-cycle, once the pipeline is full).
Q: Why not just do the whole computation in one cycle and avoid
the complexity?
A: Because of timing closure. Remember, all
combinational logic between two registers must complete within one
clock period (4 ns at 250 MHz). If we try to compute a full 262K-bit
domain intersection in one shot, the logic chain is too long — it
exceeds that 4 ns budget. By splitting into stages, each stage's logic
fits within the clock period, and we can run at a high frequency.
It is similar to time complexity tradeoffs in
software: sometimes you restructure an algorithm
to do more small steps instead of one big step, trading a bit of
overhead for better overall performance.
Q: The aqueducts of Jerusalem carried water from Solomon's Pools
to the Temple Mount in stages. Is that the same idea?
A: A perfect analogy. The water did not arrive instantly — it
flowed through successive segments. But once the aqueduct was full,
fresh water arrived continuously at the Temple end. High latency to
fill; steady delivery once flowing.
Q: What happens when a downstream stage is not ready?
A: That is called backpressure. When a downstream stage signals
READY = 0, all upstream stages must pause — like a dam in the
aqueduct. Our AXI interfaces (see AXI Protocol) use
VALID/READY handshaking to implement this naturally. The pipeline
stalls gracefully and resumes when the blockage clears, without
losing or duplicating any data.
Pipeline Tradeoffs
This table captures the fundamental tradeoff. Read each row carefully --
it shows what you gain and what you pay for pipelining.
| Property | Without pipelining | With 4-stage pipeline |
|---|---|---|
| Latency | 1 cycle (but the cycle is very long) | 4 short cycles |
| Throughput | 1 result per long cycle | 1 result per short cycle |
| Max frequency | Low (critical path = full computation) | High (critical path = 1 stage) |
| Area cost | Fewer registers | More registers (one set per stage boundary) |
The key insight: pipelining does not make any individual result arrive
faster -- in fact, it makes each result take more clock cycles to
complete. What it does is allow you to run at a much higher clock
frequency (because each stage has less work to do), which means more
results finish per second. This is the classic latency-vs-throughput
tradeoff that appears everywhere in computer engineering, from CPU
instruction pipelines to network packet processing.
In practice, you add just enough pipeline stages to achieve
timing closure at your target clock
frequency and no more. Extra stages waste registers and add latency
without improving throughput. Our depth-3 intersection pipeline uses
2-6 stages depending on the path, with a fast-path shortcut that skips
the deeper stages when the L0 summary AND is zero (meaning no possible
overlap between the two domains).
Soli Deo Gloria