Streaming Intersection — Brief ☧

"In whom are hid all the treasures of wisdom and knowledge."

— Colossians 2:3 (KJV)

Deep version → | Back to Tensor Networks →

Q: You and a friend each have a sorted list of numbers. You want to

find all the numbers that appear on both lists. The naive approach —

for every number on your list, scan through your friend's entire list —

works but is slow. For two lists of size n and m, that is O(n * m),

which is quadratic time. Can we do better?

A: Much better. Since both lists are sorted, you can use a trick

called merge intersection. Put a finger on the first item of each

list. Compare them:

If they match, record the number and advance both fingers.

If yours is smaller, advance your finger (your number cannot appear further back on the other list).

If your friend's is smaller, advance their finger.

You walk through both lists together, and you are done in O(n + m) —

linear time. Each finger only moves forward,

never backward. This is the same idea behind the merge step in merge sort.

Q: That is great for simple sorted lists. But what if the lists are

enormous and mostly empty — like bitmasks where

only a few bits out of thousands are set?

A: Then we can do even better with a trie (prefix

tree) intersection. Instead of storing the numbers

in a flat sorted list, store them in a binary trie — a tree where each

branch represents a 0 or a 1 in the binary representation. To intersect

two tries, walk them simultaneously from the root. Whenever one side has

an entirely empty subtree, you can skip the whole branch — no need to

check any of its children. This is Edward Kmett's insight: in sparse

data, you can prune vast regions of empty space.

Q: How does "galloping" fit in?

A: Galloping search (also called exponential search) helps when

one list is much smaller than the other. Instead of advancing one

element at a time, you jump ahead by doubling steps — 1, 2, 4, 8, 16...

— until you overshoot, then binary search

back. This gets you O(m log n) when m is much smaller than n, which

beats the linear merge. In our hierarchical domains, the L0 summary

bits serve the same purpose: a zero summary bit lets us skip 64 or

even 4096 elements at once.

Key Concepts

Concept	Meaning	Our Project
Merge intersection	Two sorted streams, O(n+m)	Sorted domain iteration
Binary trie	Prefix tree on bits	Hierarchical domain structure
W=2 trie	Two-child branching	Our L0/L1/L2 hierarchy
Galloping search	Skip ahead in sparse regions	Summary-bit skipping

The key insight from this table is that our domain hierarchy is not just a storage format -- it is an algorithmic strategy. The L0/L1/L2 structure is a W=2 binary trie (branching factor 64 at each level), and the summary bits at each level serve the same purpose as galloping search: they let us skip large regions of empty space in constant time. When domains are sparse (most values have been eliminated by constraints), this structured skipping turns what would be a linear scan into something dramatically faster.

Connection to our system: Domain intersection is the core operation of our

constraint solver -- it runs 770,000 times per second on the FPGA -- so optimizing it has enormous impact. Two variable domains, each represented as hierarchical bitmasks,

are intersected by walking the trie structure simultaneously. At the top level (L0), we AND the two 64-bit summary words. Each resulting 1-bit means "both domains have values in this region -- examine further." Each 0-bit means "at least one domain is completely empty here -- skip the entire region." For a domain of 262,144 values, a single zero bit in the L0 summary skips 4,096 values at once. This is the hardware implementation of galloping search, and it is what makes the SemMedDB benchmark possible: most of the 950 million concept pairs share few or no values, so the L0 summary short-circuits the vast majority of intersections after examining just two 64-bit words.

Learn more in the deep version

Related: Bitmasks | Dancing Links

Soli Deo Gloria