Streaming Intersection — Brief ☧
"In whom are hid all the treasures of wisdom and knowledge."
— Colossians 2:3 (KJV)
Deep version → | Back to Tensor Networks →
Q: You and a friend each have a sorted list of numbers. You want to
find all the numbers that appear on both lists. The naive approach —
for every number on your list, scan through your friend's entire list —
works but is slow. For two lists of size n and m, that is O(n * m),
which is quadratic time. Can we do better?
A: Much better. Since both lists are sorted, you can use a trick
called merge intersection. Put a finger on the first item of each
list. Compare them:
- If they match, record the number and advance both fingers.
- If yours is smaller, advance your finger (your number cannot appear further back on the other list).
- If your friend's is smaller, advance their finger.
You walk through both lists together, and you are done in O(n + m) —
linear time. Each finger only moves forward,
never backward. This is the same idea behind the merge step in merge sort.
Q: That is great for simple sorted lists. But what if the lists are
enormous and mostly empty — like bitmasks where
only a few bits out of thousands are set?
A: Then we can do even better with a trie (prefix
tree) intersection. Instead of storing the numbers
in a flat sorted list, store them in a binary trie — a tree where each
branch represents a 0 or a 1 in the binary representation. To intersect
two tries, walk them simultaneously from the root. Whenever one side has
an entirely empty subtree, you can skip the whole branch — no need to
check any of its children. This is Edward Kmett's insight: in sparse
data, you can prune vast regions of empty space.
Q: How does "galloping" fit in?
A: Galloping search (also called exponential search) helps when
one list is much smaller than the other. Instead of advancing one
element at a time, you jump ahead by doubling steps — 1, 2, 4, 8, 16...
— until you overshoot, then binary search
back. This gets you O(m log n) when m is much smaller than n, which
beats the linear merge. In our hierarchical domains, the L0 summary
bits serve the same purpose: a zero summary bit lets us skip 64 or
even 4096 elements at once.
Key Concepts
| Concept | Meaning | Our Project |
|---|---|---|
| Merge intersection | Two sorted streams, O(n+m) | Sorted domain iteration |
| Binary trie | Prefix tree on bits | Hierarchical domain structure |
| W=2 trie | Two-child branching | Our L0/L1/L2 hierarchy |
| Galloping search | Skip ahead in sparse regions | Summary-bit skipping |
The key insight from this table is that our domain hierarchy is not just a storage format -- it is an algorithmic strategy. The L0/L1/L2 structure is a W=2 binary trie (branching factor 64 at each level), and the summary bits at each level serve the same purpose as galloping search: they let us skip large regions of empty space in constant time. When domains are sparse (most values have been eliminated by constraints), this structured skipping turns what would be a linear scan into something dramatically faster.
Connection to our system: Domain intersection is the core operation of our
constraint solver -- it runs 770,000 times per second on the FPGA -- so optimizing it has enormous impact. Two variable domains, each represented as hierarchical bitmasks,
are intersected by walking the trie structure simultaneously. At the top level (L0), we AND the two 64-bit summary words. Each resulting 1-bit means "both domains have values in this region -- examine further." Each 0-bit means "at least one domain is completely empty here -- skip the entire region." For a domain of 262,144 values, a single zero bit in the L0 summary skips 4,096 values at once. This is the hardware implementation of galloping search, and it is what makes the SemMedDB benchmark possible: most of the 950 million concept pairs share few or no values, so the L0 summary short-circuits the vast majority of intersections after examining just two 64-bit words.
Soli Deo Gloria