Transformers and Attention — Brief ☧
Deep version → | Related: Neural Nets → | Embeddings →
"The hearing ear, and the seeing eye, the LORD hath made even both of them."
— Proverbs 20:12 (KJV)
Q: When you read a long sentence, you do not give every word equal
attention. In "The bank by the river was eroding," your brain
automatically focuses on "river" to figure out which meaning of "bank"
applies — riverbank, not financial institution. How could a machine
learn to do this?
A: By computing attention scores. For each word, the model asks:
"which other words in this sentence are most relevant to understanding
me?" It assigns high scores to relevant words and low scores to
irrelevant ones, then combines information weighted by those scores.
The word "bank" attends strongly to "river" and barely to "the."
Q: Earlier networks processed words one at a time, left to right.
What changed?
A: The transformer architecture. Instead of reading sequentially,
it lets every word attend to every other word simultaneously — in
parallel. This is vastly faster and captures long-range connections
that sequential models miss. "In the beginning was the Word" and "the
Word was made flesh" fourteen verses later — a transformer connects them
directly, while a sequential model might forget the first by the time
it reaches the fourteenth.
Q: How does the attention mechanism actually work?
A: Three components:
- Query (Q): "What am I looking for?" — the word that needs context
- Key (K): "What does each position offer?" — a label for each word
- Value (V): "What information is here?" — the actual content
The model computes a similarity score between the Query and every Key,
applies softmax to get weights, then takes a weighted sum of Values.
Q: "These [Bereans] received the word with all readiness of mind,
and searched the scriptures daily" (Acts 17:11). Were the Bereans
doing attention?
A: A wonderful analogy. For each claim Paul made (Query), they
actively searched for the most relevant passages (Keys) and retrieved
the content that matched (Values). They did not read passively — they
attended to the parts that mattered.
The Core Idea
Query: "What is the Word?" (what I'm looking for)
Keys: every word in the text (what each position offers)
Values: the meaning at each position (what to retrieve)
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
| Component | Plain Description | Purpose |
|---|---|---|
| Query (Q) | The word seeking context | What am I looking for? |
| Key (K) | Each position's label | What does each position offer? |
| Value (V) | Each position's content | What information is there? |
| Attention score | Query-Key similarity | How much to focus here |
| Output | Weighted sum of Values | Final contextual understanding |
The table above distills the attention mechanism into its essentials. Notice that the whole operation is a three-step process: first, compute how relevant each position is to your question (Query times Key); second, normalize those relevance scores into weights that sum to 1 (softmax); third, use those weights to combine the actual information (weighted sum of Values). This is why attention is so powerful -- it lets the network dynamically decide what to focus on for each word, rather than using the same fixed connections everywhere.
The attention computation is fundamentally a lookup — like searching a
hash table where the "hash" is a
learned similarity score rather than a fixed
hash function. The transformer
stacks many layers of attention, building an increasingly abstract
graph of relationships among all
positions in the input. Each layer of attention adds another level of understanding: the first layer might connect words that are grammatically related, the second might connect words that are topically related, and deeper layers capture increasingly abstract semantic relationships.
Connection to our project: Our FPGA uses content-addressable memory (CAM)
for parallel key-value lookup — the hardware analog of attention. When
checking which domains satisfy a constraint, CAM scans all entries
simultaneously, just as attention attends to all positions in parallel. The key difference is that transformer attention uses soft, continuous weights (every position gets some fractional share of attention), while our CAM uses hard, binary matching (each entry either matches the query or it does not). This distinction mirrors the broader theme of our neurosymbolic architecture: the neural world operates in continuous probabilities, while the symbolic world operates in discrete true-or-false logic. The bridge between them -- differentiable_chirho.rs with Gumbel-softmax -- is what lets gradients flow through our binary hardware.
Soli Deo Gloria