Transformers and Attention — Brief ☧

Deep version → | Related: Neural Nets → | Embeddings →

"The hearing ear, and the seeing eye, the LORD hath made even both of them."

— Proverbs 20:12 (KJV)

Q: When you read a long sentence, you do not give every word equal

attention. In "The bank by the river was eroding," your brain

automatically focuses on "river" to figure out which meaning of "bank"

applies — riverbank, not financial institution. How could a machine

learn to do this?

A: By computing attention scores. For each word, the model asks:

"which other words in this sentence are most relevant to understanding

me?" It assigns high scores to relevant words and low scores to

irrelevant ones, then combines information weighted by those scores.

The word "bank" attends strongly to "river" and barely to "the."

Q: Earlier networks processed words one at a time, left to right.

What changed?

A: The transformer architecture. Instead of reading sequentially,

it lets every word attend to every other word simultaneously — in

parallel. This is vastly faster and captures long-range connections

that sequential models miss. "In the beginning was the Word" and "the

Word was made flesh" fourteen verses later — a transformer connects them

directly, while a sequential model might forget the first by the time

it reaches the fourteenth.

Q: How does the attention mechanism actually work?

A: Three components:

Query (Q): "What am I looking for?" — the word that needs context

Key (K): "What does each position offer?" — a label for each word

Value (V): "What information is here?" — the actual content

The model computes a similarity score between the Query and every Key,

applies softmax to get weights, then takes a weighted sum of Values.

Q: "These [Bereans] received the word with all readiness of mind,

and searched the scriptures daily" (Acts 17:11). Were the Bereans

doing attention?

A: A wonderful analogy. For each claim Paul made (Query), they

actively searched for the most relevant passages (Keys) and retrieved

the content that matched (Values). They did not read passively — they

attended to the parts that mattered.

The Core Idea

Query:   "What is the Word?"    (what I'm looking for)
Keys:    every word in the text  (what each position offers)
Values:  the meaning at each position  (what to retrieve)

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

Component	Plain Description	Purpose
Query (Q)	The word seeking context	What am I looking for?
Key (K)	Each position's label	What does each position offer?
Value (V)	Each position's content	What information is there?
Attention score	Query-Key similarity	How much to focus here
Output	Weighted sum of Values	Final contextual understanding

The table above distills the attention mechanism into its essentials. Notice that the whole operation is a three-step process: first, compute how relevant each position is to your question (Query times Key); second, normalize those relevance scores into weights that sum to 1 (softmax); third, use those weights to combine the actual information (weighted sum of Values). This is why attention is so powerful -- it lets the network dynamically decide what to focus on for each word, rather than using the same fixed connections everywhere.

The attention computation is fundamentally a lookup — like searching a

hash table where the "hash" is a

learned similarity score rather than a fixed

hash function. The transformer

stacks many layers of attention, building an increasingly abstract

graph of relationships among all

positions in the input. Each layer of attention adds another level of understanding: the first layer might connect words that are grammatically related, the second might connect words that are topically related, and deeper layers capture increasingly abstract semantic relationships.

Connection to our project: Our FPGA uses content-addressable memory (CAM)

for parallel key-value lookup — the hardware analog of attention. When

checking which domains satisfy a constraint, CAM scans all entries

simultaneously, just as attention attends to all positions in parallel. The key difference is that transformer attention uses soft, continuous weights (every position gets some fractional share of attention), while our CAM uses hard, binary matching (each entry either matches the query or it does not). This distinction mirrors the broader theme of our neurosymbolic architecture: the neural world operates in continuous probabilities, while the symbolic world operates in discrete true-or-false logic. The bridge between them -- differentiable_chirho.rs with Gumbel-softmax -- is what lets gradients flow through our binary hardware.

Learn more in the deep version

Related: Embeddings | Generative Models

Soli Deo Gloria