5.8

Transformers

Self-attention, multi-head attention, positional encoding, GPT/BERT.

Transformers and Attention — Brief ☧

Deep version → | Related: Neural Nets → | Embeddings →


"The hearing ear, and the seeing eye, the LORD hath made even both of them."

— Proverbs 20:12 (KJV)

Q: When you read a long sentence, you do not give every word equal

attention. In "The bank by the river was eroding," your brain

automatically focuses on "river" to figure out which meaning of "bank"

applies — riverbank, not financial institution. How could a machine

learn to do this?

A: By computing attention scores. For each word, the model asks:

"which other words in this sentence are most relevant to understanding

me?" It assigns high scores to relevant words and low scores to

irrelevant ones, then combines information weighted by those scores.

The word "bank" attends strongly to "river" and barely to "the."

Q: Earlier networks processed words one at a time, left to right.

What changed?

A: The transformer architecture. Instead of reading sequentially,

it lets every word attend to every other word simultaneously — in

parallel. This is vastly faster and captures long-range connections

that sequential models miss. "In the beginning was the Word" and "the

Word was made flesh" fourteen verses later — a transformer connects them

directly, while a sequential model might forget the first by the time

it reaches the fourteenth.

Q: How does the attention mechanism actually work?

A: Three components:

  • Query (Q): "What am I looking for?" — the word that needs context
  • Key (K): "What does each position offer?" — a label for each word
  • Value (V): "What information is here?" — the actual content

The model computes a similarity score between the Query and every Key,

applies softmax to get weights, then takes a weighted sum of Values.

Q: "These [Bereans] received the word with all readiness of mind,

and searched the scriptures daily" (Acts 17:11). Were the Bereans

doing attention?

A: A wonderful analogy. For each claim Paul made (Query), they

actively searched for the most relevant passages (Keys) and retrieved

the content that matched (Values). They did not read passively — they

attended to the parts that mattered.

The Core Idea

Query:   "What is the Word?"    (what I'm looking for)
Keys:    every word in the text  (what each position offers)
Values:  the meaning at each position  (what to retrieve)

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
ComponentPlain DescriptionPurpose
Query (Q)The word seeking contextWhat am I looking for?
Key (K)Each position's labelWhat does each position offer?
Value (V)Each position's contentWhat information is there?
Attention scoreQuery-Key similarityHow much to focus here
OutputWeighted sum of ValuesFinal contextual understanding

The table above distills the attention mechanism into its essentials. Notice that the whole operation is a three-step process: first, compute how relevant each position is to your question (Query times Key); second, normalize those relevance scores into weights that sum to 1 (softmax); third, use those weights to combine the actual information (weighted sum of Values). This is why attention is so powerful -- it lets the network dynamically decide what to focus on for each word, rather than using the same fixed connections everywhere.

The attention computation is fundamentally a lookup — like searching a

hash table where the "hash" is a

learned similarity score rather than a fixed

hash function. The transformer

stacks many layers of attention, building an increasingly abstract

graph of relationships among all

positions in the input. Each layer of attention adds another level of understanding: the first layer might connect words that are grammatically related, the second might connect words that are topically related, and deeper layers capture increasingly abstract semantic relationships.

Connection to our project: Our FPGA uses content-addressable memory (CAM)

for parallel key-value lookup — the hardware analog of attention. When

checking which domains satisfy a constraint, CAM scans all entries

simultaneously, just as attention attends to all positions in parallel. The key difference is that transformer attention uses soft, continuous weights (every position gets some fractional share of attention), while our CAM uses hard, binary matching (each entry either matches the query or it does not). This distinction mirrors the broader theme of our neurosymbolic architecture: the neural world operates in continuous probabilities, while the symbolic world operates in discrete true-or-false logic. The bridge between them -- differentiable_chirho.rs with Gumbel-softmax -- is what lets gradients flow through our binary hardware.

Learn more in the deep version

Related: Embeddings | Generative Models


Soli Deo Gloria

Self-Check 1/1

The key mechanism in transformers is: