Embeddings — Brief ☧

Deep version → | Related: Transformers → | Neural Nets →

"The hearing ear, and the seeing eye, the LORD hath made even both of them."

— Proverbs 20:12 (KJV)

Q: Computers work with numbers, but words are not numbers. How does

a machine understand that "happy" and "joyful" are similar while "happy"

and "volcano" are not?

A: By representing each word as an array

of numbers — a vector. The trick is that the position of the word in

this number space encodes its meaning. Words with similar meanings end up

as nearby vectors; unrelated words end up far apart. This mapping from a

word (or image, or sentence) to a numeric vector is called an

embedding.

Q: Can you give a concrete example?

A: Imagine each word gets 300 numbers. After training on millions of

sentences, the model might produce:

"mercy" -> [0.8, 0.2, 0.9, -0.1, ...]

"grace" -> [0.7, 0.3, 0.8, -0.2, ...] (nearby — similar meaning)

"stone" -> [-0.3, 0.9, -0.1, 0.5, ...] (far away — different meaning)

You measure similarity by the angle between vectors: cosine similarity.

"mercy" and "grace" might score 0.95 (very similar); "mercy" and "stone"

might score 0.12 (very different).

Q: Does this only work for single words?

A: Not at all. You can embed entire sentences, paragraphs, images,

even audio clips. A sentence embedding of "The kingdom of heaven is like

a mustard seed" would be close to "Small beginnings lead to great things"

and far from "Tax collection procedures in Rome."

Q: Jesus taught in parables — compact stories carrying deep meaning.

Is an embedding like a parable?

A: In a sense. Both encode complex meaning into a compact form.

A parable compresses spiritual truth into a simple story; an embedding

compresses semantic meaning into a fixed-size vector. And just as you

can compare two parables by their themes, you can compare two embeddings

by their cosine similarity.

The Core Idea

"mercy"   -> [0.8, 0.2, 0.9, -0.1, ...]   (300 dimensions)
"grace"   -> [0.7, 0.3, 0.8, -0.2, ...]   (nearby — similar meaning)
"stone"   -> [-0.3, 0.9, -0.1, 0.5, ...]  (far away — different meaning)

cosine_similarity("mercy", "grace") = 0.95  (very similar)
cosine_similarity("mercy", "stone") = 0.12  (very different)

Embedding Type	Input	Output	Key Model
Word	Single word	300-D vector	Word2Vec, GloVe
Contextual	Word in context	768-D vector	BERT, GPT
Sentence	Full sentence	384-768-D vector	Sentence-BERT
Image	Pixel grid	512-2048-D vector	CLIP, ResNet

Notice the progression in this table. Early embedding models like Word2Vec assign each word a single fixed vector regardless of context. But the word "bank" in "river bank" means something completely different from "bank" in "bank account." Contextual models like BERT solved this by producing different vectors for the same word depending on what surrounds it. And sentence-level models take this even further, embedding an entire paragraph or document into a single vector that captures its overall meaning. This evolution -- from word-level to context-aware to sentence-level -- is one of the most important trends in modern NLP.

Under the hood, building an embedding is an

algorithm that

iterates over a massive corpus of text

(or images), adjusting the vector for each symbol until similar items

cluster together. You can think of the resulting embedding space as a

high-dimensional graph where nearby nodes

share meaning and distance encodes difference. The beauty of embeddings is that once you have them, many tasks become simple geometry: finding similar documents is a nearest-neighbor search, detecting topics is clustering, and answering "is A related to B?" is measuring the angle between two vectors.

Connection to our project: Our system uses several forms of embedding, each suited to a different purpose. Hash-consing maps terms to integer IDs -- a discrete, lossless embedding that preserves exact identity. Our godel_latent_chirho.py

maps terms to bit vectors over (position, value) pairs -- a sparse binary embedding that captures structural information about where each value appears in a pattern. And the neurosymbolic bridge (differentiable_chirho.rs) relaxes these discrete representations to continuous embeddings that gradients can flow through. You can think of our hierarchical domain bitmasks as a kind of binary embedding: a 64-bit Level 2 domain summarizes 262,144 possible values, and computing the similarity between two domains is as simple as performing a bitwise AND and counting the result (popcount). This is our version of cosine similarity -- fast, exact, and hardware-friendly.

Learn more in the deep version

Related: Transformers | Generative Models

Soli Deo Gloria