Unsupervised Learning — Brief ☧
"The hearing ear, and the seeing eye, the LORD hath made even both of them."
— Proverbs 20:12 (KJV)
Deep version → | Related: Supervised → | Neural Networks →
Q: Imagine you dump a thousand unlabeled photos on a table. Nobody
tells you what is in them. But you start naturally sorting: these have
blue skies, those are indoors, these have faces. You found groups
without anyone telling you the categories. How?
A: By noticing patterns — similarities and differences — in the data
itself. This is unsupervised learning: discovering structure in data
that has no labels. The algorithm
must find groupings, patterns, and compressed representations on its own.
Q: What kinds of structure can an algorithm discover?
A: Three main kinds:
- Clusters (K-Means, DBSCAN): groups of similar items — "these photos all show beaches, those all show mountains"
- Dimensions (PCA, t-SNE): the few traits that explain most of the variation — "brightness and color temperature explain 80% of the differences"
- Representations (Autoencoders): compressed summaries that preserve the essential information — like shrinking a photo to a thumbnail that still captures the scene
Q: How is this different from supervised learning?
A: Supervised learning has a teacher saying "this is a beach, this is
a mountain" for every example. Unsupervised learning has no teacher — it
finds the groupings by measuring similarity among the data points
themselves.
Q: In the Parable of the Wheat and Tares (Matthew 13:24-30), the
reapers separate wheat from tares at harvest without anyone labeling
each plant. Is that unsupervised learning?
A: Yes. The reapers observe natural differences — wheat has heavy,
golden heads; tares are lighter and darker. They find clusters in
the data without prior labels, just as K-Means groups data points by
proximity.
Key Methods
| Method | What It Finds | How It Works |
|---|---|---|
| K-Means | K clusters by proximity | Assign each point to the nearest center, then update centers; repeat via iteration |
| DBSCAN | Dense clusters of any shape | Find points with many neighbors; expand outward |
| PCA | Principal axes of variation | Find the directions along which data varies most |
| t-SNE | 2D visualization of high-dimensional data | Map a high-dimensional array of features to two dimensions, preserving neighborhoods |
| Autoencoder | Compressed representation | Encode input to a small latent space, then decode — like compressing a file and reconstructing it |
The key insight from this table is that unsupervised learning is not one technique but a family of approaches, each answering a different question about your data. Clustering asks "what groups naturally exist?" Dimensionality reduction asks "what are the few key factors that explain most of the variation?" And representation learning asks "how can I compress this data while preserving its essence?" In practice, these often work together: you might use PCA to reduce your data to manageable dimensions, then run K-Means on the reduced data to find clusters.
Think of clustering as building a tree of
groups: broad categories at the top, finer sub-groups below. Dimensionality
reduction helps you visualize a graph of
relationships among thousands of data points that would otherwise be
impossible to see. And autoencoders learn to distill the data down to its essence -- the smallest representation from which the original can be approximately recovered.
Connection to our project: Our domain hierarchy is unsupervised compression.
262,144 domain values are compressed to 4,096 summary bits, then to 64 bits.
This 4000:1 compression is like an autoencoder — but exact (lossless via
bitmasks), not approximate. Where an autoencoder must learn its compression through training on examples, our hierarchical OR-based compression is derived directly from the data structure itself -- no training needed. A bit at Level 1 is set to 1 if and only if the corresponding 64-bit word at Level 0 contains at least one set bit. This means the summary is perfectly faithful: you never lose information about which values are possible, you just get a faster way to check whether a region of the domain is worth examining in detail.
Soli Deo Gloria