MCMC and Hamiltonian Monte Carlo — Brief ☧
"In whom are hid all the treasures of wisdom and knowledge."
— Colossians 2:3 (KJV)
Deep version → | Back to Gumbel-Softmax →
Q: Imagine you want to estimate the average height of people in a
huge city, but you cannot measure everyone. Instead, you wander through
different neighborhoods, measuring people as you go. If you wander
long enough and visit neighborhoods in proportion to their population,
your measurements will give you a good estimate. But how do you ensure
you visit the right neighborhoods in the right proportions?
A: That is the idea behind Markov Chain Monte Carlo (MCMC).
You construct a specific random walk — a Markov chain — that is
mathematically guaranteed to visit each "neighborhood" (each region
of the probability space) in proportion to its probability. Walk
long enough, and your collected samples will faithfully represent the
distribution you care about, even if that distribution is incredibly
complicated and you cannot compute it directly.
The most common MCMC algorithm is
Metropolis-Hastings: at each step, propose a random move. If the
new location has higher probability, accept it. If lower, accept it
only sometimes (with probability proportional to the ratio). This
simple rule ensures the walk converges to the target distribution.
Q: Random wandering sounds inefficient. If the space is large, you
could spend a lot of time stumbling around low-probability areas.
A: Exactly right — and that is why Hamiltonian Monte Carlo (HMC)
was invented. Instead of random stumbling, HMC gives the "walker"
momentum, like a ball rolling on a landscape. The ball naturally follows
the contours of the terrain — rolling quickly through flat regions and
spending more time in the valleys (high-probability areas). It uses
the gradient (the slope of the probability surface) to guide its
path, computed automatically via autodiff (automatic differentiation).
Think of it this way: standard MCMC is like exploring a dark cave by
randomly bumping into walls. HMC is like exploring the same cave with
a flashlight that shows the slope of the floor — you can make much
smarter moves.
Q: How does this relate to searching through possible solutions in
our constraint system?
A: When we relax our discrete bitmask domains
into continuous probabilities (using Gumbel-Softmax), HMC can efficiently
explore the space of possible solutions by following the gradient of the
constraint-satisfaction objective. The gradient tells the sampler "which
direction makes the constraints more satisfied."
Key Concepts
| Concept | Meaning | Our Project |
|---|---|---|
| MCMC | Random walk sampling | Explore solution space |
| Metropolis-Hastings | Accept/reject proposals | Filter candidate domains |
| HMC | Gradient-guided sampling | Differentiable domain relaxation |
| Autodiff | Automatic gradient computation | differentiable_chirho.py |
The progression in this table -- from MCMC (random exploration) to HMC (gradient-guided exploration) -- mirrors the evolution of our own system from brute-force search to differentiable constraint solving. MCMC says "try random moves and keep the good ones." HMC says "use the gradient to make informed moves." The practical difference is enormous: in high-dimensional spaces (which is where real constraint problems live), random walks mix slowly and get stuck in corners, while gradient-guided moves follow the natural contours of the solution landscape and converge much faster.
Connection to our system: When discrete constraint propagation gets stuck -- when the search tree has been pruned to a difficult region where no obvious next step exists --
we can relax the problem into a continuous landscape and use HMC to explore it.
Here is how it works in practice: the FPGA computes constraint violations for a candidate assignment (this is the "potential energy" in HMC's physics analogy), the Gumbel-softmax bridge relaxes the discrete domains into continuous probabilities, and HMC uses the gradient of the constraint-violation landscape to propose moves that tend toward satisfying assignments. The FPGA provides fast constraint evaluation at each step (770,000 evaluations per second), while HMC provides
smart navigation of the graph of possible solutions. This is exactly the approach Edward Kmett suggested: use the symbolic solver's energy landscape as the potential energy in an HMC sampler, with a neural network parameterizing the kinetic energy to learn efficient proposal distributions.
Soli Deo Gloria