MCMC and Hamiltonian Monte Carlo — Brief ☧

"In whom are hid all the treasures of wisdom and knowledge."

— Colossians 2:3 (KJV)

Deep version → | Back to Gumbel-Softmax →

Q: Imagine you want to estimate the average height of people in a

huge city, but you cannot measure everyone. Instead, you wander through

different neighborhoods, measuring people as you go. If you wander

long enough and visit neighborhoods in proportion to their population,

your measurements will give you a good estimate. But how do you ensure

you visit the right neighborhoods in the right proportions?

A: That is the idea behind Markov Chain Monte Carlo (MCMC).

You construct a specific random walk — a Markov chain — that is

mathematically guaranteed to visit each "neighborhood" (each region

of the probability space) in proportion to its probability. Walk

long enough, and your collected samples will faithfully represent the

distribution you care about, even if that distribution is incredibly

complicated and you cannot compute it directly.

The most common MCMC algorithm is

Metropolis-Hastings: at each step, propose a random move. If the

new location has higher probability, accept it. If lower, accept it

only sometimes (with probability proportional to the ratio). This

simple rule ensures the walk converges to the target distribution.

Q: Random wandering sounds inefficient. If the space is large, you

could spend a lot of time stumbling around low-probability areas.

A: Exactly right — and that is why Hamiltonian Monte Carlo (HMC)

was invented. Instead of random stumbling, HMC gives the "walker"

momentum, like a ball rolling on a landscape. The ball naturally follows

the contours of the terrain — rolling quickly through flat regions and

spending more time in the valleys (high-probability areas). It uses

the gradient (the slope of the probability surface) to guide its

path, computed automatically via autodiff (automatic differentiation).

Think of it this way: standard MCMC is like exploring a dark cave by

randomly bumping into walls. HMC is like exploring the same cave with

a flashlight that shows the slope of the floor — you can make much

smarter moves.

Q: How does this relate to searching through possible solutions in

our constraint system?

A: When we relax our discrete bitmask domains

into continuous probabilities (using Gumbel-Softmax), HMC can efficiently

explore the space of possible solutions by following the gradient of the

constraint-satisfaction objective. The gradient tells the sampler "which

direction makes the constraints more satisfied."

Key Concepts

Concept	Meaning	Our Project
MCMC	Random walk sampling	Explore solution space
Metropolis-Hastings	Accept/reject proposals	Filter candidate domains
HMC	Gradient-guided sampling	Differentiable domain relaxation
Autodiff	Automatic gradient computation	`differentiable_chirho.py`

The progression in this table -- from MCMC (random exploration) to HMC (gradient-guided exploration) -- mirrors the evolution of our own system from brute-force search to differentiable constraint solving. MCMC says "try random moves and keep the good ones." HMC says "use the gradient to make informed moves." The practical difference is enormous: in high-dimensional spaces (which is where real constraint problems live), random walks mix slowly and get stuck in corners, while gradient-guided moves follow the natural contours of the solution landscape and converge much faster.

Connection to our system: When discrete constraint propagation gets stuck -- when the search tree has been pruned to a difficult region where no obvious next step exists --

we can relax the problem into a continuous landscape and use HMC to explore it.

Here is how it works in practice: the FPGA computes constraint violations for a candidate assignment (this is the "potential energy" in HMC's physics analogy), the Gumbel-softmax bridge relaxes the discrete domains into continuous probabilities, and HMC uses the gradient of the constraint-violation landscape to propose moves that tend toward satisfying assignments. The FPGA provides fast constraint evaluation at each step (770,000 evaluations per second), while HMC provides

smart navigation of the graph of possible solutions. This is exactly the approach Edward Kmett suggested: use the symbolic solver's energy landscape as the potential energy in an HMC sampler, with a neural network parameterizing the kinetic energy to learn efficient proposal distributions.

Learn more in the deep version

Related: Gumbel-Softmax | Markov Chains

Soli Deo Gloria