5.7

Reinforcement Learning

MDPs, Q-learning, policy gradients, actor-critic.

Reinforcement Learning — Brief ☧

Deep version → | Related: Neural Nets → | Neurosymbolic →


"The hearing ear, and the seeing eye, the LORD hath made even both of them."

— Proverbs 20:12 (KJV)

Q: Think about how you learned to ride a bicycle. Nobody gave you

a labeled dataset of "correct pedal positions." Instead, you tried

things, fell over, adjusted, and tried again. Eventually, through

hundreds of attempts and scraped knees, you learned balance. Is there

a kind of machine learning that works this way?

A: Yes — reinforcement learning (RL). An agent takes

actions in an environment, receives rewards (staying upright)

or penalties (falling), and over many trials learns a policy

a strategy that maximizes cumulative reward over time.

Q: How is this different from supervised learning, where you have

labeled examples?

A: In supervised learning, a teacher

provides the correct answer for every example — "this photo shows a cat."

In RL, there is no teacher — only consequences. You never get told the

"correct" action; you only discover which actions lead to higher rewards

through trial and error. The feedback is delayed and sparse: you might

make fifty moves in chess before finding out whether you won or lost.

Q: The Prodigal Son (Luke 15:11-32) left home, suffered, and

learned to return. Is that reinforcement learning?

A: A vivid example. The son (agent) took actions (squandering,

working with pigs), received penalties (hunger, shame) and eventually

found the rewarding action (returning home). He had no teacher in the

far country — only the consequences of his choices shaped his policy.

The RL Loop

      ┌──────────────────────────────┐
      │         Environment          │
      │      (the world / game)      │
      └──────┬───────────────┬───────┘
             │ state, reward │
             ▼               │
      ┌──────────────────────┤
      │        Agent         │
      │   (the learner)      │
      └──────────────────────┘
             │ action
             ▼
         take a step...
ConceptBicycle ExampleChess Example
AgentThe riderThe player
EnvironmentThe road and physicsThe board and rules
StateCurrent balance and speedCurrent board position
ActionLean left, pedal harderMove knight to e5
RewardStaying upright (+1) / falling (-1)Win (+1) / lose (-1)
PolicyLearned balancing strategyLearned opening and tactics

Look at this table and you will see that every RL problem has the same structure, regardless of whether the agent is riding a bicycle or playing chess. There is always a state (what the world looks like right now), an action (what the agent does next), and a reward (how well that worked out). The challenge that makes RL fundamentally different from supervised learning is the credit assignment problem: when you win a chess game on move 40, which of the 40 moves was responsible? The reward is delayed and sparse, which makes learning much harder than when a teacher provides immediate feedback on every example.

The agent explores by trying actions and observing rewards — a process

of iteration over many episodes. The

core algorithm balances exploration

(trying new actions) with exploitation (repeating actions known to

give good rewards). This balance is one of the deepest challenges in RL. Explore too much and you waste time on random actions; exploit too much and you might miss a much better strategy. This is related to

graph search: the agent navigates a graph

of possible states, choosing which path to explore next.

Connection to our project: Our FPGA constraint solver explores a search

tree — each branch choice is an "action,"

each pruned domain is a "state," and finding a valid solution is the

"reward." RL can learn which branches to explore first (search heuristics). Today, our solver uses hand-crafted heuristics like "branch on the variable with the smallest domain" (the minimum remaining values heuristic). But these heuristics are not always optimal for every problem structure. An RL agent could learn, from experience with many constraint problems, which branching strategy tends to find solutions fastest. The FPGA would still do the heavy lifting of constraint propagation at 770,000 operations per second; the RL agent would simply choose smarter starting points for that propagation.

Learn more in the deep version

Related: Neurosymbolic | Transformers


Soli Deo Gloria

Self-Check 1/1

In RL, the agent learns to maximize cumulative _____.