5.3

Training

Gradient descent, backpropagation, learning rates, convergence.

Training — Brief ☧

"The hearing ear, and the seeing eye, the LORD hath made even both of them."

— Proverbs 20:12 (KJV)

Deep version → | Related: Neural Networks → | Activations →


Q: Imagine you are learning to throw darts. You throw, see where the

dart lands, and adjust your aim. After hundreds of throws, you get

closer and closer to the bullseye. How does a neural network do the

same thing?

A: The network makes a prediction (throws the dart), then a **loss

function** measures how far off the prediction was (how far from the

bullseye). The network adjusts its weights to reduce that distance,

then predicts again. This loop — predict, measure, adjust, repeat — is

training.

Q: After a bad throw, you know you went too far left. But in a

network with millions of weights, how does it figure out which weights

caused the error?

A: Backpropagation — the chain rule of calculus applied layer by

layer. It traces backward from the output, telling each weight exactly

how much it contributed to the error. Think of it as replaying a video

of the throw in reverse: the dart veered left because your wrist

flicked, because your elbow was stiff, because your stance was off.

Each cause gets its share of blame.

Q: Once you know the direction to adjust, how big should the

adjustment be?

A: That is the learning rate. Too large and you overcorrect — your

next dart flies off the other side. Too small and you barely improve

between throws. Optimizers like Adam are like a smart coach: they

track your pattern of misses and adapt the adjustment size for each

weight individually.

Q: In the Parable of the Talents (Matthew 25:14-30), the master

measures each servant's return. Is the loss function like the master?

A: Yes. The loss function is the master's expectation — the gap between

what the servant returned and what was possible. Training is the process

of a faithful servant adjusting strategy to close that gap, one

iteration at a time.

The Training Loop

1. Forward pass     → compute prediction (throw the dart)
2. Compute loss     → measure how wrong it is (distance from bullseye)
3. Backward pass    → compute gradients (which direction to adjust)
4. Update weights   → nudge each weight to reduce loss (adjust aim)
5. Repeat           → until loss is small enough (bullseye!)
ConceptWhat It DoesDart Analogy
Loss functionMeasures prediction errorDistance from the bullseye
GradientDirection to reduce loss"Aim more to the right, less up"
Learning rateSize of each adjustmentHow much you change your stance
SGDSimplest optimizerAdjust after each throw
AdamAdaptive optimizerA coach who tracks your throwing patterns
EpochOne pass through all dataOne complete practice session
OverfittingMemorizing data instead of learningPerfecting one throw angle that only works from one spot
RegularizationPreventing overfittingPracticing from different positions

If you step back and look at this table, you will notice that training is fundamentally a feedback loop -- the same kind of cycle you see everywhere from thermostats to learning to cook. You try something, measure how it went, figure out what to adjust, make the adjustment, and repeat. The concepts listed above are just the precise names for each step in that loop. The loss function is your measuring stick. The gradient tells you which direction to adjust. The learning rate controls how big each adjustment is. And the optimizer is the strategy that decides exactly how to combine all of that information.

Each pass through the entire dataset is one epoch. With thousands of

training examples, the algorithm may

need dozens of epochs — each one a full

iteration over every example — before

the loss converges. Understanding when to stop is itself an art: stop too early and the model has not learned enough (underfitting); train too long and it memorizes the training data instead of learning general patterns (overfitting). This tension runs through every machine learning project.

Connection to our project: Our differentiable_chirho.py uses the same

gradient flow through soft AND/OR operations — the loss measures how many

unifications failed, and gradients flow backward to adjust relation weights. In our system, the "dart" is a constraint propagation pass: the engine tries to narrow down which values are valid for each variable. The "bullseye" is a state where every constraint is satisfied. When a unification fails, that failure generates a gradient signal that flows backward through the soft logic, telling each relation weight how much it contributed to the problem. Over many iterations, the relation weights converge on values that make the constraint solver succeed most often -- just as a dart thrower converges on the bullseye through practice.

Learn more in the deep version

Related: Neural Networks | Supervised Learning


Soli Deo Gloria

Self-Check 1/1

Gradient _____ updates weights to minimize the loss function.