Training — Brief ☧

"The hearing ear, and the seeing eye, the LORD hath made even both of them."

— Proverbs 20:12 (KJV)

Deep version → | Related: Neural Networks → | Activations →

Q: Imagine you are learning to throw darts. You throw, see where the

dart lands, and adjust your aim. After hundreds of throws, you get

closer and closer to the bullseye. How does a neural network do the

same thing?

A: The network makes a prediction (throws the dart), then a **loss

function** measures how far off the prediction was (how far from the

bullseye). The network adjusts its weights to reduce that distance,

then predicts again. This loop — predict, measure, adjust, repeat — is

training.

Q: After a bad throw, you know you went too far left. But in a

network with millions of weights, how does it figure out which weights

caused the error?

A: Backpropagation — the chain rule of calculus applied layer by

layer. It traces backward from the output, telling each weight exactly

how much it contributed to the error. Think of it as replaying a video

of the throw in reverse: the dart veered left because your wrist

flicked, because your elbow was stiff, because your stance was off.

Each cause gets its share of blame.

Q: Once you know the direction to adjust, how big should the

adjustment be?

A: That is the learning rate. Too large and you overcorrect — your

next dart flies off the other side. Too small and you barely improve

between throws. Optimizers like Adam are like a smart coach: they

track your pattern of misses and adapt the adjustment size for each

weight individually.

Q: In the Parable of the Talents (Matthew 25:14-30), the master

measures each servant's return. Is the loss function like the master?

A: Yes. The loss function is the master's expectation — the gap between

what the servant returned and what was possible. Training is the process

of a faithful servant adjusting strategy to close that gap, one

iteration at a time.

The Training Loop

1. Forward pass     → compute prediction (throw the dart)
2. Compute loss     → measure how wrong it is (distance from bullseye)
3. Backward pass    → compute gradients (which direction to adjust)
4. Update weights   → nudge each weight to reduce loss (adjust aim)
5. Repeat           → until loss is small enough (bullseye!)

Concept	What It Does	Dart Analogy
Loss function	Measures prediction error	Distance from the bullseye
Gradient	Direction to reduce loss	"Aim more to the right, less up"
Learning rate	Size of each adjustment	How much you change your stance
SGD	Simplest optimizer	Adjust after each throw
Adam	Adaptive optimizer	A coach who tracks your throwing patterns
Epoch	One pass through all data	One complete practice session
Overfitting	Memorizing data instead of learning	Perfecting one throw angle that only works from one spot
Regularization	Preventing overfitting	Practicing from different positions

If you step back and look at this table, you will notice that training is fundamentally a feedback loop -- the same kind of cycle you see everywhere from thermostats to learning to cook. You try something, measure how it went, figure out what to adjust, make the adjustment, and repeat. The concepts listed above are just the precise names for each step in that loop. The loss function is your measuring stick. The gradient tells you which direction to adjust. The learning rate controls how big each adjustment is. And the optimizer is the strategy that decides exactly how to combine all of that information.

Each pass through the entire dataset is one epoch. With thousands of

training examples, the algorithm may

need dozens of epochs — each one a full

iteration over every example — before

the loss converges. Understanding when to stop is itself an art: stop too early and the model has not learned enough (underfitting); train too long and it memorizes the training data instead of learning general patterns (overfitting). This tension runs through every machine learning project.

Connection to our project: Our differentiable_chirho.py uses the same

gradient flow through soft AND/OR operations — the loss measures how many

unifications failed, and gradients flow backward to adjust relation weights. In our system, the "dart" is a constraint propagation pass: the engine tries to narrow down which values are valid for each variable. The "bullseye" is a state where every constraint is satisfied. When a unification fails, that failure generates a gradient signal that flows backward through the soft logic, telling each relation weight how much it contributed to the problem. Over many iterations, the relation weights converge on values that make the constraint solver succeed most often -- just as a dart thrower converges on the bullseye through practice.

Learn more in the deep version

Related: Neural Networks | Supervised Learning

Soli Deo Gloria