Activation Functions — Brief ☧
"The hearing ear, and the seeing eye, the LORD hath made even both of them."
— Proverbs 20:12 (KJV)
Deep version → | Related: Neural Networks → | Training →
Q: Suppose you stack five linear equations on top of each other — multiply
by a number, add a number, repeat. Can five layers of straight lines
represent a curve?
A: No. No matter how many straight lines you chain together, the result
is still one straight line. If every neuron just computed
w*x + b, ahundred-layer network would collapse into a single multiplication and
addition. To learn curves, bends, and complex patterns, each neuron
needs a nonlinear step — something that bends the line. That step is
the activation function.
Q: What is the simplest activation people actually use?
A: ReLU:
max(0, x). If the signal is negative, output zero — thegate is shut. If positive, pass it through unchanged. It is dead simple,
fast to compute, and works surprisingly well. Think of it as a one-way
valve: flow goes forward, never backward.
Q: What if you need an answer between 0 and 1 — like a confidence
percentage?
A: That is sigmoid:
1 / (1 + e^(-x)). It smoothly squashes anyvalue into the range (0, 1). Feed in -10, you get nearly 0; feed in +10,
you get nearly 1. It works like a dimmer switch — smoothly moving between
off and on.
Q: And when you have multiple categories — say twelve options — and
you need the probabilities to add up to 1?
A: That is softmax. It takes a whole array
of scores and normalizes them so every score becomes a probability and
they all sum to 1. Like dividing an inheritance among twelve heirs: each
gets a share, and the shares must total 100%.
Q: Jesus said, "Enter ye in at the strait gate" (Matthew 7:13).
Is an activation function like a gate?
A: Exactly. Each activation decides what passes through a neuron and
what gets blocked. The shape of the gate — narrow like ReLU, smooth like
sigmoid — determines what the network can learn.
The Key Activations
| Function | Formula | Range | When to Use |
|---|---|---|---|
| ReLU | max(0, x) | [0, inf) | Default for hidden layers |
| Sigmoid | 1/(1+e^(-x)) | (0, 1) | Binary classification output |
| Tanh | (e^x - e^(-x))/(e^x + e^(-x)) | (-1, 1) | Centered hidden layers |
| Softmax | e^(x_i) / sum(e^(x_j)) | (0, 1), sums to 1 | Multi-class output |
| GELU | x * Phi(x) | (~-0.17, inf) | Transformers (GPT, BERT) |
Take a moment to notice the pattern in this table. ReLU, sigmoid, and tanh are all used in hidden layers -- the interior of the network -- while softmax is reserved for the output when you need probabilities across multiple categories. ReLU is the default starting point because it is both fast and effective, but if you are building a transformer-based model like GPT or BERT, you will likely use GELU instead. The choice of activation function is one of the most impactful design decisions you will make, because it shapes what the network can learn and how easily gradients flow during training.
Why they matter: Without activations, a deep network collapses into
a single linear transformation — no matter how many layers you add. Think of it this way: if every neuron only computed w*x + b, then stacking a hundred layers would give you nothing more than what a single layer could already do. The whole structure would be one elaborate multiplication followed by one addition. Activations break this collapse by introducing bends and curves into the function the network represents. That is what gives neural networks their power to approximate
any algorithm, from image recognition
to language translation.
Soli Deo Gloria