5.2

Activations

ReLU, sigmoid, tanh, softmax, Gumbel-softmax, smooth approximations.

Activation Functions — Brief ☧

"The hearing ear, and the seeing eye, the LORD hath made even both of them."

— Proverbs 20:12 (KJV)

Deep version → | Related: Neural Networks → | Training →


Q: Suppose you stack five linear equations on top of each other — multiply

by a number, add a number, repeat. Can five layers of straight lines

represent a curve?

A: No. No matter how many straight lines you chain together, the result

is still one straight line. If every neuron just computed w*x + b, a

hundred-layer network would collapse into a single multiplication and

addition. To learn curves, bends, and complex patterns, each neuron

needs a nonlinear step — something that bends the line. That step is

the activation function.

Q: What is the simplest activation people actually use?

A: ReLU: max(0, x). If the signal is negative, output zero — the

gate is shut. If positive, pass it through unchanged. It is dead simple,

fast to compute, and works surprisingly well. Think of it as a one-way

valve: flow goes forward, never backward.

Q: What if you need an answer between 0 and 1 — like a confidence

percentage?

A: That is sigmoid: 1 / (1 + e^(-x)). It smoothly squashes any

value into the range (0, 1). Feed in -10, you get nearly 0; feed in +10,

you get nearly 1. It works like a dimmer switch — smoothly moving between

off and on.

Q: And when you have multiple categories — say twelve options — and

you need the probabilities to add up to 1?

A: That is softmax. It takes a whole array

of scores and normalizes them so every score becomes a probability and

they all sum to 1. Like dividing an inheritance among twelve heirs: each

gets a share, and the shares must total 100%.

Q: Jesus said, "Enter ye in at the strait gate" (Matthew 7:13).

Is an activation function like a gate?

A: Exactly. Each activation decides what passes through a neuron and

what gets blocked. The shape of the gate — narrow like ReLU, smooth like

sigmoid — determines what the network can learn.

The Key Activations

FunctionFormulaRangeWhen to Use
ReLUmax(0, x)[0, inf)Default for hidden layers
Sigmoid1/(1+e^(-x))(0, 1)Binary classification output
Tanh(e^x - e^(-x))/(e^x + e^(-x))(-1, 1)Centered hidden layers
Softmaxe^(x_i) / sum(e^(x_j))(0, 1), sums to 1Multi-class output
GELUx * Phi(x)(~-0.17, inf)Transformers (GPT, BERT)

Take a moment to notice the pattern in this table. ReLU, sigmoid, and tanh are all used in hidden layers -- the interior of the network -- while softmax is reserved for the output when you need probabilities across multiple categories. ReLU is the default starting point because it is both fast and effective, but if you are building a transformer-based model like GPT or BERT, you will likely use GELU instead. The choice of activation function is one of the most impactful design decisions you will make, because it shapes what the network can learn and how easily gradients flow during training.

Why they matter: Without activations, a deep network collapses into

a single linear transformation — no matter how many layers you add. Think of it this way: if every neuron only computed w*x + b, then stacking a hundred layers would give you nothing more than what a single layer could already do. The whole structure would be one elaborate multiplication followed by one addition. Activations break this collapse by introducing bends and curves into the function the network represents. That is what gives neural networks their power to approximate

any algorithm, from image recognition

to language translation.

Learn more in the deep version

Related: Neural Networks | Training


Soli Deo Gloria

Self-Check 1/1

ReLU(x) returns: