Activation Functions
First Principles 7 cards
-
01 What is an activation?
An activation function is the non-linear rule applied after a layer’s linear computation.
For a layer:
Where:
zis the pre-activation or logit-like score.fis the activation function.ais the signal passed to the next layer.
The linear layer computes a score. The activation decides how that score should become signal.
-
02 Why do we need activations?
Without activation functions, deep networks collapse into one linear transformation.
Depth would add parameters, but not real expressive power.
Activation functions introduce non-linearity, which lets networks learn curves, thresholds, interactions, and hierarchical features.
This is the core point: linear layers mix information; activations bend the function.
-
03 How should I think about activations?
Think of activations as gates.
Different gates pass information differently:
- Sigmoid: soft yes/no probability gate.
- Tanh: centered squashing gate.
- ReLU: hard positive-only gate.
- GELU / SiLU: smooth probabilistic gates.
- Softmax: competition gate across classes.
The activation shapes both the forward signal and the backward gradient.
-
04 What are activations doing?
Hidden-layer activations mainly affect optimization and representation.
They answer:
- Can gradients flow?
- Does the layer keep useful signal?
- Does training stay stable?
- Does the architecture expect this activation?
Output activations mainly affect prediction meaning.
They answer:
- Is this a probability?
- Are classes mutually exclusive?
- Can multiple labels be true?
- Is the output bounded or unconstrained?
-
05 How do activations affect gradients?
During backpropagation, each activation contributes a local derivative.
If:
then gradients flowing backward are multiplied by:
This is why activation shape matters. If
f'(z)is often near zero, gradients shrink. If it is stable, gradients flow more easily.In practice, activation functions are not just forward transformations; they are also gradient filters.
-
06 Why is saturation bad?
Saturation happens when large input ranges produce almost constant outputs.
Examples:
- Sigmoid saturates near
0and1. - Tanh saturates near
-1and1.
In saturated regions, derivatives are close to zero. During backpropagation, multiplying by many tiny derivatives can make gradients vanish.
That is why saturating activations are usually poor hidden-layer defaults in very deep networks.
- Sigmoid saturates near
-
07 What makes an activation good?
A useful activation usually balances:
- Non-linearity
- Stable gradients
- Cheap computation
- Good behavior near zero
- Good behavior for large positive and negative inputs
- Suitable output range
- Compatibility with the architecture
There is no universally best activation. The right question is: what behavior do I need here?
Core Activations 10 cards
-
01 What are the key activations?
Activation Full form Formula Range Sigmoid Logistic sigmoid (0, 1)Tanh Hyperbolic tangent (-1, 1)ReLU Rectified Linear Unit [0, infinity)Leaky ReLU Leaky Rectified Linear Unit if , else (-infinity, infinity)Softplus Smooth ReLU-like function (0, infinity)SiLU Sigmoid Linear Unit roughly (-0.28, infinity)GELU Gaussian Error Linear Unit roughly (-0.17, infinity)Softmax Vector probability map probabilities sum to 1This table is the activation zoo in one card. Most practical decisions come from understanding the behavior, not memorizing every variant.
-
02 Sigmoid: what matters?
Sigmoid maps any real number to
(0, 1).Its derivative is:
Its maximum derivative is only
0.25, and it saturates for large positive or negative inputs.Use sigmoid for binary or independent probabilities. Avoid it as a default hidden-layer activation.
-
03 Tanh: what matters?
Tanh maps values to
(-1, 1)and is zero-centered.Its derivative is:
Tanh is often better centered than sigmoid, but it still saturates.
Use it when bounded, symmetric output is useful, such as some recurrent states or action outputs scaled to
[-1, 1]. -
04 ReLU: what matters?
ReLU stands for Rectified Linear Unit.
For positive inputs, its derivative is
1. For negative inputs, its derivative is0.That makes it cheap and good for gradient flow on the positive side.
Its weakness is that negative signal is killed completely, which can cause dead neurons.
-
05 How does ReLU die?
A ReLU unit can die when it outputs zero for almost all inputs.
If the pre-activation is always negative:
and the gradient is also zero.
The unit may stop learning.
This is why variants like Leaky ReLU, PReLU, ELU, GELU, and SiLU preserve some negative-side signal.
-
06 What fixes ReLU’s harsh cutoff?
Activation Full form Core idea Leaky ReLU Leaky Rectified Linear Unit Keep a small fixed negative slope PReLU Parametric Rectified Linear Unit Learn the negative slope ELU Exponential Linear Unit Smoothly map negatives toward -\alphaSELU Scaled Exponential Linear Unit ELU-like activation for self-normalizing networks Use these when plain ReLU is too harsh, especially if dead neurons or negative-signal loss seem problematic.
SELU is special: it expects conditions like LeCun normal initialization and alpha dropout.
-
07 Softplus: when is it useful?
Softplus is a smooth approximation to ReLU.
For large positive inputs, it behaves like
x.For large negative inputs, it approaches
0.It is useful when you need a smooth positive output, such as variance, rate, scale, or another positive parameter.
-
08 SiLU / Swish: what matters?
SiLU stands for Sigmoid Linear Unit.
Swish is:
When
beta = 1, Swish is SiLU.The intuition: SiLU softly gates the input by its own sigmoid. It keeps large positive values, smoothly downweights negatives, and preserves some small negative signal.
-
09 GELU: what matters?
GELU stands for Gaussian Error Linear Unit.
Where
Phi(x)is the standard normal CDF.GELU is like a smooth, probabilistic ReLU: instead of hard-clipping negatives, it softly gates values based on magnitude.
It is common in transformer MLP blocks.
-
10 GELU approximation: why know it?
A common approximation is:
This avoids computing the exact normal CDF.
You do not usually need to implement this by hand, but recognizing it helps when reading model code.
Softmax and Outputs 6 cards
-
01 Softmax: what is it?
Softmax turns a vector of logits into a probability distribution.
Each output is positive, and all outputs sum to
1.Unlike sigmoid or ReLU, softmax is not elementwise. Each output depends on every logit.
-
02 Softmax: what’s the intuition?
Softmax makes classes compete.
Increasing one logit increases that class probability and lowers the relative probability of other classes.
Use softmax when exactly one class should be correct.
Examples:
- Digit classification
- Single-label image classification
- Next-token prediction
- Mutually exclusive classes
-
03 Why max-shift softmax?
Exponentials can overflow for large logits.
Use:
Where:
Subtracting the same constant from every logit does not change the probabilities, but it prevents very large exponentials.
-
04 Sigmoid or softmax?
Use sigmoid when outputs are independent.
Use softmax when outputs compete.
Situation Activation Is this email spam? Sigmoid Which digit is this? Softmax Which tags apply to this document? Sigmoid per tag What is the next token? Softmax Can this image be both indoor and blurry? Sigmoid per label -
05 What about regression outputs?
For unconstrained regression, use a linear output.
For constrained regression:
Desired output Activation Any real number Linear Positive number Softplus Probability in (0, 1)Sigmoid Value in (-1, 1)Tanh Distribution over classes Softmax Hidden layers can still use ReLU, GELU, SiLU, or another hidden activation.
-
06 Output logits or probabilities?
Many libraries combine the output activation and loss for numerical stability.
Examples:
- Binary cross-entropy with logits combines sigmoid and binary cross-entropy.
- Cross-entropy loss for multi-class classification often combines softmax and negative log likelihood.
This avoids unstable computations like taking
logof probabilities that are extremely close to0.Practical rule: check whether the loss expects raw logits or already-activated probabilities. Passing both can silently hurt training.
Modern Architecture Choices 6 cards
-
01 GLU: what is the idea?
GLU stands for Gated Linear Unit.
A GLU splits a projection into two parts: values and gates.
One common form is:
Where
aandbare learned projections, and\otimesmeans elementwise multiplication.The model learns what information to pass through.
-
02 SwiGLU: why care?
SwiGLU is a gated activation that uses a Swish or SiLU-style gate.
A simplified form is:
It is common in modern transformer feedforward blocks.
For basics, learn ReLU, sigmoid, tanh, softmax, GELU, and SiLU first. Then treat GLU and SwiGLU as architecture refinements.
-
04 ReLU, GELU, or SiLU?
Use ReLU when you want a fast, simple baseline.
Use GELU when you are working with transformer-style architectures or want smooth probabilistic gating.
Use SiLU when you want a smooth ReLU-like activation that preserves small negative signal.
Activation Main benefit Main drawback ReLU Fast and simple Can kill negative signal GELU Smooth and transformer-friendly More expensive than ReLU SiLU Smooth and preserves small negative signal More expensive than ReLU -
05 What’s the common beginner trap?
The biggest beginner mistake is choosing activations only by habit.
Ask:
- What should the output mean?
- Are labels mutually exclusive or independent?
- Does the loss expect logits or probabilities?
- Does the activation saturate?
- Does it preserve useful gradients?
- Does the architecture expect a specific activation?
Output activations define prediction meaning. Hidden activations shape optimization.
-
06 What’s the takeaway?
Remember:
- Activations add non-linearity.
- Hidden activations shape learning dynamics.
- Output activations shape prediction meaning.
- Saturating activations can vanish gradients.
- ReLU is simple and strong.
- GELU and SiLU are smooth modern defaults.
- Sigmoid is for probabilities and gates.
- Softmax is for competing classes.
- There is no universally best activation.