Activation Functions in Neural Networks

Activation functions are mathematical operations applied to neurons in a neural network that introduce non-linearity, enabling the network to learn complex patterns and relationships in data. Without activation functions, neural networks would be limited to learning only linear transformations, regardless of depth.

Formal Definition

An activation function is a mathematical function f:RR (or f:R(a,b)) that transforms the weighted sum of inputs to a neuron, introducing non-linearity into the network and determining the neuron's output signal.

Given weighted sum z=i=1nwixi+b, the neuron's output is: a=f(z)

I. Fundamental Concepts

1. What are the main purposes of Activation Functions?

Activation functions are mathematical transformations applied at each neuron that determine:

  1. Whether a neuron should activate (fire a signal)
  2. The strength of the activation (output magnitude)
  3. What information propagates to subsequent layers

Think of activation functions as decision gates that control information flow through the network.

2. The Neuron's Computation Pipeline

Every neuron performs a two-step process:
ML_AI/images/af-1.png600

Step 1: Linear Aggregation (Pre-activation)

z=w1x1+w2x2++wnxn+b=wTx+b

Where:

Step 2: Non-Linear Transformation (Activation)

a=f(z)

Where:

3. Information Flow Through Layers

In Hidden Layers:

In Output Layer:

II. Mathematical Foundation

1. Why Do We Need Activation Functions?

The Linearity Trap:

Without activation functions, a neural network—regardless of its depth—collapses into a single linear transformation.

Mathematical Proof:
Consider a 3-layer network with only linear operations:

Layer 1: h1=W1x+b1Layer 2: h2=W2h1+b2=W2(W1x+b1)+b2Layer 3: y=W3h2+b3=W3[W2(W1x+b1)+b2]+b3

Expanding:

y=W3W2W1x+W3W2b1+W3b2+b3

This simplifies to:

y=Wx+b

Where W=W3W2W1 and b=W3W2b1+W3b2+b3

Conclusion: A deep linear network = a single linear layer (no benefit from depth!)

2. Non-Linearity: The Game Changer

Activation functions introduce non-linearity through:

Impact on Decision Boundaries:

3. The Role of Derivatives in Learning (Backpropagation)

Neural networks learn through gradient descent, which requires computing gradients of the loss function with respect to all parameters.

The Chain Rule in Action:
To update weight wij[l] connecting neuron i in layer l1 to neuron j in layer l:

Lwij[l]=Laj[l]×aj[l]zj[l]×zj[l]wij[l]

Where:

Key Insight: The activation function's derivative f(z) directly controls gradient flow!

4. Gradient Pathologies

Vanishing Gradients:

Exploding Gradients:

Dead Neurons (Dying ReLU):

III. Activation Functions Catalog

1. Linear Activation

Function Derivative
$$\large f(z) = mz$$
where m is a constant (usually 1)
$$f'(z) = m$$

Mathematical Properties:

Advantages:

Disadvantages:

Use Cases:

Key Insight

Always remember why linear activations defeat the purpose of deep networks in hidden layers. The key is that composition of linear functions is still linear.

2. Sigmoid (Logistic Function)

Function Derivative
$$\sigma(z)=\frac{1}{1+e^{-z}}$$ $$\sigma'(z)=\sigma(z) \cdot (1-\sigma(z))$$

Mathematical Properties:

Advantages:

Disadvantages:

Gradient Analysis:

Use Cases:

Key Insight to remember

  1. Why sigmoid causes vanishing gradients mathematically
  2. The "not zero-centered" problem and its impact on learning
  3. Why it's still perfect for binary classification outputs
  4. The relationship between sigmoid and binary cross-entropy loss

3. Softmax (Multi-Class Output)

Formula Derivative
For output vector z=[z1,z2,...,zK]:
$$\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$
$$\frac{\partial \text{Softmax}(z_i)}{\partial z_j} = \text{Softmax}(z_i)(\delta_{ij} - \text{Softmax}(z_j))$$
where δij is the Kronecker delta

Mathematical Properties:

Advantages:

Disadvantages:

Numerical Stabilization:
To prevent overflow, subtract the maximum value before computing:

Softmax(zi)=ezimax(z)j=1Kezjmax(z)

Use Cases:

Key Insight to remember

  1. The difference between Softmax (mutually exclusive) and Sigmoid (independent)
  2. Why Softmax is used with categorical cross-entropy loss
  3. The numerical stabilization trick
  4. When to use Softmax vs multiple Sigmoid outputs

4. Tanh (Hyperbolic Tangent)

Function Derivative
$$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = \frac{2}{1 + e^{-2z}} - 1$$ $$\tanh'(z) = 1 - \tanh^2(z)$$

Mathematical Properties:

Advantages:

Disadvantages:

Use Cases:

Key Insights to remember

  1. Why Tanh is better than Sigmoid for hidden layers (zero-centered, stronger gradients)
  2. The relationship: tanh(z)=2σ(2z)1
  3. Why it's preferred in RNNs over Sigmoid
  4. When vanishing gradients still occur despite improvements over Sigmoid

5. ReLU (Rectified Linear Unit)

Function Derivative
$$\text{ReLU}(z) = \max(0, z) = \begin{cases} z & \text{if } z > 0 \ 0 & \text{if } z \leq 0 \end{cases}$$ $$\text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \ 0 & \text{if } z \leq 0 \end{cases}$$

Mathematical Properties:

Advantages:

Disadvantages:

The Dying ReLU Problem:

Use Cases:

Initialization Recommendation:

Key Insights to remember

  1. Why ReLU doesn't suffer from vanishing gradients (for positive values)
  2. The dying ReLU problem and how to detect/prevent it
  3. Why it's computationally efficient compared to Sigmoid/Tanh
  4. The concept of sparse activation and why it's beneficial
  5. Proper initialization strategies (He initialization)

6. Leaky ReLU

Function Derivative
$$\text{LeakyReLU}(z) = \begin{cases} z & \text{if } z > 0 \ \alpha z & \text{if } z \leq 0 \end{cases}$$
where α is a small constant (typically 0.01)
$$\text{LeakyReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \ \alpha & \text{if } z \leq 0 \end{cases}$$

Mathematical Properties:

Advantages:

Disadvantages:

Use Cases:

Variants:

  1. Parametric ReLU (PReLU):

    • α is learned during training
    • Formula: PReLU(z)=max(αz,z)
    • One α per channel/neuron
  2. Randomized Leaky ReLU (RReLU):

    • α is random during training, fixed during testing
    • Formula: αU(l,u) where l<u (e.g., l=1/8,u=1/3)
    • Acts as regularization
Key Insights to remember

  1. How Leaky ReLU solves the dying ReLU problem
  2. The trade-off: slight increase in computation for better gradient flow
  3. When to use standard ReLU vs Leaky ReLU
  4. The difference between Leaky ReLU, PReLU, and RReLU
  5. Why it's popular in GANs

7. ELU (Exponential Linear Unit)

Function Derivative
$$\text{ELU}(z) = \begin{cases} z & \text{if } z > 0 \ \alpha(e^z - 1) & \text{if } z \leq 0 \end{cases}$$
where α>0 (typically 1.0)
$$\text{ELU}'(z) = \begin{cases} 1 & \text{if } z > 0 \ \alpha e^z = \text{ELU}(z) + \alpha & \text{if } z \leq 0 \end{cases}$$

Mathematical Properties:

Advantages:

Disadvantages:

Use Cases:

Key Insight to remember

  1. Why ELU is smoother than ReLU and why that matters
  2. The trade-off between performance and computation time
  3. How negative saturation helps with normalization
  4. When to choose ELU over ReLU or Leaky ReLU
  5. Connection to SELU (Scaled ELU) for self-normalizing networks

8. Advanced Activations (Modern Architectures)

GELU (Gaussian Error Linear Unit)

Formula:

GELU(z)=zΦ(z)=zP(Zz),ZN(0,1)

Approximation:

GELU(z)0.5z(1+tanh[2/π(z+0.044715z3)])

Properties:

Use Cases:

Swish (SiLU - Sigmoid Linear Unit)

Formula:

Swish(z)=zσ(z)=z1+ez

Properties:

Use Cases:

Mish

Formula:

Mish(z)=ztanh(ln(1+ez))=ztanh(softplus(z))

Properties:

Use Cases:

IV. Selection Guide & Best Practices

Decision Tree for Activation Functions

graph LR
    Start([Choose
Activation
Function]) --> Task{Which
Layer?} Task -->|Hidden Layers| Task1{What's
your
goal?} Task -->|Output Layer| Task2{Task
Type?} Task1 -->|Default/Fast| ReLU[✓ ReLU
Fast & Effective] Task1 -->|Deep Network
40+ layers| Modern[✓ Swish/GELU
Best Performance] Task1 -->|Dying ReLU
Problem| Leaky[✓ Leaky ReLU
or ELU] Task1 -->|Transformer/NLP| GELU_Choice[✓ GELU
BERT, GPT standard] Task1 -->|RNN/LSTM| Tanh_Choice[✓ Tanh
Zero-centered] Task2 -->|Binary
Classification| Sigmoid_Out[✓ Sigmoid
Probability 0-1] Task2 -->|Multi-Class
1 label| Softmax_Out[✓ Softmax
Probabilities sum to 1] Task2 -->|Multi-Label
Multiple tags| Sigmoid_Multi[✓ Sigmoid
Independent outputs] Task2 -->|Regression| Linear_Out[✓ Linear
No activation] style Start fill:#FFE5E5 style ReLU fill:#E5F3FF style Modern fill:#E5FFE5 style Leaky fill:#FFF5E5 style GELU_Choice fill:#FFE5F3 style Tanh_Choice fill:#F3E5FF style Sigmoid_Out fill:#E5F3FF style Softmax_Out fill:#E5FFE5 style Sigmoid_Multi fill:#FFF5E5 style Linear_Out fill:#FFE5F3

Comprehensive Comparison Table

Activation Formula Range Key Advantages Main Disadvantages Best Use Case Computational Cost
Linear f(z)=z (,) Simple, fast No non-linearity Regression output Very Low
Sigmoid 11+ez (0,1) Probability interpretation Vanishing gradient, not zero-centered Binary classification output High
Tanh ezezez+ez (1,1) Zero-centered, stronger gradients than Sigmoid Vanishing gradient RNNs, shallow networks High
ReLU max(0,z) [0,) Fast, no vanishing gradient, sparse Dying ReLU, not zero-centered Default for hidden layers Very Low
Leaky ReLU max(αz,z) (,) Fixes dying ReLU, still fast Inconsistent results GANs, when ReLU dies Low
ELU z or α(ez1) (α,) Smooth, robust to noise Computationally expensive Deep networks, noisy data Medium
GELU zΦ(z) (,) SOTA in transformers Expensive, complex NLP, transformers High
Swish zσ(z) (,) Outperforms ReLU in deep nets More expensive than ReLU Deep CNNs (40+ layers) Medium
Softmax eziezj (0,1), sum=1 Probability distribution Output layer only Multi-class classification Medium

Best Practices by Domain

Computer Vision (CNNs)

  1. Hidden Layers: ReLU (default) or Swish (if maximizing performance)
  2. Output Layer:
    • Classification: Softmax (multi-class) or Sigmoid (binary/multi-label)
    • Segmentation: Sigmoid (per-pixel probabilities)
    • Object Detection: Sigmoid (box confidence)

Natural Language Processing (Transformers)

  1. Feed-Forward Layers: GELU (BERT/GPT standard)
  2. Output Layer:

Recurrent Networks (RNNs, LSTMs, GRUs)

  1. Hidden States: Tanh (standard, zero-centered)
  2. Gate Activations: Sigmoid (in LSTMs/GRUs)
  3. Output Layer: Task-dependent (Softmax for sequence classification)

Generative Models

  1. GANs - Discriminator: Leaky ReLU
  2. GANs - Generator: ReLU or Tanh (output layer)
  3. VAEs: ReLU (encoder/decoder), Sigmoid (reconstruction output)
  4. Diffusion Models: GELU or Swish

V. Common Mistakes to Avoid

Mistake Problem's Reason Solution/alternatives
Using Sigmoid/Tanh in deep hidden layers Vanishing gradients cripple learning Use ReLU or variants
Forgetting activation in output layer Wrong output range for task Always specify based on task requirements
Using Softmax for multi-label problems Classes aren't mutually exclusive Use Sigmoid per class
Not initializing weights properly Activations saturate immediately, neurons die Use He initialization (ReLU), Xavier initialization (Tanh/Sigmoid)
Blindly copying architectures Activation choice depends on specific problem Understand why each activation was chosen
Ignoring dying ReLU problem 30-40% of neurons may be dead Monitor activation statistics, use Leaky ReLU if needed
Using Linear activation in hidden layers Network collapses to single layer Always use non-linear activations

VI. Debugging Activation Issues

Symptom What to check? Solution
Training loss not decreasing - Check: Dead ReLU neurons (high % of zero activations)
- Check: Vanishing gradients (gradient magnitudes < 1e-5)
Change activation or adjust learning rate
Loss exploding (NaN values) - Check: Unbounded activations (ReLU with no normalization)
- Check: Poor initialization
Add batch normalization, gradient clipping, or use ELU
Slow convergence - Check: Using Sigmoid/Tanh in deep network
- Check: Not zero-centered activations causing zig-zagging
Switch to ReLU or Tanh
Overfitting - Check: Network capacity too high Add dropout, use fewer neurons, or
try activations with implicit regularization (GELU, Mish)

VII. Best Practices & Recommendations

1. Start Simple, Then Optimize

2. Match Activation to Task

3. Watch for These Problems

4. Initialization Matters

VIII. Questions and Answers

Q1: Why can't we use linear activations in hidden layers?

Answer:
Because composition of linear functions is linear. Mathematically:

f(x)=W3(W2(W1x+b1)+b2)+b3=Wx+b

No matter how many layers, it reduces to a single linear transformation, providing no benefit from depth.

Q2: Explain the vanishing gradient problem.

Answer:
In backpropagation, gradients are computed via chain rule:

Lw1=LaLl=2Lf(zl)zlal1

For Sigmoid, f(z)0.25. In a 10-layer network: (0.25)10106
The gradient becomes negligibly small, and early layers don't learn.

Solution: Use ReLU (f(z)=1 for z>0) or batch normalization.

Q3: What is the dying ReLU problem and how do you fix it?

Answer:
Problem: If z<0 for all inputs to a ReLU neuron, then:

Causes:

Solutions:

  1. Use Leaky ReLU: f(z)=max(0.01z,z) (small gradient for negatives)
  2. Proper initialization: He initialization
  3. Lower learning rate
  4. Monitor activation statistics during training

Q4: When should you use Sigmoid vs Softmax in the output layer?

Answer:

Aspect Sigmoid Softmax
Use Case Binary classification OR multi-label Multi-class (single label)
Output Independent probabilities per class Probability distribution (sum = 1)
Example Image tagging (multiple tags possible) Digit classification (exactly one digit)
Loss Binary cross-entropy per output Categorical cross-entropy
Formula σ(zi) for each output independently ezijezj

Key Distinction: Are outputs mutually exclusive? If yes → Softmax. If no → Sigmoid.

Q5: Why is Tanh preferred over Sigmoid in hidden layers?

Answer:

  1. Zero-centered: Tanh output is in (1,1), mean ≈ 0

    • Sigmoid output is in (0,1), always positive
    • Zero-centered activations lead to more direct gradient descent (no zig-zagging)
  2. Stronger gradients:

    • Tanh: max(f(z))=1 at z=0
    • Sigmoid: max(f(z))=0.25 at z=0
    • 4x stronger gradient flow
  3. Mathematical relationship: tanh(z)=2σ(2z)1 (shifted and scaled Sigmoid)

However: Both still suffer from vanishing gradients, so ReLU is often better for deep networks.

Q6: Explain why ReLU is computationally efficient.

Answer:

ReLU: f(z)=max(0,z)

Sigmoid: f(z)=11+ez

Benchmark: ReLU is ~6x faster than Sigmoid in practice.
Impact: In a network with millions of activations, this speedup is significant.

Q7: When would you use ELU over ReLU?

Answer:

Choose ELU when:

  1. Noisy data: ELU's smooth negative saturation is more robust
  2. Need better performance: ELU often achieves lower test error
  3. Computation isn't critical: Can afford 2-3x slower activation
  4. Deep network: Smooth gradients help in very deep architectures

Choose ReLU when:

  1. Speed matters: Production systems, real-time applications
  2. ReLU works well: No need to optimize further
  3. Large scale: Millions of parameters, speed is critical
  4. First try: Always start with ReLU as baseline

Trade-off: ELU offers ~1-2% improvement in accuracy for ~2x compute cost.

Q8: What activation function would you use for these tasks?

Task Activation Reason
Predicting house prices Linear (output layer) Unbounded continuous values
Email spam detection Sigmoid (output layer) Binary probability
Handwritten digit recognition Softmax (output layer) 10 mutually exclusive classes
Image tagging (multiple objects) Sigmoid (per class) Independent labels
Hidden layers in CNN ReLU Fast, prevents vanishing gradients
LSTM hidden states Tanh Zero-centered, bounded
Transformer feed-forward GELU State-of-the-art for NLP

Q9. When would you choose ReLU over Leaky ReLU?

  1. When the neural network has a shallow architecture: ReLU is computationally efficient and simpler than Leaky ReLU, which makes it more suitable for shallow architectures.
  2. When the data is relatively clean and has few outliers: ReLU is less likely to introduce noise into the network since it only activates on positive input values. Therefore, it is suitable for datasets that have a limited amount of noise or outliers.
  3. When speed is a critical factor: Since ReLU has a simpler structure and requires fewer computations than Leaky ReLU, it can be faster to train and deploy. Therefore, it is preferred in scenarios where speed is critical, such as real-time applications.
  4. When the neural network is used for feature learning: ReLU can be more effective at learning features than Leaky ReLU, especially when used in the context of deep learning architectures. This is because ReLU encourages sparse representations, which can help to capture more informative features in the data.

Q10. When would you choose Leaky ReLU over ReLU?

  1. When the neural network has a deep architecture: Leaky ReLU can help to prevent the “Dying ReLU” problem, where some neurons may stop activating because they always receive negative input values, which is more likely to occur in deeper networks.
  2. When the data has a lot of noise or outliers: Leaky ReLU can provide a non-zero output for negative input values, which can help to avoid discarding potentially important information, and thus perform better than ReLU in scenarios where the data has a lot of noise or outliers.
  3. When generalization performance is a priority: Leaky ReLU can introduce some noise into the network, which can help to reduce overfitting and improve generalization performance. Therefore, it is preferred when generalization performance is a priority.
  4. When the neural network is used for regression tasks: Leaky ReLU can be more effective than ReLU for regression tasks, especially when the output range is not restricted to positive values since it can provide both positive and negative output

Q11. What is Vanishing Gradient Problem?

The vanishing gradient problem is a major obstacle in training deep artificial neural networks, where gradients used to update model weights become exponentially small as they travel backwards through the network layers. This prevents early layers from learning, effectively halting the model's ability to train properly.

Why It Happens
During training, a neural network calculates its error and uses a process called backpropagation to adjust its weights. This process relies heavily on the chain rule, which requires multiplying multiple derivatives (slopes) together across the network.

Q12. What is Zero-Centered? How is it useful?

Tanh (Hyperbolic Tangent), function is an S-shaped mathematical curve defined by . It squashes input values into an output range between -1 and 1. Its zero-centered mean means that the output values are distributed symmetrically around 0.
Because it is zero-centered, it outputs a mix of positive, negative, and zero values, which helps normalize neuron outputs and prevents gradients from getting stuck pushing in one direction during model training.

Why the Zero-Centered Mean is Helpful?

IX. Summary

Activation functions are the non-linear transformations that give neural networks their power to learn complex patterns. Here are the key takeaways:

For Hidden Layers:

  1. Start with ReLU (default, fast, effective)
  2. If ReLU neurons die → Leaky ReLU
  3. If you need SOTA → GELU (NLP) or Swish (Vision)
  4. If shallow network → Tanh (zero-centered)

For Output Layers:

  1. Binary classification → Sigmoid
  2. Multi-class (one label) → Softmax
  3. Multi-label (multiple tags) → Sigmoid
  4. Regression → Linear (no activation)

X. Additional Resources

Visualizations: