The Output Generators: Sigmoid and Softmax

Before we can measure an error, our model has to make a prediction. Neural networks output raw numbers (often called logits). We need to transform these raw numbers into usable probabilities between 0 and 1.

This is where activation functions like Sigmoid and Softmax come in. They "squash" logits into probability distributions that we can interpret and use for classification tasks.

1. What is a Logit?

A logit is the natural logarithm of the odds ratio. It's a transformation that maps a probability value from [0, 1] to the entire real number line (-∞, +∞).

Mathematical Definition
Logit(p)=log(p1p)=z

Where:

Analogy: Horse Betting

To understand why this transformation is useful, think about horse betting.

In horse betting, there's a commonly used term called odds. When we say the odds of horse number 5 winning are 3/8, we're actually saying that after 11 races, the horse will win 3 of them and lose 8.

Mathematically, odds are expressed as:

odds=p(x)1p(x)

The odds can take any positive value: [0,+). However, if we take the log of the odds, the range changes to (,+). This is called the logit function.

Why is this useful?
Linear models (like neural networks before the final activation) produce outputs on the entire real number line (,+). By predicting logits instead of probabilities directly, the model doesn't have to worry about constraining its output to be between 0 and 1. We can then convert the logit back to a probability using the Sigmoid function.

Deriving the Sigmoid Function from Logit

If we set the logit to a variable z (the raw output of a model), we can solve for the probability p:

z=log(p1p)ez=p1p(exponentiate both sides)ez(1p)=p(multiply both sides by (1p))ezezp=pez=p+ezpez=p(1+ez)p=ez1+ez

By dividing the numerator and denominator by ez, we get the familiar sigmoid formula:

p=ez/ez(1+ez)/ez=11/ez+1=11+ez

This shows that the Sigmoid function is the inverse of the Logit function. It converts a logit back into a probability.

Context in Machine Learning

2. Sigmoid Function

Purpose

The sigmoid function is a continuous, monotonically increasing function used to map predicted values to probabilities for Binary Classification (two choices: Yes/No, 0/1, True/False).

Formula
σ(z)=11+ez

Where:

Visual Representation

ML_AI/images/sigmoid-1.jpg

Derivative of Sigmoid Function

The derivative is essential for backpropagation and gradient descent:

σ(z)=σ(z)(1σ(z))

Key Properties of the Derivative:

Practice Problems

Calculate the sigmoid function and its derivative at various points:

z σ(z) σ(z) Calculation
0 0.5000 0.2500 11+e0=0.5; 0.5×0.5=0.25
0.5 0.6225 0.2350 11+e0.50.6225; 0.6225×0.37750.235
1 0.7311 0.1966 11+e10.7311; 0.7311×0.26890.197
1 0.2689 0.1966 11+e10.2689; 0.2689×0.73110.197
2 0.1192 0.1050 11+e20.1192; 0.1192×0.88080.105
Properties of the Sigmoid Function
Property Description
Range Output values always fall between 0 and 1, ideal for probabilities
Asymptotes Approaches 0 as z and 1 as z+; σ(0)=0.5
Monotonicity Monotonically increasing — as input increases, output increases
Differentiability Fully differentiable, enabling gradient-based optimization
Shape S-shaped (sigmoidal) curve with smooth, gradual transitions
Non-linearity Introduces non-linearity, allowing models to learn complex patterns
Advantages and Disadvantages
✅ Advantages
🚫 Disadvantages
Use Cases

3. Softmax Function

Purpose

Softmax is the generalization of Sigmoid for Multi-class Classification (three or more mutually exclusive choices). It converts a vector of raw scores (logits) into a probability distribution.

Formula
σ(zi)=ezij=1Kezj

Where:

Visual Representation

ML_AI/images/softmax-1.jpg

Recipe of Softmax

The Softmax function operates in three steps:

  1. Input: Takes a vector z of real numbers (logits) from the final layer

    z=[z1,z2,,zK]
  2. Exponentiation: Each element is exponentiated using e (Euler's number)

    [ez1,ez2,,eK]

    This ensures all values become positive and amplifies differences

  3. Normalization: Divide each exponentiated value by the sum of all exponentiated values

    σ(zi)=ezij=1Kezj

    This guarantees the outputs sum to 1 (a valid probability distribution)

Derivative of Softmax

The derivative of Softmax is more complex due to its dependency on all inputs. For class i with respect to logit zj:

σ(zi)zj={σ(zi)(1σ(zi))if i=jσ(zi)σ(zj)if ij

When combined with Cross-Entropy Loss, the gradient simplifies elegantly to:

Lzi=σ(zi)yi

Where yi is the true label (1 for correct class, 0 otherwise). This simple form makes training very efficient.

Properties of Softmax
Property Description
Normalization Converts logits into a probability distribution where i=1Kσ(zi)=1
Exponentiation Amplifies larger values, making the model's confidence more pronounced
Differentiability Fully differentiable, enabling gradient-based optimization
Output Range All outputs lie between 0 and 1
Mutual Exclusivity Designed for problems where each sample belongs to exactly one class
Interpretability Transforms raw outputs into probabilities that are easy to understand
Advantages and Disadvantages
✅ Advantages
❌ Disadvantages
Use Cases

Financial Example: Predicting which ETF will perform best this year: VOO, QQQM, or SCHD. Softmax might output [0.50,0.30,0.20], indicating 50% confidence in VOO, 30% in QQQM, and 20% in SCHD.

4. Sigmoid vs. Softmax: Key Differences

ML_AI/images/sigmoid_vs_softmax_1.jpg

Sigmoid receives a single input and outputs a single probability representing class 1. The probability of class 0 is simply 1P(class 1).

Softmax is vectorized — it takes a vector with K entries (one for each class) and outputs another vector where each component represents the probability of belonging to that class. All probabilities sum to 1.

Feature Sigmoid Softmax
Use Case Binary Classification (2 classes) Multi-class Classification (K > 2 classes)
Input Single scalar z Vector z=[z1,z2,,zK]
Output Single probability p(0,1) Probability distribution [σ(z1),,σ(zK)] where σ(zi)=1
Formula 11+ez ezij=1Kezj
Derivative σ(z)(1σ(z)) σ(zi)(1σ(zi)) (diagonal); σ(zi)σ(zj) (off-diagonal)
Interpretation Probability of positive class Probability distribution over all classes
Classes Two mutually exclusive classes Multiple mutually exclusive classes
Multi-label Can be used independently for each label Not suitable; use multiple Sigmoid instead

5. Questions and Answers

1. How is the sigmoid function used in neural networks?

In neural networks, the sigmoid function is used as an activation function. It takes the weighted sum of inputs and transforms it into an output between 0 and 1. This introduces non-linearity, allowing the network to learn complex patterns. However, it's mostly used in the output layer for binary classification; ReLU is preferred for hidden layers.

2. What are the mathematical properties of the sigmoid function?

The sigmoid function:

3. Why is the sigmoid function important in logistic regression?

The sigmoid function is crucial in logistic regression because it converts the linear combination of input features (wTx+b) into a probability between 0 and 1. This allows the model to predict binary outcomes (e.g., yes/no) and interpret results as probabilities, making it ideal for classification tasks.

4. How does the sigmoid function compare to other activation functions?

Function Pros Cons Best Use
Sigmoid Probabilistic output, smooth gradient Vanishing gradients, not zero-centered, slow Output layer (binary)
ReLU Fast, no vanishing gradient (for z>0) Dead neurons for z<0 Hidden layers
Tanh Zero-centered, stronger gradients than Sigmoid Still suffers from vanishing gradients Hidden layers (less common now)
Softmax Multi-class probabilities, interpretable Not for multi-label, sensitive to outliers Output layer (multi-class)

5. What is the vanishing gradient problem?

The vanishing gradient problem occurs when gradients become extremely small during backpropagation, causing weights to update very slowly or not at all. This happens in Sigmoid when:

6. Why use Softmax in the last layer?

The Softmax activation function is typically used in the final layer of a classification neural network because:

7. Can I use multiple Sigmoid functions instead of Softmax?

Yes, but only for multi-label classification where an input can belong to multiple classes simultaneously (e.g., an image can contain both "cat" and "dog"). Each Sigmoid acts as an independent binary classifier for each class. For multi-class classification where classes are mutually exclusive, use Softmax.

8. How do you handle numerical instability in Softmax?

The exponential function in Softmax can cause overflow for large logits. The solution is to subtract the maximum logit before computing:

σ(zi)=ezimax(z)j=1Kezjmax(z)

This is mathematically equivalent but numerically stable. Modern deep learning frameworks handle this automatically.


6. Summary

Concept Sigmoid Softmax
Type Binary activation Multi-class activation
Classes 2 (mutually exclusive) K (mutually exclusive)
Output Single probability Probability distribution
Formula 11+ez ezijezj
Range (0,1) Each output in (0,1), sum = 1
Use in ML Binary classification output Multi-class classification output
Paired with Binary Cross-Entropy Loss Categorical Cross-Entropy Loss

Key Takeaway: Use Sigmoid for binary problems (two classes). Use Softmax for multi-class problems where each input belongs to exactly one class. For multi-label problems, use multiple independent Sigmoid functions.