Complete Forward & Backward Propagation Example

Overview: The Big Picture

Phase Purpose Key Output
Forward Propagation Pass input through the network to get a prediction Prediction y^ and Loss L
Backward Propagation Compute gradients — how much each weight contributed to the error Gradients Lw for all weights
Weight Update Adjust weights to reduce the error New weights wnew

One complete cycle (Forward → Backward → Update) = One Training Iteration
Repeat for many iterations (epochs) until the loss is minimized and predictions are accurate.

Network Architecture

Input Layer (2 neurons)Hidden Layer (2 neurons)Output Layer (1 neuron)

Notation Reference

Symbol Meaning
xi Input feature i
wij Weight from input i to hidden neuron j
w3j Weight from hidden neuron j to output
bj Bias for neuron j
zj(l) Pre-activation (weighted sum) at layer l, neuron j
aj(l) Activation (output after sigmoid) at layer l, neuron j
y^ Network's prediction (output layer activation)
y True target label
L Loss (error)
δj(l) Error signal at layer l, neuron j
η Learning rate

Sample Data (5 rows, 2 columns)

Sample x1 x2 y (Target)
1 0.1 0.2 0
2 0.4 0.5 1
3 0.3 0.6 1
4 0.7 0.1 0
5 0.9 0.8 1

Initialized Weights and Biases

Input → Hidden Layer:

W(1)=[w11w12w21w22]=[0.150.200.250.30]b(1)=[b1b2]=[0.350.35]

Hidden → Output Layer:

W(2)=[w31w32]=[0.400.45]b(2)=0.60

Diagrammatic Representation

ML_AI/images/neu_net_eg-1.png

👉 Both Forward and Backward Propagation is detailed for Sample 1: x1=0.1, x2=0.2, y=0

Forward Propagation

5-Step Forward Propagation Steps

Step 1: Input to Hidden Layer

Compute weighted sum (z) for each hidden neuron:

(Linear Equation)z1(1)=w11x1+w21x2+b1z1(1)=w11x1+w21x2+b1z1(1)=0.15×0.1+0.25×0.2+0.35=0.015+0.05+0.35=0.415z2(1)=w12x1+w22x2+b2z2(1)=w12x1+w22x2+b2z2(1)=0.20×0.1+0.30×0.2+0.35=0.02+0.06+0.35=0.43

Step 2: Apply sigmoid activation:

Sigmoid

a1(1)=σ(z1(1))=σ(0.415)=11+e0.415=0.6023a2(1)=σ(z2(1))=σ(0.43)=11+e0.43=0.6058

Step 3: Hidden to Output Layer

Compute weighted sum for output:

z(2)=w31a1(1)+w32a2(1)+b(2)z(2)=0.40×0.6023+0.45×0.6058+0.60z(2)=0.2409+0.2726+0.60=1.1135

Step 4: Apply sigmoid activation (final prediction)

y^=a(2)=σ(z(2))=σ(1.1135)=11+e1.1135=11.3285=0.7528

Step 5: Calculate Loss (Binary Cross-Entropy)

Binary Cross-Entropy

L=[ylog(y^)+(1y)log(1y^)]L=[ylog(y^)+(1y)log(1y^)](For Sample 1 where y = 0)L=[0log(0.7528)+(10)log(10.7528)]L=1.3976

Iterate the process for all the rows

Forward Propagation Results for All 5 Samples

Sample x1 x2 z1(1) z2(1) a1(1) a2(1) z(2) y^ y BCE Loss
1 0.1 0.2 0.415 0.430 0.6023 0.6058 1.1135 0.7528 0 1.3976
2 0.4 0.5 0.535 0.580 0.6307 0.6411 1.1407 0.7757 1 0.2538
3 0.3 0.6 0.545 0.590 0.6330 0.6434 1.1428 0.7760 1 0.2534
4 0.7 0.1 0.480 0.520 0.6177 0.6272 1.1293 0.7737 0 1.4858
5 0.9 0.8 0.685 0.770 0.6649 0.6836 1.1734 0.8282 1 0.1884

Total BCE Loss:

Ltotal=1Ni=1NLiLtotal=1Ni=1NLiLtotal=15(1.3976+0.2538+0.2534+1.4858+0.1884)=3.57905Ltotal=0.7261

Backward Propagation

We compute gradients using the chain rule and propagate errors backward.

Lw=Ly^y^z(2)z(2)a(1)a(1)z(1)z(1)w(1)

Visual Representation of Backpropagation

ML_AI/images/backpropagation-4.png

5-Step Backward Propagation Steps

Step 1: Compute the Output Layer Error

Goal

  • This error signal tells us the "blame" assigned to the output neuron (δ(2)), which we'll use to update its weights and propagate backward.
  • How — Calculate the error at the output neuron by finding (1)the difference between the actual target and the prediction, (2)scaled by the derivative of its activation function.

Step 1.1 — Binary Cross-Entropy Derivative:

👉 Question — How much does the loss change when the prediction y^ changes?
This measures the sensitivity of our loss function to prediction errors. If the prediction is far from the truth, this value is large.

Ly^=σ(y^)=yy^+1y1y^=00.7528+10.2472=4.0453
Step 1.2 — Sigmoid Derivative:

Question — How much does y^ change when the pre-activation z(2) changes?
This captures magnitude — how "steep" the sigmoid curve is at the current output. The slope is largest near y^=0.5 and flattest near y^=0 or y^=1.

y^z(2)=σ(z(2))=y^(1y^)=0.7528×0.2472=0.1861
Step 1.3 — The Output Neuron Error (δ(2)):

Question — How the loss changes with the pre-activation z(2).
➢ It acts as the primary scaling factor, dictating the overall direction and magnitude of the necessary correction at the end of the network.

Calculation: The Multiplication (Chain Rule)

δ(2)=Lz(2)=Ly^y^z(2)

We can see the calculations in 2 ways

Method 1: Substitution:

When you substitute (i) and (ii) and simplify, a remarkably clean result emerges:

δ(2)=Lz(2)=Ly^y^z(2)δ(2)=4.0453×0.1861δ(2)=0.7528
Method 2: Equation Simplification
(chain rule)δ(2)=Lz(2)=Ly^y^z(2)δ(2)=Lz(2)=Ly^y^z(2)δ(2)=(yy^+1y1y^)y^(1y^)δ(2)=y+yy^+y^yy^δ(2)=y^yδ(2)=y^y(y=0)δ(2)=0.75280δ(2)=0.7528δ(2)=0.7528
This elegant result makes implementation much simpler!

This isn't a coincidence — BCE was designed specifically to pair with sigmoid. The "ugly" fractions in the BCE derivative perfectly cancel with the sigmoid derivative, giving us the clean y^y result.

Step 2: Gradients for Output Layer Weights

Goal

  • Find Lw31, Lw32, Lb(2) ➛ How much does each output-layer weights, the bias, or the activations from the previous layer contribute to the loss?
  • These gradients tell us how much a specific weight connecting the hidden layer to the output layer needs to adjust and by how much to reduce the error.

Recap

Recall that z(2)=w31a1(1)+w32a2(1)+b(2).

  • The hidden activation a1(1) and a2(1) is the input flowing into weight w31 and w32 respectively.
  • A larger input means that weight had more influence on the output — so it deserves more "blame" for the error.

The Chain Rule for Each Parameter
Each gradient is found by extending the chain rule one more step — from z⁽²⁾ back to the weight:

Lw3j=Lz(2)δ(2>)z(2)w3jLw3j=Lz(2)δ(2>)z(2)w3j

Local derivatives are simply:

z(2)w31=a1(1)z(2)w31=a1(1)z(2)w32=a2(>1)z(2)w32=a2(>1)z(2)b(2)=1z(2)b(2)=1
The Three Gradients
Gradient for weight w31

Question—How much weight w31 contributed to the error?

Lw31=δ(2)a1(1)=0.7528×0.6023=0.4533
Gradient for weight w32

Question—How much weight w32 contributed to the error?

Lw32=δ(2)a2(1)=0.7528×0.6058=0.4561
Gradient for bias b2

Question—How much Bias b(2) contributed to the error?
Bias has no input multiplier, just the error

Lb(2)=δ(2)=0.7528
Key insight — Gradient = Error × Input

Step 3: Hidden Layer Errors

Goal

  • Find Lz1(1) and Lz2(1) ➛ How much does each hidden neuron's pre-activation contribute to the loss?
  • These error signals tell us how much "blame" each hidden neuron deserves, which we'll use to update the input-to-hidden weights.

Recap

Recall that z(2)=w31a1(1)+w32a2(1)+b(2) and aj(1)=σ(zj(1)).

  • The weight w3j determines how much hidden neuron j influences the output.
  • A larger weight means that hidden neuron had more influence on the output — so it deserves more "blame" for the error.

The Chain Rule for Each Parameter
Each hidden error is found by extending the chain rule two more steps — from z(2) back through the activation aj(1) to the pre-activation zj(1):

Lzj(1)=Lz(2)δ>(2)z(2)aj(1)w3jaj(1)zj(1)σ(zj(1))Lzj(1)=Lz(2)δ>(2)z(2)aj(1)w3jaj(1)zj(1)σ(zj(1))

Since z(2)=w31a1(1)+w32a2(1)+b(2) and aj(1)=σ(zj(1)), the local derivatives are:

z(2)a1(1)=w31z(2)a1(1)=w31z(2)a2(1)=w32z(2)a2(1)=w32aj(1)zj(1)=aj(1)(1aj(1))aj(1)zj(1)=aj(1)(1aj(1))

This simplifies to the general formula:

δj(1)=(δ(2)w3j)σ(zj(1))
The Two Hidden Error Signals
Term What it represents Value
δ(2) Output error (already computed in Step 1) 0.7528 (From Step 2)
w31 Connection strength from hidden neuron 1 to output 0.40 (Input Assumption)
w32 Connection strength from hidden neuron 2 to output 0.45 (Input Assumption)
σ(z1(1)) Sigmoid derivative — how "responsive" neuron 1 was a1(1)(1a1(1))=0.2395
σ(z2(1)) Sigmoid derivative — how "responsive" neuron 2 was a2(1)(1a2(1))=0.2387
Error for hidden neuron 1 (δ1(1))

Question — How much did hidden neuron 1's pre-activation contribute to the error?

δ1(1)=δ(2)w31σ(z1(1))=0.7528×0.40×0.2395δ1(1)=0.0721δ1(1)=0.0721
Error for hidden neuron 2 (δ2(1))

Question — How much did hidden neuron 2's pre-activation contribute to the error?

δ2(1)=δ(2)w32σ(z2(1))=0.7528×0.45×0.2387δ2(1)=0.0809δ2(1)=0.0809
Key Insight — Hidden Error = Output Error × Weight × Local Slope

Step 4: Gradients for Hidden Layer Weights

Goal

  • Find Lw11, Lw21, Lw12, Lw22, Lb1, Lb2 ➛ How much does each input-to-hidden weight and bias contribute to the loss?
  • These gradients tell us how to adjust the connections between the input layer and the hidden layer to reduce the error.

Recap

Recall that zj(1)=w1jx1+w2jx2+bj.

  • The input xi is the value flowing into weight wij.
  • A larger input means that weight had more influence on the hidden neuron's output — so it deserves more "blame" for the error.

The Chain Rule for Each Parameter
Since we already computed δj(1)=Lzj(1) in Step 3, we just need one more link in the chain:

Lwij=Lzj(1)δj(1)zj(1)wijLwij=Lzj(1)δj(1)zj(1)wij

Since zj(1)=w1jx1+w2jx2+bj, the local derivatives are simply:

zj(1)w1j=x1zj(1)w1j=x1zj(1)w2j=x2zj(1)w2j=x2zj(1)bj=1zj(1)bj=1
The Six Gradients
Term What it represents Value
δ1(1) Hidden neuron 1 error (from Step 3) 0.0721 (From Step 3)
δ2(1) Hidden neuron 2 error (from Step 3) 0.0809 (From Step 3)
x1 Input feature 1 0.1 (Input Assumption)
x2 Input feature 2 0.2 (Input Assumption)
Gradient for weight w11

Question — How much did weight w11 (connecting x1 to hidden neuron 1) contribute to the error?

Lw11=δ1(1)x1=0.0721×0.1Lw11=0.00721Lw11=0.00721
Gradient for weight w21

Question — How much did weight w21 (connecting x2 to hidden neuron 1) contribute to the error?

Lw21=δ1(1)x2=0.0721×0.2Lw21=0.01442Lw21=0.01442
Gradient for weight w12

Question — How much did weight w12 (connecting x1 to hidden neuron 2) contribute to the error?

Lw12=δ2(1)x1=0.0809×0.1Lw12=0.00809Lw12=0.00809
Gradient for weight w22

Question — How much did weight w22 (connecting x2 to hidden neuron 2) contribute to the error?

Lw22=δ2(1)x2=0.0809×0.2Lw22=0.01618Lw22=0.01618
Gradient for bias b1

Question — How much did bias b1 contribute to the error?
Bias has no input multiplier, just the error

Lb1=δ1(1)Lb1=0.0721Lb1=0.0721
Gradient for bias b2

Question — How much did bias b2 contribute to the error?
Bias has no input multiplier, just the error

Lb2=δ2(1)Lb2=0.0809Lb2=0.0809
Key Insight — Gradient = Error × Input (Same Pattern as Step 2!)

Step 5: Weight Updates

Goal

Calculate new values of w11, w21, w12 , w22, w31 , w32, b1(1), b2(1), b(2)
Using gradient descent:

  • wnew=woldηw
  • bnew=boldηLb

Assumption
➢ Learning Rate η=0.5

Weight Old Value Gradient New Value
w11 0.15 0.00721 0.1464
w21 0.25 0.01442 0.2428
w12 0.20 0.00809 0.1960
w22 0.30 0.01618 0.2919
b1 0.35 0.0721 0.3139
b2 0.35 0.0809 0.3096
w31 0.40 0.4533 0.1733
w32 0.45 0.4561 0.2219
b(2) 0.60 0.7528 0.2236

What Happens Next?

  1. Repeat for all samples — In practice, you'd compute gradients for all 5 samples and average them (batch gradient descent), or update after each sample (stochastic gradient descent).

  2. Multiple Epochs — One pass through all samples = 1 epoch. Training typically requires hundreds or thousands of epochs until loss converges.

  3. Monitor Loss — After each epoch, check if Ltotal is decreasing. If it plateaus or increases, adjust learning rate or check for issues.

  4. Evaluate — Once trained, test on unseen data to measure generalization.

Key Insight: The beauty of backpropagation is that the same pattern — Gradient = Error × Input — repeats at every layer, making it computationally efficient and scalable to deep networks.