Lesson 3 of 8

Backprop & Gradients

How does a neural network learn? Through two interlocking mechanisms: Backpropagation and Gradient Descent.

First, the network makes a prediction (the forward pass). We measure how wrong it is using a Loss Function — Mean Squared Error for regression, Cross-Entropy for classification. The loss is a single number: lower is better.

Backpropagation then computes the gradient of the loss with respect to every parameter in the network. It does this by applying the Chain Rule recursively from the output layer back to the input: ∂L/∂w = (∂L/∂y)(∂y/∂w). This tells us — for every weight — which direction increases the loss.

Gradient Descent updates each weight by stepping in the opposite direction of its gradient: w ← w − α · ∂L/∂w. Here α is the learning rate — a crucial hyperparameter. Too small and training is painfully slow. Too large and the updates overshoot, causing divergence. Most modern training uses Adam, an adaptive optimizer that adjusts the learning rate per parameter automatically.