Neural network training iteratively adjusts edge weights (and biases) to reduce the difference between the network’s predictions and the true labels in the training data.
Initialise all weights randomly (small values)
Repeat for many epochs:
For each training example (x, y):
1. Forward propagation: compute output y_hat
2. Compute loss: L(y_hat, y)
3. Backpropagation: compute gradient of L w.r.t. each weight
4. Update weights: w <- w - learning_rate * gradient
Stop when loss is sufficiently small or validation error stops improving
The loss function measures prediction error. Training minimises the average loss over all training examples.
| Task | Loss | Formula |
|---|---|---|
| Binary classification | Binary cross-entropy | $-[y\log\hat{y} + (1-y)\log(1-\hat{y})]$ |
| Regression | MSE | $\frac{1}{n}\sum(y_i - \hat{y}_i)^2$ |
Weights are updated by a step opposite to the gradient (downhill on the loss surface):
$$w \leftarrow w - \eta \frac{\partial \mathcal{L}}{\partial w}$$
Where:
- $\eta$: learning rate (hyperparameter controlling step size)
- $\frac{\partial \mathcal{L}}{\partial w}$: gradient of loss with respect to weight $w$
KEY TAKEAWAY: Gradient descent moves weights in the direction that decreases the loss. The learning rate controls how large each step is. Too large: unstable training. Too small: very slow convergence.
Backpropagation efficiently computes the gradient of the loss with respect to every weight using the chain rule of calculus. Error signals are propagated backwards through the network from output to input.
Conceptually:
1. Compute loss at output
2. Propagate error backward layer by layer
3. Compute how much each weight contributed to the error
4. Update each weight accordingly
EXAM TIP: For VCAA, you do not need to derive or implement backpropagation mathematically. Know the concept: forward pass computes predictions; backward pass computes how to update weights to reduce error.
| Hyperparameter | Effect |
|---|---|
| Learning rate $\eta$ | Controls step size per update |
| Epochs | Number of full passes through training data |
| Batch size | Examples used per weight update |
| Architecture | Depth and width determine model capacity |
| Method | Updates per step | Pros | Cons |
|---|---|---|---|
| Batch GD | All examples | Stable | Slow, memory-heavy |
| Stochastic GD | One example | Fast | High variance |
| Mini-batch GD | Small batch | Balance | Most common in practice |
As training proceeds, training loss decreases. If test loss begins to increase while training loss continues to fall, the model is overfitting.
Early stopping: Halt training when validation loss starts increasing — preserving the model at its point of best generalisation.
COMMON MISTAKE: Do not confuse the learning rate with the number of epochs. Learning rate controls how much weights change per update; epochs control how many times the dataset is processed.
VCAA FOCUS: Understand the iterative nature of training (predict, compute error, update weights). Know the role of the loss function, learning rate, and gradient descent. Understand the connection between training and overfitting, and the role of early stopping.