Neural Network Training

Training Neural Networks: Iterative Weight Improvement

Neural network training iteratively adjusts edge weights (and biases) to reduce the difference between the network’s predictions and the true labels in the training data.

The Training Loop

Initialise all weights randomly (small values)
Repeat for many epochs:
    For each training example (x, y):
        1. Forward propagation: compute output y_hat
        2. Compute loss: L(y_hat, y)
        3. Backpropagation: compute gradient of L w.r.t. each weight
        4. Update weights: w <- w - learning_rate * gradient
Stop when loss is sufficiently small or validation error stops improving

Loss Function

The loss function measures prediction error. Training minimises the average loss over all training examples.

Task	Loss	Formula
Binary classification	Binary cross-entropy	$-[y\log\hat{y} + (1-y)\log(1-\hat{y})]$
Regression	MSE	$\frac{1}{n}\sum(y_i - \hat{y}_i)^2$

Gradient Descent

Weights are updated by a step opposite to the gradient (downhill on the loss surface):

$$w \leftarrow w - \eta \frac{\partial \mathcal{L}}{\partial w}$$

Where:
- $\eta$: learning rate (hyperparameter controlling step size)
- $\frac{\partial \mathcal{L}}{\partial w}$: gradient of loss with respect to weight $w$

KEY TAKEAWAY: Gradient descent moves weights in the direction that decreases the loss. The learning rate controls how large each step is. Too large: unstable training. Too small: very slow convergence.

Backpropagation

Backpropagation efficiently computes the gradient of the loss with respect to every weight using the chain rule of calculus. Error signals are propagated backwards through the network from output to input.

Conceptually:
1. Compute loss at output
2. Propagate error backward layer by layer
3. Compute how much each weight contributed to the error
4. Update each weight accordingly

EXAM TIP: For VCAA, you do not need to derive or implement backpropagation mathematically. Know the concept: forward pass computes predictions; backward pass computes how to update weights to reduce error.

Key Hyperparameters

Hyperparameter	Effect
Learning rate $\eta$	Controls step size per update
Epochs	Number of full passes through training data
Batch size	Examples used per weight update
Architecture	Depth and width determine model capacity

Variants of Gradient Descent

Method	Updates per step	Pros	Cons
Batch GD	All examples	Stable	Slow, memory-heavy
Stochastic GD	One example	Fast	High variance
Mini-batch GD	Small batch	Balance	Most common in practice

Overfitting and Early Stopping

As training proceeds, training loss decreases. If test loss begins to increase while training loss continues to fall, the model is overfitting.

Early stopping: Halt training when validation loss starts increasing — preserving the model at its point of best generalisation.

COMMON MISTAKE: Do not confuse the learning rate with the number of epochs. Learning rate controls how much weights change per update; epochs control how many times the dataset is processed.

VCAA FOCUS: Understand the iterative nature of training (predict, compute error, update weights). Know the role of the loss function, learning rate, and gradient descent. Understand the connection between training and overfitting, and the role of early stopping.

Neural Network Training

Table of Contents

About these notes

Join StudyPulse