How Neural Networks Learn: A Step-by-Step Guide for Beginners¶

Audience: Complete beginners

A neural network is a program that learns from examples rather than from hand-written rules.

Before We Start — A Story About Learning Darts¶

Imagine you have never thrown a dart before. You walk up to the board, throw, and miss — badly. What do you do next?

You do not give up and throw randomly. You look at where the dart landed, figure out how far off you were, and adjust your throw slightly. You throw again. Still off, but less so. You keep adjusting, little by little, until your throws start hitting the target.

This is exactly how a neural network learns. It makes a prediction, measures how wrong it was, and adjusts its internal settings slightly. It does this thousands of times until it gets good at the task.

Every concept in this article maps back to this story. Keep it in mind as we go.

Table of Contents¶

What is a Neural Network?
The Model We Will Use
The Data — What We Feed the Network
Step 1 — Flatten the Input
Step 2 — The Forward Pass (Throwing the Dart)
Step 3 — Computing the Loss (How Far Did We Miss?)
Step 4 — Backpropagation (Figuring Out What Went Wrong)
Step 5 — Updating the Weights (Adjusting the Throw)
The Full Training Loop
Summary

1. What is a Neural Network?¶

A neural network is a program that learns from examples rather than from hand-written rules.

Think about how you learned to recognise a cat as a child. Nobody gave you a rulebook that said "four legs + whiskers + pointy ears = cat." You just saw hundreds of cats, and your brain gradually built up a sense of what makes something a cat.

A neural network does something similar. We show it thousands of labelled images — "this is a T-shirt", "this is a shoe" — and it gradually adjusts its internal numbers until it can tell them apart on its own.

Those internal numbers are called weights. At the start of training, they are random. By the end, they encode everything the network has learned.

2. The Model We Will Use¶

Here is the PyTorch model we will be working with throughout this article:

import torch
from torch import nn

class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

This model is designed for FashionMNIST — a dataset of 28×28 grayscale images of clothing items across 10 categories (T-shirt, Trouser, Pullover, and so on).

Do not worry if the code looks unfamiliar. By the end of this article, every line will make sense.

To keep numbers manageable, we will walk through all steps using a simplified version: a 2×2 image, 4 neurons per hidden layer, and 3 output classes. The ideas are identical to the full model — just smaller numbers.

3. The Data¶

Our toy example is a 2×2 grayscale image. Each cell is a pixel, with a value between 0.0 (black) and 1.0 (white):

	Col 1	Col 2
Row 1	0.5	0.2
Row 2	0.8	0.1

The correct label for this image is class 0 (say, a T-shirt). The network does not know this yet. Its job is to figure it out from the pixel values alone.

Step 1 — Flatten the Input¶

A neural network processes a flat list of numbers, not a 2D grid. So the very first thing we do is unroll the image into a single row.

self.flatten = nn.Flatten()

Our 2×2 image becomes:

\[x = [0.5,\ 0.2,\ 0.8,\ 0.1]\]

For the real FashionMNIST model, a 28×28 image becomes a list of 784 values. Nothing is lost — the same numbers, just arranged differently.

Why does this matter? The layers that come next (nn.Linear) are designed to work with 1D vectors. Flatten is simply converting our image into a format the network can process.

Step 2 — The Forward Pass¶

In our dart analogy: this is the throw itself.

The forward pass is the journey of our input data through every layer of the network, from start to finish, producing a prediction at the end.

Let us walk through each layer one by one.

Layer 1 — First Linear Layer¶

nn.Linear(4, 4)   # 4 inputs → 4 neurons

This layer has 4 neurons. Each neuron looks at all 4 pixel values, multiplies each one by a weight, adds them all up, and adds a bias term. In math:

\[z = \sum_{i=1}^{n} w_i x_i + b\]

Where: - \(x_i\) is each pixel value - \(w_i\) is the weight for that pixel (learned during training) - \(b\) is a bias (a small offset, also learned) - \(z\) is the result — called a pre-activation

Using assumed weights, neuron 1 computes:

\[z_1 = (0.1)(0.5) + (0.4)(0.2) + (0.2)(0.8) + (0.3)(0.1) + 0.1 = 0.42\]

We do this for all 4 neurons and get:

\[z = [0.42,\ 0.69,\ 0.50,\ 0.84]\]

Why does this matter? Each neuron is essentially asking "given these pixel values and my current weights, how strongly should I activate?" The weights determine what the neuron pays attention to. Early in training, these are random guesses. By the end of training, they become meaningful — some neurons may respond strongly to edges, others to colour intensity, and so on.

Layer 2 — ReLU Activation¶

nn.ReLU()

ReLU stands for Rectified Linear Unit. It is an activation function applied after the linear layer. The formula is:

\[\text{ReLU}(z) = \max(0,\ z)\]

Any negative value becomes 0. Any positive value stays as-is.

All our values are already positive, so nothing changes here:

\[a = [0.42,\ 0.69,\ 0.50,\ 0.84]\]

Why do we need this? Without an activation function, stacking multiple linear layers is mathematically the same as having just one. No matter how many layers you add, the network can only learn straight-line relationships in the data. ReLU introduces non-linearity — the ability to learn curved, complex patterns. Removing ReLU would severely limit what the network can learn.

Layer 3 — Second Linear Layer¶

nn.Linear(4, 4)

Same computation as Layer 1, but now taking the ReLU outputs as inputs and using a different set of weights:

\[z = [0.55,\ 0.38,\ 0.71,\ 0.29]\]

Layer 4 — ReLU Activation¶

All values are positive, so the output passes through unchanged:

\[a = [0.55,\ 0.38,\ 0.71,\ 0.29]\]

Layer 5 — Output Layer¶

nn.Linear(4, 3)   # 4 inputs → 3 class scores

The final layer maps to 3 scores — one per class. These raw scores are called logits:

\[\text{logits} = [1.2,\ 0.4,\ -0.3]\]

No activation function here. The output layer deliberately has no ReLU or other activation. We want raw scores that can be positive, negative, or anything — the loss function we use next will handle converting them into probabilities.

The highest logit is class 0 (score 1.2), which is the correct answer. But at this stage of training, with random weights, this is just luck.

Step 3 — Computing the Loss¶

In our dart analogy: this is measuring how far off your throw was.

Now we need a single number that captures how wrong the network's prediction was. This is called the loss.

PyTorch uses CrossEntropyLoss for classification. It works in two steps internally.

First, it converts the raw logits into probabilities using Softmax — a formula that squashes any set of numbers into values between 0 and 1 that sum to 100%:

\[P(\text{class } j) = \frac{e^{z_j}}{\sum_{k} e^{z_k}}\]

Applied to our logits:

\[\text{probs} = \text{softmax}([1.2,\ 0.4,\ -0.3]) = [0.63,\ 0.28,\ 0.09]\]

The network is 63% confident the image is class 0, 28% confident it is class 1, and 9% confident it is class 2.

Second, it computes the loss using the true label. Since the correct label is class 0 and we assigned it 63% probability:

\[\mathcal{L} = -\log(0.63) = 0.46\]

Why negative log? A perfect prediction would give 100% probability to the correct class, and \(-\log(1.0) = 0\) — zero loss. A terrible prediction (say 1% confidence) gives \(-\log(0.01) = 4.6\) — a very high loss. So the loss naturally grows larger the more wrong we are. Our goal is to drive this number towards zero.

loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, y)

Step 4 — Backpropagation¶

In our dart analogy: this is the moment you think "I threw too far left — I need to adjust my wrist."

We now know the network was wrong (loss = 0.46). But we have thousands of weights. Which ones caused the error? And by how much does each one need to change?

Backpropagation answers this. It works backwards from the loss, through every layer, computing the gradient of the loss with respect to each weight.

The gradient \(\frac{\partial \mathcal{L}}{\partial w}\) tells us two things for every weight \(w\): - Sign — should we increase or decrease this weight to reduce the loss? - Magnitude — how sensitive is the loss to a change in this weight?

A large gradient means "this weight has a big effect on the loss — adjust it more carefully." A small gradient means "this weight barely affects the loss."

PyTorch handles all of this automatically with one line:

loss.backward()

After this line runs, every single weight in the network has a .grad value attached — its gradient, ready to be used.

Why does this matter? Without backpropagation, we would have no way to know which of our thousands of weights to change. We would be adjusting them blindly. Backpropagation is what makes learning efficient — it tells us precisely how to improve.

Step 5 — Updating the Weights¶

In our dart analogy: this is actually adjusting your throw based on what you figured out.

With gradients computed, we now update every weight by a small amount in the direction that reduces the loss. This is done by the optimizer.

The simplest optimizer is SGD (Stochastic Gradient Descent):

\[w \leftarrow w - \eta \cdot \frac{\partial \mathcal{L}}{\partial w}\]

Where \(\eta\) (eta) is the learning rate — a small number like 0.01. It controls how big each update step is.

Think of the learning rate as the size of your dart adjustment. Too large, and you overcorrect and miss in the other direction. Too small, and you improve so slowly it takes forever. Getting it right is one of the key skills in training neural networks.

In practice, most people use Adam instead of plain SGD. Adam is like SGD but smarter — it keeps track of past gradients to build momentum, and gives each weight its own adaptive learning rate:

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
optimizer.step()    # apply the weight updates

After optimizer.step(), every weight in the network has been slightly adjusted. The network is now — very slightly — better at classifying our image.

Finally, we reset the gradients to zero so they do not carry over into the next step:

optimizer.zero_grad()

Why do we need zero_grad()? By default, PyTorch accumulates gradients — it adds new gradients on top of old ones. If we forget to clear them, the gradients from step 1 get mixed into step 2, step 3, and so on. The weight updates become incorrect. Always call zero_grad() before the next forward pass.

The Full Training Loop¶

Here is everything combined into the loop that actually trains the network. This repeats for every batch of images across many epochs (full passes through the dataset):

# Setup
model = NeuralNetwork()
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(10):               # 10 full passes over the dataset
    for X, y in dataloader:           # X = images, y = true labels

        # 1. Forward pass — make a prediction
        logits = model(X)

        # 2. Compute loss — measure how wrong we were
        loss = loss_fn(logits, y)

        # 3. Backpropagation — figure out what caused the error
        loss.backward()

        # 4. Update weights — adjust to reduce the loss
        optimizer.step()

        # 5. Clear gradients — reset for the next batch
        optimizer.zero_grad()

    print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")

Each time through the inner loop is one training step — one dart throw. Each full pass through the dataset is one epoch. After enough epochs, the loss converges to a low value and the network becomes accurate.

Summary¶

Step	Dart analogy	What happens	PyTorch code
Flatten	Gripping the dart	Image → 1D vector	`nn.Flatten()`
Forward pass	Throwing the dart	Input flows through all layers → logits	`model(X)`
Loss	Measuring how far you missed	Single number capturing how wrong the prediction was	`loss_fn(logits, y)`
Backprop	Figuring out what went wrong	Gradients computed for every weight	`loss.backward()`
Update	Adjusting your throw	Weights nudged to reduce the loss	`optimizer.step()`
Zero grad	Resetting before the next throw	Gradients cleared for the next step	`optimizer.zero_grad()`

Training is just this loop — repeated thousands of times. Each repetition makes the network a little less wrong, until it becomes reliably right.