Everything in CNNs, RNNs, and Transformers is just this chapter applied with different connectivity. Master it.
1.0 Vocabulary you'll need first (plain definitions)
Before the math, here are the words used everywhere below. Read once, refer back as needed.
| Term | Plain meaning |
|---|---|
| Model / network | The thing that makes predictions; a big math function with adjustable knobs. |
| Parameter (weight/bias) | A knob the model learns by itself during training (a network has thousands→billions of them). |
| Hyperparameter | A knob you set before training (learning rate, number of layers, batch size). The model doesn't learn these. |
| Feature | One input number/attribute (e.g. a pixel, a word, "income"). |
| Label / target | The correct answer we want the model to output (used during training). |
| Training | Repeatedly showing examples and nudging the knobs to reduce mistakes. |
| Inference | Using the trained model to predict on new, unseen data. |
| Forward pass | Feeding an input through the network to get a prediction. |
| Backward pass (backprop) | Computing how to adjust every knob to reduce the error. |
| Loss | A single number scoring how wrong a prediction was (lower = better). |
| Gradient | The direction + amount to change a knob to reduce the loss. |
| Batch | A small group of examples processed together in one step (e.g. 32). |
| Iteration / step | One update of the knobs (one batch processed forward + backward). |
| Epoch | One full pass over the entire training dataset (many iterations). |
| Train / validation / test split | Data is divided three ways: train (learn from), validation (tune hyperparameters / check progress), test (final honest grade on data never seen). |
| Overfitting | Memorizing the training data instead of learning the general pattern → great on train, bad on new data. |
| Tensor | Just a multi-dimensional array of numbers (a scalar, vector, matrix, or higher). The basic data unit in PyTorch/TensorFlow. |
🧠 The 30-second mental model of all of deep learning: start with a function full of random knobs → show it an example → measure how wrong it is (loss) → figure out which way to turn each knob to be less wrong (gradient via backprop) → turn the knobs a tiny bit (optimizer step) → repeat millions of times. That's it. Everything else is variations on how the knobs are wired together.
1.1 The artificial neuron
Intuition
A neuron takes several numbers, weighs each by importance, adds them up, adds a bias (a baseline), and squashes the result through a nonlinear function. Stacking these gives a network that can approximate (in the limit) any continuous function — the Universal Approximation Theorem.
Math
For a single neuron with input vector , weights , bias :
- = pre-activation (a.k.a. logit/net input).
- = activation function (nonlinearity).
- = activation (the neuron's output).
For a layer of neurons, stack the weights into a matrix (row = weights of neuron ) and bias :
A network with layers chains these:
This is the forward pass. The final is the prediction.
Tiny numeric example
Input , one neuron with , , sigmoid activation.
1.2 Activation functions (and why nonlinearity matters)
Without a nonlinearity, stacking layers collapses: — still linear. Nonlinearity is what gives depth its power.
| Name | Derivative | Range | Notes | |
|---|---|---|---|---|
| Sigmoid | Saturates → vanishing grads; used for binary output | |||
| Tanh | Zero-centered; still saturates | |||
| ReLU | Default for hidden layers; cheap; "dead" neurons possible | |||
| Leaky ReLU | if else | Fixes dead neurons () | ||
| GELU | (see below) | Smooth; used in Transformers (BERT/GPT) | ||
| Softmax | Jacobian (below) | , sums to 1 | Output layer for multi-class |
GELU (Gaussian Error Linear Unit): where is the standard normal CDF. A common approximation:
Softmax derivation of its Jacobian (needed for backprop). Let . Then:
where is the Kronecker delta. Derivation: for , use quotient rule on ; the numerator derivative is and denominator derivative contributes . For only the denominator depends on , giving .
1.3 Loss functions
The loss measures how wrong a prediction is. Training = minimize average loss over the dataset.
Mean Squared Error (regression)
Binary Cross-Entropy (binary classification)
With and label :
A beautiful simplification: the gradient w.r.t. the logit collapses to
Why: , and . Multiply: terms cancel to . This is why pairing sigmoid + BCE is numerically clean.
Categorical Cross-Entropy (multi-class)
With one-hot label and softmax output :
Combined softmax+CE gradient w.r.t. logits is again the clean form:
This single identity powers almost all classification training. Derivation: . Then (since labels sum to 1).
1.4 Gradient descent
We want . The gradient points in the direction of steepest increase, so we step opposite to it:
= learning rate. Too large → diverge/oscillate; too small → crawl.
Three flavors:
- Batch GD: gradient over the whole dataset per step. Stable, expensive.
- Stochastic GD (SGD): one example per step. Noisy, fast, can escape shallow minima.
- Mini-batch GD: a batch of examples (typical ). The standard. Balances noise and hardware efficiency.
1D worked example
Minimize . Gradient . Start , .
| step | update | ||
|---|---|---|---|
| 0 | 5.000 | 10.0 | 4.000 |
| 1 | 4.000 | 8.0 | 3.200 |
| 2 | 3.200 | 6.4 | 2.560 |
| 3 | 2.560 | 5.12 | 2.048 |
Each step multiplies by , converging geometrically to the minimum at . (If , the multiplier becomes → diverges. This shows learning-rate sensitivity concretely.)
1.5 Backpropagation — the full derivation
Backprop = the chain rule, applied layer by layer from output to input, reusing intermediate results. This is the single most important mechanism to understand.
Setup
A network with layers . For each layer:
Define the error signal of layer as the gradient of loss w.r.t. that layer's pre-activation:
The four backprop equations
(BP1) Output layer error.
For softmax+CE this is just .
(BP2) Backpropagate the error to earlier layers.
Why: depends on only through . Chain rule: . The inner derivative is . Collecting over gives the matrix-vector form above.
(BP3) Gradient w.r.t. weights.
(BP4) Gradient w.r.t. biases.
Then update every parameter: .
Fully worked numeric backprop (do this by hand once!)
A 2-2-1 network. Inputs , target , sigmoid everywhere, MSE loss .
Weights:
Forward:
Loss .
Backward:
Gradients for output layer:
Propagate back:
(using , ).
With , every weight nudges by grad (e.g. ). That's one training step. Repeat over many batches.
1.6 Optimizers — beyond vanilla SGD
Plain SGD struggles in ravines (steep in one direction, flat in another) and with noisy gradients. These improve it.
Momentum
Accumulate a velocity that smooths the path:
. Like a heavy ball rolling downhill — dampens oscillation, accelerates consistent directions.
RMSProp
Scale each parameter's step by a running average of squared gradients (adaptive per-parameter learning rate):
Adam (the default)
Combines momentum (1st moment) + RMSProp (2nd moment), with bias correction:
Defaults: . Bias correction matters because makes early estimates biased toward zero; dividing by (which is small early) inflates them to the right scale. AdamW decouples weight decay from the gradient step and is the modern default for Transformers.
1.7 Weight initialization
Bad init → vanishing/exploding activations before training even starts. Keep variance stable across layers.
- Xavier/Glorot (for tanh/sigmoid): .
- He/Kaiming (for ReLU): — accounts for ReLU zeroing half the activations.
- Biases usually start at 0.
Why for ReLU: a linear layer's output variance is ; ReLU halves the effective variance, so we want , keeping signal magnitude roughly constant through depth.
1.8 Regularization (fighting overfitting)
Overfitting = low training loss, high test loss (memorizing noise). Remedies:
- L2 (weight decay): add to the loss → gradient gains → shrinks weights toward 0. Encourages small, smooth weights.
- L1: add → drives some weights exactly to 0 (sparsity / feature selection).
- Dropout: during training, zero each activation independently with probability , then scale survivors by (inverted dropout) so expectations match at test time. Acts like training an ensemble of subnetworks; prevents co-adaptation.
- Early stopping: halt when validation loss stops improving.
- Data augmentation: synthetically expand data (flips, crops, noise) — strong regularizer especially for vision.
1.9 Normalization layers
Normalizing intermediate activations stabilizes and speeds training.
Batch Normalization
For a feature over a mini-batch of size :
are learnable scale/shift, letting the network undo normalization if needed. At inference, use running averages of collected during training. BatchNorm depends on batch statistics → awkward for sequences/small batches.
Layer Normalization
Normalizes across features within one example (not across the batch):
Batch-independent → the choice for RNNs and Transformers. ([[04_transformers]] uses LayerNorm in every block.)
1.10 Putting it together — a NumPy MLP (no frameworks)
pythonimport numpy as np def sigmoid(z): return 1/(1+np.exp(-z)) def dsigmoid(a): return a*(1-a) # derivative in terms of activation a=σ(z) # Tiny 2-2-1 net trained on XOR X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float) # (4,2) Y = np.array([[0],[1],[1],[0]], dtype=float) # (4,1) rng = np.random.default_rng(0) W1 = rng.normal(0, 1, (2,2)); b1 = np.zeros((1,2)) # He-ish small init W2 = rng.normal(0, 1, (2,1)); b2 = np.zeros((1,1)) eta = 0.5 for epoch in range(10000): # ---- forward ---- Z1 = X @ W1 + b1; A1 = sigmoid(Z1) # (4,2) Z2 = A1 @ W2 + b2; A2 = sigmoid(Z2) # (4,1) = predictions loss = np.mean((A2 - Y)**2) # ---- backward (BP1-BP4) ---- dA2 = 2*(A2 - Y)/len(X) # dL/dA2 dZ2 = dA2 * dsigmoid(A2) # δ2 dW2 = A1.T @ dZ2 # BP3 db2 = dZ2.sum(0, keepdims=True) # BP4 dA1 = dZ2 @ W2.T # propagate dZ1 = dA1 * dsigmoid(A1) # δ1 (BP2) dW1 = X.T @ dZ1 db1 = dZ1.sum(0, keepdims=True) # ---- update ---- W2 -= eta*dW2; b2 -= eta*db2 W1 -= eta*dW1; b1 -= eta*db1 print("Predictions:", A2.ravel().round(3)) # ≈ [0, 1, 1, 0]
The same six lines of backward math (forward → δ at output → δ propagated → grads → update) reappear in every architecture in these notes. Frameworks (PyTorch/TF) just automate the chain rule via autograd (a computation graph that records operations and replays their derivatives in reverse).
The PyTorch equivalent (autograd does backprop for you)
pythonimport torch, torch.nn as nn net = nn.Sequential(nn.Linear(2,2), nn.Sigmoid(), nn.Linear(2,1), nn.Sigmoid()) opt = torch.optim.Adam(net.parameters(), lr=0.05) X = torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype=torch.float32) Y = torch.tensor([[0.],[1.],[1.],[0.]]) for _ in range(5000): pred = net(X) loss = ((pred - Y)**2).mean() opt.zero_grad(); loss.backward(); opt.step() # backward() = autograd backprop
1.11 Common pitfalls
- Forgetting to zero gradients in PyTorch (they accumulate) → use
opt.zero_grad(). - Learning rate too high → loss NaN/explodes; too low → no progress. Start ~ for Adam.
- Vanishing gradients with deep sigmoid/tanh stacks → use ReLU/GELU + residual connections + normalization.
- Data not normalized → unstable training. Standardize inputs to ~zero mean, unit variance.
- Mismatched loss/output: use softmax+CE for multiclass, sigmoid+BCE for binary, linear+MSE for regression.
Next: [[02_cnns]] applies these exact mechanics with weight-sharing for images.