back to knowledge base
module 0113 min read

Deep Learning Foundations

Neurons, activations, the forward pass, loss functions, backprop (full chain-rule derivation), optimizers SGD→Adam, init, regularization, batch/layer norm.

Everything in CNNs, RNNs, and Transformers is just this chapter applied with different connectivity. Master it.


1.0 Vocabulary you'll need first (plain definitions)

Before the math, here are the words used everywhere below. Read once, refer back as needed.

TermPlain meaning
Model / networkThe thing that makes predictions; a big math function with adjustable knobs.
Parameter (weight/bias)A knob the model learns by itself during training (a network has thousands→billions of them).
HyperparameterA knob you set before training (learning rate, number of layers, batch size). The model doesn't learn these.
FeatureOne input number/attribute (e.g. a pixel, a word, "income").
Label / targetThe correct answer we want the model to output (used during training).
TrainingRepeatedly showing examples and nudging the knobs to reduce mistakes.
InferenceUsing the trained model to predict on new, unseen data.
Forward passFeeding an input through the network to get a prediction.
Backward pass (backprop)Computing how to adjust every knob to reduce the error.
LossA single number scoring how wrong a prediction was (lower = better).
GradientThe direction + amount to change a knob to reduce the loss.
BatchA small group of examples processed together in one step (e.g. 32).
Iteration / stepOne update of the knobs (one batch processed forward + backward).
EpochOne full pass over the entire training dataset (many iterations).
Train / validation / test splitData is divided three ways: train (learn from), validation (tune hyperparameters / check progress), test (final honest grade on data never seen).
OverfittingMemorizing the training data instead of learning the general pattern → great on train, bad on new data.
TensorJust a multi-dimensional array of numbers (a scalar, vector, matrix, or higher). The basic data unit in PyTorch/TensorFlow.

🧠 The 30-second mental model of all of deep learning: start with a function full of random knobs → show it an example → measure how wrong it is (loss) → figure out which way to turn each knob to be less wrong (gradient via backprop) → turn the knobs a tiny bit (optimizer step) → repeat millions of times. That's it. Everything else is variations on how the knobs are wired together.


1.1 The artificial neuron

Intuition

A neuron takes several numbers, weighs each by importance, adds them up, adds a bias (a baseline), and squashes the result through a nonlinear function. Stacking these gives a network that can approximate (in the limit) any continuous function — the Universal Approximation Theorem.

Math

For a single neuron with input vector xRn\mathbf{x} \in \mathbb{R}^{n}, weights wRn\mathbf{w} \in \mathbb{R}^{n}, bias bRb \in \mathbb{R}:

z=wx+b=i=1nwixi+b,a=σ(z)z = \mathbf{w}^\top \mathbf{x} + b = \sum_{i=1}^{n} w_i x_i + b, \qquad a = \sigma(z)
  • zz = pre-activation (a.k.a. logit/net input).
  • σ\sigma = activation function (nonlinearity).
  • aa = activation (the neuron's output).

For a layer of mm neurons, stack the weights into a matrix WRm×n\mathbf{W} \in \mathbb{R}^{m \times n} (row jj = weights of neuron jj) and bias bRm\mathbf{b} \in \mathbb{R}^{m}:

z=Wx+b,a=σ(z)\mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b}, \qquad \mathbf{a} = \sigma(\mathbf{z})

A network with layers =1L\ell = 1 \dots L chains these:

a()=σ() ⁣(W()a(1)+b()),a(0)=x\mathbf{a}^{(\ell)} = \sigma^{(\ell)}\!\left(\mathbf{W}^{(\ell)} \mathbf{a}^{(\ell-1)} + \mathbf{b}^{(\ell)}\right), \qquad \mathbf{a}^{(0)} = \mathbf{x}

This is the forward pass. The final a(L)=y^\mathbf{a}^{(L)} = \hat{\mathbf{y}} is the prediction.

Tiny numeric example

Input x=[1,2]\mathbf{x} = [1, 2], one neuron with w=[0.5,1]\mathbf{w} = [0.5, -1], b=0.5b = 0.5, sigmoid activation.

z=0.5(1)+(1)(2)+0.5=1.0,a=11+e(1)=11+e1=0.2689z = 0.5(1) + (-1)(2) + 0.5 = -1.0, \qquad a = \frac{1}{1+e^{-(-1)}} = \frac{1}{1+e^{1}} = 0.2689

1.2 Activation functions (and why nonlinearity matters)

Without a nonlinearity, stacking layers collapses: W2(W1x)=(W2W1)x\mathbf{W}_2(\mathbf{W}_1\mathbf{x}) = (\mathbf{W}_2\mathbf{W}_1)\mathbf{x} — still linear. Nonlinearity is what gives depth its power.

Nameσ(z)\sigma(z)Derivative σ(z)\sigma'(z)RangeNotes
Sigmoid11+ez\dfrac{1}{1+e^{-z}}σ(z)(1σ(z))\sigma(z)\,(1-\sigma(z))(0,1)(0,1)Saturates → vanishing grads; used for binary output
Tanhezezez+ez\dfrac{e^{z}-e^{-z}}{e^{z}+e^{-z}}1tanh2(z)1-\tanh^2(z)(1,1)(-1,1)Zero-centered; still saturates
ReLUmax(0,z)\max(0,z){1z>00z<0\begin{cases}1 & z>0\\0 & z<0\end{cases}[0,)[0,\infty)Default for hidden layers; cheap; "dead" neurons possible
Leaky ReLUmax(αz,z)\max(\alpha z, z)11 if z>0z>0 else α\alphaR\mathbb{R}Fixes dead neurons (α0.01\alpha\approx0.01)
GELUzΦ(z)z\,\Phi(z)(see below)R\mathbb{R}Smooth; used in Transformers (BERT/GPT)
Softmaxezijezj\dfrac{e^{z_i}}{\sum_j e^{z_j}}Jacobian (below)(0,1)(0,1), sums to 1Output layer for multi-class

GELU (Gaussian Error Linear Unit): GELU(z)=zΦ(z)\text{GELU}(z) = z \cdot \Phi(z) where Φ\Phi is the standard normal CDF. A common approximation:

GELU(z)0.5z(1+tanh ⁣[2/π(z+0.044715z3)])\text{GELU}(z) \approx 0.5z\left(1 + \tanh\!\left[\sqrt{2/\pi}\,(z + 0.044715 z^3)\right]\right)

Softmax derivation of its Jacobian (needed for backprop). Let si=ezikezks_i = \dfrac{e^{z_i}}{\sum_k e^{z_k}}. Then:

sizj={si(1si)i=jsisjij  =  si(δijsj)\frac{\partial s_i}{\partial z_j} = \begin{cases} s_i(1 - s_i) & i = j \\ - s_i s_j & i \ne j \end{cases} \;=\; s_i(\delta_{ij} - s_j)

where δij\delta_{ij} is the Kronecker delta. Derivation: for i=ji=j, use quotient rule on ezi/kezke^{z_i}/\sum_k e^{z_k}; the numerator derivative is ezie^{z_i} and denominator derivative contributes eziezi/()2-e^{z_i}e^{z_i}/(\sum)^2. For iji\ne j only the denominator depends on zjz_j, giving eziezj/()2=sisj-e^{z_i}e^{z_j}/(\sum)^2 = -s_i s_j.


1.3 Loss functions

The loss L(y^,y)L(\hat{y}, y) measures how wrong a prediction is. Training = minimize average loss over the dataset.

Mean Squared Error (regression)

LMSE=1Ni=1N(y^iyi)2,Ly^i=2N(y^iyi)L_{\text{MSE}} = \frac{1}{N}\sum_{i=1}^{N} (\hat{y}_i - y_i)^2, \qquad \frac{\partial L}{\partial \hat{y}_i} = \frac{2}{N}(\hat{y}_i - y_i)

Binary Cross-Entropy (binary classification)

With y^=σ(z)(0,1)\hat{y} = \sigma(z) \in (0,1) and label y{0,1}y \in \{0,1\}:

LBCE=[ylogy^+(1y)log(1y^)]L_{\text{BCE}} = -\big[y \log \hat{y} + (1-y)\log(1-\hat{y})\big]

A beautiful simplification: the gradient w.r.t. the logit zz collapses to

LBCEz=y^y\frac{\partial L_{\text{BCE}}}{\partial z} = \hat{y} - y

Why: Ly^=yy^+1y1y^\frac{\partial L}{\partial \hat y} = -\frac{y}{\hat y} + \frac{1-y}{1-\hat y}, and y^z=y^(1y^)\frac{\partial \hat y}{\partial z} = \hat y(1-\hat y). Multiply: terms cancel to y^y\hat y - y. This is why pairing sigmoid + BCE is numerically clean.

Categorical Cross-Entropy (multi-class)

With one-hot label y\mathbf{y} and softmax output y^\hat{\mathbf{y}}:

LCE=kyklogy^k=logy^correctL_{\text{CE}} = -\sum_{k} y_k \log \hat{y}_k = -\log \hat{y}_{\text{correct}}

Combined softmax+CE gradient w.r.t. logits is again the clean form:

LCEzk=y^kyk\boxed{\frac{\partial L_{\text{CE}}}{\partial z_k} = \hat{y}_k - y_k}

This single identity powers almost all classification training. Derivation: L=iyilogsiL = -\sum_i y_i \log s_i. Then Lzk=iyi1sisizk=iyi1sisi(δiksk)=iyi(δiksk)=yk+skiyi=skyk\frac{\partial L}{\partial z_k} = -\sum_i y_i \frac{1}{s_i}\frac{\partial s_i}{\partial z_k} = -\sum_i y_i \frac{1}{s_i} s_i(\delta_{ik}-s_k) = -\sum_i y_i(\delta_{ik}-s_k) = -y_k + s_k\sum_i y_i = s_k - y_k (since labels sum to 1).


1.4 Gradient descent

We want θ=argminθL(θ)\theta^* = \arg\min_\theta L(\theta). The gradient θL\nabla_\theta L points in the direction of steepest increase, so we step opposite to it:

θθηθL\theta \leftarrow \theta - \eta \, \nabla_\theta L

η\eta = learning rate. Too large → diverge/oscillate; too small → crawl.

Three flavors:

  • Batch GD: gradient over the whole dataset per step. Stable, expensive.
  • Stochastic GD (SGD): one example per step. Noisy, fast, can escape shallow minima.
  • Mini-batch GD: a batch of BB examples (typical B=32512B=32\text{–}512). The standard. Balances noise and hardware efficiency.

1D worked example

Minimize L(θ)=θ2L(\theta) = \theta^2. Gradient L(θ)=2θL'(\theta) = 2\theta. Start θ0=5\theta_0 = 5, η=0.1\eta = 0.1.

stepθ\thetaL(θ)=2θL'(\theta)=2\thetaupdate θ0.12θ=0.8θ\theta - 0.1\cdot 2\theta = 0.8\theta
05.00010.04.000
14.0008.03.200
23.2006.42.560
32.5605.122.048

Each step multiplies θ\theta by 0.80.8, converging geometrically to the minimum at 00. (If η=1.1\eta = 1.1, the multiplier becomes 1.2-1.2 → diverges. This shows learning-rate sensitivity concretely.)


1.5 Backpropagation — the full derivation

Backprop = the chain rule, applied layer by layer from output to input, reusing intermediate results. This is the single most important mechanism to understand.

Setup

A network with layers =1L\ell=1\dots L. For each layer:

z()=W()a(1)+b(),a()=σ(z())\mathbf{z}^{(\ell)} = \mathbf{W}^{(\ell)}\mathbf{a}^{(\ell-1)} + \mathbf{b}^{(\ell)}, \qquad \mathbf{a}^{(\ell)} = \sigma(\mathbf{z}^{(\ell)})

Define the error signal of layer \ell as the gradient of loss w.r.t. that layer's pre-activation:

δ()Lz()\boldsymbol{\delta}^{(\ell)} \equiv \frac{\partial L}{\partial \mathbf{z}^{(\ell)}}

The four backprop equations

(BP1) Output layer error.

δ(L)=a(L)L    σ(z(L))\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{a}^{(L)}} L \;\odot\; \sigma'(\mathbf{z}^{(L)})

For softmax+CE this is just δ(L)=y^y\boldsymbol{\delta}^{(L)} = \hat{\mathbf{y}} - \mathbf{y}.

(BP2) Backpropagate the error to earlier layers.

δ()=(W(+1)δ(+1))σ(z())\boldsymbol{\delta}^{(\ell)} = \left( \mathbf{W}^{(\ell+1)\top}\,\boldsymbol{\delta}^{(\ell+1)} \right) \odot \sigma'(\mathbf{z}^{(\ell)})

Why: LL depends on z()\mathbf{z}^{(\ell)} only through z(+1)=W(+1)σ(z())+b(+1)\mathbf{z}^{(\ell+1)} = \mathbf{W}^{(\ell+1)}\sigma(\mathbf{z}^{(\ell)}) + \mathbf{b}^{(\ell+1)}. Chain rule: Lzj()=kLzk(+1)zk(+1)zj()\frac{\partial L}{\partial z_j^{(\ell)}} = \sum_k \frac{\partial L}{\partial z_k^{(\ell+1)}} \frac{\partial z_k^{(\ell+1)}}{\partial z_j^{(\ell)}}. The inner derivative is Wkj(+1)σ(zj())W_{kj}^{(\ell+1)} \sigma'(z_j^{(\ell)}). Collecting over kk gives the matrix-vector form above.

(BP3) Gradient w.r.t. weights.

LW()=δ()a(1)(outer product, shape m×m1)\frac{\partial L}{\partial \mathbf{W}^{(\ell)}} = \boldsymbol{\delta}^{(\ell)} \,\mathbf{a}^{(\ell-1)\top} \qquad (\text{outer product, shape } m_\ell \times m_{\ell-1})

(BP4) Gradient w.r.t. biases.

Lb()=δ()\frac{\partial L}{\partial \mathbf{b}^{(\ell)}} = \boldsymbol{\delta}^{(\ell)}

Then update every parameter: W()W()ηLW()\mathbf{W}^{(\ell)} \leftarrow \mathbf{W}^{(\ell)} - \eta\,\frac{\partial L}{\partial \mathbf{W}^{(\ell)}}.

Fully worked numeric backprop (do this by hand once!)

A 2-2-1 network. Inputs x=[0.5,0.1]\mathbf{x}=[0.5, 0.1], target y=1y=1, sigmoid everywhere, MSE loss L=12(y^y)2L=\tfrac12(\hat y - y)^2.

Weights:

W(1)=[0.10.20.30.4],  b(1)=[00],W(2)=[0.50.6],  b(2)=0\mathbf{W}^{(1)} = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{bmatrix},\; \mathbf{b}^{(1)}=\begin{bmatrix}0\\0\end{bmatrix},\quad \mathbf{W}^{(2)} = \begin{bmatrix} 0.5 & 0.6 \end{bmatrix},\; b^{(2)}=0

Forward:

z(1)=[0.1(0.5)+0.2(0.1)0.3(0.5)+0.4(0.1)]=[0.070.19],a(1)=σ(z(1))=[0.51750.5474]\mathbf{z}^{(1)} = \begin{bmatrix}0.1(0.5)+0.2(0.1)\\ 0.3(0.5)+0.4(0.1)\end{bmatrix} = \begin{bmatrix}0.07\\0.19\end{bmatrix},\quad \mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)}) = \begin{bmatrix}0.5175\\0.5474\end{bmatrix} z(2)=0.5(0.5175)+0.6(0.5474)=0.5872,y^=a(2)=σ(0.5872)=0.6427z^{(2)} = 0.5(0.5175)+0.6(0.5474) = 0.5872, \quad \hat y = a^{(2)} = \sigma(0.5872) = 0.6427

Loss L=12(0.64271)2=0.0638L = \tfrac12(0.6427-1)^2 = 0.0638.

Backward:

δ(2)=(y^y)σ(z(2))=(0.3573)(0.6427)(10.6427)=0.35730.2297=0.0821\delta^{(2)} = (\hat y - y)\cdot \sigma'(z^{(2)}) = (-0.3573)\cdot(0.6427)(1-0.6427) = -0.3573 \cdot 0.2297 = -0.0821

Gradients for output layer:

LW(2)=δ(2)a(1)=0.0821[0.5175,0.5474]=[0.0425,0.0449]\frac{\partial L}{\partial \mathbf{W}^{(2)}} = \delta^{(2)}\mathbf{a}^{(1)\top} = -0.0821\,[0.5175,\,0.5474] = [-0.0425,\,-0.0449]

Propagate back:

δ(1)=(W(2)δ(2))σ(z(1))=[0.50.6](0.0821)[0.24970.2478]=[0.01030.0122]\boldsymbol{\delta}^{(1)} = (\mathbf{W}^{(2)\top}\delta^{(2)}) \odot \sigma'(\mathbf{z}^{(1)}) = \begin{bmatrix}0.5\\0.6\end{bmatrix}(-0.0821) \odot \begin{bmatrix}0.2497\\0.2478\end{bmatrix} = \begin{bmatrix}-0.0103\\-0.0122\end{bmatrix}

(using σ(0.07)=0.51750.4825=0.2497\sigma'(0.07)=0.5175\cdot0.4825=0.2497, σ(0.19)=0.54740.4526=0.2478\sigma'(0.19)=0.5474\cdot0.4526=0.2478).

LW(1)=δ(1)x=[0.01030.0122][0.5,0.1]=[0.005140.001030.006100.00122]\frac{\partial L}{\partial \mathbf{W}^{(1)}} = \boldsymbol{\delta}^{(1)}\mathbf{x}^\top = \begin{bmatrix}-0.0103\\-0.0122\end{bmatrix}[0.5,\,0.1] = \begin{bmatrix}-0.00514 & -0.00103\\ -0.00610 & -0.00122\end{bmatrix}

With η=0.1\eta=0.1, every weight nudges by η-\eta\cdotgrad (e.g. W1(2):0.50.1(0.0425)=0.50425W^{(2)}_1: 0.5 - 0.1(-0.0425)=0.50425). That's one training step. Repeat over many batches.


1.6 Optimizers — beyond vanilla SGD

Plain SGD struggles in ravines (steep in one direction, flat in another) and with noisy gradients. These improve it.

Momentum

Accumulate a velocity that smooths the path:

vt=βvt1+(1β)L,θθηvt\mathbf{v}_t = \beta \mathbf{v}_{t-1} + (1-\beta)\nabla L, \qquad \theta \leftarrow \theta - \eta \mathbf{v}_t

β0.9\beta\approx0.9. Like a heavy ball rolling downhill — dampens oscillation, accelerates consistent directions.

RMSProp

Scale each parameter's step by a running average of squared gradients (adaptive per-parameter learning rate):

st=βst1+(1β)(L)2,θθηst+ϵL\mathbf{s}_t = \beta \mathbf{s}_{t-1} + (1-\beta)(\nabla L)^2, \qquad \theta \leftarrow \theta - \frac{\eta}{\sqrt{\mathbf{s}_t}+\epsilon}\nabla L

Adam (the default)

Combines momentum (1st moment) + RMSProp (2nd moment), with bias correction:

mt=β1mt1+(1β1)L\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1)\nabla L vt=β2vt1+(1β2)(L)2\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2)(\nabla L)^2 m^t=mt1β1t,v^t=vt1β2t\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1-\beta_1^t}, \qquad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1-\beta_2^t} θθηm^tv^t+ϵ\theta \leftarrow \theta - \eta\,\frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t}+\epsilon}

Defaults: β1=0.9, β2=0.999, ϵ=108\beta_1=0.9,\ \beta_2=0.999,\ \epsilon=10^{-8}. Bias correction matters because m0=v0=0\mathbf{m}_0=\mathbf{v}_0=0 makes early estimates biased toward zero; dividing by 1βt1-\beta^t (which is small early) inflates them to the right scale. AdamW decouples weight decay from the gradient step and is the modern default for Transformers.


1.7 Weight initialization

Bad init → vanishing/exploding activations before training even starts. Keep variance stable across layers.

  • Xavier/Glorot (for tanh/sigmoid): Var(W)=2nin+nout\text{Var}(W) = \dfrac{2}{n_{\text{in}}+n_{\text{out}}}.
  • He/Kaiming (for ReLU): Var(W)=2nin\text{Var}(W) = \dfrac{2}{n_{\text{in}}} — accounts for ReLU zeroing half the activations.
  • Biases usually start at 0.

Why 2/nin2/n_{in} for ReLU: a linear layer's output variance is ninVar(W)Var(x)n_{in}\,\text{Var}(W)\,\text{Var}(x); ReLU halves the effective variance, so we want ninVar(W)/2=1Var(W)=2/ninn_{in}\,\text{Var}(W)/2 = 1 \Rightarrow \text{Var}(W)=2/n_{in}, keeping signal magnitude roughly constant through depth.


1.8 Regularization (fighting overfitting)

Overfitting = low training loss, high test loss (memorizing noise). Remedies:

  • L2 (weight decay): add λ2θ2\frac{\lambda}{2}\|\theta\|^2 to the loss → gradient gains +λθ+\lambda\theta → shrinks weights toward 0. Encourages small, smooth weights.
  • L1: add λθ1\lambda\|\theta\|_1 → drives some weights exactly to 0 (sparsity / feature selection).
  • Dropout: during training, zero each activation independently with probability pp, then scale survivors by 1/(1p)1/(1-p) (inverted dropout) so expectations match at test time. Acts like training an ensemble of subnetworks; prevents co-adaptation.
  • Early stopping: halt when validation loss stops improving.
  • Data augmentation: synthetically expand data (flips, crops, noise) — strong regularizer especially for vision.

1.9 Normalization layers

Normalizing intermediate activations stabilizes and speeds training.

Batch Normalization

For a feature over a mini-batch of size BB:

μ=1Bixi,σ2=1Bi(xiμ)2,x^i=xiμσ2+ϵ,yi=γx^i+β\mu = \frac{1}{B}\sum_{i} x_i, \quad \sigma^2 = \frac{1}{B}\sum_i (x_i-\mu)^2, \quad \hat x_i = \frac{x_i - \mu}{\sqrt{\sigma^2+\epsilon}}, \quad y_i = \gamma \hat x_i + \beta

γ,β\gamma,\beta are learnable scale/shift, letting the network undo normalization if needed. At inference, use running averages of μ,σ2\mu,\sigma^2 collected during training. BatchNorm depends on batch statistics → awkward for sequences/small batches.

Layer Normalization

Normalizes across features within one example (not across the batch):

μ=1Hj=1Hxj,σ2=1Hj(xjμ)2,yj=γjxjμσ2+ϵ+βj\mu = \frac{1}{H}\sum_{j=1}^{H} x_j, \quad \sigma^2 = \frac{1}{H}\sum_j (x_j-\mu)^2, \quad y_j = \gamma_j \frac{x_j-\mu}{\sqrt{\sigma^2+\epsilon}} + \beta_j

Batch-independent → the choice for RNNs and Transformers. ([[04_transformers]] uses LayerNorm in every block.)


1.10 Putting it together — a NumPy MLP (no frameworks)

python
import numpy as np

def sigmoid(z): return 1/(1+np.exp(-z))
def dsigmoid(a): return a*(1-a)          # derivative in terms of activation a=σ(z)

# Tiny 2-2-1 net trained on XOR
X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float)   # (4,2)
Y = np.array([[0],[1],[1],[0]], dtype=float)            # (4,1)

rng = np.random.default_rng(0)
W1 = rng.normal(0, 1, (2,2)); b1 = np.zeros((1,2))      # He-ish small init
W2 = rng.normal(0, 1, (2,1)); b2 = np.zeros((1,1))
eta = 0.5

for epoch in range(10000):
    # ---- forward ----
    Z1 = X @ W1 + b1;  A1 = sigmoid(Z1)                 # (4,2)
    Z2 = A1 @ W2 + b2; A2 = sigmoid(Z2)                 # (4,1) = predictions
    loss = np.mean((A2 - Y)**2)

    # ---- backward (BP1-BP4) ----
    dA2   = 2*(A2 - Y)/len(X)                           # dL/dA2
    dZ2   = dA2 * dsigmoid(A2)                           # δ2
    dW2   = A1.T @ dZ2                                   # BP3
    db2   = dZ2.sum(0, keepdims=True)                    # BP4
    dA1   = dZ2 @ W2.T                                   # propagate
    dZ1   = dA1 * dsigmoid(A1)                           # δ1 (BP2)
    dW1   = X.T @ dZ1
    db1   = dZ1.sum(0, keepdims=True)

    # ---- update ----
    W2 -= eta*dW2; b2 -= eta*db2
    W1 -= eta*dW1; b1 -= eta*db1

print("Predictions:", A2.ravel().round(3))   # ≈ [0, 1, 1, 0]

The same six lines of backward math (forward → δ at output → δ propagated → grads → update) reappear in every architecture in these notes. Frameworks (PyTorch/TF) just automate the chain rule via autograd (a computation graph that records operations and replays their derivatives in reverse).

The PyTorch equivalent (autograd does backprop for you)

python
import torch, torch.nn as nn
net = nn.Sequential(nn.Linear(2,2), nn.Sigmoid(), nn.Linear(2,1), nn.Sigmoid())
opt = torch.optim.Adam(net.parameters(), lr=0.05)
X = torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype=torch.float32)
Y = torch.tensor([[0.],[1.],[1.],[0.]])
for _ in range(5000):
    pred = net(X)
    loss = ((pred - Y)**2).mean()
    opt.zero_grad(); loss.backward(); opt.step()   # backward() = autograd backprop

1.11 Common pitfalls

  • Forgetting to zero gradients in PyTorch (they accumulate) → use opt.zero_grad().
  • Learning rate too high → loss NaN/explodes; too low → no progress. Start ~10310^{-3} for Adam.
  • Vanishing gradients with deep sigmoid/tanh stacks → use ReLU/GELU + residual connections + normalization.
  • Data not normalized → unstable training. Standardize inputs to ~zero mean, unit variance.
  • Mismatched loss/output: use softmax+CE for multiclass, sigmoid+BCE for binary, linear+MSE for regression.

Next: [[02_cnns]] applies these exact mechanics with weight-sharing for images.