Deep Learning Foundations — Knowledge

Everything in CNNs, RNNs, and Transformers is just this chapter applied with different connectivity. Master it.

1.0 Vocabulary you'll need first (plain definitions)

Before the math, here are the words used everywhere below. Read once, refer back as needed.

Term	Plain meaning
Model / network	The thing that makes predictions; a big math function with adjustable knobs.
Parameter (weight/bias)	A knob the model learns by itself during training (a network has thousands→billions of them).
Hyperparameter	A knob you set before training (learning rate, number of layers, batch size). The model doesn't learn these.
Feature	One input number/attribute (e.g. a pixel, a word, "income").
Label / target	The correct answer we want the model to output (used during training).
Training	Repeatedly showing examples and nudging the knobs to reduce mistakes.
Inference	Using the trained model to predict on new, unseen data.
Forward pass	Feeding an input through the network to get a prediction.
Backward pass (backprop)	Computing how to adjust every knob to reduce the error.
Loss	A single number scoring how wrong a prediction was (lower = better).
Gradient	The direction + amount to change a knob to reduce the loss.
Batch	A small group of examples processed together in one step (e.g. 32).
Iteration / step	One update of the knobs (one batch processed forward + backward).
Epoch	One full pass over the entire training dataset (many iterations).
Train / validation / test split	Data is divided three ways: train (learn from), validation (tune hyperparameters / check progress), test (final honest grade on data never seen).
Overfitting	Memorizing the training data instead of learning the general pattern → great on train, bad on new data.
Tensor	Just a multi-dimensional array of numbers (a scalar, vector, matrix, or higher). The basic data unit in PyTorch/TensorFlow.

🧠 The 30-second mental model of all of deep learning: start with a function full of random knobs → show it an example → measure how wrong it is (loss) → figure out which way to turn each knob to be less wrong (gradient via backprop) → turn the knobs a tiny bit (optimizer step) → repeat millions of times. That's it. Everything else is variations on how the knobs are wired together.

1.1 The artificial neuron

Intuition

A neuron takes several numbers, weighs each by importance, adds them up, adds a bias (a baseline), and squashes the result through a nonlinear function. Stacking these gives a network that can approximate (in the limit) any continuous function — the Universal Approximation Theorem.

Math

For a single neuron with input vector $\mathbf{x} \in \mathbb{R}^{n}$ , weights $\mathbf{w} \in \mathbb{R}^{n}$ , bias $b \in \mathbb{R}$ :

z = \mathbf{w}^\top \mathbf{x} + b = \sum_{i=1}^{n} w_i x_i + b, \qquad a = \sigma(z)

$z$ = pre-activation (a.k.a. logit/net input).
$\sigma$ = activation function (nonlinearity).
$a$ = activation (the neuron's output).

For a layer of $m$ neurons, stack the weights into a matrix $\mathbf{W} \in \mathbb{R}^{m \times n}$ (row $j$ = weights of neuron $j$ ) and bias $\mathbf{b} \in \mathbb{R}^{m}$ :

\mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b}, \qquad \mathbf{a} = \sigma(\mathbf{z})

A network with layers $\ell = 1 \dots L$ chains these:

\mathbf{a}^{(\ell)} = \sigma^{(\ell)}\!\left(\mathbf{W}^{(\ell)} \mathbf{a}^{(\ell-1)} + \mathbf{b}^{(\ell)}\right), \qquad \mathbf{a}^{(0)} = \mathbf{x}

This is the forward pass. The final $\mathbf{a}^{(L)} = \hat{\mathbf{y}}$ is the prediction.

Tiny numeric example

Input $\mathbf{x} = [1, 2]$ , one neuron with $\mathbf{w} = [0.5, -1]$ , $b = 0.5$ , sigmoid activation.

z = 0.5(1) + (-1)(2) + 0.5 = -1.0, \qquad a = \frac{1}{1+e^{-(-1)}} = \frac{1}{1+e^{1}} = 0.2689

1.2 Activation functions (and why nonlinearity matters)

Without a nonlinearity, stacking layers collapses: $\mathbf{W}_2(\mathbf{W}_1\mathbf{x}) = (\mathbf{W}_2\mathbf{W}_1)\mathbf{x}$ — still linear. Nonlinearity is what gives depth its power.

Name	$\sigma(z)$	Derivative $\sigma'(z)$	Range	Notes
Sigmoid	$\dfrac{1}{1+e^{-z}}$	$\sigma(z)\,(1-\sigma(z))$	$(0,1)$	Saturates → vanishing grads; used for binary output
Tanh	$\dfrac{e^{z}-e^{-z}}{e^{z}+e^{-z}}$	$1-\tanh^2(z)$	$(-1,1)$	Zero-centered; still saturates
ReLU	$\max(0,z)$	$\begin{cases}1 & z>0\\0 & z<0\end{cases}$	$[0,\infty)$	Default for hidden layers; cheap; "dead" neurons possible
Leaky ReLU	$\max(\alpha z, z)$	$1$ if $z>0$ else $\alpha$	$\mathbb{R}$	Fixes dead neurons ( $\alpha\approx0.01$ )
GELU	$z\,\Phi(z)$	(see below)	$\mathbb{R}$	Smooth; used in Transformers (BERT/GPT)
Softmax	$\dfrac{e^{z_i}}{\sum_j e^{z_j}}$	Jacobian (below)	$(0,1)$ , sums to 1	Output layer for multi-class

GELU (Gaussian Error Linear Unit): $\text{GELU}(z) = z \cdot \Phi(z)$ where $\Phi$ is the standard normal CDF. A common approximation:

\text{GELU}(z) \approx 0.5z\left(1 + \tanh\!\left[\sqrt{2/\pi}\,(z + 0.044715 z^3)\right]\right)

Softmax derivation of its Jacobian (needed for backprop). Let $s_i = \dfrac{e^{z_i}}{\sum_k e^{z_k}}$ . Then:

\frac{\partial s_i}{\partial z_j} = \begin{cases} s_i(1 - s_i) & i = j \\ - s_i s_j & i \ne j \end{cases} \;=\; s_i(\delta_{ij} - s_j)

where $\delta_{ij}$ is the Kronecker delta. Derivation: for $i=j$ , use quotient rule on $e^{z_i}/\sum_k e^{z_k}$ ; the numerator derivative is $e^{z_i}$ and denominator derivative contributes $-e^{z_i}e^{z_i}/(\sum)^2$ . For $i\ne j$ only the denominator depends on $z_j$ , giving $-e^{z_i}e^{z_j}/(\sum)^2 = -s_i s_j$ .

1.3 Loss functions

The loss $L(\hat{y}, y)$ measures how wrong a prediction is. Training = minimize average loss over the dataset.

Mean Squared Error (regression)

L_{\text{MSE}} = \frac{1}{N}\sum_{i=1}^{N} (\hat{y}_i - y_i)^2, \qquad \frac{\partial L}{\partial \hat{y}_i} = \frac{2}{N}(\hat{y}_i - y_i)

Binary Cross-Entropy (binary classification)

With $\hat{y} = \sigma(z) \in (0,1)$ and label $y \in \{0,1\}$ :

L_{\text{BCE}} = -\big[y \log \hat{y} + (1-y)\log(1-\hat{y})\big]

A beautiful simplification: the gradient w.r.t. the logit $z$ collapses to

\frac{\partial L_{\text{BCE}}}{\partial z} = \hat{y} - y

Why: $\frac{\partial L}{\partial \hat y} = -\frac{y}{\hat y} + \frac{1-y}{1-\hat y}$ , and $\frac{\partial \hat y}{\partial z} = \hat y(1-\hat y)$ . Multiply: terms cancel to $\hat y - y$ . This is why pairing sigmoid + BCE is numerically clean.

Categorical Cross-Entropy (multi-class)

With one-hot label $\mathbf{y}$ and softmax output $\hat{\mathbf{y}}$ :

L_{\text{CE}} = -\sum_{k} y_k \log \hat{y}_k = -\log \hat{y}_{\text{correct}}

Combined softmax+CE gradient w.r.t. logits is again the clean form:

\boxed{\frac{\partial L_{\text{CE}}}{\partial z_k} = \hat{y}_k - y_k}

This single identity powers almost all classification training. Derivation: $L = -\sum_i y_i \log s_i$ . Then $\frac{\partial L}{\partial z_k} = -\sum_i y_i \frac{1}{s_i}\frac{\partial s_i}{\partial z_k} = -\sum_i y_i \frac{1}{s_i} s_i(\delta_{ik}-s_k) = -\sum_i y_i(\delta_{ik}-s_k) = -y_k + s_k\sum_i y_i = s_k - y_k$ (since labels sum to 1).

1.4 Gradient descent

We want $\theta^* = \arg\min_\theta L(\theta)$ . The gradient $\nabla_\theta L$ points in the direction of steepest increase, so we step opposite to it:

\theta \leftarrow \theta - \eta \, \nabla_\theta L

$\eta$ = learning rate. Too large → diverge/oscillate; too small → crawl.

Three flavors:

Batch GD: gradient over the whole dataset per step. Stable, expensive.
Stochastic GD (SGD): one example per step. Noisy, fast, can escape shallow minima.
Mini-batch GD: a batch of $B$ examples (typical $B=32\text{–}512$ ). The standard. Balances noise and hardware efficiency.

1D worked example

Minimize $L(\theta) = \theta^2$ . Gradient $L'(\theta) = 2\theta$ . Start $\theta_0 = 5$ , $\eta = 0.1$ .

step	$\theta$	$L'(\theta)=2\theta$	update $\theta - 0.1\cdot 2\theta = 0.8\theta$
0	5.000	10.0	4.000
1	4.000	8.0	3.200
2	3.200	6.4	2.560
3	2.560	5.12	2.048

Each step multiplies $\theta$ by $0.8$ , converging geometrically to the minimum at $0$ . (If $\eta = 1.1$ , the multiplier becomes $-1.2$ → diverges. This shows learning-rate sensitivity concretely.)

1.5 Backpropagation — the full derivation

Backprop = the chain rule, applied layer by layer from output to input, reusing intermediate results. This is the single most important mechanism to understand.

Setup

A network with layers $\ell=1\dots L$ . For each layer:

\mathbf{z}^{(\ell)} = \mathbf{W}^{(\ell)}\mathbf{a}^{(\ell-1)} + \mathbf{b}^{(\ell)}, \qquad \mathbf{a}^{(\ell)} = \sigma(\mathbf{z}^{(\ell)})

Define the error signal of layer $\ell$ as the gradient of loss w.r.t. that layer's pre-activation:

\boldsymbol{\delta}^{(\ell)} \equiv \frac{\partial L}{\partial \mathbf{z}^{(\ell)}}

The four backprop equations

(BP1) Output layer error.

\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{a}^{(L)}} L \;\odot\; \sigma'(\mathbf{z}^{(L)})

For softmax+CE this is just $\boldsymbol{\delta}^{(L)} = \hat{\mathbf{y}} - \mathbf{y}$ .

(BP2) Backpropagate the error to earlier layers.

\boldsymbol{\delta}^{(\ell)} = \left( \mathbf{W}^{(\ell+1)\top}\,\boldsymbol{\delta}^{(\ell+1)} \right) \odot \sigma'(\mathbf{z}^{(\ell)})

Why: $L$ depends on $\mathbf{z}^{(\ell)}$ only through $\mathbf{z}^{(\ell+1)} = \mathbf{W}^{(\ell+1)}\sigma(\mathbf{z}^{(\ell)}) + \mathbf{b}^{(\ell+1)}$ . Chain rule: $\frac{\partial L}{\partial z_j^{(\ell)}} = \sum_k \frac{\partial L}{\partial z_k^{(\ell+1)}} \frac{\partial z_k^{(\ell+1)}}{\partial z_j^{(\ell)}}$ . The inner derivative is $W_{kj}^{(\ell+1)} \sigma'(z_j^{(\ell)})$ . Collecting over $k$ gives the matrix-vector form above.

(BP3) Gradient w.r.t. weights.

\frac{\partial L}{\partial \mathbf{W}^{(\ell)}} = \boldsymbol{\delta}^{(\ell)} \,\mathbf{a}^{(\ell-1)\top} \qquad (\text{outer product, shape } m_\ell \times m_{\ell-1})

(BP4) Gradient w.r.t. biases.

\frac{\partial L}{\partial \mathbf{b}^{(\ell)}} = \boldsymbol{\delta}^{(\ell)}

Then update every parameter: $\mathbf{W}^{(\ell)} \leftarrow \mathbf{W}^{(\ell)} - \eta\,\frac{\partial L}{\partial \mathbf{W}^{(\ell)}}$ .

Fully worked numeric backprop (do this by hand once!)

A 2-2-1 network. Inputs $\mathbf{x}=[0.5, 0.1]$ , target $y=1$ , sigmoid everywhere, MSE loss $L=\tfrac12(\hat y - y)^2$ .

Weights:

\mathbf{W}^{(1)} = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{bmatrix},\; \mathbf{b}^{(1)}=\begin{bmatrix}0\\0\end{bmatrix},\quad \mathbf{W}^{(2)} = \begin{bmatrix} 0.5 & 0.6 \end{bmatrix},\; b^{(2)}=0

Forward:

\mathbf{z}^{(1)} = \begin{bmatrix}0.1(0.5)+0.2(0.1)\\ 0.3(0.5)+0.4(0.1)\end{bmatrix} = \begin{bmatrix}0.07\\0.19\end{bmatrix},\quad \mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)}) = \begin{bmatrix}0.5175\\0.5474\end{bmatrix}

z^{(2)} = 0.5(0.5175)+0.6(0.5474) = 0.5872, \quad \hat y = a^{(2)} = \sigma(0.5872) = 0.6427

Loss $L = \tfrac12(0.6427-1)^2 = 0.0638$ .

Backward:

\delta^{(2)} = (\hat y - y)\cdot \sigma'(z^{(2)}) = (-0.3573)\cdot(0.6427)(1-0.6427) = -0.3573 \cdot 0.2297 = -0.0821

Gradients for output layer:

\frac{\partial L}{\partial \mathbf{W}^{(2)}} = \delta^{(2)}\mathbf{a}^{(1)\top} = -0.0821\,[0.5175,\,0.5474] = [-0.0425,\,-0.0449]

Propagate back:

\boldsymbol{\delta}^{(1)} = (\mathbf{W}^{(2)\top}\delta^{(2)}) \odot \sigma'(\mathbf{z}^{(1)}) = \begin{bmatrix}0.5\\0.6\end{bmatrix}(-0.0821) \odot \begin{bmatrix}0.2497\\0.2478\end{bmatrix} = \begin{bmatrix}-0.0103\\-0.0122\end{bmatrix}

(using $\sigma'(0.07)=0.5175\cdot0.4825=0.2497$ , $\sigma'(0.19)=0.5474\cdot0.4526=0.2478$ ).

\frac{\partial L}{\partial \mathbf{W}^{(1)}} = \boldsymbol{\delta}^{(1)}\mathbf{x}^\top = \begin{bmatrix}-0.0103\\-0.0122\end{bmatrix}[0.5,\,0.1] = \begin{bmatrix}-0.00514 & -0.00103\\ -0.00610 & -0.00122\end{bmatrix}

With $\eta=0.1$ , every weight nudges by $-\eta\cdot$ grad (e.g. $W^{(2)}_1: 0.5 - 0.1(-0.0425)=0.50425$ ). That's one training step. Repeat over many batches.

1.6 Optimizers — beyond vanilla SGD

Plain SGD struggles in ravines (steep in one direction, flat in another) and with noisy gradients. These improve it.

Momentum

Accumulate a velocity that smooths the path:

\mathbf{v}_t = \beta \mathbf{v}_{t-1} + (1-\beta)\nabla L, \qquad \theta \leftarrow \theta - \eta \mathbf{v}_t

$\beta\approx0.9$ . Like a heavy ball rolling downhill — dampens oscillation, accelerates consistent directions.

RMSProp

Scale each parameter's step by a running average of squared gradients (adaptive per-parameter learning rate):

\mathbf{s}_t = \beta \mathbf{s}_{t-1} + (1-\beta)(\nabla L)^2, \qquad \theta \leftarrow \theta - \frac{\eta}{\sqrt{\mathbf{s}_t}+\epsilon}\nabla L

Adam (the default)

Combines momentum (1st moment) + RMSProp (2nd moment), with bias correction:

\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1)\nabla L

\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2)(\nabla L)^2

\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1-\beta_1^t}, \qquad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1-\beta_2^t}

\theta \leftarrow \theta - \eta\,\frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t}+\epsilon}

Defaults: $\beta_1=0.9,\ \beta_2=0.999,\ \epsilon=10^{-8}$ . Bias correction matters because $\mathbf{m}_0=\mathbf{v}_0=0$ makes early estimates biased toward zero; dividing by $1-\beta^t$ (which is small early) inflates them to the right scale. AdamW decouples weight decay from the gradient step and is the modern default for Transformers.

1.7 Weight initialization

Bad init → vanishing/exploding activations before training even starts. Keep variance stable across layers.

Xavier/Glorot (for tanh/sigmoid): $\text{Var}(W) = \dfrac{2}{n_{\text{in}}+n_{\text{out}}}$ .
He/Kaiming (for ReLU): $\text{Var}(W) = \dfrac{2}{n_{\text{in}}}$ — accounts for ReLU zeroing half the activations.
Biases usually start at 0.

Why $2/n_{in}$ for ReLU: a linear layer's output variance is $n_{in}\,\text{Var}(W)\,\text{Var}(x)$ ; ReLU halves the effective variance, so we want $n_{in}\,\text{Var}(W)/2 = 1 \Rightarrow \text{Var}(W)=2/n_{in}$ , keeping signal magnitude roughly constant through depth.

1.8 Regularization (fighting overfitting)

Overfitting = low training loss, high test loss (memorizing noise). Remedies:

L2 (weight decay): add $\frac{\lambda}{2}\|\theta\|^2$ to the loss → gradient gains $+\lambda\theta$ → shrinks weights toward 0. Encourages small, smooth weights.
L1: add $\lambda\|\theta\|_1$ → drives some weights exactly to 0 (sparsity / feature selection).
Dropout: during training, zero each activation independently with probability $p$ , then scale survivors by $1/(1-p)$ (inverted dropout) so expectations match at test time. Acts like training an ensemble of subnetworks; prevents co-adaptation.
Early stopping: halt when validation loss stops improving.
Data augmentation: synthetically expand data (flips, crops, noise) — strong regularizer especially for vision.

1.9 Normalization layers

Normalizing intermediate activations stabilizes and speeds training.

Batch Normalization

For a feature over a mini-batch of size $B$ :

\mu = \frac{1}{B}\sum_{i} x_i, \quad \sigma^2 = \frac{1}{B}\sum_i (x_i-\mu)^2, \quad \hat x_i = \frac{x_i - \mu}{\sqrt{\sigma^2+\epsilon}}, \quad y_i = \gamma \hat x_i + \beta

$\gamma,\beta$ are learnable scale/shift, letting the network undo normalization if needed. At inference, use running averages of $\mu,\sigma^2$ collected during training. BatchNorm depends on batch statistics → awkward for sequences/small batches.

Layer Normalization

Normalizes across features within one example (not across the batch):

\mu = \frac{1}{H}\sum_{j=1}^{H} x_j, \quad \sigma^2 = \frac{1}{H}\sum_j (x_j-\mu)^2, \quad y_j = \gamma_j \frac{x_j-\mu}{\sqrt{\sigma^2+\epsilon}} + \beta_j

Batch-independent → the choice for RNNs and Transformers. ([[04_transformers]] uses LayerNorm in every block.)

1.10 Putting it together — a NumPy MLP (no frameworks)

python
import numpy as np

def sigmoid(z): return 1/(1+np.exp(-z))
def dsigmoid(a): return a*(1-a)          # derivative in terms of activation a=σ(z)

# Tiny 2-2-1 net trained on XOR
X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float)   # (4,2)
Y = np.array([[0],[1],[1],[0]], dtype=float)            # (4,1)

rng = np.random.default_rng(0)
W1 = rng.normal(0, 1, (2,2)); b1 = np.zeros((1,2))      # He-ish small init
W2 = rng.normal(0, 1, (2,1)); b2 = np.zeros((1,1))
eta = 0.5

for epoch in range(10000):
    # ---- forward ----
    Z1 = X @ W1 + b1;  A1 = sigmoid(Z1)                 # (4,2)
    Z2 = A1 @ W2 + b2; A2 = sigmoid(Z2)                 # (4,1) = predictions
    loss = np.mean((A2 - Y)**2)

    # ---- backward (BP1-BP4) ----
    dA2   = 2*(A2 - Y)/len(X)                           # dL/dA2
    dZ2   = dA2 * dsigmoid(A2)                           # δ2
    dW2   = A1.T @ dZ2                                   # BP3
    db2   = dZ2.sum(0, keepdims=True)                    # BP4
    dA1   = dZ2 @ W2.T                                   # propagate
    dZ1   = dA1 * dsigmoid(A1)                           # δ1 (BP2)
    dW1   = X.T @ dZ1
    db1   = dZ1.sum(0, keepdims=True)

    # ---- update ----
    W2 -= eta*dW2; b2 -= eta*db2
    W1 -= eta*dW1; b1 -= eta*db1

print("Predictions:", A2.ravel().round(3))   # ≈ [0, 1, 1, 0]

The same six lines of backward math (forward → δ at output → δ propagated → grads → update) reappear in every architecture in these notes. Frameworks (PyTorch/TF) just automate the chain rule via autograd (a computation graph that records operations and replays their derivatives in reverse).

The PyTorch equivalent (autograd does backprop for you)

python
import torch, torch.nn as nn
net = nn.Sequential(nn.Linear(2,2), nn.Sigmoid(), nn.Linear(2,1), nn.Sigmoid())
opt = torch.optim.Adam(net.parameters(), lr=0.05)
X = torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype=torch.float32)
Y = torch.tensor([[0.],[1.],[1.],[0.]])
for _ in range(5000):
    pred = net(X)
    loss = ((pred - Y)**2).mean()
    opt.zero_grad(); loss.backward(); opt.step()   # backward() = autograd backprop

1.11 Common pitfalls

Forgetting to zero gradients in PyTorch (they accumulate) → use opt.zero_grad().
Learning rate too high → loss NaN/explodes; too low → no progress. Start ~ $10^{-3}$ for Adam.
Vanishing gradients with deep sigmoid/tanh stacks → use ReLU/GELU + residual connections + normalization.
Data not normalized → unstable training. Standardize inputs to ~zero mean, unit variance.
Mismatched loss/output: use softmax+CE for multiclass, sigmoid+BCE for binary, linear+MSE for regression.

Next: [[02_cnns]] applies these exact mechanics with weight-sharing for images.