back to knowledge base
module 104 min read

Math Appendix

Linear algebra, calculus, and probability refreshers used everywhere above.

The minimum linear algebra, calculus, and probability used across these notes — each with the concrete role it plays in deep learning.


10.1 Linear algebra

Vectors & dot product

a,bRn\mathbf{a},\mathbf{b}\in\mathbb{R}^n. Dot product:

ab=iaibi=abcosθ\mathbf{a}\cdot\mathbf{b} = \sum_i a_i b_i = \|\mathbf{a}\|\,\|\mathbf{b}\|\cos\theta

Used in: every neuron (wx\mathbf{w}^\top\mathbf{x}), attention scores (qk\mathbf{q}\cdot\mathbf{k}, [[04_transformers]]), cosine similarity ([[06_rag]]). Geometric meaning: projection / similarity.

Matrix multiplication

(AB)ij=kAikBkj(\mathbf{A}\mathbf{B})_{ij} = \sum_k A_{ik}B_{kj}. Shapes must align: (m×n)(n×p)=(m×p)(m\times n)(n\times p)=(m\times p). Used in: layer forward pass Wx\mathbf{W}\mathbf{x}, QK\mathbf{Q}\mathbf{K}^\top. Not commutative (ABBA\mathbf{AB}\ne\mathbf{BA}).

Transpose, identity, inverse

(A)ij=Aji(\mathbf{A}^\top)_{ij}=A_{ji}; I\mathbf{I} has 1s on the diagonal; AA1=I\mathbf{A}\mathbf{A}^{-1}=\mathbf{I}. Backprop's BP3 uses outer products and BP2 uses W\mathbf{W}^\top ([[01_deep_learning_foundations]]).

Norms

x2=ixi2,x1=ixi\|\mathbf{x}\|_2 = \sqrt{\textstyle\sum_i x_i^2}, \qquad \|\mathbf{x}\|_1 = \sum_i |x_i|

Used in: L2/L1 regularization, gradient clipping, normalizing embeddings.

Outer product

ab\mathbf{a}\mathbf{b}^\top gives an m×nm\times n matrix with entries aibja_i b_j. Appears in the weight-gradient δa\boldsymbol{\delta}\mathbf{a}^\top (BP3).

Eigenvalues / eigenvectors

Av=λv\mathbf{A}\mathbf{v}=\lambda\mathbf{v}. The largest λ|\lambda| controls how repeated multiplication grows/shrinks vectors → explains vanishing/exploding gradients in RNNs ([[03_rnn_lstm]]): the recurrent Jacobian's spectral radius decides stability.

Broadcasting

Operating on mismatched shapes by virtually expanding singleton dims (e.g. adding a bias vector to every row of a matrix). Ubiquitous in NumPy/PyTorch code.


10.2 Calculus

Derivative

Rate of change: f(x)=limh0f(x+h)f(x)hf'(x)=\lim_{h\to0}\frac{f(x+h)-f(x)}{h}. The gradient is the multivariate generalization.

Gradient

For f:RnRf:\mathbb{R}^n\to\mathbb{R}:

f=[fx1,,fxn]\nabla f = \left[\frac{\partial f}{\partial x_1},\dots,\frac{\partial f}{\partial x_n}\right]^\top

Points in the direction of steepest ascent → gradient descent steps opposite to it ([[01_deep_learning_foundations]] §1.4).

Chain rule (the engine of backprop)

Scalar: ddxf(g(x))=f(g(x))g(x)\frac{d}{dx}f(g(x))=f'(g(x))\,g'(x). Vector/multivariate:

Lxi=jLujujxi\frac{\partial L}{\partial x_i} = \sum_j \frac{\partial L}{\partial u_j}\frac{\partial u_j}{\partial x_i}

This is literally backpropagation ([[01_deep_learning_foundations]] §1.5): compose local derivatives from output back to input.

Jacobian & Hessian

  • Jacobian JRm×n\mathbf{J}\in\mathbb{R}^{m\times n}, Jij=fi/xjJ_{ij}=\partial f_i/\partial x_j — derivative of a vector function. Softmax's Jacobian and the RNN recurrence Jacobian are key examples.
  • Hessian Hij=2f/xixjH_{ij}=\partial^2 f/\partial x_i\partial x_j — curvature; underlies second-order methods (rarely used directly at scale, but conceptually behind Adam's adaptivity).

Key derivatives used in these notes

f(x)f(x)f(x)f'(x)
σ(x)=11+ex\sigma(x)=\frac1{1+e^{-x}}σ(x)(1σ(x))\sigma(x)(1-\sigma(x))
tanhx\tanh x1tanh2x1-\tanh^2 x
ReLU(x)\text{ReLU}(x)1[x>0]\mathbb{1}[x>0]
x2x^22x2x
logx\log x1/x1/x
exe^xexe^x

10.3 Probability & statistics

Random variables, expectation, variance

E[X]=xxP(x),Var(X)=E[(XEX)2]=E[X2](EX)2\mathbb{E}[X]=\sum_x x\,P(x), \qquad \text{Var}(X)=\mathbb{E}[(X-\mathbb{E}X)^2]=\mathbb{E}[X^2]-(\mathbb{E}X)^2

Used in: weight init (keep variance stable, [[01_deep_learning_foundations]] §1.7), BatchNorm (normalize mean/variance), the dk\sqrt{d_k} scaling in attention (controls score variance, [[04_transformers]]).

Probability distributions

  • Bernoulli (binary outcome) → sigmoid output + BCE.
  • Categorical (one of KK) → softmax output + cross-entropy.
  • Gaussian N(μ,σ2)\mathcal{N}(\mu,\sigma^2) → weight init, noise, GELU (uses normal CDF).

Softmax = a probability distribution

softmax(z)i=ezijezj\text{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_j e^{z_j}} maps any real vector to a distribution (positive, sums to 1). Foundation of classification and attention weights.

Maximum Likelihood Estimation (MLE)

Training a classifier by minimizing cross-entropy is maximizing the likelihood of the data:

argmaxθiPθ(yixi)=argminθilogPθ(yixi)\arg\max_\theta \prod_i P_\theta(y_i\mid x_i) = \arg\min_\theta -\sum_i \log P_\theta(y_i\mid x_i)

The right side is exactly the cross-entropy loss. This is why cross-entropy is the principled loss for classification.

Entropy, cross-entropy, KL divergence

H(p)=xp(x)logp(x),H(p,q)=xp(x)logq(x),KL(pq)=xp(x)logp(x)q(x)H(p)=-\sum_x p(x)\log p(x), \quad H(p,q)=-\sum_x p(x)\log q(x), \quad \text{KL}(p\,\|\,q)=\sum_x p(x)\log\frac{p(x)}{q(x)}
  • Cross-entropy H(p,q)H(p,q) = the loss we minimize (pp=true labels, qq=predictions).
  • KL divergence = "distance" from qq to pp; note H(p,q)=H(p)+KL(pq)H(p,q)=H(p)+\text{KL}(p\|q). KL appears in RLHF's penalty keeping the policy near the reference model ([[05_architectures]] §5.5) and in contrastive/variational objectives.

Information & temperature

Lower temperature → lower-entropy (sharper) softmax; higher → higher-entropy (more uniform). Controls exploration in sampling ([[04_transformers]] §4.8).


10.4 Numerical stability tricks (used everywhere in code)

  • Log-sum-exp for softmax: subtract the max logit before exponentiating to avoid overflow:
softmax(zi)=ezimaxjzjkezkmaxjzj\text{softmax}(z_i)=\frac{e^{z_i - \max_j z_j}}{\sum_k e^{z_k-\max_j z_j}}
  • log(x+ε) to avoid log0\log 0; /(σ+ε) in normalization to avoid divide-by-zero.
  • Fused sigmoid+BCE / softmax+CE (BCEWithLogitsLoss, CrossEntropyLoss) are more stable than computing the pieces separately (and give the clean y^y\hat y - y gradient, [[01_deep_learning_foundations]]).
  • Gradient clipping caps \|\nabla\| to prevent explosions ([[03_rnn_lstm]]).

10.5 Notation cheat-sheet

SymbolMeaning
xx, x\mathbf{x}, W\mathbf{W}scalar, vector, matrix
y^\hat y / yyprediction / ground truth
σ\sigmaactivation (often sigmoid)
η\etalearning rate
LL, θL\nabla_\theta Lloss, its gradient w.r.t. params
\odotelementwise product
δ()\boldsymbol{\delta}^{(\ell)}error signal at layer \ell
dd, dkd_k, hhmodel dim, head dim, #heads
nn / TTsequence length / time steps
E\mathbb{E}, Var\text{Var}expectation, variance

Return to the index to jump to any module.