Math Appendix — Knowledge — Devaraj Kudumula

The minimum linear algebra, calculus, and probability used across these notes — each with the concrete role it plays in deep learning.

10.1 Linear algebra

Vectors & dot product

$\mathbf{a},\mathbf{b}\in\mathbb{R}^n$ . Dot product:

\mathbf{a}\cdot\mathbf{b} = \sum_i a_i b_i = \|\mathbf{a}\|\,\|\mathbf{b}\|\cos\theta

Used in: every neuron ( $\mathbf{w}^\top\mathbf{x}$ ), attention scores ( $\mathbf{q}\cdot\mathbf{k}$ , [[04_transformers]]), cosine similarity ([[06_rag]]). Geometric meaning: projection / similarity.

Matrix multiplication

$(\mathbf{A}\mathbf{B})_{ij} = \sum_k A_{ik}B_{kj}$ . Shapes must align: $(m\times n)(n\times p)=(m\times p)$ . Used in: layer forward pass $\mathbf{W}\mathbf{x}$ , $\mathbf{Q}\mathbf{K}^\top$ . Not commutative ( $\mathbf{AB}\ne\mathbf{BA}$ ).

Transpose, identity, inverse

$(\mathbf{A}^\top)_{ij}=A_{ji}$ ; $\mathbf{I}$ has 1s on the diagonal; $\mathbf{A}\mathbf{A}^{-1}=\mathbf{I}$ . Backprop's BP3 uses outer products and BP2 uses $\mathbf{W}^\top$ ([[01_deep_learning_foundations]]).

Norms

\|\mathbf{x}\|_2 = \sqrt{\textstyle\sum_i x_i^2}, \qquad \|\mathbf{x}\|_1 = \sum_i |x_i|

Used in: L2/L1 regularization, gradient clipping, normalizing embeddings.

Outer product

$\mathbf{a}\mathbf{b}^\top$ gives an $m\times n$ matrix with entries $a_i b_j$ . Appears in the weight-gradient $\boldsymbol{\delta}\mathbf{a}^\top$ (BP3).

Eigenvalues / eigenvectors

$\mathbf{A}\mathbf{v}=\lambda\mathbf{v}$ . The largest $|\lambda|$ controls how repeated multiplication grows/shrinks vectors → explains vanishing/exploding gradients in RNNs ([[03_rnn_lstm]]): the recurrent Jacobian's spectral radius decides stability.

Broadcasting

Operating on mismatched shapes by virtually expanding singleton dims (e.g. adding a bias vector to every row of a matrix). Ubiquitous in NumPy/PyTorch code.

10.2 Calculus

Derivative

Rate of change: $f'(x)=\lim_{h\to0}\frac{f(x+h)-f(x)}{h}$ . The gradient is the multivariate generalization.

Gradient

For $f:\mathbb{R}^n\to\mathbb{R}$ :

\nabla f = \left[\frac{\partial f}{\partial x_1},\dots,\frac{\partial f}{\partial x_n}\right]^\top

Points in the direction of steepest ascent → gradient descent steps opposite to it ([[01_deep_learning_foundations]] §1.4).

Chain rule (the engine of backprop)

Scalar: $\frac{d}{dx}f(g(x))=f'(g(x))\,g'(x)$ . Vector/multivariate:

\frac{\partial L}{\partial x_i} = \sum_j \frac{\partial L}{\partial u_j}\frac{\partial u_j}{\partial x_i}

This is literally backpropagation ([[01_deep_learning_foundations]] §1.5): compose local derivatives from output back to input.

Jacobian & Hessian

Jacobian $\mathbf{J}\in\mathbb{R}^{m\times n}$ , $J_{ij}=\partial f_i/\partial x_j$ — derivative of a vector function. Softmax's Jacobian and the RNN recurrence Jacobian are key examples.
Hessian $H_{ij}=\partial^2 f/\partial x_i\partial x_j$ — curvature; underlies second-order methods (rarely used directly at scale, but conceptually behind Adam's adaptivity).

Key derivatives used in these notes

$f(x)$	$f'(x)$
$\sigma(x)=\frac1{1+e^{-x}}$	$\sigma(x)(1-\sigma(x))$
$\tanh x$	$1-\tanh^2 x$
$\text{ReLU}(x)$	$\mathbb{1}[x>0]$
$x^2$	$2x$
$\log x$	$1/x$
$e^x$	$e^x$

10.3 Probability & statistics

Random variables, expectation, variance

\mathbb{E}[X]=\sum_x x\,P(x), \qquad \text{Var}(X)=\mathbb{E}[(X-\mathbb{E}X)^2]=\mathbb{E}[X^2]-(\mathbb{E}X)^2

Used in: weight init (keep variance stable, [[01_deep_learning_foundations]] §1.7), BatchNorm (normalize mean/variance), the $\sqrt{d_k}$ scaling in attention (controls score variance, [[04_transformers]]).

Probability distributions

Bernoulli (binary outcome) → sigmoid output + BCE.
Categorical (one of $K$ ) → softmax output + cross-entropy.
Gaussian $\mathcal{N}(\mu,\sigma^2)$ → weight init, noise, GELU (uses normal CDF).

Softmax = a probability distribution

$\text{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$ maps any real vector to a distribution (positive, sums to 1). Foundation of classification and attention weights.

Maximum Likelihood Estimation (MLE)

Training a classifier by minimizing cross-entropy is maximizing the likelihood of the data:

\arg\max_\theta \prod_i P_\theta(y_i\mid x_i) = \arg\min_\theta -\sum_i \log P_\theta(y_i\mid x_i)

The right side is exactly the cross-entropy loss. This is why cross-entropy is the principled loss for classification.

Entropy, cross-entropy, KL divergence

H(p)=-\sum_x p(x)\log p(x), \quad H(p,q)=-\sum_x p(x)\log q(x), \quad \text{KL}(p\,\|\,q)=\sum_x p(x)\log\frac{p(x)}{q(x)}

Cross-entropy $H(p,q)$ = the loss we minimize ( $p$ =true labels, $q$ =predictions).
KL divergence = "distance" from $q$ to $p$ ; note $H(p,q)=H(p)+\text{KL}(p\|q)$ . KL appears in RLHF's penalty keeping the policy near the reference model ([[05_architectures]] §5.5) and in contrastive/variational objectives.

Information & temperature

Lower temperature → lower-entropy (sharper) softmax; higher → higher-entropy (more uniform). Controls exploration in sampling ([[04_transformers]] §4.8).

10.4 Numerical stability tricks (used everywhere in code)

Log-sum-exp for softmax: subtract the max logit before exponentiating to avoid overflow:

\text{softmax}(z_i)=\frac{e^{z_i - \max_j z_j}}{\sum_k e^{z_k-\max_j z_j}}

log(x+ε) to avoid $\log 0$ ; /(σ+ε) in normalization to avoid divide-by-zero.
Fused sigmoid+BCE / softmax+CE (BCEWithLogitsLoss, CrossEntropyLoss) are more stable than computing the pieces separately (and give the clean $\hat y - y$ gradient, [[01_deep_learning_foundations]]).
Gradient clipping caps $\|\nabla\|$ to prevent explosions ([[03_rnn_lstm]]).

10.5 Notation cheat-sheet

Symbol	Meaning
$x$ , $\mathbf{x}$ , $\mathbf{W}$	scalar, vector, matrix
$\hat y$ / $y$	prediction / ground truth
$\sigma$	activation (often sigmoid)
$\eta$	learning rate
$L$ , $\nabla_\theta L$	loss, its gradient w.r.t. params
$\odot$	elementwise product
$\boldsymbol{\delta}^{(\ell)}$	error signal at layer $\ell$
$d$ , $d_k$ , $h$	model dim, head dim, #heads
$n$ / $T$	sequence length / time steps
$\mathbb{E}$ , $\text{Var}$	expectation, variance

Return to the index to jump to any module.