The minimum linear algebra, calculus, and probability used across these notes — each with the concrete role it plays in deep learning.
10.1 Linear algebra
Vectors & dot product
. Dot product:
Used in: every neuron (), attention scores (, [[04_transformers]]), cosine similarity ([[06_rag]]). Geometric meaning: projection / similarity.
Matrix multiplication
. Shapes must align: . Used in: layer forward pass , . Not commutative ().
Transpose, identity, inverse
; has 1s on the diagonal; . Backprop's BP3 uses outer products and BP2 uses ([[01_deep_learning_foundations]]).
Norms
Used in: L2/L1 regularization, gradient clipping, normalizing embeddings.
Outer product
gives an matrix with entries . Appears in the weight-gradient (BP3).
Eigenvalues / eigenvectors
. The largest controls how repeated multiplication grows/shrinks vectors → explains vanishing/exploding gradients in RNNs ([[03_rnn_lstm]]): the recurrent Jacobian's spectral radius decides stability.
Broadcasting
Operating on mismatched shapes by virtually expanding singleton dims (e.g. adding a bias vector to every row of a matrix). Ubiquitous in NumPy/PyTorch code.
10.2 Calculus
Derivative
Rate of change: . The gradient is the multivariate generalization.
Gradient
For :
Points in the direction of steepest ascent → gradient descent steps opposite to it ([[01_deep_learning_foundations]] §1.4).
Chain rule (the engine of backprop)
Scalar: . Vector/multivariate:
This is literally backpropagation ([[01_deep_learning_foundations]] §1.5): compose local derivatives from output back to input.
Jacobian & Hessian
- Jacobian , — derivative of a vector function. Softmax's Jacobian and the RNN recurrence Jacobian are key examples.
- Hessian — curvature; underlies second-order methods (rarely used directly at scale, but conceptually behind Adam's adaptivity).
Key derivatives used in these notes
10.3 Probability & statistics
Random variables, expectation, variance
Used in: weight init (keep variance stable, [[01_deep_learning_foundations]] §1.7), BatchNorm (normalize mean/variance), the scaling in attention (controls score variance, [[04_transformers]]).
Probability distributions
- Bernoulli (binary outcome) → sigmoid output + BCE.
- Categorical (one of ) → softmax output + cross-entropy.
- Gaussian → weight init, noise, GELU (uses normal CDF).
Softmax = a probability distribution
maps any real vector to a distribution (positive, sums to 1). Foundation of classification and attention weights.
Maximum Likelihood Estimation (MLE)
Training a classifier by minimizing cross-entropy is maximizing the likelihood of the data:
The right side is exactly the cross-entropy loss. This is why cross-entropy is the principled loss for classification.
Entropy, cross-entropy, KL divergence
- Cross-entropy = the loss we minimize (=true labels, =predictions).
- KL divergence = "distance" from to ; note . KL appears in RLHF's penalty keeping the policy near the reference model ([[05_architectures]] §5.5) and in contrastive/variational objectives.
Information & temperature
Lower temperature → lower-entropy (sharper) softmax; higher → higher-entropy (more uniform). Controls exploration in sampling ([[04_transformers]] §4.8).
10.4 Numerical stability tricks (used everywhere in code)
- Log-sum-exp for softmax: subtract the max logit before exponentiating to avoid overflow:
log(x+ε)to avoid ;/(σ+ε)in normalization to avoid divide-by-zero.- Fused sigmoid+BCE / softmax+CE (
BCEWithLogitsLoss,CrossEntropyLoss) are more stable than computing the pieces separately (and give the clean gradient, [[01_deep_learning_foundations]]). - Gradient clipping caps to prevent explosions ([[03_rnn_lstm]]).
10.5 Notation cheat-sheet
| Symbol | Meaning |
|---|---|
| , , | scalar, vector, matrix |
| / | prediction / ground truth |
| activation (often sigmoid) | |
| learning rate | |
| , | loss, its gradient w.r.t. params |
| elementwise product | |
| error signal at layer | |
| , , | model dim, head dim, #heads |
| / | sequence length / time steps |
| , | expectation, variance |
Return to the index to jump to any module.