CNNs are MLPs with two inductive biases baked in: locality (pixels near each other matter together) and weight sharing (the same feature detector slides everywhere). This makes them parameter-efficient and translation-equivariant. In layers, a CNN climbs from simple clues to whole objects — edges and color blobs first, then corners and textures, then parts (an eye, an ear), then objects ("this is a cat").
2.1 Why not just use a dense network on images?
A image flattened = 150,528 inputs. One hidden layer of 1000 units → 150 million weights in layer 1 alone. CNNs instead learn small kernels (e.g. weights) reused across the whole image. Fewer parameters, better generalization, and the structure respects images.
2.2 The convolution operation
Intuition
Slide a small kernel (filter) — think of it as a stencil — over the image, one position at a time. At each position you compute a dot product between the kernel and the patch it covers: a high response means "this feature is here." The kernel is tuned to light up on a particular little pattern (a vertical edge, a patch of fur texture), and because the same kernel is reused at every position, the network learns that pattern once and finds it anywhere in the image. The map of responses it produces is a feature map. Different kernels detect different things — edges, textures, and so on.
Math (2D, single channel)
Input , kernel . The (cross-correlation, which is what DL libraries call "convolution") output:
Step-by-step: unrolling that double sum
The looks scary only because it's compressed. Let's expand it for a concrete kernel (), so runs over and runs over .
Step 1 — write out the inner sum (fix , let ):
Step 2 — now wrap the outer sum (let , so we get the row plus the row):
That's the entire formula with no sigmas left — just 4 products added together (a kernel has 4 weights), plus the bias . A kernel would expand the same way into 9 products.
Step 3 — pin it to a real position. To get the top-left output , substitute :
Notice the image indices are exactly the top-left patch of the image — the patch the kernel currently sits on.
Step 4 — slide by one and repeat. For the next output to the right, , substitute :
Same four kernel weights — only the image indices shifted right by one column (). That shift is the kernel sliding, and the weights staying identical is weight-sharing, made literal.
Note: true mathematical convolution flips the kernel (); deep-learning frameworks implement cross-correlation (no flip). Since kernels are learned, the flip is irrelevant — the network just learns the flipped kernel. We use the term "convolution" loosely.
Output size formula
For input size , kernel , padding , stride :
Step-by-step: where this formula comes from
We build it up one effect at a time, so each symbol earns its place.
Step 1 — no padding, stride 1. A -wide kernel needs pixels under it. On a row of pixels, its left edge can start at column up until the kernel's right edge hits the wall — i.e. the last valid start is column . Counting starts through inclusive:
(Example: starts at columns , that's outputs.)
Step 2 — add padding . Gluing zero-pixels onto each side makes the row wide (the is "both sides"). Substitute :
Step 3 — add stride . With stride 1 the start columns are . With stride we only keep every -th start: . Over a span of pixels, taking steps of size fits steps, then for the starting position itself:
Step 4 — round down. If doesn't divide evenly, the kernel can't land on a fractional pixel, so we discard the leftover with the floor , giving the boxed result .
Worked check: :
Output width equals input width — that is precisely what same padding ( for ) is engineered to do.
- Padding : add zeros around the border.
samepadding keeps (needs for stride 1).valid= no padding. - Stride : step size of the slide. halves spatial size.
Multi-channel & multi-filter
Real conv layers have input depth and produce feature maps. A single filter is (it spans all input channels), and there are of them:
Step-by-step: expanding the channel sum
The only new symbol is the extra , where indexes the input channels (e.g. for red, green, blue). Take a kernel over channels and unroll it for one output position of one filter :
Step 1 — inner spatial sum, done per channel (this is the same 4-term expansion as before, but now there's one copy per channel ):
Step 2 — sum those per-channel results across all channels, then add the bias:
So a 3-channel filter does multiplications, sums all 12 into one output number, adds the bias. The 3 input grids collapse into 1 output grid. Run different filters and you get output grids.
Step-by-step: the parameter count
Build piece by piece:
Plug in a real layer ( kernel, , ):
Compare to the dense layer from §2.1 (≈150 million). And notice: and — the image's height and width — appear nowhere in the count. The same 1,792 numbers process a image or a image, because the one small filter is reused at every position. That size-independence is the central efficiency win of convolution.
Tiny numeric example — every step shown
Input , kernel , stride 1, no padding, :
First, the output size. Plug into the formula with :
So the kernel will sit in 4 positions (top-left, top-right, bottom-left, bottom-right) and produce 4 numbers.
Position — kernel on the top-left patch. The kernel covers rows , cols :
Multiply the two grids cell-by-cell (element-wise), keeping each product separate:
Now add up all four products:
Position — slide one column right. Now covering rows , cols :
Position — back to the left, slide one row down. Rows , cols :
Position — bottom-right patch. Rows , cols :
Assemble the four numbers into the output grid:
( means element-wise multiply.) This kernel computes (top-left pixel) − (bottom-right pixel), so it outputs on a smooth patch and a non-zero value where the image changes along the main diagonal — a diagonal edge detector.
2.3 Pooling
Downsamples feature maps → reduces compute, adds small translation invariance. Think of it as shrinking the image while keeping the gist, like making a thumbnail: max-pooling looks at each little 2×2 block and keeps only the strongest signal ("was the feature present anywhere in this neighborhood? keep the loudest evidence"). That (a) makes everything smaller and faster, and (b) gives a bit of "don't-care-exactly-where" tolerance — a cat shifted by 1 pixel still pools to nearly the same result. It has no weights to learn; it's a fixed shrinking rule.
- Max pooling , stride 2: take the max in each window.
- Average pooling: mean instead of max.
- Global average pooling (GAP): average each entire feature map to one number — common before the final classifier (replaces huge dense layers).
Pooling has no learnable parameters. Backprop through max-pool routes the gradient only to the position that was the max (others get 0); through avg-pool it spreads gradient equally.
2.4 Receptive field
The receptive field of a neuron = the region of the input that influences it. Early layers only see tiny patches (a few pixels), so they can only spot tiny things — edges, dots. A neuron deep in the network is built from neurons below it, which were built from neurons below them, so it indirectly "sees" a much larger chunk of the original image and can recognize big things (a whole face, a car). That is how a CNN climbs from edges → textures → parts → objects. The receptive field grows with depth as:
Step-by-step: stacking three layers (all stride 1)
Read the formula as: new receptive field = previous receptive field + (this kernel's reach minus 1) × (how far each step now travels, which is the product of all earlier strides). With every stride , that product is always , so each layer just adds .
So three small layers "see" the same region as one big kernel — but compare the parameter counts (per channel):
Same field of view, 27 vs 49 params, and two extra ReLU nonlinearities in between → deeper + smaller kernels = more expressive, fewer params. This is the core VGG insight.
2.5 A full CNN layer stack (anatomy)
codeINPUT (32x32x3) → CONV 3x3, 32 filters, pad=same → (32x32x32) → ReLU → CONV 3x3, 32 filters, pad=same → (32x32x32) → ReLU → MAXPOOL 2x2 → (16x16x32) → CONV 3x3, 64 filters → (16x16x64) → ReLU → MAXPOOL 2x2 → (8x8x64) → FLATTEN → (4096,) → DENSE 128 → ReLU → DROPOUT 0.5 → DENSE 10 → SOFTMAX → class probabilities
Pattern: [Conv → activation] × n → Pool, repeated, with channels increasing as spatial size shrinks; then a classifier head.
2.6 Backprop through convolution (the key result)
First, the big picture — how any network learns
A freshly-built CNN is useless: every kernel is filled with random numbers, so it "sees" nonsense and guesses wildly. Training fixes this with a loop of two halves:
- Forward pass — push an image through the network and see what it guesses (say it says "70% dog" when the answer is "cat").
- Backward pass (backprop) — measure how wrong that guess was (the loss), then walk backwards through the network asking, at every single knob, one question: "if I nudged you a tiny bit, would the final answer have been less wrong?" The answer is the gradient — a direction and a strength of blame. Knobs that pushed hard toward the wrong answer get a big "turn this way" signal; knobs that didn't matter get nearly zero.
The optimizer then nudges every knob a hair in the direction that reduces the error, and you repeat on the next image. Millions of tiny nudges later, the once-random kernels have organized themselves into edge-detectors, texture-detectors, eye-detectors — nobody told them to; those just happen to be the settings that minimize the error. Backprop is the bookkeeping that figures out, fairly, how much each knob is to blame. For the general mechanics see [[01_deep_learning_foundations]]; here we only need the part that's special to convolution.
What's special about a conv layer
In a normal dense layer every knob is used once per prediction. In a conv layer the same kernel is reused at every position (that was the whole point — weight sharing), so a single kernel weight is responsible for the output at thousands of positions at once. Each of those positions comes back with its own complaint ("you made me 0.3 too high here, 0.1 too low there"), and the weight's total blame is simply the sum of all those complaints. That summing-up is the only twist convolution adds to ordinary backprop.
A conv layer has to compute three things during the backward pass. Here's what each one is for, in plain terms, before the formulas:
- (A) How should I change the kernel? — the gradient w.r.t. the kernel weights. This is what actually gets learned; the optimizer uses it to update the filter. (Computed by summing each weight's complaints, as above.)
- (B) What blame do I pass to the layer below me? — the gradient w.r.t. the layer's input. The conv layer isn't the first layer; the layers before it also have knobs to train, and they can only learn if this layer tells them how much their output contributed to the error. So the layer must "translate" its own output-blame back into input-blame and hand it down the chain.
- (C) The bias — just the sum of all the output complaints (the bias touched every output equally).
The two formulas, and why they look the way they do
Because convolution is a linear operation, both of these gradients turn out to be convolutions too — which is beautiful, because it means the same fast machinery runs forwards and backwards.
(A) Gradient w.r.t. the kernel = slide the output-gradient over the input and correlate them (every position's complaint, weighted by what the input was there, summed up):
Read it as: "how much should weight change? = for every output position , take its blame , multiply by the input pixel that this weight was multiplied against there, and add them all up." That's the "sum the 5,000 complaints" idea written as math.
(B) Gradient w.r.t. the input = a full convolution of the output-gradient with the flipped kernel (also called a "transposed convolution"):
Why the flip? In the forward pass one input pixel got "smeared" into several output positions (it sat under the kernel several times as the kernel slid past). To collect that input pixel's total blame you have to gather complaints back from all the output positions it influenced — the forward slide run in reverse — and running a slide in reverse is exactly what flipping the kernel does. (It's the same operation used to upsample images in segmentation and GANs.)
This is why conv nets are trainable with the exact same SGD machinery from [[01_deep_learning_foundations]] — nothing about the optimizer changes, only the layer's local recipe for turning output-blame into (A) kernel-updates and (B) input-blame.
Fully worked numeric example — every gradient, step by step
We reuse the exact and from §2.2, and assume the backward pass has handed us this gradient on the output (the "complaints," one per output position):
(A) Gradient w.r.t. the kernel — compute all 4 weights
Formula: . For each kernel weight, the offset just shifts which block of the input we read.
— offset , read , i.e. the top-left block :
— offset , read , the block :
— offset , read , the block :
— offset , read , the block :
Assemble: . The optimizer will nudge in the opposite direction (gradient descent): .
(C) Gradient w.r.t. the bias — just sum the complaints
The bias was added to every output equally, so its gradient is the plain sum:
(B) Gradient w.r.t. the input — the "flip", made concrete
Step 1 — flip the kernel 180° (this is what "convolve with the flipped kernel" means). Reverse rows and columns:
Step 2 — apply the formula , where any index outside counts as . Plugging in our (only and are non-zero) collapses it to:
Step 3 — evaluate at all 9 input positions (remember is only defined on the grid; outside it is ):
Step 4 — sanity-check one value the slow way. sat in all four patches during the forward pass. Look back at §2.2: it was multiplied by in and by in (and by in the other two). So its total blame is
This matrix is the "complaint" that gets handed to the layer below, which then runs its own (A)/(B)/(C) — and that hand-off, repeated layer by layer, is the entire backward pass.
2.7 Classic architectures (evolution & the idea each added)
| Year | Model | Key idea introduced |
|---|---|---|
| 1998 | LeNet-5 | First successful CNN (digits); conv→pool→FC template |
| 2012 | AlexNet | ReLU + dropout + GPUs; won ImageNet, started the deep-learning boom |
| 2014 | VGG | Stacks of small convs; depth matters; very uniform |
| 2014 | GoogLeNet/Inception | Parallel multi-scale "inception" blocks; convs to reduce channels |
| 2015 | ResNet | Residual/skip connections → trains 100+ layers |
| 2017 | DenseNet | Connect each layer to all later layers |
| 2019 | EfficientNet | Compound scaling of depth/width/resolution |
The convolution
A kernel mixes information across channels at each pixel (a per-pixel dense layer). Used to cheaply change channel count (dimensionality reduction) — central to Inception and bottleneck blocks.
Residual connection (ResNet) — why it works
A residual block computes:
The layer learns the residual instead of the full mapping.
Very deep plain networks paradoxically got worse — the blame signal faded to nothing before reaching the early layers, like a whisper passed down a line of 100 people. The fix is a shortcut wire (output = block output + untouched input): because the input is added straight through, the blame signal always has a clear highway back to earlier layers and can't fade away. Each block also only has to learn the small change it wants to make rather than reinvent the whole signal, which is far easier. Concretely, the gradient gets an identity path:
The "" guarantees gradients flow even when is tiny → solves vanishing gradients in very deep nets. This one idea (skip connections) unlocked 100+ layer networks and reappears inside every Transformer in [[04_transformers]].
2.8 Code: a CNN in PyTorch (CIFAR-style)
pythonimport torch, torch.nn as nn, torch.nn.functional as F class SmallCNN(nn.Module): def __init__(self, num_classes=10): super().__init__() self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1) # (B,32,32,32) self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1) # (B,64,32,32) self.pool = nn.MaxPool2d(2, 2) # halves H,W self.bn1 = nn.BatchNorm2d(32) self.bn2 = nn.BatchNorm2d(64) self.fc1 = nn.Linear(64*8*8, 128) self.drop = nn.Dropout(0.5) self.fc2 = nn.Linear(128, num_classes) def forward(self, x): x = self.pool(F.relu(self.bn1(self.conv1(x)))) # (B,32,16,16) x = self.pool(F.relu(self.bn2(self.conv2(x)))) # (B,64,8,8) x = x.flatten(1) # (B, 4096) x = self.drop(F.relu(self.fc1(x))) return self.fc2(x) # logits → use CrossEntropyLoss model = SmallCNN() opt = torch.optim.AdamW(model.parameters(), lr=1e-3) loss_fn = nn.CrossEntropyLoss() # softmax+CE built in # training step: # logits = model(images); loss = loss_fn(logits, labels) # opt.zero_grad(); loss.backward(); opt.step()
Residual block in code
pythonclass ResBlock(nn.Module): def __init__(self, c): super().__init__() self.c1 = nn.Conv2d(c, c, 3, padding=1); self.b1 = nn.BatchNorm2d(c) self.c2 = nn.Conv2d(c, c, 3, padding=1); self.b2 = nn.BatchNorm2d(c) def forward(self, x): out = F.relu(self.b1(self.c1(x))) out = self.b2(self.c2(out)) return F.relu(out + x) # the skip connection
2.9 Pitfalls & practical notes
- Channel order: PyTorch uses NCHW
(batch, channels, H, W); TensorFlow defaults to NHWC. - Don't flatten too early: keep spatial structure through conv/pool, flatten only before the dense head.
- BatchNorm before or after ReLU? Original ResNet: Conv→BN→ReLU. Both work; be consistent.
- Overfitting on small data → augment (random crop/flip), dropout, weight decay, or transfer-learn from a pretrained backbone.
- For modern vision, Vision Transformers (ViT) treat image patches as tokens and feed them to the Transformer of [[04_transformers]] — CNNs and Transformers now coexist.
Next: [[03_rnn_lstm]] — the same gradient machinery, but unrolled through time for sequences.