back to knowledge base
module 0221 min read

Convolutional Neural Networks

Convolution math, kernels, stride/padding, pooling, receptive fields, backprop through conv, classic architectures (LeNet→ResNet).

CNNs are MLPs with two inductive biases baked in: locality (pixels near each other matter together) and weight sharing (the same feature detector slides everywhere). This makes them parameter-efficient and translation-equivariant. In layers, a CNN climbs from simple clues to whole objects — edges and color blobs first, then corners and textures, then parts (an eye, an ear), then objects ("this is a cat").


2.1 Why not just use a dense network on images?

A 224×224×3224\times224\times3 image flattened = 150,528 inputs. One hidden layer of 1000 units → 150 million weights in layer 1 alone. CNNs instead learn small kernels (e.g. 3×3×3=273\times3\times3 = 27 weights) reused across the whole image. Fewer parameters, better generalization, and the structure respects images.


2.2 The convolution operation

Intuition

Slide a small kernel (filter) — think of it as a stencil — over the image, one position at a time. At each position you compute a dot product between the kernel and the patch it covers: a high response means "this feature is here." The kernel is tuned to light up on a particular little pattern (a vertical edge, a patch of fur texture), and because the same kernel is reused at every position, the network learns that pattern once and finds it anywhere in the image. The map of responses it produces is a feature map. Different kernels detect different things — edges, textures, and so on.

Math (2D, single channel)

Input XRH×W\mathbf{X}\in\mathbb{R}^{H\times W}, kernel KRkh×kw\mathbf{K}\in\mathbb{R}^{k_h\times k_w}. The (cross-correlation, which is what DL libraries call "convolution") output:

S(i,j)=(XK)(i,j)=m=0kh1n=0kw1X(i+m,j+n)K(m,n)+bS(i,j) = (\mathbf{X} * \mathbf{K})(i,j) = \sum_{m=0}^{k_h-1}\sum_{n=0}^{k_w-1} X(i+m,\, j+n)\,K(m,n) + b

Step-by-step: unrolling that double sum

The \sum\sum looks scary only because it's compressed. Let's expand it for a concrete 2×22\times2 kernel (kh=kw=2k_h=k_w=2), so mm runs over {0,1}\{0,1\} and nn runs over {0,1}\{0,1\}.

Step 1 — write out the inner sum (fix mm, let n=0,1n=0,1):

n=01X(i+m,j+n)K(m,n)=X(i+m,j+0)K(m,0)+X(i+m,j+1)K(m,1)\sum_{n=0}^{1} X(i+m,\,j+n)\,K(m,n) = X(i+m,\,j+0)\,K(m,0) + X(i+m,\,j+1)\,K(m,1)

Step 2 — now wrap the outer sum (let m=0,1m=0,1, so we get the m=0m=0 row plus the m=1m=1 row):

S(i,j)=X(i,j)K(0,0)+X(i,j+1)K(0,1)m=0 row  +  X(i+1,j)K(1,0)+X(i+1,j+1)K(1,1)m=1 row  +  bS(i,j) = \underbrace{X(i,\,j)K(0,0) + X(i,\,j{+}1)K(0,1)}_{m=0\text{ row}} \;+\; \underbrace{X(i{+}1,\,j)K(1,0) + X(i{+}1,\,j{+}1)K(1,1)}_{m=1\text{ row}} \;+\; b

That's the entire formula with no sigmas left — just 4 products added together (a 2×22\times2 kernel has 4 weights), plus the bias bb. A 3×33\times3 kernel would expand the same way into 9 products.

Step 3 — pin it to a real position. To get the top-left output S(0,0)S(0,0), substitute i=0, j=0i=0,\ j=0:

S(0,0)=X(0,0)K(0,0)+X(0,1)K(0,1)+X(1,0)K(1,0)+X(1,1)K(1,1)+bS(0,0) = X(0,0)K(0,0) + X(0,1)K(0,1) + X(1,0)K(1,0) + X(1,1)K(1,1) + b

Notice the image indices X(0,0),X(0,1),X(1,0),X(1,1)X(0,0),X(0,1),X(1,0),X(1,1) are exactly the top-left 2×22\times2 patch of the image — the patch the kernel currently sits on.

Step 4 — slide by one and repeat. For the next output to the right, S(0,1)S(0,1), substitute i=0, j=1i=0,\ j=1:

S(0,1)=X(0,1)K(0,0)+X(0,2)K(0,1)+X(1,1)K(1,0)+X(1,2)K(1,1)+bS(0,1) = X(0,1)K(0,0) + X(0,2)K(0,1) + X(1,1)K(1,0) + X(1,2)K(1,1) + b

Same four kernel weights K(,)K(\cdot,\cdot) — only the image indices shifted right by one column (j:01j{:}\,0\to1). That shift is the kernel sliding, and the weights staying identical is weight-sharing, made literal.

Note: true mathematical convolution flips the kernel (X(im,jn)X(i-m,j-n)); deep-learning frameworks implement cross-correlation (no flip). Since kernels are learned, the flip is irrelevant — the network just learns the flipped kernel. We use the term "convolution" loosely.

Output size formula

For input size WW, kernel kk, padding pp, stride ss:

O=Wk+2ps+1\boxed{O = \left\lfloor \frac{W - k + 2p}{s} \right\rfloor + 1}

Step-by-step: where this formula comes from

We build it up one effect at a time, so each symbol earns its place.

Step 1 — no padding, stride 1. A kk-wide kernel needs kk pixels under it. On a row of WW pixels, its left edge can start at column 0,1,2,0,1,2,\dots up until the kernel's right edge hits the wall — i.e. the last valid start is column WkW-k. Counting starts 00 through WkW-k inclusive:

O=(Wk)+1O = (W - k) + 1

(Example: W=5, k=3W=5,\ k=3 \Rightarrow starts at columns 0,1,20,1,2, that's 53+1=35-3+1 = 3 outputs.)

Step 2 — add padding pp. Gluing pp zero-pixels onto each side makes the row W+2pW+2p wide (the 22 is "both sides"). Substitute WW+2pW \to W+2p:

O=(W+2pk)+1=(Wk+2p)+1O = (W + 2p - k) + 1 = (W - k + 2p) + 1

Step 3 — add stride ss. With stride 1 the start columns are 0,1,2,0,1,2,\dots. With stride ss we only keep every ss-th start: 0,s,2s,0, s, 2s, \dots. Over a span of (Wk+2p)(W-k+2p) pixels, taking steps of size ss fits Wk+2ps\tfrac{W-k+2p}{s} steps, then +1+1 for the starting position itself:

O=Wk+2ps+1O = \frac{W - k + 2p}{s} + 1

Step 4 — round down. If ss doesn't divide evenly, the kernel can't land on a fractional pixel, so we discard the leftover with the floor \lfloor\cdot\rfloor, giving the boxed result   O=Wk+2ps+1\;O = \big\lfloor \tfrac{W-k+2p}{s}\big\rfloor + 1.

Worked check: W=32, k=3, p=1, s=1W=32,\ k=3,\ p=1,\ s=1:

O=323+2(1)1+1=31+1=32.O = \left\lfloor \frac{32 - 3 + 2(1)}{1}\right\rfloor + 1 = \lfloor 31 \rfloor + 1 = 32.

Output width equals input width — that is precisely what same padding (p=(k1)/2=1p=(k-1)/2 = 1 for k=3k=3) is engineered to do.

  • Padding pp: add zeros around the border. same padding keeps O=WO=W (needs p=(k1)/2p=(k-1)/2 for stride 1). valid = no padding.
  • Stride ss: step size of the slide. s=2s=2 halves spatial size.

Multi-channel & multi-filter

Real conv layers have input depth CinC_{in} and produce CoutC_{out} feature maps. A single filter is KRkh×kw×Cin\mathbf{K}\in\mathbb{R}^{k_h\times k_w\times C_{in}} (it spans all input channels), and there are CoutC_{out} of them:

Sc(i,j)=d=0Cin1mnXd(i+m,j+n)Kc,d(m,n)+bc,c=1CoutS_{c}(i,j) = \sum_{d=0}^{C_{in}-1}\sum_{m}\sum_{n} X_d(i+m,j+n)\,K_{c,d}(m,n) + b_c, \quad c = 1\dots C_{out}

Step-by-step: expanding the channel sum

The only new symbol is the extra d\sum_{d}, where dd indexes the input channels (e.g. d=0,1,2d=0,1,2 for red, green, blue). Take a 2×22\times2 kernel over Cin=3C_{in}=3 channels and unroll it for one output position of one filter cc:

Step 1 — inner spatial sum, done per channel (this is the same 4-term expansion as before, but now there's one copy per channel dd):

Td(i,j)=Xd(i,j)Kc,d(0,0)+Xd(i,j+1)Kc,d(0,1)+Xd(i+1,j)Kc,d(1,0)+Xd(i+1,j+1)Kc,d(1,1)T_d(i,j) = X_d(i,j)K_{c,d}(0,0) + X_d(i,j{+}1)K_{c,d}(0,1) + X_d(i{+}1,j)K_{c,d}(1,0) + X_d(i{+}1,j{+}1)K_{c,d}(1,1)

Step 2 — sum those per-channel results across all channels, then add the bias:

Sc(i,j)=T0(i,j)red+T1(i,j)green+T2(i,j)blue+bcS_c(i,j) = \underbrace{T_0(i,j)}_{\text{red}} + \underbrace{T_1(i,j)}_{\text{green}} + \underbrace{T_2(i,j)}_{\text{blue}} + b_c

So a 3-channel 2×22\times2 filter does 4×3=124\times3 = 12 multiplications, sums all 12 into one output number, adds the bias. The 3 input grids collapse into 1 output grid. Run CoutC_{out} different filters and you get CoutC_{out} output grids.

Step-by-step: the parameter count

Build (khkwCin+1)Cout(k_h \cdot k_w \cdot C_{in} + 1)\cdot C_{out} piece by piece:

weights in one filter-cube=kh×kw×Cin+ its single bias=khkwCin+1× number of filters=(khkwCin+1)Cout\begin{aligned} \text{weights in one filter-cube} &= k_h \times k_w \times C_{in} \\ \text{+ its single bias} &= k_h \cdot k_w \cdot C_{in} + 1 \\ \times \text{ number of filters} &= (k_h \cdot k_w \cdot C_{in} + 1)\cdot C_{out} \end{aligned}

Plug in a real layer (3×33\times3 kernel, Cin=3C_{in}=3, Cout=64C_{out}=64):

(333+1)64=(27+1)64=2864=1,792 parameters.(3 \cdot 3 \cdot 3 + 1)\cdot 64 = (27 + 1)\cdot 64 = 28 \cdot 64 = 1{,}792 \text{ parameters.}

Compare to the dense layer from §2.1 (≈150 million). And notice: HH and WW — the image's height and width — appear nowhere in the count. The same 1,792 numbers process a 32×3232\times32 image or a 4000×40004000\times4000 image, because the one small filter is reused at every position. That size-independence is the central efficiency win of convolution.

Tiny numeric example — every step shown

Input 3×33\times3, kernel 2×22\times2, stride 1, no padding, b=0b=0:

X=[120013211],K=[1001]\mathbf{X} = \begin{bmatrix}1 & 2 & 0\\ 0 & 1 & 3\\ 2 & 1 & 1\end{bmatrix}, \quad \mathbf{K} = \begin{bmatrix}1 & 0\\ 0 & -1\end{bmatrix}

First, the output size. Plug into the formula O=(Wk+2p)/s+1O = \lfloor (W-k+2p)/s\rfloor + 1 with W=3, k=2, p=0, s=1W=3,\ k=2,\ p=0,\ s=1:

O=32+2(0)1+1=1+1=2output is 2×2.O = \left\lfloor \frac{3 - 2 + 2(0)}{1} \right\rfloor + 1 = \lfloor 1 \rfloor + 1 = 2 \quad\Rightarrow\quad \text{output is } 2\times2.

So the kernel will sit in 4 positions (top-left, top-right, bottom-left, bottom-right) and produce 4 numbers.


Position S(0,0)S(0,0) — kernel on the top-left patch. The kernel covers rows {0,1}\{0,1\}, cols {0,1}\{0,1\}:

patch=[1201],K=[1001]\text{patch} = \begin{bmatrix}\mathbf{1} & \mathbf{2}\\ \mathbf{0} & \mathbf{1}\end{bmatrix}, \qquad \mathbf{K} = \begin{bmatrix}1 & 0\\ 0 & -1\end{bmatrix}

Multiply the two grids cell-by-cell (element-wise), keeping each product separate:

[1×12×00×01×(1)]=[1001]\begin{bmatrix}1\times1 & 2\times0\\ 0\times0 & 1\times(-1)\end{bmatrix} = \begin{bmatrix}1 & 0\\ 0 & -1\end{bmatrix}

Now add up all four products:

S(0,0)=1+0+0+(1)=0S(0,0) = 1 + 0 + 0 + (-1) = \mathbf{0}

Position S(0,1)S(0,1) — slide one column right. Now covering rows {0,1}\{0,1\}, cols {1,2}\{1,2\}:

patch=[2013]    [1001]=[2003]    S(0,1)=2+0+03=1\text{patch} = \begin{bmatrix}2 & 0\\ 1 & 3\end{bmatrix} \;\odot\; \begin{bmatrix}1 & 0\\ 0 & -1\end{bmatrix} = \begin{bmatrix}2 & 0\\ 0 & -3\end{bmatrix} \;\Rightarrow\; S(0,1) = 2 + 0 + 0 - 3 = \mathbf{-1}

Position S(1,0)S(1,0) — back to the left, slide one row down. Rows {1,2}\{1,2\}, cols {0,1}\{0,1\}:

patch=[0121]    [1001]=[0001]    S(1,0)=0+0+01=1\text{patch} = \begin{bmatrix}0 & 1\\ 2 & 1\end{bmatrix} \;\odot\; \begin{bmatrix}1 & 0\\ 0 & -1\end{bmatrix} = \begin{bmatrix}0 & 0\\ 0 & -1\end{bmatrix} \;\Rightarrow\; S(1,0) = 0 + 0 + 0 - 1 = \mathbf{-1}

Position S(1,1)S(1,1) — bottom-right patch. Rows {1,2}\{1,2\}, cols {1,2}\{1,2\}:

patch=[1311]    [1001]=[1001]    S(1,1)=1+0+01=0\text{patch} = \begin{bmatrix}1 & 3\\ 1 & 1\end{bmatrix} \;\odot\; \begin{bmatrix}1 & 0\\ 0 & -1\end{bmatrix} = \begin{bmatrix}1 & 0\\ 0 & -1\end{bmatrix} \;\Rightarrow\; S(1,1) = 1 + 0 + 0 - 1 = \mathbf{0}

Assemble the four numbers into the output grid:

S=[S(0,0)S(0,1)S(1,0)S(1,1)]=[0110]\mathbf{S} = \begin{bmatrix}S(0,0) & S(0,1)\\ S(1,0) & S(1,1)\end{bmatrix} = \begin{bmatrix}0 & -1\\ -1 & 0\end{bmatrix}

(\odot means element-wise multiply.) This kernel computes (top-left pixel) − (bottom-right pixel), so it outputs 00 on a smooth patch and a non-zero value where the image changes along the main diagonal — a diagonal edge detector.


2.3 Pooling

Downsamples feature maps → reduces compute, adds small translation invariance. Think of it as shrinking the image while keeping the gist, like making a thumbnail: max-pooling looks at each little 2×2 block and keeps only the strongest signal ("was the feature present anywhere in this neighborhood? keep the loudest evidence"). That (a) makes everything smaller and faster, and (b) gives a bit of "don't-care-exactly-where" tolerance — a cat shifted by 1 pixel still pools to nearly the same result. It has no weights to learn; it's a fixed shrinking rule.

  • Max pooling 2×22\times2, stride 2: take the max in each window.
[1324561212300114]2x2 max[6424]\begin{bmatrix}1 & 3 & 2 & 4\\ 5 & 6 & 1 & 2\\ 1 & 2 & 3 & 0\\ 0 & 1 & 1 & 4\end{bmatrix} \xrightarrow{\text{2x2 max}} \begin{bmatrix}6 & 4\\ 2 & 4\end{bmatrix}
  • Average pooling: mean instead of max.
  • Global average pooling (GAP): average each entire feature map to one number — common before the final classifier (replaces huge dense layers).

Pooling has no learnable parameters. Backprop through max-pool routes the gradient only to the position that was the max (others get 0); through avg-pool it spreads gradient equally.


2.4 Receptive field

The receptive field of a neuron = the region of the input that influences it. Early layers only see tiny patches (a few pixels), so they can only spot tiny things — edges, dots. A neuron deep in the network is built from neurons below it, which were built from neurons below them, so it indirectly "sees" a much larger chunk of the original image and can recognize big things (a whole face, a car). That is how a CNN climbs from edges → textures → parts → objects. The receptive field grows with depth as:

RF=RF1+(k1)i<siRF_{\ell} = RF_{\ell-1} + (k_\ell - 1)\prod_{i<\ell} s_i

Step-by-step: stacking three 3×33\times3 layers (all stride 1)

Read the formula as: new receptive field = previous receptive field + (this kernel's reach minus 1) × (how far each step now travels, which is the product of all earlier strides). With every stride =1=1, that product is always 11, so each layer just adds (k1)=(31)=2(k-1) = (3-1) = 2.

RF0=1(before any layer, a neuron "sees" 1 input pixel)RF1=RF0+(31)1=1+2=3(one 3×3 layer sees a 3×3 patch)RF2=RF1+(31)1=3+2=5(stacking a second layer)RF3=RF2+(31)1=5+2=7(third layer7×7 effective view)\begin{aligned} RF_0 &= 1 &&\text{(before any layer, a neuron "sees" 1 input pixel)}\\ RF_1 &= RF_0 + (3-1)\cdot 1 = 1 + 2 = \mathbf{3} &&\text{(one }3\times3\text{ layer sees a }3\times3\text{ patch)}\\ RF_2 &= RF_1 + (3-1)\cdot 1 = 3 + 2 = \mathbf{5} &&\text{(stacking a second layer)}\\ RF_3 &= RF_2 + (3-1)\cdot 1 = 5 + 2 = \mathbf{7} &&\text{(third layer} \Rightarrow 7\times7\text{ effective view)} \end{aligned}

So three small 3×33\times3 layers "see" the same 7×77\times7 region as one big 7×77\times7 kernel — but compare the parameter counts (per channel):

three 3×3:  3×(32)=3×9=27vs.one 7×7:  72=49\text{three } 3\times3:\; 3\times(3^2) = 3 \times 9 = \mathbf{27} \qquad\text{vs.}\qquad \text{one } 7\times7:\; 7^2 = \mathbf{49}

Same field of view, 27 vs 49 params, and two extra ReLU nonlinearities in between → deeper + smaller kernels = more expressive, fewer params. This is the core VGG insight.


2.5 A full CNN layer stack (anatomy)

code
INPUT (32x32x3)
  → CONV 3x3, 32 filters, pad=same  → (32x32x32)  → ReLU
  → CONV 3x3, 32 filters, pad=same  → (32x32x32)  → ReLU
  → MAXPOOL 2x2                       → (16x16x32)
  → CONV 3x3, 64 filters             → (16x16x64)  → ReLU
  → MAXPOOL 2x2                       → (8x8x64)
  → FLATTEN                           → (4096,)
  → DENSE 128 → ReLU → DROPOUT 0.5
  → DENSE 10  → SOFTMAX               → class probabilities

Pattern: [Conv → activation] × n → Pool, repeated, with channels increasing as spatial size shrinks; then a classifier head.


2.6 Backprop through convolution (the key result)

First, the big picture — how any network learns

A freshly-built CNN is useless: every kernel is filled with random numbers, so it "sees" nonsense and guesses wildly. Training fixes this with a loop of two halves:

  1. Forward pass — push an image through the network and see what it guesses (say it says "70% dog" when the answer is "cat").
  2. Backward pass (backprop) — measure how wrong that guess was (the loss), then walk backwards through the network asking, at every single knob, one question: "if I nudged you a tiny bit, would the final answer have been less wrong?" The answer is the gradient — a direction and a strength of blame. Knobs that pushed hard toward the wrong answer get a big "turn this way" signal; knobs that didn't matter get nearly zero.

The optimizer then nudges every knob a hair in the direction that reduces the error, and you repeat on the next image. Millions of tiny nudges later, the once-random kernels have organized themselves into edge-detectors, texture-detectors, eye-detectors — nobody told them to; those just happen to be the settings that minimize the error. Backprop is the bookkeeping that figures out, fairly, how much each knob is to blame. For the general mechanics see [[01_deep_learning_foundations]]; here we only need the part that's special to convolution.

What's special about a conv layer

In a normal dense layer every knob is used once per prediction. In a conv layer the same kernel is reused at every position (that was the whole point — weight sharing), so a single kernel weight is responsible for the output at thousands of positions at once. Each of those positions comes back with its own complaint ("you made me 0.3 too high here, 0.1 too low there"), and the weight's total blame is simply the sum of all those complaints. That summing-up is the only twist convolution adds to ordinary backprop.

A conv layer has to compute three things during the backward pass. Here's what each one is for, in plain terms, before the formulas:

  • (A) How should I change the kernel? — the gradient w.r.t. the kernel weights. This is what actually gets learned; the optimizer uses it to update the filter. (Computed by summing each weight's complaints, as above.)
  • (B) What blame do I pass to the layer below me? — the gradient w.r.t. the layer's input. The conv layer isn't the first layer; the layers before it also have knobs to train, and they can only learn if this layer tells them how much their output contributed to the error. So the layer must "translate" its own output-blame back into input-blame and hand it down the chain.
  • (C) The bias — just the sum of all the output complaints (the bias touched every output equally).

The two formulas, and why they look the way they do

Because convolution is a linear operation, both of these gradients turn out to be convolutions too — which is beautiful, because it means the same fast machinery runs forwards and backwards.

(A) Gradient w.r.t. the kernel = slide the output-gradient over the input and correlate them (every position's complaint, weighted by what the input was there, summed up):

LK(m,n)=ijLS(i,j)X(i+m,j+n)\frac{\partial L}{\partial K(m,n)} = \sum_{i}\sum_{j} \frac{\partial L}{\partial S(i,j)}\, X(i+m, j+n)

Read it as: "how much should weight K(m,n)K(m,n) change? = for every output position (i,j)(i,j), take its blame L/S(i,j)\partial L/\partial S(i,j), multiply by the input pixel that this weight was multiplied against there, and add them all up." That's the "sum the 5,000 complaints" idea written as math.

(B) Gradient w.r.t. the input = a full convolution of the output-gradient with the flipped kernel (also called a "transposed convolution"):

LX(i,j)=mnLS(im,jn)K(m,n)\frac{\partial L}{\partial X(i,j)} = \sum_{m}\sum_{n} \frac{\partial L}{\partial S(i-m,\,j-n)}\, K(m,n)

Why the flip? In the forward pass one input pixel got "smeared" into several output positions (it sat under the kernel several times as the kernel slid past). To collect that input pixel's total blame you have to gather complaints back from all the output positions it influenced — the forward slide run in reverse — and running a slide in reverse is exactly what flipping the kernel does. (It's the same operation used to upsample images in segmentation and GANs.)

This is why conv nets are trainable with the exact same SGD machinery from [[01_deep_learning_foundations]] — nothing about the optimizer changes, only the layer's local recipe for turning output-blame into (A) kernel-updates and (B) input-blame.

Fully worked numeric example — every gradient, step by step

We reuse the exact X\mathbf{X} and K\mathbf{K} from §2.2, and assume the backward pass has handed us this gradient on the output (the "complaints," one per output position):

X=[120013211],K=[1001],LS=[1201]    (call these g00,g01,g10,g11)\mathbf{X} = \begin{bmatrix}1 & 2 & 0\\ 0 & 1 & 3\\ 2 & 1 & 1\end{bmatrix},\quad \mathbf{K} = \begin{bmatrix}1 & 0\\ 0 & -1\end{bmatrix},\quad \frac{\partial L}{\partial S} = \begin{bmatrix} 1 & 2 \\ 0 & -1 \end{bmatrix} \;\;\Big(\text{call these } g_{00},g_{01},g_{10},g_{11}\Big)

(A) Gradient w.r.t. the kernel — compute all 4 weights

Formula: LK(m,n)=ijgijX(i+m,j+n)\displaystyle \frac{\partial L}{\partial K(m,n)} = \sum_{i}\sum_{j} g_{ij}\, X(i+m,\, j+n). For each kernel weight, the offset (m,n)(m,n) just shifts which 2×22\times2 block of the input we read.

K(0,0)K(0,0) — offset (0,0)(0,0), read X(i,j)X(i,j), i.e. the top-left 2×22\times2 block [1201]\begin{bmatrix}1&2\\0&1\end{bmatrix}:

LK(0,0)=g00(1)+g01(2)+g10(0)+g11(1)=(1)(1)+(2)(2)+(0)(0)+(1)(1)=1+4+01=4\frac{\partial L}{\partial K(0,0)} = g_{00}(1) + g_{01}(2) + g_{10}(0) + g_{11}(1) = (1)(1)+(2)(2)+(0)(0)+(-1)(1) = 1+4+0-1 = \mathbf{4}

K(0,1)K(0,1) — offset (0,1)(0,1), read X(i,j+1)X(i,j{+}1), the block [2013]\begin{bmatrix}2&0\\1&3\end{bmatrix}:

=g00(2)+g01(0)+g10(1)+g11(3)=(1)(2)+(2)(0)+(0)(1)+(1)(3)=2+0+03=1= g_{00}(2) + g_{01}(0) + g_{10}(1) + g_{11}(3) = (1)(2)+(2)(0)+(0)(1)+(-1)(3) = 2+0+0-3 = \mathbf{-1}

K(1,0)K(1,0) — offset (1,0)(1,0), read X(i+1,j)X(i{+}1,j), the block [0121]\begin{bmatrix}0&1\\2&1\end{bmatrix}:

=g00(0)+g01(1)+g10(2)+g11(1)=0+2+01=1= g_{00}(0) + g_{01}(1) + g_{10}(2) + g_{11}(1) = 0+2+0-1 = \mathbf{1}

K(1,1)K(1,1) — offset (1,1)(1,1), read X(i+1,j+1)X(i{+}1,j{+}1), the block [1311]\begin{bmatrix}1&3\\1&1\end{bmatrix}:

=g00(1)+g01(3)+g10(1)+g11(1)=1+6+01=6= g_{00}(1) + g_{01}(3) + g_{10}(1) + g_{11}(1) = 1+6+0-1 = \mathbf{6}

Assemble: LK=[4116]\displaystyle \frac{\partial L}{\partial \mathbf{K}} = \begin{bmatrix} 4 & -1 \\ 1 & 6 \end{bmatrix}. The optimizer will nudge K\mathbf{K} in the opposite direction (gradient descent): KKηLK\mathbf{K} \leftarrow \mathbf{K} - \eta\frac{\partial L}{\partial \mathbf{K}}.

(C) Gradient w.r.t. the bias — just sum the complaints

The bias was added to every output equally, so its gradient is the plain sum:

Lb=g00+g01+g10+g11=1+2+0+(1)=2\frac{\partial L}{\partial b} = g_{00}+g_{01}+g_{10}+g_{11} = 1+2+0+(-1) = \mathbf{2}

(B) Gradient w.r.t. the input — the "flip", made concrete

Step 1 — flip the kernel 180° (this is what "convolve with the flipped kernel" means). Reverse rows and columns:

K=[1001]  rotate 180  Kflip=[1001]\mathbf{K} = \begin{bmatrix}1 & 0\\ 0 & -1\end{bmatrix} \;\xrightarrow{\text{rotate }180^\circ}\; \mathbf{K}^{\text{flip}} = \begin{bmatrix}-1 & 0\\ 0 & 1\end{bmatrix}

Step 2 — apply the formula LX(i,j)=m,ng(im),(jn)K(m,n)\displaystyle \frac{\partial L}{\partial X(i,j)} = \sum_{m,n} g_{(i-m),(j-n)}\,K(m,n), where any gg index outside {0,1}\{0,1\} counts as 00. Plugging in our KK (only K(0,0)=1K(0,0)=1 and K(1,1)=1K(1,1)=-1 are non-zero) collapses it to:

LX(i,j)=gi,j1+g(i1),(j1)(1)=gi,jg(i1),(j1)\frac{\partial L}{\partial X(i,j)} = g_{i,j}\cdot 1 + g_{(i-1),(j-1)}\cdot(-1) = g_{i,j} - g_{(i-1),(j-1)}

Step 3 — evaluate at all 9 input positions (remember gg is only defined on the 2×22\times2 grid; outside it is 00):

X(0,0):    g00g(1)(1)=10=1X(0,1):    g01g(1)0=20=2X(0,2):    00=0X(1,0):    g10g0(1)=00=0X(1,1):    g11g00=11=2X(1,2):    0g01=02=2X(2,0):    00=0X(2,1):    0g10=00=0X(2,2):    0g11=0(1)=1\begin{aligned} X(0,0):&\;\; g_{00} - g_{(-1)(-1)} = 1 - 0 = \mathbf{1} &\quad X(0,1):&\;\; g_{01} - g_{(-1)0} = 2 - 0 = \mathbf{2} &\quad X(0,2):&\;\; 0 - 0 = \mathbf{0} \\ X(1,0):&\;\; g_{10} - g_{0(-1)} = 0 - 0 = \mathbf{0} &\quad X(1,1):&\;\; g_{11} - g_{00} = -1 - 1 = \mathbf{-2} &\quad X(1,2):&\;\; 0 - g_{01} = 0 - 2 = \mathbf{-2} \\ X(2,0):&\;\; 0 - 0 = \mathbf{0} &\quad X(2,1):&\;\; 0 - g_{10} = 0 - 0 = \mathbf{0} &\quad X(2,2):&\;\; 0 - g_{11} = 0 -(-1) = \mathbf{1} \end{aligned} LX=[120022001]\frac{\partial L}{\partial \mathbf{X}} = \begin{bmatrix} 1 & 2 & 0 \\ 0 & -2 & -2 \\ 0 & 0 & 1 \end{bmatrix}

Step 4 — sanity-check one value the slow way. X(1,1)X(1,1) sat in all four patches during the forward pass. Look back at §2.2: it was multiplied by K(1,1)=1K(1,1)=-1 in S(0,0)S(0,0) and by K(0,0)=1K(0,0)=1 in S(1,1)S(1,1) (and by 00 in the other two). So its total blame is

LX(1,1)=g00K(1,1)+g11K(0,0)=(1)(1)+(1)(1)=2.\frac{\partial L}{\partial X(1,1)} = g_{00}\cdot K(1,1) + g_{11}\cdot K(0,0) = (1)(-1) + (-1)(1) = -2. \checkmark

This matrix is the "complaint" that gets handed to the layer below, which then runs its own (A)/(B)/(C) — and that hand-off, repeated layer by layer, is the entire backward pass.


2.7 Classic architectures (evolution & the idea each added)

YearModelKey idea introduced
1998LeNet-5First successful CNN (digits); conv→pool→FC template
2012AlexNetReLU + dropout + GPUs; won ImageNet, started the deep-learning boom
2014VGGStacks of small 3×33\times3 convs; depth matters; very uniform
2014GoogLeNet/InceptionParallel multi-scale "inception" blocks; 1×11\times1 convs to reduce channels
2015ResNetResidual/skip connections → trains 100+ layers
2017DenseNetConnect each layer to all later layers
2019EfficientNetCompound scaling of depth/width/resolution

The 1×11\times1 convolution

A 1×11\times1 kernel mixes information across channels at each pixel (a per-pixel dense layer). Used to cheaply change channel count (dimensionality reduction) — central to Inception and bottleneck blocks.

Residual connection (ResNet) — why it works

A residual block computes:

y=F(x,{W})+x\mathbf{y} = \mathcal{F}(\mathbf{x}, \{\mathbf{W}\}) + \mathbf{x}

The layer learns the residual F=yx\mathcal{F}=\mathbf{y}-\mathbf{x} instead of the full mapping.

Very deep plain networks paradoxically got worse — the blame signal faded to nothing before reaching the early layers, like a whisper passed down a line of 100 people. The fix is a shortcut wire (output = block output + untouched input): because the input is added straight through, the blame signal always has a clear highway back to earlier layers and can't fade away. Each block also only has to learn the small change it wants to make rather than reinvent the whole signal, which is far easier. Concretely, the gradient gets an identity path:

Lx=Ly(1+Fx)\frac{\partial L}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{y}}\left(1 + \frac{\partial \mathcal{F}}{\partial \mathbf{x}}\right)

The "1+1+" guarantees gradients flow even when F/x\partial\mathcal{F}/\partial\mathbf{x} is tiny → solves vanishing gradients in very deep nets. This one idea (skip connections) unlocked 100+ layer networks and reappears inside every Transformer in [[04_transformers]].


2.8 Code: a CNN in PyTorch (CIFAR-style)

python
import torch, torch.nn as nn, torch.nn.functional as F

class SmallCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)   # (B,32,32,32)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)  # (B,64,32,32)
        self.pool  = nn.MaxPool2d(2, 2)                            # halves H,W
        self.bn1   = nn.BatchNorm2d(32)
        self.bn2   = nn.BatchNorm2d(64)
        self.fc1   = nn.Linear(64*8*8, 128)
        self.drop  = nn.Dropout(0.5)
        self.fc2   = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.pool(F.relu(self.bn1(self.conv1(x))))   # (B,32,16,16)
        x = self.pool(F.relu(self.bn2(self.conv2(x))))   # (B,64,8,8)
        x = x.flatten(1)                                  # (B, 4096)
        x = self.drop(F.relu(self.fc1(x)))
        return self.fc2(x)                                # logits → use CrossEntropyLoss

model = SmallCNN()
opt = torch.optim.AdamW(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()                           # softmax+CE built in
# training step:
# logits = model(images); loss = loss_fn(logits, labels)
# opt.zero_grad(); loss.backward(); opt.step()

Residual block in code

python
class ResBlock(nn.Module):
    def __init__(self, c):
        super().__init__()
        self.c1 = nn.Conv2d(c, c, 3, padding=1); self.b1 = nn.BatchNorm2d(c)
        self.c2 = nn.Conv2d(c, c, 3, padding=1); self.b2 = nn.BatchNorm2d(c)
    def forward(self, x):
        out = F.relu(self.b1(self.c1(x)))
        out = self.b2(self.c2(out))
        return F.relu(out + x)          # the skip connection

2.9 Pitfalls & practical notes

  • Channel order: PyTorch uses NCHW (batch, channels, H, W); TensorFlow defaults to NHWC.
  • Don't flatten too early: keep spatial structure through conv/pool, flatten only before the dense head.
  • BatchNorm before or after ReLU? Original ResNet: Conv→BN→ReLU. Both work; be consistent.
  • Overfitting on small data → augment (random crop/flip), dropout, weight decay, or transfer-learn from a pretrained backbone.
  • For modern vision, Vision Transformers (ViT) treat image patches as tokens and feed them to the Transformer of [[04_transformers]] — CNNs and Transformers now coexist.

Next: [[03_rnn_lstm]] — the same gradient machinery, but unrolled through time for sequences.