Sequences (text, audio, time series) have order and variable length. RNNs process them one step at a time, carrying a memory (hidden state) forward. This chapter derives the recurrence, the famous vanishing-gradient problem, and how gates (LSTM/GRU) fix it.
3.1 The vanilla RNN
Intuition
Read tokens one at a time. Keep a running summary ("what I've seen so far"). Update it at each step from the previous summary and the new input.
Math
At time step , input , hidden state :
Crucially, the same weights are reused at every time step (weight sharing across time, analogous to CNN weight sharing across space). is usually zeros.
Unrolling
An RNN over steps is equivalent to a -layer feedforward net where every layer shares weights:
codex1 → [RNN] → h1 → [RNN] → h2 → ... → [RNN] → hT ↑h0 ↑h1 ↑h_{T-1}
Tiny numeric example
, , , inputs , .
The state accumulates history: depends on both (through ) and .
3.2 Backpropagation Through Time (BPTT)
To train, unroll and apply backprop across all time steps. The total loss is the sum over steps:
The gradient w.r.t. the shared recurrent weight accumulates contributions from every time step:
The dangerous term is the product of Jacobians:
Vanishing & exploding gradients
That product has factors. Roughly its magnitude scales like :
- If the dominant eigenvalue of (times ) is < 1 → product → vanishing gradient: the network can't learn long-range dependencies (early inputs have no effect on late losses).
- If > 1 → product blows up → exploding gradient → NaNs.
Mitigations:
- Gradient clipping for explosion: if , rescale .
- Truncated BPTT: only backprop steps back (e.g. 35) — cheaper, limits the product length.
- Gated architectures (LSTM/GRU) for vanishing — the real fix, below.
3.3 LSTM (Long Short-Term Memory)
Intuition
Add a separate cell state that acts like a conveyor belt of memory, modified only by gates that decide what to forget, what to add, and what to output. Because the cell state is updated additively (not by repeated matrix multiply), gradients flow through it without vanishing.
The gate equations (memorize the structure, not the symbols)
At each step, with input and previous :
= concatenation. Each gate is a sigmoid in acting as a soft switch (0 = block, 1 = pass).
Why gradients survive
The cell-state recurrence is , so
When the forget gate , this Jacobian is identity → the product over time stays → no vanishing. The network learns when to remember (keep ) vs forget (drive ). This is the "constant error carousel."
Worked single-step (1-dim, conceptual numbers)
Say . Gates compute (mostly remember), , , .
The cell kept 90% of old memory and wrote half of the new candidate.
Parameter count
4 gates, each a dense layer over of size :
3.4 GRU (Gated Recurrent Unit) — a lighter LSTM
Merges cell and hidden state, uses 2 gates instead of 3. Fewer parameters, often comparable performance.
The update gate interpolates between keeping the old state and adopting the new candidate — same additive-memory benefit as LSTM.
3.5 Architectural patterns for sequences
| Pattern | Shape | Example task |
|---|---|---|
| one-to-one | 1 in → 1 out | image classification (not really RNN) |
| one-to-many | 1 in → seq out | image captioning |
| many-to-one | seq in → 1 out | sentiment classification |
| many-to-many (aligned) | seq in → seq out, same length | POS tagging |
| many-to-many (seq2seq) | seq in → seq out, different length | translation |
- Bidirectional RNN: run one RNN forward and one backward, concatenate hidden states → each position sees both past and future context. Great for tagging; can't be used for autoregressive generation.
- Stacked/deep RNN: feed one RNN's outputs as inputs to another.
- Encoder–decoder (seq2seq): an encoder RNN compresses the input into a context vector; a decoder RNN generates the output from it. This is the direct ancestor of the Transformer encoder-decoder in [[05_architectures]]. Its bottleneck (one fixed context vector) motivated attention, which [[04_transformers]] takes to its full conclusion.
3.6 Code: LSTM for sentiment in PyTorch
pythonimport torch, torch.nn as nn class SentimentLSTM(nn.Module): def __init__(self, vocab_size, emb=128, hidden=256, num_layers=2): super().__init__() self.embed = nn.Embedding(vocab_size, emb, padding_idx=0) self.lstm = nn.LSTM(emb, hidden, num_layers, batch_first=True, bidirectional=True, dropout=0.3) self.fc = nn.Linear(hidden*2, 1) # *2 for bidirectional def forward(self, x): # x: (B, T) token ids e = self.embed(x) # (B, T, emb) out, (h_n, c_n) = self.lstm(e) # out: (B, T, 2*hidden) # concat final forward (h_n[-2]) and backward (h_n[-1]) hidden states final = torch.cat([h_n[-2], h_n[-1]], dim=1) # (B, 2*hidden) return self.fc(final).squeeze(1) # (B,) logits → BCEWithLogitsLoss model = SentimentLSTM(vocab_size=20000) loss_fn = nn.BCEWithLogitsLoss() # sigmoid+BCE fused (see ch.1) opt = torch.optim.Adam(model.parameters(), 1e-3) # gradient clipping — essential for RNNs: # loss.backward() # torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0) # opt.step()
A character-level RNN forward, from scratch (NumPy)
pythonimport numpy as np def rnn_step(x, h, Wxh, Whh, bh): return np.tanh(Wxh @ x + Whh @ h + bh) # one time step H, D = 4, 3 Wxh = np.random.randn(H, D)*0.1 Whh = np.random.randn(H, H)*0.1 bh = np.zeros(H) h = np.zeros(H) sequence = [np.random.randn(D) for _ in range(5)] for x in sequence: h = rnn_step(x, h, Wxh, Whh, bh) # state carries forward print("final hidden state:", h.round(3))
3.7 Limitations that motivated Transformers
- Sequential computation — step needs step , so you can't parallelize across time → slow training on long sequences.
- Long-range dependencies — even LSTMs struggle past a few hundred tokens; the path between distant tokens is .
- Fixed-size context vector in seq2seq bottlenecks information.
The fix: attention lets every token directly look at every other token in path length and fully in parallel. That is the subject of [[04_transformers]].
Next: [[04_transformers]] — attention is all you need.