The Transformer (Vaswani et al., 2017, "Attention Is All You Need") replaced recurrence with attention, enabling full parallelism and direct long-range connections. It is the foundation of BERT, GPT, T5, and every modern LLM. We build it from the single most important idea: self-attention.
4.1 Tokenization & embeddings (the input)
Before any attention, text becomes vectors:
- Tokenize: split text into tokens (subwords) via Byte-Pair Encoding (BPE) / WordPiece / SentencePiece. E.g.
"unhappiness" → ["un", "happiness"]or["un", "happi", "ness"]. Each token maps to an integer id. - Embed: an embedding matrix (=vocab size, =model dim) maps each id to a -dim vector via lookup. These are learned.
- Add positional information (next section) because attention itself is order-agnostic.
Result: input sequence of tokens → matrix .
4.2 Positional encoding
Attention treats the input as a set — it has no inherent notion of order. We must inject position.
Sinusoidal (original Transformer)
- = position index, = dimension index.
- Different dimensions oscillate at different frequencies (wavelengths from to ).
- Key property: is a linear function of (rotation), so the model can learn to attend by relative position. Also extrapolates to unseen lengths.
We add it: .
Modern variants
- Learned absolute positions (BERT, GPT-2): a trainable position embedding table.
- RoPE (Rotary Position Embedding): rotates query/key vectors by an angle proportional to position → encodes relative position directly in the dot product. Used in LLaMA, GPT-NeoX.
- ALiBi: adds a distance-based linear bias to attention scores; great length extrapolation.
4.3 Self-attention — the heart of everything
Intuition
For each token, ask: "which other tokens are relevant to me, and how much?" Then build that token's new representation as a weighted blend of all tokens' values. "Attention" = these learned relevance weights.
Analogy — a soft dictionary lookup (the mechanics of how a word "asks around"):
- Query (Q): what I'm looking for. ("I'm the word 'it', I'm looking for a noun I might refer to.")
- Key (K): what each token advertises about itself. ("I'm 'animal', a noun, a subject.")
- Value (V): the actual content a token hands over if matched. You match your query against all keys to get relevance weights (a good Query–Key match → high weight), then take a weighted blend of the values. It's like searching a library: your question (query) is compared to each book's title/index (key), and you walk away with a mix of the contents (values) of the best-matching books — not just one book, but a blend weighted by relevance.
Math — Scaled Dot-Product Attention
From input , project into queries, keys, values with learned matrices , :
Then:
Step by step:
- Scores . Entry = how much token attends to token (dot product = similarity).
- Scale by . Why: if have unit-variance independent entries, has variance . Large → large scores → softmax saturates → tiny gradients. Dividing by restores unit variance.
- Softmax over each row → attention weights , each row sums to 1.
- Weighted sum of values: output . Row = token 's new, context-aware representation.
Fully worked numeric example (do this once by hand!)
Two tokens, . Suppose after projection:
Scores :
Scale by : . Softmax rows:
- Row 1: .
- Row 2: . , sum → .
Output :
- .
- .
Token 1 blended both values equally; token 2 leaned toward value 2. This is attention doing its job: mixing information across positions based on learned similarity.
4.4 Multi-Head Attention (MHA)
One attention "head" learns one kind of relationship. We want many in parallel (syntax, coreference, etc.). Split into heads of dim :
where recombines them. Each head attends in a different learned subspace; concatenation + projection fuses their findings. Cost is the same as one full-dim attention because each head is as wide.
Example: per head.
4.5 The complete Transformer block
A block stacks attention + a feed-forward network, each wrapped with residual connections (from [[02_cnns]]) and LayerNorm (from [[01_deep_learning_foundations]]).
Position-wise Feed-Forward Network (FFN)
Applied independently to each position:
with , , typically , GELU/ReLU. This is where much of the model's "knowledge" and per-token nonlinear processing lives. Attention mixes across tokens; FFN processes each token.
Post-LN (original) vs Pre-LN (modern)
Post-LN (original paper):
Pre-LN (GPT-2 onward — more stable for deep nets):
The residual +x gives the gradient highway; LayerNorm stabilizes scale. Stack such blocks (e.g. 12 in BERT-base, 96 in GPT-3).
Block diagram
code┌─────────────── + ◄──────────────┐ (residual) x ──►LayerNorm──► Multi-Head Attention ────┘ ┌─────────────── + ◄──────────────┐ (residual) └───►LayerNorm──► Feed-Forward (4d) ────┘ ──► output
4.6 Masking
Padding mask
Batched sequences are padded to equal length. We set attention scores for pad positions to before softmax so they get weight 0.
Causal (look-ahead) mask — for generation
In a decoder, token must not see future tokens (that would be cheating during next-token prediction). Apply a mask with for , else :
Concretely the mask is upper-triangular :
After softmax, each token attends only to itself and earlier tokens. This single trick is what makes GPT autoregressive. ([[05_architectures]] details decoder-only models.)
4.7 Why , complexity, and properties
- Complexity: self-attention is in time and memory (the score matrix). This quadratic cost in sequence length is the main scaling limitation → motivates FlashAttention (IO-aware exact attention), sparse/linear attention, sliding windows.
- Path length: any two tokens interact in one layer ( path) vs for RNNs → far better long-range modeling.
- Parallelism: all positions computed simultaneously (no time recurrence) → GPU-friendly, the reason Transformers scaled.
4.8 Training a Transformer language model
Objective
Causal/autoregressive LM (GPT): predict the next token. With sequence :
Each position's output goes through a linear "LM head" (often weight-tied to the embedding matrix ) → softmax over vocab → cross-entropy against the actual next token (the clean gradient from [[01_deep_learning_foundations]]). With causal masking, all next-token predictions are computed in one parallel forward pass ("teacher forcing").
Perplexity
A common metric: = the effective branching factor. Lower is better.
Learning-rate schedule
The original used warmup then inverse-sqrt decay:
Warmup avoids early instability when Adam's variance estimates are noisy; decay refines later. Modern LLMs use linear warmup + cosine decay with AdamW.
Inference / decoding strategies
Given the next-token distribution, pick a token:
- Greedy: argmax (deterministic, can be repetitive).
- Beam search: keep top- partial sequences (good for translation).
- Temperature : divide logits by before softmax; sharpens, flattens.
- Top-k: sample among the most likely tokens.
- Top-p (nucleus): sample from the smallest set whose cumulative prob .
4.9 Code: a Transformer block from scratch (PyTorch)
pythonimport torch, torch.nn as nn, torch.nn.functional as F, math class MultiHeadAttention(nn.Module): def __init__(self, d_model, n_heads): super().__init__() assert d_model % n_heads == 0 self.h, self.dk = n_heads, d_model // n_heads self.qkv = nn.Linear(d_model, 3*d_model) # fused Q,K,V projection self.out = nn.Linear(d_model, d_model) def forward(self, x, mask=None): # x: (B, n, d) B, n, d = x.shape qkv = self.qkv(x).reshape(B, n, 3, self.h, self.dk).permute(2,0,3,1,4) q, k, v = qkv[0], qkv[1], qkv[2] # each (B, h, n, dk) scores = (q @ k.transpose(-2,-1)) / math.sqrt(self.dk) # (B,h,n,n) if mask is not None: scores = scores.masked_fill(mask == 0, float('-inf')) attn = scores.softmax(-1) # attention weights z = attn @ v # (B,h,n,dk) z = z.transpose(1,2).reshape(B, n, d) # concat heads return self.out(z) class TransformerBlock(nn.Module): # Pre-LN variant def __init__(self, d_model, n_heads, d_ff, p=0.1): super().__init__() self.ln1 = nn.LayerNorm(d_model); self.attn = MultiHeadAttention(d_model, n_heads) self.ln2 = nn.LayerNorm(d_model) self.ff = nn.Sequential(nn.Linear(d_model, d_ff), nn.GELU(), nn.Linear(d_ff, d_model)) self.drop = nn.Dropout(p) def forward(self, x, mask=None): x = x + self.drop(self.attn(self.ln1(x), mask)) # residual + attention x = x + self.drop(self.ff(self.ln2(x))) # residual + FFN return x def causal_mask(n): return torch.tril(torch.ones(n, n)).bool() # lower-triangular True
A minimal GPT
pythonclass MiniGPT(nn.Module): def __init__(self, vocab, d=256, n_heads=8, n_layers=6, max_len=512, d_ff=1024): super().__init__() self.tok = nn.Embedding(vocab, d) self.pos = nn.Embedding(max_len, d) self.blocks = nn.ModuleList([TransformerBlock(d, n_heads, d_ff) for _ in range(n_layers)]) self.ln_f = nn.LayerNorm(d) self.head = nn.Linear(d, vocab, bias=False) self.head.weight = self.tok.weight # weight tying def forward(self, idx): # idx: (B, n) B, n = idx.shape pos = torch.arange(n, device=idx.device) x = self.tok(idx) + self.pos(pos) # embed + positional mask = causal_mask(n).to(idx.device) for blk in self.blocks: x = blk(x, mask) return self.head(self.ln_f(x)) # logits (B, n, vocab) # training: logits=model(idx); loss=F.cross_entropy(logits[:,:-1].reshape(-1,V), # idx[:,1:].reshape(-1))
4.10 Pitfalls & key intuitions
- Q, K, V are the same input in self-attention; in cross-attention (decoder attending to encoder) Q comes from the decoder, K/V from the encoder ([[05_architectures]]).
- Forgetting the mask in a decoder leaks future info → the model "cheats" and fails at generation.
- Forgetting positional encoding → the model can't tell word order ("dog bites man" = "man bites dog").
- Quadratic memory limits context length; that's an active research/engineering frontier.
- Attention weights are somewhat interpretable but not a faithful explanation of the model's reasoning — treat attention maps cautiously.
Next: [[05_architectures]] — how encoder-only, decoder-only, and encoder-decoder models reuse these blocks for different jobs.