The same Transformer block from [[04_transformers]] is assembled three ways. The difference is entirely about what each token is allowed to attend to (masking) and what objective the model is trained on. Understanding this trio explains BERT vs GPT vs T5.
5.1 The encoder
What it is
A stack of Transformer blocks with bidirectional (unmasked) self-attention: every token attends to all tokens, left and right. Output: a contextualized vector per input token. It understands; it does not generate.
Math
For input , each layer:
Final output = deep contextual embeddings. The representation of token "sees" the whole sentence.
Use it when
You need understanding/representation: classification, named-entity recognition, retrieval embeddings, sentence similarity. Not for free-form generation.
5.2 The decoder
What it is
A stack of blocks with causal (masked) self-attention: token attends only to tokens . This makes it autoregressive — it predicts the next token from the past. Optionally includes a cross-attention sublayer to attend to an encoder's output (used in encoder-decoder models).
Math (decoder self-attention)
Cross-attention (decoder ↔ encoder)
In an encoder-decoder, the decoder has a second attention sublayer where queries come from the decoder, keys/values from the encoder output :
This lets each generated token look back at the source sentence — exactly the attention that fixed the seq2seq bottleneck from [[03_rnn_lstm]].
Decoder layer structure (encoder-decoder case)
codemasked self-attention → +residual, LN cross-attention(enc) → +residual, LN feed-forward → +residual, LN
5.3 The three architectures side by side
| Encoder-only | Decoder-only | Encoder-Decoder | |
|---|---|---|---|
| Attention | Bidirectional | Causal (masked) | Enc: bidir; Dec: causal + cross |
| Sees future? | Yes | No | Enc yes, Dec no |
| Trained on | Masked LM (MLM) | Next-token LM | Seq2seq (denoising / translation) |
| Output | One vector per token | Next-token distribution | Target sequence |
| Examples | BERT, RoBERTa, ELECTRA, DeBERTa | GPT-1/2/3/4, LLaMA, Mistral | T5, BART, original Transformer, mT5 |
| Best for | Classification, NER, embeddings, retrieval | Text generation, chat, code, few-shot | Translation, summarization, structured transduction |
5.4 Encoder-only: BERT in detail
BERT = Bidirectional Encoder Representations from Transformers. Pretrained on two self-supervised tasks:
1. Masked Language Modeling (MLM)
Randomly mask 15% of tokens; predict them from bidirectional context.
- Of the chosen 15%: 80% replaced with
[MASK], 10% with a random token, 10% left unchanged (reduces train/test mismatch since[MASK]never appears at inference).
Example: "The [MASK] sat on the mat" → predict "cat". The model uses both "The" (left) and "sat on the mat" (right) — bidirectionality is the whole point.
2. Next Sentence Prediction (NSP)
Given sentences A and B, predict whether B actually follows A (50% real, 50% random). (Later work — RoBERTa — found NSP unnecessary and dropped it.)
Special tokens & input format
code[CLS] sentence A tokens [SEP] sentence B tokens [SEP]
[CLS]: its final hidden state is used as the aggregate sequence representation for classification.[SEP]: separates segments. Plus segment embeddings (A vs B) added alongside token + position embeddings.
Fine-tuning
Add a small head on top and train end-to-end on labeled data:
- Classification: linear layer on
[CLS]vector → softmax. - NER / token tagging: linear layer on every token's vector.
- Question answering (SQuAD): predict start/end span positions over the passage tokens.
Code (Hugging Face)
pythonfrom transformers import AutoTokenizer, AutoModelForSequenceClassification import torch tok = AutoTokenizer.from_pretrained("bert-base-uncased") model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2) batch = tok(["I loved this movie", "Terrible and boring"], padding=True, truncation=True, return_tensors="pt") logits = model(**batch).logits # (2, 2) probs = logits.softmax(-1) # fine-tune: loss = F.cross_entropy(logits, labels); loss.backward(); opt.step()
5.5 Decoder-only: GPT in detail
GPT = Generative Pretrained Transformer. A stack of causal-masked decoder blocks (no cross-attention, since there's no separate encoder). Trained on plain next-token prediction (§4.8).
Why decoder-only dominates LLMs
- Simplicity & scale: one objective, train on raw internet text, scales smoothly to hundreds of billions of params.
- In-context learning: large GPTs learn tasks from examples in the prompt with no weight updates ("few-shot"). E.g. give 3 translation examples, then a 4th to complete.
- Unified interface: classification, QA, summarization, code — all framed as "continue this text."
Autoregressive generation loop
codeprompt → model → next-token distribution → sample token → append → repeat
Generation is inherently sequential (each token depends on the previous), but training is parallel thanks to causal masking + teacher forcing.
The modern LLM stack (post-pretraining)
- Pretraining: next-token prediction on trillions of tokens → a "base model."
- Supervised fine-tuning (SFT): train on curated instruction→response pairs → follows instructions.
- Alignment — RLHF (Reinforcement Learning from Human Feedback) or DPO (Direct Preference Optimization): train on human preference comparisons so outputs are helpful/harmless.
- RLHF: train a reward model on human-ranked pairs, then optimize the LLM policy with PPO to maximize reward minus a KL penalty keeping it close to the SFT model:
- DPO skips the separate reward model and optimizes a closed-form classification loss directly on preference pairs — simpler and now widely used.
Code (generation)
pythonfrom transformers import AutoTokenizer, AutoModelForCausalLM tok = AutoTokenizer.from_pretrained("gpt2") model = AutoModelForCausalLM.from_pretrained("gpt2") ids = tok("The future of AI is", return_tensors="pt").input_ids out = model.generate(ids, max_new_tokens=40, do_sample=True, temperature=0.8, top_p=0.9) print(tok.decode(out[0], skip_special_tokens=True))
5.6 Encoder-Decoder: T5 / BART in detail
Best when input and output are different sequences and you need to fully encode the input before generating: translation, summarization.
T5 ("Text-to-Text Transfer Transformer")
Frames every NLP task as text→text:
- Translation:
"translate English to German: That is good." → "Das ist gut." - Summarization:
"summarize: <article>" → "<summary>" - Classification:
"cota sentence: ... → "positive"
Pretraining (span corruption): mask contiguous spans, replace each with a sentinel token, and have the decoder generate the missing spans.
codeinput: "Thank you <X> me to your party <Y> week." target: "<X> for inviting <Y> last"
Data flow
codesource ─► ENCODER (bidirectional) ─► H_enc │ (keys/values) target ─► DECODER (causal self-attn) ─┴─► cross-attention ─► next-token logits
The encoder reads the whole source once; the decoder generates the target token-by-token, cross-attending to the encoded source at every layer.
Training objective
Teacher-forced cross-entropy on the target sequence (same next-token loss, but conditioned on encoder output):
Code
pythonfrom transformers import AutoTokenizer, AutoModelForSeq2SeqLM tok = AutoTokenizer.from_pretrained("t5-small") model = AutoModelForSeq2SeqLM.from_pretrained("t5-small") ids = tok("summarize: " + long_article, return_tensors="pt", truncation=True).input_ids summary = model.generate(ids, max_new_tokens=80, num_beams=4) print(tok.decode(summary[0], skip_special_tokens=True))
5.7 Choosing an architecture (decision guide)
codeNeed to GENERATE free-form text / chat / code? └─ yes → DECODER-ONLY (GPT, LLaMA, Mistral) Need to UNDERSTAND/classify/embed a fixed input? └─ yes → ENCODER-ONLY (BERT, DeBERTa) — or just use an LLM if convenient Input → different output sequence (translate/summarize) with strong source encoding? └─ yes → ENCODER-DECODER (T5, BART)
In practice, large decoder-only LLMs now handle most tasks (including classification and summarization) via prompting, so encoder-only/encoder-decoder are chosen mainly for efficiency, embeddings ([[06_rag]] retrievers use encoder models!), or specialized transduction.
5.8 Pitfalls
- Using BERT to "generate text" — it can't naturally; it's not autoregressive.
- Using GPT for sentence embeddings — works but encoder models (or specialized embedding models) are usually better/cheaper for retrieval.
- Confusing the two attentions in a decoder: masked self-attention (over generated tokens) vs cross-attention (over encoder output) are different sublayers.
- Retrieval/RAG embeddings almost always come from encoder-style models → bridge to the next chapter.
Next: [[06_rag]] — give models external, up-to-date knowledge using encoder embeddings + vector search.