back to knowledge base
module 057 min read

Encoder / Decoder Architectures

Encoder-only (BERT), decoder-only (GPT), encoder-decoder (T5), causal masking, and why each shape exists.

The same Transformer block from [[04_transformers]] is assembled three ways. The difference is entirely about what each token is allowed to attend to (masking) and what objective the model is trained on. Understanding this trio explains BERT vs GPT vs T5.


5.1 The encoder

What it is

A stack of Transformer blocks with bidirectional (unmasked) self-attention: every token attends to all tokens, left and right. Output: a contextualized vector per input token. It understands; it does not generate.

Math

For input XRn×d\mathbf{X}\in\mathbb{R}^{n\times d}, each layer:

H=EncoderLayer(X),no causal maskAij defined for all i,j\mathbf{H} = \text{EncoderLayer}(\mathbf{X}), \quad \text{no causal mask} \Rightarrow A_{ij} \text{ defined for all } i,j

Final output H(N)Rn×d\mathbf{H}^{(N)}\in\mathbb{R}^{n\times d} = deep contextual embeddings. The representation of token ii "sees" the whole sentence.

Use it when

You need understanding/representation: classification, named-entity recognition, retrieval embeddings, sentence similarity. Not for free-form generation.


5.2 The decoder

What it is

A stack of blocks with causal (masked) self-attention: token ii attends only to tokens i\le i. This makes it autoregressive — it predicts the next token from the past. Optionally includes a cross-attention sublayer to attend to an encoder's output (used in encoder-decoder models).

Math (decoder self-attention)

A=softmax ⁣(QKdk+Mcausal)V,Mij={0jij>i\mathbf{A} = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}} + \mathbf{M}_{\text{causal}}\right)\mathbf{V}, \qquad M_{ij} = \begin{cases}0 & j\le i\\ -\infty & j > i\end{cases}

Cross-attention (decoder ↔ encoder)

In an encoder-decoder, the decoder has a second attention sublayer where queries come from the decoder, keys/values from the encoder output Henc\mathbf{H}_{\text{enc}}:

Q=XdecWQ,K=HencWK,V=HencWV\mathbf{Q} = \mathbf{X}_{\text{dec}}\mathbf{W}^Q, \quad \mathbf{K} = \mathbf{H}_{\text{enc}}\mathbf{W}^K, \quad \mathbf{V} = \mathbf{H}_{\text{enc}}\mathbf{W}^V

This lets each generated token look back at the source sentence — exactly the attention that fixed the seq2seq bottleneck from [[03_rnn_lstm]].

Decoder layer structure (encoder-decoder case)

code
masked self-attention  → +residual, LN
cross-attention(enc)    → +residual, LN
feed-forward            → +residual, LN

5.3 The three architectures side by side

Encoder-onlyDecoder-onlyEncoder-Decoder
AttentionBidirectionalCausal (masked)Enc: bidir; Dec: causal + cross
Sees future?YesNoEnc yes, Dec no
Trained onMasked LM (MLM)Next-token LMSeq2seq (denoising / translation)
OutputOne vector per tokenNext-token distributionTarget sequence
ExamplesBERT, RoBERTa, ELECTRA, DeBERTaGPT-1/2/3/4, LLaMA, MistralT5, BART, original Transformer, mT5
Best forClassification, NER, embeddings, retrievalText generation, chat, code, few-shotTranslation, summarization, structured transduction

5.4 Encoder-only: BERT in detail

BERT = Bidirectional Encoder Representations from Transformers. Pretrained on two self-supervised tasks:

1. Masked Language Modeling (MLM)

Randomly mask 15% of tokens; predict them from bidirectional context.

  • Of the chosen 15%: 80% replaced with [MASK], 10% with a random token, 10% left unchanged (reduces train/test mismatch since [MASK] never appears at inference).
LMLM=tmaskedlogP(wtwmasked)L_{\text{MLM}} = -\sum_{t\in \text{masked}} \log P(w_t \mid \mathbf{w}_{\setminus \text{masked}})

Example: "The [MASK] sat on the mat" → predict "cat". The model uses both "The" (left) and "sat on the mat" (right) — bidirectionality is the whole point.

2. Next Sentence Prediction (NSP)

Given sentences A and B, predict whether B actually follows A (50% real, 50% random). (Later work — RoBERTa — found NSP unnecessary and dropped it.)

Special tokens & input format

code
[CLS] sentence A tokens [SEP] sentence B tokens [SEP]
  • [CLS]: its final hidden state is used as the aggregate sequence representation for classification.
  • [SEP]: separates segments. Plus segment embeddings (A vs B) added alongside token + position embeddings.

Fine-tuning

Add a small head on top and train end-to-end on labeled data:

  • Classification: linear layer on [CLS] vector → softmax.
  • NER / token tagging: linear layer on every token's vector.
  • Question answering (SQuAD): predict start/end span positions over the passage tokens.

Code (Hugging Face)

python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
batch = tok(["I loved this movie", "Terrible and boring"],
            padding=True, truncation=True, return_tensors="pt")
logits = model(**batch).logits          # (2, 2)
probs = logits.softmax(-1)
# fine-tune: loss = F.cross_entropy(logits, labels); loss.backward(); opt.step()

5.5 Decoder-only: GPT in detail

GPT = Generative Pretrained Transformer. A stack of causal-masked decoder blocks (no cross-attention, since there's no separate encoder). Trained on plain next-token prediction (§4.8).

Why decoder-only dominates LLMs

  • Simplicity & scale: one objective, train on raw internet text, scales smoothly to hundreds of billions of params.
  • In-context learning: large GPTs learn tasks from examples in the prompt with no weight updates ("few-shot"). E.g. give 3 translation examples, then a 4th to complete.
  • Unified interface: classification, QA, summarization, code — all framed as "continue this text."

Autoregressive generation loop

code
prompt → model → next-token distribution → sample token → append → repeat
P(w1wn)=t=1nP(wtw<t)P(w_1\dots w_n) = \prod_{t=1}^{n} P(w_t \mid w_{<t})

Generation is inherently sequential (each token depends on the previous), but training is parallel thanks to causal masking + teacher forcing.

The modern LLM stack (post-pretraining)

  1. Pretraining: next-token prediction on trillions of tokens → a "base model."
  2. Supervised fine-tuning (SFT): train on curated instruction→response pairs → follows instructions.
  3. AlignmentRLHF (Reinforcement Learning from Human Feedback) or DPO (Direct Preference Optimization): train on human preference comparisons so outputs are helpful/harmless.
    • RLHF: train a reward model rϕr_\phi on human-ranked pairs, then optimize the LLM policy with PPO to maximize reward minus a KL penalty keeping it close to the SFT model:
    maxθ E[rϕ(x,y)]βKL ⁣(πθ(yx)πSFT(yx))\max_\theta\ \mathbb{E}\big[r_\phi(x,y)\big] - \beta\,\text{KL}\!\big(\pi_\theta(y|x)\,\|\,\pi_{\text{SFT}}(y|x)\big)
    • DPO skips the separate reward model and optimizes a closed-form classification loss directly on preference pairs — simpler and now widely used.

Code (generation)

python
from transformers import AutoTokenizer, AutoModelForCausalLM
tok = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
ids = tok("The future of AI is", return_tensors="pt").input_ids
out = model.generate(ids, max_new_tokens=40, do_sample=True,
                     temperature=0.8, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))

5.6 Encoder-Decoder: T5 / BART in detail

Best when input and output are different sequences and you need to fully encode the input before generating: translation, summarization.

T5 ("Text-to-Text Transfer Transformer")

Frames every NLP task as text→text:

  • Translation: "translate English to German: That is good." → "Das ist gut."
  • Summarization: "summarize: <article>" → "<summary>"
  • Classification: "cota sentence: ... → "positive"

Pretraining (span corruption): mask contiguous spans, replace each with a sentinel token, and have the decoder generate the missing spans.

code
input:  "Thank you <X> me to your party <Y> week."
target: "<X> for inviting <Y> last"

Data flow

code
source ─► ENCODER (bidirectional) ─► H_enc
                                       │ (keys/values)
target ─► DECODER (causal self-attn) ─┴─► cross-attention ─► next-token logits

The encoder reads the whole source once; the decoder generates the target token-by-token, cross-attending to the encoded source at every layer.

Training objective

Teacher-forced cross-entropy on the target sequence (same next-token loss, but conditioned on encoder output):

L=tlogP(yty<t, Henc)L = -\sum_{t} \log P(y_t \mid y_{<t},\ \mathbf{H}_{\text{enc}})

Code

python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tok = AutoTokenizer.from_pretrained("t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
ids = tok("summarize: " + long_article, return_tensors="pt", truncation=True).input_ids
summary = model.generate(ids, max_new_tokens=80, num_beams=4)
print(tok.decode(summary[0], skip_special_tokens=True))

5.7 Choosing an architecture (decision guide)

code
Need to GENERATE free-form text / chat / code?
   └─ yes → DECODER-ONLY (GPT, LLaMA, Mistral)
Need to UNDERSTAND/classify/embed a fixed input?
   └─ yes → ENCODER-ONLY (BERT, DeBERTa)  — or just use an LLM if convenient
Input → different output sequence (translate/summarize) with strong source encoding?
   └─ yes → ENCODER-DECODER (T5, BART)

In practice, large decoder-only LLMs now handle most tasks (including classification and summarization) via prompting, so encoder-only/encoder-decoder are chosen mainly for efficiency, embeddings ([[06_rag]] retrievers use encoder models!), or specialized transduction.


5.8 Pitfalls

  • Using BERT to "generate text" — it can't naturally; it's not autoregressive.
  • Using GPT for sentence embeddings — works but encoder models (or specialized embedding models) are usually better/cheaper for retrieval.
  • Confusing the two attentions in a decoder: masked self-attention (over generated tokens) vs cross-attention (over encoder output) are different sublayers.
  • Retrieval/RAG embeddings almost always come from encoder-style models → bridge to the next chapter.

Next: [[06_rag]] — give models external, up-to-date knowledge using encoder embeddings + vector search.