Welcome to your AI journey! In this guide, we'll demystify the Transformer architecture - the revolutionary technology powering ChatGPT, Claude, and other Large Language Models (LLMs).
Don't worry - we're keeping the math minimal and focusing on building solid intuition.
Coming from Software Engineering? Think of self-attention like a JOIN the model runs over the words you gave it: every word looks at every other word in the SAME input and scores how relevant they are, then pulls in info from the ones that matter most. If you have built search or written a self-join where rows score their relevance to other rows, this will feel familiar.
What is a Transformer?
Think of a Transformer as a super-smart reading machine. When you give it text, it doesn't just read word by word like we do. Instead, it looks at ALL words simultaneously and figures out how they relate to each other.
Before Transformers: The Old Way
Before Transformers (2017), we used models called RNNs (Recurrent Neural Networks) that processed text one word at a time - like reading a book character by character. This was:
- Slow: Had to wait for each word to be processed
- Forgetful: By the time it reached word 100, it might forget word 1
- Sequential: Couldn't use modern parallel computing effectively
The Transformer Revolution
Transformers changed everything by introducing parallel processing - reading the entire text at once, like seeing a whole page in one glance.
The Secret Sauce: Self-Attention
Here's where the magic happens. Self-attention is how the model figures out which words are important for understanding other words.
A Simple Example
Consider the sentence: "The cat sat on the mat because it was tired."
What does "it" refer to? You instantly know it's "the cat" - but how?
Your brain automatically connected "it" with "cat" because:
- "it" is a pronoun that needs a reference
- "cat" is the most logical subject
- "mat" doesn't get tired
Self-attention does exactly this! It calculates a "relevance score" between every pair of words.
How Self-Attention Works (Simplified)
Imagine each word asking three questions:
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain?"
- Value (V): "What information can I give?"
Think of a Python dict lookup. The Query is what a word is searching for ("I am 'it', I need my reference"). Every other word advertises a Key (what it could match on) and a Value (the info it hands back if matched). Unlike a real dict, the match is fuzzy: every word matches a little, weighted by its relevance score, and what gets pulled back is the matched word's Value (its actual info content).
The model computes attention scores for every word pair. Words that are relevant to each other get high scores.
Context Windows: The Model's Memory Limit
Every LLM has a context window - the maximum amount of text it can "see" at once.
What is a Context Window?
Think of it as the model's working memory or desk space. Just like you can only fit so many papers on your desk, an LLM can only process a certain number of tokens at once.
Common Context Window Sizes
| Model | Context Window |
|---|---|
| GPT-4o | 128,000 tokens |
| GPT-4o mini | 128,000 tokens |
| Claude Opus 4.x / Sonnet 4.6 | 1,000,000 tokens |
| Claude Haiku 4.5 | 200,000 tokens |
| Llama 3.1 (8B/70B) | 128,000 tokens |
| Gemini 2.0 | 1,000,000 tokens |
(Context windows and model lineups change often — treat these as ballpark figures and check the provider's docs for current limits.)
Why Context Windows Matter
# script_id: day_002_transformer_intuition/context_window_example
# Example: When your conversation exceeds the context window
conversation_so_far = """
[5000 tokens of previous chat]
User: What was the first thing I told you?
"""
# If context window is 4096 tokens, the model literally
# CANNOT see the beginning of your conversation anymore!
# Solution: Summarization or chunking strategies (we'll cover later)
When the conversation outgrows the window
When a conversation grows past the context window, the oldest messages silently fall out of view — this is why a long chatbot "forgets" what you said early on, and why you summarize or trim old turns (covered in later days). Dropping the oldest messages is something your application does to stay under the limit — not something the model does on its own.
The Architecture: Putting It All Together
Here's how a Transformer processes your text:
Key Components
- Input Embedding: Converts text to numbers the model can process — a vector is just a fixed-length list of numbers (e.g.
[0.12, -0.04, ...]) that acts like coordinates for the word's meaning, so similar words land near each other. - Positional Encoding: because the model looks at all words at once (not left-to-right), it would otherwise have no sense of order — "dog bites man" and "man bites dog" would look identical. Positional encoding stamps each word with its position so order is preserved.
- Multi-Head Attention: Multiple attention mechanisms working in parallel (like having multiple readers analyzing the text)
- Feed-Forward Network: after attention mixes in context from other words, this step processes each word's information on its own. Think of a
.map()applied independently to every word after the join gathered its context. - Output Layer: Predicts the next token
Multi-Head Attention: Multiple Perspectives
Instead of one attention mechanism, Transformers use multiple "heads" - each looking for different patterns:
Think of it like having a team of editors, each specializing in different aspects of writing!
Why This Matters for You as a Developer
Understanding Transformers helps you:
-
Write better prompts: Knowing the model processes everything at once helps you structure prompts effectively
-
Manage context wisely: Understanding context windows helps you design systems that don't lose important information
-
Debug weird behavior: When the model "forgets" something, you'll know it might be a context window issue
-
Choose the right model: Different context windows and capabilities for different use cases
Checkpoint
You should be able to (a) explain in one sentence why the model can't recall text that fell outside the context window, and (b) for a sentence with a pronoun, name which earlier word self-attention would score highest against it.
Summary
Quick Reference
| Concept | One-liner | SWE analogy |
|---|---|---|
| Self-attention | Each token weighs every other token | A join where rows score their relevance to each other |
| Attention score | Strength of relation between two tokens | A similarity weight |
| Context window | Max tokens the model can see at once | A fixed-size buffer / RAM limit |
| Multi-head attention | Several attention computations in parallel | Running multiple indexes over the same data |
| Parallelism | All positions processed together | SIMD / vectorized ops vs. a serial loop |
Exercises
- Disambiguate by attention. Take the sentence
"The trophy didn't fit in the suitcase because it was too big."Write down which worditrefers to, and list the 2–3 words you'd expect to have the highest attention withit. - Estimate a context budget. A model has a 128K-token window. If a chat keeps ~750 tokens per turn, roughly how many turns fit before you must trim history? What strategy would you use when it fills?
- Spot the heads. For
"The river bank was muddy after the storm,"describe two different relationships separate attention heads might capture (e.g. one for meaning, one for grammar). - Extend the mental model. Explain in two sentences why parallel processing makes transformers faster to train than older left-to-right RNNs.
- Disambiguate "bank." In
"The bank by the river was overgrown with grass,"which words would have high attention withbank, and what does that tell the modelbankmeans?
Solutions (approaches)
it= the trophy (the thing that didn't fit). Expect high attention betweenitandtrophy, plusbig/fit.128000 / 750≈ 170 turns. Past that, summarize older turns or drop the oldest (sliding window) — both covered in later days.- One head might link
bank↔river/muddy(meaning → riverbank), another might link the articleThe↔bank(grammatical structure). - RNNs process token t only after t-1, so work is serial; transformers compute all positions' attention simultaneously, which maps cleanly onto GPU parallelism.
river,overgrown, andgrasswould all have high attention withbank— telling the model it's a riverbank, not a financial bank.
What's Next?
Next up, Day 3: Tokenization — the subword chunks (tokens) LLMs actually read text in, and why one word is not always one token.