Ever wondered what happens to your text before an LLM processes it? Spoiler: the model doesn't see words like you do. It sees tokens - and understanding this will save you money and help you write better prompts!
Coming from Software Engineering? Tokenization is like character encoding (UTF-8, ASCII) but for meaning. Just as you've dealt with encoding bugs where one byte doesn't equal one character, in LLMs one word doesn't equal one token. Understanding this mapping is crucial for cost control and debugging — it's the impedance mismatch between human text and model input.
What is Tokenization?
Tokenization is the process of breaking text into smaller pieces called tokens. These tokens are what the LLM actually processes.
The Key Insight
Tokens are NOT always words!
A token can be:
- A whole word:
"hello" - Part of a word:
"un"+"believ"+"able" - A single character:
"!" - A space + word:
" the" - Multiple characters:
"ing"
Why Not Just Use Words?
Great question! Here's why tokenization is smarter than simple word splitting:
Benefits of Tokenization
- Smaller vocabulary: Instead of millions of words, models use ~50,000-100,000 tokens
- Handles new words: Can tokenize words it's never seen before
- Efficient encoding: Common patterns get shorter representations
- Language agnostic: Works across different languages
How Tokenization Actually Works
Most modern LLMs use Byte-Pair Encoding (BPE) or similar algorithms.
The BPE Intuition
Imagine you're creating a texting shorthand:
The algorithm:
- Starts with individual characters
- Finds the most common pair of adjacent tokens
- Merges them into a new token
- Repeats until vocabulary size is reached
Let's See Real Tokenization in Python!
Here's how to actually tokenize text using the tiktoken library (OpenAI's tokenizer):
# script_id: day_003_tokenization/basic_tokenization
# Install: pip install tiktoken
import tiktoken
# Get the tokenizer for GPT-4o
encoder = tiktoken.encoding_for_model("gpt-4o")
# Let's tokenize some text!
text = "Hello, how are you doing today?"
# Encode text to tokens
tokens = encoder.encode(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Number of tokens: {len(tokens)}")
# Output:
# Text: Hello, how are you doing today?
# Tokens: [9906, 11, 1268, 527, 499, 3815, 3432, 30]
# Number of tokens: 8
Decoding Tokens Back to Text
# script_id: day_003_tokenization/decode_tokens
import tiktoken
encoder = tiktoken.encoding_for_model("gpt-4o")
text = "Hello, how are you doing today?"
tokens = encoder.encode(text)
# Decode each token individually to see what it represents
for token in tokens:
decoded = encoder.decode([token])
print(f"Token {token} = '{decoded}'")
# Output:
# Token 9906 = 'Hello'
# Token 11 = ','
# Token 1268 = ' how'
# Token 527 = ' are'
# Token 499 = ' you'
# Token 3815 = ' doing'
# Token 3432 = ' today'
# Token 30 = '?'
Notice how most tokens include the leading space? That's a design choice in GPT tokenizers!
Tokenization Surprises
Let's explore some interesting tokenization behaviors:
Example 1: Numbers
# script_id: day_003_tokenization/tokenize_numbers
import tiktoken
encoder = tiktoken.encoding_for_model("gpt-4o")
# Numbers can be tricky!
numbers = ["42", "1000", "123456789"]
for num in numbers:
tokens = encoder.encode(num)
print(f"'{num}' = {len(tokens)} tokens: {tokens}")
# Output:
# '42' = 1 tokens: [2983]
# '1000' = 1 tokens: [1041]
# '123456789' = 3 tokens: [4513, 10961, 19608]
Large numbers get split into multiple tokens!
Example 2: Different Languages
# script_id: day_003_tokenization/tokenize_languages
import tiktoken
encoder = tiktoken.encoding_for_model("gpt-4o")
texts = [
"Hello world", # English
"Bonjour monde", # French
"こんにちは世界", # Japanese
"مرحبا بالعالم", # Arabic
]
for text in texts:
tokens = encoder.encode(text)
print(f"'{text}' = {len(tokens)} tokens")
# Output:
# 'Hello world' = 2 tokens
# 'Bonjour monde' = 3 tokens
# 'こんにちは世界' = 5 tokens
# 'مرحبا بالعالم' = 8 tokens
Non-English text often uses more tokens because the tokenizer was trained primarily on English!
Example 3: Code vs Text
# script_id: day_003_tokenization/tokenize_code_vs_text
import tiktoken
encoder = tiktoken.encoding_for_model("gpt-4o")
# Prose
prose = "The function calculates the sum of two numbers."
prose_tokens = encoder.encode(prose)
# Code
code = "def calculate_sum(a, b): return a + b"
code_tokens = encoder.encode(code)
print(f"Prose ({len(prose)} chars): {len(prose_tokens)} tokens")
print(f"Code ({len(code)} chars): {len(code_tokens)} tokens")
# Prose (47 chars): 9 tokens
# Code (38 chars): 14 tokens
Code often requires more tokens than natural language!