Welcome to Month 2! We're about to unlock one of the most powerful concepts in AI: embeddings. These arrays of numbers let computers compare meaning — and once you see how, there's nothing magical about it. They let computers find similar content and power search engines that actually understand what you're looking for.
Coming from Software Engineering? Embeddings are like hash functions for meaning. Just as a hash maps data to a fixed-size number, an embedding maps text to a fixed-size vector — but preserving semantic similarity. If you've built search with TF-IDF or feature engineering for ML models, embeddings are the modern, learned version of those features. One twist: unlike a cryptographic hash — which scatters similar inputs to wildly different outputs — an embedding does the reverse, landing similar meanings near each other.
The Problem: Computers Don't Understand Text
When a computer sees text, it only sees character codes. It has no idea that:
- "dog" and "puppy" are related
- "happy" and "sad" are opposites
- "king" and "queen" share a relationship
Embeddings solve this!
What is an Embedding?
An embedding is a way to represent text (or images, or any data) as a list of numbers that captures its meaning.
The Key Insight
Similar meanings = Similar numbers!
# script_id: day_019_what_are_embeddings/conceptual_similarity
# fragment
# Conceptual example
embedding_dog = [0.8, 0.3, -0.5, 0.2, ...]
embedding_puppy = [0.79, 0.31, -0.48, 0.19, ...] # Very similar!
embedding_pizza = [0.1, -0.7, 0.9, -0.4, ...] # Very different!
Why Embeddings Are Powerful
Real-World Applications
| Use Case | How Embeddings Help |
|---|---|
| Search | Find documents about "cars" when user searches "automobiles" |
| Recommendations | Show similar products/articles |
| Deduplication | Find near-duplicate content |
| Clustering | Group similar support tickets |
| Anomaly Detection | Find unusual patterns |
Generating Embeddings with OpenAI
# script_id: day_019_what_are_embeddings/generate_embedding
from openai import OpenAI
client = OpenAI()
def get_embedding(text: str) -> list[float]:
"""Convert text to an embedding vector."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
# Let's try it!
text = "I love programming in Python"
embedding = get_embedding(text)
print(f"Text: {text}")
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 10 values: {embedding[:10]}")
Output:
Text: I love programming in Python
Embedding dimensions: 1536
First 10 values: [0.0023, -0.0156, 0.0089, -0.0234, 0.0178, ...]
Understanding the Output
Each of the 1536 numbers represents something about the text's meaning. We don't know exactly what each dimension represents, but together they form a "fingerprint" of the text's meaning. These numbers aren't hand-written by a programmer. The embedding model learned them by being shown enormous amounts of text and adjusting itself until words used in similar contexts ended up with similar numbers — the same way a spam filter "learns" from labeled examples rather than from hard-coded if-statements. You don't configure the dimensions; you just call the API and get the result.
Comparing Embeddings: Where It Pays Off
The real power comes from comparing embeddings:
For now, treat cosine_similarity as a function that returns a score from -1 to 1 — 1 means "these mean almost the same thing," 0 means "unrelated," and negative means "opposite." We will derive the formula in Day 20; today, just trust the score.
# script_id: day_019_what_are_embeddings/compare_embeddings
from openai import OpenAI
import numpy as np
client = OpenAI()
def get_embedding(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def cosine_similarity(a: list, b: list) -> float:
"""Calculate cosine similarity between two vectors."""
a = np.array(a)
b = np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Compare some texts
texts = [
"I love dogs",
"I adore puppies",
"I enjoy pizza",
"Machine learning is fascinating"
]
# Get all embeddings
embeddings = {text: get_embedding(text) for text in texts}
# Compare first text with all others
base_text = texts[0]
print(f"Comparing with: '{base_text}'\n")
for text in texts[1:]:
similarity = cosine_similarity(embeddings[base_text], embeddings[text])
print(f"'{text}': {similarity:.4f}")
Output:
Comparing with: 'I love dogs'
'I adore puppies': 0.92 # highest — closest meaning (dogs ≈ puppies)
'I enjoy pizza': ~0.45 # lower — different topic
'Machine learning is fascinating': ~0.30 # lowest — unrelated
Don't read these as absolute grades — what matters is the RANKING: the closest meaning gets the highest score. Exact values vary by model and version.
Visualizing Embeddings
Embeddings exist in high-dimensional space, but we can project them to 2D to visualize:
Our embeddings have 1536 numbers each, which we can't draw. t-SNE is an off-the-shelf tool (from scikit-learn) that squashes those 1536 numbers down to just 2 — an x and a y — while keeping things that were close in the original space close on the chart. Think of it like flattening a 3D map to a 2D printout: some distortion, but the neighborhoods survive. You don't need its internals to use it.
# script_id: day_019_what_are_embeddings/visualize_embeddings
# Visualization with matplotlib (requires: pip install matplotlib scikit-learn)
from openai import OpenAI
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
client = OpenAI()
def visualize_embeddings(texts: list[str]):
"""Visualize embeddings in 2D space."""
# Get embeddings
embeddings = []
for text in texts:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
embeddings.append(response.data[0].embedding)
embeddings = np.array(embeddings)
# Reduce to 2D using t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=min(5, len(texts)-1))
reduced = tsne.fit_transform(embeddings)
# Plot
plt.figure(figsize=(12, 8))
for i, text in enumerate(texts):
plt.scatter(reduced[i, 0], reduced[i, 1], s=100)
plt.annotate(text, (reduced[i, 0], reduced[i, 1]), fontsize=9)
plt.title("Embedding Space Visualization")
plt.savefig("embeddings_viz.png", dpi=150, bbox_inches='tight')
print("Saved visualization to embeddings_viz.png")
# Example usage
texts = [
"dog", "puppy", "cat", "kitten",
"pizza", "pasta", "burger",
"car", "truck", "bicycle",
"happy", "joyful", "sad"
]
visualize_embeddings(texts)
Embedding Models Comparison
| Model | Dimensions | Best For | Cost |
|---|---|---|---|
| text-embedding-3-small | 1536 | General use, cost-effective | $ |
| text-embedding-3-large | 3072 | Higher accuracy needs | $$ |
| text-embedding-ada-002 | 1536 | Legacy — do not use for new projects | $ |
Choosing the Right Model
Batch Processing Embeddings
For efficiency, process multiple texts at once:
# script_id: day_019_what_are_embeddings/batch_embeddings
from openai import OpenAI
client = OpenAI()
def get_embeddings_batch(texts: list[str]) -> list[list[float]]:
"""Get embeddings for multiple texts at once."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
# Sort by index to maintain order
sorted_data = sorted(response.data, key=lambda x: x.index)
return [item.embedding for item in sorted_data]
# Much more efficient than calling one at a time!
texts = [
"First document about AI",
"Second document about cooking",
"Third document about sports",
"Fourth document about music"
]
embeddings = get_embeddings_batch(texts)
print(f"Got {len(embeddings)} embeddings")
print(f"Each with {len(embeddings[0])} dimensions")
Common Embedding Operations
1. Find Most Similar
# script_id: day_019_what_are_embeddings/find_most_similar
from openai import OpenAI
import numpy as np
client = OpenAI()
def find_most_similar(query: str, documents: list[str], top_k: int = 3) -> list[tuple]:
"""Find the most similar documents to a query."""
# Get all embeddings
all_texts = [query] + documents
response = client.embeddings.create(
model="text-embedding-3-small",
input=all_texts
)
embeddings = [d.embedding for d in sorted(response.data, key=lambda x: x.index)]
query_embedding = embeddings[0]
doc_embeddings = embeddings[1:]
# Calculate similarities
similarities = []
for i, doc_emb in enumerate(doc_embeddings):
sim = np.dot(query_embedding, doc_emb) / (
np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb)
)
similarities.append((documents[i], sim))
# Sort and return top k
similarities.sort(key=lambda x: x[1], reverse=True)
return similarities[:top_k]
# Example
documents = [
"Python is a great programming language for beginners",
"Machine learning requires lots of data",
"The best pizza is made in Italy",
"Learning to code can change your career",
"Artificial intelligence is transforming industries"
]
query = "How do I start programming?"
results = find_most_similar(query, documents, top_k=3)
print(f"Query: {query}\n")
for doc, score in results:
print(f" {score:.4f}: {doc}")
2. Cluster Similar Content
Clustering = automatically sorting items into N groups so that items in the same group are similar. K-Means (from scikit-learn) does this: you tell it how many groups you want (n_clusters), and it assigns each document to the nearest group. It's like GROUP BY, except you group by "meaning" instead of by an exact column value.
# script_id: day_019_what_are_embeddings/cluster_documents
from openai import OpenAI
from sklearn.cluster import KMeans
import numpy as np
client = OpenAI()
def cluster_documents(documents: list[str], n_clusters: int = 3) -> dict:
"""Cluster documents by semantic similarity."""
# Get embeddings
response = client.embeddings.create(
model="text-embedding-3-small",
input=documents
)
embeddings = np.array([d.embedding for d in sorted(response.data, key=lambda x: x.index)])
# Cluster
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
labels = kmeans.fit_predict(embeddings)
# Group documents by cluster
clusters = {}
for doc, label in zip(documents, labels):
if label not in clusters:
clusters[label] = []
clusters[label].append(doc)
return clusters
# Example
docs = [
"Python programming tutorial",
"JavaScript for web development",
"Best pasta recipes",
"How to make pizza at home",
"Machine learning basics",
"Deep learning with TensorFlow",
"Italian cooking guide",
"React framework tutorial"
]
clusters = cluster_documents(docs, n_clusters=3)
for cluster_id, docs in clusters.items():
print(f"\nCluster {cluster_id}:")
for doc in docs:
print(f" - {doc}")
Checkpoint
Run find_most_similar with a query against your document list and confirm: the top hits are the ones that mean the same thing as your query, even when they share no exact keywords (e.g. "feline" surfaces "cat" sentences). If results look random, check that every text was embedded with the same model and that you're sorting by descending cosine similarity, not ascending.
Summary
Quick Reference
# script_id: day_019_what_are_embeddings/quick_reference
from openai import OpenAI
client = OpenAI()
# Single embedding
response = client.embeddings.create(
model="text-embedding-3-small",
input="Your text here"
)
embedding = response.data[0].embedding
# Batch embeddings
response = client.embeddings.create(
model="text-embedding-3-small",
input=["Text 1", "Text 2", "Text 3"]
)
embeddings = [d.embedding for d in sorted(response.data, key=lambda x: x.index)] # sort by index to keep input order
Exercises
-
Similarity Explorer: Create a tool that takes a word and finds the 5 most similar words from a predefined list
-
Document Deduplicator: Build a system that identifies near-duplicate documents using embedding similarity
-
Topic Detector: Create clusters of your own content and see what topics emerge
Solutions (approaches)
- Embed the input word and the candidate list in one batch call, cosine-rank candidates against the word, take the top 5.
- Embed all documents in a batch; flag any pair with cosine similarity above ~0.95 as a near-duplicate.
- Reuse
cluster_documents, print the docs per cluster, and eyeball the emergent topic for each group.
What's Next?
Now that you understand what embeddings are, let's learn the math behind comparing them: Cosine Similarity and Euclidean Distance!