Phase 2RAG and Tool Calling·13 min read

Cosine Similarity and Euclidean Distance

Phase 2 of 8

You've learned that embeddings are lists of numbers representing meaning. But how do we actually measure "similarity" between two embeddings? In this guide, you'll master the two most important similarity metrics: Cosine Similarity and Euclidean Distance.

Coming from Software Engineering? Comparing two embeddings is like writing a custom comparator or distance function — instead of ==, you return a score for how close two records are. Euclidean is ruler-distance (think find-restaurants-near-me); cosine compares direction, ignoring how far out a point sits. (If you've used PostGIS geo queries, it's the same idea in many more dimensions.)


Why Measure Similarity?

Every time you:

  • Search for similar items
  • Find related documents
  • Recommend content
  • Detect duplicates

You need a way to measure "how similar" two embeddings are.


Cosine Similarity: The Angle Approach

Cosine similarity measures the angle between two vectors, ignoring their length.

A vector here is just the embedding list from Day 19. We draw examples in 2D so you can see the angle, but the exact same math runs on the full 1536-number list — you just can't draw it. Don't worry that you can't picture 1536 dimensions; the formula works the same no matter how many entries the list has.

The Intuition

Think of it like compass directions:

  • Same direction (both pointing north) → Very similar
  • Perpendicular (north vs east) → Unrelated
  • Opposite (north vs south) → Very different

The Formula

cosine_similarity = (A · B) / (||A|| × ||B||)

Where:
- A · B = dot product of A and B
- ||A|| = magnitude (length) of A
- ||B|| = magnitude (length) of B

The dot product is just: multiply the two lists position-by-position, then add up the results. For [1,2,3] and [4,5,6] that is 14 + 25 + 3*6 = 32. np.dot does this for you.

The magnitude (or norm) of a vector is how long the arrow is — the square root of the sum of its squared values. For [3,4] that is sqrt(33 + 44) = 5. np.linalg.norm computes this. (Note: this is a different kind of length than Python's len(), which just counts entries.)

Python Implementation

# script_id: day_020_cosine_euclidean_similarity/cosine_similarity_impl
import numpy as np

def cosine_similarity(a: list, b: list) -> float:
    """
    Calculate cosine similarity between two vectors.

    Returns a value between -1 and 1:
    - 1: Identical direction (most similar)
    - 0: Perpendicular (unrelated)
    - -1: Opposite direction (least similar)
    """
    a = np.array(a)
    b = np.array(b)

    # Dot product
    dot_product = np.dot(a, b)

    # Magnitudes
    magnitude_a = np.linalg.norm(a)
    magnitude_b = np.linalg.norm(b)

    # Avoid division by zero
    if magnitude_a == 0 or magnitude_b == 0:
        return 0.0

    return dot_product / (magnitude_a * magnitude_b)

# Example with simple 2D vectors
v1 = [1, 0]  # Pointing right
v2 = [1, 0]  # Same direction
v3 = [0, 1]  # Pointing up (perpendicular)
v4 = [-1, 0]  # Pointing left (opposite)

print(f"Same direction: {cosine_similarity(v1, v2):.4f}")      # 1.0
print(f"Perpendicular: {cosine_similarity(v1, v3):.4f}")       # 0.0
print(f"Opposite: {cosine_similarity(v1, v4):.4f}")            # -1.0

Real Embedding Example

# script_id: day_020_cosine_euclidean_similarity/cosine_real_embeddings
from openai import OpenAI
import numpy as np

client = OpenAI()

def get_embedding(text: str) -> list:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def cosine_similarity(a: list, b: list) -> float:
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Compare meanings
texts = {
    "base": "I love programming",
    "similar": "I enjoy coding",
    "related": "Software development is fun",
    "different": "The weather is nice today"
}

embeddings = {key: get_embedding(text) for key, text in texts.items()}

base_emb = embeddings["base"]
print(f"Base: '{texts['base']}'\n")

for key in ["similar", "related", "different"]:
    sim = cosine_similarity(base_emb, embeddings[key])
    print(f"  vs '{texts[key]}': {sim:.4f}")

# Approximate — your exact values will differ, but the ordering should hold:
# Base: 'I love programming'
#   vs 'I enjoy coding': 0.8923 (high - similar meaning)
#   vs 'Software development is fun': 0.7845 (medium - related topic)
#   vs 'The weather is nice today': 0.4532 (low - different topic)

Euclidean Distance: The Straight Line

Euclidean distance measures the straight-line distance between two points in space.

The Intuition

It's literally the distance you'd measure with a ruler!

The Formula

euclidean_distance = √(Σ(ai - bi)²)

Or in simpler terms:
distance = √((a1-b1)² + (a2-b2)² + ... + (an-bn)²)

Python Implementation

# script_id: day_020_cosine_euclidean_similarity/euclidean_distance_impl
import numpy as np

def euclidean_distance(a: list, b: list) -> float:
    """
    Calculate Euclidean distance between two vectors.

    Returns a value >= 0:
    - 0: Identical points
    - Small: Similar
    - Large: Different
    """
    a = np.array(a)
    b = np.array(b)
    return np.linalg.norm(a - b)

# Simple 2D example
point1 = [0, 0]
point2 = [3, 4]
point3 = [0, 0]

print(f"Distance (0,0) to (3,4): {euclidean_distance(point1, point2):.4f}")  # 5.0
print(f"Distance (0,0) to (0,0): {euclidean_distance(point1, point3):.4f}")  # 0.0

With Real Embeddings

# script_id: day_020_cosine_euclidean_similarity/euclidean_real_embeddings
from openai import OpenAI
import numpy as np

client = OpenAI()

def get_embedding(text: str) -> list:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def euclidean_distance(a: list, b: list) -> float:
    return np.linalg.norm(np.array(a) - np.array(b))

# Compare texts
texts = ["cat", "kitten", "dog", "pizza"]
embeddings = [get_embedding(t) for t in texts]

print("Euclidean distances from 'cat':\n")
for i, text in enumerate(texts[1:], 1):
    dist = euclidean_distance(embeddings[0], embeddings[i])
    print(f"  '{text}': {dist:.4f}")

# Approximate — your exact values will differ, but the ordering should hold:
# 'kitten': 0.45 (small - very similar)
# 'dog': 0.62 (medium - related animal)
# 'pizza': 1.23 (large - unrelated)

Cosine vs Euclidean: When to Use Which?

Comparison Table

Aspect Cosine Similarity Euclidean Distance
Measures Angle between vectors Straight-line distance
Range -1 to 1 0 to ∞
Best similarity 1 (same direction) 0 (same point)
Affected by magnitude? No Yes
Common use Text similarity Clustering (grouping nearby points)
Interpretation Direction match Spatial closeness

For text search and RAG you will almost always use cosine.

Visual Comparison

# script_id: day_020_cosine_euclidean_similarity/compare_metrics
import numpy as np

def compare_metrics():
    """Show how cosine and euclidean can give different results."""

    # Two vectors with same direction but different magnitudes
    a = [1, 1]
    b = [2, 2]  # Same direction as a, but twice as long

    # Another vector at different angle
    c = [1, 0]

    cosine_ab = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
    cosine_ac = np.dot(a, c) / (np.linalg.norm(a) * np.linalg.norm(c))

    euclidean_ab = np.linalg.norm(np.array(a) - np.array(b))
    euclidean_ac = np.linalg.norm(np.array(a) - np.array(c))

    print("Vectors: a=[1,1], b=[2,2], c=[1,0]")
    print("\nCosine Similarity:")
    print(f"  a vs b: {cosine_ab:.4f}  (same direction = identical!)")
    print(f"  a vs c: {cosine_ac:.4f}  (45° angle)")

    print("\nEuclidean Distance:")
    print(f"  a vs b: {euclidean_ab:.4f}  (different point)")
    print(f"  a vs c: {euclidean_ac:.4f}  (closer point)")

compare_metrics()

Output:

Vectors: a=[1,1], b=[2,2], c=[1,0]

Cosine Similarity:
  a vs b: 1.0000  (same direction = identical!)
  a vs c: 0.7071  (45° angle)

Euclidean Distance:
  a vs b: 1.4142  (different point)
  a vs c: 1.0000  (closer point)

Notice how cosine sees a and b as identical (same direction), while euclidean sees them as different (different positions)!


Normalized Embeddings: Best of Both Worlds

When embeddings are normalized (length = 1), cosine and euclidean become mathematically related:

Normalizing means shrinking or stretching the arrow to exactly length 1 while keeping its direction — like keeping a compass heading but ignoring how far you walked. Only direction is left, which is exactly what cosine cares about (and it can make search faster).

# script_id: day_020_cosine_euclidean_similarity/normalized_embeddings
import numpy as np

def normalize(v: list) -> np.array:
    """Normalize a vector to unit length."""
    v = np.array(v)
    return v / np.linalg.norm(v)

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def euclidean_distance(a, b):
    return np.linalg.norm(np.array(a) - np.array(b))

# Original vectors
a = [3, 4]
b = [1, 2]

# Normalized versions
a_norm = normalize(a)
b_norm = normalize(b)

print("Normalized vectors always have length 1:")
print(f"  |a_norm| = {np.linalg.norm(a_norm):.4f}")
print(f"  |b_norm| = {np.linalg.norm(b_norm):.4f}")

cosine = cosine_similarity(a_norm, b_norm)
euclidean = euclidean_distance(a_norm, b_norm)

print(f"\nCosine similarity: {cosine:.4f}")
print(f"Euclidean distance: {euclidean:.4f}")

# Mathematical relationship: euclidean² = 2(1 - cosine) for normalized vectors
print(f"\nVerify: euclidean² = {euclidean**2:.4f}")
print(f"        2(1 - cosine) = {2*(1-cosine):.4f}")

Many embedding models return already normalized vectors, making either metric work well.

You do not need to derive this. The takeaway: once vectors are normalized, ranking by highest cosine and ranking by smallest euclidean distance give the SAME order — so pick whichever your vector database supports.


For large-scale search, optimize your computations:

# script_id: day_020_cosine_euclidean_similarity/efficient_search
import numpy as np
from openai import OpenAI

client = OpenAI()

class EfficientSearch:
    """Efficient similarity search using matrix operations."""

    def __init__(self, documents: list[str]):
        self.documents = documents
        self._embed_documents()

    def _embed_documents(self):
        """Embed all documents and normalize."""
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=self.documents
        )
        embeddings = [d.embedding for d in sorted(response.data, key=lambda x: x.index)]
        self.embeddings = np.array(embeddings)

        # Normalize for efficient cosine similarity
        norms = np.linalg.norm(self.embeddings, axis=1, keepdims=True)
        self.normalized = self.embeddings / norms

    def search(self, query: str, top_k: int = 5) -> list[tuple]:
        """Search for most similar documents."""
        # Embed query
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=query
        )
        query_emb = np.array(response.data[0].embedding)
        query_norm = query_emb / np.linalg.norm(query_emb)

        # Compute all similarities at once (matrix multiplication!)
        # self.normalized has one row per document. np.dot against the query
        # computes the cosine for every document in a single call instead of a
        # Python loop — same math as before, just batched.
        similarities = np.dot(self.normalized, query_norm)

        # Get top k
        # argsort sorts ascending; [::-1] flips to highest-first, [:top_k] keeps the best k
        top_indices = np.argsort(similarities)[::-1][:top_k]

        return [(self.documents[i], similarities[i]) for i in top_indices]

# Usage
documents = [
    "Python is great for data science",
    "JavaScript powers the web",
    "Machine learning requires math",
    "Deep learning uses neural networks",
    "Web development is creative work",
    "Data analysis reveals insights",
    "Cooking is a form of art",
    "Music brings people together"
]

searcher = EfficientSearch(documents)

# Search!
results = searcher.search("How do I analyze data?", top_k=3)
print("Query: 'How do I analyze data?'\n")
for doc, score in results:
    print(f"  {score:.4f}: {doc}")

Converting Between Metrics

Some vector DBs only expose euclidean distance; this lets you map a cosine threshold onto it.

Sometimes you need to convert or use both:

# script_id: day_020_cosine_euclidean_similarity/metric_conversion
import numpy as np

def cosine_to_euclidean(cosine_sim: float) -> float:
    """
    Convert cosine similarity to euclidean distance.
    Only valid for normalized vectors!
    """
    return np.sqrt(2 * (1 - cosine_sim))

def euclidean_to_cosine(euclidean_dist: float) -> float:
    """
    Convert euclidean distance to cosine similarity.
    Only valid for normalized vectors!
    """
    return 1 - (euclidean_dist ** 2) / 2

# Test conversions
cosine_sim = 0.9
euclidean_dist = cosine_to_euclidean(cosine_sim)
print(f"Cosine {cosine_sim} → Euclidean {euclidean_dist:.4f}")

cosine_back = euclidean_to_cosine(euclidean_dist)
print(f"Euclidean {euclidean_dist:.4f} → Cosine {cosine_back:.4f}")

Practical Similarity Thresholds

What counts as "similar"? Here are rough guidelines:

# script_id: day_020_cosine_euclidean_similarity/categorize_similarity
def categorize_similarity(cosine_sim: float) -> str:
    """Categorize similarity score."""
    if cosine_sim >= 0.95:
        return "Near duplicate"
    elif cosine_sim >= 0.85:
        return "Very similar"
    elif cosine_sim >= 0.70:
        return "Related"
    elif cosine_sim >= 0.50:
        return "Somewhat related"
    else:
        return "Different topics"

# Example
similarities = [0.98, 0.91, 0.75, 0.55, 0.32]
for sim in similarities:
    print(f"{sim:.2f}: {categorize_similarity(sim)}")

Checkpoint

Run the cosine_similarity sanity checks and confirm the printed values: same direction ≈ 1.0, perpendicular ≈ 0.0, opposite ≈ -1.0. Then run compare_metrics and confirm cosine treats [1,1] and [2,2] as identical while Euclidean does not — that's the magnitude-vs-direction distinction. If your cosine values blow past 1.0, check that you're dividing by the product of the norms, not just the dot product.


Summary


Quick Reference

# script_id: day_020_cosine_euclidean_similarity/quick_reference
import numpy as np

# Cosine Similarity (-1 to 1, higher = more similar)
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Euclidean Distance (0 to ∞, lower = more similar)
def euclidean_distance(a, b):
    return np.linalg.norm(np.array(a) - np.array(b))

# Normalize vectors (for efficient comparison)
def normalize(v):
    return np.array(v) / np.linalg.norm(v)

Exercises

  1. Threshold Finder: Experiment with different similarity thresholds for your data to find the optimal cutoff

  2. Metric Comparison: Compare cosine vs euclidean results on the same dataset - when do they differ?

  3. Speed Test: Benchmark naive pairwise comparison vs matrix multiplication for 1000 documents

Solutions (approaches)
  1. Threshold Finder: Sweep candidate thresholds against a few labeled pairs (some you know are similar, some you know are not), and pick the cutoff that cleanly separates them.

  2. Metric Comparison: Take same-direction vectors of different magnitude (e.g. [1,1] and [2,2]) — cosine is identical while euclidean differs. They diverge whenever magnitudes differ and you have not normalized. Reuse compare_metrics to see it.

  3. Speed Test: Wrap time.perf_counter() around a Python loop of dot products versus a single np.dot(matrix, query). The matrix form is typically 10-100x faster.


What's Next?

You now understand the math of similarity! Next, let's learn how to generate embeddings via API — calling real embedding endpoints, with batching, caching, and cost management — before we store and search them in a vector database.