Manual prompt engineering is fragile, tedious, and doesn't scale. Every time you tweak a prompt, you're guessing. DSPy replaces this guesswork with a programmatic framework: you declare what you want (signatures), compose how it flows (modules), and let a compiler optimize the actual prompts automatically.
Coming from Software Engineering? DSPy is like a compiler for prompts -- you write the spec, it generates the optimized implementation. Think of it as the difference between writing assembly by hand vs. writing C and letting GCC optimize. You declare intent with type signatures, compose modules like functions, and the compiler (optimizer) searches for the best few-shot examples and instructions. If you've used SQLAlchemy (declare schema, engine generates SQL) or TensorFlow (define graph, compiler optimizes execution), DSPy follows the same philosophy.
Why DSPy?
The Core Problem with Manual Prompting
| Aspect | Manual Prompting | DSPy |
|---|---|---|
| Optimization | Trial and error | Systematic search |
| Portability | Tied to one model | Recompile for new model |
| Reproducibility | Copy-paste prompts | Version-controlled code |
| Few-shot examples | Hand-picked | Automatically selected |
| Maintenance | Rewrite on model change | Recompile |
Signatures: Declaring Input/Output Specs
A signature is a concise declaration of what a module should do -- its inputs and outputs:
# script_id: day_017_dspy/inline_signature
import dspy
# Simple inline signature: "input_field -> output_field"
# This tells DSPy: take a question, produce an answer
classify = dspy.Predict("sentence -> sentiment")
result = classify(sentence="DSPy makes prompt engineering obsolete!")
print(result.sentiment) # "positive"
Class-Based Signatures for More Control
# script_id: day_017_dspy/class_based_signature
import dspy
class FactCheck(dspy.Signature):
"""Verify whether a claim is supported by the provided context."""
context: str = dspy.InputField(desc="Reference text with known facts")
claim: str = dspy.InputField(desc="The claim to verify")
verdict: str = dspy.OutputField(desc="'supported', 'refuted', or 'not enough info'")
evidence: str = dspy.OutputField(desc="Quote from context supporting the verdict")
# DSPy uses the docstring, field names, and descriptions
# to automatically construct the optimal prompt
Modules: Composing Behavior
Modules wrap signatures with specific prompting strategies:
# script_id: day_017_dspy/modules_predict_cot
import dspy
# Configure the language model
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)
# Predict: basic prompt, no reasoning
basic_qa = dspy.Predict("question -> answer")
# ChainOfThought: adds step-by-step reasoning automatically
cot_qa = dspy.ChainOfThought("question -> answer")
# Both have the same signature, but ChainOfThought
# adds a "reasoning" field internally before producing the answer
result = cot_qa(question="What is 15% of 240?")
print(result.reasoning) # "15% means 15/100. 15/100 * 240 = 36"
print(result.answer) # "36"
Building a Multi-Step Program
# script_id: day_017_dspy/multi_hop_qa
import dspy
class MultiHopQA(dspy.Module):
"""Answer questions that require multiple reasoning steps."""
def __init__(self):
# Each sub-module handles one step
self.generate_query = dspy.ChainOfThought(
"question -> search_query"
)
self.generate_answer = dspy.ChainOfThought(
"context, question -> answer"
)
def forward(self, question: str) -> dspy.Prediction:
# Step 1: Generate a search query from the question
query_result = self.generate_query(question=question)
# Step 2: Retrieve context (using a DSPy retriever or custom function)
context = self.retrieve(query_result.search_query)
# Step 3: Generate answer from retrieved context
answer_result = self.generate_answer(
context=context,
question=question
)
return answer_result
def retrieve(self, query: str) -> str:
# Placeholder -- plug in your vector DB here
return f"Retrieved context for: {query}"
# Usage
program = MultiHopQA()
result = program(question="Who founded the company that created GPT-4?")
print(result.answer)
Optimizers (Compilers): Automatic Prompt Tuning
Optimizers search for the best prompts, few-shot examples, and instructions:
# script_id: day_017_dspy/optimizer_and_evaluator
import dspy
from dspy.evaluate import Evaluate
# 1. Define your program
qa = dspy.ChainOfThought("question -> answer")
# 2. Prepare training examples
trainset = [
dspy.Example(
question="What is the capital of France?",
answer="Paris"
).with_inputs("question"),
dspy.Example(
question="Who wrote Romeo and Juliet?",
answer="William Shakespeare"
).with_inputs("question"),
# ... more examples (10-200 is typical)
]
# 3. Define a metric: how to score predictions
def exact_match(example, prediction, trace=None):
return example.answer.lower() == prediction.answer.lower()
# 4. Compile with BootstrapFewShot
optimizer = dspy.BootstrapFewShot(
metric=exact_match,
max_bootstrapped_demos=4, # Max few-shot examples to include
max_labeled_demos=4
)
# This runs your program on training data, finds the best
# few-shot examples, and bakes them into the prompt
compiled_qa = optimizer.compile(qa, trainset=trainset)
# 5. Use the compiled (optimized) program
result = compiled_qa(question="What is the capital of Japan?")
print(result.answer) # "Tokyo"
Optimizer Comparison
| Optimizer | Strategy | Best For | Cost |
|---|---|---|---|
BootstrapFewShot |
Selects best few-shot demos | Quick optimization, small datasets | Low |
BootstrapFewShotWithRandomSearch |
Random search over demo sets | Better quality, more exploration | Medium |
MIPRO |
Optimizes instructions + demos jointly | Maximum quality, larger datasets | High |
BootstrapFinetune |
Generates data, then finetunes model | When you need a smaller/cheaper model | Very High |
Evaluators: Measuring Quality
# script_id: day_017_dspy/optimizer_and_evaluator
import dspy
from dspy.evaluate import Evaluate
# Your compiled program
compiled_qa = ... # from optimizer step above
# Held-out test set (never seen during compilation)
testset = [
dspy.Example(
question="What is the largest planet?",
answer="Jupiter"
).with_inputs("question"),
dspy.Example(
question="Who painted the Mona Lisa?",
answer="Leonardo da Vinci"
).with_inputs("question"),
]
# Define evaluation metric
def fuzzy_match(example, prediction, trace=None):
"""Check if the expected answer appears in the prediction."""
return example.answer.lower() in prediction.answer.lower()
# Run evaluation
evaluator = Evaluate(
devset=testset,
metric=fuzzy_match,
num_threads=4,
display_progress=True
)
score = evaluator(compiled_qa)
print(f"Accuracy: {score}%")
Manual Prompting vs DSPy: Side by Side
# script_id: day_017_dspy/manual_vs_dspy_comparison
# ---- MANUAL APPROACH ----
# Fragile, model-specific, hard to maintain
def classify_manual(text: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": """You are a sentiment classifier.
Classify the sentiment as positive, negative, or neutral.
Here are some examples:
- "I love this!" -> positive
- "This is terrible" -> negative
- "It's okay I guess" -> neutral
Return ONLY the sentiment label."""},
{"role": "user", "content": text}
]
)
return response.choices[0].message.content.strip()
# ---- DSPy APPROACH ----
# Declarative, portable, auto-optimized
class SentimentClassifier(dspy.Module):
def __init__(self):
self.classify = dspy.Predict(
"text -> sentiment: str" # That's it. No prompt needed.
)
def forward(self, text: str):
return self.classify(text=text)
# Compile with examples -- DSPy finds optimal few-shot demos
optimizer = dspy.BootstrapFewShot(metric=exact_match)
classifier = optimizer.compile(SentimentClassifier(), trainset=train_data)
# Switch models? Just recompile.
dspy.configure(lm=dspy.LM("anthropic/claude-sonnet-4-20250514"))
classifier_claude = optimizer.compile(SentimentClassifier(), trainset=train_data)
Summary
Quick Reference
# script_id: day_017_dspy/quick_reference
import dspy
# Configure LM
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
# Inline signature
predict = dspy.Predict("question -> answer")
# Chain of thought
cot = dspy.ChainOfThought("question -> answer")
# Compile with optimizer
optimizer = dspy.BootstrapFewShot(metric=my_metric)
compiled = optimizer.compile(my_module, trainset=examples)
# Evaluate
from dspy.evaluate import Evaluate
score = Evaluate(devset=testset, metric=my_metric)(compiled)
Exercises
-
Signature Explorer: Define three different DSPy signatures (summarization, translation, entity extraction) and compare
PredictvsChainOfThoughtoutputs for each -
Optimizer Showdown: Take a classification task with 20+ examples, compile with
BootstrapFewShotandBootstrapFewShotWithRandomSearch, and compare accuracy on a held-out test set -
Model Portability: Build a DSPy program, compile it for GPT-4o-mini, then recompile for Claude Sonnet -- measure how accuracy changes without touching any prompt text
What's Next?
You've mastered LLM foundations! Next, we build a Capstone Data Extraction Pipeline — putting everything from Phase 1 into a complete project. Then on to Phase 2: Embeddings — learning how to represent text as vectors that capture meaning!