LLMs are great at generating text, but applications need structured data - dictionaries, objects, typed fields. Enter Pydantic: the Python library that turns messy LLM output into clean, validated data structures.
Coming from Software Engineering? Pydantic is the TypeScript of Python — it adds type safety and validation to a dynamically typed language. If you've used TypeScript interfaces, JSON Schema, or Protocol Buffers to define data contracts, Pydantic models serve the exact same purpose for LLM outputs. Using Pydantic with LLMs is like building a data transformation layer between an unreliable external API and your clean internal types. The same defensive programming patterns — validate, coerce, reject — that you use at API boundaries apply when parsing LLM output.
The Problem: LLMs Return Strings
# script_id: day_014_pydantic_schemas/the_problem_strings
# What we want
user_data = {
"name": "John Doe",
"age": 30,
"email": "john@example.com"
}
# What LLM gives us
llm_response = """
The user's name is John Doe, they are 30 years old,
and their email is john@example.com.
"""
# Now what? Parse this mess manually?
What is Pydantic?
Pydantic is a data validation library that:
- Defines data structures with type hints
- Automatically validates and converts data
- Provides clear error messages
- Generates JSON schemas
Getting Started with Pydantic
Installation
pip install pydantic
Basic Model
# script_id: day_014_pydantic_schemas/basic_model
from pydantic import BaseModel
from typing import Optional
class User(BaseModel):
name: str
age: int
email: str
is_active: Optional[bool] = True
# Creating instances
user1 = User(
name="John Doe",
age=30,
email="john@example.com"
)
print(user1)
# name='John Doe' age=30 email='john@example.com' is_active=True
print(user1.name) # John Doe
print(user1.model_dump()) # {'name': 'John Doe', 'age': 30, ...}
print(user1.model_dump_json()) # JSON string
Type Coercion
Pydantic automatically converts compatible types:
# script_id: day_014_pydantic_schemas/type_coercion
from pydantic import BaseModel
class User(BaseModel):
name: str
age: int
# String "30" is converted to int 30
user = User(name="John", age="30") # Works!
print(user.age) # 30 (as int)
print(type(user.age)) # <class 'int'>
Validation Errors
# script_id: day_014_pydantic_schemas/validation_errors
from pydantic import BaseModel, ValidationError
class User(BaseModel):
name: str
age: int
try:
user = User(name="John", age="not a number")
except ValidationError as e:
print("Validation failed!")
print(e.json())
# Output (abridged):
# [
# {
# "type": "int_parsing",
# "loc": ["age"],
# "msg": "Input should be a valid integer",
# "input": "not a number"
# }
# ]
Building Complex Schemas
Nested Models
# script_id: day_014_pydantic_schemas/nested_models
from pydantic import BaseModel, Field
from typing import List, Optional
from datetime import datetime
class Address(BaseModel):
street: str
city: str
country: str
zip_code: Optional[str] = None
class Company(BaseModel):
name: str
address: Address
founded: int
class Person(BaseModel):
name: str
email: str
companies: List[Company]
# Use default_factory, NOT `datetime.now()`. A bare `datetime.now()` is
# evaluated once at class-definition time, so every Person would share the
# same timestamp. default_factory runs the callable per instance.
created_at: datetime = Field(default_factory=datetime.now)
# Usage
data = {
"name": "Jane Smith",
"email": "jane@example.com",
"companies": [
{
"name": "TechCorp",
"address": {
"street": "123 Main St",
"city": "San Francisco",
"country": "USA"
},
"founded": 2020
}
]
}
person = Person(**data)
print(person.companies[0].address.city) # San Francisco
Enums and Literals
# script_id: day_014_pydantic_schemas/enums_and_literals
from pydantic import BaseModel
from enum import Enum
from typing import Literal
class Priority(str, Enum):
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
class Task(BaseModel):
title: str
priority: Priority
status: Literal["pending", "in_progress", "completed"]
# Usage
task = Task(
title="Fix bug",
priority="high", # Converted to Priority.HIGH
status="pending"
)
print(task.priority) # Priority.HIGH
print(task.priority.value) # "high"
Field Validators
# script_id: day_014_pydantic_schemas/field_validators
from pydantic import BaseModel, field_validator, Field
from typing import List
class Product(BaseModel):
name: str
price: float = Field(gt=0) # Must be greater than 0
tags: List[str] = []
# @field_validator runs when the field is set; v is the incoming value, and
# whatever you return becomes the stored value (returning v.title() actually
# capitalizes it). @classmethod is required by Pydantic v2.
@field_validator('name')
@classmethod
def name_must_not_be_empty(cls, v):
if not v.strip():
raise ValueError('Name cannot be empty')
return v.title() # Capitalize
@field_validator('tags')
@classmethod
def tags_lowercase(cls, v):
return [tag.lower() for tag in v]
# Usage
product = Product(
name="laptop",
price=999.99,
tags=["Electronics", "COMPUTERS"]
)
print(product.name) # "Laptop" (capitalized)
print(product.tags) # ["electronics", "computers"] (lowercased)
Generating JSON Schemas
Pydantic can generate JSON schemas that LLMs understand:
JSON Schema is just a machine-readable description of a data shape — like an OpenAPI/Swagger spec for a single object. We generate it from the model and paste it into the prompt so the LLM knows exactly which fields and types to return.
# script_id: day_014_pydantic_schemas/json_schema_generation
from pydantic import BaseModel
from typing import List, Optional
import json
class ExtractedEntity(BaseModel):
"""An entity extracted from text."""
name: str
entity_type: str
confidence: float
class ExtractionResult(BaseModel):
"""Result of entity extraction."""
entities: List[ExtractedEntity]
source_text: str
language: Optional[str] = "en"
# Generate JSON Schema
schema = ExtractionResult.model_json_schema()
print(json.dumps(schema, indent=2))
Output:
{
"title": "ExtractionResult",
"description": "Result of entity extraction.",
"type": "object",
"properties": {
"entities": {
"title": "Entities",
"type": "array",
"items": {
"$ref": "#/$defs/ExtractedEntity"
}
},
"source_text": {
"title": "Source Text",
"type": "string"
},
"language": {
"title": "Language",
"default": "en",
"type": "string"
}
},
"required": ["entities", "source_text"],
"$defs": {
"ExtractedEntity": {
"title": "ExtractedEntity",
"description": "An entity extracted from text.",
"type": "object",
"properties": {
"name": {"title": "Name", "type": "string"},
"entity_type": {"title": "Entity Type", "type": "string"},
"confidence": {"title": "Confidence", "type": "number"}
},
"required": ["name", "entity_type", "confidence"]
}
}
}
Using Pydantic with LLMs
The Pattern
Basic Implementation
# script_id: day_014_pydantic_schemas/llm_basic_implementation
from pydantic import BaseModel, Field
from openai import OpenAI
from typing import Literal
import json
client = OpenAI()
class MovieReview(BaseModel):
title: str
rating: int = Field(ge=0, le=10)
sentiment: Literal["positive", "neutral", "negative"] # overall tone of the review
key_points: list[str]
def extract_review(text: str) -> MovieReview:
"""Extract structured review from text."""
# Get the JSON schema
schema = MovieReview.model_json_schema()
prompt = f"""Extract movie review information from the following text.
Return a JSON object matching this schema:
{json.dumps(schema, indent=2)}
Text to analyze:
{text}
Return only valid JSON, no other text."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0
)
# Parse the JSON response
json_str = response.choices[0].message.content or ""
data = json.loads(json_str)
# Validate with Pydantic
return MovieReview(**data)
# Usage
review_text = """
I just watched "Inception" and WOW! This movie is a masterpiece.
The visual effects are stunning, the plot keeps you guessing,
and the acting is superb. I'd give it a solid 9 out of 10.
My only complaint is it's a bit confusing at times.
"""
review = extract_review(review_text)
print(f"Title: {review.title}")
print(f"Rating: {review.rating}/10")
print(f"Sentiment: {review.sentiment}")
print(f"Key Points: {review.key_points}")
Advanced Pydantic Features for LLMs
Field Descriptions
Add descriptions that become part of the JSON schema:
# script_id: day_014_pydantic_schemas/field_descriptions
from pydantic import BaseModel, Field
from typing import Optional
class CustomerTicket(BaseModel):
"""A customer support ticket extracted from email."""
subject: str = Field(
description="Brief summary of the issue"
)
priority: str = Field(
description="Priority level: low, medium, high, or urgent"
)
category: str = Field(
description="Category: billing, technical, shipping, or other"
)
customer_sentiment: str = Field(
description="Customer's emotional state: happy, neutral, frustrated, or angry"
)
action_required: Optional[str] = Field(
default=None,
description="Immediate action needed, if any"
)
# The schema now includes these descriptions!
schema = CustomerTicket.model_json_schema()
print(schema["properties"]["priority"])
# {'description': 'Priority level: low, medium, high, or urgent', 'title': 'Priority', 'type': 'string'}
Literal[...] enforces the value set inside Pydantic; a described str leans on the LLM to comply — choose Literal when you want a hard guarantee, a described str when you want flexibility.
Examples in Schema
# script_id: day_014_pydantic_schemas/examples_in_schema
from pydantic import BaseModel, Field
from typing import List
class ProductInfo(BaseModel):
name: str = Field(examples=["iPhone 15", "MacBook Pro"])
price: float = Field(examples=[999.99, 1299.00])
features: List[str] = Field(
examples=[["5G capable", "A16 chip", "48MP camera"]]
)
Complete LLM + Pydantic Workflow
# script_id: day_014_pydantic_schemas/complete_workflow
from pydantic import BaseModel, Field, ValidationError
from openai import OpenAI
from typing import List, Optional
import json
import re
client = OpenAI()
# Step 1: Define your schema
class ContactInfo(BaseModel):
name: str = Field(description="Full name of the person")
email: Optional[str] = Field(default=None, description="Email address if mentioned")
phone: Optional[str] = Field(default=None, description="Phone number if mentioned")
company: Optional[str] = Field(default=None, description="Company/organization name")
class ExtractedContacts(BaseModel):
contacts: List[ContactInfo]
extraction_notes: str = Field(description="Any relevant notes about the extraction")
# Step 2: Create extraction function
def extract_contacts(text: str) -> ExtractedContacts:
"""Extract contact information from text."""
schema = ExtractedContacts.model_json_schema()
system_prompt = """You are a contact information extractor.
Extract all contact information from the provided text.
Return valid JSON matching the provided schema exactly."""
user_prompt = f"""Schema:
{json.dumps(schema, indent=2)}
Text to analyze:
{text}
Return only the JSON object, nothing else."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
response_format={"type": "json_object"},
temperature=0
)
# Parse and validate
json_str = response.choices[0].message.content or ""
# Clean up potential markdown formatting (rarely needed with json_object mode)
json_str = re.sub(r'^```(?:json)?\s*|\s*```$', '', json_str.strip())
data = json.loads(json_str)
return ExtractedContacts(**data)
# Step 3: Use it!
email_text = """
Hi team,
Please reach out to the following people about the project:
- John Smith from Acme Corp (john.smith@acme.com, 555-123-4567)
- Sarah Johnson, our consultant (sarah@consulting.io)
- Mike at TechStart - his number is 555-987-6543
Best,
Alex
"""
try:
result = extract_contacts(email_text)
print("Extracted Contacts:")
for contact in result.contacts:
print(f" - {contact.name}")
if contact.email:
print(f" Email: {contact.email}")
if contact.phone:
print(f" Phone: {contact.phone}")
if contact.company:
print(f" Company: {contact.company}")
print(f"\nNotes: {result.extraction_notes}")
except ValidationError as e:
print(f"Validation failed: {e}")
except json.JSONDecodeError as e:
print(f"JSON parsing failed: {e}")
Checkpoint
Run extract_review and confirm: a valid blurb returns a typed MovieReview object (access review.rating as an int, not a string). Then pass a deliberately malformed value (e.g. MovieReview(title="x", rating="great", sentiment="positive", key_points=[])) and confirm it raises a ValidationError instead of silently passing bad data downstream. If everything parses even when it shouldn't, check that your fields use real types/constraints (e.g. rating: int = Field(ge=0, le=10)) rather than bare str.
Summary
Quick Reference
# script_id: day_014_pydantic_schemas/quick_reference
from pydantic import BaseModel, Field
from typing import List, Optional
class MySchema(BaseModel):
"""Description becomes part of schema."""
required_field: str
optional_field: Optional[str] = None
with_default: str = "default"
with_description: str = Field(description="Explain the field")
constrained: int = Field(gt=0, lt=100)
list_field: List[str] = []
# Generate schema
schema = MySchema.model_json_schema()
# Parse data
obj = MySchema(**data_dict)
# Export
obj.model_dump() # To dict
obj.model_dump_json() # To JSON string
Exercises
- Invoice Extractor: Create a Pydantic model for invoices (vendor, items, totals) and extract from sample invoice text
Solution
# script_id: day_014_pydantic_schemas/exercise_1_solution
from pydantic import BaseModel, Field
from openai import OpenAI
from typing import List
import json
client = OpenAI()
class LineItem(BaseModel):
description: str
quantity: int = Field(ge=1)
unit_price: float = Field(ge=0)
class Invoice(BaseModel):
vendor: str
items: List[LineItem]
total: float = Field(ge=0)
def extract_invoice(text: str) -> Invoice:
schema = Invoice.model_json_schema()
prompt = f"""Extract invoice information from the text.
Return a JSON object matching this schema:
{json.dumps(schema, indent=2)}
Text:
{text}
Return only valid JSON, no other text."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0
)
data = json.loads(response.choices[0].message.content or "")
return Invoice(**data)
- Resume Parser: Build a schema for resumes and parse job application emails
Solution
# script_id: day_014_pydantic_schemas/exercise_2_solution
from pydantic import BaseModel, Field
from openai import OpenAI
from typing import List, Optional
import json
client = OpenAI()
class Experience(BaseModel):
company: str
title: str
years: Optional[float] = None
class Resume(BaseModel):
name: str
email: Optional[str] = None
skills: List[str] = []
experience: List[Experience] = []
def parse_resume(text: str) -> Resume:
schema = Resume.model_json_schema()
prompt = f"""Extract resume information from the email below.
Return a JSON object matching this schema:
{json.dumps(schema, indent=2)}
Email:
{text}
Return only valid JSON, no other text."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0
)
data = json.loads(response.choices[0].message.content or "")
return Resume(**data)
- Sentiment Analyzer: Create a model that extracts entities AND sentiment from product reviews
Solution
# script_id: day_014_pydantic_schemas/exercise_3_solution
from pydantic import BaseModel, Field
from openai import OpenAI
from typing import List, Literal
import json
client = OpenAI()
class Entity(BaseModel):
name: str
entity_type: str # e.g. "product", "feature", "brand"
class ReviewAnalysis(BaseModel):
entities: List[Entity]
sentiment: Literal["positive", "neutral", "negative"] # overall tone
def analyze_review(text: str) -> ReviewAnalysis:
schema = ReviewAnalysis.model_json_schema()
prompt = f"""Extract the entities mentioned and the overall sentiment
from the product review. Return a JSON object matching this schema:
{json.dumps(schema, indent=2)}
Review:
{text}
Return only valid JSON, no other text."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0
)
data = json.loads(response.choices[0].message.content or "")
return ReviewAnalysis(**data)
What's Next?
Now that you can extract structured data from LLMs, let's learn how to Test LLM Applications — making sure your extractions are reliable and correct!