So far, every agent we have built processes text. The user types a question, the LLM reads it, and text comes back. But the real world is not text-only. Users want to upload a photo of a receipt and ask "how much did I spend on food?", or dictate a voice memo and have an agent summarize it. Multimodal inputs unlock these workflows by letting your agent see images and hear audio alongside text.
Coming from Software Engineering? Multimodal inputs are like adding file upload support to your API — same endpoint, richer payload types. If you have built a REST endpoint that accepts
multipart/form-datawith both JSON fields and file attachments, you already understand the pattern. The LLM API works the same way: instead of a plain text message, you send a content array containing text blocks, image blocks, and audio blocks. The model processes them all in a single request.
Why Multimodal Matters for Agents
Text-only agents force users to describe what they see. Multimodal agents see it for themselves. This eliminates a lossy translation step and opens up use cases that are impossible with text alone:
| Use Case | Input Type | What the Agent Does |
|---|---|---|
| Receipt analysis | Photo | Extracts line items, totals, tax |
| Chart interpretation | Screenshot | Reads axes, trends, data points |
| Document processing | Scanned PDF page | Extracts text, tables, signatures |
| UI bug reports | Screenshot | Identifies layout issues, broken elements |
| Voice commands | Audio recording | Transcribes and executes instructions |
| Meeting notes | Audio file | Transcribes and summarizes discussion |
Vision APIs: Sending Images to LLMs
Both OpenAI and Anthropic support vision through their chat APIs. You send images as part of the message content array — no separate endpoint needed.
Method 1: Base64-Encoded Images
Encode the image bytes as a base64 string and embed them directly in the API request. This is the most reliable method — no public URL required.
# script_id: day_032_multimodal_inputs/vision_api_examples
import base64
from pathlib import Path
from openai import OpenAI
client = OpenAI()
def encode_image(image_path: str) -> str:
"""Read an image file and return its base64 encoding."""
image_bytes = Path(image_path).read_bytes()
return base64.standard_b64encode(image_bytes).decode("utf-8")
def analyze_image(image_path: str, question: str) -> str:
"""Send an image to GPT-4o and ask a question about it."""
b64_image = encode_image(image_path)
# Determine MIME type from extension
suffix = Path(image_path).suffix.lower()
mime_types = {".png": "image/png", ".jpg": "image/jpeg", ".jpeg": "image/jpeg", ".gif": "image/gif", ".webp": "image/webp"}
mime_type = mime_types.get(suffix, "image/png")
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": question},
{
"type": "image_url",
"image_url": {
"url": f"data:{mime_type};base64,{b64_image}",
"detail": "high" # "low", "high", or "auto"
},
},
],
}
],
max_tokens=1024,
)
return response.choices[0].message.content
# Usage
result = analyze_image("receipt.png", "List every line item and the total.")
print(result)
Method 2: URL-Based Image References
If your image is already hosted at a public URL, skip the encoding step entirely:
# script_id: day_032_multimodal_inputs/vision_api_examples
def analyze_image_url(image_url: str, question: str) -> str:
"""Analyze an image from a URL."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": question},
{
"type": "image_url",
"image_url": {"url": image_url, "detail": "auto"},
},
],
}
],
max_tokens=1024,
)
return response.choices[0].message.content
# Usage — image already hosted somewhere
result = analyze_image_url(
"https://example.com/charts/q4-revenue.png",
"What was the revenue trend in Q4?"
)
When to use which? Use base64 for local files, user uploads, and anything not publicly accessible. Use URLs for images already hosted on S3, CDN, or the public web. URLs are faster (no encoding overhead) but require the image to be reachable by the API server.
Vision with Anthropic (Claude)
Claude uses a slightly different content block format — image type with source instead of image_url:
# script_id: day_032_multimodal_inputs/vision_claude
from anthropic import Anthropic
client = Anthropic()
def analyze_image_claude(image_path: str, question: str) -> str:
"""Send an image to Claude and ask a question about it."""
b64_image = encode_image(image_path) # Same helper from above
suffix = Path(image_path).suffix.lower()
media_type = {".png": "image/png", ".jpg": "image/jpeg", ".jpeg": "image/jpeg", ".gif": "image/gif", ".webp": "image/webp"}.get(suffix, "image/png")
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": b64_image,
},
},
{"type": "text", "text": question},
],
}
],
)
return response.content[0].text
Image Token Costs
Vision is not free. The API tiles your image into chunks and each tile costs tokens. This adds up fast.
How Tiling Works (OpenAI)
| Detail Level | How It Works | Tokens |
|---|---|---|
low |
Image resized to 512x512, single tile | 85 tokens fixed |
high |
Image split into 512x512 tiles + thumbnail | 85 tokens per tile + 85 base |
auto |
API chooses based on image size | Varies |
A 2048x2048 image on high detail gets split into 16 tiles: (16 * 85) + 85 = 1,445 tokens. At GPT-4o input pricing ($2.50 per 1M tokens), that is about $0.0036 per image. Sounds cheap until you process 10,000 receipts.
# script_id: day_032_multimodal_inputs/estimate_image_tokens
def estimate_image_tokens(width: int, height: int, detail: str = "high") -> int:
"""Estimate token cost for an image based on dimensions and detail level."""
if detail == "low":
return 85
# High detail: resize so shortest side is 768px, then tile into 512x512
scale = min(768 / min(width, height), 2048 / max(width, height), 1.0)
w, h = int(width * scale), int(height * scale)
# Count 512x512 tiles (ceiling division)
tiles_x = -(-w // 512) # Ceiling division trick
tiles_y = -(-h // 512)
total_tiles = tiles_x * tiles_y
return (total_tiles * 85) + 85 # 85 per tile + 85 base
# Examples
print(estimate_image_tokens(1024, 768, "high")) # 595 tokens
print(estimate_image_tokens(4096, 4096, "high")) # 1445 tokens
print(estimate_image_tokens(512, 512, "low")) # 85 tokens
Cost tip: Always use
detail: "low"unless you genuinely need to read fine text or small UI elements. For chart trend analysis,lowis usually sufficient. For reading a receipt line by line, you needhigh.
Audio: Transcription with OpenAI Whisper
For audio inputs, the standard approach is a two-step pipeline: transcribe the audio to text with Whisper, then pass the text to your LLM.
# script_id: day_032_multimodal_inputs/whisper_transcription
from openai import OpenAI
from pathlib import Path
client = OpenAI()
def transcribe_audio(audio_path: str, language: str = None) -> str:
"""Transcribe an audio file using OpenAI Whisper."""
with open(audio_path, "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language=language, # Optional: "en", "es", "fr", etc.
response_format="text",
)
return transcript
def transcribe_with_timestamps(audio_path: str) -> dict:
"""Transcribe with word-level timestamps for precise alignment."""
with open(audio_path, "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["word"],
)
return transcript
# Usage
text = transcribe_audio("meeting_recording.mp3")
print(text)
# With timestamps
detailed = transcribe_with_timestamps("meeting_recording.mp3")
for word_info in detailed.words[:10]:
print(f" [{word_info.start:.1f}s] {word_info.word}")
Whisper supports mp3, mp4, mpeg, mpga, m4a, wav, and webm files up to 25 MB. For longer recordings, split them into chunks first.
Building a Multimodal Agent
Now let us combine everything into an agent that handles text, images, and audio in a single conversation:
# script_id: day_032_multimodal_inputs/multimodal_agent
import base64
import json
from pathlib import Path
from openai import OpenAI
client = OpenAI()
class MultimodalAgent:
"""Agent that processes text, images, and audio inputs."""
def __init__(self, system_prompt: str = "You are a helpful multimodal assistant."):
self.system_prompt = system_prompt
self.messages = [{"role": "system", "content": system_prompt}]
def _encode_image(self, path: str) -> tuple[str, str]:
"""Encode image and determine MIME type."""
data = base64.standard_b64encode(Path(path).read_bytes()).decode()
suffix = Path(path).suffix.lower()
mime = {".png": "image/png", ".jpg": "image/jpeg",
".jpeg": "image/jpeg", ".webp": "image/webp"}.get(suffix, "image/png")
return data, mime
def _transcribe(self, audio_path: str) -> str:
"""Transcribe audio to text via Whisper."""
with open(audio_path, "rb") as f:
return client.audio.transcriptions.create(
model="whisper-1", file=f, response_format="text"
)
def send(
self,
text: str = None,
image_paths: list[str] = None,
audio_paths: list[str] = None,
) -> str:
"""
Send a multimodal message. Any combination of text, images,
and audio can be provided.
"""
content = []
# Transcribe any audio files and prepend as text
if audio_paths:
for path in audio_paths:
transcript = self._transcribe(path)
content.append({
"type": "text",
"text": f"[Transcribed audio from {Path(path).name}]:\n{transcript}",
})
# Add images
if image_paths:
for path in image_paths:
data, mime = self._encode_image(path)
content.append({
"type": "image_url",
"image_url": {
"url": f"data:{mime};base64,{data}",
"detail": "auto",
},
})
# Add text prompt
if text:
content.append({"type": "text", "text": text})
# Build message — use plain string if text-only
if len(content) == 1 and content[0]["type"] == "text":
self.messages.append({"role": "user", "content": content[0]["text"]})
else:
self.messages.append({"role": "user", "content": content})
# Call the model
response = client.chat.completions.create(
model="gpt-4o",
messages=self.messages,
max_tokens=2048,
)
reply = response.choices[0].message.content
self.messages.append({"role": "assistant", "content": reply})
return reply
# Usage examples
agent = MultimodalAgent(
system_prompt="You are an expense tracking assistant. "
"Extract amounts, categories, and dates from receipts."
)
# Text only
print(agent.send(text="Hi, I need help tracking my expenses this month."))
# Image + text
print(agent.send(
text="What did I buy and how much did it cost?",
image_paths=["receipt_lunch.jpg"],
))
# Audio + image + text
print(agent.send(
text="Here is a voice note about this receipt. Summarize everything.",
image_paths=["receipt_dinner.png"],
audio_paths=["voice_note.m4a"],
))
Practical Use Case: Screenshot Analyzer
Here is a focused agent that analyzes UI screenshots for a QA workflow:
# script_id: day_032_multimodal_inputs/screenshot_analyzer
class ScreenshotAnalyzer:
"""Analyze UI screenshots for bugs, layout issues, and accessibility."""
def __init__(self):
self.client = OpenAI()
self.system_prompt = (
"You are a QA engineer analyzing UI screenshots. "
"For each screenshot, report:\n"
"1. Visual bugs (misalignment, overflow, clipping)\n"
"2. Accessibility issues (contrast, missing labels)\n"
"3. Content issues (typos, placeholder text)\n"
"Format as a structured list with severity: HIGH, MEDIUM, LOW."
)
def analyze(self, screenshot_path: str, context: str = "") -> str:
"""Analyze a single screenshot."""
b64 = base64.standard_b64encode(
Path(screenshot_path).read_bytes()
).decode()
prompt = "Analyze this UI screenshot for issues."
if context:
prompt += f"\nContext: {context}"
response = self.client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": self.system_prompt},
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{b64}",
"detail": "high", # Need high for UI details
},
},
],
},
],
max_tokens=1024,
)
return response.choices[0].message.content
# Usage
qa = ScreenshotAnalyzer()
issues = qa.analyze(
"checkout_page.png",
context="This is the checkout page on mobile Safari, iOS 17."
)
print(issues)
Cost Comparison: Text vs. Multimodal
Understanding cost differences helps you decide when vision is worth it:
| Input Type | Tokens (approx) | Cost per request (GPT-4o) |
|---|---|---|
| 500-word text prompt | ~700 tokens | $0.0018 |
| Single image (low detail) | 85 tokens | $0.0002 |
| Single image (high detail, 1024x1024) | 765 tokens | $0.0019 |
| Single image (high detail, 2048x2048) | 1,445 tokens | $0.0036 |
| 1-minute audio (Whisper) | N/A (flat rate) | $0.006 |
| Image + text combo | ~1,500 tokens | $0.0038 |
Key insight: A single high-resolution image costs roughly the same as a 500-word text prompt. The real cost danger is processing many images per request or using
highdetail whenlowwould suffice.
Summary
Quick Reference
# Base64 encode an image
import base64; b64 = base64.standard_b64encode(open("img.png","rb").read()).decode()
# OpenAI vision message
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}", "detail": "auto"}}
# Claude vision message
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}}
# OpenAI URL-based image
{"type": "image_url", "image_url": {"url": "https://example.com/image.png"}}
# Whisper transcription
client.audio.transcriptions.create(model="whisper-1", file=open("audio.mp3","rb"))
# Estimate image tokens (high detail)
tokens = (ceil(w/512) * ceil(h/512) * 85) + 85
Practice Exercises
-
Receipt Scanner: Build an agent that accepts a photo of a receipt, extracts all line items into a structured JSON object with fields for
item,quantity,price, andtotal, and calculates whether the listed total matches the sum of line items. -
Voice Memo Summarizer: Create a pipeline that takes an audio file, transcribes it with Whisper, then passes the transcript to an LLM with instructions to produce a bullet-point summary and a list of action items. Add word-level timestamps so each action item references when it was mentioned.
-
Multimodal QA Bot: Build an agent that accepts a screenshot of a web page plus a text question (e.g., "Is the navigation bar aligned correctly?"). The agent should analyze the image, answer the question, and output a severity-rated list of any other issues it notices. Compare costs between
lowandhighdetail modes for the same image.
What's Next?
With vision and audio in our toolkit, let's understand the cost engineering behind all these API calls! In the next lesson, we will learn how to estimate, track, and optimize LLM spending — because multimodal inputs make cost awareness even more critical when every image eats hundreds of tokens.