Your agent ran. It made 47 tool calls. It spent $4.20. It did not answer the question. The loop exited with "I was unable to complete the task." You have no idea what happened.
Coming from Software Engineering? Debugging agents is like debugging distributed systems — you can't step through with a breakpoint because the "logic" lives across multiple LLM calls with non-deterministic outputs. Your best tools are structured logging (trace every decision), replay (save full conversation state and re-run), and trajectory analysis (reviewing the sequence of actions like you'd review a distributed trace in Jaeger or Datadog). If you've debugged race conditions or eventual consistency bugs, you have the right patience for this.
Welcome to agent debugging — one of the most frustrating and important skills in AI engineering. Traditional debuggers do not help you here. You need different tools, different thinking, and a systematic approach.
The Failure Modes You Will Encounter
The Most Common Failure Modes
1. Infinite loops — The agent keeps calling tools but never commits to a final answer. Often caused by: unclear stopping conditions in the system prompt, tool results that never satisfy the model, or a model that second-guesses itself.
2. Wrong tool selection — The agent picks a tool that does not match the subtask. Usually a prompt problem: tool descriptions are ambiguous or overlap.
3. Hallucinated actions — The model tries to call a tool that does not exist, or passes arguments that do not match the schema. Often happens when the tool list changes but the system prompt is stale. Hallucinated action just means the model confidently invents a tool name or argument that was never in its tool list — like calling a function you never defined; the runtime then throws an unknown-tool error (exactly what the HALLUCINATED TOOL check below catches).
4. Context overflow — After many iterations, the conversation history exceeds the context window. The model gets confused, repetitive, or starts ignoring earlier instructions. The context window is the model's fixed-size input buffer — like a function with a hard cap on total argument size. Every past message and tool result is re-sent on every call, so a long run eventually overflows that cap; the model then silently drops or ignores the oldest content.
5. Error retry loops — A tool returns an error. The agent retries. Same error. Retries again. 20 times. You owe the API $3.
The First Thing to Do: Add Structured Logging
You cannot debug what you cannot see. Add logging to your agent loop before you do anything else.
# script_id: day_047_debugging_ai_agents/traceable_agent_debug
import json
import logging
import time
from dataclasses import dataclass, field
from typing import Any
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s %(levelname)s %(name)s %(message)s'
)
logger = logging.getLogger("agent")
# For production observability, consider `structlog` which provides structured
# JSON logging out of the box — easier to parse and query than plain text logs.
@dataclass
class AgentStep:
"""A single step in the agent execution."""
iteration: int
thought: str | None
tool_name: str | None
tool_input: dict | None
tool_output: Any | None
error: str | None
timestamp: float = field(default_factory=time.time)
token_count: int = 0
@dataclass
class AgentTrace:
"""Complete execution trace for an agent run."""
task: str
steps: list[AgentStep] = field(default_factory=list)
final_answer: str | None = None
total_tokens: int = 0
start_time: float = field(default_factory=time.time)
def add_step(self, step: AgentStep):
self.steps.append(step)
logger.info(
"agent_step | iteration=%d tool=%s error=%s",
step.iteration,
step.tool_name,
step.error is not None,
)
def duration_seconds(self) -> float:
return time.time() - self.start_time
def to_dict(self) -> dict:
return {
"task": self.task,
"total_steps": len(self.steps),
"total_tokens": self.total_tokens,
"duration_seconds": self.duration_seconds(),
"final_answer": self.final_answer,
"steps": [
{
"iteration": s.iteration,
"tool": s.tool_name,
"input": s.tool_input,
"output": str(s.tool_output)[:200] if s.tool_output else None,
"error": s.error,
}
for s in self.steps
],
}
Building a Traceable Agent
Here is a ReAct agent — the same think → call tool → observe loop from the flowchart above (introduced on Day 35) — now wrapped so every step is recorded:
# script_id: day_047_debugging_ai_agents/traceable_agent_debug
import json
from openai import OpenAI
client = OpenAI()
class TraceableAgent:
"""ReAct agent with full execution tracing."""
def __init__(self, tools: dict, max_iterations: int = 10):
self.tools = tools
self.max_iterations = max_iterations
def run(self, task: str) -> AgentTrace:
trace = AgentTrace(task=task)
messages = self._build_initial_messages(task)
logger.info("Agent starting: %s", task[:100])
for i in range(self.max_iterations):
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=self._get_tool_schemas(),
)
message = response.choices[0].message
trace.total_tokens += response.usage.total_tokens
# store as a dict so it composes with trim_messages() and the rest of the message history
messages.append(message.model_dump())
# No tool calls = final answer
if not message.tool_calls:
trace.final_answer = message.content
logger.info("Agent finished in %d iterations", i + 1)
break
for tool_call in message.tool_calls:
tool_name = tool_call.function.name
tool_input = json.loads(tool_call.function.arguments)
step = AgentStep(
iteration=i,
thought=message.content,
tool_name=tool_name,
tool_input=tool_input,
tool_output=None,
error=None,
)
try:
if tool_name not in self.tools:
raise ValueError(
f"Unknown tool: {tool_name}. "
f"Available: {list(self.tools.keys())}"
)
result = self.tools[tool_name](**tool_input)
step.tool_output = result
logger.info(
"Tool: %s(%s) → %s",
tool_name,
json.dumps(tool_input)[:80],
str(result)[:80],
)
except Exception as e:
step.error = str(e)
result = f"Error: {e}"
logger.error("Tool error %s: %s", tool_name, e)
trace.add_step(step)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": str(result),
})
else:
logger.warning("Agent hit max iterations (%d)", self.max_iterations)
logger.info(
"agent_complete | steps=%d tokens=%d duration=%.1fs success=%s",
len(trace.steps),
trace.total_tokens,
trace.duration_seconds(),
trace.final_answer is not None,
)
return trace
def _build_initial_messages(self, task: str) -> list[dict]:
return [
{
"role": "system",
"content": (
"You are a helpful assistant with access to tools. "
"Use tools to gather information, then provide a final answer. "
"When you have enough information, stop calling tools and give your answer. "
"Do NOT keep searching after you have a good answer."
),
},
{"role": "user", "content": task},
]
def _get_tool_schemas(self) -> list[dict]:
# Placeholder schema: empty properties means tools take no arguments. Real tools must
# declare their parameters here (name, type, required) or the model will call them with no args.
return [
{
"type": "function",
"function": {
"name": name,
"description": func.__doc__ or "",
"parameters": {"type": "object", "properties": {}},
},
}
for name, func in self.tools.items()
]
Debugging LangGraph State Machines
LangGraph gives you visibility into state at each node — but you have to ask for it.
Recall from Day 41 that LangGraph models the agent as a state machine — nodes are functions and the state is a dict passed between them. A checkpointer is just an autosave: it snapshots that dict after every node, like git commits you can check out or Redux time-travel. Annotated[list, operator.add] tells LangGraph to append updates to that field instead of overwriting it, and invoke(None, config) means re-run from a saved snapshot rather than starting fresh. You built the full rewind/replay debugger on Day 45 (Time-Travel Debugging); here it becomes a targeted debugging move — rewind to the checkpoint just before the agent went wrong and replay from there.
# script_id: day_047_debugging_ai_agents/langgraph_debugging
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.memory import MemorySaver
from typing import TypedDict, Annotated
import operator
class AgentState(TypedDict):
messages: Annotated[list, operator.add]
tool_calls_made: int
errors: list[str]
def build_debuggable_graph():
"""Build a LangGraph graph with checkpointing enabled."""
checkpointer = MemorySaver()
graph = StateGraph(AgentState)
# ... add your nodes and edges here ...
# Compile WITH checkpointer — this is what enables step replay
app = graph.compile(checkpointer=checkpointer)
return app
def inspect_execution(app, thread_id: str):
"""Inspect execution history step by step."""
config = {"configurable": {"thread_id": thread_id}}
history = list(app.get_state_history(config))
print(f"Total checkpoints: {len(history)}")
for i, checkpoint in enumerate(reversed(history)):
state = checkpoint.values
messages = state.get("messages", [])
tool_calls = state.get("tool_calls_made", 0)
print(f"\nStep {i}: {len(messages)} messages, {tool_calls} tool calls")
if messages:
last = messages[-1]
role = getattr(last, "type", "unknown")
content = str(getattr(last, "content", ""))[:100]
print(f" Last ({role}): {content}")
# Time travel: rewind to a specific checkpoint and re-run from there
def rewind_and_replay(app, thread_id: str, steps_back: int = 2):
"""Rewind execution by N steps and replay from that point."""
config = {"configurable": {"thread_id": thread_id}}
history = list(app.get_state_history(config))
if steps_back >= len(history):
print("Not enough history to rewind that far")
return
# Get the checkpoint we want to rewind to
target_checkpoint = list(reversed(history))[steps_back]
target_config = target_checkpoint.config
print(f"Rewinding {steps_back} steps...")
# Update state to that checkpoint and continue
app.update_state(target_config, target_checkpoint.values)
# Invoke from that point
result = app.invoke(None, target_config)
return result
The Debugging Checklist
When your agent does something weird, go through this checklist:
# script_id: day_047_debugging_ai_agents/traceable_agent_debug
from collections import Counter
def agent_debugging_checklist(trace: AgentTrace) -> list[str]:
"""Automated checks on an agent trace."""
issues = []
# 1. Did it hit max iterations?
if trace.final_answer is None:
issues.append("CRITICAL: Agent hit max iterations without final answer")
# 2. Were there errors?
errors = [s for s in trace.steps if s.error]
if errors:
issues.append(f"ERRORS: {len(errors)} tool calls failed")
for e in errors[:3]:
issues.append(f" - {e.tool_name}: {e.error}")
# 3. Same tool called repeatedly with same input?
tool_inputs = [
(s.tool_name, json.dumps(s.tool_input or {}, sort_keys=True))
for s in trace.steps
if s.tool_name
]
duplicates = {k: v for k, v in Counter(tool_inputs).items() if v > 2}
if duplicates:
issues.append(f"LOOP DETECTED: Repeated tool calls: {list(duplicates.keys())[:3]}")
# 4. Did token count explode?
# tokens = the word-pieces the model reads and bills you for (~3/4 of a word each).
# ~50k of accumulated history is a huge running total — almost always a runaway loop, not real work.
if trace.total_tokens > 50_000:
issues.append(f"TOKEN EXPLOSION: {trace.total_tokens:,} tokens used")
# 5. Too many steps?
if len(trace.steps) > 15:
issues.append(f"TOO MANY STEPS: {len(trace.steps)} steps")
# 6. Hallucinated tool names?
hallucination_signals = ["unknown tool", "not found", "does not exist"]
for s in trace.steps:
if s.error and any(k in s.error.lower() for k in hallucination_signals):
issues.append(f"HALLUCINATED TOOL: '{s.tool_name}'")
return issues if issues else ["No obvious issues detected"]
Common Mistakes and Fixes
Mistake 1: Infinite Loop — Missing Stop Condition
# script_id: day_047_debugging_ai_agents/stop_condition_fix
# PROBLEM: no stopping condition
bad_system_prompt = "You have access to web_search. Use it to research topics."
# FIX: explicit stopping condition
good_system_prompt = """You have access to web_search. Use it to research topics.
IMPORTANT: After 2-3 searches you have enough information. Stop searching and
give your final answer. Do NOT keep looking for more information once you have
a reasonable answer."""
Mistake 2: Error Retry Loop — No Circuit Breaker
# script_id: day_047_debugging_ai_agents/circuit_breaker
# PROBLEM: agent retries failed tool indefinitely
def bad_execute(tool_name, tool_input):
try:
return tools[tool_name](**tool_input)
except Exception as e:
return f"Error: {e}" # LLM will just try again
# FIX: track per-tool error count
class CircuitBreakerAgent:
def __init__(self, max_tool_errors: int = 3):
self.max_tool_errors = max_tool_errors
self._error_count: dict[str, int] = {}
def execute_tool(self, tool_name: str, tool_input: dict) -> str:
count = self._error_count.get(tool_name, 0)
if count >= self.max_tool_errors:
return (
f"Tool '{tool_name}' has failed {self.max_tool_errors} times. "
"Please proceed without it or try a different approach."
)
try:
return str(tools[tool_name](**tool_input))
except Exception as e:
self._error_count[tool_name] = count + 1
return f"Error ({count + 1}/{self.max_tool_errors}): {e}"
Mistake 3: Context Overflow — Unbounded History
We budget by tokens — the unit the window is actually measured in — not by message count, since one message can be a sentence or a 5,000-token document.
# script_id: day_047_debugging_ai_agents/traceable_agent_debug
import tiktoken
def trim_messages(
messages: list[dict],
max_tokens: int = 100_000,
model: str = "gpt-4o",
) -> list[dict]:
"""Keep system messages + most recent messages within token budget."""
enc = tiktoken.encoding_for_model(model)
system_msgs = [m for m in messages if m.get("role") == "system"]
other_msgs = [m for m in messages if m.get("role") != "system"]
system_tokens = sum(
len(enc.encode(m.get("content", "") or "")) for m in system_msgs
)
budget = max_tokens - system_tokens - 1000 # Reserve for next response
selected = []
used = 0
for message in reversed(other_msgs):
content = message.get("content", "") or ""
msg_tokens = len(enc.encode(content))
if used + msg_tokens > budget:
break
selected.insert(0, message)
used += msg_tokens
trimmed = len(other_msgs) - len(selected)
if trimmed > 0:
logger.warning("Trimmed %d messages from context (budget: %d tokens)", trimmed, max_tokens)
return system_msgs + selected
Pretty-Printing Traces
# script_id: day_047_debugging_ai_agents/traceable_agent_debug
def print_trace(trace: AgentTrace):
"""Pretty-print an agent execution trace for debugging."""
print(f"\n{'='*60}")
print(f"TASK: {trace.task}")
print(f"Duration: {trace.duration_seconds():.1f}s | Tokens: {trace.total_tokens:,}")
print(f"{'='*60}")
for i, step in enumerate(trace.steps):
status = "❌" if step.error else "✅"
print(f"\nStep {i+1} {status}")
if step.thought:
print(f" Thought: {step.thought[:150]}")
if step.tool_name:
print(f" Tool: {step.tool_name}")
if step.tool_input:
print(f" Input: {json.dumps(step.tool_input)[:100]}")
if step.error:
print(f" ERROR: {step.error}")
elif step.tool_output:
print(f" Output: {str(step.tool_output)[:100]}")
print(f"\n{'='*60}")
if trace.final_answer:
print(f"FINAL ANSWER: {trace.final_answer[:300]}")
else:
print("FAILED: No final answer produced")
print(f"{'='*60}\n")
SWE to AI Engineering Bridge
| Software Debugging | Agent Debugging |
|---|---|
| Stack trace | Agent execution trace |
| Breakpoints | LangGraph checkpoints / step inspection |
| Log statements | Structured AgentStep logging |
| Unit test for a function | Test a single tool call in isolation |
| Infinite loop detection | Max iterations + repeated-call detection |
| Memory leak | Context overflow / unbounded history |
| Exception handling | Tool error capture + graceful degradation |
| Profiler | Token count and cost per step |
Key Takeaways
- Add tracing before you need it — retrofitting observability is painful
- Check max iterations first — hitting the iteration cap without a final answer is the most common failure
- Context overflow is silent — the model just gets confused; trim proactively rather than waiting for a visible error
Summary
Quick Reference
| Symptom | Detection | Fix |
|---|---|---|
| Never finishes | final_answer is None after loop |
Explicit stop condition in system prompt |
| Repeats same call | Counter of (tool, input) > 2 |
Block duplicate; nudge the model |
| Calls missing tool | error contains "unknown tool" | Keep tool list and prompt in sync |
| Context overflow | message tokens near window | trim_messages(...) to a budget |
| Error retry loop | per-tool error count rising | Circuit breaker after N failures |
| "What happened?" | app.get_state_history(config) |
Inspect / rewind each checkpoint |
Exercises
- Add
AgentTraceandTraceableAgentto your Day 48 capstone and print a trace for each run. - Write a test that triggers the
TOKEN EXPLOSION(orTOO MANY STEPS) branch ofagent_debugging_checklist— build anAgentTracewithtotal_tokensover 50,000 (or more than 15 steps) and assert the matching warning appears. - Implement
trim_messagesand verify it stays under 100K tokens after 200 simulated iterations. - Use LangGraph's state history to replay an execution step-by-step and print state at each checkpoint.
Solutions (approaches)
- Wrap each call in an
AgentStep, append totrace, and callprint_trace(trace)at the end — both helpers are defined in this lesson. - Build an
AgentTracewithtotal_tokens=60_000(or append 16+AgentSteps), run it throughagent_debugging_checklist, and assert the"TOKEN EXPLOSION"(or"TOO MANY STEPS") line appears. - Build 200 fake messages, run
trim_messages(messages, max_tokens=100_000), and assert the re-encoded total is under budget. for cp in app.get_state_history(config): print(cp.values)— newest first; reverse it to read forward.
Checkpoint
Feed agent_debugging_checklist a hand-built AgentTrace whose steps contain the same (tool_name, tool_input) three times. The returned list should include a "LOOP DETECTED" line — the Counter-based check fires when an identical call repeats more than twice. Then run a clean trace through print_trace(...) and confirm each step shows ✅/❌ with its tool and output. If "LOOP DETECTED" never appears, make sure your duplicate steps serialize to identical json.dumps(..., sort_keys=True) strings (same keys, same values).
What's Next?
Capstone — Autonomous Research Agent, where you'll build a full multi-step agent with the tracing and debugging infrastructure from today.