akashnotes — Structured Learning for Engineers

Your agent calls tools, parses JSON, and chains multi-step actions. But frontier models -- the big general-purpose hosted models like GPT-4o or Claude Opus that try to do everything -- sometimes hallucinate function names, produce malformed arguments, or ignore your schema. Fine-tuning a smaller model on your exact tool-calling patterns can produce a specialist that outperforms a generalist on your workflow -- at a fraction of the cost and latency.

Coming from Software Engineering? Fine-tuning for tool use is like writing a specialized API client -- you train the model to speak your API's protocol fluently. Instead of hoping a generic HTTP library guesses the right headers, you build a typed client that knows every endpoint, every parameter, and every error code by heart.

Why Fine-tune for Agents?

Three core reasons to fine-tune for agentic tasks:

Reliable tool calling: The model learns your exact function signatures, not generic patterns
JSON schema adherence: No more missing required fields or wrong types
Domain-specific execution: Multi-step chains that follow your business logic, not generic reasoning

Training Data Format for Tool Use

The key to fine-tuning for tool calling is structured training examples. Each example shows the model a conversation with tool definitions, a user request, and the correct tool call response.

# script_id: day_080_finetuning_agentic/training_data_format
# Training example format for tool-use fine-tuning
training_example = {
    "messages": [
        {
            "role": "system",
            "content": "You are an assistant with access to the following tools."
        },
        {
            "role": "user",
            "content": "What's the weather in San Francisco?"
        },
        {
            "role": "assistant",
            "tool_calls": [
                {
                    "id": "call_001",
                    "type": "function",
                    "function": {
                        "name": "get_weather",
                        "arguments": '{"location": "San Francisco", "unit": "fahrenheit"}'
                    }
                }
            ]
        },
        {
            "role": "tool",
            "tool_call_id": "call_001",
            "content": '{"temperature": 62, "condition": "foggy"}'
        },
        {
            "role": "assistant",
            "content": "It's 62F and foggy in San Francisco."
        }
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current weather for a location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"},
                        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                    },
                    "required": ["location"]
                }
            }
        }
    ]
}

Building Tool-Use Training Sets

A strong training set covers three categories: successful calls, error recovery, and multi-step chains.

# script_id: day_080_finetuning_agentic/build_tool_use_dataset
import json
import random

def build_tool_use_dataset(tools: list, scenarios: list) -> list:
    """Build a training dataset for tool-use fine-tuning."""
    dataset = []

    for scenario in scenarios:
        # Category 1: Successful single-step calls
        if scenario["type"] == "single_call":
            dataset.append({
                "messages": [
                    {"role": "user", "content": scenario["query"]},
                    {"role": "assistant", "tool_calls": [scenario["expected_call"]]},
                    {"role": "tool", "tool_call_id": "call_1", "content": scenario["tool_response"]},
                    {"role": "assistant", "content": scenario["final_answer"]}
                ],
                "tools": tools
            })

        # Category 2: Error recovery (model retries after a failed call)
        elif scenario["type"] == "error_recovery":
            dataset.append({
                "messages": [
                    {"role": "user", "content": scenario["query"]},
                    {"role": "assistant", "tool_calls": [scenario["first_call"]]},
                    {"role": "tool", "tool_call_id": "call_1", "content": scenario["error_response"]},
                    {"role": "assistant", "tool_calls": [scenario["retry_call"]]},
                    {"role": "tool", "tool_call_id": "call_2", "content": scenario["success_response"]},
                    {"role": "assistant", "content": scenario["final_answer"]}
                ],
                "tools": tools
            })

        # Category 3: Multi-step chains
        elif scenario["type"] == "multi_step":
            messages = [{"role": "user", "content": scenario["query"]}]
            for i, step in enumerate(scenario["steps"]):
                messages.append({"role": "assistant", "tool_calls": [step["call"]]})
                messages.append({"role": "tool", "tool_call_id": f"call_{i+1}", "content": step["response"]})
            messages.append({"role": "assistant", "content": scenario["final_answer"]})
            dataset.append({"messages": messages, "tools": tools})

    return dataset

# Example: build a dataset with 3 categories
tools = [
    {"type": "function", "function": {"name": "search_db", "parameters": {"type": "object", "properties": {"query": {"type": "string"}}}}},
    {"type": "function", "function": {"name": "send_email", "parameters": {"type": "object", "properties": {"to": {"type": "string"}, "body": {"type": "string"}}}}},
]

print(f"Tools defined: {len(tools)}")

Fine-tuning for JSON Schema Compliance

One of the biggest wins from fine-tuning is eliminating malformed JSON. You train the model to always produce valid arguments that match your schema.

# script_id: day_080_finetuning_agentic/validate_schema_compliance
from pydantic import BaseModel, ValidationError
from typing import Optional
import json

# Define the schema your tool expects
class SearchArgs(BaseModel):
    query: str
    max_results: int = 10
    filter_category: Optional[str] = None

# Validate training data before fine-tuning
def validate_training_examples(dataset: list, schemas: dict) -> dict:
    """Ensure all training examples have valid tool call arguments."""
    stats = {"valid": 0, "invalid": 0, "errors": []}

    for i, example in enumerate(dataset):
        for msg in example["messages"]:
            if msg["role"] == "assistant" and "tool_calls" in msg:
                for call in msg["tool_calls"]:
                    fn_name = call["function"]["name"]
                    args_str = call["function"]["arguments"]

                    try:
                        args = json.loads(args_str)
                        if fn_name in schemas:
                            schemas[fn_name](**args)  # Validate with Pydantic
                        stats["valid"] += 1
                    except (json.JSONDecodeError, ValidationError) as e:
                        stats["invalid"] += 1
                        stats["errors"].append({"example": i, "error": str(e)})

    return stats

# Run validation
schemas = {"search_db": SearchArgs}
# stats = validate_training_examples(dataset, schemas)
# print(f"Valid: {stats['valid']}, Invalid: {stats['invalid']}")

Launching a Fine-tuning Job

# script_id: day_080_finetuning_agentic/launch_finetuning_job
from openai import OpenAI
import json

client = OpenAI()

# fine-tuning needs a dated base-model snapshot, not the bare alias
def prepare_and_finetune(dataset: list, model: str = "gpt-4o-mini-2024-07-18"):
    """Prepare a JSONL file and launch a fine-tuning job."""

    # Step 1: Write JSONL training file
    output_path = "tool_calling_train.jsonl"
    with open(output_path, "w") as f:
        for example in dataset:
            f.write(json.dumps(example) + "\n")

    # Step 2: Upload to OpenAI
    with open(output_path, "rb") as f:
        file_obj = client.files.create(file=f, purpose="fine-tune")

    print(f"Uploaded file: {file_obj.id}")

    # Step 3: Create fine-tuning job
    job = client.fine_tuning.jobs.create(
        training_file=file_obj.id,
        model=model,
        method={
            "type": "supervised",
            "supervised": {
                # n_epochs = passes over your data; "auto" lets OpenAI pick batch size + learning rate (see Day 078)
                "hyperparameters": {
                    "n_epochs": 3,
                    "batch_size": "auto",
                    "learning_rate_multiplier": "auto"
                }
            }
        },
        suffix="tool-calling-agent"
    )

    print(f"Fine-tuning job created: {job.id}")
    print(f"Status: {job.status}")
    return job

# Monitor the job
def check_job_status(job_id: str):
    """Check fine-tuning job progress."""
    job = client.fine_tuning.jobs.retrieve(job_id)
    print(f"Status: {job.status}")
    if job.fine_tuned_model:
        print(f"Model ready: {job.fine_tuned_model}")
    return job

Evaluating Tool-Call Accuracy

Exact and fuzzy match are cheap to read off the JSON. True functional match means actually running the call and checking the result -- here we approximate it with a similarity threshold, but in production you'd execute against a sandbox.

# script_id: day_080_finetuning_agentic/evaluate_tool_call_accuracy
import json
from difflib import SequenceMatcher

def evaluate_tool_calls(predictions: list, ground_truth: list) -> dict:
    """Evaluate tool-call accuracy across three metrics."""
    exact_match = 0
    fuzzy_match = 0
    functional_match = 0

    for pred, truth in zip(predictions, ground_truth):
        pred_name = pred["function"]["name"]
        truth_name = truth["function"]["name"]
        pred_args = json.loads(pred["function"]["arguments"])
        truth_args = json.loads(truth["function"]["arguments"])

        # Exact match: name and args identical
        if pred_name == truth_name and pred_args == truth_args:
            exact_match += 1
            fuzzy_match += 1
            functional_match += 1
            continue

        # Fuzzy match: name correct, args similar
        if pred_name == truth_name:
            similarity = SequenceMatcher(
                None,
                json.dumps(pred_args, sort_keys=True),
                json.dumps(truth_args, sort_keys=True)
            ).ratio()
            if similarity > 0.8:
                fuzzy_match += 1
                functional_match += 1  # proxy for functional match (verify by execution in production)
            elif similarity > 0.5:
                functional_match += 1  # proxy for functional match (verify by execution in production)

    total = len(predictions)
    return {
        "exact_match": exact_match / total,
        "fuzzy_match": fuzzy_match / total,
        "functional_match": functional_match / total,
        "total_examples": total
    }

# Example usage
# results = evaluate_tool_calls(model_predictions, expected_calls)
# print(f"Exact: {results['exact_match']:.1%}")
# print(f"Fuzzy: {results['fuzzy_match']:.1%}")
# print(f"Functional: {results['functional_match']:.1%}")

Case Study: 8B Model vs GPT-4o on Tool Calling

A common production pattern: fine-tune a small open-source model to match or beat GPT-4o on your specific tool-calling task, then serve it locally for 10x cost savings.

You can fine-tune via a hosted API (the OpenAI path shown above) or fine-tune an open model locally with LoRA (Day 079). This benchmark uses the local route.

# script_id: day_080_finetuning_agentic/benchmark_comparison
# Benchmark: compare fine-tuned 8B vs GPT-4o on your tool-calling eval set
# Illustrative numbers -- not a measured benchmark; real results depend on your task and data.

benchmark_results = {
    "model": ["GPT-4o (baseline)", "Llama-3 8B (base)", "Llama-3 8B (fine-tuned)"],
    "exact_match":      [0.87, 0.41, 0.91],
    "fuzzy_match":      [0.93, 0.58, 0.95],
    "functional_match": [0.96, 0.65, 0.97],
    "avg_latency_ms":   [1200, 180, 190],
    "cost_per_1k":      [3.50, 0.00, 0.00],  # ~0 marginal API cost per call -- you still pay upfront for the GPU/hosting
}

# Print comparison table
print(f"{'Model':<30} {'Exact':>8} {'Fuzzy':>8} {'Func':>8} {'Latency':>10} {'Cost/1k':>10}")
print("-" * 80)
for i in range(len(benchmark_results["model"])):
    print(
        f"{benchmark_results['model'][i]:<30} "
        f"{benchmark_results['exact_match'][i]:>7.0%} "
        f"{benchmark_results['fuzzy_match'][i]:>7.0%} "
        f"{benchmark_results['functional_match'][i]:>7.0%} "
        f"{benchmark_results['avg_latency_ms'][i]:>8}ms "
        f"${benchmark_results['cost_per_1k'][i]:>8.2f}"
    )

# Key insight: a fine-tuned small model can match or beat a big general-purpose
# model on a narrow, well-covered task because it learned YOUR specific tool schemas

Checkpoint

Run validate_training_examples and evaluate_tool_calls against a handful of model outputs and confirm you get a numeric accuracy score back, with malformed tool calls correctly flagged as failures. If everything scores 100% suspiciously fast, check that the validator is actually parsing the tool-call JSON rather than just checking that a string is non-empty.

Summary

Quick Reference

# script_id: day_080_finetuning_agentic/quick_reference
# Training data format: messages + tools
{"messages": [...], "tools": [...]}

# Three training categories
# 1. Successful single calls
# 2. Error recovery (retry after failure)
# 3. Multi-step chains (sequential tool use)

# Evaluation metrics
# Exact match:      name + args identical
# Fuzzy match:      name correct, args >80% similar
# Functional match: call would produce correct result

# Launch fine-tuning (OpenAI)
client.fine_tuning.jobs.create(
    training_file=file_id,
    model="gpt-4o-mini-2024-07-18",  # fine-tuning needs a dated base-model snapshot, not the bare alias
    suffix="tool-calling-agent"
)

Exercises

Build a Training Set: Create 50 training examples covering all three categories (successful calls, error recovery, multi-step chains) for a 3-tool agent (search, calculate, send_email)

Solution

# Reuse build_tool_use_dataset from this lesson
scenarios = []
for _ in range(20):
    scenarios.append({
        "type": "single_call",
        "query": "Search for Q3 sales",
        "expected_call": {"id": "call_1", "type": "function",
                          "function": {"name": "search", "arguments": '{"query": "Q3 sales"}'}},
        "tool_response": '{"rows": 42}',
        "final_answer": "Found 42 matching rows.",
    })
for _ in range(15):
    scenarios.append({
        "type": "error_recovery",
        "query": "Email the report to Sam",
        "first_call": {"id": "c1", "type": "function",
                       "function": {"name": "send_email", "arguments": '{"to": "sam"}'}},
        "error_response": '{"error": "missing body"}',
        "retry_call": {"id": "c2", "type": "function",
                       "function": {"name": "send_email", "arguments": '{"to": "sam@x.com", "body": "report"}'}},
        "success_response": '{"sent": true}',
        "final_answer": "Email sent to Sam.",
    })
for _ in range(15):
    scenarios.append({
        "type": "multi_step",
        "query": "Sum the search results and email them",
        "steps": [
            {"call": {"id": "c1", "type": "function",
                      "function": {"name": "calculate", "arguments": '{"expr": "1+2"}'}},
             "response": '{"result": 3}'},
        ],
        "final_answer": "The total is 3.",
    })

dataset = build_tool_use_dataset(tools, scenarios)
print(f"Built {len(dataset)} training examples")  # 50

Schema Validator: Write a validation pipeline that checks every training example for valid JSON arguments against Pydantic schemas, and reports statistics on coverage per tool

Solution

# Reuse validate_training_examples from this lesson
from pydantic import BaseModel
from typing import Optional

class SearchArgs(BaseModel):
    query: str
    max_results: int = 10

class EmailArgs(BaseModel):
    to: str
    body: Optional[str] = None

schemas = {"search": SearchArgs, "send_email": EmailArgs}
stats = validate_training_examples(dataset, schemas)
print(f"Valid: {stats['valid']}, Invalid: {stats['invalid']}")
for err in stats["errors"]:
    print(err)

Eval Harness: Build an evaluation harness that tests a model on 20 tool-calling scenarios and reports exact match, fuzzy match, and functional correctness scores

Solution

# Reuse evaluate_tool_calls from this lesson
predictions, ground_truth = [], []
for scenario in test_scenarios[:20]:        # your 20 held-out scenarios
    predictions.append(call_model(scenario["query"]))   # returns a tool_call dict
    ground_truth.append(scenario["expected_call"])

results = evaluate_tool_calls(predictions, ground_truth)
print(f"Exact:      {results['exact_match']:.1%}")
print(f"Fuzzy:      {results['fuzzy_match']:.1%}")
print(f"Functional: {results['functional_match']:.1%}")

What's Next?

We have a fine-tuned model. How do we know it's actually better? Let's learn evaluation techniques for fine-tuned models!

Fine-tuning for Agentic Tasks

Why Fine-tune for Agents?

Training Data Format for Tool Use

Building Tool-Use Training Sets

Fine-tuning for JSON Schema Compliance

Launching a Fine-tuning Job

Evaluating Tool-Call Accuracy

Case Study: 8B Model vs GPT-4o on Tool Calling

Checkpoint

Summary

Quick Reference

Exercises

What's Next?

On this page