Phase 5Evaluation and Security·10 min read

LLM-as-Judge — Part 1: Automated Evaluation

Phase 5 of 8

How do you know if your agent is actually good? Manual testing doesn't scale. In this guide, you'll learn to use LLMs to automatically evaluate agent responses and RAG systems.

Coming from Software Engineering? LLM-as-judge is like using a linter or static analysis tool for natural language output. Just as ESLint checks code quality against rules, an LLM judge checks response quality against criteria. You've used automated quality gates in CI — this is the same concept applied to AI output. The tradeoff is similar too: automated checks are fast and consistent but imperfect; human review is accurate but doesn't scale.


Why Automated Evaluation?

Benefits of automated evaluation:

  • Scale: Evaluate thousands of responses
  • Consistency: Same criteria every time
  • Speed: Seconds vs. hours
  • Iteration: Test changes quickly

LLM-as-Judge: The Basics

Use one LLM to evaluate another's output:

# script_id: day_058_llm_as_judge_part1/evaluate_response
from openai import OpenAI

client = OpenAI()

def evaluate_response(question: str, response: str, criteria: list[str]) -> dict:
    """
    Use an LLM to evaluate a response.

    Args:
        question: The original question
        response: The response to evaluate
        criteria: List of criteria to judge

    Returns:
        Evaluation scores and reasoning
    """

    criteria_text = "\n".join([f"- {c}" for c in criteria])

    eval_prompt = f"""You are an expert evaluator. Evaluate the following response.

QUESTION: {question}

RESPONSE: {response}

CRITERIA:
{criteria_text}

For each criterion, provide:
1. A score from 1-5 (1=poor, 5=excellent)
2. Brief reasoning

Format your response as:
CRITERION: [name]
SCORE: [1-5]
REASONING: [your reasoning]

Then provide an OVERALL score (1-5) and summary."""

    evaluation = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": eval_prompt}],
        temperature=0  # Consistent evaluations
    )

    return evaluation.choices[0].message.content

# Example usage
question = "What is machine learning?"
response = "Machine learning is a subset of AI where computers learn from data without being explicitly programmed."

criteria = [
    "Accuracy: Is the information correct?",
    "Completeness: Does it cover the key points?",
    "Clarity: Is it easy to understand?"
]

evaluation = evaluate_response(question, response, criteria)
print(evaluation)

Structured Evaluation with Scores

Get numerical scores for easy comparison:

# script_id: day_058_llm_as_judge_part1/evaluate_with_scores
from openai import OpenAI
import json

client = OpenAI()

def evaluate_with_scores(question: str, response: str) -> dict:
    """Get structured evaluation scores."""

    eval_prompt = f"""Evaluate this response on a scale of 1-5 for each criterion.

Question: {question}
Response: {response}

Return a JSON object with these scores:
- accuracy: How factually correct is the response?
- completeness: How thoroughly does it answer the question?
- clarity: How clear and understandable is it?
- relevance: How relevant is it to the question?
- conciseness: Is it appropriately concise (not too verbose)?

Also include:
- overall: Overall quality score (1-5)
- feedback: Brief constructive feedback

Use whole numbers from 1 to 5 for each score.
Return ONLY valid JSON."""

    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": eval_prompt}],
        response_format={"type": "json_object"},
        temperature=0
    )

    return json.loads(result.choices[0].message.content)

# Example
scores = evaluate_with_scores(
    "Explain photosynthesis",
    "Photosynthesis is when plants make food from sunlight."
)

print(f"Accuracy: {scores['accuracy']}/5")
print(f"Completeness: {scores['completeness']}/5")
print(f"Clarity: {scores['clarity']}/5")
print(f"Overall: {scores['overall']}/5")
print(f"Feedback: {scores['feedback']}")

Pairwise Comparison

Compare two responses to find the better one:

# script_id: day_058_llm_as_judge_part1/compare_responses
from openai import OpenAI
import json

client = OpenAI()

def compare_responses(question: str, response_a: str, response_b: str) -> dict:
    """
    Compare two responses and determine which is better.

    This is often more reliable than absolute scoring!
    """

    comparison_prompt = f"""Compare these two responses to the same question.

QUESTION: {question}

RESPONSE A:
{response_a}

RESPONSE B:
{response_b}

Evaluate which response is better and why. Consider:
- Accuracy
- Completeness
- Clarity
- Helpfulness

Return JSON with:
- winner: "A" or "B" or "tie"
- confidence: "high", "medium", or "low"
- reasoning: Brief explanation of your choice
- a_strengths: List of response A's strengths
- b_strengths: List of response B's strengths"""

    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": comparison_prompt}],
        response_format={"type": "json_object"},
        temperature=0
    )

    return json.loads(result.choices[0].message.content)

# Example
result = compare_responses(
    "What is a REST API?",
    "A REST API lets programs talk over HTTP.",
    "A REST API exposes resources addressed by URLs, which clients act on using standard HTTP verbs (GET, POST, PUT, DELETE). Each request is stateless — it carries everything the server needs — and the server replies with a status code (e.g. 200 OK, 404 Not Found) plus a body, usually JSON."
)

print(f"Winner: Response {result['winner']}")
print(f"Confidence: {result['confidence']}")
print(f"Reasoning: {result['reasoning']}")

RAG Evaluation with Ragas

A RAG system has an extra failure mode beyond a plain LLM answer — the retrieval step can fetch the wrong documents — so a RAG eval scores two things: did we fetch the right context, and did the answer use it correctly. The open-source Ragas framework packages these as ready-made metrics (faithfulness, answer relevancy, context precision/recall); we cover it in depth on Day 60. Below we build the judge from scratch to show the mechanics.

In plain terms, those four metrics mean:

  • Faithfulness — the answer only states things the retrieved docs actually support (no made-up facts, i.e. no hallucination).
  • Answer relevancy — does the answer actually address the question.
  • Context precision — of the chunks retrieval returned, how many were actually relevant (low = noisy retrieval).
  • Context recall — did retrieval pull in everything needed to answer (low = a needed doc was missed).

Understanding Ragas Metrics

Custom RAG Evaluator

# script_id: day_058_llm_as_judge_part1/custom_rag_evaluator
from openai import OpenAI
import json

client = OpenAI()

class RAGEvaluator:
    """Evaluate RAG system responses."""

    def __init__(self):
        self.metrics = {}

    def evaluate_faithfulness(self, answer: str, contexts: list[str]) -> float:
        """Check if answer is grounded in the provided contexts."""

        context_text = "\n".join(contexts)

        prompt = f"""Evaluate if this answer is faithful to (grounded in) the given contexts.

CONTEXTS:
{context_text}

ANSWER:
{answer}

For each claim in the answer, check if it's supported by the contexts.
Return JSON:
{{
    "claims": ["claim1", "claim2", ...],
    "supported_claims": ["claim1", ...],
    "unsupported_claims": ["claim2", ...],
    "faithfulness_score": 0.0-1.0
}}"""

        result = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )

        data = json.loads(result.choices[0].message.content)
        return data["faithfulness_score"]

    def evaluate_relevancy(self, question: str, answer: str) -> float:
        """Check if answer is relevant to the question."""

        prompt = f"""Evaluate if this answer is relevant to the question.

QUESTION: {question}

ANSWER: {answer}

Consider:
- Does it address what was asked?
- Is it on-topic?
- Does it provide useful information?

Return JSON:
{{
    "addresses_question": true/false,
    "on_topic": true/false,
    "provides_value": true/false,
    "relevancy_score": 0.0-1.0,
    "reasoning": "..."
}}"""

        result = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )

        data = json.loads(result.choices[0].message.content)
        return data["relevancy_score"]

    def evaluate_context_quality(self, question: str, contexts: list[str]) -> float:
        """Check if retrieved contexts are relevant to the question."""

        context_text = "\n---\n".join([f"Context {i+1}: {c}" for i, c in enumerate(contexts)])

        prompt = f"""Evaluate the quality of these retrieved contexts for answering the question.

QUESTION: {question}

CONTEXTS:
{context_text}

For each context, rate its relevance to the question.
Return JSON:
{{
    "context_ratings": [
        {{"context_num": 1, "relevance": "high/medium/low", "useful_for_answer": true/false}},
        ...
    ],
    "overall_context_quality": 0.0-1.0,
    "reasoning": "..."
}}"""

        result = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )

        data = json.loads(result.choices[0].message.content)
        return data["overall_context_quality"]

    def full_evaluation(self, question: str, answer: str, contexts: list[str]) -> dict:
        """Run full RAG evaluation."""

        return {
            "faithfulness": self.evaluate_faithfulness(answer, contexts),
            "relevancy": self.evaluate_relevancy(question, answer),
            "context_quality": self.evaluate_context_quality(question, contexts)
        }

# Usage
evaluator = RAGEvaluator()

result = evaluator.full_evaluation(
    question="What are the benefits of exercise?",
    answer="Exercise improves cardiovascular health, builds muscle, and boosts mood through endorphin release.",
    contexts=[
        "Regular exercise strengthens the heart and improves blood circulation.",
        "Physical activity releases endorphins, which are natural mood elevators.",
        "Strength training helps build and maintain muscle mass."
    ]
)

print(f"Faithfulness: {result['faithfulness']:.2%}")
print(f"Relevancy: {result['relevancy']:.2%}")
print(f"Context Quality: {result['context_quality']:.2%}")

Checkpoint

Run the evaluate_with_scores(...) example and confirm you get back a dict with whole-number scores from 1 to 5 for accuracy, completeness, clarity, and overall — and that the deliberately weak photosynthesis answer scores below 5 on at least one axis. If json.loads throws, the judge ignored "Return ONLY valid JSON"; keeping response_format={"type": "json_object"} and temperature=0 is what forces parseable, repeatable output.


Summary


Quick Reference

Pattern When to use Key detail
Criteria scoring Free-form quality review temperature=0; ask for score + reasoning
Structured scores Comparing many responses response_format={"type": "json_object"}
Pairwise comparison "Which is better, A or B?" More reliable than absolute scores
Custom RAG evaluator Bespoke RAG metrics One prompt per dimension, parse JSON
Ragas framework Off-the-shelf RAG metrics Covered in depth on Day 60

Tips:

  • Always set temperature=0 for the judge so scores are reproducible.
  • For an off-the-shelf RAG metric suite, reach for the Ragas framework (Day 60) instead of hand-rolling every dimension.
  • Prefer pairwise comparison when absolute scores feel arbitrary; it's easier for a judge to say "B is better" than "B is a 4.2".

Exercises

  1. Extend evaluate_with_scores to also return a boolean pass field that is True only when every individual criterion scores at least 4. Use it to filter a batch of responses.
  2. Run compare_responses twice on the same pair but swap response_a and response_b. Does the winner stay consistent? Note what you observe (this is a preview of position bias, covered tomorrow).
  3. Add a fifth Ragas-style dimension to RAGEvaluatoranswer_completeness (does the answer cover everything the context supports?) — following the same prompt-and-parse pattern.
  4. Build the smallest possible Ragas run: two samples, only the faithfulness and answer_relevancy metrics, and print each score.
Solutions (approaches)
  1. Parse the JSON, then result["pass"] = all(result[c] >= 4 for c in ["accuracy", "completeness", "clarity", "relevance", "conciseness"]).
  2. Swap the arguments; an unbiased judge should flip the winner. If it always picks the same slot, that's position bias.
  3. Copy evaluate_relevancy, change the prompt to ask "is anything in the context missing from the answer?", return a completeness_score, and add it to the full_evaluation dict.
  4. from ragas import evaluate, EvaluationDataset
    from ragas.metrics import faithfulness, answer_relevancy
    ds = EvaluationDataset.from_list([
        {"user_input": "...", "response": "...", "retrieved_contexts": ["..."], "reference": "..."},
        {"user_input": "...", "response": "...", "retrieved_contexts": ["..."], "reference": "..."},
    ])
    r = evaluate(dataset=ds, metrics=[faithfulness, answer_relevancy])
    print(r)
    

What's Next?

You can now score and compare responses, but naive judges have systematic biases (position, verbosity, self-enhancement). Next up: LLM-as-Judge Part 2 — detecting and correcting those biases, calibrating with anchors, measuring reliability with Cohen's Kappa, and running multi-judge consensus cost-efficiently.