How do you know if your agent is actually good? Manual testing doesn't scale. In this guide, you'll learn to use LLMs to automatically evaluate agent responses and RAG systems.
Coming from Software Engineering? LLM-as-judge is like using a linter or static analysis tool for natural language output. Just as ESLint checks code quality against rules, an LLM judge checks response quality against criteria. You've used automated quality gates in CI — this is the same concept applied to AI output. The tradeoff is similar too: automated checks are fast and consistent but imperfect; human review is accurate but doesn't scale.
Why Automated Evaluation?
Benefits of automated evaluation:
- Scale: Evaluate thousands of responses
- Consistency: Same criteria every time
- Speed: Seconds vs. hours
- Iteration: Test changes quickly
LLM-as-Judge: The Basics
Use one LLM to evaluate another's output:
# script_id: day_058_llm_as_judge_part1/evaluate_response
from openai import OpenAI
client = OpenAI()
def evaluate_response(question: str, response: str, criteria: list[str]) -> dict:
"""
Use an LLM to evaluate a response.
Args:
question: The original question
response: The response to evaluate
criteria: List of criteria to judge
Returns:
Evaluation scores and reasoning
"""
criteria_text = "\n".join([f"- {c}" for c in criteria])
eval_prompt = f"""You are an expert evaluator. Evaluate the following response.
QUESTION: {question}
RESPONSE: {response}
CRITERIA:
{criteria_text}
For each criterion, provide:
1. A score from 1-5 (1=poor, 5=excellent)
2. Brief reasoning
Format your response as:
CRITERION: [name]
SCORE: [1-5]
REASONING: [your reasoning]
Then provide an OVERALL score (1-5) and summary."""
evaluation = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": eval_prompt}],
temperature=0 # Consistent evaluations
)
return evaluation.choices[0].message.content
# Example usage
question = "What is machine learning?"
response = "Machine learning is a subset of AI where computers learn from data without being explicitly programmed."
criteria = [
"Accuracy: Is the information correct?",
"Completeness: Does it cover the key points?",
"Clarity: Is it easy to understand?"
]
evaluation = evaluate_response(question, response, criteria)
print(evaluation)
Structured Evaluation with Scores
Get numerical scores for easy comparison:
# script_id: day_058_llm_as_judge_part1/evaluate_with_scores
from openai import OpenAI
import json
client = OpenAI()
def evaluate_with_scores(question: str, response: str) -> dict:
"""Get structured evaluation scores."""
eval_prompt = f"""Evaluate this response on a scale of 1-5 for each criterion.
Question: {question}
Response: {response}
Return a JSON object with these scores:
- accuracy: How factually correct is the response?
- completeness: How thoroughly does it answer the question?
- clarity: How clear and understandable is it?
- relevance: How relevant is it to the question?
- conciseness: Is it appropriately concise (not too verbose)?
Also include:
- overall: Overall quality score (1-5)
- feedback: Brief constructive feedback
Use whole numbers from 1 to 5 for each score.
Return ONLY valid JSON."""
result = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": eval_prompt}],
response_format={"type": "json_object"},
temperature=0
)
return json.loads(result.choices[0].message.content)
# Example
scores = evaluate_with_scores(
"Explain photosynthesis",
"Photosynthesis is when plants make food from sunlight."
)
print(f"Accuracy: {scores['accuracy']}/5")
print(f"Completeness: {scores['completeness']}/5")
print(f"Clarity: {scores['clarity']}/5")
print(f"Overall: {scores['overall']}/5")
print(f"Feedback: {scores['feedback']}")
Pairwise Comparison
Compare two responses to find the better one:
# script_id: day_058_llm_as_judge_part1/compare_responses
from openai import OpenAI
import json
client = OpenAI()
def compare_responses(question: str, response_a: str, response_b: str) -> dict:
"""
Compare two responses and determine which is better.
This is often more reliable than absolute scoring!
"""
comparison_prompt = f"""Compare these two responses to the same question.
QUESTION: {question}
RESPONSE A:
{response_a}
RESPONSE B:
{response_b}
Evaluate which response is better and why. Consider:
- Accuracy
- Completeness
- Clarity
- Helpfulness
Return JSON with:
- winner: "A" or "B" or "tie"
- confidence: "high", "medium", or "low"
- reasoning: Brief explanation of your choice
- a_strengths: List of response A's strengths
- b_strengths: List of response B's strengths"""
result = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": comparison_prompt}],
response_format={"type": "json_object"},
temperature=0
)
return json.loads(result.choices[0].message.content)
# Example
result = compare_responses(
"What is a REST API?",
"A REST API lets programs talk over HTTP.",
"A REST API exposes resources addressed by URLs, which clients act on using standard HTTP verbs (GET, POST, PUT, DELETE). Each request is stateless — it carries everything the server needs — and the server replies with a status code (e.g. 200 OK, 404 Not Found) plus a body, usually JSON."
)
print(f"Winner: Response {result['winner']}")
print(f"Confidence: {result['confidence']}")
print(f"Reasoning: {result['reasoning']}")
RAG Evaluation with Ragas
A RAG system has an extra failure mode beyond a plain LLM answer — the retrieval step can fetch the wrong documents — so a RAG eval scores two things: did we fetch the right context, and did the answer use it correctly. The open-source Ragas framework packages these as ready-made metrics (faithfulness, answer relevancy, context precision/recall); we cover it in depth on Day 60. Below we build the judge from scratch to show the mechanics.
In plain terms, those four metrics mean:
- Faithfulness — the answer only states things the retrieved docs actually support (no made-up facts, i.e. no hallucination).
- Answer relevancy — does the answer actually address the question.
- Context precision — of the chunks retrieval returned, how many were actually relevant (low = noisy retrieval).
- Context recall — did retrieval pull in everything needed to answer (low = a needed doc was missed).
Understanding Ragas Metrics
Custom RAG Evaluator
# script_id: day_058_llm_as_judge_part1/custom_rag_evaluator
from openai import OpenAI
import json
client = OpenAI()
class RAGEvaluator:
"""Evaluate RAG system responses."""
def __init__(self):
self.metrics = {}
def evaluate_faithfulness(self, answer: str, contexts: list[str]) -> float:
"""Check if answer is grounded in the provided contexts."""
context_text = "\n".join(contexts)
prompt = f"""Evaluate if this answer is faithful to (grounded in) the given contexts.
CONTEXTS:
{context_text}
ANSWER:
{answer}
For each claim in the answer, check if it's supported by the contexts.
Return JSON:
{{
"claims": ["claim1", "claim2", ...],
"supported_claims": ["claim1", ...],
"unsupported_claims": ["claim2", ...],
"faithfulness_score": 0.0-1.0
}}"""
result = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
data = json.loads(result.choices[0].message.content)
return data["faithfulness_score"]
def evaluate_relevancy(self, question: str, answer: str) -> float:
"""Check if answer is relevant to the question."""
prompt = f"""Evaluate if this answer is relevant to the question.
QUESTION: {question}
ANSWER: {answer}
Consider:
- Does it address what was asked?
- Is it on-topic?
- Does it provide useful information?
Return JSON:
{{
"addresses_question": true/false,
"on_topic": true/false,
"provides_value": true/false,
"relevancy_score": 0.0-1.0,
"reasoning": "..."
}}"""
result = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
data = json.loads(result.choices[0].message.content)
return data["relevancy_score"]
def evaluate_context_quality(self, question: str, contexts: list[str]) -> float:
"""Check if retrieved contexts are relevant to the question."""
context_text = "\n---\n".join([f"Context {i+1}: {c}" for i, c in enumerate(contexts)])
prompt = f"""Evaluate the quality of these retrieved contexts for answering the question.
QUESTION: {question}
CONTEXTS:
{context_text}
For each context, rate its relevance to the question.
Return JSON:
{{
"context_ratings": [
{{"context_num": 1, "relevance": "high/medium/low", "useful_for_answer": true/false}},
...
],
"overall_context_quality": 0.0-1.0,
"reasoning": "..."
}}"""
result = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
data = json.loads(result.choices[0].message.content)
return data["overall_context_quality"]
def full_evaluation(self, question: str, answer: str, contexts: list[str]) -> dict:
"""Run full RAG evaluation."""
return {
"faithfulness": self.evaluate_faithfulness(answer, contexts),
"relevancy": self.evaluate_relevancy(question, answer),
"context_quality": self.evaluate_context_quality(question, contexts)
}
# Usage
evaluator = RAGEvaluator()
result = evaluator.full_evaluation(
question="What are the benefits of exercise?",
answer="Exercise improves cardiovascular health, builds muscle, and boosts mood through endorphin release.",
contexts=[
"Regular exercise strengthens the heart and improves blood circulation.",
"Physical activity releases endorphins, which are natural mood elevators.",
"Strength training helps build and maintain muscle mass."
]
)
print(f"Faithfulness: {result['faithfulness']:.2%}")
print(f"Relevancy: {result['relevancy']:.2%}")
print(f"Context Quality: {result['context_quality']:.2%}")
Checkpoint
Run the evaluate_with_scores(...) example and confirm you get back a dict with whole-number scores from 1 to 5 for accuracy, completeness, clarity, and overall — and that the deliberately weak photosynthesis answer scores below 5 on at least one axis. If json.loads throws, the judge ignored "Return ONLY valid JSON"; keeping response_format={"type": "json_object"} and temperature=0 is what forces parseable, repeatable output.
Summary
Quick Reference
| Pattern | When to use | Key detail |
|---|---|---|
| Criteria scoring | Free-form quality review | temperature=0; ask for score + reasoning |
| Structured scores | Comparing many responses | response_format={"type": "json_object"} |
| Pairwise comparison | "Which is better, A or B?" | More reliable than absolute scores |
| Custom RAG evaluator | Bespoke RAG metrics | One prompt per dimension, parse JSON |
| Ragas framework | Off-the-shelf RAG metrics | Covered in depth on Day 60 |
Tips:
- Always set
temperature=0for the judge so scores are reproducible. - For an off-the-shelf RAG metric suite, reach for the Ragas framework (Day 60) instead of hand-rolling every dimension.
- Prefer pairwise comparison when absolute scores feel arbitrary; it's easier for a judge to say "B is better" than "B is a 4.2".
Exercises
- Extend
evaluate_with_scoresto also return a booleanpassfield that isTrueonly when every individual criterion scores at least 4. Use it to filter a batch of responses. - Run
compare_responsestwice on the same pair but swapresponse_aandresponse_b. Does the winner stay consistent? Note what you observe (this is a preview of position bias, covered tomorrow). - Add a fifth Ragas-style dimension to
RAGEvaluator—answer_completeness(does the answer cover everything the context supports?) — following the same prompt-and-parse pattern. - Build the smallest possible Ragas run: two samples, only the
faithfulnessandanswer_relevancymetrics, and print each score.
Solutions (approaches)
- Parse the JSON, then
result["pass"] = all(result[c] >= 4 for c in ["accuracy", "completeness", "clarity", "relevance", "conciseness"]). - Swap the arguments; an unbiased judge should flip the winner. If it always picks the same slot, that's position bias.
- Copy
evaluate_relevancy, change the prompt to ask "is anything in the context missing from the answer?", return acompleteness_score, and add it to thefull_evaluationdict. -
from ragas import evaluate, EvaluationDataset from ragas.metrics import faithfulness, answer_relevancy ds = EvaluationDataset.from_list([ {"user_input": "...", "response": "...", "retrieved_contexts": ["..."], "reference": "..."}, {"user_input": "...", "response": "...", "retrieved_contexts": ["..."], "reference": "..."}, ]) r = evaluate(dataset=ds, metrics=[faithfulness, answer_relevancy]) print(r)
What's Next?
You can now score and compare responses, but naive judges have systematic biases (position, verbosity, self-enhancement). Next up: LLM-as-Judge Part 2 — detecting and correcting those biases, calibrating with anchors, measuring reliability with Cohen's Kappa, and running multi-judge consensus cost-efficiently.