Yesterday you built basic LLM-as-judge scoring and pairwise comparison. Today we go deeper: the failure modes that make naive judges unreliable, calibration techniques that fix them, multi-judge consensus for high-stakes evaluation, and cost-efficient strategies that don't blow your budget.
Coming from Software Engineering? This is like building a reliable test suite — a single flaky test is worse than no test because it gives false confidence. LLM judges have systematic biases (like a linter that always flags a certain pattern even when it's fine). Today you learn to detect and compensate for those biases, use consensus across multiple judges (like requiring 2 of 3 reviewers to approve a PR), and optimize for cost the same way you'd optimize CI pipeline runtime.
The Problem with Naive LLM Judges
Your Day 58 judge works — until it doesn't. LLM judges have well-documented biases that produce unreliable scores if you don't account for them.
Position Bias
LLMs tend to prefer whichever response appears first (or last, depending on the model). This means pairwise comparisons can flip just by swapping the order.
# script_id: day_059_llm_as_judge_part2/judge_techniques
from openai import OpenAI
import json
client = OpenAI()
def demonstrate_position_bias(question: str, response_a: str, response_b: str) -> dict:
"""Show how position affects judge scoring."""
prompt_template = """Compare these two responses to the question: "{question}"
Response {label_1}: {first}
Response {label_2}: {second}
Which response is better? Return JSON: {{"winner": "A" or "B", "reasoning": "..."}}"""
# Order 1: A first
result_1 = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt_template.format(
question=question, label_1="A", label_2="B",
first=response_a, second=response_b
)}],
response_format={"type": "json_object"},
temperature=0
)
order_1 = json.loads(result_1.choices[0].message.content)
# Order 2: B first (swap positions, keep labels)
result_2 = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt_template.format(
question=question, label_1="A", label_2="B",
first=response_b, second=response_a
)}],
response_format={"type": "json_object"},
temperature=0
)
order_2 = json.loads(result_2.choices[0].message.content)
# We keep the labels "A"/"B" but swap which response sits behind each. An
# UNBIASED judge therefore reports a *different* winning label across the two
# runs (it followed the response, not the slot). The label staying the same
# means the judge favored a position -> position bias.
unbiased = order_1["winner"] != order_2["winner"] # True == winning label flipped
return {
"a_first_winner": order_1["winner"],
"b_first_winner": order_2["winner"],
"unbiased": unbiased,
"position_biased": not unbiased
}
Verbosity Bias
LLM judges tend to rate longer, more detailed responses higher — even when the shorter response is more accurate or more appropriate.
# script_id: day_059_llm_as_judge_part2/judge_techniques
def detect_verbosity_bias(question: str, concise: str, verbose: str) -> dict:
"""Check if the judge prefers verbose responses regardless of quality."""
prompt = f"""Rate both responses on a 1-5 scale for ACCURACY ONLY.
Ignore length, style, and detail. Focus purely on factual correctness.
Question: {question}
Response A (concise): {concise}
Response B (detailed): {verbose}
Return JSON: {{"a_score": 1-5, "b_score": 1-5, "reasoning": "..."}}"""
result = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0
)
scores = json.loads(result.choices[0].message.content)
# If verbose always scores higher despite equal accuracy, that's bias
return scores
# Example: Both are equally correct, but one is verbose
result = detect_verbosity_bias(
question="What is the capital of France?",
concise="Paris.",
verbose="The capital of France is Paris, a city known as the City of Light, "
"located in the north-central part of the country along the Seine River. "
"Paris has been the capital since the late 10th century."
)
# A biased judge will score the verbose response higher despite equal accuracy
Self-Enhancement Bias
Models tend to rate their own outputs higher than outputs from other models. If you use GPT-4o to judge GPT-4o vs. Claude outputs, expect bias toward GPT-4o.
Mitigation: Use a different model as judge than the one that generated the responses, or use multi-judge consensus.
Calibration: Making Judges Reliable
Calibration means ensuring your judge's scores are meaningful and consistent. Without calibration, a "4 out of 5" from your judge might mean different things for different types of questions.
Anchor-Based Calibration
Provide the judge with reference examples at known quality levels:
# script_id: day_059_llm_as_judge_part2/judge_techniques
CALIBRATION_ANCHORS = {
"excellent": {
"question": "Explain recursion in programming.",
"response": "Recursion is when a function calls itself to solve a smaller "
"version of the same problem. Every recursive function needs a "
"base case (when to stop) and a recursive case (how to break the "
"problem down). Example: calculating factorial — factorial(5) = "
"5 * factorial(4), and factorial(1) = 1 is the base case.",
"score": 5,
"reasoning": "Accurate, clear, includes example, mentions base case."
},
"mediocre": {
"question": "Explain recursion in programming.",
"response": "Recursion is when a function calls itself. It's used in "
"programming for various tasks.",
"score": 2,
"reasoning": "Technically correct but lacks depth, no example, no base case."
},
"poor": {
"question": "Explain recursion in programming.",
"response": "Recursion is a loop that repeats until a condition is met.",
"score": 1,
"reasoning": "Confuses recursion with iteration. Factually incorrect."
}
}
def calibrated_judge(question: str, response: str) -> dict:
"""Judge with calibration anchors for consistent scoring."""
anchor_text = ""
for level, anchor in CALIBRATION_ANCHORS.items():
anchor_text += f"\n--- {level.upper()} example (score: {anchor['score']}) ---\n"
anchor_text += f"Q: {anchor['question']}\n"
anchor_text += f"A: {anchor['response']}\n"
anchor_text += f"Why this score: {anchor['reasoning']}\n"
prompt = f"""You are an expert evaluator. Score the following response on a 1-5 scale.
Use these calibration examples to anchor your scoring:
{anchor_text}
Now evaluate:
Question: {question}
Response: {response}
Return JSON: {{"score": 1-5, "reasoning": "...", "closest_anchor": "excellent/mediocre/poor"}}"""
result = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0
)
return json.loads(result.choices[0].message.content)
Measuring Judge Reliability: Cohen's Kappa
Compare your LLM judge against human evaluators (or against itself across runs) to measure agreement.
If two PR reviewers both approve 90% of PRs out of habit, they'll agree ~80% of the time purely by luck — so raw agreement overstates how much they actually think alike. Cohen's Kappa subtracts that luck: it measures how much they agree beyond random chance. p_observed = how often they actually matched; p_expected = how often they'd match by coincidence given how each rates overall; kappa rescales whatever agreement is left over.
# script_id: day_059_llm_as_judge_part2/cohens_kappa
def cohens_kappa(judge_1_scores: list[int], judge_2_scores: list[int]) -> float:
"""Calculate Cohen's Kappa for inter-rater agreement (Landis & Koch, 1977).
Returns:
-1 to 1: <0 = worse than random, 0 = random, 0.21-0.40 = fair,
0.41-0.60 = moderate, 0.61-0.80 = substantial, >0.80 = almost perfect
"""
assert len(judge_1_scores) == len(judge_2_scores)
n = len(judge_1_scores)
categories = sorted(set(judge_1_scores + judge_2_scores))
# Observed agreement
agreements = sum(1 for a, b in zip(judge_1_scores, judge_2_scores) if a == b)
p_observed = agreements / n
# Expected agreement by chance
p_expected = sum(
(judge_1_scores.count(c) / n) * (judge_2_scores.count(c) / n)
for c in categories
)
if p_expected == 1:
return 1.0
return (p_observed - p_expected) / (1 - p_expected)
# Usage: compare LLM judge vs human labels
human_scores = [5, 3, 4, 2, 5, 1, 4, 3, 5, 2]
llm_scores = [5, 4, 4, 2, 5, 1, 3, 3, 5, 3]
kappa = cohens_kappa(human_scores, llm_scores)
print(f"Cohen's Kappa: {kappa:.3f}")
# > 0.6 = substantial agreement → judge is reliable enough for automated use
# 0.21-0.40 = fair agreement → judge needs better calibration or different prompt
# < 0.20 = slight/poor agreement → fundamentally rethink your evaluation approach
Multi-Judge Consensus
For high-stakes evaluations, use multiple judges and aggregate their scores. This reduces the impact of any single judge's bias.
# script_id: day_059_llm_as_judge_part2/judge_techniques
from typing import Optional
def multi_judge_evaluate(
question: str,
response: str,
models: tuple[str, ...] = ("gpt-4o", "gpt-4o-mini"),
threshold: float = 0.7
) -> dict:
"""Evaluate using multiple LLM judges with consensus scoring."""
all_scores = []
all_reasoning = []
for model in models:
result = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": f"""Rate this response 1-5.
Question: {question}
Response: {response}
Return JSON: {{"score": 1-5, "reasoning": "..."}}"""}],
response_format={"type": "json_object"},
temperature=0
)
parsed = json.loads(result.choices[0].message.content)
all_scores.append(parsed["score"])
all_reasoning.append({"model": model, **parsed})
avg_score = sum(all_scores) / len(all_scores)
# variance = how spread-out the judges' scores are (0 = identical scores).
# On a 1-5 scale, < 0.5 means scores sit within ~1 point of each other —
# treat that as the judges agreeing.
score_variance = sum((s - avg_score) ** 2 for s in all_scores) / len(all_scores)
return {
"average_score": round(avg_score, 2),
"scores": all_scores,
"variance": round(score_variance, 2),
"high_agreement": score_variance < 0.5,
"passed": avg_score >= (threshold * 5),
"details": all_reasoning
}
# 💰 Cost note: 2 judges = 2x cost per evaluation.
# At GPT-4o ($2.50/1M in + $10/1M out) with ~500 tokens per eval:
# Single judge: ~$0.006/eval → $6/1000 evals
# Dual judge: ~$0.012/eval → $12/1000 evals
Handling Disagreement
When judges disagree significantly, you need a tiebreaker strategy:
# script_id: day_059_llm_as_judge_part2/judge_techniques
def evaluate_with_tiebreaker(question: str, response: str) -> dict:
"""Two cheap judges + expensive tiebreaker when they disagree."""
# Round 1: Two cheap judges
cheap_results = multi_judge_evaluate(
question, response,
models=["gpt-4o-mini", "gpt-4o-mini"]
)
if cheap_results["high_agreement"]:
return {**cheap_results, "tiebreaker_used": False}
# Round 2: Expensive judge as tiebreaker (only when needed)
tiebreaker = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"""Two evaluators disagreed on this response.
Scores were: {cheap_results['scores']}
Question: {question}
Response: {response}
Provide the definitive score. Return JSON: {{"score": 1-5, "reasoning": "..."}}"""}],
response_format={"type": "json_object"},
temperature=0
)
final = json.loads(tiebreaker.choices[0].message.content)
return {
"average_score": final["score"],
"scores": cheap_results["scores"] + [final["score"]],
"tiebreaker_used": True,
"tiebreaker_reasoning": final["reasoning"]
}
# 💰 Cost optimization: tiebreaker only fires ~20-30% of the time
# Effective cost: ~$0.004/eval instead of $0.012/eval for always using 2 expensive judges
Cost-Efficient Evaluation at Scale
Running LLM-as-judge on every response is expensive. Here are strategies to keep costs manageable.
Strategy 1: Sampling
Don't evaluate everything. Evaluate a random sample and extrapolate:
# script_id: day_059_llm_as_judge_part2/sampled_evaluation
import random
def sampled_evaluation(
test_cases: list[dict],
eval_fn,
sample_rate: float = 0.1, # Evaluate 10% of cases
min_samples: int = 30 # Below ~30 samples the estimate gets noisy;
# 30 is the common rule-of-thumb floor for a usable average
) -> dict:
"""Evaluate a sample and estimate population quality."""
n_samples = max(min_samples, int(len(test_cases) * sample_rate))
sample = random.sample(test_cases, min(n_samples, len(test_cases)))
scores = [eval_fn(tc["question"], tc["response"])["score"] for tc in sample]
mean = sum(scores) / len(scores)
std = (sum((s - mean) ** 2 for s in scores) / len(scores)) ** 0.5
return {
"estimated_mean": round(mean, 2),
"std_dev": round(std, 2),
# 95% confidence interval = the range your *true* average is ~95% likely to
# fall in. 1.96 is the standard 95% multiplier; std / sqrt(n) is the standard
# error. A wide range means you sampled too few cases to trust the estimate.
"confidence_interval": (round(mean - 1.96 * std / len(scores)**0.5, 2),
round(mean + 1.96 * std / len(scores)**0.5, 2)),
"samples_evaluated": len(scores),
"total_cases": len(test_cases)
}
Strategy 2: Tiered Evaluation
Use cheap models for screening, expensive models for borderline cases:
# script_id: day_059_llm_as_judge_part2/judge_techniques
def tiered_evaluation(question: str, response: str) -> dict:
"""Cheap screening → expensive evaluation only for borderline cases."""
# Tier 1: Fast, cheap screening with gpt-4o-mini
screen = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"""Quick quality check.
Question: {question}
Response: {response}
Return JSON: {{"score": 1-5, "confidence": "high" or "low"}}"""}],
response_format={"type": "json_object"},
temperature=0
)
screening = json.loads(screen.choices[0].message.content)
# Clear pass (4-5) or clear fail (1-2) with high confidence → done
if screening.get("confidence") == "high" and screening["score"] != 3:
return {"score": screening["score"], "tier": "screening", "model": "gpt-4o-mini"}
# Tier 2: Borderline cases get full evaluation with gpt-4o
full = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"""Carefully evaluate this response.
Question: {question}
Response: {response}
Return JSON: {{"score": 1-5, "reasoning": "..."}}"""}],
response_format={"type": "json_object"},
temperature=0
)
result = json.loads(full.choices[0].message.content)
return {"score": result["score"], "tier": "full", "model": "gpt-4o",
"reasoning": result["reasoning"]}
# 💰 Cost: ~60-70% of cases resolved at Tier 1 (gpt-4o-mini: ~$0.0004/eval)
# Only 30-40% escalated to Tier 2 (gpt-4o: ~$0.006/eval)
# Blended cost: ~$0.002/eval vs $0.006/eval for always using gpt-4o (3x cheaper)
Checkpoint
Run the cohens_kappa(...) example — it's pure Python with no API call, so it's fully deterministic. With the sample human_scores/llm_scores (7 of 10 ratings match) you should see a printed kappa of about 0.615 — "substantial" agreement — not 0.7. If you got exactly 0.7 you returned raw agreement instead of chance-corrected agreement; the whole point of kappa is subtracting p_expected, so double-check that term is in your formula.
Summary
| Technique | What It Solves | When to Use |
|---|---|---|
| Position swapping | Position bias in pairwise comparison | Always for A/B comparisons |
| Calibration anchors | Inconsistent scoring across different questions | When score reliability matters |
| Cohen's Kappa | Unknown judge reliability | Before trusting automated eval in production |
| Multi-judge consensus | Single-judge bias/noise | High-stakes evaluations (content publishing, safety) |
| Tiebreaker pattern | Cost of always using multiple judges | Balance cost vs. reliability |
| Sampled evaluation | Evaluating everything is too expensive | Large test sets (1000+) |
| Tiered evaluation | Expensive models for every eval | Production evaluation at scale |
Quick Reference
| Concern | Technique | One-liner |
|---|---|---|
| Position bias | Swap order, keep labels | Unbiased judge flips its winning label |
| Verbosity bias | Score one dimension only | "Rate ACCURACY ONLY, ignore length" |
| Self-enhancement | Use a different judge model | Don't let GPT-4o grade GPT-4o |
| Score consistency | Calibration anchors | Show excellent/mediocre/poor exemplars |
| Judge reliability | cohens_kappa(human, llm) |
> 0.6 = trust it; < 0.2 = rethink |
| Single-judge noise | multi_judge_evaluate(...) |
Average across models, check variance |
| Cost at scale | Tiered / sampled eval | Cheap screen first, escalate only the gray zone |
Tips:
- Reach for the cheap-judges-plus-tiebreaker pattern before always running an expensive judge — the tiebreaker usually fires only 20-30% of the time.
- Kappa under 0.4 means the prompt or anchors need work; don't ship an automated gate on a judge you haven't measured against human labels.
Exercises
- Wrap the position-bias check into a
position_bias_rate(pairs)helper that runsdemonstrate_position_biasover a list of pairs and returns the fraction that were position-biased. - Add a fourth calibration anchor at score 3 ("fair") to
CALIBRATION_ANCHORSand confirmcalibrated_judgecan map a borderline response to it viaclosest_anchor. - Modify
multi_judge_evaluateto also returnmin_scoreandmax_score, then flag any evaluation where the spread (max - min) is 2 or more asneeds_review. - Using
cohens_kappa, write a tiny experiment: score 10 responses withgpt-4o-minitwice (temperature 0) and compute self-agreement. Is it close to 1.0? Explain any gap. (Note: even at temperature 0, LLM outputs aren't perfectly repeatable — so don't expect exactly 1.0.)
Solutions (approaches)
-
def position_bias_rate(pairs): results = [demonstrate_position_bias(q, a, b) for q, a, b in pairs] return sum(r["position_biased"] for r in results) / len(results) - Add
"fair": {"question": "...", "response": "...", "score": 3, "reasoning": "..."}; the loop that buildsanchor_textalready iterates the dict, so no other change is needed. - After collecting
all_scores:result["min_score"] = min(all_scores); result["max_score"] = max(all_scores); result["needs_review"] = (max(all_scores) - min(all_scores)) >= 2. - Collect two score lists from the same model/prompt, then
cohens_kappa(run1, run2). Gaps come from sampling nondeterminism even at temperature 0; near-1.0 means the judge is internally stable.
What's Next
You now have reliable, calibrated, cost-efficient evaluation. Next up: RAGAS — a specialized framework for evaluating RAG systems with metrics like faithfulness, answer relevancy, and context precision.