akashnotes — Structured Learning for Engineers

Yesterday you built basic LLM-as-judge scoring and pairwise comparison. Today we go deeper: the failure modes that make naive judges unreliable, calibration techniques that fix them, multi-judge consensus for high-stakes evaluation, and cost-efficient strategies that don't blow your budget.

Coming from Software Engineering? This is like building a reliable test suite — a single flaky test is worse than no test because it gives false confidence. LLM judges have systematic biases (like a linter that always flags a certain pattern even when it's fine). Today you learn to detect and compensate for those biases, use consensus across multiple judges (like requiring 2 of 3 reviewers to approve a PR), and optimize for cost the same way you'd optimize CI pipeline runtime.

The Problem with Naive LLM Judges

Your Day 58 judge works — until it doesn't. LLM judges have well-documented biases that produce unreliable scores if you don't account for them.

Position Bias

LLMs tend to prefer whichever response appears first (or last, depending on the model). This means pairwise comparisons can flip just by swapping the order.

# script_id: day_059_llm_as_judge_part2/judge_techniques
from openai import OpenAI
import json

client = OpenAI()

def demonstrate_position_bias(question: str, response_a: str, response_b: str) -> dict:
    """Show how position affects judge scoring."""

    prompt_template = """Compare these two responses to the question: "{question}"

Response {label_1}: {first}

Response {label_2}: {second}

Which response is better? Return JSON: {{"winner": "A" or "B", "reasoning": "..."}}"""

    # Order 1: A first
    result_1 = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt_template.format(
            question=question, label_1="A", label_2="B",
            first=response_a, second=response_b
        )}],
        response_format={"type": "json_object"},
        temperature=0
    )
    order_1 = json.loads(result_1.choices[0].message.content)

    # Order 2: B first (swap positions, keep labels)
    result_2 = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt_template.format(
            question=question, label_1="A", label_2="B",
            first=response_b, second=response_a
        )}],
        response_format={"type": "json_object"},
        temperature=0
    )
    order_2 = json.loads(result_2.choices[0].message.content)

    # We keep the labels "A"/"B" but swap which response sits behind each. An
    # UNBIASED judge therefore reports a *different* winning label across the two
    # runs (it followed the response, not the slot). The label staying the same
    # means the judge favored a position -> position bias.
    unbiased = order_1["winner"] != order_2["winner"]  # True == winning label flipped

    return {
        "a_first_winner": order_1["winner"],
        "b_first_winner": order_2["winner"],
        "unbiased": unbiased,
        "position_biased": not unbiased
    }

Verbosity Bias

LLM judges tend to rate longer, more detailed responses higher — even when the shorter response is more accurate or more appropriate.

# script_id: day_059_llm_as_judge_part2/judge_techniques
def detect_verbosity_bias(question: str, concise: str, verbose: str) -> dict:
    """Check if the judge prefers verbose responses regardless of quality."""

    prompt = f"""Rate both responses on a 1-5 scale for ACCURACY ONLY.
Ignore length, style, and detail. Focus purely on factual correctness.

Question: {question}

Response A (concise): {concise}

Response B (detailed): {verbose}

Return JSON: {{"a_score": 1-5, "b_score": 1-5, "reasoning": "..."}}"""

    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0
    )
    scores = json.loads(result.choices[0].message.content)

    # If verbose always scores higher despite equal accuracy, that's bias
    return scores

# Example: Both are equally correct, but one is verbose
result = detect_verbosity_bias(
    question="What is the capital of France?",
    concise="Paris.",
    verbose="The capital of France is Paris, a city known as the City of Light, "
            "located in the north-central part of the country along the Seine River. "
            "Paris has been the capital since the late 10th century."
)
# A biased judge will score the verbose response higher despite equal accuracy

Self-Enhancement Bias

Models tend to rate their own outputs higher than outputs from other models. If you use GPT-4o to judge GPT-4o vs. Claude outputs, expect bias toward GPT-4o.

Mitigation: Use a different model as judge than the one that generated the responses, or use multi-judge consensus.

Calibration: Making Judges Reliable

Calibration means ensuring your judge's scores are meaningful and consistent. Without calibration, a "4 out of 5" from your judge might mean different things for different types of questions.

Anchor-Based Calibration

Provide the judge with reference examples at known quality levels:

# script_id: day_059_llm_as_judge_part2/judge_techniques
CALIBRATION_ANCHORS = {
    "excellent": {
        "question": "Explain recursion in programming.",
        "response": "Recursion is when a function calls itself to solve a smaller "
                    "version of the same problem. Every recursive function needs a "
                    "base case (when to stop) and a recursive case (how to break the "
                    "problem down). Example: calculating factorial — factorial(5) = "
                    "5 * factorial(4), and factorial(1) = 1 is the base case.",
        "score": 5,
        "reasoning": "Accurate, clear, includes example, mentions base case."
    },
    "mediocre": {
        "question": "Explain recursion in programming.",
        "response": "Recursion is when a function calls itself. It's used in "
                    "programming for various tasks.",
        "score": 2,
        "reasoning": "Technically correct but lacks depth, no example, no base case."
    },
    "poor": {
        "question": "Explain recursion in programming.",
        "response": "Recursion is a loop that repeats until a condition is met.",
        "score": 1,
        "reasoning": "Confuses recursion with iteration. Factually incorrect."
    }
}

def calibrated_judge(question: str, response: str) -> dict:
    """Judge with calibration anchors for consistent scoring."""

    anchor_text = ""
    for level, anchor in CALIBRATION_ANCHORS.items():
        anchor_text += f"\n--- {level.upper()} example (score: {anchor['score']}) ---\n"
        anchor_text += f"Q: {anchor['question']}\n"
        anchor_text += f"A: {anchor['response']}\n"
        anchor_text += f"Why this score: {anchor['reasoning']}\n"

    prompt = f"""You are an expert evaluator. Score the following response on a 1-5 scale.

Use these calibration examples to anchor your scoring:
{anchor_text}

Now evaluate:
Question: {question}
Response: {response}

Return JSON: {{"score": 1-5, "reasoning": "...", "closest_anchor": "excellent/mediocre/poor"}}"""

    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0
    )
    return json.loads(result.choices[0].message.content)

Measuring Judge Reliability: Cohen's Kappa

Compare your LLM judge against human evaluators (or against itself across runs) to measure agreement.

If two PR reviewers both approve 90% of PRs out of habit, they'll agree ~80% of the time purely by luck — so raw agreement overstates how much they actually think alike. Cohen's Kappa subtracts that luck: it measures how much they agree beyond random chance. p_observed = how often they actually matched; p_expected = how often they'd match by coincidence given how each rates overall; kappa rescales whatever agreement is left over.

# script_id: day_059_llm_as_judge_part2/cohens_kappa
def cohens_kappa(judge_1_scores: list[int], judge_2_scores: list[int]) -> float:
    """Calculate Cohen's Kappa for inter-rater agreement (Landis & Koch, 1977).

    Returns:
        -1 to 1: <0 = worse than random, 0 = random, 0.21-0.40 = fair,
        0.41-0.60 = moderate, 0.61-0.80 = substantial, >0.80 = almost perfect
    """
    assert len(judge_1_scores) == len(judge_2_scores)
    n = len(judge_1_scores)
    categories = sorted(set(judge_1_scores + judge_2_scores))

    # Observed agreement
    agreements = sum(1 for a, b in zip(judge_1_scores, judge_2_scores) if a == b)
    p_observed = agreements / n

    # Expected agreement by chance
    p_expected = sum(
        (judge_1_scores.count(c) / n) * (judge_2_scores.count(c) / n)
        for c in categories
    )

    if p_expected == 1:
        return 1.0
    return (p_observed - p_expected) / (1 - p_expected)

# Usage: compare LLM judge vs human labels
human_scores = [5, 3, 4, 2, 5, 1, 4, 3, 5, 2]
llm_scores =   [5, 4, 4, 2, 5, 1, 3, 3, 5, 3]

kappa = cohens_kappa(human_scores, llm_scores)
print(f"Cohen's Kappa: {kappa:.3f}")
# > 0.6 = substantial agreement → judge is reliable enough for automated use
# 0.21-0.40 = fair agreement → judge needs better calibration or different prompt
# < 0.20 = slight/poor agreement → fundamentally rethink your evaluation approach

Multi-Judge Consensus

For high-stakes evaluations, use multiple judges and aggregate their scores. This reduces the impact of any single judge's bias.

# script_id: day_059_llm_as_judge_part2/judge_techniques
from typing import Optional

def multi_judge_evaluate(
    question: str,
    response: str,
    models: tuple[str, ...] = ("gpt-4o", "gpt-4o-mini"),
    threshold: float = 0.7
) -> dict:
    """Evaluate using multiple LLM judges with consensus scoring."""

    all_scores = []
    all_reasoning = []

    for model in models:
        result = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": f"""Rate this response 1-5.
Question: {question}
Response: {response}
Return JSON: {{"score": 1-5, "reasoning": "..."}}"""}],
            response_format={"type": "json_object"},
            temperature=0
        )
        parsed = json.loads(result.choices[0].message.content)
        all_scores.append(parsed["score"])
        all_reasoning.append({"model": model, **parsed})

    avg_score = sum(all_scores) / len(all_scores)
    # variance = how spread-out the judges' scores are (0 = identical scores).
    # On a 1-5 scale, < 0.5 means scores sit within ~1 point of each other —
    # treat that as the judges agreeing.
    score_variance = sum((s - avg_score) ** 2 for s in all_scores) / len(all_scores)

    return {
        "average_score": round(avg_score, 2),
        "scores": all_scores,
        "variance": round(score_variance, 2),
        "high_agreement": score_variance < 0.5,
        "passed": avg_score >= (threshold * 5),
        "details": all_reasoning
    }

# 💰 Cost note: 2 judges = 2x cost per evaluation.
# At GPT-4o ($2.50/1M in + $10/1M out) with ~500 tokens per eval:
# Single judge: ~$0.006/eval → $6/1000 evals
# Dual judge: ~$0.012/eval → $12/1000 evals

Handling Disagreement

When judges disagree significantly, you need a tiebreaker strategy:

# script_id: day_059_llm_as_judge_part2/judge_techniques
def evaluate_with_tiebreaker(question: str, response: str) -> dict:
    """Two cheap judges + expensive tiebreaker when they disagree."""

    # Round 1: Two cheap judges
    cheap_results = multi_judge_evaluate(
        question, response,
        models=["gpt-4o-mini", "gpt-4o-mini"]
    )

    if cheap_results["high_agreement"]:
        return {**cheap_results, "tiebreaker_used": False}

    # Round 2: Expensive judge as tiebreaker (only when needed)
    tiebreaker = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"""Two evaluators disagreed on this response.
Scores were: {cheap_results['scores']}

Question: {question}
Response: {response}

Provide the definitive score. Return JSON: {{"score": 1-5, "reasoning": "..."}}"""}],
        response_format={"type": "json_object"},
        temperature=0
    )
    final = json.loads(tiebreaker.choices[0].message.content)

    return {
        "average_score": final["score"],
        "scores": cheap_results["scores"] + [final["score"]],
        "tiebreaker_used": True,
        "tiebreaker_reasoning": final["reasoning"]
    }

# 💰 Cost optimization: tiebreaker only fires ~20-30% of the time
# Effective cost: ~$0.004/eval instead of $0.012/eval for always using 2 expensive judges

Cost-Efficient Evaluation at Scale

Running LLM-as-judge on every response is expensive. Here are strategies to keep costs manageable.

Strategy 1: Sampling

Don't evaluate everything. Evaluate a random sample and extrapolate:

# script_id: day_059_llm_as_judge_part2/sampled_evaluation
import random

def sampled_evaluation(
    test_cases: list[dict],
    eval_fn,
    sample_rate: float = 0.1,  # Evaluate 10% of cases
    min_samples: int = 30       # Below ~30 samples the estimate gets noisy;
                                # 30 is the common rule-of-thumb floor for a usable average
) -> dict:
    """Evaluate a sample and estimate population quality."""

    n_samples = max(min_samples, int(len(test_cases) * sample_rate))
    sample = random.sample(test_cases, min(n_samples, len(test_cases)))

    scores = [eval_fn(tc["question"], tc["response"])["score"] for tc in sample]

    mean = sum(scores) / len(scores)
    std = (sum((s - mean) ** 2 for s in scores) / len(scores)) ** 0.5

    return {
        "estimated_mean": round(mean, 2),
        "std_dev": round(std, 2),
        # 95% confidence interval = the range your *true* average is ~95% likely to
        # fall in. 1.96 is the standard 95% multiplier; std / sqrt(n) is the standard
        # error. A wide range means you sampled too few cases to trust the estimate.
        "confidence_interval": (round(mean - 1.96 * std / len(scores)**0.5, 2),
                                 round(mean + 1.96 * std / len(scores)**0.5, 2)),
        "samples_evaluated": len(scores),
        "total_cases": len(test_cases)
    }

Strategy 2: Tiered Evaluation

Use cheap models for screening, expensive models for borderline cases:

# script_id: day_059_llm_as_judge_part2/judge_techniques
def tiered_evaluation(question: str, response: str) -> dict:
    """Cheap screening → expensive evaluation only for borderline cases."""

    # Tier 1: Fast, cheap screening with gpt-4o-mini
    screen = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"""Quick quality check.
Question: {question}
Response: {response}
Return JSON: {{"score": 1-5, "confidence": "high" or "low"}}"""}],
        response_format={"type": "json_object"},
        temperature=0
    )
    screening = json.loads(screen.choices[0].message.content)

    # Clear pass (4-5) or clear fail (1-2) with high confidence → done
    if screening.get("confidence") == "high" and screening["score"] != 3:
        return {"score": screening["score"], "tier": "screening", "model": "gpt-4o-mini"}

    # Tier 2: Borderline cases get full evaluation with gpt-4o
    full = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"""Carefully evaluate this response.
Question: {question}
Response: {response}
Return JSON: {{"score": 1-5, "reasoning": "..."}}"""}],
        response_format={"type": "json_object"},
        temperature=0
    )
    result = json.loads(full.choices[0].message.content)

    return {"score": result["score"], "tier": "full", "model": "gpt-4o",
            "reasoning": result["reasoning"]}

# 💰 Cost: ~60-70% of cases resolved at Tier 1 (gpt-4o-mini: ~$0.0004/eval)
# Only 30-40% escalated to Tier 2 (gpt-4o: ~$0.006/eval)
# Blended cost: ~$0.002/eval vs $0.006/eval for always using gpt-4o (3x cheaper)

Checkpoint

Run the cohens_kappa(...) example — it's pure Python with no API call, so it's fully deterministic. With the sample human_scores/llm_scores (7 of 10 ratings match) you should see a printed kappa of about 0.615 — "substantial" agreement — not 0.7. If you got exactly 0.7 you returned raw agreement instead of chance-corrected agreement; the whole point of kappa is subtracting p_expected, so double-check that term is in your formula.

Summary

Technique	What It Solves	When to Use
Position swapping	Position bias in pairwise comparison	Always for A/B comparisons
Calibration anchors	Inconsistent scoring across different questions	When score reliability matters
Cohen's Kappa	Unknown judge reliability	Before trusting automated eval in production
Multi-judge consensus	Single-judge bias/noise	High-stakes evaluations (content publishing, safety)
Tiebreaker pattern	Cost of always using multiple judges	Balance cost vs. reliability
Sampled evaluation	Evaluating everything is too expensive	Large test sets (1000+)
Tiered evaluation	Expensive models for every eval	Production evaluation at scale

Quick Reference

Concern	Technique	One-liner
Position bias	Swap order, keep labels	Unbiased judge flips its winning label
Verbosity bias	Score one dimension only	"Rate ACCURACY ONLY, ignore length"
Self-enhancement	Use a different judge model	Don't let GPT-4o grade GPT-4o
Score consistency	Calibration anchors	Show excellent/mediocre/poor exemplars
Judge reliability	`cohens_kappa(human, llm)`	> 0.6 = trust it; < 0.2 = rethink
Single-judge noise	`multi_judge_evaluate(...)`	Average across models, check variance
Cost at scale	Tiered / sampled eval	Cheap screen first, escalate only the gray zone

Tips:

Reach for the cheap-judges-plus-tiebreaker pattern before always running an expensive judge — the tiebreaker usually fires only 20-30% of the time.
Kappa under 0.4 means the prompt or anchors need work; don't ship an automated gate on a judge you haven't measured against human labels.

Exercises

Wrap the position-bias check into a position_bias_rate(pairs) helper that runs demonstrate_position_bias over a list of pairs and returns the fraction that were position-biased.
Add a fourth calibration anchor at score 3 ("fair") to CALIBRATION_ANCHORS and confirm calibrated_judge can map a borderline response to it via closest_anchor.
Modify multi_judge_evaluate to also return min_score and max_score, then flag any evaluation where the spread (max - min) is 2 or more as needs_review.
Using cohens_kappa, write a tiny experiment: score 10 responses with gpt-4o-mini twice (temperature 0) and compute self-agreement. Is it close to 1.0? Explain any gap. (Note: even at temperature 0, LLM outputs aren't perfectly repeatable — so don't expect exactly 1.0.)

Solutions (approaches)

def position_bias_rate(pairs):
    results = [demonstrate_position_bias(q, a, b) for q, a, b in pairs]
    return sum(r["position_biased"] for r in results) / len(results)

Add "fair": {"question": "...", "response": "...", "score": 3, "reasoning": "..."}; the loop that builds anchor_text already iterates the dict, so no other change is needed.
After collecting all_scores: result["min_score"] = min(all_scores); result["max_score"] = max(all_scores); result["needs_review"] = (max(all_scores) - min(all_scores)) >= 2.
Collect two score lists from the same model/prompt, then cohens_kappa(run1, run2). Gaps come from sampling nondeterminism even at temperature 0; near-1.0 means the judge is internally stable.

What's Next

You now have reliable, calibrated, cost-efficient evaluation. Next up: RAGAS — a specialized framework for evaluating RAG systems with metrics like faithfulness, answer relevancy, and context precision.

LLM-as-Judge — Part 2: Advanced Evaluation Techniques