Phase 5Evaluation and Security·8 min read

LLM Guardrails

Phase 5 of 8

Guardrails keep AI systems safe, on-topic, and predictable. They validate inputs before they reach the LLM and sanitize outputs before they reach the user.

Coming from Software Engineering? Guardrails are middleware — just like Express middleware validates requests before they hit your route handler, guardrails validate inputs and outputs before they reach the LLM or the user. If you've written request validation middleware (checking auth tokens, sanitizing input, rate limiting), you've built a simpler version of this. Guardrails AI adds a declarative validator layer on top, similar to how WAF rules or API gateway policies work.


What are Guardrails?

Guardrails provide:

  • Topic control: Keep conversations on-topic
  • Safety filters: Block harmful or toxic content
  • PII protection: Detect and redact personal information
  • Format validation: Ensure structured outputs
  • Action control: Limit what agents can do

Guardrails AI

Guardrails AI is the leading open-source framework for adding input/output validation to LLM applications. It provides a hub of pre-built validators and integrates directly into your LLM calls.

Installation

pip install guardrails-ai
guardrails hub install hub://guardrails/toxic_language
guardrails hub install hub://guardrails/detect_pii
guardrails hub install hub://guardrails/restrict_to_topic

Core Concept: Guards and Validators

A Guard wraps your LLM call and runs validators on input and output. Validators come from the Guardrails Hub — a registry of community and official validators.

Basic Input/Output Validation

# script_id: day_064_llm_guardrails/basic_input_output_validation
import openai
from guardrails import Guard
from guardrails.hub import ToxicLanguage, DetectPII, RestrictToTopic

# Input validation guard
input_guard = Guard().use_many(
    ToxicLanguage(on_fail="exception"),
    DetectPII(["EMAIL", "PHONE"], on_fail="fix"),
)

# Use with LLM — the guard wraps the API call
result = input_guard(
    llm_api=openai.chat.completions.create,
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": user_input}],
)

print(result.validated_output)

on_fail strategies:

Strategy Behavior
exception Raise an error, block the request
fix Attempt to fix the issue automatically
reask Ask the LLM to regenerate its response
noop Log but allow through
filter Remove the failing content

Structured Output Validation

Guardrails AI works with Pydantic models to enforce structured outputs:

# script_id: day_064_llm_guardrails/structured_output_validation
from guardrails import Guard
from pydantic import BaseModel, Field
from typing import List

class ProductReview(BaseModel):
    product_name: str = Field(description="Name of the product")
    rating: float = Field(ge=1, le=5, description="Rating from 1 to 5")
    pros: List[str] = Field(min_length=1, description="List of pros")
    cons: List[str] = Field(description="List of cons")
    summary: str = Field(max_length=200, description="Brief summary")

guard = Guard.from_pydantic(ProductReview)

result = guard(
    llm_api=openai.chat.completions.create,
    model="gpt-4o-mini",
    messages=[{
        "role": "user",
        "content": "Review the Sony WH-1000XM5 headphones."
    }],
)

# result.validated_output is a validated dict matching ProductReview
print(result.validated_output["rating"])
print(result.validated_output["pros"])

Topic Restriction

# script_id: day_064_llm_guardrails/topic_restriction
from guardrails import Guard
from guardrails.hub import RestrictToTopic

guard = Guard().use(
    RestrictToTopic(
        valid_topics=["programming", "technology", "software engineering"],
        invalid_topics=["politics", "religion", "medical advice"],
        on_fail="exception",
    )
)

# This passes
result = guard.validate("How do I implement a binary search tree?")

# This raises an exception
result = guard.validate("What's your opinion on the election?")

Custom Validators

# script_id: day_064_llm_guardrails/custom_validator
from guardrails.validators import Validator, register_validator
from typing import Any, Dict, List

@register_validator("no-competitor-mention", data_type="string")
class NoCompetitorMention(Validator):
    """Block mentions of competitor names."""

    def __init__(self, competitors: List[str], on_fail: str = "fix"):
        super().__init__(on_fail=on_fail)
        self.competitors = [c.lower() for c in competitors]

    def validate(self, value: Any, metadata: Dict) -> Any:
        lower_value = value.lower()
        for competitor in self.competitors:
            if competitor in lower_value:
                raise ValueError(f"Competitor '{competitor}' mentioned in output")
        return value

    def fix(self, value: Any, metadata: Dict) -> Any:
        result = value
        for competitor in self.competitors:
            result = result.replace(competitor, "[competitor]")
            result = result.replace(competitor.title(), "[Competitor]")
        return result

# Usage
guard = Guard().use(
    NoCompetitorMention(
        competitors=["microsoft", "google", "amazon"],
        on_fail="fix"
    )
)

result = guard.validate("Our product is better than Microsoft's solution")
print(result.validated_output)
# "Our product is better than [Competitor]'s solution"

Built-in Model Safety APIs

Before reaching for a framework, consider the safety features already built into major LLM providers. These are lightweight, require no extra dependencies, and handle common moderation needs.

OpenAI Moderation API

OpenAI provides a free moderation endpoint that classifies text across safety categories:

# script_id: day_064_llm_guardrails/openai_moderation
from openai import OpenAI

client = OpenAI()

def check_moderation(text: str) -> dict:
    """Check text against OpenAI's moderation categories."""
    response = client.moderations.create(input=text)
    result = response.results[0]

    if result.flagged:
        # Identify which categories were triggered
        triggered = [
            category for category, flagged
            in result.categories.model_dump().items()
            if flagged
        ]
        return {"safe": False, "categories": triggered}

    return {"safe": True, "categories": []}

# Use as a pre-check before LLM calls
user_input = "How do I write a Python function?"
moderation = check_moderation(user_input)

if not moderation["safe"]:
    print(f"Blocked: {moderation['categories']}")
else:
    # Proceed with LLM call
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": user_input}]
    )

Categories detected: hate, harassment, self-harm, sexual, violence, and their "graphic" sub-categories.

Claude System Prompt Guardrails

Anthropic's Claude has strong built-in safety, and you can reinforce it with system prompt instructions:

# script_id: day_064_llm_guardrails/claude_system_guardrails
from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="""You are a helpful programming assistant.

GUARDRAILS:
- Only answer questions about programming, software engineering, and technology.
- If asked about topics outside your scope, politely decline.
- Never include personal information (emails, phone numbers, addresses) in responses.
- Do not generate code that could be used for hacking or exploitation.
- Keep responses concise and under 500 words.""",
    messages=[{"role": "user", "content": user_input}]
)

When to Use What

Approach Best For Tradeoffs
System prompt rules Simple topic/behavior constraints No enforcement guarantee; LLM can ignore
OpenAI Moderation API Content safety screening Limited to safety categories; no custom rules
Guardrails AI Structured validation, PII, custom rules Extra dependency; adds latency per validator

Combining Guardrails with Agents

# script_id: day_064_llm_guardrails/guarded_agent
import openai
from guardrails import Guard
from guardrails.hub import ToxicLanguage, DetectPII

class GuardedAgent:
    """Agent with layered guardrails on input and output."""

    def __init__(self):
        self.client = openai.OpenAI()

        # Input: block toxic content, redact PII
        self.input_guard = Guard().use_many(
            ToxicLanguage(on_fail="exception"),
            DetectPII(["EMAIL", "PHONE", "SSN"], on_fail="fix"),
        )

        # Output: clean PII leaks, enforce length
        self.output_guard = Guard().use_many(
            DetectPII(["EMAIL", "PHONE", "SSN"], on_fail="fix"),
            ToxicLanguage(on_fail="fix"),
        )

    def generate(self, user_input: str) -> str:
        """Generate a guarded response."""

        # Validate input
        try:
            input_result = self.input_guard.validate(user_input)
            clean_input = input_result.validated_output
        except Exception as e:
            return f"I can't process that request: {e}"

        # Call LLM
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": clean_input}]
        )
        raw_output = response.choices[0].message.content

        # Validate output
        output_result = self.output_guard.validate(raw_output)
        return output_result.validated_output

# Usage
agent = GuardedAgent()
print(agent.generate("How do I write a Python function?"))

Historical Context: NeMo Guardrails

NVIDIA developed NeMo Guardrails as an early framework for adding programmable guardrails to LLM applications. It used a custom language called Colang to define conversational flows and rules declaratively -- you would write dialog patterns specifying how the bot should respond to certain user intents, and the framework would enforce those flows at runtime.

NeMo Guardrails was notable for its dialog-flow approach to safety: rather than validating individual strings, it modeled entire conversation trajectories. This made it powerful for complex multi-turn guardrail scenarios. However, the Colang language had a steep learning curve and the framework required significant configuration overhead compared to simpler validation-based approaches.

As of 2025, NeMo Guardrails and Colang are in maintenance mode and are not recommended for new projects. The ecosystem has moved toward validator-based frameworks like Guardrails AI, which offer a more composable and Pythonic approach. If you encounter NeMo Guardrails in existing codebases, consider migrating to Guardrails AI or built-in provider safety APIs.


Summary

Key takeaways:

  • Guardrails AI is the go-to framework -- use Hub validators for common checks (toxicity, PII, topic) and Pydantic models for structured output
  • Built-in model safety (OpenAI Moderation API, Claude system prompts) covers basic content moderation with zero extra dependencies
  • Layer your guardrails: system prompt constraints + moderation API + framework validators for defense in depth
  • Always validate both input (what the user sends) and output (what the LLM returns)

Quick Reference

# script_id: day_064_llm_guardrails/quick_reference
# Guardrails AI — input/output validation
from guardrails import Guard
from guardrails.hub import ToxicLanguage, DetectPII

guard = Guard().use_many(ToxicLanguage(on_fail="exception"), DetectPII(on_fail="fix"))
result = guard(llm_api=openai.chat.completions.create, model="gpt-4o-mini", messages=[...])

# Guardrails AI — structured output
guard = Guard.from_pydantic(MyModel)
result = guard(llm_api=..., messages=[...])

# OpenAI Moderation API
response = client.moderations.create(input=text)
flagged = response.results[0].flagged

# Claude system prompt guardrails
response = client.messages.create(model="claude-sonnet-4-20250514", system="GUARDRAILS: ...", ...)

What's Next?

Now let's learn about Safe Sandboxing - containerizing agent code execution for security!