Evaluation & Safety

Guardrails

Programmatic constraints placed around AI model inputs and outputs to prevent harmful, off-topic, or policy-violating behavior.

Why it matters

Guardrails are how you ship AI to production without anxiety. They are the safety net between a capable-but-unpredictable model and real users.

Input vs. output guardrails

Input guardrails screen user messages before they reach the model — blocking prompt injection attempts, filtering PII, and enforcing topic boundaries. Output guardrails validate model responses before they reach the user — checking for toxicity, factual consistency, format compliance, and policy adherence.

Implementation patterns

  • Rule-based — regex, keyword blocklists, format validators. Fast and predictable.
  • Classifier-based — lightweight ML models that detect toxicity, PII, or off-topic content.
  • LLM-as-judge — use a second LLM call to evaluate whether the output meets quality criteria.
  • Constitutional AI — self-critique loops where the model checks its own output against principles.

Frameworks

Guardrails AI, NeMo Guardrails (NVIDIA), and Anthropic's built-in safety layers are the main options. Many teams also build custom guardrail pipelines tailored to their domain.