LLM Guardrails and Safety Patterns
Notes on AI/LLM guardrails and safety patterns from a book on "Agentic Design Patterns" by one of Google's Distinguished Engineers, Antonio Gulli.
Guardrails and Safety Patterns
Provide a protective layer to guide agent behavior in order to prevent "harmful, biased, irrelevant or otherwise undesirable responses".
Other reasons for guardrails:
- Legal
- Compliance
Guardrails can be implemented using "lower power" but faster LLM models. They should be teamed up with observability to understand when they're being triggered, false positives and general user behavior.
Where to Apply Guardrails
-
Input validation - filter malicious content and prevent jailbreak attacks.
-
Prompt or behavioural constraints - directly instructing the LLM, explicitly preventing or allowing tool use.
-
External moderation APIs, "human in the loop" or other LLMs options for output validation.
Guardrail Prompts
- A general purpose safety prompt - company policy
- Permissible input prompt
- A structured output definition prompt
- Policy determination by a prompt (input or output validation, what policy does it break?)
- Technical guard rail prompt to verify the output of other prompts
- Jailbreak prompt
Links
https://docs.google.com/document/d/1rsaK53T3Lg5KoGwvf8ukOUvbELRtH-V0LnOIFDxBryE/edit?tab=t.0#heading=h.pxcur8v2qagu - "Agentic Design Patterns: A Hands-On Guide to Building Intelligent Systems" by Antonio Gulli