Skip to main content

Prompt Evaluation

· 3 min read
Sanjeev Sarda
High Performance Developer

Notes on prompt evaluation from Anthropic's course on Claude.

claude

Prompt Engineering and Evaluation

Prompt Evaluation

alt text

Measuring the effectiveness of prompts through automated testing:

  • Test against expected answers
  • Compare model versions against the same prompt
  • Review the output for errors, mistakes, policy

Evaluation Pipelines

Generate objective metrics about performance:

  • Identify weaknesses pre-prod
  • A-B test prompts
  • Iterate based on measurement
  • Increase reliability of your AI app

alt text

Eval data set - set of inputs for evaluation purposes, can be done with smaller faster LLMs

Grader - grade the quality of the response of the LLM. Numerical grading allows for a more objective measurement of prompt performance.

alt text

Graders

Take the model output which has been generated from the evaluation data.

The grader provides measureable feedback on the model output.

alt text

Evaluation Criteria

alt text

alt text

Prompt Engineering

Developing high quality prompts is an iterative process:

alt text

Being Clear and Direct

alt text

The key takeaway is that Claude responds best when you treat it like a capable assistant who needs clear direction rather than someone who has to guess what you want. Start strong with a direct action verb, be specific about the task, and you'll see better results right away.
  • Be specific about what you want
  • Provide guidelines or steps as part of the prompt

When providing guidelines we can provide qualities like word count, or we can provide steps:

alt text

Output guidelines should be used in every prompt to ensure consistent and useful results.

Process steps should be for more complex problems e.g. troubleshooting, decision making, critical thinking, considering multiple angles.

Structure with XML Tags

Seperate distinct portions of a prompt using XML tags:

alt text

Useful to add XML tags when:

  • Including large amounts of context or data
  • Mixing different types of content (code, documentation, data)
  • You want to be extra clear about content boundaries
  • Working with complex prompts that interpolate multiple variables

Using Examples

Provide examples for input and output as part of the prompt and to handle corner cases:

alt text

  • Capturing corner cases or edge scenarios
  • Defining complex output formats (like specific JSON structures)
  • Showing the exact style or tone you want
  • Demonstrating how to handle ambiguous inputs

You can also explain why the output is good to add more context for the model.

https://www.anthropic.com/learn