Prompt Evaluation

July 13, 2025 · 3 min read

Sanjeev Sarda

High Performance Developer

Notes on prompt evaluation from Anthropic's course on Claude.

claude

Prompt Engineering and Evaluation

Prompt Evaluation

alt text

Measuring the effectiveness of prompts through automated testing:

Test against expected answers
Compare model versions against the same prompt
Review the output for errors, mistakes, policy

Evaluation Pipelines

Generate objective metrics about performance:

Identify weaknesses pre-prod
A-B test prompts
Iterate based on measurement
Increase reliability of your AI app

alt text

Eval data set - set of inputs for evaluation purposes, can be done with smaller faster LLMs

Grader - grade the quality of the response of the LLM. Numerical grading allows for a more objective measurement of prompt performance.

alt text

Graders

Take the model output which has been generated from the evaluation data.

The grader provides measureable feedback on the model output.

alt text

Evaluation Criteria

alt text

Prompt Engineering

Developing high quality prompts is an iterative process:

alt text

Being Clear and Direct

alt text

The key takeaway is that Claude responds best when you treat it like a capable assistant who needs clear direction rather than someone who has to guess what you want. Start strong with a direct action verb, be specific about the task, and you'll see better results right away.

Be specific about what you want
Provide guidelines or steps as part of the prompt

When providing guidelines we can provide qualities like word count, or we can provide steps:

alt text

Output guidelines should be used in every prompt to ensure consistent and useful results.

Process steps should be for more complex problems e.g. troubleshooting, decision making, critical thinking, considering multiple angles.

Structure with XML Tags

Seperate distinct portions of a prompt using XML tags:

alt text

Useful to add XML tags when:

Including large amounts of context or data
Mixing different types of content (code, documentation, data)
You want to be extra clear about content boundaries
Working with complex prompts that interpolate multiple variables

Using Examples

Provide examples for input and output as part of the prompt and to handle corner cases:

alt text

Capturing corner cases or edge scenarios
Defining complex output formats (like specific JSON structures)
Showing the exact style or tone you want
Demonstrating how to handle ambiguous inputs

You can also explain why the output is good to add more context for the model.

Links

https://www.anthropic.com/learn

Prompt Engineering and Evaluation​

Prompt Evaluation​

Evaluation Pipelines​

Graders​

Evaluation Criteria​

Prompt Engineering​

Being Clear and Direct​

Structure with XML Tags​

Using Examples​

Links​

Prompt Engineering and Evaluation

Prompt Evaluation

Evaluation Pipelines

Graders

Evaluation Criteria

Prompt Engineering

Being Clear and Direct

Structure with XML Tags

Using Examples

Links