Retrieval Augmented Generation

July 15, 2025 · 4 min read

High Performance Developer

Notes on retrieval augmented generation (RAG) from Anthropic's course on Claude.

claude

Retrieval Augmented Generation

Breaking down large documents which need to be passed to the LLM into chunks so we only send the most relevant pieces. Reduces traffic, latency and costs. Augment the generation of some text with a (pre-generation) retrieval process.

alt text

The LLM can then focus on the most relevant content and handle large documents at scale, as well as multiple docs.

Challenges with RAG:

Requires preprocessing to chunk docs
Need a search mechanism to find relevant chunks
You may miss relevant chunks
Chunking strategy policy

Chunking Strategies

alt text

Sized Based

Downsides:

Cutoff text mid sente...
Lack of context in each chunk

Can overlap chunks:

alt text

Structure Based

Using document structure like headers, paragraphs and sections. Good for well formatted documents like markdown (maybe not mine).

Semantic Chunking

Divide text into sentences, using NLP to determine sentence relevancy with one another. Chunks are built up using groups of related sentences.

Sentence Chunking

Split out sentences, add overlap if needed.

Selection Strategy

Structure based - when you control the document formatting
Sentence based - middle ground for text docs
Size based - most reliable, simple for prod use

Text Embeddings

We need to find the most relevant chunks for the user's question. This is typicall done using semantic search:

alt text

Once we generate these embeddings (for each chunk), they get stored in a vector database.

alt text

When we get a user query, we generate it's embedding values using the same model as we did for the chunks. We then use cosine similarity/cosine distance to compare the embeddings of the query with those in the vector db (for each of the chunks), in order to find the most similar chunks.

alt text

We now have the chunks effectively ordered by relevancy and can combine them into a prompt.

BM25 Lexical Search

WE can combine semantic with lexical search using BM25. We may want to do this because the semantic search is not returning the exact term we are using and so now need to use lexical search.

alt text

Reminds me a bit of inverse doc frequency:

alt text

Gives higher weight to rare, specific terms
Ignores common words that don't add search value
Focuses on term frequency rather than semantic meaning
Works especially well for technical terms, IDs, and specific phrases

However, the scoring mechanism for semantic and lexical search are both different, so we need to handle this. However, this allows us to integrate and normalise other unknown search and ranking methodologies.

alt text

Reciprocal Rank Fusion

alt text

The score is the sum of the reciprocal of each rank for each of the ranking methodologies.

The score is inversely proportionate to the rank - high ranks are bad, low ranks are good.

Re-ranking

Once we have a list of relevant chunks, we use an LLM to rank them again according to relevancy

alt text

We trade increased accuracy vs latency and extra API calls.

Contextual Retrieval

When we chunk a document, each piece now lacks context of it's place int he document and how it relates to the document. Retrieval accuracy can go down as context is removed.

alt text

We use an LLM to write a snippet which "situates" the chunk into the overall document. This generated context is combined with the original chunk to create a "contextualized chunk", which is used in the vector and BM25 indices.

Large Documents

Provide a reduced set of contexts.

alt text

Some from the start
Chunks before the chunk you're contextualizing
Skip chunks in the middle

For example, a contextualized chunk might start with: "This chunk is Section 2 
of an Annual Interdisciplinary Research Review, detailing software 
engineering efforts to resolve stability issues in Project Phoenix..." 
followed by the original chunk text.

This technique is especially valuable for complex documents 
where individual sections have many interconnections and 
references to other parts of the document. The added 
context helps ensure that relevant chunks are retrieved 
even when the search query doesn't exactly match 
the chunk's original text.

Links

https://www.anthropic.com/learn

Retrieval Augmented Generation​

Chunking Strategies​

Sized Based​

Structure Based​

Semantic Chunking​

Sentence Chunking​

Selection Strategy​

Text Embeddings​

BM25 Lexical Search​

Reciprocal Rank Fusion​

Re-ranking​

Contextual Retrieval​

Large Documents​

Links​

Retrieval Augmented Generation

Chunking Strategies

Sized Based

Structure Based

Semantic Chunking

Sentence Chunking

Selection Strategy

Text Embeddings

BM25 Lexical Search

Reciprocal Rank Fusion

Re-ranking

Contextual Retrieval

Large Documents

Links