author

Charles Camp

Boosting RAG Metrics: Two Practical Methods

11.09.2025

AI/ML

A close-up image of a reflective, curved surface displaying text about large language models. The text is reflected and distorted, creating a futuristic appearance. The main question visible is 'How do large language models work?' with partial answers explaining they are a type of machine learning and deep learning, trained on neural networks using vast amounts of data.

Introduction

In this article, we present how to get started with building a naive RAG with Llama-index and how to customize the retrieval process to obtain more relevant results. We finally elaborate on the importance of having an automated evaluation in place to choose which version should be deployed in production.

What is RAG?

Retrieval-Augmented Generation (RAG) is a powerful framework used to enhance the contextual understanding of Large Language Models (LLMs) by incorporating additional knowledge. By extending the context provided to LLMs, RAG enables them to generate more accurate and relevant responses, making it particularly valuable for companies seeking personalized interactions with their data.

Numerous companies aim to utilize chatbots to interact with their clients. Consequently, the chatbot serves as the face of the company, embodying its capacity to offer assistance, convey warmth, and demonstrate dedication. Therefore, to ensure optimal customer experience and maintain strong client relationships, it is crucial to develop advanced chatbots that deliver exceptional performance.

The RAG process can be broken down into three main phases:

Indexing: External data is indexed by chunking the text from documents, computing embeddings, and storing them in a specialized database known as VectorStores.
Retrieval: During retrieval, the embedding of a query is computed, and chunks that are most relevant to it are retrieved to serve as context for generating a response.
Generation: Leveraging the retrieved context, the LLM formulates its response.

Naive RAG implementation with Llama-index

Prerequisites

To run the provided code, ensure you have Python 3.11.7 installed along with the following dependencies:

install Python 3.11.7

jupyter llama-index==0.10.34 llama-index-embeddings-huggingface==0.2.0 openai==1.25.1 python-dotenv #reads .env file trulens-eval==0.28.2 #used to evaluate RAGs

Define the following variables in a .env file:

.env

OPENAI_API_KEY=<insert your OpenAI API key> HUGGINGFACE_API_KEY=<insert your Hugging Face API key>

Getting data

Download a sample text file of Paul Graham's essay for demonstration:

sample text file

!mkdir -p '../data' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O '../data/paul_graham_essay.txt'

Naive RAG implementation

For this experiment, we will use OpenAI for the LLM and BGE as embedding model. ServiceContext has been deprecated since the version v.0.10.0 of llama-index and replaced by Settings. The Settings object enables to set global parameters and those are lazily instantiated. This means that only required elements will be loaded into memory.

embedding model

from llama_index.core import Settings from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.llms.openai import OpenAI Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1) Settings.embed_model = HuggingFaceEmbedding( model_name="BAAI/bge-small-en-v1.5" )

With Llama-index, implementing a Naive RAG is incredibly simple:

implementing naive RAG

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex documents = SimpleDirectoryReader( input_files=["../data/paul_graham_essay.txt"] ).load_data() index = VectorStoreIndex.from_documents(documents) query_engine = index.as_query_engine() response = query_engine.query("What did Paul Graham study?") str(response)

Limitations and causes

While Naive RAG implementations offer a quick way to interact with documents, the quality of their responses is only as good as the context they retrieve. Tuning parameters like chunk_size can significantly impact performance. However, other approaches, such as Sentence Window Retrieval, have proven effective in retrieving the most relevant context.

Sentence window retrieval

By default, the VectorStoreIndex splits documents into chunks before computing their embeddings. However, you have the flexibility to control how your documents are parsed by selecting a different type of node_parser. In this case, we'll utilize the SentenceWindowNodeParser, which parses documents into single-sentence nodes while retaining the preceding and following sentences in the metadata field "window".

Setting a window size of 1 means that the "window" field will contain one sentence on either side of the original sentence. This approach significantly improves context relevance during retrieval, as it provides a more comprehensive context for each sentence.

sentence window retrieval

node_parser = SentenceWindowNodeParser.from_defaults( window_size=1, window_metadata_key="window", original_text_metadata_key="original_text", ) sentence_index = VectorStoreIndex.from_documents( document, node_parser=node_parser, )

The MetadataReplacementPostProcessor functions by swapping out the node's content with a specific field from the node metadata—referred to as "window" in this instance. In practical terms, while the retrieval process solely relies on the node content to locate similar nodes based on the query, it's the "window" metadata, comprising the original text and adjacent sentences, that's fed into the re-ranker. This input allows the re-ranker to evaluate the context and identify the nodes most valuable for addressing the query.

The re-ranking is executed with the SentenceTransformerRerank class and leverages the BGE embeddings. This step selects the most informative nodes to answer the query, preventing the context window from being cluttered with irrelevant data. As a result, re-ranking often reduces the number of tokens used and decreases latency.

postproc = MetadataReplacementPostProcessor(target_metadata_key="window") rerank = SentenceTransformerRerank( top_n=top_n, model="BAAI/bge-reranker-base" ) sentence_window_engine = sentence_index.as_query_engine( similarity_top_k=similarity_top_k, node_postprocessors=[postproc, rerank] )

In this example, the top similarity_top_k most similar nodes to the query are initially retrieved. These nodes are then re-ranked based on their relevance to answering the query. Only the top top_n examples are retained and added to the context for generating the answer.

Automerging retrieval

Another commonly used parser is the HierarchicalNodeParser. This parser recursively split a document in chunks. This builds a hierarchy of nodes that are connected to their parent.

node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=[2048, 512, 128]) nodes = node_parser.get_nodes_from_documents(documents) leaf_nodes = get_leaf_nodes(nodes) automerging_index = VectorStoreIndex(leaf_nodes)

The AutoMergingRetriever recursively merges subsets of leaf nodes that are similar to the query. This helps retrieve as much informative context as possible.

base_retriever = automerging_index.as_retriever(similarity_top_k=similarity_top_k) retriever = AutoMergingRetriever(base_retriever) rerank = SentenceTransformerRerank( top_n=rerank_top_n, model="BAAI/bge-reranker-base" ) auto_merging_engine = RetrieverQueryEngine.from_args( retriever, node_postprocessors=[rerank] )

Similar to the Sentence Window Retrieval approach, we implement a re-ranking step for the retrieved nodes to provide only the most valuable information to the LLM for answer generation. Both methods result in lower token usage and reduced response latency.

Evaluation

We conducted an experiment with three different RAGs and compared their results using the trulens-eval library. We devised 10 questions for each chatbot and evaluated the quality of their answers based on three criteria:

Answer Relevance
Context Relevance
Groundedness

The entire experiment code is available here.

In this specific case, a very simplistic demo example, the primary improvement we observed was a significant reduction in token usage. This indicates that we achieved similar performances for a fraction of the initial cost. However, in larger examples, we can anticipate an improvement in Context Relevance as well.

Next step

In our upcoming episode, we'll delve further into our assessment procedure. We'll explore methods for automating it and determining which metrics are best suited for diagnosing our pipeline's issues, helping us identify areas that require more focus.

Sources

[1] deeplearning.ai RAG course

[2] Llama-index Node PostProcessors documentation

[3] Migration documentation

[4] AutoMerging documentation

About the author

Charles Camp is a seasoned ML engineer with nine years of experience, specializing in NLP. He's also an AWS Solution Architect Associate, skilled in deploying ML solutions.

After writing his Master thesis, using machine learning to build predictive models in the medical sector, at Carnegie Mellon University, Charles took on the challenge of combatting money laundering at Credit Suisse Zurich (Switzerland), where he first began utilizing NLP to identify negative news.

For the past four years, Charles has been freelancing, collaborating with startups and industry leaders to realize their NLP projects. During the last two years, he's been diving into the latest LLM tech, extracting valuable insights from company documents, building chatbots, and developing powerful recommender systems.