· Philipp Buschhaus · 5 min read
Building RAG the Right Way: A Comprehensive Guide
Discover how to build an optimized RAG system using methods like semantic and contextual search with re-ranking to improve data retrieval accuracy and reduce failed retrievals in organizational environments.
Building a powerful Retrieval-Augmented Generation (RAG) component is an essential part for most GenAI applications. The key to building RAG successfully is realizing that “it depends”—on the type of data, the specific use case, and how you approach it. This is why most publicly available RAG applications fail to perform in terms of accuracy when used in organizational environments. Whether you’re dealing with large amounts of unstructured data, like PDFs, or integrating first-party data sources, doing RAG the right way requires a methodical approach. This blog post will outline best practices we applied and established for the most effective RAG, ensuring it outperforms the existing models in terms of accuracy. Our solution incorporates some of Anthropic’s and NVIDIA’s latest research and outperforms all available and built-in RAG applications tested (incl. OpenAI’s) by reducing failed retrievals to <2%.
1. Rewriting Queries for RAG Optimization
A significant first step in building an effective RAG system is ensuring that your initial user query is redefined into RAG-optimized searches. Natural language queries like “How much revenue did we make this year?” often need refinement to match the exact structure of the available data. For example, breaking it down to “revenue company A 2024” can yield much more relevant results.
This process may involve multi-shot testing—where different versions of a query are tested against the system to see what yields the best results. By refining queries upfront, you ensure that your RAG system retrieves more accurate data and provides a better user experience.
2. Hybrid Search: Combining Semantic and Keyword Search
The best RAG systems use a hybrid search approach, leveraging both semantic and keyword searches. This means that while your system understands the meaning behind a query (semantic search), it also looks for exact keyword matches. This combination ensures that you’re retrieving both contextually relevant and precise results, striking a balance between accuracy and relevance.
3. Prepending Context for Accurate Embeddings
When creating embeddings for the vector store (used in the semantic search) or the TF-IDF index (used in keyword search), it’s essential to pre-append context to the text. For example, adding the section heading or document title to each chunk of text ensures the AI understands the broader context. This simple step can improve the system’s accuracy when retrieving data, especially in long documents or complex datasets.
4. Tagging with Structured Metadata for Better Retrieval
Tagging documents with structured metadata—such as department, permission level, or customer—can significantly enhance the RAG system’s ability to filter and retrieve relevant data. When dealing with large datasets, especially in complex organizations, this metadata allows you to pre-filter documents, ensuring that the system only retrieves information relevant to the query at hand.
For example, in a large organization, metadata tagging such as “finance department” or “Q3 reports” can filter out irrelevant results before the search even begins, improving retrieval speed and accuracy.
5. Finding the Right Chunk Size
There’s a sweet spot for the number of text chunks you should feed back to your large language model (LLM) for generating responses. Testing shows that an ideal number is around 20 chunks. However, the exact number will depend on your use case, so it’s worth experimenting with different chunk sizes to find what works best for your data and model.
6. Re-ranking for Relevance
After retrieving the top chunks of data, it’s crucial to perform re-ranking to ensure only the most relevant information is considered. This process involves taking the top N chunks and re-ranking them using a re-ranking model. The goal is to refine the selection further, feeding back only the most relevant X chunks to the LLM for final output.
This step significantly enhances the quality of responses, especially when dealing with complex or vague queries.
7. Maintaining Chunk Order for Consistency
Once you’ve identified the relevant chunks, it’s important to maintain the order of the selected chunks when feeding them back to the LLM. Preserving this order helps maintain the coherence and flow of information, ensuring that the generated responses are not only accurate but also logically structured.
Grounding LLM Responses in Internal Data
Building a RAG system over PDFs is a common approach, but for most businesses, internal data doesn’t live in isolated documents. Instead, companies need a source of truth—a centralized system that contains verified information about the business. This could be stored in a graph database (in our case Neo4j) and integrated with third-party tools like CRMs or HR systems to ensure that it remains up to date.
Additionally, the system needs to have fine-grained permissions. This can be achieved by tagging each document with metadata such as permission levels and mapping these to user IDs. Before any retrieval process, the RAG system should retrieve the user’s access level and ensure that only allowed data is accessed. For added security, implementing a final “firewall check” before returning data ensures no sensitive information slips through.
Resolving “simple” yet tricky queries
Here are some common low-context user queries that most RAG solutions in the market struggle with:
- “How much revenue did we make?”
- “Give me a summary of Q3?”
- “Who do I contact if I have a problem with customer X?”
Using the above methods—optimized queries, hybrid search, prepending context, metadata tagging, and re-ranking—these kinds of questions can now be answered accurately and efficiently in our RAG implementation, without the user needing to become a prompt engineer. The goal is to allow users to communicate naturally with the system, receiving precise, relevant answers without extensive input modification.