Evaluating RAG pipelines with Ragas and Openlayer

How to use synthetic data to evaluate RAG systems


Developing the basic structure of a RAG pipeline is typically straightforward. However, the real challenge emerges when it comes to fine-tuning it for production and ensuring the quality of its outputs. The selection of appropriate tools and parameters becomes challenging in a landscape filled with numerous options. In this article, I will share insights on developing a strong workflow combining Ragas and Openlayer to assist you in making the best choices for your RAG and ensuring its high quality.

For this article, we’ll use data from arxiv papers about prompt engineering to build our RAG pipeline.


To run the code samples in this post, make sure to pip install the dependencies, clone the git repo, and export your OpenAI API key as an environment variable.

pip install ragas llama-index openlayer pypdf

git clone https://huggingface.co/datasets/explodinggradients/prompt-engineering-papers


Synthetic test data generation

Compiling a golden test dataset for evaluation is often a burdensome and expensive task, more so at the outset of a project or with shifting data sources. Synthetic generation of high-quality test data presents a viable solution, cutting down curation efforts by 90%. The ideal dataset should feature high-quality, diverse data points, mirroring real production scenarios. Ragas adopts a unique, evolution-based synthetic data generation technique, ensuring the creation of diverse and high-quality questions. While it defaults to OpenAI models, Ragas allows the use of any preferred model. Let's see how we can generate 1,000 data points using Ragas.

from llama_index import SimpleDirectoryReader
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

# load documents 
dir_path = "./prompt-engineering-papers"
reader = SimpleDirectoryReader(dir_path,num_files_limit=2)
documents = reader.load_data()

# generator with openai models
generator = TestsetGenerator.with_openai()

# set question type distribution
distribution = {simple: 0.5, reasoning: 0.25, multi_context: 0.25}

# generate testset
testset = generator.generate_with_llamaindex_docs(
test_df = testset.to_pandas()

With the flexibility to adjust the distribution of question types, you can tailor the dataset to better fit your specific requirements. Now that our test dataset is prepared, let’s build a basic RAG pipeline using llama-index.

Building RAG

In this step, I’m using llama-index to build a simple RAG pipeline which I will be evaluating later. For evaluating any RAG using Ragas, apart from the question and the expected answer corresponding to each question that we generated in the earlier step we also need input from our RAG pipeline like contexts retrieved and generated answer for each query. Let’s prepare that now.

import nest_asyncio
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.embeddings import OpenAIEmbedding
from llama_index.schema import Document
from llama_index.core.base_query_engine import BaseQueryEngine

from typing import List


def build_query_engine(documents: List[Document]):
    vector_index = VectorStoreIndex.from_documents(

    query_engine = vector_index.as_query_engine(similarity_top_k=2)
    return query_engine

def generate_single_response(query_engine:BaseQueryEngine, question:str):
    response = query_engine.query(question)
    return {
            "contexts":[c.node.get_content() for c in response.source_nodes]

query_engine =  build_query_engine(documents)

Let’s test out RAG by posting a simple question.

question = "What are some strategies proposed to enhance the in-context learning capability of language models?"
generate_single_response(query_engine, question)
{'answer': 'Strategies proposed to enhance the in-context learning capability of language models include instruction tuning, generating instruction tuning datasets, connecting language models with powerful vision foundational models, and using proper data formatting and architecture designs. Additionally, in the speech area, treating text-to-speech synthesis as a language modeling task and using intermediate representations such as audio codec codes have been proposed to enhance in-context learning capability.',
 'contexts': ['with instruction tuning, and the idea is also ex-\nplored in the multi-modal scenarios as well. Re-\ncent explorations first generate instruction tuning\ndatasets transforming existing vision-language task\ndataset (Xu et al., 2022; Li et al., 2023a) or with\npower LLMs such as GPT-4 (Liu et al., 2023; Zhu\net al., 2023a) , and connect LLMs with powerful vi-\nsion foundational models such as BLIP-2 (Li et al.,\n2023c) on these multi-modal datasets (Zhu et al.,\n2023a; Dai et al., 2023).\n9.3 Speech In-Context Learning\nIn the speech area, Wang et al. (2023a) treated text-\nto-speech synthesis as a language modeling task.\nThey use audio codec codes as an intermediate rep-\nresentation and propose the first TTS framework\nwith strong in-context learning capability. Subse-\nquently, V ALLE-X (Zhang et al., 2023b) extend the\nidea to multi-lingual scenarios, demonstrating su-\nperior performance in zero-shot cross-lingual text-\nto-speech synthesis and zero-shot speech-to-speech\ntranslation tasks.\n3Takeaway :(1) Recent studies have explored\nin-context learning beyond natural language with\npromising results. Properly formatted data (e.g.,\ninterleaved image-text datasets for vision-language\ntasks) and architecture designs are key factors\nfor activating the potential of in-context learning.\nExploring it in a more complex structured space\nsuch as for graph data is challenging and promis-\ning (Huang et al., 2023a). (2) Findings in textual\nin-context learning demonstration design and selec-\ntion cannot be trivially transferred to other modal-\nities. Domain-specific investigation is required to\nfully leverage the potential of in-context learning\nin various modalities.',
  'Language models are few-shot learners. In Ad-\nvances in Neural Information Processing Sys-\ntems 33: Annual Conference on Neural Infor-\nmation Processing Systems 2020, NeurIPS 2020,\nDecember 6-12, 2020, virtual .\nStephanie C. Y . Chan, Adam Santoro, Andrew K.\nLampinen, Jane X. Wang, Aaditya Singh,\nPierre H. Richemond, Jay McClelland, and Fe-\nlix Hill. 2022. Data distributional properties\ndrive emergent in-context learning in transform-\ners.CoRR , abs/2205.05055.\nMingda Chen, Jingfei Du, Ramakanth Pasunuru,\nTodor Mihaylov, Srini Iyer, Veselin Stoyanov,\nand Zornitsa Kozareva. 2022a. Improving in-\ncontext few-shot learning via self-supervised\ntraining. In Proceedings of the 2022 Conference\nof the North American Chapter of the Associa-\ntion for Computational Linguistics: Human Lan-\nguage Technologies , pages 3558–3573, Seattle,\nUnited States. Association for Computational\nLinguistics.\nYanda Chen, Chen Zhao, Zhou Yu, Kathleen McKe-\nown, and He He. 2022b. On the relation between\nsensitivity and accuracy in in-context learning.\nArXiv preprint , abs/2209.07661.\nYanda Chen, Chen Zhao, Zhou Yu, Kathleen R.\nMcKeown, and He He. 2022c. On the relation\nbetween sensitivity and accuracy in in-context\nlearning. CoRR , abs/2209.07661.\nYanda Chen, Ruiqi Zhong, Sheng Zha, George\nKarypis, and He He. 2022d. Meta-learning via\nlanguage model in-context tuning. In Proc. of\nACL, pages 719–730, Dublin, Ireland. Associa-\ntion for Computational Linguistics.']}

Now we have seen that works perfectly, let’s form the evaluation dataset by feeding in each question from test data to our pipeline.

from datasets import Dataset
import pandas as pd

def generate_ragas_dataset(query_engine:BaseQueryEngine, test_df:pd.Dataframe):

  test_questions = test_df["question"].values
  responses = [generate_single_response(query_engine,q) for q in test_questions]

  dataset_dict = {
        "question": test_questions,
        "answer": [response["answer"] for response in responses],
        "contexts":[response["contexts"] for response in responses],
  ds = Dataset.from_dict(dataset_dict)
  return ds

ragas_dataset = generate_ragas_dataset(query_engine, test_df)
ragas_df = ragas_dataset.to_pandas()

Let’s move on and add this data to Openlayer now that we have all the required data points in for evaluation.

Commit to Openlayer

We can start to gauge how well (or poorly) our RAG system is performing by inspecting our dataset rows. The problem is that this evaluation approach is error-prone and lacks comprehensiveness. To ship a production-grade system, we must adopt a more systematic evaluation strategy.

Openlayer is the tool we will use to thoroughly evaluate our RAG system. We will dive into how it solves our problem shortly. First, we need to onboard our artifacts to the Openlayer platform.

Let’s start creating a project on Openlayer:

import openlayer
from openlayer.tasks import TaskType

client = openlayer.OpenlayerClient("YOUR_OPENLAYER_KEY_HERE")
project = client.create_project(
    description="Evaluating an LLM used for product development."

After we create the project, the next step is to add our dataset and model to the project:

validation_dataset_config = {
    "contextColumnName": "contexts",
    "inputVariableNames": ["question"],
    "label": "validation",
    "outputColumnName": "answer",
model_config = {
    "inputVariableNames": ["question"],
    "modelType": "shell"
		"metadata": {  # Some optional metadata we want to log
			"top_k": 2,
       "chunk_size": 512,
       "embeddings": "OpenAI"	

Finally, we can commit and push these artifacts to the Openlayer platform.

project.commit("Initial commit!")

Evaluation and Testing in Openlayer

Once we have our first commit on Openlayer, it’s time to start evaluating our system by creating tests.

Tests materialize our expectations around our system. For our RAG system, we can start with performance tests using any Ragas metric.

The test we just created uses the context recall and measures how good our context retriever is at retrieving all the relevant information required to answer a question. If the context recall is below 0.8 — our threshold — the test fails.

Note how creating tests such as the one we just did enables component-wise evaluation and metric-driven development, two pillars of LLM application development.

We created our first test, but we should not stop there. We can set up other Openlayer tests to evaluate our system, ranging from data quality to model performance tests.

After we have tests in place, every commit to Openlayer gets evaluated against this set of well-defined criteria. By doing so, we ensure that we are systematically making progress and avoiding regressions as we work to improve our RAG system.

Analysis of evaluation results

On our project home page, we can see which tests are passing and which are failing. For example, we can see that the context recall test is currently failing. If we click on the test card, we notice that the test fails because our context recall is below our threshold.

We can inspect the scores for individual rows and hypothesize with other team members what is the root cause of the issue. One plausible hypothesis is that our context retriever is retrieving too few contexts, making the context incomplete to answer the question.

One possible path to mitigate the issue is to modify the similarity_top_k parameter inside our build_query_engine function. We can increase it, to make our context retriever return more than just two relevant contexts. After re-computing the results, we can push a new commit to Openlayer. By doing so, we can evaluate if the issue was fixed and we can rest assured that we avoided regressions in other domains if the other tests remain passing.

Using Openlayer & Ragas in Production

All our efforts so far concentrated on evaluating our system while it was still under development, i.e., before shipping it to production. This is why we started by creating a validation set with Ragas, pushing artifacts to Openlayer, and keeping track of their versions.

However, Openlayer and Ragas are helpful beyond development. Both can be used in a production environment, to monitor our RAG system and help maintain its quality and performance while it serves user requests.

The main difference between development and monitoring is that while in development, we use a validation set to run tests, in monitoring, tests run on production data at a regular cadence.

For example, let’s take a look at the context recall test. We can create a test on Openlayer that checks whether our context recall remains above 0.8 and runs every hour. Openlayer will then use the latest production data published to compute the results. If the test suddenly fails, you get notified immediately, to take corrective action in time.


Recommended posts

Data science

Detecting data integrity issues in machine learning

Methods and tools for data quality assurance

Gustavo Cid

June 20th, 2023 • 6 minute read

Detecting data integrity issues in machine learning
Model quality

What is data-centric AI: 3 reasons to pay attention to it

The trend in AI

Gustavo Cid

November 23rd, 2023 • 5 minute read

What is data-centric AI: 3 reasons to pay attention to it
Error analysis

The race to put AI to work

A tipping point or hype for businesses and environmental, social, and corporate governance (ESG)?

Vikas Nair

June 28th, 2022 • 10 minute read

The race to put AI to work