Evaluate and improve your Retrieval-Augmented Generation (RAG) pipelines with open-rag-eval
, an open-source Python evaluation toolkit.
Evaluating RAG quality can be complex. open-rag-eval
provides a flexible and extensible framework to measure the performance of your RAG system, helping you identify areas for improvement. Its modular design allows easy integration of custom metrics and connectors for various RAG implementations.
Importantly, open-rag-eval's metrics do not require golden chunks or golden answer, making RAG evaluation easy and scalable. This is achieved by utilizing UMBRELA and AutoNuggetizer, techniques originating and researched in Jimmy Lin's lab at UWaterloo.
Out-of-the-box, the toolkit includes:
- An implementation of the evaluation metrics used in the TREC-RAG benchmark.
- A connector for the Vectara RAG platform.
- Connectors for LlamaIndex and LangChain (more coming soon...)
- Standard Metrics: Provides TREC-RAG evaluation metrics ready to use.
- Modular Architecture: Easily add custom evaluation metrics or integrate with any RAG pipeline.
- Detailed Reporting: Generates per-query scores and intermediate outputs for debugging and analysis.
- Visualization: Compare results across different configurations or runs with plotting utilities.
This guide walks you through an end-to-end evaluation using the toolkit. We'll use Vectara as the example RAG platform and the TRECRAG evaluator.
- Python: Version 3.9 or higher.
- OpenAI API Key: Required for the default LLM judge model used in some metrics. Set this as an environment variable:
export OPENAI_API_KEY='your-api-key'
- Vectara Account: To enable the Vectara connector, you need:
- A Vectara account.
- A corpus containing your indexed data.
- An API key with querying permissions.
- Your Customer ID and Corpus key.
In order to build the library from source, which is the recommended method to follow the sample instructions below you can do:
$ git clone https://github.com/vectara/open-rag-eval.git
$ cd open-rag-eval
$ pip install -e .
If you want to install directly from pip, which is the common method if you want to use the library in your own pipeline instead of running the samples, you can run:
pip install open-rag-eval
After installing the library you can follow instructions below to run a sample evaluation and test out the library end to end.
Create a CSV file that contains the queries (for example queries.csv
), which contains a single column named query
, with each row representing a query you want to test against your RAG system.
Example queries file:
query
What is a blackhole?
How big is the sun?
How many moons does jupiter have?
Edit the eval_config_vectara.yaml file. This file controls the evaluation process, including connector options, evaluator choices, and metric settings.
- Ensure your queries file is listed under
input_queries
, and fill in the correct values forgenerated_answers
andeval_results_file
- Choose an output folder (where all artifacts will be stored) and put it unde
results_folder
- Update the
connector
section (underoptions
/query_config
) with your Vectaracorpus_key
. - Customize any Vectara query parameter to tailor this evaluation to a query configuration set.
In addition, make sure you have VECTARA_API_KEY
and OPENAI_API_KEY
available in your environment. For example:
- export VECTARA_API_KEY='your-vectara-api-key'
- export OPENAI_API_KEY='your-openai-api-key'
With everything configured, now is the time to run evaluation! Run the following command from the root folder of open-rag-eval:
python open_rag_eval/run_eval.py --config config_examples/eval_config_vectara.yaml
You should see the evaluation progress on your command line. Once it's done, detailed results will be saved to a local CSV file (in the file listed under eval_results_file
) where you can see the score assigned to each sample along with intermediate output useful for debugging and explainability.
Note that a local plot for each evaluation is also stored in the output folder, under the filename listed as metrics_file
.
If you are using RAG outputs from your own pipeline, make sure to put your RAG output in a format that is readable by the toolkit (See data/test_csv_connector.csv
as an example).
Copy vectara_eval_config.yaml
to xxx_eval_config.yaml
(where xxx
is the name of your RAG pipeline) as follows:
- Comment out or delete the connector section
- Ensure
input_queries
,results_folder
,generated_answers
andeval_results_file
are properly configured. Specifically the generated answers need to exist in the results folder.
With everything configured, now is the time to run evaluation! Run the following command:
python open_rag_eval/run_eval.py --config xxx_eval_config.yaml
and you should see the evaluation progress on your command line. Once it's done, detailed results will be saved to a local CSV file where you can see the score assigned to each sample along with intermediate output useful for debugging and explainability.
Once your evaluation run is complete, you can visualize and explore the results in several convenient ways:
We highly recommend using the Open Evaluation Viewer for an intuitive and powerful visual analysis experience. You can drag add multiple reports to view them as a comparison.
Visit https://openevaluation.ai
Upload your results.csv
file and enjoy:
* Dashboards of evaluation results.
* Query-by-query breakdowns.
* Easy comparison between different runs (upload multiple files).
* No setup required—fully web-based.
This is the easiest and most user-friendly way to explore detailed RAG evaluation metrics. To see an example of how the visualization works, go the website and click on "Try our demo evaluation reports"
For those who prefer local or scriptable visualization, you can use the CLI plotting utility. Multiple different runs can be plotted on the same plot allowing for easy comparison of different configurations or RAG providers:
To plot a single result:
python open_rag_eval/plot_results.py --evaluator trec results.csv
Or to plot multiple results:
python open_rag_eval/plot_results.py --evaluator trec results_1.csv results_2.csv results_3.csv
--evaluator
argument must be specified to indicate which evaluator (trec or consistency) the plots should be generated for.
✅ Optional: --metrics-to-plot
- A comma-separated list of metrics to include in the plot (e.g., bert_score,rouge_score).
By default the run_eval.py script will plot metrics and save them to the results folder.
For an advanced local viewing experience, you can use the included Streamlit-based visualization app:
cd open_rag_eval/viz/
streamlit run visualize.py
Note that you will need to have streamlit installed in your environment (which should be the case if you've installed open-rag-eval).
The open-rag-eval
framework follows these general steps during an evaluation:
- (Optional) Data Retrieval: If configured with a connector (like the Vectara connector), call the specified RAG provider with a set of input queries to generate answers and retrieve relevant document passages/contexts. If using pre-existing results (
input_results
), load them from the specified file. - Evaluation: Use a configured Evaluator to assess the quality of the RAG results (query, answer, contexts). The Evaluator applies one or more Metrics.
- Scoring: Metrics calculate scores based on different quality dimensions (e.g., faithfulness, relevance, context utilization). Some metrics may employ judge Models (like LLMs) for their assessment.
- Reporting: Reporting is handled in two parts:
- Evaluator-specific Outputs: Each evaluator implements a
to_csv()
method to generate a detailed CSV file containing scores and intermediate results for every query. Each evaluator also implements aplot_metrics()
function, which generates visualizations specific to that evaluator's metrics. Theplot_metrics()
function can optionally accept a list of metrics to plot. This list may be provided by the evaluator'sget_metrics_to_plot()
function, allowing flexible and evaluator-defined plotting behavior. - Consolidated CSV Report: In addition to evaluator-specific outputs, a consolidated CSV is generated by merging selected columns from all evaluators. To support this, each evaluator must implement
get_consolidated_columns()
, which returns a list of column names from its results to include in the merged report. All rows are merged using "query_id" as the join key, so evaluators must ensure this column is present in their output.
- Evaluator-specific Outputs: Each evaluator implements a
- Metrics: Metrics are the core of the evaluation. They are used to measure the quality of the RAG system, each metric has a different focus and is used to evaluate different aspects of the RAG system. Metrics can be used to evaluate the quality of the retrieval, the quality of the (augmented) generation, the quality of the RAG system as a whole.
- Models: Models are the underlying judgement models used by some of the metrics. They are used to judge the quality of the RAG system. Models can be diverse: they may be LLMs, classifiers, rule based systems, etc.
- RAGResult: Represents the output of a single run of a RAG pipeline — including the input query, retrieved contexts, and generated answer.
- MultiRAGResult: The main input to evaluators. It holds multiple RAGResult instances for the same query (e.g., different generations or retrievals) and allows comparison across these runs to compute metrics like consistency.
- Evaluators: Evaluators compute quality metrics for RAG systems. The framework currently supports two built-in evaluators:
- TRECEvaluator: Evaluates each query independently using retrieval and generation metrics such as UMBRELA, HHEM Score, and others. Returns a
MultiScoredRAGResult
, which holds a list ofScoredRAGResult
objects, each containing the originalRAGResult
along with the scores assigned by the evaluator and its metrics. - ConsistencyEvaluator evaluates the consistency of a model's responses across multiple generations for the same query. It currently uses two default metrics:
- BERTScore: This metric evaluates the semantic similarity between generations using the multilingual
xlm-roberta-large
model, which supports over 100 languages. In this evaluator,BERTScore
is computed with baseline rescaling enabled (rescale_with_baseline=True
by default), which normalizes the similarity scores by subtracting language-specific baselines. This adjustment helps produce more interpretable and comparable scores across languages, reducing the inherent bias that transformer models often exhibit toward unrelated sentence pairs. If a language-specific baseline is not available, the evaluator logs a warning and automatically falls back to rawBERTScore
values, ensuring robustness. - ROUGE-L: This metric measures the longest common subsequence (LCS) between two sequences of text, capturing fluency and in-sequence overlap without requiring exact n-gram matches. In this evaluator,
ROUGE-L
is computed without stemming or tokenization, making it most reliable for English-only evaluations. Its accuracy may degrade for other languages due to the lack of language-specific segmentation and preprocessing. As such, it complementsBERTScore
by providing a syntactic alignment signal in English-language scenarios.
- BERTScore: This metric evaluates the semantic similarity between generations using the multilingual
- TRECEvaluator: Evaluates each query independently using retrieval and generation metrics such as UMBRELA, HHEM Score, and others. Returns a
For programmatic integration, the framework provides a Flask-based web server.
Endpoints:
/api/v1/evaluate
: Evaluate a single RAG output provided in the request body./api/v1/evaluate_batch
: Evaluate multiple RAG outputs in a single request.
Run the Server:
python open_rag_eval/run_server.py
See the API README for detailed documentation for the API.
Open-RAG-Eval uses a plug-in connector architecture to enable testing various RAG platforms. Out of the box it includes connectors for Vectara, LlamaIndex and Langchain.
Here's how connectors work:
- All connectors are derived from the
Connector
class, and need to define thefetch_data
method. - The Connector class has a utility method called
read_queries
which is helpful in reading the input queries. - When implementing
fetch_data
you simply go through all the queries, one by one, and call the RAG system with that query (repeating each query as specified by therepeat_query
setting in the connector configuration). - The output is stored in the
results
file, with a N rows per query, where N is the number of passages (or chunks) including these fieldsquery_id
: a unique ID for the queryquery text
: the actual query text stringquery_run
: an identifier for the specific run of the query (useful when you execute the same query multiple times based on therepeat_query
setting in the connector)passage
: the passage (aka chunk)passage_id
: a unique ID for this passage (you can use just the passage number as a string)generated_answer
: text of the generated response or answer from your RAG pipeline, including citations in [N] format.
See the example results file for an example results file
All 3 existing connectors (Vectara, Langchain and LlamaIndex) provide a good reference for how to implement a connector.
👤 Vectara
- Website: vectara.com
- Twitter: @vectara
- GitHub: @vectara
- LinkedIn: @vectara
- Discord: @vectara
Contributions, issues and feature requests are welcome and appreciated!
Feel free to check issues page. You can also take a look at the contributing guide.
Give a ⭐️ if this project helped you!
Copyright © 2025 Vectara.
This project is Apache 2.0 licensed.