RankLLM: A Python Package for Reranking with Large Language Models

In the realm of information retrieval, the ability to accurately and efficiently identify the most relevant documents to a user’s query from a vast corpus is of paramount importance. Over the years, significant advancements have been made in this field, with the emergence of large language models (LLMs) bringing about a paradigm shift. These powerful models have shown remarkable potential in enhancing the effectiveness of document reranking. Today, I am excited to introduce RankLLM, an open-source Python package developed by researchers at the University of Waterloo. RankLLM serves as a bridge between various advanced LLMs and document reranking tasks, empowering developers to leverage these models and boost the performance of retrieval systems with ease.

What is RankLLM?

RankLLM is an open-source Python package designed to facilitate reranking using LLMs in multi-stage retrieval systems. It offers a highly modular and configurable architecture, supporting a wide range of proprietary and open-source LLMs. This means that developers can seamlessly integrate different LLMs into customized reranking workflows, regardless of whether they are proprietary models like OpenAI’s GPT series or open-source alternatives such as LLaMA, Vicuna, Zephyr, and Mistral.

The package is not only compatible with common inference frameworks but also integrates with Pyserini for retrieval, providing users with a convenient tool for information retrieval. Additionally, RankLLM includes built-in evaluation tools for multi-stage pipelines and a module for detailed analysis of input prompts and LLM responses. These features address reliability concerns related to LLM APIs and the non-deterministic behavior of Mixture-of-Experts (MoE) models. With RankLLM, users can reproduce results from recent models like RankGPT, LRL, RankVicuna, RankZephyr, and more. Its compatibility with a wide array of LLMs enables quick reproduction of reported results, accelerating both research and real-world applications.

Why Do We Need RankLLM?

Traditional document retrieval processes typically involve a two-step procedure. First, a fast and scalable retriever, such as BM25 or dense representations like DPR, proposes a candidate set of documents. Then, in the reranking stage, more computationally expensive models, particularly transformer-based architectures, reassess and reorder the candidate list to enhance effectiveness. However, research in the space of rerankers has largely relied on ad hoc and disparate implementations. This has introduced complexity when comparing different approaches and hindered rapid exploration of the design space.

RankLLM addresses these challenges by providing a unified framework that integrates various reranking models. It simplifies the process of experimenting with and comparing different reranking techniques, thereby accelerating the development of this field. By leveraging the power of LLMs, RankLLM enables more nuanced and context-aware reranking, leading to more accurate and relevant retrieval results.

Core Features of RankLLM

Now, let’s delve into the core features of RankLLM that make it a game-changer in the document reranking landscape.

Diverse Model Support

One of RankLLM’s standout features is its extensive support for a variety of LLMs. Whether you prefer proprietary models like OpenAI’s GPT-4 or open-source alternatives such as Vicuna or Zephyr, RankLLM has you covered. This diverse model support allows users to choose the most suitable model based on their specific requirements and resource constraints. For instance, open-source models like LLaMA and its variants offer a cost-effective solution, while proprietary models like GPT-4 provide cutting-edge performance.

Modular and Configurable Architecture

RankLLM’s modular design is a testament to its flexibility. The package is composed of multiple interconnected modules, each serving a specific purpose. This allows users to customize the reranking process by selecting different ranking methods, LLMs, inference frameworks, and prompt templates. The configurability of RankLLM extends to every aspect of the reranking workflow, from the initial retrieval stage to the final evaluation and analysis.

End-to-End Workflow Components

To streamline the entire reranking process, RankLLM offers several auxiliary components:

  • Retrieval Component: Integrated with Pyserini, this component enables users to retrieve relevant documents for a given query from a specified corpus using various retrieval methods. It supports datasets like TREC 2019–2023 Deep Learning Tracks, BEIR, and Mr.TyDi, among others.
  • Evaluation and Analysis Component: This component allows users to evaluate the effectiveness of reranking results using standard metrics such as nDCG and mAP. It also provides tools for analyzing LLM responses, including identifying malformed responses and counting errors.
  • Training Component: For users who wish to fine-tune LLMs for custom reranking needs, RankLLM includes a training module. This module supports distributed fine-tuning through the Hugging Face Transformers library and leverages distributed training with Hugging Face Accelerate and DeepSpeed Zero-3 for memory optimization.

Reproducibility Features

Reproducibility is a cornerstone of scientific research, and RankLLM places great emphasis on this aspect. It offers predefined configurations, comprehensive logging, demo snippets, detailed docstrings, and README instructions. The two-click reproducibility (2CR) feature ensures that users can easily reproduce experimental results with minimal effort. This transparency and reproducibility not only facilitate knowledge sharing within the research community but also enable users to build upon existing work and advance the field.

Installation and Usage of RankLLM

Getting started with RankLLM is straightforward. Below are the steps to install and use this powerful package.

Installing RankLLM

RankLLM can be installed in two ways:

  • Via PyPI: RankLLM is available on the Python Package Index and can be installed using the following pip command:

  • From Source: RankLLM can also be installed from its source repository, which is publicly available at rankllm.ai. This option is recommended for users who are interested in contributing to RankLLM. The installation steps are as follows:

Before installing RankLLM, ensure that you have a CUDA-enabled PyTorch installed that is compatible with your specific GPU configuration. Once RankLLM is installed, you can follow the retrieve and rerank pipeline as described in the subsequent sections.

Environment Setup

To fully utilize RankLLM’s features, you need to set up the following environment:

  • Install JDK 21: Since RankLLM relies on Anserini, JDK 21 must be installed on your system. JDK 11 is not supported and may lead to errors.

  • Create a Conda Environment: It is recommended to create a Conda environment for RankLLM to avoid dependency conflicts with other projects. You can create and activate a Conda environment using the following commands:

  • Install PyTorch with CUDA: Depending on your operating system (Windows or Linux), install PyTorch with CUDA support using pip. For example, for CUDA 12.1, you can use the following command:

  • Install OpenJDK and Maven: If you plan to use RankLLM’s retrieval functionality, install OpenJDK 21 and Maven via Conda using the following command:

If you wish to use RankLLM’s retriever component, you also need to install the retriever dependencies:

  • Install Retriever Dependencies: Run the following command to install the dependencies required for the retriever component:

  • Install SGLang or TensorRT-LLM (Optional): If you want to use the SGLang or TensorRT-LLM inference backends, you can install them using the following commands:

Quick Start Example

Here’s a quick example to help you get started with RankLLM. This example demonstrates how to perform retrieval, reranking, evaluation, and invocation analysis for queries from the DL19 dataset, retrieving the top 100 documents for each query using BM25 as the retriever and RankZephyr as the reranker.

from pathlib import Path

from rank_llm.analysis.response_analysis import ResponseAnalyzer
from rank_llm.data import DataWriter
from rank_llm.evaluation.trec_eval import EvalFunction
from rank_llm.rerank import Reranker, get_openai_api_key
from rank_llm.rerank.listwise import (
    SafeOpenai,
    VicunaReranker,
    ZephyrReranker,
)
from rank_llm.retrieve.retriever import Retriever
from rank_llm.retrieve.topics_dict import TOPICS

# -------- Retrieval --------

# By default, BM25 is used to retrieve the top 100 candidates.
dataset_name = "dl19"
retrieved_results = Retriever.from_dataset_with_prebuilt_index(dataset_name)

# Users can specify other retrieval methods and the number of retrieved candidates.
# retrieved_results = Retriever.from_dataset_with_prebuilt_index(
#     dataset_name, RetrievalMethod.SPLADE_P_P_ENSEMBLE_DISTIL, k=50
# )
# ---------------------------

# --------- Rerank ----------

# Rank Zephyr model
reranker = ZephyrReranker()

# Rank Vicuna model
# reranker = VicunaReranker()

# RankGPT
# model_coordinator = SafeOpenai("gpt-4o-mini", 4096, keys=get_openai_api_key())
# reranker = Reranker(model_coordinator)

rerank_results = reranker.rerank_batch(requests=retrieved_results)
# ---------------------------

# ------- Evaluation --------

# Evaluate retrieved results.
ndcg_10_retrieved = EvalFunction.from_results(retrieved_results, TOPICS[dataset_name])
print(ndcg_10_retrieved)

# Evaluate rerank results.
ndcg_10_rerank = EvalFunction.from_results(rerank_results, TOPICS[dataset_name])
print(ndcg_10_rerank)

# By default, ndcg@10 is the evaluation metric; other metrics can be specified.
# eval_args = ["-c", "-m", "map_cut.100", "-l2"]
# map_100_rerank = EvalFunction.from_results(rerank_results, topics, eval_args)
# print(map_100_rerank)

# eval_args = ["-c", "-m", "recall.20"]
# recall_20_rerank = EvalFunction.from_results(rerank_results, topics, eval_args)
# print(recall_20_rerank)

# ---------------------------

# --- Analyze invocations ---
analyzer = ResponseAnalyzer.from_inline_results(rerank_results)
error_counts = analyzer.count_errors(verbose=True)
print(error_counts)
# ---------------------------

# ------ Save results -------
writer = DataWriter(rerank_results)
Path(f"demo_outputs/").mkdir(parents=True, exist_ok=True)
writer.write_in_jsonl_format(f"demo_outputs/rerank_results.jsonl")
writer.write_in_trec_eval_format(f"demo_outputs/rerank_results.txt")
writer.write_inference_invocations_history(
    f"demo_outputs/inference_invocations_history.json"
)
# ---------------------------

In this example, we first use BM25 to retrieve the top 100 candidate documents for each query in the DL19 dataset. Then, we apply the RankZephyr model to rerank these candidates. The evaluation results show the improvement in nDCG@10 metric after reranking. Additionally, we analyze the model’s invocation responses to identify any errors and save the reranking results in different formats for further analysis.

Model Zoo in RankLLM

RankLLM supports a wide range of reranking models, including pointwise, pairwise, and listwise models. Below is a list of some of the models available in RankLLM:

Pointwise Reranking Models: MonoT5

MonoT5 is a pointwise reranking model based on the T5 architecture. It jointly encodes queries and documents to generate a relevance score for each document. In RankLLM, the MonoT5 model family is used for pointwise reranking tasks. Its simplicity and efficiency make it suitable for large-scale document collections.

Pairwise Reranking Models: DuoT5

DuoT5 is designed for pairwise reranking. It compares pairs of documents with respect to a query and produces a relative relevance judgment. By considering the relative information between document pairs, DuoT5 can achieve higher reranking accuracy. However, due to the need to process document pairs, its computational complexity is relatively high.

Listwise Reranking Models

Listwise reranking models have gained significant attention in recent years. They process the entire candidate document list at once, capturing the interdependencies between documents to produce a more optimized ranking. RankLLM includes several listwise reranking models, such as LiT5, RankVicuna, and RankZephyr.

  • LiT5: LiT5 is a listwise reranking model based on the T5 architecture. With specialized prompts and training methodologies, it can effectively rerank document lists. LiT5 offers various variants to meet different needs, including models of different sizes and training configurations.
  • RankVicuna: RankVicuna is a listwise reranking model built on the open-source Vicuna model. It leverages Vicuna’s powerful language understanding and generation capabilities, combined with carefully designed prompts, to achieve high-quality document list reranking. The open-source nature of RankVicuna allows researchers and developers to easily customize and extend it.
  • RankZephyr: RankZephyr is another open-source listwise reranking model that has demonstrated excellent performance and robustness across multiple tasks. Through effective model architecture and training strategies, it achieves a balance between reranking quality and operational efficiency.

In addition to these models, RankLLM also supports other proprietary and open-source listwise reranking models, such as Gemini, providing users with a wealth of choices.

Experimental Results of RankLLM

To validate the effectiveness of RankLLM, extensive experiments were conducted on the MS MARCO V1 and V2 passage corpora across the DL19–DL23 datasets. The results showed that using various reranking models in RankLLM could significantly improve the relevance of retrieval results compared to traditional BM25 retrieval methods.

In the experiments, BM25 was used as the first-stage retriever to obtain the top 100 candidate documents for each query. Subsequently, different reranking models in RankLLM were employed to rerank these candidates. The experimental results revealed that in multiple datasets, the reranked results demonstrated substantial improvements in the nDCG@10 metric.

For instance, on the DL19 dataset, reranking with the MonoT5 model improved the nDCG@10 to 0.7174, while reranking with the RankZephyr model achieved an even higher nDCG@10 of 0.7412. This indicates that listwise reranking models in RankLLM can better capture the relationships between documents, leading to superior ranking outcomes.

However, the experiments also uncovered some issues. Some out-of-the-box prompt-decoder models exhibited malformed responses, such as incorrect formats, repetitions, or missing documents. For example, the GPT4o-mini model had 28% to 75% of responses with missing candidate document IDs. Nevertheless, through RankLLM’s error handling and response post-processing mechanisms, these issues were mitigated to a certain extent, enabling these models to still deliver competitive results.

Advantages and Applications of RankLLM

The advantages of RankLLM are multi-fold:

  • Unified Framework: By integrating various reranking models into a single framework, RankLLM simplifies experimentation and comparison, saving researchers and developers significant time and effort.
  • Flexible Configuration: RankLLM offers extensive configuration options, allowing users to tailor models, prompts, and inference frameworks to their specific needs, thereby achieving optimal reranking performance.
  • Strong Extensibility: The modular design of RankLLM facilitates the addition of new models, prompts, and inference backends, enabling it to keep pace with advancements in the LLM field and incorporate new technological breakthroughs.
  • Comprehensive Analysis and Evaluation Tools: The invocation analysis and evaluation components in RankLLM provide users with deep insights into model behavior and reranking results, offering a basis for further model optimization.

RankLLM can be applied in various information retrieval scenarios, such as optimizing search engine results, document retrieval and ranking in question-answering systems, and content recommendation in recommender systems. In these applications, RankLLM leverages the power of LLMs to enhance the relevance of retrieval results and improve user experience.

Future Development Directions of RankLLM

RankLLM has already achieved notable success, but its development is far from complete. Looking ahead, the RankLLM development team plans to expand the range of supported models by integrating more advanced LLMs. They also aim to further optimize the package’s performance to enhance its efficiency in large-scale data processing and real-time applications.

Moreover, RankLLM will strengthen its integration with other information retrieval and RAG frameworks to provide users with more comprehensive end-to-end solutions. Through close collaboration with the community, RankLLM is expected to continuously improve and evolve, becoming an indispensable tool in the field of information retrieval.

Conclusion

RankLLM is a powerful, flexible, and user-friendly Python package that offers a comprehensive solution for document reranking using LLMs. It empowers researchers and developers to effortlessly explore and apply various reranking models, thereby improving the performance of information retrieval systems. By lowering the barriers to entry in this field, RankLLM drives technological advancement and innovation in information retrieval.

If you are passionate about information retrieval and LLMs, I highly recommend giving RankLLM a try. You can visit the official RankLLM website at rankllm.ai to access more information and resources, and embark on your reranking journey. Whether you are a seasoned researcher or a developer looking to enhance your retrieval system’s performance, RankLLM is poised to deliver significant value and surprise.

– END –