DeepEval: Your Ultimate Open-Source Framework for Large Language Model Evaluation

In the rapidly evolving landscape of artificial intelligence, large language models (LLMs) are becoming increasingly powerful and versatile. However, with this advancement comes the critical need for robust evaluation frameworks to ensure these models meet the desired standards of accuracy, relevance, and safety. DeepEval emerges as a simple-to-use, open-source evaluation framework specifically designed for LLMs, offering a comprehensive suite of metrics and features to thoroughly assess LLM systems.

DeepEval is akin to Pytest but is specialized for unit testing LLM outputs. It leverages the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc. What sets DeepEval apart is its ability to run evaluations locally on your machine using LLMs and various other NLP models. Whether your LLM applications are RAG pipelines, chatbots, AI agents implemented via LangChain or LlamaIndex, DeepEval provides comprehensive coverage to help you determine the optimal models, prompts, and architecture for improving your RAG pipeline, preventing prompt drifting, or transitioning from OpenAI to hosting your own Deepseek R1 with confidence.

[!IMPORTANT]
Need a place for your DeepEval testing data to live? Sign up for the DeepEval platform to compare iterations of your LLM app, generate & share testing reports, and more.

Want to talk LLM evaluation, need help picking metrics, or just to say hi? Come join our Discord.

🔥 Metrics and Features

DeepEval offers a rich set of metrics and features to meet diverse evaluation needs:

Comprehensive Evaluation Metrics

DeepEval supports both end-to-end and component-level LLM evaluation. It provides a wide variety of ready-to-use LLM evaluation metrics, all powered by ANY LLM of your choice, statistical methods, or NLP models that run locally on your machine:

  • G-Eval: A research-backed metric for evaluating LLM outputs against custom criteria with human-like accuracy.
  • DAG (Deep Acyclic Graph): Evaluates the logical structure and flow of LLM outputs.
  • RAG Metrics:

    • Answer Relevancy: Measures how relevant the answer is to the input query.
    • Faithfulness: Assesses whether the answer faithfully reflects the context.
    • Contextual Recall: Evaluates how well the answer incorporates all relevant context.
    • Contextual Precision: Measures the accuracy of the answer based on the context.
    • Contextual Relevancy: Assesses the relevance of the context to the answer.
    • RAGAS: A comprehensive metric for Retrieval-Augmented Generation systems.
  • Agentic Metrics:

    • Task Completion: Evaluates whether the AI agent successfully completes the assigned task.
    • Tool Correctness: Assesses the correct usage of tools by the AI agent.
  • Other Metrics:

    • Hallucination: Detects whether the model generates false or misleading information.
    • Summarization: Evaluates the quality of summarization tasks.
    • Bias: Assesses the presence of bias in model outputs.
    • Toxicity: Detects harmful or inappropriate content in responses.
  • Conversational Metrics:

    • Knowledge Retention: Measures how well the model retains knowledge throughout a conversation.
    • Conversation Completeness: Evaluates whether the conversation covers all relevant aspects.
    • Conversation Relevancy: Assesses the relevance of each response in the conversation.
    • Role Adherence: Ensures the model stays true to its designated role in the conversation.

Custom Metrics and Synthetic Datasets

DeepEval allows you to build your own custom metrics that seamlessly integrate with its ecosystem. You can also generate synthetic datasets for evaluation, providing flexibility to address specific evaluation scenarios that may not be covered by existing metrics.

CI/CD Integration and Red Teaming

DeepEval integrates seamlessly with ANY CI/CD environment, making it easy to incorporate evaluation into your development workflow. Additionally, it supports red teaming your LLM application to test for 40+ safety vulnerabilities in just a few lines of code, including toxicity, bias, SQL injection, etc., using advanced attack enhancement strategies such as prompt injections.

Benchmarking and Platform Integration

DeepEval enables you to easily benchmark ANY LLM against popular LLM benchmarks in under 10 lines of code, including MMLU, HellaSwag, DROP, BIG-Bench Hard, TruthfulQA, HumanEval, GSM8K, and more. It also offers 100% integration with Confident AI, providing a full evaluation lifecycle solution.

🔌 Integrations

DeepEval integrates with various frameworks to enhance its evaluation capabilities:

  • LlamaIndex: Allows unit testing of RAG applications in CI/CD.
  • Hugging Face: Enables real-time evaluations during LLM fine-tuning.

🚀 QuickStart

Let’s explore how to get started with DeepEval using a practical example. Suppose you have developed a customer support chatbot based on RAG, and you want to evaluate its performance using DeepEval.

Installation

First, install DeepEval using pip:

pip install -U deepeval

Create an Account (Recommended)

Using the DeepEval platform is highly recommended as it allows you to generate sharable testing reports on the cloud. It’s free and requires no additional code setup. To log in, run:

deepeval login

Follow the instructions in the CLI to create an account, copy your API key, and paste it into the CLI. All test cases will automatically be logged. You can find more information on data privacy here.

Writing Your First Test Case

Create a test file named test_chatbot.py:

import pytest
from deepeval import assert_test
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

def test_case():
    correctness_metric = GEval(
        name="Correctness",
        criteria="Determine if the 'actual output' is correct based on the 'expected output'.",
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
        threshold=0.5
    )
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",
        actual_output="You have 30 days to get a full refund at no extra cost.",
        expected_output="We offer a 30-day full refund at no extra costs.",
        retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
    )
    assert_test(test_case, [correctness_metric])

Set your OPENAI_API_KEY as an environment variable (you can also evaluate using your own custom model; for more details, visit this part of our docs):

export OPENAI_API_KEY="..."

Run the test file in the CLI:

deepeval test run test_chatbot.py

Congratulations! Your test case should have passed ✅ Let’s break down what happened:

  • The input variable mimics a user input, and actual_output is a placeholder for what your application’s supposed to output based on this input.
  • The expected_output variable represents the ideal answer for a given input, and GEval is a research-backed metric provided by deepeval for evaluating your LLM output on any custom criteria with human-like accuracy.
  • In this example, the metric criteria is correctness of the actual_output based on the provided expected_output.
  • All metric scores range from 0 – 1, and the threshold=0.5 determines if your test has passed or not.

You can find more information on additional options for end-to-end evaluation, using additional metrics, creating custom metrics, and integrating with other tools like LangChain and LlamaIndex in our documentation.

Evaluating Nested Components

To evaluate individual components within your LLM app, you can run component-level evaluations. This powerful feature allows you to evaluate any component within an LLM system by tracing components such as LLM calls, retrievers, tool calls, and agents using the @observe decorator:

from deepeval.tracing import observe, update_current_span
from deepeval.test_case import LLMTestCase
from deepeval.dataset import Golden
from deepeval.metrics import GEval
from deepeval import evaluate

correctness = GEval(name="Correctness", criteria="Determine if the 'actual output' is correct based on the 'expected output'.", evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT])

@observe(metrics=[correctness])
def inner_component():
    update_current_span(test_case=LLMTestCase(input="...", actual_output="..."))
    return

@observe
def llm_app(input: str):
    inner_component()
    return

evaluate(observed_callback=llm_app, goldens=[Golden(input="Hi!")])

You can learn more about component-level evaluations here.

Evaluating Without Pytest Integration

If you prefer not to use Pytest integration, you can evaluate directly in a notebook environment:

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output="We offer a 30-day full refund at no extra costs.",
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)
evaluate([test_case], [answer_relevancy_metric])

Using Standalone Metrics

DeepEval’s modular design makes it easy to use any of its metrics independently. Continuing from the previous example:

from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output="We offer a 30-day full refund at no extra costs.",
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)

answer_relevancy_metric.measure(test_case)
print(answer_relevancy_metric.score)
print(answer_relevancy_metric.reason)

Note that some metrics are designed for RAG pipelines, while others are for fine-tuning. Be sure to consult our documentation to select the appropriate metrics for your use case.

Evaluating a Dataset / Test Cases in Bulk

In DeepEval, a dataset is simply a collection of test cases. Here’s how you can evaluate them in bulk:

import pytest
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset

first_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])
second_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])

dataset = EvaluationDataset(test_cases=[first_test_case, second_test_case])

@pytest.mark.parametrize(
    "test_case",
    dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
    hallucination_metric = HallucinationMetric(threshold=0.3)
    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
    assert_test(test_case, [hallucination_metric, answer_relevancy_metric])

Run the tests in the CLI:

deepeval test run test_<filename>.py

You can also add the -n flag to run tests in parallel.

Alternatively, you can evaluate a dataset/test cases without using our Pytest integration:

from deepeval import evaluate
...

evaluate(dataset, [answer_relevancy_metric])
# or
dataset.evaluate([answer_relevancy_metric])

LLM Evaluation With Confident AI

The DeepEval platform, also known as Confident AI, provides a comprehensive solution for the entire LLM evaluation lifecycle:

  1. Curate/Annotate Evaluation Datasets on the Cloud: Easily create and annotate evaluation datasets on the Confident AI platform.
  2. Benchmark LLM App: Use your dataset to benchmark your LLM app and compare it with previous iterations to determine which models and prompts work best.
  3. Fine-Tune Metrics: Customize metrics to achieve the desired evaluation results.
  4. Debug Evaluation Results: Utilize LLM traces to debug evaluation results and gain deeper insights into model performance.
  5. Monitor & Evaluate LLM Responses: Monitor and evaluate LLM responses in production to improve datasets with real-world data.
  6. Repeat Until Perfection: Continuously refine your models and datasets until you achieve the desired level of performance.

To get started with Confident AI, log in from the CLI:

deepeval login

Follow the instructions to create an account and paste your API key into the CLI. Then, run your test file again:

deepeval test run test_chatbot.py

Once the test is complete, a link will be displayed in the CLI. Paste it into your browser to view the results.

Contributing

We welcome contributions to DeepEval. Please read our CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.

Roadmap

Here are some of the features we have planned for DeepEval:

  • [x] Integration with Confident AI
  • [x] Implement G-Eval
  • [x] Implement RAG metrics
  • [x] Implement Conversational metrics
  • [x] Evaluation Dataset Creation
  • [x] Red-Teaming
  • [ ] DAG custom metrics
  • [ ] Guardrails

Authors

DeepEval is built by the founders of Confident AI. For inquiries, please contact jeffreyip@confident-ai.com.

License

DeepEval is licensed under Apache 2.0. You can find more details in the LICENSE.md file.

In the ever-growing field of artificial intelligence, DeepEval stands out as a powerful and flexible tool for evaluating LLMs. By providing a comprehensive set of metrics, features, and integrations, it empowers developers to build better, safer, and more reliable LLM applications. Whether you’re working on a chatbot, a RAG pipeline, or any other LLM-based system, DeepEval is your go-to framework for ensuring your models perform at their best.