DeepEyesV2: Building an Agentic Multimodal Model

DeepEyesV2 Logo

Enabling AI to Not Just “See” but Integrate Visual Information into Reasoning


Paper Link


SFT Dataset


RL Dataset


Model Checkpoint


Project Homepage

Logo inspired by the oracle bone character for “eye”.

What is DeepEyesV2?

As OpenAI noted in a related article: “They don’t just see an image, they can integrate visual information directly into the reasoning chain.” DeepEyesV2 embodies this concept—it is an agentic multimodal model that unifies code execution and web search within a single reasoning loop, enabling reliable and complex problem-solving.

In simple terms, DeepEyesV2 functions like an intelligent assistant with visual capabilities. It can understand both text and images, and solve problems by writing code, performing calculations, and searching the web. This versatility makes it excel at tasks requiring comprehensive analysis, logical reasoning, and access to external information.

What’s New with DeepEyesV2?

Compared to traditional multimodal models, DeepEyesV2 brings several key breakthroughs:

  1. Unified Reasoning Loop
    It pioneers the integration of code execution and web search within a single reasoning workflow. The model can independently select tools based on task requirements, forming a closed-loop problem-solving capability.

  2. Carefully Curated Training Data
    Through rigorous data filtering and cleaning, we’ve built cold-start Supervised Fine-Tuning (SFT) data and Reinforcement Learning (RL) data. These two datasets complement each other, laying a solid foundation for the model’s performance.

  3. Strong Cross-Domain Capabilities
    Extensive experiments across real-world understanding, mathematical reasoning, and search-intensive benchmarks demonstrate DeepEyesV2’s exceptional reasoning skills and tool-usage abilities.

  4. Adaptive Tool-Usage Behavior
    Analysis of DeepEyesV2’s tool-usage patterns reveals task-adaptive behaviors. Research also shows that reinforcement learning enables the model to use more complex tool combinations and perform adaptive, context-aware tool invocation.

How to Get Started with DeepEyesV2?

Data Preparation

DeepEyesV2’s training data is divided into two categories, accessible via the following links:

For detailed data preparation instructions, refer to the documentation in the data directory.

Cold-Start Training

The cold-start phase uses LLaMA-Factory. Follow these steps:

Environment Setup

Refer to the official LLaMA-Factory documentation for detailed installation guidelines.

Base Model Selection

We use Qwen-2.5-VL-7B-Instruct as the base model. Qwen-2.5-VL-32B-Instruct is also supported.

Launch Training

Run the following command to start cold-start training:

bash ./cold_start/run_cold_start.sh

Reinforcement Learning Training

Reinforcement learning uses the same codebase as DeepEyes. Follow these steps:

Environment Setup

cd reinforcement_learning
# Follow the official VeRL installation process
pip install -e .

# Install additional dependencies required by DeepEyes
bash scripts/install_deepeyes.sh

Code Sandbox Configuration

DeepEyesV2 can write code, execute it in a sandbox, and perform subsequent reasoning based on the results. Similar to o3, DeepEyesV2’s code follows Jupyter-style syntax. The model’s code is sent to a server, which returns execution outputs.

Deploy the code server using this repository. We use Redis to store Jupyter Notebook states, so you’ll need to install Redis separately.

During RL training, transmitting large volumes of high-resolution images to a single node in a short time may saturate bandwidth, causing retransmissions and timeouts. We recommend:

  • Deploying multiple code servers to distribute network load
  • Using high-performance machines for code servers to ensure fast responses

In practice, we deployed multiple code server services on each GPU node. Refer to this repository for deployment. This approach eliminated timeouts caused by network, CPU, or other hardware issues during our training.

Online Search API Configuration

DeepEyesV2 can access external knowledge through search. We use online search instead of Retrieval-Augmented Generation (RAG):

  • For image search: We use the cache from MMSearch-R1. Download the corresponding cache data from here.
  • For text search: We use an online search service. You can use your own search API by modifying the content in reinforcement_learning/verl/workers/agent/envs/deepeyesv2/search_utils.py.

Judge Server Deployment

We use the Qwen model for “llm-as-a-judge” verification. Deploy the server following instructions from DeepEyes:

# Download the Qwen-2.5-72B-Instruct model
# Other models like Qwen-3-8B are also compatible
huggingface-cli download --resume-download https://huggingface.co/Qwen/Qwen2.5-72B-Instruct --local-dir /path/to/your/local/filedir --local-dir-use-symlinks False

# Start vllm serving
vllm serve /path/to/your/local/filedir \
    --port 18901 \
    --gpu-memory-utilization 0.8 \
    --max-model-len 32768 \
    --tensor-parallel-size 8 \ 
    --served-model-name "judge" \
    --trust-remote-code \
    --disable-log-requests

Reinforcement Learning Execution

Hardware Requirements:

  • 7B Model Training: At least 32 GPUs (4 nodes × 8 GPUs)
  • 32B Model Training: At least 64 GPUs (8 nodes × 8 GPUs)

Each node should have a minimum of 1200GB CPU RAM, as high-resolution images consume significant memory.

Training Steps:

  1. Build a ray cluster across all training nodes
  2. Prepare data before starting training
  3. Launch training with the following scripts:
cd reinforcement_learning
# Enter your wandb API key here
wandb login

# Set the IP and port for your Qwen-2.5-72B-Instruct vllm service
export LLM_AS_A_JUDGE_BASE="http://your.vllm.machine.ip:18901/v1"

# Configuration for 7B model
bash examples/deepeyesv2/run_qwen2_5_vl-7b_final_allin.sh

The training scripts use wandb and RL Logging Board (an excellent tool) to visualize training dynamics.

Model Evaluation

For detailed evaluation methods, see the documentation in the evaluation directory.

Frequently Asked Questions (FAQs)

What sets DeepEyesV2 apart from other multimodal models?

DeepEyesV2’s core difference lies in unifying code execution and web search into a single reasoning loop, creating an agentic system. Unlike traditional models that only understand text and images, it can proactively use tools to solve problems without human intervention.

What hardware do I need to train DeepEyesV2?

  • For the 7B model: At least 32 GPUs (4 nodes × 8 GPUs)
  • For the 32B model: At least 64 GPUs (8 nodes × 8 GPUs)

Each node also requires a minimum of 1200GB CPU RAM to handle high-resolution images.

How can I access DeepEyesV2’s training data?

Training data is available on Hugging Face in two categories:

Which base models are supported by DeepEyesV2?

Currently, it supports Qwen-2.5-VL-7B-Instruct and Qwen-2.5-VL-32B-Instruct as base models.

What is the purpose of the code sandbox?

The code sandbox allows DeepEyesV2 to safely write and execute code, especially for tasks requiring precise calculations like mathematical computations and data analysis. After executing code in the sandbox, the model can use the results for further reasoning, enhancing problem-solving capabilities.

Why do I need multiple code servers?

During reinforcement learning training, transmitting large volumes of high-resolution images can saturate bandwidth. Deploying multiple code servers distributes network pressure, preventing timeouts and transmission errors to ensure stable training.

How do I evaluate DeepEyesV2’s performance?

Detailed evaluation methods are available in the evaluation documentation, including workflows and metrics for benchmark testing.

What license does DeepEyesV2 use?

The project is released under the Apache License, permitting commercial use subject to the license terms.

Project Impact

Since its release, DeepEyesV2 has gained widespread attention. Below is the project’s star growth trajectory:

Star History Chart

Citation

If you use DeepEyesV2 in your research or projects, please cite the following paper:

@article{hong2025deepeyesv2,
  title={DeepEyesV2: Toward Agentic Multimodal Model},
  author={Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu and Xing Yu},
  journal={arXiv preprint arXiv:2511.05271},
  year={2025}
}

With the information above, you should have a comprehensive understanding of DeepEyesV2. If you encounter any issues during use, refer to the project documentation or visit the homepage for further assistance.