AgentCPM: Open-Source Agents That Bring Deep Research to Your Device

Can powerful AI assistants that handle complex, multi-step tasks only exist in the cloud, tethered to massive models and internet connections? What happens when a job requires over a hundred tool calls, but the data involved is too sensitive to leave a private server? The recent open-source release of AgentCPM-Explore and AgentCPM-Report by Tsinghua University, Renmin University of China, and ModelBest offers a compelling new answer. They demonstrate that long-horizon, deep-research capabilities can thrive on local devices with remarkably compact models.

Overview & Core Breakthrough: Redefining On-Device Intelligence

The Core Question This Article Answers: How does one open-source project, through two distinct models, tackle the dual challenges of “deep exploration” and “deep reporting,” all while being deployable on local hardware?

Traditionally, AI Agents designed for complex tasks requiring multi-step planning, real-time information retrieval, and sophisticated reasoning have relied on massive, cloud-based models with billions or even trillillions of parameters. This creates challenges: high latency, significant cost, and, most critically, a barrier for privacy-sensitive industries like finance, healthcare, and legal services. The AgentCPM project was born to break this bottleneck.

AgentCPM is not a single model but a series focused on solving the “deep research” capability for AI agents. Its two newest open-source members have distinct, complementary roles:

  • AgentCPM-Explore (4B): A 4-billion parameter “Deep Search Specialist.” Its core mission is to understand complex user intent, autonomously plan and execute over 100 rounds of tool calls (searching, querying, calculating), cross-verify multi-source information, and persist in a dynamic environment until it finds a definitive answer.
  • AgentCPM-Report (8B): An 8-billion parameter “Deep Writing Specialist” based on MiniCPM4.1. It excels at taking an open-ended instruction, performing dozens of rounds of deep retrieval and nearly a hundred steps of chain-of-thought reasoning to integrate, analyze, and reconstruct information from vast sources, ultimately producing a logically sound, insightful long-form report.

Their shared hallmark is “doing more with less” and “edge-first” design. With parameter counts a fraction of their larger counterparts, they achieve performance on international benchmarks that rivals—and in some cases surpasses—both larger open-source models (30B+ level) and certain closed-source commercial systems. This makes high-performance agents a realistic prospect for private, on-device deployment on phones, edge servers, and local machines.

Author’s Reflection: In an era obsessed with scaling model size, AgentCPM has chosen a path of “depth optimization” over “blind enlargement.” This underscores a crucial insight: for agentic applications emphasizing planning and execution, the co-design of model architecture, training methodology, and tooling ecosystem can be as important as raw parameter count.

Deep Dive into AgentCPM-Explore: How 4B Parameters Drive 100+ Rounds of Autonomous Exploration

The Core Question This Section Answers: How can a model with only 4 billion parameters competently handle complex tasks that require over a hundred interactive steps and multiple tools?

Consider this task: “Find the latest technical progress and market competitive landscape regarding energy density, cost, and safety for Tesla’s 4680 battery, BYD’s Blade Battery, and CATL’s Kirin Battery in the EV sector.” This isn’t answered by a single web search. It requires decomposing the question, executing multiple searches, comparing information, verifying data, and potentially tracking recent news.

This is precisely the challenge AgentCPM-Explore is built for. It is not a general-purpose chatbot but an agent specialized for “exploration.”

Core Technical Highlights & Implementation

  1. Long-Horizon Task Handling: The model is specially trained to maintain exceptional coherence and goal consistency over extended contexts, supporting over 100 consecutive rounds of interaction. This means it won’t lose sight of the original objective or get lost in information fragments during a complex exploration.
  2. Dynamic Strategy Planning & Verification: It doesn’t just call tools; it dynamically adjusts its search strategy based on previous results. For instance, upon encountering contradictory information from one source, it will automatically initiate new queries for cross-verification. If it finds dated information, it will seek more recent data. This “think-act-observe-replan” loop is the core of its deep research capability.
  3. Full-Stack Open-Source Framework: Its capability is powered by a complete, open-source infrastructure:

    • AgentRL: A fully asynchronous reinforcement learning framework for training agents. Developers can use it with their own custom task environments and reward mechanisms to train or fine-tune specialized exploration agents.
    • AgentDock: A unified tool sandbox management platform. Based on the Model Context Protocol (MCP), it containerizes various tool services (web search, document parsing, code execution, etc.), providing agents with a stable, extensible tool-calling environment.
    • AgentToLeaP: A one-click agent capability evaluation framework. It integrates 8 classic agent benchmarks like GAIA, HLE, and BrowseComp, allowing developers to easily and standardizedly assess their model’s performance.

Scenario-Based Example: A Quick-Start Experience

Scenario: A market analyst wants to automatically fetch and summarize the latest Computer Science papers from arXiv daily to track tech trends.

Operational Steps (Based on the project’s QuickStart guide):

  1. Environment Preparation: Use the pre-built Docker image to get an evaluation environment with all dependencies.

    docker pull yuyangfu/agenttoleap-eval:v1.0
    docker run -dit --name agenttoleap --gpus all --network host -v $(pwd):/workspace yuyangfu/agenttoleap-eval:v1.0
    docker exec -it agenttoleap /bin/bash
    
  2. Launch Tool Platform: In another terminal, start the AgentDock tool sandbox, which will provide the search, reading, and other tools needed for the exploration.

    cd AgentDock
    docker compose up -d
    
  3. Configure & Run: Modify the quickstart.py script in the project root. Set the QUERY instruction to “Fetch and summarize today’s arXiv papers in the CS category,” and configure the model API and tool server address. Run the script.

    python quickstart.py
    
  4. Examine Results: In the outputs/quickstart_results/ directory, you’ll find the complete dialog.json file. This file records the agent’s full “chain-of-thought”: how it planned steps (first access arXiv, then filter by category), which tools it called, what intermediate results it obtained, and how it finally organized the answer.

Performance Evidence: Results on Classic Benchmarks

Theoretical capability needs objective measurement. AgentCPM-Explore was tested across 8 classic agent evaluation sets covering complex reasoning, web browsing, and long-horizon decision-making. Its scores reveal the “big potential of small models”:

On challenging benchmarks like GAIA (complex QA), BrowseComp (web browsing comprehension), and HLE (Human Long-term Evaluation), this 4B model not only achieved State-of-the-Art (SOTA) performance among similarly sized models but also surpassed many opponents with twice its parameters (8B-level). Particularly on WebWalkerQA (web navigation QA) and Seal-0 (search engine augmented QA), its scores of 68.1% and 40.5% are competitive with some open-source models over 30B in size, clearly demonstrating its efficiency in information retrieval and multi-step reasoning tasks.

Author’s Insight: Explore’s success is not accidental. It builds upon the Qwen3-4B-Thinking base model, which already has enhanced reasoning capabilities, and then undergoes specialized reinforcement learning training for tool use and long-horizon planning. This is akin to taking a trainee with solid logical foundations and putting them through intensive “field drills” to become a domain expert. It provides a valuable blueprint for the community: you don’t need to wait for trillion-parameter models; with carefully designed tasks and training, mid-sized models can excel at specific agentic workloads.

Deep Dive into AgentCPM-Report: Generating Deep Reports Locally That Rival Top-Tier Closed-Source Systems

The Core Question This Section Answers: How can an 8B model running entirely locally write multi-thousand-word reports comparable in quality to those from Gemini DeepResearch?

If Explore is an “Information Explorer,” then Report is an “Information Architect.” Its task isn’t to find a specific answer but to conduct a sweeping information retrieval, deep synthesis, and structured reconstruction around an open topic, ultimately building a well-argued, insightful long-form report.

Core Value & Unique Advantages

  1. Extreme Efficiency, Big Results from Small Scale: On deep research tasks, it achieves performance comparable to top-tier closed-source commercial systems (like Gemini 2.5 Pro DeepResearch) with only 8B parameters. This means users can obtain report quality previously dependent on high-end cloud services, even on devices with limited computational power.
  2. Physical Isolation, Absolute Security: This is its most critical competitive advantage. The entire system supports fully offline, local deployment. A user’s private knowledge base (internal documents, patent libraries, customer data) is vectorized and indexed locally. The model’s entire process—retrieval, reasoning, writing—happens locally, completely eliminating the risk of data leakage from cloud uploads. This perfectly meets the compliance requirements of industries like finance, law, government, and healthcare.

Practical Demonstration: From Zero to Generated Report

Scenario: A research analyst at an investment firm needs to write a deep analysis report on the “Competitive Landscape of the AI Drug Discovery Industry” based on the company’s internal project database and public industry news.

Operational Steps:

  1. One-Command Deployment: Thanks to deep integration with the low-code UltraRAG framework, deployment is exceptionally simple.

    git clone git@github.com:OpenBMB/UltraRAG.git
    cd UltraRAG
    git checkout agentcpm-report-demo
    cd agentcpm-report-demo
    cp env.example .env
    docker-compose -f docker-compose.yml up -d --build
    

    (The first run pulls images and downloads the model, taking roughly 30 minutes.)

  2. Build the Knowledge Base: Open the web management interface at http://localhost:5050.

    • Upload internal project database files (PDF, Word, TXT, etc.).
    • (Optional) Import the public Wiki2024 dataset as general knowledge.
    • The system automatically performs text chunking, vectorization, and builds an index in the backend Milvus vector database.
  3. Generate the Report: In the Chat interface, select the “AgentCPM-Report” pipeline and input the instruction: “Based on our uploaded internal database and public information, please write an in-depth analysis report (minimum 8000 words) on the technological roadmaps, key players, market risks, and future opportunities in the AI drug discovery industry.”
  4. Observe & Retrieve: The system will initiate a lengthy automated pipeline. Backend logs will show the model performing multiple retrieval rounds (locating relevant info from the knowledge base), organizing outlines, expanding arguments, and verifying data. Finally, a structurally complete report, citing both internal data and external facts, will be presented.

Authoritative Benchmark Validation: Competing with Top Commercial Systems

Demonstrations need rigorous quantification. AgentCPM-Report proves its mettle across several specialized deep research evaluations:

  • On DeepResearch Bench, its overall score (50.11) is very close to the current acknowledged benchmark, Gemini-2.5-Pro-DeepResearch (49.71). More strikingly, on the “Insight” sub-metric—which measures whether a report offers unique perspectives—it achieved a higher score of 52.64. This proves it’s not just collating information but can perform valuable analysis and synthesis.
  • On DeepResearch Gym, its overall performance (98.48) even surpassed all compared systems, including Gemini and Claude-based WebWeaver. On key dimensions like “Depth,” “Breadth,” and “Insightfulness,” it received perfect or near-perfect ratings, thoroughly validating its ability for comprehensive, deep information mining and reorganization.

Author’s Reflection: The value of the Report model extends far beyond a mere “writing tool.” It effectively builds a “privatized deep research assistant.” In an age of information overload, it provides organizations and individuals with a secure, autonomous channel to transform dormant private data assets into high-quality decision-making material. Its success also highlights the immense potential of combining RAG (Retrieval-Augmented Generation) technology with agentic workflows—the model doesn’t need to know everything, but it must know how to efficiently and accurately find and use everything.

Open-Source Ecosystem & Community Building: More Than Models, It’s Infrastructure

The Core Question This Section Answers: What extensible, customizable infrastructure does the AgentCPM project provide for the developer community?

The vitality of a model lies in its ecosystem. The AgentCPM series is notable not only for model performance but also for open-sourcing the entire “infrastructure” for training, deployment, and evaluation, significantly lowering the barrier to research and application.

  • Custom Tool Integration: If you have an internal API or specialized data processing tool, you can easily package it as an MCP-compliant service and add it to the AgentDock platform. The agent model can then immediately learn to call this new tool. This paves the way for building vertical-domain agents (e.g., for e-commerce inventory queries, IoT device control).
  • Custom Model Integration: The framework supports integrating other models compatible with tool-calling formats. Developers only need to implement a lightweight “tool call parser” for the new model to allow it to leverage the rich tool ecosystem on AgentDock.
  • Custom Evaluation Sets: Researchers wanting to evaluate agents on newly proposed tasks can simply prepare data in the specified format under the AgentToLeaP framework, seamlessly integrating into the evaluation pipeline with results comparable to existing benchmarks.

This openness and modular design encourage community innovation and adaptation, collectively advancing the field of on-device agent technology.

Scenario-Based Implementation Guide: How Your Business Can Use AgentCPM

The Core Question This Section Answers: How can users in different industries apply AgentCPM-Explore and AgentCPM-Report to real-world business problems?

Technology ultimately serves scenarios. Here are some implementation ideas based on their core capabilities:

Application Industry AgentCPM-Explore Use Case AgentCPM-Report Use Case
Financial Research Real-time monitoring of announcements, news, and sentiment for multiple listed companies, automatically cross-verifying information to extract key financial events and risk signals. Automatically generating quarterly industry analysis reports or deep competitor comparison reports based on internal research libraries and public data.
Legal & Compliance Automatically retrieving relevant legal statutes, historical case law, and academic opinions based on case points for multi-dimensional comparative analysis. Consolidating vast case files and evidence materials to automatically generate case summary reports or compliance review memoranda.
Healthcare & Biotech Tracking the latest global clinical trial dynamics or research publications for a specific target or drug. Integrating patient records, lab reports, and the latest medical literature to assist in generating personalized treatment plan analysis reports.
Technology & R&D Monitoring global patent activity and top conference papers for a specified technical field (e.g., solid-state batteries). Researching the origin, development, schools of thought, and future trends of a technical roadmap to generate technical landscape analysis reports.
Education & Academia Helping students or researchers conduct systematic literature reviews on complex academic questions, outlining research脉络. Assisting researchers in writing literature reviews, project rationale, and other content requiring extensive citation and deep analysis.
Enterprise Internal Serving as an advanced corporate assistant, querying order, logistics, and customer feedback information across systems to answer complex business queries. Automatically aggregating departmental weekly/monthly report data to generate company performance analysis reports; analyzing customer feedback to generate product improvement suggestion reports.

Implementation Checklist:

  1. Define Need: Is your task about finding a specific answer or generating a comprehensive narrative? Choose Explore for the former, Report for the latter.
  2. Assess Environment: Are there strict data sovereignty requirements? If yes, Report’s local deployment is essential.
  3. Prepare Knowledge: For Report, start organizing your internal documents, databases, knowledge bases. This is the foundation of its value.
  4. Try Deployment: Follow the “Quick Start” guide in this article or the official README to complete a full deployment and run-through in a test environment.
  5. Customize Development: Based on business needs, integrate your proprietary tools via AgentDock, or use the AgentRL framework to fine-tune the Explore model on specific task data.

Conclusion & Forward Look

The joint open-source release of AgentCPM-Explore and AgentCPM-Report marks a new phase in the development of LLM-based agents: a shift from “brute-force scaling” towards “elegant, powerful, and controlled” efficiency. They prove that through precise model architecture design, targeted training methods, and robust infrastructure, mid-sized models can fully shoulder the responsibility for long-horizon, complex agentic tasks.

This is more than a technology release; it’s a clear signal about the future of agents: it will be collaborative between cloud and edge, coexisting between general and specialized. For the vast majority of enterprise and personal scenarios that need to process private data, require fast response, or must control costs, a specialized agent that can run efficiently and securely on local devices holds far more immediate and tangible value than a distant, expensive, general-purpose giant model.


One-Page Summary

  • What it is: Open-source twin agent models specializing in “deep research” capabilities.
  • Explore (4B): The Deep Search Specialist. Excels at multi-step planning and tool calling to explore for answers in complex, dynamic environments. Ideal for fact-checking, information monitoring, complex Q&A.
  • Report (8B): The Deep Writing Specialist. Excels at synthesizing multi-source information to produce long-form, in-depth reports. Core advantage is fully local deployment for absolute data security. Ideal for industry analysis, literature reviews, compliance reporting.
  • Core Achievement: Achieves performance rivaling or surpassing some larger open-source and closed-source models with minimal parameters.
  • Open-Source Ecosystem: Provides full-stack training framework (AgentRL), tool platform (AgentDock), and evaluation framework (AgentToLeaP) for deep customization.
  • Who Should Use It: Industries and developers with high demands for data privacy, report quality, and automated deep research, such as finance, law, healthcare, and scientific research.

Frequently Asked Questions (FAQ)

  1. Should I use AgentCPM-Explore or AgentCPM-Report?
    Use Explore if your core need is to answer a specific, complex question requiring web access or multiple tool calls. Use Report if your core need is to generate a structured, evidence-rich long-form report based primarily on your local knowledge base.

  2. What hardware is needed to deploy AgentCPM-Report?
    Using a GPU is recommended for acceptable inference speed. The project offers both vLLM (GPU) and llama.cpp (CPU) deployment options. CPU is slower but has wider compatibility. Refer to the model card for specific memory/VRAM requirements (an 8B model typically works well with 16GB+ VRAM).

  3. Can I train a model like AgentCPM-Explore with my own data?
    Yes. The open-source AgentRL framework is a fully asynchronous RL framework for training agents. You would need to define your own task environment, toolset, and reward function, then use the framework to train/fine-tune a base model (like Qwen3-4B) on your task data.

  4. Can I integrate my company’s internal APIs into the AgentDock tool sandbox?
    Absolutely. You need to package your internal API as an MCP-compliant tool service and add its configuration to AgentDock’s config.toml file, then restart the relevant service node. The agent model will automatically recognize and be able to call this new tool.

  5. How exactly is the “local security” of AgentCPM-Report implemented?
    The entire system (including the Milvus vector database, vLLM inference framework, and web UI) can be deployed via Docker Compose on a single offline server or PC. All data—your documents, generated vector indexes, model weights, and report content—never leaves that device’s network, ensuring physical isolation.

  6. Do these models support Chinese?
    Fully supported. Both models are trained and evaluated with strong bilingual (English/Chinese) capabilities. They perform excellently on Chinese evaluation sets like BrowseComp (ZH) and DeepConsult, making them particularly suitable for deep research and reporting in Chinese contexts.

  7. Is there a ready-to-use graphical interface, or is coding required?
    AgentCPM-Report, integrated with the UltraRAG framework, provides an out-of-the-box web GUI (localhost:5050). Users can operate by uploading files and clicking buttons, no coding required. AgentCPM-Explore is currently configured and launched via Python scripts, suitable for developers integrating it into automated pipelines.

  8. Does their performance really match Gemini DeepResearch?
    Based on the project’s published results on benchmarks like DeepResearch Bench, AgentCPM-Report’s overall score is very close to Gemini 2.5 Pro DeepResearch and even surpasses it on the “Insight” sub-metric. This demonstrates top-tier competitiveness for the specific task of deep report generation. Actual results may vary slightly depending on the task domain and knowledge base quality.