Executive Memory for LLM: Revolutionizing Long-Horizon Reasoning in AI Agents

高效码农

3 months ago

MemoBrain: The Executive Memory Brain for LLM Reasoning

In the complex reasoning scenarios of tool-augmented agents, the continuous accumulation of long-horizon reasoning trajectories and temporary tool interaction results is constantly occupying the limited working context space of large language models (LLMs). Without the support of a dedicated memory mechanism, this undifferentiated information accumulation can disrupt the logical continuity of reasoning and cause the agent to deviate from task objectives—turning memory management from a mere efficiency optimization issue into a core link supporting long-horizon, goal-directed reasoning.

MemoBrain is precisely an executive memory model designed to address this problem. It constructs a dependency-aware reasoning memory system for tool-augmented agents, which can not only capture key intermediate reasoning states but also clarify their logical connections. Acting as a “cognitive co-pilot,” it proactively manages the working context, making the long-horizon reasoning of LLMs more coherent and efficient.

I. Why Do LLMs Need “Executive Memory”?

For anyone involved in AI agent development, you may have encountered such a problem: when an agent handles complex tasks that require multi-step, multi-tool calls (such as completing in-depth research or answering a multi-dimensional factual question), as the number of reasoning steps increases, all thinking processes, tool call records, and tool return results are stuffed into the LLM’s context window without discrimination.

LLMs have limited context windows—for example, common models may have upper limits of 8K, 32K, or even 128K tokens. But even if the window is large enough, unorganized information accumulation leads to two core issues:

Logical Disruption: Irrelevant or redundant information dilutes key reasoning clues, causing the agent to forget previous reasoning conclusions or even make contradictory judgments;
Low Efficiency: Repetitive information occupies a lot of context space, which not only increases reasoning costs but also reduces the effectiveness of each round of thinking.

Traditional context management methods simply “pile up information passively,” while MemoBrain’s core idea is to “manage information actively”—it not only stores reasoning trajectories but also identifies which steps are effective, which are completed, and which can be compressed, always maintaining a compact, high-value reasoning core for the LLM.

II. The Core Working Principle of MemoBrain

Essentially, MemoBrain is an executive memory system designed for reasoning agents. Unlike traditional memory modules that store passively, it proactively manages reasoning trajectories through four core links:

1. Memory Construction: Drawing a “Logical Relationship Graph” for Reasoning Steps

MemoBrain converts each reasoning step, each tool call, and feedback into nodes in a memory graph, while recording the dependency relationships between nodes. For example, the step “search for Paris population data” will establish a connection with the goal node “answer questions related to Paris population,” and “obtaining population data” will connect with “search for Paris population data”—this graph can clearly reflect the logical context of reasoning, rather than a simple linear record.

2. Flush: Removing Invalid Reasoning Nodes

During the reasoning process, the agent may attempt incorrect paths (such as searching for irrelevant information) or repeatedly execute the same tool calls. These invalid or redundant nodes are identified and removed by MemoBrain, reducing “noise” in the context.

3. Fold: Compressing Completed Sub-Reasoning Trajectories

When a subtask is completed (such as successfully obtaining Paris population data), MemoBrain compresses all reasoning steps and tool interaction records corresponding to this subtask into a concise summary node. For example, a process of “search → obtain results → verify results” that originally required 1000 tokens to record may be compressed into the sentence “Confirmed that Paris has a permanent population of approximately 2.2 million,” which not only retains key information but also significantly saves context space.

4. Context Management: Maintaining a Fixed-Size, High-Value Reasoning Core

MemoBrain sets a fixed context budget (e.g., 32K tokens) to ensure that only the core information needed for current reasoning is retained in memory—invalid content is flushed, completed content is folded, and the LLM’s working context always focuses on the unfinished core tasks.

<Figure 1: MemoBrain’s Overall Architecture and Workflow

III. Getting Started with MemoBrain Quickly

⚠️ Note: This project is currently in an intensive development phase, with continuous updates to features and user experience coming soon!

3.1 Installation Steps (Completed in 3 Simple Steps)

Installing MemoBrain is straightforward—simply clone the repository via git and complete the local installation by navigating to the directory:

git clone https://github.com/qhjqhj00/MemoBrain.git
cd MemoBrain
pip install -e .

3.2 Model Selection: Choose the Right Base Model for Optimal Results

MemoBrain is compatible with any LLM that supports the OpenAI-compatible API as the base model. However, we recommend two options, which you can select based on your resources and needs:

High-Performance Commercial Models: Such as DeepSeek V3.2 and GPT-5, which deliver the best reasoning performance;
Officially Fine-Tuned MemoBrain-Specific Models: Optimized specifically for memory operations, offering higher cost-effectiveness, including:
- MemoBrain-4B (based on Qwen3-4B-Instruct-2507): Low resource consumption, suitable for low-configuration environments;
- MemoBrain-8B (based on Qwen3-8B): Balances efficiency and performance, ideal for most scenarios;
- MemoBrain-14B (based on Qwen3-14B): Optimal performance, suitable for scenarios requiring high reasoning accuracy.

3.3 Deploying MemoBrain Models with vLLM

If you choose an officially fine-tuned MemoBrain model, deployment with vLLM is recommended for higher efficiency. First install vLLM, then select the model for deployment based on your needs:

# Step 1: Install vLLM
pip install vllm

# Deploy MemoBrain-8B (balance of efficiency and performance)
vllm serve TommyChien/MemoBrain-8B --port 8002

# Or deploy MemoBrain-4B (low resource usage)
vllm serve TommyChien/MemoBrain-4B --port 8002

# Or deploy MemoBrain-14B (best performance)
vllm serve TommyChien/MemoBrain-14B --port 8002

3.4 Basic Python Usage Example

Below is a complete workflow for basic MemoBrain usage, covering initialization, memory recording, and memory optimization:

import asyncio
from memobrain import MemoBrain

async def main():
    # Step 1: Initialize MemoBrain instance
    # Option A: Use officially fine-tuned model (recommended)
    memory = MemoBrain(
        api_key="EMPTY",  # No API key required for vLLM-deployed models
        base_url="http://localhost:8002/v1",
        model_name="TommyChien/MemoBrain-8B"
    )
    
    # Option B: Use commercial model (replace with your API credentials)
    # memory = MemoBrain(
    #     api_key="your-api-key",
    #     base_url="https://api.deepseek.com/v1",
    #     model_name="deepseek-chat"
    # )
    
    # Step 2: Initialize memory graph with task description
    memory.init_memory("Solve a complex research problem")
    
    # Step 3: Record conversation interactions (reasoning trajectories)
    await memory.memorize([
        {"role": "assistant", "content": "I need to search for information about Paris..."},
        {"role": "user", "content": "Search results: Paris is the capital of France..."}
    ])
    
    await memory.memorize([
        {"role": "assistant", "content": "Let me get the population data..."},
        {"role": "user", "content": "Paris has approximately 2.2 million inhabitants."}
    ])
    
    # Step 4: Optimize memory (flush invalid steps + fold completed sub-trajectories)
    optimized_messages = await memory.recall()
    print(f"Memory optimized: {len(optimized_messages)} messages")

asyncio.run(main())

3.5 Key Usage Tip: Choosing the Right “Memory Unit”

Many people new to MemoBrain wonder, “What granularity should I use to record reasoning trajectories?” Here’s a clear recommendation:
The optimal unit for calling the memorize() method is a “reasoning episode”—a complete reasoning cycle that typically includes:

Thinking: The agent’s reasoning process (e.g., “I need to search for Paris population data”);
Tool Call: The action performed by the agent (e.g., searching, browsing web pages, running code);
Tool Feedback: The result returned by the tool (e.g., search results, web page content).

For example, in a tool-augmented agent workflow, the correct calling method is as follows:

# First episode: Thinking → Tool Call → Tool Feedback
await memory.memorize([
    {"role": "assistant", "content": "I need to search for information about Paris..."},
    {"role": "user", "content": "Search results: Paris is the capital of France..."}
])

# Second episode: Thinking → Tool Call → Tool Feedback  
await memory.memorize([
    {"role": "assistant", "content": "Let me visit the Wikipedia page for more details..."},
    {"role": "user", "content": "Page content: Paris has a population of 2.2 million..."}
])

This episode-level granularity allows MemoBrain to better identify the logical structure and dependency relationships in reasoning trajectories, resulting in far better optimization effects than fragmented recording.

3.6 MemoBrain Core API Quick Reference

For quick reference, we’ve compiled the functions and usage of core APIs:

Method	Detailed Description
`MemoBrain(api_key, base_url, model_name)`	Creates a MemoBrain instance. `api_key` is the model API key (fill in EMPTY for vLLM-deployed models), `base_url` is the model API address, and `model_name` is the model name.
`init_memory(task: str)`	Initializes the memory graph with a task description to help MemoBrain clarify reasoning goals.
`memorize(messages: List[Dict])`	Asynchronously records new conversation turns (reasoning trajectories). `messages` is a list of dictionaries containing `role` and `content`.
`recall()`	Asynchronously performs memory optimization (flush + fold) and returns optimized context messages.

Message Format Specification:

[
    {"role": "user", "content": "Your question"},
    {"role": "assistant", "content": "Assistant's response"}
]

IV. Advanced MemoBrain Usage Examples

Beyond basic usage, MemoBrain supports advanced features such as memory snapshot loading and automatic management based on token budgets. Below are detailed examples:

4.1 Loading Memory Snapshots and Visualizing Memory Structure

If you need to reuse previous reasoning memories or view the structure of the memory graph, you can do so as follows:

import json
from memobrain import MemoBrain

# Initialize MemoBrain
memory = MemoBrain(
    api_key="EMPTY",
    base_url="http://localhost:8002/v1",
    model_name="TommyChien/MemoBrain-14B"
)

# Load saved memory snapshot
data = json.load(open("memory_snapshot.json"))
memory.load_dict_memory(data["memory"])

# Visualize memory graph structure
print(memory.graph.pretty_print())

For a complete example, refer to examples/example_usage.py in the project.

4.2 Automatic Memory Management Based on Token Budget

In long-horizon reasoning tasks, manually monitoring context size is cumbersome. MemoBrain supports automatic management based on token budgets—when the context exceeds the set token limit, memory optimization is triggered automatically:

import asyncio
from memobrain import MemoBrain
from utils import num_tokens_from_messages

async def token_budget_example(conversations):
    memory = MemoBrain(
        api_key="EMPTY",
        base_url="http://localhost:8002/v1",
        model_name="TommyChien/MemoBrain-14B"
    )
    memory.init_memory("Long-running research task")
    
    # Set token budget limit (e.g., 32K tokens)
    max_memory_size = 32 * 1024
    current_messages = []
    
    for conv in conversations:
        await memory.memorize(conv)
        current_messages.extend(conv)
        
        # Calculate current token count
        token_count = num_tokens_from_messages(current_messages)
        
        # Trigger automatic optimization when budget is exceeded
        if token_count > max_memory_size:
            optimized = await memory.recall()
            current_messages = optimized
            print(f"Memory optimized: {token_count} → {num_tokens_from_messages(optimized)} tokens")

asyncio.run(token_budget_example(your_conversations))

The core advantages of this approach are:

No need to manually track context size—automatic memory management is completed;
Budgets can be flexibly adjusted according to the model’s context window (e.g., adapting to 8K/32K/128K models);
On-demand memory optimization ensures reasoning continuity while maximizing context usage efficiency;
Seamless adaptation to long-horizon reasoning trajectories avoids reasoning interruptions caused by context overflow.

V. Integrating MemoBrain into ReAct Agents

ReAct is currently a mainstream tool-augmented reasoning framework. MemoBrain provides a complete integration example, enabling ReAct agents to possess efficient long-horizon memory management capabilities.

5.1 Core Features After Integration

Tool-Augmented Reasoning: Supports common tools such as web search, page browsing, and code execution;
Automatic Memory Management: MemoBrain handles context optimization transparently without modifying the agent’s core logic;
Token Budget Control: Configurable memory size limits (32K tokens by default);
Flexible Comparison: Supports enabling/disabling MemoBrain for easy performance comparison.

5.2 Quick Start Example

First navigate to the examples directory, deploy the MemoBrain model, then run the evaluation task:

cd examples

# Deploy MemoBrain-14B model (port 8002)
vllm serve TommyChien/MemoBrain-14B --port 8002

# Run GAIA task (with MemoBrain enabled)
python run_task.py --eval_task gaia

# Run GAIA task (without MemoBrain, for comparison)
python run_task.py --eval_task gaia --no_memory

⚠️ Note: To run this example completely, additional configuration is required. For details, refer to examples/README.md, including: deploying 3 models (reasoning model, auxiliary model, memory model), configuring API keys (Google Search, Jina), and installing dependencies.

5.3 Programmatic Integration Example

If you need to integrate MemoBrain + ReAct into your own code, refer to the following example:

import asyncio
from memobrain import MemoBrain
from react_with_memory import run_react_agent
from config import Configuration

async def main():
    # Configure models and API keys
    config = Configuration(
        # Reasoning model
        reasoning_model="Alibaba-NLP/Tongyi-DeepResearch-30B-A3B",
        reasoning_model_base_url="http://localhost:8000/v1",
        reasoning_model_api_key="empty",
        
        # Auxiliary model (for web page content summarization)
        auxiliary_model="Qwen/Qwen3-30B-A3B-Instruct-2507",
        auxiliary_model_base_url="http://localhost:8001/v1",
        auxiliary_model_api_key="empty",
        
        # Memory model (MemoBrain)
        memory_model="TommyChien/MemoBrain-14B",
        memory_model_base_url="http://localhost:8002/v1",
        memory_model_api_key="empty",
        
        # Tool API keys
        google_api_key="YOUR_GOOGLE_API_KEY",
        google_cx="YOUR_GOOGLE_CX",
        jina_api_key="YOUR_JINA_API_KEY",
        
        # Memory configuration
        max_memory_size=32*1024,  # 32K tokens
        max_llm_call_per_run=200,
        use_memory=True
    )
    
    # Run ReAct agent with MemoBrain
    result = await run_react_agent(
        question="What is the population of Paris?",
        config=config,
        use_memory=True  # Enable MemoBrain
    )
    
    print(f"Prediction: {result['prediction']}")
    print(f"Token Count: {result['token_count']}")
    print(f"Memorize Time: {result['total_memorize_time']:.2f}s")
    print(f"Recall Time: {result['total_recall_time']:.2f}s")

asyncio.run(main())

5.4 Workflow After Integration

Agent Reasoning: The ReAct agent performs multi-step reasoning and calls tools to complete tasks;
Memory Recording: After each tool execution, MemoBrain records the reasoning episode;
Context Monitoring: Continuously tracks the token count of the current context and compares it with the set budget limit;
Automatic Optimization: When the context exceeds the budget, MemoBrain automatically executes:
- Flush: Removes invalid/redundant reasoning steps;
- Fold: Compresses completed sub-trajectories into summaries;
- Returns the optimized context for the agent to continue reasoning.

For complete documentation, refer to examples/README.md, including detailed deployment guides, configuration options, evaluation tasks (GAIA, WebWalker, BrowseComp), performance metrics, and debugging tips.

VI. MemoBrain’s Experimental Results: Let Data Speak

We validated MemoBrain’s performance on challenging long-horizon reasoning benchmarks. The results show that integrating MemoBrain-8B with different base agents consistently improves performance.

6.1 Core Experimental Results

The table below summarizes MemoBrain’s performance on GAIA (General AI Assistant benchmark) and WebWalkerQA (web reasoning benchmark) (best results in bold, second-best underlined):

<
<

Method <	General AI Assistant (GAIA) <				WebWalkerQA
Method <	L1 <	L2 <	L3 <	Avg. <	Easy <	Med. <	Hard <	Avg.
Direct Reasoning (w/o Retrieval)
QwQ-32B	25.6	9.6	16.7	16.5	7.5	2.1	3.8	4.0
GPT-4o	23.1	15.4	8.3	17.5	6.7	6.0	4.2	5.5
DeepSeek-R1-671B	43.6	26.9	8.3	31.1	5.0	11.8	11.3	10.0
Retrieval-Augmented Generation
Vanilla RAG (QwQ-32B)	33.3	36.5	8.3	32.0	36.9	26.1	33.5	31.2
Query Planning (QwQ-32B)	48.7	25.0	8.3	32.0	28.8	35.7	30.8	32.5
Iterative RAG (QwQ-32B)	51.3	28.8	8.3	35.0	29.4	32.9	31.3	31.5
Tool-Integrated Reasoning
ReAct (QwQ-32B)	48.7	34.6	16.7	37.8	35.6	29.1	13.2	24.1
ReAct (GPT-4o)	51.2	34.6	8.3	34.6	34.6	42.0	23.9	33.8
ReAct (Qwen3-30B-A3B)	48.7	26.9	8.3	33.0	26.3	27.5	21.7	25.2
WebThinker-32B†	56.4	50.0	16.7	48.5	58.8	44.6	40.4	46.5
WebDancer (QwQ-32B)†	56.4	48.1	25.0	46.6	49.4	55.0	29.6	43.2
ReSum-GRPO†	—	—	—	48.5	—	—	—	—
DeepAgent-RL†	66.7	59.6	25.0	58.3	—	—	—	—
AgentFold-30B-A3B†	—	—	—	67.0	—	—	—	—
GLM-4.6	76.9	59.6	33.3	63.1	64.4	62.9	48.8	58.2
DeepResearch-30B-A3B	79.5	67.3	41.7	68.9	72.5	71.8	61.3	68.2
MemoBrain-8B
w/ GLM-4.6	79.5	71.2	50.0	71.8	68.8	69.6	61.3	66.5
w/ DeepResearch-30B-A3B	82.1	69.2	58.3	74.5	73.1	72.1	64.2	69.6

6.2 Key Findings

Cross-Difficulty Improvement: MemoBrain-8B achieves improvements across all difficulty levels (L1/L2/L3) of GAIA. Especially in the most challenging L3 level, it increases by 16.6 percentage points compared to the base model;
Breakthrough in Hard Tasks: In the Hard difficulty tasks of WebWalkerQA, the combination of MemoBrain-8B + DeepResearch-30B-A3B increases by 2.9 percentage points compared to the base model, demonstrating the value of memory management for long-horizon hard reasoning;
Strong Versatility: MemoBrain stably improves performance whether integrated with GLM-4.6 or DeepResearch-30B-A3B, proving that it does not rely on specific base models but the inherent value of its memory management mechanism;
Optimal Performance: The combination of MemoBrain-8B + DeepResearch-30B-A3B achieves the best results on both benchmarks, with an average score of 74.5 on GAIA and 69.6 on WebWalkerQA, verifying the core role of executive memory in long-horizon reasoning.

VII. Frequently Asked Questions (FAQ)

To address potential questions during usage, we’ve compiled common queries and their answers:

Q1: What is the difference between MemoBrain and traditional context compression?

A: Traditional context compression is usually undifferentiated text summarization, focusing only on “reducing character count” without considering the logical dependencies between reasoning steps. In contrast, MemoBrain is “semantic-level” memory management. It first constructs a dependency graph of reasoning steps, then cleans invalid nodes and folds completed sub-trajectories based on the graph structure. This not only reduces context volume but also retains the logical core of reasoning, avoiding the loss of key dependencies after compression.

Q2: Which LLMs does MemoBrain support?

A: Theoretically, it supports all LLMs compatible with the OpenAI API, including commercial models (such as DeepSeek V3.2 and GPT-5) and open-source models (such as Qwen series and Llama series). Officially, we recommend using the fine-tuned MemoBrain-4B/8B/14B, as these models are optimized for memory operations and deliver better performance.

Q3: How much additional reasoning cost does MemoBrain incur?

A: MemoBrain’s memory optimization operation (the recall method) requires an additional model call. However, since it can significantly reduce the context length for subsequent reasoning, it actually lowers the total reasoning cost overall—especially in long-horizon tasks, the token savings far exceed the cost of the optimization operation.

Q4: How should I set MemoBrain’s token budget?

A: It is recommended to set it based on the context window of the base model. For example, if the base model has a 32K token context window, MemoBrain’s max_memory_size can be set to 28K~30K (reserving some space for the agent’s real-time thinking); for models with an 8K context window, it can be set to 6K~7K. The core principle is to “reserve sufficient space for real-time reasoning while maximizing memory utilization.”

Q5: Which scenarios is MemoBrain suitable for?

A: It is most suitable for long-horizon reasoning scenarios that require multi-step, multi-tool calls, such as: in-depth academic research, complex factual question answering, multi-step code generation and debugging, cross-webpage information integration, enterprise-level intelligent assistants, etc. In short-horizon, single-step tasks (such as simple question answering), MemoBrain’s advantages are not obvious.

Q6: How to save and reuse MemoBrain’s memory data?

A: You can convert the memory graph into a dictionary format using memory.to_dict(), save it as a JSON file; later, load it using memory.load_dict_memory(), and you can reuse the previous memory data without re-recording reasoning trajectories.

Q7: What hardware configuration is required for MemoBrain deployment?

A: Requirements vary by model:

MemoBrain-4B: Can be deployed on a single GPU with 16GB VRAM;
MemoBrain-8B: A single GPU with 24GB VRAM is recommended;
MemoBrain-14B: A single GPU with 32GB VRAM or two GPUs with 16GB VRAM each is recommended.

If deploying with vLLM, VRAM utilization is higher, and hardware requirements can be appropriately reduced.

VIII. Conclusion and Future Plans

MemoBrain’s core value lies in transforming “passive context accumulation” into “active executive memory management.” By constructing a dependency-aware memory graph, cleaning invalid information, and folding completed sub-trajectories, it ensures that LLMs maintain logical coherence during long-horizon reasoning while maximizing context usage efficiency.

Currently, MemoBrain has open-sourced its core functions, including paper publication, open-sourcing of dedicated models (4B/8B/14B), code implementation, and ReAct agent integration examples; future iterations will continue to optimize efficiency and adaptability.

If your work involves tool-augmented agents, long-horizon reasoning, or LLM context management, MemoBrain is a tool worth trying—it does not require you to restructure existing agent frameworks, and with simple integration, it can significantly improve the stability and effectiveness of long-horizon reasoning.

Citation

If you use MemoBrain in your research or projects, please cite the relevant paper:

@article{memobrain2026,
  title={MemoBrain: Executive Memory as an Agentic Brain for Reasoning},
  author={Hongjin Qian, Zhao Cao, Zheng Liu},
  journal={arXiv preprint arXiv:2601.08079},
  year={2026}
}

Contribution and Open-Source Notes

MemoBrain is open-sourced under the MIT License, and contributions are welcome:

Submit bug reports and feature suggestions;
Optimize documentation and usage examples;
Submit code PRs to improve core functions.

We hope MemoBrain can provide valuable references for the research and application of LLM long-horizon reasoning, making agent reasoning more coherent and efficient.