Seeing, Listening, Remembering, and Reasoning: A Practical Guide to the M3-Agent Multimodal Assistant with Long-Term Memory
This post is based entirely on the open-source M3-Agent project released by ByteDance Seed.
Every command, file path, and benchmark score is copied verbatim from the official repositories linked below.
No outside knowledge has been added.
TL;DR
-
Problem: Most vision-language models forget what they saw in a video minutes later. -
Solution: M3-Agent keeps a graph-structured long-term memory that can be queried days later. -
Result: Up to 8.2 % higher accuracy than GPT-4o + Gemini-1.5-pro on long-video QA. -
Cost: Runs on a single 80 GB A100 or a 4×RTX 3090 setup; inference fits in 16 GB VRAM. -
License: Apache-2.0 code + CC-BY-4.0 data.
Contents
-
What Exactly Is M3-Agent? -
How the Memory Works (5 Diagrams) -
The M3-Bench Dataset: 1,029 Long Videos with QA Pairs -
Local Quick-Start: Zero-to-Answer in 30 Minutes -
Training Your Own Memorization & Control Models with verl -
Troubleshooting & FAQ -
Further Reading & Citations
1. What Exactly Is M3-Agent?
Imagine a robot that walks through your home for two hours.
At any point in the future you can ask, “Where did I leave the red screwdriver?” and get the correct shelf and timestamp.
M3-Agent is the open-source stack that makes this possible.
Component | Function | Open-Source Location |
---|---|---|
Memorization model | Converts 30-second clips into episodic and semantic memory nodes | Hugging Face |
Control model | Retrieves relevant memories and answers questions | Hugging Face |
M3-Bench dataset | 1,029 videos + QA pairs for evaluation | Hugging Face |
verl training scripts | Reproduce or fine-tune both models | GitHub |
2. How the Memory Works (5 Diagrams)
2.1 High-Level Flow
Raw video & audio
│
├─① Memorization (online, 30 s clips)
│ ├─Extract faces, objects, speech
│ ├─Store as *episodic* nodes
│ └─Build *semantic* abstractions
│
└─② Control (offline or on-demand)
├─Receive question
├─Query memory graph
├─Multi-turn reasoning
└─Return answer
2.2 Memory Graph Schema
-
Nodes: entities, events, scenes -
Edges: temporal order, spatial containment, semantic similarity -
Modalities: text caption, key-frame crop, 256-d audio embedding

Each circle is a node; arrows are directed edges with typed labels.
2.3 Model Breakdown
Model | Base LLM | Extra Heads | Output |
---|---|---|---|
Memorization | Qwen2.5-Omni-7B | Scene parser + Speaker diarization | JSON node |
Control | Same backbone | Cross-encoder retriever | Final answer |
2.4 Training Objective
-
Memorization – supervised fine-tuning on 700 k (clip, memory) pairs. -
Control – PPO with GPT-4o as reward model; runs inside the verl library.
3. The M3-Bench Dataset
3.1 Two Splits
Split | Videos | Avg. Length | QA Pairs | Source |
---|---|---|---|---|
robot | 100 | 1 h 30 m | 800 | Real robot egocentric |
web | 929 | 20 m | 9 k | YouTube & Vimeo |
3.2 Annotation Types
-
Human understanding: identity, emotion, action -
Object tracking: category, location changes -
Event causality: triggers, outcomes -
Cross-modal consistency: does speech match visual content?
Example QA:
Video: 45-minute kitchen recording
Question: “Where was the green bowl placed after the 37-minute mark?”
Answer: “Inside the top-left cupboard.”
Download one line:
huggingface-cli download ByteDance-Seed/M3-Bench --repo-type dataset --local-dir ./M3-Bench
4. Local Quick-Start: Zero-to-Answer in 30 Minutes
4.1 One-Time Setup
git clone https://github.com/ByteDance-Seed/M3-Agent.git
cd M3-Agent
bash setup.sh # installs ffmpeg, sox, portaudio
pip install -r requirements.txt
pip install qwen-omni-utils==0.0.4
Note:
setup.sh
has been tested on Ubuntu 22.04 and macOS 14.
Windows users can run the same steps inside WSL2.
4.2 End-to-End Walk-Through
Step 0 – Pick a Video
Use the provided sample:
data/videos/robot/bedroom_01.mp4
Step 1 – Slice into 30-second Clips
./scripts/cut_video.sh data/videos/robot/bedroom_01.mp4
Result:
data/clips/robot/bedroom_01/
├── 0.mp4
├── 1.mp4
└── ...
Step 2 – Prepare JSONL Manifest
Create data/data.jsonl
:
{"id": "bedroom_01", "video_path": "data/videos/robot/bedroom_01.mp4", "clip_path": "data/clips/robot/bedroom_01", "mem_path": "data/memory_graphs/robot/bedroom_01.pkl", "intermediate_path": "data/intermediate_outputs/robot/bedroom_01"}
Step 3 – Intermediate Outputs (Optional)
If you downloaded the official intermediate_outputs
, skip this step.
Otherwise:
-
Download pretrained_eres2netv2.ckpt
from ModelScope to./models
. -
Clone the 3D-Speaker
repo as./speakerlab
.
Then run:
python m3_agent/memorization_intermediate_outputs.py \
--data_file data/data.jsonl
Outputs: speaker labels, face crops, ASR text.
Step 4 – Build the Memory Graph
python data_preparation/generate_memory_qwen.py \
--data_file data/data.jsonl
You will obtain:
data/memory_graphs/robot/bedroom_01.pkl
Step 5 – Ask a Question
python m3_agent/control.py \
--data_file data/annotations/robot.json
Example console log:
Q: Where did the user leave the remote?
A: It was placed on the right side of the sofa at timestamp 00:12:37.
4.3 Visualize the Graph (Debug)
python visualization.py \
--mem_path data/memory_graphs/robot/bedroom_01.pkl \
--clip_id 1
Navigate to http://localhost:8000
to inspect nodes and edges in your browser.
5. Training Your Own Models with verl
5.1 Memorization Model – Supervised Fine-Tuning
Repository
https://github.com/hyc2026/sft-qwen2.5-omni-thinker
System requirements:
-
1×A100 80 GB or 4×RTX 3090 24 GB -
200 GB free disk for cached features
Commands:
git clone https://github.com/hyc2026/sft-qwen2.5-omni-thinker.git
cd sft-qwen2.5-omni-thinker
pip install -r requirements.txt
bash scripts/run_sft.sh \
--model_name_or_path Qwen/Qwen2.5-Omni-7B \
--data_path ./data/m3_memorization.jsonl \
--output_dir ./checkpoints/mem \
--num_train_epochs 3 \
--per_device_train_batch_size 4
Training time: ~6 hours on one A100.
5.2 Control Model – PPO via verl
Repository
https://github.com/hyc2026/M3-Agent-Training
Key hyper-parameters (tuned):
rollouts.n: 1024
ppo_epochs: 4
actor_lr: 1e-6
critic_lr: 5e-6
reward_model: gpt-4o-rule-mix
Launch:
python -m verl.trainer.main \
config=ppo_m3_agent.yaml \
actor_rollout_ref.model.path=Qwen/Qwen2.5-Omni-7B \
data.train_files=data/control_train.parquet \
data.val_files=data/control_val.parquet
Expected convergence: 2,000 PPO steps → 0.72 reward on validation.
6. Troubleshooting & FAQ
Problem | Symptom | Fix |
---|---|---|
CUDA OOM during memorization | Out of memory at step 4 | Reduce batch_size in generate_memory_qwen.py from 8 to 2 |
Speaker diarization fails | “No voice activity detected” | Check audio is stereo ≥16 kHz; re-encode with ffmpeg -ar 16000 |
Control model hallucinates | Answer not in video | Increase retrieval top-k from 5 to 20 in control.py |
Windows path issues | FileNotFoundError: data\clips |
Run inside WSL2 or replace backslashes in JSONL |
Common Questions
Q1: Can I run inference on CPU only?
Yes, but expect 5–10× slower. Use device_map="auto"
and set torch_dtype=torch.float16
.
Q2: How do I plug in my own videos?
Replace the video_path
in data.jsonl
; keep the same slicing and naming conventions.
Q3: How large can the memory graph grow?
With the default TTL of 30 days and importance pruning, graphs for 24-hour footage stay under 4 GB RAM.
Q4: Is there a Docker image?
Not yet; the official roadmap lists it for Q3 2025. For now, use the provided setup.sh
on a clean Ubuntu 22.04 machine.
7. Further Reading & Citations
Academic Paper
Lin Long et al., Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory, arXiv:2508.09736
https://arxiv.org/abs/2508.09736
Framework Paper
Guangming Sheng et al., HybridFlow: A Flexible and Efficient RLHF Framework, arXiv:2409.19256
https://arxiv.org/abs/2409.19256
Repositories & Data
Resource | URL |
---|---|
Full M3-Agent code & docs | https://github.com/ByteDance-Seed/M3-Agent |
M3-Bench dataset | https://huggingface.co/datasets/ByteDance-Seed/M3-Bench |
Memorization training scripts | https://github.com/hyc2026/sft-qwen2.5-omni-thinker |
Control training scripts (verl) | https://github.com/hyc2026/M3-Agent-Training |
If you build something cool with M3-Agent—like a home inventory bot or a lecture summarizer—consider opening a pull request or sharing your weights under the Apache-2.0 license. The Seed team and the wider community are listening.