Seeing, Listening, Remembering, and Reasoning: A Practical Guide to the M3-Agent Multimodal Assistant with Long-Term Memory

This post is based entirely on the open-source M3-Agent project released by ByteDance Seed.
Every command, file path, and benchmark score is copied verbatim from the official repositories linked below.
No outside knowledge has been added.

TL;DR

Problem: Most vision-language models forget what they saw in a video minutes later.
Solution: M3-Agent keeps a graph-structured long-term memory that can be queried days later.
Result: Up to 8.2 % higher accuracy than GPT-4o + Gemini-1.5-pro on long-video QA.
Cost: Runs on a single 80 GB A100 or a 4×RTX 3090 setup; inference fits in 16 GB VRAM.
License: Apache-2.0 code + CC-BY-4.0 data.

What Exactly Is M3-Agent?
How the Memory Works (5 Diagrams)
The M3-Bench Dataset: 1,029 Long Videos with QA Pairs
Local Quick-Start: Zero-to-Answer in 30 Minutes
Training Your Own Memorization & Control Models with verl
Troubleshooting & FAQ
Further Reading & Citations

1. What Exactly Is M3-Agent?

Imagine a robot that walks through your home for two hours.
At any point in the future you can ask, “Where did I leave the red screwdriver?” and get the correct shelf and timestamp.
M3-Agent is the open-source stack that makes this possible.

Component	Function	Open-Source Location
Memorization model	Converts 30-second clips into episodic and semantic memory nodes	Hugging Face
Control model	Retrieves relevant memories and answers questions	Hugging Face
M3-Bench dataset	1,029 videos + QA pairs for evaluation	Hugging Face
verl training scripts	Reproduce or fine-tune both models	GitHub

2. How the Memory Works (5 Diagrams)

2.1 High-Level Flow

Raw video & audio
      │
      ├─① Memorization (online, 30 s clips)
      │     ├─Extract faces, objects, speech
      │     ├─Store as *episodic* nodes
      │     └─Build *semantic* abstractions
      │
      └─② Control (offline or on-demand)
            ├─Receive question
            ├─Query memory graph
            ├─Multi-turn reasoning
            └─Return answer

2.2 Memory Graph Schema

Nodes: entities, events, scenes
Edges: temporal order, spatial containment, semantic similarity
Modalities: text caption, key-frame crop, 256-d audio embedding

Each circle is a node; arrows are directed edges with typed labels.

2.3 Model Breakdown

Model	Base LLM	Extra Heads	Output
Memorization	Qwen2.5-Omni-7B	Scene parser + Speaker diarization	JSON node
Control	Same backbone	Cross-encoder retriever	Final answer

2.4 Training Objective

Memorization – supervised fine-tuning on 700 k (clip, memory) pairs.
Control – PPO with GPT-4o as reward model; runs inside the verl library.

3. The M3-Bench Dataset

3.1 Two Splits

Split	Videos	Avg. Length	QA Pairs	Source
robot	100	1 h 30 m	800	Real robot egocentric
web	929	20 m	9 k	YouTube & Vimeo

3.2 Annotation Types

Human understanding: identity, emotion, action
Object tracking: category, location changes
Event causality: triggers, outcomes
Cross-modal consistency: does speech match visual content?

Example QA:

Video: 45-minute kitchen recording
Question: “Where was the green bowl placed after the 37-minute mark?”
Answer: “Inside the top-left cupboard.”

Download one line:

huggingface-cli download ByteDance-Seed/M3-Bench --repo-type dataset --local-dir ./M3-Bench

4. Local Quick-Start: Zero-to-Answer in 30 Minutes

4.1 One-Time Setup

git clone https://github.com/ByteDance-Seed/M3-Agent.git
cd M3-Agent
bash setup.sh          # installs ffmpeg, sox, portaudio
pip install -r requirements.txt
pip install qwen-omni-utils==0.0.4

Note: setup.sh has been tested on Ubuntu 22.04 and macOS 14.
Windows users can run the same steps inside WSL2.

4.2 End-to-End Walk-Through

Step 0 – Pick a Video

Use the provided sample:

data/videos/robot/bedroom_01.mp4

Step 1 – Slice into 30-second Clips

./scripts/cut_video.sh data/videos/robot/bedroom_01.mp4

Result:

data/clips/robot/bedroom_01/
├── 0.mp4
├── 1.mp4
└── ...

Step 2 – Prepare JSONL Manifest

Create data/data.jsonl:

{"id": "bedroom_01", "video_path": "data/videos/robot/bedroom_01.mp4", "clip_path": "data/clips/robot/bedroom_01", "mem_path": "data/memory_graphs/robot/bedroom_01.pkl", "intermediate_path": "data/intermediate_outputs/robot/bedroom_01"}

Step 3 – Intermediate Outputs (Optional)

If you downloaded the official intermediate_outputs, skip this step.
Otherwise:

Download pretrained_eres2netv2.ckpt from ModelScope to ./models.
Clone the 3D-Speaker repo as ./speakerlab.

Then run:

python m3_agent/memorization_intermediate_outputs.py \
  --data_file data/data.jsonl

Outputs: speaker labels, face crops, ASR text.

Step 4 – Build the Memory Graph

python data_preparation/generate_memory_qwen.py \
  --data_file data/data.jsonl

You will obtain:

data/memory_graphs/robot/bedroom_01.pkl

Step 5 – Ask a Question

python m3_agent/control.py \
  --data_file data/annotations/robot.json

Example console log:

Q: Where did the user leave the remote?
A: It was placed on the right side of the sofa at timestamp 00:12:37.

4.3 Visualize the Graph (Debug)

python visualization.py \
  --mem_path data/memory_graphs/robot/bedroom_01.pkl \
  --clip_id 1

Navigate to http://localhost:8000 to inspect nodes and edges in your browser.

5. Training Your Own Models with verl

5.1 Memorization Model – Supervised Fine-Tuning

Repository

https://github.com/hyc2026/sft-qwen2.5-omni-thinker

System requirements:

1×A100 80 GB or 4×RTX 3090 24 GB
200 GB free disk for cached features

Commands:

git clone https://github.com/hyc2026/sft-qwen2.5-omni-thinker.git
cd sft-qwen2.5-omni-thinker
pip install -r requirements.txt

bash scripts/run_sft.sh \
  --model_name_or_path Qwen/Qwen2.5-Omni-7B \
  --data_path ./data/m3_memorization.jsonl \
  --output_dir ./checkpoints/mem \
  --num_train_epochs 3 \
  --per_device_train_batch_size 4

Training time: ~6 hours on one A100.

5.2 Control Model – PPO via verl

Repository

https://github.com/hyc2026/M3-Agent-Training

Key hyper-parameters (tuned):

rollouts.n: 1024
ppo_epochs: 4
actor_lr: 1e-6
critic_lr: 5e-6
reward_model: gpt-4o-rule-mix

Launch:

python -m verl.trainer.main \
  config=ppo_m3_agent.yaml \
  actor_rollout_ref.model.path=Qwen/Qwen2.5-Omni-7B \
  data.train_files=data/control_train.parquet \
  data.val_files=data/control_val.parquet

Expected convergence: 2,000 PPO steps → 0.72 reward on validation.

6. Troubleshooting & FAQ

Problem	Symptom	Fix
CUDA OOM during memorization	Out of memory at step 4	Reduce `batch_size` in `generate_memory_qwen.py` from 8 to 2
Speaker diarization fails	“No voice activity detected”	Check audio is stereo ≥16 kHz; re-encode with `ffmpeg -ar 16000`
Control model hallucinates	Answer not in video	Increase retrieval top-k from 5 to 20 in `control.py`
Windows path issues	`FileNotFoundError: data\clips`	Run inside WSL2 or replace backslashes in JSONL

Common Questions

Q1: Can I run inference on CPU only?
Yes, but expect 5–10× slower. Use device_map="auto" and set torch_dtype=torch.float16.

Q2: How do I plug in my own videos?
Replace the video_path in data.jsonl; keep the same slicing and naming conventions.

Q3: How large can the memory graph grow?
With the default TTL of 30 days and importance pruning, graphs for 24-hour footage stay under 4 GB RAM.

Q4: Is there a Docker image?
Not yet; the official roadmap lists it for Q3 2025. For now, use the provided setup.sh on a clean Ubuntu 22.04 machine.

7. Further Reading & Citations

Academic Paper

Lin Long et al., Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory, arXiv:2508.09736
https://arxiv.org/abs/2508.09736

Framework Paper

Guangming Sheng et al., HybridFlow: A Flexible and Efficient RLHF Framework, arXiv:2409.19256
https://arxiv.org/abs/2409.19256

Repositories & Data

Resource	URL
Full M3-Agent code & docs	https://github.com/ByteDance-Seed/M3-Agent
M3-Bench dataset	https://huggingface.co/datasets/ByteDance-Seed/M3-Bench
Memorization training scripts	https://github.com/hyc2026/sft-qwen2.5-omni-thinker
Control training scripts (verl)	https://github.com/hyc2026/M3-Agent-Training

If you build something cool with M3-Agent—like a home inventory bot or a lecture summarizer—consider opening a pull request or sharing your weights under the Apache-2.0 license. The Seed team and the wider community are listening.

M3-Agent: Revolutionizing Multimodal AI with Graph-Based Long-Term Memory

Seeing, Listening, Remembering, and Reasoning: A Practical Guide to the M3-Agent Multimodal Assistant with Long-Term Memory

TL;DR

Contents

1. What Exactly Is M3-Agent?

2. How the Memory Works (5 Diagrams)

2.1 High-Level Flow

2.2 Memory Graph Schema

2.3 Model Breakdown

2.4 Training Objective

3. The M3-Bench Dataset

3.1 Two Splits

3.2 Annotation Types

4. Local Quick-Start: Zero-to-Answer in 30 Minutes

4.1 One-Time Setup

4.2 End-to-End Walk-Through

Step 0 – Pick a Video

Step 1 – Slice into 30-second Clips

Step 2 – Prepare JSONL Manifest

Step 3 – Intermediate Outputs (Optional)

Step 4 – Build the Memory Graph

Step 5 – Ask a Question

4.3 Visualize the Graph (Debug)

5. Training Your Own Models with verl

5.1 Memorization Model – Supervised Fine-Tuning

5.2 Control Model – PPO via verl

6. Troubleshooting & FAQ

Common Questions

7. Further Reading & Citations

Academic Paper

Framework Paper

Repositories & Data

M3-Agent: Revolutionizing Multimodal AI with Graph-Based Long-Term Memory

Seeing, Listening, Remembering, and Reasoning: A Practical Guide to the M3-Agent Multimodal Assistant with Long-Term Memory

TL;DR

Contents

1. What Exactly Is M3-Agent?

2. How the Memory Works (5 Diagrams)

2.1 High-Level Flow

2.2 Memory Graph Schema

2.3 Model Breakdown

2.4 Training Objective

3. The M3-Bench Dataset

3.1 Two Splits

3.2 Annotation Types

4. Local Quick-Start: Zero-to-Answer in 30 Minutes

4.1 One-Time Setup

4.2 End-to-End Walk-Through

Step 0 – Pick a Video

Step 1 – Slice into 30-second Clips

Step 2 – Prepare JSONL Manifest

Step 3 – Intermediate Outputs (Optional)

Step 4 – Build the Memory Graph

Step 5 – Ask a Question

4.3 Visualize the Graph (Debug)

5. Training Your Own Models with verl

5.1 Memorization Model – Supervised Fine-Tuning

5.2 Control Model – PPO via verl

6. Troubleshooting & FAQ

Common Questions

7. Further Reading & Citations

Academic Paper

Framework Paper

Repositories & Data

Related Posts