Seeing, Listening, Remembering, and Reasoning: A Practical Guide to the M3-Agent Multimodal Assistant with Long-Term Memory

This post is based entirely on the open-source M3-Agent project released by ByteDance Seed.
Every command, file path, and benchmark score is copied verbatim from the official repositories linked below.
No outside knowledge has been added.


TL;DR

  • Problem: Most vision-language models forget what they saw in a video minutes later.
  • Solution: M3-Agent keeps a graph-structured long-term memory that can be queried days later.
  • Result: Up to 8.2 % higher accuracy than GPT-4o + Gemini-1.5-pro on long-video QA.
  • Cost: Runs on a single 80 GB A100 or a 4×RTX 3090 setup; inference fits in 16 GB VRAM.
  • License: Apache-2.0 code + CC-BY-4.0 data.

Contents

  1. What Exactly Is M3-Agent?
  2. How the Memory Works (5 Diagrams)
  3. The M3-Bench Dataset: 1,029 Long Videos with QA Pairs
  4. Local Quick-Start: Zero-to-Answer in 30 Minutes
  5. Training Your Own Memorization & Control Models with verl
  6. Troubleshooting & FAQ
  7. Further Reading & Citations

1. What Exactly Is M3-Agent?

Imagine a robot that walks through your home for two hours.
At any point in the future you can ask, “Where did I leave the red screwdriver?” and get the correct shelf and timestamp.
M3-Agent is the open-source stack that makes this possible.

Component Function Open-Source Location
Memorization model Converts 30-second clips into episodic and semantic memory nodes Hugging Face
Control model Retrieves relevant memories and answers questions Hugging Face
M3-Bench dataset 1,029 videos + QA pairs for evaluation Hugging Face
verl training scripts Reproduce or fine-tune both models GitHub

2. How the Memory Works (5 Diagrams)

2.1 High-Level Flow

Raw video & audio
      │
      ├─① Memorization (online, 30 s clips)
      │     ├─Extract faces, objects, speech
      │     ├─Store as *episodic* nodes
      │     └─Build *semantic* abstractions
      │
      └─② Control (offline or on-demand)
            ├─Receive question
            ├─Query memory graph
            ├─Multi-turn reasoning
            └─Return answer

2.2 Memory Graph Schema

  • Nodes: entities, events, scenes
  • Edges: temporal order, spatial containment, semantic similarity
  • Modalities: text caption, key-frame crop, 256-d audio embedding
memory_graph

Each circle is a node; arrows are directed edges with typed labels.

2.3 Model Breakdown

Model Base LLM Extra Heads Output
Memorization Qwen2.5-Omni-7B Scene parser + Speaker diarization JSON node
Control Same backbone Cross-encoder retriever Final answer

2.4 Training Objective

  1. Memorization – supervised fine-tuning on 700 k (clip, memory) pairs.
  2. Control – PPO with GPT-4o as reward model; runs inside the verl library.

3. The M3-Bench Dataset

3.1 Two Splits

Split Videos Avg. Length QA Pairs Source
robot 100 1 h 30 m 800 Real robot egocentric
web 929 20 m 9 k YouTube & Vimeo

3.2 Annotation Types

  • Human understanding: identity, emotion, action
  • Object tracking: category, location changes
  • Event causality: triggers, outcomes
  • Cross-modal consistency: does speech match visual content?

Example QA:

Video: 45-minute kitchen recording
Question: “Where was the green bowl placed after the 37-minute mark?”
Answer: “Inside the top-left cupboard.”

Download one line:

huggingface-cli download ByteDance-Seed/M3-Bench --repo-type dataset --local-dir ./M3-Bench

4. Local Quick-Start: Zero-to-Answer in 30 Minutes

4.1 One-Time Setup

git clone https://github.com/ByteDance-Seed/M3-Agent.git
cd M3-Agent
bash setup.sh          # installs ffmpeg, sox, portaudio
pip install -r requirements.txt
pip install qwen-omni-utils==0.0.4

Note: setup.sh has been tested on Ubuntu 22.04 and macOS 14.
Windows users can run the same steps inside WSL2.

4.2 End-to-End Walk-Through

Step 0 – Pick a Video

Use the provided sample:

data/videos/robot/bedroom_01.mp4

Step 1 – Slice into 30-second Clips

./scripts/cut_video.sh data/videos/robot/bedroom_01.mp4

Result:

data/clips/robot/bedroom_01/
├── 0.mp4
├── 1.mp4
└── ...

Step 2 – Prepare JSONL Manifest

Create data/data.jsonl:

{"id": "bedroom_01", "video_path": "data/videos/robot/bedroom_01.mp4", "clip_path": "data/clips/robot/bedroom_01", "mem_path": "data/memory_graphs/robot/bedroom_01.pkl", "intermediate_path": "data/intermediate_outputs/robot/bedroom_01"}

Step 3 – Intermediate Outputs (Optional)

If you downloaded the official intermediate_outputs, skip this step.
Otherwise:

  1. Download pretrained_eres2netv2.ckpt from ModelScope to ./models.
  2. Clone the 3D-Speaker repo as ./speakerlab.

Then run:

python m3_agent/memorization_intermediate_outputs.py \
  --data_file data/data.jsonl

Outputs: speaker labels, face crops, ASR text.

Step 4 – Build the Memory Graph

python data_preparation/generate_memory_qwen.py \
  --data_file data/data.jsonl

You will obtain:

data/memory_graphs/robot/bedroom_01.pkl

Step 5 – Ask a Question

python m3_agent/control.py \
  --data_file data/annotations/robot.json

Example console log:

Q: Where did the user leave the remote?
A: It was placed on the right side of the sofa at timestamp 00:12:37.

4.3 Visualize the Graph (Debug)

python visualization.py \
  --mem_path data/memory_graphs/robot/bedroom_01.pkl \
  --clip_id 1

Navigate to http://localhost:8000 to inspect nodes and edges in your browser.


5. Training Your Own Models with verl

5.1 Memorization Model – Supervised Fine-Tuning

Repository

https://github.com/hyc2026/sft-qwen2.5-omni-thinker

System requirements:

  • 1×A100 80 GB or 4×RTX 3090 24 GB
  • 200 GB free disk for cached features

Commands:

git clone https://github.com/hyc2026/sft-qwen2.5-omni-thinker.git
cd sft-qwen2.5-omni-thinker
pip install -r requirements.txt

bash scripts/run_sft.sh \
  --model_name_or_path Qwen/Qwen2.5-Omni-7B \
  --data_path ./data/m3_memorization.jsonl \
  --output_dir ./checkpoints/mem \
  --num_train_epochs 3 \
  --per_device_train_batch_size 4

Training time: ~6 hours on one A100.

5.2 Control Model – PPO via verl

Repository

https://github.com/hyc2026/M3-Agent-Training

Key hyper-parameters (tuned):

rollouts.n: 1024
ppo_epochs: 4
actor_lr: 1e-6
critic_lr: 5e-6
reward_model: gpt-4o-rule-mix

Launch:

python -m verl.trainer.main \
  config=ppo_m3_agent.yaml \
  actor_rollout_ref.model.path=Qwen/Qwen2.5-Omni-7B \
  data.train_files=data/control_train.parquet \
  data.val_files=data/control_val.parquet

Expected convergence: 2,000 PPO steps → 0.72 reward on validation.


6. Troubleshooting & FAQ

Problem Symptom Fix
CUDA OOM during memorization Out of memory at step 4 Reduce batch_size in generate_memory_qwen.py from 8 to 2
Speaker diarization fails “No voice activity detected” Check audio is stereo ≥16 kHz; re-encode with ffmpeg -ar 16000
Control model hallucinates Answer not in video Increase retrieval top-k from 5 to 20 in control.py
Windows path issues FileNotFoundError: data\clips Run inside WSL2 or replace backslashes in JSONL

Common Questions

Q1: Can I run inference on CPU only?
Yes, but expect 5–10× slower. Use device_map="auto" and set torch_dtype=torch.float16.

Q2: How do I plug in my own videos?
Replace the video_path in data.jsonl; keep the same slicing and naming conventions.

Q3: How large can the memory graph grow?
With the default TTL of 30 days and importance pruning, graphs for 24-hour footage stay under 4 GB RAM.

Q4: Is there a Docker image?
Not yet; the official roadmap lists it for Q3 2025. For now, use the provided setup.sh on a clean Ubuntu 22.04 machine.


7. Further Reading & Citations

Academic Paper

Lin Long et al., Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory, arXiv:2508.09736
https://arxiv.org/abs/2508.09736

Framework Paper

Guangming Sheng et al., HybridFlow: A Flexible and Efficient RLHF Framework, arXiv:2409.19256
https://arxiv.org/abs/2409.19256

Repositories & Data

Resource URL
Full M3-Agent code & docs https://github.com/ByteDance-Seed/M3-Agent
M3-Bench dataset https://huggingface.co/datasets/ByteDance-Seed/M3-Bench
Memorization training scripts https://github.com/hyc2026/sft-qwen2.5-omni-thinker
Control training scripts (verl) https://github.com/hyc2026/M3-Agent-Training

If you build something cool with M3-Agent—like a home inventory bot or a lecture summarizer—consider opening a pull request or sharing your weights under the Apache-2.0 license. The Seed team and the wider community are listening.