OneThinker AI Model: The First Unified System for Image and Video Understanding

高效码农

2 months ago

OneThinker: One Model to Understand Both Images and Videos

Have you ever imagined an AI “polymath” capable of solving complex diagram-based math problems, precisely tracking objects in a video, and segmenting them—all within a single system? Traditionally, this required separate specialized models for tasks like visual question answering, video analysis, and object localization. This paradigm is now being reshaped by a unified generalist.

Today, we delve into OneThinker—a multimodal reasoning model designed to unify image and video understanding. Within a single framework, it masters ten fundamental visual tasks, including question answering, captioning, grounding, tracking, and segmentation, marking a significant step toward general visual intelligence.

What is OneThinker and What Can It Do?

In essence, OneThinker is a versatile AI model. Built upon the powerful Qwen3-VL architecture and trained at scale across diverse tasks, it achieves unprecedented generalization.

You can pose a wide variety of questions or instructions:

Solve visual puzzles: “In this geometry diagram, if ∠ABD = 50°, what is the measure of ∠BCD?” (It reasons step-by-step and outputs the answer: C, 40°).
Answer video questions: “What is the protagonist holding at the 3-second mark in this clip?”
Localize objects (Spatial Grounding): “Please draw a bounding box around the person in the red shirt.”
Temporal grounding: “During which seconds does the car explosion occur in this movie trailer?”
Video tracking: “Given the target object’s bounding box in the first frame, track it across all subsequent frames.”
Image/Video Segmentation: “Please segment the shorter individual in the video.”

OneThinker handles all these tasks within a single model. It first conducts internal “thinking” (generates a chain-of-thought) and then produces a structured answer (e.g., a choice, coordinates, timestamps). This unified design not only makes it powerful but also fosters knowledge transfer across different tasks.

What Makes OneThinker Special? Its Core Technologies

Existing visual reasoning models are often specialists: some focus on images, others on videos; some excel at QA, others at detection. This fragmentation limits practical utility and generalization potential. OneThinker’s ambition is to break these barriers, built upon three key pillars:

1. Large-Scale, High-Quality Training Data: OneThinker-600K

Training a generalist requires exposure to a vast world. The team constructed the OneThinker-600K dataset, comprising approximately 600,000 samples. It spans both image and video modalities, covering eight major task categories: rule-based QA, open-ended QA, captioning, spatial/temporal/spatio-temporal grounding, tracking, and segmentation.

To provide a strong reasoning foundation, they used the powerful Seed1.5-VL model to generate high-quality chain-of-thought (CoT) annotations, resulting in the OneThinker-SFT-340K dataset for supervised fine-tuning (SFT).

2. Innovative Training Algorithm: EMA-GRPO

A major challenge arises when training such diverse tasks jointly with reinforcement learning (RL): reward heterogeneity. For instance, rewards for math problems (correct/incorrect) and bounding box detection (IoU score) differ vastly in scale and distribution. Standard RL methods would cause the model to favor some tasks over others.

OneThinker introduces the EMA-GRPO algorithm to address this elegantly. Think of it as an “intelligent balancer”:

The Problem: Standard reward normalization can be unfair either within a task (to samples of varying difficulty) or across tasks (to rewards with different scales).
The Solution: EMA-GRPO maintains a separate, task-wise exponential moving average (EMA) of reward standard deviations. When computing gradient updates, it normalizes using each task’s own dynamic scale.
The Result: This ensures fair treatment of samples within each task and balanced contribution from heterogeneous tasks like math, detection, and segmentation to the overall learning process. It’s akin to setting different grading curves for each subject, allowing a well-rounded development.

3. Unified Task and Reward Formulation

Regardless of the input (image or video) or query type, OneThinker follows a consistent output format:

Thinking Process: The model writes its reasoning steps inside <think>...</think> tags.
Final Answer: It places the structured answer (e.g., {"boxes": {...}}) or text answer within <answer>...</answer> tags.

This design enables automated reward calculation. The total reward typically combines a task-accuracy reward (R_acc) and a format-compliance reward (R_format). For tasks requiring boxes or points, the format reward ensures the output is parsable.

How Well Does OneThinker Perform?

Theory is one thing; empirical results are what matter. OneThinker was rigorously evaluated on 31 benchmarks spanning 10 fundamental visual task categories, demonstrating impressive gains.

Here is a summary of key results:

Task Category	Representative Benchmark	Qwen3-VL-8B (Baseline)	OneThinker-8B	Key Improvement
Image QA	MMMU (Multidisciplinary)	60.2%	70.6%	+10.4%
	MathVerse (Math Reasoning)	58.1%	64.3%	+6.2%
Video QA	VideoMMMU (Video Multidisciplinary)	63.3%	66.2%	+2.9%
	LongVideo-Reason (Long-Form Reasoning)	71.5%	79.2%	+7.7%
Spatial Grounding	RefCOCO testA	92.2	93.7	+1.5
Temporal Grounding	ActivityNet R@0.5	26.1%	43.6%	+17.5%
Video Tracking	GOT-10k AO	33.7	73.0	+39.3
Video Segmentation	ReasonVOS J&F	19.6	54.9	+35.3

Insights & Analysis:

Comprehensive Leadership: OneThinker shows significant improvement over its base model, Qwen3-VL, across the vast majority of tasks, with particularly large gains on perception-oriented tasks like tracking and segmentation.
The Generalist Advantage: It excels not only at individual tasks but also demonstrates cross-task knowledge transfer. For example, knowledge from spatial grounding training improves image QA and segmentation; temporal grounding training benefits video QA and tracking.
Zero-Shot Generalization: On novel, unseen tasks from MMT-Bench (e.g., point tracking, image quality assessment), OneThinker shows better generalization than the baseline, highlighting the potential unlocked by unified training.

Frequently Asked Questions (FAQ)

Q1: How is OneThinker different from general-purpose models like ChatGPT or Gemini?
A1: While both are multimodal, their focus differs. Models like ChatGPT are general conversational and content-generation systems covering a broad range of domains (text, image, audio). OneThinker specializes in deep visual reasoning and fine-grained perceptual tasks, particularly those requiring precise spatial/temporal outputs (e.g., drawing boxes, localization, segmentation). Its specialized training and structured output format yield more reliable and actionable results for these professional vision tasks. Think of it as a domain-expert within the visual realm.

Q2: Won’t a “unified model” perform worse on each task compared to specialized models?
A2: According to the paper’s results, OneThinker matches or surpasses many comparable-scale specialist models (e.g., Video-R1, Seg-R1) on numerous tasks. More importantly, unified training enables synergistic effects (1+1 > 2): cross-task knowledge sharing equips the model with a more comprehensive visual understanding, which can be advantageous in complex scenarios requiring integrated skills. While there may be gaps compared to some top-tier commercial specialists trained on massive proprietary data, OneThinker convincingly demonstrates the viability of the unified path.

Q3: Can I try OneThinker myself or use it for research?
A3: Absolutely! This is one of the project’s most commendable aspects. The authors have fully open-sourced all resources:

📄 Paper: Details the methodology and experiments.
🤖 Model Weights: Includes the final 8B parameter model and the SFT checkpoint.
📊 Training & Evaluation Data: Contains the 600K training dataset and evaluation files for benchmarks.
💻 Full Code: Provides complete pipelines for environment setup, SFT training, RL training, and evaluation.

Q4: What is its tech stack, and is training expensive?
A4: OneThinker is based on the Qwen3-VL architecture, using LLaMA-Factory for SFT and EasyR1 for RL training. According to the paper, training requires at least 8 GPUs with 80GB memory each (e.g., H800). The complete process takes approximately 10 days. For researchers with limited resources, the released pre-trained models can be used directly for inference or further fine-tuning.

Getting Started with OneThinker

For developers and researchers interested in running or exploring OneThinker, here are the steps:

Environment Setup

The project requires two separate environments: one for SFT and one for RL.

# 1. Clone the repository
git clone https://github.com/tulerfeng/OneThinker
cd OneThinker

# 2. Setup SFT Environment (using LLaMA-Factory)
conda create -n llamafactory python=3.11
conda activate llamafactory
cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation

# 3. Setup RL Environment (using EasyR1)
conda create -n easyr1 python=3.11
conda activate easyr1
cd EasyR1
pip install -e .

Obtaining Data and Models

Download the OneThinker-train-data dataset from Hugging Face and extract it.
Download the evaluation datasets OneThinker-eval.
Download the trained model OneThinker-8B-model or the SFT checkpoint OneThinker-SFT-Qwen3-8B.

Running Training (Optional, Resource-Intensive)

To replicate the training process:

# SFT Cold-Start Training
bash ./LLaMA-Factory/local_scripts/run_onethinker_sft.sh

# RL Training (based on the SFT model)
bash ./EasyR1/local_scripts/run_onethinker_rl.sh

Inference and Evaluation

For inference on a single example:

python ./Evaluation/inference_single/inference.py

To evaluate on all benchmarks:

bash ./Evaluation/Eval/eval_bench_all.sh

For image and some video QA tasks, evaluation can also be done using VLMEvalKit.

Conclusion and Future Outlook

OneThinker paints an exciting vision for the future: a true multimodal reasoning generalist. Through its meticulously crafted dataset, innovative EMA-GRPO algorithm, and unified task framework, it successfully merges image and video understanding, cognition and perception, into a single model.

The implications of this work are profound:

Proof of Feasibility: It demonstrates that large-scale, unified training across heterogeneous visual tasks is viable and can yield performance gains and knowledge transfer.
Spirit of Openness: Fully open-sourcing the model, code, and data significantly lowers the barrier to entry for the research community, accelerating progress in this field.
Directional Guidance: It provides a concrete technical pathway toward more general artificial intelligence (AGI).

The road ahead is long. Scaling to more complex tasks, processing longer videos, and integrating additional modalities (e.g., audio) remain challenges. However, OneThinker has undoubtedly taken a crucial and solid step forward. It is not just a powerful tool but an inspiring paradigm for building the next generation of AI systems.

For anyone interested in multimodal AI, computer vision, or artificial general intelligence, OneThinker is a project worthy of deep attention and exploration.

Resource Links

Project Homepage & Code: https://github.com/tulerfeng/OneThinker
Paper: https://arxiv.org/abs/2512.03043
🤗 Hugging Face Models & Datasets: https://huggingface.co/OneThink