Uni-MoE-2.0-Omni: The Open-Source MoE Model Mastering Text, Images, Audio & Video

高效码农

2 months ago

Uni-MoE-2.0-Omni: One Open-Source MoE Model that Understands and Generates Text, Images, Audio, and Video

Core question: Is there a single open-source large model that can both understand and generate text, images, speech, and video without stacking multiple pipelines?
One-sentence answer: Uni-MoE-2.0-Omni uses a dynamic-capacity Mixture-of-Experts (MoE) architecture built on Qwen2.5-7B, trained with 75B multimodal tokens, to deliver state-of-the-art performance on 85 benchmarks while keeping all code and weights publicly available.

Quick Scan (30 seconds)

What you get	Why it matters
Unified tokenizer for audio, image, video, text	One sequence → one forward pass → no external fusion
Dynamic MoE layer (1.5–18B active)	Activates only the experts a token needs; saves 40% compute
End-to-end generation	Language prompts directly drive speech (MoE-TTS) and image (Task-DiT) heads
85-benchmark evaluation	Leads in long-video QA, long-speech ASR, and image editing; competitive in text-to-image
Fully open	Five checkpoints, training scripts, data lists, and interactive demos on GitHub & HuggingFace

1. Why “Omnimodal” Is Harder Than “Multimodal”

Most systems chain specialized models: ASR → LLM → TTS for voice; ViT → LLM → diffusion for images. This works but brings three pains:

Error accumulation – every hop loses signal.
Data format hell – mel-spectrograms, pixel arrays, and sub-word IDs don’t live in the same space.
Deployment weight – four large models in RAM/GPU.

Uni-MoE-2.0-Omni’s design philosophy is simple: convert everything to tokens once, then run one sparse model that routes internally.

2. Architecture: One Token Sequence to Rule Them All

2.1 Tokenisation Strategy (Input Side)

Modality	Chunking	Tokens per unit	Encoder
Audio	30 s, 16 kHz	200 (≈20/3 s)	Whisper-Large-v3
Image	384×384 patches	(H/384)×(W/384)×T	SigLIP-So400M
Video	1 fps, per-frame 384×384	frame_count×T	same ViT
Text	standard BPE	1 per sub-word	Qwen2.5 tokenizer

All outputs are 1-D token lists that concatenate directly; no ad-hoc fusion layers.

Scenario walk-through
You upload a 12s voice note asking “Which city is this?” together with a photo.

Audio → 80 tokens
Image → 256 tokens
Text → 10 tokens
The model sees 346 tokens in one sequence and performs self-attention across them.

2.2 Dynamic-Capacity MoE Layer

Core idea: replace every FFN with an MoE block that contains:

4 Routed Experts (5.7B total) – modality-specialised, chosen by Top-P (>0.7 cumulative prob).
2 Shared Experts (712M) – always on, supply common linguistic knowledge.
1 Null Expert – zero output; learnt “skip” for easy tokens.

Hence each token activates 2 shared + 0–3 routed = 1.5B–18B params, average 8B.

Gradient trick
Top-P is non-differentiable → straight-through estimator + ODE baseline subtraction (Euler or 3rd-order Heun) lets gradients flow to the router (see Appendix A.4 in the paper).

Scenario walk-through
During inference, a speech token is routed like this:

Router logits → softmax → [0.45, 0.35, 0.12, 0.08]
Cumulative [0.45, 0.80, 0.92, 1.00]
With P=0.7 the first two experts are chosen; the rest of the computation is skipped → 37% FLOPs saved.

2.3 Omni-Modality 3D RoPE

Standard 1D RoPE is decomposed into (time, height, width).

Text: three IDs identical → falls back to 1-D.
Audio: time-ID increases every 3s; h,w repeated.
Image: time-ID fixed; h,w grow patch-by-patch.
Video: time-ID increases per frame; h,w same as image.

This guarantees spatio-temporal alignment inside the attention matrix, so the model can learn “what sound happens when which pixel moves”.

3. Generation Heads: From Tokens Back to Human-Perceptible Media

3.1 Context-Aware MoE-TTS (Speech Output)

Backbone: Qwen2.5-0.5B turned into 4-expert MoE
Input: text + style tokens <speech start> Jenny calm <speech prompt>
Output: WavTokenizer discrete codes (40 tokens/s) → 24 kHz waveform
Long-form: sentence split at punctuation; previous segment’s latent guides next → 2min+ coherent speech

Benchmark snapshot (WER ↓ better)

Dataset	Uni-MoE-2.0-Omni	Qwen2.5-Omni	Gap
LibriTTS-clean	5.85 %	5.20 %	+0.65
SEED-hard	2.67 %	2.15 %	+0.52
TinyStories-en	5.02 %	6.20 %	-1.18

3.2 Task-Aware Diffusion Transformer (Image Output)

Keeps frozen PixWizard DiT to avoid catastrophic forgetting.
Only learns task projector + content projector (<1B params).
Accepts <TASK> edit / depth-to-image / derain ... plus <IMG> embeddings.
Runs 15 denoising steps → 1024×1024 output.

Scenario walk-through
Input: rainy night photo + prompt “Remove the rain”.
Task token = derain; content token = scene embedding.
DiT receives both → produces clean image PSNR 25.41 dB, SSIM 0.82, beating specialised derain nets.

4. Training Recipe: From 7B Dense LLM to 26B Sparse Omnimodal Model

Stage	Data Tokens	Trainable Modules	Key Purpose
a. Cross-modal pretrain	75B	MLP + QFormer	Align audio & vision tokens into text space
b. Expert warm-up	30B	Dense experts (3)	Initialise speech-understand, speech-gen, vision experts
c. Mixed SFT	22B	Full MoE + TTS + DiT	Instruction following, cross-modal reasoning
d. Annealing + RL	5B	All (low LR)	Balance modality ratio; GSPO-DPO boosts reasoning

Annealing detail

Image data filtered to 5B high-quality tokens
Video expanded to 21B by picking clips with clear audio → audio-visual QA pairs
Text kept small (4B) but heavy on STEM & chain-of-thought → preserves reasoning

Author’s reflection
“We initially feared speech generation would hurt visual benchmarks. The opposite happened: because the shared attention layer must align phonemes with lip motion in video clips, the model actually improved temporal grounding—an accidental but welcome side effect of unifying tokens.”

5. Evaluation Highlights (85 Benchmarks)

5.1 Video & Cross-modal Understanding

Video-MME (no subtitle): 66.4 (↑6.6 vs Qwen2.5-Omni)
VSI-Bench (spatial reasoning): 56.0 (↑36.7)
WorldSense long video QA: 44.7 (new state-of-the-art)

5.2 Audio & Speech

LibriSpeech-clean/other-long WER: 2.04 % / 4.20 % (↓4.2/3.8)
RACE-audio comprehension: 90.3 (middle) / 87.2 (high)
Music understanding MusicCaps: 62.4 CIDER (2nd place)

5.3 Image Generation & Editing

Task	Metric	Uni-MoE-2.0-Omni	Best rival	Delta
GEdit-Bench	score ↑	6.02	PixWizard 3.20	+88%
Depth-to-Image FID ↓	27.45	Qwen-Image 27.54	-0.3%
Derain SSIM ↑	0.82	Qwen-Image 0.80	+2.5%

6. Quick-Start: Install, Load, Chat in 10 Lines

Clone

git clone https://github.com/HITsz-TMG/Uni-MoE.git && cd Uni-MoE/Uni-MoE-2

Create env

conda create -n unimoe python=3.10 && conda activate unimoe
pip install -r requirements.txt

Download checkpoint (~50 GB)

huggingface-cli download HIT-TMG/Uni-MoE-2-8E --local-dir ./ckpt

Run omnimodal chat

from unimoe import OmniInference
bot = OmniInference(ckpt="./ckpt", device_map="auto")
text, speech, image = bot.chat(
    audio="./ask.wav",               # 12s spoken question
    image="./scene.jpg",             # snapshot
    prompt="Where was this photo taken?",
    output_form=["text","speech","image"]
)
print(text)                        # "This is Guilin, China."
speech.save("./answer.wav")        # Jenny voice
image.save("./map.png")            # optional visual hint

GPU memory peaks at 36 GB bf16; A100-40G suffices.

7. Author’s Takeaways & Lessons

Token unification beats modality-specific towers—shared attention is a stronger alignment force than external cross-attention layers.
Dynamic MoE + Top-P delivers real compute savings on easy tokens while keeping capacity for hard ones; training stability surprised us.
Progressive expert warm-up is critical—jumping straight to mixed SFT caused routing collapse in early ablations.
Language-centric generation (text prompts → speech/image heads) keeps generation controllable without re-designing vocoders or DiT cores.

Action Checklist / Implementation Steps

Check GPU RAM ≥40 GB or use device_map="auto" for multi-GPU.
Download repo + weights → run pip install -r requirements.txt.
Test single-modality inputs first (text-only, then image+text, then audio).
Tune top_p routing threshold (0.7 default) if you see expert overflow warnings.
Fine-tune: freeze shared experts, LoRA-r=16 on routed experts for domain data ≤10M tokens.
For long speech (>2 min), set sentence_split_threshold=25 to avoid prosody drift.
Image style transfer: prepare 50 high-res pairs, train Task-DiT projector 2k steps, LR 1e-5.

One-Page Overview

Uni-MoE-2.0-Omni is a fully open-source sparse MoE model that:

ingests text, audio, image, and video as a single token sequence;
routes tokens dynamically to modality-specific or shared experts (1.5–18B active params);
outputs text, natural speech (MoE-TTS), and high-res images (Task-DiT) conditioned on language prompts;
achieves new SOTA on long-video QA, long-speech ASR, and competitive numbers on image editing;
releases all five checkpoints, training scripts, and data lists for reproducibility.

Average inference compute is ~8B parameters, 40% less than a dense 13B model with comparable quality.

FAQ

Is the model truly open for commercial use?
Weights are Apache 2.0, but contain Whisper & SigLIP encoders—please follow their licences. Speech generation checkpoint is research-only.
How much training compute if I want to reproduce?
128×A100-80G for ~3 weeks. You can skip pretrain and start from warm-up experts (~30% GPU hours).
Can I add a new modality, e.g., EEG?
Yes: write a tokenizer that outputs 1-D tokens, insert a new routed expert, and run stage-a/b/c on your data.
Does dynamic MoE hurt latency?
No—Top-P selection is a small matmul; on A100 we measure <1 ms overhead per layer.
What’s the minimum data for decent voice cloning?
MoE-TTS needs ~30 minutes of clean, single-speaker audio; fewer data works but may require more epochs.
Why am I getting garbled long speech?
Reduce sentence_split_threshold or enable context-slide (see long_form=True in repo).
Image generation looks oversmooth—fix?
Increase guidance_scale to 10–12 or sample with 50 DDIM steps instead of 15.