Uni-MoE-2.0-Omni: One Open-Source MoE Model that Understands and Generates Text, Images, Audio, and Video
Core question: Is there a single open-source large model that can both understand and generate text, images, speech, and video without stacking multiple pipelines?
One-sentence answer: Uni-MoE-2.0-Omni uses a dynamic-capacity Mixture-of-Experts (MoE) architecture built on Qwen2.5-7B, trained with 75B multimodal tokens, to deliver state-of-the-art performance on 85 benchmarks while keeping all code and weights publicly available.
Quick Scan (30 seconds)
| What you get | Why it matters |
|---|---|
| Unified tokenizer for audio, image, video, text | One sequence → one forward pass → no external fusion |
| Dynamic MoE layer (1.5–18B active) | Activates only the experts a token needs; saves 40% compute |
| End-to-end generation | Language prompts directly drive speech (MoE-TTS) and image (Task-DiT) heads |
| 85-benchmark evaluation | Leads in long-video QA, long-speech ASR, and image editing; competitive in text-to-image |
| Fully open | Five checkpoints, training scripts, data lists, and interactive demos on GitHub & HuggingFace |
1. Why “Omnimodal” Is Harder Than “Multimodal”
Most systems chain specialized models: ASR → LLM → TTS for voice; ViT → LLM → diffusion for images. This works but brings three pains:
-
Error accumulation – every hop loses signal. -
Data format hell – mel-spectrograms, pixel arrays, and sub-word IDs don’t live in the same space. -
Deployment weight – four large models in RAM/GPU.
Uni-MoE-2.0-Omni’s design philosophy is simple: convert everything to tokens once, then run one sparse model that routes internally.
2. Architecture: One Token Sequence to Rule Them All
2.1 Tokenisation Strategy (Input Side)
| Modality | Chunking | Tokens per unit | Encoder |
|---|---|---|---|
| Audio | 30 s, 16 kHz | 200 (≈20/3 s) | Whisper-Large-v3 |
| Image | 384×384 patches | (H/384)×(W/384)×T | SigLIP-So400M |
| Video | 1 fps, per-frame 384×384 | frame_count×T | same ViT |
| Text | standard BPE | 1 per sub-word | Qwen2.5 tokenizer |
All outputs are 1-D token lists that concatenate directly; no ad-hoc fusion layers.
Scenario walk-through
You upload a 12s voice note asking “Which city is this?” together with a photo.
-
Audio → 80 tokens -
Image → 256 tokens -
Text → 10 tokens
The model sees 346 tokens in one sequence and performs self-attention across them.
2.2 Dynamic-Capacity MoE Layer
Core idea: replace every FFN with an MoE block that contains:
-
4 Routed Experts (5.7B total) – modality-specialised, chosen by Top-P (>0.7 cumulative prob). -
2 Shared Experts (712M) – always on, supply common linguistic knowledge. -
1 Null Expert – zero output; learnt “skip” for easy tokens.
Hence each token activates 2 shared + 0–3 routed = 1.5B–18B params, average 8B.
Gradient trick
Top-P is non-differentiable → straight-through estimator + ODE baseline subtraction (Euler or 3rd-order Heun) lets gradients flow to the router (see Appendix A.4 in the paper).
Scenario walk-through
During inference, a speech token is routed like this:
-
Router logits → softmax → [0.45, 0.35, 0.12, 0.08] -
Cumulative [0.45, 0.80, 0.92, 1.00] -
With P=0.7 the first two experts are chosen; the rest of the computation is skipped → 37% FLOPs saved.
2.3 Omni-Modality 3D RoPE
Standard 1D RoPE is decomposed into (time, height, width).
-
Text: three IDs identical → falls back to 1-D. -
Audio: time-ID increases every 3s; h,w repeated. -
Image: time-ID fixed; h,w grow patch-by-patch. -
Video: time-ID increases per frame; h,w same as image.
This guarantees spatio-temporal alignment inside the attention matrix, so the model can learn “what sound happens when which pixel moves”.
3. Generation Heads: From Tokens Back to Human-Perceptible Media
3.1 Context-Aware MoE-TTS (Speech Output)
-
Backbone: Qwen2.5-0.5B turned into 4-expert MoE -
Input: text + style tokens <speech start> Jenny calm <speech prompt> -
Output: WavTokenizer discrete codes (40 tokens/s) → 24 kHz waveform -
Long-form: sentence split at punctuation; previous segment’s latent guides next → 2min+ coherent speech
Benchmark snapshot (WER ↓ better)
| Dataset | Uni-MoE-2.0-Omni | Qwen2.5-Omni | Gap |
|---|---|---|---|
| LibriTTS-clean | 5.85 % | 5.20 % | +0.65 |
| SEED-hard | 2.67 % | 2.15 % | +0.52 |
| TinyStories-en | 5.02 % | 6.20 % | -1.18 |
3.2 Task-Aware Diffusion Transformer (Image Output)
-
Keeps frozen PixWizard DiT to avoid catastrophic forgetting. -
Only learns task projector + content projector (<1B params). -
Accepts <TASK> edit / depth-to-image / derain ...plus<IMG>embeddings. -
Runs 15 denoising steps → 1024×1024 output.
Scenario walk-through
Input: rainy night photo + prompt “Remove the rain”.
Task token = derain; content token = scene embedding.
DiT receives both → produces clean image PSNR 25.41 dB, SSIM 0.82, beating specialised derain nets.
4. Training Recipe: From 7B Dense LLM to 26B Sparse Omnimodal Model
| Stage | Data Tokens | Trainable Modules | Key Purpose |
|---|---|---|---|
| a. Cross-modal pretrain | 75B | MLP + QFormer | Align audio & vision tokens into text space |
| b. Expert warm-up | 30B | Dense experts (3) | Initialise speech-understand, speech-gen, vision experts |
| c. Mixed SFT | 22B | Full MoE + TTS + DiT | Instruction following, cross-modal reasoning |
| d. Annealing + RL | 5B | All (low LR) | Balance modality ratio; GSPO-DPO boosts reasoning |
Annealing detail
-
Image data filtered to 5B high-quality tokens -
Video expanded to 21B by picking clips with clear audio → audio-visual QA pairs -
Text kept small (4B) but heavy on STEM & chain-of-thought → preserves reasoning
Author’s reflection
“We initially feared speech generation would hurt visual benchmarks. The opposite happened: because the shared attention layer must align phonemes with lip motion in video clips, the model actually improved temporal grounding—an accidental but welcome side effect of unifying tokens.”
5. Evaluation Highlights (85 Benchmarks)
5.1 Video & Cross-modal Understanding
-
Video-MME (no subtitle): 66.4 (↑6.6 vs Qwen2.5-Omni) -
VSI-Bench (spatial reasoning): 56.0 (↑36.7) -
WorldSense long video QA: 44.7 (new state-of-the-art)
5.2 Audio & Speech
-
LibriSpeech-clean/other-long WER: 2.04 % / 4.20 % (↓4.2/3.8) -
RACE-audio comprehension: 90.3 (middle) / 87.2 (high) -
Music understanding MusicCaps: 62.4 CIDER (2nd place)
5.3 Image Generation & Editing
| Task | Metric | Uni-MoE-2.0-Omni | Best rival | Delta |
|---|---|---|---|---|
| GEdit-Bench | score ↑ | 6.02 | PixWizard 3.20 | +88% |
| Depth-to-Image FID ↓ | 27.45 | Qwen-Image 27.54 | -0.3% | |
| Derain SSIM ↑ | 0.82 | Qwen-Image 0.80 | +2.5% |
6. Quick-Start: Install, Load, Chat in 10 Lines
-
Clone
git clone https://github.com/HITsz-TMG/Uni-MoE.git && cd Uni-MoE/Uni-MoE-2
-
Create env
conda create -n unimoe python=3.10 && conda activate unimoe
pip install -r requirements.txt
-
Download checkpoint (~50 GB)
huggingface-cli download HIT-TMG/Uni-MoE-2-8E --local-dir ./ckpt
-
Run omnimodal chat
from unimoe import OmniInference
bot = OmniInference(ckpt="./ckpt", device_map="auto")
text, speech, image = bot.chat(
audio="./ask.wav", # 12s spoken question
image="./scene.jpg", # snapshot
prompt="Where was this photo taken?",
output_form=["text","speech","image"]
)
print(text) # "This is Guilin, China."
speech.save("./answer.wav") # Jenny voice
image.save("./map.png") # optional visual hint
GPU memory peaks at 36 GB bf16; A100-40G suffices.
7. Author’s Takeaways & Lessons
-
Token unification beats modality-specific towers—shared attention is a stronger alignment force than external cross-attention layers. -
Dynamic MoE + Top-P delivers real compute savings on easy tokens while keeping capacity for hard ones; training stability surprised us. -
Progressive expert warm-up is critical—jumping straight to mixed SFT caused routing collapse in early ablations. -
Language-centric generation (text prompts → speech/image heads) keeps generation controllable without re-designing vocoders or DiT cores.
Action Checklist / Implementation Steps
-
Check GPU RAM ≥40 GB or use device_map="auto"for multi-GPU. -
Download repo + weights → run pip install -r requirements.txt. -
Test single-modality inputs first (text-only, then image+text, then audio). -
Tune top_prouting threshold (0.7 default) if you see expert overflow warnings. -
Fine-tune: freeze shared experts, LoRA-r=16 on routed experts for domain data ≤10M tokens. -
For long speech (>2 min), set sentence_split_threshold=25to avoid prosody drift. -
Image style transfer: prepare 50 high-res pairs, train Task-DiT projector 2k steps, LR 1e-5.
One-Page Overview
Uni-MoE-2.0-Omni is a fully open-source sparse MoE model that:
-
ingests text, audio, image, and video as a single token sequence; -
routes tokens dynamically to modality-specific or shared experts (1.5–18B active params); -
outputs text, natural speech (MoE-TTS), and high-res images (Task-DiT) conditioned on language prompts; -
achieves new SOTA on long-video QA, long-speech ASR, and competitive numbers on image editing; -
releases all five checkpoints, training scripts, and data lists for reproducibility.
Average inference compute is ~8B parameters, 40% less than a dense 13B model with comparable quality.
FAQ
-
Is the model truly open for commercial use?
Weights are Apache 2.0, but contain Whisper & SigLIP encoders—please follow their licences. Speech generation checkpoint is research-only. -
How much training compute if I want to reproduce?
128×A100-80G for ~3 weeks. You can skip pretrain and start from warm-up experts (~30% GPU hours). -
Can I add a new modality, e.g., EEG?
Yes: write a tokenizer that outputs 1-D tokens, insert a new routed expert, and run stage-a/b/c on your data. -
Does dynamic MoE hurt latency?
No—Top-P selection is a small matmul; on A100 we measure <1 ms overhead per layer. -
What’s the minimum data for decent voice cloning?
MoE-TTS needs ~30 minutes of clean, single-speaker audio; fewer data works but may require more epochs. -
Why am I getting garbled long speech?
Reducesentence_split_thresholdor enable context-slide (seelong_form=Truein repo). -
Image generation looks oversmooth—fix?
Increaseguidance_scaleto 10–12 or sample with 50 DDIM steps instead of 15.

