Site icon Efficient Coder

Uni-MoE-2.0-Omni: The Open-Source MoE Model Mastering Text, Images, Audio & Video

Uni-MoE-2.0-Omni: One Open-Source MoE Model that Understands and Generates Text, Images, Audio, and Video

Core question: Is there a single open-source large model that can both understand and generate text, images, speech, and video without stacking multiple pipelines?
One-sentence answer: Uni-MoE-2.0-Omni uses a dynamic-capacity Mixture-of-Experts (MoE) architecture built on Qwen2.5-7B, trained with 75B multimodal tokens, to deliver state-of-the-art performance on 85 benchmarks while keeping all code and weights publicly available.


Quick Scan (30 seconds)

What you get Why it matters
Unified tokenizer for audio, image, video, text One sequence → one forward pass → no external fusion
Dynamic MoE layer (1.5–18B active) Activates only the experts a token needs; saves 40% compute
End-to-end generation Language prompts directly drive speech (MoE-TTS) and image (Task-DiT) heads
85-benchmark evaluation Leads in long-video QA, long-speech ASR, and image editing; competitive in text-to-image
Fully open Five checkpoints, training scripts, data lists, and interactive demos on GitHub & HuggingFace

1. Why “Omnimodal” Is Harder Than “Multimodal”

Most systems chain specialized models: ASR → LLM → TTS for voice; ViT → LLM → diffusion for images. This works but brings three pains:

  1. Error accumulation – every hop loses signal.
  2. Data format hell – mel-spectrograms, pixel arrays, and sub-word IDs don’t live in the same space.
  3. Deployment weight – four large models in RAM/GPU.

Uni-MoE-2.0-Omni’s design philosophy is simple: convert everything to tokens once, then run one sparse model that routes internally.


2. Architecture: One Token Sequence to Rule Them All

2.1 Tokenisation Strategy (Input Side)

Modality Chunking Tokens per unit Encoder
Audio 30 s, 16 kHz 200 (≈20/3 s) Whisper-Large-v3
Image 384×384 patches (H/384)×(W/384)×T SigLIP-So400M
Video 1 fps, per-frame 384×384 frame_count×T same ViT
Text standard BPE 1 per sub-word Qwen2.5 tokenizer

All outputs are 1-D token lists that concatenate directly; no ad-hoc fusion layers.

Scenario walk-through
You upload a 12s voice note asking “Which city is this?” together with a photo.

  • Audio → 80 tokens
  • Image → 256 tokens
  • Text → 10 tokens
    The model sees 346 tokens in one sequence and performs self-attention across them.

2.2 Dynamic-Capacity MoE Layer

Core idea: replace every FFN with an MoE block that contains:

  • 4 Routed Experts (5.7B total) – modality-specialised, chosen by Top-P (>0.7 cumulative prob).
  • 2 Shared Experts (712M) – always on, supply common linguistic knowledge.
  • 1 Null Expert – zero output; learnt “skip” for easy tokens.

Hence each token activates 2 shared + 0–3 routed = 1.5B–18B params, average 8B.

Gradient trick
Top-P is non-differentiable → straight-through estimator + ODE baseline subtraction (Euler or 3rd-order Heun) lets gradients flow to the router (see Appendix A.4 in the paper).

Scenario walk-through
During inference, a speech token is routed like this:

  • Router logits → softmax → [0.45, 0.35, 0.12, 0.08]
  • Cumulative [0.45, 0.80, 0.92, 1.00]
  • With P=0.7 the first two experts are chosen; the rest of the computation is skipped → 37% FLOPs saved.

2.3 Omni-Modality 3D RoPE

Standard 1D RoPE is decomposed into (time, height, width).

  • Text: three IDs identical → falls back to 1-D.
  • Audio: time-ID increases every 3s; h,w repeated.
  • Image: time-ID fixed; h,w grow patch-by-patch.
  • Video: time-ID increases per frame; h,w same as image.

This guarantees spatio-temporal alignment inside the attention matrix, so the model can learn “what sound happens when which pixel moves”.


3. Generation Heads: From Tokens Back to Human-Perceptible Media

3.1 Context-Aware MoE-TTS (Speech Output)

  • Backbone: Qwen2.5-0.5B turned into 4-expert MoE
  • Input: text + style tokens <speech start> Jenny calm <speech prompt>
  • Output: WavTokenizer discrete codes (40 tokens/s) → 24 kHz waveform
  • Long-form: sentence split at punctuation; previous segment’s latent guides next → 2min+ coherent speech

Benchmark snapshot (WER ↓ better)

Dataset Uni-MoE-2.0-Omni Qwen2.5-Omni Gap
LibriTTS-clean 5.85 % 5.20 % +0.65
SEED-hard 2.67 % 2.15 % +0.52
TinyStories-en 5.02 % 6.20 % -1.18

3.2 Task-Aware Diffusion Transformer (Image Output)

  • Keeps frozen PixWizard DiT to avoid catastrophic forgetting.
  • Only learns task projector + content projector (<1B params).
  • Accepts <TASK> edit / depth-to-image / derain ... plus <IMG> embeddings.
  • Runs 15 denoising steps → 1024×1024 output.

Scenario walk-through
Input: rainy night photo + prompt “Remove the rain”.
Task token = derain; content token = scene embedding.
DiT receives both → produces clean image PSNR 25.41 dB, SSIM 0.82, beating specialised derain nets.


4. Training Recipe: From 7B Dense LLM to 26B Sparse Omnimodal Model

Stage Data Tokens Trainable Modules Key Purpose
a. Cross-modal pretrain 75B MLP + QFormer Align audio & vision tokens into text space
b. Expert warm-up 30B Dense experts (3) Initialise speech-understand, speech-gen, vision experts
c. Mixed SFT 22B Full MoE + TTS + DiT Instruction following, cross-modal reasoning
d. Annealing + RL 5B All (low LR) Balance modality ratio; GSPO-DPO boosts reasoning

Annealing detail

  • Image data filtered to 5B high-quality tokens
  • Video expanded to 21B by picking clips with clear audio → audio-visual QA pairs
  • Text kept small (4B) but heavy on STEM & chain-of-thought → preserves reasoning

Author’s reflection
“We initially feared speech generation would hurt visual benchmarks. The opposite happened: because the shared attention layer must align phonemes with lip motion in video clips, the model actually improved temporal grounding—an accidental but welcome side effect of unifying tokens.”


5. Evaluation Highlights (85 Benchmarks)

5.1 Video & Cross-modal Understanding

  • Video-MME (no subtitle): 66.4 (↑6.6 vs Qwen2.5-Omni)
  • VSI-Bench (spatial reasoning): 56.0 (↑36.7)
  • WorldSense long video QA: 44.7 (new state-of-the-art)

5.2 Audio & Speech

  • LibriSpeech-clean/other-long WER: 2.04 % / 4.20 % (↓4.2/3.8)
  • RACE-audio comprehension: 90.3 (middle) / 87.2 (high)
  • Music understanding MusicCaps: 62.4 CIDER (2nd place)

5.3 Image Generation & Editing

Task Metric Uni-MoE-2.0-Omni Best rival Delta
GEdit-Bench score ↑ 6.02 PixWizard 3.20 +88%
Depth-to-Image FID ↓ 27.45 Qwen-Image 27.54 -0.3%
Derain SSIM ↑ 0.82 Qwen-Image 0.80 +2.5%

6. Quick-Start: Install, Load, Chat in 10 Lines

  1. Clone
git clone https://github.com/HITsz-TMG/Uni-MoE.git && cd Uni-MoE/Uni-MoE-2
  1. Create env
conda create -n unimoe python=3.10 && conda activate unimoe
pip install -r requirements.txt
  1. Download checkpoint (~50 GB)
huggingface-cli download HIT-TMG/Uni-MoE-2-8E --local-dir ./ckpt
  1. Run omnimodal chat
from unimoe import OmniInference
bot = OmniInference(ckpt="./ckpt", device_map="auto")
text, speech, image = bot.chat(
    audio="./ask.wav",               # 12s spoken question
    image="./scene.jpg",             # snapshot
    prompt="Where was this photo taken?",
    output_form=["text","speech","image"]
)
print(text)                        # "This is Guilin, China."
speech.save("./answer.wav")        # Jenny voice
image.save("./map.png")            # optional visual hint

GPU memory peaks at 36 GB bf16; A100-40G suffices.


7. Author’s Takeaways & Lessons

  • Token unification beats modality-specific towers—shared attention is a stronger alignment force than external cross-attention layers.
  • Dynamic MoE + Top-P delivers real compute savings on easy tokens while keeping capacity for hard ones; training stability surprised us.
  • Progressive expert warm-up is critical—jumping straight to mixed SFT caused routing collapse in early ablations.
  • Language-centric generation (text prompts → speech/image heads) keeps generation controllable without re-designing vocoders or DiT cores.

Action Checklist / Implementation Steps

  1. Check GPU RAM ≥40 GB or use device_map="auto" for multi-GPU.
  2. Download repo + weights → run pip install -r requirements.txt.
  3. Test single-modality inputs first (text-only, then image+text, then audio).
  4. Tune top_p routing threshold (0.7 default) if you see expert overflow warnings.
  5. Fine-tune: freeze shared experts, LoRA-r=16 on routed experts for domain data ≤10M tokens.
  6. For long speech (>2 min), set sentence_split_threshold=25 to avoid prosody drift.
  7. Image style transfer: prepare 50 high-res pairs, train Task-DiT projector 2k steps, LR 1e-5.

One-Page Overview

Uni-MoE-2.0-Omni is a fully open-source sparse MoE model that:

  • ingests text, audio, image, and video as a single token sequence;
  • routes tokens dynamically to modality-specific or shared experts (1.5–18B active params);
  • outputs text, natural speech (MoE-TTS), and high-res images (Task-DiT) conditioned on language prompts;
  • achieves new SOTA on long-video QA, long-speech ASR, and competitive numbers on image editing;
  • releases all five checkpoints, training scripts, and data lists for reproducibility.

Average inference compute is ~8B parameters, 40% less than a dense 13B model with comparable quality.


FAQ

  1. Is the model truly open for commercial use?
    Weights are Apache 2.0, but contain Whisper & SigLIP encoders—please follow their licences. Speech generation checkpoint is research-only.

  2. How much training compute if I want to reproduce?
    128×A100-80G for ~3 weeks. You can skip pretrain and start from warm-up experts (~30% GPU hours).

  3. Can I add a new modality, e.g., EEG?
    Yes: write a tokenizer that outputs 1-D tokens, insert a new routed expert, and run stage-a/b/c on your data.

  4. Does dynamic MoE hurt latency?
    No—Top-P selection is a small matmul; on A100 we measure <1 ms overhead per layer.

  5. What’s the minimum data for decent voice cloning?
    MoE-TTS needs ~30 minutes of clean, single-speaker audio; fewer data works but may require more epochs.

  6. Why am I getting garbled long speech?
    Reduce sentence_split_threshold or enable context-slide (see long_form=True in repo).

  7. Image generation looks oversmooth—fix?
    Increase guidance_scale to 10–12 or sample with 50 DDIM steps instead of 15.

Exit mobile version