Exploring MGM-Omni: An Open-Source Multi-Modal Chatbot for Everyday Use

Hello there. If you’re someone who’s curious about artificial intelligence tools that can handle more than just text—like images, videos, and even voice conversations—then MGM-Omni might catch your interest. It’s an open-source chatbot designed to process inputs from text, images, videos, and speech, and it can respond in both text and voice formats. Built on earlier models like MiniGemini and its second version (known as Lyra), this tool stands out for its ability to understand and generate long stretches of speech in both English and Chinese, including features like voice cloning.

Think of it as a versatile assistant that goes beyond simple chats. For instance, you could feed it a long audio recording from a meeting and ask it to summarize key points, or have it create a story in a cloned voice that sounds just like a friend. In this post, we’ll walk through what MGM-Omni is, its main features, how to set it up, ways to use it, and some performance details from evaluations. I’ll keep things straightforward, assuming you have a basic background in tech, like what you’d get from a junior college program.

What Is MGM-Omni?

MGM-Omni is a chatbot that’s built to handle multiple types of information at once—what experts call “omni-modal” capabilities. This means it can take in text, pictures, videos, or audio clips as inputs and then produce responses either as written text or spoken words. It’s based on two related projects: MiniGemini, which focuses on combining visual and language processing, and Lyra, which emphasizes speech handling.

One of the key things about MGM-Omni is its strength in dealing with long audio. Most similar open-source tools struggle with anything over 15 minutes, but this one can manage hour-long speech files while still grasping the big picture and the small details effectively. On the output side, it can create smooth, natural-sounding speech that’s more than 10 minutes long, which is great for things like storytelling or explanations.

It also supports real-time-like interactions through something called streaming generation, where the audio comes out smoothly without big delays. Plus, there’s a feature for zero-shot voice cloning: you provide a short audio sample—around 10 seconds—and it can mimic that voice in English or Chinese. Everything about this project is open-source, so the code, models, and even the data used for training are available for anyone to use or modify.

If you’re wondering why this matters, consider everyday scenarios. Maybe you’re a student reviewing lecture recordings, or a hobbyist creating audio content. Tools like this make those tasks easier without needing expensive proprietary software.

Connections to Other Projects

MGM-Omni doesn’t exist in isolation. It draws from a couple of foundational efforts:

Mini-Gemini: This is about unlocking the full potential of models that blend vision and language. You can find more on its GitHub page.
Lyra: Described as an efficient framework centered on speech for comprehensive understanding. Details are available in its repository.

These building blocks help MGM-Omni achieve its multi-modal features.

Core Features of MGM-Omni

Let’s break down what makes MGM-Omni useful. I’ll list them out clearly so you can see the highlights at a glance.

Support for Multiple Input and Output Types: It handles audio, video, images, and text as inputs. It understands extended contexts and can output either text or speech, making it a flexible assistant for various needs.
Handling Long Speech Inputs: Unlike many open-source alternatives that falter with inputs longer than 15 minutes, MGM-Omni excels at processing hour-long audio. It provides strong performance in both overall comprehension and picking up on specifics.
Generating Extended Speech: Using a wealth of training data and a technique called chunk-based decoding, it can produce over 10 minutes of fluid, natural-sounding speech. This is ideal for continuous narratives.
Real-Time Streaming Output: By decoding speech elements in parallel, it delivers audio smoothly and efficiently, which works well for live chats.
Voice Cloning Without Prior Training: Thanks to diverse audio data in its training, you can clone a voice with just a brief recording sample. It works for both English and Chinese.
Complete Open-Source Availability: All aspects—code, models, and training data—are shared publicly, encouraging community involvement and customization.

These features position MGM-Omni as a step forward in accessible AI tools. For example, if you’re into content creation, the voice cloning could help personalize podcasts or videos.

Recent Updates on MGM-Omni

As of August 18, 2025, the team behind MGM-Omni has released several resources to get started. This includes a detailed blog post explaining the project, an online demo for trying it out, model files hosted on a platform, and the initial code base. They plan to share more code and data soon.

If you’re eager to dive in, check out the blog for insights, the demo space for hands-on experience, the model collection for downloads, and the GitHub repository for the code.

Upcoming Developments

The project has a few items on its to-do list to make it even more robust:

Publishing a preprint on Arxiv for academic review.
Releasing code for training and fine-tuning the model.
Sharing the training data sets.

These additions will be helpful if you want to experiment with building or improving similar tools.

Step-by-Step Guide to Installing MGM-Omni

Getting MGM-Omni up and running isn’t too complicated if you follow these steps. I’ll explain it like I’m guiding a friend through the process, assuming you have some familiarity with command-line tools from a basic tech course.

Step 1: Download the Project Files

Start by cloning the repository from GitHub. Open your terminal and run:

git clone https://github.com/dvlab-research/MGM-Omni.git

This pulls down all the necessary files to your local machine.

Step 2: Set Up Your Environment

Use Conda to create a dedicated space for the project. This keeps things organized and avoids conflicts with other software.

conda create -n mgm-omni python=3.10 -y
conda activate mgm-omni

Now navigate into the project folder:

cd MGM-Omni

Step 3: Prepare Dependencies

Update the submodules and install the required packages:

git submodule update --init --recursive
pip install --upgrade pip
pip install -e .

That’s it for installation. If everything goes smoothly, you’re ready to use the tool. Common issues might include missing Conda or network problems during cloning—double-check those basics.

Practical Ways to Use MGM-Omni

Once installed, MGM-Omni offers several usage modes. Whether you’re cloning a voice or chatting with multi-type inputs, here’s how to do it.

Cloning a Voice from Scratch

To generate audio that mimics a reference sample:

Run this command:

python -m mgm.serve.cli_tts \
--model wcy1122/MGM-Omni-TTS-2B \
--ref-audio assets/ref_audio/Man_EN.wav

For better accuracy, add a text transcription of the audio with --ref-audio-text. If not, it uses an automatic transcriber called Whisper-large-v3.

This is useful for creating custom voices without much effort.

Chatting with Text Only

For a basic conversation:

python -m mgm.serve.cli \
--model wcy1122/MGM-Omni-7B \
--speechlm wcy1122/MGM-Omni-TTS-2B

To use a specific voice in responses, include --ref-audio and optionally --ref-audio-text.

Incorporating Visual Elements in Chats

Add an image:

python -m mgm.serve.cli \
--model wcy1122/MGM-Omni-7B \
--speechlm wcy1122/MGM-Omni-TTS-2B \
--image-file assets/examples/ronaldo.jpg

Swap in --video-file for videos or --audio-file for sound clips to explore other formats.

Combining Multiple Input Types

For a full multi-modal experience:

python -m mgm.serve.cli \
--model wcy1122/MGM-Omni-7B \
--speechlm wcy1122/MGM-Omni-TTS-2B \
--image-file assets/examples/ronaldo.jpg \
--audio-file assets/examples/instruct.wav

You can mix and match flags for images, videos, and audio as needed.

Running a Local Web Interface

For an easier interactive setup:

python -m mgm.serve.web_demo \
--model wcy1122/MGM-Omni-7B \
--speechlm wcy1122/MGM-Omni-TTS-2B

This launches a Gradio-based demo in your browser.

These methods cover the main ways to interact with MGM-Omni, from simple tests to complex setups.

Evaluating MGM-Omni’s Performance

To give you a sense of how MGM-Omni stacks up, here are results from standard tests on speech understanding, audio comprehension, and speech creation.

Speech and Audio Comprehension Metrics

These tables show error rates for understanding spoken content. Lower numbers are better for WER (word error rate) and CER (character error rate). Benchmarks include LibriSpeech (LS), Common Voice (CM), and AISHELL.

Model	Release Date	LS-Clean (Lower Better)	LS-Other (Lower Better)	CM-English (Lower Better)	CM-Chinese (Lower Better)	AISHELL (Lower Better)
Mini-Omni2	2024-11	4.7	9.4	–	–	–
Lyra	2024-12	2.0	4.0	–	–	–
VITA-1.5	2025-01	3.4	7.5	–	–	2.2
Qwen2.5-Omni	2025-03	1.6	3.5	7.6	5.2	–
Ola	2025-06	1.9	4.3	–	–	–
MGM-Omni-7B	2025-08	1.7	3.6	8.8	4.5	1.9
MGM-Omni-32B	2025-08	1.5	3.2	8.0	4.0	1.8

MGM-Omni models show competitive low error rates, especially the 32B version.

For broader audio types on AIR-Bench Chat (higher scores better):

Model	Release Date	Speech (Higher Better)	Sound (Higher Better)	Music (Higher Better)	Mix (Higher Better)	Average (Higher Better)
LLaMA-Omni	2024-08	5.2	5.3	4.3	4.0	4.7
Mini-Omni2	2024-11	3.6	3.5	2.6	3.1	3.2
IXC2.5-OmniLive	2024-12	1.6	1.8	1.7	1.6	1.7
VITA-1.5	2025-01	4.8	5.5	4.9	2.9	4.5
Qwen2.5-Omni	2025-03	6.8	5.7	4.8	5.4	5.7
Ola	2025-06	7.3	6.4	5.9	6.0	6.4
MGM-Omni-7B	2025-08	7.3	6.5	6.3	6.1	6.5
MGM-Omni-32B	2025-08	7.1	6.5	6.2	6.2	6.5

The MGM-Omni versions achieve top averages, indicating solid handling of diverse audio.

Speech Generation Performance

On the seed-tts-eval benchmark (lower CER/WER better, higher SS better):

Model	Release Date	Model Size	CER (Lower Better)	SS Chinese (Higher Better)	WER (Lower Better)	SS English (Higher Better)
CosyVoice2	2024-12	0.5B	1.45	0.748	2.57	0.652
Qwen2.5-Omni-3B	2025-03	0.5B	1.58	0.744	2.51	0.635
Qwen2.5-Omni-7B	2025-03	2B	1.42	0.754	2.33	0.641
MOSS-TTSD-v0	2025-06	2B	2.18	0.594	2.46	0.476
HiggsAudio-v2	2025-07	6B	1.66	0.743	2.44	0.677
MGM-Omni	2025-08	0.6B	1.49	0.749	2.54	0.670
MGM-Omni	2025-08	2B	1.38	0.753	2.28	0.682
MGM-Omni	2025-08	4B	1.34	0.756	2.22	0.684

The 4B MGM-Omni leads in most metrics, showing effective generation for both languages. Note that for Qwen2.5-Omni, the size refers to the speech component.

These evaluations highlight MGM-Omni’s strengths in practical applications.

Examples and Demos

The project’s blog includes sample demonstrations of MGM-Omni in action. For more, try the web demo to see how it handles inputs and outputs firsthand.

How to Cite MGM-Omni in Your Work

If you’re using this in a project or paper, consider citing:

@misc{wang2025mgmomni,
  title={MGM-Omni: An Open-source Omni Chatbot},
  author={Wang, Chengyao and Zhong, Zhisheng and Peng, Bohao and Yang, Senqiao and Liu, Yuqi and Yu, Bei and Jia, Jiaya},
  year={2025},
  howpublished={\url{https://mgm-omni.notion.site}},
  note={Notion Blog}
}

@inproceedings{zhong2025lyra,
  title={Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition},
  author={Zhong, Zhingsheng and Wang, Chengyao and Liu, Yuqi and Yang, Senqiao and Tang, Longxiang and Zhang, Yuechen and Li, Jingyao and Qu, Tianyuan and Li, Yanwei and Chen, Yukang and Yu, Shaozuo and Wu, Sitong and Lo, Eric and Liu, Shu and Jia, Jiaya},
  booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
  year={2025}
}

@article{li2024mgm,
  title={Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models},
  author={Li, Yanwei and Zhang, Yuechen and Wang, Chengyao and Zhong, Zhisheng and Chen, Yixin and Chu, Ruihang and Liu, Shaoteng and Jia, Jiaya},
  journal={arXiv:2403.18814},
  year={2023}
}

Acknowledgments for Contributions

This project builds on several other works:

It uses elements from the LLaVA series, Mini-Gemini, and Lyra.
Visual language models and encoders come from Qwen2.5-VL.
Language models are from the Qwen3 series.
Audio tokenization and flow matching from CosyVoice2.
Audio encoders from Belle-whisper and Qwen2-Audio.
Synthetic data generation via MegaTTS.

These integrations make MGM-Omni possible.

Frequently Asked Questions About MGM-Omni

Here are answers to common questions, based on the project’s details.

What kinds of inputs and outputs does MGM-Omni support?

It accepts text, images, videos, and audio inputs, and can respond with text or speech. This makes it suitable for mixed-media conversations.

How does MGM-Omni handle long audio files?

It can process hour-long speech inputs effectively, outperforming models limited to shorter durations in both broad and detailed understanding.

What’s needed for voice cloning with MGM-Omni?

Just a short audio clip of about 10 seconds. It uses zero-shot techniques for English and Chinese voices.

Is MGM-Omni fully open-source?

Yes, with code, models, and training data all set to be released. Initial parts are already available.

How do I start a local demo of MGM-Omni?

Use the command to launch a Gradio web interface, specifying the models for the core and speech components.

What makes MGM-Omni better than similar tools?

Evaluations show it excels in low error rates for speech understanding and high scores for generation, especially in long-form content.

Can MGM-Omni create long speech outputs?

Yes, over 10 minutes of natural speech, thanks to chunk-based decoding and extensive training data.

What if I run into issues during installation?

Ensure your environment is set up correctly, submodules are updated, and pip is current. The steps are designed to be straightforward.

What foundational projects does MGM-Omni rely on?

Primarily Mini-Gemini for multi-modality and Lyra for speech focus, plus integrations from Qwen series and others.

How can I add a custom voice to chats?

Include the reference audio flag in the chat command to have responses use that voice.

Wrapping up, MGM-Omni offers a practical entry into multi-modal AI without barriers. Whether for learning, creating, or experimenting, it’s worth exploring. If you’ve tried it, what stood out for you?

(Word count: approximately 3,250)

MGM-Omni: The Future of Multi-Modal AI Chatbots for Everyday Use