Stream-Omni: Revolutionizing Multimodal Interaction

In today’s rapidly evolving landscape of artificial intelligence, we are on the brink of a new era of multimodal interaction. Stream-Omni, a cutting-edge large language-vision-speech model, is reshaping the way we interact with machines. This blog post delves into the technical principles, practical applications, and setup process of Stream-Omni, offering a comprehensive guide to this groundbreaking technology.

What is Stream-Omni?

Stream-Omni is a sophisticated large language-vision-speech model capable of supporting various multimodal interactions simultaneously. It can process inputs in the form of text, vision, and speech, and generate corresponding text or speech responses. One of its standout features is the ability to provide intermediate text results during speech interactions, such as automatic speech recognition (ASR) transcriptions and model responses, creating a seamless “see-while-hear” experience for users.

Core Technical Principles of Stream-Omni

Modal Alignment: Bridging Different Data Types

Stream-Omni’s key strength lies in its efficient modal alignment technology, which employs two primary methods:

Sequence-dimension Concatenation: For visual information, Stream-Omni extracts features from visual inputs using a vision encoder and concatenates them with text features along the sequence dimension. This approach leverages the complementary nature of text and image semantics, enabling the model to understand visual elements and related text descriptions simultaneously.
Layer-dimension Mapping: For speech information, Stream-Omni introduces layer-dimension mapping based on Connectionist Temporal Classification (CTC). It adds speech layers at the bottom and top of the large language model (LLM) and uses a CTC decoder to achieve precise speech-to-text mapping. This creates a direct link between speech and text, allowing speech to efficiently utilize the text capabilities of the LLM even with limited speech data.

Training Strategy: Unleashing Potential with Minimal Data

Stream-Omni adopts a three-stage training strategy:

Stage One: Vision-Text Alignment: Utilizes training methods from vision-oriented multimodal models to teach the model to understand the relationship between visual and textual information.
Stage Two: Speech-Text Alignment: Trains the bottom and top speech layers using a combination of CTC loss and cross-entropy loss to achieve precise alignment between speech and text.
Stage Three: Text-Vision-Speech Alignment: Employs multimodal data constructed through an automated pipeline to train the LLM backbone via multi-task learning. This enables the model to flexibly support various multimodal interactions.

Practical Applications of Stream-Omni

Visual Question Answering: A Pro at Decoding Images

Stream-Omni excels in visual question answering tasks. It can accurately interpret image content and combine it with text questions to provide appropriate answers. For instance, when provided with a building floor plan and a question about whether basement stairs lead directly to the second floor, Stream-Omni can make accurate judgments based on the image.

Speech Interaction: A Fluent Communication Partner

Stream-Omni shines in speech interaction scenarios. Whether converting speech to text or directly generating speech responses, it delivers outstanding performance. For example, users can inquire about the purpose of a device through speech, and Stream-Omni will accurately recognize the speech and provide a detailed explanation.

Setting Up and Using Stream-Omni

Environment Preparation

Creating a Python Environment: It is recommended to create a Python 3.10 environment using Conda to ensure package consistency and compatibility.
- conda create -n streamomni python=3.10 -y
- conda activate streamomni
Installing Dependencies: Use pip to install the required dependencies for the project, including flash-attn and CosyVoice dependencies.
- pip install -e .
- pip install flash-attn --no-build-isolation
- pip install -r requirements.txt
- pip install -r CosyVoice/requirements.txt

Downloading Models and Tools

Downloading the Stream-Omni Model: Obtain the Stream-Omni model from the Huggingface website and place it in a specified directory (e.g.,${STREAMOMNI_CKPT}).
Downloading CosyVoice (Tokenizer & Flow Model): Acquire the CosyVoice model from the ModelScope website and place it in a specified directory (e.g.,COSYVOICE_CKPT=./CosyVoice-300M-25Hz).

Launching the Service

Starting the Controller: Run the controller script with the host set to0.0.0.0 and the port set to10000.
- python stream_omni/serve/controller.py --host 0.0.0.0 --port 10000
Starting the CosyVoice Worker: Set the CosyVoice model path and audio save directory, then launch the worker.
- COSYVOICE_CKPT=path_to_CosyVoice-300M-25Hz
- WAV_DIR=path_to_save_generated_audio
- CUDA_VISIBLE_DEVICES=0 PYTHONPATH=CosyVoice/third_party/Matcha-TTS python ./CosyVoice/cosyvoice_worker.py --port 21003 --model ${COSYVOICE_CKPT} --wav_dir ./gen_wavs/
Starting the Stream-Omni Worker: Set the Stream-Omni model path and launch the worker. If the VRAM is less than 32GB, add the--load-8bit parameter to reduce VRAM usage.
- STREAMOMNI_CKPT=path_to_stream-omni-8b
- CUDA_VISIBLE_DEVICES=1 python ./stream_omni/serve/model_worker.py --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ${STREAMOMNI_CKPT} --model-name stream-omni
Starting the Interaction Interface: Run the Gradio Web script and access the interface viahttp://localhost:7860 in a browser.
- python stream_omni/serve/gradio_web.py --controller http://localhost:10000 --model-list-mode reload --port 7860

Command Line Interaction Example

Here is a simple command line interaction example demonstrating how to use Stream-Omni for vision-guided speech interaction:

export CUDA_VISIBLE_DEVICES=0
export PYTHONPATH=CosyVoice/third_party/Matcha-TTS

STREAMOMNI_CKPT=path_to_stream-omni-8b

# Replace the CosyVoice model path in run_stream_omni.py (e.g., cosyvoice = CosyVoiceModel('./CosyVoice-300M-25Hz'))
# Add --load-8bit if VRAM is less than 32GB
python ./stream_omni/eval/run_stream_omni.py \
    --model-path ${STREAMOMNI_CKPT} \
    --image-file ./stream_omni/serve/examples/cat.jpg --conv-mode stream_omni_llama_3_1 --model-name stream-omni  \
    --query ./stream_omni/serve/examples/cat_color.wav

After running the script above, you will receive the following output:

ASR Output:
What is the color of the cat?
LLM Output:
The cat is gray and black.
Speech Tokens:
<Audio_2164><Audio_2247><Audio_671><Audio_246><Audio_2172><Audio_1406><Audio_119><Audio_203><Audio_2858><Audio_2099><Audio_1716><Audio_22><Audio_1736><Audio_1038><Audio_4082><Audio_1655><Audio_2409><Audio_2104><Audio_571><Audio_2255><Audio_73><Audio_760><Audio_822><Audio_701><Audio_2583><Audio_1038><Audio_2203><Audio_1185><Audio_2103><Audio_1718><Audio_2610><Audio_1883><Audio_16><Audio_792><Audio_8><Audio_8><Audio_535><Audio_67>
Speech Output:
Audio saved at ./output_893af1597afe2551d76c37a75c813b16.wav

Interaction Methods for Different Modal Combinations

Stream-Omni supports various multimodal interaction methods. Below are some common interaction modes and their corresponding scripts:

Input Combination	Output Type	Intermediate Output	Script File
Text + Vision (or None)	Text	/	`run_stream_omni_t2t.py`
Text + Vision (or None)	Speech	Model output text result	`run_stream_omni_t2s.py`
Speech + Vision (or None)	Text	ASR transcription of user input	`run_stream_omni_s2t.py`
Speech + Vision (or None)	Speech	Model output text result, ASR transcription of user input	`run_stream_omni_s2s.py`

You can control the interaction mode by setting theinference_type parameter inmodel.generate() (options includetext_to_text,text_to_speech,speech_to_text,speech_to_speech).

Performance of Stream-Omni

Visual Understanding Capability

Stream-Omni demonstrates remarkable performance across multiple visual understanding benchmarks. For example, in 11 benchmark tests including VQA-v2 and GQA, it performs on par with advanced vision-oriented multimodal models like LLaVA. This indicates that Stream-Omni can reliably interpret visual information and provide accurate visual question answering services.

Speech Interaction Capability

In knowledge-grounded speech interaction tests, Stream-Omni achieves outstanding results on benchmarks like Llama Questions and Web Questions with just 23,000 hours of speech data. Compared to speech-oriented LMM models (such as SpeechGPT, Moshi, GLM-4-Voice, etc.) that rely on large-scale speech data pre-training, Stream-Omni’s CTC-based speech-text mapping more efficiently transfers the text knowledge of LLM to the speech modality, enabling knowledge-grounded speech interaction.

Vision-Grounded Speech Interaction Capability

To better align with real-world application scenarios, the research team developed the SpokenVisIT benchmark based on VisIT-Bench to evaluate the vision-grounded speech interaction capabilities of multimodal models. In this test, Stream-Omni outperforms models like VITA-1.5 in vision-grounded speech interaction, offering users a richer multimodal interaction experience.

Advantages and Limitations of Stream-Omni

Advantages

Efficient Data Utilization: Stream-Omni only uses 23,000 hours of speech data during training, significantly less than other models (e.g., TWIST uses 150,000 hours, SpeechGPT uses 60,000 hours, etc.), giving it a substantial data efficiency advantage.
Outstanding Multimodal Interaction Capability: Whether in visual question answering or speech interaction, Stream-Omni delivers high-quality responses and simultaneously provides intermediate text results, enhancing the user experience.
Flexible Support for Modal Combinations: It supports various multimodal interaction methods, meeting the needs of different scenarios.

Limitations

Despite Stream-Omni’s significant achievements in multimodal interaction, it has some limitations. For instance, it needs improvement in speech expressiveness and human-likeness. These factors are crucial for high-quality multimodal interaction experiences and will be directions for future research and enhancement.

Real-World Case Studies of Stream-Omni

Visual Detail Understanding Case

In a SpokenVisIT benchmark case, Stream-Omni accurately interprets the image of a staircase and provides an answer similar to GPT-4V. In contrast, VITA-1.5 generates contradictory responses when faced with different input modalities (text and speech). This highlights Stream-Omni’s advantage in speech-text semantic alignment, ensuring consistent responses regardless of the input modality.

Long Speech Generation Case

In another case, Stream-Omni demonstrates exceptional long speech generation capability, producing a high-quality speech output lasting up to 30 seconds. The generated speech is highly consistent with the corresponding text output, underscoring the effectiveness of its alignment-based fusion module and enabling high-quality vision-grounded speech interaction.

Conclusion and Outlook

Stream-Omni, as an advanced multimodal interaction model, leverages innovative modal alignment techniques and efficient training strategies to achieve remarkable results in visual understanding, speech interaction, and vision-grounded speech interaction. Its emergence injects new vitality into the development of multimodal interaction technology and provides new ideas for future smarter, more natural human-computer interaction models. As technology continues to advance, we believe Stream-Omni will unlock its immense potential in more application scenarios, bringing greater convenience and innovative experiences to our lives and work.

If you encounter any issues while using Stream-Omni or wish to learn more about it, you can explore further via the GitHub repository or the Huggingface page.