MiniCPM Real-Time Multimodal AI: Redefining Edge Device Intelligence

高效码农

3 months ago

MiniCPM: A Breakthrough in Real-time Multimodal Interaction on End-side Devices

Introduction

In the rapidly evolving field of artificial intelligence, multimodal large models (MLLM) have become a key focus. These models can process various types of data, such as text, images, and audio, providing a more natural and enriched human-computer interaction experience. However, due to computational resource and performance limitations, most high-performance multimodal models have traditionally been confined to cloud-based operation, making it difficult for general users to utilize them directly on local devices like smartphones or tablets.

The MiniCPM series of models, developed jointly by the Tsinghua University Natural Language Processing Laboratory and Modelart Intelligence, has changed this situation. This family of models, with their outstanding performance and efficient deployment capabilities, has made it possible for multimodal AI to operate in real-time on end-side devices. This article will provide an in-depth exploration of the technical features, performance, and practical applications of MiniCPM-o 2.6 and MiniCPM-V 2.6, helping readers gain a comprehensive understanding of this cutting-edge technology.

Overview of MiniCPM Series Models

Since its initial release in February 2024, the MiniCPM series has undergone several iterations. Currently, the two most noteworthy models in this series are MiniCPM-o 2.6 and MiniCPM-V 2.6. Both models boast 8 billion parameters (8B) and deliver performance comparable to or even surpassing that of some commercial models, all while maintaining a relatively compact model size.

MiniCPM-o 2.6: The Pinnacle of Multimodal Real-time Streaming Interaction

MiniCPM-o 2.6, the latest version in the series, is built upon SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B. Trained and inferred in an end-to-end manner, it supports image, video, text, and audio inputs, and can generate high-quality text and speech outputs.

Key Advantages

Visual Understanding Capability: MiniCPM-o 2.6 achieves an average score of 70.2 on the OpenCompass benchmark, surpassing commercial models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet. It excels in single-image understanding, multi-image understanding, and video understanding tasks, demonstrating exceptional performance in handling complex scenarios and multi-language content.
Speech Interaction Capability: Supporting real-time bilingual speech dialogue in Chinese and English, MiniCPM-o 2.6 outperforms GPT-4o-realtime in speech understanding tasks such as ASR and STT. It also offers features like emotion, speech rate, and style control, enabling personalized speech output according to user preferences.
Multimodal Streaming Interaction: As an innovative feature, MiniCPM-o 2.6 can receive continuous video and audio streams and interact with users in real-time. On the StreamingBench benchmark, it achieves the best level in the open-source community, surpassing GPT-4o-202408 and Claude 3.5 Sonnet.
Efficient Deployment Capability: Optimized for visual token density, MiniCPM-o 2.6 requires only 640 tokens to process a 1.8 million-pixel image, 75% fewer than most models. This enables efficient real-time multimodal streaming interaction on end-side devices like iPads.

MiniCPM-V 2.6: Outstanding Performance in Visual Understanding

Focusing on visual understanding tasks, MiniCPM-V 2.6 delivers particularly impressive performance in single-image, multi-image, and video understanding. Built on SigLip-400M and Qwen2-7B, it also boasts 8B parameters.

Key Advantages

Leading Visual Performance: On the OpenCompass benchmark, MiniCPM-V 2.6 achieves an average score of 65.2, outperforming models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet. It achieves the best results on multiple visual-specific benchmarks, including OCRBench.
Multi-image and Video Understanding: Supporting multi-image dialogue and reasoning, MiniCPM-V 2.6 excels on multi-image benchmarks like Mantis-Eval and BLINK. It can also process video inputs and provide detailed video descriptions covering both temporal and spatial information.
Efficient Real-time Processing: Similar to MiniCPM-o 2.6, MiniCPM-V 2.6 offers optimized visual token density, enabling efficient real-time video understanding on end-side devices.

Technical Architecture and Innovations

The MiniCPM series models adopt a unique end-to-end multimodal architecture, connecting and training different modality encoding/decoding modules to fully leverage multimodal knowledge. Below are its main technical features:

End-to-End Multimodal Architecture

The models handle different modalities of data, such as images, text, and audio, through a unified framework. This avoids the need for separate model design for each modality, significantly enhancing the model’s versatility and efficiency.

Multimodal Streaming Mechanism

To support real-time streaming interaction, the MiniCPM series transforms offline encoding/decoders into online modules suitable for streaming input/output. This mechanism is particularly advantageous for processing continuous video and audio streams, enabling the model to respond to user inputs in real-time.

Configurable Speech Scheme

In terms of speech interaction, MiniCPM-o 2.6 introduces an innovative multimodal system prompt design, allowing users to flexibly control speech styles through text or speech samples. This provides strong support for personalized voice assistant applications.

Performance Evaluation and Comparison

To comprehensively assess the performance of MiniCPM series models, the development team conducted tests on multiple authoritative benchmarks. Below are some key evaluation results:

Visual Understanding Capability Evaluation

On the OpenCompass benchmark, MiniCPM-o 2.6 achieves an average score of 70.2 in single-image understanding tasks, surpassing several commercial closed-source models. Specifically, it scores 897* on OCRBench and 71.9* on MathVista mini, showcasing leading performance.

Speech Understanding and Generation Capability Evaluation

In speech recognition (ASR) tasks, MiniCPM-o 2.6 performs favorably against models like GPT-4o-Realtime and Gemini 1.5 Pro across multiple datasets. For instance, its character error rate (CER) on the AISHELL-1 dataset is 1.6, significantly lower than GPT-4o-Realtime’s 7.3*.

In speech synthesis (TTS) tasks, MiniCPM-o 2.6 also achieves excellent results on the SpeechQA benchmark. For example, its accuracy (ACC) on the Speech Llama Q. dataset reaches 61.0, second only to GPT-4o-Realtime’s 71.7.

Multimodal Streaming Interaction Capability Evaluation

On the StreamingBench benchmark, MiniCPM-o 2.6 achieves an overall score of 66.0, surpassing GPT-4o-202408’s 64.1 and Claude 3.5 Sonnet’s 59.7. It demonstrates leading performance in real-time video understanding, multi-source understanding, contextual understanding, and overall evaluation.

Practical Applications and Case Studies

The multimodal capabilities and efficient deployment features of MiniCPM series models make them highly promising for various practical applications. Below are some typical application scenarios:

Intelligent Assistants

MiniCPM-o 2.6’s real-time speech dialogue and multimodal streaming interaction capabilities make it an ideal choice for building intelligent voice assistants. Users can interact with the assistant via speech or text, and the assistant can understand user inputs in various modalities, providing a richer range of services.

Education Sector

In education, MiniCPM series models can be used to develop intelligent tutoring systems. For example, by analyzing students’ handwritten homework images, the model can offer targeted feedback and guidance. It can also generate detailed answers and explanations based on students’ speech questions.

Healthcare

Medical imaging analysis is another significant application area for MiniCPM series models. They can process medical images (such as X-rays and CT scans) to assist doctors in preliminary diagnoses. Additionally, they can offer health consultations and rehabilitation guidance to patients via speech interaction.

Content Creation

For content creators, MiniCPM series models can serve as creative aids. Users can input images and text descriptions to have the model generate related video scripts or copywriting. The model’s speech synthesis capability can also be used to produce audio content.

Installation and Deployment Guide

To help developers and researchers better utilize MiniCPM series models, here is a detailed installation and deployment guide:

Environmental Preparation

Ensure your development environment meets the following requirements:

Python 3.8 or higher
PyTorch 1.9 or higher
transformers library version 4.44.2
Other dependency libraries (such as pillow and librosa)

You can install the basic dependencies using the following command:

pip install -r requirements_o2.6.txt

Model Download and Loading

MiniCPM series models offer various versions to accommodate different devices and application scenarios. Below are some commonly used model versions and their download links:

Model	Device	Resource	Description	Download Link
MiniCPM-o 2.6	GPU	18 GB	The latest version, offering GPT-4o-level visual, speech, and multimodal streaming interaction capabilities on end-side devices	Hugging Face
MiniCPM-o 2.6 gguf	CPU	8 GB	gguf version, with lower memory usage and higher inference efficiency	Hugging Face
MiniCPM-o 2.6 int4	GPU	9 GB	int4 quantized version, with lower memory usage	Hugging Face

Sample code for loading the model:

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)

Inference and Application Examples

Image Question Answering

from PIL import Image

image = Image.open('example.jpg').convert('RGB')
question = "What is in this image?"

msgs = [{'role': 'user', 'content': [image, question]}]
answer = model.chat(msgs=msgs, tokenizer=tokenizer)
print(answer)

Speech Dialogue

import librosa

audio_input, _ = librosa.load('audio.wav', sr=16000, mono=True)
question = "Repeat what I said."

msgs = [{'role': 'user', 'content': [question, audio_input]}]
answer = model.chat(msgs=msgs, tokenizer=tokenizer, generate_audio=True)
print(answer)

Multimodal Streaming Interaction

from moviepy.editor import VideoFileClip

video_path = "video.mp4"
video = VideoFileClip(video_path)

# Extract video frames and audio stream
frames = [Image.fromarray(frame) for frame in video.iter_frames()]
audio_stream = video.audio.to_soundarray()

msgs = [{'role': 'user', 'content': frames + [audio_stream]}]
answer = model.chat(msgs=msgs, tokenizer=tokenizer, omni_input=True)
print(answer)

FAQ

How to Choose the Right Model Version?

Select the appropriate model version based on your device and application scenario. If you are using a GPU with sufficient memory, consider the standard MiniCPM-o 2.6 or MiniCPM-V 2.6 versions. For CPU devices or environments with limited memory, it is recommended to use the gguf or int4 quantized versions.

How Does the Model Perform on Different Devices?

MiniCPM series models are optimized to run efficiently on a variety of devices. On end-side devices like iPad Pro (M4 chip), they can achieve a smooth inference speed of 16-18 tokens per second. On servers equipped with high-performance GPUs, the inference speed is even faster, making it suitable for handling a large number of concurrent requests.

How to Improve the Model’s Response Speed?

To enhance the model’s response speed, you can try the following methods:

Reduce the complexity of the input data (e.g., lower image resolution, shorten audio length)
Use quantized model versions (e.g., int4)
Optimize inference code to minimize unnecessary computational overhead
Use multi-card parallel inference on server-side

What Programming Languages Does the Model Support?

Currently, MiniCPM series models are primarily called through Python interfaces. The development team provides a Python implementation based on the transformers library, making it convenient for developers to integrate into their projects.

Conclusion and Outlook

The MiniCPM series models represent a significant breakthrough in multimodal large models for end-side deployment and real-time interaction. Through this article, we have gained an in-depth understanding of the exceptional performance of MiniCPM-o 2.6 and MiniCPM-V 2.6 in visual understanding, speech interaction, and multimodal streaming processing, as well as their vast potential in various practical application scenarios.

Looking to the future, we anticipate continued development of the MiniCPM series models in the following aspects:

Performance Optimization: Further enhancing the model’s operational efficiency across various devices and reducing resource consumption.
Functional Expansion: Increasing support for more modalities and tasks, such as 3D vision and multi-language speech synthesis.
Ease of Use Improvement: Providing more user-friendly development interfaces and tools to lower the adoption barrier.
Community Building: Leveraging the power of the open-source community to continuously enrich the model’s functionalities and applications.

For developers and researchers, MiniCPM series models are not only a powerful tool but also an innovative platform full of opportunities. By delving into and practicing with this technology, we can fully utilize it to bring smarter and more efficient solutions to various industries.

We hope this article has provided readers with a comprehensive understanding of the technical features and application value of the MiniCPM series models, inspiring more innovative thinking and practical exploration. In the vast field of multimodal AI, let us共同 witness the power of technology and embark on a smarter future together.

The content of this article is based on the official documentation of MiniCPM. All technical details and performance data originate from the official information released by the development team. For further information or to use the MiniCPM series models, please visit the GitHub project page to access the latest resources and code.