ERNIE-4.5-VL-28B-A3B-Thinking: A Breakthrough in Multimodal AI

In today’s era of rapid artificial intelligence advancement, multimodal models have become a critical bridge connecting visual perception and language understanding. Baidu’s newly launched ERNIE-4.5-VL-28B-A3B-Thinking represents a significant upgrade based on the existing ERNIE-4.5-VL-28B-A3B architecture, achieving a qualitative leap especially in multimodal reasoning capabilities. If you’re focused on AI applications in visual-language interaction or planning to develop related intelligent tools, this model deserves in-depth exploration.

Core Highlights of ERNIE-4.5-VL-28B-A3B-Thinking: What You Need to Know

The upgrade of ERNIE-4.5-VL-28B-A3B-Thinking is not a simple parameter adjustment but a systematic technical optimization that delivers enhanced capabilities. Its core advantages stem from three key areas:

1. Large-Scale High-Quality Data Training for Enhanced Modality Alignment

During its development, the model underwent an extensive mid-training phase, absorbing a massive and diverse corpus of premium visual-language reasoning data. This training approach not only significantly boosted the model’s representation power but also deepened the semantic alignment between visual and language modalities.

In simple terms, earlier models might “understand” images and “read” text but struggle to accurately map their underlying meanings. In contrast, the optimized ERNIE-4.5-VL-28B-A3B-Thinking naturally connects visual information in images with the semantics of text descriptions—much like humans—laying the groundwork for reasoning in complex scenarios.

2. Cutting-Edge Reinforcement Learning for Improved Learning Efficiency

The model leverages advanced multimodal reinforcement learning techniques, integrating GSPO (Generative SPO) and IcePop strategies to stabilize MoE (Mixture of Experts) training, paired with dynamic difficulty sampling. This combination offers two distinct benefits:

Training stability: Mitigates common issues in multimodal model training, such as convergence challenges and parameter oscillations.
Learning efficiency: Enables the model to intelligently select training samples aligned with its current capabilities, accelerating the mastery of core patterns from limited data.

3. Enhanced Practical Features to Lower Application Barriers

Addressing the real-world needs of developers and enterprises, the model prioritizes two key capability enhancements:

Visual Grounding: Delivers more precise localization and flexible instruction execution, enabling rapid responses to commands like “mark the defective component in the image” or “circle abnormal regions” in complex industrial settings.
“Thinking with Images”: When paired with tools like image zooming and image search, the model mimics human observation—zooming in and out of images to capture details and uncover deep insights, ideal for tasks requiring attention to fine-grained features or long-tail visual knowledge.

Core Capabilities of ERNIE-4.5-VL-28B-A3B-Thinking: Beyond “Seeing” and “Saying”

Despite being a lightweight model (activating only 3 billion parameters), ERNIE-4.5-VL-28B-A3B-Thinking matches the performance of top-tier industry flagship models across various benchmarks. Its strengths shine in six key areas:

1. Visual Reasoning: Multi-Step Analysis for Complex Scenarios

Chart analysis: Extracts trends from line graphs or bar charts to answer questions like “Which period had the highest growth rate?” or “Predict next quarter’s values.”
Causal reasoning: Infers potential cause-effect relationships (e.g., linking a broken window to a stone on the ground).
Scene understanding: Identifies and correlates visual elements in complex street scenes (e.g., “Are pedestrians crossing the road when the light is red?”).

2. STEM Reasoning: Solving Visual Science and Math Problems

The model’s robust visual capabilities drive significant improvements in STEM (Science, Technology, Engineering, Mathematics) tasks involving visual inputs:

Mathematics: Recognizes side lengths and angles of geometric shapes in images to calculate area or volume.
Physics: Analyzes forces acting on objects (e.g., a ball on a slope) based on visual cues.
Chemistry: Identifies experimental setups in images to determine reaction types or potential products.

3. Visual Grounding: Precise Response to Spatial Instructions

In scenarios requiring accurate localization, the model executes commands with precision:

Industrial quality inspection: Marks “solder joints with poor contact” or “abnormal areas” in circuit board images per text instructions.
Design assistance: Locates and labels target positions (e.g., “move this icon to the top-right corner”) in interface design mockups.
Medical imaging: Highlights suspected lesion areas in CT scans to assist doctors in rapid focus.

4. “Thinking with Images”: Detail Processing and Deep Exploration

The model mimics human observation habits, zooming in/out of images to uncover hidden information:

Cultural relic identification: First observes the overall artifact, then zooms in to examine textures or inscriptions for dating and craftsmanship analysis.
Product quality control: Magnifies labels on packaging images to verify compliance with text standards.
Map analysis: Overviews regional layouts before zooming in on specific road segments to check traffic signs or conditions.

5. Tool Utilization: Expanding Long-Tail Knowledge and Functionality

Equipped with robust tool-calling capabilities, the model supplements its built-in knowledge with external tools:

Image search: Identifies unfamiliar plants or animals by invoking image search tools.
Data calculation: Uses calculators for complex numerical operations when analyzing charts.
Information verification: Cross-checks time, location, or other details in images via search engines.

6. Video Understanding: Capturing Temporal Changes

Beyond static images, the model excels at video content analysis with strong temporal awareness:

Timeline perception: Identifies when objects appear/disappear and the sequence of actions in videos.
Event localization: Pinpoints timeframes for key events (e.g., “someone entering a restricted area” or “items being moved”) in surveillance footage.
Content summarization: Extracts core insights, such as “three main topics discussed in a meeting video” or “step-by-step breakdowns of tutorial videos.”

ERNIE-4.5-VL-28B-A3B-Thinking Benchmark Performance

Getting Started: A Step-by-Step Guide to Using ERNIE-4.5-VL-28B-A3B-Thinking

Whether you’re a developer or researcher, you can deploy and use ERNIE-4.5-VL-28B-A3B-Thinking through the following methods:

Method 1: Inference with the Transformers Library

If you’re familiar with Python and Hugging Face’s Transformers library, this method enables quick implementation of basic text-image interaction.

Step 1: Install Required Libraries

Ensure torch and transformers are installed in your environment:

pip install torch transformers

Step 2: Write Inference Code

import torch
from transformers import AutoProcessor, AutoTokenizer, AutoModelForCausalLM

# Model path
model_path = 'baidu/ERNIE-4.5-VL-28B-A3B-Thinking'

# Load model (auto-device mapping, bfloat16 precision for balance of performance and memory)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",  # Automatically assign model to available devices (CPU/GPU)
    dtype=torch.bfloat16,
    trust_remote_code=True  # Trust remote code (model may include custom components)
)

# Load processor (handles text and image inputs)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model.add_image_preprocess(processor)  # Add image preprocessing functionality to the model

# Construct input messages (text + image)
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What color clothes is the girl in the picture wearing?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example1.jpg"
                }
            },
        ]
    },
]

# Process text input: Generate model-compatible conversation template
text = processor.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,  # Add generation prompt to signal response generation
)

# Process visual inputs: Extract image and video information
image_inputs, video_inputs = processor.process_vision_info(messages)

# Integrate all inputs into model-acceptable format
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,  # Pad input lengths to match
    return_tensors="pt",  # Return PyTorch tensors
)

# Move inputs to the model's device
device = next(model.parameters()).device
inputs = inputs.to(device)

# Generate response (max 1024 tokens)
generated_ids = model.generate(
    inputs=inputs['input_ids'].to(device),
    **inputs,
    max_new_tokens=1024,
    use_cache=False  # Disable caching to avoid cumulative errors in long-text generation
)

# Decode generated results to get the final response
output_text = processor.decode(generated_ids[0][len(inputs['input_ids'][0]):])
print(output_text)

Code Explanation:

device_map="auto": Automatically distributes the model across available hardware (CPU/GPU) without manual configuration.
dtype=torch.bfloat16: Uses bfloat16 precision to reduce memory usage while maintaining performance.
processor: Unified handler for text and image inputs, eliminating the need for separate image preprocessing code (e.g., resizing, normalization).

Method 2: High-Performance Inference with vLLM

vLLM is a high-throughput, low-latency LLM inference library ideal for scenarios requiring fast responses.

Step 1: Install vLLM

Install the latest version of vLLM (supports multimodal models):

pip install uv  # For fast Python package installation
uv pip install -U vllm --pre \
  --extra-index-url https://wheels.vllm.ai/nightly \  # vLLM nightly build source
  --extra-index-url https://download.pytorch.org/whl/cu129 \  # PyTorch CUDA 12.9 source
  --index-strategy unsafe-best-match

Step 2: Start the vLLM Service

# Single 80GB GPU deployment (add --gpu-memory-utilization 0.95 if errors occur)
vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code

Step 3: Enable Reasoning and Tool-Call Parsers (Optional)

For reasoning chain analysis or tool-calling functionality, add the following parameters:

# Enable ERNIE 4.5-specific reasoning and tool-call parsers
vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code \
 --reasoning-parser ernie45  \
 --tool-call-parser ernie45  \
 --enable-auto-tool-choice  # Allow the model to automatically decide whether to call tools

Use Cases:

High-concurrency applications (e.g., customer service chatbots, intelligent Q&A systems).
Real-time interaction scenarios requiring low latency.

Method 3: Rapid Deployment with FastDeploy

FastDeploy is Baidu’s deployment tool for fast, production-ready deployment of multi-framework models.

Step 1: Install FastDeploy

Refer to the FastDeploy Official Documentation for environment-specific installation instructions.

Step 2: Launch the Service

fastdeploy serve --model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \
  --max-model-len 131072 \  # Maximum model input length (tokens)
  --max-num-seqs 32 \  # Maximum number of concurrent sequences
  --port 8180 \  # Service port
  --quantization wint8 \  # Use wint8 quantization to reduce memory usage
  --reasoning-parser ernie-45-vl-thinking \  # Reasoning parser
  --tool-call-parser ernie-45-vl-thinking \  # Tool-call parser
  --mm-processor-kwargs '{"image_max_pixels": 12845056 }'  # Maximum image pixels

Notes:

A minimum of 80GB GPU memory is required for single-card deployment.
The --quantization wint8 parameter reduces memory usage but may slightly impact precision (task-dependent).

Method 4: Fine-Tuning with ERNIEKit

ERNIEKit is a PaddlePaddle-based training toolkit designed for ERNIE-series open-source large models, supporting instruction fine-tuning (SFT, LoRA) and alignment training (DPO).

Step 1: Download the Model

huggingface-cli download baidu/ERNIE-4.5-VL-28B-A3B-Thinking --local-dir baidu/ERNIE-4.5-VL-28B-A3B-Thinking

Step 2: Instruction Fine-Tuning (SFT)

# Basic instruction fine-tuning (uses LoRA to save memory)
erniekit train examples/configs/ERNIE-4.5-VL-28B-A3B-Thinking/sft/run_sft_lora_8k.yaml

# Tool-call instruction fine-tuning (for tool-integration scenarios)
erniekit train examples/configs/ERNIE-4.5-VL-28B-A3B-Thinking/sft_function_call/run_sft_8k.yaml

Advanced Fine-Tuning Configurations:

Multi-GPU training: Refer to multi-card configuration examples in the ERNIEKit repository.
Fine-tuning strategies: Supports full-parameter fine-tuning, LoRA, DPO (Direct Preference Optimization), and more.

For detailed scripts and configurations, visit the ERNIEKit GitHub Repository and explore the examples folder.

License and Citation

ERNIE-4.5-VL-28B-A3B-Thinking is licensed under the Apache License 2.0, permitting commercial use subject to the terms and conditions (e.g., retaining copyright notices, disclaiming liability). Copyright © 2025 Baidu, Inc. All rights reserved.

If you use this model in research or projects, please cite the technical report:

@misc{ernie2025technicalreport,
      title={ERNIE 4.5 Technical Report},
      author={Baidu-ERNIE-Team},
      year={2025},
      primaryClass={cs.CL},
      howpublished={\url{https://ernie.baidu.com/blog/publication/ERNIE_Technical_Report.pdf}}
}

Frequently Asked Questions (FAQ)

1. What hardware configuration is required to run ERNIE-4.5-VL-28B-A3B-Thinking?

A minimum of 80GB GPU memory (e.g., NVIDIA A100, H100) is recommended for single-card deployment. Quantization features in vLLM or FastDeploy can reduce memory requirements, but 60GB+ is still advised for stability.

2. Does the model support Chinese input?

Yes, the model supports both Chinese and English, with optimized performance for Chinese scenarios. It accurately understands Chinese instructions and visual content (e.g., Chinese labels, handwritten characters).

3. How does the model decide when to call tools?

When deploying with vLLM or FastDeploy, enable the --enable-auto-tool-choice parameter. The model automatically triggers tool calls (e.g., image search) for questions beyond its built-in knowledge (e.g., “What is the name of the flower in this image?”).

4. How much data is needed for fine-tuning?

Basic instruction fine-tuning: At least 10,000 high-quality text-image pairs.
Domain-specific tasks (e.g., industrial inspection): 5,000+ task-specific samples, paired with LoRA for optimal results with limited data.

5. How is the “Thinking with Images” feature activated?

No additional commands are required. The model automatically adopts human-like observation logic for complex images—e.g., scanning the overall scene first, then focusing on key details in images with multiple small objects.

6. What advantages does it offer over other multimodal models?

Key strengths include “lightweight efficiency” and “deep reasoning”:

Activates only 3 billion parameters while matching top-tier model performance, ideal for resource-constrained environments.
Excels at complex tasks like multi-step reasoning and causal analysis, thanks to reinforcement learning and dynamic difficulty sampling.

7. Can it be used for real-time video analysis?

Currently, the model is best suited for short video clips (≤10 seconds). For real-time long-video analysis, combine it with video processing tools (e.g., FFmpeg) to reduce input data volume via frame sampling.

ERNIE-4.5-VL-28B-A3B-Thinking delivers an efficient, accurate solution for multimodal AI applications through technical innovation. Whether for research or commercial development, it unlocks new possibilities for developers. If you’re seeking a multimodal model that balances performance and resource efficiency, follow the guides above to get started—its capabilities are sure to meet your needs.

ERNIE-4.5-VL-28B-A3B-Thinking: Leading Multimodal AI Breakthrough