Seed1.5-VL: A Game-Changer in Multimodal AI

##Introduction

In the ever-evolving landscape of artificial intelligence, multimodal models have emerged as a key paradigm for enabling AI to perceive, reason, and act in open-ended environments. These models, which align visual and textual modalities within a unified framework, have significantly advanced research in areas such as multimodal reasoning, image editing, GUI agents, autonomous driving, and robotics. However, despite remarkable progress, current vision-language models (VLMs) still fall short of human-level generality, particularly in tasks requiring 3D spatial understanding, object counting, imaginative visual inference, and interactive gameplay.

Seed1.5-VL, the latest multimodal foundation model developed by ByteDance, addresses these challenges head-on. With a vision encoder boasting 532 million parameters and a Mixture-of-Experts (MoE) language model with 20 billion active parameters, Seed1.5-VL delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving state-of-the-art performance on 38 out of 60 public benchmarks. Beyond benchmark success, Seed1.5-VL excels in agent-centric tasks such as GUI control and gameplay, outperforming leading multimodal systems like OpenAI CUA and Claude 3.7. It also demonstrates robust reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles.

##Model Architecture

Seed1.5-VL comprises three main components: a vision encoder, an MLP adapter, and a large language model (LLM). This architecture is designed to efficiently process and integrate visual and textual information. The vision encoder, based on the well-established Vision Transformer (ViT) architecture, supports dynamic image resolutions and employs 2D RoPE for positional encoding. It can handle images of varying dimensions, making it highly adaptable to different visual tasks.

The vision encoder undergoes a dedicated pre-training pipeline before being integrated with the LLM. This pre-training includes three stages: Masked Image Modeling (MIM) with 2D RoPE, Native-Resolution Contrastive Learning, and Omni-modal Pre-training. These stages enhance the encoder’s visual perception capabilities and enable it to learn comprehensive visual representations.

The MLP adapter projects visual features into multimodal tokens, which are then processed by the LLM. The LLM, initialized from an internal pre-trained model with approximately 20 billion active parameters, is trained on a large-scale corpus of high-quality text-only tokens. This architecture allows for efficient and effective integration of visual and textual information.

##Pre-training

The pre-training of Seed1.5-VL leverages a vast corpus of 3 trillion diverse, high-quality source tokens. The data is categorized based on target capabilities, with specific curation processes for each category. For generic image-text pairs and knowledge data, a series of filtering techniques are applied to mitigate noise and class imbalance. The distribution of visual concepts within the raw image-text pairs adheres to a long-tail pattern. To address this, a targeted pre-processing framework is proposed to balance the distribution of visual concepts.

For optical character recognition (OCR), a large-scale training dataset containing over 1 billion samples is built. This includes documents, scene text, tables, charts, and flowcharts. Various data augmentation techniques are applied to improve the model’s robustness in understanding textual content within images.

In terms of visual grounding and counting, the training strategy primarily utilizes three types of data: bounding box annotations, point annotations, and counting data. Bounding box data is sourced from widely-used open-source datasets, while point data is generated using Molmo and CountGD. Counting data is constructed by sampling from the aforementioned bounding box and point data.

For 3D spatial understanding, data targeting relative depth sorting, absolute depth estimation, and 3D grounding tasks is constructed. This includes leveraging DepthAnything V2 to infer depth relationships among objects and deriving absolute depth from publicly available datasets.

The video data is used to enhance the model’s understanding of multi-frame time-series images. It encompasses various tasks such as video captioning, video question answering, action recognition, action grounding, and multi-image understanding.

The STEM data collection includes a diverse array of problem-solving data across various domains. This involves collecting high-quality educational grounding samples and synthesizing structured tables, chemical structural diagrams, and coordinate system diagrams.

For GUI data, a large-scale dataset is curated from UI-TARS. This includes screenshots paired with structured metadata, such as element type, bounding box, text, and depth. The dataset is designed to support robust GUI perception, grounding, and reasoning.

##Post-training

The post-training stage equips Seed1.5-VL with robust instruction-following and reasoning abilities through a combination of Supervised Fine-tuning (SFT) and Reinforcement Learning (RL). The SFT stage provides the model with foundational instruction-following and reasoning capabilities. The SFT dataset includes general instruction data and Long Chain-of-Thought (LongCoT) data, which are generated via prompt engineering and rejection sampling.

The RL stage further boosts the model’s human evaluation performance and multimodal understanding capabilities. This involves preference data collection, reward model training, and optimization with reinforcement algorithms. The RLHF (Reinforcement Learning from Human Feedback) process collects list-wise multimodal preference datasets for reward modeling. Human annotations and heuristic synthesis are used to gather preference data. The reward model is initialized with an instruction-tuned VLM and is trained to act as a generative classifier that directly outputs answer indicator tokens regarding the preference between two responses.

##Training Infrastructure

To accelerate and stabilize pre-training, several training optimizations are developed, including hybrid parallelism, workload balancing, parallelism-aware data loading, and robust training. These optimizations significantly enhance training throughput.

The hybrid parallelism approach parallelizes the vision encoder and the language model differently. For the vision encoder and the MLP adapter, ZeRO data parallelism is used, while for the language model, standard 4-D parallelism is employed. This approach ensures efficient and balanced workload distribution.

Workload balancing is achieved through a classical greedy algorithm that redistributes visual data based on computation intensity. Parallelism-aware data loading is implemented to reduce multimodal data IO overhead. Fault tolerance is ensured using the MegaScale robust training framework, which allows for efficient checkpoint saving and resuming through ByteCheckpoint.

##Evaluation

Seed1.5-VL undergoes comprehensive evaluation on a wide range of public benchmarks and internal benchmarks. On public benchmarks, Seed1.5-VL demonstrates strong performance in tasks such as vision encoder as a zero-shot classifier, vision task evaluation, video task evaluation, and multimodal agent tasks.

In the vision encoder as a zero-shot classifier task, Seed-ViT achieves an average zero-shot accuracy of 82.5 across multiple datasets, comparable to models with significantly more parameters. In vision task evaluation, Seed1.5-VL shows robust performance in multimodal reasoning, general visual question answering, document and chart understanding, grounding and counting, and 3D spatial understanding. It achieves state-of-the-art performance on several benchmarks, including MathVista, V*, VLM are Blind, ZeroBench, VisuLogic, RealWorldQA, SimpleVQA, MMStar, TextVQA, InfographicVQA, DocVQA, BLINK, LVIS-MG, VisualWebBench, RefCOCO-avg, CountBench, and FSC-147.

For video task evaluation, Seed1.5-VL excels in short video understanding, long video understanding, streaming video understanding, video reasoning, and video grounding. It achieves state-of-the-art performance on benchmarks such as MotionBench, MVBench, TOMATO, TVBench, Dream-1K, TempCompass, Charades-STA, and TACoS.

In multimodal agent tasks, Seed1.5-VL outperforms previous models in GUI grounding, GUI agent, and game agent tasks. It demonstrates exceptional grounding performance and strong generalization across diverse environments and devices.

The internal benchmarks developed by ByteDance provide a more comprehensive evaluation of the model’s capabilities. These benchmarks focus on core capabilities, comprehensive scope, evaluation accuracy, mitigation of benchmark overfitting, and task and input diversity. Seed1.5-VL achieves the second-highest overall score on the internal benchmark, excelling in OOD, Agent, Atomic Instruction Following categories, and showing strong capabilities in STEM and Document & Diagram Understanding.

##Limitations

Despite its strong performance, Seed1.5-VL exhibits certain limitations, particularly in fine-grained visual perception and complex reasoning. It struggles with accurately counting objects in complex visual scenes and identifying subtle differences between images. In higher-level reasoning tasks, Seed1.5-VL sometimes overlooks specific conditions or introduces unfounded assumptions, leading to incomplete or invalid responses.

##Conclusion and Next Steps

Seed1.5-VL represents a significant advancement in the field of multimodal AI, demonstrating strong capabilities in reasoning, OCR, diagram understanding, visual grounding, 3D spatial understanding, and video understanding. Its performance on public benchmarks and internal evaluations highlights its potential as a powerful and versatile multimodal model. However, there is still room for improvement, particularly in areas such as 3D spatial reasoning, hallucination mitigation, and complex combinatorial search. Addressing these challenges will be a focus of ongoing research, which includes efforts to unify existing model capabilities with image generation and incorporating robust tool-use mechanisms.

The development of Seed1.5-VL builds upon substantial prior work within the AI research community. By detailing the model architecture, data synthesis pipeline, training methodology, training framework innovations, and internal evaluation design in this report, we hope to contribute to future progress in the field of multimodal AI.