ROVI Dataset: Revolutionizing Text-to-Image Generation with AI-Powered Visual Grounding
How a novel VLM-LLM re-captioning pipeline creates the world’s most comprehensive open-vocabulary image dataset for precise object-aware text-to-image generation.
The Fundamental Gap in Text-to-Image Systems
Current text-to-image generators face three critical limitations:
-
Description incompleteness: Human-written captions miss 60-80% of visual elements -
Vocabulary constraints: Traditional datasets cover only thousands of object categories -
Spatial ambiguity: Most systems can’t accurately place objects in specific locations
ROVI (Re-captioned Open-Vocabulary Instances) solves these problems through an innovative AI pipeline that automatically generates:
-
1,011,704 high-resolution images with bounding box annotations -
Object descriptions covering two orders of magnitude more categories than existing datasets -
Precise spatial grounding for text-to-image generation
Dataset Breakdown:
| Component | Quantity | Technical Specification |
|--------------------|-----------|----------------------------------|
| Total Samples | 1,011,704 | 7-digit unique keys (0000001-1011704) |
| Training Set | 981,551 | Curated from quality-filtered sources |
| Validation Set | 30,153 | Randomly accessible via demo viewer |
| Image Resolution | HD+ | Original dimensions preserved |
| Annotation Types | 4 | (labels, bboxes, scores, ovd_belongings) |
The Breakthrough Five-Stage Pipeline
Stage 1: Comprehensive Visual Description (VLM Description)
Core Technology: https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5
-
Generates detailed descriptions capturing both primary and secondary elements -
Preserves original web captions ( web_caption
) alongside AI descriptions (vlm_description
) -
Tokenization metrics track description complexity: -
web_clip_tok_num
: Original caption token count -
vlm_clip_tok_num
: AI-generated description token count
-
# Pseudo-implementation of VLM description
image = load_from_url(url)
vlm_description = internvl_model.generate(
prompt="Describe all visual elements comprehensively",
image=image
)
Stage 2: Intelligent Object Extraction (LLM Summarization)
Core Technology: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
-
Two-phase object extraction process: -
Attribute parsing: Extracts colors, materials, states -
Phrase decomposition: Breaks compound nouns into atomic elements
-
-
Output: Cleaned object lists for detection
Phrase Decomposition Example:
Input: "red-white striped beach umbrella"
Output: ["red", "white", "striped", "beach umbrella"]
Stage 3: Multi-Model Object Detection (Multi-OVD Detection)
Detection Ensemble:
-
Grounding-DINO ( gd
inovd_belongings
) -
YOLO-World ( yw
) -
OWLv2 ( ow
) -
OV-DINO ( od
)
-
Processes all objects from Stage 2 without category restrictions -
Maintains raw detection results for maximum coverage -
Implementation note: Each detector requires separate environment setup
Stage 4: Intelligent Result Consolidation (OVD Resampling)
Five-step filtering workflow:
-
Pre-filtering: Applies safety thresholds -
OVD-specific deduplication -
Adaptive sampling across detection sources -
IoU-based selection with overlap penalties -
Final candidate selection
graph LR
A[Raw Detections] --> B[Pre-filtering]
B --> C[Per-OVD NMS]
C --> D[Adaptive Sampling]
D --> E[IoU Penalty]
E --> F[Final Candidates]
Stage 5: Visual Verification (VLM Cross-Checking)
Core Technology: https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct
-
Verification protocol: -
Crop image regions for each candidate box -
Query model: “Is this an image of {object}?” -
Apply probability calibration to counter yes-bias -
Remove false positives
-
-
Critical for validating unusual objects
Inside the ROVI Dataset Structure
JSON Sample Anatomy
{
"0981552": {
"url": "https://image.source.example",
"source": "laion_aes",
"width": 1920,
"height": 1080,
"box_num": 4,
"category_num": 3,
"web_caption": "Beach sunset with people",
"vlm_description": "Golden hour scene showing...",
"web_clip_tok_num": 23,
"vlm_clip_tok_num": 45,
"phash": "a3c8f1e5b92d",
"labels": ["surfboard", "sun hat", "sand", "waves"],
"bboxes": [[12,45,120,230], [310,80,425,165], ...],
"scores": [0.92, 0.87, 0.94, 0.78],
"ovd_belongings": ["gd", "yw", "od", "ow"]
}
}
Data Provenance Sources
Source Code | Origin Dataset | Quality Filter |
---|---|---|
laion_aes |
LAION-5B | Aesthetic score ≥ 6.0 |
coyo_6plus |
COYO-700M | Aesthetic score ≥ 6.0 |
coyo_add |
COYO-700M | Score 5.75-6.0 |
laion_pop |
LAION-POP | High average aesthetic |
Annotation Field Details
-
labels
: Open-vocabulary object names (strings) -
bboxes
: Normalized [x_min, y_min, x_max, y_max] coordinates -
scores
: Detection confidence probabilities -
ovd_belongings
: 2-letter codes indicating detection source
Transforming Text-to-Image Generation
GLIGEN Integration Results
The reference implementation using https://github.com/gligen/GLIGEN demonstrates:
-
37% improvement in object-positioning accuracy -
52% increase in prompt fidelity for secondary elements -
Human-rated aesthetic quality of 4.8/5.0
Practical Implementation Guide
# Loading ROVI dataset
from datasets import load_dataset
rovi = load_dataset("CHang/ROVI", split="train")
# Sample processing for T2I training
def process_sample(sample):
image = download_image(sample['url'])
prompt = sample['vlm_description']
bboxes = sample['bboxes']
return {"image": image, "prompt": prompt, "bboxes": bboxes}
Frequently Asked Questions
How can I explore ROVI without downloading?
Access the https://huggingface.co/spaces/CHang/ROVI-Dataset-Example-Viewer showing 100 random validation images with all annotations.
Are the annotations completely accurate?
While advanced, the automated pipeline has limitations:
-
~5% localization inaccuracy for occluded objects -
Occasional singular/plural inconsistencies -
Rare artifacts in complex object descriptions
What are the licensing terms?
ROVI is released under https://creativecommons.org/licenses/by/4.0/, permitting commercial use with attribution.
How to reproduce the pipeline?
Requires five isolated environments:
-
InternVL-1.5 for description -
Llama3-8B for summarization -
OVD ensemble for detection -
Resampling environment -
Qwen-VL for verification
Technical Boundaries and Evolution
Current Limitation | Mitigation Strategy |
---|---|
URL-based image sourcing | Planned mirror archive |
Small object detection | High-res detector integration |
Language model artifacts | Multi-model consensus approach |
Computational intensity | Optimized resampling algorithms |
Peer-reviewed paper accepted at https://iccv.thecvf.com/virtual/2025/poster/245
The Future of Language-Guided Image Creation
ROVI represents a paradigm shift where:
-
Visual language models become description engines -
Object detectors function as open-vocabulary annotators -
Text-to-image systems gain spatial awareness
The dataset enables next-generation generators that understand “a red umbrella positioned at 30% from left” as precisely as human artists.
Official Resources:
• Paper: ICCV 2025 Proceedings (coming soon)
• Dataset: https://huggingface.co/datasets/CHang/ROVI