GigaWorld-0: Building World Models to Drive Embodied AI Forward
Have you ever wondered how AI systems can learn to interact with the real world without needing endless hours of physical trials? That’s where world models come in—they act as virtual simulators that generate realistic data for training AI agents. Today, let’s talk about GigaWorld-0, a framework that’s designed specifically as a data engine for vision-language-action learning in embodied AI. It’s a unified system that combines video generation and 3D modeling to create high-quality, controllable data. I’ll walk you through what it is, how it works, and how you can get started with it, all based on the details from its documentation and research paper.
If you’re a graduate with a background in computer science or engineering, this should feel accessible. We’ll break down the technical parts step by step, using lists and tables where it makes sense, and I’ll address common questions as we go.
What Is GigaWorld-0 and Why Does It Matter?
You might be asking: What exactly is GigaWorld-0? It’s a world model framework that serves as a data engine for embodied AI, meaning it generates synthetic data to train AI systems that interact with physical environments, like robots. Embodied AI refers to agents that perceive, reason, and act in the real world, and data efficiency is a big challenge here because collecting real-world data is expensive and time-consuming.
GigaWorld-0 addresses this by integrating two main parts: GigaWorld-0-Video for creating diverse video sequences, and GigaWorld-0-3D for ensuring those sequences are geometrically consistent and physically realistic. Together, they produce data that’s visually rich, spatially accurate, and aligned with instructions—perfect for training vision-language-action (VLA) models.
Why is this important? Traditional methods rely on real-world data, which limits scalability. GigaWorld-0 flips that by synthesizing data that’s controllable in aspects like appearance, camera views, and actions. This leads to better generalization in real robots, as shown in evaluations where VLA models trained on its data perform well without any real-world interactions during training.
The framework is open-source under Apache 2.0, with models available on Hugging Face, and it builds on efficient training tools to make large-scale generation feasible.

This image shows the architecture of GigaWorld-0-Video-Dreamer, the core video generation model.
Breaking Down the Architecture of GigaWorld-0
Let’s dive into how GigaWorld-0 is structured. It’s not just one model but a suite of components working together.
The Video Side: GigaWorld-0-Video
This part focuses on generating videos that are temporally coherent and controllable. It includes several models:
-
GigaWorld-0-Video-Dreamer: The foundation for image-text-to-video generation. It uses a Mixture-of-Experts (MoE) architecture with sparse attention to handle embodied scenes efficiently. Trained on large datasets of interaction videos, it produces sequences from text prompts and initial images.
-
GigaWorld-0-Video-AppearanceTransfer: Allows editing of textures, materials, and lighting in videos while keeping motions intact. Useful for augmenting data with diverse appearances.
-
GigaWorld-0-Video-ViewTransfer: Renders videos from specified camera viewpoints, adapting robot trajectories to maintain consistency.
-
GigaWorld-0-Video-MimicTransfer: Translates human demonstration videos into robot actions, enabling cross-embodiment data generation.
These models support multi-view generation, FP8 precision for faster training, and distillation for efficient inference.
You might wonder: How does the video generation process work mathematically? It uses flow-matching to model the generative process, where the latent states evolve over time based on conditions like text and images. The equation is:
[\frac{d\mathbf{z}{t}}{dt} = \mathbf{v}{\theta}(\mathbf{z}_{t}, t, \mathbf{c})]
Here, (\mathbf{z}_t) is the latent at time (t), and (\mathbf{c}) includes text and image inputs.
The MoE in the feed-forward networks routes inputs to experts efficiently, reducing compute needs.
The 3D Side: GigaWorld-0-3D
To add geometric and physical realism, this component builds 3D scenes:
-
GigaWorld-0-3D-FG: Generates 3D assets for foreground objects using generative models.
-
GigaWorld-0-3D-BG: Reconstructs backgrounds with 3D Gaussian Splatting (3DGS) for high-fidelity environments.
-
GigaWorld-0-3D-Phys: Models object physics and performs differentiable system identification for the robotic arm.
-
GigaWorld-0-3D-Act: Computes executable arm motions that are physically consistent.
This ensures data respects constraints like gravity and collisions, making it suitable for robot training.
The full list of components and their functions is in this table:
| Model Name | Function |
|---|---|
| GigaWorld-0-Video-Dreamer | Image-text-to-video foundation model for embodied scenes. |
| GigaWorld-0-Video-AppearanceTransfer | Text-guided appearance transfer, edits texture, material, lighting. |
| GigaWorld-0-Video-ViewTransfer | Renders videos from user-specified camera extrinsics. |
| GigaWorld-0-Video-MimicTransfer | Translates egocentric human demonstration to robot arm trajectories. |
| GigaWorld-0-3D-FG | Generates 3D assets of foreground manipulable objects. |
| GigaWorld-0-3D-BG | Reconstructs backgrounds via 3D Gaussian Splatting (3DGS). |
| GigaWorld-0-3D-Phys | Models object physics and performs differentiable system identification. |
| GigaWorld-0-3D-Act | Synthesizes executable, physically consistent arm motions. |
By combining these, GigaWorld-0 creates data that’s not only visually appealing but also useful for tasks like manipulation.

This figure illustrates how GigaWorld-0 handles video generation, human-to-robot transfer, and 3D scene creation.
How to Install and Set Up GigaWorld-0
If you’re thinking about trying it out, installation is straightforward. It depends on three frameworks: GigaTrain for efficient training, GigaDatasets for data handling, and GigaModels for the models themselves.
Here’s a step-by-step guide:
-
Create a new conda environment:
conda create -n giga_world_0 python=3.11.10 -y conda activate giga_world_0 -
Install the dependencies:
pip3 install giga-train pip3 install giga-datasets pip3 install natten -
Clone and install GigaModels:
git clone https://github.com/open-gigaai/giga-models.git cd giga-models pip3 install -e . -
Clone the GigaWorld-0 repository:
git clone git@github.com:open-gigaai/giga-world-0.git
This sets up a fresh environment to avoid conflicts.
Preparing Data for Training
Before training, you need to organize your video data. Each video should have a corresponding text prompt.
Structure your raw data like this:
raw_data/
├── 0.mp4 # Video file 0
├── 0.txt # Prompt for video file 0
├── 1.mp4 # Video file 1
├── 1.txt # Prompt for video file 1
├── ...
Then, pack the data and extract embeddings:
python scripts/pack_data.py \
--video-dir /path/to/raw_data/ \
--save-dir /path/to/packed_data/
This prepares the data for efficient training.
Training GigaWorld-0 Models
Training is where the magic happens. Use the provided config for the video model:
python scripts/train.py --config configs.giga_world_0_video.config
For LoRA training (a lighter way to fine-tune), set config.train_mode.train_mode='lora' and config.train_mode.lora_rank=64 in the config file, then run the same command.
GigaTrain uses FP8 precision and sparse attention to make this scalable, reducing memory and compute needs.
You might ask: How long does training take? It depends on your hardware, but the framework is optimized for efficiency, allowing large-scale runs.
Running Inference with GigaWorld-0
Once trained, inference generates videos. Here are examples:
-
Single GPU:
python scripts/inference.py \ --data-path /path/to/packed_test_data/ \ --save-dir /path/to/vis_results/ \ --transformer-model-path /path/to/your_transformer/ \ --text-encoder-model-path /path/to/giga_world_0_video/text_encoder/ \ --vae-model-path /path/to/giga_world_0/vae/ \ --gpu_ids 0 -
Multi-GPU:
Add more GPU IDs, like--gpu_ids 0 1 2 3 4 5 6 7. -
With LoRA:
Include--lora-model-path /path/to/your_lora/.
Outputs are saved in the specified directory.
First, download models from Hugging Face:
-
GigaWorld-0-Video-Pretrain-2b: For base image-text-to-video. -
GigaWorld-0-Video-GR1-2b: Fine-tuned on GR1 dataset.
Use:
python scripts/download.py --model-name video_pretrain --save-dir /path/to/giga_world_0_video_pretrain/
python scripts/download.py --model-name video_gr1 --save-dir /path/to/giga_world_0_video_gr1/
Evaluating Performance: Benchmarks and Visuals
How do we know GigaWorld-0 works well? Evaluations on benchmarks like PBench and DreamGen Bench show it outperforms others in visual quality, physical plausibility, and instruction alignment.
On PBench (Robot Set):
| Method | Param. | Semantics | Visual Quality | Temporal Consistency | Physical Plausibility | Multi-View Consistency | Overall Score |
|---|---|---|---|---|---|---|---|
| Cosmos-Predict2-14B | 14B | 97.5 | 97.5 | 47.2 | 94.2 | 85.1 | 82.07 |
| Wan2.2-14B | 14B | 96.8 | 96.8 | 47.5 | 93.8 | 83.2 | 78.85 |
| Wan2.2-5B | 5B | 95.4 | 95.0 | 46.7 | 92.7 | 80.1 | 77.15 |
| Cosmos-Predict2.5-2B | 2B | 93.8 | 91.3 | 49.3 | 92.1 | 84.7 | 79.95 |
| GigaWorld-0-Video-Dreamer | 2B(Act.) | 97.6 | 97.6 | 48.1 | 93.6 | 88.2 | 82.07 |
GigaWorld-0 tops the overall score with fewer parameters.
On DreamGen Bench (fine-tuned on GR1):
It excels in environment, object, and behavior metrics, showing strong generalization.
Visualizations confirm this. For example, GigaWorld-0-Video-Dreamer generates diverse trajectories from one frame, handling rigid and deformable objects.

This shows predicted joint trajectories aligning with ground truth.
Other visuals include multi-view generation, appearance edits, view transfers, mimic transfers, and 3D scenes—all demonstrating consistency and realism.
Real-World Applications: Downstream Tasks
The true test is in embodied tasks. GigaWorld-0 data trains VLA models like GigaBrain-0, which succeed in real robots for tasks like laundry folding, paper towel prep, table bussing, juice making, and moving baskets—without real-world training data.
Visuals from deployments:

GigaBrain-0 folding laundry on a G1 robot.
This highlights how synthetic data boosts robustness and success rates.
Related Work and Context
World models build on prior research in video generation (like HunyuanVideo, Cosmos) and robotics (DriveDreamer, GAIA). GigaWorld-0 advances this by focusing on embodied data with control over multiple dimensions.
Citation and Credits
If you’re using this in research, cite:
@misc{gigaai2025gigaworld0,
title={GigaWorld-0: World Models as Data Engine to Empower Embodied AI},
author={GigaAI},
year={2025},
eprint={2511.19861},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.19861},
}
The team includes Boyuan Wang, Chaojun Ni, and others.
FAQ
What is a world model in the context of embodied AI?
A world model is a simulator that predicts how the environment changes based on actions. In GigaWorld-0, it generates data for training AI agents to interact physically.
How does GigaWorld-0 differ from other video generation models?
It specializes in embodied scenes with controls for appearance, views, and actions, plus 3D integration for physics—making it more suitable for robotics than general models.
Can I fine-tune GigaWorld-0 on my own data?
Yes, use LoRA training as described. Pack your videos with prompts and adjust the config.
What hardware do I need for training?
It supports FP8 and sparse attention, so it’s efficient, but multi-GPU setups help for large scales.
How do I generate multi-view videos?
During inference, the models support it natively, as shown in visualizations.
Is GigaWorld-0 open-source?
Yes, code on GitHub, models on Hugging Face.
What datasets were used for pretraining?
It was trained on large embodied interaction corpora, fine-tuned on GR1 for specific tasks.
How does it handle physical realism?
Through GigaWorld-0-3D-Phys and -Act, which model dynamics and motions.
Can it transfer human demos to robots?
Yes, via GigaWorld-0-Video-MimicTransfer.
What’s the resolution of generated videos?
Up to 93x480x768 for GR1-fine-tuned models.

