GigaWorld-0: Building World Models to Drive Embodied AI Forward

Have you ever wondered how AI systems can learn to interact with the real world without needing endless hours of physical trials? That’s where world models come in—they act as virtual simulators that generate realistic data for training AI agents. Today, let’s talk about GigaWorld-0, a framework that’s designed specifically as a data engine for vision-language-action learning in embodied AI. It’s a unified system that combines video generation and 3D modeling to create high-quality, controllable data. I’ll walk you through what it is, how it works, and how you can get started with it, all based on the details from its documentation and research paper.

If you’re a graduate with a background in computer science or engineering, this should feel accessible. We’ll break down the technical parts step by step, using lists and tables where it makes sense, and I’ll address common questions as we go.

What Is GigaWorld-0 and Why Does It Matter?

You might be asking: What exactly is GigaWorld-0? It’s a world model framework that serves as a data engine for embodied AI, meaning it generates synthetic data to train AI systems that interact with physical environments, like robots. Embodied AI refers to agents that perceive, reason, and act in the real world, and data efficiency is a big challenge here because collecting real-world data is expensive and time-consuming.

GigaWorld-0 addresses this by integrating two main parts: GigaWorld-0-Video for creating diverse video sequences, and GigaWorld-0-3D for ensuring those sequences are geometrically consistent and physically realistic. Together, they produce data that’s visually rich, spatially accurate, and aligned with instructions—perfect for training vision-language-action (VLA) models.

Why is this important? Traditional methods rely on real-world data, which limits scalability. GigaWorld-0 flips that by synthesizing data that’s controllable in aspects like appearance, camera views, and actions. This leads to better generalization in real robots, as shown in evaluations where VLA models trained on its data perform well without any real-world interactions during training.

The framework is open-source under Apache 2.0, with models available on Hugging Face, and it builds on efficient training tools to make large-scale generation feasible.

This image shows the architecture of GigaWorld-0-Video-Dreamer, the core video generation model.

Breaking Down the Architecture of GigaWorld-0

Let’s dive into how GigaWorld-0 is structured. It’s not just one model but a suite of components working together.

The Video Side: GigaWorld-0-Video

This part focuses on generating videos that are temporally coherent and controllable. It includes several models:

GigaWorld-0-Video-Dreamer: The foundation for image-text-to-video generation. It uses a Mixture-of-Experts (MoE) architecture with sparse attention to handle embodied scenes efficiently. Trained on large datasets of interaction videos, it produces sequences from text prompts and initial images.
GigaWorld-0-Video-AppearanceTransfer: Allows editing of textures, materials, and lighting in videos while keeping motions intact. Useful for augmenting data with diverse appearances.
GigaWorld-0-Video-ViewTransfer: Renders videos from specified camera viewpoints, adapting robot trajectories to maintain consistency.
GigaWorld-0-Video-MimicTransfer: Translates human demonstration videos into robot actions, enabling cross-embodiment data generation.

These models support multi-view generation, FP8 precision for faster training, and distillation for efficient inference.

You might wonder: How does the video generation process work mathematically? It uses flow-matching to model the generative process, where the latent states evolve over time based on conditions like text and images. The equation is:

[\frac{d\mathbf{z}{t}}{dt} = \mathbf{v}{\theta}(\mathbf{z}_{t}, t, \mathbf{c})]

Here, (\mathbf{z}_t) is the latent at time (t), and (\mathbf{c}) includes text and image inputs.

The MoE in the feed-forward networks routes inputs to experts efficiently, reducing compute needs.

The 3D Side: GigaWorld-0-3D

To add geometric and physical realism, this component builds 3D scenes:

GigaWorld-0-3D-FG: Generates 3D assets for foreground objects using generative models.
GigaWorld-0-3D-BG: Reconstructs backgrounds with 3D Gaussian Splatting (3DGS) for high-fidelity environments.
GigaWorld-0-3D-Phys: Models object physics and performs differentiable system identification for the robotic arm.
GigaWorld-0-3D-Act: Computes executable arm motions that are physically consistent.

This ensures data respects constraints like gravity and collisions, making it suitable for robot training.

The full list of components and their functions is in this table:

Model Name	Function
GigaWorld-0-Video-Dreamer	Image-text-to-video foundation model for embodied scenes.
GigaWorld-0-Video-AppearanceTransfer	Text-guided appearance transfer, edits texture, material, lighting.
GigaWorld-0-Video-ViewTransfer	Renders videos from user-specified camera extrinsics.
GigaWorld-0-Video-MimicTransfer	Translates egocentric human demonstration to robot arm trajectories.
GigaWorld-0-3D-FG	Generates 3D assets of foreground manipulable objects.
GigaWorld-0-3D-BG	Reconstructs backgrounds via 3D Gaussian Splatting (3DGS).
GigaWorld-0-3D-Phys	Models object physics and performs differentiable system identification.
GigaWorld-0-3D-Act	Synthesizes executable, physically consistent arm motions.

By combining these, GigaWorld-0 creates data that’s not only visually appealing but also useful for tasks like manipulation.

This figure illustrates how GigaWorld-0 handles video generation, human-to-robot transfer, and 3D scene creation.

How to Install and Set Up GigaWorld-0

If you’re thinking about trying it out, installation is straightforward. It depends on three frameworks: GigaTrain for efficient training, GigaDatasets for data handling, and GigaModels for the models themselves.

Here’s a step-by-step guide:

Create a new conda environment:

conda create -n giga_world_0 python=3.11.10 -y
conda activate giga_world_0

Install the dependencies:

pip3 install giga-train
pip3 install giga-datasets
pip3 install natten

Clone and install GigaModels:

git clone https://github.com/open-gigaai/giga-models.git
cd giga-models
pip3 install -e .

Clone the GigaWorld-0 repository:

git clone git@github.com:open-gigaai/giga-world-0.git

This sets up a fresh environment to avoid conflicts.

Preparing Data for Training

Before training, you need to organize your video data. Each video should have a corresponding text prompt.

Structure your raw data like this:

raw_data/
├── 0.mp4  # Video file 0
├── 0.txt  # Prompt for video file 0
├── 1.mp4  # Video file 1
├── 1.txt  # Prompt for video file 1
├── ...

Then, pack the data and extract embeddings:

python scripts/pack_data.py \
  --video-dir /path/to/raw_data/ \
  --save-dir /path/to/packed_data/

This prepares the data for efficient training.

Training GigaWorld-0 Models

Training is where the magic happens. Use the provided config for the video model:

python scripts/train.py --config configs.giga_world_0_video.config

For LoRA training (a lighter way to fine-tune), set config.train_mode.train_mode='lora' and config.train_mode.lora_rank=64 in the config file, then run the same command.

GigaTrain uses FP8 precision and sparse attention to make this scalable, reducing memory and compute needs.

You might ask: How long does training take? It depends on your hardware, but the framework is optimized for efficiency, allowing large-scale runs.

Running Inference with GigaWorld-0

Once trained, inference generates videos. Here are examples:

Single GPU:

python scripts/inference.py \
  --data-path /path/to/packed_test_data/ \
  --save-dir /path/to/vis_results/ \
  --transformer-model-path /path/to/your_transformer/ \
  --text-encoder-model-path /path/to/giga_world_0_video/text_encoder/ \
  --vae-model-path /path/to/giga_world_0/vae/ \
  --gpu_ids 0

Multi-GPU:
Add more GPU IDs, like --gpu_ids 0 1 2 3 4 5 6 7.
With LoRA:
Include --lora-model-path /path/to/your_lora/.

Outputs are saved in the specified directory.

First, download models from Hugging Face:

GigaWorld-0-Video-Pretrain-2b: For base image-text-to-video.
GigaWorld-0-Video-GR1-2b: Fine-tuned on GR1 dataset.

Use:

python scripts/download.py --model-name video_pretrain --save-dir /path/to/giga_world_0_video_pretrain/
python scripts/download.py --model-name video_gr1 --save-dir /path/to/giga_world_0_video_gr1/

Evaluating Performance: Benchmarks and Visuals

How do we know GigaWorld-0 works well? Evaluations on benchmarks like PBench and DreamGen Bench show it outperforms others in visual quality, physical plausibility, and instruction alignment.

On PBench (Robot Set):

Method	Param.	Semantics	Visual Quality	Temporal Consistency	Physical Plausibility	Multi-View Consistency	Overall Score
Cosmos-Predict2-14B	14B	97.5	97.5	47.2	94.2	85.1	82.07
Wan2.2-14B	14B	96.8	96.8	47.5	93.8	83.2	78.85
Wan2.2-5B	5B	95.4	95.0	46.7	92.7	80.1	77.15
Cosmos-Predict2.5-2B	2B	93.8	91.3	49.3	92.1	84.7	79.95
GigaWorld-0-Video-Dreamer	2B(Act.)	97.6	97.6	48.1	93.6	88.2	82.07

GigaWorld-0 tops the overall score with fewer parameters.

On DreamGen Bench (fine-tuned on GR1):

It excels in environment, object, and behavior metrics, showing strong generalization.

Visualizations confirm this. For example, GigaWorld-0-Video-Dreamer generates diverse trajectories from one frame, handling rigid and deformable objects.

Qualitative comparison of action inference

This shows predicted joint trajectories aligning with ground truth.

Other visuals include multi-view generation, appearance edits, view transfers, mimic transfers, and 3D scenes—all demonstrating consistency and realism.

Real-World Applications: Downstream Tasks

The true test is in embodied tasks. GigaWorld-0 data trains VLA models like GigaBrain-0, which succeed in real robots for tasks like laundry folding, paper towel prep, table bussing, juice making, and moving baskets—without real-world training data.

Visuals from deployments:

GigaBrain-0 folding laundry on a G1 robot.

This highlights how synthetic data boosts robustness and success rates.

Related Work and Context

World models build on prior research in video generation (like HunyuanVideo, Cosmos) and robotics (DriveDreamer, GAIA). GigaWorld-0 advances this by focusing on embodied data with control over multiple dimensions.

Citation and Credits

If you’re using this in research, cite:

@misc{gigaai2025gigaworld0,
  title={GigaWorld-0: World Models as Data Engine to Empower Embodied AI},
  author={GigaAI},
  year={2025},
  eprint={2511.19861},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2511.19861},
}

The team includes Boyuan Wang, Chaojun Ni, and others.

FAQ

What is a world model in the context of embodied AI?

A world model is a simulator that predicts how the environment changes based on actions. In GigaWorld-0, it generates data for training AI agents to interact physically.

How does GigaWorld-0 differ from other video generation models?

It specializes in embodied scenes with controls for appearance, views, and actions, plus 3D integration for physics—making it more suitable for robotics than general models.

Can I fine-tune GigaWorld-0 on my own data?

Yes, use LoRA training as described. Pack your videos with prompts and adjust the config.

What hardware do I need for training?

It supports FP8 and sparse attention, so it’s efficient, but multi-GPU setups help for large scales.

How do I generate multi-view videos?

During inference, the models support it natively, as shown in visualizations.

Is GigaWorld-0 open-source?

Yes, code on GitHub, models on Hugging Face.

What datasets were used for pretraining?

It was trained on large embodied interaction corpora, fine-tuned on GR1 for specific tasks.

How does it handle physical realism?

Through GigaWorld-0-3D-Phys and -Act, which model dynamics and motions.

Can it transfer human demos to robots?

Yes, via GigaWorld-0-Video-MimicTransfer.

What’s the resolution of generated videos?

Up to 93x480x768 for GR1-fine-tuned models.

GigaWorld-0: The Next-Gen World Model Revolutionizing Embodied AI Training