WorldGrow: A Revolutionary Framework for Generating Infinite 3D Worlds

Introduction: Why Do We Need Infinite 3D Worlds?

Why is infinite 3D world generation technology so crucial, and what fundamental challenges do existing methods face?

In fields like video games, virtual reality, film production, and autonomous driving simulation, constructing large-scale, continuous, and content-rich 3D environments has always been a significant challenge. Traditional methods either rely on manual modeling, which is time-consuming and labor-intensive, or use existing generation techniques that often underperform in scalability and consistency. More importantly, with the development of embodied AI and world models, we need infinitely expandable virtual environments where AI agents can learn, navigate, and plan without boundaries.

Mainstream existing methods have clear shortcomings. Approaches based on 2D diffusion models generate multi-view images and then “lift” them to 3D, but they lack understanding of the overall 3D structure, leading to geometric errors and appearance inconsistencies. Another category of methods directly predicts 3D representations, like triplanes or UDFs, but they struggle to scale to large scenes due to limitations in the scale and quality of scene-level datasets. Meanwhile, recent 3D foundation models, while capable of generating high-quality single objects, cannot handle continuous scene-level generation.

The Birth of WorldGrow: Rethinking 3D Scene Generation

Core Question: How to achieve truly infinitely expandable 3D scenes?

WorldGrow’s answer is to decompose the infinite world into manageable 3D blocks and achieve coherent large-scale scene generation through an intelligent block synthesis and growth mechanism.

The WorldGrow framework is based on a simple yet powerful insight: instead of generating an entire infinite world at once, adopt a modular approach by decomposing the scene into standardized 3D blocks, then gradually expanding through a context-aware generation mechanism. This method, for the first time, enables the generation of theoretically infinitely expandable 3D scenes while maintaining geometric consistency and visual realism.

Author’s Reflection: During development, we deeply realized that successfully migrating object-level generation priors to the scene level hinges on rethinking how 3D representations encode spatial relationships. This is not just a scaling issue but a fundamental shift in semantic understanding—from isolated objects to interconnected environments.

Deconstructing WorldGrow’s Core Components

Data Curation: The Foundation of High-Quality Training Data

How does one prepare suitable training data for infinite 3D scene generation?

Data is the lifeblood of any AI system, and it’s particularly critical for 3D scene generation. The primary challenge WorldGrow faced was that existing 3D datasets, like Objaverse-XL, are primarily object-centric, containing isolated assets rather than continuous spatial environments.

The Scene Slicing Strategy is central to WorldGrow’s data preparation. We start from complete 3D scenes (like houses or cities) and extract coherent, reusable blocks through a systematic segmentation process. Specifically, we import the scene mesh into Blender, place cubes within its bounding box, and extract content via Boolean intersection. To ensure spatial density, we render a top-down view and compute the occupancy rate of each extracted cube—if less than 95% of the surface contains visible content, it is repositioned and re-evaluated.

The Coarse-to-Fine Data Strategy addresses a fundamental trade-off in block design. Larger 3D blocks capture broader scene context, benefiting global layout learning, but may sacrifice rendering fidelity; smaller blocks support finer visual quality but lack sufficient spatial context to learn coherent scene structures. To this end, we prepared two distinct datasets: coarse blocks and fine blocks. Coarse blocks have four times the area in the XY plane while maintaining the same height, thus capturing larger spatial volumes and richer contextual information.

Practical Application Scenario: Imagine you are generating a city environment for an open-world game. Using coarse blocks, you can quickly establish street layouts and district divisions; then use fine blocks to populate detailed building facades, road textures, and streetscape details. This hierarchical approach ensures both macro-planning rationality and micro-visual quality.

Scene-Friendly SLAT: Evolving Representations from Object to Scene

How can object-level 3D representations be adapted for scene-level generation?

WorldGrow builds upon TRELLIS’s Structured LATents (SLAT) representation but introduces key improvements to make it more suitable for scene-level generation. SLAT effectively encodes geometric structure and appearance by representing 3D objects as a combination of sparse voxel grids and DINOv2 features.

However, the original SLAT faced two main challenges in scene generation. First, direct feature aggregation performed poorly in cluttered scenes. At the object level, self-occlusion is rare, and projective feature aggregation works well; but at the scene level, this approach leads to artifacts like color bleeding between adjacent surfaces. Second, the decoder, pretrained on object data, lacked the capability to handle detailed 3D content near scene block boundaries, often producing floaters and artifacts.

WorldGrow’s solution is to introduce Occlusion-Aware Feature Aggregation. When computing sparse voxel features, each voxel center is projected onto multiple camera views, where we compute binary visibility masks using depth testing. The occlusion-aware feature is then computed by averaging DINOv2 features only from the views where the voxel is actually visible, ensuring each voxel only receives features from its observable views and preventing feature contamination across occluded surfaces.

Simultaneously, we retrained the decoder on scene block data, shifting its focus from isolated objects to structured scene content. This adaptation enables the decoder to better handle boundary regions, producing cleaner geometry and more coherent textures at block edges.

Author’s Reflection: We initially underestimated the importance of occlusion handling in scene generation. In early experiments, ignoring visibility tests caused wall textures to “bleed” onto furniture, destroying the scene’s realism. This lesson emphasized that 3D scene generation is not merely an extension of 2.5D tasks but requires a completely different approach to spatial reasoning.

3D Block Inpainting: Enabling Seamless Scene Expansion

How to ensure newly generated scene blocks seamlessly connect with the existing environment?

The core challenge of scene expansion is ensuring new content remains consistent with the existing environment in terms of geometry, style, and texture. WorldGrow formulates this as a 3D block inpainting task, where missing target blocks are synthesized based on their surrounding spatial neighbors.

WorldGrow’s inpainting framework operates in two stages: structure and latent space. Given a partially observed block with missing regions, the model first predicts the 3D structure, then reconstructs the corresponding latent features for high-fidelity appearance synthesis.

The key innovation lies in modifying the model’s input layer. Instead of using noisy latents as input, three components are concatenated along the channel dimension: the noisy latents, a binary mask indicating the inpainting region, and the masked known region itself. This design allows the model to condition its prediction on both the known context and explicit spatial cues of the missing area.

The training process involves randomly selecting two splitting positions along the X and Y axes to divide each scene block into four quadrants, keeping one as context and masking the remaining three. For structure inpainting, a voxel-level binary mask is defined, where a value of 1 indicates voxels to be inpainted. For latent inpainting, a sparse mask is defined to guide the latent generator in reconstructing corresponding features.

Practical Use Case: Suppose you have generated a living room scene and now want to expand an adjacent dining room. The inpainting model considers the nearby areas of the living room (including floor continuity, wall style, and lighting conditions) to generate a dining space that is stylistically and layout-wise coherent, ensuring a smooth transition between the two areas rather than creating a completely disjointed new space.

Coarse-to-Fine Generation: Balancing Global Layout and Local Detail

How to ensure richness of local details while maintaining global structural plausibility?

One of the most innovative aspects of WorldGrow is its coarse-to-fine generation strategy, which explicitly separates layout reasoning from detail generation, with each stage operating at a different semantic level.

The Block Expansion Process starts from a seed block, and the scene is progressively expanded in the XY plane through iterative 3D block inpainting. For each new block, the inpainting model takes its left, top, and top-left (if available) previously generated blocks as context. To ensure continuity, we reuse a portion of these existing blocks: specifically, a 3/8-width margin is reused from each adjacent block along the X and Y axes. Based on this context, the central 5/8w×5/8w region is inpainted to complete a new 12/8w×12/8w block. This overlapping design ensures smooth transitions across block boundaries and provides a consistent context window for each expansion step.

Coarse Structure Generation uses the coarse structure model to establish the scene’s large-scale layout. This produces a low-resolution but spatially coherent structure defining the world’s overall geometry.

Fine Structure Refinement enriches local geometry through the fine structure generator. We upsample the coarse structure via trilinear interpolation to match the fine stage’s resolution, then partition it into standard fine blocks. We adopt a structure-guided denoising approach: for each upsampled fine block, we encode it into an initial latent, then perturb this latent with controlled Gaussian noise. The fine generator denoises it to reconstruct the refined structure. This strategy enhances details while preserving spatial distribution priors.

SLAT-Based Appearance Generation Once the world’s fine-level structure is complete, we generate the corresponding SLATs. This stage follows the same block-by-block generation strategy as used for structure but operates in the latent space. After all latent blocks are generated, the full SLAT is decoded by our retrained decoder into a renderable 3D world.

Author’s Reflection: The coarse-to-fine approach initially seemed to add complexity, but it actually significantly improved generation quality. We found that without coarse guidance, the fine model often produced locally plausible but globally incoherent layouts—for example, doors opening onto walls without connecting to other rooms. This hierarchical approach mirrors how humans design environments: plan the overall layout first, then refine the details.

Experimental Validation: Performance and Effectiveness Assessment

How does WorldGrow perform in actual tests, and what advantages does it offer compared to existing methods?

We evaluated WorldGrow on the large-scale 3D-FRONT dataset, comprising 3,425 curated houses with reasonable layouts and detailed furnishings. From these, we generated 120k fine blocks and 38k coarse blocks. We also validated WorldGrow’s adaptability using the UrbanScene3D urban dataset.

The evaluation covers two aspects: scene block generation and full scene synthesis. For block generation, we assess both geometric and visual quality. We report three standard distribution-based metrics (MMD, COV, and 1-NNA) computed using both Chamfer Distance (CD) and Earth Mover’s Distance (EMD). We also adopt the perceptual Fréchet Inception Distance (FID) with PointNet++ to assess 3D geometric quality. For visual quality, we render generated blocks from fixed multiple viewpoints and compute perceptual metrics including CLIP score and FID variants with different feature extractors (Inception V3, DINOv2, and CLIP).

For full-scene synthesis, where ground-truth meshes are unavailable, we conducted a human preference study with 91 participants comparing 5 methods across 10 scenes (4 house-level, 6 unbounded), evaluating structural plausibility, geometric detail, appearance fidelity, and scene continuity.

Quantitative Results: Outperforming Existing Methods

In scene block generation, WorldGrow achieves state-of-the-art performance on geometric metrics. Specifically, it achieves 0.97 on MMD (CD) (lower is better), 51.82% on COV (CD) (higher is better), 66.30% on 1-NNA (CD) (lower is better), and 7.52 on FID (lower is better). These results demonstrate superior performance in block connectivity and structural coherence.

In visual fidelity, WorldGrow significantly outperforms all baselines, achieving a CLIP score of 0.843 (higher is better), FID-Inception of 29.87 (lower is better), FID-DINOv2 of 313.54, and FID-CLIP of 3.95. This indicates WorldGrow’s ability to generate high-quality scene blocks with realistic appearance.

Scene Examples Generated by WorldGrow
Figure: 5×5 and 9×9 block layouts generated by WorldGrow, demonstrating its ability to create diverse, coherent scenes.

Qualitative Results and Human Evaluation

The human preference study confirmed WorldGrow’s advantages in practical scene generation. In textured scenes, WorldGrow outperformed baselines in structural plausibility (4.48/5), geometric detail (4.44/5), and appearance fidelity (4.33/5). In unbounded scenes, WorldGrow scored highest in continuity (4.69/5), proving the effectiveness of our block-by-block expansion and coarse-to-fine generation strategy.

Large-Scale Scene Generation
Figure: Large-scale scene generated by WorldGrow, spanning 19×39 blocks (~1,800 m²), demonstrating scalability and consistency over large extents.

Expansion Stability Test

We conducted an expansion stability experiment to quantitatively assess long-run generation quality and error accumulation. We synthesized large 7×7 block scenes and randomly sampled 1×1 blocks exclusively from the outer regions (beyond the initial 3×3 region) for evaluation. WorldGrow maintained consistent generation quality even at distant expansions, achieving scores comparable to the main evaluation, while SynCity showed significant performance degradation (FID increased from 34.69 to 51.97). Notably, SynCity failed in 70% of expansion attempts, with only successful cases reported in the table.

Application Scenarios: Using WorldGrow in Practice

Indoor Scene Generation

How is WorldGrow applied to actual indoor environment generation?

WorldGrow excels at indoor scene generation, capable of creating diverse and plausible room layouts. For example, when generating a residential environment, the system can start from a seed bedroom and gradually expand to include connected living rooms, kitchens, and bathrooms, ensuring proper connections for doorways, corridors, and open spaces.

Practical Case: In one demonstration, WorldGrow started from a 4×4 meter seed bedroom and progressively generated a complete apartment layout including a living room, kitchen, bathroom, and connecting corridor. The entire scene covered over 1800 square meters, contained dozens of rooms, all maintaining consistency in geometry and appearance. An embodied agent could seamlessly navigate the entire environment, proving the structural rationality and traversability of the generated scene.

Outdoor City Generation

Our experiments on the UrbanScene3D dataset show that WorldGrow can be adapted for outdoor environment generation. Despite limited training data (only 10k fine and 3k coarse blocks), WorldGrow demonstrated potential in generating coherent urban streetscapes, including building layouts, road networks, and streetscape elements.

Outdoor Scene Generation
Figure: Infinite outdoor 3D scenes generated by WorldGrow, including urban streetscapes with plausible layouts and suburban neighborhoods with consistent styles.

Embodied AI and Navigation

Environments generated by WorldGrow are particularly suitable for embodied AI tasks like navigation and planning. The generated scenes are not only visually realistic but also structurally plausible, with traversable spaces and consistent geometry. This makes them ideal testbeds for training and evaluating AI agents’ abilities to navigate, plan, and interact in complex environments.

Practical Application Scenario: In autonomous driving simulation, WorldGrow can generate infinite urban scenes for training and testing autonomous driving systems. Unlike traditional methods using predefined finite maps, WorldGrow can dynamically expand the environment, providing AI agents with ever-changing challenges, thereby improving their generalization capability.

Limitations and Future Work

Current Limitations

What are the current limitations of WorldGrow?

Although WorldGrow has achieved significant results, several limitations remain. Currently, our method only expands scenes in the XY plane, while vertical expansion along the Z-axis—essential for multi-story buildings—is an important direction for future work.

Generation quality and diversity are also bounded by the limitations of current 3D datasets in terms of scale, variety, and semantic annotations. Our block-wise design trades off fine geometric details for computational feasibility, prioritizing infinite generation capability over local detail resolution.

Furthermore, although WorldGrow naturally supports conditional control, the current implementation focuses on unconditional generation without semantic conditioning.

Future Directions

These limitations present clear opportunities for future research. Multi-level generation strategies could enable vertical expansion for complete buildings. Larger-scale dataset curation—particularly for outdoor environments—would enhance both diversity and quality.

Introducing LLM-generated captions could enable fine-grained semantic control over room types and layouts. Moreover, integrating WorldGrow into geometry-appearance unified generation models could lead to more efficient pipelines.

Author’s Reflection: The process of developing WorldGrow made us realize that advances in 3D scene generation require not only better algorithms but also richer, more diverse datasets. The limitations of current 3D datasets proved more critical than we initially anticipated. In the future, we believe the community’s efforts on high-quality 3D scene data will be as important, if not more critical, as algorithmic innovation.

Conclusion

WorldGrow represents a significant step forward in addressing the long-standing challenge of achieving infinite scalability, coherent layouts, and realistic appearances in 3D scene generation. By combining pre-trained 3D priors with novel block inpainting and coarse-to-fine refinement strategies, our framework overcomes the fundamental scalability and coherence issues that constrained previous methods.

Comprehensive evaluation demonstrates state-of-the-art performance in geometry reconstruction and visual fidelity, while uniquely supporting the generation of large-scale scenes that maintain both local detail and global consistency. As virtual worlds become increasingly important for embodied AI training and simulation, WorldGrow provides a practical path toward scalable, high-quality 3D content generation for future world models.

Practical Summary / Action Checklist

For practitioners looking to apply WorldGrow or similar technologies, here are the key takeaways:

  1. Data Preparation is Key: Invest in high-quality, diverse 3D scene data and adopt a systematic block extraction strategy.
  2. Adopt a Hierarchical Approach: Decompose scene generation into layout planning and detail refinement stages to balance global coherence and local quality.
  3. Handle Occlusion: Implement occlusion-aware mechanisms in feature aggregation to prevent artifacts in scene generation.
  4. Design Overlap Regions: Include overlapping regions in block expansion to ensure seamless transitions and spatial continuity.
  5. Validate Expansion Stability: Test the generation system’s performance over long expansions to ensure quality doesn’t degrade with distance.
  6. Consider Specific Application Needs: Adapt block size and generation parameters for your specific use case (indoor, outdoor, navigation, etc.).

One-Page Overview

WorldGrow Core Points:

  • Problem: Infinite 3D scene generation with coherent layouts and realistic appearance.
  • Solution: Block-based generation, progressively expanded via context-aware inpainting.
  • Key Innovations:

    • Scene-friendly SLAT with occlusion-aware feature aggregation.
    • 3D block inpainting mechanism.
    • Coarse-to-fine generation strategy.
  • Performance: Outperforms existing methods on geometric and visual metrics; supports large-scale scene generation.
  • Applications: Game development, VR/AR, autonomous driving simulation, embodied AI training.
  • Limitations: Currently limited to horizontal expansion; dependent on quality of existing 3D datasets.
  • Future Directions: Vertical expansion, semantic control, larger-scale datasets.

Frequently Asked Questions (FAQ)

How does WorldGrow handle continuity between blocks?
WorldGrow uses overlapping regions and a context-aware inpainting mechanism. When generating each new block, it considers part of the content from adjacent blocks (3/8-width margin) as context, ensuring smooth transitions in geometry and appearance.

What are the advantages of WorldGrow compared to 2D-based methods?
Unlike methods that rely on lifting 2D images to 3D, WorldGrow directly works with 3D representations, avoiding multi-view inconsistency issues and maintaining consistent geometry and appearance from all viewpoints.

How much computational resources does WorldGrow require?
On a single A100 GPU, generating one block takes about 20 seconds. A complete 10×10 indoor scene (~272 m²) can be generated in 30 minutes, with a peak memory usage of 13GB.

Can WorldGrow generate outdoor environments?
Yes, experiments on the UrbanScene3D dataset show that WorldGrow can be adapted to generate coherent urban streetscapes, including buildings, roads, and streetscape elements.

Does WorldGrow support semantic control?
The current implementation focuses on unconditional generation, but the framework naturally supports conditional control. Future work plans to integrate LLM-generated captions for fine-grained semantic control over room types and layouts.

How scalable is WorldGrow?
Tests show that WorldGrow can generate very large-scale scenes over 1800 m² (19×39 blocks), maintaining consistent generation quality even at distant expansions, without significant quality degradation or seam accumulation.

What are the main limitations of WorldGrow?
Main limitations include currently supporting only horizontal (XY-plane) expansion, dependence on the quality and diversity of existing 3D datasets, and the trade-off between detail resolution and computational feasibility.

How would WorldGrow benefit from future 3D datasets?
Larger-scale, more diverse 3D scene datasets would significantly improve generation quality and diversity, especially for outdoor environments and complex architectural structures.