Hybrid 3D-4D Gaussian Mixing: A New Paradigm for Dynamic Scene Reconstruction

Introduction

Accurate representation and rendering of dynamic 3D scenes are critical for applications like virtual reality, augmented reality, sports broadcasting, and film production. However, achieving high – fidelity, computationally efficient, and temporally coherent modeling of dynamic scenes remains challenging. Recent advances in neural rendering, particularly Neural Radiance Fields (NeRF), have shown promise in novel view synthesis and 3D scene reconstruction. Yet, they struggle with real – time rendering of complex dynamic scenes due to computational costs.

The Emergence of 3D and 4D Gaussian Splatting

3D Gaussian Splatting (3DGS) has emerged as a promising alternative to NeRF – based methods. Unlike NeRF, which relies on implicit representations and computationally expensive volumetric rendering, 3DGS represents scenes as collections of Gaussian primitives and leverages fast rasterization. Several extensions have been proposed to adapt 3DGS for dynamic 3D scene reconstruction, incorporating motion modeling and temporal consistency. There are two primary paradigms for applying 3DGS to dynamic 3D capture.

The first approach extends 3D Gaussians to dynamic scenes by tracking Gaussians over time. It uses techniques like multi – layer perceptrons, temporal residuals, or interpolation functions. These methods leverage temporal redundancy across frames to improve representation efficiency and accelerate training. However, they often struggle with fast – moving objects. The second paradigm directly optimizes 4D Gaussians, representing the entire spatiotemporal volume as a set of splatted 4D Gaussians. While this approach enables high – quality reconstructions, it incurs significant memory and computational overhead.

The Innovation of Hybrid 3D-4D Gaussian Splatting

To address the inefficiencies of conventional 4DGS pipelines, researchers have proposed hybrid 3D – 4D Gaussian Splatting (3D – 4DGS). This novel framework dynamically classifies Gaussians as either static (3D) or dynamic (4D), enabling an adaptive strategy that optimizes storage and computation.

Our method begins by modeling all Gaussians as 4D and then adaptively identifying those with minimal temporal variation across the sequence. These Gaussians are classified as static and converted into a purely 3D representation by discarding the time dimension, effectively freezing their position, rotation, and color parameters. Meanwhile, fully dynamic Gaussians retain their 4D nature to capture complex motion. Importantly, this classification is performed iteratively at each densification stage, progressively refining the regions that truly require 4D modeling. The final rendering pipeline seamlessly integrates both 3D and 4D Gaussians, projecting them into screen space for alpha compositing.

Experimental Validation and Results

Datasets

Our experiments are conducted on two standard challenging datasets: Neural 3D Video (N3V) and Technicolor. N3V primarily consists of 10 – second multi – view videos (plus one 40 – second long sequence), while Technicolor features 16 – camera light field captures of short but complex scenes.

Implementation Details

We initialize our 4D Gaussian representation using dense COLMAP reconstructions for the N3V dataset and start from a sparse COLMAP reconstruction for Technicolor. We adopt the densification pipeline from 3D Gaussian Splatting, progressively increasing the number of Gaussians by cloning and splitting operations. Unlike prior works, we do not perform periodic opacity resets during training. We set the temporal scale threshold τ to 3 for the 10 – second N3V sequences and 6 for the 40 – second sequence, while using a threshold of 1 for Technicolor.

Results

Our method consistently achieves competitive or superior PSNR and SSIM scores while significantly reducing training times compared to state – of – the – art baselines. On the N3V dataset, our approach achieves an average PSNR of 32.25 dB, outperforming recent methods in both fidelity and rendering speed. For the 40 – second clip from the N3V dataset, our method achieves the second – best PSNR (29.2 dB) and the lowest LPIPS (0.1173), demonstrating strong perceptual quality. On the Technicolor dataset, our model achieves 33.22 dB PSNR and 0.911 SSIM, with only 29 minutes of training time on an RTX 3090.

Conclusion

We have presented a novel hybrid 3D – 4D Gaussian Splatting framework for dynamic scene reconstruction. By distinguishing static regions and selectively assigning 4D parameters only to dynamic elements, our method substantially reduces redundancy while preserving high – fidelity motion cues. Extensive experiments on the N3V and Technicolor datasets demonstrate that our approach consistently achieves competitive or superior quality and faster training compared to state – of – the – art baselines.

Limitations and Future Work

First, our heuristic scale thresholding could be refined, potentially using learning – based or data – driven methods. Second, a specialized 4D densification strategy could further reduce redundancy and optimize memory usage. Such approaches may lead to even higher reconstruction quality and more efficient training.

References

  1. Seungjun Oh, Younggeun Lee, Hyejin Jeon, et al. Hybrid 3D – 4D Gaussian Splatting for Fast Dynamic Scene Representation. arXiv:2505.13215v1 [cs.CV], 19 May 2025.
  2. Sungkyunkwan University, Department of Artificial Intelligence.
  3. Yonsei University, Department of Artificial Intelligence.