OneStory: Redefining Multi-Shot Video Generation with Adaptive Memory
Abstract
OneStory addresses the critical challenge of maintaining narrative coherence across discontinuous video shots by introducing an adaptive memory system. This framework achieves a 58.74% improvement in character consistency and supports minute-scale video generation through next-shot prediction and dynamic context compression. By reformulating multi-shot generation as an autoregressive task, it bridges the gap between single-scene video models and complex storytelling requirements.
What is Multi-Shot Video Generation?
Imagine watching a movie where scenes seamlessly transition between different locations and characters. Traditional AI video generators struggle with this “multi-shot” structure—sequences of non-contiguous clips that must maintain visual and narrative consistency. The core difficulty lies in preserving entities like characters and environments across time gaps while handling viewpoint changes and scene evolutions.
Key Challenges:
-
Entity Persistence: When a character exits frame, they must reappear identically later. -
Environmental Continuity: Street layouts or background elements must remain consistent. -
Temporal Reasoning: Models must understand causal relationships (e.g., opening a door leading to a new scene).
A 2025 study found traditional models often replaced female characters with strangers after six shots, highlighting memory loss issues.
Why Existing Methods Fail
Current approaches fall into three categories, each with inherent flaws:
| Method Type | Flaw |
|---|---|
| Fixed-Window Attention | Discards older frames when sliding windows forward, causing memory decay |
| Single Keyframe Control | Uses one image per shot, limiting cross-shot information flow |
| Edit-and-Extend | Relies on last-frame transformations, failing for complex motions |
The table above shows how these methods underperform in real-world scenarios. For instance, a test case revealed baseline models achieved only 51% accuracy in reidentifying characters after scene cuts.
OneStory’s Three Core Innovations
1. Adaptive Memory System
-
Smart Frame Selection: From prior shots, selects frames with highest semantic relevance (42% better than random sampling). -
Dynamic Compression: Prioritizes key elements (people at 1/4 resolution, backgrounds at 1/8) using importance-guided patchification. -
Staged Training: Starts with synthetic data to stabilize architecture before real-world fine-tuning, boosting narrative consistency by 27%.
2. Dual-Phase Context Injection
-
Textual Alignment: CLIP model matches current shot descriptions against historical frames. -
Visual Matching: DINOv2 analyzes visual features to finalize top-K relevant frames.
This dual process ensures both textual and visual coherence. Tested on character reappearance tasks, accuracy jumped from 51% to 93%.
3. Unified Training Architecture
-
Triplet Standardization: Converts variable-length sequences to three-shot format (e.g., inserting synthetic frames). -
Mixed Augmentation: Uses 52% cross-video interpolation and 48% first-frame transformations for training data diversity. -
Hybrid Loss Function: Combines diffusion loss (80%) with memory selection loss (20%) for balanced optimization.
Performance Benchmarks
Tested against industry standards, OneStory demonstrates superior performance:
| Metric | OneStory | Baseline A | Baseline B |
|---|---|---|---|
| Character Consistency | 58.74% | 54.54% | 56.33% |
| Environmental Coherence | 93.87% | 90.87% | 92.89% |
| Semantic Alignment | 0.5752 | 0.5526 | 0.5657 |
| Motion Control | 46.98% | 37.46% | 42.31% |
Real-World Applications:
-
Character Evolution: Maintains facial identity while updating clothing styles. -
Zoom Shots: Accurately tracks target objects during close-ups. -
Composite Scenarios: Merging separate scenes into cohesive narratives.
Practical Development Guide
Evaluating Quality
Use these metrics for systematic assessment:
-
Entity Tracking: YOLOv5 segmentation matching rates for repeated characters. -
Background Fidelity: DINAv2 cosine similarity between consecutive shots. -
Semantic Matching: ViCLIP vector angle comparison between text prompts and generated videos. -
Motion Fluency: Optical flow analysis of frame transitions.
Common Pitfalls
-
Perspective Shocks: Sudden camera movements can disrupt depth perception. -
Occlusion Issues: Partially hidden objects may trigger identity swaps. -
Lighting Variations: Radical lighting changes affect background recognition.
Future Directions
Researchers plan three enhancement paths:
-
Spatiotemporal Joint Modeling: Incorporating optical flow for smoother motion transitions. -
Multimodal Expansion: Adding voice commands for interactive storytelling. -
Real-Time Generation: Reducing inference latency to 120ms per frame via sparse patchification.
Industries like gaming and education stand to benefit significantly, potentially reducing script-to-video workflow times by 70%.
FAQ
Q1: Can OneStory handle videos longer than 10 shots?
A: Yes, the current version supports up to 15 shots. Longer sequences require increasing the context budget parameter.
Q2: How does it manage multiple interacting characters?
A: The system builds attention maps to assign memory weights dynamically. Tests show over 89% consistency even with three+ characters.
Q3: What hardware is needed for training?
A: Full training requires 128 A100 GPUs (~15 hours). LoRA fine-tuning enables adaptation on single H100 systems.
By combining adaptive memory with next-shot prediction, OneStory establishes a new standard for scalable, coherent multi-shot video generation. This technology not only advances entertainment production but also empowers educational content creation and interactive storytelling applications.

