Wan-Move: 5 Secrets to Precise Motion Control in AI Video Generation

高效码农

2 months ago

Wan-Move: Motion-Controllable Video Generation via Latent Trajectory Guidance

In a nutshell: Wan-Move is a novel framework for precise motion control in video generation. It injects motion guidance by projecting pixel-space point trajectories into a model’s latent space and copying the first frame’s features along these paths. This requires no architectural changes to base image-to-video models (like Wan-I2V-14B) and enables the generation of high-quality 5-second, 480p videos. User studies indicate its motion controllability rivals commercial tools like Kling 1.5 Pro’s Motion Brush.

In video generation, the quest to animate a static image and control its motion with precision lies at the heart of both research and creative pursuits. The central challenge has been this: how can we make models not just “see” the first frame, but also “understand” and “execute” the complex motions we envision?

Traditional approaches often present a trade-off. Methods offering coarse control, like bounding boxes, miss fine-grained details. Those striving for precise control typically introduce complex motion encoders and fusion modules, resulting in bulky models that are difficult to scale and fine-tune. This compromise between control fidelity and model elegance has hindered the widespread adoption of high-quality, user-friendly motion-controllable video generation.

Today, we delve into Wan-Move, a significant new contribution from researchers at Tongyi Lab (Alibaba Group), Tsinghua University, HKU, and CUHK. This work proposes a solution that is both simple and powerful, potentially changing how we manipulate dynamics in video. Furthermore, the team introduces MoveBench, a comprehensive benchmark designed to establish a rigorous, common ground for evaluation and progress in the field.

The Core Idea: What is “Latent Trajectory Guidance”?

The core innovation of Wan-Move is elegantly straightforward: make the original conditioning features inherently “motion-aware” to guide synthesis. It bypasses the complex path of designing auxiliary motion encoders. The process can be broken down into three key steps:

Represent Motion with Dense Point Trajectories: First, object motion is described using dense point trajectories (e.g., tracked from a video using CoTracker). Each trajectory is a spatiotemporal path, enabling fine-grained control from local to global movement.
Project Trajectories into Latent Space: Leveraging the translational equivariance of a pre-trained VAE in video diffusion models, these pixel-space coordinates are deterministically mapped into the model’s latent feature space.
Feature Replication for Guidance Injection: This is the crucial step. For each trajectory, the latent feature from the first frame at the trajectory’s starting point is extracted. This feature is then “copied” along the mapped latent path to corresponding locations in all subsequent frames. This creates an aligned spatiotemporal feature map that explicitly tells the model: “This element in the scene should move along this path.”

Figure: The core of Wan-Move — Latent Trajectory Guidance. (a) Transforming point trajectories from video to latent space and replicating first-frame features along the path. (b) The training framework adds only an efficient latent feature replication step to an existing image-to-video model.

The advantages of this approach are clear:

No Architectural Changes: Motion guidance is injected by directly editing the image condition features, eliminating the need for new modules (e.g., ControlNet) in the base model (e.g., Wan-I2V-14B).
Preserves Rich Context: What’s replicated is not isolated pixel values but latent feature patches containing rich semantic and texture information, driving more natural and coherent local motion.
Highly Scalable: With no extra trainable parameters introduced, it enables easy and scalable fine-tuning of powerful base Image-to-Video (I2V) backbones for rapid performance gains.

The Foundation for Rigorous Evaluation: The MoveBench Benchmark

The absence of unified, high-quality benchmarks often hinders technological progress. To establish a rigorous and comprehensive standard for evaluating motion-controllable video generation, the researchers created the MoveBench benchmark. Compared to existing datasets like DAVIS, VIPSeg, or MagicBench, MoveBench offers significant improvements in scale, duration, and annotation quality:

Scale & Diversity: It comprises 1,018 carefully curated high-quality video clips, each 5 seconds long, spanning 54 distinct content categories (e.g., “Tennis,” “Cooking,” “City Traffic”) to ensure broad scenario coverage.
High-Quality Annotations: Each clip is paired with detailed motion annotations. A human-in-the-loop pipeline ensures precision: annotators click on a target in the first frame, SAM generates an initial mask, and negative points can be added to exclude irrelevant areas. This is critical for annotating complex, small, or articulated motions. Every video includes at least one representative motion trajectory, with 192 videos featuring multi-object trajectories.
Detailed Captions: Using Gemini, dense descriptive captions are generated for each video, covering objects, actions, and camera dynamics, providing rich semantic context for generation tasks.

Figure: The MoveBench construction pipeline combines algorithmic filtering with human expertise to ensure high data quality and precise annotations.

Performance: What the Data Shows

The team conducted extensive experiments on MoveBench and the public DAVIS dataset, comparing Wan-Move against leading academic methods (ImageConductor, LeviTor, Tora, MagicMotion) and the commercial model Kling 1.5 Pro. Metrics included visual fidelity scores (FID, FVD, PSNR, SSIM) and the motion-specific End-Point Error (EPE — the L2 distance between tracked points in the generated video and the ground-truth trajectory).

Quantitative results demonstrate clear superiority:

Single-Object Motion Control: On MoveBench, Wan-Move achieved the best scores across the board: FID 12.2 (↓), FVD 83.5 (↓), PSNR 17.8 (↑), SSIM 0.64 (↑), EPE 2.6 (↓). Its motion accuracy (EPE) notably outperformed other methods (baselines ranged from 3.2 to 3.4).
Multi-Object Motion Control: On the more challenging multi-object subset of MoveBench, Wan-Move’s advantage was even greater, achieving an EPE of 2.2, compared to Tora’s 3.5.
Human Evaluation: In a two-alternative forced-choice (2AFC) human study, Wan-Move achieved win rates over 50% in motion quality and visual quality when compared to Kling 1.5 Pro, demonstrating commercial-grade competitiveness.
Efficiency: Thanks to its lean design, Wan-Move adds only 3 seconds of inference latency over the base I2V model. In contrast, methods using ControlNet for condition fusion added 225 seconds.

Qualitative comparisons are equally compelling:
Visual samples in the paper show Wan-Move more accurately adhering to complex motion trajectories (e.g., rotations, curved paths) while maintaining higher visual consistency and detail fidelity. Comparatively, some baseline methods exhibit motion deviation, object distortion, or unnatural background flickering.

Diverse Application Scenarios

Because point trajectories can flexibly represent various motion types, Wan-Move supports a wide range of applications:

Object Control: Precisely control the path of one or multiple objects by specifying trajectories for single or multiple points.
Camera Control: Simulate camera motions like panning, dollying, or rotating by dragging background points or using camera-aligned 2D trajectories calculated from monocular depth estimates.
Motion Transfer: Extract motion trajectories from one video and apply them to a different static image, animating the new image with the “motion” of another.
3D Rotation Control: Combine depth estimation to calculate an object’s 3D rotation and project it as 2D trajectories, enabling convincing 3D rotation animation.

Figure: Wan-Move supports diverse motion control applications, including single/multi-object control, camera control, motion transfer, and 3D rotation.

Getting Started: Usage and Evaluation

For researchers and developers, Wan-Move offers a complete open-source ecosystem:

Code & Models: The project code is open-sourced on GitHub. The 14B parameter model weights (Wan-Move-14B-480P) are released on Hugging Face and ModelScope, supporting 5-second 480p video generation.
MoveBench Dataset: The evaluation benchmark is also publicly available to facilitate fair comparison within the community.
Quick Start: After installing dependencies, inference can be run via simple command-line instructions, streamlined for both evaluating the MoveBench benchmark and generating videos with custom images and trajectories.

Technical Insights and Ablation Studies

To validate its design choices, the paper includes a series of in-depth ablation studies:

Trajectory Guidance Strategy: Comparing “Pixel Replication,” “Random Track Embedding,” and the proposed “Latent Feature Replication.” Results confirm latent feature replication delivers the best video quality (PSNR 17.8) and motion control (EPE 2.6) by preserving rich local context.
Number of Training Trajectories: The study found sampling up to 200 trajectories during training yields the best balance. Too few (e.g., 10) leads to weak control, while too many (e.g., 1024) may mismatch the sparser trajectories common during inference.
Generalization Capability: Wan-Move demonstrated strong robustness and generalization even when the number of input trajectories varied (from 1 to 1024), and when handling large-motion or out-of-distribution (OOD) scenarios.

Limitations and Future Work

Every technology has its boundaries. Wan-Move’s primary limitation stems from its reliance on point trajectories: if a target point remains occluded for an extended duration, the model may lose motion guidance. Additionally, performance may degrade in extremely cluttered scenes or when input trajectories severely violate physical laws.

Conclusion

Wan-Move represents a significant advance in motion-controllable video generation through its simple yet powerful core design of “Latent Trajectory Guidance.” It successfully bridges the gap between fine-grained motion control and model simplicity/scalability, achieving generation quality competitive with leading commercial tools. Coupled with the high-standard MoveBench benchmark, this open-source work provides researchers with a powerful tool and sets a new reference point for the community. It holds the potential to inspire further innovation and ultimately empower a broader spectrum of creators.

How-To: Run a Wan-Move Example

To quickly experience Wan-Move, follow these steps (assuming a configured Python environment):

Clone the repo and install dependencies:

git clone https://github.com/ali-vilab/Wan-Move.git
cd Wan-Move
pip install -r requirements.txt

Download the model weights:

huggingface-cli download Ruihang/Wan-Move-14B-480P --local-dir ./Wan-Move-14B-480P

Run the example generation:

python generate.py \
  --task wan-move-i2v \
  --size 480*832 \
  --ckpt_dir ./Wan-Move-14B-480P \
  --image examples/example.jpg \
  --track examples/example_tracks.npy \
  --track_visibility examples/example_visibility.npy \
  --prompt "A laptop is placed on a wooden table..." \
  --save_file example.mp4

FAQ

Q1: What is the key difference between Wan-Move and previous motion control methods like DragNUWA or MotionCtrl?
A1: The fundamental difference lies in how motion guidance is injected. Previous methods mostly required training auxiliary motion encoders (like ControlNet) to fuse motion signals into the generative model. Wan-Move bypasses this by projecting point trajectories into the latent space and replicating first-frame features, directly updating the model’s existing condition features without any extra modules. This leads to a more elegant architecture that is easier to fine-tune and scale.

Q2: Do I need to provide point trajectories myself? How can I obtain them?
A2: Yes, point trajectories are required as input during inference. You can extract them from a reference video using open-source tracking tools (like CoTracker, used in the paper) or manually specify keypoint paths. Wan-Move supports input ranging from sparse (e.g., 1 point) to dense (e.g., 1024 points) trajectories.

Q3: What is the video duration and resolution Wan-Move can generate?
A3: The currently released Wan-Move-14B-480P model is focused on generating 5-second videos at 832×480 resolution (480p), representing an important milestone for long-duration, high-quality motion-controllable generation.

Q4: What is the value of the MoveBench benchmark for developers?
A4: MoveBench provides a large-scale, high-quality, and consistently annotated test set. Developers can use it to objectively evaluate the motion control performance of their own models or methods, ensuring fair comparisons and quickly identifying weaknesses in specific scenarios (e.g., multi-object, large motion).

Q5: Could this technology be misused? What is the research team’s perspective?
A5: The paper explicitly acknowledges its dual-use potential. Like all powerful generative models, Wan-Move can be used for positive purposes in creative industries, education, and simulation, but also carries the risk of misuse for creating misleading or harmful content. The team promotes transparent research through open-sourcing and reminds users that they must comply with legal and ethical standards, bearing full responsibility for their use of the model.