HY-Motion 1.0: Tencent Releases Billion-Parameter Text-to-3D Motion Generation Model

Snippet Summary: HY-Motion 1.0 is the first billion-parameter text-to-3D human motion model, pre-trained on 3,000 hours of data, covering 200+ motion categories, achieving 78.6% instruction-following accuracy and 3.43/5.0 motion quality score—significantly outperforming existing open-source solutions.

Text-to-3D Animation: It’s Actually Here Now

Picture this scenario: You type “a person kicks a soccer ball while swinging their arm,” and within seconds, a smooth, natural 3D human animation appears. This isn’t science fiction—it’s the capability that Tencent’s Hunyuan team has just open-sourced with HY-Motion 1.0.

How complex is traditional 3D animation production? Even experienced animators equipped with expensive motion capture systems need hours or even days to create just a few seconds of high-quality animation. HY-Motion 1.0’s arrival is completely changing this game.

Why Is This Different? Three Key Breakthroughs

Breakthrough One: Billion-Scale Parameters

HY-Motion 1.0 is the first model in text-to-motion generation to reach billion-parameter scale. Specifically, the standard version contains 1.0B (one billion) parameters, while the lightweight version has 0.46B (460 million) parameters.

What does this scale mean? Compare it with existing open-source models:

MoMask has parameters far below the million level
DART and LoM operate at similarly small scales
HY-Motion 1.0 directly increases scale by several orders of magnitude

This parameter increase delivers qualitative improvements. In instruction-following capability tests, HY-Motion 1.0 achieves an average score of 3.24 out of 5, while other models typically hover around 2.2. This isn’t incremental improvement—it’s over 40% performance gain.

Breakthrough Two: 3,000 Hours of Diverse Training Data

Data quality and scale directly determine a model’s ceiling. HY-Motion 1.0’s training data comes from three primary sources:

In-the-Wild Video Data: Extracted from 12 million high-quality video clips, covering diverse real-world action scenarios. These videos undergo rigorous preprocessing:

Shot boundary detection to segment coherent scenes
Human detector to filter clips containing people
GVHMR algorithm to reconstruct 3D human motion trajectories and convert to SMPL-X parameter format

Motion Capture Data: Approximately 500 hours of professional mocap data, primarily from controlled indoor environments, offering exceptional quality but limited scene diversity.

3D Animation Assets: Motion sequences hand-crafted by professional artists for game production, exhibiting outstanding quality but relatively limited quantity.

Ultimately, after strict filtering including duplicate removal, abnormal pose detection, joint velocity outlier detection, and foot-sliding artifact detection, they obtained over 3,000 hours of high-quality motion data, with 400 hours representing top-tier quality.

Breakthrough Three: Complete Three-Stage Training Paradigm

This represents HY-Motion 1.0’s most critical innovation. Unlike traditional single-phase training, the model employs a “coarse-to-fine, supervised-to-feedback” progressive training strategy:

Stage One: Large-Scale Pretraining (3,000 hours)

Goal: Teach the model “how to move”
Data: All 3,000 hours across various quality levels
Results: Rapidly establishes broad motion priors with strong semantic understanding
Trade-off: Generated motions may exhibit high-frequency jitter and foot sliding

Stage Two: High-Quality Fine-Tuning (400 hours curated data)

Goal: Elevate from “roughly correct” to “precisely smooth”
Data: Only 400 hours of manually verified high-quality data
Learning rate: Reduced to 0.1× pretraining rate to prevent forgetting
Results: Dramatically reduces jitter and sliding, stronger anatomical consistency, accurately distinguishes fine details like “wave left hand” versus “wave right hand”

Stage Three: Reinforcement Learning Alignment (Dual Optimization)

Step 1 DPO (Direct Preference Optimization): Based on 40,000 motion pairs, human annotators labeled “better” and “worse” choices, learning from 9,228 high-quality annotations
Step 2 Flow-GRPO: Through explicit physics and semantic reward functions, enforces rigid physical constraints (eliminating foot sliding) and precise semantic alignment

This three-stage training enables the model to achieve dual improvements in quality and control precision while maintaining diversity.

Technical Architecture: How Hybrid Transformers Understand Text and Generate Motion

Motion Representation: The 201-Dimensional Vector Secret

HY-Motion 1.0 uses SMPL-H skeleton definition (22 joints excluding hands), encoding each motion frame as a 201-dimensional vector:

Global root translation (3D): Defines character position in space
Global body orientation (6D): Using continuous 6D rotation representation
Local joint rotations (126D): 21 joints × 6D rotation representation
Local joint positions (66D): 22 joints × 3D coordinates

This representation’s advantage lies in compatibility with standard animation workflows, directly importable into mainstream 3D software like Blender, Maya, etc. Unlike the commonly used HumanML3D representation, HY-Motion removes explicit temporal derivatives (velocities) and foot contact labels, finding this actually accelerates training convergence.

Dual-Stream Single-Stream Hybrid Architecture

The model’s core is a hybrid Transformer cleverly combining dual-stream and single-stream processing:

Dual-Stream Blocks (1/3 of total layers):

Motion and text processed independently, preserving modality-specific representations
Interact through joint attention mechanism: motion features query semantic cues from text
Text tokens shielded from motion noise, maintaining semantic integrity

Single-Stream Blocks (2/3 of total layers):

Motion and text tokens concatenated into unified sequence
Parallel spatial and channel attention modules enable deep multimodal fusion

Dual Text Encoding Strategy:

Qwen3-8B extracts fine-grained token-level semantic embeddings, converted via Bidirectional Token Refiner into bidirectional representations needed for non-autoregressive generation
CLIP-L extracts global text embeddings, injected via AdaLN mechanism to adaptively modulate feature statistics throughout the network

Two Critical Attention Mechanism Designs

Asymmetric Attention Mask:

Motion tokens globally attend to text sequence, extracting semantic cues
Text tokens explicitly masked from motion latents
Purpose: Prevent diffusion noise in motion from contaminating text embeddings

Narrow Band Temporal Mask:

Within motion branch, each frame only attends to a sliding window of ±60 frames (121 frames total at 30fps)
Hypothesis: Kinematic dynamics primarily governed by local continuity
Advantage: Linear computational complexity, handles long sequences

Full RoPE Positional Encoding:

Text and motion embeddings concatenated into single sequence before applying Rotary Position Embeddings
Establishes continuous relative coordinate system, enabling model to understand correspondence between specific text tokens and temporal frames

Flow Matching: Why More Efficient Than Traditional Diffusion

HY-Motion uses Flow Matching to construct continuous probability paths from standard Gaussian noise to complex motion data distributions. Adopting optimal transport paths defined as linear interpolation: x_t = (1-t)x_0 + tx_1

The training objective minimizes mean squared error between predicted and ground-truth velocity. During inference, starting from random noise x_0, the model recovers clean motion x_1 by integrating along the predicted velocity field using ODE solvers (e.g., Euler method).

Compared to traditional DDPM diffusion models, Flow Matching advantages:

More stable training without complex noise scheduling
Fewer inference steps, faster generation
More elegant mathematical formulation, easier to understand and optimize

Data Processing Pipeline: From Raw Video to Precise Annotation

Automated Cleaning Pipeline

Retargeting and Unification:

All motions uniformly retargeted to neutral SMPL-H skeleton
SMPL/SMPL-H/SMPL-X formats converted via mesh fitting
Other skeletal structures mapped using retargeting tools

Multi-Layer Filtering Mechanism:

Remove duplicate sequences
Eliminate abnormal poses (joint angles beyond physiological range)
Detect joint velocity outliers (sudden changes exceeding thresholds)
Identify anomalous displacements (non-physical phenomena like teleportation)
Prune static motions (movement amplitude below threshold)
Detect foot-sliding artifacts (horizontal movement when foot contacts ground)

Standardization Processing:

Uniformly resample to 30fps
Sequences longer than 12 seconds segmented into multiple clips
Normalize to canonical coordinate frame: Y-axis up, starting position at origin, lowest body point aligned to ground plane, initial facing direction along positive Z-axis

Intelligent Annotation Pipeline

Direct Use of Original Video for Video-Sourced Data:
For motions extracted from video, directly use original video for annotation.

Synthetic Rendering for 3D Data:
For mocap and animation assets, texture and render SMPL-H models to generate synthetic videos.

VLM Preliminary Annotation:
Feed videos into Vision-Language Models (e.g., Gemini-2.5-Pro) with optimized prompts dedicated to human motion, obtaining preliminary descriptions and action keywords.

Manual Refinement (High-Quality Data):
For 400 hours of curated data, manually verify VLM outputs:

Correct erroneous descriptions
Supplement missing key motion elements
Ensure perfect text-motion correspondence

LLM Enhancement and Diversification:
Using Large Language Models to:

Standardize description structure while preserving original semantics
Create diverse paraphrases for data augmentation
Generate synonymous descriptions in different expression styles

Hierarchical Taxonomy: Covering 200+ Motion Categories

HY-Motion establishes a three-level motion taxonomy with six top-level categories:

1. Locomotion

Horizontal Movement: walking, running, side-stepping
Vertical Movement: jumping, squatting
Special Movement: crawling, climbing
Vehicles: riding motorcycle

2. Sports & Athletics

Ball Sports: tennis, soccer
Precision Sports: archery, shooting
Track & Field: high jump, long jump, shot put, sprinting

3. Fitness & Outdoor Activities

Gym & Strength Training: crunches, planks, leg press stretches
Yoga: child’s pose, pigeon pose, warrior I pose
Outdoor Activities: skydiving, curling

4. Daily Activities

Basic Postures: standing, sitting, lying down
Object Interaction: twisting caps
Housework: sweeping
Personal Care: shaving, applying lotion
Office & Study: making phone calls
Eating & Cooking

5. Social Interactions & Leisure

Solo Rhythmic Gestures
Solo Semantic Gestures
Dance: cha-cha, modern dance
Gymnastics & Acrobatics: handstands
Instrument Playing: piano
Martial Arts: kung fu
Theatrical Performance: runway walking

6. Game Character Actions

Defense Moves
Firearm Attacks: cannon/bazooka firing
Hit Reactions
Magic Attacks: staff spells
Melee Attacks
Melee Weapon Attacks: one-handed sword slashes

This taxonomy progressively refines from 6 top-level categories to over 200 fine-grained motion classes at leaf nodes—currently the industry’s most comprehensive motion classification system.

Auxiliary Module: LLM-Driven Duration Prediction and Prompt Rewriting

User inputs are often casual and colloquial, like “kick ball” or “a person is kicking soccer ball.” To help the model better understand and execute, HY-Motion introduces an independent LLM module handling two key tasks:

Duration Prediction

The LLM leverages its inherent common-sense knowledge to infer typical motion duration from text descriptions. For example:

“Wave hand” typically lasts 1-2 seconds
“Sit down to stand up” approximately 2-3 seconds
“Run one lap” might require 10-15 seconds

To improve accuracy, this LLM is fine-tuned on a dataset containing real motion durations, aligning predictions with training data distribution.

Prompt Rewriting

Converts users’ casual inputs into structured, model-friendly descriptions. For example:

Input: “kick ball”
Output: “A person kicks a soccer ball, extending their leg forward”

The rewriting process preserves user intent while adding motion details for more precise generation.

Two-Stage Training Strategy

Supervised Fine-Tuning (SFT):

Fine-tuned from Qwen3-30B-A3B model
Training data consists of {user prompt, optimized prompt, duration} triplets
User prompts synthesized by powerful LLM (Gemini-2.5-Pro) to simulate real user input diversity, including informal language, mixed Chinese-English, varying specificity levels

Reinforcement Learning (RL):

Uses Group Relative Policy Optimization (GRPO) algorithm
More powerful model (Qwen3-235B-A22B-Instruct-2507) serves as reward judge
Reward function evaluates two dimensions: semantic consistency (whether rewrite faithfully captures user intent) and temporal plausibility (whether predicted duration matches action complexity)
By optimizing relative advantages of candidates, guides policy toward generating semantically precise and temporally coherent instructions

Performance Comparison: Far Exceeding Existing Open-Source Solutions

Instruction-Following Capability Evaluation

On a test set containing over 2,000 text prompts covering six major categories from simple atomic actions to complex combinations, human annotators scored generated motions on a 1-5 scale:

Model	Locomotion	Sports	Fitness	Daily	Social	Game	Average	SSAE Accuracy
MoMask	2.98	2.41	2.09	2.07	2.38	1.97	2.31	58.0%
GoToZero	2.80	2.23	2.07	2.00	2.32	1.74	2.19	52.7%
DART	2.91	2.47	2.03	2.07	2.40	2.05	2.31	42.7%
LoM	2.81	2.07	1.95	2.00	2.39	1.84	2.17	48.9%
HY-Motion 1.0	3.76	3.18	3.15	3.06	3.25	3.01	3.24	78.6%

HY-Motion 1.0’s average score of 3.24 represents over 40% improvement compared to the next-best model. In Structured Semantic Alignment Evaluation (SSAE), accuracy reaches 78.6%, nearly 30 percentage points higher than other models.

SSAE is an automated evaluation method transforming text-motion alignment into video question-answering tasks. For prompt “a person swings their arm while shooting a soccer ball,” the system decomposes into yes/no questions:

“Is the person kicking their leg?”
“Is the person swinging their arm?”
“Does the person appear to be shooting a soccer ball?”

A Vision-Language Model (Gemini-2.5-Pro) then watches the rendered video and answers—the correctness rate constitutes the SSAE score.

Motion Quality Evaluation

Using the same test set, reviewers scored motion fluidity, physical plausibility, and naturalness:

Model	Locomotion	Sports	Fitness	Daily	Social	Game	Average
MoMask	3.05	2.91	2.58	2.66	2.77	2.81	2.79
GoToZero	3.11	3.01	2.69	2.72	2.89	2.81	2.86
DART	3.38	3.33	2.94	2.95	3.06	3.07	3.11
LoM	3.14	3.08	2.98	3.01	3.14	3.01	3.06
HY-Motion 1.0	3.59	3.51	3.28	3.37	3.43	3.41	3.43

HY-Motion 1.0 also leads in quality metrics with an average score of 3.43, approximately 10% higher than the closest competitor.

Scaling Experiments: The Power of Scale

To verify the impact of model scale and data volume, the team trained multiple variants at different sizes:

Instruction-Following Capability Scales with Model Size:

Model	Parameters	Training Data	Average Score
DiT-0.05B	50M	3000 hours	3.10
DiT-0.46B	460M	3000 hours	3.20
DiT-0.46B-400h	460M	400 hours	3.05
DiT-1B	1.0B	3000 hours	3.34

Key findings:

From 50M to 1B, instruction-following capability continuously improves
At same scale, 3000-hour data outperforms 400-hour data (3.20 vs 3.05), proving large-scale data is crucial for semantic understanding
From 460M to 1B, despite doubling parameters, improvement rate narrows

Motion Quality Saturates at Medium Scale:

Model	Parameters	Average Score
DiT-0.05B	50M	2.91
DiT-0.46B	460M	3.26
DiT-0.46B-400h	460M (high-quality only)	3.31
DiT-1B	1.0B	3.34

Key findings:

From 50M to 460M, quality jumps significantly (2.91 to 3.26)
Beyond 460M, continued scaling brings marginal quality gains (3.26 to 3.34)
At same scale, high-quality data fine-tuning benefits quality more than large-scale data (3.31 vs 3.26), proving data quality is crucial for physical realism

These experiments validate data’s dual role: scale drives semantic understanding, quality ensures physical authenticity.

Quick Start: Two Usage Methods

Environment Setup

System Requirements: Supports macOS, Windows, and Linux

Installation Steps:

Install PyTorch (visit pytorch.org for your system-appropriate version)
Clone repository and install dependencies:

git clone https://github.com/Tencent-Hunyuan/HY-Motion-1.0
cd HY-Motion-1.0
pip install -r requirements.txt

Download model weights:
Follow instructions in ckpts/README.md to download necessary model files, including:

HY-Motion-1.0 (standard version, 1.0B parameters)
HY-Motion-1.0-Lite (lightweight version, 0.46B parameters)

Command-Line Batch Inference

Suitable for processing large volumes of prompts:

# Use standard version
python3 local_infer.py --model_path ckpts/tencent/HY-Motion-1.0

# Use lightweight version
python3 local_infer.py --model_path ckpts/tencent/HY-Motion-1.0-Lite

Common Parameter Configuration:

--input_text_dir: Directory containing .txt or .json format prompt files
--output_dir: Result save directory (default: output/local_infer)
--disable_duration_est: Disable LLM-based duration prediction
--disable_rewrite: Disable LLM-based prompt rewriting
--prompt_engineering_host / --prompt_engineering_model_path: (Optional) Host address/local path for motion duration prediction and prompt rewriting module

Important note: If you don’t set prompt engineering module parameters, you must simultaneously set both --disable_duration_est and --disable_rewrite, otherwise the script will error due to inability to access rewriting service.

Interactive Web Interface

More intuitive usage method—launch Gradio application:

python3 gradio_app.py

After running, access http://localhost:7860 in your browser to see a friendly interface where you can:

Directly input text descriptions
Adjust generation parameters
Real-time preview of generated motions
Download result files

This method is particularly suitable for rapid testing and creative exploration.

Real-World Applications: Unlocking Creative Productivity

Game Development

Traditional game character animation production involves complex workflows: concept design → mocap shooting → data cleaning → artistic adjustment → engine integration—lengthy cycles with high costs.

Using HY-Motion 1.0:

Input “a warrior slashes with a two-handed sword” → instantly generate base animation
Input “a mage casts fireball with staff” → rapid prototype validation
Batch generate NPC daily actions (walking, standing, talking, etc.)
Quickly iterate combat action designs

Estimated to reduce character animation prototyping time from days to hours.

Film and Animation Previsualization

Before formal shooting or production, directors and screenwriters can:

Quickly visualize action scenes from scripts
Preview complex fight choreography
Test character positioning and interactions
Present proof-of-concept to investors

Dramatically reduces pre-planning trial-and-error costs.

Virtual Humans and Digital Avatars

Live streaming, education, and customer service virtual personas need rich motion libraries:

Automatically generate explanation gestures from text scripts
Real-time response to user commands generating interactive motions
Generate subject-relevant demonstration actions for virtual teachers (physics experiments, chemistry operations)

Fitness and Sports Training

Coaches and athletes can:

Generate standard action demonstrations (“proper squat form”)
Visualize complex combination moves (“jump followed by side kick”)
Rapidly create motion libraries for training apps
Assist with action analysis and correction

Robot Motion Planning

While HY-Motion generates virtual human motions, it can serve as reference for humanoid robot motion planning:

Convert natural language commands to motion sequences
Provide demonstration data for robot learning
Evaluate action feasibility and naturalness

Current Limitations: Future Improvement Directions

The team candidly identifies two main limitations of current HY-Motion 1.0:

Complex Instruction Understanding Challenges

Despite significantly exceeding baseline models in semantic alignment, difficulties remain with highly detailed or complex instructions. For example:

“A person steps forward with left foot while simultaneously swinging right hand upward, then rotates body 90 degrees left, followed by bending down to touch left toe with right hand”

Such instructions containing multiple steps, precise orientations, and strict sequences may not be fully executed accurately by the model.

Root cause: Inherent difficulty in data annotation pipeline. Whether VLM automatic annotation or manual refinement, creating complete accurate textual descriptions for subtle complex motions is extremely challenging. Many details (arm angles, body center-of-gravity shifts) are difficult to express precisely in natural language.

Insufficient Human-Object Interaction Capability

Current dataset primarily focuses on body kinematics, lacking explicit object geometry information. Therefore, the model may not generate accurate physical interactions with external objects. For example:

Contact points when grasping tools may be imprecise
Force application points when pushing/pulling/lifting objects may be unnatural
Actions requiring precise spatial alignment like sitting in chairs or opening doors may exhibit clipping or floating

This is a shared challenge across the entire field. Future needs include:

Datasets containing object geometry
Physical simulation constraints
Contact-aware generative models

The team indicates active research in these directions.

Why Open Source?

Tencent’s Hunyuan team fully open-sources HY-Motion 1.0, including:

Complete inference code
Pre-trained model weights (both 1.0B and 0.46B versions)
Detailed technical documentation
Online demonstration platform

The reasoning for open source is straightforward:

Accelerate research progress: Enable global researchers to innovate from a higher starting point
Promote technology democratization: Lower barriers to 3D animation creation, benefiting more creators
Advance commercial maturity: Rapidly iterate through community feedback, accelerating technology toward practical use

As stated in the paper, they hope HY-Motion 1.0 can serve as a solid baseline, inspiring more exploration and accelerating development of scalable, high-quality motion generation technologies.

Core Insights: The Dual Truth of Data and Scale

Through HY-Motion 1.0’s development, the team distilled two key principles:

Principle One: The Duality of Data

Scale Drives Semantics: Expanding training data volume is the primary driver for enhancing instruction-following and semantic understanding. Experiments show models trained on 3000 hours significantly outperform same-scale models trained on 400 hours in instruction comprehension.
Quality Ensures Realism: Improving data quality is the decisive factor for enhancing motion fidelity and physical realism. High-quality data fine-tuning can significantly reduce jitter and sliding artifacts, even with unchanged model scale.

Principle Two: Multi-Stage Training Effectiveness
The “coarse-to-fine” three-stage framework—large-scale pretraining, high-quality fine-tuning, reinforcement learning alignment—proves necessary. This approach effectively balances the trade-off between motion diversity and precision, providing a robust optimization pathway for the field.

Frequently Asked Questions

How long does it take to generate a motion?

Depends on motion length, model scale, and hardware configuration. On servers equipped with high-end GPUs (like NVIDIA A100), generating a 5-second motion sequence (30fps, 150 frames total) typically takes several seconds to around ten seconds. The lightweight version is faster but with slightly reduced quality.

Can generated motions be directly used in games or animation production?

Yes, but post-processing is usually needed. HY-Motion outputs standard SMPL-H skeleton format, compatible with mainstream 3D software (Blender, Maya, Unity, Unreal Engine, etc.). For commercial projects, recommendations include:

Post-generation refinement of details by professional animators
Skeleton retargeting according to target character body type
Timing adjustments to match specific scene requirements

Does it support multi-person interactive actions?

Current version primarily targets single-person motions. While training data includes some multi-person contact categories (handshakes, hugs), it generates motion sequences for individual characters. True multi-person collaborative generation (two characters simultaneously interacting with precise spatial alignment) is a next-stage research direction.

Can the model be fine-tuned for specific styles?

Yes. If you have domain-specific motion data (combat styles from particular games, specific dance choreography), you can fine-tune based on the pre-trained model. Recommend using the high-quality fine-tuning stage’s learning rate (0.1× pretraining) to preserve learned knowledge while adapting to new styles.

Are there restrictions on commercial use?

Open-source models typically follow specific license agreements—please check the project’s LICENSE file for specifics. Generally, research and non-commercial use is free, while commercial use may require additional licensing or compliance with specific terms. Recommend contacting Tencent’s Hunyuan team for explicit authorization information.

How does it compare with closed-source commercial solutions?

HY-Motion 1.0 leads among open-source solutions but may still have gaps compared to top-tier closed-source commercial products (like undisclosed internal solutions from major companies), particularly in complex scenarios and human-object interaction. Open source advantages include customizability, auditability, unrestricted usage, and community support.

One-Sentence Summary: HY-Motion 1.0 pushes text-to-3D motion generation to new heights through billion-scale parameters, 3,000 hours of diverse data, and three-stage refined training—opening a new chapter in practical AI-assisted animation production.

HY-Motion 1.0: Tencent’s 1B-Parameter Model That Turns Text Into Realistic 3D Animation