FantasyPortrait: Advancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers
FantasyPortrait is a state-of-the-art framework designed to create lifelike and emotionally rich animations from static portraits. It addresses the long-standing challenges of cross-identity facial reenactment and multi-character animation by combining implicit expression control with a masked cross-attention mechanism. Built upon a Diffusion Transformer (DiT) backbone, FantasyPortrait can produce high-quality animations for both single and multi-character scenarios, while preserving fine-grained emotional details and avoiding feature interference between characters.
1. Background and Challenges
Animating a static portrait into a dynamic, expressive video is a complex task with broad applications:
-
Film production – breathing life into still images for storytelling. -
Virtual communication – enabling expressive avatars for meetings or chats. -
Gaming and interactive media – bringing characters to life without manual keyframing.
1.1 Limitations of Existing Methods
Most traditional methods rely on explicit geometric priors, such as:
-
Facial landmarks (2D keypoints) -
3D Morphable Models (3DMM)
While these can work in controlled scenarios, they struggle in:
-
Cross-identity reenactment: Large differences in facial structure (e.g., between genders, ages, or ethnicities) often lead to artifacts, motion distortions, and flickering. -
Subtle emotion capture: Explicit geometry is insufficient to represent nuanced muscle movements or complex emotional cues. -
Multi-character animation: Driving features for different characters often interfere with each other, causing “expression leakage.”
2. Core Innovations in FantasyPortrait
FantasyPortrait introduces two main innovations:
-
Expression-Augmented Implicit Control
Instead of relying on explicit geometry, FantasyPortrait extracts identity-agnostic implicit expression features from the driving video. A dedicated expression-augmented learning module enhances complex, fine-grained dynamics—especially lip motion and emotional expressions—while keeping head pose and eye motion consistent. -
Masked Cross-Attention for Multi-Character Control
This mechanism ensures that expression features for each character are controlled independently while maintaining synchronized motion. It prevents feature leakage between characters, enabling natural group animations.

3. Technical Framework
3.1 Overview
FantasyPortrait is built on a Latent Diffusion Model (LDM) with a DiT backbone:
-
Input: Static portrait image(s) + driving video(s) -
Implicit feature extraction: Capture expression-related motion without identity bias -
Expression-augmented encoding: Enhance challenging non-rigid facial motions -
Cross-attention fusion: Combine portrait and motion embeddings -
Video synthesis: Decode the latent animation into video frames

3.2 Expression-Augmented Implicit Control
Implicit expression representation is extracted via:
-
elip: Lip motion -
eeye: Eye gaze and blinking -
ehead: Head pose -
eemo: Emotional expression
For complex, non-rigid movements (elip
and eemo
), FantasyPortrait uses learnable tokens to break them down into smaller sub-features representing specific muscle groups or emotion dimensions. These interact with semantically aligned video tokens through multi-head cross-attention, capturing detailed region-specific relationships.
Formula:
em = Concat(Ea(eemo), Ea(elip), ehead, eeye)
Where:
-
Ea
: Expression-augmented encoder -
em
: Motion embedding
3.3 Masked Cross-Attention for Multi-Character Animation
For each character:
-
Detect and crop the face region. -
Extract motion embeddings. -
Concatenate embeddings from all characters:
ê\_m = {e1\_m, e2\_m, ..., eN\_m}
-
Apply mask matrices in cross-attention layers to limit each motion embedding’s influence to its corresponding face region.
Mathematically:
Z'\_i = Z\_i + softmax( M ⊙ Q\_i K\_i^T / √dK ) V\_i
-
M
: Face mask mapped to latent space -
⊙
: Element-wise multiplication -
Q_i
,K_i
,V_i
: Query, Key, and Value projections
4. Dataset and Benchmark
4.1 Multi-Expr Dataset
A dedicated multi-character facial expression video dataset, curated from:
-
OpenVid-1M -
OpenHumanVid
Processing pipeline:
-
Detect number of people per clip (YOLOv8) → keep clips with ≥ 2 faces. -
Filter out low-quality or artifact-prone clips using aesthetic scoring and blur detection. -
Select clips with clear expression dynamics using facial landmark motion analysis.
Scale: ~30,000 high-quality video clips with descriptive captions.

4.2 ExprBench
A public benchmark for evaluating expression-driven animation quality:
-
ExprBench-Single: Single-character test set -
ExprBench-Multi: Multi-character test set
Video diversity:
-
Realistic human portraits -
Anthropomorphic animals -
Cartoon characters -
Wide emotional range and head movements

5. Experimental Results
5.1 Quantitative Performance
Evaluated metrics:
-
FID / FVD: Video quality -
PSNR / SSIM: Frame reconstruction fidelity -
LMD / MAE: Landmark and eye motion accuracy -
AED / APD: Expression and head pose accuracy
FantasyPortrait consistently outperforms:
-
LivePortrait -
Skyreels-A1 -
HunyuanPortrait -
X-Portrait -
FollowYE
5.2 User Study
32 participants rated each method (0–10 scale) across:
-
Video Quality (VQ) -
Expression Similarity (ES) -
Motion Naturalness (MN) -
Expression Richness (ER)
FantasyPortrait scored highest in all categories, especially ES and ER.
5.3 Ablation Study
Key findings:
-
Removing EAL → Loss of fine-grained expression detail -
Removing MCA → Severe expression leakage in multi-character scenarios -
Removing Multi-Expr data → Significant performance drop in multi-character generation

6. Installation and Quick Start
6.1 Environment Setup
git clone https://github.com/Fantasy-AMAP/fantasy-portrait.git
cd fantasy-portrait
apt-get install ffmpeg
pip install -r requirements.txt
pip install flash_attn
6.2 Download Data and Models
Multi-Expr Dataset:
Models:
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P --local-dir ./models/Wan2.1-I2V-14B-720P
huggingface-cli download acvlab/FantasyPortrait --local-dir ./models
6.3 Inference
Single character:
bash infer_single.sh
Multi-character (same driver):
bash infer_multi.sh
Multi-character (different drivers):
bash infer_multi_diff.sh
7. Performance
torch_dtype | Persistent Params | Speed per Iteration | VRAM |
---|---|---|---|
bfloat16 | None | 15.5s | 40G |
bfloat16 | 7B | 32.8s | 20G |
bfloat16 | 0 | 42.6s | 5G |
8. FAQ
Q1: Minimum VRAM requirement?
5 GB with slower performance; 20 GB or more recommended for faster inference.
Q2: Can it handle cross-gender or cross-age reenactment?
Yes. The identity-agnostic implicit features allow stable cross-identity performance.
Q3: Will expressions get mixed in multi-character mode?
No. The masked cross-attention mechanism isolates features between characters.
9. Conclusion and Outlook
FantasyPortrait advances portrait animation with:
-
Expression-augmented implicit control: Captures nuanced emotions and complex lip motions. -
Masked cross-attention: Delivers independent yet synchronized multi-character animation.
Future directions:
-
Inference acceleration for real-time use cases. -
Ethical safeguards to detect and prevent misuse.