FantasyPortrait: Advancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers

FantasyPortrait is a state-of-the-art framework designed to create lifelike and emotionally rich animations from static portraits. It addresses the long-standing challenges of cross-identity facial reenactment and multi-character animation by combining implicit expression control with a masked cross-attention mechanism. Built upon a Diffusion Transformer (DiT) backbone, FantasyPortrait can produce high-quality animations for both single and multi-character scenarios, while preserving fine-grained emotional details and avoiding feature interference between characters.

1. Background and Challenges

Animating a static portrait into a dynamic, expressive video is a complex task with broad applications:

Film production – breathing life into still images for storytelling.
Virtual communication – enabling expressive avatars for meetings or chats.
Gaming and interactive media – bringing characters to life without manual keyframing.

1.1 Limitations of Existing Methods

Most traditional methods rely on explicit geometric priors, such as:

Facial landmarks (2D keypoints)
3D Morphable Models (3DMM)

While these can work in controlled scenarios, they struggle in:

Cross-identity reenactment: Large differences in facial structure (e.g., between genders, ages, or ethnicities) often lead to artifacts, motion distortions, and flickering.
Subtle emotion capture: Explicit geometry is insufficient to represent nuanced muscle movements or complex emotional cues.
Multi-character animation: Driving features for different characters often interfere with each other, causing “expression leakage.”

2. Core Innovations in FantasyPortrait

FantasyPortrait introduces two main innovations:

Expression-Augmented Implicit Control
Instead of relying on explicit geometry, FantasyPortrait extracts identity-agnostic implicit expression features from the driving video. A dedicated expression-augmented learning module enhances complex, fine-grained dynamics—especially lip motion and emotional expressions—while keeping head pose and eye motion consistent.
Masked Cross-Attention for Multi-Character Control
This mechanism ensures that expression features for each character are controlled independently while maintaining synchronized motion. It prevents feature leakage between characters, enabling natural group animations.

3. Technical Framework

3.1 Overview

FantasyPortrait is built on a Latent Diffusion Model (LDM) with a DiT backbone:

Input: Static portrait image(s) + driving video(s)
Implicit feature extraction: Capture expression-related motion without identity bias
Expression-augmented encoding: Enhance challenging non-rigid facial motions
Cross-attention fusion: Combine portrait and motion embeddings
Video synthesis: Decode the latent animation into video frames

3.2 Expression-Augmented Implicit Control

Implicit expression representation is extracted via:

elip: Lip motion
eeye: Eye gaze and blinking
ehead: Head pose
eemo: Emotional expression

For complex, non-rigid movements (elip and eemo), FantasyPortrait uses learnable tokens to break them down into smaller sub-features representing specific muscle groups or emotion dimensions. These interact with semantically aligned video tokens through multi-head cross-attention, capturing detailed region-specific relationships.

Formula:


em = Concat(Ea(eemo), Ea(elip), ehead, eeye)

Where:

Ea: Expression-augmented encoder
em: Motion embedding

3.3 Masked Cross-Attention for Multi-Character Animation

For each character:

Detect and crop the face region.
Extract motion embeddings.
Concatenate embeddings from all characters:


ê\_m = {e1\_m, e2\_m, ..., eN\_m}

Apply mask matrices in cross-attention layers to limit each motion embedding’s influence to its corresponding face region.

Mathematically:


Z'\_i = Z\_i + softmax( M ⊙ Q\_i K\_i^T / √dK ) V\_i

M: Face mask mapped to latent space
⊙: Element-wise multiplication
Q_i, K_i, V_i: Query, Key, and Value projections

4. Dataset and Benchmark

4.1 Multi-Expr Dataset

A dedicated multi-character facial expression video dataset, curated from:

OpenVid-1M
OpenHumanVid

Processing pipeline:

Detect number of people per clip (YOLOv8) → keep clips with ≥ 2 faces.
Filter out low-quality or artifact-prone clips using aesthetic scoring and blur detection.
Select clips with clear expression dynamics using facial landmark motion analysis.

Scale: ~30,000 high-quality video clips with descriptive captions.

4.2 ExprBench

A public benchmark for evaluating expression-driven animation quality:

ExprBench-Single: Single-character test set
ExprBench-Multi: Multi-character test set

Video diversity:

Realistic human portraits
Anthropomorphic animals
Cartoon characters
Wide emotional range and head movements

5. Experimental Results

5.1 Quantitative Performance

Evaluated metrics:

FID / FVD: Video quality
PSNR / SSIM: Frame reconstruction fidelity
LMD / MAE: Landmark and eye motion accuracy
AED / APD: Expression and head pose accuracy

FantasyPortrait consistently outperforms:

LivePortrait
Skyreels-A1
HunyuanPortrait
X-Portrait
FollowYE

5.2 User Study

32 participants rated each method (0–10 scale) across:

Video Quality (VQ)
Expression Similarity (ES)
Motion Naturalness (MN)
Expression Richness (ER)

FantasyPortrait scored highest in all categories, especially ES and ER.

5.3 Ablation Study

Key findings:

Removing EAL → Loss of fine-grained expression detail
Removing MCA → Severe expression leakage in multi-character scenarios
Removing Multi-Expr data → Significant performance drop in multi-character generation

6. Installation and Quick Start

6.1 Environment Setup

git clone https://github.com/Fantasy-AMAP/fantasy-portrait.git
cd fantasy-portrait

apt-get install ffmpeg
pip install -r requirements.txt
pip install flash_attn

6.2 Download Data and Models

Multi-Expr Dataset:

Models:

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P --local-dir ./models/Wan2.1-I2V-14B-720P
huggingface-cli download acvlab/FantasyPortrait --local-dir ./models

6.3 Inference

Single character:

bash infer_single.sh

Multi-character (same driver):

bash infer_multi.sh

Multi-character (different drivers):

bash infer_multi_diff.sh

7. Performance

torch_dtype	Persistent Params	Speed per Iteration	VRAM
bfloat16	None	15.5s	40G
bfloat16	7B	32.8s	20G
bfloat16	0	42.6s	5G

8. FAQ

Q1: Minimum VRAM requirement?
5 GB with slower performance; 20 GB or more recommended for faster inference.

Q2: Can it handle cross-gender or cross-age reenactment?
Yes. The identity-agnostic implicit features allow stable cross-identity performance.

Q3: Will expressions get mixed in multi-character mode?
No. The masked cross-attention mechanism isolates features between characters.

9. Conclusion and Outlook

FantasyPortrait advances portrait animation with:

Expression-augmented implicit control: Captures nuanced emotions and complex lip motions.
Masked cross-attention: Delivers independent yet synchronized multi-character animation.

Future directions:

Inference acceleration for real-time use cases.
Ethical safeguards to detect and prevent misuse.

FantasyPortrait Revolutionizes AI Portrait Animation: How This Framework Enables Multi-Character Emotional Storytelling