FantasyPortrait: Advancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers

FantasyPortrait is a state-of-the-art framework designed to create lifelike and emotionally rich animations from static portraits. It addresses the long-standing challenges of cross-identity facial reenactment and multi-character animation by combining implicit expression control with a masked cross-attention mechanism. Built upon a Diffusion Transformer (DiT) backbone, FantasyPortrait can produce high-quality animations for both single and multi-character scenarios, while preserving fine-grained emotional details and avoiding feature interference between characters.


1. Background and Challenges

Animating a static portrait into a dynamic, expressive video is a complex task with broad applications:

  • Film production – breathing life into still images for storytelling.
  • Virtual communication – enabling expressive avatars for meetings or chats.
  • Gaming and interactive media – bringing characters to life without manual keyframing.

1.1 Limitations of Existing Methods

Most traditional methods rely on explicit geometric priors, such as:

  • Facial landmarks (2D keypoints)
  • 3D Morphable Models (3DMM)

While these can work in controlled scenarios, they struggle in:

  • Cross-identity reenactment: Large differences in facial structure (e.g., between genders, ages, or ethnicities) often lead to artifacts, motion distortions, and flickering.
  • Subtle emotion capture: Explicit geometry is insufficient to represent nuanced muscle movements or complex emotional cues.
  • Multi-character animation: Driving features for different characters often interfere with each other, causing “expression leakage.”

2. Core Innovations in FantasyPortrait

FantasyPortrait introduces two main innovations:

  1. Expression-Augmented Implicit Control
    Instead of relying on explicit geometry, FantasyPortrait extracts identity-agnostic implicit expression features from the driving video. A dedicated expression-augmented learning module enhances complex, fine-grained dynamics—especially lip motion and emotional expressions—while keeping head pose and eye motion consistent.

  2. Masked Cross-Attention for Multi-Character Control
    This mechanism ensures that expression features for each character are controlled independently while maintaining synchronized motion. It prevents feature leakage between characters, enabling natural group animations.

Single vs. multi-character examples

3. Technical Framework

3.1 Overview

FantasyPortrait is built on a Latent Diffusion Model (LDM) with a DiT backbone:

  1. Input: Static portrait image(s) + driving video(s)
  2. Implicit feature extraction: Capture expression-related motion without identity bias
  3. Expression-augmented encoding: Enhance challenging non-rigid facial motions
  4. Cross-attention fusion: Combine portrait and motion embeddings
  5. Video synthesis: Decode the latent animation into video frames
System Architecture

3.2 Expression-Augmented Implicit Control

Implicit expression representation is extracted via:

  • elip: Lip motion
  • eeye: Eye gaze and blinking
  • ehead: Head pose
  • eemo: Emotional expression

For complex, non-rigid movements (elip and eemo), FantasyPortrait uses learnable tokens to break them down into smaller sub-features representing specific muscle groups or emotion dimensions. These interact with semantically aligned video tokens through multi-head cross-attention, capturing detailed region-specific relationships.

Formula:


em = Concat(Ea(eemo), Ea(elip), ehead, eeye)

Where:

  • Ea: Expression-augmented encoder
  • em: Motion embedding

3.3 Masked Cross-Attention for Multi-Character Animation

For each character:

  1. Detect and crop the face region.
  2. Extract motion embeddings.
  3. Concatenate embeddings from all characters:

ê\_m = {e1\_m, e2\_m, ..., eN\_m}

  1. Apply mask matrices in cross-attention layers to limit each motion embedding’s influence to its corresponding face region.

Mathematically:


Z'\_i = Z\_i + softmax( M ⊙ Q\_i K\_i^T / √dK ) V\_i

  • M: Face mask mapped to latent space
  • : Element-wise multiplication
  • Q_i, K_i, V_i: Query, Key, and Value projections

4. Dataset and Benchmark

4.1 Multi-Expr Dataset

A dedicated multi-character facial expression video dataset, curated from:

  • OpenVid-1M
  • OpenHumanVid

Processing pipeline:

  1. Detect number of people per clip (YOLOv8) → keep clips with ≥ 2 faces.
  2. Filter out low-quality or artifact-prone clips using aesthetic scoring and blur detection.
  3. Select clips with clear expression dynamics using facial landmark motion analysis.

Scale: ~30,000 high-quality video clips with descriptive captions.

Three-person animation example

4.2 ExprBench

A public benchmark for evaluating expression-driven animation quality:

  • ExprBench-Single: Single-character test set
  • ExprBench-Multi: Multi-character test set

Video diversity:

  • Realistic human portraits
  • Anthropomorphic animals
  • Cartoon characters
  • Wide emotional range and head movements
Anthropomorphic example

5. Experimental Results

5.1 Quantitative Performance

Evaluated metrics:

  • FID / FVD: Video quality
  • PSNR / SSIM: Frame reconstruction fidelity
  • LMD / MAE: Landmark and eye motion accuracy
  • AED / APD: Expression and head pose accuracy

FantasyPortrait consistently outperforms:

  • LivePortrait
  • Skyreels-A1
  • HunyuanPortrait
  • X-Portrait
  • FollowYE

5.2 User Study

32 participants rated each method (0–10 scale) across:

  • Video Quality (VQ)
  • Expression Similarity (ES)
  • Motion Naturalness (MN)
  • Expression Richness (ER)

FantasyPortrait scored highest in all categories, especially ES and ER.


5.3 Ablation Study

Key findings:

  • Removing EAL → Loss of fine-grained expression detail
  • Removing MCA → Severe expression leakage in multi-character scenarios
  • Removing Multi-Expr data → Significant performance drop in multi-character generation
Ablation visual comparison

6. Installation and Quick Start

6.1 Environment Setup

git clone https://github.com/Fantasy-AMAP/fantasy-portrait.git
cd fantasy-portrait

apt-get install ffmpeg
pip install -r requirements.txt
pip install flash_attn

6.2 Download Data and Models

Multi-Expr Dataset:

Models:

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P --local-dir ./models/Wan2.1-I2V-14B-720P
huggingface-cli download acvlab/FantasyPortrait --local-dir ./models

6.3 Inference

Single character:

bash infer_single.sh

Multi-character (same driver):

bash infer_multi.sh

Multi-character (different drivers):

bash infer_multi_diff.sh

7. Performance

torch_dtype Persistent Params Speed per Iteration VRAM
bfloat16 None 15.5s 40G
bfloat16 7B 32.8s 20G
bfloat16 0 42.6s 5G

8. FAQ

Q1: Minimum VRAM requirement?
5 GB with slower performance; 20 GB or more recommended for faster inference.

Q2: Can it handle cross-gender or cross-age reenactment?
Yes. The identity-agnostic implicit features allow stable cross-identity performance.

Q3: Will expressions get mixed in multi-character mode?
No. The masked cross-attention mechanism isolates features between characters.


9. Conclusion and Outlook

FantasyPortrait advances portrait animation with:

  • Expression-augmented implicit control: Captures nuanced emotions and complex lip motions.
  • Masked cross-attention: Delivers independent yet synchronized multi-character animation.

Future directions:

  • Inference acceleration for real-time use cases.
  • Ethical safeguards to detect and prevent misuse.