ToonComposer: Revolutionizing Cartoon Production with AI-Driven In-Betweening and Colorization

高效码农

5 months ago

ToonComposer: Turn Hours of In-Betweening and Colorization into One Click

“

Project & Demo: https://lg-li.github.io/project/tooncomposer

What This Article Will Give You

❀

A plain-language tour of why cartoon production is slow today
❀

A step-by-step how ToonComposer removes two whole steps
❀

A zero-hype tutorial to install and run the open-source demo
❀

Real numbers and side-by-side images taken directly from the original paper
❀

A concise FAQ that answers the questions most people ask first

1. The Old Workflow: Three Pain Points You Already Know

Traditional 2-D or anime production breaks into three stages:

Keyframing – an artist draws the “story poses”.
In-betweening – assistants draw all the frames between those poses.
Colorization – painters add color to every single line drawing.

Pain points:

Step	What hurts	How much time
In-betweening	Large motions need many keyframes; small teams drown	Days to weeks
Colorization	Needs a clean, full-detail sketch on every frame	Frame × Frame
Error build-up	Mistakes in step 2 propagate into step 3	Redo loops

2. Post-Keyframing: One New Stage, Two Jobs Done Together

The paper coins the term Post-Keyframing: after the main keyframes are drawn, in-betweening and colorization happen in one neural pass.

Inputs:

❀

One colored reference frame
❀

One or more sparse keyframe sketches (you decide where)
❀

A short text prompt describing the scene

Output:

❀

A complete, colored video segment at 480 p or 608 p

Figure 2: old pipeline vs. Post-Keyframing pipeline.

3. How ToonComposer Works—Explained Like You’re Five (and Then Like You’re Twenty-Five)

3.1 The Five-Year-Old Version

You give the computer a colored picture and a few stick-figure sketches. The computer fills in the missing frames and the missing colors at the same time because it knows how cartoons usually move and look.

3.2 The Twenty-Five-Year-Old Version

Building block	One-sentence purpose
DiT backbone (Wan 2.1)	Modern transformer that already knows motion from millions of real videos
Sparse Sketch Injection	Lets you drop single-line-art “hints” at any frame index without re-training
Spatial Low-Rank Adapter (SLRA)	Retunes only the appearance layers so the model keeps its motion talent but looks like a cartoon
Region-wise Control	You can leave parts of the sketch blank; the model hallucinates background motion

4. Key Technical Details (No External Knowledge)

❀

SLRA rank = 144 trainable parameters
❀

Training data = 37 k anime/cartoon clips, 4 synthetic sketch styles + 1 human-sketch model (IC-Sketcher)
❀

Loss = Rectified Flow velocity prediction, logit-normal timestep sampling
❀

Resolution = 480 p or 608 p square, 1–69 frames demonstrated
❀

VRAM = ~14 GB at 480 p with flash-attention enabled

5. Benchmarks: Numbers and Side-by-Side Stills

5.1 Synthetic Test Set

Metrics: lower is better for LPIPS/DISTS, higher for CLIP.

Method	LPIPS↓	DISTS↓	CLIP↑
AniDoc	0.3734	0.5461	0.8665
LVCD	0.3910	0.5505	0.8428
ToonCrafter	0.3830	0.5571	0.8463
ToonComposer	0.1785	0.0926	0.9449

5.2 Human-Drawn Sketches (PKBench)

Method	Subject Consistency↑	Motion Smoothness↑
AniDoc	0.9456	0.9842
LVCD	0.8653	0.9724
ToonCrafter	0.8567	0.9674
ToonComposer	0.9509	0.9910

5.3 47-Person User Study

❀

70.99 % preferred ToonComposer for aesthetic quality
❀

68.58 % preferred ToonComposer for motion quality

6. Install & Run in 10 Minutes

6.1 Prerequisites

❀

NVIDIA GPU with ≥ 16 GB VRAM
❀

CUDA 11.8 or newer
❀

Python 3.10

6.2 Commands

# Clone
git clone https://github.com/TencentARC/ToonComposer
cd ToonComposer

# Create env
conda create -n tooncomposer python=3.10 -y
conda activate tooncomposer
pip install -r requirements.txt
pip install flash-attn==2.8.0.post2 --no-build-isolation

6.3 Start Gradio UI

python app.py --device cuda:0 --resolution 480p

Browser opens at http://localhost:7860.

“

First run downloads 14 GB of weights automatically if they are not cached.

7. Using the Gradio Interface

Panel	What to do
Prompt box	Type a short scene description
Color reference	Upload one full-color image
Keyframe sketches	Click on the timeline → upload line art
Region mask	Optional: black-out areas you want the model to invent
CFG & residual sliders	Start at defaults (7.5, 1.0); tweak later
Generate	Wait 30 s–3 min (depends on frame count)

8. Real-World Tips

❀

Too little VRAM? Use 480 p or add --no-flash to disable flash-attention.
❀

Need longer clips? Generate 60-frame blocks and fade-blend later.
❀

Line art too messy? Run an edge-cleaner or simply redraw the key poses; the model tolerates rough lines but rewards clarity.
❀

Offline/air-gapped servers? Pre-download:

huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P
huggingface-cli download TencentARC/ToonComposer

Then set env vars:

export WAN21_I2V_DIR=/path/to/Wan2.1-I2V-14B-480P
export TOONCOMPOSER_DIR=/path/to/TencentARC-ToonComposer

9. Frequently Asked Questions

Q1. Can it do 3-D cartoons?
Yes. The authors fine-tuned a light variant on 3-D rendered clips; examples are in the supplementary video.

Q2. Do I need to draw every frame?
No. One colored reference and a single keyframe sketch already produce motion. More sketches give finer control.

Q3. Is the output commercially safe?
Weights are Apache-2.0. Input images must be your own or licensed.

Q4. Why does my train look flat when I leave the background blank?
Turn on Region-wise Control and explicitly mask the background; otherwise the model thinks you want empty blue.

Q5. Will 4 K come soon?
Not in this release. Authors cite GPU memory limits; future work will investigate cascaded super-resolution.

Q6. Can I train my own style?
Paper provides the SLRA recipe and training objective, but training scripts are not yet public.

Q7. How many keyframes are optimal?

❀

Simple head turn: 1–2
❀

Complex fight scene: 4–6
❀

Rule of thumb: add one sketch wherever motion changes direction.

Q8. Does it work on AMD or Apple Silicon?
Code is CUDA-only today; AMD ROCm and MPS forks are community efforts.

Q9. Can I disable color and get only line art?
Technically yes—feed gray reference and gray sketches—but the model still outputs fully colored frames.

Q10. How do I cite it?

@article{li2025tooncomposer,
  title={ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing},
  author={Li, Lingen and others},
  journal={arXiv preprint arXiv:2508.10881},
  year={2025}
}

10. Takeaway

ToonComposer does not replace animators; it replaces the tedium between creative decisions.
Upload a color keyframe, scribble a few poses, and you have a watchable draft in minutes—leaving you free to refine storytelling rather than chase frames one by one.

Ready to try?
Project page: https://lg-li.github.io/project/tooncomposer
Online demo: https://huggingface.co/spaces/TencentARC/ToonComposer