Scone: Teaching AI to “Pick the Right Person” in a Crowd – A Leap Towards Precise Subject-Driven Image Generation
Snippet
The Scone model addresses a critical challenge in subject-driven image generation: accurately identifying and generating only the instruction-specified subject from a reference image containing multiple candidates. It introduces an “understanding bridge strategy” within a unified understanding-generation architecture, leveraging the early semantic advantages of the understanding expert to guide the generation process. This results in superior composition and distinction capabilities, achieving a leading overall score of 8.50 among open-source models on the new SconeEval benchmark.
Have you ever imagined handing an AI a group photo with a description, and it generates a perfect image of “the person wearing glasses” reading in a library? Or giving it a picture of your pet with a pile of toys and asking it to draw “the dog holding the red ball” running on the grass?
This captures the core challenge facing today’s subject-driven image generation technology: composition is easy, distinction is hard. Existing models excel at combining multiple independent subjects into a new scene. However, when a single reference image itself contains several candidate subjects, they often struggle to choose, leading to incorrect generation or omission of the intended subject.
Today, we delve into a groundbreaking piece of work from Peking University and Kuaishou’s Kling Team—the Scone model. It is not just a powerful image generator; it is a visual semantic understanding expert that has learned to “pick the right person from a picture.”
Part 1: The Core Problem: We’ve Overlooked the Power of “Distinction”
Subject-driven generation technology has advanced rapidly, evolving from handling single subjects to now fusing elements from four or even more reference images. Technical reports show that top-tier models like GPT-4o and Gemini exhibit remarkable potential in such “composition” tasks.
However, a long-overlooked capability gap is exposed in real-world complex scenarios: Distinction.
Consider this scenario:
-
Reference Image 1: A group photo containing three people: A, B, and C. -
Instruction: “Generate an image of the person wearing the striped shirt from Reference Image 1 (i.e., person B) drinking coffee.”
Ideally, the model should accurately identify B’s features and insert only B into the new scene. Current models are prone to three types of errors:
-
Subject Omission: None of the people appear in the generated image. -
Subject Error: Person A or C is generated instead. -
Subject Redundancy: All persons A, B, and C are crammed into the scene.
The research points out that this is because most existing methods assume reference images are “clean”—each image corresponds to a single, clear subject for extraction. Real-world images are filled with interference and intricate details. Models lack the ability to precisely lock onto a target subject within a “multi-candidate” environment and utilize its information effectively.
This is the fundamental problem Scone aims to solve: endowing the model with the ability to distinguish the target subject within complex contexts and leverage that information for precise generation.
Part 2: Scone’s Solution: Building a Bridge Between “Understanding” and “Generation”
Scone stands for Subject-driven Composition and Distinction Enhancement. Its core innovation lies in its architecture and training strategy.
2.1 The Foundation: A Unified Understanding-Generation Model
Scone is not built from scratch; it is constructed upon a base model called BAGEL, a unified understanding-generation architecture. Internally, this type of model has two “experts”:
-
The Understanding Expert: Excels at parsing the semantics of images and text. It understands what is in the picture and what the instruction is saying. -
The Generation Expert: Excels at synthesizing high-quality images based on the provided information.
Research found that in the early processing stages, the understanding expert shows higher attention to image regions relevant to the text instruction (like the target subject), while the generation expert is more focused on texture details. This means the understanding expert can lock onto “where to look” earlier and more accurately.
2.2 The Soul: The “Understanding Bridge” Strategy
Scone’s most critical contribution is the “Understanding Bridge Strategy.” The core idea is: Transform the understanding expert into a “semantic bridge” that conveys the high-level, clean semantic information it captures to guide the generation expert.
How is this bridge built? Through a two-stage training scheme:
Stage I: Composition Training (Learning to Compose)
-
Goal: Teach the model basic subject composition capabilities. -
Data: Training is performed using “single-candidate” data (where each reference image contains only one unambiguous subject). -
Result: The model learns how to harmoniously combine one or multiple subjects from different images into a new scene.
Stage II: Distinction Training (Learning to Distinguish)
This is the essence of Scone, conducted in two steps:
-
Step 1: Bridge Formation. “Multi-candidate” data is introduced (reference images containing multiple subjects). The understanding expert is trained to perform early cross-modal alignment and compute a “semantic mask.” This mask acts like a spotlight, illuminating only the image regions most relevant to the instruction (the target subject) while drastically reducing the attention weight on irrelevant regions (distractor subjects). At this point, the understanding expert becomes a qualified “semantic bridge.” -
Step 2: Bridge Guidance. With the formed “bridge” (understanding expert) kept fixed, the generation expert is trained to generate under its guidance. The generation expert learns to trust and follow the semantic focus provided by the understanding expert, thereby accurately extracting target features from cluttered reference images for generation.
Crucially, this entire process introduces no additional model parameters. It unlocks the inherent potential of the unified model entirely through a sophisticated training strategy.
Part 3: How to Fairly Judge “Distinction” Ability? – The SconeEval Benchmark
The “distinction” capability is difficult to measure because past evaluation benchmarks (like OmniContext) primarily focused on “composition,” with test scenarios being overly idealistic.
To address this, the team constructed a new, more challenging evaluation benchmark: SconeEval.
-
Scale: Contains 409 test cases, covering three domains: characters, objects, and scenes, across 19 case types. -
Three-Tier Task Difficulty: -
Composition Task: The traditional task. Each reference image contains one subject, requiring single or multi-subject composition. -
Distinction Task: Each reference image contains multiple subjects. The instruction specifies one of them to generate. -
Distinction & Composition Task: The most complex! Multiple reference images are used, and each contains multiple subjects. The model must first distinguish the target in each image, then combine them into a new scene.
-
-
Evaluation Metrics: Uses GPT-4.1 for automated evaluation, providing both a “Composition Score” (measuring instruction following and subject consistency) and a “Distinction Score” (measuring accuracy, precision, recall, etc., in identifying the target subject).
SconeEval provides the community with the first systematic yardstick for evaluating a model’s “distinction” capability. As shown in the comparison, it is currently the only benchmark encompassing composition, distinction, and combined tasks.
Table 1: Task Comparison Between SconeEval and Existing Benchmarks
| Benchmark | Composition Task | Distinction Task | Distinction & Composition Task |
|---|---|---|---|
| DreamBench | ✓ | ✗ | ✗ |
| OmniContext | ✓ | ✗ | ✗ |
| SconeEval (Ours) | ✓ | ✓ | ✓ |
Part 4: The Proof is in the Data: How Does Scone Perform?
The theory is elegant, but actual results are what matter. Experiments were conducted on both the OmniContext and SconeEval benchmarks.
4.1 On the OmniContext Benchmark (Emphasizes Composition)
Scone achieved the highest average score (8.01) among open-source models, proving its powerful composition capability was not sacrificed by focusing on distinction. Its performance closely trailed top closed-source models like GPT-4o (8.78) and Gemini (8.07).
4.2 On the SconeEval Benchmark (Comprehensive Test)
The results here better demonstrate Scone’s unique advantages:
-
Overall Score: Scone ranks first among open-source models with a score of 8.50. -
Distinction Capability: Scone’s distinction score is notably high at 8.79, significantly leading other open-source unified models (e.g., OmniGen2 at 7.81) and pure generation models (e.g., Qwen-Image-Edit at 7.65). -
Key Finding: Unified models (like OmniGen2, Echo-4o) generally achieved higher distinction scores than pure generation models. This validates that “understanding ability” is crucial for solving the distinction problem. Scone’s “understanding bridge” strategy maximizes this advantage.
Table 2: Quantitative Comparison on the SconeEval Benchmark (Selected Models)
| Method | Composition (Avg) | Distinction (Avg) | Overall Score |
|---|---|---|---|
| GPT-4o | 8.98 | 8.90 | 8.94 |
| Gemini-2.5-Flash | 8.56 | 8.84 | 8.70 |
| Scone (Ours) | 8.21 | 8.79 | 8.50 |
| Echo-4o | 8.05 | 8.14 | 8.09 |
| Qwen-Image-Edit-2509 | 7.76 | 7.65 | 7.70 |
| OmniGen2 | 7.39 | 7.81 | 7.60 |
4.3 Ablation Studies: Every Piece Matters
Ablation experiments were conducted to verify the effectiveness of key design choices:
-
High-Quality Data: Using the filtered set of 22K refined single-candidate data improved overall performance from 7.95 (using 70K base data) to 8.02. -
Understanding Bridge Strategy: In Stage II training, the “two-step with bridge” strategy achieved a final score of 8.50, significantly outperforming “direct fine-tuning” (7.94) and “two-step without bridge” (8.43) approaches.
4.4 Human Evaluation Alignment
To validate the reliability of automated scoring, a user study involving 30 evaluators (including professionals) was conducted. When comparing outputs from Scone, OmniGen2, and UniWorld-V2, Scone received a normalized preference score of 0.46, far exceeding the 0.27 scores of the other two models. This confirms that GPT-4.1 ratings align closely with human judgment and that Scone’s output quality is genuinely preferred.
Part 5: Seeing is Believing: Scone’s Generation Results
Descriptions can be abstract. Let’s visually appreciate Scone’s capabilities through examples from the research.
Scenario 1: Complex Composition
-
Task: Combine subjects from four separate reference images (people, objects) into a single coffee shop scene. -
Observation: Scone successfully placed all subjects naturally and harmoniously into the new environment while maintaining high consistency in each subject’s individual characteristics.
Scenario 2: Precise Distinction
-
Task: A reference image contains two different dogs. The instruction is to generate “the dog with brown ears” on a beach. -
Observation: Compared to other models, Scone was the only one that accurately generated the target dog (with brown ears), effectively avoiding subject error or redundancy.
Scenario 3: Distinction & Composition
-
Task: Two reference images, each containing multiple people. The instruction asks to generate an image of “the person in the blue shirt from Image 1” and “the person wearing a hat from Image 2” talking together. -
Observation: Scone precisely picked the blue-shirted person from the crowd in Image 1 and the hat-wearing person from Image 2, combining them into a new conversational scene. Other models exhibited misidentification or omission.
These cases demonstrate that Scone has made substantial progress in reducing subject redundancy, confusion, and omission.
Part 6: Practical Information: Model, Code, and Data
The Scone project emphasizes openness and reproducibility.
-
Model Weights: The final Scone model checkpoint is available on Hugging Face. -
Training & Inference Code: The complete codebase, including scripts for both training stages and inference, is open-sourced on GitHub. -
Training Data: The team has released the Scone-S2I-57K dataset on Hugging Face, which includes the refined single-candidate data (22K samples) and the constructed multi-candidate data (35K samples) used for training. -
Evaluation Benchmark: The entire SconeEval benchmark, comprising 409 test cases, is also publicly released as a dataset.
This comprehensive release allows researchers and developers to not only use the model but also to reproduce the results, conduct further ablation studies, or benchmark their own methods against Scone on a fair and challenging ground.
Part 7: Limitations and Future Directions
Of course, Scone is not omnipotent. The paper candidly acknowledges a limitation it shares with existing methods: unrealistic physical interactions. For example, in a generated image, a dog might appear to pass “through” a chair leg, violating basic physical laws. This indicates that models still have room for improvement in understanding complex spatial relationships and physical constraints between objects.
For the future, the team’s research will focus on developing more efficient mechanisms to reduce redundant image token processing, enabling scalable subject-driven generation in even more complex scenarios.
FAQ: What You Might Want to Know About Scone
Q1: How does Scone compare to GPT-4o and Gemini?
A1: On the OmniContext benchmark, which emphasizes composition, top closed-source models (GPT-4o, Gemini) still lead. However, on the new SconeEval benchmark that stresses “distinction” capability, Scone, as an open-source model, shows competitive overall performance (8.50) close to Gemini (8.70) and demonstrates unique strengths in its distinction score. It provides the research and developer community with a high-performance, reproducible strong baseline.
Q2: Are the model, code, and training data open source?
A2: Yes, this project is fully open source. The paper, training/inference code, and model weights have been publicly released. The team has also published the Scone-S2I-57K training dataset and the SconeEval evaluation benchmark. All resources are available on Hugging Face and GitHub.
Q3: How can I use Scone?
A3: Researchers or developers can clone the GitHub repository and follow the provided instructions to set up the environment. The project includes detailed scripts for single-case inference, as well as for batch inference and evaluation on both the OmniContext and SconeEval benchmarks. Basic familiarity with Python and configuring deep learning environments is required.
Q4: Can this technology be used commercially?
A4: The model’s license depends on the licenses of its base model (BAGEL) and the data it uses. Before considering commercial use, it is essential to carefully review the license files (e.g., LICENSE) provided with the model weights, code, and data to ensure compliant usage.
Q5: What is the value of the SconeEval benchmark dataset?
A5: SconeEval is the first benchmark to systematically evaluate “subject distinction” capability. It not only contains 409 carefully constructed test cases but also provides a complete evaluation protocol (prompt) and scripts. This is an invaluable resource for any research team looking to improve or assess their own model’s distinction ability.
Conclusion
The work on the Scone model shifts the research focus in subject-driven image generation from “how to collage” to “how to accurately select and collage.” Through its innovative “understanding bridge” strategy, it cleverly leverages the inherent advantages of unified models, offering a new approach to solving generation tasks in complex real-world visual scenes.
With the comprehensive open-sourcing of code, model, data, and benchmark, this work not only contributes a powerful tool but also pushes the entire field toward a more rigorous and practically relevant direction. The next generation of image generation models may truly become “visual designers” capable of understanding complex instructions and discerning fine details.

