Solving Spatial Confusion: How CoMPaSS Transforms Text-to-Image Diffusion Models

高效码农

4 months ago

CoMPaSS: A Framework for Better Spatial Understanding in Text-to-Image Models

Hey there, if you’re into text-to-image generation, you’ve probably noticed how these models can create stunning, realistic pictures from just a description. But have you ever wondered why they sometimes mess up simple things like “a cat to the left of a dog”? It turns out, getting spatial relationships right—like left, right, above, or below—is trickier than it seems. That’s where CoMPaSS comes in. It’s a framework designed to help existing diffusion models handle these spatial details more accurately. In this post, I’ll walk you through what CoMPaSS is, how it works, and how you can try it out yourself. We’ll break it down step by step, so even if you’re not a deep learning expert, you’ll get the idea.

Let’s start with the basics. What exactly is CoMPaSS? It’s a system that tackles two main problems in text-to-image diffusion models: unclear data about spatial relationships and text encoders that don’t always capture the order and meaning of words properly. By fixing these, CoMPaSS helps models generate images that match the spatial setups described in prompts more faithfully.

Why Do Text-to-Image Models Struggle with Spatial Relationships?

Imagine you’re describing a scene: “A brown leather sofa is placed left of a tall bookshelf, a vintage table lamp sits besides a hard cover book, a wall clock hangs above a ceramic vase, while a sleepy cat rests inside a woven basket.” Models like FLUX.1 might get the objects right but jumble the positions. Why does this happen?

From what we’ve seen in datasets like LAION, CC12M, and COCO, spatial descriptions are often ambiguous. For example:

Words like “left” or “right” can mean different things depending on perspective— is it from the viewer’s angle or the object’s own direction?
Sometimes spatial terms aren’t even about space, like “the right choice.”
References might be missing, like “looking to the right” without saying right of what.

On top of that, text encoders (the parts that turn words into usable data for the model) don’t always preserve the order of tokens in a prompt. This means the model might not distinguish between “A left of B” and “B left of A” as well as it should.

CoMPaSS addresses this with two key parts: a data engine called SCOP for creating better training data, and a module called TENOR that helps the model understand text order better. The result? Images that actually reflect the spatial configs you describe, as shown in this teaser image from the project:

In experiments, adding CoMPaSS to models like FLUX.1-dev, SD1.4, SD1.5, and SD2.1 improved scores on benchmarks like VISOR by 98%, T2I-CompBench Spatial by 67%, and GenEval Position by 131%. That’s a big leap without hurting the model’s overall image quality.

What Is the SCOP Data Engine and How Does It Work?

You might be asking, “How do you fix ambiguous data?” SCOP, or Spatial Constraints-Oriented Pairing, is a data engine that pulls out clear spatial relationships from images and pairs them with accurate text descriptions. It’s like a filter that ensures only unambiguous pairs make it into the training data.

Here’s how SCOP processes an image in three stages:

Relationship Reasoning: Start with an image and identify all objects using bounding boxes and categories (like from COCO). Then, list every possible pair of objects. For an image with, say, a person, motorcycle, car, truck, and chair, you might get 15 pairs.
Spatial Constraints Enforcement: Not all pairs are useful. SCOP applies five rules to keep only the clear ones:
- Visual Significance: The pair must take up enough space in the image (area of union over image area > threshold τ_v). This ensures the relationship is prominent.
- Semantic Distinction: Objects must be different categories (e.g., no two motorcycles).
- Spatial Clarity: Objects can’t be too far apart (distance between centers relative to the smaller object’s diagonal < τ_u).
- Minimal Overlap: They shouldn’t overlap too much (intersection over min area < τ_o), but some overlap is okay for things like “on top of.”
- Size Balance: Sizes should be comparable (min area over max area > τ_s) so neither dominates.
These constraints whittle down the pairs—for example, from 15 to 5 unambiguous ones.
Relationship Decoding: For valid pairs, create descriptors like “(cup, box) (couch, box).” Then, turn these into image crops and text prompts from templates, like “a cup on top of a couch.”

When applied to the COCO training split, SCOP creates a dataset with 28,028 object pairs from 15,426 images. It’s small compared to massive datasets, but focused and high-quality. A human study showed 85.2% agreement with these pairs, meaning they’re reliable.

If you’re thinking, “Can I replicate this dataset?” Yes—the instructions are in the SCOP directory’s README. It uses the COCO dataset, so you’ll need that as a base.

Exploring the TENOR Module: Preserving Token Order

Now, even with great data, the model needs to understand the text properly. That’s where TENOR, or Token ENcoding ORdering, comes in. It’s a plug-and-play module that adds no extra parameters and barely increases computation time.

Why do we need it? Text encoders like CLIP or T5 often lose the sequential order of words. In a test with 6,320 prompts using COCO objects and relations like left/right/above/below, encoders failed to match equivalent rephrasings (e.g., “A left of B” vs. “B right of A”) over 95% of the time. They picked swapped or negated versions instead.

TENOR fixes this by injecting token order info into the model’s attention mechanism. It reinforces the prompt’s structure, helping the model distinguish spatial setups. It’s compatible with UNet-based models (like SD1.5) and MMDiT-based ones (like FLUX.1).

For training and inference, check the TENOR directory. There are specific instructions for FLUX.1-dev and for SD1.4/1.5/2.1.

How to Set Up and Try CoMPaSS Yourself

Ready to give it a go? Let’s talk about getting started. The project uses a Python environment managed with uv, and there’s a script to make it easy.

Step-by-Step Environment Setup

Run the Setup Script: From the project root, execute:
```
bash ./setup_env.sh
```
This installs requirements into a .venv/ subdirectory.
Activate the Environment: Once done, activate it with:
```
source .venv/bin/activate
```

That’s it—you’re set. Note that for full training, you’ll need both SCOP and TENOR. For just generating images, TENOR and reference weights suffice.

Downloading Reference Weights

The project provides weights on Hugging Face for quick testing. These are the ones used in the paper’s metrics. Here’s a table of options:

Model	Link
FLUX.1-dev	https://huggingface.co/blurgy/CoMPaSS-FLUX.1
SD1.4	https://huggingface.co/blurgy/CoMPaSS-SD1.4
SD1.5	https://huggingface.co/blurgy/CoMPaSS-SD1.5
SD2.1	https://huggingface.co/blurgy/CoMPaSS-SD2.1

Start with FLUX.1-dev—it’s a small 50MB Rank-16 LoRA.

Using the SCOP Dataset

To build the SCOP dataset yourself:

Go to the SCOP directory.
Follow the README for replicating the 28,028 pairs from COCO.
It involves processing images with the engine described earlier.

Training and Inference with TENOR

For FLUX.1-dev:

Check TENOR/flux/README.md for details.

For SD models:

See TENOR/sd/README.md.

These guides cover both training on SCOP data and generating images from text prompts.

How Does CoMPaSS Fit into the Bigger Picture of Text-to-Image Research?

Text-to-image diffusion models have come a long way, from early ones like GLIDE and DALLE-2 to recent ones like PixArt, SD3, and FLUX.1. They use large datasets and encoders like CLIP or T5 to connect text and images.

But spatial control is a hot area. Some methods add training for specific tasks or use inference tricks like bounding boxes. CoMPaSS stands out because it focuses on data quality (via SCOP) and text handling (via TENOR) without heavy overhead.

It’s inspired by issues in datasets and encoders, and it works across architectures without compromising general capabilities.

FAQ: Common Questions About CoMPaSS

Here, I’ll answer some questions you might have, based on what people often ask about similar projects.

What is CoMPaSS and how does it improve text-to-image models?
CoMPaSS is a framework that boosts spatial understanding in diffusion models. It uses SCOP to create clear spatial data and TENOR to preserve text order, leading to better alignment between prompts and generated images.

How does SCOP handle ambiguous spatial descriptions?
SCOP filters pairs with constraints like visual significance, semantic distinction, spatial clarity, minimal overlap, and size balance. This ensures only clear relationships are used.

What are the spatial constraints in SCOP?

Visual Significance: Area(union) / Area(image) > τ_v
Semantic Distinction: Different categories
Spatial Clarity: Distance / Characteristic length < τ_u
Minimal Overlap: Intersection / Min area < τ_o
Size Balance: Min area / Max area > τ_s

Why does TENOR matter for spatial relationships?
Text encoders often ignore token order, so “A left of B” might look like “B left of A.” TENOR adds order info to attention, helping the model get it right.

Can I use CoMPaSS with my own diffusion model?
Yes, it’s compatible with UNet (SD1.4/1.5/2.1) and MMDiT (FLUX.1) architectures. Follow the TENOR instructions for integration.

How big is the SCOP dataset and where does it come from?
It’s 28,028 pairs from 15,426 COCO images. You can replicate it using the SCOP README.

What benchmarks show CoMPaSS improvements?
It excels on VISOR (+98%), T2I-CompBench Spatial (+67%), and GenEval Position (+131%).

Do I need special hardware for training?
The README doesn’t specify, but since it’s based on existing models, standard GPU setups for diffusion training should work. Start with inference to test.

How do I cite CoMPaSS if I use it in my work?
Use this BibTeX:

@inproceedings{zhang2025compass,
  title={CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models},
  author={Zhang, Gaoyang and Fu, Bingtao and Fan, Qingnan and Zhang, Qi and Liu, Runxing and Gu, Hong and Zhang, Huaqi and Liu, Xinguo},
  booktitle={ICCV},
  year={2025}
}

Is CoMPaSS only for spatial relationships, or does it help with other things?
It’s focused on spatial, but it maintains or improves image fidelity without hurting general generation.

How-To Guide: Generating Images with CoMPaSS

If you’re eager to generate your own images, here’s a simple how-to based on the project setup.

Step 1: Set Up the Environment

As mentioned earlier:

Run bash ./setup_env.sh
Activate with source .venv/bin/activate

Step 2: Download Weights

Pick a model from the table above, say FLUX.1-dev. Download from Hugging Face and place in your project.

Step 3: Prepare TENOR

For FLUX: Follow TENOR/flux/README.md to load the module.
Input a prompt like “a motorcycle to the right of a bear.”
Run inference—the model should now handle the spatial part better.

Step 4: If Training, Use SCOP

Build the dataset per SCOP/README.md.
Train TENOR on it, as per the model-specific guides.

This process lets you see the difference firsthand. For example, without CoMPaSS, positions might flip; with it, they stick to the prompt.

Diving Deeper: The Analysis Behind TENOR

You might wonder, “How did they figure out text encoders are the problem?” They ran a proxy task: Take a base prompt like “A to the left of B,” create variations (rephrased, negated, swapped), and see if the encoder picks the right match by similarity.

Results in a table:

Text Encoder	Correct (%)	Most Similar: Rephrased	Neg. Rel.	Swp.
CLIP ViT-L	0.02%	1	5088	1231
OpenCLIP ViT-H	0%	0	6054	266
OpenCLIP ViT-bigG	0.03%	2	6067	251
T5-XXL	4.84%	306	4777	1237

Even big encoders fail most of the time. TENOR steps in to preserve that order.

Related Concepts in Diffusion Models

If you’re familiar with diffusion models, CoMPaSS builds on ideas like using pre-trained encoders and fine-tuning for tasks. It avoids heavy additions, unlike some spatial control methods that need extra supervision or optimization.

For instance, training-based approaches adapt weights but can be costly. Inference-only ones use boxes but require manual input. CoMPaSS balances by curating data and tweaking attention lightly.

Wrapping Up: Why CoMPaSS Matters for Your Projects

Working with text-to-image models, I’ve found spatial accuracy to be a real pain point—especially for applications like design or storytelling where positions matter. CoMPaSS makes it easier without overcomplicating things. It’s efficient, data-focused, and easy to integrate.

If you try it, start small with FLUX.1-dev and a simple prompt. See how it handles “a bird below a skateboard” compared to the base model. The project page (https://compass.blurgy.xyz) and arXiv (https://arxiv.org/abs/2412.13195) have more visuals and details.

Authors include Gaoyang Zhang (https://github.com/blurgyy), Qingnan Fan (https://fqnchina.github.io), Qi Zhang (https://qzhang-cv.github.io), and others from Zhejiang University, vivo, and Ant Group.

This framework isn’t about flashy new models; it’s about fixing core issues thoughtfully. Give it a shot and let me know how it goes—spatial generation just got a lot smarter.