USO: A Practical Guide to Unified Style and Subject-Driven Image Generation

“Upload one photo of your pet, pick any art style, type a sentence—USO does the rest.”


Table of Contents

  1. What Exactly Is USO?
  2. Why Couldn’t We Do This Before?
  3. Getting Started: Hardware, Software, and Low-Memory Tricks
  4. Four Everyday Workflows (with Ready-to-Copy Commands)
  5. Side-by-Side Results: USO vs. Popular Alternatives
  6. Troubleshooting & FAQs
  7. How It Works—Explained Like You’re Five
  8. Quick Reference & Next Steps

1. What Exactly Is USO?

USO stands for Unified Style and Subject-driven Generation.
In plain words, it is an open-source image model that merges two previously separate tasks:

Task Old Way USO Way
Style Transfer Change the look but keep the original layout. Change the look and place any subject anywhere.
Subject-Driven Generation Keep the subject but ignore style. Keep the subject and paint it in any style.
Both at Once Two tools + heavy editing. One command line.

You need only three inputs:

  1. A subject image (your cat, your product, your avatar).
  2. A style image (oil painting, cyberpunk render, hand-drawn sketch).
  3. A text prompt (what the subject should be doing or where it should be).

2. Why Couldn’t We Do This Before?

Three bottlenecks held everyone back:

Bottleneck What Went Wrong USO Fix
Task Silos Style models and subject models lived in separate code bases. Single model trained for both tasks.
Feature Tangle Models couldn’t tell “style” from “subject” inside one picture. Separate encoders for style and content.
Lack of Triple Data Training sets only had pairs (input, output). USO needs triplets: subject, style, final image. Built 200k triplets with automated experts.

3. Getting Started: Hardware, Software, and Low-Memory Tricks

3.1 Minimum & Recommended Specs

Item Minimum Sweet Spot
OS Linux / Windows 10+ Same
Python 3.10–3.12 3.11
CUDA 11.8 12.1+
VRAM 8 GB (with offload) 16 GB
RAM 16 GB 32 GB

3.2 One-Line Install

# 1. Create environment
python -m venv uso_env
source uso_env/bin/activate   # Windows: uso_env\Scripts\activate

# 2. Install torch
pip install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu124

# 3. Install USO
pip install -r requirements.txt

# 4. Download weights
cp example.env .env
# Edit .env and set HF_TOKEN=your_huggingface_token
python weights/downloader.py

3.3 Low-VRAM Cheat Sheet

If you only have 8–12 GB VRAM, add two flags:

python inference.py \
  --prompt "A corgi wearing a tuxedo" \
  --image_paths "my_corgi.jpg" "oil_painting.jpg" \
  --offload \
  --model_type flux-dev-fp8
  • --offload moves weights to system RAM when idle.
  • --model_type flux-dev-fp8 uses 8-bit precision; peak usage drops to ≈ 16 GB instead of 24 GB.

4. Four Everyday Workflows (with Ready-to-Copy Commands)

Tip: In every command, the first path is the subject image.
If you don’t need a subject, pass an empty string "".

4.1 Subject-Only: Keep the Face, Change the Scene

python inference.py \
  --prompt "The girl is riding a bike in the street" \
  --image_paths "assets/gradio_examples/identity1.jpg" \
  --width 1024 --height 1024

What happens: USO keeps the girl’s face and pose but places her on a new street background.

4.2 Style-Only: New Look, Random Subject

python inference.py \
  --prompt "A cat sleeping on a chair" \
  --image_paths "" "assets/gradio_examples/style1.webp"

What happens: Any cat will appear, painted in the style you provided.

4.3 Subject + Style: The Full Combo

python inference.py \
  --prompt "The woman gave an impassioned speech on the podium" \
  --image_paths "assets/gradio_examples/identity2.webp" "assets/gradio_examples/style2.webp"

What happens: Same woman, same podium, but rendered in the selected art style.

4.4 Multi-Style Blend

python inference.py \
  --prompt "A handsome man" \
  --image_paths "" "assets/gradio_examples/style3.webp" "assets/gradio_examples/style4.webp"

What happens: The model merges both styles into one coherent image.


5. Side-by-Side Results: USO vs. Popular Alternatives

All numbers come from the USO-Bench dataset (50 subjects × 50 styles × multiple prompts).

Metric Meaning USO Next Best
CLIP-I ↑ Subject similarity 0.623 0.605 (UNO)
DINO ↑ Identity consistency 0.793 0.789 (UNO)
CSD ↑ Style similarity 0.557 0.540 (InstantStyle-XL)
CLIP-T ↑ Text alignment 0.282 0.282 (StyleStudio)

Visual Example
Comparison strip
Left: reference style; center: InstantStyle; right: USO—notice the closer brush-stroke match.


6. Troubleshooting & FAQs

Q1: Can I run this on AMD or Apple Silicon?

  • AMD GPUs: Not officially tested; use ROCm at your own risk.
  • Apple Silicon: M-series chips lack CUDA; currently unsupported.

Q2: Do I need a portrait, or will any object work?

Any clear subject works—pets, products, logos, cartoons.

Q3: How do I keep the original pose?

Leave the text prompt empty "". USO will lock the layout and swap only the style.

Q4: What licenses apply?

  • Code: Apache 2.0
  • Weights: Same as base FLUX.1 dev (check original repo)
  • Generated images: User’s responsibility—respect local laws.

Q5: Where are my images saved?

By default, outputs land in ./outputs/YYYY-MM-DD/.


7. How It Works—Explained Like You’re Five

Imagine two artists:

  • Artist S only cares about style (colors, brushwork).
  • Artist C only cares about content (who or what is in the picture).

USO trains them together so they learn when to listen to each other and when to ignore each other.

7.1 Data Factory: Building 200 k Triplets

  1. Start with public photos and AI-generated images.
  2. Run Expert A → turns any photo into a stylized version.
  3. Run Expert B → removes style, turning stylized images back to plain photos.
  4. Now you have neat triplets:

    (plain subject, style image, stylized subject)
    

7.2 Two-Stage Training

Stage Goal What’s Trained
1. Style Alignment Make the model understand new styles fast Only the lightweight projector
2. Disentanglement Teach it to mix any subject with any style Full DiT backbone

7.3 Style Reward Loop

After every few steps, a reward model scores the image:

  • “Does the new picture look like the style reference?”
  • High score → keep learning direction.
  • Low score → adjust weights.

This extra feedback pushes quality beyond normal training.


8. Quick Reference & Next Steps

8.1 One-Page Command Cheat Sheet

Goal Subject Image Style Image Prompt Extra Flags
Subject only yes no describe scene none
Style only no yes describe content none
Both yes yes describe scene none
Low VRAM any any any --offload --model_type flux-dev-fp8

8.2 File Tree After Install

USO/
├── inference.py          # Main script
├── app.py                # Gradio demo
├── weights/downloader.py # Fetch checkpoints
├── outputs/              # Generated images
└── examples/             # Sample images & prompts

8.3 Gradio Web Demo

python app.py
# Open http://localhost:7860 in your browser

For low memory:

python app.py --offload --name flux-dev-fp8

8.4 Further Reading


Happy creating!