USO: A Practical Guide to Unified Style and Subject-Driven Image Generation
“Upload one photo of your pet, pick any art style, type a sentence—USO does the rest.”
Table of Contents
-
What Exactly Is USO? -
Why Couldn’t We Do This Before? -
Getting Started: Hardware, Software, and Low-Memory Tricks -
Four Everyday Workflows (with Ready-to-Copy Commands) -
Side-by-Side Results: USO vs. Popular Alternatives -
Troubleshooting & FAQs -
How It Works—Explained Like You’re Five -
Quick Reference & Next Steps
1. What Exactly Is USO?
USO stands for Unified Style and Subject-driven Generation.
In plain words, it is an open-source image model that merges two previously separate tasks:
Task | Old Way | USO Way |
---|---|---|
Style Transfer | Change the look but keep the original layout. | Change the look and place any subject anywhere. |
Subject-Driven Generation | Keep the subject but ignore style. | Keep the subject and paint it in any style. |
Both at Once | Two tools + heavy editing. | One command line. |
You need only three inputs:
-
A subject image (your cat, your product, your avatar). -
A style image (oil painting, cyberpunk render, hand-drawn sketch). -
A text prompt (what the subject should be doing or where it should be).
2. Why Couldn’t We Do This Before?
Three bottlenecks held everyone back:
Bottleneck | What Went Wrong | USO Fix |
---|---|---|
Task Silos | Style models and subject models lived in separate code bases. | Single model trained for both tasks. |
Feature Tangle | Models couldn’t tell “style” from “subject” inside one picture. | Separate encoders for style and content. |
Lack of Triple Data | Training sets only had pairs (input, output). USO needs triplets: subject, style, final image. | Built 200k triplets with automated experts. |
3. Getting Started: Hardware, Software, and Low-Memory Tricks
3.1 Minimum & Recommended Specs
Item | Minimum | Sweet Spot |
---|---|---|
OS | Linux / Windows 10+ | Same |
Python | 3.10–3.12 | 3.11 |
CUDA | 11.8 | 12.1+ |
VRAM | 8 GB (with offload) | 16 GB |
RAM | 16 GB | 32 GB |
3.2 One-Line Install
# 1. Create environment
python -m venv uso_env
source uso_env/bin/activate # Windows: uso_env\Scripts\activate
# 2. Install torch
pip install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu124
# 3. Install USO
pip install -r requirements.txt
# 4. Download weights
cp example.env .env
# Edit .env and set HF_TOKEN=your_huggingface_token
python weights/downloader.py
3.3 Low-VRAM Cheat Sheet
If you only have 8–12 GB VRAM, add two flags:
python inference.py \
--prompt "A corgi wearing a tuxedo" \
--image_paths "my_corgi.jpg" "oil_painting.jpg" \
--offload \
--model_type flux-dev-fp8
-
--offload
moves weights to system RAM when idle. -
--model_type flux-dev-fp8
uses 8-bit precision; peak usage drops to ≈ 16 GB instead of 24 GB.
4. Four Everyday Workflows (with Ready-to-Copy Commands)
Tip: In every command, the first path is the subject image.
If you don’t need a subject, pass an empty string""
.
4.1 Subject-Only: Keep the Face, Change the Scene
python inference.py \
--prompt "The girl is riding a bike in the street" \
--image_paths "assets/gradio_examples/identity1.jpg" \
--width 1024 --height 1024
What happens: USO keeps the girl’s face and pose but places her on a new street background.
4.2 Style-Only: New Look, Random Subject
python inference.py \
--prompt "A cat sleeping on a chair" \
--image_paths "" "assets/gradio_examples/style1.webp"
What happens: Any cat will appear, painted in the style you provided.
4.3 Subject + Style: The Full Combo
python inference.py \
--prompt "The woman gave an impassioned speech on the podium" \
--image_paths "assets/gradio_examples/identity2.webp" "assets/gradio_examples/style2.webp"
What happens: Same woman, same podium, but rendered in the selected art style.
4.4 Multi-Style Blend
python inference.py \
--prompt "A handsome man" \
--image_paths "" "assets/gradio_examples/style3.webp" "assets/gradio_examples/style4.webp"
What happens: The model merges both styles into one coherent image.
5. Side-by-Side Results: USO vs. Popular Alternatives
All numbers come from the USO-Bench dataset (50 subjects × 50 styles × multiple prompts).
Metric | Meaning | USO | Next Best |
---|---|---|---|
CLIP-I ↑ | Subject similarity | 0.623 | 0.605 (UNO) |
DINO ↑ | Identity consistency | 0.793 | 0.789 (UNO) |
CSD ↑ | Style similarity | 0.557 | 0.540 (InstantStyle-XL) |
CLIP-T ↑ | Text alignment | 0.282 | 0.282 (StyleStudio) |
Visual Example
Left: reference style; center: InstantStyle; right: USO—notice the closer brush-stroke match.
6. Troubleshooting & FAQs
Q1: Can I run this on AMD or Apple Silicon?
-
AMD GPUs: Not officially tested; use ROCm at your own risk. -
Apple Silicon: M-series chips lack CUDA; currently unsupported.
Q2: Do I need a portrait, or will any object work?
Any clear subject works—pets, products, logos, cartoons.
Q3: How do I keep the original pose?
Leave the text prompt empty ""
. USO will lock the layout and swap only the style.
Q4: What licenses apply?
-
Code: Apache 2.0 -
Weights: Same as base FLUX.1 dev (check original repo) -
Generated images: User’s responsibility—respect local laws.
Q5: Where are my images saved?
By default, outputs land in ./outputs/YYYY-MM-DD/
.
7. How It Works—Explained Like You’re Five
Imagine two artists:
-
Artist S only cares about style (colors, brushwork). -
Artist C only cares about content (who or what is in the picture).
USO trains them together so they learn when to listen to each other and when to ignore each other.
7.1 Data Factory: Building 200 k Triplets
-
Start with public photos and AI-generated images. -
Run Expert A → turns any photo into a stylized version. -
Run Expert B → removes style, turning stylized images back to plain photos. -
Now you have neat triplets: (plain subject, style image, stylized subject)
7.2 Two-Stage Training
Stage | Goal | What’s Trained |
---|---|---|
1. Style Alignment | Make the model understand new styles fast | Only the lightweight projector |
2. Disentanglement | Teach it to mix any subject with any style | Full DiT backbone |
7.3 Style Reward Loop
After every few steps, a reward model scores the image:
-
“Does the new picture look like the style reference?” -
High score → keep learning direction. -
Low score → adjust weights.
This extra feedback pushes quality beyond normal training.
8. Quick Reference & Next Steps
8.1 One-Page Command Cheat Sheet
Goal | Subject Image | Style Image | Prompt | Extra Flags |
---|---|---|---|---|
Subject only | yes | no | describe scene | none |
Style only | no | yes | describe content | none |
Both | yes | yes | describe scene | none |
Low VRAM | any | any | any | --offload --model_type flux-dev-fp8 |
8.2 File Tree After Install
USO/
├── inference.py # Main script
├── app.py # Gradio demo
├── weights/downloader.py # Fetch checkpoints
├── outputs/ # Generated images
└── examples/ # Sample images & prompts
8.3 Gradio Web Demo
python app.py
# Open http://localhost:7860 in your browser
For low memory:
python app.py --offload --name flux-dev-fp8
8.4 Further Reading
-
Technical paper: arXiv:2508.18966 -
Online demo: Hugging Face Space -
Source code: GitHub
Happy creating!