Data-Augmentation in 2025: How to Train a Vision Model with Only One Photo per Class
(A plain-English walkthrough of the DALDA framework)
By an industry practitioner who has spent the last decade turning research papers into working products.
Contents
-
Why the “one-photo” problem matters -
Meet DALDA in plain words -
How the pieces fit together -
Install everything in 15 minutes -
Run your first 1-shot experiment -
Reading the numbers: diversity vs. accuracy -
Troubleshooting mini-FAQ -
Where to go next
1. Why the “one-photo” problem matters
Imagine you are a quality-control engineer at a small factory. Every time a new scratch pattern appears on the production line, you can only take one clear photograph before the line is cleaned.
The machine-learning team still wants a classifier that can tell new scratches from old ones.
Traditional flip / rotate / crop tricks only re-arrange pixels; they do not invent new angles, lighting, or backgrounds.
Large diffusion models (Stable Diffusion, DALL-E, Midjourney) can invent new scenes, but without guardrails they happily generate a cat when you asked for a scratch.
DALDA—short for Data-Augmentation Leveraging Diffusion models and LLMs with Adaptive Guidance Scaling—is the open-source answer to this exact dilemma.
It keeps the creative power of diffusion models while making sure the result still belongs to the correct category.
2. Meet DALDA in plain words
Component | Everyday analogy | What it really does |
---|---|---|
GPT-4 (or any LLM) | A talkative art director | Writes ten different English sentences that describe the single photo you gave it, adding new but realistic details such as “viewed from above, under soft morning light”. |
CLIPScore | A picky curator | Measures how well the original photo matches its class name. High score = faithful photo; low score = ambiguous photo. |
Adaptive Guidance Scaling (AGS) | A volume knob | Turns the influence of the original photo up or down depending on CLIPScore. Prevents the diffusion model from drifting too far. |
Stable Diffusion + IP-Adapter | The actual painter | Generates brand-new images that respect both the art-director sentences and the volume-knob guidance. |
After one click, you receive ten new images that look different yet remain unmistakably the same class.
3. How the pieces fit together
3.1 CLIPScore in one minute
CLIP is a neural network that converts both images and text into the same numeric space.
The cosine similarity between the two vectors is the CLIPScore:
-
0.90 – 1.00 → excellent match -
0.70 – 0.89 → good match -
0.30 – 0.69 → risky match -
< 0.30 → probably off-topic
DALDA uses the official CLIP-ViT-B/16
checkpoint to compute this score automatically.
3.2 AGS: the volume knob
Inside the IP-Adapter code, each cross-attention layer receives two signals:
-
Text signal (from the LLM sentence) -
Image signal (from your single photo)
A scalar called λ (lambda) decides the mix:
new_attention = text_attention + λ × image_attention
-
High CLIPScore (≥ 0.30 default) → λ is lowered (0.1–0.4) so the text can explore new poses. -
Low CLIPScore (< 0.30) → λ is raised (0.7–0.9) so the original photo keeps the diffusion model on a short leash.
λ is sampled from a truncated normal distribution so you still get variety rather than a single fixed weight.
3.3 Prompt generation example
You give GPT-4 the dataset description and the class name.
For the class Pomeranian it may return:
“A fluffy Pomeranian dog trots merrily along a path bordered by radiant tulips, its tail gently lifted in tune with the whimsical spring air.”
Ten such sentences are produced for each training image.
4. Install everything in 15 minutes
4.1 Hardware & OS
-
Linux (Ubuntu 20.04+), macOS 12+, or Windows 10/11 with WSL2 -
NVIDIA GPU with ≥ 8 GB VRAM (RTX 3060, 4060, 3090, 4090, A100, etc.) -
CUDA 11.8 or 12.x drivers already working ( nvidia-smi
shows your card)
4.2 Software stack
# 1. Create a clean environment
conda create -n dalda python=3.10 -y
conda activate dalda
# 2. Clone the repository
git clone https://github.com/xxx/dalda.git
cd dalda
pip install -r requirements.txt # ~2 GB download
# 3. Download IP-Adapter weights (needed for AGS)
git lfs install
git clone https://huggingface.co/h94/IP-Adapter
mv IP-Adapter/models ./models
4.3 Download a sample dataset
Oxford Pets is small (800 MB) and has 37 pet classes—perfect for a first run.
# Images
wget https://thor.robots.ox.ac.uk/~vgg/data/pets/images.tar.gz
tar -xvf images.tar.gz
# Train/val lists
wget https://thor.robots.ox.ac.uk/~vgg/data/pets/annotations.tar.gz
tar -xvf annotations.tar.gz
# Move into expected folder
mkdir -p datasets/oxford_pet
mv images annotations datasets/oxford_pet/
Folder tree must look like:
datasets
└── oxford_pet
├── images
└── annotations
5. Run your first 1-shot experiment
5.1 Generate prompts with an LLM
Create a file called .env
in the root directory:
AZURE_ENDPOINT=https://<your-resource>.openai.azure.com
AZURE_API_KEY=your_key
AZURE_API_VERSION=2023-07-01-preview
Then:
python generate_prompt_by_LLM.py
The script produces prompts/oxford_pet_llm.json
with ten sentences for each of the 37 classes.
(No OpenAI key? Use the sample prompts already shipped with the repo.)
5.2 Train the classifier
python train_classifier.py \
--config cfg_dalda_pets_llm_prompt_adaptive_scaling_clip
Key config lines you might tweak:
Key | Meaning | Default |
---|---|---|
examples_per_class |
Shots per class | 1 |
synthetic_probability |
% of synthetic images in each batch | 0.5 |
resume |
Continue if interrupted | [seed, shots, class_id, img_id] |
Example log:
Epoch 1/50 - train_acc: 0.67 val_acc: 0.71
Epoch 2/50 - train_acc: 0.73 val_acc: 0.78
...
5.3 Inspect the new images
Outputs land in:
outputs/oxford_pet/synthetic/
├── 0_ABYSSINIAN_aug-00-0.png
├── 0_ABYSSINIAN_aug-01-0.png
...
Open a few files and verify they still look like the correct animal.
6. Reading the numbers: diversity vs. accuracy
DALDA ships with two standard diversity metrics:
Metric | Meaning | Ideal value |
---|---|---|
CLIP-I | Average CLIP similarity between synthetic images | Lower = more diverse |
LPIPS | Learned perceptual similarity | Higher = more diverse |
Typical 1-shot results:
Dataset | CLIPScore type | CLIP-I ↓ | LPIPS ↑ | ResNet50 top-1 ↑ |
---|---|---|---|---|
Oxford Pets | High (0.78) | 0.882 | 0.709 | 80.7 % |
Caltech-101 | High (0.84) | 0.879 | 0.705 | 78.8 % |
Flowers102 | Low (0.55) | 0.921 | 0.746 | 52.1 % |
Take-aways:
-
High CLIPScore datasets → AGS encourages diversity; accuracy still rises. -
Low CLIPScore datasets → AGS turns the volume knob toward the original photo; diversity is intentionally lower to stay on-topic. -
The framework does not require fine-tuning Stable Diffusion itself—saving hours of GPU time.
7. Troubleshooting mini-FAQ
Q1: I don’t have a GPT-4 key.
A: Use the example prompts under prompts/
or write ten simple sentences per class yourself. The rest of the pipeline still works.
Q2: Out-of-memory error.
A:
-
Reduce --resolution
to 256×256 in the config. -
Lower batch_size
to 8. -
Use --gradient_checkpointing true
.
Q3: Can I use this for object detection?
A: The current repo only demonstrates classification. Replace the IP-Adapter with a ControlNet branch (e.g., bounding-box conditioning) and adapt the dataloader.
Q4: Commercial usage?
A: All generated images are new pixels. Still, double-check your local regulations before including them in a commercial dataset.
Q5: Windows native (no WSL)?
A: Possible, but CUDA path issues are common. WSL2 is the smoothest route.
Q6: How long does 1-shot training take?
A: ~15 minutes on an RTX 3090 for Oxford Pets (37 × 10 = 370 synthetic images).
Q7: Can I mix real and synthetic data in any ratio?
A: Yes—change synthetic_probability
from 0.0 (all real) to 1.0 (all synthetic).
Q8: Why is Flowers102 accuracy lower?
A: Flowers have subtle differences; CLIPScore is naturally lower. AGS therefore limits diversity, keeping the model conservative.
Q9: Reproducibility?
A: The authors report the average of three seeds. You can set --seed 42
for deterministic runs.
Q10: Can I fine-tune Stable Diffusion for my domain?
A: You can, but it defeats the purpose of “no fine-tuning.” Try the vanilla pipeline first.
8. Where to go next
-
Higher resolution – swap CompVis/stable-diffusion-v1-4
forstabilityai/stable-diffusion-xl-base-1.0
and adjust memory settings. -
Other modalities – the same AGS knob works for audio or 3-D diffusion models because CLIPScore equivalents exist. -
Edge deployment – export the trained classifier to ONNX or TensorRT; synthetic images are only needed at training time.
Citation
If you use DALDA in your research or product, please cite:
@article{jung2024dalda,
title={{DALDA}: Data Augmentation Leveraging Diffusion Model and {LLM} with Adaptive Guidance Scaling},
author={Jung, Kyuheon and Seo, Yongdeuk and Cho, Seongwoo and Kim, Jaeyoung and Min, Hyun-seok and Choi, Sungchul},
journal={arXiv preprint arXiv:2409.16949},
year={2024}
}
Happy augmenting!