Data-Augmentation in 2025: How to Train a Vision Model with Only One Photo per Class

(A plain-English walkthrough of the DALDA framework)

By an industry practitioner who has spent the last decade turning research papers into working products.


Contents

  1. Why the “one-photo” problem matters
  2. Meet DALDA in plain words
  3. How the pieces fit together
  4. Install everything in 15 minutes
  5. Run your first 1-shot experiment
  6. Reading the numbers: diversity vs. accuracy
  7. Troubleshooting mini-FAQ
  8. Where to go next

1. Why the “one-photo” problem matters

Imagine you are a quality-control engineer at a small factory. Every time a new scratch pattern appears on the production line, you can only take one clear photograph before the line is cleaned.
The machine-learning team still wants a classifier that can tell new scratches from old ones.
Traditional flip / rotate / crop tricks only re-arrange pixels; they do not invent new angles, lighting, or backgrounds.
Large diffusion models (Stable Diffusion, DALL-E, Midjourney) can invent new scenes, but without guardrails they happily generate a cat when you asked for a scratch.

DALDA—short for Data-Augmentation Leveraging Diffusion models and LLMs with Adaptive Guidance Scaling—is the open-source answer to this exact dilemma.
It keeps the creative power of diffusion models while making sure the result still belongs to the correct category.


2. Meet DALDA in plain words

Component Everyday analogy What it really does
GPT-4 (or any LLM) A talkative art director Writes ten different English sentences that describe the single photo you gave it, adding new but realistic details such as “viewed from above, under soft morning light”.
CLIPScore A picky curator Measures how well the original photo matches its class name. High score = faithful photo; low score = ambiguous photo.
Adaptive Guidance Scaling (AGS) A volume knob Turns the influence of the original photo up or down depending on CLIPScore. Prevents the diffusion model from drifting too far.
Stable Diffusion + IP-Adapter The actual painter Generates brand-new images that respect both the art-director sentences and the volume-knob guidance.

After one click, you receive ten new images that look different yet remain unmistakably the same class.


3. How the pieces fit together

3.1 CLIPScore in one minute

CLIP is a neural network that converts both images and text into the same numeric space.
The cosine similarity between the two vectors is the CLIPScore:

  • 0.90 – 1.00 → excellent match
  • 0.70 – 0.89 → good match
  • 0.30 – 0.69 → risky match
  • < 0.30 → probably off-topic

DALDA uses the official CLIP-ViT-B/16 checkpoint to compute this score automatically.

3.2 AGS: the volume knob

Inside the IP-Adapter code, each cross-attention layer receives two signals:

  1. Text signal (from the LLM sentence)
  2. Image signal (from your single photo)

A scalar called λ (lambda) decides the mix:

new_attention = text_attention + λ × image_attention
  • High CLIPScore (≥ 0.30 default) → λ is lowered (0.1–0.4) so the text can explore new poses.
  • Low CLIPScore (< 0.30) → λ is raised (0.7–0.9) so the original photo keeps the diffusion model on a short leash.

λ is sampled from a truncated normal distribution so you still get variety rather than a single fixed weight.

3.3 Prompt generation example

You give GPT-4 the dataset description and the class name.
For the class Pomeranian it may return:

“A fluffy Pomeranian dog trots merrily along a path bordered by radiant tulips, its tail gently lifted in tune with the whimsical spring air.”

Ten such sentences are produced for each training image.


4. Install everything in 15 minutes

4.1 Hardware & OS

  • Linux (Ubuntu 20.04+), macOS 12+, or Windows 10/11 with WSL2
  • NVIDIA GPU with ≥ 8 GB VRAM (RTX 3060, 4060, 3090, 4090, A100, etc.)
  • CUDA 11.8 or 12.x drivers already working (nvidia-smi shows your card)

4.2 Software stack

# 1. Create a clean environment
conda create -n dalda python=3.10 -y
conda activate dalda

# 2. Clone the repository
git clone https://github.com/xxx/dalda.git
cd dalda
pip install -r requirements.txt   # ~2 GB download

# 3. Download IP-Adapter weights (needed for AGS)
git lfs install
git clone https://huggingface.co/h94/IP-Adapter
mv IP-Adapter/models ./models

4.3 Download a sample dataset

Oxford Pets is small (800 MB) and has 37 pet classes—perfect for a first run.

# Images
wget https://thor.robots.ox.ac.uk/~vgg/data/pets/images.tar.gz
tar -xvf images.tar.gz

# Train/val lists
wget https://thor.robots.ox.ac.uk/~vgg/data/pets/annotations.tar.gz
tar -xvf annotations.tar.gz

# Move into expected folder
mkdir -p datasets/oxford_pet
mv images annotations datasets/oxford_pet/

Folder tree must look like:

datasets
└── oxford_pet
    ├── images
    └── annotations

5. Run your first 1-shot experiment

5.1 Generate prompts with an LLM

Create a file called .env in the root directory:

AZURE_ENDPOINT=https://<your-resource>.openai.azure.com
AZURE_API_KEY=your_key
AZURE_API_VERSION=2023-07-01-preview

Then:

python generate_prompt_by_LLM.py

The script produces prompts/oxford_pet_llm.json with ten sentences for each of the 37 classes.
(No OpenAI key? Use the sample prompts already shipped with the repo.)

5.2 Train the classifier

python train_classifier.py \
  --config cfg_dalda_pets_llm_prompt_adaptive_scaling_clip

Key config lines you might tweak:

Key Meaning Default
examples_per_class Shots per class 1
synthetic_probability % of synthetic images in each batch 0.5
resume Continue if interrupted [seed, shots, class_id, img_id]

Example log:

Epoch 1/50 - train_acc: 0.67  val_acc: 0.71
Epoch 2/50 - train_acc: 0.73  val_acc: 0.78
...

5.3 Inspect the new images

Outputs land in:

outputs/oxford_pet/synthetic/
  ├── 0_ABYSSINIAN_aug-00-0.png
  ├── 0_ABYSSINIAN_aug-01-0.png
  ...

Open a few files and verify they still look like the correct animal.


6. Reading the numbers: diversity vs. accuracy

DALDA ships with two standard diversity metrics:

Metric Meaning Ideal value
CLIP-I Average CLIP similarity between synthetic images Lower = more diverse
LPIPS Learned perceptual similarity Higher = more diverse

Typical 1-shot results:

Dataset CLIPScore type CLIP-I ↓ LPIPS ↑ ResNet50 top-1 ↑
Oxford Pets High (0.78) 0.882 0.709 80.7 %
Caltech-101 High (0.84) 0.879 0.705 78.8 %
Flowers102 Low (0.55) 0.921 0.746 52.1 %

Take-aways:

  1. High CLIPScore datasets → AGS encourages diversity; accuracy still rises.
  2. Low CLIPScore datasets → AGS turns the volume knob toward the original photo; diversity is intentionally lower to stay on-topic.
  3. The framework does not require fine-tuning Stable Diffusion itself—saving hours of GPU time.

7. Troubleshooting mini-FAQ

Q1: I don’t have a GPT-4 key.
A: Use the example prompts under prompts/ or write ten simple sentences per class yourself. The rest of the pipeline still works.

Q2: Out-of-memory error.
A:

  • Reduce --resolution to 256×256 in the config.
  • Lower batch_size to 8.
  • Use --gradient_checkpointing true.

Q3: Can I use this for object detection?
A: The current repo only demonstrates classification. Replace the IP-Adapter with a ControlNet branch (e.g., bounding-box conditioning) and adapt the dataloader.

Q4: Commercial usage?
A: All generated images are new pixels. Still, double-check your local regulations before including them in a commercial dataset.

Q5: Windows native (no WSL)?
A: Possible, but CUDA path issues are common. WSL2 is the smoothest route.

Q6: How long does 1-shot training take?
A: ~15 minutes on an RTX 3090 for Oxford Pets (37 × 10 = 370 synthetic images).

Q7: Can I mix real and synthetic data in any ratio?
A: Yes—change synthetic_probability from 0.0 (all real) to 1.0 (all synthetic).

Q8: Why is Flowers102 accuracy lower?
A: Flowers have subtle differences; CLIPScore is naturally lower. AGS therefore limits diversity, keeping the model conservative.

Q9: Reproducibility?
A: The authors report the average of three seeds. You can set --seed 42 for deterministic runs.

Q10: Can I fine-tune Stable Diffusion for my domain?
A: You can, but it defeats the purpose of “no fine-tuning.” Try the vanilla pipeline first.


8. Where to go next

  • Higher resolution – swap CompVis/stable-diffusion-v1-4 for stabilityai/stable-diffusion-xl-base-1.0 and adjust memory settings.
  • Other modalities – the same AGS knob works for audio or 3-D diffusion models because CLIPScore equivalents exist.
  • Edge deployment – export the trained classifier to ONNX or TensorRT; synthetic images are only needed at training time.

Citation

If you use DALDA in your research or product, please cite:

@article{jung2024dalda,
  title={{DALDA}: Data Augmentation Leveraging Diffusion Model and {LLM} with Adaptive Guidance Scaling},
  author={Jung, Kyuheon and Seo, Yongdeuk and Cho, Seongwoo and Kim, Jaeyoung and Min, Hyun-seok and Choi, Sungchul},
  journal={arXiv preprint arXiv:2409.16949},
  year={2024}
}

Happy augmenting!