GPT-IMAGE-EDIT-1.5M: A Practical Guide to Training Open-Source Image-Editing Models That Rival GPT-4o

From raw download to 7.24-point benchmark scores—no hype, just the facts.


Table of Contents

  1. Why another image-editing dataset?
  2. What exactly is GPT-IMAGE-EDIT-1.5M?
  3. How the dataset was built—step by step
  4. Hands-on experiment: reproducing the 7.24 GEdit-EN score
  5. Download, verify, and load the data
  6. Frequently asked questions
  7. Ready-to-use PyTorch dataset snippet
  8. Next steps and closing thoughts

1. Why another image-editing dataset?

If you have ever tried to train an instruction-guided image-editing model, you have probably run into three recurring headaches:

Pain point What it looks like Why it matters
Instructions are too simple “Make the sky blue” The model never learns complex, multi-step edits
Text–image mismatch Prompt says “add a red umbrella,” but the generated umbrella is green Loss stalls, results look wrong
Small data volume Public sets top out at a few hundred thousand samples Overfitting appears after the first few epochs

Large proprietary systems such as GPT-4o have shown that data quality, not model size alone, drives photorealistic and semantically accurate edits. The problem: GPT-4o’s training data is private, leaving open-source developers behind.

Researchers from UC Santa Cruz, the University of Edinburgh, and Adobe decided to close the gap by re-processing three existing public datasets—OmniEdit, HQ-Edit, and UltraEdit—using GPT-4o itself. The result is GPT-IMAGE-EDIT-1.5M, a royalty-free collection of 1.54 million instruction–source–target triplets that anyone can download, inspect, and fine-tune on today.


2. What exactly is GPT-IMAGE-EDIT-1.5M?

2.1 Scale and composition

  • Total samples: 1 540 203
  • Origin:

    • OmniEdit ≈ 60 %
    • HQ-Edit ≈ 25 %
    • UltraEdit ≈ 15 %
  • Resolutions: 1024×1024, 1536×1024, 1024×1536 (aspect-ratio locked)
  • Language: English instructions; ~10 % of instructions were rewritten by GPT-4o for clarity

2.2 One sample unpacked

Field Example
instruction “Replace the wooden table with a glass one and add a vase of sunflowers on top.”
source_image source
edited_image edited

Each triplet is delivered as two JPEG images plus one line of JSON in a .jsonl file.


3. How the dataset was built—step by step

Think of the pipeline as three refinement passes over the original data.

3.1 Pass 1: Output regeneration

  • Feed the original instruction + source image to GPT-4o’s image-edit endpoint
  • Require 1024 px resolution, strict alignment to the source
  • Auto-reject distorted or padded outputs

Impact: ImgEdit score on OmniEdit rose from 2.94 → 3.24.

3.2 Pass 2: Instruction rewrite

  • Problem: GPT-4o occasionally “over-creates,” so the new image no longer matches the old instruction.
  • Fix: Show GPT-4o the source and the regenerated target, then ask for a fresh, precise instruction.
  • Impact: ImgEdit score climbed an additional 0.16 (3.24 → 3.40).

3.3 Pass 3: Full pair regeneration (HQ-Edit only)

  • Problem: HQ-Edit’s source images came from DALL-E 3 and looked dated.
  • Fix: Ask GPT-4o to create a new high-quality source first, then apply the same edit instruction to it.
  • Impact: GEdit-EN score edged up from 5.67 → 5.73.

After all passes, every image was run through a padding-crop-resize script to guarantee square or 3:2 / 2:3 output without stretching, then SHA-256 checksummed.


4. Hands-on experiment: reproducing the 7.24 GEdit-EN score

4.1 Base model

  • FluxKontext dev: a rectified-flow transformer that natively supports 1024 px images
  • Text encoder swap: authors replaced the default T5 encoder with Qwen-VL-7B embeddings for crisper prompt understanding

4.2 Training recipe (single-node, 8×A100 80 GB)

Parameter Value Notes
Batch size 256 real samples Gradient accumulation ×4 if you only have 4 GPUs
Learning rate 5 e-5 Cosine schedule to 1 e-6
Steps 30 000 ~1 epoch over 1.5 M samples
Precision bfloat16 Flash-Attention 2 enabled

4.3 Results summary

Benchmark Baseline (original data) GPT-IMAGE-EDIT-1.5M Gain
GEdit-EN-full 6.26 7.24 +0.98
ImgEdit-Full 3.52 3.80 +0.28
Complex-Edit 8.49 8.78 +0.29

Scores are computed by automated multimodal LLM judges that measure instruction following, identity preservation, and perceptual quality.


5. Download, verify, and load the data

5.1 Where to get it

  • Official page: https://ucsc-vlaa.github.io/GPT-Image-Edit
  • Hugging Face mirror: search GPT-IMAGE-EDIT-1.5M
  • Total size: ~1.8 TB (JPEG, quality 95)

5.2 Folder layout

GPT-IMAGE-EDIT-1.5M/
├─ metadata/
│  ├─ omniedit.jsonl
│  ├─ hqedit.jsonl
│  └─ ultraedit.jsonl
├─ images/
│  ├─ 00000000.jpg
│  └─ ...
└─ checksum.sha256

5.3 Integrity check

sha256sum -c checksum.sha256

If any line fails, re-download only that shard.


6. Frequently asked questions

Question Short answer
Can I use this commercially? Yes. The dataset is CC-BY-4.0. You must credit the authors and check any third-party assets in source images.
I only have a 24 GB RTX 4090. Use --gradient_checkpointing and --mixed_precision fp16. Effective batch of 4 still converges in ~2 days.
My instructions are in Chinese. Only English is provided. Community multilingual forks are tracked in GitHub issue #7.
Can I add my own data later? Append new JSONL lines with the same keys (source, target, instruction) and rerun the training script.

7. Ready-to-use PyTorch dataset snippet

Save as gpt_image_edit.py:

import json, os
from PIL import Image
from torch.utils.data import Dataset

class GPTImageEditDataset(Dataset):
    def __init__(self, meta_file: str, img_dir: str, transform=None):
        with open(meta_file) as f:
            self.samples = [json.loads(line) for line in f]
        self.img_dir = img_dir
        self.transform = transform

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        item = self.samples[idx]
        src_path = os.path.join(self.img_dir, item['source'])
        tgt_path = os.path.join(self.img_dir, item['target'])
        src = Image.open(src_path).convert('RGB')
        tgt = Image.open(tgt_path).convert('RGB')
        prompt = item['instruction']
        if self.transform:
            src = self.transform(src)
            tgt = self.transform(tgt)
        return {'source': src, 'target': tgt, 'prompt': prompt}

Usage example:

from torchvision import transforms
transform = transforms.Compose([
    transforms.Resize((1024, 1024)),
    transforms.ToTensor()
])
dataset = GPTImageEditDataset('metadata/omniedit.jsonl', 'images/', transform=transform)

8. Next steps and closing thoughts

With GPT-IMAGE-EDIT-1.5M and a modest open-source backbone, you can now reach benchmark scores within striking distance of GPT-4o—without paying per-image API fees or locking yourself into a closed platform.

Immediate experiments to try

  1. LoRA fine-tuning on 8 GB consumer cards
  2. Video frame editing by extending the same rectified-flow transformer
  3. Plug-in for Figma / Photoshop using the provided PyTorch loader and ONNX export

The dataset, code, and model weights are all live today. Clone the repo, run the checksum, and you can be training in less time than it takes to finish your coffee.

Happy editing.