GPT-IMAGE-EDIT-1.5M: A Practical Guide to Training Open-Source Image-Editing Models That Rival GPT-4o
From raw download to 7.24-point benchmark scores—no hype, just the facts.
Table of Contents
-
Why another image-editing dataset? -
What exactly is GPT-IMAGE-EDIT-1.5M? -
How the dataset was built—step by step -
Hands-on experiment: reproducing the 7.24 GEdit-EN score -
Download, verify, and load the data -
Frequently asked questions -
Ready-to-use PyTorch dataset snippet -
Next steps and closing thoughts
1. Why another image-editing dataset?
If you have ever tried to train an instruction-guided image-editing model, you have probably run into three recurring headaches:
Pain point | What it looks like | Why it matters |
---|---|---|
Instructions are too simple | “Make the sky blue” | The model never learns complex, multi-step edits |
Text–image mismatch | Prompt says “add a red umbrella,” but the generated umbrella is green | Loss stalls, results look wrong |
Small data volume | Public sets top out at a few hundred thousand samples | Overfitting appears after the first few epochs |
Large proprietary systems such as GPT-4o have shown that data quality, not model size alone, drives photorealistic and semantically accurate edits. The problem: GPT-4o’s training data is private, leaving open-source developers behind.
Researchers from UC Santa Cruz, the University of Edinburgh, and Adobe decided to close the gap by re-processing three existing public datasets—OmniEdit, HQ-Edit, and UltraEdit—using GPT-4o itself. The result is GPT-IMAGE-EDIT-1.5M, a royalty-free collection of 1.54 million instruction–source–target triplets that anyone can download, inspect, and fine-tune on today.
2. What exactly is GPT-IMAGE-EDIT-1.5M?
2.1 Scale and composition
-
Total samples: 1 540 203 -
Origin: -
OmniEdit ≈ 60 % -
HQ-Edit ≈ 25 % -
UltraEdit ≈ 15 %
-
-
Resolutions: 1024×1024, 1536×1024, 1024×1536 (aspect-ratio locked) -
Language: English instructions; ~10 % of instructions were rewritten by GPT-4o for clarity
2.2 One sample unpacked
Field | Example |
---|---|
instruction | “Replace the wooden table with a glass one and add a vase of sunflowers on top.” |
source_image | ![]() |
edited_image | ![]() |
Each triplet is delivered as two JPEG images plus one line of JSON in a .jsonl
file.
3. How the dataset was built—step by step
Think of the pipeline as three refinement passes over the original data.
3.1 Pass 1: Output regeneration
-
Feed the original instruction + source image to GPT-4o’s image-edit endpoint -
Require 1024 px resolution, strict alignment to the source -
Auto-reject distorted or padded outputs
Impact: ImgEdit score on OmniEdit rose from 2.94 → 3.24.
3.2 Pass 2: Instruction rewrite
-
Problem: GPT-4o occasionally “over-creates,” so the new image no longer matches the old instruction. -
Fix: Show GPT-4o the source and the regenerated target, then ask for a fresh, precise instruction. -
Impact: ImgEdit score climbed an additional 0.16 (3.24 → 3.40).
3.3 Pass 3: Full pair regeneration (HQ-Edit only)
-
Problem: HQ-Edit’s source images came from DALL-E 3 and looked dated. -
Fix: Ask GPT-4o to create a new high-quality source first, then apply the same edit instruction to it. -
Impact: GEdit-EN score edged up from 5.67 → 5.73.
After all passes, every image was run through a padding-crop-resize script to guarantee square or 3:2 / 2:3 output without stretching, then SHA-256 checksummed.
4. Hands-on experiment: reproducing the 7.24 GEdit-EN score
4.1 Base model
-
FluxKontext dev: a rectified-flow transformer that natively supports 1024 px images -
Text encoder swap: authors replaced the default T5 encoder with Qwen-VL-7B embeddings for crisper prompt understanding
4.2 Training recipe (single-node, 8×A100 80 GB)
Parameter | Value | Notes |
---|---|---|
Batch size | 256 real samples | Gradient accumulation ×4 if you only have 4 GPUs |
Learning rate | 5 e-5 | Cosine schedule to 1 e-6 |
Steps | 30 000 | ~1 epoch over 1.5 M samples |
Precision | bfloat16 | Flash-Attention 2 enabled |
4.3 Results summary
Benchmark | Baseline (original data) | GPT-IMAGE-EDIT-1.5M | Gain |
---|---|---|---|
GEdit-EN-full | 6.26 | 7.24 | +0.98 |
ImgEdit-Full | 3.52 | 3.80 | +0.28 |
Complex-Edit | 8.49 | 8.78 | +0.29 |
Scores are computed by automated multimodal LLM judges that measure instruction following, identity preservation, and perceptual quality.
5. Download, verify, and load the data
5.1 Where to get it
-
Official page: https://ucsc-vlaa.github.io/GPT-Image-Edit -
Hugging Face mirror: search GPT-IMAGE-EDIT-1.5M
-
Total size: ~1.8 TB (JPEG, quality 95)
5.2 Folder layout
GPT-IMAGE-EDIT-1.5M/
├─ metadata/
│ ├─ omniedit.jsonl
│ ├─ hqedit.jsonl
│ └─ ultraedit.jsonl
├─ images/
│ ├─ 00000000.jpg
│ └─ ...
└─ checksum.sha256
5.3 Integrity check
sha256sum -c checksum.sha256
If any line fails, re-download only that shard.
6. Frequently asked questions
Question | Short answer |
---|---|
Can I use this commercially? | Yes. The dataset is CC-BY-4.0. You must credit the authors and check any third-party assets in source images. |
I only have a 24 GB RTX 4090. | Use --gradient_checkpointing and --mixed_precision fp16 . Effective batch of 4 still converges in ~2 days. |
My instructions are in Chinese. | Only English is provided. Community multilingual forks are tracked in GitHub issue #7. |
Can I add my own data later? | Append new JSONL lines with the same keys (source , target , instruction ) and rerun the training script. |
7. Ready-to-use PyTorch dataset snippet
Save as gpt_image_edit.py
:
import json, os
from PIL import Image
from torch.utils.data import Dataset
class GPTImageEditDataset(Dataset):
def __init__(self, meta_file: str, img_dir: str, transform=None):
with open(meta_file) as f:
self.samples = [json.loads(line) for line in f]
self.img_dir = img_dir
self.transform = transform
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
item = self.samples[idx]
src_path = os.path.join(self.img_dir, item['source'])
tgt_path = os.path.join(self.img_dir, item['target'])
src = Image.open(src_path).convert('RGB')
tgt = Image.open(tgt_path).convert('RGB')
prompt = item['instruction']
if self.transform:
src = self.transform(src)
tgt = self.transform(tgt)
return {'source': src, 'target': tgt, 'prompt': prompt}
Usage example:
from torchvision import transforms
transform = transforms.Compose([
transforms.Resize((1024, 1024)),
transforms.ToTensor()
])
dataset = GPTImageEditDataset('metadata/omniedit.jsonl', 'images/', transform=transform)
8. Next steps and closing thoughts
With GPT-IMAGE-EDIT-1.5M and a modest open-source backbone, you can now reach benchmark scores within striking distance of GPT-4o—without paying per-image API fees or locking yourself into a closed platform.
Immediate experiments to try
-
LoRA fine-tuning on 8 GB consumer cards -
Video frame editing by extending the same rectified-flow transformer -
Plug-in for Figma / Photoshop using the provided PyTorch loader and ONNX export
The dataset, code, and model weights are all live today. Clone the repo, run the checksum, and you can be training in less time than it takes to finish your coffee.
Happy editing.