ViBT: Vision Bridge Transformer at Scale – A Practical Deep Dive
What is ViBT and why does it achieve up to 4× faster inference than token-heavy conditional diffusion models while maintaining comparable quality?
ViBT is the first large-scale realization of Brownian Bridge generative models for vision tasks. Instead of the classic “noise-to-data” paradigm, it directly learns stochastic trajectories from a structured source (image/video) to a structured target, eliminating most conditioning tokens and dramatically reducing compute.

Figure: Example results of ViBT across instruction-based editing, stylization, colorization, and frame interpolation.
Why the Noise-to-Data Paradigm Feels Wrong for Conditional Generation
Most modern image and video generators (PixArt, Flux, Wan2.1, etc.) start from pure Gaussian noise even when a high-quality source image or video is available. For tasks like stylization, colorization, or frame interpolation this is inherently wasteful: the source and target are already highly correlated.
ViBT flips the script: it treats the source as the starting point x₀ and the desired output as the endpoint x₁, then learns a Brownian Bridge that connects them directly.
Core Innovations That Make ViBT Work at 20B Parameters
The Key Technical Breakthrough: Variance-Stabilized Velocity Matching
Standard velocity targets in bridge models blow up when t → 1 because the denominator (1−t) becomes tiny. This causes catastrophic gradient explosion in large models.
ViBT introduces a simple yet powerful normalization factor α that balances contributions across all timesteps, making 20B-scale training stable for the first time.
Personal reflection: I’ve trained many 7B–13B diffusion transformers and almost always hit the “late-timestep explosion” wall. The α-normalization feels almost too obvious in hindsight, but it single-handedly unlocks the entire bridge paradigm at scale.
Real-World Scenarios Where ViBT Shines
1. Instruction-Based Image Editing & Stylization
You have a photo and a text prompt like “turn this into Van Gogh’s starry night”.
ViBT concatenates the source image tokens with the text tokens and samples a short bridge trajectory. No ControlNet, no IP-Adapter, no extra conditioning tokens.
Result: 30-step inference on a single 3090 is ~6–8 seconds at 1024×1024 versus 25–30 seconds for comparable Flux-based pipelines.
2. Video Stylization
Transfer the style of one reference image to an entire video clip.
Because the source frames are already meaningful structure, ViBT avoids the massive token bloat that plagues video DiTs. A 64-frame 512×512 video can be stylized in one forward pass on a single 80GB card.
3. Grayscale Video Colorization
Classic data-to-data task. Feed grayscale frames as x₀, corresponding color frames as x₁ during training. At inference time, only the grayscale video is needed – the bridge naturally fills in plausible colors with excellent temporal consistency.
4. Video Frame Interpolation
Give two real keyframes as fixed endpoints. The Brownian Bridge formulation guarantees the trajectory stays anchored, producing flicker-free intermediate frames with far fewer steps than flow-based or diffusion interpolators.
Hands-On: Run ViBT in Five Minutes
Environment Setup
conda create -n vibT python=3.12 -y
conda activate vibT
git clone https://github.com/Yuanshi9815/ViBT.git
cd ViBT
pip install -e .
Quick Image Stylization Example
Open examples/image_stylization.ipynb and run the first few cells:
-
Load the 1.3B image editing checkpoint (fine-tuned from Qwen-Image-Edit) -
Upload your source image -
Write a style prompt -
Generate with 30 steps
That’s it – no complicated adapter setup.
Video Colorization Notebook
examples/video_colorization.ipynb works the same way: drop a grayscale clip, hit run, get a colored video in ~45 seconds (64 frames, 512×512, single A100).
Video Frame Interpolation
examples/video_frame_interpolation.ipynb takes two keyframes and produces any number of in-betweens with perfect endpoint consistency.
Available Models (December 2025)
Live demo (no install required):
https://huggingface.co/spaces/Yuanshi/ViBT
Personal Takeaway After Testing Dozens of Runs
The most surprising part is how little randomness you actually need. With traditional diffusion you fight noise at every step; with ViBT the source already carries 90% of the information, so the bridge only has to nudge pixels gently. The result feels “calmer” and more faithful to the input structure.
I now believe bridge-style models are the natural endpoint for almost all conditional vision tasks. The noise-to-data detour was a historical artifact that we no longer need.
One-Page Summary
Quick Start Checklist
-
Create conda env → vibT -
Clone repo + pip install -e . -
Open any notebook under examples/ -
Run cells → done
FAQ
Q: How is ViBT different from Rectified Flow?
A: Rectified Flow is deterministic (zero noise). ViBT uses a proper Brownian Bridge with controlled stochasticity, which gives better creativity while keeping endpoints fixed.
Q: Why is inference so much faster?
A: Primarily because almost all conditioning tokens are removed. The transformer processes 60–80% fewer tokens per step.
Q: Can I run the 20B model locally?
A: Only on multi-GPU cloud setups right now (8×80GB minimum). The 1.3B models run comfortably on 24–48GB consumer cards.
Q: When will full training code be released?
A: The team is cleaning it up; expect it in early 2026.
Q: Is LoRA fine-tuning possible?
A: Yes, and convergence is 2–3× faster than diffusion LoRAs because trajectories are shorter.
Q: How does it compare to Flux or SD3 on pure unconditional generation?
A: Slightly behind on pure text-to-image, but dramatically ahead on every conditional task (editing, control, video).
Q: Does it support ControlNet-style structure guidance?
A: Not needed in the classic sense – the source image itself is the strongest possible structure guide. Community experiments feeding depth/edge maps as x₀ already outperform traditional ControlNet.
ViBT marks the moment when conditional generation finally escaped the “start from noise” dogma. If you work with image editing, video processing, or any structured translation task, it’s worth trying today – the difference in speed and simplicity is immediately obvious.

