VISTA: Let Your Prompt Rewrite Itself—A Test-Time Agent That Turns 8-Second Ideas into High-Scoring Videos
Give VISTA a one-line prompt, grab a coffee, and come back to a short film that keeps getting better with every loop.
The One-Sentence Prompt Problem
Friday, 5 p.m.
Product manager drops a Slack message:
“Need an 8-second shot—spaceship jumps to hyperspace, stars streak, cinematic.”
You fire up Veo 3, wait 30 seconds, and get… a ship flying vertically against a static star wallpaper. The YouTube comment writes itself: “Nice screensaver.”
So you do what every generative-video wrangler does—tweak the prompt, re-generate, tweak again. By midnight your prompt is a 200-token essay and the stars still look like stickers.
The bottleneck isn’t the model; it’s the one-shot prompt.
Directors don’t hand cinematographers a Post-it that says “make it cool.” They storyboard, review rushes, and iterate. VISTA (Video Iterative Self-improvement Agent) does exactly that—inside the inference loop, without retraining anything.
What Is VISTA in 30 Seconds?
A black-box, multi-agent loop that:
-
Storyboards your idea into timed scenes (9 attributes each). -
Shoots several candidates. -
Runs a critic’s round-table (visual, audio, context). -
Rewrites the prompt and repeats—until stars actually streak.
Key stats
-
Up to 45.9% / 46.3% pairwise win-rate vs direct prompting (single- & multi-scene). -
66.4% human preference over strongest baselines. -
Works with any text-to-video model (tested on Veo 3 & Veo 2). -
Cost: ~0.7 M tokens per iteration (≈ $0.02, video generation extra).
The Four-Step Loop—Storyboard, Shoot, Review, Rewrite
| Step | Human Analogy | Technical Core | Default Criteria |
| 1. Structured Planner | Write the shot list | MLLM → JSON scenes w/ 9 dims | Realism, Relevance, Creativity |
| 2. Tournament Selector | Pick the best daily | Pairwise + bidirectional swap | Visual fidelity, physics, alignment, engagement |
| 3. 3-D Critics | Invite picky friends | Normal + Adversarial + Meta judges | Visual / Audio / Context (1-10) |
| 4. Deep-Thinking Writer | Director’s notes → new script | DTPA 6-step introspection → edit actions | Keep user intent, add detail |
1) Structured Planner—From “Vibe” to Shootable JSON
Example attributes per scene:
{
"duration": 8,
"scene_type": "live-action",
"characters": ["white stray cat"],
"actions": ["paw reaches toward lens"],
"dialogues": null,
"visual_environment": "sunset rooftop, warm tone",
"camera": "hand-held POV, slight shake",
"sounds": ["distant traffic", "subtle wind"],
"mood": "heart-warming"
}
-
Keeps original prompt in the candidate pool—prevents over-decomposition. -
Official prompt templates are on the project page—copy-paste friendly.
2) Tournament Selector—Why Pairwise Beats Scoring
Scoring invites “everyone’s an 8.”
VISTA runs binary tournaments:
-
Each film gets a mini-review (probe critique). -
Compare A→B and B→A (swap order) to cancel position bias. -
Knock-out until one winner.
You can inject penalties for typical AI bloopers:
-
Sudden object pop-in / pop-out -
Unnecessary subtitles -
3+ scene transitions in an 8-second clip
Violations auto-lose the round.
3) 3-D Critics—Normal vs Adversarial vs Meta
Dimension | Normal Judge | Adversarial Judge | Meta Judge |
---|---|---|---|
Visual | “Stars look smooth” | “No parallax—totally fake” | Score 7/10, asks for parallax |
Audio | “Dialogue clear” | “Wind noise ruins take” | Suggests clean ambient bed |
Context | “Follows prompt” | “Ship flies up, not forward” | Recommends horizontal motion |
Any metric ≤ 8 triggers concrete edit actions for the next round.
4) Deep-Thinking Prompting Agent (DTPA)
Six self-interrogation steps:
-
List sub-8 issues -
Re-state user goal -
Spot model limitations (e.g., ray-tracing reflections) -
Find prompt gaps (vague size, missing transition) -
Propose surgical edits (≤ 150 words each) -
Double-check all issues covered
Sample output actions:
-
“Add: sunglasses must not reflect camera rigs.” -
“Change text appearance to ‘slide-up from bottom, <5% screen’.” -
“Transition: use 6-frame cross-dissolve instead of hard cut.”
These actions seed new prompt variants—loop again.
Hands-On: Wire VISTA to Your Veo 3 Pipeline
Prerequisites: Python ≥ 3.9, Google Cloud Veo 3 access, Gemini 2.5 Flash key
# 1. Clone the reference repo
git clone https://github.com/google-research/vista.git
cd vista
# 2. One-line deps
pip install -r requirements.txt
# 3. Export keys
export GOOGLE_APPLICATION_CREDENTIALS=/path/veo3-key.json
export GEMINI_API_KEY="gemini-2.5-flash-key"
# 4. Fire the loop
python -m vista.run \
--prompt "A spaceship entering hyperspace, stars streaking past" \
--iterations 5 \
--videos-per-iter 3 \
--output-dir ./run001
Best video + final prompt land in run001/best/
.
Numbers Don’t Lie—Automatic & Human Eval
Benchmark | Iteration | Win-rate vs Direct | Human Preference |
---|---|---|---|
Single-scene | 5 | 45.9 % → 46.3 % | 66.4 % |
Multi-scene | 5 | 46.3 % | same |
Token cost ~ 0.7 M per iter (≈ $0.02, video generation billed separately).
Ablation & Scaling—What Happens If You Remove…?
Module removed → Win-rate drop |
-
No Planner: –10 % at init -
No Tournament: unstable after round 3 -
Only Adversarial: too harsh, multi-scene collapses -
Only Normal: too soft, misses flaws -
No DTPA: lower ceiling, edits plateau
More iterations (20) → VISTA still climbs; baselines plateau at ~8 rounds.
Weaker backbone (Veo 2) → +23.8 % / +33.3 % vs direct, proving model-agnostic boost.
Show, Don’t Tell—Before vs After
Prompt: “Spaceship enters hyperspace, stars streak.”
-
Direct: ship flies up, star field static → looks like a GIF. -
VISTA: horizontal acceleration, parallax streaks, engine glow pulses with speed.
Prompt: “Couple releases lantern, night sky.”
-
Direct: afternoon sky → instant fake navy, hard cut to outro. -
VISTA: smooth dusk transition, wind-noise-free audio, cross-dissolve to CTA.
FAQ (Schema.org FAQ-style)
Q: Can I plug in Stable Video Diffusion or Runway?
A: Any T2V model that eats text prompts works—just swap the generation call.
Q: Will the story drift after many loops?
A: DTPA step 2 locks the original user intent; edits only add specificity.
Q: Do I have to use Gemini 2.5 Flash for judging?
A: Code is modular; tested with Gemini Pro & Qwen2.5-VL-32B—trend holds.
Take-Out
VISTA shows that the next frontier in generative video isn’t bigger models—it’s smarter, self-correcting prompts.
Next time someone pings you “make it cooler,” just hand the line to VISTA and let the agent argue with itself until the stars finally streak the right way.
- : [Anthropic on Multi-Agent Engineering](https://www.osc hina.net/news/355664/anthropic-built-multi-agent-research-system)
- Unite.AI Jailbreaking T2V