VisGym: The Ultimate Test for Vision-Language Models – Why Top AI Agents Struggle with Multi-Step Tasks
The Core Question Answered Here: While Vision-Language Models (VLMs) excel at static image recognition, can they truly succeed in environments requiring perception, memory, and action over long periods? Why do the most advanced “frontier” models frequently fail at seemingly simple multi-step visual tasks?
In the rapidly evolving landscape of artificial intelligence, Vision-Language Models have become the bridge connecting computer vision with natural language processing. From identifying objects in a photo to answering complex questions about an image, their performance is often nothing short of miraculous. However, when we shift the goalpost from “looking and describing” to “looking and doing”—that is, navigating a complex environment by observing images, formulating plans, executing actions, and remembering history—the situation changes drastically.
Humans intuitively manipulate objects, solve puzzles, and explore unknown mazes by integrating deep environmental perception, memory of past steps, and planning for future actions. For AI, this capability remains a significant bottleneck. To systematically diagnose and address this gap, researchers at UC Berkeley have introduced VisGym. This is a comprehensive evaluation and training platform consisting of 17 diverse environments designed to rigorously test and improve VLMs in long-horizon visual decision-making.
This article dives deep into the design philosophy and technical architecture of VisGym. We will analyze the test results of frontier models like GPT-5 and Gemini 2.5 Pro, reveal specific weaknesses in visual interaction, and discuss how targeted fine-tuning can boost performance.
Image Source: Unsplash
What is VisGym? A Look Inside the 17 Environments
The Core Question Answered Here: What exactly does VisGym look like, and how does its design of diverse tasks comprehensively test a model’s multimodal decision-making capabilities?
VisGym is not just a simple benchmark; it is more like a highly customizable “gymnasium” filled with a variety of “training equipment” targeting different capabilities. Unlike traditional single-task benchmarks, VisGym builds a unified yet diverse ecosystem aimed at isolating and testing generic factors that influence visual interactive decision-making, rather than just evaluating performance in specific domains.
The core value of VisGym lies in its diversity. It includes 17 meticulously designed long-horizon environments spanning symbolic logic, real-image understanding, navigation, and robotic manipulation. While the backgrounds vary, all tasks require the model to integrate visual input, language instructions, and action history to make correct decisions.
To get a clearer picture of VisGym’s composition, we can categorize these environments and examine their specific parameter configurations:
Environment Categories and Features Overview
| Environment Name | Domain | Observability | Dynamics | Difficulty Params | Action Examples |
|---|---|---|---|---|---|
| Colorization (填色) | Real Images | Full | Known | 1 | Rotate(θ), Saturate(δ), Stop() |
| Counting (计数) | Real Images | Full | Known | 2 | Mark(x, y), Undo(), Guess(N), Stop() |
| Jigsaw (拼图) | Real Images | Full | Known | 2 | Swap((r1,c1),(r2,c2)), Reorder([…]), Stop() |
| Matchstick Equation (火柴棒算式) | Synthetic Images | Full | Known | 1 | Move Stick([i,s,j,t]), Undo(), Stop() |
| Maze 2D (2D迷宫) | Synthetic Images | Full | Known | 2 | Move(d), Stop() |
| Maze 3D (3D迷宫) | Synthetic Images | Partial | Known | 2 | Move(0), Turn(d), Stop() |
| Mental Rotation 2D/3D (心理旋转) | Real/Synthetic | Full/Partial | Known | 1-3 | Rotate([dy,dp,dr]), Stop() |
| Pick & Place / Reach (Robotic Arm) | Synthetic Images | Partial | Unknown | 0 | Move([x,y,z]), Gripper, Stop() |
| Video Unshuffle (视频乱序还原) | Real Images | Full | Known | 3 | Swap(i,j), Reorder([…]), Stop() |
| Zoom-In Puzzle (缩放拼图) | Real Images | Full | Known | 5 | Swap(i,j), Reorder([…]), Stop() |
Note: This table is based on environment configuration data provided by VisGym, covering tasks ranging from simple symbolic manipulation to complex robotic control.
Key Design Differences: Comparison with Other Frameworks
VisGym is not the only visual interaction testing framework, but it fills many gaps left by previous tools. Unlike LIBERO (focused mainly on robotic manipulation) or OSWorld (focused mainly on computer usage), VisGym emphasizes “cross-domain” diagnostic capabilities.
It supports comparisons between Structured Observations (like ASCII characters) and Unstructured Observations (like pixel images), supports Partially Observable Markov Decision Processes (POMDPs) where the model must infer hidden states from history, and importantly, supports Supervised Fine-Tuning (SFT) and Online Reinforcement Learning. This means researchers can use the large-scale demonstration data it generates to actually train models, rather than just grading them.
Image Source: Unsplash
Author’s Reflection
“
When examining these 17 environments, what struck me most was the clever combination of “observability” and “dynamics.” For example, the 2D Maze is fully observable, while the 3D Maze is partially observable; robotic arm tasks are not only partially observable but also have unknown dynamics. This design forces models to possess the ability to handle uncertainty, which is a necessary step towards general intelligence. Mere image understanding is far from enough here; the model must learn to think and act like a true intelligent agent exploring the world.
Technical Architecture & Core Design: How to Make Models “Play”
The Core Question Answered Here: How does VisGym translate complex visual interaction tasks into instructions VLMs can understand and execute? What are the innovations in its technical implementation?
VisGym is built on the widely used Gymnasium framework, making it compatible with classic RL environments like MuJoCo and Atari. However, to adapt to modern VLM characteristics, VisGym introduces key enhancements that allow models to control environments via natural language interaction.
1. Function-Conditioned Action Spaces
Traditional RL environments typically use discrete action IDs or continuous vectors to represent actions. This approach is unfriendly to humans and unintuitive for VLMs. VisGym redefines the action space as function calls with parameters.
For example, in a jigsaw puzzle, the model doesn’t output Action ID: 5; instead, it outputs ("swap", (1, 2)). This abstraction leverages the powerful function-calling capabilities of VLMs, allowing models to compose strategies across domains. For instance, it might call move(x, y) in one task and rotate(theta) in another. This semantic-level abstraction significantly lowers the learning difficulty for the model.
2. Function Instructions and Environmental Feedback
To enable zero-shot interaction, VisGym provides a set of natural language descriptions of functions at the start of each task, detailing the purpose and argument constraints of each function. For example, it tells the model: “The move function accepts a direction parameter, which can be ‘up’, ‘down’, ‘left’, ‘right’.”
Additionally, beyond visual feedback (image changes), the environment provides textual feedback. After every action, the environment returns a text description, such as “invalid format”, “out of bounds”, or “executed”. This acts as a crucial aid for models with weaker visual perception.
3. Oracle Solvers and Data Generation
This is a highlight of VisGym. To ensure environments are solvable and provide data for supervised fine-tuning, the research team implemented heuristic multi-step solvers for each environment. These solvers can complete each task using available actions and support multiple solving strategies and randomness. This means VisGym can not only run evaluations but also automatically generate massive, high-quality, structured demonstration data for training models.
# Pseudo-code example: VisGym environment interaction loop
# Initialize environment
env = VisGym("Matchstick_Equation")
observation = env.reset()
# Get function instructions
instructions = env.get_function_instructions()
# Instruction example: "move([i, s, j, t]): Move the i-th matchstick to position j, direction s..."
# Start multi-turn interaction
for step in range(max_steps):
# Model generates action based on history and current observation
# Here the model outputs structured function calls
action = model.predict(history=history, obs=observation, instructions=instructions)
# Execute action, get new observation, feedback, and reward
observation, feedback, reward, done, info = env.step(action)
# Record history
history.append((observation, action, feedback))
if done:
break
Image Source: Unsplash
The Frontier Model Exam: Real Battle Stats for GPT-5 & Gemini 2.5 Pro
The Core Question Answered Here: How do the currently accepted AI “ceiling” models—GPT-5, Gemini 2.5 Pro, etc.—actually perform when facing these long-horizon visual tasks?
VisGym conducted rigorous evaluations of 12 state-of-the-art vision-language models. These included proprietary closed-source models (like GPT-5, Gemini 2.5 Pro, Claude Sonnet 4, Grok 4 Fast) and open-weight models (like Qwen3-VL, GLM-4.5V, Llama-4-Maverick). The evaluations were divided into “Easy” and “Hard” configurations, with each model tested for 70 episodes per configuration.
Overall Performance: “Honor Students” Who Failed the Exam
The results show that even the best-performing proprietary models fall far short of human levels on VisGym.
-
Best Model (Gemini 3 Pro): Achieved an average success rate of only 46.61% in the Easy configuration. -
Hard Configuration: Even the best model saw their success rate drop to 26.00%.
This means that in more difficult settings, three out of every four attempts result in complete failure. This fully demonstrates that long-horizon visual interactive decision-making remains a massive challenge for current VLMs.
Model Personality: Unique Strengths
While overall scores were low, different models showed distinct “personalities” and specialties, reflecting differences in their training data and architecture:
-
GPT-5: Proven to be the leader in handling long-context visual interactions. It performed strongest in tasks requiring inference of unknown dynamics (like Matchstick Rotation) and in Hard settings. Its successful trajectories often contained more steps, showing stronger patience and planning capability. -
Gemini 2.5 Pro: Exhibited extremely strong low-level visual perception capabilities. It dominated tasks requiring tight spatial alignment, precise correspondence of local patterns, and sensitivity to subtle visual cues, such as Jigsaw, 2D Maze, and Zoom-In Puzzle. -
Qwen-3-VL: Particularly good at object localization, performing best in the “Referring Dot-Pointing” task. -
Most Models: The distribution of successful steps was concentrated around 3-5 steps. Once the step count increased, the success rate plummeted. This indicates that most models struggle when handling complex tasks requiring more than 5 steps.
Common Failure Modes
By analyzing failure trajectories, the research team summarized four recurring failure types across tasks:
-
Restricted Action Space & Looping: Models tend to repeat a single operation or fixed-magnitude action. For example, moving continuously in the same direction in the robotic arm task, or always using “swap” instead of the more efficient “reorder”. -
State Mismanagement: Models fail to maintain or update internal state. They ignore text or environmental feedback, revisit previously explored areas, or repeat illegal actions after multiple “wall collision” feedback. -
Early Termination: The model issues the “stop” command prematurely before reaching the goal. -
Ignoring Visual or Spatial Information: Models ignore the provided visual information. For example, remaining indifferent when the target object leaves the frame, or completely ignoring visual misalignment in Mental Rotation tasks.
Deep Diagnosis: Why Do Models Fail?
The Core Question Answered Here: What specific factors limit VLM performance in long-horizon visual decision-making? Is it the context length? Poor visual recognition? Or a lack of feedback mechanisms?
The greatest value of VisGym lies in its powerful controllability. The research team performed fine-grained diagnosis on the causes of model failure by controlling variables. Here are key findings that offer strong guidance for future model improvements.
1. The “Inverted-U” Trap of Context Length
Typically, we assume giving models more history (longer context) helps them make better decisions. However, in VisGym experiments, this wasn’t always the case.
Experiments show a relationship between model performance and retained conversation history turns that resembles an Inverted U:
-
Short-term Gains: Retaining 1 to 4 previous turns improves performance, as the model can leverage previous feedback or visual changes. -
Long-term Drop: When the full, unbounded history is provided, model performance declines. This indicates that stale observation data and redundant visual information interfere with the model’s judgment.
This suggests that current VLMs are not good at extracting key information from long visual histories; they need a more effective “memory compression” mechanism.
2. The Huge Gap Between Visual and Text Representation
To test if models lack “visual understanding” or “logical reasoning,” researchers converted some visual tasks (like Matchstick Equation, 2D Maze) into pure text ASCII art.
The results were surprising:
-
GPT-5: Performance improved by 3 to 4 times in most tasks. This suggests GPT-5’s main bottleneck lies in visual grounding (mapping pixels to semantics), not logical reasoning. -
Text Isn’t Always the Winner: In the “Matchstick Equation” task, visual performance was actually better than text. This is likely because the irregular shapes and spacing of ASCII art create distorted glyphs that models struggle with.
This reveals a profound lesson: For current VLMs, rendering a task into visual images often makes it harder. Pure logic tasks forced through the visual channel often introduce unnecessary noise.
3. Pathological Dependence on Text Feedback
Humans can understand “blocked” by seeing an object collide without being told “you hit a wall.” Can VLMs do this?
By removing the text feedback provided by the environment (keeping only visual state transitions), all models showed consistent and significant performance drops. This means current VLMs rely heavily on explicit textual descriptions to infer action validity; they can barely infer physical rules or constraints (like “hitting a wall” or “illegal move”) from pure visual changes.
4. The Double-Edged Sword of Goal Observation
If we show the model the “Final Goal Image” at the start of a task (e.g., the completed puzzle), it should lower the difficulty. Experiments confirmed this, with models generally improving.
However, there are risks. In “Zoom-In Puzzle” and “Matchstick Equation,” GPT-5 and Gemini 2.5 Pro performed worse when shown the goal image. Further analysis showed this was due to visual misjudgment—the model incorrectly decided that the current initial state already “matched” the goal image (e.g., Gemini 2.5 Pro had an 80% error rate in Zoom-In Puzzle), leading to premature task termination.
This is an interesting paradox: Explicit goal observations can raise the theoretical ceiling, but if visual perception is weak, they can become a misleading source.
Training & Fine-Tuning: How to Make Models Stronger?
The Core Question Answered Here: Since existing models perform poorly, can we significantly improve their performance in multi-step visual decision-making using supervised fine-tuning (SFT) on data generated by VisGym?
VisGym is not just an exam hall; it is a training ground. Using built-in oracle solvers, the research team generated numerous demonstration trajectories and conducted supervised fine-tuning experiments.
1. The Huge Gains from Fine-Tuning
Whether using single-task or mixed-task fine-tuning, the fine-tuned models (based on Qwen2.5-VL-7B-Instruct) achieved State-of-the-Art (SOTA) performance on most tasks. This validates two facts:
-
The tasks designed in VisGym are learnable. -
Structured solver demonstration data is extremely effective for improving visual interaction capabilities in VLMs.
2. Newer Models Generalize Better
Experiments compared two different generation base models: Qwen2.5-VL and Qwen3-VL. Although both were trained on “Easy” difficulty data, in the unseen “Hard” difficulty test:
-
Qwen2.5-VL saw a massive drop in success rate on Hard tasks. -
Qwen3-VL also dropped on Hard tasks but much less than the previous generation, with a success rate nearly double that of Qwen2.5-VL.
This indicates that with the advancement of large model architectures and training data, modern VLMs are significantly enhancing their Out-of-Distribution (OOD) generalization capabilities in multi-step visual decision-making.
3. Vision vs. LLM: Which Matters More?
Researchers attempted to isolate the contribution of the “Vision Encoder” versus the “LLM Backbone.” By comparing results from fine-tuning only the visual part, only the LLM part, and both:
-
Most tasks benefited from fine-tuning both parts. -
The LLM contributed the larger gain, especially in tasks with partial observability or unknown dynamics. -
Vision fine-tuning was decisive mainly in tasks requiring fine perception (like Zoom-In Puzzle).
This suggests the current bottleneck might not lie in “seeing clearly” but in “remembering” and “reasoning through.”
4. Information-Revealing Behaviors: Data Quality > Quantity
This is one of the most insightful findings of the study. In environments with unknown dynamics or partial observability, not all demonstration data is equally useful.
-
Standard Data: Just shows how to complete the task (e.g., walk directly to the finish). -
Information-Revealing Data: Deliberately reveals hidden states during the solution. For example, in the unknown-dynamics Matchstick Rotation task, tentatively making small moves first to understand the “degree” relationship before doing the final alignment; in partial 3D Mental Rotation, rotating fully once to see the whole object before aligning with the target.
Experimental results show that training with “Information-Revealing” data boosted success rates from 32.9% to 70.0%. This tells us: In data-scarce or complex environments, teaching models “how to explore” is more important than teaching them “how to act.”
Image Source: Unsplash
Conclusion & Future Outlook: The Path to General Visual Intelligence
VisGym provides us with an incredibly valuable perspective, revealing the true level of current VLMs in visual interactive decision-making. While models like GPT-5 and Gemini 2.5 Pro are amazing at chatting and code generation, in the mirror of VisGym, they expose soft spots like weak long-context processing, fragile visual perception, and heavy reliance on text feedback.
However, through systematic diagnosis and targeted fine-tuning, we can also see significant performance improvements. In particular, the importance of “Information-Revealing” data provides a new direction for future AI agent training: Agents need to learn not just to do things, but also how to take action to acquire information in unknown worlds.
As an open-source, scalable framework, VisGym provides a unified arena for global researchers. In this arena, we are no longer just competing to see who can recognize more objects, but who can act like a true intelligent agent, perceiving, remembering, thinking, and finally acting in a complex, dynamic, and partially unknown visual world.
Practical Summary / Action Checklist
Based on findings from VisGym, if you are developing or evaluating visual interactive agents, the following recommendations are worth referencing:
-
Prioritize Context Management: Don’t blindly feed unlimited history images to the model. Try truncating stale history or develop specialized memory compression modules to retain only key frames relevant to the current decision. -
Provide Necessary Text Feedback: If your environment allows, ensure textual feedback (e.g., “execution successful”, “hit a wall”) is provided. Current VLMs rely heavily on these cues to understand the logic behind visual changes. -
Use Goal Images with Caution: While providing a goal image can offer direction, beware of the reverse effect of visual misjudgment. For fine-grained tasks, ensure the model has high-precision perception capabilities before introducing goal observations. -
Prioritize Training LLM Reasoning: If compute is limited, prioritize fine-tuning the LLM part (the language model backbone), as the current bottleneck is often reasoning through state sequences rather than single visual feature extraction. -
Generate “Information-Revealing” Demo Data: When preparing fine-tuning data, don’t just record “expert playthroughs.” Record trajectories that include “observing the environment” and “probing boundaries.” This data teaches the model how to handle uncertainty. -
Focus on Visual vs. Text Modality Transfer: If your task is pure logic, consider keeping it in text modality for processing. Don’t force a conversion to images unless you are sure the model’s visual grounding capability is strong enough.
One-Page Summary (TL;DR)
-
What is VisGym? A testing and training platform with 17 diverse environments (mazes, puzzles, robots, etc.) designed to evaluate VLM performance in long-horizon, multi-step visual interactive tasks. -
Core Findings: The strongest models (GPT-5, Gemini 2.5 Pro) achieve success rates of less than 50% in Easy tasks and only about 26% in Hard tasks. -
Main Bottlenecks: -
Long Context Failure: Too many history images interfere with the model. -
Fragile Visual Perception: Visual rendering increases task difficulty compared to text. -
Feedback Dependence: Models cannot infer physical rules from visual changes alone; they require text descriptions.
-
-
Training Insights: -
SFT Works: Fine-tuning with solver-generated data significantly boosts performance. -
Newer Models Generalize Better: Qwen3-VL outperforms Qwen2.5-VL in unseen hard tasks. -
Data Quality is Key: Demonstration data containing “exploration” and “information-revealing” behaviors is far more valuable than pure “direct solution” data.
-
-
Conclusion: VisGym reveals the gap towards general visual intelligence and provides a systematic method to diagnose and bridge these gaps.
Frequently Asked Questions (FAQ)
Q1: How is VisGym different from other AI benchmarks like ImageNet or VideoGameBench?
A: ImageNet mainly tests static image classification, while VideoGameBench involves interaction but focuses on specific domains. VisGym is a cross-domain, unified framework that not only evaluates performance but supports supervised fine-tuning and is specifically designed to diagnose generic factors (like context length, feedback types) influencing visual interactive decision-making.
Q2: Why does GPT-5 perform better in long-context tasks?
A: The research found that GPT-5 performed strongest in tasks requiring inference of unknown dynamics (like Matchstick Rotation) and in Hard settings. Its successful trajectories often included more steps, suggesting GPT-5 has stronger robustness in processing long-sequence information and utilizing history for planning compared to other models.
Q3: Why does showing the model the goal image sometimes lead to worse performance?
A: This is a visual perception paradox. If the model’s visual perception isn’t precise enough, it might mistakenly judge the current initial state as already matching the goal image, leading to premature task termination. In fine tasks like Zoom-In Puzzle, this misjudgment is particularly obvious.
Q4: What are “Information-Revealing” demonstration data, and why are they important?
A: “Information-Revealing” data refers to operation sequences in a trajectory that deliberately reveal hidden states or environment dynamics. For example, probing with a small move before moving a robotic arm to understand direction, or rotating fully before aligning a 3D object to see its shape. This type of data helps models learn how to make decisions in partially observable or unknown environments much more than simple “direct solution” data.
Q5: For developers, what is the biggest practical value of VisGym right now?
A: For developers, VisGym provides an existing, high-quality, scalable source of generative data (via built-in solvers). Developers can use these generated structured trajectories to fine-tune their own visual agents, significantly improving their performance in multi-step interaction tasks without expensive manual annotation.

