Gelato-30B-A3B: Teach Computers to Understand & Execute GUI Instructions with AI

高效码农

2 months ago

Gelato-30B-A3B: The Advanced AI Model Revolutionizing Computer Interface Interaction

Introduction: The Challenge of Teaching AI to Use Computers

In an era where artificial intelligence is transforming how we interact with technology, one fundamental challenge remains: how can we teach AI agents to reliably locate and interact with specific elements on a computer screen based on simple human instructions? This problem, known as GUI grounding, represents the critical bridge between human language and computer interface interaction.

The ML Foundations research team has recently made a significant breakthrough with their release of Gelato-30B-A3B, a state-of-the-art grounding model specifically designed for graphical user interfaces. This advanced system represents a substantial leap forward in enabling AI agents to convert natural language instructions into precise, reliable click locations on computer screens.

Understanding GUI Grounding: Why It Matters

GUI grounding refers to the process of mapping natural language instructions to specific element locations within a graphical user interface. In practical terms, it’s what enables an AI to understand exactly where to click when you say “open the settings menu” or “find the save button.”

In typical computer-use agent architectures, two main components work together: a planning module that understands high-level user instructions and creates action plans, and a grounding module that translates these plans into specific interface operations. Gelato-30B-A3B specializes in this grounding function, serving as a modular component that can be integrated into broader AI systems.

Consider this real-world example: when a user requests “clear the browser cache,” a planning model like GPT-5 might break this down into sequential steps: “open browser settings,” “locate privacy and security options,” “select clear browsing data.” Gelato’s role is to precisely identify the screen coordinates for each of these interface elements—finding the exact location of the settings icon, identifying where the privacy option appears, and ultimately executing the complete task.

This separation between planning and grounding proves particularly valuable in today’s diverse computing environments. With multiple operating systems (Windows, macOS, Linux) and thousands of applications, each with unique interface layouts, having a specialized grounding model allows AI agents to adapt to this variety without requiring complete retraining for each environment.

Gelato-30B-A3B: Technical Architecture and Capabilities

Gelato-30B-A3B is a substantial AI model built upon Qwen3-VL-30B-A3B Instruct foundation, incorporating a mixture of experts architecture. From a technical perspective, the model processes two types of input: screen captures and textual instructions, producing a single output—the coordinates for a click action.

This architectural design enables straightforward integration into existing AI agent frameworks. During operation, a planning model determines the next high-level action to execute, then calls upon Gelato to resolve this action into a specific screen location.

Performance Benchmarks and Comparative Analysis

In standardized testing environments, Gelato-30B-A3B has demonstrated exceptional performance:

63.88% accuracy on the ScreenSpot Pro benchmark
69.15% accuracy on OS-World-G evaluation
74.65% accuracy on the refined OS-World-G assessment

While these percentages might initially appear modest, they represent current state-of-the-art performance in the challenging domain of GUI grounding. More significantly, Gelato-30B-A3B has outperformed previous specialized computer grounding models like GTA1-32B, and even surpassed much larger visual language models such as Qwen3-VL-235B-A22B-Instruct.

Model	Activated Parameters	ScreenSpot-Pro	OS-World-G	OS-World-G (Refined)
Qwen3-VL-30B-A3B-Instruct	3B	60.5%	61.0%	–
Qwen3-VL-235B-A22B-Instruct	22B	62.0%	66.7%	–
OpenCUA-72B	72B	60.8%	59.6%	–
GTA1-32B	32B	63.6%	65.2%	72.2%
Gelato-30B-A3B	3B	63.88%	69.15%	74.65%

The Click-100k Dataset: Foundation of Gelato’s Success

Like any sophisticated AI model, Gelato-30B-A3B’s capabilities are fundamentally built upon its training data. The model’s impressive performance stems from its foundation—the meticulously curated Click-100k dataset. This specialized collection contains over 100,000 paired examples of computer screen images matched with natural language instructions.

Constructing a Comprehensive Training Resource

Rather than building from scratch, the research team integrated and refined multiple existing public datasets to create Click-100k. The compilation includes samples from ShowUI, AutoGUI, PC Agent E, WaveUI, OS Atlas, UGround, PixMo Points, SeeClick, UI VISION, and other sources, all mapped to a unified structural format.

Each data source contributed a maximum of 50,000 samples to ensure diversity and balance within the dataset. Every sample includes screen imagery, natural language instructions, target element bounding boxes, image dimensions, and normalized bounding box coordinates.

Rigorous Data Quality Assurance

Simply aggregating data wasn’t sufficient. The research team implemented a comprehensive filtering pipeline to ensure training sample quality:

OmniParser Validation: Eliminated clicks that didn’t land on detected interface elements
Complexity Filtering: Used Qwen2.5-7B-VL and SE-GUI-3B to identify and remove overly simple interactions (like basic hyperlink clicks)
Instruction-Alignment Checking: Employed GTA1-7B-2507 and UI-Venus-7B to remove samples where text instructions didn’t match click regions

This meticulous data curation strategy yielded significant benefits. Experimental results demonstrated that models trained on the filtered data achieved a 9 percentage point accuracy improvement on ScreenSpot Pro benchmarks compared to models trained on unfiltered data.

Expanding Professional Application Coverage

Recognizing that existing public datasets often lacked professional application examples—a critical gap for general-purpose computer-use agents—the research team specifically enhanced Click-100k with data from UI VISION and a JEDI subset focusing on spreadsheet and text cell manipulation.

Additionally, they extracted data from over 85 professional application tutorial videos, using Claude-4-Sonnet to generate click bounding boxes and low-level instructions, followed by manual verification and correction. This deliberate expansion significantly improved the model’s capability with professional software tools.

The Training Process: Building Gelato-30B-A3B

Gelato-30B-A3B’s training incorporated GRPO (Group Relative Policy Optimization), a reinforcement learning approach derived from work on DeepSeekMath and similar systems.

Key Training Configuration Elements

The research team followed a DAPO (Decoupled Advantage Policy Optimization) framework with several important modifications:

Removal of the KL divergence term from the objective function
Implementation of a higher clipping threshold (0.28)
Exclusion of advantage-zero rollouts

The reward structure used sparse rewards—only providing positive feedback when predicted clicks landed within target bounding boxes. This approach, similar to that used in GTA1 models, was refined to deliver superior performance.

Training Progression and Outcomes

Training commenced from the Qwen3-VL-30B-A3B-Instruct baseline model and progressed through 100 reinforcement learning steps on 32 A100 GPUs with 40GB memory. Throughout this process, the model demonstrated consistent improvement across all evaluation benchmarks.

The optimal checkpoint emerged at step 84, selected based on average performance across ScreenSpot Pro, OS-World-G, and OS-World-G Refined evaluations. At this point, the model achieved 63.88% accuracy on ScreenSpot-Pro, and 67.19% and 73.40% on OS-World-G and OS-World-G Refined respectively.

Developing Refusal Capability

An interesting discovery emerged during evaluation: the research team successfully elicited refusal behavior from Gelato without explicit training for this capability. By simply appending “If you cannot find the element, return refusal” to instruction prompts and including refusal cases in evaluations (previously treated as zero-accuracy), they increased OS-World-G accuracy to 69.15% (a 1.96 percentage point improvement) and OS-World-G Refined to 74.65% (a 1.25 percentage point gain).

This refusal capability has important practical implications, preventing AI agents from executing incorrect operations when uncertain about target locations.

End-to-End Agent Performance Assessment

To evaluate Gelato’s performance in realistic scenarios, the research team integrated it into the GTA1.5 agent framework and conducted comprehensive testing within the OS-World environment.

Experimental Setup

The testing environment featured:

GPT-5 as the planning model, responsible for high-level action decisions
Gelato-30B-A3B handling grounding functions, translating actions to specific clicks
A maximum of 50 agent steps with 3-second intervals between actions

To ensure fair comparisons, the team conducted three runs for each model using a fixed OS-World snapshot, carefully tracking success rates.

Evaluation Results

Gelato-30B-A3B achieved a 58.71% automated success rate with minimal variation between runs. Under identical testing conditions, GTA1-32B reached a 56.97% success rate.

However, researchers identified that automated OS-World evaluation sometimes missed valid solutions. They subsequently conducted human evaluation on 20 problematic tasks. Under human assessment, Gelato’s success rate rose to 61.85%, while GTA1-32B reached 59.47%.

These results demonstrate that improved grounding capability directly translates to enhanced end-to-end agent performance. Gelato-30B-A3B not only excelled in isolated grounding tasks but also delivered measurable improvements in practical computer-use scenarios.

Implementing Gelato-30B-A3B in Projects

For developers and researchers interested in practical applications, Gelato-30B-A3B is openly available through the Hugging Face platform. Below is a practical implementation example demonstrating model loading and inference:

from transformers import Qwen3VLMoeForConditionalGeneration, AutoProcessor
import re
from PIL import Image, ImageDraw
import requests
from io import BytesIO

def extract_coordinates(raw_string):
    """
    Extract coordinates from raw string output.
    Args:
        raw_string: str (e.g. "(100, 200)")
    Returns:
        x: float (e.g. 100.0)
        y: float (e.g. 200.0)
    """
    try:
        matches = re.findall(r"\((-?\d*\.?\d+),\s*(-?\d*\.?\d+)\)", raw_string)
        return [tuple(map(int, match)) for match in matches][0]
    except:
        return 0,0

def visualize_prediction(img, pred_x, pred_y, img_width, img_height):
    """
    Visualize predicted coordinates on the image (high visibility).
    """
    pred_x = int((pred_x * img_width) / 1000)
    pred_y = int((pred_y * img_height) / 1000)

    draw = ImageDraw.Draw(img, "RGBA")

    r = 30
    draw.ellipse(
        (pred_x - r, pred_y - r, pred_x + r, pred_y + r),
        outline="lime",
        fill=(0, 255, 0, 90),
        width=5
    )

    cross_len = 15
    draw.line((pred_x - cross_len, pred_y, pred_x + cross_len, pred_y), fill="lime", width=5)
    draw.line((pred_x, pred_y - cross_len, pred_x, pred_y + cross_len), fill="lime", width=5)

    img.save("predicted_coordinates.png")
    print(f"Predicted coordinates: ({pred_x}, {pred_y})")

# Load the model and processor
MODEL_PATH = "mlfoundations/Gelato-30B-A3B"

model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    device_map="auto",
    dtype="auto"
)

processor = AutoProcessor.from_pretrained(
    MODEL_PATH
)

url = "https://github.com/QwenLM/Qwen3-VL/raw/main/cookbooks/assets/computer_use/computer_use1.jpeg"
response = requests.get(url)
img = Image.open(BytesIO(response.content))
img_width, img_height = img.size

# Prepare messages
PROMPT = '''
You are an expert UI element locator. Given a GUI image and a user's element description, provide the coordinates of the specified element as a single (x,y) point. For elements with area, return the center point.

Output the coordinate pair exactly:
(x,y)
'''
PROMPT = PROMPT.strip()
INSTRUCTION = "Reload the cache."

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": PROMPT + "\n\n"},
            {"type": "image", "image": img},
            {"type": "text", "text": "\n" + INSTRUCTION},
        ],
    }
]

device = next(model.parameters()).device
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
).to(device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=32)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

# Extract the coordinates from the output text
print(f"Model output: {output_text[0]}")
pred_x, pred_y = extract_coordinates(output_text[0])

# Calculate the absolute coordinates from normalized coordinates
visualize_prediction(img, pred_x, pred_y, img_width, img_height)

This code demonstrates the fundamental workflow for using the Gelato model: loading the model, preparing input images and instructions, generating coordinate predictions, and visualizing results. The model outputs normalized coordinates (0-1000 range) that require conversion to absolute coordinates based on actual image dimensions.

Significance and Implications of the Gelato Model

The release of Gelato-30B-A3B represents a significant milestone in the development of practical AI agents. It demonstrates that through carefully curated datasets and specialized training methodologies, substantial advances can be achieved in the critical area of GUI grounding.

Technical Contributions Summarized

Advanced Grounding Performance: Surpassed previous state-of-the-art specialized computer grounding models and larger general-purpose visual language models across multiple benchmarks.
High-Quality Dataset: The Click-100k dataset, through rigorous filtering and integration processes, provides a valuable training resource for GUI grounding tasks.
Effective Training Methodology: The GRPO reinforcement learning approach combined with sparse reward mechanisms significantly enhanced grounding accuracy over supervised baselines.
Practical Validation: When integrated with planning models like GPT-5, Gelato-30B-A3B improved success rates on OS-World computer-use tasks, demonstrating that better grounding directly translates to stronger end-to-end agent performance.

Future Directions and Applications

Gelato-30B-A3B opens new possibilities for practical AI agent deployment. As this technology matures, we can anticipate AI assisting with or executing computer operations across increasingly diverse scenarios—from simple data entry tasks to complex workflow automation, spanning desktop applications to specialized professional software.

For developers and organizations, this technology enables the creation of more intelligent and reliable automation solutions, reducing repetitive manual operations while improving efficiency and accuracy.

Frequently Asked Questions

How does the Gelato model differ from general visual language models?

Gelato is specifically optimized for GUI grounding tasks, processing screen captures and text instructions to output precise click coordinates. General visual language models typically address broader visual question-answering tasks and demonstrate lower precision for specialized GUI operations. By training on the high-quality Click-100k dataset, Gelato outperforms much larger general visual language models on GUI grounding tasks.

How can I use the Gelato model for custom tasks?

The Gelato model is readily accessible through the Hugging Face platform. The implementation process involves loading the model, preparing screen captures and text instructions, and calling the model to generate coordinate predictions. For integration into AI agent systems, Gelato typically functions as a grounding module alongside planning models like GPT-5, with the planning model determining high-level action sequences and Gelato resolving these into specific operations.

Is Gelato’s accuracy sufficient for real-world applications?

While Gelato achieves 60%-75% accuracy on benchmarks—representing current state-of-the-art performance in the challenging GUI grounding domain—practical applications can employ multiple strategies to enhance overall system reliability. These include combining various verification mechanisms, implementing fallback strategies, or incorporating human review for critical tasks. As technology continues advancing, these accuracy rates are expected to improve further.

What types of applications can Gelato handle?

Gelato has been trained on diverse application data, including web browsers, office software, and system utilities. Through the inclusion of professional application tutorial data, it has also developed capabilities with specialized software tools. However, for highly specialized or newly emerging applications, additional fine-tuning or domain-specific data augmentation may be necessary.

What format do the model’s output coordinates use?

The Gelato model outputs normalized coordinate values ranging from 0 to 1000. In practical applications, these normalized coordinates must be converted to absolute coordinates based on actual screen or image dimensions. For example, with an image width of 1920 pixels, an x-coordinate of 500 would correspond to 960 pixels (500/1000 * 1920).

Conclusion: The Future of Human-Computer Interaction

Gelato-30B-A3B represents a significant advancement in GUI grounding technology. Through the combination of high-quality datasets, carefully designed model architecture, and effective training methodologies, it establishes new standards for graphical user interface element localization.

This technological progress not only pushes the boundaries of AI agent capabilities but also lays the foundation for future intelligent and autonomous computer-use systems. As open-source community participation increases and research continues advancing, we can anticipate practical applications based on this technology emerging, fundamentally transforming how we interact with computers.

For AI researchers and developers, Gelato-30B-A3B provides a powerful tool for building more intelligent and reliable automation solutions, driving the practical application of AI technology in daily life and work environments.