Beyond Static Prompts: How Multi-View Instructions Turbo-charge GUI Grounding — A Hands-On Guide to UI-Ins
“
Why read this?
Because simply re-phrasing the same user intent into four different angles can lift a 7 B model’s pixel-accuracy by up to 76 %—without extra data or heavier back-bones. This article shows you the exact pipeline, code, and training tricks that make it happen.
1 The Invisible Ceiling of One-Angle Instructions
Core question answered:
“Why do existing GUI-grounding models hit an accuracy wall even when the screenshot is crystal-clear?”
Summary: We trace the bottleneck to low-quality, single-angle instructions in public datasets (23 % flawed) and show that human-style perspective switching is the missing piece.
Author’s reflection:
I used to believe “more layers & bigger image resolution” would solve grounding failures. After auditing 1 909 samples from OS-Atlas, Widget-Captioning and AMEX, I realised a quarter of the labels were ambiguous or mismatched—the model never had a fair chance.
2 Dataset Autopsy: 23 % Flaw Rate & How We Cleaned It
Core question answered:
“Can we trust open-source GUI datasets for serious training?”
| Issue type | Example | Share |
|---|---|---|
| Ambiguous match | “click the button” (5 identical buttons) | 4.4 % |
| Location mismatch | bbox shifted 20 px, misses icon | 18.9 % |
| Intent–UI mismatch | instruction says “close tab”, UI shows “clear cache” | 5.3 % |
Our 3-step washer:
-
OmniParser-V2 detects all clickable elements ➜ gives clean candidate set. -
IoU filter: drop ground truth if best candidate IoU < 0.5. -
Re-write & verify: GPT-4.1 generates appearance / function / spatial / intent views, then self-checks uniqueness.
Result: error rate drops from 23 % → 8 %; models trained on the cleaned set gain +4.1 ∼ 7.8 % absolute points on three benchmarks.
3 Four Angles of Human Description — With Code to Generate Them
Core question answered:
“What do ‘multi-view’ instructions actually look like, and how can I produce them at scale?”
Code snippet (simplified but runnable):
PERSPECTIVES = ["appearance","function","spatial","intent"]
PROMPT = """
Screenshot: <img>
Target element highlighted in red.
Create one unambiguous instruction for each perspective.
JSON output only.
"""
def augment(img, bbox):
img_with_box = draw_red_box(img,bbox)
msgs = [{"role":"system","content":PROMPT},
{"role":"user","content":[{"type":"image","image":img_with_box}]}]
reply = gpt4_1_chat(msgs)
return json.loads(reply) # keys = PERSPECTIVES
Real example for a red “×” icon:
| Perspective | Generated sentence |
|---|---|
| appearance | Click the crimson × icon in the upper-right corner. |
| function | Close the current file-manager window. |
| spatial | Choose the top-right button, left of the minimise icon. |
| intent | Get rid of this screen. |
Author’s reflection:
When I first saw the model written “get rid of this screen” I thought it was too casual—yet that high-level intent view became the highest-recall angle on dark-theme windows where the tiny × is nearly invisible.
4 Training Recipe: SFT Opens the Menu, RL Learns to Order
Core question answered:
“How do you teach a model to pick the best angle instead of feeding it a single prompt?”
4.1 Stage-1 SFT — “See the world”
- ❀
Data: 283 k cleaned samples, each carries 2 random views (1 = user instruction, 1 = hidden reasoning). - ❀
Target: maximise likelihood of reasoning text + click-point jointly. - ❀
Format forced:
<think>
From spatial view: the button is below the search bar.
</think>
<tool_call>
{"name":"grounding","arguments":{"action":"click","coordinate":[x,y]}}
</tool_call>
4.2 Stage-2 GRPO RL — “Choose wisely”
- ❀
Prompt: “Think first, then output coordinate.” (no perspective list) - ❀
Reward: 1 if predicted point ∈ ground-truth box, else 0. - ❀
Rollouts: 8 per sample; advantages Z-score normalised. - ❀
Outcome: UI-Ins-7B +6.6 % extra gain on ScreenSpot-Pro implicit subset.
5 Benchmark Battle: Where the Gains Come From
Core question answered:
“Does multi-view reasoning only help fancy long instructions?”
| Benchmark | subset | Baseline Qwen2.5-VL-7B | UI-Ins-7B | gain |
|---|---|---|---|---|
| MMBench-GUI L2 | basic (explicit) | 33.9 | 64.7 | +90 % |
| MMBench-GUI L2 | advanced (implicit) | 32.5 | 80.8 | +159 % |
| UI-I2E-Bench | explicit | 73.8 | 88.9 | +20 % |
| UI-I2E-Bench | implicit | 62.7 | 76.3 | +22 % |
Take-away: The vaguer the instruction, the larger the lift—exactly where product teams suffer most.
6 Online Agent Stress-Test: AndroidWorld Live
Core question answered:
“Will the offline benchmark victory collapse once the screen scrolls, refreshes, or lags?”
Setup:
- ❀
Planner = GPT-5 (high-level step maker) - ❀
Executor = UI-Ins-7B (pixel clicker) - ❀
Env = AndroidWorld (real phone, 116 tasks)
| Model pipeline | success rate |
|---|---|
| Gemini 2.5 Computer-Use | 69.7 % |
| UI-TARS-2 | 73.3 % |
| UI-Ins-7B + GPT-5 | 74.1 % |
Author’s reflection:
I expected drift and latency to kill accuracy. Interestingly, multi-view reasoning acted like a robustness buffer—when the theme changed between light/dark modes, the model switched to function + spatial views and kept clicking correctly.
7 Ablation Deep-Dive: What Actually Matters?
Core question answered:
“Which ingredient gives the biggest bang for the buck?”
| Ablated component | MMB-GUI L2 | Δ |
|---|---|---|
| none (full UI-Ins-7B) | 83.1 | — |
| → remove clean data | 72.4 | −10.7 |
| → remove RL stage | 76.3 | −6.8 |
| → remove reasoning | 79.1 | −4.0 |
| → keep free-form reasoning (no views) | 78.8 | −4.3 |
Insight: Data cleaning > RL > reasoning format. Skipping data quality is like “pouring premium petrol into a muddy tank.”
8 Error Gallery: Where UI-Ins Still Fails
Core question answered:
“What systematic weaknesses remain, so I don’t deploy blindly?”
-
World-knowledge gap
Instruction: “open the app from the toy company famous for building blocks”
Model picks Jazwares; correct is MEGA (visually smaller logo). -
Layout-resolution mismatch
Predicts centre of toolbar instead of the tiny dropdown arrow—bbox centre is not always clickable. -
Visual hallucination under occlusion
Two identical “save” icons appear; model selects the dimmed disabled copy.
Reflection: These errors remind us that semantic reasoning ≠ factual knowledge and pixel precision still needs segmentation-level supervision.
9 Quick-Start: Inference in 13 Lines
Core question answered:
“How do I run UI-Ins-7B inside my own agent loop today?”
from PIL import Image
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
import torch, re
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Tongyi-MiA/UI-Ins-7B", torch_dtype=torch.bfloat16, device_map="auto")
processor = AutoProcessor.from_pretrained("Tongyi-MiA/UI-Ins-7B")
image = Image.open("screen.jpg").convert("RGB")
instr = "Turn off the battery saver"
prompt = [{"role":"user","content":[{"type":"image"},{"type":"text","text":instr}]}]
inputs = processor.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[inputs], images=[image], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=128)
coord = re.findall(r"\[(\d+),\s*(\d+)\]", processor.decode(out[0], skip_special_tokens=True))
if coord: x,y = map(int, coord[0]); print("Click at",x,y)
Checklist for production:
- ❀
Always keep <think>tag in system prompt—dropping it costs ~4 % accuracy. - ❀
Feed original-resolution screenshots; the model rescales internally. - ❀
Batch size >1 gives ~12 % speed-up thanks to vision encoder caching.
10 One-Page Overview
- ❀
Problem: 23 % of public GUI instructions are noisy; single-view training under-utilises model capacity. - ❀
Solution: Instruction-as-Reasoning — clean data, generate 4 views, SFT teaches all views, GRPO-RL learns to pick best view. - ❀
Results: UI-Ins-7B/32B set new SOTA on 5 benchmarks; 7B online agent beats Gemini-2.5 & UI-TARS-2. - ❀
Recipe: OmniParser → IoU filter → GPT-4.1 rewrite → SFT on 283 k → GRPO on 33 k → done. - ❀
Key ablation order: data cleaning > RL > reasoning format. - ❀
Remaining gaps: world knowledge, sub-element precision, disabled-widget hallucination.
11 Action Checklist / Implementation Steps
-
Download raw GUI datasets (OS-Atlas, AMEX, etc.). -
Run OmniParser-V2 → extract all UI boxes → IoU-filter labels. -
Use provided GPT-4.1 prompt script to create 4-view instructions; self-verify uniqueness. -
Git-clone LLaMA-Factory → launch 1-epoch SFT (lr 5e-6, batch 256, Qwen2.5-VL-7B backbone). -
Build reward function: point-in-box 0/1 → Z-score advantages → GRPO training (lr 1e-6, 8 rollouts, 33 k samples). -
Evaluate on MMBench-GUI L2, UI-I2E, ScreenSpot-Pro; expect +15-20 % absolute on implicit subsets. -
Integrate into Agent: GPT-class planner → UI-Ins executor → AndroidWorld ≥70 % success.
12 FAQ
Q1: Do I need a 32 B model to see the benefit?
A: No. UI-Ins-7B already outperforms 32 B baselines on implicit instructions.
Q2: Can I apply this to non-English screenshots?
A: Yes. The pipeline is language-agnostic; just replace the GPT-4.1 prompt with your target language.
Q3: Is RL training unstable?
A: Policy collapse is mitigated by diverse SFT initialization; without it we observed −5.7 % after 100 RL steps.
Q4: How much GPU memory for RL on 7 B?
A: 8-rollouts need ~28 GB; reduce to 4 or use CPU offload for 24 GB cards.
Q5: Are the generated instructions public?
A: The cleaned + augmented 283 k sample set will be open-sourced under CC-BY-NC-SA 4.0.
Q6: Does the method work for bounding-box output instead of points?
A: The paper focuses on point targets; switching to box regression requires modifying the loss but the multi-view idea still applies.
Q7: What’s the largest real-world deployment so far?
A: Alibaba’s internal Tongyi Lab has adopted UI-Ins-7B as the default grounding back-end for 3 desktop-agent products (details confidential).
Author’s closing note:
If you remember only one thing, let it be this: treat your prompt as a reasoning path, not a string. Clean the data, give the model a menu of perspectives, and let RL decide what’s tasty today. The code is open—go make your GUI agent see the world from more than one angle.
