Multi-View Instructions: The Secret to 76% Higher GUI Grounding Accuracy

高效码农

2 months ago

Beyond Static Prompts: How Multi-View Instructions Turbo-charge GUI Grounding — A Hands-On Guide to UI-Ins

“

Why read this?
Because simply re-phrasing the same user intent into four different angles can lift a 7 B model’s pixel-accuracy by up to 76 %—without extra data or heavier back-bones. This article shows you the exact pipeline, code, and training tricks that make it happen.

1 The Invisible Ceiling of One-Angle Instructions

Core question answered:
“Why do existing GUI-grounding models hit an accuracy wall even when the screenshot is crystal-clear?”

Summary: We trace the bottleneck to low-quality, single-angle instructions in public datasets (23 % flawed) and show that human-style perspective switching is the missing piece.

Author’s reflection:
I used to believe “more layers & bigger image resolution” would solve grounding failures. After auditing 1 909 samples from OS-Atlas, Widget-Captioning and AMEX, I realised a quarter of the labels were ambiguous or mismatched—the model never had a fair chance.

2 Dataset Autopsy: 23 % Flaw Rate & How We Cleaned It

Core question answered:
“Can we trust open-source GUI datasets for serious training?”

Issue type	Example	Share
Ambiguous match	“click the button” (5 identical buttons)	4.4 %
Location mismatch	bbox shifted 20 px, misses icon	18.9 %
Intent–UI mismatch	instruction says “close tab”, UI shows “clear cache”	5.3 %

Our 3-step washer:

OmniParser-V2 detects all clickable elements ➜ gives clean candidate set.
IoU filter: drop ground truth if best candidate IoU < 0.5.
Re-write & verify: GPT-4.1 generates appearance / function / spatial / intent views, then self-checks uniqueness.

Result: error rate drops from 23 % → 8 %; models trained on the cleaned set gain +4.1 ∼ 7.8 % absolute points on three benchmarks.

3 Four Angles of Human Description — With Code to Generate Them

Core question answered:
“What do ‘multi-view’ instructions actually look like, and how can I produce them at scale?”

Code snippet (simplified but runnable):

PERSPECTIVES = ["appearance","function","spatial","intent"]
PROMPT = """
Screenshot: <img>
Target element highlighted in red.
Create one unambiguous instruction for each perspective.
JSON output only.
"""
def augment(img, bbox):
    img_with_box = draw_red_box(img,bbox)
    msgs = [{"role":"system","content":PROMPT},
            {"role":"user","content":[{"type":"image","image":img_with_box}]}]
    reply = gpt4_1_chat(msgs)
    return json.loads(reply)  # keys = PERSPECTIVES

Real example for a red “×” icon:

Perspective	Generated sentence
appearance	Click the crimson × icon in the upper-right corner.
function	Close the current file-manager window.
spatial	Choose the top-right button, left of the minimise icon.
intent	Get rid of this screen.

Author’s reflection:
When I first saw the model written “get rid of this screen” I thought it was too casual—yet that high-level intent view became the highest-recall angle on dark-theme windows where the tiny × is nearly invisible.

4 Training Recipe: SFT Opens the Menu, RL Learns to Order

Core question answered:
“How do you teach a model to pick the best angle instead of feeding it a single prompt?”

4.1 Stage-1 SFT — “See the world”

❀

Data: 283 k cleaned samples, each carries 2 random views (1 = user instruction, 1 = hidden reasoning).
❀

Target: maximise likelihood of reasoning text + click-point jointly.
❀

Format forced:

<think>
From spatial view: the button is below the search bar.
</think>
<tool_call>
{"name":"grounding","arguments":{"action":"click","coordinate":[x,y]}}
</tool_call>

4.2 Stage-2 GRPO RL — “Choose wisely”

❀

Prompt: “Think first, then output coordinate.” (no perspective list)
❀

Reward: 1 if predicted point ∈ ground-truth box, else 0.
❀

Rollouts: 8 per sample; advantages Z-score normalised.
❀

Outcome: UI-Ins-7B +6.6 % extra gain on ScreenSpot-Pro implicit subset.

5 Benchmark Battle: Where the Gains Come From

Core question answered:
“Does multi-view reasoning only help fancy long instructions?”

Benchmark	subset	Baseline Qwen2.5-VL-7B	UI-Ins-7B	gain
MMBench-GUI L2	basic (explicit)	33.9	64.7	+90 %
MMBench-GUI L2	advanced (implicit)	32.5	80.8	+159 %
UI-I2E-Bench	explicit	73.8	88.9	+20 %
UI-I2E-Bench	implicit	62.7	76.3	+22 %

Take-away: The vaguer the instruction, the larger the lift—exactly where product teams suffer most.

6 Online Agent Stress-Test: AndroidWorld Live

Core question answered:
“Will the offline benchmark victory collapse once the screen scrolls, refreshes, or lags?”

Setup:

❀

Planner = GPT-5 (high-level step maker)
❀

Executor = UI-Ins-7B (pixel clicker)
❀

Env = AndroidWorld (real phone, 116 tasks)

Model pipeline	success rate
Gemini 2.5 Computer-Use	69.7 %
UI-TARS-2	73.3 %
UI-Ins-7B + GPT-5	74.1 %

Author’s reflection:
I expected drift and latency to kill accuracy. Interestingly, multi-view reasoning acted like a robustness buffer—when the theme changed between light/dark modes, the model switched to function + spatial views and kept clicking correctly.

7 Ablation Deep-Dive: What Actually Matters?

Core question answered:
“Which ingredient gives the biggest bang for the buck?”

Ablated component	MMB-GUI L2	Δ
none (full UI-Ins-7B)	83.1	—
→ remove clean data	72.4	−10.7
→ remove RL stage	76.3	−6.8
→ remove reasoning	79.1	−4.0
→ keep free-form reasoning (no views)	78.8	−4.3

Insight: Data cleaning > RL > reasoning format. Skipping data quality is like “pouring premium petrol into a muddy tank.”

8 Error Gallery: Where UI-Ins Still Fails

Core question answered:
“What systematic weaknesses remain, so I don’t deploy blindly?”

World-knowledge gap
Instruction: “open the app from the toy company famous for building blocks”
Model picks Jazwares; correct is MEGA (visually smaller logo).
Layout-resolution mismatch
Predicts centre of toolbar instead of the tiny dropdown arrow—bbox centre is not always clickable.
Visual hallucination under occlusion
Two identical “save” icons appear; model selects the dimmed disabled copy.

Reflection: These errors remind us that semantic reasoning ≠ factual knowledge and pixel precision still needs segmentation-level supervision.

9 Quick-Start: Inference in 13 Lines

Core question answered:
“How do I run UI-Ins-7B inside my own agent loop today?”

from PIL import Image
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
import torch, re

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
        "Tongyi-MiA/UI-Ins-7B", torch_dtype=torch.bfloat16, device_map="auto")
processor = AutoProcessor.from_pretrained("Tongyi-MiA/UI-Ins-7B")

image = Image.open("screen.jpg").convert("RGB")
instr = "Turn off the battery saver"
prompt = [{"role":"user","content":[{"type":"image"},{"type":"text","text":instr}]}]

inputs = processor.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[inputs], images=[image], return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=128)
coord = re.findall(r"\[(\d+),\s*(\d+)\]", processor.decode(out[0], skip_special_tokens=True))
if coord: x,y = map(int, coord[0]); print("Click at",x,y)

Checklist for production:

❀

Always keep <think> tag in system prompt—dropping it costs ~4 % accuracy.
❀

Feed original-resolution screenshots; the model rescales internally.
❀

Batch size >1 gives ~12 % speed-up thanks to vision encoder caching.

10 One-Page Overview

❀

Problem: 23 % of public GUI instructions are noisy; single-view training under-utilises model capacity.
❀

Solution: Instruction-as-Reasoning — clean data, generate 4 views, SFT teaches all views, GRPO-RL learns to pick best view.
❀

Results: UI-Ins-7B/32B set new SOTA on 5 benchmarks; 7B online agent beats Gemini-2.5 & UI-TARS-2.
❀

Recipe: OmniParser → IoU filter → GPT-4.1 rewrite → SFT on 283 k → GRPO on 33 k → done.
❀

Key ablation order: data cleaning > RL > reasoning format.
❀

Remaining gaps: world knowledge, sub-element precision, disabled-widget hallucination.

11 Action Checklist / Implementation Steps

Download raw GUI datasets (OS-Atlas, AMEX, etc.).
Run OmniParser-V2 → extract all UI boxes → IoU-filter labels.
Use provided GPT-4.1 prompt script to create 4-view instructions; self-verify uniqueness.
Git-clone LLaMA-Factory → launch 1-epoch SFT (lr 5e-6, batch 256, Qwen2.5-VL-7B backbone).
Build reward function: point-in-box 0/1 → Z-score advantages → GRPO training (lr 1e-6, 8 rollouts, 33 k samples).
Evaluate on MMBench-GUI L2, UI-I2E, ScreenSpot-Pro; expect +15-20 % absolute on implicit subsets.
Integrate into Agent: GPT-class planner → UI-Ins executor → AndroidWorld ≥70 % success.

12 FAQ

Q1: Do I need a 32 B model to see the benefit?
A: No. UI-Ins-7B already outperforms 32 B baselines on implicit instructions.

Q2: Can I apply this to non-English screenshots?
A: Yes. The pipeline is language-agnostic; just replace the GPT-4.1 prompt with your target language.

Q3: Is RL training unstable?
A: Policy collapse is mitigated by diverse SFT initialization; without it we observed −5.7 % after 100 RL steps.

Q4: How much GPU memory for RL on 7 B?
A: 8-rollouts need ~28 GB; reduce to 4 or use CPU offload for 24 GB cards.

Q5: Are the generated instructions public?
A: The cleaned + augmented 283 k sample set will be open-sourced under CC-BY-NC-SA 4.0.

Q6: Does the method work for bounding-box output instead of points?
A: The paper focuses on point targets; switching to box regression requires modifying the loss but the multi-view idea still applies.

Q7: What’s the largest real-world deployment so far?
A: Alibaba’s internal Tongyi Lab has adopted UI-Ins-7B as the default grounding back-end for 3 desktop-agent products (details confidential).

Author’s closing note:
If you remember only one thing, let it be this: treat your prompt as a reasoning path, not a string. Clean the data, give the model a menu of perspectives, and let RL decide what’s tasty today. The code is open—go make your GUI agent see the world from more than one angle.