Beyond Static Prompts: How Multi-View Instructions Turbo-charge GUI Grounding — A Hands-On Guide to UI-Ins

Why read this?
Because simply re-phrasing the same user intent into four different angles can lift a 7 B model’s pixel-accuracy by up to 76 %—without extra data or heavier back-bones. This article shows you the exact pipeline, code, and training tricks that make it happen.


1 The Invisible Ceiling of One-Angle Instructions

Core question answered:
“Why do existing GUI-grounding models hit an accuracy wall even when the screenshot is crystal-clear?”

Summary: We trace the bottleneck to low-quality, single-angle instructions in public datasets (23 % flawed) and show that human-style perspective switching is the missing piece.

Author’s reflection:
I used to believe “more layers & bigger image resolution” would solve grounding failures. After auditing 1 909 samples from OS-Atlas, Widget-Captioning and AMEX, I realised a quarter of the labels were ambiguous or mismatched—the model never had a fair chance.


2 Dataset Autopsy: 23 % Flaw Rate & How We Cleaned It

Core question answered:
“Can we trust open-source GUI datasets for serious training?”

Issue type Example Share
Ambiguous match “click the button” (5 identical buttons) 4.4 %
Location mismatch bbox shifted 20 px, misses icon 18.9 %
Intent–UI mismatch instruction says “close tab”, UI shows “clear cache” 5.3 %

Our 3-step washer:

  1. OmniParser-V2 detects all clickable elements ➜ gives clean candidate set.
  2. IoU filter: drop ground truth if best candidate IoU < 0.5.
  3. Re-write & verify: GPT-4.1 generates appearance / function / spatial / intent views, then self-checks uniqueness.

Result: error rate drops from 23 % → 8 %; models trained on the cleaned set gain +4.1 ∼ 7.8 % absolute points on three benchmarks.


3 Four Angles of Human Description — With Code to Generate Them

Core question answered:
“What do ‘multi-view’ instructions actually look like, and how can I produce them at scale?”

Code snippet (simplified but runnable):

PERSPECTIVES = ["appearance","function","spatial","intent"]
PROMPT = """
Screenshot: <img>
Target element highlighted in red.
Create one unambiguous instruction for each perspective.
JSON output only.
"""
def augment(img, bbox):
    img_with_box = draw_red_box(img,bbox)
    msgs = [{"role":"system","content":PROMPT},
            {"role":"user","content":[{"type":"image","image":img_with_box}]}]
    reply = gpt4_1_chat(msgs)
    return json.loads(reply)  # keys = PERSPECTIVES

Real example for a red “×” icon:

Perspective Generated sentence
appearance Click the crimson × icon in the upper-right corner.
function Close the current file-manager window.
spatial Choose the top-right button, left of the minimise icon.
intent Get rid of this screen.

Author’s reflection:
When I first saw the model written “get rid of this screen” I thought it was too casual—yet that high-level intent view became the highest-recall angle on dark-theme windows where the tiny × is nearly invisible.


4 Training Recipe: SFT Opens the Menu, RL Learns to Order

Core question answered:
“How do you teach a model to pick the best angle instead of feeding it a single prompt?”

4.1 Stage-1 SFT — “See the world”


  • Data: 283 k cleaned samples, each carries 2 random views (1 = user instruction, 1 = hidden reasoning).

  • Target: maximise likelihood of reasoning text + click-point jointly.

  • Format forced:
<think>
From spatial view: the button is below the search bar.
</think>
<tool_call>
{"name":"grounding","arguments":{"action":"click","coordinate":[x,y]}}
</tool_call>

4.2 Stage-2 GRPO RL — “Choose wisely”


  • Prompt: “Think first, then output coordinate.” (no perspective list)

  • Reward: 1 if predicted point ∈ ground-truth box, else 0.

  • Rollouts: 8 per sample; advantages Z-score normalised.

  • Outcome: UI-Ins-7B +6.6 % extra gain on ScreenSpot-Pro implicit subset.

5 Benchmark Battle: Where the Gains Come From

Core question answered:
“Does multi-view reasoning only help fancy long instructions?”

Benchmark subset Baseline Qwen2.5-VL-7B UI-Ins-7B gain
MMBench-GUI L2 basic (explicit) 33.9 64.7 +90 %
MMBench-GUI L2 advanced (implicit) 32.5 80.8 +159 %
UI-I2E-Bench explicit 73.8 88.9 +20 %
UI-I2E-Bench implicit 62.7 76.3 +22 %

Take-away: The vaguer the instruction, the larger the lift—exactly where product teams suffer most.


6 Online Agent Stress-Test: AndroidWorld Live

Core question answered:
“Will the offline benchmark victory collapse once the screen scrolls, refreshes, or lags?”

Setup:


  • Planner = GPT-5 (high-level step maker)

  • Executor = UI-Ins-7B (pixel clicker)

  • Env = AndroidWorld (real phone, 116 tasks)
Model pipeline success rate
Gemini 2.5 Computer-Use 69.7 %
UI-TARS-2 73.3 %
UI-Ins-7B + GPT-5 74.1 %

Author’s reflection:
I expected drift and latency to kill accuracy. Interestingly, multi-view reasoning acted like a robustness buffer—when the theme changed between light/dark modes, the model switched to function + spatial views and kept clicking correctly.


7 Ablation Deep-Dive: What Actually Matters?

Core question answered:
“Which ingredient gives the biggest bang for the buck?”

Ablated component MMB-GUI L2 Δ
none (full UI-Ins-7B) 83.1
→ remove clean data 72.4 −10.7
→ remove RL stage 76.3 −6.8
→ remove reasoning 79.1 −4.0
→ keep free-form reasoning (no views) 78.8 −4.3

Insight: Data cleaning > RL > reasoning format. Skipping data quality is like “pouring premium petrol into a muddy tank.”


8 Error Gallery: Where UI-Ins Still Fails

Core question answered:
“What systematic weaknesses remain, so I don’t deploy blindly?”

  1. World-knowledge gap
    Instruction: “open the app from the toy company famous for building blocks”
    Model picks Jazwares; correct is MEGA (visually smaller logo).

  2. Layout-resolution mismatch
    Predicts centre of toolbar instead of the tiny dropdown arrow—bbox centre is not always clickable.

  3. Visual hallucination under occlusion
    Two identical “save” icons appear; model selects the dimmed disabled copy.

Reflection: These errors remind us that semantic reasoning ≠ factual knowledge and pixel precision still needs segmentation-level supervision.


9 Quick-Start: Inference in 13 Lines

Core question answered:
“How do I run UI-Ins-7B inside my own agent loop today?”

from PIL import Image
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
import torch, re

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
        "Tongyi-MiA/UI-Ins-7B", torch_dtype=torch.bfloat16, device_map="auto")
processor = AutoProcessor.from_pretrained("Tongyi-MiA/UI-Ins-7B")

image = Image.open("screen.jpg").convert("RGB")
instr = "Turn off the battery saver"
prompt = [{"role":"user","content":[{"type":"image"},{"type":"text","text":instr}]}]

inputs = processor.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[inputs], images=[image], return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=128)
coord = re.findall(r"\[(\d+),\s*(\d+)\]", processor.decode(out[0], skip_special_tokens=True))
if coord: x,y = map(int, coord[0]); print("Click at",x,y)

Checklist for production:


  • Always keep <think> tag in system prompt—dropping it costs ~4 % accuracy.

  • Feed original-resolution screenshots; the model rescales internally.

  • Batch size >1 gives ~12 % speed-up thanks to vision encoder caching.

10 One-Page Overview


  • Problem: 23 % of public GUI instructions are noisy; single-view training under-utilises model capacity.

  • Solution: Instruction-as-Reasoning — clean data, generate 4 views, SFT teaches all views, GRPO-RL learns to pick best view.

  • Results: UI-Ins-7B/32B set new SOTA on 5 benchmarks; 7B online agent beats Gemini-2.5 & UI-TARS-2.

  • Recipe: OmniParser → IoU filter → GPT-4.1 rewrite → SFT on 283 k → GRPO on 33 k → done.

  • Key ablation order: data cleaning > RL > reasoning format.

  • Remaining gaps: world knowledge, sub-element precision, disabled-widget hallucination.

11 Action Checklist / Implementation Steps

  1. Download raw GUI datasets (OS-Atlas, AMEX, etc.).
  2. Run OmniParser-V2 → extract all UI boxes → IoU-filter labels.
  3. Use provided GPT-4.1 prompt script to create 4-view instructions; self-verify uniqueness.
  4. Git-clone LLaMA-Factory → launch 1-epoch SFT (lr 5e-6, batch 256, Qwen2.5-VL-7B backbone).
  5. Build reward function: point-in-box 0/1 → Z-score advantages → GRPO training (lr 1e-6, 8 rollouts, 33 k samples).
  6. Evaluate on MMBench-GUI L2, UI-I2E, ScreenSpot-Pro; expect +15-20 % absolute on implicit subsets.
  7. Integrate into Agent: GPT-class planner → UI-Ins executor → AndroidWorld ≥70 % success.

12 FAQ

Q1: Do I need a 32 B model to see the benefit?
A: No. UI-Ins-7B already outperforms 32 B baselines on implicit instructions.

Q2: Can I apply this to non-English screenshots?
A: Yes. The pipeline is language-agnostic; just replace the GPT-4.1 prompt with your target language.

Q3: Is RL training unstable?
A: Policy collapse is mitigated by diverse SFT initialization; without it we observed −5.7 % after 100 RL steps.

Q4: How much GPU memory for RL on 7 B?
A: 8-rollouts need ~28 GB; reduce to 4 or use CPU offload for 24 GB cards.

Q5: Are the generated instructions public?
A: The cleaned + augmented 283 k sample set will be open-sourced under CC-BY-NC-SA 4.0.

Q6: Does the method work for bounding-box output instead of points?
A: The paper focuses on point targets; switching to box regression requires modifying the loss but the multi-view idea still applies.

Q7: What’s the largest real-world deployment so far?
A: Alibaba’s internal Tongyi Lab has adopted UI-Ins-7B as the default grounding back-end for 3 desktop-agent products (details confidential).


Author’s closing note:
If you remember only one thing, let it be this: treat your prompt as a reasoning path, not a string. Clean the data, give the model a menu of perspectives, and let RL decide what’s tasty today. The code is open—go make your GUI agent see the world from more than one angle.