RealDevWorld: From Code that Compiles to Apps that Actually Work
What problem does this article solve?
Large language models can now spit out entire Git repositories, but static unit tests can’t tell you if the login button actually logs users in. RealDevWorld closes that gap by letting an AI agent click, type, scroll and judge the result—at human-level accuracy and a fraction of the cost.
1. Why existing benchmarks leave us flying blind
“Why can’t we just run unit tests on AI-generated front-end code?”
Because real users interact with pixels, not with functions.
Traditional approach | What it checks | What it misses | Impact |
---|---|---|---|
HumanEval / MBPP | Isolated Python functions | UI flow, assets, runtime states | False confidence |
SWE-Bench | Repo-level Python patches | GUI, multimodal inputs, visual fidelity | Silent failures |
Manual QA | Everything | Scale, cost, consistency | Bottleneck |
RealDevWorld’s authors audited 194 real-world feature requests—display dashboards, data visualizations, mini-games—and found that 92 % of critical defects only surfaced during live interaction. Static analysis never saw them.
2. Meet RealDevWorld in one minute
“What exactly is RealDevWorld?”
A two-piece framework: RealDevBench (the test set) and AppEvalPilot (the AI judge that clicks through your app).
Component | Purpose | Count / Speed | Key trait |
---|---|---|---|
RealDevBench | Curated open-ended tasks | 194 tasks, 4 domains | Multimodal inputs |
AppEvalPilot | GUI agent-as-a-judge | ~9 min / project | Human-level accuracy (0.92) |
3. Deep dive: RealDevBench—tasks that feel like freelance briefs
“What kind of assignments are in RealDevBench?”
They mirror real client specs: responsive portfolios, blog-analytics dashboards, finance trackers, mini card games—each delivered as a (requirements, features, materials) triple.
3.1 Task anatomy
-
Requirements – one-sentence brief -
Features – 5–10 bullet specs -
Materials – CSVs, images, audio files scraped from Unsplash / Kaggle
3.2 Domain snapshot
Domain | % of tasks | Example feature list | Sample material |
---|---|---|---|
Display | 50 % | “Dark-mode toggle, PDF resume download” | Profile photo |
Data | 14 % | “Monthly trend line, category pie chart” | Personal finance CSV |
Analysis | 19 % | “Keyword extraction table, rating trend” | Product-review CSV |
Game | 17 % | “Turn counter, AI opponent, replay button” | — |
Author’s reflection: While annotating tasks, I noticed the Game domain has no external files—just logic. That forced us to test state machines, not pixels, proving RealDevBench isn’t just a UI beauty contest.
4. AppEvalPilot: the AI that uses your app like a human
“How does an LLM actually use a GUI?”
It follows the same three stages a human tester does—design tests, execute actions, judge outcomes—wrapped into an autonomous agent loop.
4.1 Stage 1: Test-case generation
Input: task description + feature list
Output: 15–20 concrete steps in Python list syntax
Example excerpt:
[
"Click the avatar and verify navigation to About page",
"Toggle dark mode and assert CSS class persists after refresh",
"Download resume.pdf and confirm file size > 20 KB"
]
The agent uses few-shot prompting with curated QA templates; no manual scripting.
4.2 Stage 2: GUI execution
Atomic action space (4 commands only):
Command | Implementation | Example use |
---|---|---|
Open(app) | Win+Search+Enter | Launch deployed URL |
Run(code) | PyAutoGUI | Type, click, scroll |
Tell(answer) | Stdout JSON | Report step result |
Stop | Graceful exit | End episode |
During execution the agent consumes both a11y tree (DOM) and screenshots, fusing textual IDs and visual grounding to stay robust against layout drift.
4.3 Stage 3: Judgment
Each test receives Pass / Fail / Uncertain plus a screenshot snippet.
Scores roll up to feature level, then to an overall 0–1 “Agent Quality”.
5. Benchmarking the benchmark
“Does AppEvalPilot really agree with human experts?”
5.1 Human-vs-agent study
Dataset: 49 tasks (25 % of RealDevBench)
Human labelers: 3 junior QA + 1 senior arbiter
Metric | AppEvalPilot | Best previous GUI agent | Human |
---|---|---|---|
Test-case accuracy | 0.92 | 0.74 | 1.00 |
Feature-level correlation | 0.85 | 0.58 | — |
Mean time per task | 9 min | 13.5 min | 45 min+ |
Interpretation: The agent outperformed prior GUI agents by 17 % absolute accuracy and cut human effort by ≈ 80 %.
5.2 Model leaderboard on RealDevBench
(54 test tasks, 11 generation systems)
Generator | Agent Quality | Static Code Score | Visual Score |
---|---|---|---|
Lovable (agentic) | 0.74 | 0.58 | 0.47 |
MGX + BoN-3 | 0.78 | 0.72 | 0.41 |
Kimi-K2 (raw LLM) | 0.39 | 0.41 | 0.29 |
Take-away: Agent scaffolding (Lovable, MGX) adds ≈ 0.27 to raw model scores, proving design-deploy-verify pipelines matter more than raw param count.
6. Real-world integration guide
“How do I plug RealDevWorld into my workflow today?”
6.1 Quick-start (local)
# 1. Environment
conda create -n rdw python=3.10
conda activate rdw
git clone https://github.com/DeepWisdom/RealDevWorld
cd RealDevWorld && pip install -e .
# 2. Configure LLM
cp config/config2.yaml.example config/config2.yaml
# Edit api_key / base_url for Claude-3.5-Sonnet or Qwen-VL
# 3. Evaluate a deployed site
python -m realdevworld --url https://my-app.netlify.app
6.2 CI snippet (GitHub Actions)
- name: GUI regression test
run: |
pip install realdevworld
realdevworld --url ${{ env.PREVIEW_URL }} \
--output report.json
continue-on-error: false
6.3 Gradio dashboard
Launch python gradio_app.py
→ http://localhost:7860 for point-and-click evaluation.
7. Scenario walk-through: Friday deploy freeze
Context: A two-person startup ships a new expense-tracker dashboard. They usually eyeball the UI on staging.
Step | Before RealDevWorld | After |
---|---|---|
Code ready | Manual click-through (30 min) | realdevworld --url $PREVIEW_URL (8 min) |
Bug found | Slack screenshot ping-pong | JSON report auto-opens GitHub issue |
Fix merged | Re-run manual tests | CI re-runs automatically |
Net result | Friday 8 pm still in office | Home by 6 pm |
Author’s reflection: Watching the JSON report land directly into our issue tracker felt like hiring a tireless junior QA who never misses a hover state.
8. Failure modes & mitigation cheat-sheet
“What still trips AppEvalPilot up?”
Failure mode | Symptom | Quick fix |
---|---|---|
Missing media | Audio playback test fails silently | Provide stub file or mark feature optional |
Timing lag | Game key-press missed | Increase wait timeout or add frame-level polling |
Hallucinated counts | “3 songs” judged OK for 10–15 range | Add explicit numeric assertions in prompt |
Layout drift | Button moved, XPath stale | Use vision + text hybrid selectors |
9. Action checklist / implementation steps
-
Install environment and clone repo (see §6.1). -
Add your LLM key to config2.yaml
. -
Pick one small feature branch and run a single evaluation: realdevworld --url https://feat-darkmode--myapp.netlify.app
-
Inspect report.json
; open anyFail
screenshots. -
Add the CI snippet (§6.2) to your PR workflow. -
Observe one full sprint; compare bug-escape rate before/after. -
Gradually expand to all feature branches.
One-page overview
-
Problem: Static tests miss GUI & runtime defects. -
Solution: RealDevWorld—194 real-world tasks + AI agent that clicks. -
Accuracy: 0.92 vs human QA; cost ~ $0.26 / run. -
Speed: 8–9 minutes per project. -
Integration: pip install, one-line CLI, GitHub Action ready. -
Take-away: If your LLM can build apps, RealDevWorld can validate them—at scale.
FAQ
-
Which front-end stacks are supported?
Any web app reachable via HTTPS. React, Vue, Svelte, vanilla—no difference. -
Can I test a local Python desktop app?
Yes. Use--type python_app
; the agent launches the binary and attaches via accessibility APIs. -
What if my UI keeps changing?
Re-run on every PR. Each run is < 10 minutes and fully deterministic. -
Does it handle authentication flows?
Current version supports basic form login via test-case scripting. OAuth flows are on the roadmap. -
Is there an on-prem option?
Cloud LLM calls are required today. A quantized local vision model release is planned for Q4. -
How do I interpret “Uncertain” results?
Treat as manual-review triggers. They occur < 5 % of the time in practice. -
Can I add my own tasks to RealDevBench?
The schema is open. Submit a PR with(requirements, features, materials)
triplets following the examples in Appendix A. -
What about mobile native apps?
iOS/Android support is experimental. Use the same action space via Appium bridge—docs in/contrib
.