How I Built a Fully Automated Coding Agent MVP: Big Models for Planning, Small Models for Doing
I recently set out to test a counterintuitive hypothesis:
In multi-agent orchestration, non-reasoning small models can sometimes outperform reasoning large models in cost and speed.
To verify this, I built a minimal viable product (MVP) called hero-coding. I ran the same harness against three model groups: ChatGPT 5.4, Ling-2.6-flash, and Ling-2.5-1T.
The results overturned my initial assumption: it’s not simply “small models win,” but rather “the right model in the right place wins.”
Here’s the full breakdown of how it works, how to set it up, and where to focus your effort if you want to run this yourself.
What Is a Fully Automated Coding Agent?
In one sentence:
You drop a user story into an inbox/ directory, come back half an hour later, and check git log to see whether the task is done.
High-Level Flow
inbox/us-001.md ← user story (markdown + frontmatter)
│
▼
Dispatcher ──── watches inbox/, spawns worker subprocess
│
▼
Worker ──── runs pi-coding-agent --mode json
│ atomic execution, one git commit per change
▼
Judge ──── reads git log + full diff, returns structured verdict
│
┌───┴───┐
PASS FAIL
│ │
▼ ▼
done/ append failure reason to story, restart worker
All three core components are stateless, one-shot processes:
-
Worker runs once and exits. -
Judge runs once and exits. -
All state is persisted via Git and the filesystem.
This design fits the pattern of “long tasks composed of short tasks.”
For the worker piece, I reused pi-coding-agent (as Mario noted in his README: Pi intentionally avoids sub-agents and plan modes, leaving that to you).
My extension is about 400 lines of TypeScript that wraps it into an inbox-driven automation.
The takeaway: You don’t need to rewrite a coding agent to run a coding agent factory. You just need to wire up a harness.
What Is a User Story?
A minimal executable unit of work. I use markdown with frontmatter:
---
id: us-001
title: Add timezone parameter to formatDate
priority: normal
max_retries: 3
---
## Goal
Add an optional timezone parameter to formatDate, defaulting to UTC.
## Acceptance Criteria
- [ ] Add timezone?: string to the function signature
- [ ] When omitted, output is byte-identical to current behavior
- [ ] When set to Asia/Tokyo, formats in that timezone
- [ ] Add 3 tests in tests/utils.test.ts
- [ ] npm test passes
## Out of Scope
- Do not change other functions
- Do not touch locale settings
Drop it into inbox/ and the Dispatcher takes over.
The Out of Scope section is often more valuable than the Goal—non-reasoning models tend to overstep. Clear constraints lead to better results.
What I Ran and How
I prepared a small TypeScript project with intentional bugs:
-
formatDate(date)— missing timezone support (to be added) -
parseRange(1-5)— off-by-one bug, returns[1,2,3,4]instead of[1,2,3,4,5] -
formatNumber(-1234)— double-negative bug, returns--1,234instead of-1,234
Three user stories map to three typical workloads:
-
us-001: Add a feature (timezone) -
us-002: Fix a clear bug (parseRange) -
us-003: Add validation + tests (parseRange boundaries)
Model groups:
-
ChatGPT 5.4 (reasoning model) — accessed via local reverse proxy with a real account -
Ling-2.6-flash (non-reasoning small model) — Ant Group’s BaiLing OpenAI-compatible API -
Ling-2.5-1T (non-reasoning large model) — same API family
Worker and Judge are configured via ~/.pi/agent/models.json using the OpenAI-compatible protocol. Switching models is as simple as changing the model field.
All state is kept in Git. The harness is fully reproducible.
Results
Here’s a summary of the raw data (full JSON logs are in the repo under runs/):
| Task | Model | Time | Token Usage | Passes | Notes |
|---|---|---|---|---|---|
| us-001 (add feature) | Ling-2.5-1T | 130s | 13K | 1 round | 11% of ChatGPT’s tokens, 63% of time |
| us-001 (add feature) | ChatGPT 5.4 | 205s | 120K | 2 rounds | Heavy reasoning overhead |
| us-002 (bug fix) | Ling-2.6-flash | 90s | — | 1 round | 31% faster than ChatGPT |
| us-002 (bug fix) | ChatGPT 5.4 | 131s | — | 1 round | — |
| us-003 (validation + tests) | Ling-2.5-1T | 58s | 5K | 1 round | 33% faster, 40% fewer tokens |
| us-003 (validation + tests) | ChatGPT 5.4 | 86s | 13K | 1 round | — |
Key Takeaways
-
Clear bug fixes: Ling-2.6-flash is 31% faster than ChatGPT (us-002).
It makes more tool calls (52 vs 14) but each is extremely fast. This is its sweet spot: high-frequency, low-latency edits, completions, and quick fixes. -
Feature additions: Ling-2.5-1T uses 11% of ChatGPT’s tokens and 63% of the time (us-001).
ChatGPT’s reasoning adds significant token overhead. Ling-1T’s more restrained thinking keeps costs low. -
Input validation and tests: Ling-2.5-1T is 33% faster and uses 40% fewer tokens (us-003).
Both pass, but Ling does it more efficiently across the board.
In all three cases, the Ling models outperformed ChatGPT in cost and speed when used appropriately.
Why This Matters: The Importance of a Harness
I initially tried using Ling-2.6-flash for everything. It failed on us-001 with an infinite loop:
worker → bash: echo All criteria met.
worker → bash: echo All criteria met.
worker → bash: echo All criteria met.
... (until 80 tool calls hit the limit)
The fix was to change the strategy:
-
Use Ling-2.5-1T for understanding, planning, and decomposition -
Use Ling-2.6-flash for fast execution, completions, and quick patches
Non-reasoning small models have no planning ability. You need a “brain” model to design, then a “hand” model to execute.
After switching us-001’s worker to Ling-1T, it passed in one round.
Simple but Effective Harness Practices
I encountered five failures during development—all due to typical non-reasoning model pitfalls. A small harness makes the difference.
1. Loop Detection (worker.ts, ~10 lines)
const recent: string[] = [];
if (sig) {
recent.push(sig);
if (recent.length > 6) recent.shift();
if (recent.filter(s => s === sig).length >= 4) {
child.kill(SIGKILL); // same sig 4 times in 6 = loop
}
}
2. Auto-Rescue Commit (dispatcher.ts, ~10 lines)
async function autoRescueCommit(repo: string, round: number) {
const status = await git([status, --porcelain], repo);
if (!status.trim()) return false;
await git([add, -A], repo);
await git([commit, -m, `chore(rescue): round ${round}`], repo);
console.log('↳ auto-rescue: committed pending changes left by worker');
return true;
}
In us-003, Ling-1T fixed the code correctly but forgot to commit. Without this, the Judge would see no commit and mark it FAIL. The harness rescued it automatically.
Final Thoughts
This experiment started as a hypothesis and turned into a practical guide.
-
Ling-2.6-flash is faster for high-frequency, low-complexity tasks (us-002 was 31% faster than ChatGPT). -
Ling-2.5-1T is extremely token-efficient for planning and understanding (us-001 used 89% fewer tokens than ChatGPT). -
Small models need a harness—otherwise, they loop or miss commits. -
A “brain + hand” split works well: Ling-1T for design, Ling-flash for execution.
If you already have a working harness, try switching to Ling-2.6 with this division of labor. You’ll likely see token usage drop by an order of magnitude while speed improves.
👉 Want more AI engineering tips, tools, and community?
Join the AI Spark community (currently free for a limited time). Share, learn, and grow together. Add my assistant on WeChat: rivanow—spaces are limited.
One last tip: Run the repo yourself. Watching the Worker stream tool calls in real time is more instructive than reading any article.
Tags: AI agents, coding agents, multi-agent systems, Ling models, automated development, AI harness, non-reasoning models, cost optimization, Git-based workflows
(All data, code, and run logs are available in the GitHub repository linked in the comments.)

