8 Days, 20 USD, One CLI: Building an Open-Source AI Manhua-Video App with Claude Code & GLM-4.7
Core question answered in one line:
A backend-only engineer with zero mobile experience can ship an end-to-end “prompt-to-manhua-video” Android app in eight calendar days and spend only twenty dollars by letting a CLI coding agent write Flutter code while a cheap but powerful LLM plans every creative step.
1. Why Another AI-Video Tool? The Mobile Gap
Core question this section answers:
If web-based manhua-video makers already exist, why bother building a mobile-native one?
-
Every existing product the author tried was desktop-web only, asking users to upload reference images, drag storyboards, and wait for cloud queues—none of which feels natural on a bus ride. -
During a Wuhan AI meet-up, thirty attendees repeated the same pain: “I want to create on my phone while commuting.” -
The author’s earlier two blog posts on AI manhua drew tens of thousands of reads and dozens of private “when will there be an app?” messages. -
Personal motivation: the author’s wife joked “you must be crazy” when hearing the idea—proving her wrong became an eight-day sprint target.
Author reflection:
I realised the competition wasn’t other apps; it was the empty time people spend scrolling. If creation is easier than consumption, they’ll create.
2. Success Metric: What Does “Done” Look Like on Day 8?
Core question:
Under money, time, and skill ceilings, what exact artefact counts as victory?
| Constraint | Hard Limit | Stretch Goal |
|---|---|---|
| Time | 8 calendar days (incl. 3-day New-Year break) | Demo ready by day 5 |
| Budget | ≤ 100 USD | Actual spend: 20 USD |
| Skill | Author has never shipped Android before | Use Claude Code CLI to generate Flutter |
| Output | Installable APK < 60 MB | Open-source repo with ≥ 50 stars |
| Functionality | One-sentence input → 30 s vertical video | Character face consistency ≥ 80 % |
Checklist used every evening:
-
[x] 1-sentence prompt accepted -
[x] 8-scene script auto-written -
[x] 1 protagonist, 1 side character, 7 scenes configurable -
[x] Three-view character sheet generated once and reused -
[x] Each scene: 1 key image + 4 s video clip -
[x] FFmpeg concatenates clips into final MP4 -
[x] MIT license repo public on GitHub
3. Tech Choices in 10 Minutes: Flutter + GLM-4.7 + ReAct Loop
Core question:
Which stack can be believed in without proof-of-concept time?
-
Coding Agent
Claude Code (CLI) showed 95 % compile-success on Flutter snippets in earlier toy tests—no GUI, no config hell, justyesto every suggestion. -
UI Framework
Flutter: one code-base → Android APK, hot-reload < 1 s, default Material3 theme already “pretty enough”. -
LLM Brain
GLM-4.7:-
Chinese & English scripting equally fluent -
JSON instruction format stable across 50 prompt iterations -
Year-end promo: 100 M tokens for 20 USD top-up
-
-
Media API
-
Image: Gemini Pro Vision endpoint (author had ready-made doc) -
Video: Veo 2.0 beta (author had ready-made doc)
-
-
Control Pattern
ReAct loop:
User text → LLM thinks → JSON tool call → App executes → result fed back → LLM next action … until “action: finish”.
Author reflection:
I picked Flutter over React-Native simply because Claude Code hallucinates fewer imports in Dart. That’s not engineering elegance—that’s deadline survival.
4. Data-Flow Architecture: One Picture, No Black Box
Core question:
How does a plain sentence become a watchable 30-second manhua video?
flowchart LR
A[User types prompt] -->|1| B[GLM-4.7 planner]
B -->|JSON tool| C[Flutter app]
C --> D{Router}
D -->|write_script| E[Local YAML]
D -->|draw_char| F[Gemini image]
D -->|draw_scene| G[Gemini image]
D -->|gen_video| H[Veo API]
F & G & H -->|URL| I[Feedback to GLM]
I --> B
B -->|action:finish| J[FFmpeg concat]
J --> K[final.mp4]
Consistency trick:
-
Same seedandnegative_promptblock across all calls. -
Character LoRA trained on three-view sheet, weight 0.8. -
Video calls use the key image as first frame reference.
5. Day-by-Day Log: From 0 Lines to APK in the Store Folder
Core question:
What does the actual grind look like when the clock never stops?
| Day | Goal | Key Event | Hours | Outcome |
|---|---|---|---|---|
| D1 AM | Env setup | flutter doctor all green |
2 | CLI only |
| D1 PM | Prompt craft | 200-line ReAct template finished | 3 | Saved as director_ai/docs/system_prompt.md |
| D2 | Script→JSON | First successful 8-scene output | 4 | 20 k tokens burnt |
| D3 | Character lock | Three-view sheet consistent at 0.87 IoU | 6 | Seed 128475 canonised |
| D4 | Scene images | 7 images parallel, 3 threads | 5 | 1024×1792 each |
| D5 | Clip videos | 4 s clips finally stable | 7 | 12 failures → success |
| D6 | Glue & UI | FFmpeg script + progress bar | 4 | APK 58 MB |
| D7 | Dog-food | 3 friends test on Xiaomi/Samsung/Pixel | 3 | 17 bugs → 0 |
| D8 | Ship | README + demo tweet | 2 | GitHub public |
Worst moment: 2 a.m. on D5, Veo threw “invalid aspect ratio” 7 times—docs said 16:9 instead of 9:16. One word, two nights of sleep lost.
6. Walk-through: 32-Second “Strawberry Cake” Manhua
Core question:
Can a reader reproduce an entire clip right now with nothing but the repo?
Step 0 Input
User types:
“a pink-haired girl baking a strawberry cake, cute vibe”
Step 1 Script (GLM-4.7)
{
"title": "Berry Sweet",
"scenes": 8,
"hook": "Her heart beats louder than the oven timer.",
"protagonist": { "name": "Berry", "trait": "pink bob, strawberry badge" }
}
Time: 3.2 s | Tokens: 1.1 k
Step 2 Three-View Sheet
Prompt snippet:
「pink bob, strawberry badge on apron, three-view sheet, seed128475, negative: blurry, extra limbs」
Output: 1024×1024 PNG, face IoU 0.87
Image source: author generation
Step 3 Key Images ×7
Example scene-3 prompt:
「close-up, Berry placing strawberry on cream mountain, window light, seed128475」
Generation: 6 s, 1024×1792
Step 4 Video Clips ×7
Request:
{ "image": "<key_image>", "duration": 4, "motion": "subtle head tilt, cream swirl" }
Median latency 28 s, 4 s@30 fps MP4 returned.
Step 5 Concatenate
No re-encode:
ffmpeg -f concat -safe 0 -i list.txt -c copy final.mp4
Size: 32.4 MB, 720×1280, 32 s.
Play-test screenshot:
Image source: author generation
7. Character Consistency Deep Dive: Seed, LoRA, Reference Frame
Core question:
How is face-drift kept under 10 % without manual touch-up?
-
Global Seed
-
Three-view, key images, and video first-frame all reuse seed=128475. -
Identical negative_promptblock removes random limb chaos.
-
-
Lightweight LoRA
-
Train 20 steps, rank 16, on three-view sheet → 3.7 MB file. -
Inference weight 0.8: keeps face, allows style flexibility.
-
-
Reference-Frame Video
-
Veo accepts first-frame image; feed resized 512×512 crop of face. -
Motion descriptor limited to subtleorslowto avoid warping.
-
Numbers from 20-run ablation:
-
Face IoU ≥ 0.85: 18 / 20 -
Human rating ≥ 4 / 5: 9 / 10 clips
8. Budget Autopsy: Where Did the 20 USD Go?
Core question:
Is “cheap” a marketing line or a repeatable fact?
| Item | Unit Price | Quantity | Subtotal |
|---|---|---|---|
| GLM-4.7 128 k | 0.015 USD / 1 k tokens | 1 300 000 | 19.5 USD |
| Gemini Pro Vision | Free tier | 60 imgs | 0 |
| Veo 2.0 beta | Free tier | 60 clips | 0 |
| Total | 19.5 USD ≈ 20 USD |
Promo detail: 20 USD top-up during campaign gave 100 M tokens. At list price (0.06 USD / 1 k) the same run costs ~78 USD—still under the 100 USD ceiling.
9. Repo Tour & Local Build in 5 Minutes
Core question:
How can a reader clone and see her own manhua video tonight?
git clone https://github.com/<user>/man-dao.git
cd man-dao
cp config.yaml.example config.yaml
# fill GLM_KEY, GEMINI_KEY, VEO_KEY
flutter pub get
flutter run --release
Key folders:
-
lib/react_loop.dart– ReAct parser, 180 lines -
scripts/seed_lock.py– enforces same seed across APIs -
assets/lora/berry_rank16.safetensors– 3.7 MB character weights
First successful compile:
Image source: Unsplash
10. Lessons Learnt & Road-map
Core question:
If the author started tomorrow, what would he skip, double-down on, or never do again?
Lessons
-
Read docs to the pixel—aspect ratio typo cost 14 hours. -
Free tiers are great until day 7; always have a second provider URL ready. -
Version-control the prompt—rolling back a 200-line system prompt by Ctrl-Zis not fun.
Next milestones
-
Mandarin TTS with CosyVoice + lip-sync (already tested, PR pending) -
In-app sharing to mini-program (backendless, QR-code only) -
Community LoRA market so users can swap protagonists in one click
Action Checklist / Implementation Steps
-
Install Flutter 3.16 and Android Studio Hedgehog. -
Clone repo, fill config.yamlwith API keys. -
Run flutter doctor→ all ticks green. -
Execute flutter run --releaseon a physical phone (camera permission needed). -
Type a one-sentence story idea → wait 8 min → receive 30 s manhua video. -
Train your own LoRA: put 10 three-view images under lora/train_data/and runscripts/lora_train.py. -
Commit, push, and tweet the repo—maintainer will merge useful PRs within 48 h.
One-page Overview
-
Scope: Backend-only engineer, zero mobile exp, 8 days, 20 USD. -
Stack: Claude Code CLI → Flutter → GLM-4.7 planner → Gemini img → Veo video → FFmpeg concat. -
Loop: ReAct pattern keeps LLM in charge, app just calls tools. -
Consistency: Global seed + 3.7 MB LoRA + reference frame = ≤ 10 % face drift. -
Deliverable: 60 MB APK, open-source, MIT license, GitHub live now.
FAQ
-
Q: Can I switch to React Native?
A: RN branch stub exists but Claude Code generates more reliable Dart; feel free to PR. -
Q: What happens when free Veo quota dries up?
A: Swapbase_urlinvideo_api.dartto Runway or Pika; interface identical. -
Q: Is 20 USD a long-term realistic cost?
A: At list price the same run costs ~78 USD; still below 100 USD cap. -
Q: Commercial use allowed?
A: MIT license, do as you wish; don’t upload copyrighted faces to LoRA trainer. -
Q: iOS version?
A: Flutter code is cross-platform; you need Apple Dev account (99 USD) and video export compliance description. -
Q: Why English voice-over?
A: MVP skipped language lock; Mandarin TTS PR is under review. -
Q: Eight-day crunch healthy?
A: Averaged 5 hrs/day, no all-nighters; double the timeline if you want weekends.

