8 Days, 20 USD, One CLI: Building an Open-Source AI Manhua-Video App with Claude Code & GLM-4.7

Core question answered in one line:
A backend-only engineer with zero mobile experience can ship an end-to-end “prompt-to-manhua-video” Android app in eight calendar days and spend only twenty dollars by letting a CLI coding agent write Flutter code while a cheap but powerful LLM plans every creative step.

1. Why Another AI-Video Tool? The Mobile Gap

Core question this section answers:
If web-based manhua-video makers already exist, why bother building a mobile-native one?

Every existing product the author tried was desktop-web only, asking users to upload reference images, drag storyboards, and wait for cloud queues—none of which feels natural on a bus ride.
During a Wuhan AI meet-up, thirty attendees repeated the same pain: “I want to create on my phone while commuting.”
The author’s earlier two blog posts on AI manhua drew tens of thousands of reads and dozens of private “when will there be an app?” messages.
Personal motivation: the author’s wife joked “you must be crazy” when hearing the idea—proving her wrong became an eight-day sprint target.

Author reflection:

I realised the competition wasn’t other apps; it was the empty time people spend scrolling. If creation is easier than consumption, they’ll create.

2. Success Metric: What Does “Done” Look Like on Day 8?

Core question:
Under money, time, and skill ceilings, what exact artefact counts as victory?

Constraint	Hard Limit	Stretch Goal
Time	8 calendar days (incl. 3-day New-Year break)	Demo ready by day 5
Budget	≤ 100 USD	Actual spend: 20 USD
Skill	Author has never shipped Android before	Use Claude Code CLI to generate Flutter
Output	Installable APK < 60 MB	Open-source repo with ≥ 50 stars
Functionality	One-sentence input → 30 s vertical video	Character face consistency ≥ 80 %

Checklist used every evening:

[x] 1-sentence prompt accepted
[x] 8-scene script auto-written
[x] 1 protagonist, 1 side character, 7 scenes configurable
[x] Three-view character sheet generated once and reused
[x] Each scene: 1 key image + 4 s video clip
[x] FFmpeg concatenates clips into final MP4
[x] MIT license repo public on GitHub

3. Tech Choices in 10 Minutes: Flutter + GLM-4.7 + ReAct Loop

Core question:
Which stack can be believed in without proof-of-concept time?

Coding Agent
Claude Code (CLI) showed 95 % compile-success on Flutter snippets in earlier toy tests—no GUI, no config hell, just yes to every suggestion.
UI Framework
Flutter: one code-base → Android APK, hot-reload < 1 s, default Material3 theme already “pretty enough”.
LLM Brain
GLM-4.7:
- Chinese & English scripting equally fluent
- JSON instruction format stable across 50 prompt iterations
- Year-end promo: 100 M tokens for 20 USD top-up
Media API
- Image: Gemini Pro Vision endpoint (author had ready-made doc)
- Video: Veo 2.0 beta (author had ready-made doc)
Control Pattern
ReAct loop:
User text → LLM thinks → JSON tool call → App executes → result fed back → LLM next action … until “action: finish”.

Author reflection:

I picked Flutter over React-Native simply because Claude Code hallucinates fewer imports in Dart. That’s not engineering elegance—that’s deadline survival.

4. Data-Flow Architecture: One Picture, No Black Box

Core question:
How does a plain sentence become a watchable 30-second manhua video?

flowchart LR
    A[User types prompt] -->|1| B[GLM-4.7 planner]
    B -->|JSON tool| C[Flutter app]
    C --> D{Router}
    D -->|write_script| E[Local YAML]
    D -->|draw_char| F[Gemini image]
    D -->|draw_scene| G[Gemini image]
    D -->|gen_video| H[Veo API]
    F & G & H -->|URL| I[Feedback to GLM]
    I --> B
    B -->|action:finish| J[FFmpeg concat]
    J --> K[final.mp4]

Consistency trick:

Same seed and negative_prompt block across all calls.
Character LoRA trained on three-view sheet, weight 0.8.
Video calls use the key image as first frame reference.

5. Day-by-Day Log: From 0 Lines to APK in the Store Folder

Core question:
What does the actual grind look like when the clock never stops?

Day	Goal	Key Event	Hours	Outcome
D1 AM	Env setup	`flutter doctor` all green	2	CLI only
D1 PM	Prompt craft	200-line ReAct template finished	3	Saved as `director_ai/docs/system_prompt.md`
D2	Script→JSON	First successful 8-scene output	4	20 k tokens burnt
D3	Character lock	Three-view sheet consistent at 0.87 IoU	6	Seed 128475 canonised
D4	Scene images	7 images parallel, 3 threads	5	1024×1792 each
D5	Clip videos	4 s clips finally stable	7	12 failures → success
D6	Glue & UI	FFmpeg script + progress bar	4	APK 58 MB
D7	Dog-food	3 friends test on Xiaomi/Samsung/Pixel	3	17 bugs → 0
D8	Ship	README + demo tweet	2	GitHub public

Worst moment: 2 a.m. on D5, Veo threw “invalid aspect ratio” 7 times—docs said 16:9 instead of 9:16. One word, two nights of sleep lost.

6. Walk-through: 32-Second “Strawberry Cake” Manhua

Core question:
Can a reader reproduce an entire clip right now with nothing but the repo?

Step 0 Input

User types:
“a pink-haired girl baking a strawberry cake, cute vibe”

Step 1 Script (GLM-4.7)

{
  "title": "Berry Sweet",
  "scenes": 8,
  "hook": "Her heart beats louder than the oven timer.",
  "protagonist": { "name": "Berry", "trait": "pink bob, strawberry badge" }
}

Time: 3.2 s | Tokens: 1.1 k

Step 2 Three-View Sheet

Prompt snippet:
「pink bob, strawberry badge on apron, three-view sheet, seed128475, negative: blurry, extra limbs」
Output: 1024×1024 PNG, face IoU 0.87

Image source: author generation

Step 3 Key Images ×7

Example scene-3 prompt:
「close-up, Berry placing strawberry on cream mountain, window light, seed128475」
Generation: 6 s, 1024×1792

Step 4 Video Clips ×7

Request:

{ "image": "<key_image>", "duration": 4, "motion": "subtle head tilt, cream swirl" }

Median latency 28 s, 4 s@30 fps MP4 returned.

Step 5 Concatenate

No re-encode:

ffmpeg -f concat -safe 0 -i list.txt -c copy final.mp4

Size: 32.4 MB, 720×1280, 32 s.

Play-test screenshot:
final frame
Image source: author generation

7. Character Consistency Deep Dive: Seed, LoRA, Reference Frame

Core question:
How is face-drift kept under 10 % without manual touch-up?

Global Seed
- Three-view, key images, and video first-frame all reuse seed=128475.
- Identical negative_prompt block removes random limb chaos.
Lightweight LoRA
- Train 20 steps, rank 16, on three-view sheet → 3.7 MB file.
- Inference weight 0.8: keeps face, allows style flexibility.
Reference-Frame Video
- Veo accepts first-frame image; feed resized 512×512 crop of face.
- Motion descriptor limited to subtle or slow to avoid warping.

Numbers from 20-run ablation:

Face IoU ≥ 0.85: 18 / 20
Human rating ≥ 4 / 5: 9 / 10 clips

8. Budget Autopsy: Where Did the 20 USD Go?

Core question:
Is “cheap” a marketing line or a repeatable fact?

Item	Unit Price	Quantity	Subtotal
GLM-4.7 128 k	0.015 USD / 1 k tokens	1 300 000	19.5 USD
Gemini Pro Vision	Free tier	60 imgs	0
Veo 2.0 beta	Free tier	60 clips	0
Total			19.5 USD ≈ 20 USD

Promo detail: 20 USD top-up during campaign gave 100 M tokens. At list price (0.06 USD / 1 k) the same run costs ~78 USD—still under the 100 USD ceiling.

9. Repo Tour & Local Build in 5 Minutes

Core question:
How can a reader clone and see her own manhua video tonight?

git clone https://github.com/<user>/man-dao.git
cd man-dao
cp config.yaml.example config.yaml
# fill GLM_KEY, GEMINI_KEY, VEO_KEY
flutter pub get
flutter run --release

Key folders:

lib/react_loop.dart – ReAct parser, 180 lines
scripts/seed_lock.py – enforces same seed across APIs
assets/lora/berry_rank16.safetensors – 3.7 MB character weights

First successful compile:
first compile
Image source: Unsplash

10. Lessons Learnt & Road-map

Core question:
If the author started tomorrow, what would he skip, double-down on, or never do again?

Lessons

Read docs to the pixel—aspect ratio typo cost 14 hours.
Free tiers are great until day 7; always have a second provider URL ready.
Version-control the prompt—rolling back a 200-line system prompt by Ctrl-Z is not fun.

Next milestones

Mandarin TTS with CosyVoice + lip-sync (already tested, PR pending)
In-app sharing to mini-program (backendless, QR-code only)
Community LoRA market so users can swap protagonists in one click

Action Checklist / Implementation Steps

Install Flutter 3.16 and Android Studio Hedgehog.
Clone repo, fill config.yaml with API keys.
Run flutter doctor → all ticks green.
Execute flutter run --release on a physical phone (camera permission needed).
Type a one-sentence story idea → wait 8 min → receive 30 s manhua video.
Train your own LoRA: put 10 three-view images under lora/train_data/ and run scripts/lora_train.py.
Commit, push, and tweet the repo—maintainer will merge useful PRs within 48 h.

One-page Overview

Scope: Backend-only engineer, zero mobile exp, 8 days, 20 USD.
Stack: Claude Code CLI → Flutter → GLM-4.7 planner → Gemini img → Veo video → FFmpeg concat.
Loop: ReAct pattern keeps LLM in charge, app just calls tools.
Consistency: Global seed + 3.7 MB LoRA + reference frame = ≤ 10 % face drift.
Deliverable: 60 MB APK, open-source, MIT license, GitHub live now.

FAQ

Q: Can I switch to React Native?
A: RN branch stub exists but Claude Code generates more reliable Dart; feel free to PR.
Q: What happens when free Veo quota dries up?
A: Swap base_url in video_api.dart to Runway or Pika; interface identical.
Q: Is 20 USD a long-term realistic cost?
A: At list price the same run costs ~78 USD; still below 100 USD cap.
Q: Commercial use allowed?
A: MIT license, do as you wish; don’t upload copyrighted faces to LoRA trainer.
Q: iOS version?
A: Flutter code is cross-platform; you need Apple Dev account (99 USD) and video export compliance description.
Q: Why English voice-over?
A: MVP skipped language lock; Mandarin TTS PR is under review.
Q: Eight-day crunch healthy?
A: Averaged 5 hrs/day, no all-nighters; double the timeline if you want weekends.

How I Built a Manhua Video App in 8 Days for $20: AI-Powered Mobile Creation