8 Days, 20 USD, One CLI: Building an Open-Source AI Manhua-Video App with Claude Code & GLM-4.7

Core question answered in one line:
A backend-only engineer with zero mobile experience can ship an end-to-end “prompt-to-manhua-video” Android app in eight calendar days and spend only twenty dollars by letting a CLI coding agent write Flutter code while a cheap but powerful LLM plans every creative step.


1. Why Another AI-Video Tool? The Mobile Gap

Core question this section answers:
If web-based manhua-video makers already exist, why bother building a mobile-native one?

  • Every existing product the author tried was desktop-web only, asking users to upload reference images, drag storyboards, and wait for cloud queues—none of which feels natural on a bus ride.
  • During a Wuhan AI meet-up, thirty attendees repeated the same pain: “I want to create on my phone while commuting.”
  • The author’s earlier two blog posts on AI manhua drew tens of thousands of reads and dozens of private “when will there be an app?” messages.
  • Personal motivation: the author’s wife joked “you must be crazy” when hearing the idea—proving her wrong became an eight-day sprint target.

Author reflection:

I realised the competition wasn’t other apps; it was the empty time people spend scrolling. If creation is easier than consumption, they’ll create.


2. Success Metric: What Does “Done” Look Like on Day 8?

Core question:
Under money, time, and skill ceilings, what exact artefact counts as victory?

Constraint Hard Limit Stretch Goal
Time 8 calendar days (incl. 3-day New-Year break) Demo ready by day 5
Budget ≤ 100 USD Actual spend: 20 USD
Skill Author has never shipped Android before Use Claude Code CLI to generate Flutter
Output Installable APK < 60 MB Open-source repo with ≥ 50 stars
Functionality One-sentence input → 30 s vertical video Character face consistency ≥ 80 %

Checklist used every evening:

  • [x] 1-sentence prompt accepted
  • [x] 8-scene script auto-written
  • [x] 1 protagonist, 1 side character, 7 scenes configurable
  • [x] Three-view character sheet generated once and reused
  • [x] Each scene: 1 key image + 4 s video clip
  • [x] FFmpeg concatenates clips into final MP4
  • [x] MIT license repo public on GitHub

3. Tech Choices in 10 Minutes: Flutter + GLM-4.7 + ReAct Loop

Core question:
Which stack can be believed in without proof-of-concept time?

  1. Coding Agent
    Claude Code (CLI) showed 95 % compile-success on Flutter snippets in earlier toy tests—no GUI, no config hell, just yes to every suggestion.

  2. UI Framework
    Flutter: one code-base → Android APK, hot-reload < 1 s, default Material3 theme already “pretty enough”.

  3. LLM Brain
    GLM-4.7:

    • Chinese & English scripting equally fluent
    • JSON instruction format stable across 50 prompt iterations
    • Year-end promo: 100 M tokens for 20 USD top-up
  4. Media API

    • Image: Gemini Pro Vision endpoint (author had ready-made doc)
    • Video: Veo 2.0 beta (author had ready-made doc)
  5. Control Pattern
    ReAct loop:
    User text → LLM thinks → JSON tool call → App executes → result fed back → LLM next action … until “action: finish”.

Author reflection:

I picked Flutter over React-Native simply because Claude Code hallucinates fewer imports in Dart. That’s not engineering elegance—that’s deadline survival.


4. Data-Flow Architecture: One Picture, No Black Box

Core question:
How does a plain sentence become a watchable 30-second manhua video?

flowchart LR
    A[User types prompt] -->|1| B[GLM-4.7 planner]
    B -->|JSON tool| C[Flutter app]
    C --> D{Router}
    D -->|write_script| E[Local YAML]
    D -->|draw_char| F[Gemini image]
    D -->|draw_scene| G[Gemini image]
    D -->|gen_video| H[Veo API]
    F & G & H -->|URL| I[Feedback to GLM]
    I --> B
    B -->|action:finish| J[FFmpeg concat]
    J --> K[final.mp4]

Consistency trick:

  • Same seed and negative_prompt block across all calls.
  • Character LoRA trained on three-view sheet, weight 0.8.
  • Video calls use the key image as first frame reference.

5. Day-by-Day Log: From 0 Lines to APK in the Store Folder

Core question:
What does the actual grind look like when the clock never stops?

Day Goal Key Event Hours Outcome
D1 AM Env setup flutter doctor all green 2 CLI only
D1 PM Prompt craft 200-line ReAct template finished 3 Saved as director_ai/docs/system_prompt.md
D2 Script→JSON First successful 8-scene output 4 20 k tokens burnt
D3 Character lock Three-view sheet consistent at 0.87 IoU 6 Seed 128475 canonised
D4 Scene images 7 images parallel, 3 threads 5 1024×1792 each
D5 Clip videos 4 s clips finally stable 7 12 failures → success
D6 Glue & UI FFmpeg script + progress bar 4 APK 58 MB
D7 Dog-food 3 friends test on Xiaomi/Samsung/Pixel 3 17 bugs → 0
D8 Ship README + demo tweet 2 GitHub public

Worst moment: 2 a.m. on D5, Veo threw “invalid aspect ratio” 7 times—docs said 16:9 instead of 9:16. One word, two nights of sleep lost.


6. Walk-through: 32-Second “Strawberry Cake” Manhua

Core question:
Can a reader reproduce an entire clip right now with nothing but the repo?

Step 0 Input

User types:
“a pink-haired girl baking a strawberry cake, cute vibe”

Step 1 Script (GLM-4.7)

{
  "title": "Berry Sweet",
  "scenes": 8,
  "hook": "Her heart beats louder than the oven timer.",
  "protagonist": { "name": "Berry", "trait": "pink bob, strawberry badge" }
}

Time: 3.2 s | Tokens: 1.1 k

Step 2 Three-View Sheet

Prompt snippet:
「pink bob, strawberry badge on apron, three-view sheet, seed128475, negative: blurry, extra limbs」
Output: 1024×1024 PNG, face IoU 0.87
three-view
Image source: author generation

Step 3 Key Images ×7

Example scene-3 prompt:
「close-up, Berry placing strawberry on cream mountain, window light, seed128475」
Generation: 6 s, 1024×1792

Step 4 Video Clips ×7

Request:

{ "image": "<key_image>", "duration": 4, "motion": "subtle head tilt, cream swirl" }

Median latency 28 s, 4 s@30 fps MP4 returned.

Step 5 Concatenate

No re-encode:

ffmpeg -f concat -safe 0 -i list.txt -c copy final.mp4

Size: 32.4 MB, 720×1280, 32 s.

Play-test screenshot:
final frame
Image source: author generation


7. Character Consistency Deep Dive: Seed, LoRA, Reference Frame

Core question:
How is face-drift kept under 10 % without manual touch-up?

  1. Global Seed

    • Three-view, key images, and video first-frame all reuse seed=128475.
    • Identical negative_prompt block removes random limb chaos.
  2. Lightweight LoRA

    • Train 20 steps, rank 16, on three-view sheet → 3.7 MB file.
    • Inference weight 0.8: keeps face, allows style flexibility.
  3. Reference-Frame Video

    • Veo accepts first-frame image; feed resized 512×512 crop of face.
    • Motion descriptor limited to subtle or slow to avoid warping.

Numbers from 20-run ablation:

  • Face IoU ≥ 0.85: 18 / 20
  • Human rating ≥ 4 / 5: 9 / 10 clips

8. Budget Autopsy: Where Did the 20 USD Go?

Core question:
Is “cheap” a marketing line or a repeatable fact?

Item Unit Price Quantity Subtotal
GLM-4.7 128 k 0.015 USD / 1 k tokens 1 300 000 19.5 USD
Gemini Pro Vision Free tier 60 imgs 0
Veo 2.0 beta Free tier 60 clips 0
Total 19.5 USD ≈ 20 USD

Promo detail: 20 USD top-up during campaign gave 100 M tokens. At list price (0.06 USD / 1 k) the same run costs ~78 USD—still under the 100 USD ceiling.


9. Repo Tour & Local Build in 5 Minutes

Core question:
How can a reader clone and see her own manhua video tonight?

git clone https://github.com/<user>/man-dao.git
cd man-dao
cp config.yaml.example config.yaml
# fill GLM_KEY, GEMINI_KEY, VEO_KEY
flutter pub get
flutter run --release

Key folders:

  • lib/react_loop.dart – ReAct parser, 180 lines
  • scripts/seed_lock.py – enforces same seed across APIs
  • assets/lora/berry_rank16.safetensors – 3.7 MB character weights

First successful compile:
first compile
Image source: Unsplash


10. Lessons Learnt & Road-map

Core question:
If the author started tomorrow, what would he skip, double-down on, or never do again?

Lessons

  1. Read docs to the pixel—aspect ratio typo cost 14 hours.
  2. Free tiers are great until day 7; always have a second provider URL ready.
  3. Version-control the prompt—rolling back a 200-line system prompt by Ctrl-Z is not fun.

Next milestones

  • Mandarin TTS with CosyVoice + lip-sync (already tested, PR pending)
  • In-app sharing to mini-program (backendless, QR-code only)
  • Community LoRA market so users can swap protagonists in one click

Action Checklist / Implementation Steps

  1. Install Flutter 3.16 and Android Studio Hedgehog.
  2. Clone repo, fill config.yaml with API keys.
  3. Run flutter doctor → all ticks green.
  4. Execute flutter run --release on a physical phone (camera permission needed).
  5. Type a one-sentence story idea → wait 8 min → receive 30 s manhua video.
  6. Train your own LoRA: put 10 three-view images under lora/train_data/ and run scripts/lora_train.py.
  7. Commit, push, and tweet the repo—maintainer will merge useful PRs within 48 h.

One-page Overview

  • Scope: Backend-only engineer, zero mobile exp, 8 days, 20 USD.
  • Stack: Claude Code CLI → Flutter → GLM-4.7 planner → Gemini img → Veo video → FFmpeg concat.
  • Loop: ReAct pattern keeps LLM in charge, app just calls tools.
  • Consistency: Global seed + 3.7 MB LoRA + reference frame = ≤ 10 % face drift.
  • Deliverable: 60 MB APK, open-source, MIT license, GitHub live now.

FAQ

  1. Q: Can I switch to React Native?
    A: RN branch stub exists but Claude Code generates more reliable Dart; feel free to PR.

  2. Q: What happens when free Veo quota dries up?
    A: Swap base_url in video_api.dart to Runway or Pika; interface identical.

  3. Q: Is 20 USD a long-term realistic cost?
    A: At list price the same run costs ~78 USD; still below 100 USD cap.

  4. Q: Commercial use allowed?
    A: MIT license, do as you wish; don’t upload copyrighted faces to LoRA trainer.

  5. Q: iOS version?
    A: Flutter code is cross-platform; you need Apple Dev account (99 USD) and video export compliance description.

  6. Q: Why English voice-over?
    A: MVP skipped language lock; Mandarin TTS PR is under review.

  7. Q: Eight-day crunch healthy?
    A: Averaged 5 hrs/day, no all-nighters; double the timeline if you want weekends.