Site icon Efficient Coder

Novel-to-Video AI Workflow: Create Ready-to-Edit CapCut Drafts Completely Locally (2026 Guide)

Novel Video Workflow: Turn Any Novel into Ready-to-Edit CapCut Videos Using Local AI (2026 Tested Guide)

Meta Description / Featured Snippet Summary
Novel Video Workflow is an open-source macOS automation pipeline that converts full-length novels into short-form videos by intelligently splitting chapters, generating cloned-voice audio with IndexTTS2, creating AI illustrations via DrawThings, producing time-aligned subtitles with Aegisub, and exporting .json draft projects directly compatible with CapCut (Jianying / 剪映) version 3.4.1. The entire process runs locally using Ollama (qwen3:4b recommended), requires Apple Silicon, ≥16 GB RAM (32 GB preferred), and outputs production-ready assets in roughly 1–3 hours per chapter depending on hardware.

If you’re searching for a privacy-first, zero-subscription-cost method to batch-produce novel narration videos for TikTok, YouTube Shorts, Xiaohongshu, or Douyin in 2026, this open-source tool currently offers one of the most integrated local workflows available.

Why Creators Struggle with Novel-to-Video Conversion — and How This Tool Fixes It

Turning long-form novels into engaging vertical videos usually hits the same bottlenecks:

  • Manually splitting 300,000-word texts into chapters takes hours
  • Stock TTS voices sound robotic; good voice cloning services charge per minute
  • Finding or generating 10–20 matching scene images per chapter is tedious
  • Subtitle timing rarely syncs perfectly with speech
  • Importing scattered audio, images, and captions into CapCut and aligning everything manually kills productivity

Novel Video Workflow automates nearly every repetitive step and — most importantly — exports a CapCut project structure (.json + media folders) that the app recognizes immediately after import. After minor tweaks (transitions, BGM, opening/ending cards), you can publish.

Realistic Hardware & Software Requirements (January 2026)

Component Minimum Recommended Notes
Operating System macOS (Apple Silicon only tested) macOS Sonoma / Sequoia Intel Macs untested; Linux/Windows ports depend on future community work
RAM 16 GB 32 GB Ollama + TTS + image gen run concurrently and are memory hungry
Storage 100 GB free 200+ GB free A 30-chap novel can generate 15–40 GB of temporary + final assets
GPU Apple M1 / M2 M3 / M4 series Metal acceleration critical for DrawThings speed
Go compiler 1.25+ Latest stable Required to build & run the project
CapCut client 3.4.1 (download link provided) Stick to 3.4.1 Newer versions may break .json draft recognition

Bottom line: An M2/M3 MacBook Pro or Mac mini with 32 GB unified memory delivers the smoothest experience in early 2026.

Four Mandatory AI Services You Must Run Before Starting

  1. Ollama — chapter splitting, content analysis, prompt optimization
    Default port 11434
    Recommended model: qwen3:4b (strong Chinese + English understanding, low resource usage)

    curl -fsSL https://ollama.com/install.sh | sh
    ollama serve &
    ollama pull qwen3:4b
    
  2. DrawThings — core image generator

    • Download from Mac App Store
    • Settings → HTTP Server → Enable + Allow LAN access
    • Confirm it listens on port 7861
  3. IndexTTS2 — high-quality TTS with voice cloning

    • Port 7860
    • Follow the official IndexTTS2 repo install guide
    • Prepare 1–3 min clean reference audio (no music, minimal noise)
  4. FFmpeg — audio slicing, format conversion

    brew install ffmpeg
    

Easiest Way to Run the Full Pipeline

Most users start with one command:

go run main.go

This launches both the backend MCP service and the web console simultaneously.

Then open your browser:

http://localhost:8080

You’ll see a clean, single-page interface.

Step-by-Step Web Console Workflow (With Folder Structure)

  1. Prepare your novel text

    mkdir -p input/SpookyInn
    cp SpookyInn.txt input/SpookyInn/SpookyInn.txt
    
  2. (Highly recommended) Add reference voice

    mkdir -p assets/ref_audio
    cp your-voice-reference.m4a assets/ref_audio/ref.m4a
    
  3. In the browser at localhost:8080

    • Select the “SpookyInn” folder
    • Check desired steps (most people select everything)
      • Chapter splitting
      • Audio synthesis (IndexTTS2)
      • Image generation (DrawThings)
      • Subtitle generation
      • CapCut project export
    • Click “Process uploaded folder”

Processing time varies:

  • TTS: 3–12 min per chapter (longest step)
  • Images: 5–60 sec per image (depends on steps & model)

What the Output Actually Looks Like

After completion, check:

output/SpookyInn/chapter_10/
├── chapter_10.wav          # full chapter audio with cloned voice
├── chapter_10.srt          # Aegisub-generated timed subtitles
├── chapter_10.json         # CapCut draft project file (the magic part)
└── images/
    ├── scene_01.png
    ├── scene_02.png
    …                       # usually 8–15 images per chapter
    └── scene_10.png

Importing into CapCut (3.4.1)

  1. Launch CapCut
  2. Go to Drafts → Import draft from folder
  3. Select the entire chapter_10/ folder
  4. CapCut auto-loads timeline: audio track, image clips in order, subtitles already timed

You now only need to:

  • Adjust pacing / zoom / transitions if desired
  • Apply consistent LUT / color grade
  • Add opening hook card + ending CTA
  • Export 1080×1920 vertical video

Key MCP Tools Exposed (For Power Users & Ollama Desktop Integration)

Tool Name Purpose Needs Ref Audio? Primary Output
generate_indextts2_audio Single audio segment via IndexTTS2 Yes (recommended) .wav
generate_subtitles_from_indextts2 Create .srt from existing audio .srt
file_split_novel_into_chapters AI-powered chapter detection & splitting chapter_xx folders
generate_image_from_text Text-to-image .png
generate_image_from_image Style transfer / img2img .png
generate_images_from_chapter Auto-split chapter → multiple images Multiple .png
generate_images_from_chapter_with_ai_prompt Ollama writes refined prompts → higher coherence Multiple .png

The last tool usually produces the most visually consistent illustrations because Ollama first summarizes scene context into optimized prompts.

Quick Troubleshooting FAQ

Q: DrawThings service unreachable?
A: Double-check HTTP server is enabled in DrawThings preferences and port is 7861. macOS firewall may block it.

Q: Voice cloning sounds off?
A: Use high-SNR, monologue-style reference audio (1–3 min, no reverb/music). Volume-normalized WAV/M4A works best.

Q: Too many / too few images?
A: Edit config.yaml image count settings (check file comments for exact keys).

Q: Only want audio + subtitles?
A: Uncheck image-related boxes in the web UI.

Q: CapCut imports black frames / wrong order?
A: Revert to exactly CapCut 3.4.1 — later builds sometimes change draft folder parsing.

Q: Can it run on Windows or Linux?
A: Official docs state macOS + Apple Silicon only (Jan 2026). Community ports may appear later.

Who Benefits Most from This Workflow?

  • Indie creators who want full ownership of voice, style, and data
  • Chinese/English novel translators or adapters producing daily short-form content
  • Anyone with an M-series Mac who prefers local processing over cloud API costs & rate limits
  • Experimenters who enjoy tweaking TTS timbre + illustration style templates

Project homepage: https://github.com/hulutech-web/novel-video-workflow
Deeper reading: SYSTEM_ARCHITECTURE.md and USER_GUIDE.md in the repo.

This pipeline isn’t perfect yet — generation speed and image consistency still depend heavily on your hardware — but in early 2026 it remains one of the few fully local, end-to-end novel-to-CapCut solutions that actually works in production-like settings.

Exit mobile version