Novel Video Workflow: Turn Any Novel into Ready-to-Edit CapCut Videos Using Local AI (2026 Tested Guide)
Meta Description / Featured Snippet Summary
Novel Video Workflow is an open-source macOS automation pipeline that converts full-length novels into short-form videos by intelligently splitting chapters, generating cloned-voice audio with IndexTTS2, creating AI illustrations via DrawThings, producing time-aligned subtitles with Aegisub, and exporting .json draft projects directly compatible with CapCut (Jianying / 剪映) version 3.4.1. The entire process runs locally using Ollama (qwen3:4b recommended), requires Apple Silicon, ≥16 GB RAM (32 GB preferred), and outputs production-ready assets in roughly 1–3 hours per chapter depending on hardware.
If you’re searching for a privacy-first, zero-subscription-cost method to batch-produce novel narration videos for TikTok, YouTube Shorts, Xiaohongshu, or Douyin in 2026, this open-source tool currently offers one of the most integrated local workflows available.
Why Creators Struggle with Novel-to-Video Conversion — and How This Tool Fixes It
Turning long-form novels into engaging vertical videos usually hits the same bottlenecks:
-
Manually splitting 300,000-word texts into chapters takes hours -
Stock TTS voices sound robotic; good voice cloning services charge per minute -
Finding or generating 10–20 matching scene images per chapter is tedious -
Subtitle timing rarely syncs perfectly with speech -
Importing scattered audio, images, and captions into CapCut and aligning everything manually kills productivity
Novel Video Workflow automates nearly every repetitive step and — most importantly — exports a CapCut project structure (.json + media folders) that the app recognizes immediately after import. After minor tweaks (transitions, BGM, opening/ending cards), you can publish.
Realistic Hardware & Software Requirements (January 2026)
| Component | Minimum | Recommended | Notes |
|---|---|---|---|
| Operating System | macOS (Apple Silicon only tested) | macOS Sonoma / Sequoia | Intel Macs untested; Linux/Windows ports depend on future community work |
| RAM | 16 GB | 32 GB | Ollama + TTS + image gen run concurrently and are memory hungry |
| Storage | 100 GB free | 200+ GB free | A 30-chap novel can generate 15–40 GB of temporary + final assets |
| GPU | Apple M1 / M2 | M3 / M4 series | Metal acceleration critical for DrawThings speed |
| Go compiler | 1.25+ | Latest stable | Required to build & run the project |
| CapCut client | 3.4.1 (download link provided) | Stick to 3.4.1 | Newer versions may break .json draft recognition |
Bottom line: An M2/M3 MacBook Pro or Mac mini with 32 GB unified memory delivers the smoothest experience in early 2026.
Four Mandatory AI Services You Must Run Before Starting
-
Ollama — chapter splitting, content analysis, prompt optimization
Default port 11434
Recommended model:qwen3:4b(strong Chinese + English understanding, low resource usage)curl -fsSL https://ollama.com/install.sh | sh ollama serve & ollama pull qwen3:4b -
DrawThings — core image generator
-
Download from Mac App Store -
Settings → HTTP Server → Enable + Allow LAN access -
Confirm it listens on port 7861
-
-
IndexTTS2 — high-quality TTS with voice cloning
-
Port 7860 -
Follow the official IndexTTS2 repo install guide -
Prepare 1–3 min clean reference audio (no music, minimal noise)
-
-
FFmpeg — audio slicing, format conversion
brew install ffmpeg
Easiest Way to Run the Full Pipeline
Most users start with one command:
go run main.go
This launches both the backend MCP service and the web console simultaneously.
Then open your browser:
http://localhost:8080
You’ll see a clean, single-page interface.
Step-by-Step Web Console Workflow (With Folder Structure)
-
Prepare your novel text
mkdir -p input/SpookyInn cp SpookyInn.txt input/SpookyInn/SpookyInn.txt -
(Highly recommended) Add reference voice
mkdir -p assets/ref_audio cp your-voice-reference.m4a assets/ref_audio/ref.m4a -
In the browser at localhost:8080
-
Select the “SpookyInn” folder -
Check desired steps (most people select everything) -
Chapter splitting -
Audio synthesis (IndexTTS2) -
Image generation (DrawThings) -
Subtitle generation -
CapCut project export
-
-
Click “Process uploaded folder”
-
Processing time varies:
-
TTS: 3–12 min per chapter (longest step) -
Images: 5–60 sec per image (depends on steps & model)
What the Output Actually Looks Like
After completion, check:
output/SpookyInn/chapter_10/
├── chapter_10.wav # full chapter audio with cloned voice
├── chapter_10.srt # Aegisub-generated timed subtitles
├── chapter_10.json # CapCut draft project file (the magic part)
└── images/
├── scene_01.png
├── scene_02.png
… # usually 8–15 images per chapter
└── scene_10.png
Importing into CapCut (3.4.1)
-
Launch CapCut -
Go to Drafts → Import draft from folder -
Select the entire chapter_10/folder -
CapCut auto-loads timeline: audio track, image clips in order, subtitles already timed
You now only need to:
-
Adjust pacing / zoom / transitions if desired -
Apply consistent LUT / color grade -
Add opening hook card + ending CTA -
Export 1080×1920 vertical video
Key MCP Tools Exposed (For Power Users & Ollama Desktop Integration)
| Tool Name | Purpose | Needs Ref Audio? | Primary Output |
|---|---|---|---|
| generate_indextts2_audio | Single audio segment via IndexTTS2 | Yes (recommended) | .wav |
| generate_subtitles_from_indextts2 | Create .srt from existing audio | — | .srt |
| file_split_novel_into_chapters | AI-powered chapter detection & splitting | — | chapter_xx folders |
| generate_image_from_text | Text-to-image | — | .png |
| generate_image_from_image | Style transfer / img2img | — | .png |
| generate_images_from_chapter | Auto-split chapter → multiple images | — | Multiple .png |
| generate_images_from_chapter_with_ai_prompt | Ollama writes refined prompts → higher coherence | — | Multiple .png |
The last tool usually produces the most visually consistent illustrations because Ollama first summarizes scene context into optimized prompts.
Quick Troubleshooting FAQ
Q: DrawThings service unreachable?
A: Double-check HTTP server is enabled in DrawThings preferences and port is 7861. macOS firewall may block it.
Q: Voice cloning sounds off?
A: Use high-SNR, monologue-style reference audio (1–3 min, no reverb/music). Volume-normalized WAV/M4A works best.
Q: Too many / too few images?
A: Edit config.yaml image count settings (check file comments for exact keys).
Q: Only want audio + subtitles?
A: Uncheck image-related boxes in the web UI.
Q: CapCut imports black frames / wrong order?
A: Revert to exactly CapCut 3.4.1 — later builds sometimes change draft folder parsing.
Q: Can it run on Windows or Linux?
A: Official docs state macOS + Apple Silicon only (Jan 2026). Community ports may appear later.
Who Benefits Most from This Workflow?
-
Indie creators who want full ownership of voice, style, and data -
Chinese/English novel translators or adapters producing daily short-form content -
Anyone with an M-series Mac who prefers local processing over cloud API costs & rate limits -
Experimenters who enjoy tweaking TTS timbre + illustration style templates
Project homepage: https://github.com/hulutech-web/novel-video-workflow
Deeper reading: SYSTEM_ARCHITECTURE.md and USER_GUIDE.md in the repo.
This pipeline isn’t perfect yet — generation speed and image consistency still depend heavily on your hardware — but in early 2026 it remains one of the few fully local, end-to-end novel-to-CapCut solutions that actually works in production-like settings.
