Novel-to-Video AI Workflow: Create Ready-to-Edit CapCut Drafts Completely Locally (2026 Guide)

高效码农

2 months ago

Novel Video Workflow: Turn Any Novel into Ready-to-Edit CapCut Videos Using Local AI (2026 Tested Guide)

Meta Description / Featured Snippet Summary
Novel Video Workflow is an open-source macOS automation pipeline that converts full-length novels into short-form videos by intelligently splitting chapters, generating cloned-voice audio with IndexTTS2, creating AI illustrations via DrawThings, producing time-aligned subtitles with Aegisub, and exporting .json draft projects directly compatible with CapCut (Jianying / 剪映) version 3.4.1. The entire process runs locally using Ollama (qwen3:4b recommended), requires Apple Silicon, ≥16 GB RAM (32 GB preferred), and outputs production-ready assets in roughly 1–3 hours per chapter depending on hardware.

If you’re searching for a privacy-first, zero-subscription-cost method to batch-produce novel narration videos for TikTok, YouTube Shorts, Xiaohongshu, or Douyin in 2026, this open-source tool currently offers one of the most integrated local workflows available.

Why Creators Struggle with Novel-to-Video Conversion — and How This Tool Fixes It

Turning long-form novels into engaging vertical videos usually hits the same bottlenecks:

Manually splitting 300,000-word texts into chapters takes hours
Stock TTS voices sound robotic; good voice cloning services charge per minute
Finding or generating 10–20 matching scene images per chapter is tedious
Subtitle timing rarely syncs perfectly with speech
Importing scattered audio, images, and captions into CapCut and aligning everything manually kills productivity

Novel Video Workflow automates nearly every repetitive step and — most importantly — exports a CapCut project structure (.json + media folders) that the app recognizes immediately after import. After minor tweaks (transitions, BGM, opening/ending cards), you can publish.

Realistic Hardware & Software Requirements (January 2026)

Component	Minimum	Recommended	Notes
Operating System	macOS (Apple Silicon only tested)	macOS Sonoma / Sequoia	Intel Macs untested; Linux/Windows ports depend on future community work
RAM	16 GB	32 GB	Ollama + TTS + image gen run concurrently and are memory hungry
Storage	100 GB free	200+ GB free	A 30-chap novel can generate 15–40 GB of temporary + final assets
GPU	Apple M1 / M2	M3 / M4 series	Metal acceleration critical for DrawThings speed
Go compiler	1.25+	Latest stable	Required to build & run the project
CapCut client	3.4.1 (download link provided)	Stick to 3.4.1	Newer versions may break .json draft recognition

Bottom line: An M2/M3 MacBook Pro or Mac mini with 32 GB unified memory delivers the smoothest experience in early 2026.

Four Mandatory AI Services You Must Run Before Starting

Ollama — chapter splitting, content analysis, prompt optimization
Default port 11434
Recommended model: qwen3:4b (strong Chinese + English understanding, low resource usage)
```
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
ollama pull qwen3:4b
```
DrawThings — core image generator
- Download from Mac App Store
- Settings → HTTP Server → Enable + Allow LAN access
- Confirm it listens on port 7861
IndexTTS2 — high-quality TTS with voice cloning
- Port 7860
- Follow the official IndexTTS2 repo install guide
- Prepare 1–3 min clean reference audio (no music, minimal noise)
FFmpeg — audio slicing, format conversion
```
brew install ffmpeg
```

Easiest Way to Run the Full Pipeline

Most users start with one command:

go run main.go

This launches both the backend MCP service and the web console simultaneously.

Then open your browser:

http://localhost:8080

You’ll see a clean, single-page interface.

Step-by-Step Web Console Workflow (With Folder Structure)

Prepare your novel text

mkdir -p input/SpookyInn
cp SpookyInn.txt input/SpookyInn/SpookyInn.txt

(Highly recommended) Add reference voice

mkdir -p assets/ref_audio
cp your-voice-reference.m4a assets/ref_audio/ref.m4a

In the browser at localhost:8080
- Select the “SpookyInn” folder
- Check desired steps (most people select everything)
  - Chapter splitting
  - Audio synthesis (IndexTTS2)
  - Image generation (DrawThings)
  - Subtitle generation
  - CapCut project export
- Click “Process uploaded folder”

Processing time varies:

TTS: 3–12 min per chapter (longest step)
Images: 5–60 sec per image (depends on steps & model)

What the Output Actually Looks Like

After completion, check:

output/SpookyInn/chapter_10/
├── chapter_10.wav          # full chapter audio with cloned voice
├── chapter_10.srt          # Aegisub-generated timed subtitles
├── chapter_10.json         # CapCut draft project file (the magic part)
└── images/
    ├── scene_01.png
    ├── scene_02.png
    …                       # usually 8–15 images per chapter
    └── scene_10.png

Importing into CapCut (3.4.1)

Launch CapCut
Go to Drafts → Import draft from folder
Select the entire chapter_10/ folder
CapCut auto-loads timeline: audio track, image clips in order, subtitles already timed

You now only need to:

Adjust pacing / zoom / transitions if desired
Apply consistent LUT / color grade
Add opening hook card + ending CTA
Export 1080×1920 vertical video

Key MCP Tools Exposed (For Power Users & Ollama Desktop Integration)

Tool Name	Purpose	Needs Ref Audio?	Primary Output
generate_indextts2_audio	Single audio segment via IndexTTS2	Yes (recommended)	.wav
generate_subtitles_from_indextts2	Create .srt from existing audio	—	.srt
file_split_novel_into_chapters	AI-powered chapter detection & splitting	—	chapter_xx folders
generate_image_from_text	Text-to-image	—	.png
generate_image_from_image	Style transfer / img2img	—	.png
generate_images_from_chapter	Auto-split chapter → multiple images	—	Multiple .png
generate_images_from_chapter_with_ai_prompt	Ollama writes refined prompts → higher coherence	—	Multiple .png

The last tool usually produces the most visually consistent illustrations because Ollama first summarizes scene context into optimized prompts.

Quick Troubleshooting FAQ

Q: DrawThings service unreachable?
A: Double-check HTTP server is enabled in DrawThings preferences and port is 7861. macOS firewall may block it.

Q: Voice cloning sounds off?
A: Use high-SNR, monologue-style reference audio (1–3 min, no reverb/music). Volume-normalized WAV/M4A works best.

Q: Too many / too few images?
A: Edit config.yaml image count settings (check file comments for exact keys).

Q: Only want audio + subtitles?
A: Uncheck image-related boxes in the web UI.

Q: CapCut imports black frames / wrong order?
A: Revert to exactly CapCut 3.4.1 — later builds sometimes change draft folder parsing.

Q: Can it run on Windows or Linux?
A: Official docs state macOS + Apple Silicon only (Jan 2026). Community ports may appear later.

Who Benefits Most from This Workflow?

Indie creators who want full ownership of voice, style, and data
Chinese/English novel translators or adapters producing daily short-form content
Anyone with an M-series Mac who prefers local processing over cloud API costs & rate limits
Experimenters who enjoy tweaking TTS timbre + illustration style templates

Project homepage: https://github.com/hulutech-web/novel-video-workflow
Deeper reading: SYSTEM_ARCHITECTURE.md and USER_GUIDE.md in the repo.

This pipeline isn’t perfect yet — generation speed and image consistency still depend heavily on your hardware — but in early 2026 it remains one of the few fully local, end-to-end novel-to-CapCut solutions that actually works in production-like settings.