Pixelle-Video: The Ultimate Zero-Threshold AI Automated Short Video Engine

Summary:
Pixelle-Video is an AI-powered automated short video engine that transforms a single topic into a complete video production. It automates scriptwriting, AI image/video generation, voiceover synthesis, and background music addition. Featuring Windows one-click installation and deep support for ComfyUI and various LLMs, it enables zero-threshold video creation without any prior editing experience.


1. Introduction: Turning Video Creation into a “One-Sentence” Task

In an era where digital content consumption is exploding, short video has become the dominant medium for information dissemination. However, the traditional video production pipeline—spanning scriptwriting, asset sourcing, and editing—presents a significant technical barrier and time cost for most creators.

Imagine a solution where you only need to input a simple topic, and the AI handles everything else: writing the script, finding visuals, generating voiceovers, and editing the final cut. This is the core promise of Pixelle-Video.

As an AI fully automated short video engine, Pixelle-Video encapsulates complex video production processes within a minimalist interface. Whether you are creating a personal growth video on “How to Improve Yourself,” a science explainer on “Why Haven’t We Found Alien Civilizations?”, or a historical commentary on the “Zizhi Tongjian,” Pixelle-Video utilizes its modular AI workflow to deliver high-quality short videos in minutes.

Based on the core features and technical architecture of Pixelle-Video, this article provides an in-depth analysis of its working principles, a detailed installation and configuration guide, and a walkthrough of how to achieve automated video creation from scratch using a simple Web interface.

2. Core Features: An Automated Closed Loop from Idea to Render

The core value of Pixelle-Video lies in its “Fully Automated Generation” capability. It is not merely a video editing tool but a comprehensive creation platform integrating Large Language Models (LLM), Computer Vision (CV), and Text-to-Speech (TTS) technologies. Its feature highlights cover the entire lifecycle of video production, truly realizing “zero threshold, zero editing experience.”

2.1 AI Smart Scripting

The soul of a video lies in its script. Pixelle-Video features powerful AI intelligent scripting capabilities. Users simply input a “topic,” and the system automatically invokes configured Large Language Models (such as Qwen, DeepSeek, GPT, etc.) to intelligently create video commentary tailored to the theme. This process completely eliminates the need for manual conceptualization and writing, producing structured scripts ready for audio-visual production.

2.2 AI Image & Video Generation

To accompany the script, Pixelle-Video automatically generates exquisite AI illustrations or video clips for every single sentence.

  • Image Capabilities: Supports calling various AI painting models to generate matching visuals for each line of commentary.
  • Video Generation: By integrating advanced AI video generation models like WAN 2.1, the system can create dynamic video content, moving beyond static images to significantly enhance video engagement.
  • Custom Assets: Beyond AI generation, users can utilize the “Custom Material” feature to upload their own photos and videos. The system uses AI to intelligently analyze these assets and generate scripts based on them, enabling a “asset-first” flexible creation mode.

2.3 Intelligent Voice Synthesis & Voice Cloning

Voice is the emotional carrier of a video. Pixelle-Video supports numerous mainstream TTS (Text-to-Speech) solutions, including Edge-TTS and Index-TTS.

  • Multimodal Voice: It supports not just basic speech synthesis but also “Voice Cloning.” Users can upload a reference audio file (MP3/WAV/FLAC) to clone a similar tone for narration, greatly enhancing personalization.
  • Multi-language Support: The system has recently added support for multi-language TTS voices to meet regional requirements.

2.4 Background Music & Visual Styles

  • BGM Addition: Video atmosphere relies on music. The system supports adding Background Music (BGM). Users can select tracks from the built-in library or place custom music files (MP3/WAV, etc.) into the designated folder for automatic synthesis.
  • Visual Templates: To craft unique video styles, Pixelle-Video offers a variety of visual templates. Whether minimalist, cinematic, or other artistic styles, users can customize the video’s look by selecting different templates and adjusting parameters.

2.5 Flexible Dimensions & Model Compatibility

Targeting different publishing platforms, the system supports various video aspect ratios, including portrait, landscape, and square. More importantly, its underlying architecture is built on ComfyUI, offering extreme scalability. Users can use preset workflows or flexibly combine atomic capabilities—for example, replacing the image generation model with FLUX or switching TTS to ChatTTS—meeting needs from casual users to professional tech enthusiasts.

3. Technical Architecture: A Modular Video Generation Pipeline

Pixelle-Video adopts a clear modular design. Its video generation flow is logically rigorous, primarily divided into four core stages: Script Generation → Visual Planning → Frame-by-Frame Processing → Video Synthesis.

  1. Script Generation: The starting point. Based on the user’s topic, the system uses the LLM to generate a structured video script. Users can choose to let the AI create freely or use a fixed script.
  2. Visual Planning: After the script is generated, the system performs semantic analysis on the text, planning visual prompts for each sentence (or paragraph) and deciding whether to generate static images or dynamic video clips.
  3. Frame-by-Frame Processing: This is the compute-intensive phase. The system invokes configured image/video generation services (like ComfyUI or RunningHub) to render visual content frame-by-frame based on prompts. Simultaneously, the TTS engine converts the script into audio files.
  4. Video Synthesis: Finally, the system composites the generated visual assets, audio files, and background music according to the selected timeline and visual template using tools like FFmpeg, outputting the complete video file.

This pipeline design ensures every step supports flexible customization. Users can intervene at any stage to swap AI models, adjust audio parameters, or switch visual styles to achieve personalized creative needs.

4. Quick Start: Installation and Deployment Guide

To lower the barrier to entry, Pixelle-Video offers two primary installation methods: a Windows One-Click All-in-One Package and Source Code Installation. Whether you are a standard Windows user or a developer, there is a suitable deployment solution.

4.1 Windows One-Click All-in-One Package (Recommended for Windows Users)

For most Windows users, the most convenient method is the official all-in-one package. Its biggest advantage is “out-of-the-box” usability; users do not need to manually install complex runtime environment dependencies like Python, uv, or ffmpeg.

Operation Steps:

  1. Download: Visit the GitHub Releases page to download the latest Windows all-in-one package archive.
  2. Extract: Unzip the downloaded archive to any local directory.
  3. Launch: Enter the unzipped folder and double-click the start.bat script.
  4. Access: The script will automatically launch the Web interface. The browser should automatically open http://localhost:8501. If not, manually enter this address in your browser.
  5. Configuration: On first use, configure the LLM API and Image Generation Service keys in the Web interface’s “⚙️ System Configuration”. Once configured, you can start generating videos.

Note: The all-in-one package contains all dependencies. No manual environment installation is required; just configure API keys on first use.

4.2 Installing from Source (For macOS / Linux Users)

For macOS, Linux users, or developers seeking deep customization, installing from source is the more flexible choice.

Prerequisite Environment Dependencies:
Before starting, ensure your system has the Python package manager uv and the video processing tool ffmpeg installed.

  • Install uv: Visit the official uv documentation for installation instructions suitable for your system. Verify installation with uv --version in the terminal.
  • Install ffmpeg:

    • macOS: Use Homebrew: brew install ffmpeg.
    • Ubuntu / Debian: Use apt: sudo apt update then sudo apt install ffmpeg.
    • Windows: Download from the official site, unzip, and add the bin directory to your system PATH.
    • Verify installation with ffmpeg -version.

Installation Steps:

  1. Download Project: Clone the repository locally using git:

    git clone https://github.com/AIDC-AI/Pixelle-Video.git
    cd Pixelle-Video
    
  2. Launch Web Interface: Use uv to run the Streamlit app (Recommended method, handles dependencies automatically):

    uv run streamlit run web/app.py
    
  3. Configuration: Once the browser opens http://localhost:8501, expand the “⚙️ System Configuration” panel and fill in the LLM and Image service details as prompted, then save.

5. Deep Dive into the Web Interface: From Configuration to Generation

Pixelle-Video’s Web interface features an intuitive three-column layout, corresponding to System Configuration, Content/Audio-Visual Settings, and Generation Control. Below is a detailed breakdown of each section’s functions and setup tips.

5.1 ⚙️ System Configuration: The Foundation

On first use, the System Configuration is mandatory. Expanding the panel reveals two core configurations:

LLM Configuration (Large Language Model)

The LLM is the brain of the video script.

  • Quick Presets: To simplify things for beginners, the interface offers a dropdown menu to select preset models like Tongyi Qianwen (Qwen), GPT-4o, or DeepSeek. Upon selection, the system auto-fills the Base URL and Model name.
  • Manual Config: Advanced users can manually input the API Key, API Address, and Model Name.
  • Get API Key: The interface usually provides a “Get API Key” link guiding you to register and fetch the key.

Image Configuration

This determines the source of your video visuals.

  • Local Deployment (Recommended): If you have ComfyUI deployed locally, simply enter the local service address (default: http://127.0.0.1:8188). Click “Test Connection” to confirm availability.
  • Cloud Deployment: For users without local compute power, cloud services like RunningHub can be used. Simply enter the API Key for the cloud image generation service.

After configuration, ensure you click “Save Configuration” to apply settings.

5.2 📝 Content Input: The Start of Creativity

The left sidebar is primarily for content input and settings.

Generation Mode Selection

  • AI Generate Content: The core mode. Input a topic (e.g., “Why cultivate a reading habit”), and the AI will auto-write the script. Ideal for users who want fast video generation with AI ghostwriting.
  • Fixed Script Content: If you have a ready-made script, select this mode, paste the text, and the system will skip AI creation, generating the video directly based on the provided text.

Background Music (BGM) Settings

  • No BGM: Voiceover only.
  • Built-in Music: Select from the preset list (e.g., default.mp3).
  • Custom Music: Place your own files (MP3/WAV, etc.) into the project’s bgm/ folder; the system will auto-detect them.
  • Preview: Click “Preview BGM” to listen before generating.

5.3 🎤 Voice Settings: Giving Voice to Video

The top half of the middle column focuses on speech synthesis settings.

TTS Workflow Selection

The system automatically scans TTS workflows in the workflows/ folder. Select from the dropdown. Besides mainstream solutions like Edge-TTS and Index-TTS, you can use custom TTS workflows if familiar with ComfyUI.

Reference Audio & Voice Cloning

For workflows supporting voice cloning (like Index-TTS), upload a reference audio file.

  • Upload Formats: Supports MP3, WAV, FLAC, etc.
  • Instant Preview: Upload, enter test text, and click “Preview Voice” to hear the cloned tone.

5.4 🎨 Visual Settings: Crafting Unique Styles

The bottom half of the middle column controls the visual aesthetic.

Image Generation Settings

  • ComfyUI Workflow: Select the specific image generation workflow, supporting both local and cloud (RunningHub). The default is often image_flux.json. You can place custom workflows in the workflows/ folder.
  • Image Dimensions: Set width and height in pixels. Default is often 1024×1024, but note that different AI models have specific size limits.
  • Prompt Prefix: This is the key to controlling overall artistic style. Users must input an English prompt prefix (e.g., “Minimalist black-and-white matchstick figure style illustration, clean lines, simple sketch style”). The system combines this with the script description to control the art style. Click “Preview Style” to test.

Video Templates

Templates dictate the video layout, fonts, and animation.

  • Naming Conventions:

    • static_*.html: Static templates, text-based styles, no AI media needed.
    • image_*.html: Image templates, use AI-generated images as backgrounds.
    • video_*.html: Video templates, use AI-generated videos as backgrounds.
  • Size Grouping: Templates are grouped by Portrait, Landscape, and Square for easy platform selection.
  • Custom Preview: Click “Preview Template” to view effects and test parameters. HTML-savvy users can create or modify templates in the templates/ folder.

5.5 🎬 Generate Video: One-Click Output

The right sidebar is the final command center.

Generation & Progress

Once all parameters are set, click the prominent “🎬 Generate Video” button. The system begins working, displaying real-time progress (e.g., “Scene 3/5 – Generating Illustration”).

Video Preview & Output

Upon completion, the video auto-plays in the interface. Metadata like duration, file size, and scene count is displayed. The final video file is saved in the project’s output/ folder for easy access.

6. Application Scenarios and Case Studies

The power of Pixelle-Video lies in its wide adaptability. By tweaking topics, templates, and styles, it can generate various types of video content. Below are application scenarios showcased in the documentation:

6.1 Portrait Video Cases (Suitable for TikTok, Shorts, Reels)

  • Humanities & Documentary: e.g., “Scenery on the journey makes one linger,” using the default template to showcase travel beauty.
  • Cultural Deconstruction: e.g., “Santa ID,” exploring cultural icons.
  • Scientific Speculation: e.g., “Why haven’t we found alien civilizations?”, using AI-generated deep-space visuals with narration.
  • Personal Growth: e.g., “How to improve yourself,” using a cloned voice for a mentor-like vibe.
  • Deep Thinking: e.g., “How to understand Antifragility,” visualizing abstract concepts.
  • History & Culture: e.g., “Zizhi Tongjian,” using fixed images for historical gravitas.
  • Emotional: e.g., “Winter Warm Sun,” conveying warmth via soft visuals and cloned voice.
  • Novel Commentary: e.g., “Battle Through the Heavens,” fast production via custom scripts.
  • Knowledge & Science: e.g., “Health Knowledge,” using specific models (like Qwen) for illustrative images.

6.2 Landscape Video Cases (Suitable for YouTube, Bilibili)

  • Side Hustle/Money: e.g., “Side Hustle Earning,” using a cinematic template.
  • History Commentary: e.g., “Revelations of Zizhi Tongjian,” using a custom template for unique storytelling.

These cases demonstrate that whether for simple emotional sharing or complex science explanation, Pixelle-Video can generate satisfactory video works automatically by entering just one keyword.

7. System Updates and Iteration History

An active open-source project relies on continuous iteration. The Pixelle-Video update log reflects its progress in functionality and stability. Here are key recent milestones:

  • 2026-01-14: Added “Digital Human Avatar” and “Image-to-Video” pipelines, expanding video forms; added multi-language TTS voice support.
  • 2026-01-06: Added support for calling RunningHub 48G VRAM machines, enabling cloud generation of higher-res/complex videos.
  • 2025-12-28: Made RunningHub concurrency limits configurable; optimized logic for LLM structured data returns.
  • 2025-12-17: Added ComfyUI API Key configuration, support for Nano Banana model, and API support for custom template parameters.
  • 2025-12-10: Built-in FAQ in the sidebar; locked edge-tts version to fix instability issues.
  • 2025-12-08: Support for fixed script splitting (paragraph/line/sentence); optimized template selection interaction with direct preview.
  • 2025-12-06: Fixed video generation API URL path handling; improved cross-platform compatibility.
  • 2025-12-05: Added Windows All-in-One Package download; optimized image/video reverse workflows.
  • 2025-12-04: Added “Custom Material” feature; AI analyzes user-uploaded photos/videos to generate scripts.
  • 2025-11-18: Optimized RunningHub service calls for parallel processing; added History page; support for batch video creation tasks.

8. Frequently Asked Questions (FAQ)

Q: How long does it take to generate a video on the first try?
A: Generation time depends on the number of scenes, network conditions, and the inference speed of the AI models used. Under normal conditions, a short video can usually be completed within a few minutes.

Q: What should I do if the generated video quality is not satisfactory?
A: Pixelle-Video offers multi-dimensional adjustments. You can try:

  1. Switch LLM Models: Different models (GPT, Qwen, DeepSeek) have different writing styles.
  2. Adjust Image Parameters: Change image dimensions or modify the “Prompt Prefix” to alter the artistic style.
  3. Optimize Audio: Switch TTS workflows or upload high-quality reference audio for better cloning.
  4. Change Templates/Dims: Try different visual templates and aspect ratios for a new look.

Q: What are the costs associated with using Pixelle-Video?
A: This project supports running completely for free! You can choose different cost strategies:

  • Totally Free: LLM via local Ollama + Local ComfyUI. 0 cost, but requires good local hardware.
  • Recommended: LLM via Tongyi Qianwen (very low cost, high value) + Local ComfyUI.
  • Cloud: LLM via OpenAI + Image via RunningHub. Higher cost, but no local setup needed and powerful cloud compute.

Advice: If you have a powerful local GPU, use the totally free solution; otherwise, Tongyi Qianwen is recommended for the best balance.

9. Conclusion

Pixelle-Video is not just a tool; it is the practical implementation of automated video creation philosophy. By combining the powerful capabilities of ComfyUI with a modernized Web interface, it breaks down the technical barriers of video production, allowing everyone to become a video creator.

With new features like Digital Human Avatars and Image-to-Video, along with support for cloud computing power like RunningHub, Pixelle-Video is becoming increasingly intelligent and powerful. Just input a topic and leave the rest to AI—this is no longer a futuristic imagination, but an accessible reality.