Xiaozhi ESP32-Server: Open-Source Backend Solution for Smart Hardware
(Developed by Professor Siyuan Liu’s Research Group at South China University of Technology)
Project Overview
Xiaozhi-esp32-server is an intelligent backend system built on human-computer symbiotic intelligence theory. It provides full-stack support for the open-source hardware project xiaozhi-esp32, implementing the Xiaozhi Communication Protocol using Python, Java, and Vue. The system integrates voiceprint recognition, MCP access points, and multimodal interaction capabilities, serving as a foundational platform for IoT developers.
Target Audience 👥
This solution is designed for:
-
Hardware engineers deploying ESP32-based devices -
Researchers exploring voice-controlled IoT systems -
Developers building custom smart hardware ecosystems
🎥 Functional Demonstrations:
Explore 15+ use cases in the demonstration videos, including:
Real-time voice interruption Cantonese dialect support Appliance control via IOT commands Visual object recognition
Critical Security Notice ⚠️
-
Third-Party Service Disclaimer -
No commercial partnerships with integrated API providers (ASR/TTS/LLM platforms) -
Users must independently verify service agreements and privacy policies
-
-
Usage Restrictions -
Not validated for production environments -
Public deployments require additional security hardening
-
Deployment Architectures 🚀
Comparative Deployment Models
Type | Key Features | Use Cases | Resource Requirements |
---|---|---|---|
Simplified Setup | Core dialogue/IOT/voiceprint | Low-resource environments | 2-core CPU / 2GB RAM (API-only) |
Full-Module Setup | OTA/control panel/visual perception | Complete functionality | 4-core CPU / 8GB RAM (with FunASR) |
Implementation Guides:
💻 Live Test Environment:
Control Panel: https://2662r3426b.vicp.fun WebSocket Endpoint: wss://2662r3426b.vicp.fun/xiaozhi/v1/
Technical Capabilities ✨
Implemented Features ✅
Module | Technical Specifications |
---|---|
Voice Interaction | Streaming ASR/TTS, multilingual support, real-time VAD detection |
Voiceprint Verification | Multi-user enrollment, real-time speaker identification |
Visual Perception | GLM-4V, Qwen-VL vision model integration |
Intelligent Dialogue | 10+ LLM platform compatibility (Zhipu, Volcano, Alibaba) |
Protocol Support | Client IOT control, MCP access protocol, custom plugins |
Development Roadmap 🚧
-
Multi-device coordination system -
Dynamic plugin hot-swapping -
Detailed development timeline
Platform Compatibility Matrix 📋
Function | Free Tier | Performance Tier |
---|---|---|
Speech Recognition (ASR) | FunASR (on-device) | Volcano Doubao StreamASR |
Language Models (LLM) | Zhipu GLM-4-Flash | Volcano Doubao-1.5-Pro-32K |
Speech Synthesis (TTS) | Linkerai Streaming TTS | Volcano Dual-Stream TTS |
Vision Models (VLLM) | Zhipu GLM-4V-Flash | Qwen Qwen2.5-VL-3B |
⚙️ Diagnostic Tools:
python performance_tester.py # Core module latency analysis python performance_tester_vllm.py # Vision model performance test
Ecosystem Integration 👬
Project | Technology | Functionality |
---|---|---|
Android/iOS Client | Flutter | Cross-platform voice interface |
Desktop Client | Python | Hardware-free simulation |
Java Backend | Java | Enterprise-grade alternative |
Acknowledgments 🙏
Organization | Contributions |
---|---|
Shifang Ronghai | Communication protocol standardization |
Xuanfeng Technology | Function-calling framework development |
Huiyuan Design | User experience optimization |
SEO Keywords
-
Open-source ESP32 backend -
Voiceprint authentication system -
MCP protocol implementation -
Multimodal interaction framework -
SCUT human-computer intelligence -
LLM integration platform -
IoT control architecture
Technical Summary
Xiaozhi-esp32-server is an open-source backend system for smart hardware, developed at South China University of Technology. It enables voice interaction, biometric verification, and visual perception through Docker or source-based deployment. Supporting 10+ AI platforms with free and premium configurations, it’s ideal for IoT prototyping but currently unsuitable for production environments.
References: