Xiaozhi ESP32-Server: Open-Source Backend Solution for Smart Hardware

(Developed by Professor Siyuan Liu’s Research Group at South China University of Technology)


Project Overview

Xiaozhi-esp32-server is an intelligent backend system built on human-computer symbiotic intelligence theory. It provides full-stack support for the open-source hardware project xiaozhi-esp32, implementing the Xiaozhi Communication Protocol using Python, Java, and Vue. The system integrates voiceprint recognition, MCP access points, and multimodal interaction capabilities, serving as a foundational platform for IoT developers.


Target Audience 👥

This solution is designed for:

  • Hardware engineers deploying ESP32-based devices
  • Researchers exploring voice-controlled IoT systems
  • Developers building custom smart hardware ecosystems

🎥 Functional Demonstrations:
Explore 15+ use cases in the demonstration videos, including:

  • Real-time voice interruption
  • Cantonese dialect support
  • Appliance control via IOT commands
  • Visual object recognition

Critical Security Notice ⚠️

  1. Third-Party Service Disclaimer

    • No commercial partnerships with integrated API providers (ASR/TTS/LLM platforms)
    • Users must independently verify service agreements and privacy policies
  2. Usage Restrictions

    • Not validated for production environments
    • Public deployments require additional security hardening

Deployment Architectures 🚀

Comparative Deployment Models
Type Key Features Use Cases Resource Requirements
Simplified Setup Core dialogue/IOT/voiceprint Low-resource environments 2-core CPU / 2GB RAM (API-only)
Full-Module Setup OTA/control panel/visual perception Complete functionality 4-core CPU / 8GB RAM (with FunASR)

Implementation Guides:

💻 Live Test Environment:

Control Panel: https://2662r3426b.vicp.fun  
WebSocket Endpoint: wss://2662r3426b.vicp.fun/xiaozhi/v1/  

Technical Capabilities ✨

Implemented Features ✅
Module Technical Specifications
Voice Interaction Streaming ASR/TTS, multilingual support, real-time VAD detection
Voiceprint Verification Multi-user enrollment, real-time speaker identification
Visual Perception GLM-4V, Qwen-VL vision model integration
Intelligent Dialogue 10+ LLM platform compatibility (Zhipu, Volcano, Alibaba)
Protocol Support Client IOT control, MCP access protocol, custom plugins
Development Roadmap 🚧

Platform Compatibility Matrix 📋

Function Free Tier Performance Tier
Speech Recognition (ASR) FunASR (on-device) Volcano Doubao StreamASR
Language Models (LLM) Zhipu GLM-4-Flash Volcano Doubao-1.5-Pro-32K
Speech Synthesis (TTS) Linkerai Streaming TTS Volcano Dual-Stream TTS
Vision Models (VLLM) Zhipu GLM-4V-Flash Qwen Qwen2.5-VL-3B

⚙️ Diagnostic Tools:

python performance_tester.py        # Core module latency analysis  
python performance_tester_vllm.py   # Vision model performance test  

Ecosystem Integration 👬

Project Technology Functionality
Android/iOS Client Flutter Cross-platform voice interface
Desktop Client Python Hardware-free simulation
Java Backend Java Enterprise-grade alternative

Acknowledgments 🙏

Organization Contributions
Shifang Ronghai Communication protocol standardization
Xuanfeng Technology Function-calling framework development
Huiyuan Design User experience optimization
Adoption Trend

SEO Keywords

  • Open-source ESP32 backend
  • Voiceprint authentication system
  • MCP protocol implementation
  • Multimodal interaction framework
  • SCUT human-computer intelligence
  • LLM integration platform
  • IoT control architecture

Technical Summary

Xiaozhi-esp32-server is an open-source backend system for smart hardware, developed at South China University of Technology. It enables voice interaction, biometric verification, and visual perception through Docker or source-based deployment. Supporting 10+ AI platforms with free and premium configurations, it’s ideal for IoT prototyping but currently unsuitable for production environments.


References:

  1. Xiaozhi Communication Protocol
  2. ESP32 Hardware Specifications
  3. Deployment Documentation