Fun-Audio-Chat: Engineering Real-Time Voice Interaction with Dual-Resolution Representations and Core-Cocktail Training What makes it possible to run a high-fidelity, full-duplex voice assistant on a single GPU without sacrificing text comprehension? Fun-Audio-Chat achieves this by processing speech at an efficient 5 Hz frame rate while generating audio at 25 Hz, combined with a two-stage training regimen that merges intermediate models to preserve the base LLM’s knowledge. The open-source 8B model delivers state-of-the-art performance across spoken QA, audio understanding, and voice empathy benchmarks while cutting GPU training time nearly in half. Why Existing Joint Speech-Text Models Hit a Wall Why can’t current …
Gemini 2.5 Flash Native Audio: When AI Voice Agents Cross the Threshold from “Functional” to “Actually Useful” What fundamentally changed with Google’s latest Gemini 2.5 Flash Native Audio update? The model now executes complex business workflows with 71.5% multi-step accuracy, maintains 90% instruction adherence across long conversations, and preserves speaker intonation across 70+ languages—making production deployment viable for customer service, financial services, and real-time translation. For years, the gap between AI voice demo videos and real-world deployment has been painfully obvious. Anyone who’s tested a “conversational AI” knows the familiar breaking points: “Sorry, I didn’t catch that,” awkward silence during …