long-form speech synthesisarchive

SoulX-Podcast: Achieving Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity

4 months ago 高效码农

The Core Question This Article Answers How can we build a system that generates natural, long-form, multi-speaker conversational speech while supporting dialect and paralinguistic control? SoulX-Podcast makes breakthrough progress in this area by combining large language models with multi-stage data processing pipelines. Recent advances in text-to-speech synthesis have significantly improved speech quality, but most existing systems struggle with multi-speaker, multi-turn conversation scenarios. SoulX-Podcast emerges as a specialized solution to this challenge. It supports both Mandarin and English, along with several Chinese dialects including Sichuanese, Henanese, and Cantonese, while also controlling paralinguistic features like laughter and sighs—setting a new standard for …