The Core Question This Article Answers

How can we build a system that generates natural, long-form, multi-speaker conversational speech while supporting dialect and paralinguistic control? SoulX-Podcast makes breakthrough progress in this area by combining large language models with multi-stage data processing pipelines.

Recent advances in text-to-speech synthesis have significantly improved speech quality, but most existing systems struggle with multi-speaker, multi-turn conversation scenarios. SoulX-Podcast emerges as a specialized solution to this challenge. It supports both Mandarin and English, along with several Chinese dialects including Sichuanese, Henanese, and Cantonese, while also controlling paralinguistic features like laughter and sighs—setting a new standard for podcast-style speech generation.

Imagine this scenario: an AI-generated podcast where three speakers from different regions converse naturally in their respective dialects, interspersed with appropriate laughter and reactions—what was once only imaginable is now possible with SoulX-Podcast.

Why Do We Need Specialized Multi-Speaker Conversational Speech Synthesis?

Limitations of Existing Systems

Most current text-to-speech systems are optimized for single-speaker scenarios and face numerous challenges when generating multi-speaker conversations. These systems often fail to maintain speaker timbre consistency, struggle with prosodic variations in long conversations, and lack fine-grained control over paralinguistic features.

In practical applications, podcasts, conversational agents, and audio content creation all require more natural interactive experiences. Traditional systems produce dialogue that sounds mechanical and disjointed, lacking the rhythmic flow and emotional expression characteristic of human conversation. This creates the need for speech synthesis systems specifically optimized for multi-speaker scenarios.

Author Insight: During development, we realized that conversational speech synthesis isn’t just a technical challenge but also an understanding of the essence of human communication. Truly natural conversation involves not just voice quality but also rhythm, pauses, and emotional nuances—elements that together create the “liveliness” of dialogue.

Core Features of SoulX-Podcast

Unique System Value

SoulX-Podcast’s core advantages over traditional speech synthesis systems manifest in three areas: long-form dialogue stability, dialect diversity support, and paralinguistic control capabilities.

The system can continuously generate over 90 minutes of conversational audio while maintaining high speaker timbre consistency and smooth speaker transitions. In actual testing, generated conversations demonstrate contextually adaptive prosodic features, reflecting natural rhythm and intonation changes as dialogues progress.

Beyond Mandarin and English, support for Sichuanese, Henanese, and Cantonese dialects enables more personalized voice generation. More importantly, all these dialects support cross-dialect zero-shot voice cloning, where a single audio prompt can generate speech in any supported dialect.

Application Example: A media company could use SoulX-Podcast to generate podcast content in regional dialects, requiring only a host’s Mandarin sample to produce a version of the program where the host speaks Sichuanese, significantly lowering the barrier for multilingual content production.

Performance Radar Chart

SoulX-Podcast’s performance across multiple evaluation dimensions

Data Processing Pipeline: From Raw Audio to Training-Ready Data

The Core Question This Section Answers

How do we extract high-quality, well-annotated training data from messy real-world conversation recordings?

Data processing forms the foundation for building high-quality speech synthesis systems. SoulX-Podcast’s data processing pipeline is carefully designed and optimized specifically for conversational data from real-world scenarios.

Audio Preprocessing and Enhancement

Real-world conversation recordings typically contain background music or noise, which negatively impacts subsequent transcription and speaker diarization tasks. The solution uses UVR-MDX-based vocal separation tools to remove background audio and noise, then normalizes the processed signals to consistent amplitude levels.

Practical Example: For an interview recording with slight background music, the system first identifies and separates vocals from accompaniment, retaining only the clean vocal portion for subsequent processing. This step ensures the model learns pure speech characteristics rather than irrelevant audio artifacts.

Segmentation and Speaker Diarization

Processing long-form conversation recordings (e.g., over 30 minutes) poses challenges for traditional speaker diarization systems. As speakers’ vocal characteristics and speaking states change over time, diarization models may incorrectly assign multiple speaker identities to the same person, causing speaker count and continuity issues.

The solution employs a multi-stage processing pipeline: first applying voice activity detection to split long recordings into short utterances, then concatenating these into approximately five-minute conversation segments. During this process, silence duration constraints are enforced—if inter-utterance silence exceeds a predetermined threshold, adjacent utterances are treated as the start and end of separate segments.

Finally, a Sortformer-based diarization model detects speaker boundaries and assigns speaker labels, generating reliable speaker turn annotations for subsequent processing.

Data Processing Pipeline

Complete processing workflow from raw conversation recordings to structured data

Quality Filtering and Speech Recognition

Although audio enhancement occurs in the initial stage, some segments may still exhibit suboptimal denoising results or inherently poor recording quality. To prevent such low-quality data from negatively impacting model training, multiple filtering criteria are applied to conversation segments, including signal-to-noise ratio and perceptual quality estimated by DNSMOS.

The speech recognition phase employs a dual-ASR transcription strategy for reliable text. Each utterance is transcribed by two independent ASR models: Mandarin speech uses Paraformer and Whisper, while English speech uses Parakeet and Whisper. For each utterance, two transcription results are obtained, and character error rate for Chinese or word error rate for English is calculated. Utterances with error rates below a predefined threshold are fully retained, while those exceeding the threshold keep only text transcriptions, with corresponding audio discarded.

Author Insight: The most profound lesson from data cleaning is that high-quality training data requires not just quantity but also consistency and purity. We found that even a small proportion of low-quality data disproportionately negatively impacts final model stability. This prompted us to establish a strict multi-stage filtering pipeline.

Speaker Purity Optimization

To ensure speaker label consistency, speaker purity optimization is performed based on speaker embedding clustering. For each conversation segment, embeddings of all utterances from the same speaker are clustered, and utterances whose embeddings deviate excessively from cluster centroids are identified as outliers. These outlier utterances are excluded from audio data—only their text transcriptions are retained. This strategy effectively mitigates potential speaker confusion during multi-turn conversation synthesis while maximizing overall data retention.

Paralinguistic and Dialect Annotation

Paralinguistic cues (like laughter and sighs) play crucial roles in enhancing conversation naturalness and expressiveness. To enable controllable generation of such cues, paralinguistic mining and annotation are performed on collected data.

Paralinguistic annotation uses a two-stage refinement framework combining high-throughput automatic detection with model-assisted verification. The first stage employs language-specific ASR models fine-tuned for paralinguistic event detection to process raw audio corpora. Chinese data uses Beats for coarse nonverbal cue identification, while English data uses Whisperd. The second stage uses the Gemini-2.5-Pro API for model-driven verification and fine-grained annotation, generating precise time-aligned annotations with corresponding text.

For dialect annotation, two complementary strategies are employed: collecting publicly available recordings in specific Chinese dialects, and training dialect identification models to retrieve and categorize dialectal utterances from broader real-world datasets. This ultimately yielded approximately 2,000 hours of Sichuanese, 1,000 hours of Cantonese, and 500 hours of Henanese speech.

Application Example: In virtual assistant scenarios, the system can automatically insert appropriate laughter or sighs based on conversation content, making interactions more natural. For example, when a user tells a joke, the assistant can generate responses with genuine laughter instead of mechanical text responses.

SoulX-Podcast Model Architecture and Technical Implementation

The Core Question This Section Answers

How does SoulX-Podcast achieve natural long-form, multi-speaker conversation generation?

SoulX-Podcast adopts a two-stage generation framework, inheriting from the CosyVoice series models. Specifically, a large language model first predicts semantic tokens, which are then converted to acoustic features through flow matching and finally synthesized into waveform audio via a vocoder.

Token Organization and Sequence Construction

To enable flexible multi-turn conversation generation, the system employs text-speech interleaved sequences allowing sentence-by-sentence synthesis. Specifically, each speaker’s text tokens are followed by corresponding speech tokens, then concatenated chronologically with the next speaker’s text and speech tokens. Each utterance begins with a speaker token indicating speaker identity.

Dialect control is achieved by inserting dialect-specific tokens immediately after speaker tokens, while paralinguistic cues (like laughter, sighs) are treated as text tokens placed at corresponding positions within the sequence.

Sequence Example:

<SPEAKER1><Sichuan><Text Tokens><Audio Tokens><SPEAKER2><Sichuan><Text Tokens><Audio Tokens><SPEAKER3><...>

This organization enables the model to understand conversation turn-taking structure while maintaining consistency for each speaker’s characteristics and dialect style.

Training Strategy and Curriculum Learning

Conversational speech data is relatively scarce compared to monologue speech. To effectively leverage heterogeneous data patterns and enhance performance in conversation scenarios, a curriculum learning strategy is adopted.

The first stage initializes the large language model backbone from Qwen3-1.7B and trains it on a mixture of monologue and conversation data to acquire basic text-to-speech capabilities. Subsequently, the model undergoes further training on multi-speaker conversation data in both Chinese and English, incorporating dialectal and paralinguistic elements.

Since Chinese dialect data volume is significantly smaller than Mandarin and English, additional fine-tuning on dialect data enhances the model’s dialect capabilities, ultimately producing a podcast model specifically optimized for dialect generation.

To address long-form audio generation challenges, a context regularization mechanism is introduced that progressively drops historical speech tokens while retaining their textual context. This encourages the model to rely on semantic continuity rather than low-level acoustic memory, thereby improving coherence and stability in extended conversation synthesis.

Author Insight: Adopting curriculum learning strategy was a key success factor. We found that having the model first master basic speech synthesis capabilities, then gradually introducing complex conversation elements, worked better than directly training the complete system. This “easy first” approach significantly improved final model stability and generalization capability.

Inference and Cross-Dialect Voice Cloning

During inference, the token organization established during training is followed: initial text and speech tokens from multiple speakers are interleaved, and the model autoregressively generates subsequent speech tokens in the same interleaved manner.

Cross-dialect voice cloning faces unique challenges: unlike clear orthographic differences between Chinese and English, various Chinese dialects—particularly Mandarin, Henanese, and Sichuanese—share identical written forms. Even Cantonese, though linguistically more distinct, still exhibits substantial textual overlap with Mandarin. Consequently, when target text closely resembles Mandarin and speech prompts are also in Mandarin, dialect control signals become weak.

To address this, Dialect-Guided Prompting inference strategy is proposed. Specifically, before generating dialectal podcasts, a short dialect-typical sentence—one strongly reflecting target dialect style—is prepended to the input text. This initial utterance effectively guides the model toward producing speech with desired dialect characteristics in subsequent generations.

Inference Process

SoulX-Podcast’s inference process, supporting cross-dialect prompting

Performance Evaluation and Actual Performance

The Core Question This Section Answers

How does SoulX-Podcast perform across different tasks and scenarios?

Although SoulX-Podcast is designed for multi-turn, multi-speaker conversation synthesis, it can also handle traditional monologue speech synthesis tasks. We first compare its performance with state-of-the-art TTS models on standard monologue synthesis tasks, then evaluate its capabilities in conversation generation, paralinguistic control, and dialect synthesis.

Monologue Speech Generation

To evaluate SoulX-Podcast’s zero-shot voice cloning TTS capability, performance is assessed on Seed-TTS-eval and compared with existing zero-shot TTS models. Speech intelligibility is measured using CER for Chinese and WER for English, while speaker similarity is quantified via cosine similarity of speaker embeddings.

Evaluation results show SoulX-Podcast demonstrates significant superiority in intelligibility for zero-shot monologue TTS scenarios. Specifically, SoulX-Podcast achieves the lowest CER in the Chinese test set. In the English test set, SoulX-Podcast ranks just behind F5-TTS. In speaker similarity, SoulX-Podcast also achieves strong results, ranking just after Seed-TTS and MaskGCT on both Chinese and English test sets, demonstrating excellent performance in traditional zero-shot TTS.

Model test-zh CER(↓) test-zh SIM(↑) test-en WER(↓) test-en SIM(↑)
Seed-TTS 1.12 0.796 2.25 0.762
MaskGCT 2.27 0.774 2.62 0.714
F5-TTS 1.56 0.741 1.83 0.647
CosyVoice2 1.45 0.748 2.57 0.652
SoulX-Podcast 1.10 0.743 1.91 0.661

Podcast Generation Capabilities

To evaluate multi-turn, multi-speaker conversation generation, SoulX-Podcast is compared with representative conversation TTS systems on the ZipVoice-Dia test set. This benchmark contains natural multi-turn conversations, enabling assessment of intelligibility and cross-speaker consistency in long-form synthesis.

Results show SoulX-Podcast outperforms recent state-of-the-art models on both Chinese and English subsets. Specifically, it achieves the lowest WER/CER and highest cross-speaker similarity while maintaining competitive UTMOS scores, demonstrating superior speaker consistency and perceived quality.

Model ZipVoice-Dia (zh) CER(↓) ZipVoice-Dia (zh) cpSIM(↑) ZipVoice-Dia (zh) UTMOS(↑) ZipVoice-Dia (en) WER(↓) ZipVoice-Dia (en) cpSIM(↑) ZipVoice-Dia (en) UTMOS(↑)
ZipVoice-Dia 3.39 0.553 2.24 3.32 0.438 3.10
MoonCast 27.43 0.441 1.76 23.62 0.356 2.30
SoulX-Podcast 2.2 0.599 2.09 2.27 0.484 2.96

Application Example: Online education platforms could use SoulX-Podcast to generate multi-teacher conversational courses where different subject teachers with unique voice styles and dialects participate in discussions, creating more engaging learning experiences.

Paralinguistic Control Evaluation

To evaluate the proposed model’s controllable paralinguistic generation capability, a dedicated paralinguistic test set is constructed. Using a large language model, 20 test utterances are generated for each of five paralinguistic labels: <|laughter|>, <|sigh|>, <|breathing|>, <|coughing|>, and <throat_clearing>. Corresponding audio samples are then synthesized using SoulX-Podcast in monologue speech synthesis mode.

For objective evaluation, the Qwen-2.5 Omni-FT model serves as an automatic paralinguistic recognizer. The evaluator’s task is to verify whether each synthesized utterance contains the target paralinguistic event specified in the prompt. Resulting recognition accuracies are summarized below:

Label Count Correct Error Accuracy
laughter 20 20 0 1.00
sigh 20 17 3 0.85
breathing 20 15 5 0.75
coughing 20 14 6 0.70
throat_clearing 20 16 4 0.80
Total/Average 100 82 18 0.82

The model achieves a strong overall accuracy of 0.82 in controlling these paralinguistic events. It demonstrates near-perfect control over distinct events like <|laughter|> and high fidelity for <|sigh|> and <throat_clearing|>. Primary error sources concentrate on more acoustically subtle or ambiguous events, namely <|breathing|> and <|coughing|>, which may be more challenging for the evaluator model to distinguish.

Dialect Generation Capabilities

SoulX-Podcast currently supports three major Chinese dialects: Sichuanese, Henanese, and Cantonese. Performance on these dialects is evaluated in both monologue TTS and conversation generation settings.

The monologue test set includes 1,000 samples per dialect from internal LLM-generated data plus SeedTTS, Wenetspeech-Yue-eval, and Wenetspeech-Chuan-eval. The conversation test set contains 100 LLM-generated items per dialect.

Dialect-specific ASR systems compute CER, including Wenetspeech-Chuan-ASR for Sichuanese, TeleSpeech for Henanese, and Wenetspeech-Yue-ASR for Cantonese. SoulX-Podcast achieves consistent speaker similarity across all three dialects, comparable to its performance on Mandarin and English. Relatively high CER values may partly stem from ASR system limitations.

Dialect Monologue Test CER(↓) Monologue Test SIM(↑) Dialogue Test CER(↓) Dialogue Test cpSIM(↑)
Sichuanese 3.75 0.704 15.42 0.641
Henanese 8.14 0.705 28.06 0.647
Cantonese 9.77 0.680 19.50 0.627

Author Insight: The most encouraging finding from performance evaluation was that models specifically optimized for conversation scenarios also excel at traditional monologue tasks. This suggests that deep understanding of conversation complexity actually enhances the model’s fundamental speech synthesis capabilities, demonstrating well-designed systems’ versatility and robustness.

Conclusion and Future Outlook

SoulX-Podcast represents significant progress in speech synthesis, particularly for multi-speaker, long-form conversation generation. Through text-speech interleaved modeling paradigm, the system generates long-form multi-turn conversation speech with consistent quality and coherence.

Experimental results demonstrate SoulX-Podcast not only excels in multi-turn conversation synthesis but also effectively generalizes to zero-shot monologue TTS. Its ability to handle multiple Chinese dialects and paralinguistic cues further highlights its potential as a unified framework for speech generation.

Author Insight: The biggest takeaway from developing SoulX-Podcast was recognizing that truly natural speech synthesis must transcend mere voice quality to capture nuances in human communication. Adding dialect diversity and paralinguistic control aren’t just technical features but steps toward more inclusive and expressive speech technology. Future work should continue exploring how synthetic speech can better reflect the richness of human emotion and culture.

Practical Summary and Action Checklist

Key Takeaways

  • SoulX-Podcast supports generating multi-speaker conversations up to 90 minutes long while maintaining timbre stability and natural prosodic variations
  • Beyond Mandarin and English, the system supports multiple dialects including Sichuanese, Henanese, and Cantonese
  • Precise control over paralinguistic elements like laughter and sighs is achieved through special token organization
  • Two-stage training strategy starts with basic speech synthesis before introducing complex conversation elements
  • Cross-dialect voice cloning enables multiple dialect generation from a single audio prompt

Implementation Recommendations

  1. During data preparation, perform strict quality filtering and speaker purity optimization
  2. Employ curriculum learning during training: single-speaker before multi-speaker, monolingual before multilingual
  3. Use dialect-guided prompting during inference to enhance dialect control effects
  4. For long-form generation, leverage context regularization mechanisms to maintain conversation coherence

One-Page Overview

SoulX-Podcast is a large language model-based speech synthesis system specifically designed for multi-speaker, multi-turn conversation scenarios. Core innovations include:

  • Text-speech interleaved sequence modeling supporting natural conversation flow
  • Dialect diversity support covering major Chinese dialects
  • Paralinguistic control enabling generation of nonverbal elements like laughter and sighs
  • Long-form generation capability maintaining stability over 90+ minutes of conversation
  • Cross-dialect voice cloning supporting multi-dialect generation from single prompts

The system achieves state-of-the-art performance across multiple tasks including monologue TTS, conversation generation, paralinguistic control, and dialect synthesis, providing powerful tools for podcasts, virtual assistants, and audio content creation.

Frequently Asked Questions (FAQ)

What dialects does SoulX-Podcast support?
The system currently supports three major Chinese dialects: Sichuanese, Henanese, and Cantonese, plus Mandarin and English. All dialects support zero-shot voice cloning.

How is cross-dialect voice cloning achieved?
Using Dialect-Guided Prompting strategy: before generating target dialect speech, prepend a typical dialect sentence to the input text, effectively guiding the model to produce desired dialect characteristics in subsequent generations.

How long can the system generate conversations?
In actual testing, SoulX-Podcast can continuously generate over 90 minutes of conversational audio while maintaining speaker timbre consistency and natural conversation fluency.

What specific paralinguistic control functions are available?
The system can control generation of laughter, sighs, breathing sounds, coughing, and throat clearing—these elements are inserted as special tokens at appropriate positions in text-speech sequences.

What’s the main difference between SoulX-Podcast and traditional TTS systems?
The main difference is specific optimization for multi-speaker conversation scenarios, supporting long-form generation, dialect diversity, and paralinguistic control, whereas traditional systems primarily focus on single-speaker scenarios.

How is training data prepared?
Real conversation data is processed through multi-stage workflows including audio enhancement, segmentation and speaker diarization, quality filtering, dual-ASR transcription verification, plus paralinguistic and dialect annotation.

How does it perform on lower-resource dialects?
For dialects with relatively less data (like Henanese), additional fine-tuning strategies enhance model capability. Although performance may slightly lag behind data-rich languages, usable quality is maintained.

What practical application scenarios is the system suitable for?
Suitable for podcast production, virtual assistants, online education, entertainment content creation, and other scenarios requiring natural multi-speaker conversations, particularly applications needing dialect diversity.


All technical details and data in this article are based on the original technical report, maintaining all technical accuracy while providing accessible interpretation.