MOSS-TTSD: Open-Source Bilingual Spoken Dialogue Synthesis for AI-Powered Podcasts
In the rapidly evolving landscape of artificial intelligence, voice technology has moved beyond simple text-to-speech conversion to sophisticated dialogue generation. MOSS-TTSD (Text to Spoken Dialogue) represents a significant advancement in this field, offering a powerful, open-source solution for creating natural-sounding conversations between two speakers. Whether you’re a content creator looking to produce AI podcasts, a developer building conversational AI, or a researcher exploring voice synthesis, MOSS-TTSD provides a robust foundation for your projects.
What is MOSS-TTSD?
MOSS-TTSD is an open-source bilingual spoken dialogue synthesis model that transforms dialogue scripts between two speakers into natural, expressive conversational speech. Unlike traditional text-to-speech (TTS) systems that simply convert written text to audio, MOSS-TTSD specializes in dialogue scenarios where two speakers take turns speaking.
Think of it as a tool that can take a script like this:
[S1]Hello, how are you today?
[S2]I'm doing great, thanks for asking!
[S1]That's wonderful to hear.
And turn it into a natural-sounding conversation between two people, complete with appropriate pauses, intonation, and vocal characteristics.
This capability makes MOSS-TTSD particularly valuable for creating AI-powered podcasts, where natural-sounding dialogue between hosts or between a host and a guest is essential for listener engagement.
Why MOSS-TTSD Stands Out
MOSS-TTSD isn’t just another TTS system—it addresses specific limitations in current voice synthesis technology through several key innovations:
1. Expressive Dialogue Generation
MOSS-TTSD generates highly expressive, human-like dialogue speech with natural conversational prosody. This means the synthesized speech doesn’t just sound like a robot reading text—it captures the emotional nuances, pauses, and rhythm of real human conversation.
The model achieves this through several key technologies:
-
Unified semantic-acoustic neural audio codec -
Pre-trained large language model -
Millions of hours of TTS data -
400,000 hours of synthetic and real conversational speech
These elements work together to create speech that sounds genuinely conversational rather than robotic.
2. Zero-Shot Two-Speaker Voice Cloning
One of MOSS-TTSD’s most impressive features is its ability to perform zero-shot two-speaker voice cloning. This means you don’t need to train a separate model for each speaker—you can simply provide reference audio and text for both speakers, and MOSS-TTSD will accurately mimic their voices in the generated dialogue.
For example, you could provide:
-
A 30-second audio clip of Speaker 1 saying “I love learning new things” -
A 30-second audio clip of Speaker 2 saying “I find it fascinating how AI is evolving”
MOSS-TTSD would then generate a conversation where Speaker 1 and Speaker 2 sound like their respective reference samples.
3. Bilingual Support
MOSS-TTSD supports both Chinese and English, making it versatile for content creators working in multiple languages. You can create conversations that mix languages naturally or produce entirely English or Chinese content.
This is particularly valuable for global content creators who want to reach international audiences without needing separate models for each language.
4. Long-Form Speech Generation
Unlike many TTS systems that struggle with longer audio segments, MOSS-TTSD is optimized for long-form speech generation. Through low-bitrate codec and training framework optimizations, it can produce continuous audio that flows naturally for extended periods.
This is essential for podcast production, where episodes often run for 20-60 minutes or more.
5. Open Source and Commercially Available
MOSS-TTSD is fully open source under the Apache 2.0 license, meaning you can use it for free in commercial applications without worrying about licensing restrictions.
This open-source approach encourages community development and ensures transparency in how the model works.
MOSS-TTSD v0.5: A Significant Upgrade
On July 4, 2025, the MOSS-TTSD team released version 0.5 of the model, which brought several important improvements:
-
Enhanced accuracy in timbre switching (the ability to switch between speaker voices naturally) -
Improved voice cloning capabilities -
Greater model stability
The team recommends using v0.5 as the default model for all new projects, as it addresses many of the limitations present in earlier versions.
Getting Started with MOSS-TTSD
System Requirements
MOSS-TTSD has modest system requirements, making it accessible even for users with consumer-grade hardware:
-
Python 3.10+ -
PyTorch 2.0+ -
A GPU (with at least 7GB VRAM for standard use cases)
The model is designed to be efficient with GPU memory usage, as we’ll see in the technical specifications below.
Installation Process
Setting up MOSS-TTSD is straightforward. Here’s how to get started:
Using Conda (Recommended)
# Create a new environment
conda create -n moss_ttsd python=3.10 -y
conda activate moss_ttsd
# Install dependencies
pip install -r requirements.txt
pip install flash-attn
# Download the XY Tokenizer
mkdir -p XY_Tokenizer/weights
huggingface-cli download fnlp/XY_Tokenizer_TTSD_V0 xy_tokenizer.ckpt --local-dir ./XY_Tokenizer/weights/
Using pip
If you prefer not to use Conda, you can use pip directly:
# Create a virtual environment
python -m venv moss_ttsd
source moss_ttsd/bin/activate # On Windows: moss_ttsd\Scripts\activate
# Install dependencies
pip install -r requirements.txt
pip install flash-attn
Using MOSS-TTSD: Three Main Methods
MOSS-TTSD offers three primary ways to generate spoken dialogue:
1. Local Inference
For those who prefer working directly with the model on their local machine, MOSS-TTSD provides a command-line interface for inference.
Basic Command:
python inference.py --jsonl examples/examples.jsonl --output_dir outputs --seed 42 --use_normalize
Parameters Explained:
-
--jsonl
: Path to your input JSONL file containing dialogue scripts -
--output_dir
: Directory where generated audio files will be saved -
--seed
: Random seed for reproducibility -
--use_normalize
: Enables text normalization (recommended for better results) -
--dtype
: Model data type (default: bf16) -
--attn_implementation
: Attention implementation (default: flash_attention_2)
2. Web UI
For a more visual experience, MOSS-TTSD offers a Gradio-based web interface:
python gradio_demo.py
This command launches a local web server where you can input dialogue scripts and see the results in real-time. The interface is intuitive and doesn’t require any programming knowledge.
3. API Usage
For developers looking to integrate MOSS-TTSD into larger applications, a batch processing tool is available using the SiliconFlow API:
# Set environment variables
export SILICONFLOW_API_KEY="your_siliconflow_api_key"
export SILICONFLOW_API_BASE="https://api.siliconflow.cn/v1"
# Run batch processing
python use_api.py --jsonl_file your_data.jsonl --output_dir your_output --max_workers 8
Understanding MOSS-TTSD’s Input Formats
MOSS-TTSD supports three main input formats, each with specific use cases:
Format 1: Text-Only Input (No Voice Cloning)
This format is ideal when you don’t need voice cloning—just want to generate dialogue between two speakers with default voices.
{
"text": "[S1]Speaker 1 dialogue content[S2]Speaker 2 dialogue content[S1]..."
}
Format 2: Separate Speaker Audio References
This format allows you to provide separate reference audio for each speaker, enabling more accurate voice cloning.
{
"base_path": "/path/to/audio/files",
"text": "[S1]Speaker 1 dialogue content[S2]Speaker 2 dialogue content[S1]...",
"prompt_audio_speaker1": "path/to/speaker1_audio.wav",
"prompt_text_speaker1": "Reference text for speaker 1 voice cloning",
"prompt_audio_speaker2": "path/to/speaker2_audio.wav",
"prompt_text_speaker2": "Reference text for speaker 2 voice cloning"
}
Format 3: Shared Audio Reference
This format uses a single reference audio file containing both speakers’ voices.
{
"base_path": "/path/to/audio/files",
"text": "[S1]Speaker 1 dialogue content[S2]Speaker 2 dialogue content[S1]...",
"prompt_audio": "path/to/shared_reference_audio.wav",
"prompt_text": "[S1]Reference text for speaker 1[S2]Reference text for speaker 2"
}
Practical Example: Creating a Podcast Script
Let’s walk through a concrete example of using MOSS-TTSD to create a podcast segment.
Step 1: Prepare your dialogue script
Create a text file (podcast_script.txt
) with the following content:
[S1]Welcome to the AI Insights podcast, where we explore the latest developments in artificial intelligence.
[S2]I'm your host, Alex, and today we're discussing a groundbreaking new voice synthesis model.
[S1]That's right, we're talking about MOSS-TTSD, which stands for Text to Spoken Dialogue Generation.
[S2]This model can transform dialogue scripts into natural-sounding conversations between two speakers.
[S1]And it supports both Chinese and English, making it ideal for global content creators.
[S2]Exactly! Let's dive into how it works.
Step 2: Convert to JSONL format
Create a JSONL file (podcast_script.jsonl
) with:
{
"text": "[S1]Welcome to the AI Insights podcast, where we explore the latest developments in artificial intelligence.[S2]I'm your host, Alex, and today we're discussing a groundbreaking new voice synthesis model.[S1]That's right, we're talking about MOSS-TTSD, which stands for Text to Spoken Dialogue Generation.[S2]This model can transform dialogue scripts into natural-sounding conversations between two speakers.[S1]And it supports both Chinese and English, making it ideal for global content creators.[S2]Exactly! Let's dive into how it works."
}
Step 3: Run the inference
python inference.py --jsonl podcast_script.jsonl --output_dir podcast_outputs --use_normalize
This will generate a WAV file in the podcast_outputs
directory containing your AI-generated podcast.
Understanding GPU Requirements
One of MOSS-TTSD’s strengths is its efficiency with GPU resources. The model is designed to run on consumer-grade hardware, making it accessible to a wide range of users.
When generating 600 seconds of audio at default bf16 precision, MOSS-TTSD uses less than 7GB of VRAM. This means it can run on most modern GPUs, including those found in laptops and desktops.
You can estimate the VRAM needed for a specific audio length using this formula:
Where:
-
= Desired audio length in seconds -
= Estimated VRAM cost in GB
Audio Length (Seconds) | VRAM Cost (GB) |
---|---|
120 | 6.08 |
300 | 6.39 |
360 | 6.5 |
600 | 6.91 |
Note: If your reference audio prompts (e.g., prompt_audio_speaker1
) are longer than the default examples, VRAM usage will be higher.
Podcast Generation with Podever
MOSS-TTSD includes a specialized tool called Podever that simplifies podcast creation from various sources:
Converting Web Content to Podcast
python podcast_generate.py "https://www.open-moss.com/cn/moss-ttsd/"
This command will automatically:
-
Extract the content from the specified URL -
Generate a podcast script -
Convert the script into natural-sounding dialogue using MOSS-TTSD
Converting PDFs to Podcast
python podcast_generate.py "examples/Attention Is All You Need.pdf"
This converts a PDF document into a podcast, making it ideal for turning academic papers or long-form articles into accessible audio content.
Generating English Podcasts
python podcast_generate.py "your_input" -l en
The -l en
flag generates the podcast in English instead of the default Chinese.
Fine-Tuning MOSS-TTSD for Your Needs
MOSS-TTSD includes tools for fine-tuning the model with your own data, allowing you to customize it for specific use cases.
Setting Up the Fine-Tuning Environment
# Create a new environment
conda create -n moss_ttsd_finetune python=3.10 -y
conda activate moss_ttsd_finetune
# Install dependencies
pip install -r finetune/requirements_finetune.txt
pip install flash-attn
Data Preparation
MOSS-TTSD supports two main data formats for fine-tuning:
Format 1: Single Audio File with Full Transcript
{
"file_path": "/path/to/audio.wav",
"full_transcript": "[S1]Speaker content[S2]Speaker content..."
}
Format 2: Separate Reference and Main Audio Files
{
"reference_audio": "/path/to/reference.wav",
"reference_text": "[S1]Reference content for voice cloning[S2]Reference content for voice cloning",
"audio": "/path/to/main.wav",
"text": "[S1]Speaker content[S2]Speaker content..."
}
Data Preprocessing
python finetune/data_preprocess.py --jsonl <path_to_jsonl> --model_path <path_to_model> --output_dir <output_directory> --data_name <data_name> [--use_normalize]
Full Model Fine-Tuning
python finetune/finetune.py --model_path <path_to_model> --data_dir <path_to_processed_data> --output_dir <output_directory> --training_config <training_config_file>
LoRA Fine-Tuning (Memory Efficient)
python finetune/finetune.py --model_path <path_to_model> --data_dir <path_to_processed_data> --output_dir <output_directory> --training_config <training_config_file> --lora_config <lora_config_file> --lora
Practical Applications of MOSS-TTSD
1. AI-Powered Podcast Creation
MOSS-TTSD’s primary application is in creating AI-generated podcasts. With the Podever tool, content creators can transform blog posts, research papers, or even entire books into engaging audio content without needing to record themselves.
This democratizes podcast creation, allowing anyone with text content to produce professional-quality audio.
2. Language Learning Tools
MOSS-TTSD can generate natural conversations between two speakers in Chinese and English, making it ideal for language learners. Students can listen to realistic dialogues and practice their listening comprehension skills.
3. Voice Assistant Training
Developers building voice assistants can use MOSS-TTSD to generate high-quality training data for conversational AI systems, improving the naturalness of their voice interfaces.
4. Accessibility Applications
For individuals with speech impairments or language barriers, MOSS-TTSD can help create natural-sounding voice output that matches their intended communication style.
Understanding MOSS-TTSD’s Limitations
While MOSS-TTSD is powerful, it’s important to understand its current limitations:
-
Speaker Switching Errors: Occasionally, the model may misidentify speaker turns, leading to unnatural voice switching.
-
Voice Cloning Deviations: Voice cloning isn’t always perfect, particularly with complex or emotional speech.
-
Resource Requirements: While efficient, the model still requires a GPU for optimal performance.
The development team is actively working on addressing these limitations in future releases.
Frequently Asked Questions
Q: What’s the difference between MOSS-TTSD and traditional TTS?
A: Traditional TTS systems convert written text to speech but don’t handle dialogue between multiple speakers. MOSS-TTSD specializes in generating natural conversations between two speakers with accurate voice switching.
Q: Can I use MOSS-TTSD to clone my own voice?
A: Yes, MOSS-TTSD supports voice cloning using reference audio. You’ll need to provide a short audio clip of your voice along with corresponding text for the model to learn your vocal characteristics.
Q: How long does it take to generate a 30-minute podcast?
A: The generation time depends on your hardware. On a typical GPU, generating 30 minutes (1800 seconds) of audio would take approximately:
-
VRAM: 0.00172 * 1800 + 5.8832 = 9.05 GB -
Time: Roughly 1-3 minutes per minute of audio, depending on your hardware.
Q: Can I use MOSS-TTSD for commercial projects?
A: Yes, MOSS-TTSD is fully open source under the Apache 2.0 license, which allows for free commercial use.
Q: What’s the best way to get started with MOSS-TTSD?
A: The easiest way to get started is by using the Web UI:
-
Install the dependencies -
Run python gradio_demo.py
-
Enter a dialogue script in the interface -
Generate and listen to your conversation
Q: How does MOSS-TTSD handle mixed-language dialogue?
A: MOSS-TTSD can seamlessly handle mixed-language dialogue. For example, you can have Speaker 1 speak in English while Speaker 2 speaks in Chinese, and the model will maintain the language switching naturally.
Q: Can I fine-tune MOSS-TTSD with my own voice?
A: Yes, MOSS-TTSD includes fine-tuning capabilities that allow you to customize the model with your own voice samples. This requires preparing a dataset of your voice and following the fine-tuning workflow.
Technical Deep Dive: How MOSS-TTSD Works
MOSS-TTSD’s architecture combines several cutting-edge technologies to create natural-sounding dialogue:
-
Unified Semantic-Acoustic Neural Audio Codec: This component efficiently encodes and decodes audio while preserving the semantic meaning of the speech.
-
Pre-Trained Large Language Model: Provides the contextual understanding needed for natural dialogue flow.
-
Massive Training Data: The model was trained on millions of hours of TTS data and 400,000 hours of synthetic and real conversational speech.
-
Optimized Training Framework: The training process specifically focuses on long-form speech generation, making it suitable for podcasts and other extended audio content.
This combination allows MOSS-TTSD to generate dialogue that sounds natural, with appropriate pauses, intonation, and speaker switching.
Real-World Example: Creating a Podcast from a Research Paper
Let’s walk through a practical example of using MOSS-TTSD to create a podcast from an academic paper.
Step 1: Obtain the research paper
Download “Attention Is All You Need” PDF from the repository.
Step 2: Convert to podcast
python podcast_generate.py "examples/Attention Is All You Need.pdf"
Step 3: Review the output
The tool will:
-
Extract key sections from the paper -
Generate a natural-sounding dialogue script -
Convert the script to audio using MOSS-TTSD -
Save the final podcast file
Step 4: Listen to your new podcast
You’ll now have a professional-sounding podcast that explains the key concepts of the paper in an engaging, conversational format.
The Future of Dialogue Synthesis
The development of models like MOSS-TTSD represents a significant step toward more natural and engaging AI-generated audio. As the technology continues to evolve, we can expect:
-
Improved accuracy in speaker switching -
More natural emotional expression -
Better handling of complex or emotional speech -
Enhanced support for additional languages -
Further optimization for lower resource requirements
These improvements will make AI-generated audio even more accessible and valuable for content creators, educators, and developers.
Conclusion
MOSS-TTSD represents a significant advancement in spoken dialogue synthesis, offering a powerful, open-source solution for creating natural-sounding conversations between two speakers. With its bilingual support, voice cloning capabilities, and efficient resource usage, it opens up new possibilities for content creators, developers, and researchers.
Whether you’re looking to create AI-powered podcasts, build conversational AI applications, or explore the latest in voice synthesis technology, MOSS-TTSD provides a robust foundation to work with. The model’s open-source nature and commercial-friendly license make it accessible to a wide range of users, from individual creators to large organizations.
By using MOSS-TTSD, you’re not just getting a tool—you’re joining a community of developers and creators who are shaping the future of spoken dialogue technology. As the model continues to evolve, it will become even more powerful and versatile, further lowering the barriers to entry for high-quality voice synthesis.
To get started with MOSS-TTSD, visit the Hugging Face Model Page or check out the English Blog for detailed tutorials and examples. The MOSS-TTSD team is actively working on improvements, so be sure to check back regularly for new features and enhancements.