WenetSpeech-Yue: A Large-Scale Cantonese Speech Corpus with Multi-Dimensional Annotation

Why Cantonese Speech Processing Demands Large-Scale Annotated Resources

Cantonese, spoken by approximately 84.9 million native speakers worldwide, presents unique challenges for speech processing due to its rich tone system of nine tones in six categories, coexistence of literary and colloquial forms, and frequent code-switching with English. Despite its linguistic complexity and cultural significance, Cantonese has remained severely under-resourced in speech technology compared to major languages. The development of WenetSpeech-Yue addresses this critical gap by providing the largest open-source Cantonese speech corpus with comprehensive multi-dimensional annotations.

The WenetSpeech-Pipe Framework: Building High-Quality Speech Datasets

How can we efficiently build large-scale speech corpora with rich annotations?

WenetSpeech-Pipe provides an integrated pipeline for constructing large-scale speech corpora with multi-dimensional annotation tailored for both speech understanding and generation. This modular framework comprises six specialized modules that work in concert to transform raw audio into richly annotated, machine-learning-ready data.

The pipeline begins with Audio Collection, gathering in-the-wild speech recordings across diverse domains including storytelling, drama, commentary, vlogs, food, entertainment, news, and education. These long recordings are segmented into utterance-level clips using voice activity detection (VAD), creating a foundation for subsequent processing stages.

WenetSpeech-Pipe Overview
The complete WenetSpeech-Pipe processing workflow

Speaker Attributes Annotation: Enabling Multi-Speaker Modeling

To support multi-speaker modeling and style-aware synthesis, WenetSpeech-Pipe incorporates a dedicated speaker attributes annotation stage. Using the pyannote toolkit for speaker diarization, the system assigns local speaker labels to segments from the same source, providing intra-recording speaker separation. The Vox-Profile tool then estimates age and gender for each segment, creating comprehensive speaker metadata that facilitates both supervised and style-controllable speech modeling.

Speech Quality Assessment: Ensuring High-Fidelity Audio

For tasks requiring high-fidelity audio such as TTS and voice conversion, WenetSpeech-Pipe implements a rigorous quality assessment stage. Each segment undergoes three complementary evaluations: Brouhaha for signal-to-noise ratio (SNR) measurements, DNSMOS for perceptual quality scoring (MOS), and bandwidth detection for spectral characteristics analysis. These measures generate structured quality annotations with quantitative scores and spectral references, enabling downstream filtering for high-quality applications.

Multi-System ASR Transcription: Leveraging Complementary Strengths

Recognizing that single ASR systems exhibit systematic biases and error patterns, WenetSpeech-Pipe employs a multi-system ensemble approach. Each audio segment is independently transcribed using three high-performance Cantonese ASR systems: SenseVoice, Whisper, and TeleASR. These systems differ in architecture, training data, and optimization objectives, producing complementary error profiles and diverse linguistic hypotheses that form the foundation for subsequent fusion and refinement.

Text Postprocessing: Standardizing Diverse Outputs

The raw transcriptions from different ASR systems exhibit significant variations in character sets (traditional vs. simplified Chinese), inclusion of non-lexical tags, and formatting inconsistencies. To ensure reliable cross-system alignment, WenetSpeech-Pipe applies a comprehensive text postprocessing pipeline that includes:

  • Traditional-to-simplified Chinese conversion using OpenCC
  • Punctuation and special symbol removal
  • Numerical expression and date standardization through rule-based rewriting
  • Whitespace insertion between Cantonese and English words for bilingual modeling

Text Processing
Text normalization process across different ASR systems

Recognizer Output Voting: Achieving Consensus Through Fusion

Despite text normalization, variations persist in lexical selection, word segmentation, and phonetic representation. WenetSpeech-Pipe adopts and extends the Recognizer Output Voting Error Reduction (ROVER) framework to generate unified, high-accuracy reference transcriptions. The implementation includes:

  • Dynamic programming alignment of normalized transcriptions
  • Candidate filtering based on edit distance thresholds to exclude outlier hypotheses
  • Frequency-based word selection at each aligned position
  • Pronunciation-level confidence measures for Cantonese pinyin
  • LLM-powered refinement using Qwen3-4B for context-aware corrections

The final output includes character-level forced alignment between refined transcriptions and original audio, yielding precise timestamps for each character to support fine-grained speech processing tasks.

Author’s Reflection: The multi-system approach proved crucial for handling Cantonese’s linguistic complexity. By leveraging complementary ASR systems and implementing sophisticated fusion techniques, we achieved significantly higher transcription accuracy than any single system could provide, demonstrating the power of ensemble methods in speech processing.

WenetSpeech-Yue Dataset: Comprehensive Specifications

What constitutes a comprehensive speech dataset for Cantonese?

WenetSpeech-Yue represents the largest and most comprehensive open-source Cantonese speech corpus to date, spanning 21,800 hours across ten domains with rich multi-dimensional annotations. The dataset’s architecture and organization reflect careful consideration of both research needs and practical applications.

Metadata Structure and Extensibility

All audio metadata is stored in a standardized JSON format designed for both machine readability and human interpretability. Core fields include:

  • utt_id: Unique identifier for each audio segment
  • rover_result: Consensus transcription from three ASR systems
  • confidence: Text transcription confidence score
  • jyutping_confidence: Cantonese pinyin confidence score
  • duration: Audio duration in seconds
  • Speaker attributes (speaker_id, gender, age)
  • Audio quality metrics (sample_rate, DNSMOS, SNR)
  • Precise timestamp information with start/end times
  • Extended metadata including program name, geographical information, source links, and domain classification

This structured approach ensures consistency while maintaining flexibility for future metadata expansions.

Domain Coverage and Distribution

WenetSpeech-Yue covers ten distinct domains carefully selected to represent the breadth of Cantonese speech in real-world contexts:

  1. Storytelling
  2. Entertainment
  3. Drama
  4. Culture
  5. Vlog
  6. Commentary
  7. Education
  8. Podcast
  9. News
  10. Others

This diverse coverage ensures models trained on the dataset encounter a wide variety of speaking styles, vocabulary, and contextual scenarios, enhancing their generalization capabilities.

Domain Distribution
Distribution of audio content across the ten domains in WenetSpeech-Yue

Duration Characteristics and Confidence-Based Partitioning

The dataset contains both short and long recordings with an average duration of 11.40 seconds per audio segment. To maximize utility while acknowledging varying transcription quality, the data is partitioned into three subsets based on confidence scores:

  • Strong labels (confidence > 0.9): 6,771.43 hours
  • Moderate labels (0.8 < confidence ≤ 0.9): 10,615.02 hours
  • Weak labels (0.6 < confidence ≤ 0.8): 4,488.13 hours

This stratification enables flexible training strategies where high-confidence subsets can drive fine-tuning while carefully leveraging lower-confidence segments improves model robustness in semi-supervised or domain-adaptive scenarios.

Quality Metrics and Filtering Strategies

Comprehensive quality assessment reveals the dataset’s characteristics: DNSMOS scores span 2.0 to 4.4, SNR values range from -5 to 80 dB, and sampling rates vary from 8,000 to 32,000 Hz. For high-fidelity applications like TTS, samples with DNSMOS greater than 2.5 and SNR above 25 dB are retained, resulting in a 12,000-hour high-quality subset suitable for generative tasks.

Speaker Demographic Representation

The corpus contains a diverse range of speakers, though with a predominance of male speakers particularly in the middle-age group (50.6%). This distribution reflects both the source material characteristics and opportunities for future balancing through targeted data collection.

Author’s Reflection: The confidence-based partitioning strategy emerged as one of the most valuable aspects of our approach. By explicitly quantifying and leveraging transcription confidence, we enabled more effective training protocols that acknowledge the reality of imperfect automated annotations while maximizing their utility.

WSYue-eval: A Comprehensive Benchmark for Cantonese Speech Tasks

How can we effectively evaluate Cantonese speech processing systems?

To address Cantonese’s unique linguistic characteristics, we developed WSYue-eval, a comprehensive benchmark encompassing both Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) tasks. This integrated evaluation framework is specifically tailored to assess model performance across critical dimensions of Cantonese language processing.

WSYue-ASR-eval: Multi-Faceted ASR Evaluation

The WSYue-ASR-eval test set is designed for rigorous ASR evaluation with multi-round manual annotations including text transcription, emotion, age, and gender labels. The set is divided into two subsets by audio duration to enable comprehensive evaluation across speech lengths:

Set Duration Speakers Hours
Short 0–10s 2861 9.46
Long 10–30s 838 1.97

This structure allows researchers to understand model performance characteristics across different utterance lengths and speaking scenarios. The evaluation set also covers diverse real-world Cantonese usage, including code-switching and multi-domain conditions, providing a realistic assessment of model capabilities.

WSYue-TTS-eval: Zero-Shot Synthesis Assessment

For TTS evaluation, we introduced WSYue-TTS-eval with two specialized subsets:

  • Base: Contains 1,000 samples from Common Voice for evaluating real-world performance on typical speech data
  • Coverage: Combines manually curated and LLM-generated texts spanning multiple domains (daily life, news, entertainment, poetry) and incorporating diverse linguistic phenomena including polyphonic characters, tone sandhi, code-switching, proper nouns, and numerals

This two-pronged approach enables both standardized performance comparison and rigorous evaluation of generalization capabilities across challenging linguistic scenarios.

Author’s Reflection: Creating effective evaluation benchmarks proved as important as building the training data itself. The Coverage subset in particular, with its carefully constructed linguistic challenges, has been invaluable for identifying subtle weaknesses in TTS systems that wouldn’t be apparent from conventional evaluation sets.

Experimental Results: Demonstrating Practical Effectiveness

How do models trained on WenetSpeech-Yue perform against existing systems?

Comprehensive experiments across both ASR and TTS tasks demonstrate that models trained on WenetSpeech-Yue achieve competitive results against state-of-the-art Cantonese speech systems, including commercial and LLM-based models.

ASR Performance Analysis

We conducted extensive ASR experiments with two model categories: traditional architectures without large language models (w/o LLM) and LLM-augmented hybrids (w/ LLM). The evaluation encompassed diverse test sets including in-house collections (Dialogue, Reading), open-source resources (Common Voice yue/zh-HK, MDCC, Daily_Use, Commands), and our proposed WSYue-ASR-eval benchmark.

The results reveal several consistent observations:

  1. Across all model scales, our WenetSpeech-Yue trained models achieve best performance on most evaluation sets
  2. Within the small-scale group, both SenseVoice-small-Yue and U2pp-Conformer-Yue achieve competitive results despite smaller sizes
  3. In the w/o LLM category, both U2pp-Conformer-Yue and Whisper-medium-Yue surpass large-scale baselines
  4. In the w/ LLM group, U2pp-Conformer-LLM-Yue consistently attains state-of-the-art accuracy

These results highlight that WenetSpeech-Yue not only improves overall performance but also maximizes model potential across different parameter regimes, validating its utility for both traditional and LLM-enhanced ASR paradigms.

The Impact of Two-Stage Training

Our two-stage training strategy demonstrates significant performance gains:

Model Stage WSYue-ASR-eval (Short) WSYue-ASR-eval (Long)
Whisper-medium-Yue 1 7.27% 11.19%
2 5.05% 8.05%
U2pp-Conformer-Yue 1 7.62% 12.01%
2 5.05% 8.89%
U2pp-Conformer-LLM-Yue 1 6.81% 10.75%
2 4.73% 7.91%

Stage 1, trained on mixed-confidence data, already achieves competitive Cantonese ASR performance, while Stage 2 fine-tuning on high-confidence data yields significant gains across both test sets. These results confirm that high-confidence labels are the primary driver of performance improvements.

TTS Performance Evaluation

For speech synthesis, we adopted a transfer learning approach on two pretrained TTS models (Llasa-1B and CosyVoice2) fine-tuned on the WenetSpeech-Yue TTS subset. Evaluation against zero-shot baselines and commercial systems demonstrates substantial improvements across both objective and subjective metrics.

CosyVoice2-Yue attains the lowest MER among all systems (10.33% on base set, 9.49% on coverage set) and highest speaker similarity scores (0.821 and 0.834). In subjective evaluation, CosyVoice2-Yue achieves the highest intelligibility (I-MOS: 4.45 ± 0.16), while Llasa-1B-Yue outperforms in speaker similarity (S-MOS: 4.11 ± 0.37) and accent nativeness (A-MOS: 4.34 ± 0.34).

These results confirm the effectiveness of WenetSpeech-Yue for improving Cantonese speech synthesis, with fine-tuned models significantly outperforming their pretrained counterparts and achieving competitive performance against commercial systems.

Author’s Reflection: The consistent performance improvements across both ASR and TTS tasks validate our multi-dimensional annotation approach. Particularly noteworthy was how models trained on our data demonstrated stronger generalization capabilities, suggesting that the rich metadata enables learning more robust representations.

Practical Implementation: Using WenetSpeech-Yue and Trained Models

How can researchers and developers leverage these resources?

The WenetSpeech-Yue ecosystem provides multiple entry points for different use cases, from direct dataset usage to employing pre-trained models for inference or further fine-tuning.

Dataset Access and Download

All components of the WenetSpeech-Yue ecosystem are publicly accessible:

Pre-trained Models for Immediate Use

We release several pre-trained models optimized for Cantonese:

ASR Models:

TTS Models:

Inference Code Examples

U2pp_Conformer_Yue Inference:

dir=u2pp_conformer_yue
decode_checkpoint=$dir/u2pp_conformer_yue.pt
test_set=path/to/test_set
test_result_dir=path/to/test_result_dir

python wenet/bin/recognize.py \
  --gpu 0 \
  --modes attention_rescoring \
  --config $dir/train.yaml \
  --test_data $test_set/data.list \
  --checkpoint $decode_checkpoint \
  --beam_size 10 \
  --batch_size 32 \
  --ctc_weight 0.5 \
  --result_dir $test_result_dir \
  --decoding_chunk_size -1

Whisper_Medium_Yue Inference:

dir=whisper_medium_yue
decode_checkpoint=$dir/whisper_medium_yue.pt
test_set=path/to/test_set
test_result_dir=path/to/test_result_dir

python wenet/bin/recognize.py \
  --gpu 0 \
  --modes attention \
  --config $dir/train.yaml \
  --test_data $test_set/data.list \
  --checkpoint $decode_checkpoint \
  --beam_size 10 \
  --batch_size 32 \
  --blank_penalty 0.0 \
  --ctc_weight 0.0 \
  --reverse_weight 0.0 \
  --result_dir $test_result_dir \
  --decoding_chunk_size -1

SenseVoice_Small_Yue Inference:

from funasr import AutoModel

model_dir = "sensevoice_small_yue"

model = AutoModel(
    model=model_path,
    device="cuda:0",
)
res = model.generate(
    wav_path,
    cache={},
    language="yue",
    use_itn=True,
    batch_size=64,
)

Training Strategies and Recommendations

Based on our experimental results, we recommend a two-stage training approach:

  1. Stage 1: Train on mixed medium- and high-confidence labels for rapid convergence and robust foundation
  2. Stage 2: Fine-tune on high-confidence labels to maximize transcription accuracy and final performance

This approach balances training efficiency with ultimate performance, effectively leveraging the confidence-based data partitioning.

Author’s Reflection: The release of both data and models creates a complete ecosystem for Cantonese speech technology development. We’ve been particularly encouraged to see researchers using these resources for applications beyond ASR and TTS, including speaker verification, emotion recognition, and linguistic analysis.

Conclusion: Advancing Cantonese Speech Technology

WenetSpeech-Yue represents a significant milestone in Cantonese speech processing, providing the research community with the largest and most comprehensively annotated open-source corpus for this important language. Through the integrated WenetSpeech-Pipe framework, we have demonstrated an effective approach to building high-quality speech datasets with rich multi-dimensional annotations.

The experimental results consistently show that models trained on WenetSpeech-Yue achieve state-of-the-art performance across both ASR and TTS tasks, outperforming existing systems including commercial offerings and general-purpose multilingual models. The accompanying WSYue-eval benchmark provides a rigorous foundation for ongoing evaluation and comparison.

Beyond its immediate applications, WenetSpeech-Yue serves as a valuable case study in resource development for under-resourced languages, demonstrating effective strategies for automated annotation, quality control, and multi-system fusion. We believe this work will significantly accelerate progress in Cantonese speech technology and provide a blueprint for similar efforts for other languages.

Action Checklist / Implementation Steps

  1. Access Resources: Download the WenetSpeech-Yue dataset and evaluation benchmarks from Hugging Face
  2. Select Appropriate Models: Choose from pre-trained ASR or TTS models based on your specific task requirements
  3. Implement Inference: Use provided code examples to integrate models into your applications
  4. Custom Training: For specialized needs, use the dataset with recommended two-stage training strategy
  5. Evaluation: Validate performance using WSYue-eval benchmarks for comprehensive assessment
  6. Contribution: Consider contributing improvements or extensions back to the community

One-page Overview

Aspect Description
Dataset Name WenetSpeech-Yue
Total Duration 21,800 hours
Domain Coverage 10 categories including Storytelling, Entertainment, Drama, Culture, Vlog, Commentary, Education, Podcast, News, Others
Annotation Types ASR transcription, text confidence, speaker identity, age, gender, SNR, DNSMOS, character-level timestamps
Data Partitioning Strong labels (>0.9): 6,771h, Moderate (0.8-0.9): 10,615h, Weak (0.6-0.8): 4,488h
Evaluation Benchmarks WSYue-ASR-eval (9.46h Short + 1.97h Long), WSYue-TTS-eval (Base + Coverage subsets)
Pre-trained Models Conformer-Yue, SenseVoice-small-Yue, Whisper-medium-Yue, CosyVoice2-Yue
Performance SOTA or competitive results on all tested ASR and TTS tasks
Access Fully open-source via GitHub and Hugging Face

Frequently Asked Questions (FAQ)

Q: What makes Cantonese particularly challenging for speech processing?
A: Cantonese presents unique challenges including a complex tone system with nine tones in six categories, coexistence of literary and colloquial forms, frequent code-switching with English, and limited annotated resources compared to major languages.

Q: How does WenetSpeech-Yue compare to existing Cantonese speech datasets?
A: WenetSpeech-Yue is significantly larger (21,800 hours vs. typically under 500 hours for other datasets) and more comprehensive, with multi-dimensional annotations covering speaker attributes, quality metrics, and precise timestamps, unlike previous datasets that primarily offered only speech-text alignment.

Q: What is the confidence score in the dataset metadata?
A: The confidence score represents the agreement level between multiple ASR systems during the ROVER voting process, with higher scores indicating more reliable transcriptions. This enables effective data filtering for different training stages.

Q: Can I use WenetSpeech-Yue for commercial applications?
A: The dataset is released under open-source licenses that permit commercial use, but we recommend checking the specific license terms for each component (dataset, models, code) to ensure compliance.

Q: How were the audio quality metrics (DNSMOS, SNR) calculated?
A: DNSMOS scores were generated using Microsoft’s DNSMOS model for perceptual quality assessment, while SNR values were computed using the Brouhaha toolkit for signal-to-noise ratio estimation.

Q: What types of speech are included in the different domains?
A: The dataset covers diverse speech types including narrative storytelling, conversational entertainment, dramatic performances, cultural discussions, personal vlogs, technical commentaries, educational content, podcast conversations, news broadcasts, and various other speech styles.

Q: How can I contribute to or extend this dataset?
A: While the current release is complete, we welcome community feedback and suggestions through our GitHub repository. Future expansions may incorporate community-contributed data following similar quality control processes.

Q: What computational resources are recommended for working with this dataset?
A: For full dataset training, we recommend high-performance computing resources with multiple GPUs and substantial storage capacity. For inference with pre-trained models, a single modern GPU is sufficient for most applications.