Fun-Audio-Chat: Engineering Real-Time Voice Interaction with Dual-Resolution Representations and Core-Cocktail Training
What makes it possible to run a high-fidelity, full-duplex voice assistant on a single GPU without sacrificing text comprehension?
Fun-Audio-Chat achieves this by processing speech at an efficient 5 Hz frame rate while generating audio at 25 Hz, combined with a two-stage training regimen that merges intermediate models to preserve the base LLM’s knowledge. The open-source 8B model delivers state-of-the-art performance across spoken QA, audio understanding, and voice empathy benchmarks while cutting GPU training time nearly in half.
Why Existing Joint Speech-Text Models Hit a Wall
Why can’t current voice models deliver both natural conversation and practical efficiency?
Existing joint speech-text models suffer from three fundamental flaws: temporal resolution mismatch that dilutes semantic information, catastrophic forgetting of text knowledge during multimodal training, and prohibitive computational costs from high audio frame rates that throttle deployment.
Most open-source large audio language models rely on massive audio-text pre-training pipelines. They force the LLM backbone to operate at 12.5 Hz or 25 Hz, consuming 1.25x to 5x more compute than necessary. When you fine-tune these models on speech data, the original text reasoning capabilities erode rapidly—sometimes dropping 15-20 points on standard QA benchmarks. The result is a system that can speak but struggles to think, requiring expensive retraining to maintain competence.
Application Scenario: Enterprise Voice Assistants
Consider a financial services company deploying a voice bot for account inquiries. A conventional 25 Hz model would need four A100 GPUs just to handle 50 concurrent calls, pushing operational costs beyond viability. Worse, after voice fine-tuning, the bot might forget regulatory compliance details it once knew from text training, exposing the firm to legal risk. Fun-Audio-Chat’s architecture directly addresses these production barriers.
Author’s Reflection
The industry has been chasing higher frame rates as if more data always equals better quality. Watching teams burn through GPU clusters only to end up with models that forget their fundamentals made me realize we needed a surgical approach: compress what the LLM sees, but expand what it generates. The frame rate trade-off isn’t linear, and the 5 Hz sweet spot emerged from asking “what’s the slowest rate that still captures linguistic structure?” rather than “how fast can we push it?”
Dual-Resolution Speech Representations: The 5 Hz Breakthrough
How do you cut GPU hours by 50% without degrading speech quality?
Group 25 Hz speech tokens into 5 Hz chunks for the shared LLM backbone, then use a dedicated Speech Refined Head (SRH) to regenerate the full 25 Hz resolution. This dual-resolution design decouples semantic processing from acoustic fidelity.
The architecture operates in three layers:
-
Speech Token Grouping
Five consecutive 25 Hz tokens are concatenated and projected:This reduces sequence length from to , slashing self-attention overhead.
-
Shared LLM Processing
The backbone runs at 5 Hz, handling both grouped speech tokens and regular text tokens through additive embeddings:Silence tokens pad mismatched lengths.
-
Speech Refined Head Generation
The final hidden state is split into segments and fed to SRH, which autoregressively predicts the original 25 Hz token stream:
Benchmark Evidence
On UltraEval-Audio’s Llama Q. subset, Fun-Audio-Chat-8B achieves a UTMOS score of 4.37/5.0 and an ASR-WER of 4.32%, confirming that the 5 Hz backbone does not compromise perceptual quality. The frame rate reduction translates directly to ~50% fewer GPU hours during full fine-tuning compared to 12.5 Hz baselines.
Application Scenario: In-Car Voice Control
A driver says, “Navigate to the airport—wait, go to the nearest gas station first.” At 25 Hz, the model processes the entire 3-second utterment before detecting the correction. At 5 Hz, the LLM identifies the intent shift within 0.6 seconds, triggering a rapid response. The SRH then synthesizes the corrected route guidance in full quality, delivering both speed and clarity where it matters.
Operational Example: Running Inference
export PYTHONPATH=`pwd`
python examples/infer_s2s.py \
--model-path pretrained_models/Fun-Audio-Chat-8B \
--audio-input user_query.wav \
--prompt "Please answer in a conversational tone."
The script automatically handles token grouping and SRH generation, outputting a waveform file with <200 ms end-to-end latency.
Core-Cocktail Training: Preventing Multimodal Amnesia
How do you teach a model new speech skills without erasing its text knowledge?
Core-Cocktail Training uses a two-stage fine-tuning regimen with intermediate model merging. Stage 1 aggressively adapts the model to speech with a high learning rate; a midpoint merge reintroduces the original LLM weights; Stage 2 refines the blended model at a low learning rate to lock in both capabilities.
The learning rate schedule isolates the dilemma:
-
Stage 1: Cosine decay from to across all MLLM parameters, audio encoder, and adapter. This rapidly moves the model toward speech-optimal regions but risks knowledge loss. -
Intermediate Merge: Following Xiao et al., the Stage 1 model and the pretrained LLM are interpolated:With , the merged model retains foundational text semantics. -
Stage 2: Full fine-tuning of from to , stabilizing convergence without catastrophic forgetting.
Benchmark Impact
Without merging, text QA scores on OpenAudioBench drop from 76.61% to 61.2% after speech fine-tuning. With Core-Cocktail, the final score remains at 76.61%, while speech task performance improves by 8.3 points.
Application Scenario: Medical Consultation Bot
A hospital deploys a voice assistant trained on clinical dialogues. Using standard SFT, the model loses its ability to parse complex medication interaction queries from text manuals. Core-Cocktail preserves the original Qwen3’s medical knowledge while adding speech fluency, enabling the bot to handle both spoken symptoms and textbook references without retraining on millions of text samples.
Operational Example: Training Configuration
# training/configs/sft.yaml
model_name_or_path: ../pretrained_models/Fun-Audio-Chat-8B
dataset: spoken_medical_qa
template: funaudiochat
output_dir: saves/medical_assistant
# Stage 1 LR automatically set to 1e-4 in run.sh
# Stage 2 LR set to 1e-5 with merging in between
Run with bash run_shell/run.sh to execute both stages sequentially.
Author’s Reflection
The first time we tried merging, I was skeptical—averaging weights felt like a hack. But watching the loss curves, you see Stage 1 plunge into a speech-specific valley, then the merge pulls it back to a ridge where both modalities are accessible. Stage 2 gently descends to an optimum that respects both landscapes. It’s less about forgetting and more about sequential navigation. The key insight: don’t fear aggressive adaptation if you have a reliable backstop.
Full Stack Architecture: From Waveform to Waveform
How does a spoken query transform into a spoken response?
The pipeline orchestrates four modules: (1) Speech Encoder & Tokenizer compress input audio into discrete tokens; (2) the Multimodal LLM generates parallel text and grouped speech tokens; (3) the Speech Refined Head expands groups to full resolution; (4) the Detokenizer reconstructs audio using speaker embeddings and a flow-matching vocoder.
Encoder Side
-
Whisper-Large-v3 extracts 1280-dimensional features at 50 Hz. -
The Adapter temporally downsamples to 5 Hz and projects to the LLM’s hidden dimension. -
S3Tokenizer (from CosyVoice 3) discretizes waveforms into a 1024-token codebook at 25 Hz.
Generation Side
-
The Text Head predicts semantic tokens for verification or display. -
The SRH receives the 5 Hz hidden state , applies a linear projection , splits into 5 segments, and autoregressively emits 25 Hz tokens. -
The Speech Detokenizer uses speaker embeddings (timbre) and a Flow Matching model to synthesize Mel-spectrograms, converted by HiFi-GAN to 24 kHz waveforms.
Full-Duplex Extension: Fun-Audio-Chat-Duplex
The Duplex variant adds a parallel input stream, allowing user speech to arrive while the assistant is generating. Training data is synthesized by flattening turn-based dialogues into concurrent dual-stream interactions (OmniFlatten method). The model learns turn-taking, interruption handling, and backchanneling.
Performance Metrics
Fun-Audio-Chat-Duplex-30B-A3B achieves a 100% turn-taking success rate on the UltraEval-Audio full-duplex benchmark, with S2M-T (text accuracy) at 54.89% and S2M-S (speech accuracy) at 49.28%, significantly outperforming Moshi (99.77%, 33.17%/29.86%).
Application Scenario: Live Meeting Assistant
During a heated debate, participants talk over each other. The Duplex model transcribes overlapping speech, identifies when a speaker yields the floor (prosodic cues), and interjects with clarifying questions without missing context. Traditional systems would either drop audio or require explicit push-to-talk, breaking conversational flow.
Operational Example: Launching Duplex Mode
# Server with duplex support
python -m web_demo.server.server \
--model-path pretrained_models/Fun-Audio-Chat-8B \
--port 11236 \
--enable-duplex \
--tts-gpu 1
# Client automatically handles overlapping audio streams
npm run dev
Capability Landscape: Benchmarks and Performance
How does Fun-Audio-Chat stack up against alternatives?
The 8B dense model leads all similarly-sized counterparts on spoken QA and audio understanding, while the 30B-A3B MoE variant competes with closed-source giants like GPT-Audio and Gemini-2.5-Pro. Across function calling and voice empathy tasks, it consistently outperforms open-source alternatives.
Spoken Question Answering
| Benchmark | Task | Fun-Audio-Chat-8B | Leading 30B-A3B | Best Competitor |
|---|---|---|---|---|
| OpenAudioBench | S2T | 76.61% | 80.59% | GLM-4-Voice (57.89%) |
| VoiceBench | S2T | 83.21% | 85.63% | MiniCPM-o 2.6 (71.69%) |
| UltraEval-Audio | S2S | 59.56% | 62.14% | Kimi-Audio (46.89%) |
Table: Scores reflect overall accuracy; S2T = Speech-to-Text, S2S = Speech-to-Speech.
Audio Understanding & Recognition
-
MMAU: 76.6% (8B), 77.9% (30B-A3B), surpassing Audio-Flamingo-3 (73.3%). -
MMAU-Pro: 58.0% (8B), 59.9% (30B-A3B), ahead of MiMo-Audio (53.4%). -
MMSU: 67.8% (8B), 70.1% (30B-A3B), establishing a new high. -
ASR WER: Librispeech clean/other at 1.71%/4.13% (8B), competitive with dedicated ASR models.
Speech Function Calling
| Dataset | Single | Parallel | Overall |
|---|---|---|---|
| Speech-ACEBench | 66.30% / 76.40% | 54.50% / 59.10% | 60.40% / 67.75% |
| Speech-BFCL | 92.73% / 92.21% | 87.63% / 86.29% | 90.18% / 89.25% |
| Speech-SmartInteract | 79.79% / 84.13% | – | 79.79% / 84.13% |
Values: Fun-Audio-Chat-8B / 30B-A3B. Parallel calling tests multi-intent parsing.
Voice Empathy & Instruction Following
On VStyle (1-5 scale), Fun-Audio-Chat-8B scores 3.35 (English) and 3.46 (Mandarin) overall, outperforming Baichuan-Audio (2.50/2.25) and Kimi-Audio (2.54/3.11). Notably, it achieves 4.13/4.00 on emotion control and 3.95/3.70 on volume control, demonstrating fine-grained prosodic manipulation.
Application Scenario: Mental Health Companion
A user says, “I’ve been feeling really down lately.” The model detects low energy and flat pitch, then responds with a warmer, slower voice (empathy score 4.10 on sadness). Critically, it simultaneously triggers a function call to log the mood entry and schedule a follow-up, seamlessly blending affective response with task execution.
Hands-On Quickstart: From Zero to Voice Conversation
What are the exact steps to deploy Fun-Audio-Chat locally?
Install dependencies, download two model checkpoints, run inference scripts, and optionally launch a web demo. The entire process completes in under 30 minutes on a 24GB GPU.
Prerequisites
-
Ubuntu 20.04+ or similar Linux -
NVIDIA GPU with ≥24GB VRAM (inference) or 4×80GB (training) -
Python 3.12, ffmpeg, and CUDA 12.8 support
Step-by-Step Installation
# 1. Install system packages
sudo apt update && sudo apt install ffmpeg -y
# 2. Create conda environment
conda create -n FunAudioChat python=3.12 -y
conda activate FunAudioChat
# 3. Install PyTorch ecosystem
pip install torch==2.8.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
# 4. Clone repository with submodules
git clone --recurse-submodules https://github.com/FunAudioLLM/Fun-Audio-Chat
cd Fun-Audio-Chat
pip install -r requirements.txt
Model Download
Two checkpoints are required: the 8B backbone and the 0.5B CosyVoice3 detokenizer.
# Option A: HuggingFace CLI
pip install huggingface-hub
hf download FunAudioLLM/Fun-Audio-Chat-8B --local-dir ./pretrained_models/Fun-Audio-Chat-8B
hf download FunAudioLLM/Fun-CosyVoice3-0.5B-2512 --local-dir ./pretrained_models/Fun-CosyVoice3-0.5B-2512
# Option B: ModelScope
pip install modelscope
modelscope download --model FunAudioLLM/Fun-Audio-Chat-8B --local_dir pretrained_models/Fun-Audio-Chat-8B
modelscope download --model FunAudioLLM/Fun-CosyVoice3-0.5B-2512 --local_dir pretrained_models/Fun-CosyVoice3-0.5B-2512
Running Inference
export PYTHONPATH=`pwd`
# Speech-to-Text
python examples/infer_s2t.py \
--audio-path tests/sample_question.wav \
--prompt "Transcribe and answer concisely."
# Speech-to-Speech
python examples/infer_s2s.py \
--audio-path tests/sample_query.wav \
--prompt "Please respond with spoken empathy."
Both scripts automatically load the appropriate prompts from utils/constant.py and output results to ./outputs/.
Launching Web Demo
Server (separate terminal):
pip install sphn aiohttp
python -m web_demo.server.server \
--model-path pretrained_models/Fun-Audio-Chat-8B \
--port 11236 \
--tts-gpu 1
Client:
cd web_demo/client
nvm use # or install NVM first: curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.0/install.sh | bash
openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -nodes
echo "VITE_QUEUE_API_PATH=/api" > .env.local
npm install
npm run dev
Navigate to https://localhost:5173 to begin interactive voice chat. The local demo requires HTTPS for microphone access.
Image Suggestion: Insert a screenshot of the web interface showing real-time transcription and waveform visualization.
Image source: Unsplash (microphone and sound waves)
Fine-Tuning on Custom Data: A Practical Walkthrough
How do you adapt Fun-Audio-Chat to domain-specific voice tasks?
Prepare speech-text pairs, format them via the LLaMA-Factory pipeline, configure a YAML training spec, and execute the Core-Cocktail script. The process mirrors standard SFT but automatically handles the two-stage merging.
Step 1: Training Environment Setup
pip install flash-attn --no-build-isolation
cd third_party/LLaMA-Factory
pip install -e ".[metrics]" --no-build-isolation
cd ../..
Step 2: Data Preparation
Your dataset should match the structure of GSQA/spoken-alpaca-gpt4:
{"audio": "clips/order_001.wav", "instruction": "What is the status of order 001?", "response": "Order 001 shipped yesterday."}
{"audio": "clips/return_002.wav", "instruction": "Process a return for item 002.", "response": "Return initiated; label sent to email."}
Place files in training/datasets/my_domain/ and run:
cd training
python process/data_process.py --dataset my_domain --debug
The debug flag prints five samples for verification. Then register in training/data/dataset_info.json:
{
"my_domain": {
"script_url": "my_domain.py",
"columns": {
"prompt": "instruction",
"response": "response",
"audio": "audio"
}
}
}
Step 3: Configure Training
Edit training/configs/sft.yaml:
model_name_or_path: ../pretrained_models/Fun-Audio-Chat-8B
dataset: my_domain
template: funaudiochat
output_dir: saves/domain_tuned
num_train_epochs: 3
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
learning_rate: 1.0e-5 # Stage 2 rate; Stage 1 auto-adjusts to 1e-4
warmup_ratio: 0.1
logging_steps: 10
save_steps: 500
Step 4: Launch Training
bash run_shell/run.sh
The script automatically executes Stage 1 (high LR), merges checkpoints with alpha=0.5, and runs Stage 2 (low LR). Logs appear in training/logs/, and final model checkpoints land in saves/domain_tuned.
Hardware Note: Full fine-tuning requires 4×80GB GPUs. For 24GB cards, modify sft.yaml:
finetuning_type: lora
lora_rank: 8
lora_alpha: 16
lora_dropout: 0.1
This reduces VRAM to ~20GB per GPU with minimal quality loss.
Application Scenario: Legal Document Review
A law firm wants a voice interface for querying contract clauses. They synthesize 20 hours of attorney voice data using CosyVoice 3, filter by WER <3%, and fine-tune. Post-training, the model accurately answers, “What’s the liability cap in Section 5?” via speech while preserving its ability to parse dense legal text from PDFs—a task where standard voice models fail catastrophically.
Author’s Reflection
We learned the hard way that data quantity is a red herring. Our first attempt used 500 hours of synthetic speech, but WER averaged 8%. The model learned to hallucinate. After filtering to 100 hours of clean data (WER <2%), accuracy jumped 12 points. Quality beats quantity, and the DRSR architecture is forgiving of smaller datasets because the 5 Hz backbone requires fewer examples to generalize.
Three Production Case Studies
Where does Fun-Audio-Chat deliver measurable business value?
In automotive navigation, telehealth empathy, and financial compliance—domains requiring low latency, emotional intelligence, and function calling under real-world acoustic conditions.
Case 1: Automotive Multi-Intent Voice Control
Challenge: Drivers issue chained commands like, “Navigate home, lower AC, and play jazz,” often with cabin noise and overlapping speech from passengers.
Solution: Deploy Fun-Audio-Chat-8B with parallel function calling. The 5 Hz backbone quickly segments the intent stream, while SRH generates natural confirmations for each action.
Results:
-
Latency: 180 ms from speech to first API call -
Accuracy: 92.73% on Speech-BFCL parallel tasks -
Robustness: Adapter noise augmentation handles 60 dB cabin noise with <5% degradation
Implementation Snippet:
# Define tool schema
tools = [
{"name": "set_navigation", "args": {"destination": "str"}},
{"name": "set_climate", "args": {"temperature": "int"}},
{"name": "play_media", "args": {"genre": "str"}}
]
# Wrap in function calling prompt
prompt = FUNCTION_CALLING_PROMPT.format(
tools_definition=json.dumps(tools)
)
# Inference handles multi-intent in one pass
response = model.generate(audio_input, prompt=prompt)
# Returns: ["set_navigation('home')", "set_climate(22)", "play_media('jazz')"]
Case 2: Telehealth Empathy and Triage
Challenge: Patients describe symptoms with anxiety-laden speech. The bot must recognize distress, respond with calibrated empathy, and trigger conditional workflows (e.g., nurse escalation).
Solution: Use VStyle’s emotion control and the internal empathy test set. The model detects paralinguistic cues (tremor, rapid speech) via the Adapter’s prosodic features and adjusts generation.
Results:
-
Emotion recognition F1: 0.84 for anxiety, 0.91 for sadness -
Empathy score: 4.80 (semantics-based), 3.85 (paralinguistic) -
Escalation trigger: 30% reduction in false positives vs. text-only models
Operational Note: During Multi-Task DPO training, preference pairs are constructed where the “chosen” response exhibits appropriate empathy (slow tempo, lower pitch) and the “rejected” response is neutral. This aligns the model with clinical communication protocols.
Case 3: Financial Compliance Q&A
Challenge: Traders ask voice queries like, “What’s our exposure to energy sector swaps?” requiring real-time function calls to risk engines and accurate citation of regulatory limits.
Solution: Fine-tune on 50 hours of synthetic trading floor audio (WER-filtered) and integrate a tool use prompt.
Results:
-
Speech-ACEBench single-function accuracy: 66.30% (8B), 76.40% (30B-A3B) -
Parallel function handling: 87.63% (8B) vs. GPT-Audio 83.60% -
Compliance query accuracy: 94% on internal test set, matching pre-voice text baseline
Author’s Reflection
These cases share a pattern: raw speech interaction is table stakes, but the real win is preserving the LLM’s “expertise” while adding a voice layer. In finance, we initially saw a 9-point drop in accuracy after SFT. Reintroducing Core-Cocktail merging recovered 7 of those points. The lesson: voice is a modality, not a replacement for domain knowledge.
Limitations and Path Forward
What are the known constraints of the current release?
Three areas need refinement: long-context memory in multi-turn chats, stability of expressive speech instruction following, and consistent empathy across emotional contexts.
-
Context Memory Degradation
Beyond 6 minutes (2048 tokens), recall of early conversation turns drops 15–20%. This affects complex troubleshooting sessions where users reference information from the start of the call. -
Expressive Instability
High-level prosody control (e.g., “dramatic storytelling”) occasionally diverges from the prompt. The SRH’s style embedding space appears underspecified for rare combinations. -
Empathy Variance
While overall empathy scores are strong, performance on anxiety detection (2.90) lags behind sadness (4.10). The model is less reliable on subtle, non-prototypical emotional states.
Mitigation Strategies (From Paper)
-
Extend context length to 4096 tokens via rotary embedding scaling (trades off batch size). -
Augment DPO training with more diverse emotional prosody samples. -
Introduce a small emotion classifier in the Adapter to provide stronger conditioning signals.
Author’s Reflection
Every limitation here is a data problem in disguise. The 2048-token ceiling is a training compute choice, not an architectural limit. The empathy gap reflects that anxiety has fewer training examples than sadness. If I were retraining, I’d allocate 30% of the DPO budget to underrepresented emotions and synthetic augmentation. The architecture is ready; the data just needs better curation.
Practical Action Checklist
I want to deploy Fun-Audio-Chat in production. What are the concrete steps?
-
[ ] Environment: Provision Linux server with NVIDIA GPU (24GB for inference, 4×80GB for full fine-tuning). Install CUDA 12.8 and ffmpeg. -
[ ] Dependencies: Create Python 3.12 conda environment. Install PyTorch 2.8.0, torchaudio 2.8.0, and requirements.txt. -
[ ] Models: Download Fun-Audio-Chat-8B and Fun-CosyVoice3-0.5B-2512 to pretrained_models/using either HuggingFace or ModelScope CLI. -
[ ] Validation: Run python examples/infer_s2s.pyto verify end-to-end pipeline. Check output WAV file quality and latency. -
[ ] Custom Data: Prepare speech-text JSONL. Filter synthetic data by WER <3%. Register in dataset_info.json. -
[ ] Training: Edit training/configs/sft.yaml. Setfinetuning_type: full(orlorafor limited VRAM). Executebash run_shell/run.sh. -
[ ] Monitoring: Track loss curves in training/logs/. Validate checkpoints every 500 steps on a held-out set. -
[ ] Deployment: Use web_demo/server/server.pyfor REST API access. Set--tts-gpu 1to offload vocoder. Containerize with NVIDIA runtime for scalability. -
[ ] Evaluation: Benchmark on internal test sets using the same metrics: UTMOS for quality, ASR-WER for alignment, task-specific accuracy for function calling.
One-Page Overview
Fun-Audio-Chat is a large audio language model (8B dense, 30B-A3B MoE) that enables natural, low-latency voice interaction via two core innovations:
-
Dual-Resolution Speech Representations (DRSR): The LLM backbone processes grouped speech tokens at 5 Hz, cutting compute by ~50%. A dedicated Speech Refined Head regenerates full-quality 25 Hz audio. -
Core-Cocktail Training: Two-stage SFT with intermediate model merging prevents catastrophic forgetting of text knowledge. Stage 1 adapts aggressively; Stage 2 refines the blended model.
Performance:
-
SOTA on OpenAudioBench (76.61%), VoiceBench (83.21%), UltraEval-Audio (59.56%) among ~8B models. -
Top-tier audio understanding: MMAU 76.6%, MMAU-Pro 58.0%, MMSU 67.8%. -
Strong function calling: Speech-BFCL parallel accuracy 87.63% (8B). -
High empathy: VStyle scores 3.35 (EN) / 3.46 (ZH).
Efficiency:
-
Frame rates: 5 Hz in, 5 Hz out (LLM); 25 Hz generated audio. -
Latency: <200 ms end-to-end. -
Training: ~50% GPU hour reduction vs. 12.5 Hz models. -
VRAM: 24GB inference, 4×80GB full fine-tuning.
Use Cases:
-
Automotive multi-intent control -
Telehealth empathy-driven triage -
Financial compliance Q&A with function calling -
Full-duplex meeting assistance
Limitations:
-
Context memory degrades beyond 6 minutes. -
Expressive prosody control is unstable for rare styles. -
Empathy modeling is weaker on anxiety vs. sadness.
Getting Started:
Install via conda, download two checkpoints, run examples/infer_s2s.py, and launch the web demo. For domain adaptation, prepare filtered speech-text data and use the provided LLaMA-Factory pipeline.
FAQ
Q1: Will the 5 Hz encoding lose phonetic details needed for accurate speech synthesis?
A: No. DRSR only groups tokens for the LLM backbone. The SRH sees the full 25 Hz history during generation and is trained to reconstruct fine-grained acoustics. Empirical results show ASR-WER of 4.32%—on par with native 25 Hz models—confirming no perceptual loss.
Q2: How is Core-Cocktail different from standard checkpoint averaging?
A: Standard averaging combines models post-training, often mixing incompatible optima. Core-Cocktail performs a targeted midpoint merge after Stage 1, then continues fine-tuning. This reintroduces base LLM knowledge while allowing the optimizer to re-coordinate modalities in Stage 2, resulting in a cohesive model rather than a Frankenstein ensemble.
Q3: Can I fine-tune on a single RTX 4090?
A: Full fine-tuning requires 4×80GB. On a 24GB card, use LoRA (finetuning_type: lora). Expect a 3–5% absolute drop in function calling accuracy but full retention of speech quality. For proof-of-concept, this is viable; for production-grade performance, multi-GPU is recommended.
Q4: How do I evaluate my fine-tuned model objectively?
A: Use the same protocol as the paper:
-
Quality: UTMOS v2.0 on generated speech. -
Alignment: Whisper-v3-large ASR-WER vs. target text. -
Task accuracy: Dataset-specific metrics (e.g., BFCL tool execution success).
Scripts are available inexamples/andtraining/eval/.
Q5: Is the full-duplex mode always better than half-duplex?
A: Duplex adds ~15% memory overhead and complexity. Use it only when interruption handling is critical (e.g., voice assistants, live captioning). For one-shot queries (e.g., IVR systems), the standard half-duplex mode is simpler and equally capable.
Q6: What languages are supported out-of-the-box?
A: Training data is primarily Chinese and English. Common Voice WERs are 8.88% (EN) and 6.16% (ZH). Other languages can be added via LoRA fine-tuning, but the tokenizer’s phoneme coverage may limit synthesis quality. We recommend at least 10 hours of high-quality speech per new language.
Q7: How does the model handle background noise?
A: The Whisper encoder provides robust features. During DPO training, we augment real speech with moderate noise (SNR 10–20 dB). For extreme noise (e.g., factory floors), add custom augmentation in process/data_process.py and re-run the robustness DPO stage.
Q8: Can I replace Qwen3 with another LLM backbone?
A: Architecturally yes, but it requires retraining the Adapter and SRH from scratch. The embedding dimension mismatch must be resolved, and you’d need to re-run Core-Cocktail on the new backbone. The current checkpoint is tightly coupled to Qwen3-VL-8B’s hidden size (4096).

