A Comprehensive Guide to Tongyi Qianwen ASR Models: Choosing, Using, and Implementing Qwen3-ASR and Qwen-Audio-ASR
Core Question Addressed in This Article
What are the differences between Tongyi Qianwen’s two speech recognition models—Qwen3-ASR and Qwen-Audio-ASR—in terms of functionality, use cases, and cost? How do you select the right model for your business needs? What is the complete workflow from API configuration to practical implementation (including URL-based, local file, and streaming output)? And how can context enhancement be used to solve inaccuracies in professional terminology recognition?
1. Tongyi Qianwen ASR Models: Versions, Capabilities, and Use Cases
1.1 Model Overview: Positioning Differences Between Official and Beta Versions
Core Question for This Section: What are the core positioning and applicable scenarios of Qwen3-ASR and Qwen-Audio-ASR?
Qwen3-ASR is an official version model designed for production environments, featuring comprehensive capabilities such as multilingual recognition and adaptation to complex environments. Qwen-Audio-ASR, by contrast, is a beta version for experimental use only—it has limited functionality, no stability guarantees, and is suitable solely for personal testing or non-commercial scenarios.
Technically, Qwen3-ASR is built on Tongyi Qianwen’s multimodal foundation and has undergone extensive scenario validation, enabling it to handle complex production-level demands. Examples include transcribing multilingual customer service calls for cross-border e-commerce platforms, recognizing song lyrics with background music (e.g., for short video subtitle generation), and identifying equipment operation commands in noisy factory settings. Qwen-Audio-ASR, however, is trained on Qwen-Audio
and only supports Chinese and English recognition. It is better suited for developers seeking a quick introduction to speech recognition—such as converting simple voice notes to text for personal projects.
Reflection/Lesson Learned: When assisting a startup with building a customer service voice system, we initially used Qwen-Audio-ASR for testing. While it met basic Chinese recognition needs, switching to a production environment revealed critical flaws: its lack of noise rejection caused transcription accuracy to plummet in workshop settings. After upgrading to Qwen3-ASR and enabling its intelligent non-speech filtering feature, accuracy rose to over 95%. For all commercial scenarios, prioritize Qwen3-ASR to avoid business disruptions caused by the beta version’s limitations.
1.2 Comparative Analysis of Core Model Parameters: Languages, Sampling Rates, Costs, and Quotas
Core Question for This Section: What are the specific differences between the two models in terms of supported languages, sampling rates, usage costs, and free quotas?
The table below provides a clear comparison of key parameters for both models, helping developers quickly determine if they align with requirements (e.g., multilingual support, budget constraints, or free testing limits):
Table 1: Qwen3-ASR Model Parameter Details
Model Name | Version | Supported Languages | Supported Sampling Rate | Unit Price (CNY/second) | Free Quota |
---|---|---|---|---|---|
qwen3-asr-flash | Stable | Chinese, English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, Spanish | 16kHz | 0.00022 | 36,000 seconds (10 hours), valid for 180 days |
qwen3-asr-flash-2025-09-08 | Snapshot | Same as above | 16kHz | 0.00022 | Same as above |
Note: Currently, qwen3-asr-flash (stable version) has identical functionality to the qwen3-asr-flash-2025-09-08 snapshot. The stable version receives ongoing updates, while the snapshot preserves functionality at a specific time point—ideal for scenarios requiring fixed model versions (e.g., compliance verification in the healthcare industry).
Table 2: Qwen-Audio-ASR Model Parameter Details
Model Name | Version | Supported Languages | Supported Format | Supported Sampling Rate | Context Length (Tokens) | Maximum Input (Tokens) | Maximum Output (Tokens) | Free Quota |
---|---|---|---|---|---|---|---|---|
qwen-audio-asr | Stable | Chinese, English | Audio | 16kHz | 8,192 | 6,144 | 2,048 | 100,000 Tokens, valid for 180 days |
qwen-audio-asr-latest | Latest | Same as above | Same as above | 16kHz | 8,192 | 6,144 | 2,048 | Same as above |
qwen-audio-asr-2024-12-04 | Snapshot | Same as above | Same as above | 16kHz | 8,192 | 6,144 | 2,048 | Same as above |
Cost Calculation Explanation: Qwen-Audio-ASR uses Token-based pricing, with 25 Tokens generated per second of audio (fractions of a second are rounded up to 1 second). For example, a 120-second audio clip consumes 120×25 = 3,000 Tokens, which falls within the free quota (100,000 Tokens). Qwen3-ASR, by contrast, charges by the second: its 10-hour free quota can cover approximately 3,600 10-second short voice clips (e.g., customer service dialogue snippets)—sufficient for most developers to complete initial testing.
Use Case Examples:
-
A cross-border live streaming team needs to transcribe English, Japanese, and Korean live audio to text. They select Qwen3-ASR’s stable version, leveraging its multilingual support and 10-hour free quota for testing. Subsequent daily 1-hour live streams cost only ~0.79 CNY (3,600 seconds × 0.00022 CNY/second). -
An individual developer testing Chinese voice note transcription uses Qwen-Audio-ASR’s latest version. The 100,000-Token free quota supports ~4,000 seconds (66 minutes) of audio recognition—fully meeting personal needs.
Image Source: Unsplash (Visualizing technical parameter comparison, aligned with the article’s focus)
2. In-Depth Feature Comparison: Which Capabilities Solve Your Business Problems?
2.1 Core Feature Differences: From Multilingual Support to Noise Rejection
Core Question for This Section: How do feature differences between the two models impact business scenario selection?
Qwen3-ASR outperforms Qwen-Audio-ASR across critical capabilities like multilingual support, context enhancement, and noise rejection. Qwen-Audio-ASR only offers basic Chinese/English recognition and streaming output. The detailed comparison below highlights these gaps:
Table 3: Feature Comparison Between Qwen3-ASR and Qwen-Audio-ASR
Feature | Qwen3-ASR Support | Qwen-Audio-ASR Support | Business Value Explanation |
---|---|---|---|
Integration Method | Java/Python SDK, HTTP API | Java/Python SDK, HTTP API | Both models support mainstream programming languages, enabling integration by teams with different tech stacks (e.g., Java backends or Python data analysis teams) |
Multilingual Recognition | 11 languages (Chinese, English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, Spanish) | Chinese, English only | Essential for cross-border businesses (e.g., international customer service, multilingual meetings). Qwen3-ASR covers major trade languages |
Context Enhancement | ✅ Supports Context configuration via the text parameter to improve professional terminology recognition |
❌ Not supported | Resolves inaccuracies in domain-specific terminology (e.g., investment banking jargon, medical terms)—a core competitive advantage of Qwen3-ASR |
Language Identification | ✅ Enable with enable_lid=true to return language information |
❌ Not supported | Automatically identifies and transcribes unknown languages (e.g., calls from international clients) |
Specify Target Language | ✅ Specify language via the language parameter (e.g., zh for Chinese, en for English) |
❌ Not supported | Improves accuracy when the language is known (e.g., specifying language=ja for Japanese customer service calls) |
Singing Recognition | ✅ Supports transcription of full songs with background music | ❌ Not supported | Useful for short video platforms (lyric extraction) and karaoke subtitle generation |
Noise Rejection | ✅ Intelligently filters non-speech sounds (e.g., factory noise, traffic hum) | ❌ Not supported | Reduces transcription errors in noisy environments (e.g., workshop equipment commands, outdoor interviews) |
ITN (Inverse Text Normalization) | ✅ Enable with enable_itn=true for Chinese/English (e.g., converting “123” to “one hundred and twenty-three”) |
❌ Not supported | Standardizes numeric formats in finance/healthcare (e.g., age/amount transcription in medical records) |
Punctuation Prediction | ✅ Automatically adds punctuation (e.g., commas, periods) | ❌ Not supported | Eliminates manual punctuation for long texts (e.g., meeting minutes), improving readability |
Streaming Output | ✅ Supports real-time return of intermediate results | ✅ Supported | Reduces wait times for real-time scenarios (e.g., live meeting subtitles, voice assistants) |
Reflection/Feature Selection Advice: In an educational scenario involving automated grading of voice assignments, users reported “disordered numeric transcription” (e.g., “2024” being recognized as “two thousand and twenty-four” instead of “2024”). Enabling Qwen3-ASR’s ITN feature with enable_itn=true
resolved this issue. With Qwen-Audio-ASR, however, the lack of ITN would require additional development of numeric format conversion logic—adding unnecessary costs. Choose models based on whether your business needs specialized features (e.g., ITN, context enhancement), not just free quotas.
2.2 Audio Input and Format Requirements: Prerequisites for Successful Calls
Core Question for This Section: What are the audio input and format requirements for both models? How can you avoid call failures due to format issues?
Both models share identical audio input methods and format requirements, supporting local files and online URLs. Only specific formats, channels, and durations are allowed, as detailed below:
(1) Audio Input Methods
-
Local Audio: Provide the absolute file path, with format variations across operating systems (see the “Quick Start” section below for details). -
Online Audio: Upload audio to publicly accessible storage (e.g., Alibaba Cloud OSS) and provide the full URL.
Critical Note: The online URL must be publicly accessible. Verify this using a browser or curl
command (e.g., curl -I https://xxx.mp3
—a return of HTTP 200 indicates accessibility). A common mistake is using internal network URLs, which cause the model to fail accessing the audio and return a “resource inaccessible” error. Prioritize generating public URLs via Alibaba Cloud OSS.
(2) Audio Format Requirements
-
Supported Formats: aac, amr, avi, aiff, flac, flv, m4a, mkv, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv (covers mainstream audio/video formats; video formats automatically extract audio tracks). -
Channels: Mono only (convert stereo audio to mono first—see the “FAQs” section for methods). -
Sampling Rate: 16kHz only (convert other rates, e.g., 44.1kHz to 16kHz). -
File Size/Duration: Maximum 10MB file size and 3 minutes duration (split longer audio—e.g., a 1-hour meeting recording into 20 3-minute segments).
Tool Recommendation: Use the open-source ffprobe
tool to quickly verify audio compliance. For example, check the format, codec, sampling rate, and channels of test.mp3
:
# Command: Query audio container format, codec, sampling rate, and channel count
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 test.mp3
A valid output will show format_name=mp3
, codec_name=mp3
, sample_rate=16000
, and channels=1
. If channels=2
(stereo) or sample_rate=44100
(44.1kHz), format conversion is required (see the “FAQs” section).
Image Source: Unsplash (Demonstrating a technical tool interface, aligned with audio format verification)
3. Quick Start: From API Key Configuration to Successful Implementation
3.1 Preparations: API Key Acquisition and Environment Configuration
Core Question for This Section: Before calling ASR models, how do you obtain an API Key and configure the environment?
Calling Tongyi Qianwen ASR models requires first obtaining an API Key (for authentication), configuring environment variables (or specifying it directly in code), and installing the corresponding SDK (if using SDK integration). The steps are as follows:
(1) Obtain an API Key
-
Visit the Alibaba Cloud Model Studio Console and log in to your Alibaba Cloud account. -
In the “API Key Management” section, click “Create API Key” and record the generated API Key
(format:sk-xxx
). -
Important: The API Key is sensitive information—do not share it to avoid unauthorized use and unexpected charges.
(2) Configure Environment Variables (Recommended)
To avoid hardcoding the API Key in code, configure it as an environment variable:
-
Linux/macOS: Run export DASHSCOPE_API_KEY="sk-xxx"
in the terminal (replace with your API Key), or add it to~/.bashrc
for permanent effect. -
Windows: Run set DASHSCOPE_API_KEY=sk-xxx
in Command Prompt, or add a global variable via “System Properties > Environment Variables”.
(3) Install the SDK (If Using Python/Java SDK)
-
Python SDK: Run pip install dashscope --upgrade
to install the latest version (ensure version ≥1.0.0 to avoid compatibility issues). -
Java SDK: Add the following dependency to the pom.xml
of your Maven project:<dependency> <groupId>com.alibaba</groupId> <artifactId>dashscope-sdk-java</artifactId> <version>Latest Version</version> </dependency>
Reflection/Configuration Pitfall: A common issue is failing to restart the IDE (e.g., PyCharm) after configuring environment variables on Windows, preventing code from reading the API Key and triggering an “API Key not configured” error. Restart your development tool after configuration, or verify success with echo $DASHSCOPE_API_KEY
(Linux/macOS) or echo %DASHSCOPE_API_KEY%
(Windows).
3.2 Practical Implementation for Three Scenarios: URL, Local File, and Streaming Output
Core Question for This Section: How do you call Qwen3-ASR and Qwen-Audio-ASR via URL, local file, and streaming output?
The calling logic for both models is similar—only the model
parameter needs modification (use qwen3-asr-flash
for Qwen3-ASR and qwen-audio-asr
for Qwen-Audio-ASR). Below are complete, annotated code examples for Qwen3-ASR across three scenarios:
Scenario 1: Call via Online Audio URL
Suitable for audio stored on public networks (e.g., OSS), such as transcribing customer service dialogue audio on OSS.
Python Code
import os
import dashscope
# 1. Configure request message: "system" for context enhancement (empty here, detailed in later sections), "user" for audio URL
messages = [
{
"role": "system",
"content": [{"text": ""}] # Context content (can be empty)
},
{
"role": "user",
"content": [
# Replace with your public audio URL (example uses a test audio from Alibaba Cloud OSS)
{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"},
]
}
]
# 2. Call the model
response = dashscope.MultiModalConversation.call(
api_key=os.getenv("DASHSCOPE_API_KEY"), # Retrieve API Key from environment variable
model="qwen3-asr-flash", # Model name (replace with "qwen-audio-asr" for Qwen-Audio-ASR)
messages=messages,
result_format="message", # Result format (human-readable)
asr_options={
"enable_lid": True, # Enable language identification (returns audio language)
"enable_itn": True # Enable Inverse Text Normalization (standardizes numeric formats)
# "language": "zh", # Optional: Specify language if known to improve accuracy
}
)
# 3. Output results
print("Recognition Result:")
print(response["output"]["choices"][0]["message"]["content"][0]["text"])
print("Language Information:")
print(response["output"]["choices"][0]["message"]["annotations"][0]["language"])
print("Call Duration (seconds):")
print(response["usage"]["seconds"])
Sample Output
Recognition Result:
Welcome to Alibaba Cloud.
Language Information:
en
Call Duration (seconds):
1
Scenario 2: Call via Local File
Suitable for audio stored locally (e.g., voice notes on a personal computer). Note the file path format variations across operating systems:
Table 4: Local File Path Formats by Operating System
Operating System | SDK | Path Format | Example |
---|---|---|---|
Linux/macOS | Python/Java | file://{absolute path} | file:///home/user/audio/test.mp3 |
Windows | Python | file://{absolute path} | file://D:/audio/test.mp3 |
Windows | Java | file:///{absolute path} | file:///D:/audio/test.mp3 |
Java Code (Local File Call)
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
public class ASRLocalFileDemo {
public static void callLocalAudio() throws ApiException, NoApiKeyException, UploadFileException {
// 1. Configure local file path (Windows Java example—replace with your path)
String localFilePath = "file:///D:/audio/test.mp3";
// 2. Construct request message: "system" adds context (example: "Video conference audio with participants Zhang San, Li Si")
MultiModalMessage sysMessage = MultiModalMessage.builder()
.role(Role.SYSTEM.getValue())
.content(Arrays.asList(Collections.singletonMap("text", "This is audio from a video conference. Participants include: Zhang San, Li Si, Wang Wu.")))
.build();
MultiModalMessage userMessage = MultiModalMessage.builder()
.role(Role.USER.getValue())
.content(Arrays.asList(Collections.singletonMap("audio", localFilePath)))
.build();
// 3. Configure ASR parameters
Map<String, Object> asrOptions = new HashMap<>();
asrOptions.put("enable_lid", true); // Enable language identification
asrOptions.put("enable_itn", true); // Enable ITN
// 4. Call the model
MultiModalConversationParam param = MultiModalConversationParam.builder()
.apiKey(System.getenv("DASHSCOPE_API_KEY")) // Retrieve API Key from environment variable
.model("qwen3-asr-flash") // Model name
.message(sysMessage) // Context message
.message(userMessage) // User audio message
.parameter("asr_options", asrOptions)
.build();
MultiModalConversationResult result = new MultiModalConversation().call(param);
// 5. Output results
System.out.println("Recognition Result: " + JsonUtils.toJson(result));
}
public static void main(String[] args) {
try {
callLocalAudio();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.err.println("Call Failed: " + e.getMessage());
}
}
}
Scenario 3: Streaming Output Call
Suitable for real-time scenarios (e.g., live meeting subtitles, voice assistants). The model returns intermediate results incrementally, eliminating the need to wait for full audio processing and reducing user wait times.
curl Code (HTTP API Call, Supported on All Systems)
curl --location --request POST 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header 'Authorization: Bearer $DASHSCOPE_API_KEY' \ # API Key from environment variable
--header 'Content-Type: application/json' \
--header 'X-DashScope-SSE: enable' \ # Enable streaming output (critical header)
--data '{
"model": "qwen3-asr-flash", # Model name
"input": {
"messages": [
{
"content": [{"text": ""}],
"role": "system"
},
{
"content": [{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"}],
"role": "user"
}
]
},
"parameters": {
"incremental_output": true, # Return incremental intermediate results
"asr_options": {
"enable_lid": true,
"enable_itn": true
}
}
}'
Sample Streaming Output (Incremental Returns)
# 1st Return: Intermediate result "Welcome"
id:1
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"annotations":[{"type":"audio_info","language":"en"}],"content":[{"text":"Welcome"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"output_tokens_details":{"text_tokens":2},"input_tokens_details":{"text_tokens":0},"seconds":1},"request_id":"05a122e9-2f28-9e37-8156-0e564a8126e0"}
# 2nd Return: Intermediate result "Welcome to"
id:2
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"annotations":[{"type":"audio_info","language":"en"}],"content":[{"text":"Welcome to"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"output_tokens_details":{"text_tokens":3},"input_tokens_details":{"text_tokens":0},"seconds":1},"request_id":"05a122e9-2f28-9e37-8156-0e564a8126e0"}
# Final Return: Complete result "Welcome to Alibaba Cloud."
id:5
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"annotations":[{"type":"audio_info","language":"en"}],"content":[{"text":"Welcome to Alibaba Cloud."}],"role":"assistant"},"finish_reason":"stop"}]},"usage":{"output_tokens_details":{"text_tokens":6},"input_tokens_details":{"text_tokens":0},"seconds":1},"request_id":"05a122e9-2f28-9e37-8156-0e564a8126e0"}
Reflection/Advantage of Streaming Output: In live meeting subtitle scenarios, non-streaming output requires waiting for 3 minutes of audio processing before returning full results. Streaming output, however, starts returning intermediate subtitles within 1 second—eliminating long waits. Prioritize streaming output for real-time scenarios; use non-streaming output for non-real-time tasks (e.g., batch processing historical audio) for simpler code and no need to handle intermediate result concatenation.
4. Core Competitiveness: Qwen3-ASR’s Context Enhancement Feature
4.1 Value of Context Enhancement: Solving the Pain Point of Inaccurate Professional Terminology Recognition
Core Question for This Section: How does context enhancement resolve inaccurate professional terminology recognition? What advantages does it have over traditional hotword solutions?
Qwen3-ASR’s context enhancement feature allows passing domain-specific text (e.g., glossaries, industry knowledge) in requests, enabling the model to “learn” domain context in advance and significantly improve terminology recognition accuracy. This is more flexible and fault-tolerant than traditional hotword solutions.
Limitations of Traditional Hotword Solutions: Only single terms (e.g., “Bulge Bracket”) can be added, with no support for contextual associations (e.g., “Bulge Bracket” refers to “top-tier investment banks”). Context enhancement, by contrast, supports arbitrary text formats (glossaries, paragraphs, mixed content), allowing the model to automatically learn term relationships and achieve higher accuracy.
Practical Case: Investment Banking Terminology Recognition
A fintech company needed to transcribe investment banking meeting audio. Without context enhancement, the key term “Bulge Bracket” (top-tier investment banks) was incorrectly recognized as “Bird Rock” (a meaningless phrase), distorting the transcription. After enabling context enhancement, accuracy reached 100%. The comparison below illustrates the improvement:
Table 5: Recognition Accuracy Comparison (With/Without Context Enhancement)
Scenario | Recognition Result | Accuracy | Issue Analysis |
---|---|---|---|
Without Context Enhancement | “What jargon is used in the investment banking industry? First, foreign top-tier banks: Bird Rock, BB…” | 60% | The model failed to recognize “Bulge Bracket” and replaced it with a homophonic error |
With Context Enhancement | “What jargon is used in the investment banking industry? First, foreign top-tier banks: Bulge Bracket, BB…” | 100% | Context provided the term “Bulge Bracket,” enabling accurate recognition |
4.2 Practical Context Configuration: Four Common Input Formats
Core Question for This Section: How do you configure context content? What input formats are supported?
Context is passed via the text
parameter in the system
message. Four common formats are supported, with a maximum length of 10,000 Tokens. Examples are provided below:
(1) Glossary Format (Multiple Separators)
Suitable for scenarios with few terms. Supports separators like commas, spaces, and lists:
-
Format 1 (Comma-separated): {"text": "Bulge Bracket, Boutique, Middle Market, Domestic Securities Firms"}
-
Format 2 (Space-separated): {"text": "Bulge Bracket Boutique Middle Market Domestic Securities Firms"}
-
Format 3 (List-separated): {"text": "['Bulge Bracket', 'Boutique', 'Middle Market', 'Domestic Securities Firms']"}
(2) Natural Language Paragraph Format
Suitable for scenarios with many terms requiring contextual background. For example, passing a complete introduction to investment banking categories:
{
"text": "A Guide to Investment Banking Categories! Many students in Australia have asked me: What exactly is an investment bank? Today, I’ll explain—for international students, investment banks mainly fall into four categories: Bulge Bracket, Boutique, Middle Market, and Domestic Securities Firms. Bulge Bracket Banks: These are the so-called 'top-tier investment banks,' including Goldman Sachs and Morgan Stanley. These large institutions have extensive business scopes and scale. Boutique Banks: These are smaller but highly specialized. Examples include Lazard and Evercore, which have deep expertise in specific fields. Middle Market Banks: These serve mid-sized companies, offering M&A and IPO services. While smaller than top-tier banks, they have significant influence in specific markets. Domestic Securities Firms: With the rise of the Chinese market, domestic securities firms play an increasingly important role in the global market."
}
(3) Mixed Content Format (Glossary + Paragraph)
Suitable for scenarios requiring both core terms and background explanations. For example:
{
"text": "Core Terms: Bulge Bracket, Boutique, Middle Market, Domestic Securities Firms. Background: Bulge Bracket refers to top-tier investment banks (e.g., Goldman Sachs, Morgan Stanley); Boutique refers to specialized banks (e.g., Lazard); Middle Market banks serve mid-sized enterprises; Domestic Securities Firms include CITIC Securities and HT Securities."
}
(4) Format with Distracting Text
The model exhibits high fault tolerance for irrelevant text. Even if context includes content unrelated to terms (e.g., names), terminology recognition remains unaffected. For example:
{
"text": "A Guide to Investment Banking Categories! (Content as above, omitted) ... Wang Haoxuan, Li Zihan, Zhang Jingxing, Liu Xinyi, Chen Junjie, Yang Siyuan, Zhao Yutong, Huang Zhiqiang, Zhou Zimo, Wu Yajing"
}
Reflection/Context Configuration Advice: More context is not always better. Testing with 20,000 Tokens of long text caused model timeouts. Prioritize extracting core terms and key background (≤5,000 Tokens)—this balances accuracy and processing speed. Additionally, if terms have English abbreviations (e.g., “Bulge Bracket” abbreviated as “BB”), include both full names and abbreviations in context to further improve recognition.
5. Model Application Compliance and API References
5.1 Model Application Launch: Compliance Filing Steps
Core Question for This Section: What compliance filings are required when launching an application based on ASR models?
Per Alibaba Cloud requirements, all commercial applications developed using Tongyi Qianwen models must complete compliance filings to avoid violations of data security, privacy protection, and other relevant regulations. The steps are as follows:
-
Visit the Alibaba Cloud Model Studio Application Compliance Filing Guide. -
Prepare filing materials: Application name, purpose, user privacy policy, and data processing instructions (e.g., whether audio is stored, how encryption is implemented). -
Submit the filing application per the guide and wait for Alibaba Cloud review (typically 3–5 business days). -
After approval, the application can be officially launched (personal testing or non-commercial applications do not require filing).
5.2 API References and Additional Resources
Core Question for This Section: Where can you find detailed API documentation and technical support?
-
Official API Reference: Speech Recognition – Tongyi Qianwen API Reference provides detailed explanations of all parameters (e.g., additional asr_options
configurations). -
Technical Support: For call errors (e.g., “unsupported audio format,” “invalid API Key”), submit a ticket via the Alibaba Cloud Console or join the official developer community for real-time assistance. -
Sample Code Repository: Alibaba Cloud’s GitHub repository offers sample code for additional scenarios (e.g., batch audio processing, multimodal applications combining multiple models).
6. Frequently Asked Questions (FAQs)
-
Q: How do I provide a publicly accessible audio URL for the API?
A: Use Alibaba Cloud Object Storage Service (OSS): ① Upload audio to an OSS bucket; ② Enable “Public Access” in the bucket’s “Permission Settings”; ③ Generate the audio file URL (click “Get URL” in the OSS Console’s “File Management”); ④ Verify accessibility (browser orcurl
returns HTTP 200). -
Q: How do I check if my audio format meets requirements?
A: Use theffprobe
tool. Runffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 [audio_file_path]
. Ensure the output meets: supported format (e.g., mp3, wav), 16kHz sampling rate, and 1 channel (mono). -
Q: How do I process audio to meet model requirements (e.g., trimming, format conversion)?
A: Use theFFmpeg
tool: ① Trim audio (start at 1:30, trim 2 minutes):ffmpeg -i long_audio.wav -ss 00:01:30 -t 00:02:00 -c copy output_clip.wav
; ② Convert format (to 16kHz, mono, 16-bit WAV):ffmpeg -i input.mp3 -ac 1 -ar 16000 -sample_fmt s16 output.wav
. -
Q: How do I continue using Qwen3-ASR after exhausting the free quota?
A: After the free quota (10 hours) is used up, billing automatically starts at 0.00022 CNY/second. Recharge via the Alibaba Cloud Console. To reduce costs, optimize audio length (e.g., retain only valid speech segments) or allocate free quotas wisely during testing. -
Q: What is the maximum text length supported by context enhancement? What happens if it’s exceeded?
A: The maximum is 10,000 Tokens. Exceeding this returns an “context length exceeded” error. Simplify context by retaining only core terms and key background to avoid wasting Tokens on irrelevant text. -
Q: Can I still call Qwen-Audio-ASR after exhausting its free quota (100,000 Tokens)?
A: No—calls will fail. Alibaba Cloud recommends switching to Qwen3-ASR for continued use (supports commercial use with more comprehensive features). -
Q: How do I choose between streaming and non-streaming output?
A: Use streaming output for real-time scenarios (e.g., live meeting subtitles, voice assistants) to reduce wait times. Use non-streaming output for non-real-time scenarios (e.g., batch processing historical audio, voice note transcription) for simpler code and no need to handle intermediate result concatenation.
7. Practical Summary and One-Page Overview
7.1 Practical Summary (Action Checklist)
-
Model Selection: Commercial/multilingual/specialized features → Qwen3-ASR ( qwen3-asr-flash
); Personal testing/basic Chinese/English → Qwen-Audio-ASR (qwen-audio-asr
). -
Preparations: Obtain API Key → Configure environment variables → Install SDK (if using). -
Audio Verification: Use ffprobe
to confirm format (16kHz, mono, supported format) → Convert withFFmpeg
if non-compliant. -
Implementation: URL call → Use public URL; Local call → Follow OS-specific path formats; Real-time scenarios → Enable streaming output. -
Professional Optimization: Improve terminology recognition → Configure context enhancement (via system
’stext
parameter); Standardize numerics → Enable ITN (enable_itn=true
). -
Compliance Launch: Commercial applications → Complete Alibaba Cloud compliance filing.
7.2 One-Page Overview (Core Information Summary)
Module | Core Information |
---|---|
Model Selection Basis | Qwen3-ASR (production, multilingual, specialized features); Qwen-Audio-ASR (testing, basic Chinese/English) |
Free Quota | Qwen3-ASR: 10 hours; Qwen-Audio-ASR: 100,000 Tokens (both valid for 180 days) |
Audio Requirements | 16kHz, mono, ≤10MB, ≤3 minutes, supports 17 formats |
Key Features | Qwen3-ASR: context enhancement, ITN, noise rejection; Qwen-Audio-ASR: basic recognition only |
Calling Methods | URL, local file, streaming output (Python/Java SDK, HTTP API) |
Common Error Fixes | Format errors → Convert with FFmpeg ; API Key errors → Check environment variables; Invalid URL → Verify public access |
8. Conclusion: Selecting the Right ASR Solution for Your Needs
Tongyi Qianwen’s Qwen3-ASR and Qwen-Audio-ASR models cater to “production-level” and “experimental-level” needs, respectively. Developers should select models based on business scenarios (commercial/testing), feature requirements (multilingual/professional terminology recognition), and cost budgets (free quotas/paid usage).
For most commercial scenarios (e.g., cross-border customer service, intelligent meetings, short video subtitles), Qwen3-ASR’s context enhancement, noise rejection, and ITN features significantly improve recognition quality. Its 10-hour free quota suffices for initial testing, with extremely low subsequent costs (~0.79 CNY for 1 hour of daily use). For individual developers or non-commercial scenarios, Qwen-Audio-ASR’s 100,000-Token free quota meets basic needs for quick speech recognition experimentation.
Finally, before implementation, verify audio formats with ffprobe
, optimize terminology recognition via context enhancement, and prioritize environment variables for API Key configuration to ensure secure, efficient calls.