A Comprehensive Guide to Tongyi Qianwen ASR Models: Choosing, Using, and Implementing Qwen3-ASR and Qwen-Audio-ASR

Core Question Addressed in This Article

What are the differences between Tongyi Qianwen’s two speech recognition models—Qwen3-ASR and Qwen-Audio-ASR—in terms of functionality, use cases, and cost? How do you select the right model for your business needs? What is the complete workflow from API configuration to practical implementation (including URL-based, local file, and streaming output)? And how can context enhancement be used to solve inaccuracies in professional terminology recognition?

1. Tongyi Qianwen ASR Models: Versions, Capabilities, and Use Cases

1.1 Model Overview: Positioning Differences Between Official and Beta Versions

Core Question for This Section: What are the core positioning and applicable scenarios of Qwen3-ASR and Qwen-Audio-ASR?
Qwen3-ASR is an official version model designed for production environments, featuring comprehensive capabilities such as multilingual recognition and adaptation to complex environments. Qwen-Audio-ASR, by contrast, is a beta version for experimental use only—it has limited functionality, no stability guarantees, and is suitable solely for personal testing or non-commercial scenarios.

Technically, Qwen3-ASR is built on Tongyi Qianwen’s multimodal foundation and has undergone extensive scenario validation, enabling it to handle complex production-level demands. Examples include transcribing multilingual customer service calls for cross-border e-commerce platforms, recognizing song lyrics with background music (e.g., for short video subtitle generation), and identifying equipment operation commands in noisy factory settings. Qwen-Audio-ASR, however, is trained on Qwen-Audio and only supports Chinese and English recognition. It is better suited for developers seeking a quick introduction to speech recognition—such as converting simple voice notes to text for personal projects.

Reflection/Lesson Learned: When assisting a startup with building a customer service voice system, we initially used Qwen-Audio-ASR for testing. While it met basic Chinese recognition needs, switching to a production environment revealed critical flaws: its lack of noise rejection caused transcription accuracy to plummet in workshop settings. After upgrading to Qwen3-ASR and enabling its intelligent non-speech filtering feature, accuracy rose to over 95%. For all commercial scenarios, prioritize Qwen3-ASR to avoid business disruptions caused by the beta version’s limitations.

1.2 Comparative Analysis of Core Model Parameters: Languages, Sampling Rates, Costs, and Quotas

Core Question for This Section: What are the specific differences between the two models in terms of supported languages, sampling rates, usage costs, and free quotas?
The table below provides a clear comparison of key parameters for both models, helping developers quickly determine if they align with requirements (e.g., multilingual support, budget constraints, or free testing limits):

Table 1: Qwen3-ASR Model Parameter Details

Model Name Version Supported Languages Supported Sampling Rate Unit Price (CNY/second) Free Quota
qwen3-asr-flash Stable Chinese, English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, Spanish 16kHz 0.00022 36,000 seconds (10 hours), valid for 180 days
qwen3-asr-flash-2025-09-08 Snapshot Same as above 16kHz 0.00022 Same as above

Note: Currently, qwen3-asr-flash (stable version) has identical functionality to the qwen3-asr-flash-2025-09-08 snapshot. The stable version receives ongoing updates, while the snapshot preserves functionality at a specific time point—ideal for scenarios requiring fixed model versions (e.g., compliance verification in the healthcare industry).

Table 2: Qwen-Audio-ASR Model Parameter Details

Model Name Version Supported Languages Supported Format Supported Sampling Rate Context Length (Tokens) Maximum Input (Tokens) Maximum Output (Tokens) Free Quota
qwen-audio-asr Stable Chinese, English Audio 16kHz 8,192 6,144 2,048 100,000 Tokens, valid for 180 days
qwen-audio-asr-latest Latest Same as above Same as above 16kHz 8,192 6,144 2,048 Same as above
qwen-audio-asr-2024-12-04 Snapshot Same as above Same as above 16kHz 8,192 6,144 2,048 Same as above

Cost Calculation Explanation: Qwen-Audio-ASR uses Token-based pricing, with 25 Tokens generated per second of audio (fractions of a second are rounded up to 1 second). For example, a 120-second audio clip consumes 120×25 = 3,000 Tokens, which falls within the free quota (100,000 Tokens). Qwen3-ASR, by contrast, charges by the second: its 10-hour free quota can cover approximately 3,600 10-second short voice clips (e.g., customer service dialogue snippets)—sufficient for most developers to complete initial testing.

Use Case Examples:

  • A cross-border live streaming team needs to transcribe English, Japanese, and Korean live audio to text. They select Qwen3-ASR’s stable version, leveraging its multilingual support and 10-hour free quota for testing. Subsequent daily 1-hour live streams cost only ~0.79 CNY (3,600 seconds × 0.00022 CNY/second).
  • An individual developer testing Chinese voice note transcription uses Qwen-Audio-ASR’s latest version. The 100,000-Token free quota supports ~4,000 seconds (66 minutes) of audio recognition—fully meeting personal needs.

Speech Recognition Model Parameter Comparison
Image Source: Unsplash (Visualizing technical parameter comparison, aligned with the article’s focus)

2. In-Depth Feature Comparison: Which Capabilities Solve Your Business Problems?

2.1 Core Feature Differences: From Multilingual Support to Noise Rejection

Core Question for This Section: How do feature differences between the two models impact business scenario selection?
Qwen3-ASR outperforms Qwen-Audio-ASR across critical capabilities like multilingual support, context enhancement, and noise rejection. Qwen-Audio-ASR only offers basic Chinese/English recognition and streaming output. The detailed comparison below highlights these gaps:

Table 3: Feature Comparison Between Qwen3-ASR and Qwen-Audio-ASR

Feature Qwen3-ASR Support Qwen-Audio-ASR Support Business Value Explanation
Integration Method Java/Python SDK, HTTP API Java/Python SDK, HTTP API Both models support mainstream programming languages, enabling integration by teams with different tech stacks (e.g., Java backends or Python data analysis teams)
Multilingual Recognition 11 languages (Chinese, English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, Spanish) Chinese, English only Essential for cross-border businesses (e.g., international customer service, multilingual meetings). Qwen3-ASR covers major trade languages
Context Enhancement ✅ Supports Context configuration via the text parameter to improve professional terminology recognition ❌ Not supported Resolves inaccuracies in domain-specific terminology (e.g., investment banking jargon, medical terms)—a core competitive advantage of Qwen3-ASR
Language Identification ✅ Enable with enable_lid=true to return language information ❌ Not supported Automatically identifies and transcribes unknown languages (e.g., calls from international clients)
Specify Target Language ✅ Specify language via the language parameter (e.g., zh for Chinese, en for English) ❌ Not supported Improves accuracy when the language is known (e.g., specifying language=ja for Japanese customer service calls)
Singing Recognition ✅ Supports transcription of full songs with background music ❌ Not supported Useful for short video platforms (lyric extraction) and karaoke subtitle generation
Noise Rejection ✅ Intelligently filters non-speech sounds (e.g., factory noise, traffic hum) ❌ Not supported Reduces transcription errors in noisy environments (e.g., workshop equipment commands, outdoor interviews)
ITN (Inverse Text Normalization) ✅ Enable with enable_itn=true for Chinese/English (e.g., converting “123” to “one hundred and twenty-three”) ❌ Not supported Standardizes numeric formats in finance/healthcare (e.g., age/amount transcription in medical records)
Punctuation Prediction ✅ Automatically adds punctuation (e.g., commas, periods) ❌ Not supported Eliminates manual punctuation for long texts (e.g., meeting minutes), improving readability
Streaming Output ✅ Supports real-time return of intermediate results ✅ Supported Reduces wait times for real-time scenarios (e.g., live meeting subtitles, voice assistants)

Reflection/Feature Selection Advice: In an educational scenario involving automated grading of voice assignments, users reported “disordered numeric transcription” (e.g., “2024” being recognized as “two thousand and twenty-four” instead of “2024”). Enabling Qwen3-ASR’s ITN feature with enable_itn=true resolved this issue. With Qwen-Audio-ASR, however, the lack of ITN would require additional development of numeric format conversion logic—adding unnecessary costs. Choose models based on whether your business needs specialized features (e.g., ITN, context enhancement), not just free quotas.

2.2 Audio Input and Format Requirements: Prerequisites for Successful Calls

Core Question for This Section: What are the audio input and format requirements for both models? How can you avoid call failures due to format issues?
Both models share identical audio input methods and format requirements, supporting local files and online URLs. Only specific formats, channels, and durations are allowed, as detailed below:

(1) Audio Input Methods

  • Local Audio: Provide the absolute file path, with format variations across operating systems (see the “Quick Start” section below for details).
  • Online Audio: Upload audio to publicly accessible storage (e.g., Alibaba Cloud OSS) and provide the full URL.

Critical Note: The online URL must be publicly accessible. Verify this using a browser or curl command (e.g., curl -I https://xxx.mp3—a return of HTTP 200 indicates accessibility). A common mistake is using internal network URLs, which cause the model to fail accessing the audio and return a “resource inaccessible” error. Prioritize generating public URLs via Alibaba Cloud OSS.

(2) Audio Format Requirements

  • Supported Formats: aac, amr, avi, aiff, flac, flv, m4a, mkv, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv (covers mainstream audio/video formats; video formats automatically extract audio tracks).
  • Channels: Mono only (convert stereo audio to mono first—see the “FAQs” section for methods).
  • Sampling Rate: 16kHz only (convert other rates, e.g., 44.1kHz to 16kHz).
  • File Size/Duration: Maximum 10MB file size and 3 minutes duration (split longer audio—e.g., a 1-hour meeting recording into 20 3-minute segments).

Tool Recommendation: Use the open-source ffprobe tool to quickly verify audio compliance. For example, check the format, codec, sampling rate, and channels of test.mp3:

# Command: Query audio container format, codec, sampling rate, and channel count
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 test.mp3

A valid output will show format_name=mp3, codec_name=mp3, sample_rate=16000, and channels=1. If channels=2 (stereo) or sample_rate=44100 (44.1kHz), format conversion is required (see the “FAQs” section).

Audio Format Verification Tool Interface
Image Source: Unsplash (Demonstrating a technical tool interface, aligned with audio format verification)

3. Quick Start: From API Key Configuration to Successful Implementation

3.1 Preparations: API Key Acquisition and Environment Configuration

Core Question for This Section: Before calling ASR models, how do you obtain an API Key and configure the environment?
Calling Tongyi Qianwen ASR models requires first obtaining an API Key (for authentication), configuring environment variables (or specifying it directly in code), and installing the corresponding SDK (if using SDK integration). The steps are as follows:

(1) Obtain an API Key

  1. Visit the Alibaba Cloud Model Studio Console and log in to your Alibaba Cloud account.
  2. In the “API Key Management” section, click “Create API Key” and record the generated API Key (format: sk-xxx).
  3. Important: The API Key is sensitive information—do not share it to avoid unauthorized use and unexpected charges.

(2) Configure Environment Variables (Recommended)

To avoid hardcoding the API Key in code, configure it as an environment variable:

  • Linux/macOS: Run export DASHSCOPE_API_KEY="sk-xxx" in the terminal (replace with your API Key), or add it to ~/.bashrc for permanent effect.
  • Windows: Run set DASHSCOPE_API_KEY=sk-xxx in Command Prompt, or add a global variable via “System Properties > Environment Variables”.

(3) Install the SDK (If Using Python/Java SDK)

  • Python SDK: Run pip install dashscope --upgrade to install the latest version (ensure version ≥1.0.0 to avoid compatibility issues).
  • Java SDK: Add the following dependency to the pom.xml of your Maven project:

    <dependency>
        <groupId>com.alibaba</groupId>
        <artifactId>dashscope-sdk-java</artifactId>
        <version>Latest Version</version>
    </dependency>
    

Reflection/Configuration Pitfall: A common issue is failing to restart the IDE (e.g., PyCharm) after configuring environment variables on Windows, preventing code from reading the API Key and triggering an “API Key not configured” error. Restart your development tool after configuration, or verify success with echo $DASHSCOPE_API_KEY (Linux/macOS) or echo %DASHSCOPE_API_KEY% (Windows).

3.2 Practical Implementation for Three Scenarios: URL, Local File, and Streaming Output

Core Question for This Section: How do you call Qwen3-ASR and Qwen-Audio-ASR via URL, local file, and streaming output?
The calling logic for both models is similar—only the model parameter needs modification (use qwen3-asr-flash for Qwen3-ASR and qwen-audio-asr for Qwen-Audio-ASR). Below are complete, annotated code examples for Qwen3-ASR across three scenarios:

Scenario 1: Call via Online Audio URL

Suitable for audio stored on public networks (e.g., OSS), such as transcribing customer service dialogue audio on OSS.

Python Code
import os
import dashscope

# 1. Configure request message: "system" for context enhancement (empty here, detailed in later sections), "user" for audio URL
messages = [
    {
        "role": "system",
        "content": [{"text": ""}]  # Context content (can be empty)
    },
    {
        "role": "user",
        "content": [
            # Replace with your public audio URL (example uses a test audio from Alibaba Cloud OSS)
            {"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"},
        ]
    }
]

# 2. Call the model
response = dashscope.MultiModalConversation.call(
    api_key=os.getenv("DASHSCOPE_API_KEY"),  # Retrieve API Key from environment variable
    model="qwen3-asr-flash",  # Model name (replace with "qwen-audio-asr" for Qwen-Audio-ASR)
    messages=messages,
    result_format="message",  # Result format (human-readable)
    asr_options={
        "enable_lid": True,  # Enable language identification (returns audio language)
        "enable_itn": True   # Enable Inverse Text Normalization (standardizes numeric formats)
        # "language": "zh",  # Optional: Specify language if known to improve accuracy
    }
)

# 3. Output results
print("Recognition Result:")
print(response["output"]["choices"][0]["message"]["content"][0]["text"])
print("Language Information:")
print(response["output"]["choices"][0]["message"]["annotations"][0]["language"])
print("Call Duration (seconds):")
print(response["usage"]["seconds"])
Sample Output
Recognition Result:
Welcome to Alibaba Cloud.
Language Information:
en
Call Duration (seconds):
1

Scenario 2: Call via Local File

Suitable for audio stored locally (e.g., voice notes on a personal computer). Note the file path format variations across operating systems:

Table 4: Local File Path Formats by Operating System
Operating System SDK Path Format Example
Linux/macOS Python/Java file://{absolute path} file:///home/user/audio/test.mp3
Windows Python file://{absolute path} file://D:/audio/test.mp3
Windows Java file:///{absolute path} file:///D:/audio/test.mp3
Java Code (Local File Call)
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;

public class ASRLocalFileDemo {
    public static void callLocalAudio() throws ApiException, NoApiKeyException, UploadFileException {
        // 1. Configure local file path (Windows Java example—replace with your path)
        String localFilePath = "file:///D:/audio/test.mp3";
        
        // 2. Construct request message: "system" adds context (example: "Video conference audio with participants Zhang San, Li Si")
        MultiModalMessage sysMessage = MultiModalMessage.builder()
                .role(Role.SYSTEM.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "This is audio from a video conference. Participants include: Zhang San, Li Si, Wang Wu.")))
                .build();
        
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(Collections.singletonMap("audio", localFilePath)))
                .build();
        
        // 3. Configure ASR parameters
        Map<String, Object> asrOptions = new HashMap<>();
        asrOptions.put("enable_lid", true);  // Enable language identification
        asrOptions.put("enable_itn", true);   // Enable ITN
        
        // 4. Call the model
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))  // Retrieve API Key from environment variable
                .model("qwen3-asr-flash")  // Model name
                .message(sysMessage)       // Context message
                .message(userMessage)      // User audio message
                .parameter("asr_options", asrOptions)
                .build();
        
        MultiModalConversationResult result = new MultiModalConversation().call(param);
        
        // 5. Output results
        System.out.println("Recognition Result: " + JsonUtils.toJson(result));
    }

    public static void main(String[] args) {
        try {
            callLocalAudio();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.err.println("Call Failed: " + e.getMessage());
        }
    }
}

Scenario 3: Streaming Output Call

Suitable for real-time scenarios (e.g., live meeting subtitles, voice assistants). The model returns intermediate results incrementally, eliminating the need to wait for full audio processing and reducing user wait times.

curl Code (HTTP API Call, Supported on All Systems)
curl --location --request POST 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header 'Authorization: Bearer $DASHSCOPE_API_KEY' \  # API Key from environment variable
--header 'Content-Type: application/json' \
--header 'X-DashScope-SSE: enable' \  # Enable streaming output (critical header)
--data '{
    "model": "qwen3-asr-flash",  # Model name
    "input": {
        "messages": [
            {
                "content": [{"text": ""}],
                "role": "system"
            },
            {
                "content": [{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"}],
                "role": "user"
            }
        ]
    },
    "parameters": {
        "incremental_output": true,  # Return incremental intermediate results
        "asr_options": {
            "enable_lid": true,
            "enable_itn": true
        }
    }
}'
Sample Streaming Output (Incremental Returns)
# 1st Return: Intermediate result "Welcome"
id:1
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"annotations":[{"type":"audio_info","language":"en"}],"content":[{"text":"Welcome"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"output_tokens_details":{"text_tokens":2},"input_tokens_details":{"text_tokens":0},"seconds":1},"request_id":"05a122e9-2f28-9e37-8156-0e564a8126e0"}

# 2nd Return: Intermediate result "Welcome to"
id:2
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"annotations":[{"type":"audio_info","language":"en"}],"content":[{"text":"Welcome to"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"output_tokens_details":{"text_tokens":3},"input_tokens_details":{"text_tokens":0},"seconds":1},"request_id":"05a122e9-2f28-9e37-8156-0e564a8126e0"}

# Final Return: Complete result "Welcome to Alibaba Cloud."
id:5
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"annotations":[{"type":"audio_info","language":"en"}],"content":[{"text":"Welcome to Alibaba Cloud."}],"role":"assistant"},"finish_reason":"stop"}]},"usage":{"output_tokens_details":{"text_tokens":6},"input_tokens_details":{"text_tokens":0},"seconds":1},"request_id":"05a122e9-2f28-9e37-8156-0e564a8126e0"}

Reflection/Advantage of Streaming Output: In live meeting subtitle scenarios, non-streaming output requires waiting for 3 minutes of audio processing before returning full results. Streaming output, however, starts returning intermediate subtitles within 1 second—eliminating long waits. Prioritize streaming output for real-time scenarios; use non-streaming output for non-real-time tasks (e.g., batch processing historical audio) for simpler code and no need to handle intermediate result concatenation.

4. Core Competitiveness: Qwen3-ASR’s Context Enhancement Feature

4.1 Value of Context Enhancement: Solving the Pain Point of Inaccurate Professional Terminology Recognition

Core Question for This Section: How does context enhancement resolve inaccurate professional terminology recognition? What advantages does it have over traditional hotword solutions?
Qwen3-ASR’s context enhancement feature allows passing domain-specific text (e.g., glossaries, industry knowledge) in requests, enabling the model to “learn” domain context in advance and significantly improve terminology recognition accuracy. This is more flexible and fault-tolerant than traditional hotword solutions.

Limitations of Traditional Hotword Solutions: Only single terms (e.g., “Bulge Bracket”) can be added, with no support for contextual associations (e.g., “Bulge Bracket” refers to “top-tier investment banks”). Context enhancement, by contrast, supports arbitrary text formats (glossaries, paragraphs, mixed content), allowing the model to automatically learn term relationships and achieve higher accuracy.

Practical Case: Investment Banking Terminology Recognition
A fintech company needed to transcribe investment banking meeting audio. Without context enhancement, the key term “Bulge Bracket” (top-tier investment banks) was incorrectly recognized as “Bird Rock” (a meaningless phrase), distorting the transcription. After enabling context enhancement, accuracy reached 100%. The comparison below illustrates the improvement:

Table 5: Recognition Accuracy Comparison (With/Without Context Enhancement)

Scenario Recognition Result Accuracy Issue Analysis
Without Context Enhancement “What jargon is used in the investment banking industry? First, foreign top-tier banks: Bird Rock, BB…” 60% The model failed to recognize “Bulge Bracket” and replaced it with a homophonic error
With Context Enhancement “What jargon is used in the investment banking industry? First, foreign top-tier banks: Bulge Bracket, BB…” 100% Context provided the term “Bulge Bracket,” enabling accurate recognition

4.2 Practical Context Configuration: Four Common Input Formats

Core Question for This Section: How do you configure context content? What input formats are supported?
Context is passed via the text parameter in the system message. Four common formats are supported, with a maximum length of 10,000 Tokens. Examples are provided below:

(1) Glossary Format (Multiple Separators)

Suitable for scenarios with few terms. Supports separators like commas, spaces, and lists:

  • Format 1 (Comma-separated): {"text": "Bulge Bracket, Boutique, Middle Market, Domestic Securities Firms"}
  • Format 2 (Space-separated): {"text": "Bulge Bracket Boutique Middle Market Domestic Securities Firms"}
  • Format 3 (List-separated): {"text": "['Bulge Bracket', 'Boutique', 'Middle Market', 'Domestic Securities Firms']"}

(2) Natural Language Paragraph Format

Suitable for scenarios with many terms requiring contextual background. For example, passing a complete introduction to investment banking categories:

{
    "text": "A Guide to Investment Banking Categories! Many students in Australia have asked me: What exactly is an investment bank? Today, I’ll explain—for international students, investment banks mainly fall into four categories: Bulge Bracket, Boutique, Middle Market, and Domestic Securities Firms. Bulge Bracket Banks: These are the so-called 'top-tier investment banks,' including Goldman Sachs and Morgan Stanley. These large institutions have extensive business scopes and scale. Boutique Banks: These are smaller but highly specialized. Examples include Lazard and Evercore, which have deep expertise in specific fields. Middle Market Banks: These serve mid-sized companies, offering M&A and IPO services. While smaller than top-tier banks, they have significant influence in specific markets. Domestic Securities Firms: With the rise of the Chinese market, domestic securities firms play an increasingly important role in the global market."
}

(3) Mixed Content Format (Glossary + Paragraph)

Suitable for scenarios requiring both core terms and background explanations. For example:

{
    "text": "Core Terms: Bulge Bracket, Boutique, Middle Market, Domestic Securities Firms. Background: Bulge Bracket refers to top-tier investment banks (e.g., Goldman Sachs, Morgan Stanley); Boutique refers to specialized banks (e.g., Lazard); Middle Market banks serve mid-sized enterprises; Domestic Securities Firms include CITIC Securities and HT Securities."
}

(4) Format with Distracting Text

The model exhibits high fault tolerance for irrelevant text. Even if context includes content unrelated to terms (e.g., names), terminology recognition remains unaffected. For example:

{
    "text": "A Guide to Investment Banking Categories! (Content as above, omitted) ... Wang Haoxuan, Li Zihan, Zhang Jingxing, Liu Xinyi, Chen Junjie, Yang Siyuan, Zhao Yutong, Huang Zhiqiang, Zhou Zimo, Wu Yajing"
}

Reflection/Context Configuration Advice: More context is not always better. Testing with 20,000 Tokens of long text caused model timeouts. Prioritize extracting core terms and key background (≤5,000 Tokens)—this balances accuracy and processing speed. Additionally, if terms have English abbreviations (e.g., “Bulge Bracket” abbreviated as “BB”), include both full names and abbreviations in context to further improve recognition.

5. Model Application Compliance and API References

5.1 Model Application Launch: Compliance Filing Steps

Core Question for This Section: What compliance filings are required when launching an application based on ASR models?
Per Alibaba Cloud requirements, all commercial applications developed using Tongyi Qianwen models must complete compliance filings to avoid violations of data security, privacy protection, and other relevant regulations. The steps are as follows:

  1. Visit the Alibaba Cloud Model Studio Application Compliance Filing Guide.
  2. Prepare filing materials: Application name, purpose, user privacy policy, and data processing instructions (e.g., whether audio is stored, how encryption is implemented).
  3. Submit the filing application per the guide and wait for Alibaba Cloud review (typically 3–5 business days).
  4. After approval, the application can be officially launched (personal testing or non-commercial applications do not require filing).

5.2 API References and Additional Resources

Core Question for This Section: Where can you find detailed API documentation and technical support?

  • Official API Reference: Speech Recognition – Tongyi Qianwen API Reference provides detailed explanations of all parameters (e.g., additional asr_options configurations).
  • Technical Support: For call errors (e.g., “unsupported audio format,” “invalid API Key”), submit a ticket via the Alibaba Cloud Console or join the official developer community for real-time assistance.
  • Sample Code Repository: Alibaba Cloud’s GitHub repository offers sample code for additional scenarios (e.g., batch audio processing, multimodal applications combining multiple models).

6. Frequently Asked Questions (FAQs)

  1. Q: How do I provide a publicly accessible audio URL for the API?
    A: Use Alibaba Cloud Object Storage Service (OSS): ① Upload audio to an OSS bucket; ② Enable “Public Access” in the bucket’s “Permission Settings”; ③ Generate the audio file URL (click “Get URL” in the OSS Console’s “File Management”); ④ Verify accessibility (browser or curl returns HTTP 200).

  2. Q: How do I check if my audio format meets requirements?
    A: Use the ffprobe tool. Run ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 [audio_file_path]. Ensure the output meets: supported format (e.g., mp3, wav), 16kHz sampling rate, and 1 channel (mono).

  3. Q: How do I process audio to meet model requirements (e.g., trimming, format conversion)?
    A: Use the FFmpeg tool: ① Trim audio (start at 1:30, trim 2 minutes): ffmpeg -i long_audio.wav -ss 00:01:30 -t 00:02:00 -c copy output_clip.wav; ② Convert format (to 16kHz, mono, 16-bit WAV): ffmpeg -i input.mp3 -ac 1 -ar 16000 -sample_fmt s16 output.wav.

  4. Q: How do I continue using Qwen3-ASR after exhausting the free quota?
    A: After the free quota (10 hours) is used up, billing automatically starts at 0.00022 CNY/second. Recharge via the Alibaba Cloud Console. To reduce costs, optimize audio length (e.g., retain only valid speech segments) or allocate free quotas wisely during testing.

  5. Q: What is the maximum text length supported by context enhancement? What happens if it’s exceeded?
    A: The maximum is 10,000 Tokens. Exceeding this returns an “context length exceeded” error. Simplify context by retaining only core terms and key background to avoid wasting Tokens on irrelevant text.

  6. Q: Can I still call Qwen-Audio-ASR after exhausting its free quota (100,000 Tokens)?
    A: No—calls will fail. Alibaba Cloud recommends switching to Qwen3-ASR for continued use (supports commercial use with more comprehensive features).

  7. Q: How do I choose between streaming and non-streaming output?
    A: Use streaming output for real-time scenarios (e.g., live meeting subtitles, voice assistants) to reduce wait times. Use non-streaming output for non-real-time scenarios (e.g., batch processing historical audio, voice note transcription) for simpler code and no need to handle intermediate result concatenation.

7. Practical Summary and One-Page Overview

7.1 Practical Summary (Action Checklist)

  1. Model Selection: Commercial/multilingual/specialized features → Qwen3-ASR (qwen3-asr-flash); Personal testing/basic Chinese/English → Qwen-Audio-ASR (qwen-audio-asr).
  2. Preparations: Obtain API Key → Configure environment variables → Install SDK (if using).
  3. Audio Verification: Use ffprobe to confirm format (16kHz, mono, supported format) → Convert with FFmpeg if non-compliant.
  4. Implementation: URL call → Use public URL; Local call → Follow OS-specific path formats; Real-time scenarios → Enable streaming output.
  5. Professional Optimization: Improve terminology recognition → Configure context enhancement (via system’s text parameter); Standardize numerics → Enable ITN (enable_itn=true).
  6. Compliance Launch: Commercial applications → Complete Alibaba Cloud compliance filing.

7.2 One-Page Overview (Core Information Summary)

Module Core Information
Model Selection Basis Qwen3-ASR (production, multilingual, specialized features); Qwen-Audio-ASR (testing, basic Chinese/English)
Free Quota Qwen3-ASR: 10 hours; Qwen-Audio-ASR: 100,000 Tokens (both valid for 180 days)
Audio Requirements 16kHz, mono, ≤10MB, ≤3 minutes, supports 17 formats
Key Features Qwen3-ASR: context enhancement, ITN, noise rejection; Qwen-Audio-ASR: basic recognition only
Calling Methods URL, local file, streaming output (Python/Java SDK, HTTP API)
Common Error Fixes Format errors → Convert with FFmpeg; API Key errors → Check environment variables; Invalid URL → Verify public access

8. Conclusion: Selecting the Right ASR Solution for Your Needs

Tongyi Qianwen’s Qwen3-ASR and Qwen-Audio-ASR models cater to “production-level” and “experimental-level” needs, respectively. Developers should select models based on business scenarios (commercial/testing), feature requirements (multilingual/professional terminology recognition), and cost budgets (free quotas/paid usage).

For most commercial scenarios (e.g., cross-border customer service, intelligent meetings, short video subtitles), Qwen3-ASR’s context enhancement, noise rejection, and ITN features significantly improve recognition quality. Its 10-hour free quota suffices for initial testing, with extremely low subsequent costs (~0.79 CNY for 1 hour of daily use). For individual developers or non-commercial scenarios, Qwen-Audio-ASR’s 100,000-Token free quota meets basic needs for quick speech recognition experimentation.

Finally, before implementation, verify audio formats with ffprobe, optimize terminology recognition via context enhancement, and prioritize environment variables for API Key configuration to ensure secure, efficient calls.