DeSTA2.5-Audio: Pioneering General-Purpose Large Audio Language Models with Self-Generated Cross-Modal Alignment

高效码农

23 hours ago

DeSTA2.5-Audio: Pioneering the Future of General-Purpose Large Audio Language Models

In the rapidly evolving landscape of artificial intelligence, the quest for models capable of robust auditory perception and precise instruction-following has gained significant momentum. DeSTA2.5-Audio, a cutting-edge Large Audio Language Model (LALM), stands at the forefront of this innovation. Designed to transcend the limitations of task-specific audio instruction-tuning, DeSTA2.5-Audio leverages a self-generated cross-modal alignment strategy, marking a paradigm shift in how we approach audio-linguistic understanding.

The Genesis of DeSTA2.5-Audio

The development of DeSTA2.5-Audio was driven by the recognition that existing LALMs often suffered from catastrophic forgetting. This phenomenon occurs when models become overly specialized in certain audio-related tasks, resulting in diminished performance on unseen tasks and a decline in their inherent language abilities. To address this, the DeSTA (Descriptive Speech-Text Alignment) framework was introduced. Unlike conventional approaches that rely on external annotations or responses from different LLMs, DeSTA empowers the backbone LLM to generate its own training targets. This self-generated approach ensures stylistic and semantic consistency with the LLM’s native output distribution, effectively preserving its instruction-following capabilities while enabling the model to adapt to auditory inputs.

The DeSTA Framework: A Deep Dive

Self-Generated Dataset Construction

The DeSTA framework begins with the collection of diverse datasets containing detailed metadata. Each audio segment’s metadata is meticulously converted into a structured textual format, following a specific schema:

[timestamp] Spoken content (non-verbal attribute name: value)

For instance, a speech clip might be represented as:

[00:00-00:05] Hello world (Gender:Female, Emotion:Happy...)

This conversion standardizes the representation of various audio types. The resulting dataset, denoted as $D_{initial} = {(x^{audio}, x^{text})}$ , pairs each audio clip $x^{audio}$ with its corresponding textual description $x^{text}$ .

Based on these initial audio-description pairs, a text-based LLM is utilized to generate training targets. A diverse set of prompts is randomly sampled from a predefined instruction pool. These prompts are crafted to maximize the LLM’s ability to utilize all available information from the textual description. The backbone LLM then takes the text description $x^{fext}$ and the prompt $p$ as inputs and produces a response $y = LLM (x^{fext}, p)$ . This automated setup eliminates the need for labor-intensive, task-specific instruction design, enabling the generation of detailed and context-aware responses.

Model Training

The model adopts a modular architecture that integrates a pre-trained audio model with an instruction-tuned LLM. To bridge the audio and language modalities, a modality adapter composed of Q-Former blocks is inserted between the two modules. In this design, the parameters of the audio model and the LLM are frozen, and only the modality adapter is fine-tuned to learn robust audio-text alignment representations.

The audio input $x^{audio}$ is first encoded into a continuous representation by applying Q-Former blocks to multiple intermediate hidden states from the audio encoder. The outputs from multiple layers are aggregated using learnable scalar weights, and subsequently projected through a linear layer to match the LLM embedding dimension. Optionally, a linguistic representation may be incorporated by transcribing the input audio using the audio decoder to obtain a text sequence. When used, the discrete features are concatenated with the continuous features along the sequence dimension to form the final audio representation.

The resulting audio embeddings are passed to the LLM along with the prompt embeddings to autoregressively generate the output sequence. The model undergoes end-to-end optimization using the standard next-token prediction loss computed on the training targets.

DeSTA-AQA5M: A Large-Scale Audio Instruction-Tuning Dataset

To construct the training corpus for DeSTA2.5-Audio, 50 publicly available datasets spanning a wide range of audio processing domains were collected. These datasets were prioritized for their comprehensive metadata covering paralinguistic features, speaker identity attributes, audio quality indicators, and environmental or contextual sounds. The dataset comprises approximately 7,000 hours of audio: 5,400 hours of speech, 1,000 hours of environmental sounds, and 500 hours of music.

The instruction pool consists of 4,000 prompts for the speech category and 3,000 prompts for the environmental sound and music categories. An upsampling strategy was applied to balance across domains. Each audio sample is paired with multiple prompts, formatted as:

{x^{\text{text}}} {p}

All responses were generated using the vLLM toolkit, with decoding parameters set to a temperature of 0.05 and a top-p value of 1.0. This process yielded a large-scale dataset of approximately 5 million audio-prompt-response triplets, referred to as DeSTA-AQA5M.

Model Specification and Training Setup

DeSTA2.5-Audio is built upon Llama3.1-8B-Instruct and Whisper-large-v3. A six-layer Q-former architecture with 64 queries serves as the modality adapter. The query vectors attend to intermediate hidden states from Whisper encoder layers 8, 16, 24, and 32, allowing the model to capture multi-scale acoustic features. For the optional linguistic representation, offline transcriptions are provided for speech domain datasets. For audio and music domain datasets, no transcription is used, and only continuous embeddings derived from the Q-Former are utilized during training. During inference, a lightweight pre-trained voice activity detection (VAD) model is employed to identify the presence of human speech in audio inputs and conditionally activate the Whisper decoder when necessary.

The model implementation is based on the Transformers library. It consists of 8.8 billion total parameters, with 131 million trainable parameters. Training is performed for five epochs using the Adam optimizer, a cosine annealing learning rate schedule, and 2,000 warm-up steps. The training is conducted on a cluster of 8 NVIDIA A100-80GB GPUs, with a global batch size of 96 and an initial learning rate of 1e-4. The total number of training steps is approximately 250,000.

Evaluation Setup

To evaluate the capabilities of DeSTA2.5-Audio across instruction-following, perceptual understanding, and reasoning, a diverse suite of benchmarks is adopted:

Dynamic-SUPERB Phase-1: Evaluates instruction-following and speech understanding across 48 classification tasks, categorized into content, semantic, paralinguistic, degradation, and speaker groups. Performance is measured by classification accuracy against ground truth labels.
Dynamic-SUPERB Phase-2: Extends the benchmark to 180 tasks, incorporating new contributions from the research community, including regression and open-ended generation tasks across the speech, environmental sound, and music domains. Performance is assessed using task-specific metrics.
MMAU: A benchmark for evaluating advanced audio-language understanding and reasoning across speech, environmental sounds, and music, using multiple-choice question formats. Some questions require expert-level domain knowledge for correct interpretation. Performance is evaluated by accuracy against ground truth answers.
SAKURA: A benchmark developed to evaluate single-hop and multi-hop reasoning in LALMs. Single-hop questions assess basic auditory perception, while multi-hop questions require combining auditory cues with external world knowledge and reasoning beyond the immediate input. Performance is evaluated using multiple-choice questions.
Speech-IFEval: A diagnostic benchmark designed to assess whether LALMs retain their instruction-following after cross-modal alignment. It introduces the instruction-following rate (IFrate) and defines the forgetting rate (∆) as the relative drop in IFrate between an LALM and its backbone LLM.
VoiceBench: Comprises a collection of tasks designed to evaluate spoken interaction performance. It utilizes text-to-speech (TTS) systems to convert textual instructions into audio inputs, simulating realistic voice-based scenarios.

Main Results

Results on Dynamic-SUPERB Phase-1, MMAU, SAKURA, and Speech-IFEval

DeSTA2.5-Audio demonstrates consistently superior performance across multiple benchmarks. It achieves the highest scores on Dynamic-SUPERB Phase-1 (69.53), MMAU (57.50), SAKURA-Multi (69.85), and Speech-IFEval (93.89), indicating its robustness and generalization across multiple domains and conditions. Compared to DeSTA2, which focused primarily on speech data, DeSTA2.5-Audio shows improved performance attributed to the broader coverage of environmental and musical audios in addition to speech. This demonstrates that the proposed data construction strategy is robust and can be effectively extended to other domains.

Results on Dynamic-SUPERB Phase-2

DeSTA2.5-Audio achieves the highest overall performance on the Dynamic-SUPERB Phase-2 benchmark, ranking first in both win count (14 domains) and average relative score (3.42). These results highlight the model’s robustness and versatility across a wide range of general audio tasks.

Results on VoiceBench

DeSTA2.5-Audio achieves an overall score of 74.52 on VoiceBench, outperforming models such as VITA-1.5 (64.53) and Qwen2-Audio-Instruct (55.80). It demonstrates balanced and consistent results across a broad range of subtasks, including AlpacaEval, CommonEval, and OpenbookQA. This performance is largely attributed to the strong language ability of DeSTA2.5-Audio, where it not only follows the system prompt to adapt its behavior into an interactive style but also responds appropriately based on its inherent knowledge.

Comparison Studies

Comparison between Self-Generated and Cross-Model Settings

The self-generated training data exhibit consistently lower perplexity, suggesting that the generated responses align well with the distribution of the backbone LLM. When comparing Llama3.1 and Qwen2.5, Qwen2.5 consistently outperforms Llama3.1 across all benchmarks. This performance gap may be attributed to Qwen2.5’s stronger text generation capabilities. However, there is currently no conclusive evidence indicating a corresponding advantage in auditory perception, which warrants further investigation. Nevertheless, under identical training conditions, our experimental results suggest that Qwen2.5 serves as a more effective backbone LLM than Llama3.1. These findings also indicate that our training framework generalizes well across different LLMs.

Prompt diversity also plays a significant role in model performance. In A3, we adopt the self-generation setup using a single descriptive prompt, which already demonstrates strong zero-shot generalization. By simply increasing prompt diversity, as done in A1, we further enrich the training target and enhance the overall effectiveness of the training method. Notably, these results are achieved without the need for any task-specific instruction pairs. This highlights the strength of the self-generation design. Even when data construction relies solely on randomly sampled prompts, the model can still achieve zero-shot generation by leveraging the inherent capabilities of the LLM.

When comparing self-generation and cross-model settings, training targets in the cross-model setting result in higher perplexity, suggesting that the backbone LLM is less familiar with the data distribution. For example, while training Qwen2.5 on Qwen2.5-generated data (A2) produces strong results, training Llama3.1 on Qwen2.5-generated data (B1) leads to model degeneration, with outputs containing repetitive or nonsensical tokens. Similarly, training Llama3.1 on data generated by Gemma3-12B (B2) fails to match the performance observed in the self-generation setting (A1). These results support our distribution mismatch hypothesis and emphasize the importance of using a self-generated configuration, even when the annotator LLM is more capable. We also explore using Llama3.1-70B to generate training data (B3), representing a more powerful model from the same family. In this case, the lower perplexity (2.20) suggests a closer alignment between the training data and Llama3.1’s distribution. However, compared to A1, B3 achieves better performance on Dynamic-SUPERB and SAKURA, but underperforms on MMAU and Speech-IFEval. This indicates that using a stronger model does not necessarily lead to consistent improvements across all tasks.

With LoRA Adapter

In the LoRA adapter setting, we introduce trainable parameters to the backbone LLM, which is expected to increase model capacity and help mitigate the distribution mismatch problem. In the self-generation setup (C1), where the dataset is well-aligned with the backbone LLM, we find that adding LoRA layers yields similar or only slightly better performance. This indicates that, under self-generation settings, incorporating LoRA adapters does not provide significant advantages. In other words, fine-tuning a lightweight modality adapter is sufficient for cross-modal alignment when using our proposed training framework, where the model focuses on learning auditory concepts without being hindered by stylistic or distributional mismatches. Interestingly, when training with Qwen2.5-generated data (C2), the performances in audio processing benchmarks are comparable to the self-generation setup (A2). However, they experience a significant degradation in SAKURA-Multi and Speech-IFEval, which requires additional text knowledge and instruction-following ability. This difference indicates that while adding a LoRA adapter can help mitigate the distribution mismatch and perform well on in-domain tasks, it may still degrade the model’s general capabilities when evaluated on benchmarks that require knowledge from LLM pretraining. This reveals a critical design limitation in current LALM training strategies. Models such as LTU-AS and SALMONN attempt to address the catastrophic forgetting problem by introducing LoRA adapter layers to the LLM. However, our experimental results suggest that reducing the discrepancy between training data and model distribution is a more critical factor for preserving generalization ability than architectural modifications alone.

With 5 Epochs

With the 5-epoch setting, we investigate the impact of training duration on model performance. Consistent with discussion in Section VII-A, the 5-epoch results (D1 and D2) show that prompt diversity not only improves effectiveness but also enhances training efficiency. Despite being trained for only half the number of epochs, the models achieve performance comparable to their 10-epoch counterparts (A1). Notably, while D2 continues to improve with extended training (as in A3), convergence is slower and the final performance remains worse, indicating that diverse training target is also essential for achieving better alignment. In contrast, although D3 achieves non-trivial performance with only 5 epochs of training, B1 demonstrates that extended training under distribution mismatch leads to model degeneration. These findings underscore our key motivation: effective cross-modal alignment requires repeated exposure to align audio representations across epochs. When the training data is matched with the backbone model, performance improves steadily without degrading the model’s inherent language capabilities. In contrast, learning from mismatched data places a heavier burden on the model, ultimately resulting in suboptimal performance and forgetting of its pre-trained linguistic abilities.

Conclusion

DeSTA2.5-Audio represents a significant advancement in the field of Large Audio Language Models. By employing the self-generated cross-modal alignment framework, DeSTA, it effectively mitigates the catastrophic forgetting problem commonly observed in prior works. Under this framework, we successfully develop a robust and generalizable LALM without relying on task-specific instruction data. Constructed from 50 diverse audio datasets, DeSTA-AQA5M provides a large-scale, task-agnostic corpus of 5 million audio-text pairs. Trained solely on this corpus, DeSTA2.5-Audio achieves state-of-the-art or competitive results across benchmarks such as Dynamic-SUPERB (Phase-1 and Phase-2), MMAU, SAKURA, Speech-IFEval, and VoiceBench. Our findings highlight the importance of training data quality over quantity in LALM development. Compared to previous studies, our self-generation strategy offers a scalable and robust solution with better generalization. While our current design leverages text description as an intermediate bridge, we acknowledge that not all acoustic nuances can be effectively captured through textual representations. In future work, we will continue to explore methods that enable LALMs to better capture subtle, non-text-expressible audio features.