HeartMuLa: A Comprehensive Guide to Open Source Music Generation and Understanding

In the rapidly evolving landscape of artificial intelligence, the field of generative music has seen remarkable advancements. However, much of the cutting-edge progress has been locked behind closed-source commercial systems, limiting accessibility for researchers and developers. Enter HeartMuLa, a family of open-source music foundation models designed to bridge the gap between academic research and commercial-grade application. This ecosystem unifies music understanding, alignment, and controllable generation into a single, extensible framework.
In this article, we will take an in-depth look at the HeartMuLa ecosystem, exploring its architecture, performance benchmarks, and providing a practical guide to deploying these models locally.

The HeartMuLa Ecosystem: An Overview

HeartMuLa is not just a single model; it is a comprehensive suite of components working in harmony to achieve high-fidelity music generation. The framework consists of four major pillars:

  1. HeartCLAP: An audio-text alignment model that learns a shared embedding space for music semantics.
  2. HeartTranscriptor: A robust lyric recognition model optimized specifically for real-world musical scenarios.
  3. HeartCodec: A low-frame-rate, high-fidelity music codec tokenizer that captures long-range musical structure while preserving fine-grained acoustic details.
  4. HeartMuLa: An LLM-based song generation model capable of synthesizing high-fidelity music under rich, user-controllable conditions.
    Figure 1: Overall comparison of HeartMuLa with existing music foundation models
    As shown in Figure 1, the open-source HeartMuLa-3B model demonstrates competitive performance against established commercial models like Suno v5 and Udio v1.5, particularly in metrics such as Lyric Error Rate (PER) and Structural Coherence.

HeartCodec: The Foundation of Efficient Music Modeling

At the core of HeartMuLa’s efficiency lies HeartCodec, a novel music tokenizer designed to overcome the limitations of traditional high-frame-rate audio codecs.

Architecture and Design

Standard audio tokenizers often operate at frame rates between 25Hz and 50Hz. While these provide high temporal resolution, they generate extremely long token sequences, making it computationally expensive for Large Language Models (LLMs) to model long-form music. HeartCodec addresses this with a frame rate of 12.5 Hz, effectively halving the sequence length compared to 25Hz models.
The architecture is composed of three distinct parts:

  1. Semantic-Rich Encoder: Instead of relying on a single audio encoder, HeartCodec extracts features from multiple pre-trained models: Whisper (phonetic cues), WavLM (acoustic details), and a fine-tuned MuEncoder (high-level musical semantics). This multi-encoder strategy captures complementary representations across different levels of abstraction.
  2. Ultra-Low Frame Rate Compressor: This module fuses the multi-level representations and applies a query-based quantization strategy. By inserting learnable query tokens, the model downsamples the feature sequence while retaining the essential musical information.
  3. High-Fidelity Reconstruction Decoder: To reconstruct high-quality audio from these compressed tokens, HeartCodec utilizes a Flow Matching approach. It maps discrete representations into a continuous latent space conditioned on a continuous tokenizer (specifically a 25Hz SQ-Codec) and uses a Diffusion Transformer backbone to predict the vector field.

Performance Benchmarking

The objective evaluation results demonstrate that HeartCodec outperforms several state-of-the-art baselines, including SemantiCodec, XCodec, and MuCodec. Notably, it achieves the lowest Fréchet Audio Distance (FAD) and Fréchet Distance (FD), indicating superior alignment with the original audio’s time-domain and frequency-domain distributions.
Table: Comparative Evaluation of Audio Codecs

Model CodeBook Framerate (Hz) VISQOL ↑ FAD ↓ STOI ↑
Ground Truth
SemantiCodec 1 x 32768 25 2.24 2.32 22.38
XCodec 4 x 1024 50 2.32 0.88 16.08
MuCodec 1 x 16384 25 3.07 1.02 14.73
HeartCodec (SQ Ft.) 8 x 8192 12.5 3.72 0.27 11.06
The “SQ Ft.” notation refers to the final training stage of HeartCodec, where the SQ-Codec decoder is fine-tuned to adapt to the distilled latent distribution. This stage yields significant gains in reconstruction quality compared to the base pretraining stages.

HeartMuLa: Hierarchical Architecture for Music Generation

The heart of the ecosystem is HeartMuLa, a music language model built upon the discrete tokens produced by HeartCodec. It utilizes a hierarchical factorization of the modeling process to balance computational efficiency with audio fidelity.

Global-Local Factorization

Long-form music requires understanding both long-range structure (like verses and choruses) and short-range details (like timbre and texture). HeartMuLa addresses this via a two-stage architecture:

  • Global Transformer: This component models intra-frame dependencies by predicting the base tokens (Layer 0) of the Residual Vector Quantization (RVQ). It captures the coarse semantic information and long-range musical structure.
  • Local Transformer: Once the global structure is established, the local transformer predicts the residual tokens (Layer 1 to K-1) within each frame, conditioned on the global context. This handles the synthesis of fine-grained acoustic details.
    By offloading the heavy lifting of long-range modeling to the Global Transformer and delegating local detail synthesis to the Local Transformer, the system achieves high fidelity without overwhelming computational resources.

Conditioning Mechanism

Controlling music generation requires precise conditioning. HeartMuLa integrates three main types of input:

  1. Lyrics: Text input including structural markers like [intro], [verse], and [chorus] to guide the song’s progression.
  2. Tags: High-level musical attributes (genre, instrument, mood, etc.) encapsulated in special tokens.
  3. Reference Audio: During training, embeddings from a reference audio clip are used to capture global style cues. Note that for privacy and ethical reasons, the model uses MuQ-MuLan embeddings which do not include speaker timbre information.

Training Strategy: From Warmup to Preference Optimization

Achieving commercial-grade quality requires a rigorous training pipeline. HeartMuLa employs a Four-Stage Progressive Training Paradigm:
Figure 4: Four-Stage Progressive Training Paradigm

Stage 1: Warmup

The model is trained on 30-second music segments containing lyrics. The goal is rapid parameter convergence and establishing a foundational understanding of local acoustic texture.

Stage 2: Pretraining

Scaled to a full 100,000-hour dataset, the model learns long-range temporal dependencies and global musical structures under complete conditional inputs (Lyrics, Tags, Reference Audio).

Stage 3: Supervised Fine-Tuning (SFT)

A high-quality subset of the data is selected based on objective metrics (AudioBox and SongEval). The model is fine-tuned to improve synthesis quality and fine-grained structural control.

Stage 4: Direct Preference Optimization (DPO)

Traditional reinforcement learning for LLMs is computationally expensive and unstable. HeartMuLa uses Direct Preference Optimization (DPO), a method that optimizes the model directly based on preference pairs (winning samples vs. losing samples).

  • Muq-similarity-based Set: Enhances semantic alignment between audio and text tags.
  • PER-based Set: Focuses on articulation accuracy (reducing Phoneme Error Rate).
  • AudioBox & SongEval-based Set: Optimizes for holistic audio quality and musicality.
    This final stage is crucial for elevating the perceptual quality, resulting in significant improvements in vocal clarity and stylistic fidelity.

Performance Evaluation and Benchmarks

To assess the capabilities of HeartMuLa, comprehensive objective and subjective evaluations were conducted on the multilingual HeartBeats Benchmark.

Lyric Intelligibility

One of the standout features of HeartMuLa is its lyric clarity. In English, it achieved a Phoneme Error Rate (PER) of 0.09, outperforming top commercial models. In Chinese, the PER is 0.12, and in Japanese, Korean, and Spanish, it consistently leads or ties with top performers. This suggests the model effectively learns the alignment between acoustic sounds and phonetic text across multiple languages.
Table: Objective Evaluation Results on the HeartBeats Benchmark (English)

Model AudioBox PQ ↑ SongEval Avg ↑ Style Tag Sim ↑ PER ↓
Suno-v5 8.21 4.54 0.26 0.13
Suno-v4.5 8.24 4.51 0.25 0.14
Udio-v1.5 7.98 3.97 0.23 0.25
MiniMax-2.0 8.35 4.51 0.26 0.13
HeartMuLa (Ours) 8.14 4.48 0.26 0.09

Musical Quality

In terms of musicality, coherence, and naturalness, HeartMuLa scores competitively with Suno v4.5 and significantly outperforms open-source baselines like LeVo, YuE, and DiffRhythm 2. Subjective listening tests confirm that the generated music exhibits high harmony, structure fidelity, and creativity.

How to Deploy HeartMuLa Locally

The HeartMuLa team has released the code and model weights under the Apache 2.0 License, allowing for both academic and commercial use. Below is a step-by-step guide to deploying the HeartMuLa-3B version locally.

System Requirements

  • Python: 3.10 is recommended.
  • GPU: Due to the size of the 3B model and the LLM inference requirements, a GPU with substantial VRAM (e.g., NVIDIA A100, or consumer cards like RTX 3090/4090) is highly recommended.

Installation Steps

  1. Clone the Repository
    First, clone the heartlib repository to your local machine.

    git clone https://github.com/HeartMuLa/heartlib.git
    cd heartlib
    pip install -e .
    
  2. Download Checkpoints
    The models can be downloaded using either Hugging Face or ModelScope.
    Using Hugging Face:

    hf download --local-dir './ckpt' 'HeartMuLa/HeartMuLaGen'
    hf download --local-dir './ckpt/HeartMuLa-oss-3B' 'HeartMuLa/HeartMuLa-oss-3B'
    hf download --local-dir './ckpt/HeartCodec-oss' 'HeartMuLa/HeartCodec-oss'
    

    Using ModelScope:

    modelscope download --model 'HeartMuLa/HeartMuLaGen' --local_dir './ckpt'
    modelscope download --model 'HeartMuLa/HeartMuLa-oss-3B' --local_dir './ckpt/HeartMuLa-oss-3B'
    modelscope download --model 'HeartMuLa/HeartCodec-oss' --local_dir './ckpt/HeartCodec-oss'
    

    After downloading, ensure your directory structure looks like this:

    ./ckpt/
    ├── HeartCodec-oss/
    ├── HeartMuLa-oss-3B/
    ├── gen_config.json
    └── tokenizer.json
    

Generating Music

To generate music, you can run the provided example script.
Basic Command:

python ./examples/run_music_generation.py --model_path=./ckpt --version="3B"

This command will generate music based on the default lyrics and tags found in the ./assets folder. The output will be saved to ./assets/output.mp3.

Customizing Your Generation

You can modify the generation by providing your own lyrics and tags, and adjusting sampling parameters.
Command Arguments:

  • --lyrics: Path to your lyrics file.
  • --tags: Path to your tags file.
  • --save_path: Where to save the output audio.
  • --max_audio_length_ms: Maximum length of the audio in milliseconds (default: 240000, i.e., 4 minutes).
  • --topk: Top-k sampling parameter (default: 50).
  • --temperature: Sampling temperature (default: 1.0).
  • --cfg_scale: Classifier-free guidance scale (default: 1.5).
    Formatting Lyrics and Tags
    For best results, structure your lyrics with section markers.
    Example Lyrics (./assets/lyrics.txt):
[Intro]
[Verse]
The sun creeps in across the floor
I hear the traffic outside the door
The coffee pot begins to hiss
It is another morning just like this
[Prechorus]
The world keeps spinning round and round
Feet are planted on the ground
I find my rhythm in the sound
[Chorus]
Every day the light returns
Every day the fire burns
We keep on walking down this street
Moving to the same steady beat
[Verse]
The hours tick deeply into noon
Chasing shadows,chasing the moon
Work is done and the lights go low
Watching the city start to glow
[Bridge]
It is not always easy,not always bright
Sometimes we wrestle with the night
But we make it to the morning light
[Chorus]
Every day the light returns
Every day the fire burns
We keep on walking down this street
Moving to the same steady beat
[Outro]
Just another day
Every single day

Example Tags (./assets/tags.txt):

piano,happy,wedding,synthesizer,romantic

Tags should be comma-separated without spaces. These tags act as high-level instructions to the model regarding genre, mood, and instrumentation.

Advanced Components: HeartCLAP and HeartTranscriptor

While HeartMuLa is the generative engine, the ecosystem is supported by two other critical models.

HeartCLAP: Bridging Text and Audio

HeartCLAP is an audio-text alignment model. It establishes a unified embedding space that facilitates tasks like music tagging and cross-modal retrieval. It is trained using Contrastive Learning with the InfoNCE loss, ensuring that positive music-text pairs are pulled together in the latent space while mismatched pairs are pushed apart.
The training data encompasses a wide spectrum of attributes: genre, mood, instrumentation, and more. Crucially, it employs masking strategies during training to ensure robustness against incomplete user prompts.

HeartTranscriptor: Lyrics Recognition

Standard Automatic Speech Recognition (ASR) models often struggle with the complexity of musical vocals due to interference from accompaniments. HeartTranscriptor is a model fine-tuned specifically for robust lyric recognition.
The training process involves using the Demucs model to separate vocals from the audio, followed by filtering based on word error rates. This curated dataset ensures the model learns to transcribe lyrics accurately even in complex musical arrangements.

Future Directions and Fine-Grained Control

The current open-source release (HeartMuLa-oss-3B) focuses on generating music conditioned on lyrics and tags. However, the team has outlined future capabilities that are currently available in internal versions or planned for release:

  • Fine-Grained Control: Users will be able to specify the style of different song sections (e.g., making the “Intro” piano-based and atmospheric, while the “Chorus” is energetic and electronic) using natural language prompts.
  • Reference Audio Conditioning: Allowing users to upload a reference audio clip to guide the stylistic generation of the song.
  • 7B Parameter Model: The team reports that scaling the model to 7B parameters achieves performance comparable to Suno, suggesting that larger models will further close the gap with commercial systems.

Conclusion

HeartMuLa represents a significant milestone in the democratization of AI music generation. By releasing a full-stack solution—from the audio tokenizer to the generative LLM—the team has provided the community with a robust foundation for research and application. The model’s ability to generate long-form, high-fidelity music with clear lyrics across multiple languages demonstrates that open-source systems can compete with closed-source giants.
Whether you are a developer looking to integrate music generation into your application or a musician exploring AI as a creative tool, HeartMuLa offers a transparent, efficient, and capable platform to explore the future of music.

Frequently Asked Questions (FAQ)

Is HeartMuLa completely open source?
Yes, the HeartMuLa ecosystem, including the 3B model weights, HeartCodec, HeartTranscriptor, and HeartCLAP, is released under the Apache 2.0 License. This allows for both academic research and commercial applications.
How does HeartMuLa compare to Suno or Udio?
Based on objective benchmarks, HeartMuLa-3B outperforms many commercial models in Lyric Intelligibility (PER) and matches or comes close to them in musical quality metrics like AudioBox PQ and SongEval. The 7B internal version is reported to achieve comparable performance to Suno.
Can I use my own voice or a specific singer’s timbre?
The current version of HeartMuLa does not support direct timbre cloning. The model uses MuQ-MuLan embeddings which exclude speaker timbre information to avoid ethical issues regarding voice copying.
What is the maximum length of music I can generate?
The model supports long-form music generation of up to six minutes, maintaining structural coherence and expressive diversity.
Can I generate music without lyrics?
Yes, you can generate instrumental music by providing empty or placeholder lyrics (like [Intro], [Main] with no text) or focusing purely on the tags to generate background music.
Do I need a powerful GPU to run the model?
While it is possible to run smaller segments on consumer GPUs with sufficient VRAM (e.g., 16GB-24GB), optimal performance and the ability to generate long-form music are best achieved with enterprise-grade GPUs like the NVIDIA A100.
How do I cite HeartMuLa in my research?
If you use HeartMuLa in your work, you can cite the arXiv paper as follows:

@misc{yang2026heartmulafamilyopensourced,
      title={HeartMuLa: A Family of Open Sourced Music Foundation Models}, 
      author={Dongchao Yang and Yuxin Xie and Yuguo Yin and Zheyu Wang and Xiaoyu Yi and Gongxi Zhu and Xiaolong Weng and Zihan Xiong and Yingzhe Ma and Dading Cong and Jingliang Liu and Zihang Huang and Jinghan Ru and Rongjie Huang and Haoran Wan and Peixu Wang and Kuoxi Yu and Helin Wang and Liming Liang and Xianwei Zhuang and Yuanyuan Wang and Haohan Guo and Junjie Cao and Zeqian Ju and Songxiang Liu and Yuewen Cao and Heming Weng and Yuexian Zou},
      year={2026},
      eprint={2601.10547},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2601.10547}, 
}