T5Gemma: A New Collection of Encoder-Decoder Gemma Models

Introduction

In the fast-paced world of large language models (LLMs), encoder-decoder models have often been overshadowed by their decoder-only counterparts. However, encoder-decoder models like T5 still hold significant advantages in many practical applications due to their high inference efficiency, design flexibility, and rich encoder representation for input understanding. Today, we are excited to introduce T5Gemma, a new collection of encoder-decoder LLMs developed by adapting pretrained decoder-only models into the encoder-decoder architecture.

From Decoder-Only to Encoder-Decoder

T5Gemma explores the potential of building top-tier encoder-decoder models based on pretrained decoder-only models through a technique called model adaptation. The core idea is to initialize the parameters of an encoder-decoder model using the weights of an already pretrained decoder-only model and then further adapt them via UL2 or PrefixLM-based pre-training.

This adaptation method offers remarkable flexibility, enabling creative combinations of model sizes. For instance, pairing a large encoder with a small decoder (e.g., a 9B encoder with a 2B decoder) creates an “unbalanced” model that allows fine-tuning the quality-efficiency trade-off for specific tasks like summarization, where deep input understanding is more critical than output complexity.

Towards Better Quality-Efficiency Trade-off

Performance Benchmark

In our experiments, T5Gemma models demonstrate comparable or better performance than their decoder-only Gemma counterparts, nearly dominating the quality-inference efficiency pareto frontier across several benchmarks, including SuperGLUE, which measures the quality of learned representations.

This performance advantage translates to real-world benefits. When measuring actual latency for GSM8K (math reasoning), T5Gemma shows clear improvements. For example, T5Gemma 9B-9B achieves higher accuracy than Gemma 2 9B but with similar latency. More impressively, T5Gemma 9B-2B delivers a significant accuracy boost over the 2B-2B model while maintaining nearly identical latency to the much smaller Gemma 2 2B model. These results highlight the flexibility and power of encoder-decoder adaptation in balancing quality and inference speed.

Pre-trained and Fine-tuned Capabilities

T5Gemma exhibits promising capabilities both before and after instruction tuning. After pre-training, it achieves significant gains on complex reasoning tasks. For instance, T5Gemma 9B-9B scores over 9 points higher on GSM8K (math reasoning) and 4 points higher on DROP (reading comprehension) than the original Gemma 2 9B model. This indicates that the encoder-decoder architecture, when initialized via adaptation, has the potential to create a more capable and performant foundational model.

These foundational improvements set the stage for even more dramatic gains after instruction tuning. Comparing Gemma 2 IT to T5Gemma IT, the performance gap widens significantly across the board. T5Gemma 2B-2B IT sees its MMLU score jump by nearly 12 points over the Gemma 2 2B, and its GSM8K score increases from 58.0% to 70.7%. The adapted architecture not only potentially provides a better starting point but also responds more effectively to instruction tuning, ultimately leading to a substantially more capable and helpful final model.

T5Gemma Model Variants and Applications

Multiple Sizes

T5Gemma offers a variety of model sizes, including T5-sized models (Small, Base, Large, and XL), Gemma 2-based models (2B and 9B), and an additional model between T5 Large and T5 XL. This diversity allows researchers and developers to select the appropriate model size based on specific needs.

Different Training Objectives

T5Gemma provides models trained with either PrefixLM or UL2 objectives, offering state-of-the-art generative performance or representation quality. This flexibility enables users to choose models based on different task requirements.

Unbalanced Model Configuration

The powerful and efficient unbalanced 9B-2B checkpoint in T5Gemma allows exploration of the trade-offs between encoder and decoder sizes. This configuration provides new possibilities for customizing models for specific tasks.

Practical Application Cases

Text Summarization

In text summarization tasks, T5Gemma’s unbalanced configuration (such as 9B-2B) shows unique advantages. The large encoder can deeply understand the semantic information of the input text, while the small decoder efficiently generates concise and accurate summaries. This combination not only improves summary quality but also significantly reduces generation latency, meeting the demands of real-time applications.

Machine Translation

In machine translation tasks, T5Gemma’s bidirectional attention mechanism enables the encoder to comprehensively capture context information of the source language sentence. Experiments show that T5Gemma achieves excellent performance in multiple language translation tasks, particularly in handling complex sentence structures and long sentences.

Question Answering Systems

For question answering systems, T5Gemma’s efficient inference capability and strong context understanding allow it to provide accurate and detailed answers in a short time. In the SuperGLUE benchmark test, T5Gemma achieves high accuracy in tasks such as COPA, WIC, and WSC, demonstrating its competitiveness in language understanding tasks.

Future Development Directions

Model Size Expansion

Currently, T5Gemma research primarily focuses on the 2B and 9B sizes of Gemma 2 models. In the future, the research team plans to expand model sizes (such as 27B) to explore the performance potential of larger-scale models.

Cross-Model Family Adaptation

T5Gemma’s method is highly flexible and theoretically applicable to other model families like LLaMA and QWen. Future work will attempt to adapt different model families, such as pairing LLaMA models with QWen models, to leverage their respective strengths.

Multimodal Extension

Extending T5Gemma to multimodal modeling (e.g., vision-language and speech-language) is another exciting direction. This will enable the model to handle more complex input forms and expand its application scenarios.

Training Objective Optimization

The research team will continue exploring how to better combine techniques like PrefixLM, knowledge distillation, and UL2 to enhance model performance. Additionally, joint optimization methods for PrefixLM and UL2 will be investigated in future work.

Conclusion

T5Gemma successfully transforms pretrained decoder-only models into encoder-decoder models through innovative adaptation technology. It maintains efficient inference capabilities while significantly improving model performance. T5Gemma revitalizes encoder-decoder models in the LLM field and provides researchers and developers with a powerful and flexible tool. By leveraging T5Gemma’s various model sizes, training objectives, and configurations, we can find the optimal balance between quality and efficiency for different application scenarios, driving the application of natural language processing technology in more fields.

T5Gemma_banner
decoder-only model
Encoder-decoder models benchmarks
Detailed results for pretrained models
Results for fine-tuned + RLHFed models