Model2Vec: Fast and Efficient Static Embedding Models
In today’s information age, natural language processing (NLP) technologies are becoming increasingly widespread. From text classification to information retrieval, and building complex question answering systems, the performance and efficiency of models are critical. Model2Vec is a game-changing technology that transforms sentence transformers into compact, fast, and powerful static models. It provides new solutions for various NLP tasks.
Quick Start
If you’re already familiar with the basics of NLP and model deployment, you can start using Model2Vec in just minutes. Here are the basic steps to install and use Model2Vec:
pip install model2vec
Once installed, you can load pre-trained models from HuggingFace hub and immediately start creating embeddings. For example, using the potion-base-8M model:
from model2vec import StaticModel
# Load a model from HuggingFace hub
model = StaticModel.from_pretrained("minishlab/potion-base-8M")
# Create embeddings
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])
# Create token embedding sequences
token_embeddings = model.encode_as_sequence(["It's dangerous to go alone!", "It's a secret to everybody."])
If you want to distill your own Model2Vec model from scratch, follow these steps:
pip install model2vec[distill]
Then, you can use the following code to distill a model quickly on CPU:
from model2vec.distill import distill
# Distill from a Sentence Transformer model
m2v_model = distill(model_name="BAAI/bge-base-en-v1.5", pca_dims=256)
# Save the model
m2v_model.save_pretrained("m2v_model")
After distillation, you can also fine-tune your own classification models on top of the distilled model or pre-trained models:
pip install model2vec[training]
Here’s an example of how to fine-tune a model:
import numpy as np
from datasets import load_dataset
from model2vec.train import StaticModelForClassification
# Initialize a classifier from a pre-trained model
classifier = StaticModelForClassification.from_pretrained(model_name="minishlab/potion-base-32M")
# Load a dataset
ds = load_dataset("setfit/subj")
# Train the classifier on text (X) and labels (y)
classifier.fit(ds["train"]["text"], ds["train"]["label"])
# Evaluate the classifier
classification_report = classifier.evaluate(ds["test"]["text"], ds["test"]["label"])
Updates and Announcements
The Model2Vec team has been constantly updating and improving the models to deliver better performance and more features. Here are some key updates:
-
December 2, 2024: Model2Vec training was released, enabling users to fine-tune their own classification models on top of Model2Vec models. For more information, check out the training documentation and results. -
January 30, 2024: Two new models were released: potion-base-32M and potion-retrieval-32M. Potion-base-32M is our most powerful model to date, featuring a larger vocabulary and higher dimensions. Potion-retrieval-32M is a fine-tuned version of potion-base-32M optimized for retrieval tasks, making it the best-performing static retrieval model currently available. -
October 30, 2024: Three new models were introduced: potion-base-8M, potion-base-4M, and potion-base-2M. These models were trained using the Tokenlearn method. Users of our older English M2V models are recommended to switch to these new models, as they deliver superior performance across all tasks.
Key Features of Model2Vec
-
State-of-the-Art Performance: Model2Vec models significantly outperform other static embedding models like GLoVe and BPEmb across various tasks. For detailed results, refer to our results section. -
Compact Size: Model2Vec reduces the size of sentence transformer models by up to a factor of 50. Our best model takes up just about 30MB on disk, while our smallest model is only around 8MB, making it the smallest model on MTEB. -
Lightweight Dependencies: The base package primarily depends on numpy. -
Blazing Fast Inference: Up to 500 times faster on CPU compared to the original model. -
Rapid, Dataset-Free Distillation: You can distill your own model in about 30 seconds on a CPU without any dataset. -
Fine-Tuning: Fine-tune your own classification models on top of Model2Vec models. -
Integration with Popular Libraries: Model2Vec is directly integrated into popular libraries such as Sentence Transformers and LangChain. For more details, see our integrations documentation. -
Seamless Integration with HuggingFace Hub: Easily share and load models from HuggingFace Hub using familiar methods like from_pretrained and push_to_hub. Our models can be found here.
What is Model2Vec?
Model2Vec creates small yet powerful models that deliver superior performance across all tasks we’ve tested. It outperforms traditional static embedding models like GloVe and is faster to create. Similar to BPEmb, it can generate subword embeddings but with better performance. Distillation requires no data, just a vocabulary and a model.
The core concept involves passing a vocabulary through a sentence transformer model to create static embeddings for individual tokens. Following this, several post-processing steps are applied to produce our top-tier models. For a more in-depth exploration, refer to the following resources:
-
Our initial Model2Vec blog post offers a good overview of the core concept, though we’ve made numerous improvements since then. -
Our Tokenlearn blog post explains the Tokenlearn method used to train our potion models. -
Our official documentation provides a high-level overview of how Model2Vec works.
Documentation
Our official documentation is available here. It includes:
-
Usage documentation: Provides a technical overview of how to use Model2Vec. -
Integrations documentation: Offers examples of using Model2Vec with various downstream libraries. -
Model2Vec technical documentation: Presents a high-level overview of the workings of Model2Vec.
Model List
We offer a range of ready-to-use models. These models are available on the HuggingFace hub and can be loaded using the from_pretrained method. Here are some of the models:
Model | Language | Sentence Transformer | Params | Task |
---|---|---|---|---|
potion-base-32M | English | bge-base-en-v1.5 | 32.3M | General |
potion-base-8M | English | bge-base-en-v1.5 | 7.5M | General |
potion-base-4M | English | bge-base-en-v1.5 | 3.7M | General |
potion-base-2M | English | bge-base-en-v1.5 | 1.8M | General |
potion-retrieval-32M | English | bge-base-en-v1.5 | 32.3M | Retrieval |
M2V_multilingual_output | Multilingual | LaBSE | 471M | General |
Results
We have conducted extensive experiments to evaluate the performance of Model2Vec models. The results are documented in the results folder. Here’s a summary of the key sections:
-
MTEB Results: MTEB (Massive Text Embedding Benchmark) is a benchmark for evaluating text embedding quality across various NLP tasks, including text classification, clustering, and information retrieval. Model2Vec models have performed exceptionally well in MTEB tests, especially in text classification and information retrieval tasks, outperforming other static embedding models significantly. -
Training Results: This section showcases Model2Vec’s performance under different training conditions, such as varying parameter settings and training dataset sizes. These results help users understand how to select the right model and training strategy based on their specific needs. -
Ablation Studies: By removing or modifying certain key components of Model2Vec, we’ve studied their impact on model performance. For instance, we found that the model’s post-processing steps significantly affect final performance, highlighting their importance in generating high-quality embeddings.
Summary
Model2Vec is a revolutionary technology that transforms sentence transformers into compact static models. It maintains high performance while drastically reducing model size and boosting inference speed. Whether you’re looking to quickly deploy NLP applications or need to run models in resource-constrained environments, Model2Vec is an excellent choice. With its simple installation and usage process, and integration with multiple popular libraries, Model2Vec provides robust support for a wide range of NLP tasks. As Model2Vec continues to evolve and update, we can look forward to it playing an even more significant role in the future of NLP.