Skip to content

docs: add info about quantization and dimensionality reduction #231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 30, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,8 @@ For advanced usage, please refer to our [usage documentation](https://github.com

## Updates & Announcements

- **01/05/2024**: We released backend support for `BPE` and `Unigram` tokenizers, along with quantization and dimensionality reduction. New Model2Vec models are now 50% of the original models, and can be quantized to int8 to be 25% of the size, without loss of performance.

- **12/02/2024**: We released **Model2Vec training**, allowing you to fine-tune your own classification models on top of Model2Vec models. Find out more in our [training documentation](https://github.com/MinishLab/model2vec/blob/main/model2vec/train/README.md) and [results](results/README.md#training-results).

- **30/01/2024**: We released two new models: [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) and [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M). [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) is our most performant model to date, using a larger vocabulary and higher dimensions. [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) is a finetune of [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) that is optimized for retrieval tasks, and is the best performing static retrieval model currently available.
Expand Down
48 changes: 48 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,54 @@ m2v_model = distill(model_name=model_name, vocabulary=vocabulary, use_subword=Fa

**Important note:** we assume the passed vocabulary is sorted in rank frequency. i.e., we don't care about the actual word frequencies, but do assume that the most frequent word is first, and the least frequent word is last. If you're not sure whether this is case, set `apply_zipf` to `False`. This disables the weighting, but will also make performance a little bit worse.

### Quantization

Models can be quantized to `float16` (default) or `int8` during distillation, or when loading from disk.

```python
from model2vec.distill import distill

# Distill a Sentence Transformer model and quantize is to int8
m2v_model = distill(model_name="BAAI/bge-base-en-v1.5", quantize_to="int8")

# Save the model. This model is now 25% of the size of a normal model.
m2v_model.save_pretrained("m2v_model")
```

You can also quantize during loading.

```python
from model2vec import StaticModel

model = StaticModel.from_pretrained("minishlab/potion-base-8m", quantize_to="int8")
```

### Dimensionality reduction

Because almost all Model2Vec models have been distilled using PCA, and because PCA explicitly orders dimensions from most informative to least informative, we can perform dimensionality reduction during loading. This is very similar to how matryoshka embeddings work.

```python
from model2vec import StaticModel

model = StaticModel.from_pretrained("minishlab/potion-base-8m", dimensionality=32)

print(model.embedding.shape)
# (29528, 32)
```

### Combining quantization and dimensionality reduction

Combining these tricks can lead to extremely small models. For example, using this, we can reduce the size of `potion-base-8m`, which is now 30MB, to only 1MB:

```python
model = StaticModel.from_pretrained("minishlab/potion-base-8m",
dimensionality=32,
quantize_to="int8")
print(model.embedding.nbytes)
# 944896 bytes = 944kb
```

This should be enough to satisfy even the strongest hardware constraints.

## Training

Expand Down