Skip to content

Commit a2da2e9

Browse files
stephantulPringled
andauthored
Add fittable (#140)
* Fix tokenizer issue * fix issue with warning * regenerate lock file * fix lock file * Try to not select 2.5.1 * fix: issue with dividers in utils * Try to not select 2.5.0 * fix: do not up version * Attempt special fix * feat: add training * fix: no grad * use numpy * Add train_test_split * fix: issue with fit not resetting * feat: add lightning * Fix bugs * fix: reviewer comments * fix train issue * fix issue with trainer * fix: truncate during training * feat: tokenize maximum length truncation * fixes * typo * Add progressbar * small code changes, add docs * fix training comments * Add pipeline saving * fix bug * fix issue with normalize test * change default batch size * feat: add sklearn skops pipeline * Device handling and automatic batch size * Add docstrings, defaults * docs * fix: rename * fix: rename * fix installation * rename * Add training tutorial * Add tutorial link * test: add tests * fix tests * tests: fix tests * Address comments * Add inference reqs to train reqs * fix normalize * update lock file * fix: move modelcards * fix: batch size * update lock file * Update model2vec/inference/README.md Co-authored-by: Thomas van Dongen <[email protected]> * Update model2vec/inference/README.md Co-authored-by: Thomas van Dongen <[email protected]> * Update model2vec/inference/README.md Co-authored-by: Thomas van Dongen <[email protected]> * Update model2vec/train/classifier.py Co-authored-by: Thomas van Dongen <[email protected]> * fix: encode args * fix: trust_remote_code * fix notebook --------- Co-authored-by: Thomas van Dongen <[email protected]>
1 parent 93647fd commit a2da2e9

20 files changed

+2541
-25
lines changed

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ install:
99
uv run pre-commit install
1010

1111
install-no-pre-commit:
12-
uv pip install ".[dev,distill]"
12+
uv pip install ".[dev,distill,inference,train]"
1313
uv pip install "torch<2.5.0"
1414

1515
install-base:

model2vec/hf_utils.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ def _create_model_card(
6060
license: str = "mit",
6161
language: list[str] | None = None,
6262
model_name: str | None = None,
63+
template_path: str = "modelcards/model_card_template.md",
6364
**kwargs: Any,
6465
) -> None:
6566
"""
@@ -70,11 +71,12 @@ def _create_model_card(
7071
:param license: The license to use.
7172
:param language: The language of the model.
7273
:param model_name: The name of the model to use in the Model Card.
74+
:param template_path: The path to the template.
7375
:param **kwargs: Additional metadata for the model card (e.g., model_name, base_model, etc.).
7476
"""
7577
folder_path = Path(folder_path)
7678
model_name = model_name or folder_path.name
77-
template_path = Path(__file__).parent / "model_card_template.md"
79+
full_path = Path(__file__).parent / template_path
7880

7981
model_card_data = ModelCardData(
8082
model_name=model_name,
@@ -85,7 +87,7 @@ def _create_model_card(
8587
library_name="model2vec",
8688
**kwargs,
8789
)
88-
model_card = ModelCard.from_template(model_card_data, template_path=template_path)
90+
model_card = ModelCard.from_template(model_card_data, template_path=full_path)
8991
model_card.save(folder_path / "README.md")
9092

9193

model2vec/inference/README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Inference
2+
3+
This subpackage mainly contains helper functions for inference with trained models that have been exported to `scikit-learn` compatible pipelines.
4+
5+
If you're looking for information on how to train a model, see [here](../train/README.md).
6+
7+
# Usage
8+
9+
Let's assume you're using our [potion-edu classifier](https://huggingface.co/minishlab/potion-8m-edu-classifier).
10+
11+
```python
12+
from model2vec.inference import StaticModelPipeline
13+
14+
classifier = StaticModelPipeline.from_pretrained("minishlab/potion-8m-edu-classifier")
15+
label = classifier.predict("Attitudes towards cattle in the Alps: a study in letting go.")
16+
```
17+
18+
This should just work.

model2vec/inference/__init__.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
from model2vec.utils import get_package_extras, importable
2+
3+
_REQUIRED_EXTRA = "inference"
4+
5+
for extra_dependency in get_package_extras("model2vec", _REQUIRED_EXTRA):
6+
importable(extra_dependency, _REQUIRED_EXTRA)
7+
8+
from model2vec.inference.model import StaticModelPipeline
9+
10+
__all__ = ["StaticModelPipeline"]

model2vec/inference/model.py

Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
from __future__ import annotations
2+
3+
import re
4+
from pathlib import Path
5+
from tempfile import TemporaryDirectory
6+
7+
import huggingface_hub
8+
import numpy as np
9+
import skops.io
10+
from sklearn.pipeline import Pipeline
11+
12+
from model2vec.hf_utils import _create_model_card
13+
from model2vec.model import PathLike, StaticModel
14+
15+
_DEFAULT_TRUST_PATTERN = re.compile(r"sklearn\..+")
16+
_DEFAULT_MODEL_FILENAME = "pipeline.skops"
17+
18+
19+
class StaticModelPipeline:
20+
def __init__(self, model: StaticModel, head: Pipeline) -> None:
21+
"""Create a pipeline with a StaticModel encoder."""
22+
self.model = model
23+
self.head = head
24+
25+
@classmethod
26+
def from_pretrained(
27+
cls: type[StaticModelPipeline], path: PathLike, token: str | None = None, trust_remote_code: bool = False
28+
) -> StaticModelPipeline:
29+
"""
30+
Load a StaticModel from a local path or huggingface hub path.
31+
32+
NOTE: if you load a private model from the huggingface hub, you need to pass a token.
33+
34+
:param path: The path to the folder containing the pipeline, or a repository on the Hugging Face Hub
35+
:param token: The token to use to download the pipeline from the hub.
36+
:param trust_remote_code: Whether to trust the remote code. If this is False, we will only load components coming from `sklearn`.
37+
:return: The loaded pipeline.
38+
"""
39+
model, head = _load_pipeline(path, token, trust_remote_code)
40+
model.embedding = np.nan_to_num(model.embedding)
41+
42+
return cls(model, head)
43+
44+
def save_pretrained(self, path: str) -> None:
45+
"""Save the model to a folder."""
46+
save_pipeline(self, path)
47+
48+
def push_to_hub(self, repo_id: str, token: str | None = None, private: bool = False) -> None:
49+
"""
50+
Save a model to a folder, and then push that folder to the hf hub.
51+
52+
:param repo_id: The id of the repository to push to.
53+
:param token: The token to use to push to the hub.
54+
:param private: Whether the repository should be private.
55+
"""
56+
from model2vec.hf_utils import push_folder_to_hub
57+
58+
with TemporaryDirectory() as temp_dir:
59+
save_pipeline(self, temp_dir)
60+
self.model.save_pretrained(temp_dir)
61+
push_folder_to_hub(Path(temp_dir), repo_id, private, token)
62+
63+
def _predict_and_coerce_to_2d(
64+
self,
65+
X: list[str] | str,
66+
show_progress_bar: bool,
67+
max_length: int | None,
68+
batch_size: int,
69+
use_multiprocessing: bool,
70+
multiprocessing_threshold: int,
71+
) -> np.ndarray:
72+
"""Predict the labels of the input and coerce the output to a matrix."""
73+
encoded = self.model.encode(
74+
X,
75+
show_progress_bar=show_progress_bar,
76+
max_length=max_length,
77+
batch_size=batch_size,
78+
use_multiprocessing=use_multiprocessing,
79+
multiprocessing_threshold=multiprocessing_threshold,
80+
)
81+
if np.ndim(encoded) == 1:
82+
encoded = encoded[None, :]
83+
84+
return encoded
85+
86+
def predict(
87+
self,
88+
X: list[str] | str,
89+
show_progress_bar: bool = False,
90+
max_length: int | None = 512,
91+
batch_size: int = 1024,
92+
use_multiprocessing: bool = True,
93+
multiprocessing_threshold: int = 10_000,
94+
) -> np.ndarray:
95+
"""Predict the labels of the input."""
96+
encoded = self._predict_and_coerce_to_2d(
97+
X,
98+
show_progress_bar=show_progress_bar,
99+
max_length=max_length,
100+
batch_size=batch_size,
101+
use_multiprocessing=use_multiprocessing,
102+
multiprocessing_threshold=multiprocessing_threshold,
103+
)
104+
105+
return self.head.predict(encoded)
106+
107+
def predict_proba(
108+
self,
109+
X: list[str] | str,
110+
show_progress_bar: bool = False,
111+
max_length: int | None = 512,
112+
batch_size: int = 1024,
113+
use_multiprocessing: bool = True,
114+
multiprocessing_threshold: int = 10_000,
115+
) -> np.ndarray:
116+
"""Predict the probabilities of the labels of the input."""
117+
encoded = self._predict_and_coerce_to_2d(
118+
X,
119+
show_progress_bar=show_progress_bar,
120+
max_length=max_length,
121+
batch_size=batch_size,
122+
use_multiprocessing=use_multiprocessing,
123+
multiprocessing_threshold=multiprocessing_threshold,
124+
)
125+
126+
return self.head.predict_proba(encoded)
127+
128+
129+
def _load_pipeline(
130+
folder_or_repo_path: PathLike, token: str | None = None, trust_remote_code: bool = False
131+
) -> tuple[StaticModel, Pipeline]:
132+
"""
133+
Load a model and an sklearn pipeline.
134+
135+
This assumes the following files are present in the repo:
136+
- `pipeline.skops`: The head of the pipeline.
137+
- `config.json`: The configuration of the model.
138+
- `model.safetensors`: The weights of the model.
139+
- `tokenizer.json`: The tokenizer of the model.
140+
141+
:param folder_or_repo_path: The path to the folder containing the pipeline.
142+
:param token: The token to use to download the pipeline from the hub. If this is None, you will only
143+
be able to load the pipeline from a local folder, public repository, or a repository that you have access to
144+
because you are logged in.
145+
:param trust_remote_code: Whether to trust the remote code. If this is False,
146+
we will only load components coming from `sklearn`. If this is True, we will load all components.
147+
If you set this to True, you are responsible for whatever happens.
148+
:return: The encoder model and the loaded head
149+
:raises FileNotFoundError: If the pipeline file does not exist in the folder.
150+
:raises ValueError: If an untrusted type is found in the pipeline, and `trust_remote_code` is False.
151+
"""
152+
folder_or_repo_path = Path(folder_or_repo_path)
153+
model_filename = _DEFAULT_MODEL_FILENAME
154+
if folder_or_repo_path.exists():
155+
head_pipeline_path = folder_or_repo_path / model_filename
156+
if not head_pipeline_path.exists():
157+
raise FileNotFoundError(f"Pipeline file does not exist in {folder_or_repo_path}")
158+
else:
159+
head_pipeline_path = huggingface_hub.hf_hub_download(
160+
folder_or_repo_path.as_posix(), model_filename, token=token
161+
)
162+
163+
model = StaticModel.from_pretrained(folder_or_repo_path)
164+
165+
unknown_types = skops.io.get_untrusted_types(file=head_pipeline_path)
166+
# If the user does not trust remote code, we should check that the unknown types are trusted.
167+
# By default, we trust everything coming from scikit-learn.
168+
if not trust_remote_code:
169+
for t in unknown_types:
170+
if not _DEFAULT_TRUST_PATTERN.match(t):
171+
raise ValueError(f"Untrusted type {t}.")
172+
head = skops.io.load(head_pipeline_path, trusted=unknown_types)
173+
174+
return model, head
175+
176+
177+
def save_pipeline(pipeline: StaticModelPipeline, folder_path: str | Path) -> None:
178+
"""
179+
Save a pipeline to a folder.
180+
181+
:param pipeline: The pipeline to save.
182+
:param folder_path: The path to the folder to save the pipeline to.
183+
"""
184+
folder_path = Path(folder_path)
185+
folder_path.mkdir(parents=True, exist_ok=True)
186+
model_filename = _DEFAULT_MODEL_FILENAME
187+
head_pipeline_path = folder_path / model_filename
188+
skops.io.dump(pipeline.head, head_pipeline_path)
189+
pipeline.model.save_pretrained(folder_path)
190+
base_model_name = pipeline.model.base_model_name
191+
if isinstance(base_model_name, list) and base_model_name:
192+
name = base_model_name[0]
193+
elif isinstance(base_model_name, str):
194+
name = base_model_name
195+
else:
196+
name = "unknown"
197+
_create_model_card(
198+
folder_path,
199+
base_model_name=name,
200+
language=pipeline.model.language,
201+
template_path="modelcards/classifier_template.md",
202+
)

model2vec/model.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ def normalize(self) -> bool:
8787
@normalize.setter
8888
def normalize(self, value: bool) -> None:
8989
"""Update the config if the value of normalize changes."""
90-
config_normalize = self.config.get("normalize", False)
90+
config_normalize = self.config.get("normalize")
9191
self._normalize = value
9292
if config_normalize is not None and value != config_normalize:
9393
logger.warning(
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
---
2+
{{ card_data }}
3+
---
4+
5+
# {{ model_name }} Model Card
6+
7+
This [Model2Vec](https://github.com/MinishLab/model2vec) model is a fine-tuned version of {% if base_model %}the [{{ base_model }}](https://huggingface.co/{{ base_model }}){% else %}a{% endif %} Model2Vec model. It also includes a classifier head on top.
8+
9+
## Installation
10+
11+
Install model2vec using pip:
12+
```
13+
pip install model2vec[inference]
14+
```
15+
16+
## Usage
17+
Load this model using the `from_pretrained` method:
18+
```python
19+
from model2vec.inference import StaticModelPipeline
20+
21+
# Load a pretrained Model2Vec model
22+
model = StaticModelPipeline.from_pretrained("{{ model_name }}")
23+
24+
# Predict labels
25+
predicted = model.predict(["Example sentence"])
26+
```
27+
28+
## Additional Resources
29+
30+
- [All Model2Vec models on the hub](https://huggingface.co/models?library=model2vec)
31+
- [Model2Vec Repo](https://github.com/MinishLab/model2vec)
32+
- [Model2Vec Results](https://github.com/MinishLab/model2vec?tab=readme-ov-file#results)
33+
- [Model2Vec Tutorials](https://github.com/MinishLab/model2vec/tree/main/tutorials)
34+
35+
## Library Authors
36+
37+
Model2Vec was developed by the [Minish Lab](https://github.com/MinishLab) team consisting of [Stephan Tulkens](https://github.com/stephantul) and [Thomas van Dongen](https://github.com/Pringled).
38+
39+
## Citation
40+
41+
Please cite the [Model2Vec repository](https://github.com/MinishLab/model2vec) if you use this model in your work.
42+
```
43+
@software{minishlab2024model2vec,
44+
authors = {Stephan Tulkens, Thomas van Dongen},
45+
title = {Model2Vec: Turn any Sentence Transformer into a Small Fast Model},
46+
year = {2024},
47+
url = {https://github.com/MinishLab/model2vec},
48+
}
49+
```

model2vec/model_card_template.md renamed to model2vec/modelcards/model_card_template.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
# {{ model_name }} Model Card
66

7-
This [Model2Vec](https://github.com/MinishLab/model2vec) model is a distilled version of {% if base_model %}the [{{ base_model }}](https://huggingface.co/{{ base_model }}){% else %}a{% endif %} Sentence Transformer. It uses static embeddings, allowing text embeddings to be computed orders of magnitude faster on both GPU and CPU. It is designed for applications where computational resources are limited or where real-time performance is critical.
7+
This [Model2Vec](https://github.com/MinishLab/model2vec) model is a distilled version of {% if base_model %}the {{ base_model }}(https://huggingface.co/{{ base_model }}){% else %}a{% endif %} Sentence Transformer. It uses static embeddings, allowing text embeddings to be computed orders of magnitude faster on both GPU and CPU. It is designed for applications where computational resources are limited or where real-time performance is critical.
88

99

1010
## Installation

0 commit comments

Comments
 (0)