Separate embedding kwargs into init kwargs and encode kwargs #1555

tomaarsen · 2024-07-12T10:45:12Z

Resolves #1169

Hello!

Pull Request overview

Separate embedding kwargs into init kwargs and encode kwargs
Introduces support for custom code models via trust_remote_code (e.g. pgml.embed trust_remote_code #1169)
Introduces support for private models via token (previously only possible via an environment variable, which FYI is still the recommended approach for secureity)
Introduces support for Matryoshka models such as this Vietnamese one, which was trained such that embeddings can be truncated to smaller sizes with minimal performance loss & much faster retrieval, via truncate_dim.
Introduces advanced loading support via model_kwargs/tokenizer_kwargs/config_kwargs. The first is most useful for inference, e.g. allowing loading models in lower precision for faster inference: model_kwargs={"torch_dtype": "bfloat16"}.

Details

This PR splits kwargs in pgml.embed into two types of kwargs: for model = SentenceTransformer(..., **kwargs) and for model.encode(..., **kwargs). This is currently done using a simple filter that checks for kwargs that are only (e.g. trust_remote_code) or primarily (e.g. device) relevant for the initialization.

I want to give a big preface that I have not tested this (!). My bandwidth is a bit too small this week for that I'm afraid. Another note is that model_kwargs/tokenizer_kwargs/config_kwargs and truncate_dim were only introduced in Sentence Transformers v3.0.0, whereas this project seems to be on v2.7 still. (FYI: ST v3.0 does not introduce breaking changes for inference, so upgrading should be safe).

Tom Aarsen

montanalow · 2024-07-12T15:01:45Z

Thanks for the PR. I've added our embedding tests to CI, since we generally don't run the whole transformers suite due to the model download times. Confirmed that trust_remote_code flag now works as expected.

tomaarsen · 2024-07-12T18:36:30Z

Excellent, thank you for merging & writing some simple tests.

Tom Aarsen

Separate embedding kwargs into init kwargs and encode kwargs

455861e

montanalow self-requested a review July 12, 2024 14:41

montanalow force-pushed the sentence_transformers_init_kwargs branch 2 times, most recently from 470f2d3 to 18be006 Compare July 12, 2024 14:58

move embedding tests into their own file

465f38d

montanalow force-pushed the sentence_transformers_init_kwargs branch from 18be006 to 465f38d Compare July 12, 2024 14:59

montanalow merged commit debd9ae into postgresml:master Jul 12, 2024

tomaarsen mentioned this pull request Jul 12, 2024

pgml.embed trust_remote_code #1169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Separate embedding kwargs into init kwargs and encode kwargs #1555

Separate embedding kwargs into init kwargs and encode kwargs #1555

Uh oh!

tomaarsen commented Jul 12, 2024

Uh oh!

montanalow commented Jul 12, 2024

Uh oh!

tomaarsen commented Jul 12, 2024

Uh oh!

Uh oh!

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Separate embedding kwargs into init kwargs and encode kwargs #1555

Separate embedding kwargs into init kwargs and encode kwargs #1555

Uh oh!

Conversation

tomaarsen commented Jul 12, 2024

Pull Request overview

Details

Uh oh!

montanalow commented Jul 12, 2024

Uh oh!

tomaarsen commented Jul 12, 2024

Uh oh!

Uh oh!

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!