Skip to content

Adding embed_array for getting the embeddings of multiple strings #686

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jun 5, 2023

Conversation

jsaied99
Copy link
Contributor

@jsaied99 jsaied99 commented Jun 5, 2023

Works the same way as pgml.embed but can take an array of inputs. Example:

SELECT pgml.embed(
    transformer => 'intfloat/e5-small', 
    inputs => ARRAY['Hello', 'World'],
    kwargs => '{"device": "cpu"}'::JSONB
);
SELECT pgml.embed(
    transformer => 'hkunlp/instructor-base', 
    inputs => ARRAY['Hello World', 'I love Rust'],
    kwargs => '{"device": "cpu", "instruction": "Represent the content for retrieving supporting documents:"}'::JSONB
);

For instructor, I'm passing the same instruction to each input, we could potentially have a inputs array that is a json and each individual input.

@jsaied99 jsaied99 marked this pull request as ready for review June 5, 2023 20:20

try:
inputs = json.loads(inputs)
except json.decoder.JSONDecodeError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the known case that this is handling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without passing any extra arguments, I couldn't think of a way of knowing wether inputs was a string or a JSON string. Thought the simplest way was trying to convert it into a python object.


else:
texts_with_instructions = []
instruction = kwargs.pop("instruction")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it may be common to have multiple instructions with multiple inputs? Hmm, in that case I'm not sure we have a nice way to structure the args... we can leave that for some future work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I'll leave it as it is.

result = model.encode(text, **kwargs)
if instructor:
result = model.encode(inputs, **kwargs)
if instructor and len(result) == 1:
result = result[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to handle the multi instructor case, right? Or does instructor always just return an appropriate single dimension array?

inputs: default!(Vec<String>, "ARRAY[]::TEXT[]"),
kwargs: default!(JsonB, "'{}'"),
) -> Vec<Vec<f32>> {
crate::bindings::transformers::embed_batch(transformer, &inputs, &kwargs.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make crate::bindings::transformers::embed take a list of inputs always, and modify pub fn embed to pass a slice in, similar to generate and generate_batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll change this. Makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if we do pass an input array always, I could get rid of the try catch since it's always an array of strings.

@jsaied99 jsaied99 requested a review from montanalow June 5, 2023 21:26
@montanalow montanalow merged commit 96dd570 into postgresml:master Jun 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy