-
Notifications
You must be signed in to change notification settings - Fork 332
Adding embed_array for getting the embeddings of multiple strings #686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
try: | ||
inputs = json.loads(inputs) | ||
except json.decoder.JSONDecodeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the known case that this is handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without passing any extra arguments, I couldn't think of a way of knowing wether inputs was a string or a JSON string. Thought the simplest way was trying to convert it into a python object.
|
||
else: | ||
texts_with_instructions = [] | ||
instruction = kwargs.pop("instruction") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it may be common to have multiple instructions with multiple inputs? Hmm, in that case I'm not sure we have a nice way to structure the args... we can leave that for some future work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I'll leave it as it is.
result = model.encode(text, **kwargs) | ||
if instructor: | ||
result = model.encode(inputs, **kwargs) | ||
if instructor and len(result) == 1: | ||
result = result[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to handle the multi instructor case, right? Or does instructor always just return an appropriate single dimension array?
pgml-extension/src/api.rs
Outdated
inputs: default!(Vec<String>, "ARRAY[]::TEXT[]"), | ||
kwargs: default!(JsonB, "'{}'"), | ||
) -> Vec<Vec<f32>> { | ||
crate::bindings::transformers::embed_batch(transformer, &inputs, &kwargs.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make crate::bindings::transformers::embed
take a list of inputs always, and modify pub fn embed
to pass a slice in, similar to generate
and generate_batch
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll change this. Makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if we do pass an input array always, I could get rid of the try catch since it's always an array of strings.
Works the same way as pgml.embed but can take an array of inputs. Example:
For instructor, I'm passing the same instruction to each input, we could potentially have a inputs array that is a json and each individual input.