pgml Python SDK with vector search support #636

santiatpml · 2023-05-19T16:03:53Z

Objective of this SDK is to provide an easy interface for PostgresML generative AI capabilities. This version supports vector search using multiple models and text splitters.

Quick start instructions are here

levkk · 2023-05-19T17:46:36Z

pgml-sdks/python/pgml/pgml/collection.py

+        run_create_or_insert_statement(conn, create_schema_statement)
+        create_table_statement = (
+            "CREATE TABLE IF NOT EXISTS %s (\
+                                id          serial8 PRIMARY KEY,\


Suggested change

id serial8 PRIMARY KEY,\

id bigserial PRIMARY KEY,\

More idiomatic, but doesn't matter.

levkk · 2023-05-19T17:48:04Z

pgml-sdks/python/pgml/pgml/collection.py

+                                document    uuid NOT NULL,\
+                                metadata    jsonb NOT NULL DEFAULT '{}',\
+                                text        text NOT NULL,\
+                                UNIQUE (document)\


Duplicate primary key technically, since this is unique. Curious why we can't use the id field as the document identifier?

levkk · 2023-05-19T17:49:12Z

pgml-sdks/python/pgml/pgml/collection.py

+        run_create_or_insert_statement(conn, create_index_statement, autocommit=True)
+
+        create_index_statement = (
+            "CREATE INDEX CONCURRENTLY IF NOT EXISTS \


Don't need to create an index here on document, UNIQUE does it already automatically. So I think you end up with two indexes on the same field.

levkk · 2023-05-19T17:51:57Z

pgml-sdks/python/pgml/pgml/collection.py

+        )
+        run_create_or_insert_statement(conn, create_statement)
+
+        index_statement = (


UNIQUE (task, splitter, model) in the table definition creates a compound index on those three columns. Having 3 additional individual indexes on the same columns may not be necessary.

levkk · 2023-05-19T17:53:54Z

pgml-sdks/python/pgml/pgml/collection.py

+                            created_at  timestamptz NOT NULL DEFAULT now(), \
+                            task        text NOT NULL, \
+                            splitter    int8 NOT NULL REFERENCES %s\
+                              ON DELETE CASCADE\


Are you sure you want to CASCADE? This has the effect of automatically deleting rows from this table when the row they are referencing in another table is deleted. This can delete a lot of data accidentally. The preferred way for me generally is to ON DELETE RESTRICT which is the default. That way, you'll get an error when attempting to delete a splitter that's referenced by this table. This is default behavior also, so you can just remove ON DELETE CASCADE.

levkk · 2023-05-19T17:57:34Z

pgml-sdks/python/pgml/pgml/collection.py

+        self.transforms_table = self.name + ".transforms"
+        create_statement = (
+            "CREATE TABLE IF NOT EXISTS %s (\
+                            oid         regclass PRIMARY KEY,\


Odd way of referencing another table, I've never seen that done before. Is this compatible with logical replication? I.e. if we want to move this data to another database, the oids probably won't match anymore, since they are specific to a Postgres installation.

This should probably be called table_name. An oid is something very Postgres-specific and does not mean a table reference at all. In fact, it can reference a row in a TOAST table or a type in pg_type.

levkk · 2023-05-19T18:01:00Z

pgml-sdks/python/pgml/pgml/collection.py

+                "CREATE TABLE IF NOT EXISTS %s ( \
+                                id          serial8 PRIMARY KEY,\
+                                created_at  timestamptz NOT NULL DEFAULT now(),\
+                                chunk       int8 NOT NULL REFERENCES %s\


The convention is sometimes to use field_id to explain to the reader that this refers to the primary key (usually id) column in another table. The reviewer would look in this case if the column being referenced has an index on it, which is often required for foreign keys: a foreign key validation needs to be an index scan, the quickest way to validate that this value exists in another table.

_id is not just a railsism, it's a popular and significant readability improvement to easily distinguish local fields from foreign keys, which isn't really Hungarian notation. It's part of what-this-field-represents, rather than what-data-type-is-this-field.

It's more similar to the convention many programming languages use by adding * or & to distinguish references and pointers from local objects on the stack.

A foreign key can be polymorphic, in which case, the naming convention suddenly becomes important. Sure, one can't really enforce naming conventions programmatically, but there is no reason to make something harder to understand.

For example, created_at can be renamed to random_date and only by looking at the definition of the column one would realize that this is actually a creation timestamp, but we don't do that, we call it created_at because we want to be kind to our future selves and to reviewers and other users of our code. An argument can be made for calling variables abcd and only by reading the code fully, one will truly understand what they do, but we don't do that either.

Rails can be many things that we don't agree with, but the way ActiveRecord handles data is fantastic, and we would be wise to learn from their experience and do something most engineers are familiar and happy to work with.

And you will commonly see code that is forced to name two variables in a context things like document_ptr and a document , to disambiguate the one that has been dereferenced safely. No one would assume that the compiler checks _ptr correctness, but if they ever saw a variable with that extension, next to a variable without the extension, and the truth of the references disagreed with the conventions we've established match expectations, people would say the variable was misnamed and likely to cause bugs or confusion.

We write code for others to read. Code should consistently follow the conventions established in the project. The convention has already been established, and there is no need to revisit that convention, when the suggestion obviously confused multiple team members in multiple places due to not just inconsistency with the greater project scope, but inconsistency within the necessary layers to achieve a solution, and also inconsistency within a single table.

Natural joins with USING is generally useless, because it uses all columns in common, not just the key. In this case created_at would be shared but would not be the same, so natural joins won't work.

Note USING is reasonably safe from column changes in the joined relations since only the listed columns are combined. NATURAL is considerably more risky since any schema changes to either relation that cause a new matching column name to be present will cause the join to combine that new column as well.

If you're still not convinced, there is the user_uuid in your example, which is yet another instance.

But similarly, if you really want to use USING, and not natural joins, it works just as well to call them both document_id, and you can keep that feature of SQL.

Calling them both document_id would also be a break with Rails conventions, though...

You seem confused about the importance of Rails conventions, and to be missing the point that this schema is not an island unto itself, nor is this SDK. The important aspect is that a database and schema in this context are useless without an application layer of logic on top. That application layer will have concerns with ORM that will be simpler to maintain if the database follows a convention of using _id prefixes. The reason you see the pattern replicated so widely, is because it's so frequently eases the mental burden of application developers. DBA's or people working only inside the database may not understand these concepts or concerns, but that doesn't mean they don't exist.

levkk

The tables feel a bit overindexed. In Postgres, each index carries a write penalty: for each row that's inserted into the table, each index needs to be updated accordingly. When all (or most) columns are updated, an update to any column requires an update to all indexes (i.e. Postgres can't do HOT, heap-only, updates).

I think I understand the schema overall, although it would be helpful to define it in a separate .sql file which can then be executed when the SDK is used for the first time. Although, we do generate a lot of tables in the fly, so understandably that's not possible for all use cases.

Overall, I think this is great to start with, and might require some optimizations as it's deployed at scale.

levkk · 2023-05-19T19:45:27Z

It is much better to improve performance later by removing indexes than to try to improve it by adding them...I'm not sure how this kind of thinking about indexes got started.

Experience. Removing an index is dangerous, adding an index is safe.

montanalow · 2023-05-19T20:42:25Z

pgml-sdks/python/pgml/pgml/collection.py

+                log.info("id key is not present.. hashing")
+                document_id = hashlib.md5(text.encode("utf-8")).hexdigest()
+            metadata = document
+            delete_statement = "DELETE FROM %s WHERE document = %s" % (


This would be more clear to me if the field name was document_id to match the variable name. I was half expecting document to be a text value.

Right, so using one convention for names at one layer, and a different convention for names in the next layer down seems confusing, there is good reason to add _id in both layers.

montanalow · 2023-05-19T20:43:24Z

pgml-sdks/python/pgml/pgml/collection.py

+            chunks = text_splitter.create_documents([text])
+            for chunk_id, chunk in enumerate(chunks):
+                insert_statement = (
+                    "INSERT INTO %s (document,splitter,chunk_id, chunk) VALUES (%s, %s, %s, %s);"


Missing _id suffixes make this hard for me to read.

Yes, chunk_id is the chunk index for the given document text. For example: if the text "hello world" is split into two chunks then chunk_id 0 will map to "hello" and chunk_id 1 will map to "world". id is the global id of the chunk across all documents and splitters.

Yes, this is a perfect example of needing both a local and reference to the same "object" in a single context, and being forced to disambiguate one of them with an _id suffix. The confusing ones in this line are document, and splitter. chunk and chunk_id are clear and easily distinguished, even though chunk_id does not in fact refer to a foreign key in a chunks table, it's clearly marked as a reference.

chunk_id is the ordering of the chunk within the document. chunk_index sounds good.

montanalow · 2023-05-19T20:47:48Z

pgml-sdks/python/pgml/pgml/collection.py

+                "CREATE TABLE IF NOT EXISTS %s ( \
+                                id          serial8 PRIMARY KEY,\
+                                created_at  timestamptz NOT NULL DEFAULT now(),\
+                                chunk       int8 NOT NULL REFERENCES %s\


_id is not just a railsism, it's a popular and significant readability improvement to easily distinguish local fields from foreign keys, which isn't really Hungarian notation. It's part of what-this-field-represents, rather than what-data-type-is-this-field.

It's more similar to the convention many programming languages use by adding * or & to distinguish references and pointers from local objects on the stack.

montanalow · 2023-05-19T20:51:16Z

pgml-sdks/python/pgml/pgml/collection.py

+        model_params = results[0]["parameters"]
+
+        # get all chunks that don't have embeddings
+        embeddings_statement = (


This needs to be refactored into the insert statement to avoid round tripping the vector.

montanalow · 2023-05-19T20:51:48Z

pgml-sdks/python/pgml/pgml/collection.py

+        embeddings_table = self._create_or_get_embeddings_table(
+            conn, model_id=model_id, splitter_id=splitter_id
+        )
+        select_statement = "SELECT name, parameters FROM %s WHERE id = %d;" % (


I would include this with a CTE in the embeddings statement as well to avoid the round trip.

montanalow · 2023-05-19T20:54:22Z

pgml-sdks/python/pgml/pgml/collection.py

+        results = run_select_statement(conn, select_statement)
+
+        model = results[0]["name"]
+        query_embeddings = self._get_embeddings(


A query builder can be used for the CTE here.

montanalow · 2023-05-19T20:55:31Z

pgml-sdks/python/pgml/pgml/collection.py

+        query_embeddings = self._get_embeddings(
+            conn, query, model_name=model, parameters=query_parameters
+        )
+        embeddings_table = self._create_or_get_embeddings_table(


This needs to be cached some how to avoid multiple round trips to the db for this function.

montanalow · 2023-05-19T20:56:09Z

pgml-sdks/python/pgml/pgml/collection.py

+        )
+
+        select_statement = (
+            "SELECT chunk, 1 - (%s.embedding <=> %s::float8[]::vector) AS score FROM %s ORDER BY score DESC LIMIT %d;"


This needs to be embedded with all the results queries to avoid n+1 queries

montanalow · 2023-05-19T20:56:45Z

pgml-sdks/python/pgml/pgml/collection.py

+        for result in results:
+            _out = {}
+            _out["score"] = result["score"]
+            select_statement = "SELECT chunk, document FROM %s WHERE id = %d" % (


Prefer joins in SQL to round trips from the app.

montanalow · 2023-05-19T20:57:43Z

pgml-sdks/python/pgml/pgml/database.py

+        self.pool.putconn(conn)
+        return Collection(self.pool, name)
+
+    def delete_collection(self, name: str) -> None:


Hard delete is a bit of a footgun.

Yeah, we could have some archiving mechanism, which simply alters the schema to name + '_archive', and then enable delete or restore on archives.

Adding more storage space is cheap. Recovering lost data is expensive and scary.

Also, we should support "undo".

that's what I meant by "restore"

That does not guarantee people aren't still using the table, if they are ignoring the flag by using poorly written homegrown queries, which is a potentially desirable outcome for projects that outgrow simple SDK designs, and what to leverage the full expressive power of SQL.

montanalow

🥳

santiadavani and others added 13 commits May 12, 2023 16:25

Python SDK init

89c1223

create collection init

53dde0f

Upsert documents + tests

2d685e8

Creating more tables as part of collection ..

86ccef8

Register models and text splitters

5b5cee4

Refactored run select and added models

7b03e01

Embeddings and vector search

2d9202e

Incremental updates for chunks and embeddings

ea19ecc

Docstrings for all modules

dee6e5b

Minor updates

b7a0495

Added basic readme with quickstart

5186705

Updated readme with PGML_CONNECTION

5c8cf62

Updated readme

5a81918

santiatpml requested review from solidsnack, montanalow and levkk May 19, 2023 16:04

solidsnack approved these changes May 19, 2023

View reviewed changes

levkk reviewed May 19, 2023

View reviewed changes

levkk approved these changes May 19, 2023

View reviewed changes

Minor API and notebook updates

a5d1618

montanalow requested changes May 19, 2023

View reviewed changes

This comment was marked as duplicate.

Sign in to view

solidsnack approved these changes May 19, 2023

View reviewed changes

santiatpml and others added 4 commits May 22, 2023 10:51

Using document_id, chunk_id etc. for column names

368da8a

Renaming model -> model_id and splitter -> splitter_id

986b314

Performance improvements

4601edc

delete collection is replaced with archive collection

988ea41

santiatpml requested a review from montanalow May 23, 2023 18:53

santiadavani added 2 commits May 23, 2023 12:23

Support for uuids without dashes

97ec30b

Refactored upsert documents

4793563

montanalow approved these changes May 23, 2023

View reviewed changes

Merge branch 'master' into santi-pgml-memory-sdk-python

998c996

santiatpml merged commit cac1a6a into master May 23, 2023

santiatpml deleted the santi-pgml-memory-sdk-python branch May 23, 2023 22:42

SilasMarvin pushed a commit that referenced this pull request Oct 5, 2023

pgml Python SDK with vector search support (#636)

b535000

pgml Python SDK with vector search support #636

pgml Python SDK with vector search support #636

Uh oh!

Conversation

santiatpml commented May 19, 2023

Uh oh!

levkk May 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

levkk May 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

levkk May 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

levkk left a comment

Choose a reason for hiding this comment

Uh oh!

levkk commented May 19, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

levkk May 19, 2023 •

edited

Loading

levkk May 19, 2023 •

edited

Loading

levkk May 19, 2023 •

edited

Loading