diff --git a/pgml-cms/docs/README.md b/pgml-cms/docs/README.md index a698c121a..8c4d7edb5 100644 --- a/pgml-cms/docs/README.md +++ b/pgml-cms/docs/README.md @@ -4,27 +4,27 @@ description: The key concepts that make up PostgresML. # Overview -PostgresML is a complete MLOps platform built on PostgreSQL. +PostgresML is a complete MLOps platform built on PostgreSQL. > _Move the models to the database_, _rather than continuously moving the data to the models._ -The data for ML & AI systems is inherently larger and more dynamic than the models. It's more efficient, manageable and reliable to move the models to the database, rather than continuously moving the data to the models\_.\_ PostgresML allows you to take advantage of the fundamental relationship between data and models, by extending the database with the following capabilities and goals: +The data for ML & AI systems is inherently larger and more dynamic than the models. It's more efficient, manageable and reliable to move the models to the database, rather than continuously moving the data to the models. PostgresML allows you to take advantage of the fundamental relationship between data and models, by extending the database with the following capabilities and goals: * **Model Serving** - _**GPU accelerated**_ inference engine for interactive applications, with no additional networking latency or reliability costs. * **Model Store** - Download _**open-source**_ models including state of the art LLMs from HuggingFace, and track changes in performance between versions. * **Model Training** - Train models with _**your application data**_ using more than 50 algorithms for regression, classification or clustering tasks. Fine tune pre-trained models like LLaMA and BERT to improve performance. -* **Feature Store** - _**Scalable**_ access to model inputs, including vector, text, categorical, and numeric data. Vector database, text search, knowledge graph and application data all in one _**low-latency**_ system. +* **Feature Store** - _**Scalable**_ access to model inputs, including vector, text, categorical, and numeric data. Vector database, text search, knowledge graph and application data all in one _**low-latency**_ system.

Machine Learning Infrastructure (2.0) by a16z — PostgresML handles all of the functions typically performed by a cacophony of services, described by a16z

-These capabilities are primarily provided by two open-source software projects, that may be used independently, but are designed to be used with the rest of the Postgres ecosystem, including trusted extensions like pgvector and pg\_partman. +These capabilities are primarily provided by two open-source software projects, that may be used independently, but are designed to be used with the rest of the Postgres ecosystem, including trusted extensions like pgvector and pg\_partman. * **pgml** is an open source extension for PostgreSQL. It adds support for GPUs and the latest ML & AI algorithms _**inside**_ the database with a SQL API and no additional infrastructure, networking latency, or reliability costs. * **PgCat** is an open source proxy pooler for PostgreSQL. It abstracts the scalability and reliability concerns of managing a distributed cluster of Postgres databases. Client applications connect only to the proxy, which handles load balancing and failover, _**outside**_ of any single database.

PostgresML architectural diagram — A PostgresML deployment at scale

-In addition, PostgresML provides [native language SDKs](https://github.com/postgresml/postgresml/tree/master/pgml-sdks/pgml) to implement best practices for common ML & AI applications. The JavaScript and Python SDKs are generated from the core Rust SDK, to provide the same API, correctness and efficiency across all application runtimes. +In addition, PostgresML provides [native language SDKs](https://github.com/postgresml/postgresml/tree/master/pgml-sdks/pgml) to implement best practices for common ML & AI applications. The JavaScript and Python SDKs are generated from the core Rust SDK, to provide the same API, correctness and efficiency across all application runtimes. SDK clients can perform advanced machine learning tasks in a single SQL request, without having to transfer additional data, models, hardware or dependencies to the client application. For example: @@ -36,6 +36,6 @@ SDK clients can perform advanced machine learning tasks in a single SQL request, * Forecasting timeseries data for key metrics with complex metadata * Fraud and anomaly detection with application data -Our goal is to provide access to Open Source AI for everyone. PostgresML is under continuous development to keep up with the rapidly evolving use cases for ML & AI, and we release non breaking changes with minor version updates in accordance with SemVer. We welcome contributions to our [open source code and documentation](https://github.com/postgresml). +Our goal is to provide access to Open Source AI for everyone. PostgresML is under continuous development to keep up with the rapidly evolving use cases for ML & AI, and we release non breaking changes with minor version updates in accordance with SemVer. We welcome contributions to our [open source code and documentation](https://github.com/postgresml). We can host your AI database in our cloud, or you can run our Docker image locally with PostgreSQL, pgml, pgvector and NVIDIA drivers included. diff --git a/pgml-cms/docs/SUMMARY.md b/pgml-cms/docs/SUMMARY.md index 1c8fe5c28..bfc9ef6a1 100644 --- a/pgml-cms/docs/SUMMARY.md +++ b/pgml-cms/docs/SUMMARY.md @@ -38,11 +38,12 @@ * [Overview](introduction/apis/client-sdks/getting-started.md) * [Collections](introduction/apis/client-sdks/collections.md) * [Pipelines](introduction/apis/client-sdks/pipelines.md) - * [Vector Search](introduction/apis/client-sdks/search.md) - * [Document Search](introduction/apis/client-sdks/document-search.md) + * [Search](introduction/apis/client-sdks/search.md) * [Tutorials](introduction/apis/client-sdks/tutorials/README.md) * [Semantic Search](introduction/apis/client-sdks/tutorials/semantic-search.md) - * [Semantic Search Using Instructor Model](introduction/apis/client-sdks/tutorials/semantic-search-1.md) + * [Semantic Search using Instructor model](introduction/apis/client-sdks/tutorials/semantic-search-using-instructor-model.md) + * [Extractive Question Answering](introduction/apis/client-sdks/tutorials/extractive-question-answering.md) + * [Summarizing Question Answering](introduction/apis/client-sdks/tutorials/summarizing-question-answering.md) ## Product diff --git a/pgml-cms/docs/introduction/apis/client-sdks/collections.md b/pgml-cms/docs/introduction/apis/client-sdks/collections.md index 264f7dfa0..c5e4df68d 100644 --- a/pgml-cms/docs/introduction/apis/client-sdks/collections.md +++ b/pgml-cms/docs/introduction/apis/client-sdks/collections.md @@ -1,12 +1,16 @@ +--- +description: >- + Organizational building blocks of the SDK. Manage all documents and related chunks, embeddings, tsvectors, and pipelines. +--- # Collections Collections are the organizational building blocks of the SDK. They manage all documents and related chunks, embeddings, tsvectors, and pipelines. ## Creating Collections -By default, collections will read and write to the database specified by `PGML_DATABASE_URL` environment variable. +By default, collections will read and write to the database specified by `DATABASE_URL` environment variable. -### **Default `PGML_DATABASE_URL`** +### **Default `DATABASE_URL`** {% tabs %} {% tab title="JavaScript" %} @@ -22,9 +26,9 @@ collection = Collection("test_collection") {% endtab %} {% endtabs %} -### Custom `PGML_DATABASE_URL` +### **Custom DATABASE\_URL** -Create a Collection that reads from a different database than that set by the environment variable `PGML_DATABASE_URL`. +Create a Collection that reads from a different database than that set by the environment variable `DATABASE_URL`. {% tabs %} {% tab title="Javascript" %} @@ -42,23 +46,21 @@ collection = Collection("test_collection", CUSTOM_DATABASE_URL) ## Upserting Documents -Documents are dictionaries with one required key: `id`. All other keys/value pairs are stored and can be chunked, embedded, broken into tsvectors, and searched over as specified by a `Pipeline`. +Documents are dictionaries with two required keys: `id` and `text`. All other keys/value pairs are stored as metadata for the document. {% tabs %} {% tab title="JavaScript" %} ```javascript const documents = [ { - id: "document_one", - title: "Document One", + id: "Document One", text: "document one contents...", - random_key: "here is some random data", + random_key: "this will be metadata for the document", }, { - id: "document_two", - title: "Document Two", + id: "Document Two", text: "document two contents...", - random_key: "here is some random data", + random_key: "this will be metadata for the document", }, ]; await collection.upsert_documents(documents); @@ -69,40 +71,35 @@ await collection.upsert_documents(documents); ```python documents = [ { - "id": "document_one", - "title": "Document One", + "id": "Document 1", "text": "Here are the contents of Document 1", - "random_key": "here is some random data", + "random_key": "this will be metadata for the document" }, { - "id": "document_two", - "title": "Document Two", + "id": "Document 2", "text": "Here are the contents of Document 2", - "random_key": "here is some random data", - }, + "random_key": "this will be metadata for the document" + } ] +collection = Collection("test_collection") await collection.upsert_documents(documents) ``` {% endtab %} {% endtabs %} -Documents can be replaced by upserting documents with the same `id`. +Document metadata can be replaced by upserting the document without the `text` key. {% tabs %} {% tab title="JavaScript" %} ```javascript const documents = [ { - id: "document_one", - title: "Document One New Title", - text: "Here is some new text for document one", - random_key: "here is some new random data", + id: "Document One", + random_key: "this will be NEW metadata for the document", }, { - id: "document_two", - title: "Document Two New Title", - text: "Here is some new text for document two", - random_key: "here is some new random data", + id: "Document Two", + random_key: "this will be NEW metadata for the document", }, ]; await collection.upsert_documents(documents); @@ -113,42 +110,39 @@ await collection.upsert_documents(documents); ```python documents = [ { - "id": "document_one", - "title": "Document One", - "text": "Here is some new text for document one", - "random_key": "here is some random data", + "id": "Document 1", + "random_key": "this will be NEW metadata for the document" }, { - "id": "document_two", - "title": "Document Two", - "text": "Here is some new text for document two", - "random_key": "here is some random data", - }, + "id": "Document 2", + "random_key": "this will be NEW metadata for the document" + } ] +collection = Collection("test_collection") await collection.upsert_documents(documents) ``` {% endtab %} {% endtabs %} -Documents can be merged by setting the `merge` option. On conflict, new document keys will override old document keys. +Document metadata can be merged with new metadata by upserting the document without the `text` key and specifying the merge option. {% tabs %} {% tab title="JavaScript" %} ```javascript const documents = [ { - id: "document_one", - new_key: "this will be a new key in document one", - random_key: "this will replace old random_key" + id: "Document One", + text: "document one contents...", }, { - id: "document_two", - new_key: "this will bew a new key in document two", - random_key: "this will replace old random_key" + id: "Document Two", + text: "document two contents...", }, ]; await collection.upsert_documents(documents, { - merge: true + metdata: { + merge: true + } }); ``` {% endtab %} @@ -157,17 +151,20 @@ await collection.upsert_documents(documents, { ```python documents = [ { - "id": "document_one", - "new_key": "this will be a new key in document one", - "random_key": "this will replace old random_key", + "id": "Document 1", + "random_key": "this will be NEW merged metadata for the document" }, { - "id": "document_two", - "new_key": "this will be a new key in document two", - "random_key": "this will replace old random_key", - }, + "id": "Document 2", + "random_key": "this will be NEW merged metadata for the document" + } ] -await collection.upsert_documents(documents, {"merge": True}) +collection = Collection("test_collection") +await collection.upsert_documents(documents, { + "metadata": { + "merge": True + } +}) ``` {% endtab %} {% endtabs %} @@ -179,12 +176,14 @@ Documents can be retrieved using the `get_documents` method on the collection ob {% tabs %} {% tab title="JavaScript" %} ```javascript +const collection = Collection("test_collection") const documents = await collection.get_documents({limit: 100 }) ``` {% endtab %} {% tab title="Python" %} ```python +collection = Collection("test_collection") documents = await collection.get_documents({ "limit": 100 }) ``` {% endtab %} @@ -199,12 +198,14 @@ The SDK supports limit-offset pagination and keyset pagination. {% tabs %} {% tab title="JavaScript" %} ```javascript +const collection = pgml.newCollection("test_collection") const documents = await collection.get_documents({ limit: 100, offset: 10 }) ``` {% endtab %} {% tab title="Python" %} ```python +collection = Collection("test_collection") documents = await collection.get_documents({ "limit": 100, "offset": 10 }) ``` {% endtab %} @@ -215,31 +216,41 @@ documents = await collection.get_documents({ "limit": 100, "offset": 10 }) {% tabs %} {% tab title="JavaScript" %} ```javascript +const collection = Collection("test_collection") const documents = await collection.get_documents({ limit: 100, last_row_id: 10 }) ``` {% endtab %} {% tab title="Python" %} ```python +collection = Collection("test_collection") documents = await collection.get_documents({ "limit": 100, "last_row_id": 10 }) ``` {% endtab %} {% endtabs %} -The `last_row_id` can be taken from the `row_id` field in the returned document's dictionary. Keyset pagination does not currently work when specifying the `order_by` key. +The `last_row_id` can be taken from the `row_id` field in the returned document's dictionary. ### Filtering Documents -Documents can be filtered by passing in the `filter` key. +Metadata and full text filtering are supported just like they are in vector recall. {% tabs %} {% tab title="JavaScript" %} ```javascript +const collection = pgml.newCollection("test_collection") const documents = await collection.get_documents({ - limit: 10, + limit: 100, + offset: 10, filter: { - id: { - $eq: "document_one" + metadata: { + id: { + $eq: 1 + } + }, + full_text_search: { + configuration: "english", + text: "Some full text query" } } }) @@ -248,25 +259,34 @@ const documents = await collection.get_documents({ {% tab title="Python" %} ```python -documents = await collection.get_documents( - { - "limit": 100, - "filter": { - "id": {"$eq": "document_one"}, +collection = Collection("test_collection") +documents = await collection.get_documents({ + "limit": 100, + "offset": 10, + "filter": { + "metadata": { + "id": { + "$eq": 1 + } }, + "full_text_search": { + "configuration": "english", + "text": "Some full text query" + } } -) +}) ``` {% endtab %} {% endtabs %} ### Sorting Documents -Documents can be sorted on any key. Note that this does not currently work well with Keyset based pagination. If paginating and sorting, use Limit-Offset based pagination. +Documents can be sorted on any metadata key. Note that this does not currently work well with Keyset based pagination. If paginating and sorting, use Limit-Offset based pagination. {% tabs %} {% tab title="JavaScript" %} ```javascript +const collection = pgml.newCollection("test_collection") const documents = await collection.get_documents({ limit: 100, offset: 10, @@ -279,6 +299,7 @@ const documents = await collection.get_documents({ {% tab title="Python" %} ```python +collection = Collection("test_collection") documents = await collection.get_documents({ "limit": 100, "offset": 10, @@ -294,24 +315,39 @@ documents = await collection.get_documents({ Documents can be deleted with the `delete_documents` method on the collection object. +Metadata and full text filtering are supported just like they are in vector recall. + {% tabs %} {% tab title="JavaScript" %} ```javascript +const collection = pgml.newCollection("test_collection") const documents = await collection.delete_documents({ + metadata: { id: { $eq: 1 } + }, + full_text_search: { + configuration: "english", + text: "Some full text query" + } }) ``` {% endtab %} {% tab title="Python" %} ```python -documents = await collection.delete_documents( - { - "id": {"$eq": 1}, +documents = await collection.delete_documents({ + "metadata": { + "id": { + "$eq": 1 + } + }, + "full_text_search": { + "configuration": "english", + "text": "Some full text query" } -) +}) ``` {% endtab %} {% endtabs %} diff --git a/pgml-cms/docs/introduction/apis/client-sdks/document-search.md b/pgml-cms/docs/introduction/apis/client-sdks/document-search.md deleted file mode 100644 index 0d47336d5..000000000 --- a/pgml-cms/docs/introduction/apis/client-sdks/document-search.md +++ /dev/null @@ -1,127 +0,0 @@ -# Document Search - -SDK is specifically designed to provide powerful, flexible document search. `Pipeline`s are required to perform search. See the [Pipelines](https://postgresml.org/docs/introduction/apis/client-sdks/pipelines) for more information about using `Pipeline`s. - -This section will assume we have previously ran the following code: - -{% tabs %} -{% tab title="JavaScript" %} -```javascript -const pipeline = pgml.newPipeline("test_pipeline", { - abstract: { - semantic_search: { - model: "intfloat/e5-small", - }, - full_text_search: { configuration: "english" }, - }, - body: { - splitter: { model: "recursive_character" }, - semantic_search: { - model: "hkunlp/instructor-base", - parameters: { - instruction: "Represent the Wikipedia document for retrieval: ", - } - }, - }, -}); -const collection = pgml.newCollection("test_collection"); -await collection.add_pipeline(pipeline); -``` -{% endtab %} - -{% tab title="Python" %} -```python -pipeline = Pipeline( - "test_pipeline", - { - "abstract": { - "semantic_search": { - "model": "intfloat/e5-small", - }, - "full_text_search": {"configuration": "english"}, - }, - "body": { - "splitter": {"model": "recursive_character"}, - "semantic_search": { - "model": "hkunlp/instructor-base", - "parameters": { - "instruction": "Represent the Wikipedia document for retrieval: ", - }, - }, - }, - }, -) -collection = Collection("test_collection") -``` -{% endtab %} -{% endtabs %} - -## Doing Document Search - -{% tabs %} -{% tab title="JavaScript" %} -```javascript -const results = await collection.search( - { - query: { - full_text_search: { abstract: { query: "What is the best database?", boost: 1.2 } }, - semantic_search: { - abstract: { - query: "What is the best database?", boost: 2.0, - }, - body: { - query: "What is the best database?", boost: 1.25, parameters: { - instruction: - "Represent the Wikipedia question for retrieving supporting documents: ", - } - }, - }, - filter: { user_id: { $eq: 1 } }, - }, - limit: 10 - }, - pipeline, -); -``` -{% endtab %} - -{% tab title="Python" %} -```python -results = await collection.search( - { - "query": { - "full_text_search": { - "abstract": {"query": "What is the best database?", "boost": 1.2} - }, - "semantic_search": { - "abstract": { - "query": "What is the best database?", - "boost": 2.0, - }, - "body": { - "query": "What is the best database?", - "boost": 1.25, - "parameters": { - "instruction": "Represent the Wikipedia question for retrieving supporting documents: ", - }, - }, - }, - "filter": {"user_id": {"$eq": 1}}, - }, - "limit": 10, - }, - pipeline, -) -``` -{% endtab %} -{% endtabs %} - -Just like `vector_search`, `search` takes in two arguments. The first is a `JSON` object specifying the `query` and `limit` and the second is the `Pipeline`. The `query` object can have three fields: `full_text_search`, `semantic_search` and `filter`. Both `full_text_search` and `semantic_search` function similarly. They take in the text to compare against, titled`query`, an optional `boost` parameter used to boost the effectiveness of the ranking, and `semantic_search` also takes in an optional `parameters` key which specify parameters to pass to the embedding model when embedding the passed in text. - -Lets break this query down a little bit more. We are asking for a maximum of 10 documents ranked by `full_text_search` on the `abstract` and `semantic_search` on the `abstract` and `body`. We are also filtering out all documents that do not have the key `user_id` equal to `1`. The `full_text_search` provides a score for the `abstract`, and `semantic_search` provides scores for the `abstract` and the `body`. The `boost` parameter is a multiplier applied to these scores before they are summed together and sorted by `score` `DESC`. - -The `filter` is structured the same way it is when performing `vector_search` see [filtering with vector\_search](https://postgresml.org/docs/introduction/apis/client-sdks/search)[ ](https://postgresml.org/docs/introduction/apis/client-sdks/search#metadata-filtering)for more examples on filtering documents. - -## Fine-Tuning Document Search - -More information and examples on this coming soon... diff --git a/pgml-cms/docs/introduction/apis/client-sdks/getting-started.md b/pgml-cms/docs/introduction/apis/client-sdks/getting-started.md index 326a76ac3..6d1a60cf8 100644 --- a/pgml-cms/docs/introduction/apis/client-sdks/getting-started.md +++ b/pgml-cms/docs/introduction/apis/client-sdks/getting-started.md @@ -27,17 +27,18 @@ Once the SDK is installed, you an use the following example to get started. ```javascript const pgml = require("pgml"); -const main = async () => { // Open the main function +const main = async () => { collection = pgml.newCollection("sample_collection"); ``` {% endtab %} {% tab title="Python" %} ```python -from pgml import Collection, Pipeline +from pgml import Collection, Model, Splitter, Pipeline import asyncio -async def main(): # Start of the main function +async def main(): + # Initialize collection collection = Collection("sample_collection") ``` {% endtab %} @@ -55,31 +56,20 @@ Continuing with `main` {% tabs %} {% tab title="JavaScript" %} ```javascript -const pipeline = pgml.newPipeline("sample_pipeline", { - text: { - splitter: { model: "recursive_character" }, - semantic_search: { - model: "intfloat/e5-small", - }, - }, -}); +// Create a pipeline using the default model and splitter +const model = pgml.newModel(); +const splitter = pgml.newSplitter(); +const pipeline = pgml.newPipeline("sample_pipeline", model, splitter); await collection.add_pipeline(pipeline); ``` {% endtab %} {% tab title="Python" %} ```python -pipeline = Pipeline( - "test_pipeline", - { - "text": { - "splitter": { "model": "recursive_character" }, - "semantic_search": { - "model": "intfloat/e5-small", - }, - }, - }, -) +# Create a pipeline using the default model and splitter +model = Model() +splitter = Splitter() +pipeline = Pipeline("sample_pipeline", model, splitter) await collection.add_pipeline(pipeline) ``` {% endtab %} @@ -87,7 +77,8 @@ await collection.add_pipeline(pipeline) #### Explanation: -* The code constructs a pipeline called `"sample_pipeline"` and adds it to the collection we Initialized above. This pipeline automatically generates chunks and embeddings for the `text` key for every upserted document. +* The code creates an instance of `Model` and `Splitter` using their default arguments. +* Finally, the code constructs a pipeline called `"sample_pipeline"` and add it to the collection we Initialized above. This pipeline automatically generates chunks and embeddings for every upserted document. ### Upsert documents @@ -96,6 +87,7 @@ Continuing with `main` {% tabs %} {% tab title="JavaScript" %} ```javascript +// Create and upsert documents const documents = [ { id: "Document One", @@ -114,15 +106,15 @@ await collection.upsert_documents(documents); ```python documents = [ { - "id": "Document One", - "text": "document one contents...", + id: "Document One", + text: "document one contents...", }, { - "id": "Document Two", - "text": "document two contents...", + id: "Document Two", + text: "document two contents...", }, -] -await collection.upsert_documents(documents) +]; +await collection.upsert_documents(documents); ``` {% endtab %} {% endtabs %} @@ -139,58 +131,45 @@ Continuing with `main` {% tabs %} {% tab title="JavaScript" %} ```javascript -const results = await collection.vector_search( - { - query: { - fields: { - text: { - query: "Something about a document...", - }, - }, - }, - limit: 2, - }, - pipeline, -); - +// Query +const queryResults = await collection + .query() + .vector_recall("Some user query that will match document one first", pipeline) + .limit(2) + .fetch_all(); + +// Convert the results to an array of objects +const results = queryResults.map((result) => { + const [similarity, text, metadata] = result; + return { + similarity, + text, + metadata, + }; +}); console.log(results); await collection.archive(); - -} // Close the main function ``` {% endtab %} {% tab title="Python" %} ```python -results = await collection.vector_search( - { - "query": { - "fields": { - "text": { - "query": "Something about a document...", - }, - }, - }, - "limit": 2, - }, - pipeline, -) - +# Query +query = "Some user query that will match document one first" +results = await collection.query().vector_recall(query, pipeline).limit(2).fetch_all() print(results) - +# Archive collection await collection.archive() - -# End of the main function ``` {% endtab %} {% endtabs %} **Explanation:** -* The `query` method is called to perform a vector-based search on the collection. The query string is `Something about a document...`, and the top 2 results are requested -* The search results are printed to the screen -* Finally, the `archive` method is called to archive the collection +* The `query` method is called to perform a vector-based search on the collection. The query string is `Some user query that will match document one first`, and the top 2 results are requested. +* The search results are converted to objects and printed. +* Finally, the `archive` method is called to archive the collection and free up resources in the PostgresML database. Call `main` function. @@ -226,24 +205,24 @@ node vector_search.js {% tab title="Python" %} ```bash -python3 vector_search.py +python vector_search.py ``` {% endtab %} {% endtabs %} -You should see the search results printed in the terminal. +You should see the search results printed in the terminal. As you can see, our vector search engine did match document one first. ```bash [ - { - "chunk": "document one contents...", - "document": {"id": "Document One", "text": "document one contents..."}, - "score": 0.9034339189529419, - }, - { - "chunk": "document two contents...", - "document": {"id": "Document Two", "text": "document two contents..."}, - "score": 0.8983734250068665, - }, + { + similarity: 0.8506832955692104, + text: 'document one contents...', + metadata: { id: 'Document One' } + }, + { + similarity: 0.8066114609244565, + text: 'document two contents...', + metadata: { id: 'Document Two' } + } ] ``` diff --git a/pgml-cms/docs/introduction/apis/client-sdks/pipelines.md b/pgml-cms/docs/introduction/apis/client-sdks/pipelines.md index 0ba692ff2..1bae53481 100644 --- a/pgml-cms/docs/introduction/apis/client-sdks/pipelines.md +++ b/pgml-cms/docs/introduction/apis/client-sdks/pipelines.md @@ -1,174 +1,107 @@ +--- +description: >- + Pipelines are composed of a model, splitter, and additional optional arguments. +--- # Pipelines -`Pipeline`s define the schema for the transformation of documents. Different `Pipeline`s can be used for different tasks. +Pipelines are composed of a Model, Splitter, and additional optional arguments. Collections can have any number of Pipelines. Each Pipeline is ran everytime documents are upserted. -## Defining Schema +## Models -New `Pipeline`s require schema. Here are a few examples of variations of schema along with common use cases. +Models are used for embedding chuncked documents. We support most every open source model on [Hugging Face](https://huggingface.co/), and also OpenAI's embedding models. -For the following section we will assume we have documents that have the structure: +### **Create a default Model "intfloat/e5-small" with default parameters: {}** -```json -{ - "id": "Each document has a unique id", - "title": "Each document has a title", - "body": "Each document has some body text" -} +{% tabs %} +{% tab title="JavaScript" %} +```javascript +const model = pgml.newModel() +``` +{% endtab %} + +{% tab title="Python" %} +```python +model = Model() ``` +{% endtab %} +{% endtabs %} + +### **Create a Model with custom parameters** {% tabs %} {% tab title="JavaScript" %} ```javascript -const pipeline = pgml.newPipeline("test_pipeline", { - title: { - full_text_search: { configuration: "english" }, - }, - body: { - splitter: { model: "recursive_character" }, - semantic_search: { - model: "hkunlp/instructor-base", - parameters: { - instruction: "Represent the Wikipedia document for retrieval: ", - } - }, - }, -}); +const model = pgml.newModel( + "hkunlp/instructor-base", + "pgml", + { instruction: "Represent the Wikipedia document for retrieval: " } +) ``` {% endtab %} {% tab title="Python" %} ```python -pipeline = Pipeline( - "test_pipeline", - { - "title": { - "full_text_search": {"configuration": "english"}, - }, - "body": { - "splitter": {"model": "recursive_character"}, - "semantic_search": { - "model": "hkunlp/instructor-base", - "parameters": { - "instruction": "Represent the Wikipedia document for retrieval: ", - }, - }, - }, - }, +model = Model( + name="hkunlp/instructor-base", + parameters={"instruction": "Represent the Wikipedia document for retrieval: "} ) ``` {% endtab %} {% endtabs %} -This `Pipeline` does two things. For each document in the `Collection`, it converts all `title`s into tsvectors enabling full text search, and splits and embeds the `body` text enabling semantic search using vectors. This kind of `Pipeline` would be great for site search utilizing hybrid keyword and semantic search. - -For a more simple RAG use case, the following `Pipeline` would work well. +### **Use an OpenAI model** {% tabs %} {% tab title="JavaScript" %} ```javascript -const pipeline = pgml.newPipeline("test_pipeline", { - body: { - splitter: { model: "recursive_character" }, - semantic_search: { - model: "hkunlp/instructor-base", - parameters: { - instruction: "Represent the Wikipedia document for retrieval: ", - } - }, - }, -}); +const model = pgml.newModel("text-embedding-ada-002", "openai") ``` {% endtab %} {% tab title="Python" %} ```python -pipeline = Pipeline( - "test_pipeline", - { - "body": { - "splitter": {"model": "recursive_character"}, - "semantic_search": { - "model": "hkunlp/instructor-base", - "parameters": { - "instruction": "Represent the Wikipedia document for retrieval: ", - }, - }, - }, - }, -) +model = Model(name="text-embedding-ada-002", source="openai") ``` {% endtab %} {% endtabs %} -This `Pipeline` splits and embeds the `body` text enabling semantic search using vectors. This is a very popular `Pipeline` for RAG. +## Splitters -We support most every open source model on [Hugging Face](https://huggingface.co/), and OpenAI's embedding models. To use a model from OpenAI specify the `source` as `openai`, and make sure and set the environment variable `OPENAI_API_KEY`. +Splitters are used to split documents into chunks before embedding them. We support splitters found in [LangChain](https://www.langchain.com/). + +### **Create a default Splitter "recursive\_character" with default parameters: {}** {% tabs %} {% tab title="JavaScript" %} ```javascript -const pipeline = pgml.newPipeline("test_pipeline", { - body: { - splitter: { model: "recursive_character" }, - semantic_search: { - model: "text-embedding-ada-002", - source: "openai" - }, - }, -}); +const splitter = pgml.newSplitter() ``` {% endtab %} {% tab title="Python" %} ```python -pipeline = Pipeline( - "test_pipeline", - { - "body": { - "splitter": {"model": "recursive_character"}, - "semantic_search": {"model": "text-embedding-ada-002", "source": "openai"}, - }, - }, -) +splitter = Splitter() ``` {% endtab %} {% endtabs %} -## Customizing the Indexes - -By default the SDK uses HNSW indexes to efficiently perform vector recall. The default HNSW index sets `m` to 16 and `ef_construction` to 64. These defaults can be customized in the `Pipeline` schema. See [pgvector](https://github.com/pgvector/pgvector) for more information on vector indexes. +### **Create a Splitter with custom parameters** {% tabs %} {% tab title="JavaScript" %} ```javascript -const pipeline = pgml.newPipeline("test_pipeline", { - body: { - splitter: { model: "recursive_character" }, - semantic_search: { - model: "intfloat/e5-small", - hnsw: { - m: 100, - ef_construction: 200 - } - }, - }, -}); +splitter = pgml.newSplitter( + "recursive_character", + { chunk_size: 1500, chunk_overlap: 40 } +) ``` {% endtab %} {% tab title="Python" %} ```python -pipeline = Pipeline( - "test_pipeline", - { - "body": { - "splitter": {"model": "recursive_character"}, - "semantic_search": { - "model": "intfloat/e5-small", - "hnsw": {"m": 100, "ef_construction": 200}, - }, - }, - }, +splitter = Splitter( + name="recursive_character", + parameters={"chunk_size": 1500, "chunk_overlap": 40} ) ``` {% endtab %} @@ -176,50 +109,125 @@ pipeline = Pipeline( ## Adding Pipelines to a Collection -The first time a `Pipeline` is added to a `Collection` it will automatically chunk and embed any documents already in that `Collection`. +When adding a Pipeline to a collection it is required that Pipeline has a Model and Splitter. + +The first time a Pipeline is added to a Collection it will automatically chunk and embed any documents already in that Collection. {% tabs %} {% tab title="JavaScript" %} ```javascript +const model = pgml.newModel() +const splitter = pgml.newSplitter() +const pipeline = pgml.newPipeline("test_pipeline", model, splitter) await collection.add_pipeline(pipeline) ``` {% endtab %} {% tab title="Python" %} ```python +model = Model() +splitter = Splitter() +pipeline = Pipeline("test_pipeline", model, splitter) await collection.add_pipeline(pipeline) ``` {% endtab %} {% endtabs %} -> Note: After a `Pipeline` has been added to a `Collection` instances of the `Pipeline` object can be created without specifying a schema: +### Enabling full text search + +Pipelines can take additional arguments enabling full text search. When full text search is enabled, in addition to automatically chunking and embedding, the Pipeline will create the necessary tsvectors to perform full text search. + +For more information on full text search please see: [Postgres Full Text Search](https://www.postgresql.org/docs/15/textsearch.html). {% tabs %} {% tab title="JavaScript" %} ```javascript -const pipeline = pgml.newPipeline("test_pipeline") +const model = pgml.newModel() +const splitter = pgml.newSplitter() +const pipeline = pgml.newPipeline("test_pipeline", model, splitter, { + full_text_search: { + active: true, + configuration: "english" + } +}) +await collection.add_pipeline(pipeline) ``` {% endtab %} {% tab title="Python" %} ```python -pipeline = Pipeline("test_pipeline") +model = Model() +splitter = Splitter() +pipeline = Pipeline("test_pipeline", model, splitter, { + "full_text_search": { + "active": True, + "configuration": "english" + } +}) +await collection.add_pipeline(pipeline) +``` +{% endtab %} +{% endtabs %} + +### Customizing the HNSW Index + +By default the SDK uses HNSW indexes to efficiently perform vector recall. The default HNSW index sets `m` to 16 and `ef_construction` to 64. These defaults can be customized when the Pipeline is created. + +{% tabs %} +{% tab title="JavaScript" %} +```javascript +const model = pgml.newModel() +const splitter = pgml.newSplitter() +const pipeline = pgml.newPipeline("test_pipeline", model, splitter, { + hnsw: { + m: 16, + ef_construction: 64 + } +}) +await collection.add_pipeline(pipeline) +``` +{% endtab %} + +{% tab title="Python" %} +```python +model = Model() +splitter = Splitter() +pipeline = Pipeline("test_pipeline", model, splitter, { + "hnsw": { + "m": 16, + "ef_construction": 64 + } +}) +await collection.add_pipeline(pipeline) ``` {% endtab %} {% endtabs %} ## Searching with Pipelines -There are two different forms of search that can be done after adding a `Pipeline` to a `Collection` +Pipelines are a required argument when performing vector search. After a Pipeline has been added to a Collection, the Model and Splitter can be omitted when instantiating it. -* [Vector Search](https://postgresml.org/docs/introduction/apis/client-sdks/search) -* [Document Search](https://postgresml.org/docs/introduction/apis/client-sdks/document-search) +{% tabs %} +{% tab title="JavaScript" %} +```javascript +const pipeline = pgml.newPipeline("test_pipeline") +const collection = pgml.newCollection("test_collection") +const results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).fetch_all() +``` +{% endtab %} -See their respective pages for more information on searching. +{% tab title="Python" %} +```python +pipeline = Pipeline("test_pipeline") +collection = Collection("test_collection") +results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).fetch_all() +``` +{% endtab %} +{% endtabs %} ## **Disable a Pipeline** -`Pipelines` can be disabled or removed to prevent them from running automatically when documents are upserted. +Pipelines can be disabled or removed to prevent them from running automatically when documents are upserted. {% tabs %} {% tab title="JavaScript" %} @@ -239,11 +247,11 @@ await collection.disable_pipeline(pipeline) {% endtab %} {% endtabs %} -Disabling a `Pipeline` prevents it from running automatically, but leaves all tsvectors, chunks, and embeddings already created by that `Pipeline` in the database. +Disabling a Pipeline prevents it from running automatically, but leaves all chunks and embeddings already created by that Pipeline in the database. ## **Enable a Pipeline** -Disabled `Pipeline`s can be re-enabled. +Disabled pipelines can be re-enabled. {% tabs %} {% tab title="JavaScript" %} @@ -263,7 +271,7 @@ await collection.enable_pipeline(pipeline) {% endtab %} {% endtabs %} -Enabling a `Pipeline` will cause it to automatically run on all documents it may have missed while disabled. +Enabling a Pipeline will cause it to automatically run and chunk and embed all documents it may have missed while disabled. ## **Remove a Pipeline** @@ -284,4 +292,4 @@ await collection.remove_pipeline(pipeline) {% endtab %} {% endtabs %} -Removing a `Pipeline` deletes it and all associated data from the database. Removed `Pipelines` cannot be re-enabled but can be recreated. +Removing a Pipeline deletes it and all associated data from the database. Removed Pipelines cannot be re-enabled but can be recreated. diff --git a/pgml-cms/docs/introduction/apis/client-sdks/search.md b/pgml-cms/docs/introduction/apis/client-sdks/search.md index cb61d91b2..2659015dd 100644 --- a/pgml-cms/docs/introduction/apis/client-sdks/search.md +++ b/pgml-cms/docs/introduction/apis/client-sdks/search.md @@ -1,353 +1,257 @@ -# Vector Search +# Search -SDK is specifically designed to provide powerful, flexible vector search. `Pipeline`s are required to perform search. See [Pipelines ](https://postgresml.org/docs/introduction/apis/client-sdks/pipelines)for more information about using `Pipeline`s. +SDK is specifically designed to provide powerful, flexible vector search. Pipelines are required to perform search. See the [pipelines.md](pipelines.md "mention") for more information about using Pipelines. -This section will assume we have previously ran the following code: +### **Basic vector search** {% tabs %} {% tab title="JavaScript" %} -```javascript -const pipeline = pgml.newPipeline("test_pipeline", { - abstract: { - semantic_search: { - model: "intfloat/e5-small", - }, - full_text_search: { configuration: "english" }, - }, - body: { - splitter: { model: "recursive_character" }, - semantic_search: { - model: "hkunlp/instructor-base", - parameters: { - instruction: "Represent the Wikipedia document for retrieval: ", - } - }, - }, -}); -const collection = pgml.newCollection("test_collection"); -await collection.add_pipeline(pipeline); -``` +

const collection = pgml.newCollection("test_collection")
+const pipeline = pgml.newPipeline("test_pipeline")
+const results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).fetch_all()
+

{% endtab %} {% tab title="Python" %} ```python -pipeline = Pipeline( - "test_pipeline", - { - "abstract": { - "semantic_search": { - "model": "intfloat/e5-small", - }, - "full_text_search": {"configuration": "english"}, - }, - "body": { - "splitter": {"model": "recursive_character"}, - "semantic_search": { - "model": "hkunlp/instructor-base", - "parameters": { - "instruction": "Represent the Wikipedia document for retrieval: ", - }, - }, - }, - }, -) collection = Collection("test_collection") +pipeline = Pipeline("test_pipeline") +results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).fetch_all() ``` {% endtab %} {% endtabs %} -This creates a `Pipeline` that is capable of full text search and semantic search on the `abstract` and semantic search on the `body` of documents. - -## **Doing vector search** +### **Vector search with custom limit** {% tabs %} {% tab title="JavaScript" %} ```javascript -const results = await collection.vector_search( - { - query: { - fields: { - body: { - query: "What is the best database?", parameters: { - instruction: - "Represent the Wikipedia question for retrieving supporting documents: ", - } - }, - }, - }, - limit: 5, - }, - pipeline, -); +const collection = pgml.newCollection("test_collection") +const pipeline = pgml.newPipeline("test_pipeline") +const results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).limit(10).fetch_all() ``` {% endtab %} {% tab title="Python" %} ```python -results = await collection.vector_search( - { - "query": { - "fields": { - "body": { - "query": "What is the best database?", - "parameters": { - "instruction": "Represent the Wikipedia question for retrieving supporting documents: ", - }, - }, - }, - }, - "limit": 5, - }, - pipeline, -) +collection = Collection("test_collection") +pipeline = Pipeline("test_pipeline") +results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).limit(10).fetch_all() ``` {% endtab %} {% endtabs %} -Let's break this down. `vector_search` takes in a `JSON` object and a `Pipeline`. The `JSON` object currently supports two keys: `query` and `limit` . The `limit` limits how many chunks should be returned, the `query` specifies the actual query to perform. Let's see another more complicated example: +### **Metadata Filtering** + +We provide powerful and flexible arbitrarly nested metadata filtering based off of [MongoDB Comparison Operators](https://www.mongodb.com/docs/manual/reference/operator/query-comparison/). We support each operator mentioned except the `$nin`. + +**Vector search with $eq metadata filtering** {% tabs %} {% tab title="JavaScript" %} ```javascript -const query = "What is the best database?"; -const results = await collection.vector_search( - { - query: { - fields: { - abstract: { - query: query, - full_text_filter: "database" - }, - body: { - query: query, parameters: { - instruction: - "Represent the Wikipedia question for retrieving supporting documents: ", - } - }, - }, - }, - limit: 5, - }, - pipeline, -); +const collection = pgml.newCollection("test_collection") +const pipeline = pgml.newPipeline("test_pipeline") +const results = await collection.query() + .vector_recall("Here is some query", pipeline) + .limit(10) + .filter({ + metadata: { + uuid: { + $eq: 1 + } + } + }) + .fetch_all() ``` {% endtab %} {% tab title="Python" %} -```python -query = "What is the best database?" -results = await collection.vector_search( - { - "query": { - "fields": { - "abastract": { - "query": query, - "full_text_filter": "database", - }, - "body": { - "query": query, - "parameters": { - "instruction": "Represent the Wikipedia question for retrieving supporting documents: ", - }, - }, - }, - }, - "limit": 5, - }, - pipeline, +

collection = Collection("test_collection")
+pipeline = Pipeline("test_pipeline")
+results = (
+    await collection.query()
+    .vector_recall("Here is some query", pipeline)
+    .limit(10)
+    .filter({
+        "metadata": {
+            "uuid": {
+                "$eq": 1
+            }    
+        }
+    })
+    .fetch_all()
 )
-
-```
+

{% endtab %} {% endtabs %} -The `query` in this example is slightly more intricate. We are doing vector search over both the `abstract` and `body` keys of our documents. This means our search may return chunks from both the `abstract` and `body` of our documents. We are also filtering out all `abstract` chunks that do not contain the text `"database"` we can do this because we enabled `full_text_search` on the `abstract` key in the `Pipeline` schema. Also note that the model used for embedding the `body` takes parameters, but not the model used for embedding the `abstract`. - -## **Filtering** +The above query would filter out all documents that do not contain a key `uuid` equal to `1`. -We provide powerful and flexible arbitrarly nested filtering based off of [MongoDB Comparison Operators](https://www.mongodb.com/docs/manual/reference/operator/query-comparison/). We support each operator mentioned except the `$nin`. - -**Vector search with $eq filtering** +**Vector search with $gte metadata filtering** {% tabs %} {% tab title="JavaScript" %} ```javascript -const results = await collection.vector_search( - { - query: { - fields: { - body: { - query: "What is the best database?", parameters: { - instruction: - "Represent the Wikipedia question for retrieving supporting documents: ", - } - }, - }, - filter: { - user_id: { - $eq: 1 - } +const collection = pgml.newCollection("test_collection") +const pipeline = pgml.newPipeline("test_pipeline") +const results = await collection.query() + .vector_recall("Here is some query", pipeline) + .limit(10) + .filter({ + metadata: { + index: { + $gte: 3 } - }, - limit: 5, - }, - pipeline, -); + } + }) + .fetch_all() ``` {% endtab %} {% tab title="Python" %} ```python -results = await collection.vector_search( - { - "query": { - "fields": { - "body": { - "query": "What is the best database?", - "parameters": { - "instruction": "Represent the Wikipedia question for retrieving supporting documents: ", - }, - }, - }, - "filter": {"user_id": {"$eq": 1}}, - }, - "limit": 5, - }, - pipeline, +collection = Collection("test_collection") +pipeline = Pipeline("test_pipeline") +results = ( + await collection.query() + .vector_recall("Here is some query", pipeline) + .limit(10) + .filter({ + "metadata": { + "index": { + "$gte": 3 + } + } + }) + .fetch_all() ) ``` {% endtab %} {% endtabs %} -The above query would filter out all chunks from documents that do not contain a key `user_id` equal to `1`. +The above query would filter out all documents that do not contain a key `index` with a value greater than or equal to `3`. -**Vector search with $gte filtering** +**Vector search with $or and $and metadata filtering** {% tabs %} {% tab title="JavaScript" %} ```javascript -const results = await collection.vector_search( - { - query: { - fields: { - body: { - query: "What is the best database?", parameters: { - instruction: - "Represent the Wikipedia question for retrieving supporting documents: ", - } +const collection = pgml.newCollection("test_collection") +const pipeline = pgml.newPipeline("test_pipeline") +const results = await collection.query() + .vector_recall("Here is some query", pipeline) + .limit(10) + .filter({ + metadata: { + $or: [ + { + $and: [ + { + $eq: { + uuid: 1 + } + }, + { + $lt: { + index: 100 + } + } + ] }, - }, - filter: { - user_id: { - $gte: 1 + { + special: { + $ne: True + } } - } - }, - limit: 5, - }, - pipeline, -); + ] + } + }) + .fetch_all() ``` {% endtab %} {% tab title="Python" %} ```python -results = await collection.vector_search( - { - "query": { - "fields": { - "body": { - "query": "What is the best database?", - "parameters": { - "instruction": "Represent the Wikipedia question for retrieving supporting documents: ", - }, +collection = Collection("test_collection") +pipeline = Pipeline("test_pipeline") +results = ( + await collection.query() + .vector_recall("Here is some query", pipeline) + .limit(10) + .filter({ + "metadata": { + "$or": [ + { + "$and": [ + { + "$eq": { + "uuid": 1 + } + }, + { + "$lt": { + "index": 100 + } + } + ] }, - }, - "filter": {"user_id": {"$gte": 1}}, - }, - "limit": 5, - }, - pipeline, + { + "special": { + "$ne": True + } + } + ] + } + }) + .fetch_all() ) ``` {% endtab %} {% endtabs %} -The above query would filter out all documents that do not contain a key `user_id` with a value greater than or equal to `1`. +The above query would filter out all documents that do not have a key `special` with a value `True` or (have a key `uuid` equal to 1 and a key `index` less than 100). + +### **Full Text Filtering** -**Vector search with $or and $and filtering** +If full text search is enabled for the associated Pipeline, documents can be first filtered by full text search and then recalled by embedding similarity. {% tabs %} {% tab title="JavaScript" %} ```javascript -const results = await collection.vector_search( - { - query: { - fields: { - body: { - query: "What is the best database?", parameters: { - instruction: - "Represent the Wikipedia question for retrieving supporting documents: ", - } - }, - }, - filter: { - $or: [ - { - $and: [ - { - $eq: { - user_id: 1 - } - }, - { - $lt: { - user_score: 100 - } - } - ] - }, - { - special: { - $ne: true - } - } - ] - } - }, - limit: 5, - }, - pipeline, -); +const collection = pgml.newCollection("test_collection") +const pipeline = pgml.newPipeline("test_pipeline") +const results = await collection.query() + .vector_recall("Here is some query", pipeline) + .limit(10) + .filter({ + full_text: { + configuration: "english", + text: "Match Me" + } + }) + .fetch_all() ``` {% endtab %} {% tab title="Python" %} ```python -results = await collection.vector_search( - { - "query": { - "fields": { - "body": { - "query": "What is the best database?", - "parameters": { - "instruction": "Represent the Wikipedia question for retrieving supporting documents: ", - }, - }, - }, - "filter": { - "$or": [ - {"$and": [{"$eq": {"user_id": 1}}, {"$lt": {"user_score": 100}}]}, - {"special": {"$ne": True}}, - ], - }, - }, - "limit": 5, - }, - pipeline, +collection = Collection("test_collection") +pipeline = Pipeline("test_pipeline") +results = ( + await collection.query() + .vector_recall("Here is some query", pipeline) + .limit(10) + .filter({ + "full_text": { + "configuration": "english", + "text": "Match Me" + } + }) + .fetch_all() ) ``` {% endtab %} {% endtabs %} -The above query would filter out all documents that do not have a key `special` with a value `True` or (have a key `user_id` equal to 1 and a key `user_score` less than 100). +The above query would first filter out all documents that do not match the full text search criteria, and then perform vector recall on the remaining documents. diff --git a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/README.md b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/README.md index ed07f8b2c..84ce15b78 100644 --- a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/README.md +++ b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/README.md @@ -1,6 +1,2 @@ # Tutorials -We have a number of tutorials / examples for our Python and JavaScript SDK. For a full list of examples check out: - -* [JavaScript Examples on Github](https://github.com/postgresml/postgresml/tree/master/pgml-sdks/pgml/javascript/examples) -* [Python Examples on Github](https://github.com/postgresml/postgresml/tree/master/pgml-sdks/pgml/python/examples) diff --git a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/extractive-question-answering.md b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/extractive-question-answering.md new file mode 100644 index 000000000..78abc3a09 --- /dev/null +++ b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/extractive-question-answering.md @@ -0,0 +1,161 @@ +--- +description: >- + JavaScript and Python code snippets for end-to-end question answering. +--- +# Extractive Question Answering + +Here is the documentation for the JavaScript and Python code snippets performing end-to-end question answering: + +## Imports and Setup + +The SDK and datasets are imported. Builtins are used in Python for transforming text. + +{% tabs %} +{% tab title="JavaScript" %} +```js +const pgml = require("pgml"); +require("dotenv").config(); +``` +{% endtab %} + +{% tab title="Python" %} +```python +from pgml import Collection, Model, Splitter, Pipeline, Builtins +from datasets import load_dataset +from dotenv import load_dotenv +``` +{% endtab %} +{% endtabs %} + +## Initialize Collection + +A collection is created to hold context passages. + +{% tabs %} +{% tab title="JavaScript" %} +```js +const collection = pgml.newCollection("my_javascript_eqa_collection"); +``` +{% endtab %} + +{% tab title="Python" %} +```python +collection = Collection("squad_collection") +``` +{% endtab %} +{% endtabs %} + +## Create Pipeline + +A pipeline is created and added to the collection. + +{% tabs %} +{% tab title="JavaScript" %} +```js +const pipeline = pgml.newPipeline( + "my_javascript_eqa_pipeline", + pgml.newModel(), + pgml.newSplitter(), +); + +await collection.add_pipeline(pipeline); +``` +{% endtab %} + +{% tab title="Python" %} +```python +model = Model() +splitter = Splitter() +pipeline = Pipeline("squadv1", model, splitter) +await collection.add_pipeline(pipeline) +``` +{% endtab %} +{% endtabs %} + +## Upsert Documents + +Context passages from SQuAD are upserted into the collection. + +{% tabs %} +{% tab title="JavaScript" %} +```js +const documents = [ + { + id: "...", + text: "...", + } +]; + +await collection.upsert_documents(documents); +``` +{% endtab %} + +{% tab title="Python" %} +```python +data = load_dataset("squad") + +documents = [ + {"id": ..., "text": ...} + for r in data +] + +await collection.upsert_documents(documents) +``` +{% endtab %} +{% endtabs %} + +## Query for Context + +A vector search query retrieves context passages. + +{% tabs %} +{% tab title="JavaScript" %} +```js +const queryResults = await collection + .query() + .vector_recall(query, pipeline) + .fetch_all(); + +const context = queryResults + .map(result => result[1]) + .join("\n"); +``` +{% endtab %} + +{% tab title="Python" %} +```python +results = await collection.query() + .vector_recall(query, pipeline) + .fetch_all() + +context = " ".join(results[0][1]) +``` +{% endtab %} +{% endtabs %} + +## Query for Answer + +The context is passed to a QA model to extract the answer. + +{% tabs %} +{% tab title="JavaScript" %} +```js +const builtins = pgml.newBuiltins(); + +const answer = await builtins.transform("question-answering", [ + JSON.stringify({question, context}) +]); +``` +{% endtab %} + +{% tab title="Python" %} +```python +builtins = Builtins() + +answer = await builtins.transform( + "question-answering", + [{"question": query, "context": context}] +) +``` +{% endtab %} +{% endtabs %} diff --git a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search-1.md b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search-1.md deleted file mode 100644 index 49aa6461b..000000000 --- a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search-1.md +++ /dev/null @@ -1,231 +0,0 @@ ---- -description: Example for Semantic Search ---- - -# Semantic Search Using Instructor Model - -This tutorial demonstrates using the `pgml` SDK to create a collection, add documents, build a pipeline for vector search, make a sample query, and archive the collection when finished. In this tutorial we use [hkunlp/instructor-base](https://huggingface.co/hkunlp/instructor-base), a more advanced embeddings model that takes parameters when doing embedding and recall. - -[Link to full JavaScript implementation](../../../../../../pgml-sdks/pgml/javascript/examples/question\_answering.js) - -[Link to full Python implementation](../../../../../../pgml-sdks/pgml/python/examples/question\_answering.py) - -## Imports and Setup - -The SDK is imported and environment variables are loaded. - -{% tabs %} -{% tab title="JavasScript" %} -```js -const pgml = require("pgml"); -require("dotenv").config(); -``` -{% endtab %} - -{% tab title="Python" %} -```python -from pgml import Collection, Pipeline -from datasets import load_dataset -from time import time -from dotenv import load_dotenv -from rich.console import Console -import asyncio -``` -{% endtab %} -{% endtabs %} - -## Initialize Collection - -A collection object is created to represent the search collection. - -{% tabs %} -{% tab title="JavaScript" %} -```js -const main = async () => { // Open the main function, we close it at the bottom - // Initialize the collection - const collection = pgml.newCollection("qa_collection"); -``` -{% endtab %} - -{% tab title="Python" %} -```python -async def main(): # Start the main function, we end it after archiving - load_dotenv() - console = Console() - - # Initialize collection - collection = Collection("squad_collection") -``` -{% endtab %} -{% endtabs %} - -## Create Pipeline - -A pipeline encapsulating a model and splitter is created and added to the collection. - -{% tabs %} -{% tab title="JavaScript" %} -```js - // Add a pipeline - const pipeline = pgml.newPipeline("qa_pipeline", { - text: { - splitter: { model: "recursive_character" }, - semantic_search: { - model: "intfloat/e5-small", - }, - }, - }); - await collection.add_pipeline(pipeline); -``` -{% endtab %} - -{% tab title="Python" %} -```python - # Create and add pipeline - pipeline = Pipeline( - "squadv1", - { - "text": { - "splitter": {"model": "recursive_character"}, - "semantic_search": { - "model": "hkunlp/instructor-base", - "parameters": { - "instruction": "Represent the Wikipedia document for retrieval: " - }, - }, - } - }, - ) - await collection.add_pipeline(pipeline) -``` -{% endtab %} -{% endtabs %} - -## Upsert Documents - -Documents are upserted into the collection and indexed by the pipeline. - -{% tabs %} -{% tab title="JavaScript" %} -```js - // Upsert documents, these documents are automatically split into chunks and embedded by our pipeline - const documents = [ - { - id: "Document One", - text: "PostgresML is the best tool for machine learning applications!", - }, - { - id: "Document Two", - text: "PostgresML is open source and available to everyone!", - }, - ]; - await collection.upsert_documents(documents); -``` -{% endtab %} - -{% tab title="Python" %} -```python - # Prep documents for upserting - data = load_dataset("squad", split="train") - data = data.to_pandas() - data = data.drop_duplicates(subset=["context"]) - documents = [ - {"id": r["id"], "text": r["context"], "title": r["title"]} - for r in data.to_dict(orient="records") - ] - - # Upsert documents - await collection.upsert_documents(documents[:200]) -``` -{% endtab %} -{% endtabs %} - -## Query - -A vector similarity search query is made on the collection. - -{% tabs %} -{% tab title="JavaScript" %} -```js - // Perform vector search - const query = "What is the best tool for building machine learning applications?"; - const queryResults = await collection.vector_search( - { - query: { - fields: { - text: { query: query } - } - }, limit: 1 - }, pipeline); - console.log(queryResults); -``` -{% endtab %} - -{% tab title="Python" %} -```python - # Query for answer - query = "Who won more than 20 grammy awards?" - console.print("Querying for context ...") - start = time() - results = await collection.vector_search( - { - "query": { - "fields": { - "text": { - "query": query, - "parameters": { - "instruction": "Represent the Wikipedia question for retrieving supporting documents: " - }, - }, - } - }, - "limit": 5, - }, - pipeline, - ) - end = time() - console.print("\n Results for '%s' " % (query), style="bold") - console.print(results) - console.print("Query time = %0.3f" % (end - start)) -``` -{% endtab %} -{% endtabs %} - -## Archive Collection - -The collection is archived when finished. - -{% tabs %} -{% tab title="JavaScript" %} -```js - await collection.archive(); -} // Close the main function -``` -{% endtab %} - -{% tab title="Python" %} -```python - await collection.archive() -# The end of the main function -``` -{% endtab %} -{% endtabs %} - -## Main - -Boilerplate to call main() async function. - -{% tabs %} -{% tab title="JavaScript" %} -```javascript -main().then(() => console.log("Done!")); -``` -{% endtab %} - -{% tab title="Python" %} -```python -if __name__ == "__main__": - asyncio.run(main()) -``` -{% endtab %} -{% endtabs %} diff --git a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search-using-instructor-model.md b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search-using-instructor-model.md new file mode 100644 index 000000000..697845b55 --- /dev/null +++ b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search-using-instructor-model.md @@ -0,0 +1,127 @@ +--- +description: >- + JavaScript and Python code snippets for using instructor models in more advanced search use cases. +--- +# Semantic Search using Instructor model + +This shows using instructor models in the `pgml` SDK for more advanced use cases. + +## Imports and Setup + +{% tabs %} +{% tab title="JavaScript" %} +```js +const pgml = require("pgml"); +require("dotenv").config(); +``` +{% endtab %} + +{% tab title="Python" %} +```python +from pgml import Collection, Model, Splitter, Pipeline +from datasets import load_dataset +from dotenv import load_dotenv +``` +{% endtab %} +{% endtabs %} + +## Initialize Collection + +{% tabs %} +{% tab title="JavaScript" %} +```js +const collection = pgml.newCollection("my_javascript_qai_collection"); +``` +{% endtab %} + +{% tab title="Python" %} +```python +collection = Collection("squad_collection_1") +``` +{% endtab %} +{% endtabs %} + +## Create Pipeline + +{% tabs %} +{% tab title="JavaScript" %} +```js +const model = pgml.newModel("hkunlp/instructor-base", "pgml", { + instruction: "Represent the Wikipedia document for retrieval: ", +}); + +const pipeline = pgml.newPipeline( + "my_javascript_qai_pipeline", + model, + pgml.newSplitter(), +); + +await collection.add_pipeline(pipeline); +``` +{% endtab %} + +{% tab title="Python" %} +```python +model = Model("hkunlp/instructor-base", parameters={ + "instruction": "Represent the Wikipedia document for retrieval: " +}) + +pipeline = Pipeline("squad_instruction", model, Splitter()) +await collection.add_pipeline(pipeline) +``` +{% endtab %} +{% endtabs %} + +## Upsert Documents + +{% tabs %} +{% tab title="JavaScript" %} +

const documents = [
+  {
+    id: "...",
+    text: "...",
+  },
+];
+
+await collection.upsert_documents(documents);
+

+{% endtab %} + +{% tab title="Python" %} +```python +data = load_dataset("squad") + +documents = [ + {"id": ..., "text": ...} for r in data +] + +await collection.upsert_documents(documents) +``` +{% endtab %} +{% endtabs %} + +## Query + +{% tabs %} +{% tab title="JavaScript" %} +```js +const queryResults = await collection + .query() + .vector_recall(query, pipeline, { + instruction: + "Represent the Wikipedia question for retrieving supporting documents: ", + }) + .fetch_all(); +``` +{% endtab %} + +{% tab title="Python" %} +```python +results = await collection.query() + .vector_recall(query, pipeline, { + "instruction": "Represent the Wikipedia question for retrieving supporting documents: " + }) + .fetch_all() +``` +{% endtab %} +{% endtabs %} diff --git a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search.md b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search.md index 726ef3fa3..89bf07cd8 100644 --- a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search.md +++ b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search.md @@ -4,11 +4,7 @@ description: Example for Semantic Search # Semantic Search -This tutorial demonstrates using the `pgml` SDK to create a collection, add documents, build a pipeline for vector search, make a sample query, and archive the collection when finished. - -[Link to full JavaScript implementation](../../../../../../pgml-sdks/pgml/javascript/examples/semantic\_search.js) - -[Link to full Python implementation](../../../../../../pgml-sdks/pgml/python/examples/semantic\_search.py) +This tutorial demonstrates using the `pgml` SDK to create a collection, add documents, build a pipeline for vector search, make a sample query, and archive the collection when finished. It loads sample data, indexes questions, times a semantic search query, and prints formatted results. ## Imports and Setup @@ -18,17 +14,16 @@ The SDK is imported and environment variables are loaded. {% tab title="JavasScript" %} ```js const pgml = require("pgml"); + require("dotenv").config(); ``` {% endtab %} {% tab title="Python" %} ```python -from pgml import Collection, Pipeline +from pgml import Collection, Model, Splitter, Pipeline from datasets import load_dataset -from time import time from dotenv import load_dotenv -from rich.console import Console import asyncio ``` {% endtab %} @@ -41,20 +36,17 @@ A collection object is created to represent the search collection. {% tabs %} {% tab title="JavaScript" %} ```js -const main = async () => { // Open the main function, we close it at the bottom - // Initialize the collection - const collection = pgml.newCollection("semantic_search_collection"); +const main = async () => { + const collection = pgml.newCollection("my_javascript_collection"); +} ``` {% endtab %} {% tab title="Python" %} ```python -async def main(): # Start the main function, we end it after archiving +async def main(): load_dotenv() - console = Console() - - # Initialize collection - collection = Collection("quora_collection") + collection = Collection("my_collection") ``` {% endtab %} {% endtabs %} @@ -66,32 +58,19 @@ A pipeline encapsulating a model and splitter is created and added to the collec {% tabs %} {% tab title="JavaScript" %} ```js - // Add a pipeline - const pipeline = pgml.newPipeline("semantic_search_pipeline", { - text: { - splitter: { model: "recursive_character" }, - semantic_search: { - model: "intfloat/e5-small", - }, - }, - }); - await collection.add_pipeline(pipeline); +const model = pgml.newModel(); +const splitter = pgml.newSplitter(); +const pipeline = pgml.newPipeline("my_javascript_pipeline", model, splitter); +await collection.add_pipeline(pipeline); ``` {% endtab %} {% tab title="Python" %} ```python - # Create and add pipeline - pipeline = Pipeline( - "quorav1", - { - "text": { - "splitter": {"model": "recursive_character"}, - "semantic_search": {"model": "intfloat/e5-small"}, - } - }, - ) - await collection.add_pipeline(pipeline) +model = Model() +splitter = Splitter() +pipeline = Pipeline("my_pipeline", model, splitter) +await collection.add_pipeline(pipeline) ``` {% endtab %} {% endtabs %} @@ -103,37 +82,29 @@ Documents are upserted into the collection and indexed by the pipeline. {% tabs %} {% tab title="JavaScript" %} ```js - // Upsert documents, these documents are automatically split into chunks and embedded by our pipeline - const documents = [ - { - id: "Document One", - text: "document one contents...", - }, - { - id: "Document Two", - text: "document two contents...", - }, - ]; - await collection.upsert_documents(documents); +const documents = [ + { + id: "Document One", + text: "...", + }, + { + id: "Document Two", + text: "...", + }, +]; + +await collection.upsert_documents(documents); ``` {% endtab %} {% tab title="Python" %} ```python - # Prep documents for upserting - dataset = load_dataset("quora", split="train") - questions = [] - for record in dataset["questions"]: - questions.extend(record["text"]) - - # Remove duplicates and add id - documents = [] - for i, question in enumerate(list(set(questions))): - if question: - documents.append({"id": i, "text": question}) - - # Upsert documents - await collection.upsert_documents(documents[:2000]) +documents = [ + {"id": "doc1", "text": "..."}, + {"id": "doc2", "text": "..."} +] + +await collection.upsert_documents(documents) ``` {% endtab %} {% endtabs %} @@ -145,34 +116,21 @@ A vector similarity search query is made on the collection. {% tabs %} {% tab title="JavaScript" %} ```js - // Perform vector search - const query = "Something that will match document one first"; - const queryResults = await collection.vector_search( - { - query: { - fields: { - text: { query: query } - } - }, limit: 2 - }, pipeline); - console.log("The results"); - console.log(queryResults); +const queryResults = await collection + .query() + .vector_recall( + "query", + pipeline, + ) + .fetch_all(); ``` {% endtab %} {% tab title="Python" %} ```python - # Query - query = "What is a good mobile os?" - console.print("Querying for %s..." % query) - start = time() - results = await collection.vector_search( - {"query": {"fields": {"text": {"query": query}}}, "limit": 5}, pipeline - ) - end = time() - console.print("\n Results for '%s' " % (query), style="bold") - console.print(results) - console.print("Query time = %0.3f" % (end - start)) +results = await collection.query() + .vector_recall("query", pipeline) + .fetch_all() ``` {% endtab %} {% endtabs %} @@ -184,15 +142,13 @@ The collection is archived when finished. {% tabs %} {% tab title="JavaScript" %} ```js - await collection.archive(); -} // Close the main function +await collection.archive(); ``` {% endtab %} {% tab title="Python" %} ```python - await collection.archive() -# The end of the main function +await collection.archive() ``` {% endtab %} {% endtabs %} @@ -204,7 +160,9 @@ Boilerplate to call main() async function. {% tabs %} {% tab title="JavaScript" %} ```javascript -main().then(() => console.log("Done!")); +main().then((results) => { + console.log("Vector search Results: \n", results); +}); ``` {% endtab %} diff --git a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/summarizing-question-answering.md b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/summarizing-question-answering.md new file mode 100644 index 000000000..caa7c8a59 --- /dev/null +++ b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/summarizing-question-answering.md @@ -0,0 +1,164 @@ +--- +description: >- + JavaScript and Python code snippets for text summarization. +--- +# Summarizing Question Answering + +Here are the Python and JavaScript examples for text summarization using `pgml` SDK + +## Imports and Setup + +The SDK and datasets are imported. Builtins are used for transformations. + +{% tabs %} +{% tab title="JavaScript" %} +```js +const pgml = require("pgml"); +require("dotenv").config(); +``` +{% endtab %} + +{% tab title="Python" %} +```python +from pgml import Collection, Model, Splitter, Pipeline, Builtins +from datasets import load_dataset +from dotenv import load_dotenv +``` +{% endtab %} +{% endtabs %} + +## Initialize Collection + +A collection is created to hold text passages. + +{% tabs %} +{% tab title="JavaScript" %} +```js +const collection = pgml.newCollection("my_javascript_sqa_collection"); +``` +{% endtab %} + +{% tab title="Python" %} +```python +collection = Collection("squad_collection") +``` +{% endtab %} +{% endtabs %} + +## Create Pipeline + +A pipeline is created and added to the collection. + +{% tabs %} +{% tab title="JavaScript" %} +```js +const pipeline = pgml.newPipeline( + "my_javascript_sqa_pipeline", + pgml.newModel(), + pgml.newSplitter(), +); + +await collection.add_pipeline(pipeline); +``` +{% endtab %} + +{% tab title="Python" %} +```python +model = Model() +splitter = Splitter() +pipeline = Pipeline("squadv1", model, splitter) +await collection.add_pipeline(pipeline) +``` +{% endtab %} +{% endtabs %} + +## Upsert Documents + +Text passages are upserted into the collection. + +{% tabs %} +{% tab title="JavaScript" %} +```js +const documents = [ + { + id: "...", + text: "...", + } +]; + +await collection.upsert_documents(documents); +``` +{% endtab %} + +{% tab title="Python" %} +```python +data = load_dataset("squad") + +documents = [ + {"id": ..., "text": ...} + for r in data +] + +await collection.upsert_documents(documents) +``` +{% endtab %} +{% endtabs %} + +## Query for Context + +A vector search retrieves a relevant text passage. + +{% tabs %} +{% tab title="JavaScript" %} +```js +const queryResults = await collection + .query() + .vector_recall(query, pipeline) + .fetch_all(); + +const context = queryResults[0][1]; +``` +{% endtab %} + +{% tab title="Python" %} +```python +results = await collection.query() + .vector_recall(query, pipeline) + .fetch_all() + +context = results[0][1] +``` +{% endtab %} +{% endtabs %} + +## Summarize Text + +The text is summarized using a pretrained model. + +{% tabs %} +{% tab title="JavaScript" %} +```js +const builtins = pgml.newBuiltins(); + +const summary = await builtins.transform( + {task: "summarization", + model: "sshleifer/distilbart-cnn-12-6"}, + [context] +); +``` + + +{% endtab %} + +{% tab title="Python" %} +```python +builtins = Builtins() + +summary = await builtins.transform( + {"task": "summarization", + "model": "sshleifer/distilbart-cnn-12-6"}, + [context] +) +``` +{% endtab %} +{% endtabs %} diff --git a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.deploy.md b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.deploy.md index e296155af..22dd3733c 100644 --- a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.deploy.md +++ b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.deploy.md @@ -1,3 +1,8 @@ +--- +description: >- + Release trained models when ML quality metrics computed during training improve. Track model deployments over time and rollback if needed. +--- + # pgml.deploy() ## Deployments @@ -27,7 +32,7 @@ pgml.deploy( There are 3 different deployment strategies available: | Strategy | Description | -| ------------- | ------------------------------------------------------------------------------------------------ | +| ------------- |--------------------------------------------------------------------------------------------------| | `most_recent` | The most recently trained model for this project is immediately deployed, regardless of metrics. | | `best_score` | The model that achieved the best key metric score is immediately deployed. | | `rollback` | The model that was deployed before to the current one is deployed. | @@ -79,6 +84,8 @@ SELECT * FROM pgml.deploy( (1 row) ``` + + ### Rolling Back In case the new model isn't performing well in production, it's easy to rollback to the previous version. A rollback creates a new deployment for the old model. Multiple rollbacks in a row will oscillate between the two most recently deployed models, making rollbacks a safe and reversible operation. @@ -123,7 +130,7 @@ SELECT * FROM pgml.deploy( ### Specific Model IDs -In the case you need to deploy an exact model that is not the `most_recent` or `best_score`, you may deploy a model by id. Model id's can be found in the `pgml.models` table. +In the case you need to deploy an exact model that is not the `most_recent` or `best_score`, you may deploy a model by id. Model id's can be found in the `pgml.models` table. #### SQL diff --git a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.embed.md b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.embed.md index 6b392bc26..61f6a6b0e 100644 --- a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.embed.md +++ b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.embed.md @@ -1,3 +1,8 @@ +--- +description: >- + Generate high quality embeddings with faster end-to-end vector operations without an additional vector database. +--- + # pgml.embed() Embeddings are a numeric representation of text. They are used to represent words and sentences as vectors, an array of numbers. Embeddings can be used to find similar pieces of text, by comparing the similarity of the numeric vectors using a distance measure, or they can be used as input features for other machine learning models, since most algorithms can't use text directly. diff --git a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.predict/README.md b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.predict/README.md index 68373638a..6566497e5 100644 --- a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.predict/README.md +++ b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.predict/README.md @@ -1,3 +1,8 @@ +--- +description: >- + Batch predict from data in a table. Online predict with parameters passed in a query. Automatically reuse pre-processing steps from training. +--- + # pgml.predict() ## API @@ -51,7 +56,7 @@ LIMIT 25; ### Classification Example -If you've already been through the [pgml.train](../pgml.train "mention") examples, you can see the predictive results of those models: +If you've already been through the [pgml.train](../pgml.train/ "mention") examples, you can see the predictive results of those models: ```sql SELECT diff --git a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.train/README.md b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.train/README.md index 5f5b0d89e..d00460bfa 100644 --- a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.train/README.md +++ b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.train/README.md @@ -1,8 +1,6 @@ --- description: >- - The training function is at the heart of PostgresML. It's a powerful single - mechanism that can handle many different training tasks which are configurable - with the function parameters. + Pre-process and pull data to train a model using any of 50 different ML algorithms. --- # pgml.train() @@ -35,7 +33,7 @@ pgml.train( | Parameter | Example | Description | | --------------- | ----------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `project_name` | `'Search Results Ranker'` | An easily recognizable identifier to organize your work. | -| `task` | `'regression'` | The objective of the experiment: `regression`, `classification` or `cluster` | +| `task` | `'regression'` | The objective of the experiment: `regression`, `classification` or `cluster` | | `relation_name` | `'public.search_logs'` | The Postgres table or view where the training data is stored or defined. | | `y_column_name` | `'clicked'` | The name of the label (aka "target" or "unknown") column in the training table. | | `algorithm` | `'xgboost'` |

The algorithm to train on the dataset, see the task specific pages for available algorithms:
regression.md

classification.md
clustering.md

| diff --git a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.transform/README.md b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.transform/README.md index 4d1c30d12..00093f135 100644 --- a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.transform/README.md +++ b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.transform/README.md @@ -1,4 +1,6 @@ --- +description: >- + Perform dozens of state-of-the-art natural language processing (NLP) tasks with thousands of models. Serve with the same Postgres infrastructure. layout: title: visible: true diff --git a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.tune.md b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.tune.md index 33b875c89..524b3adfd 100644 --- a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.tune.md +++ b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.tune.md @@ -1,8 +1,13 @@ +--- +description: >- + Fine tune open-source models on your own data. +--- + # pgml.tune() ## Fine Tuning -Pre-trained models allow you to get up and running quickly, but you can likely improve performance on your dataset by fine tuning them. Normally, you'll bring your own data to the party, but for these examples we'll use datasets published on Hugging Face. +Pre-trained models allow you to get up and running quickly, but you can likely improve performance on your dataset by fine tuning them. Normally, you'll bring your own data to the party, but for these examples we'll use datasets published on Hugging Face. ### Translation Example diff --git a/pgml-cms/docs/resources/benchmarks/ggml-quantized-llm-support-for-huggingface-transformers.md b/pgml-cms/docs/resources/benchmarks/ggml-quantized-llm-support-for-huggingface-transformers.md index 7ca0a3562..b6e5c059a 100644 --- a/pgml-cms/docs/resources/benchmarks/ggml-quantized-llm-support-for-huggingface-transformers.md +++ b/pgml-cms/docs/resources/benchmarks/ggml-quantized-llm-support-for-huggingface-transformers.md @@ -1,5 +1,11 @@ +--- +description: >- + Quantization allows PostgresML to fit larger models in less RAM. +--- # GGML Quantized LLM support for Huggingface Transformers + + Quantization allows PostgresML to fit larger models in less RAM. These algorithms perform inference significantly faster on NVIDIA, Apple and Intel hardware. Half-precision floating point and quantized optimizations are now available for your favorite LLMs downloaded from Huggingface. ## Introduction @@ -54,7 +60,8 @@ SELECT pgml.transform( ## Quantization -_Discrete quantization is not a new idea. It's been used by both algorithms and artists for more than a hundred years._\\ +_Discrete quantization is not a new idea. It's been used by both algorithms and artists for more than a hundred years._\ + Going beyond 16-bit down to 8 or 4 bits is possible, but not with hardware accelerated floating point operations. If we want hardware acceleration for smaller types, we'll need to use small integers w/ vectorized instruction sets. This is the process of _quantization_. Quantization can be applied to existing models trained with 32-bit floats, by converting the weights to smaller integer primitives that will still benefit from hardware accelerated instruction sets like Intel's [AVX](https://en.wikipedia.org/wiki/Advanced\_Vector\_Extensions). A simple way to quantize a model can be done by first finding the maximum and minimum values of the weights, then dividing the range of values into the number of buckets available in your integer type, 256 for 8-bit, 16 for 4-bit. This is called _post-training quantization_, and it's the simplest way to quantize a model. diff --git a/pgml-cms/docs/resources/benchmarks/making-postgres-30-percent-faster-in-production.md b/pgml-cms/docs/resources/benchmarks/making-postgres-30-percent-faster-in-production.md index f999591e1..a0581b8e2 100644 --- a/pgml-cms/docs/resources/benchmarks/making-postgres-30-percent-faster-in-production.md +++ b/pgml-cms/docs/resources/benchmarks/making-postgres-30-percent-faster-in-production.md @@ -1,3 +1,7 @@ +--- +description: >- + Anyone who runs Postgres at scale knows that performance comes with trade offs. +--- # Making Postgres 30 Percent Faster in Production Anyone who runs Postgres at scale knows that performance comes with trade offs. The typical playbook is to place a pooler like PgBouncer in front of your database and turn on transaction mode. This makes multiple clients reuse the same server connection, which allows thousands of clients to connect to your database without causing a fork bomb. diff --git a/pgml-cms/docs/resources/benchmarks/million-requests-per-second.md b/pgml-cms/docs/resources/benchmarks/million-requests-per-second.md index 603e9bb2e..1b7f43985 100644 --- a/pgml-cms/docs/resources/benchmarks/million-requests-per-second.md +++ b/pgml-cms/docs/resources/benchmarks/million-requests-per-second.md @@ -1,4 +1,8 @@ -# Scaling to 1 Million Requests per Second +--- +description: >- + The question "Does it Scale?" has become somewhat of a meme in software engineering. +--- +# Million Requests per Second The question "Does it Scale?" has become somewhat of a meme in software engineering. There is a good reason for it though, because most businesses plan for success. If your app, online store, or SaaS becomes popular, you want to be sure that the system powering it can serve all your new customers. diff --git a/pgml-cms/docs/resources/benchmarks/mindsdb-vs-postgresml.md b/pgml-cms/docs/resources/benchmarks/mindsdb-vs-postgresml.md index 211d32922..e56d676a8 100644 --- a/pgml-cms/docs/resources/benchmarks/mindsdb-vs-postgresml.md +++ b/pgml-cms/docs/resources/benchmarks/mindsdb-vs-postgresml.md @@ -1,3 +1,7 @@ +--- +description: >- + Compare two projects that both aim to provide an SQL interface to ML algorithms and the data they require. +--- # MindsDB vs PostgresML ## Introduction diff --git a/pgml-cms/docs/resources/benchmarks/postgresml-is-8-40x-faster-than-python-http-microservices.md b/pgml-cms/docs/resources/benchmarks/postgresml-is-8-40x-faster-than-python-http-microservices.md index fca4dc98d..73bde7c33 100644 --- a/pgml-cms/docs/resources/benchmarks/postgresml-is-8-40x-faster-than-python-http-microservices.md +++ b/pgml-cms/docs/resources/benchmarks/postgresml-is-8-40x-faster-than-python-http-microservices.md @@ -1,3 +1,7 @@ +--- +description: >- + PostgresML is a simpler alternative to that ever-growing complexity. +--- # PostgresML is 8-40x faster than Python HTTP microservices Machine learning architectures can be some of the most complex, expensive and _difficult_ arenas in modern systems. The number of technologies and the amount of required hardware compete for tightening headcount, hosting, and latency budgets. Unfortunately, the trend in the industry is only getting worse along these lines, with increased usage of state-of-the-art architectures that center around data warehouses, microservices and NoSQL databases. diff --git a/pgml-cms/docs/use-cases/embeddings/generating-llm-embeddings-with-open-source-models-in-postgresml.md b/pgml-cms/docs/use-cases/embeddings/generating-llm-embeddings-with-open-source-models-in-postgresml.md index 8ed3a34f8..526838bc6 100644 --- a/pgml-cms/docs/use-cases/embeddings/generating-llm-embeddings-with-open-source-models-in-postgresml.md +++ b/pgml-cms/docs/use-cases/embeddings/generating-llm-embeddings-with-open-source-models-in-postgresml.md @@ -1,4 +1,6 @@ -# Generating LLM embeddings with open source models +# Generating LLM embeddings with open source models in PostgresML + + PostgresML makes it easy to generate embeddings from text in your database using a large selection of state-of-the-art models with one simple call to **`pgml.embed`**`(model_name, text)`. Prove the results in this series to your own satisfaction, for free, by signing up for a GPU accelerated database. @@ -15,6 +17,8 @@ In recent years, embeddings have become an increasingly popular technique in mac They can also turn natural language into quantitative features for downstream machine learning models and applications. + + _Embeddings show us the relationships between rows in the database._ A popular use case driving the adoption of "vector databases" is doing similarity search on embeddings, often referred to as "Semantic Search". This is a powerful technique that allows you to find similar items in large datasets by comparing their vectors. For example, you could use it to find similar products in an e-commerce site, similar songs in a music streaming service, or similar documents given a text query. @@ -102,7 +106,7 @@ LIMIT 5; ## Generating embeddings from natural language text -PostgresML provides a simple interface to generate embeddings from text in your database. You can use the [`pgml.embed`](https://postgresml.org/docs/transformers/embeddings) function to generate embeddings for a column of text. The function takes a transformer name and a text value. The transformer will automatically be downloaded and cached on your connection process for reuse. You can see a list of potential good candidate models to generate embeddings on the [Massive Text Embedding Benchmark leaderboard](https://huggingface.co/spaces/mteb/leaderboard). +PostgresML provides a simple interface to generate embeddings from text in your database. You can use the [`pgml.embed`](/docs/introduction/apis/sql-extensions/pgml.embed) function to generate embeddings for a column of text. The function takes a transformer name and a text value. The transformer will automatically be downloaded and cached on your connection process for reuse. You can see a list of potential good candidate models to generate embeddings on the [Massive Text Embedding Benchmark leaderboard](https://huggingface.co/spaces/mteb/leaderboard). Since our corpus of documents (movie reviews) are all relatively short and similar in style, we don't need a large model. [`intfloat/e5-small`](https://huggingface.co/intfloat/e5-small) will be a good first attempt. The great thing about PostgresML is you can always regenerate your embeddings later to experiment with different embedding models. @@ -198,7 +202,8 @@ For comparison, it would cost about $299 to use OpenAI's cheapest embedding mode | GPU | 17ms | $72 | 6 hours | | OpenAI | 300ms | $299 | millennia | -\\ +\ + You can also find embedding models that outperform OpenAI's `text-embedding-ada-002` model across many different tests on the [leaderboard](https://huggingface.co/spaces/mteb/leaderboard). It's always best to do your own benchmarking with your data, models, and hardware to find the best fit for your use case. pFad - Phonifier reborn

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Alternative Proxies:

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.