diff --git a/pgml-cms/docs/SUMMARY.md b/pgml-cms/docs/SUMMARY.md index bfc9ef6a1..1c8fe5c28 100644 --- a/pgml-cms/docs/SUMMARY.md +++ b/pgml-cms/docs/SUMMARY.md @@ -38,12 +38,11 @@ * [Overview](introduction/apis/client-sdks/getting-started.md) * [Collections](introduction/apis/client-sdks/collections.md) * [Pipelines](introduction/apis/client-sdks/pipelines.md) - * [Search](introduction/apis/client-sdks/search.md) + * [Vector Search](introduction/apis/client-sdks/search.md) + * [Document Search](introduction/apis/client-sdks/document-search.md) * [Tutorials](introduction/apis/client-sdks/tutorials/README.md) * [Semantic Search](introduction/apis/client-sdks/tutorials/semantic-search.md) - * [Semantic Search using Instructor model](introduction/apis/client-sdks/tutorials/semantic-search-using-instructor-model.md) - * [Extractive Question Answering](introduction/apis/client-sdks/tutorials/extractive-question-answering.md) - * [Summarizing Question Answering](introduction/apis/client-sdks/tutorials/summarizing-question-answering.md) + * [Semantic Search Using Instructor Model](introduction/apis/client-sdks/tutorials/semantic-search-1.md) ## Product diff --git a/pgml-cms/docs/introduction/apis/client-sdks/collections.md b/pgml-cms/docs/introduction/apis/client-sdks/collections.md index c5e4df68d..3af0b9093 100644 --- a/pgml-cms/docs/introduction/apis/client-sdks/collections.md +++ b/pgml-cms/docs/introduction/apis/client-sdks/collections.md @@ -1,16 +1,16 @@ --- -description: >- - Organizational building blocks of the SDK. Manage all documents and related chunks, embeddings, tsvectors, and pipelines. +description: Organizational building blocks of the SDK. Manage all documents and related chunks, embeddings, tsvectors, and pipelines. --- + # Collections Collections are the organizational building blocks of the SDK. They manage all documents and related chunks, embeddings, tsvectors, and pipelines. ## Creating Collections -By default, collections will read and write to the database specified by `DATABASE_URL` environment variable. +By default, collections will read and write to the database specified by `PGML_DATABASE_URL` environment variable. -### **Default `DATABASE_URL`** +### **Default `PGML_DATABASE_URL`** {% tabs %} {% tab title="JavaScript" %} @@ -26,9 +26,9 @@ collection = Collection("test_collection") {% endtab %} {% endtabs %} -### **Custom DATABASE\_URL** +### Custom `PGML_DATABASE_URL` -Create a Collection that reads from a different database than that set by the environment variable `DATABASE_URL`. +Create a Collection that reads from a different database than that set by the environment variable `PGML_DATABASE_URL`. {% tabs %} {% tab title="Javascript" %} @@ -46,21 +46,23 @@ collection = Collection("test_collection", CUSTOM_DATABASE_URL) ## Upserting Documents -Documents are dictionaries with two required keys: `id` and `text`. All other keys/value pairs are stored as metadata for the document. +Documents are dictionaries with one required key: `id`. All other keys/value pairs are stored and can be chunked, embedded, broken into tsvectors, and searched over as specified by a `Pipeline`. {% tabs %} {% tab title="JavaScript" %} ```javascript const documents = [ { - id: "Document One", + id: "document_one", + title: "Document One", text: "document one contents...", - random_key: "this will be metadata for the document", + random_key: "here is some random data", }, { - id: "Document Two", + id: "document_two", + title: "Document Two", text: "document two contents...", - random_key: "this will be metadata for the document", + random_key: "here is some random data", }, ]; await collection.upsert_documents(documents); @@ -71,35 +73,40 @@ await collection.upsert_documents(documents); ```python documents = [ { - "id": "Document 1", + "id": "document_one", + "title": "Document One", "text": "Here are the contents of Document 1", - "random_key": "this will be metadata for the document" + "random_key": "here is some random data", }, { - "id": "Document 2", + "id": "document_two", + "title": "Document Two", "text": "Here are the contents of Document 2", - "random_key": "this will be metadata for the document" - } + "random_key": "here is some random data", + }, ] -collection = Collection("test_collection") await collection.upsert_documents(documents) ``` {% endtab %} {% endtabs %} -Document metadata can be replaced by upserting the document without the `text` key. +Documents can be replaced by upserting documents with the same `id`. {% tabs %} {% tab title="JavaScript" %} ```javascript const documents = [ { - id: "Document One", - random_key: "this will be NEW metadata for the document", + id: "document_one", + title: "Document One New Title", + text: "Here is some new text for document one", + random_key: "here is some new random data", }, { - id: "Document Two", - random_key: "this will be NEW metadata for the document", + id: "document_two", + title: "Document Two New Title", + text: "Here is some new text for document two", + random_key: "here is some new random data", }, ]; await collection.upsert_documents(documents); @@ -110,39 +117,42 @@ await collection.upsert_documents(documents); ```python documents = [ { - "id": "Document 1", - "random_key": "this will be NEW metadata for the document" + "id": "document_one", + "title": "Document One", + "text": "Here is some new text for document one", + "random_key": "here is some random data", }, { - "id": "Document 2", - "random_key": "this will be NEW metadata for the document" - } + "id": "document_two", + "title": "Document Two", + "text": "Here is some new text for document two", + "random_key": "here is some random data", + }, ] -collection = Collection("test_collection") await collection.upsert_documents(documents) ``` {% endtab %} {% endtabs %} -Document metadata can be merged with new metadata by upserting the document without the `text` key and specifying the merge option. +Documents can be merged by setting the `merge` option. On conflict, new document keys will override old document keys. {% tabs %} {% tab title="JavaScript" %} ```javascript const documents = [ { - id: "Document One", - text: "document one contents...", + id: "document_one", + new_key: "this will be a new key in document one", + random_key: "this will replace old random_key" }, { - id: "Document Two", - text: "document two contents...", + id: "document_two", + new_key: "this will bew a new key in document two", + random_key: "this will replace old random_key" }, ]; await collection.upsert_documents(documents, { - metdata: { - merge: true - } + merge: true }); ``` {% endtab %} @@ -151,20 +161,17 @@ await collection.upsert_documents(documents, { ```python documents = [ { - "id": "Document 1", - "random_key": "this will be NEW merged metadata for the document" + "id": "document_one", + "new_key": "this will be a new key in document one", + "random_key": "this will replace old random_key", }, { - "id": "Document 2", - "random_key": "this will be NEW merged metadata for the document" - } + "id": "document_two", + "new_key": "this will be a new key in document two", + "random_key": "this will replace old random_key", + }, ] -collection = Collection("test_collection") -await collection.upsert_documents(documents, { - "metadata": { - "merge": True - } -}) +await collection.upsert_documents(documents, {"merge": True}) ``` {% endtab %} {% endtabs %} @@ -176,14 +183,12 @@ Documents can be retrieved using the `get_documents` method on the collection ob {% tabs %} {% tab title="JavaScript" %} ```javascript -const collection = Collection("test_collection") const documents = await collection.get_documents({limit: 100 }) ``` {% endtab %} {% tab title="Python" %} ```python -collection = Collection("test_collection") documents = await collection.get_documents({ "limit": 100 }) ``` {% endtab %} @@ -198,14 +203,12 @@ The SDK supports limit-offset pagination and keyset pagination. {% tabs %} {% tab title="JavaScript" %} ```javascript -const collection = pgml.newCollection("test_collection") const documents = await collection.get_documents({ limit: 100, offset: 10 }) ``` {% endtab %} {% tab title="Python" %} ```python -collection = Collection("test_collection") documents = await collection.get_documents({ "limit": 100, "offset": 10 }) ``` {% endtab %} @@ -216,41 +219,31 @@ documents = await collection.get_documents({ "limit": 100, "offset": 10 }) {% tabs %} {% tab title="JavaScript" %} ```javascript -const collection = Collection("test_collection") const documents = await collection.get_documents({ limit: 100, last_row_id: 10 }) ``` {% endtab %} {% tab title="Python" %} ```python -collection = Collection("test_collection") documents = await collection.get_documents({ "limit": 100, "last_row_id": 10 }) ``` {% endtab %} {% endtabs %} -The `last_row_id` can be taken from the `row_id` field in the returned document's dictionary. +The `last_row_id` can be taken from the `row_id` field in the returned document's dictionary. Keyset pagination does not currently work when specifying the `order_by` key. ### Filtering Documents -Metadata and full text filtering are supported just like they are in vector recall. +Documents can be filtered by passing in the `filter` key. {% tabs %} {% tab title="JavaScript" %} ```javascript -const collection = pgml.newCollection("test_collection") const documents = await collection.get_documents({ - limit: 100, - offset: 10, + limit: 10, filter: { - metadata: { - id: { - $eq: 1 - } - }, - full_text_search: { - configuration: "english", - text: "Some full text query" + id: { + $eq: "document_one" } } }) @@ -259,34 +252,25 @@ const documents = await collection.get_documents({ {% tab title="Python" %} ```python -collection = Collection("test_collection") -documents = await collection.get_documents({ - "limit": 100, - "offset": 10, - "filter": { - "metadata": { - "id": { - "$eq": 1 - } +documents = await collection.get_documents( + { + "limit": 100, + "filter": { + "id": {"$eq": "document_one"}, }, - "full_text_search": { - "configuration": "english", - "text": "Some full text query" - } } -}) +) ``` {% endtab %} {% endtabs %} ### Sorting Documents -Documents can be sorted on any metadata key. Note that this does not currently work well with Keyset based pagination. If paginating and sorting, use Limit-Offset based pagination. +Documents can be sorted on any key. Note that this does not currently work well with Keyset based pagination. If paginating and sorting, use Limit-Offset based pagination. {% tabs %} {% tab title="JavaScript" %} ```javascript -const collection = pgml.newCollection("test_collection") const documents = await collection.get_documents({ limit: 100, offset: 10, @@ -299,7 +283,6 @@ const documents = await collection.get_documents({ {% tab title="Python" %} ```python -collection = Collection("test_collection") documents = await collection.get_documents({ "limit": 100, "offset": 10, @@ -315,39 +298,24 @@ documents = await collection.get_documents({ Documents can be deleted with the `delete_documents` method on the collection object. -Metadata and full text filtering are supported just like they are in vector recall. - {% tabs %} {% tab title="JavaScript" %} ```javascript -const collection = pgml.newCollection("test_collection") const documents = await collection.delete_documents({ - metadata: { id: { $eq: 1 } - }, - full_text_search: { - configuration: "english", - text: "Some full text query" - } }) ``` {% endtab %} {% tab title="Python" %} ```python -documents = await collection.delete_documents({ - "metadata": { - "id": { - "$eq": 1 - } - }, - "full_text_search": { - "configuration": "english", - "text": "Some full text query" +documents = await collection.delete_documents( + { + "id": {"$eq": 1}, } -}) +) ``` {% endtab %} {% endtabs %} diff --git a/pgml-cms/docs/introduction/apis/client-sdks/document-search.md b/pgml-cms/docs/introduction/apis/client-sdks/document-search.md new file mode 100644 index 000000000..0d47336d5 --- /dev/null +++ b/pgml-cms/docs/introduction/apis/client-sdks/document-search.md @@ -0,0 +1,127 @@ +# Document Search + +SDK is specifically designed to provide powerful, flexible document search. `Pipeline`s are required to perform search. See the [Pipelines](https://postgresml.org/docs/introduction/apis/client-sdks/pipelines) for more information about using `Pipeline`s. + +This section will assume we have previously ran the following code: + +{% tabs %} +{% tab title="JavaScript" %} +```javascript +const pipeline = pgml.newPipeline("test_pipeline", { + abstract: { + semantic_search: { + model: "intfloat/e5-small", + }, + full_text_search: { configuration: "english" }, + }, + body: { + splitter: { model: "recursive_character" }, + semantic_search: { + model: "hkunlp/instructor-base", + parameters: { + instruction: "Represent the Wikipedia document for retrieval: ", + } + }, + }, +}); +const collection = pgml.newCollection("test_collection"); +await collection.add_pipeline(pipeline); +``` +{% endtab %} + +{% tab title="Python" %} +```python +pipeline = Pipeline( + "test_pipeline", + { + "abstract": { + "semantic_search": { + "model": "intfloat/e5-small", + }, + "full_text_search": {"configuration": "english"}, + }, + "body": { + "splitter": {"model": "recursive_character"}, + "semantic_search": { + "model": "hkunlp/instructor-base", + "parameters": { + "instruction": "Represent the Wikipedia document for retrieval: ", + }, + }, + }, + }, +) +collection = Collection("test_collection") +``` +{% endtab %} +{% endtabs %} + +## Doing Document Search + +{% tabs %} +{% tab title="JavaScript" %} +```javascript +const results = await collection.search( + { + query: { + full_text_search: { abstract: { query: "What is the best database?", boost: 1.2 } }, + semantic_search: { + abstract: { + query: "What is the best database?", boost: 2.0, + }, + body: { + query: "What is the best database?", boost: 1.25, parameters: { + instruction: + "Represent the Wikipedia question for retrieving supporting documents: ", + } + }, + }, + filter: { user_id: { $eq: 1 } }, + }, + limit: 10 + }, + pipeline, +); +``` +{% endtab %} + +{% tab title="Python" %} +```python +results = await collection.search( + { + "query": { + "full_text_search": { + "abstract": {"query": "What is the best database?", "boost": 1.2} + }, + "semantic_search": { + "abstract": { + "query": "What is the best database?", + "boost": 2.0, + }, + "body": { + "query": "What is the best database?", + "boost": 1.25, + "parameters": { + "instruction": "Represent the Wikipedia question for retrieving supporting documents: ", + }, + }, + }, + "filter": {"user_id": {"$eq": 1}}, + }, + "limit": 10, + }, + pipeline, +) +``` +{% endtab %} +{% endtabs %} + +Just like `vector_search`, `search` takes in two arguments. The first is a `JSON` object specifying the `query` and `limit` and the second is the `Pipeline`. The `query` object can have three fields: `full_text_search`, `semantic_search` and `filter`. Both `full_text_search` and `semantic_search` function similarly. They take in the text to compare against, titled`query`, an optional `boost` parameter used to boost the effectiveness of the ranking, and `semantic_search` also takes in an optional `parameters` key which specify parameters to pass to the embedding model when embedding the passed in text. + +Lets break this query down a little bit more. We are asking for a maximum of 10 documents ranked by `full_text_search` on the `abstract` and `semantic_search` on the `abstract` and `body`. We are also filtering out all documents that do not have the key `user_id` equal to `1`. The `full_text_search` provides a score for the `abstract`, and `semantic_search` provides scores for the `abstract` and the `body`. The `boost` parameter is a multiplier applied to these scores before they are summed together and sorted by `score` `DESC`. + +The `filter` is structured the same way it is when performing `vector_search` see [filtering with vector\_search](https://postgresml.org/docs/introduction/apis/client-sdks/search)[ ](https://postgresml.org/docs/introduction/apis/client-sdks/search#metadata-filtering)for more examples on filtering documents. + +## Fine-Tuning Document Search + +More information and examples on this coming soon... diff --git a/pgml-cms/docs/introduction/apis/client-sdks/getting-started.md b/pgml-cms/docs/introduction/apis/client-sdks/getting-started.md index 6d1a60cf8..326a76ac3 100644 --- a/pgml-cms/docs/introduction/apis/client-sdks/getting-started.md +++ b/pgml-cms/docs/introduction/apis/client-sdks/getting-started.md @@ -27,18 +27,17 @@ Once the SDK is installed, you an use the following example to get started. ```javascript const pgml = require("pgml"); -const main = async () => { +const main = async () => { // Open the main function collection = pgml.newCollection("sample_collection"); ``` {% endtab %} {% tab title="Python" %} ```python -from pgml import Collection, Model, Splitter, Pipeline +from pgml import Collection, Pipeline import asyncio -async def main(): - # Initialize collection +async def main(): # Start of the main function collection = Collection("sample_collection") ``` {% endtab %} @@ -56,20 +55,31 @@ Continuing with `main` {% tabs %} {% tab title="JavaScript" %} ```javascript -// Create a pipeline using the default model and splitter -const model = pgml.newModel(); -const splitter = pgml.newSplitter(); -const pipeline = pgml.newPipeline("sample_pipeline", model, splitter); +const pipeline = pgml.newPipeline("sample_pipeline", { + text: { + splitter: { model: "recursive_character" }, + semantic_search: { + model: "intfloat/e5-small", + }, + }, +}); await collection.add_pipeline(pipeline); ``` {% endtab %} {% tab title="Python" %} ```python -# Create a pipeline using the default model and splitter -model = Model() -splitter = Splitter() -pipeline = Pipeline("sample_pipeline", model, splitter) +pipeline = Pipeline( + "test_pipeline", + { + "text": { + "splitter": { "model": "recursive_character" }, + "semantic_search": { + "model": "intfloat/e5-small", + }, + }, + }, +) await collection.add_pipeline(pipeline) ``` {% endtab %} @@ -77,8 +87,7 @@ await collection.add_pipeline(pipeline) #### Explanation: -* The code creates an instance of `Model` and `Splitter` using their default arguments. -* Finally, the code constructs a pipeline called `"sample_pipeline"` and add it to the collection we Initialized above. This pipeline automatically generates chunks and embeddings for every upserted document. +* The code constructs a pipeline called `"sample_pipeline"` and adds it to the collection we Initialized above. This pipeline automatically generates chunks and embeddings for the `text` key for every upserted document. ### Upsert documents @@ -87,7 +96,6 @@ Continuing with `main` {% tabs %} {% tab title="JavaScript" %} ```javascript -// Create and upsert documents const documents = [ { id: "Document One", @@ -106,15 +114,15 @@ await collection.upsert_documents(documents); ```python documents = [ { - id: "Document One", - text: "document one contents...", + "id": "Document One", + "text": "document one contents...", }, { - id: "Document Two", - text: "document two contents...", + "id": "Document Two", + "text": "document two contents...", }, -]; -await collection.upsert_documents(documents); +] +await collection.upsert_documents(documents) ``` {% endtab %} {% endtabs %} @@ -131,45 +139,58 @@ Continuing with `main` {% tabs %} {% tab title="JavaScript" %} ```javascript -// Query -const queryResults = await collection - .query() - .vector_recall("Some user query that will match document one first", pipeline) - .limit(2) - .fetch_all(); - -// Convert the results to an array of objects -const results = queryResults.map((result) => { - const [similarity, text, metadata] = result; - return { - similarity, - text, - metadata, - }; -}); +const results = await collection.vector_search( + { + query: { + fields: { + text: { + query: "Something about a document...", + }, + }, + }, + limit: 2, + }, + pipeline, +); + console.log(results); await collection.archive(); + +} // Close the main function ``` {% endtab %} {% tab title="Python" %} ```python -# Query -query = "Some user query that will match document one first" -results = await collection.query().vector_recall(query, pipeline).limit(2).fetch_all() +results = await collection.vector_search( + { + "query": { + "fields": { + "text": { + "query": "Something about a document...", + }, + }, + }, + "limit": 2, + }, + pipeline, +) + print(results) -# Archive collection + await collection.archive() + +# End of the main function ``` {% endtab %} {% endtabs %} **Explanation:** -* The `query` method is called to perform a vector-based search on the collection. The query string is `Some user query that will match document one first`, and the top 2 results are requested. -* The search results are converted to objects and printed. -* Finally, the `archive` method is called to archive the collection and free up resources in the PostgresML database. +* The `query` method is called to perform a vector-based search on the collection. The query string is `Something about a document...`, and the top 2 results are requested +* The search results are printed to the screen +* Finally, the `archive` method is called to archive the collection Call `main` function. @@ -205,24 +226,24 @@ node vector_search.js {% tab title="Python" %} ```bash -python vector_search.py +python3 vector_search.py ``` {% endtab %} {% endtabs %} -You should see the search results printed in the terminal. As you can see, our vector search engine did match document one first. +You should see the search results printed in the terminal. ```bash [ - { - similarity: 0.8506832955692104, - text: 'document one contents...', - metadata: { id: 'Document One' } - }, - { - similarity: 0.8066114609244565, - text: 'document two contents...', - metadata: { id: 'Document Two' } - } + { + "chunk": "document one contents...", + "document": {"id": "Document One", "text": "document one contents..."}, + "score": 0.9034339189529419, + }, + { + "chunk": "document two contents...", + "document": {"id": "Document Two", "text": "document two contents..."}, + "score": 0.8983734250068665, + }, ] ``` diff --git a/pgml-cms/docs/introduction/apis/client-sdks/pipelines.md b/pgml-cms/docs/introduction/apis/client-sdks/pipelines.md index 1bae53481..be27f96eb 100644 --- a/pgml-cms/docs/introduction/apis/client-sdks/pipelines.md +++ b/pgml-cms/docs/introduction/apis/client-sdks/pipelines.md @@ -1,233 +1,229 @@ --- -description: >- - Pipelines are composed of a model, splitter, and additional optional arguments. +description: Pipelines are composed of a model, splitter, and additional optional arguments. --- -# Pipelines -Pipelines are composed of a Model, Splitter, and additional optional arguments. Collections can have any number of Pipelines. Each Pipeline is ran everytime documents are upserted. +# Pipelines -## Models +`Pipeline`s define the schema for the transformation of documents. Different `Pipeline`s can be used for different tasks. -Models are used for embedding chuncked documents. We support most every open source model on [Hugging Face](https://huggingface.co/), and also OpenAI's embedding models. +## Defining Schema -### **Create a default Model "intfloat/e5-small" with default parameters: {}** +New `Pipeline`s require schema. Here are a few examples of variations of schema along with common use cases. -{% tabs %} -{% tab title="JavaScript" %} -```javascript -const model = pgml.newModel() -``` -{% endtab %} +For the following section we will assume we have documents that have the structure: -{% tab title="Python" %} -```python -model = Model() +```json +{ + "id": "Each document has a unique id", + "title": "Each document has a title", + "body": "Each document has some body text" +} ``` -{% endtab %} -{% endtabs %} - -### **Create a Model with custom parameters** {% tabs %} {% tab title="JavaScript" %} ```javascript -const model = pgml.newModel( - "hkunlp/instructor-base", - "pgml", - { instruction: "Represent the Wikipedia document for retrieval: " } -) +const pipeline = pgml.newPipeline("test_pipeline", { + title: { + full_text_search: { configuration: "english" }, + }, + body: { + splitter: { model: "recursive_character" }, + semantic_search: { + model: "hkunlp/instructor-base", + parameters: { + instruction: "Represent the Wikipedia document for retrieval: ", + } + }, + }, +}); ``` {% endtab %} {% tab title="Python" %} ```python -model = Model( - name="hkunlp/instructor-base", - parameters={"instruction": "Represent the Wikipedia document for retrieval: "} +pipeline = Pipeline( + "test_pipeline", + { + "title": { + "full_text_search": {"configuration": "english"}, + }, + "body": { + "splitter": {"model": "recursive_character"}, + "semantic_search": { + "model": "hkunlp/instructor-base", + "parameters": { + "instruction": "Represent the Wikipedia document for retrieval: ", + }, + }, + }, + }, ) ``` {% endtab %} {% endtabs %} -### **Use an OpenAI model** - -{% tabs %} -{% tab title="JavaScript" %} -```javascript -const model = pgml.newModel("text-embedding-ada-002", "openai") -``` -{% endtab %} - -{% tab title="Python" %} -```python -model = Model(name="text-embedding-ada-002", source="openai") -``` -{% endtab %} -{% endtabs %} - -## Splitters - -Splitters are used to split documents into chunks before embedding them. We support splitters found in [LangChain](https://www.langchain.com/). +This `Pipeline` does two things. For each document in the `Collection`, it converts all `title`s into tsvectors enabling full text search, and splits and embeds the `body` text enabling semantic search using vectors. This kind of `Pipeline` would be great for site search utilizing hybrid keyword and semantic search. -### **Create a default Splitter "recursive\_character" with default parameters: {}** +For a more simple RAG use case, the following `Pipeline` would work well. {% tabs %} {% tab title="JavaScript" %} ```javascript -const splitter = pgml.newSplitter() +const pipeline = pgml.newPipeline("test_pipeline", { + body: { + splitter: { model: "recursive_character" }, + semantic_search: { + model: "hkunlp/instructor-base", + parameters: { + instruction: "Represent the Wikipedia document for retrieval: ", + } + }, + }, +}); ``` {% endtab %} {% tab title="Python" %} ```python -splitter = Splitter() +pipeline = Pipeline( + "test_pipeline", + { + "body": { + "splitter": {"model": "recursive_character"}, + "semantic_search": { + "model": "hkunlp/instructor-base", + "parameters": { + "instruction": "Represent the Wikipedia document for retrieval: ", + }, + }, + }, + }, +) ``` {% endtab %} {% endtabs %} -### **Create a Splitter with custom parameters** +This `Pipeline` splits and embeds the `body` text enabling semantic search using vectors. This is a very popular `Pipeline` for RAG. + +We support most every open source model on [Hugging Face](https://huggingface.co/), and OpenAI's embedding models. To use a model from OpenAI specify the `source` as `openai`, and make sure and set the environment variable `OPENAI_API_KEY`. {% tabs %} {% tab title="JavaScript" %} ```javascript -splitter = pgml.newSplitter( - "recursive_character", - { chunk_size: 1500, chunk_overlap: 40 } -) +const pipeline = pgml.newPipeline("test_pipeline", { + body: { + splitter: { model: "recursive_character" }, + semantic_search: { + model: "text-embedding-ada-002", + source: "openai" + }, + }, +}); ``` {% endtab %} {% tab title="Python" %} ```python -splitter = Splitter( - name="recursive_character", - parameters={"chunk_size": 1500, "chunk_overlap": 40} +pipeline = Pipeline( + "test_pipeline", + { + "body": { + "splitter": {"model": "recursive_character"}, + "semantic_search": {"model": "text-embedding-ada-002", "source": "openai"}, + }, + }, ) ``` {% endtab %} {% endtabs %} -## Adding Pipelines to a Collection - -When adding a Pipeline to a collection it is required that Pipeline has a Model and Splitter. +## Customizing the Indexes -The first time a Pipeline is added to a Collection it will automatically chunk and embed any documents already in that Collection. +By default the SDK uses HNSW indexes to efficiently perform vector recall. The default HNSW index sets `m` to 16 and `ef_construction` to 64. These defaults can be customized in the `Pipeline` schema. See [pgvector](https://github.com/pgvector/pgvector) for more information on vector indexes. {% tabs %} {% tab title="JavaScript" %} ```javascript -const model = pgml.newModel() -const splitter = pgml.newSplitter() -const pipeline = pgml.newPipeline("test_pipeline", model, splitter) -await collection.add_pipeline(pipeline) +const pipeline = pgml.newPipeline("test_pipeline", { + body: { + splitter: { model: "recursive_character" }, + semantic_search: { + model: "intfloat/e5-small", + hnsw: { + m: 100, + ef_construction: 200 + } + }, + }, +}); ``` {% endtab %} {% tab title="Python" %} ```python -model = Model() -splitter = Splitter() -pipeline = Pipeline("test_pipeline", model, splitter) -await collection.add_pipeline(pipeline) +pipeline = Pipeline( + "test_pipeline", + { + "body": { + "splitter": {"model": "recursive_character"}, + "semantic_search": { + "model": "intfloat/e5-small", + "hnsw": {"m": 100, "ef_construction": 200}, + }, + }, + }, +) ``` {% endtab %} {% endtabs %} -### Enabling full text search - -Pipelines can take additional arguments enabling full text search. When full text search is enabled, in addition to automatically chunking and embedding, the Pipeline will create the necessary tsvectors to perform full text search. +## Adding Pipelines to a Collection -For more information on full text search please see: [Postgres Full Text Search](https://www.postgresql.org/docs/15/textsearch.html). +The first time a `Pipeline` is added to a `Collection` it will automatically chunk and embed any documents already in that `Collection`. {% tabs %} {% tab title="JavaScript" %} ```javascript -const model = pgml.newModel() -const splitter = pgml.newSplitter() -const pipeline = pgml.newPipeline("test_pipeline", model, splitter, { - full_text_search: { - active: true, - configuration: "english" - } -}) await collection.add_pipeline(pipeline) ``` {% endtab %} {% tab title="Python" %} ```python -model = Model() -splitter = Splitter() -pipeline = Pipeline("test_pipeline", model, splitter, { - "full_text_search": { - "active": True, - "configuration": "english" - } -}) await collection.add_pipeline(pipeline) ``` {% endtab %} {% endtabs %} -### Customizing the HNSW Index - -By default the SDK uses HNSW indexes to efficiently perform vector recall. The default HNSW index sets `m` to 16 and `ef_construction` to 64. These defaults can be customized when the Pipeline is created. +> Note: After a `Pipeline` has been added to a `Collection` instances of the `Pipeline` object can be created without specifying a schema: {% tabs %} {% tab title="JavaScript" %} ```javascript -const model = pgml.newModel() -const splitter = pgml.newSplitter() -const pipeline = pgml.newPipeline("test_pipeline", model, splitter, { - hnsw: { - m: 16, - ef_construction: 64 - } -}) -await collection.add_pipeline(pipeline) +const pipeline = pgml.newPipeline("test_pipeline") ``` {% endtab %} {% tab title="Python" %} ```python -model = Model() -splitter = Splitter() -pipeline = Pipeline("test_pipeline", model, splitter, { - "hnsw": { - "m": 16, - "ef_construction": 64 - } -}) -await collection.add_pipeline(pipeline) +pipeline = Pipeline("test_pipeline") ``` {% endtab %} {% endtabs %} ## Searching with Pipelines -Pipelines are a required argument when performing vector search. After a Pipeline has been added to a Collection, the Model and Splitter can be omitted when instantiating it. +There are two different forms of search that can be done after adding a `Pipeline` to a `Collection` -{% tabs %} -{% tab title="JavaScript" %} -```javascript -const pipeline = pgml.newPipeline("test_pipeline") -const collection = pgml.newCollection("test_collection") -const results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).fetch_all() -``` -{% endtab %} +* [Vector Search](https://postgresml.org/docs/introduction/apis/client-sdks/search) +* [Document Search](https://postgresml.org/docs/introduction/apis/client-sdks/document-search) -{% tab title="Python" %} -```python -pipeline = Pipeline("test_pipeline") -collection = Collection("test_collection") -results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).fetch_all() -``` -{% endtab %} -{% endtabs %} +See their respective pages for more information on searching. ## **Disable a Pipeline** -Pipelines can be disabled or removed to prevent them from running automatically when documents are upserted. +`Pipelines` can be disabled or removed to prevent them from running automatically when documents are upserted. {% tabs %} {% tab title="JavaScript" %} @@ -247,11 +243,11 @@ await collection.disable_pipeline(pipeline) {% endtab %} {% endtabs %} -Disabling a Pipeline prevents it from running automatically, but leaves all chunks and embeddings already created by that Pipeline in the database. +Disabling a `Pipeline` prevents it from running automatically, but leaves all tsvectors, chunks, and embeddings already created by that `Pipeline` in the database. ## **Enable a Pipeline** -Disabled pipelines can be re-enabled. +Disabled `Pipeline`s can be re-enabled. {% tabs %} {% tab title="JavaScript" %} @@ -271,7 +267,7 @@ await collection.enable_pipeline(pipeline) {% endtab %} {% endtabs %} -Enabling a Pipeline will cause it to automatically run and chunk and embed all documents it may have missed while disabled. +Enabling a `Pipeline` will cause it to automatically run on all documents it may have missed while disabled. ## **Remove a Pipeline** @@ -292,4 +288,4 @@ await collection.remove_pipeline(pipeline) {% endtab %} {% endtabs %} -Removing a Pipeline deletes it and all associated data from the database. Removed Pipelines cannot be re-enabled but can be recreated. +Removing a `Pipeline` deletes it and all associated data from the database. Removed `Pipelines` cannot be re-enabled but can be recreated. diff --git a/pgml-cms/docs/introduction/apis/client-sdks/search.md b/pgml-cms/docs/introduction/apis/client-sdks/search.md index 2659015dd..cb61d91b2 100644 --- a/pgml-cms/docs/introduction/apis/client-sdks/search.md +++ b/pgml-cms/docs/introduction/apis/client-sdks/search.md @@ -1,257 +1,353 @@ -# Search +# Vector Search -SDK is specifically designed to provide powerful, flexible vector search. Pipelines are required to perform search. See the [pipelines.md](pipelines.md "mention") for more information about using Pipelines. +SDK is specifically designed to provide powerful, flexible vector search. `Pipeline`s are required to perform search. See [Pipelines ](https://postgresml.org/docs/introduction/apis/client-sdks/pipelines)for more information about using `Pipeline`s. -### **Basic vector search** +This section will assume we have previously ran the following code: {% tabs %} {% tab title="JavaScript" %} -
const collection = pgml.newCollection("test_collection")
-const pipeline = pgml.newPipeline("test_pipeline")
-const results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).fetch_all()
-
+```javascript
+const pipeline = pgml.newPipeline("test_pipeline", {
+ abstract: {
+ semantic_search: {
+ model: "intfloat/e5-small",
+ },
+ full_text_search: { configuration: "english" },
+ },
+ body: {
+ splitter: { model: "recursive_character" },
+ semantic_search: {
+ model: "hkunlp/instructor-base",
+ parameters: {
+ instruction: "Represent the Wikipedia document for retrieval: ",
+ }
+ },
+ },
+});
+const collection = pgml.newCollection("test_collection");
+await collection.add_pipeline(pipeline);
+```
{% endtab %}
{% tab title="Python" %}
```python
+pipeline = Pipeline(
+ "test_pipeline",
+ {
+ "abstract": {
+ "semantic_search": {
+ "model": "intfloat/e5-small",
+ },
+ "full_text_search": {"configuration": "english"},
+ },
+ "body": {
+ "splitter": {"model": "recursive_character"},
+ "semantic_search": {
+ "model": "hkunlp/instructor-base",
+ "parameters": {
+ "instruction": "Represent the Wikipedia document for retrieval: ",
+ },
+ },
+ },
+ },
+)
collection = Collection("test_collection")
-pipeline = Pipeline("test_pipeline")
-results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).fetch_all()
```
{% endtab %}
{% endtabs %}
-### **Vector search with custom limit**
+This creates a `Pipeline` that is capable of full text search and semantic search on the `abstract` and semantic search on the `body` of documents.
+
+## **Doing vector search**
{% tabs %}
{% tab title="JavaScript" %}
```javascript
-const collection = pgml.newCollection("test_collection")
-const pipeline = pgml.newPipeline("test_pipeline")
-const results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).limit(10).fetch_all()
+const results = await collection.vector_search(
+ {
+ query: {
+ fields: {
+ body: {
+ query: "What is the best database?", parameters: {
+ instruction:
+ "Represent the Wikipedia question for retrieving supporting documents: ",
+ }
+ },
+ },
+ },
+ limit: 5,
+ },
+ pipeline,
+);
```
{% endtab %}
{% tab title="Python" %}
```python
-collection = Collection("test_collection")
-pipeline = Pipeline("test_pipeline")
-results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).limit(10).fetch_all()
+results = await collection.vector_search(
+ {
+ "query": {
+ "fields": {
+ "body": {
+ "query": "What is the best database?",
+ "parameters": {
+ "instruction": "Represent the Wikipedia question for retrieving supporting documents: ",
+ },
+ },
+ },
+ },
+ "limit": 5,
+ },
+ pipeline,
+)
```
{% endtab %}
{% endtabs %}
-### **Metadata Filtering**
-
-We provide powerful and flexible arbitrarly nested metadata filtering based off of [MongoDB Comparison Operators](https://www.mongodb.com/docs/manual/reference/operator/query-comparison/). We support each operator mentioned except the `$nin`.
-
-**Vector search with $eq metadata filtering**
+Let's break this down. `vector_search` takes in a `JSON` object and a `Pipeline`. The `JSON` object currently supports two keys: `query` and `limit` . The `limit` limits how many chunks should be returned, the `query` specifies the actual query to perform. Let's see another more complicated example:
{% tabs %}
{% tab title="JavaScript" %}
```javascript
-const collection = pgml.newCollection("test_collection")
-const pipeline = pgml.newPipeline("test_pipeline")
-const results = await collection.query()
- .vector_recall("Here is some query", pipeline)
- .limit(10)
- .filter({
- metadata: {
- uuid: {
- $eq: 1
- }
- }
- })
- .fetch_all()
+const query = "What is the best database?";
+const results = await collection.vector_search(
+ {
+ query: {
+ fields: {
+ abstract: {
+ query: query,
+ full_text_filter: "database"
+ },
+ body: {
+ query: query, parameters: {
+ instruction:
+ "Represent the Wikipedia question for retrieving supporting documents: ",
+ }
+ },
+ },
+ },
+ limit: 5,
+ },
+ pipeline,
+);
```
{% endtab %}
{% tab title="Python" %}
-collection = Collection("test_collection")
-pipeline = Pipeline("test_pipeline")
-results = (
- await collection.query()
- .vector_recall("Here is some query", pipeline)
- .limit(10)
- .filter({
- "metadata": {
- "uuid": {
- "$eq": 1
- }
- }
- })
- .fetch_all()
+```python
+query = "What is the best database?"
+results = await collection.vector_search(
+ {
+ "query": {
+ "fields": {
+ "abastract": {
+ "query": query,
+ "full_text_filter": "database",
+ },
+ "body": {
+ "query": query,
+ "parameters": {
+ "instruction": "Represent the Wikipedia question for retrieving supporting documents: ",
+ },
+ },
+ },
+ },
+ "limit": 5,
+ },
+ pipeline,
)
-
+
+```
{% endtab %}
{% endtabs %}
-The above query would filter out all documents that do not contain a key `uuid` equal to `1`.
+The `query` in this example is slightly more intricate. We are doing vector search over both the `abstract` and `body` keys of our documents. This means our search may return chunks from both the `abstract` and `body` of our documents. We are also filtering out all `abstract` chunks that do not contain the text `"database"` we can do this because we enabled `full_text_search` on the `abstract` key in the `Pipeline` schema. Also note that the model used for embedding the `body` takes parameters, but not the model used for embedding the `abstract`.
+
+## **Filtering**
-**Vector search with $gte metadata filtering**
+We provide powerful and flexible arbitrarly nested filtering based off of [MongoDB Comparison Operators](https://www.mongodb.com/docs/manual/reference/operator/query-comparison/). We support each operator mentioned except the `$nin`.
+
+**Vector search with $eq filtering**
{% tabs %}
{% tab title="JavaScript" %}
```javascript
-const collection = pgml.newCollection("test_collection")
-const pipeline = pgml.newPipeline("test_pipeline")
-const results = await collection.query()
- .vector_recall("Here is some query", pipeline)
- .limit(10)
- .filter({
- metadata: {
- index: {
- $gte: 3
+const results = await collection.vector_search(
+ {
+ query: {
+ fields: {
+ body: {
+ query: "What is the best database?", parameters: {
+ instruction:
+ "Represent the Wikipedia question for retrieving supporting documents: ",
+ }
+ },
+ },
+ filter: {
+ user_id: {
+ $eq: 1
+ }
}
- }
- })
- .fetch_all()
+ },
+ limit: 5,
+ },
+ pipeline,
+);
```
{% endtab %}
{% tab title="Python" %}
```python
-collection = Collection("test_collection")
-pipeline = Pipeline("test_pipeline")
-results = (
- await collection.query()
- .vector_recall("Here is some query", pipeline)
- .limit(10)
- .filter({
- "metadata": {
- "index": {
- "$gte": 3
- }
- }
- })
- .fetch_all()
+results = await collection.vector_search(
+ {
+ "query": {
+ "fields": {
+ "body": {
+ "query": "What is the best database?",
+ "parameters": {
+ "instruction": "Represent the Wikipedia question for retrieving supporting documents: ",
+ },
+ },
+ },
+ "filter": {"user_id": {"$eq": 1}},
+ },
+ "limit": 5,
+ },
+ pipeline,
)
```
{% endtab %}
{% endtabs %}
-The above query would filter out all documents that do not contain a key `index` with a value greater than or equal to `3`.
+The above query would filter out all chunks from documents that do not contain a key `user_id` equal to `1`.
-**Vector search with $or and $and metadata filtering**
+**Vector search with $gte filtering**
{% tabs %}
{% tab title="JavaScript" %}
```javascript
-const collection = pgml.newCollection("test_collection")
-const pipeline = pgml.newPipeline("test_pipeline")
-const results = await collection.query()
- .vector_recall("Here is some query", pipeline)
- .limit(10)
- .filter({
- metadata: {
- $or: [
- {
- $and: [
- {
- $eq: {
- uuid: 1
- }
- },
- {
- $lt: {
- index: 100
- }
- }
- ]
- },
- {
- special: {
- $ne: True
+const results = await collection.vector_search(
+ {
+ query: {
+ fields: {
+ body: {
+ query: "What is the best database?", parameters: {
+ instruction:
+ "Represent the Wikipedia question for retrieving supporting documents: ",
}
+ },
+ },
+ filter: {
+ user_id: {
+ $gte: 1
}
- ]
- }
- })
- .fetch_all()
+ }
+ },
+ limit: 5,
+ },
+ pipeline,
+);
```
{% endtab %}
{% tab title="Python" %}
```python
-collection = Collection("test_collection")
-pipeline = Pipeline("test_pipeline")
-results = (
- await collection.query()
- .vector_recall("Here is some query", pipeline)
- .limit(10)
- .filter({
- "metadata": {
- "$or": [
- {
- "$and": [
- {
- "$eq": {
- "uuid": 1
- }
- },
- {
- "$lt": {
- "index": 100
- }
- }
- ]
+results = await collection.vector_search(
+ {
+ "query": {
+ "fields": {
+ "body": {
+ "query": "What is the best database?",
+ "parameters": {
+ "instruction": "Represent the Wikipedia question for retrieving supporting documents: ",
+ },
},
- {
- "special": {
- "$ne": True
- }
- }
- ]
- }
- })
- .fetch_all()
+ },
+ "filter": {"user_id": {"$gte": 1}},
+ },
+ "limit": 5,
+ },
+ pipeline,
)
```
{% endtab %}
{% endtabs %}
-The above query would filter out all documents that do not have a key `special` with a value `True` or (have a key `uuid` equal to 1 and a key `index` less than 100).
-
-### **Full Text Filtering**
+The above query would filter out all documents that do not contain a key `user_id` with a value greater than or equal to `1`.
-If full text search is enabled for the associated Pipeline, documents can be first filtered by full text search and then recalled by embedding similarity.
+**Vector search with $or and $and filtering**
{% tabs %}
{% tab title="JavaScript" %}
```javascript
-const collection = pgml.newCollection("test_collection")
-const pipeline = pgml.newPipeline("test_pipeline")
-const results = await collection.query()
- .vector_recall("Here is some query", pipeline)
- .limit(10)
- .filter({
- full_text: {
- configuration: "english",
- text: "Match Me"
- }
- })
- .fetch_all()
+const results = await collection.vector_search(
+ {
+ query: {
+ fields: {
+ body: {
+ query: "What is the best database?", parameters: {
+ instruction:
+ "Represent the Wikipedia question for retrieving supporting documents: ",
+ }
+ },
+ },
+ filter: {
+ $or: [
+ {
+ $and: [
+ {
+ $eq: {
+ user_id: 1
+ }
+ },
+ {
+ $lt: {
+ user_score: 100
+ }
+ }
+ ]
+ },
+ {
+ special: {
+ $ne: true
+ }
+ }
+ ]
+ }
+ },
+ limit: 5,
+ },
+ pipeline,
+);
```
{% endtab %}
{% tab title="Python" %}
```python
-collection = Collection("test_collection")
-pipeline = Pipeline("test_pipeline")
-results = (
- await collection.query()
- .vector_recall("Here is some query", pipeline)
- .limit(10)
- .filter({
- "full_text": {
- "configuration": "english",
- "text": "Match Me"
- }
- })
- .fetch_all()
+results = await collection.vector_search(
+ {
+ "query": {
+ "fields": {
+ "body": {
+ "query": "What is the best database?",
+ "parameters": {
+ "instruction": "Represent the Wikipedia question for retrieving supporting documents: ",
+ },
+ },
+ },
+ "filter": {
+ "$or": [
+ {"$and": [{"$eq": {"user_id": 1}}, {"$lt": {"user_score": 100}}]},
+ {"special": {"$ne": True}},
+ ],
+ },
+ },
+ "limit": 5,
+ },
+ pipeline,
)
```
{% endtab %}
{% endtabs %}
-The above query would first filter out all documents that do not match the full text search criteria, and then perform vector recall on the remaining documents.
+The above query would filter out all documents that do not have a key `special` with a value `True` or (have a key `user_id` equal to 1 and a key `user_score` less than 100).
diff --git a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/README.md b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/README.md
index 84ce15b78..ed07f8b2c 100644
--- a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/README.md
+++ b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/README.md
@@ -1,2 +1,6 @@
# Tutorials
+We have a number of tutorials / examples for our Python and JavaScript SDK. For a full list of examples check out:
+
+* [JavaScript Examples on Github](https://github.com/postgresml/postgresml/tree/master/pgml-sdks/pgml/javascript/examples)
+* [Python Examples on Github](https://github.com/postgresml/postgresml/tree/master/pgml-sdks/pgml/python/examples)
diff --git a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/extractive-question-answering.md b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/extractive-question-answering.md
deleted file mode 100644
index 78abc3a09..000000000
--- a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/extractive-question-answering.md
+++ /dev/null
@@ -1,161 +0,0 @@
----
-description: >-
- JavaScript and Python code snippets for end-to-end question answering.
----
-# Extractive Question Answering
-
-Here is the documentation for the JavaScript and Python code snippets performing end-to-end question answering:
-
-## Imports and Setup
-
-The SDK and datasets are imported. Builtins are used in Python for transforming text.
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```js
-const pgml = require("pgml");
-require("dotenv").config();
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
-from pgml import Collection, Model, Splitter, Pipeline, Builtins
-from datasets import load_dataset
-from dotenv import load_dotenv
-```
-{% endtab %}
-{% endtabs %}
-
-## Initialize Collection
-
-A collection is created to hold context passages.
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```js
-const collection = pgml.newCollection("my_javascript_eqa_collection");
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
-collection = Collection("squad_collection")
-```
-{% endtab %}
-{% endtabs %}
-
-## Create Pipeline
-
-A pipeline is created and added to the collection.
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```js
-const pipeline = pgml.newPipeline(
- "my_javascript_eqa_pipeline",
- pgml.newModel(),
- pgml.newSplitter(),
-);
-
-await collection.add_pipeline(pipeline);
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
-model = Model()
-splitter = Splitter()
-pipeline = Pipeline("squadv1", model, splitter)
-await collection.add_pipeline(pipeline)
-```
-{% endtab %}
-{% endtabs %}
-
-## Upsert Documents
-
-Context passages from SQuAD are upserted into the collection.
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```js
-const documents = [
- {
- id: "...",
- text: "...",
- }
-];
-
-await collection.upsert_documents(documents);
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
-data = load_dataset("squad")
-
-documents = [
- {"id": ..., "text": ...}
- for r in data
-]
-
-await collection.upsert_documents(documents)
-```
-{% endtab %}
-{% endtabs %}
-
-## Query for Context
-
-A vector search query retrieves context passages.
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```js
-const queryResults = await collection
- .query()
- .vector_recall(query, pipeline)
- .fetch_all();
-
-const context = queryResults
- .map(result => result[1])
- .join("\n");
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
-results = await collection.query()
- .vector_recall(query, pipeline)
- .fetch_all()
-
-context = " ".join(results[0][1])
-```
-{% endtab %}
-{% endtabs %}
-
-## Query for Answer
-
-The context is passed to a QA model to extract the answer.
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```js
-const builtins = pgml.newBuiltins();
-
-const answer = await builtins.transform("question-answering", [
- JSON.stringify({question, context})
-]);
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
-builtins = Builtins()
-
-answer = await builtins.transform(
- "question-answering",
- [{"question": query, "context": context}]
-)
-```
-{% endtab %}
-{% endtabs %}
diff --git a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search-1.md b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search-1.md
new file mode 100644
index 000000000..2927773c3
--- /dev/null
+++ b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search-1.md
@@ -0,0 +1,231 @@
+---
+description: JavaScript and Python code snippets for using instructor models in more advanced search use cases.
+---
+
+# Semantic Search Using Instructor Model
+
+This tutorial demonstrates using the `pgml` SDK to create a collection, add documents, build a pipeline for vector search, make a sample query, and archive the collection when finished. In this tutorial we use [hkunlp/instructor-base](https://huggingface.co/hkunlp/instructor-base), a more advanced embeddings model that takes parameters when doing embedding and recall.
+
+[Link to full JavaScript implementation](../../../../../../pgml-sdks/pgml/javascript/examples/question\_answering.js)
+
+[Link to full Python implementation](../../../../../../pgml-sdks/pgml/python/examples/question\_answering.py)
+
+## Imports and Setup
+
+The SDK is imported and environment variables are loaded.
+
+{% tabs %}
+{% tab title="JavasScript" %}
+```js
+const pgml = require("pgml");
+require("dotenv").config();
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+from pgml import Collection, Pipeline
+from datasets import load_dataset
+from time import time
+from dotenv import load_dotenv
+from rich.console import Console
+import asyncio
+```
+{% endtab %}
+{% endtabs %}
+
+## Initialize Collection
+
+A collection object is created to represent the search collection.
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```js
+const main = async () => { // Open the main function, we close it at the bottom
+ // Initialize the collection
+ const collection = pgml.newCollection("qa_collection");
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+async def main(): # Start the main function, we end it after archiving
+ load_dotenv()
+ console = Console()
+
+ # Initialize collection
+ collection = Collection("squad_collection")
+```
+{% endtab %}
+{% endtabs %}
+
+## Create Pipeline
+
+A pipeline encapsulating a model and splitter is created and added to the collection.
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```js
+ // Add a pipeline
+ const pipeline = pgml.newPipeline("qa_pipeline", {
+ text: {
+ splitter: { model: "recursive_character" },
+ semantic_search: {
+ model: "intfloat/e5-small",
+ },
+ },
+ });
+ await collection.add_pipeline(pipeline);
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+ # Create and add pipeline
+ pipeline = Pipeline(
+ "squadv1",
+ {
+ "text": {
+ "splitter": {"model": "recursive_character"},
+ "semantic_search": {
+ "model": "hkunlp/instructor-base",
+ "parameters": {
+ "instruction": "Represent the Wikipedia document for retrieval: "
+ },
+ },
+ }
+ },
+ )
+ await collection.add_pipeline(pipeline)
+```
+{% endtab %}
+{% endtabs %}
+
+## Upsert Documents
+
+Documents are upserted into the collection and indexed by the pipeline.
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```js
+ // Upsert documents, these documents are automatically split into chunks and embedded by our pipeline
+ const documents = [
+ {
+ id: "Document One",
+ text: "PostgresML is the best tool for machine learning applications!",
+ },
+ {
+ id: "Document Two",
+ text: "PostgresML is open source and available to everyone!",
+ },
+ ];
+ await collection.upsert_documents(documents);
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+ # Prep documents for upserting
+ data = load_dataset("squad", split="train")
+ data = data.to_pandas()
+ data = data.drop_duplicates(subset=["context"])
+ documents = [
+ {"id": r["id"], "text": r["context"], "title": r["title"]}
+ for r in data.to_dict(orient="records")
+ ]
+
+ # Upsert documents
+ await collection.upsert_documents(documents[:200])
+```
+{% endtab %}
+{% endtabs %}
+
+## Query
+
+A vector similarity search query is made on the collection.
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```js
+ // Perform vector search
+ const query = "What is the best tool for building machine learning applications?";
+ const queryResults = await collection.vector_search(
+ {
+ query: {
+ fields: {
+ text: { query: query }
+ }
+ }, limit: 1
+ }, pipeline);
+ console.log(queryResults);
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+ # Query for answer
+ query = "Who won more than 20 grammy awards?"
+ console.print("Querying for context ...")
+ start = time()
+ results = await collection.vector_search(
+ {
+ "query": {
+ "fields": {
+ "text": {
+ "query": query,
+ "parameters": {
+ "instruction": "Represent the Wikipedia question for retrieving supporting documents: "
+ },
+ },
+ }
+ },
+ "limit": 5,
+ },
+ pipeline,
+ )
+ end = time()
+ console.print("\n Results for '%s' " % (query), style="bold")
+ console.print(results)
+ console.print("Query time = %0.3f" % (end - start))
+```
+{% endtab %}
+{% endtabs %}
+
+## Archive Collection
+
+The collection is archived when finished.
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```js
+ await collection.archive();
+} // Close the main function
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+ await collection.archive()
+# The end of the main function
+```
+{% endtab %}
+{% endtabs %}
+
+## Main
+
+Boilerplate to call main() async function.
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```javascript
+main().then(() => console.log("Done!"));
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+{% endtab %}
+{% endtabs %}
diff --git a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search-using-instructor-model.md b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search-using-instructor-model.md
deleted file mode 100644
index 697845b55..000000000
--- a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search-using-instructor-model.md
+++ /dev/null
@@ -1,127 +0,0 @@
----
-description: >-
- JavaScript and Python code snippets for using instructor modelsĀ in more advanced search use cases.
----
-# Semantic Search using Instructor model
-
-This shows using instructor models in the `pgml` SDK for more advanced use cases.
-
-## Imports and Setup
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```js
-const pgml = require("pgml");
-require("dotenv").config();
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
-from pgml import Collection, Model, Splitter, Pipeline
-from datasets import load_dataset
-from dotenv import load_dotenv
-```
-{% endtab %}
-{% endtabs %}
-
-## Initialize Collection
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```js
-const collection = pgml.newCollection("my_javascript_qai_collection");
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
-collection = Collection("squad_collection_1")
-```
-{% endtab %}
-{% endtabs %}
-
-## Create Pipeline
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```js
-const model = pgml.newModel("hkunlp/instructor-base", "pgml", {
- instruction: "Represent the Wikipedia document for retrieval: ",
-});
-
-const pipeline = pgml.newPipeline(
- "my_javascript_qai_pipeline",
- model,
- pgml.newSplitter(),
-);
-
-await collection.add_pipeline(pipeline);
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
-model = Model("hkunlp/instructor-base", parameters={
- "instruction": "Represent the Wikipedia document for retrieval: "
-})
-
-pipeline = Pipeline("squad_instruction", model, Splitter())
-await collection.add_pipeline(pipeline)
-```
-{% endtab %}
-{% endtabs %}
-
-## Upsert Documents
-
-{% tabs %}
-{% tab title="JavaScript" %}
-const documents = [
- {
- id: "...",
- text: "...",
- },
-];
-
-await collection.upsert_documents(documents);
-
-{% endtab %}
-
-{% tab title="Python" %}
-```python
-data = load_dataset("squad")
-
-documents = [
- {"id": ..., "text": ...} for r in data
-]
-
-await collection.upsert_documents(documents)
-```
-{% endtab %}
-{% endtabs %}
-
-## Query
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```js
-const queryResults = await collection
- .query()
- .vector_recall(query, pipeline, {
- instruction:
- "Represent the Wikipedia question for retrieving supporting documents: ",
- })
- .fetch_all();
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
-results = await collection.query()
- .vector_recall(query, pipeline, {
- "instruction": "Represent the Wikipedia question for retrieving supporting documents: "
- })
- .fetch_all()
-```
-{% endtab %}
-{% endtabs %}
diff --git a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search.md b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search.md
index 89bf07cd8..71c0e5615 100644
--- a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search.md
+++ b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search.md
@@ -1,10 +1,14 @@
---
-description: Example for Semantic Search
+description: JavaScript and Python code snippets for peforming semantic search using the SDK.
---
# Semantic Search
-This tutorial demonstrates using the `pgml` SDK to create a collection, add documents, build a pipeline for vector search, make a sample query, and archive the collection when finished. It loads sample data, indexes questions, times a semantic search query, and prints formatted results.
+This tutorial demonstrates using the `pgml` SDK to create a collection, add documents, build a pipeline for vector search, make a sample query, and archive the collection when finished.
+
+[Link to full JavaScript implementation](../../../../../../pgml-sdks/pgml/javascript/examples/semantic\_search.js)
+
+[Link to full Python implementation](../../../../../../pgml-sdks/pgml/python/examples/semantic\_search.py)
## Imports and Setup
@@ -14,16 +18,17 @@ The SDK is imported and environment variables are loaded.
{% tab title="JavasScript" %}
```js
const pgml = require("pgml");
-
require("dotenv").config();
```
{% endtab %}
{% tab title="Python" %}
```python
-from pgml import Collection, Model, Splitter, Pipeline
+from pgml import Collection, Pipeline
from datasets import load_dataset
+from time import time
from dotenv import load_dotenv
+from rich.console import Console
import asyncio
```
{% endtab %}
@@ -36,17 +41,20 @@ A collection object is created to represent the search collection.
{% tabs %}
{% tab title="JavaScript" %}
```js
-const main = async () => {
- const collection = pgml.newCollection("my_javascript_collection");
-}
+const main = async () => { // Open the main function, we close it at the bottom
+ // Initialize the collection
+ const collection = pgml.newCollection("semantic_search_collection");
```
{% endtab %}
{% tab title="Python" %}
```python
-async def main():
+async def main(): # Start the main function, we end it after archiving
load_dotenv()
- collection = Collection("my_collection")
+ console = Console()
+
+ # Initialize collection
+ collection = Collection("quora_collection")
```
{% endtab %}
{% endtabs %}
@@ -58,19 +66,32 @@ A pipeline encapsulating a model and splitter is created and added to the collec
{% tabs %}
{% tab title="JavaScript" %}
```js
-const model = pgml.newModel();
-const splitter = pgml.newSplitter();
-const pipeline = pgml.newPipeline("my_javascript_pipeline", model, splitter);
-await collection.add_pipeline(pipeline);
+ // Add a pipeline
+ const pipeline = pgml.newPipeline("semantic_search_pipeline", {
+ text: {
+ splitter: { model: "recursive_character" },
+ semantic_search: {
+ model: "intfloat/e5-small",
+ },
+ },
+ });
+ await collection.add_pipeline(pipeline);
```
{% endtab %}
{% tab title="Python" %}
```python
-model = Model()
-splitter = Splitter()
-pipeline = Pipeline("my_pipeline", model, splitter)
-await collection.add_pipeline(pipeline)
+ # Create and add pipeline
+ pipeline = Pipeline(
+ "quorav1",
+ {
+ "text": {
+ "splitter": {"model": "recursive_character"},
+ "semantic_search": {"model": "intfloat/e5-small"},
+ }
+ },
+ )
+ await collection.add_pipeline(pipeline)
```
{% endtab %}
{% endtabs %}
@@ -82,29 +103,37 @@ Documents are upserted into the collection and indexed by the pipeline.
{% tabs %}
{% tab title="JavaScript" %}
```js
-const documents = [
- {
- id: "Document One",
- text: "...",
- },
- {
- id: "Document Two",
- text: "...",
- },
-];
-
-await collection.upsert_documents(documents);
+ // Upsert documents, these documents are automatically split into chunks and embedded by our pipeline
+ const documents = [
+ {
+ id: "Document One",
+ text: "document one contents...",
+ },
+ {
+ id: "Document Two",
+ text: "document two contents...",
+ },
+ ];
+ await collection.upsert_documents(documents);
```
{% endtab %}
{% tab title="Python" %}
```python
-documents = [
- {"id": "doc1", "text": "..."},
- {"id": "doc2", "text": "..."}
-]
-
-await collection.upsert_documents(documents)
+ # Prep documents for upserting
+ dataset = load_dataset("quora", split="train")
+ questions = []
+ for record in dataset["questions"]:
+ questions.extend(record["text"])
+
+ # Remove duplicates and add id
+ documents = []
+ for i, question in enumerate(list(set(questions))):
+ if question:
+ documents.append({"id": i, "text": question})
+
+ # Upsert documents
+ await collection.upsert_documents(documents[:2000])
```
{% endtab %}
{% endtabs %}
@@ -116,21 +145,34 @@ A vector similarity search query is made on the collection.
{% tabs %}
{% tab title="JavaScript" %}
```js
-const queryResults = await collection
- .query()
- .vector_recall(
- "query",
- pipeline,
- )
- .fetch_all();
+ // Perform vector search
+ const query = "Something that will match document one first";
+ const queryResults = await collection.vector_search(
+ {
+ query: {
+ fields: {
+ text: { query: query }
+ }
+ }, limit: 2
+ }, pipeline);
+ console.log("The results");
+ console.log(queryResults);
```
{% endtab %}
{% tab title="Python" %}
```python
-results = await collection.query()
- .vector_recall("query", pipeline)
- .fetch_all()
+ # Query
+ query = "What is a good mobile os?"
+ console.print("Querying for %s..." % query)
+ start = time()
+ results = await collection.vector_search(
+ {"query": {"fields": {"text": {"query": query}}}, "limit": 5}, pipeline
+ )
+ end = time()
+ console.print("\n Results for '%s' " % (query), style="bold")
+ console.print(results)
+ console.print("Query time = %0.3f" % (end - start))
```
{% endtab %}
{% endtabs %}
@@ -142,13 +184,15 @@ The collection is archived when finished.
{% tabs %}
{% tab title="JavaScript" %}
```js
-await collection.archive();
+ await collection.archive();
+} // Close the main function
```
{% endtab %}
{% tab title="Python" %}
```python
-await collection.archive()
+ await collection.archive()
+# The end of the main function
```
{% endtab %}
{% endtabs %}
@@ -160,9 +204,7 @@ Boilerplate to call main() async function.
{% tabs %}
{% tab title="JavaScript" %}
```javascript
-main().then((results) => {
- console.log("Vector search Results: \n", results);
-});
+main().then(() => console.log("Done!"));
```
{% endtab %}
diff --git a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/summarizing-question-answering.md b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/summarizing-question-answering.md
deleted file mode 100644
index caa7c8a59..000000000
--- a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/summarizing-question-answering.md
+++ /dev/null
@@ -1,164 +0,0 @@
----
-description: >-
- JavaScript and Python code snippets for text summarization.
----
-# Summarizing Question Answering
-
-Here are the Python and JavaScript examples for text summarization using `pgml` SDK
-
-## Imports and Setup
-
-The SDK and datasets are imported. Builtins are used for transformations.
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```js
-const pgml = require("pgml");
-require("dotenv").config();
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
-from pgml import Collection, Model, Splitter, Pipeline, Builtins
-from datasets import load_dataset
-from dotenv import load_dotenv
-```
-{% endtab %}
-{% endtabs %}
-
-## Initialize Collection
-
-A collection is created to hold text passages.
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```js
-const collection = pgml.newCollection("my_javascript_sqa_collection");
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
-collection = Collection("squad_collection")
-```
-{% endtab %}
-{% endtabs %}
-
-## Create Pipeline
-
-A pipeline is created and added to the collection.
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```js
-const pipeline = pgml.newPipeline(
- "my_javascript_sqa_pipeline",
- pgml.newModel(),
- pgml.newSplitter(),
-);
-
-await collection.add_pipeline(pipeline);
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
-model = Model()
-splitter = Splitter()
-pipeline = Pipeline("squadv1", model, splitter)
-await collection.add_pipeline(pipeline)
-```
-{% endtab %}
-{% endtabs %}
-
-## Upsert Documents
-
-Text passages are upserted into the collection.
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```js
-const documents = [
- {
- id: "...",
- text: "...",
- }
-];
-
-await collection.upsert_documents(documents);
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
-data = load_dataset("squad")
-
-documents = [
- {"id": ..., "text": ...}
- for r in data
-]
-
-await collection.upsert_documents(documents)
-```
-{% endtab %}
-{% endtabs %}
-
-## Query for Context
-
-A vector search retrieves a relevant text passage.
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```js
-const queryResults = await collection
- .query()
- .vector_recall(query, pipeline)
- .fetch_all();
-
-const context = queryResults[0][1];
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
-results = await collection.query()
- .vector_recall(query, pipeline)
- .fetch_all()
-
-context = results[0][1]
-```
-{% endtab %}
-{% endtabs %}
-
-## Summarize Text
-
-The text is summarized using a pretrained model.
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```js
-const builtins = pgml.newBuiltins();
-
-const summary = await builtins.transform(
- {task: "summarization",
- model: "sshleifer/distilbart-cnn-12-6"},
- [context]
-);
-```
-
-
-{% endtab %}
-
-{% tab title="Python" %}
-```python
-builtins = Builtins()
-
-summary = await builtins.transform(
- {"task": "summarization",
- "model": "sshleifer/distilbart-cnn-12-6"},
- [context]
-)
-```
-{% endtab %}
-{% endtabs %}
diff --git a/pgml-dashboard/src/components/pages/docs/landing_page/mod.rs b/pgml-dashboard/src/components/pages/docs/landing_page/mod.rs
index 16f80ab9c..063c5051b 100644
--- a/pgml-dashboard/src/components/pages/docs/landing_page/mod.rs
+++ b/pgml-dashboard/src/components/pages/docs/landing_page/mod.rs
@@ -19,9 +19,8 @@ lazy_static! {
("installation", "fullscreen"),
("collections", "overview_key"),
("pipelines", "climate_mini_split"),
+ ("semantic search", "book"),
("semantic search using instructor model", "book"),
- ("extractive question answering", "book"),
- ("summarizing question answering", "book"),
("postgresml is 8-40x faster than python http microservices", "fit_page"),
("scaling to 1 million requests per second", "bolt"),
("mindsdb vs postgresml", "arrow_split"),
@@ -43,14 +42,11 @@ lazy_static! {
.into_iter()
.map(|s| s.to_owned())
.collect();
- static ref TUTORIAL_TARGETS: VecNote: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.
Alternative Proxies: