diff --git a/pgml-cms/docs/README.md b/pgml-cms/docs/README.md
index a698c121a..8c4d7edb5 100644
--- a/pgml-cms/docs/README.md
+++ b/pgml-cms/docs/README.md
@@ -4,27 +4,27 @@ description: The key concepts that make up PostgresML.
# Overview
-PostgresML is a complete MLOps platform built on PostgreSQL.
+PostgresML is a complete MLOps platform built on PostgreSQL.
> _Move the models to the database_, _rather than continuously moving the data to the models._
-The data for ML & AI systems is inherently larger and more dynamic than the models. It's more efficient, manageable and reliable to move the models to the database, rather than continuously moving the data to the models\_.\_ PostgresML allows you to take advantage of the fundamental relationship between data and models, by extending the database with the following capabilities and goals:
+The data for ML & AI systems is inherently larger and more dynamic than the models. It's more efficient, manageable and reliable to move the models to the database, rather than continuously moving the data to the models. PostgresML allows you to take advantage of the fundamental relationship between data and models, by extending the database with the following capabilities and goals:
* **Model Serving** - _**GPU accelerated**_ inference engine for interactive applications, with no additional networking latency or reliability costs.
* **Model Store** - Download _**open-source**_ models including state of the art LLMs from HuggingFace, and track changes in performance between versions.
* **Model Training** - Train models with _**your application data**_ using more than 50 algorithms for regression, classification or clustering tasks. Fine tune pre-trained models like LLaMA and BERT to improve performance.
-* **Feature Store** - _**Scalable**_ access to model inputs, including vector, text, categorical, and numeric data. Vector database, text search, knowledge graph and application data all in one _**low-latency**_ system.
+* **Feature Store** - _**Scalable**_ access to model inputs, including vector, text, categorical, and numeric data. Vector database, text search, knowledge graph and application data all in one _**low-latency**_ system.
PostgresML handles all of the functions typically performed by a cacophony of services, described by a16z A PostgresML deployment at scale
const collection = pgml.newCollection("test_collection")
+const pipeline = pgml.newPipeline("test_pipeline")
+const results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).fetch_all()
+
{% endtab %}
{% tab title="Python" %}
```python
-pipeline = Pipeline(
- "test_pipeline",
- {
- "abstract": {
- "semantic_search": {
- "model": "intfloat/e5-small",
- },
- "full_text_search": {"configuration": "english"},
- },
- "body": {
- "splitter": {"model": "recursive_character"},
- "semantic_search": {
- "model": "hkunlp/instructor-base",
- "parameters": {
- "instruction": "Represent the Wikipedia document for retrieval: ",
- },
- },
- },
- },
-)
collection = Collection("test_collection")
+pipeline = Pipeline("test_pipeline")
+results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).fetch_all()
```
{% endtab %}
{% endtabs %}
-This creates a `Pipeline` that is capable of full text search and semantic search on the `abstract` and semantic search on the `body` of documents.
-
-## **Doing vector search**
+### **Vector search with custom limit**
{% tabs %}
{% tab title="JavaScript" %}
```javascript
-const results = await collection.vector_search(
- {
- query: {
- fields: {
- body: {
- query: "What is the best database?", parameters: {
- instruction:
- "Represent the Wikipedia question for retrieving supporting documents: ",
- }
- },
- },
- },
- limit: 5,
- },
- pipeline,
-);
+const collection = pgml.newCollection("test_collection")
+const pipeline = pgml.newPipeline("test_pipeline")
+const results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).limit(10).fetch_all()
```
{% endtab %}
{% tab title="Python" %}
```python
-results = await collection.vector_search(
- {
- "query": {
- "fields": {
- "body": {
- "query": "What is the best database?",
- "parameters": {
- "instruction": "Represent the Wikipedia question for retrieving supporting documents: ",
- },
- },
- },
- },
- "limit": 5,
- },
- pipeline,
-)
+collection = Collection("test_collection")
+pipeline = Pipeline("test_pipeline")
+results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).limit(10).fetch_all()
```
{% endtab %}
{% endtabs %}
-Let's break this down. `vector_search` takes in a `JSON` object and a `Pipeline`. The `JSON` object currently supports two keys: `query` and `limit` . The `limit` limits how many chunks should be returned, the `query` specifies the actual query to perform. Let's see another more complicated example:
+### **Metadata Filtering**
+
+We provide powerful and flexible arbitrarly nested metadata filtering based off of [MongoDB Comparison Operators](https://www.mongodb.com/docs/manual/reference/operator/query-comparison/). We support each operator mentioned except the `$nin`.
+
+**Vector search with $eq metadata filtering**
{% tabs %}
{% tab title="JavaScript" %}
```javascript
-const query = "What is the best database?";
-const results = await collection.vector_search(
- {
- query: {
- fields: {
- abstract: {
- query: query,
- full_text_filter: "database"
- },
- body: {
- query: query, parameters: {
- instruction:
- "Represent the Wikipedia question for retrieving supporting documents: ",
- }
- },
- },
- },
- limit: 5,
- },
- pipeline,
-);
+const collection = pgml.newCollection("test_collection")
+const pipeline = pgml.newPipeline("test_pipeline")
+const results = await collection.query()
+ .vector_recall("Here is some query", pipeline)
+ .limit(10)
+ .filter({
+ metadata: {
+ uuid: {
+ $eq: 1
+ }
+ }
+ })
+ .fetch_all()
```
{% endtab %}
{% tab title="Python" %}
-```python
-query = "What is the best database?"
-results = await collection.vector_search(
- {
- "query": {
- "fields": {
- "abastract": {
- "query": query,
- "full_text_filter": "database",
- },
- "body": {
- "query": query,
- "parameters": {
- "instruction": "Represent the Wikipedia question for retrieving supporting documents: ",
- },
- },
- },
- },
- "limit": 5,
- },
- pipeline,
+collection = Collection("test_collection")
+pipeline = Pipeline("test_pipeline")
+results = (
+ await collection.query()
+ .vector_recall("Here is some query", pipeline)
+ .limit(10)
+ .filter({
+ "metadata": {
+ "uuid": {
+ "$eq": 1
+ }
+ }
+ })
+ .fetch_all()
)
-
-```
+
{% endtab %}
{% endtabs %}
-The `query` in this example is slightly more intricate. We are doing vector search over both the `abstract` and `body` keys of our documents. This means our search may return chunks from both the `abstract` and `body` of our documents. We are also filtering out all `abstract` chunks that do not contain the text `"database"` we can do this because we enabled `full_text_search` on the `abstract` key in the `Pipeline` schema. Also note that the model used for embedding the `body` takes parameters, but not the model used for embedding the `abstract`.
-
-## **Filtering**
+The above query would filter out all documents that do not contain a key `uuid` equal to `1`.
-We provide powerful and flexible arbitrarly nested filtering based off of [MongoDB Comparison Operators](https://www.mongodb.com/docs/manual/reference/operator/query-comparison/). We support each operator mentioned except the `$nin`.
-
-**Vector search with $eq filtering**
+**Vector search with $gte metadata filtering**
{% tabs %}
{% tab title="JavaScript" %}
```javascript
-const results = await collection.vector_search(
- {
- query: {
- fields: {
- body: {
- query: "What is the best database?", parameters: {
- instruction:
- "Represent the Wikipedia question for retrieving supporting documents: ",
- }
- },
- },
- filter: {
- user_id: {
- $eq: 1
- }
+const collection = pgml.newCollection("test_collection")
+const pipeline = pgml.newPipeline("test_pipeline")
+const results = await collection.query()
+ .vector_recall("Here is some query", pipeline)
+ .limit(10)
+ .filter({
+ metadata: {
+ index: {
+ $gte: 3
}
- },
- limit: 5,
- },
- pipeline,
-);
+ }
+ })
+ .fetch_all()
```
{% endtab %}
{% tab title="Python" %}
```python
-results = await collection.vector_search(
- {
- "query": {
- "fields": {
- "body": {
- "query": "What is the best database?",
- "parameters": {
- "instruction": "Represent the Wikipedia question for retrieving supporting documents: ",
- },
- },
- },
- "filter": {"user_id": {"$eq": 1}},
- },
- "limit": 5,
- },
- pipeline,
+collection = Collection("test_collection")
+pipeline = Pipeline("test_pipeline")
+results = (
+ await collection.query()
+ .vector_recall("Here is some query", pipeline)
+ .limit(10)
+ .filter({
+ "metadata": {
+ "index": {
+ "$gte": 3
+ }
+ }
+ })
+ .fetch_all()
)
```
{% endtab %}
{% endtabs %}
-The above query would filter out all chunks from documents that do not contain a key `user_id` equal to `1`.
+The above query would filter out all documents that do not contain a key `index` with a value greater than or equal to `3`.
-**Vector search with $gte filtering**
+**Vector search with $or and $and metadata filtering**
{% tabs %}
{% tab title="JavaScript" %}
```javascript
-const results = await collection.vector_search(
- {
- query: {
- fields: {
- body: {
- query: "What is the best database?", parameters: {
- instruction:
- "Represent the Wikipedia question for retrieving supporting documents: ",
- }
+const collection = pgml.newCollection("test_collection")
+const pipeline = pgml.newPipeline("test_pipeline")
+const results = await collection.query()
+ .vector_recall("Here is some query", pipeline)
+ .limit(10)
+ .filter({
+ metadata: {
+ $or: [
+ {
+ $and: [
+ {
+ $eq: {
+ uuid: 1
+ }
+ },
+ {
+ $lt: {
+ index: 100
+ }
+ }
+ ]
},
- },
- filter: {
- user_id: {
- $gte: 1
+ {
+ special: {
+ $ne: True
+ }
}
- }
- },
- limit: 5,
- },
- pipeline,
-);
+ ]
+ }
+ })
+ .fetch_all()
```
{% endtab %}
{% tab title="Python" %}
```python
-results = await collection.vector_search(
- {
- "query": {
- "fields": {
- "body": {
- "query": "What is the best database?",
- "parameters": {
- "instruction": "Represent the Wikipedia question for retrieving supporting documents: ",
- },
+collection = Collection("test_collection")
+pipeline = Pipeline("test_pipeline")
+results = (
+ await collection.query()
+ .vector_recall("Here is some query", pipeline)
+ .limit(10)
+ .filter({
+ "metadata": {
+ "$or": [
+ {
+ "$and": [
+ {
+ "$eq": {
+ "uuid": 1
+ }
+ },
+ {
+ "$lt": {
+ "index": 100
+ }
+ }
+ ]
},
- },
- "filter": {"user_id": {"$gte": 1}},
- },
- "limit": 5,
- },
- pipeline,
+ {
+ "special": {
+ "$ne": True
+ }
+ }
+ ]
+ }
+ })
+ .fetch_all()
)
```
{% endtab %}
{% endtabs %}
-The above query would filter out all documents that do not contain a key `user_id` with a value greater than or equal to `1`.
+The above query would filter out all documents that do not have a key `special` with a value `True` or (have a key `uuid` equal to 1 and a key `index` less than 100).
+
+### **Full Text Filtering**
-**Vector search with $or and $and filtering**
+If full text search is enabled for the associated Pipeline, documents can be first filtered by full text search and then recalled by embedding similarity.
{% tabs %}
{% tab title="JavaScript" %}
```javascript
-const results = await collection.vector_search(
- {
- query: {
- fields: {
- body: {
- query: "What is the best database?", parameters: {
- instruction:
- "Represent the Wikipedia question for retrieving supporting documents: ",
- }
- },
- },
- filter: {
- $or: [
- {
- $and: [
- {
- $eq: {
- user_id: 1
- }
- },
- {
- $lt: {
- user_score: 100
- }
- }
- ]
- },
- {
- special: {
- $ne: true
- }
- }
- ]
- }
- },
- limit: 5,
- },
- pipeline,
-);
+const collection = pgml.newCollection("test_collection")
+const pipeline = pgml.newPipeline("test_pipeline")
+const results = await collection.query()
+ .vector_recall("Here is some query", pipeline)
+ .limit(10)
+ .filter({
+ full_text: {
+ configuration: "english",
+ text: "Match Me"
+ }
+ })
+ .fetch_all()
```
{% endtab %}
{% tab title="Python" %}
```python
-results = await collection.vector_search(
- {
- "query": {
- "fields": {
- "body": {
- "query": "What is the best database?",
- "parameters": {
- "instruction": "Represent the Wikipedia question for retrieving supporting documents: ",
- },
- },
- },
- "filter": {
- "$or": [
- {"$and": [{"$eq": {"user_id": 1}}, {"$lt": {"user_score": 100}}]},
- {"special": {"$ne": True}},
- ],
- },
- },
- "limit": 5,
- },
- pipeline,
+collection = Collection("test_collection")
+pipeline = Pipeline("test_pipeline")
+results = (
+ await collection.query()
+ .vector_recall("Here is some query", pipeline)
+ .limit(10)
+ .filter({
+ "full_text": {
+ "configuration": "english",
+ "text": "Match Me"
+ }
+ })
+ .fetch_all()
)
```
{% endtab %}
{% endtabs %}
-The above query would filter out all documents that do not have a key `special` with a value `True` or (have a key `user_id` equal to 1 and a key `user_score` less than 100).
+The above query would first filter out all documents that do not match the full text search criteria, and then perform vector recall on the remaining documents.
diff --git a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/README.md b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/README.md
index ed07f8b2c..84ce15b78 100644
--- a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/README.md
+++ b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/README.md
@@ -1,6 +1,2 @@
# Tutorials
-We have a number of tutorials / examples for our Python and JavaScript SDK. For a full list of examples check out:
-
-* [JavaScript Examples on Github](https://github.com/postgresml/postgresml/tree/master/pgml-sdks/pgml/javascript/examples)
-* [Python Examples on Github](https://github.com/postgresml/postgresml/tree/master/pgml-sdks/pgml/python/examples)
diff --git a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/extractive-question-answering.md b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/extractive-question-answering.md
new file mode 100644
index 000000000..78abc3a09
--- /dev/null
+++ b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/extractive-question-answering.md
@@ -0,0 +1,161 @@
+---
+description: >-
+ JavaScript and Python code snippets for end-to-end question answering.
+---
+# Extractive Question Answering
+
+Here is the documentation for the JavaScript and Python code snippets performing end-to-end question answering:
+
+## Imports and Setup
+
+The SDK and datasets are imported. Builtins are used in Python for transforming text.
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```js
+const pgml = require("pgml");
+require("dotenv").config();
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+from pgml import Collection, Model, Splitter, Pipeline, Builtins
+from datasets import load_dataset
+from dotenv import load_dotenv
+```
+{% endtab %}
+{% endtabs %}
+
+## Initialize Collection
+
+A collection is created to hold context passages.
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```js
+const collection = pgml.newCollection("my_javascript_eqa_collection");
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+collection = Collection("squad_collection")
+```
+{% endtab %}
+{% endtabs %}
+
+## Create Pipeline
+
+A pipeline is created and added to the collection.
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```js
+const pipeline = pgml.newPipeline(
+ "my_javascript_eqa_pipeline",
+ pgml.newModel(),
+ pgml.newSplitter(),
+);
+
+await collection.add_pipeline(pipeline);
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+model = Model()
+splitter = Splitter()
+pipeline = Pipeline("squadv1", model, splitter)
+await collection.add_pipeline(pipeline)
+```
+{% endtab %}
+{% endtabs %}
+
+## Upsert Documents
+
+Context passages from SQuAD are upserted into the collection.
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```js
+const documents = [
+ {
+ id: "...",
+ text: "...",
+ }
+];
+
+await collection.upsert_documents(documents);
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+data = load_dataset("squad")
+
+documents = [
+ {"id": ..., "text": ...}
+ for r in data
+]
+
+await collection.upsert_documents(documents)
+```
+{% endtab %}
+{% endtabs %}
+
+## Query for Context
+
+A vector search query retrieves context passages.
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```js
+const queryResults = await collection
+ .query()
+ .vector_recall(query, pipeline)
+ .fetch_all();
+
+const context = queryResults
+ .map(result => result[1])
+ .join("\n");
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+results = await collection.query()
+ .vector_recall(query, pipeline)
+ .fetch_all()
+
+context = " ".join(results[0][1])
+```
+{% endtab %}
+{% endtabs %}
+
+## Query for Answer
+
+The context is passed to a QA model to extract the answer.
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```js
+const builtins = pgml.newBuiltins();
+
+const answer = await builtins.transform("question-answering", [
+ JSON.stringify({question, context})
+]);
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+builtins = Builtins()
+
+answer = await builtins.transform(
+ "question-answering",
+ [{"question": query, "context": context}]
+)
+```
+{% endtab %}
+{% endtabs %}
diff --git a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search-1.md b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search-1.md
deleted file mode 100644
index 49aa6461b..000000000
--- a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search-1.md
+++ /dev/null
@@ -1,231 +0,0 @@
----
-description: Example for Semantic Search
----
-
-# Semantic Search Using Instructor Model
-
-This tutorial demonstrates using the `pgml` SDK to create a collection, add documents, build a pipeline for vector search, make a sample query, and archive the collection when finished. In this tutorial we use [hkunlp/instructor-base](https://huggingface.co/hkunlp/instructor-base), a more advanced embeddings model that takes parameters when doing embedding and recall.
-
-[Link to full JavaScript implementation](../../../../../../pgml-sdks/pgml/javascript/examples/question\_answering.js)
-
-[Link to full Python implementation](../../../../../../pgml-sdks/pgml/python/examples/question\_answering.py)
-
-## Imports and Setup
-
-The SDK is imported and environment variables are loaded.
-
-{% tabs %}
-{% tab title="JavasScript" %}
-```js
-const pgml = require("pgml");
-require("dotenv").config();
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
-from pgml import Collection, Pipeline
-from datasets import load_dataset
-from time import time
-from dotenv import load_dotenv
-from rich.console import Console
-import asyncio
-```
-{% endtab %}
-{% endtabs %}
-
-## Initialize Collection
-
-A collection object is created to represent the search collection.
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```js
-const main = async () => { // Open the main function, we close it at the bottom
- // Initialize the collection
- const collection = pgml.newCollection("qa_collection");
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
-async def main(): # Start the main function, we end it after archiving
- load_dotenv()
- console = Console()
-
- # Initialize collection
- collection = Collection("squad_collection")
-```
-{% endtab %}
-{% endtabs %}
-
-## Create Pipeline
-
-A pipeline encapsulating a model and splitter is created and added to the collection.
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```js
- // Add a pipeline
- const pipeline = pgml.newPipeline("qa_pipeline", {
- text: {
- splitter: { model: "recursive_character" },
- semantic_search: {
- model: "intfloat/e5-small",
- },
- },
- });
- await collection.add_pipeline(pipeline);
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
- # Create and add pipeline
- pipeline = Pipeline(
- "squadv1",
- {
- "text": {
- "splitter": {"model": "recursive_character"},
- "semantic_search": {
- "model": "hkunlp/instructor-base",
- "parameters": {
- "instruction": "Represent the Wikipedia document for retrieval: "
- },
- },
- }
- },
- )
- await collection.add_pipeline(pipeline)
-```
-{% endtab %}
-{% endtabs %}
-
-## Upsert Documents
-
-Documents are upserted into the collection and indexed by the pipeline.
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```js
- // Upsert documents, these documents are automatically split into chunks and embedded by our pipeline
- const documents = [
- {
- id: "Document One",
- text: "PostgresML is the best tool for machine learning applications!",
- },
- {
- id: "Document Two",
- text: "PostgresML is open source and available to everyone!",
- },
- ];
- await collection.upsert_documents(documents);
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
- # Prep documents for upserting
- data = load_dataset("squad", split="train")
- data = data.to_pandas()
- data = data.drop_duplicates(subset=["context"])
- documents = [
- {"id": r["id"], "text": r["context"], "title": r["title"]}
- for r in data.to_dict(orient="records")
- ]
-
- # Upsert documents
- await collection.upsert_documents(documents[:200])
-```
-{% endtab %}
-{% endtabs %}
-
-## Query
-
-A vector similarity search query is made on the collection.
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```js
- // Perform vector search
- const query = "What is the best tool for building machine learning applications?";
- const queryResults = await collection.vector_search(
- {
- query: {
- fields: {
- text: { query: query }
- }
- }, limit: 1
- }, pipeline);
- console.log(queryResults);
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
- # Query for answer
- query = "Who won more than 20 grammy awards?"
- console.print("Querying for context ...")
- start = time()
- results = await collection.vector_search(
- {
- "query": {
- "fields": {
- "text": {
- "query": query,
- "parameters": {
- "instruction": "Represent the Wikipedia question for retrieving supporting documents: "
- },
- },
- }
- },
- "limit": 5,
- },
- pipeline,
- )
- end = time()
- console.print("\n Results for '%s' " % (query), style="bold")
- console.print(results)
- console.print("Query time = %0.3f" % (end - start))
-```
-{% endtab %}
-{% endtabs %}
-
-## Archive Collection
-
-The collection is archived when finished.
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```js
- await collection.archive();
-} // Close the main function
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
- await collection.archive()
-# The end of the main function
-```
-{% endtab %}
-{% endtabs %}
-
-## Main
-
-Boilerplate to call main() async function.
-
-{% tabs %}
-{% tab title="JavaScript" %}
-```javascript
-main().then(() => console.log("Done!"));
-```
-{% endtab %}
-
-{% tab title="Python" %}
-```python
-if __name__ == "__main__":
- asyncio.run(main())
-```
-{% endtab %}
-{% endtabs %}
diff --git a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search-using-instructor-model.md b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search-using-instructor-model.md
new file mode 100644
index 000000000..697845b55
--- /dev/null
+++ b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search-using-instructor-model.md
@@ -0,0 +1,127 @@
+---
+description: >-
+ JavaScript and Python code snippets for using instructor models in more advanced search use cases.
+---
+# Semantic Search using Instructor model
+
+This shows using instructor models in the `pgml` SDK for more advanced use cases.
+
+## Imports and Setup
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```js
+const pgml = require("pgml");
+require("dotenv").config();
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+from pgml import Collection, Model, Splitter, Pipeline
+from datasets import load_dataset
+from dotenv import load_dotenv
+```
+{% endtab %}
+{% endtabs %}
+
+## Initialize Collection
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```js
+const collection = pgml.newCollection("my_javascript_qai_collection");
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+collection = Collection("squad_collection_1")
+```
+{% endtab %}
+{% endtabs %}
+
+## Create Pipeline
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```js
+const model = pgml.newModel("hkunlp/instructor-base", "pgml", {
+ instruction: "Represent the Wikipedia document for retrieval: ",
+});
+
+const pipeline = pgml.newPipeline(
+ "my_javascript_qai_pipeline",
+ model,
+ pgml.newSplitter(),
+);
+
+await collection.add_pipeline(pipeline);
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+model = Model("hkunlp/instructor-base", parameters={
+ "instruction": "Represent the Wikipedia document for retrieval: "
+})
+
+pipeline = Pipeline("squad_instruction", model, Splitter())
+await collection.add_pipeline(pipeline)
+```
+{% endtab %}
+{% endtabs %}
+
+## Upsert Documents
+
+{% tabs %}
+{% tab title="JavaScript" %}
+const documents = [
+ {
+ id: "...",
+ text: "...",
+ },
+];
+
+await collection.upsert_documents(documents);
+
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+data = load_dataset("squad")
+
+documents = [
+ {"id": ..., "text": ...} for r in data
+]
+
+await collection.upsert_documents(documents)
+```
+{% endtab %}
+{% endtabs %}
+
+## Query
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```js
+const queryResults = await collection
+ .query()
+ .vector_recall(query, pipeline, {
+ instruction:
+ "Represent the Wikipedia question for retrieving supporting documents: ",
+ })
+ .fetch_all();
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+results = await collection.query()
+ .vector_recall(query, pipeline, {
+ "instruction": "Represent the Wikipedia question for retrieving supporting documents: "
+ })
+ .fetch_all()
+```
+{% endtab %}
+{% endtabs %}
diff --git a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search.md b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search.md
index 726ef3fa3..89bf07cd8 100644
--- a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search.md
+++ b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/semantic-search.md
@@ -4,11 +4,7 @@ description: Example for Semantic Search
# Semantic Search
-This tutorial demonstrates using the `pgml` SDK to create a collection, add documents, build a pipeline for vector search, make a sample query, and archive the collection when finished.
-
-[Link to full JavaScript implementation](../../../../../../pgml-sdks/pgml/javascript/examples/semantic\_search.js)
-
-[Link to full Python implementation](../../../../../../pgml-sdks/pgml/python/examples/semantic\_search.py)
+This tutorial demonstrates using the `pgml` SDK to create a collection, add documents, build a pipeline for vector search, make a sample query, and archive the collection when finished. It loads sample data, indexes questions, times a semantic search query, and prints formatted results.
## Imports and Setup
@@ -18,17 +14,16 @@ The SDK is imported and environment variables are loaded.
{% tab title="JavasScript" %}
```js
const pgml = require("pgml");
+
require("dotenv").config();
```
{% endtab %}
{% tab title="Python" %}
```python
-from pgml import Collection, Pipeline
+from pgml import Collection, Model, Splitter, Pipeline
from datasets import load_dataset
-from time import time
from dotenv import load_dotenv
-from rich.console import Console
import asyncio
```
{% endtab %}
@@ -41,20 +36,17 @@ A collection object is created to represent the search collection.
{% tabs %}
{% tab title="JavaScript" %}
```js
-const main = async () => { // Open the main function, we close it at the bottom
- // Initialize the collection
- const collection = pgml.newCollection("semantic_search_collection");
+const main = async () => {
+ const collection = pgml.newCollection("my_javascript_collection");
+}
```
{% endtab %}
{% tab title="Python" %}
```python
-async def main(): # Start the main function, we end it after archiving
+async def main():
load_dotenv()
- console = Console()
-
- # Initialize collection
- collection = Collection("quora_collection")
+ collection = Collection("my_collection")
```
{% endtab %}
{% endtabs %}
@@ -66,32 +58,19 @@ A pipeline encapsulating a model and splitter is created and added to the collec
{% tabs %}
{% tab title="JavaScript" %}
```js
- // Add a pipeline
- const pipeline = pgml.newPipeline("semantic_search_pipeline", {
- text: {
- splitter: { model: "recursive_character" },
- semantic_search: {
- model: "intfloat/e5-small",
- },
- },
- });
- await collection.add_pipeline(pipeline);
+const model = pgml.newModel();
+const splitter = pgml.newSplitter();
+const pipeline = pgml.newPipeline("my_javascript_pipeline", model, splitter);
+await collection.add_pipeline(pipeline);
```
{% endtab %}
{% tab title="Python" %}
```python
- # Create and add pipeline
- pipeline = Pipeline(
- "quorav1",
- {
- "text": {
- "splitter": {"model": "recursive_character"},
- "semantic_search": {"model": "intfloat/e5-small"},
- }
- },
- )
- await collection.add_pipeline(pipeline)
+model = Model()
+splitter = Splitter()
+pipeline = Pipeline("my_pipeline", model, splitter)
+await collection.add_pipeline(pipeline)
```
{% endtab %}
{% endtabs %}
@@ -103,37 +82,29 @@ Documents are upserted into the collection and indexed by the pipeline.
{% tabs %}
{% tab title="JavaScript" %}
```js
- // Upsert documents, these documents are automatically split into chunks and embedded by our pipeline
- const documents = [
- {
- id: "Document One",
- text: "document one contents...",
- },
- {
- id: "Document Two",
- text: "document two contents...",
- },
- ];
- await collection.upsert_documents(documents);
+const documents = [
+ {
+ id: "Document One",
+ text: "...",
+ },
+ {
+ id: "Document Two",
+ text: "...",
+ },
+];
+
+await collection.upsert_documents(documents);
```
{% endtab %}
{% tab title="Python" %}
```python
- # Prep documents for upserting
- dataset = load_dataset("quora", split="train")
- questions = []
- for record in dataset["questions"]:
- questions.extend(record["text"])
-
- # Remove duplicates and add id
- documents = []
- for i, question in enumerate(list(set(questions))):
- if question:
- documents.append({"id": i, "text": question})
-
- # Upsert documents
- await collection.upsert_documents(documents[:2000])
+documents = [
+ {"id": "doc1", "text": "..."},
+ {"id": "doc2", "text": "..."}
+]
+
+await collection.upsert_documents(documents)
```
{% endtab %}
{% endtabs %}
@@ -145,34 +116,21 @@ A vector similarity search query is made on the collection.
{% tabs %}
{% tab title="JavaScript" %}
```js
- // Perform vector search
- const query = "Something that will match document one first";
- const queryResults = await collection.vector_search(
- {
- query: {
- fields: {
- text: { query: query }
- }
- }, limit: 2
- }, pipeline);
- console.log("The results");
- console.log(queryResults);
+const queryResults = await collection
+ .query()
+ .vector_recall(
+ "query",
+ pipeline,
+ )
+ .fetch_all();
```
{% endtab %}
{% tab title="Python" %}
```python
- # Query
- query = "What is a good mobile os?"
- console.print("Querying for %s..." % query)
- start = time()
- results = await collection.vector_search(
- {"query": {"fields": {"text": {"query": query}}}, "limit": 5}, pipeline
- )
- end = time()
- console.print("\n Results for '%s' " % (query), style="bold")
- console.print(results)
- console.print("Query time = %0.3f" % (end - start))
+results = await collection.query()
+ .vector_recall("query", pipeline)
+ .fetch_all()
```
{% endtab %}
{% endtabs %}
@@ -184,15 +142,13 @@ The collection is archived when finished.
{% tabs %}
{% tab title="JavaScript" %}
```js
- await collection.archive();
-} // Close the main function
+await collection.archive();
```
{% endtab %}
{% tab title="Python" %}
```python
- await collection.archive()
-# The end of the main function
+await collection.archive()
```
{% endtab %}
{% endtabs %}
@@ -204,7 +160,9 @@ Boilerplate to call main() async function.
{% tabs %}
{% tab title="JavaScript" %}
```javascript
-main().then(() => console.log("Done!"));
+main().then((results) => {
+ console.log("Vector search Results: \n", results);
+});
```
{% endtab %}
diff --git a/pgml-cms/docs/introduction/apis/client-sdks/tutorials/summarizing-question-answering.md b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/summarizing-question-answering.md
new file mode 100644
index 000000000..caa7c8a59
--- /dev/null
+++ b/pgml-cms/docs/introduction/apis/client-sdks/tutorials/summarizing-question-answering.md
@@ -0,0 +1,164 @@
+---
+description: >-
+ JavaScript and Python code snippets for text summarization.
+---
+# Summarizing Question Answering
+
+Here are the Python and JavaScript examples for text summarization using `pgml` SDK
+
+## Imports and Setup
+
+The SDK and datasets are imported. Builtins are used for transformations.
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```js
+const pgml = require("pgml");
+require("dotenv").config();
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+from pgml import Collection, Model, Splitter, Pipeline, Builtins
+from datasets import load_dataset
+from dotenv import load_dotenv
+```
+{% endtab %}
+{% endtabs %}
+
+## Initialize Collection
+
+A collection is created to hold text passages.
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```js
+const collection = pgml.newCollection("my_javascript_sqa_collection");
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+collection = Collection("squad_collection")
+```
+{% endtab %}
+{% endtabs %}
+
+## Create Pipeline
+
+A pipeline is created and added to the collection.
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```js
+const pipeline = pgml.newPipeline(
+ "my_javascript_sqa_pipeline",
+ pgml.newModel(),
+ pgml.newSplitter(),
+);
+
+await collection.add_pipeline(pipeline);
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+model = Model()
+splitter = Splitter()
+pipeline = Pipeline("squadv1", model, splitter)
+await collection.add_pipeline(pipeline)
+```
+{% endtab %}
+{% endtabs %}
+
+## Upsert Documents
+
+Text passages are upserted into the collection.
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```js
+const documents = [
+ {
+ id: "...",
+ text: "...",
+ }
+];
+
+await collection.upsert_documents(documents);
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+data = load_dataset("squad")
+
+documents = [
+ {"id": ..., "text": ...}
+ for r in data
+]
+
+await collection.upsert_documents(documents)
+```
+{% endtab %}
+{% endtabs %}
+
+## Query for Context
+
+A vector search retrieves a relevant text passage.
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```js
+const queryResults = await collection
+ .query()
+ .vector_recall(query, pipeline)
+ .fetch_all();
+
+const context = queryResults[0][1];
+```
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+results = await collection.query()
+ .vector_recall(query, pipeline)
+ .fetch_all()
+
+context = results[0][1]
+```
+{% endtab %}
+{% endtabs %}
+
+## Summarize Text
+
+The text is summarized using a pretrained model.
+
+{% tabs %}
+{% tab title="JavaScript" %}
+```js
+const builtins = pgml.newBuiltins();
+
+const summary = await builtins.transform(
+ {task: "summarization",
+ model: "sshleifer/distilbart-cnn-12-6"},
+ [context]
+);
+```
+
+
+{% endtab %}
+
+{% tab title="Python" %}
+```python
+builtins = Builtins()
+
+summary = await builtins.transform(
+ {"task": "summarization",
+ "model": "sshleifer/distilbart-cnn-12-6"},
+ [context]
+)
+```
+{% endtab %}
+{% endtabs %}
diff --git a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.deploy.md b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.deploy.md
index e296155af..22dd3733c 100644
--- a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.deploy.md
+++ b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.deploy.md
@@ -1,3 +1,8 @@
+---
+description: >-
+ Release trained models when ML quality metrics computed during training improve. Track model deployments over time and rollback if needed.
+---
+
# pgml.deploy()
## Deployments
@@ -27,7 +32,7 @@ pgml.deploy(
There are 3 different deployment strategies available:
| Strategy | Description |
-| ------------- | ------------------------------------------------------------------------------------------------ |
+| ------------- |--------------------------------------------------------------------------------------------------|
| `most_recent` | The most recently trained model for this project is immediately deployed, regardless of metrics. |
| `best_score` | The model that achieved the best key metric score is immediately deployed. |
| `rollback` | The model that was deployed before to the current one is deployed. |
@@ -79,6 +84,8 @@ SELECT * FROM pgml.deploy(
(1 row)
```
+
+
### Rolling Back
In case the new model isn't performing well in production, it's easy to rollback to the previous version. A rollback creates a new deployment for the old model. Multiple rollbacks in a row will oscillate between the two most recently deployed models, making rollbacks a safe and reversible operation.
@@ -123,7 +130,7 @@ SELECT * FROM pgml.deploy(
### Specific Model IDs
-In the case you need to deploy an exact model that is not the `most_recent` or `best_score`, you may deploy a model by id. Model id's can be found in the `pgml.models` table.
+In the case you need to deploy an exact model that is not the `most_recent` or `best_score`, you may deploy a model by id. Model id's can be found in the `pgml.models` table.
#### SQL
diff --git a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.embed.md b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.embed.md
index 6b392bc26..61f6a6b0e 100644
--- a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.embed.md
+++ b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.embed.md
@@ -1,3 +1,8 @@
+---
+description: >-
+ Generate high quality embeddings with faster end-to-end vector operations without an additional vector database.
+---
+
# pgml.embed()
Embeddings are a numeric representation of text. They are used to represent words and sentences as vectors, an array of numbers. Embeddings can be used to find similar pieces of text, by comparing the similarity of the numeric vectors using a distance measure, or they can be used as input features for other machine learning models, since most algorithms can't use text directly.
diff --git a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.predict/README.md b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.predict/README.md
index 68373638a..6566497e5 100644
--- a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.predict/README.md
+++ b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.predict/README.md
@@ -1,3 +1,8 @@
+---
+description: >-
+ Batch predict from data in a table. Online predict with parameters passed in a query. Automatically reuse pre-processing steps from training.
+---
+
# pgml.predict()
## API
@@ -51,7 +56,7 @@ LIMIT 25;
### Classification Example
-If you've already been through the [pgml.train](../pgml.train "mention") examples, you can see the predictive results of those models:
+If you've already been through the [pgml.train](../pgml.train/ "mention") examples, you can see the predictive results of those models:
```sql
SELECT
diff --git a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.train/README.md b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.train/README.md
index 5f5b0d89e..d00460bfa 100644
--- a/pgml-cms/docs/introduction/apis/sql-extensions/pgml.train/README.md
+++ b/pgml-cms/docs/introduction/apis/sql-extensions/pgml.train/README.md
@@ -1,8 +1,6 @@
---
description: >-
- The training function is at the heart of PostgresML. It's a powerful single
- mechanism that can handle many different training tasks which are configurable
- with the function parameters.
+ Pre-process and pull data to train a model using any of 50 different ML algorithms.
---
# pgml.train()
@@ -35,7 +33,7 @@ pgml.train(
| Parameter | Example | Description |
| --------------- | ----------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `project_name` | `'Search Results Ranker'` | An easily recognizable identifier to organize your work. |
-| `task` | `'regression'` | The objective of the experiment: `regression`, `classification` or `cluster` |
+| `task` | `'regression'` | The objective of the experiment: `regression`, `classification` or `cluster` |
| `relation_name` | `'public.search_logs'` | The Postgres table or view where the training data is stored or defined. |
| `y_column_name` | `'clicked'` | The name of the label (aka "target" or "unknown") column in the training table. |
| `algorithm` | `'xgboost'` | The algorithm to train on the dataset, see the task specific pages for available algorithms:
regression.md
classification.md
clustering.md
Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.
Alternative Proxies: