diff --git a/README.md b/README.md
index aa585e2d0..3a34cb672 100644
--- a/README.md
+++ b/README.md
@@ -138,6 +138,10 @@ SELECT pgml.predict(
# Installation
PostgresML installation consists of three parts: PostgreSQL database, Postgres extension for machine learning and a dashboard app. The extension provides all the machine learning functionality and can be used independently using any SQL IDE. The dashboard app provides an easy to use interface for writing SQL notebooks, performing and tracking ML experiments and ML models.
+## Serverless Cloud
+
+If you want to check out the functionality without the hassle of Docker, [sign up for a free PostgresML account](https://postgresml.org/signup). You'll get a free database in seconds, with access to GPUs and state of the art LLMs.
+
## Docker
```
@@ -150,19 +154,14 @@ docker run \
sudo -u postgresml psql -d postgresml
```
-For more details, take a look at our [Quick Start with Docker](https://postgresml.org/docs/guides/setup/quick_start_with_docker) documentation.
-
-## Serverless Cloud
-
-If you want to check out the functionality without the hassle of Docker, [sign up for a free PostgresML account](https://postgresml.org/signup). You'll get a free database in seconds, with access to GPUs and state of the art LLMs.
+For more details, take a look at our [Quick Start with Docker](https://postgresml.org/docs/guides/developer-docs/quick-start-with-docker) documentation.
# Getting Started
## Option 1
-- On local installation, go to dashboard app at `http://localhost:8000/` to use SQL notebooks.
-
- On the cloud console click on the **Dashboard** button to connect to your instance with a SQL notebook, or connect directly with tools listed below.
+- On local installation, go to dashboard app at `http://localhost:8000/` to use SQL notebooks.
## Option 2
diff --git a/docker/dashboard.sh b/docker/dashboard.sh
index e4be965da..92b395674 100644
--- a/docker/dashboard.sh
+++ b/docker/dashboard.sh
@@ -4,6 +4,7 @@ set -e
export DATABASE_URL=postgres://postgresml:postgresml@127.0.0.1:5432/postgresml
export DASHBOARD_STATIC_DIRECTORY=/usr/share/pgml-dashboard/dashboard-static
export DASHBOARD_CONTENT_DIRECTORY=/usr/share/pgml-dashboard/dashboard-content
+export DASHBOARD_CONTENT_DOCS=/usr/share/pgml-docs
export SEARCH_INDEX_DIRECTORY=/var/lib/pgml-dashboard/search-index
export ROCKET_SECRET_KEY=$(openssl rand -hex 32)
export ROCKET_ADDRESS=0.0.0.0
diff --git a/packages/postgresml-dashboard/etc/systemd/system/pgml-dashboard.service b/packages/postgresml-dashboard/etc/systemd/system/pgml-dashboard.service
index 2e130814c..b2a1028a5 100644
--- a/packages/postgresml-dashboard/etc/systemd/system/pgml-dashboard.service
+++ b/packages/postgresml-dashboard/etc/systemd/system/pgml-dashboard.service
@@ -7,6 +7,7 @@ StartLimitIntervalSec=0
Environment=RUST_LOG=info
Environment=DASHBOARD_STATIC_DIRECTORY=/usr/share/pgml-dashboard/dashboard-static
Environment=DASHBOARD_CONTENT_DIRECTORY=/usr/share/pgml-dashboard/dashboard-content
+Environment=DASHBOARD_CONTENT_DIRECTORY=/usr/share/pgml-docs
Environment=ROCKET_ADDRESS=0.0.0.0
Environment=GITHUB_STARS=${GITHUB_STARS}
Environment=SEARCH_INDEX_DIRECTORY=/var/lib/pgml-dashboard/search-index
diff --git a/pgml-dashboard/.env.development b/pgml-dashboard/.env.development
index 6129ccd80..81bf7e34a 100644
--- a/pgml-dashboard/.env.development
+++ b/pgml-dashboard/.env.development
@@ -1,2 +1,3 @@
DATABASE_URL=postgres:///pgml_dashboard_development
DEV_MODE=true
+RUST_LOG=debug,tantivy=error,rocket=info
diff --git a/pgml-dashboard/Cargo.lock b/pgml-dashboard/Cargo.lock
index ba9a3c5ef..0298a5519 100644
--- a/pgml-dashboard/Cargo.lock
+++ b/pgml-dashboard/Cargo.lock
@@ -559,6 +559,15 @@ dependencies = [
"tracing-subscriber",
]
+[[package]]
+name = "convert_case"
+version = "0.6.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ec182b0ca2f35d8fc196cf3404988fd8b8c739a4d270ff118a398feb0cbec1ca"
+dependencies = [
+ "unicode-segmentation",
+]
+
[[package]]
name = "cookie"
version = "0.17.0"
@@ -1741,6 +1750,15 @@ version = "0.1.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c41e0c4fef86961ac6d6f8a82609f55f31b05e4fce149ac5710e439df7619ba4"
+[[package]]
+name = "markdown"
+version = "1.0.0-alpha.13"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "92e9ce98969bb1391c8d6fdac320897ea7e86c4d356e8f220a5abd28b142e512"
+dependencies = [
+ "unicode-id",
+]
+
[[package]]
name = "markup5ever"
version = "0.11.0"
@@ -2186,6 +2204,7 @@ dependencies = [
"chrono",
"comrak",
"console-subscriber",
+ "convert_case",
"csv-async",
"dotenv",
"env_logger",
@@ -2193,6 +2212,7 @@ dependencies = [
"itertools",
"lazy_static",
"log",
+ "markdown",
"num-traits",
"once_cell",
"parking_lot 0.12.1",
@@ -2212,6 +2232,7 @@ dependencies = [
"tantivy",
"time 0.3.23",
"tokio",
+ "url",
"yaml-rust",
"zoomies",
]
@@ -4083,6 +4104,12 @@ version = "0.3.13"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "92888ba5573ff080736b3648696b70cafad7d250551175acbaa4e0385b3e1460"
+[[package]]
+name = "unicode-id"
+version = "0.3.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b1b6def86329695390197b82c1e244a54a131ceb66c996f2088a3876e2ae083f"
+
[[package]]
name = "unicode-ident"
version = "1.0.11"
diff --git a/pgml-dashboard/Cargo.toml b/pgml-dashboard/Cargo.toml
index 3313a16ff..172b49ddd 100644
--- a/pgml-dashboard/Cargo.toml
+++ b/pgml-dashboard/Cargo.toml
@@ -17,12 +17,14 @@ base64 = "0.21"
comrak = "0.17"
chrono = "0.4"
csv-async = "1"
+convert_case = "0.6"
dotenv = "0.15"
env_logger = "0.10"
itertools = "0.10"
parking_lot = "0.12"
lazy_static = "1.4"
log = "0.4"
+markdown = "1.0.0-alpha.13"
num-traits = "0.2"
once_cell = "1.18"
rand = "0.8"
@@ -39,6 +41,7 @@ sqlx = { version = "0.6.3", features = [ "runtime-tokio-rustls", "postgres", "js
tantivy = "0.19"
time = "0.3"
tokio = { version = "1", features = ["full"] }
+url = "2.4"
yaml-rust = "0.4"
zoomies = { git="https://github.com/HyperparamAI/zoomies.git", branch="master" }
pgvector = { version = "0.2.2", features = [ "sqlx", "postgres" ] }
diff --git a/pgml-dashboard/README.md b/pgml-dashboard/README.md
index a960ad77a..91cfdec00 100644
--- a/pgml-dashboard/README.md
+++ b/pgml-dashboard/README.md
@@ -2,4 +2,4 @@
PostgresML provides a dashboard with analytical views of the training data and model performance, as well as integrated notebooks for rapid iteration. It is primarily written in Rust using [Rocket](https://rocket.rs/) as a lightweight web framework and [SQLx](https://github.com/launchbadge/sqlx) to interact with the database.
-Please see the [quick start instructions](https://postgresml.org/user_guides/setup/quick_start_with_docker/) for general information on installing or deploying PostgresML. A [developer guide](https://postgresml.org/developer_guide/overview/) is also available for those who would like to contribute.
+Please see the [quick start instructions](https://postgresml.org/docs/guides/getting-started/sign-up) for general information on installing or deploying PostgresML. A [developer guide](https://postgresml.org/developer_guide/overview/) is also available for those who would like to contribute.
diff --git a/pgml-dashboard/content/docs/guides/dashboard/overview.md b/pgml-dashboard/content/docs/guides/dashboard/overview.md
index 4f0e16f43..70eb761f6 100644
--- a/pgml-dashboard/content/docs/guides/dashboard/overview.md
+++ b/pgml-dashboard/content/docs/guides/dashboard/overview.md
@@ -1,6 +1,6 @@
# Dashboard
-PostgresML comes with a web app to provide visibility into models and datasets in your database. If you're running [our Docker container](/docs/guides/setup/quick_start_with_docker/), you can view it running on [http://localhost:8000/](http://localhost:8000/).
+PostgresML comes with a web app to provide visibility into models and datasets in your database. If you're running [our Docker container](/docs/guides/developer-docs/quick-start-with-docker), you can view it running on [http://localhost:8000/](http://localhost:8000/).
## Generate example data
diff --git a/pgml-dashboard/content/docs/guides/setup/distributed_training.md b/pgml-dashboard/content/docs/guides/setup/distributed_training.md
index 41ff97e4f..748595f3c 100644
--- a/pgml-dashboard/content/docs/guides/setup/distributed_training.md
+++ b/pgml-dashboard/content/docs/guides/setup/distributed_training.md
@@ -22,7 +22,7 @@ psql \
-f dump.sql
```
-If you're using our Docker stack, you can import the data there:
+If you're using our Docker stack, you can import the data there:
```
psql \
diff --git a/pgml-dashboard/content/docs/guides/setup/v2/installation.md b/pgml-dashboard/content/docs/guides/setup/v2/installation.md
index 683ad7302..3dd865f33 100644
--- a/pgml-dashboard/content/docs/guides/setup/v2/installation.md
+++ b/pgml-dashboard/content/docs/guides/setup/v2/installation.md
@@ -10,7 +10,7 @@ The extension can be installed by compiling it from source, or if you're using U
!!! tip
-If you're just looking to try PostgresML without installing it on your system, take a look at our [Quick Start with Docker](/docs/guides/setup/quick_start_with_docker) guide.
+If you're just looking to try PostgresML without installing it on your system, take a look at our [Quick Start with Docker](/docs/guides/developer-docs/quick-start-with-docker) guide.
!!!
diff --git a/pgml-dashboard/src/api/docs.rs b/pgml-dashboard/src/api/docs.rs
index bfabb78e3..10fc5d948 100644
--- a/pgml-dashboard/src/api/docs.rs
+++ b/pgml-dashboard/src/api/docs.rs
@@ -24,54 +24,36 @@ async fn search(query: &str, index: &State) -> ResponseOk
)
}
-#[get("/docs/", rank = 10)]
-async fn doc_handler<'a>(path: PathBuf, cluster: &Cluster) -> Result {
- let guides = vec![
- NavLink::new("Setup").children(vec![
- NavLink::new("Installation").children(vec![
- NavLink::new("v2").href("/docs/guides/setup/v2/installation"),
- NavLink::new("Upgrade from v1.0 to v2.0")
- .href("/docs/guides/setup/v2/upgrade-from-v1"),
- NavLink::new("v1").href("/docs/guides/setup/installation"),
- ]),
- NavLink::new("Quick Start with Docker")
- .href("/docs/guides/setup/quick_start_with_docker"),
- NavLink::new("Distributed Training").href("/docs/guides/setup/distributed_training"),
- NavLink::new("GPU Support").href("/docs/guides/setup/gpu_support"),
- NavLink::new("Developer Setup").href("/docs/guides/setup/developers"),
- ]),
- NavLink::new("Training").children(vec![
- NavLink::new("Overview").href("/docs/guides/training/overview"),
- NavLink::new("Algorithm Selection").href("/docs/guides/training/algorithm_selection"),
- NavLink::new("Hyperparameter Search")
- .href("/docs/guides/training/hyperparameter_search"),
- NavLink::new("Preprocessing Data").href("/docs/guides/training/preprocessing"),
- NavLink::new("Joint Optimization").href("/docs/guides/training/joint_optimization"),
- ]),
- NavLink::new("Predictions").children(vec![
- NavLink::new("Overview").href("/docs/guides/predictions/overview"),
- NavLink::new("Deployments").href("/docs/guides/predictions/deployments"),
- NavLink::new("Batch Predictions").href("/docs/guides/predictions/batch"),
- ]),
- NavLink::new("Transformers").children(vec![
- NavLink::new("Setup").href("/docs/guides/transformers/setup"),
- NavLink::new("Pre-trained Models").href("/docs/guides/transformers/pre_trained_models"),
- NavLink::new("Fine Tuning").href("/docs/guides/transformers/fine_tuning"),
- NavLink::new("Embeddings").href("/docs/guides/transformers/embeddings"),
- ]),
- NavLink::new("Vector Operations").children(vec![
- NavLink::new("Overview").href("/docs/guides/vector_operations/overview")
- ]),
- NavLink::new("Dashboard").href("/docs/guides/dashboard/overview"),
- NavLink::new("Schema").children(vec![
- NavLink::new("Models").href("/docs/guides/schema/models"),
- NavLink::new("Snapshots").href("/docs/guides/schema/snapshots"),
- NavLink::new("Projects").href("/docs/guides/schema/projects"),
- NavLink::new("Deployments").href("/docs/guides/schema/deployments"),
- ]),
- ];
-
- render(cluster, &path, guides, "Guides", &Path::new("docs")).await
+use rocket::fs::NamedFile;
+
+#[get("/docs/guides/.gitbook/assets/", rank = 10)]
+pub async fn gitbook_assets(path: PathBuf) -> Option {
+ let path = PathBuf::from(&config::docs_dir())
+ .join("docs/guides/.gitbook/assets/")
+ .join(path);
+
+ NamedFile::open(path).await.ok()
+}
+
+#[get("/docs/", rank = 5)]
+async fn doc_handler(path: PathBuf, cluster: &Cluster) -> Result {
+ let root = PathBuf::from("docs/guides/");
+ let index_path = PathBuf::from(&config::docs_dir())
+ .join(&root)
+ .join("SUMMARY.md");
+ let contents = tokio::fs::read_to_string(&index_path).await.expect(
+ format!(
+ "could not read table of contents markdown: {:?}",
+ index_path
+ )
+ .as_str(),
+ );
+ let mdast = ::markdown::to_mdast(&contents, &::markdown::ParseOptions::default())
+ .expect("could not parse table of contents markdown");
+ let guides = markdown::parse_summary_into_nav_links(&mdast)
+ .expect("could not extract nav links from table of contents");
+
+ render(cluster, &path, guides, "Guides", &Path::new("docs"), &config::docs_dir()).await
}
#[get("/blog/", rank = 10)]
@@ -134,6 +116,7 @@ async fn blog_handler<'a>(path: PathBuf, cluster: &Cluster) -> Result(
mut nav_links: Vec,
nav_title: &'a str,
folder: &'a Path,
+ content: &'a str,
) -> Result {
let url = path.clone();
// Get the document content
- let path = Path::new(&config::content_dir())
+ let path = Path::new(&content)
.join(folder)
.join(&(path.to_str().unwrap().to_string() + ".md"));
+ info!("path: {:?}", path);
// Read to string
let contents = match tokio::fs::read_to_string(&path).await {
Ok(contents) => contents,
@@ -244,7 +229,7 @@ async fn render<'a>(
}
pub fn routes() -> Vec {
- routes![doc_handler, blog_handler, search]
+ routes![gitbook_assets, doc_handler, blog_handler, search]
}
#[cfg(test)]
diff --git a/pgml-dashboard/src/components/navbar/template.html b/pgml-dashboard/src/components/navbar/template.html
index 7f54386ae..f000aa67f 100644
--- a/pgml-dashboard/src/components/navbar/template.html
+++ b/pgml-dashboard/src/components/navbar/template.html
@@ -15,7 +15,7 @@
<% if !standalone_dashboard { %>
-
-
\ No newline at end of file
diff --git a/pgml-docs/.gitbook/assets/scaling-postgresml-3.svg b/pgml-docs/.gitbook/assets/scaling-postgresml-3.svg
deleted file mode 100644
index 42d5c1a57..000000000
--- a/pgml-docs/.gitbook/assets/scaling-postgresml-3.svg
+++ /dev/null
@@ -1,4 +0,0 @@
-
-
-
-
\ No newline at end of file
diff --git a/pgml-docs/.gitbook/assets/select_plan.png b/pgml-docs/.gitbook/assets/select_plan.png
deleted file mode 100644
index 443972780..000000000
Binary files a/pgml-docs/.gitbook/assets/select_plan.png and /dev/null differ
diff --git a/pgml-docs/.gitbook/assets/signup_screenshot.png b/pgml-docs/.gitbook/assets/signup_screenshot.png
deleted file mode 100644
index df9c23b96..000000000
Binary files a/pgml-docs/.gitbook/assets/signup_screenshot.png and /dev/null differ
diff --git a/pgml-docs/README.md b/pgml-docs/README.md
deleted file mode 100644
index ab2252701..000000000
--- a/pgml-docs/README.md
+++ /dev/null
@@ -1,17 +0,0 @@
----
-description: Page to navigate to any part of documentation
----
-
-# Home
-
-* [getting-started](getting-started/ "mention")
-* [natural-language-processing](machine-learning/natural-language-processing/ "mention")
-* [vector-database.md](vector-database.md "mention")
-* [supervised-learning](machine-learning/supervised-learning/ "mention")
-* [unsupervised-learning.md](machine-learning/unsupervised-learning.md "mention")
-* [sdks](sdks/ "mention")
-* [chatbots.md](apps/chatbots.md "mention")
-* [use-cases](use-cases/ "mention")
-* [benchmarks](benchmarks/ "mention")
-* [monitoring.md](monitoring.md "mention")
-* [developer-docs](developer-docs/ "mention")
diff --git a/pgml-docs/SUMMARY.md b/pgml-docs/SUMMARY.md
deleted file mode 100644
index 555d1b4fa..000000000
--- a/pgml-docs/SUMMARY.md
+++ /dev/null
@@ -1,67 +0,0 @@
-# Table of contents
-
-* [Guides](README.md)
-* [Overview](overview.md)
-* [Getting Started](getting-started/README.md)
- * [Sign up](getting-started/sign-up.md)
- * [Select a plan](getting-started/select-a-plan.md)
- * [Database Credentials](getting-started/database-credentials.md)
- * [Connect to the Database](getting-started/connect-to-the-database.md)
-* [Machine Learning](machine-learning/README.md)
- * [Natural Language Processing](machine-learning/natural-language-processing/README.md)
- * [Embeddings](machine-learning/natural-language-processing/embeddings.md)
- * [Fill Mask](machine-learning/natural-language-processing/fill-mask.md)
- * [Question Answering](machine-learning/natural-language-processing/question-answering.md)
- * [Summarization](machine-learning/natural-language-processing/summarization.md)
- * [Text Classification](machine-learning/natural-language-processing/text-classification.md)
- * [Text Generation](machine-learning/natural-language-processing/text-generation.md)
- * [Text-to-Text Generation](machine-learning/natural-language-processing/text-to-text-generation.md)
- * [Token Classification](machine-learning/natural-language-processing/token-classification.md)
- * [Translation](machine-learning/natural-language-processing/translation.md)
- * [Zero-shot Classification](machine-learning/natural-language-processing/zero-shot-classification.md)
- * [Supervised Learning](machine-learning/supervised-learning/README.md)
- * [Data Pre-processing](machine-learning/supervised-learning/data-pre-processing.md)
- * [Regression](machine-learning/supervised-learning/regression.md)
- * [Classification](machine-learning/supervised-learning/classification.md)
- * [Hyperparameter Search](machine-learning/supervised-learning/hyperparameter-search.md)
- * [Joint Optimization](machine-learning/supervised-learning/joint-optimization.md)
- * [Unsupervised Learning](machine-learning/unsupervised-learning.md)
-* [Vector Database](vector-database.md)
-* [SDKs](sdks/README.md)
- * [Overview](sdks/overview.md)
- * [Getting Started](sdks/getting-started.md)
- * [Collections](sdks/collections.md)
- * [Pipelines](sdks/pipelines.md)
- * [Search](sdks/search.md)
- * [Tutorials](sdks/tutorials/README.md)
- * [Semantic Search](sdks/tutorials/semantic-search.md)
- * [Semantic Search using Instructor model](sdks/tutorials/semantic-search-using-instructor-model.md)
- * [Extractive Question Answering](sdks/tutorials/extractive-question-answering.md)
- * [Summarizing Question Answering](sdks/tutorials/summarizing-question-answering.md)
-* [Apps](apps/README.md)
- * [Chatbots](apps/chatbots.md)
- * [Fraud Detection](apps/fraud-detection.md)
- * [Recommendation Engine](apps/recommendation-engine.md)
- * [Search](apps/search.md)
- * [Time-series Forecasting](apps/time-series-forecasting.md)
-* [Use cases](use-cases/README.md)
- * [Improve Search Results with Machine Learning](use-cases/improve-search-results-with-machine-learning.md)
- * [Generating LLM embeddings with open source models in PostgresML](use-cases/generating-llm-embeddings-with-open-source-models-in-postgresml.md)
- * [Tuning vector recall while generating query embeddings in the database](use-cases/tuning-vector-recall-while-generating-query-embeddings-in-the-database.md)
- * [Personalize embedding results with application data in your database](use-cases/personalize-embedding-results-with-application-data-in-your-database.md)
- * [LLM based pipelines with PostgresML and dbt (data build tool)](use-cases/llm-based-pipelines-with-postgresml-and-dbt-data-build-tool.md)
-* [PgCat](pgcat.md)
-* [Benchmarks](benchmarks/README.md)
- * [PostgresML is 8-40x faster than Python HTTP microservices](benchmarks/postgresml-is-8-40x-faster-than-python-http-microservices.md)
- * [Million Requests per Second](benchmarks/million-requests-per-second.md)
- * [MindsDB vs PostgresML](benchmarks/mindsdb-vs-postgresml.md)
- * [GGML Quantized LLM support for Huggingface Transformers](benchmarks/ggml-quantized-llm-support-for-huggingface-transformers.md)
- * [Making Postgres 30 Percent Faster in Production](benchmarks/making-postgres-30-percent-faster-in-production.md)
-* [Monitoring](monitoring.md)
-* [FAQs](faqs.md)
-* [Developer Docs](developer-docs/README.md)
- * [Quick Start with Docker](developer-docs/quick-start-with-docker.md)
- * [Installation](developer-docs/installation.md)
- * [Contributing](developer-docs/contributing.md)
- * [Distributed Training](developer-docs/distributed-training.md)
- * [GPU Support](developer-docs/gpu-support.md)
diff --git a/pgml-docs/apps/README.md b/pgml-docs/apps/README.md
deleted file mode 100644
index 11e48c878..000000000
--- a/pgml-docs/apps/README.md
+++ /dev/null
@@ -1,3 +0,0 @@
-# Apps
-
-Easy to use no-code interfaces to build and deploy end-to-end ML powered applications. These will serve as solutions that can be used as is or reference architectures for applications that need customization. For instance: `pgml-chat` is a no-code command line app, that allows anyone to build an interactive chatbot for slack or discard on top of their private knowledge base.
diff --git a/pgml-docs/apps/chatbots.md b/pgml-docs/apps/chatbots.md
deleted file mode 100644
index 06c455a14..000000000
--- a/pgml-docs/apps/chatbots.md
+++ /dev/null
@@ -1,162 +0,0 @@
----
-description: CLI tool to build and deploy chatbots
----
-
-# Chatbots
-
-## Introduction
-
-A command line tool to build and deploy a _**knowledge based**_ chatbot using PostgresML and OpenAI API.
-
-There are two stages in building a knowledge based chatbot:
-
-* Build a knowledge base by ingesting documents, chunking documents, generating embeddings and indexing these embeddings for fast query
-* Generate responses to user queries by retrieving relevant documents and generating responses using OpenAI API
-
-This tool automates the above two stages and provides a command line interface to build and deploy a knowledge based chatbot.
-
-## Prerequisites
-
-Before you begin, make sure you have the following:
-
-* PostgresML Database: Sign up for a free [GPU-powered database](https://postgresml.org/signup)
-* Python version >=3.8
-* OpenAI API key
-
-## Getting started
-
-1. Create a virtual environment and install `pgml-chat` using `pip`:
-
-```bash
-pip install pgml-chat
-```
-
-`pgml-chat` will be installed in your PATH.
-
-2. Download `.env.template` file from PostgresML Github repository.
-
-```bash
-wget https://raw.githubusercontent.com/postgresml/postgresml/master/pgml-apps/pgml-chat/.env.template
-```
-
-3. Copy the template file to `.env`
-4. Update environment variables with your OpenAI API key and PostgresML database credentials.
-
-```bash
-OPENAI_API_KEY=
-DATABASE_URL=
-MODEL=hkunlp/instructor-xl
-MODEL_PARAMS={"instruction": "Represent the Wikipedia document for retrieval: "}
-QUERY_PARAMS={"instruction": "Represent the Wikipedia question for retrieving supporting documents: "}
-SYSTEM_PROMPT="You are an assistant to answer questions about an open source software named PostgresML. Your name is PgBot. You are based out of San Francisco, California."
-BASE_PROMPT="Given relevant parts of a document and a question, create a final answer.\
- Include a SQL query in the answer wherever possible. \
- Use the following portion of a long document to see if any of the text is relevant to answer the question.\
- \nReturn any relevant text verbatim.\n{context}\nQuestion: {question}\n \
- If the context is empty then ask for clarification and suggest user to send an email to team@postgresml.org or join PostgresML [Discord](https://discord.gg/DmyJP3qJ7U)."
-```
-
-## Usage
-
-You can get help on the command line interface by running:
-
-```bash
-(pgml-bot-builder-py3.9) pgml-chat % pgml-chat --help
-usage: pgml-chat [-h] --collection_name COLLECTION_NAME [--root_dir ROOT_DIR] [--stage {ingest,chat}] [--chat_interface {cli,slack}]
-
-PostgresML Chatbot Builder
-
-optional arguments:
- -h, --help show this help message and exit
- --collection_name COLLECTION_NAME
- Name of the collection (schema) to store the data in PostgresML database (default: None)
- --root_dir ROOT_DIR Input folder to scan for markdown files. Required for ingest stage. Not required for chat stage (default: None)
- --stage {ingest,chat}
- Stage to run (default: chat)
- --chat_interface {cli, slack, discord}
- Chat interface to use (default: cli)
-```
-
-### Ingest
-
-In this step, we ingest documents, chunk documents, generate embeddings and index these embeddings for fast query.
-
-```bash
-LOG_LEVEL=DEBUG pgml-chat --root_dir --collection_name --stage ingest
-```
-
-You will see output logging the pipelines progress.
-
-### Chat
-
-You can interact with the bot using the command line interface or Slack.
-
-#### Command Line Interface
-
-In this step, we start chatting with the chatbot at the command line. You can increase the log level to ERROR to suppress the logs. CLI is the default chat interface.
-
-```bash
-LOG_LEVEL=ERROR pgml-chat --collection_name --stage chat --chat_interface cli
-```
-
-You should be able to interact with the bot as shown below. Control-C to exit.
-
-```bash
-User (Ctrl-C to exit): Who are you?
-PgBot: I am PgBot, an AI assistant here to answer your questions about PostgresML, an open source software. How can I assist you today?
-User (Ctrl-C to exit): What is PostgresML?
-Found relevant documentation....
-PgBot: PostgresML is an open source software that allows you to unlock the full potential of your data and drive more sophisticated insights and decision-making processes. It provides a dashboard with analytical views of the training data and
-model performance, as well as integrated notebooks for rapid iteration. PostgresML is primarily written in Rust using Rocket as a lightweight web framework and SQLx to interact with the database.
-
-If you have any further questions or need more information, please feel free to send an email to team@postgresml.org or join the PostgresML Discord community at https://discord.gg/DmyJP3qJ7U.
-```
-
-#### Slack
-
-**Setup** You need SLACK\_BOT\_TOKEN and SLACK\_APP\_TOKEN to run the chatbot on Slack. You can get these tokens by creating a Slack app. Follow the instructions [here](https://slack.dev/bolt-python/tutorial/getting-started) to create a Slack app.Include the following environment variables in your .env file:
-
-```bash
-SLACK_BOT_TOKEN=
-SLACK_APP_TOKEN=
-```
-
-In this step, we start chatting with the chatbot on Slack. You can increase the log level to ERROR to suppress the logs.
-
-```bash
-LOG_LEVEL=ERROR pgml-chat --collection_name --stage chat --chat_interface slack
-```
-
-If you have set up the Slack app correctly, you should see the following output:
-
-```bash
-⚡️ Bolt app is running!
-```
-
-Once the slack app is running, you can interact with the chatbot on Slack as shown below. In the example here, name of the bot is `PgBot`. This app responds only to direct messages to the bot.
-
-
-
-#### Discord
-
-**Setup** You need DISCORD\_BOT\_TOKEN to run the chatbot on Discord. You can get this token by creating a Discord app. Follow the instructions [here](https://discordpy.readthedocs.io/en/stable/discord.html) to create a Discord app. Include the following environment variables in your .env file:
-
-```bash
-DISCORD_BOT_TOKEN=
-```
-
-In this step, we start chatting with the chatbot on Discord. You can increase the log level to ERROR to suppress the logs.
-
-```bash
-pgml-chat --collection_name --stage chat --chat_interface discord
-```
-
-If you have set up the Discord app correctly, you should see the following output:
-
-```bash
-2023-08-02 16:09:57 INFO discord.client logging in using static token
-```
-
-Once the discord app is running, you can interact with the chatbot on Discord as shown below. In the example here, name of the bot is `pgchat`. This app responds only to direct messages to the bot.
-
-
diff --git a/pgml-docs/apps/fraud-detection.md b/pgml-docs/apps/fraud-detection.md
deleted file mode 100644
index dbe05b5dd..000000000
--- a/pgml-docs/apps/fraud-detection.md
+++ /dev/null
@@ -1,3 +0,0 @@
-# Fraud Detection
-
-Describe this app, write a GitHub issue and ask people to do a :thumbsup:on the issue
diff --git a/pgml-docs/apps/recommendation-engine.md b/pgml-docs/apps/recommendation-engine.md
deleted file mode 100644
index 73e132a6e..000000000
--- a/pgml-docs/apps/recommendation-engine.md
+++ /dev/null
@@ -1,3 +0,0 @@
-# Recommendation Engine
-
-Describe this app, write a GitHub issue and ask people to do a :thumbsup:on the issue
diff --git a/pgml-docs/apps/search.md b/pgml-docs/apps/search.md
deleted file mode 100644
index 1a5b6b8f8..000000000
--- a/pgml-docs/apps/search.md
+++ /dev/null
@@ -1,3 +0,0 @@
-# Search
-
-Describe this app, write a GitHub issue and ask people to do a :thumbsup:on the issue
diff --git a/pgml-docs/apps/time-series-forecasting.md b/pgml-docs/apps/time-series-forecasting.md
deleted file mode 100644
index a7f7ab998..000000000
--- a/pgml-docs/apps/time-series-forecasting.md
+++ /dev/null
@@ -1,2 +0,0 @@
-# Time-series Forecasting
-
diff --git a/pgml-docs/benchmarks/README.md b/pgml-docs/benchmarks/README.md
deleted file mode 100644
index ce4a798b7..000000000
--- a/pgml-docs/benchmarks/README.md
+++ /dev/null
@@ -1,2 +0,0 @@
-# Benchmarks
-
diff --git a/pgml-docs/benchmarks/ggml-quantized-llm-support-for-huggingface-transformers.md b/pgml-docs/benchmarks/ggml-quantized-llm-support-for-huggingface-transformers.md
deleted file mode 100644
index da53f4702..000000000
--- a/pgml-docs/benchmarks/ggml-quantized-llm-support-for-huggingface-transformers.md
+++ /dev/null
@@ -1,434 +0,0 @@
-# GGML Quantized LLM support for Huggingface Transformers
-
-
-
-Quantization allows PostgresML to fit larger models in less RAM. These algorithms perform inference significantly faster on NVIDIA, Apple and Intel hardware. Half-precision floating point and quantized optimizations are now available for your favorite LLMs downloaded from Huggingface.
-
-## Introduction
-
-Large Language Models (LLMs) are... large. They have a lot of parameters, which make up the weights and biases of the layers inside deep neural networks. Typically, these parameters are represented by individual 32-bit floating point numbers, so a model like GPT-2 that has 1.5B parameters would need `4 bytes * 1,500,000,000 = 6GB RAM`. The Leading Open Source models like LLaMA, Alpaca, and Guanaco, currently have 65B parameters, which requires about 260GB RAM. This is a lot of RAM, and it's not even counting what's needed to store the input and output data.
-
-Bandwidth between RAM and CPU often becomes a bottleneck for performing inference with these models, rather than the number of processing cores or their speed, because the processors become starved for data. One way to reduce the amount of RAM and memory bandwidth needed is to use a smaller datatype, like 16-bit floating point numbers, which would reduce the model size in RAM by half. There are a couple competing 16-bit standards, but NVIDIA has introduced support for bfloat16 in their latest hardware generation, which keeps the full exponential range of float32, but gives up a 2/3rs of the precision. Most research has shown this is a good quality/performance tradeoff, and that model outputs are not terribly sensitive when truncating the least significant bits.
-
-| Format | Significand | Exponent |
-| ----------- | ----------- | -------- |
-| bfloat16 | 8 bits | 8 bits |
-| float16 | 11 bits | 5 bits |
-| float32 | 24 bits | 8 bits |
-|
| | |
-
-You can select the data type for torch tensors in PostgresML by setting the `torch_dtype` parameter in the `pgml.transform` function. The default is `float32`, but you can also use `float16` or `bfloat16`. Here's an example of using `bfloat16` with the [Falcon-7B Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) model:
-
-!!! generic
-
-!!! code\_block time="4584.906 ms"
-
-```sql
-SELECT pgml.transform(
- task => '{
- "model": "tiiuae/falcon-7b-instruct",
- "device_map": "auto",
- "torch_dtype": "bfloat16",
- "trust_remote_code": true
- }'::JSONB,
- args => '{
- "max_new_tokens": 100
- }'::JSONB,
- inputs => ARRAY[
- 'Complete the story: Once upon a time,'
- ]
-) AS result;
-```
-
-!!!
-
-!!! results
-
-| transform |
-| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| \[\[{"generated\_text": "Complete the story: Once upon a time, there was a small village where everyone was happy and lived peacefully.\nOne day, a powerful ruler from a neighboring kingdom arrived with an evil intent. He wanted to conquer the peaceful village and its inhabitants. The ruler was accompanied by a large army, ready to take control. The villagers, however, were not to be intimidated. They rallied together and, with the help of a few brave warriors, managed to defeat the enemy. The villagers celebrated their victory, and peace was restored in the kingdom for"}]] |
-
-!!!
-
-!!!
-
-4.5 seconds is slow for an interactive response. If we're building dynamic user experiences, it's worth digging deeper into optimizations.
-
-## Quantization
-
-_Discrete quantization is not a new idea. It's been used by both algorithms and artists for more than a hundred years._\
-
-
-Going beyond 16-bit down to 8 or 4 bits is possible, but not with hardware accelerated floating point operations. If we want hardware acceleration for smaller types, we'll need to use small integers w/ vectorized instruction sets. This is the process of _quantization_. Quantization can be applied to existing models trained with 32-bit floats, by converting the weights to smaller integer primitives that will still benefit from hardware accelerated instruction sets like Intel's [AVX](https://en.wikipedia.org/wiki/Advanced\_Vector\_Extensions). A simple way to quantize a model can be done by first finding the maximum and minimum values of the weights, then dividing the range of values into the number of buckets available in your integer type, 256 for 8-bit, 16 for 4-bit. This is called _post-training quantization_, and it's the simplest way to quantize a model.
-
-[GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https://arxiv.org/abs/2210.17323) is a research paper that outlines the details for quantizing LLMs after they have already been trained on full float32 precision, and the tradeoffs involved. Their work is implemented as an [open source library](https://github.com/IST-DASLab/gptq), which has been adapted to work with Huggingface Transformers by [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ). PostgresML will automatically use AutoGPTQ when a HuggingFace model with GPTQ in the name is used.
-
-[GGML](https://github.com/ggerganov/ggml) is another quantization implementation focused on CPU optimization, particularly for Apple M1 & M2 silicon. It relies on the same principles, but is a different underlying implementation. As a general rule of thumb, if you're using NVIDIA hardware and your entire model will fit in VRAM, GPTQ will be faster. If you're using Apple or Intel hardware, GGML will likely be faster.
-
-The community (shoutout to [TheBloke](https://huggingface.co/TheBloke)), has been applying these quantization methods to LLMs in the Huggingface Transformers library. Many versions of your favorite LLMs are now available in more efficient formats. This might allow you to move up to a larger model size, or fit more models in the same amount of RAM.
-
-## Using GPTQ & GGML in PostgresML
-
-You'll need to update to PostgresML 2.6.0 or later to use GPTQ or GGML. You will need to update your Python dependencies for PostgresML to take advantage of these new capabilities. AutoGPTQ also provides prebuilt wheels for Python if you're having trouble installing the pip package which builds it from source. They maintain a list of wheels [available for download](https://github.com/PanQiWei/AutoGPTQ/releases) on GitHub.
-
-```commandline
-pip install -r requirements.txt
-```
-
-### GPU Support
-
-PostgresML will automatically use GPTQ or GGML when a HuggingFace model has one of those libraries in its name. By default, PostgresML uses a CUDA device where possible.
-
-#### GPTQ
-
-!!! generic
-
-!!! code\_block time="281.213 ms"
-
-```sql
-SELECT pgml.transform(
- task => '{
- "task": "text-generation",
- "model": "mlabonne/gpt2-GPTQ-4bit"
- }'::JSONB,
- inputs => ARRAY[
- 'Once upon a time,'
- ],
- args => '{"max_new_tokens": 32}'::JSONB
-);
-```
-
-!!!
-
-!!! results
-
-| transform |
-| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| \["Once upon a time, the world was a place of great beauty and great danger. The world was a place of great danger. The world was a place of great danger. The world"] |
-
-!!!
-
-!!!
-
-#### GGML
-
-!!! generic
-
-!!! code\_block time="252.213 ms"
-
-```sql
-SELECT pgml.transform(
- task => '{
- "task": "text-generation",
- "model": "marella/gpt-2-ggml"
- }'::JSONB,
- inputs => ARRAY[
- 'Once upon a time,'
- ],
- args => '{"max_new_tokens": 32}'::JSONB
-);
-```
-
-!!!
-
-!!! results
-
-| transform |
-| --------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| \[" the world was filled with people who were not only rich but also powerful.\n\nThe first thing that came to mind when I thought of this place is how"] |
-
-!!!
-
-!!!
-
-#### GPT2
-
-!!! generic
-
-!!! code\_block time="279.888 ms"
-
-```sql
-SELECT pgml.transform(
- task => '{
- "task": "text-generation",
- "model": "gpt2"
- }'::JSONB,
- inputs => ARRAY[
- 'Once upon a time,'
- ],
- args => '{"max_new_tokens": 32}'::JSONB
-);
-```
-
-!!!
-
-!!! results
-
-| transform |
-| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| \[\[{"Once upon a time, I'd get angry over the fact that my house was going to have some very dangerous things from outside. To be honest, I know it's going to be"}]] |
-
-!!!
-
-!!!
-
-This quick example running on my RTX 3090 GPU shows there is very little difference in runtime for these libraries and models when everything fits in VRAM by default. But let's see what happens when we execute the model on my Intel i9-13900 CPU instead of my GPU...
-
-### CPU Support
-
-We can specify the CPU by passing a `"device": "cpu"` argument to the `task`.
-
-#### GGML
-
-!!! generic
-
-!!! code\_block time="266.997 ms"
-
-```sql
-SELECT pgml.transform(
- task => '{
- "task": "text-generation",
- "model": "marella/gpt-2-ggml",
- "device": "cpu"
- }'::JSONB,
- inputs => ARRAY[
- 'Once upon a time,'
- ],
- args => '{"max_new_tokens": 32}'::JSONB
-);
-```
-
-!!!
-
-!!! results
-
-| transform |
-| -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| \[" we've all had an affair with someone and now the truth has been revealed about them. This is where our future comes in... We must get together as family"] |
-
-!!!
-
-!!!
-
-#### GPT2
-
-!!! generic
-
-!!! code\_block time="33224.136 ms"
-
-```sql
-SELECT pgml.transform(
- task => '{
- "task": "text-generation",
- "model": "gpt2",
- "device": "cpu"
- }'::JSONB,
- inputs => ARRAY[
- 'Once upon a time,'
- ],
- args => '{"max_new_tokens": 32}'::JSONB
-);
-```
-
-!!!
-
-!!! results
-
-| transform |
-| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| \[\[{"generated\_text": "Once upon a time, we were able, due to our experience at home, to put forward the thesis that we're essentially living life as a laboratory creature with the help of other humans"}]] |
-
-!!!
-
-!!!
-
-Now you can see the difference. With both implementations and models forced to use only the CPU, we can see that a quantized version can be literally 100x faster. In fact, the quantized version on the CPU is as fast as the vanilla version on the GPU. This is a huge win for CPU users.
-
-### Larger Models
-
-HuggingFace and these libraries have a lot of great models. Not all of these models provide a complete config.json, so you may need to include some additional params for the task, like `model_type`.
-
-#### LLaMA
-
-!!! generic
-
-!!! code\_block time="3411.324 ms"
-
-```sql
-SELECT pgml.transform(
- task => '{
- "task": "text-generation",
- "model": "TheBloke/robin-7B-v2-GGML",
- "model_type": "llama"
- }'::JSONB,
- inputs => ARRAY[
- 'Once upon a time,'
- ],
- args => '{"max_new_tokens": 32}'::JSONB
-);
-```
-
-!!!
-
-!!! results
-
-| transform |
-| -------------------------------------------------------------------------------------------------------------------------------------- |
-| \[" in a land far away, there was a kingdom ruled by a wise and just king. The king had three sons, each of whom he loved dearly and"] |
-
-!!!
-
-!!!
-
-#### MPT
-
-!!! generic
-
-!!! code\_block time="4198.817 ms"
-
-```sql
-SELECT pgml.transform(
- task => '{
- "task": "text-generation",
- "model": "TheBloke/MPT-7B-Storywriter-GGML",
- "model_type": "mpt"
- }'::JSONB,
- inputs => ARRAY[
- 'Once upon a time,'
- ],
- args => '{"max_new_tokens": 32}'::JSONB
-);
-```
-
-!!!
-
-!!! results
-
-| transform |
-| ------------------------------------------------------------------------------------------------------------------------ |
-| \["\n\nWhen he heard a song that sounded like this:\n\n"The wind is blowing, the rain's falling. \nOh where'd the love"] |
-
-!!!
-
-!!!
-
-#### Falcon
-
-!!! generic
-
-!!! code\_block time="4198.817 ms"
-
-```sql
-SELECT pgml.transform(
- task => '{
- "task": "text-generation",
- "model": "TheBloke/falcon-40b-instruct-GPTQ",
- "trust_remote_code": true
- }'::JSONB,
- inputs => ARRAY[
- 'Once upon a time,'
- ],
- args => '{"max_new_tokens": 32}'::JSONB
-);
-```
-
-!!!
-
-!!! results
-
-| transform |
-| ------------------------------------------------------------------------------------------------------------------------ |
-| \["\n\nWhen he heard a song that sounded like this:\n\n"The wind is blowing, the rain's falling. \nOh where'd the love"] |
-
-!!!
-
-!!!
-
-### Specific Quantization Files
-
-Many of these models are published with multiple different quantization methods applied and saved into different files in the same model space, e.g. 4-bit, 5-bit, 8-bit. You can specify which quantization method you want to use by passing a `model_file` argument to the `task`, in addition to the `model`. You'll need to check the model card for file and quantization details.
-
-!!! generic
-
-!!! code\_block time="6498.597"
-
-```sql
-SELECT pgml.transform(
- task => '{
- "task": "text-generation",
- "model": "TheBloke/MPT-7B-Storywriter-GGML",
- "model_file": "mpt-7b-storywriter.ggmlv3.q8_0.bin"
- }'::JSONB,
- inputs => ARRAY[
- 'Once upon a time,'
- ],
- args => '{"max_new_tokens": 32}'::JSONB
-);
-```
-
-!!!
-
-!!! results
-
-| transform |
-| -------------------------------------------------------------------------------------------------------------------------------------------------- |
-| \[" we made peace with the Romans, but they were too busy making war on each other to notice. The king and queen of Rome had a son named Romulus"] |
-
-!!!
-
-!!!
-
-### The whole shebang
-
-PostgresML aims to provide a flexible API to the underlying libraries. This means that you should be able to pass in any valid arguments to [`AutoModel.from_pretrained(...)`](https://huggingface.co/docs/transformers/v4.30.0/en/model\_doc/auto#transformers.FlaxAutoModelForVision2Seq.from\_pretrained) via the `task`, and additional arguments to call on the resulting pipeline during inference for `args`. PostgresML caches each model based on the `task` arguments, so calls to an identical task will be as fast as possible. The arguments that are valid for any model depend on the inference implementation it uses. You'll need to check the model card and underlying library for details.
-
-Getting GPU acceleration to work may also depend on compiling dependencies or downloading Python wheels as well as passing in the correct arguments if your implementing library does not run on a GPU by default like huggingface transformers. PostgresML will cache your model on the GPU, and it will be visible in the process list if it is being used, for as long as your database connection is open. You can always check `nvidia-smi` to see if the GPU is being used as expected. We understand this isn't ideal, but we believe the bleeding edge should be accessible to those that dare. We test many models and configurations to make sure our cloud offering has broad support, but always appreciate GitHub issues when something is missing.
-
-Shoutout to [Tostino](https://github.com/Tostino/) for the extended example below.
-
-!!! generic
-
-!!! code\_block time="3784.565"
-
-```sql
-SELECT pgml.transform(
- task => '{
- "task": "text-generation",
- "model": "TheBloke/vicuna-7B-v1.3-GGML",
- "model_type": "llama",
- "model_file": "vicuna-7b-v1.3.ggmlv3.q5_K_M.bin",
- "gpu_layers": 256
- }'::JSONB,
- inputs => ARRAY[
- $$A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
-
-USER: Please write an intro to a story about a woman living in New York.
-ASSISTANT:$$
- ],
- args => '{
- "max_new_tokens": 512,
- "threads": 16,
- "stop": ["USER:","USER"]
- }'::JSONB
-);
-```
-
-!!!
-
-!!! results
-
-| transform |
-| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| \[" Meet Sarah, a strong-willed woman who has always had a passion for adventure. Born and raised in the bustling city of New York, she was no stranger to the hustle and bustle of life in the big apple. However, Sarah longed for something more than the monotonous routine that had become her daily life.\n\nOne day, while browsing through a travel magazine, Sarah stumbled upon an ad for a wildlife conservation program in Africa. Intrigued by the opportunity to make a difference in the world and expand her horizons, she decided to take the leap and apply for the position.\n\nTo her surprise, Sarah was accepted into the program and found herself on a plane bound for the African continent. She spent the next several months living and working among some of the most incredible wildlife she had ever seen. It was during this time that Sarah discovered a love for exploration and a desire to see more of the world.\n\nAfter completing her program, Sarah returned to New York with a newfound sense of purpose and ambition. She was determined to use her experiences to fuel her love for adventure and make the most out of every opportunity that came her way. Whether it was traveling to new destinations or taking on new challenges in her daily life, Sarah was not afraid to step outside of her comfort zone and embrace the unknown.\n\nAnd so, Sarah's journey continued as she made New York her home base for all of her future adventures. She became a role model for others who longed for something more out of life, inspiring them to chase their dreams and embrace the exciting possibilities that lay ahead."] |
-
-!!!
-
-!!!
-
-### Conclusion
-
-There are many open source LLMs. If you're looking for a list to try, check out [the leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open\_llm\_leaderboard). You can also [search for GPTQ](https://huggingface.co/models?search=gptq) and [GGML](https://huggingface.co/models?search=ggml) versions of those models on the hub to see what is popular in the community. If you're looking for a model that is not available in a quantized format, you can always quantize it yourself. If you're successful, please consider sharing your quantized model with the community!
-
-To dive deeper, you may also want to consult the docs for [ctransformers](https://github.com/marella/ctransformers) if you're using a GGML model, and [auto\_gptq](https://github.com/PanQiWei/AutoGPTQ) for GPTQ models. While Python dependencies are fantastic to let us all iterate quickly, and rapidly adopt the latest innovations, they are not as performant or resilient as native code. There is good progress being made to move a lot of this functionality into [rustformers](https://github.com/rustformers/llm) which we intend to adopt on our quest to remove Python completely on the road to PostgresML 3.0, but we're not going to slow down the pace of innovation while we build more stable and performant APIs.
-
-GPTQ & GGML are a huge win for performance and memory usage, and we're excited to see what you can do with them.
diff --git a/pgml-docs/benchmarks/making-postgres-30-percent-faster-in-production.md b/pgml-docs/benchmarks/making-postgres-30-percent-faster-in-production.md
deleted file mode 100644
index 22a501a30..000000000
--- a/pgml-docs/benchmarks/making-postgres-30-percent-faster-in-production.md
+++ /dev/null
@@ -1,48 +0,0 @@
-# Making Postgres 30 Percent Faster in Production
-
-
-
-Anyone who runs Postgres at scale knows that performance comes with trade offs. The typical playbook is to place a pooler like PgBouncer in front of your database and turn on transaction mode. This makes multiple clients reuse the same server connection, which allows thousands of clients to connect to your database without causing a fork bomb.
-
-Unfortunately, this comes with a trade off. Since multiple clients use the same server, they couldn't take advantage of prepared statements. Prepared statements are a way for Postgres to cache a query plan and execute it multiple times with different parameters. If you have never tried this before, you can run `pgbench` against your local DB and you'll see that `--protocol prepared` outperforms `simple` and `extended` by at least 30 percent. Giving up this feature has been a given for production deployments for as long as I can remember, but not anymore.
-
-## PgCat Prepared Statements
-
-Since [#474](https://github.com/postgresml/pgcat/pull/474), PgCat supports prepared statements in session and transaction mode. Our initial benchmarks show 30% increase over extended protocol (`--protocol extended`) and 15% against simple protocol (`--simple`). Most (all?) web frameworks use at least the extended protocol, so we are looking at a **30% performance increase across the board for everyone** who writes web apps and uses Postgres in production, by just switching to named prepared statements.
-
-In Rails apps, it's as simple as setting `prepared_statements: true`.
-
-This is not only a performance benefit, but also a usability improvement for client libraries that have to use prepared statements, like the popular Rust crate [SQLx](https://github.com/launchbadge/sqlx). Until now, the typical recommendation was to just not use a pooler.
-
-## Benchmark
-
-\
-
-
-
-
-The benchmark was conducted using `pgbench` with 1, 10, 100 and 1000 clients sending millions of queries to PgCat, which itself was running on a different EC2 machine alongside the database. This is a simple setup often used in production. Another configuration sees a pooler use its own machine, which of course increases latency but improves on availability. The clients were on another EC2 machine to simulate the latency experienced in typical web apps deployed in Kubernetes, ECS, EC2 and others.
-
-Benchmark ran in transaction mode. Session mode is faster with fewer clients, but does not scale in production with more than a few hundred clients. Only `SELECT` statements (`-S` option) were used, since the typical `pgbench` benchmark uses a similar number of writes to reads, which is an atypical production workload. Most apps read 90% of the time, and write 10% of the time. Reads are where prepared statements truly shine.
-
-## Implementation
-
-PgCat implements an internal cache & mapping between clients' prepared statements and servers that may or may not have them. If a server has the prepared statement, PgCat just forwards the `Bind (F)`, `Execute (F)` and `Describe (F)` messages. If the server doesn't have the prepared statement, PgCat fetches it from the client cache & prepares it using the `Parse (F)` message. You can refer to [Postgres docs](https://www.postgresql.org/docs/current/protocol-flow.html) for a more detailed explanation of how the extended protocol works.
-
-An important feature of PgCat's implementation is that all prepared statements are renamed and assigned globally unique names. This means that clients that don't randomize their prepared statement names and expect it to be gone after they disconnect from the "Postgres server", work as expected (I put "Postgres server" in quotes because they are actually talking to a proxy that pretends to be a Postgres database). Typical error when using such clients with PgBouncer is `prepared statement "sqlx_s_2" already exists`, which is pretty confusing when you see it for the first time.
-
-## Metrics
-
-We've added two new metrics to the admin database: `prepare_cache_hit` and `prepare_cache_miss`. Prepare cache hits indicate that the prepared statement requested by the client already exists on the server. That's good because PgCat can just rewrite the messages and send them to the server immediately. Prepare cache misses indicate that PgCat had to issue a prepared statement call to the server, which requires additional time and decreases throughput. In the ideal scenario, the cache hits outnumber the cache misses by an order of magnitude. If they are the same or worse, the prepared statements are not being used correctly by the clients.
-
-
-
-Our benchmark had a 99.99% cache hit ratio, which is really good, but in production this number is likely to be lower. You can monitor your cache hit/miss ratios through the admin database by querying it with `SHOW SERVERS`.
-
-## Roadmap
-
-Our implementation is pretty simple and we are already seeing massive improvements, but we can still do better. A `Parse (F)` made prepared statement works, but if one prepares their statements using `PREPARE` explicitly, PgCat will ignore it and that query isn't likely to work outside of session mode.
-
-Another issue is explicit `DEALLOCATE` and `DISCARD` calls. PgCat doesn't detect them currently, and a client can potentially bust the server prepared statement cache without PgCat knowing about it. It's an easy enough fix to intercept and act on that query accordingly, but we haven't built that yet.
-
-Testing with `pgbench` is an artificial benchmark, which is good and bad. It's good because, other things being equal, we can demonstrate that one implementation & configuration of the database/pooler cluster is superior to another. It's bad because in the real world, the results can differ. We are looking for users who would be willing to test our implementation against their production traffic and tell us how we did. This feature is optional and can be enabled & disabled dynamically, without restarting PgCat, with `prepared_statements = true` in `pgcat.toml`.
diff --git a/pgml-docs/benchmarks/million-requests-per-second.md b/pgml-docs/benchmarks/million-requests-per-second.md
deleted file mode 100644
index 9bb93df38..000000000
--- a/pgml-docs/benchmarks/million-requests-per-second.md
+++ /dev/null
@@ -1,238 +0,0 @@
-# Million Requests per Second
-
-
-
-The question "Does it Scale?" has become somewhat of a meme in software engineering. There is a good reason for it though, because most businesses plan for success. If your app, online store, or SaaS becomes popular, you want to be sure that the system powering it can serve all your new customers.
-
-At PostgresML, we are very concerned with scale. Our engineering background took us through scaling PostgreSQL to 100 TB+, so we're certain that it scales, but could we scale machine learning alongside it?
-
-In this post, we'll discuss how we horizontally scale PostgresML to achieve more than **1 million XGBoost predictions per second** on commodity hardware.
-
-If you missed our previous post and are wondering why someone would combine machine learning and Postgres, take a look at our PostgresML vs. Python benchmark.
-
-## Architecture Overview
-
-If you're familiar with how one runs PostgreSQL at scale, you can skip straight to the [results](broken-reference).
-
-Part of our thesis, and the reason why we chose Postgres as our host for machine learning, is that scaling machine learning inference is very similar to scaling read queries in a typical database cluster.
-
-Inference speed varies based on the model complexity (e.g. `n_estimators` for XGBoost) and the size of the dataset (how many features the model uses), which is analogous to query complexity and table size in the database world and, as we'll demonstrate further on, scaling the latter is mostly a solved problem.
-
-
-
-
System Architecture
-
-
-
-| Component | Description |
-| --------- | --------------------------------------------------------------------------------------------------------- |
-| Clients | Regular Postgres clients |
-| ELB | [Elastic Network Load Balancer](https://aws.amazon.com/elasticloadbalancing/) |
-| PgCat | A Postgres [pooler](https://github.com/levkk/pgcat/) with built-in load balancing, failover, and sharding |
-| Replica | Regular Postgres [replicas](https://www.postgresql.org/docs/current/high-availability.html) |
-| Primary | Regular Postgres primary |
-
-Our architecture has four components that may need to scale up or down based on load:
-
-1. Clients
-2. Load balancer
-3. [PgCat](https://github.com/levkk/pgcat/) pooler
-4. Postgres replicas
-
-We intentionally don't discuss scaling the primary in this post, because sharding, which is the most effective way to do so, is a fascinating subject that deserves its own series of posts. Spoiler alert: we sharded Postgres without any problems.
-
-### Clients
-
-Clients are regular Postgres connections coming from web apps, job queues, or pretty much anywhere that needs data. They can be long-living or ephemeral and they typically grow in number as the application scales.
-
-Most modern deployments use containers which are added as load on the app increases, and removed as the load decreases. This is called dynamic horizontal scaling, and it's an effective way to adapt to changing traffic patterns experienced by most businesses.
-
-### Load Balancer
-
-The load balancer is a way to spread traffic across horizontally scalable components, by routing new connections to targets in a round robin (or random) fashion. It's typically a very large box (or a fast router), but even those need to be scaled if traffic suddenly increases. Since we're running our system on AWS, this is already taken care of, for a reasonably small fee, by using an Elastic Load Balancer.
-
-### PgCat
-
-If you've used Postgres in the past, you know that it can't handle many concurrent connections. For large deployments, it's necessary to run something we call a pooler. A pooler routes thousands of clients to only a few dozen server connections by time-sharing when a client can use a server. Because most queries are very quick, this is a very effective way to run Postgres at scale.
-
-There are many poolers available presently, the most notable being PgBouncer, which has been around for a very long time, and is trusted by many large organizations. Unfortunately, it hasn't evolved much with the growing needs of highly available Postgres deployments, so we wrote [our own](https://github.com/levkk/pgcat/) which added important functionality we needed:
-
-* Load balancing of read queries
-* Failover in case a read replica is broken
-* Sharding (this feature is still being developed)
-
-In this benchmark, we used its load balancing feature to evenly distribute XGBoost predictions across our Postgres replicas.
-
-### Postgres Replicas
-
-Scaling Postgres reads is pretty straight forward. If more read queries are coming in, we add a replica to serve the increased load. If the load is decreasing, we remove a replica to save money. The data is replicated from the primary, so all replicas are identical, and all of them can serve any query, or in our case, an XGBoost prediction. PgCat can dynamically add and remove replicas from its config without disconnecting clients, so we can add and remove replicas as needed, without downtime.
-
-#### Parallelizing XGBoost
-
-Scaling XGBoost predictions is a little bit more interesting. XGBoost cannot serve predictions concurrently because of internal data structure locks. This is common to many other machine learning algorithms as well, because making predictions can temporarily modify internal components of the model.
-
-PostgresML bypasses that limitation because of how Postgres itself handles concurrency:
-
-
-
-
-
-_PostgresML concurrency_
-
-PostgreSQL uses the fork/multiprocessing architecture to serve multiple clients concurrently: each new client connection becomes an independent OS process. During connection startup, PostgresML loads all models inside the process' memory space. This means that each connection has its own copy of the XGBoost model and PostgresML ends up serving multiple XGBoost predictions at the same time without any lock contention.
-
-## Results
-
-We ran over a 100 different benchmarks, by changing the number of clients, poolers, replicas, and XGBoost predictions we requested. The benchmarks were meant to test the limits of each configuration, and what remediations were needed in each scenario. Our raw data is available below.
-
-One of the tests we ran used 1,000 clients, which were connected to 1, 2, and 5 replicas. The results were exactly what we expected.
-
-### Linear Scaling
-
-
-
-
-
-
Latency
-
-
-
-
Throughput
-
-
-
-Both latency and throughput, the standard measurements of system performance, scale mostly linearly with the number of replicas. Linear scaling is the north star of all horizontally scalable systems, and most are not able to achieve it because of increasing complexity that comes with synchronization.
-
-Our architecture shares nothing and requires no synchronization. The replicas don't talk to each other and the poolers don't either. Every component has the knowledge it needs (through configuration) to do its job, and they do it well.
-
-The most impressive result is serving close to a million predictions with an average latency of less than 1ms. You might notice though that `950160.7` isn't quite one million, and that's true. We couldn't reach one million with 1000 clients, so we increased to 2000 and got our magic number: **1,021,692.7 req/sec**, with an average latency of **1.7ms**.
-
-### Batching Predictions
-
-Batching is a proven method to optimize performance. If you need to get several data points, batch the requests into one query, and it will run faster than making individual requests.
-
-We should precede this result by stating that PostgresML does not yet have a batch prediction API as such. Our `pgml.predict()` function can predict multiple points, but we haven't implemented a query pattern to pass multiple rows to that function at the same time. Once we do, based on our tests, we should see a substantial increase in batch prediction performance.
-
-Regardless of that limitation, we still managed to get better results by batching queries together since Postgres needed to do less query parsing and searching, and we saved on network round trip time as well.
-
-
-
-
-
-
-
-
-
-
-
-If batching did not work at all, we would see a linear increase in latency and a linear decrease in throughput. That did not happen; instead, we got a 1.5x improvement by batching 5 predictions together, and a 1.2x improvement by batching 20. A modest success, but a success nonetheless.
-
-### Graceful Degradation and Queuing
-
-
-
-
-
-
-
-
-
-
-
-All systems, at some point in their lifetime, will come under more load than they were designed for; what happens then is an important feature (or bug) of their design. Horizontal scaling is never immediate: it takes a bit of time to spin up additional hardware to handle the load. It can take a second, or a minute, depending on availability, but in both cases, existing resources need to serve traffic the best way they can.
-
-We were hoping to test PostgresML to its breaking point, but we couldn't quite get there. As the load (number of clients) increased beyond provisioned capacity, the only thing we saw was a gradual increase in latency. Throughput remained roughly the same. This gradual latency increase was caused by simple queuing: the replicas couldn't serve requests concurrently, so the requests had to patiently wait in the poolers.
-
-
-
-_"What's taking so long over there!?"_
-
-Among many others, this is a very important feature of any proxy: it's a FIFO queue (first in, first out). If the system is underutilized, queue size is 0 and all requests are served as quickly as physically possible. If the system is overutilized, the queue size increases, holds as the number of requests stabilizes, and decreases back to 0 as the system is scaled up to accommodate new traffic.
-
-Queueing overall is not desirable, but it's a feature, not a bug. While autoscaling spins up an additional replica, the app continues to work, although a few milliseconds slower, which is a good trade off for not overspending on hardware.
-
-As the demand on PostgresML increases, the system gracefully handles the load. If the number of replicas stays the same, latency slowly increases, all the while remaining well below acceptable ranges. Throughput holds as well, as increasing number of clients evenly split available resources.
-
-If we increase the number of replicas, latency decreases and throughput increases, as the number of clients increases in parallel. We get the best result with 5 replicas, but this number is variable and can be changed as needs for latency compete with cost.
-
-## What's Next
-
-Horizontal scaling and high availability are fascinating topics in software engineering. Needing to serve 1 million predictions per second is rare, but having the ability to do that, and more if desired, is an important aspect for any new system.
-
-The next challenge for us is to scale writes horizontally. In the database world, this means sharding the database into multiple separate machines using a hashing function, and automatically routing both reads and writes to the right shards. There are many possible solutions on the market for this already, e.g. Citus and Foreign Data Wrappers, but none are as horizontally scalable as we like, although we will incorporate them into our architecture until we build the one we really want.
-
-For that purpose, we're building our own open source [Postgres proxy](https://github.com/levkk/pgcat/) which we discussed earlier in the article. As we progress further in our journey, we'll be adding more features and performance improvements.
-
-By combining PgCat with PostgresML, we are aiming to build the next generation of machine learning infrastructure that can power anything from tiny startups to unicorns and massive enterprises, without the data ever leaving our favorite database.
-
-## Methodology
-
-### ML
-
-This time, we used an XGBoost model with 100 trees:
-
-```postgresql
-SELECT * FROM pgml.train(
- 'flights',
- task => 'regression',
- relation_name => 'flights_mat_3',
- y_column_name => 'depdelayminutes',
- algorithm => 'xgboost',
- hyperparams => '{"n_estimators": 100 }',
- runtime => 'rust'
-);
-```
-
-and fetched our predictions the usual way:
-
-```postgresql
-SELECT pgml.predict(
- 'flights',
- ARRAY[
- year,
- quarter,
- month,
- distance,
- dayofweek,
- dayofmonth,
- flight_number_operating_airline,
- originairportid,
- destairportid,
- flight_number_marketing_airline,
- departure
- ]
-) AS prediction
-FROM flights_mat_3 LIMIT :limit;
-```
-
-where `:limit` is the batch size of 1, 5, and 20.
-
-#### Model
-
-The model is roughly the same as the one we used in our previous post, with just one extra feature added, which improved R2 a little bit.
-
-### Hardware
-
-#### Client
-
-The client was a `c5n.4xlarge` box on EC2. We chose the `c5n` class to have the 100 GBit NIC, since we wanted it to saturate our network as much as possible. Thousands of clients were simulated using [`pgbench`](https://www.postgresql.org/docs/current/pgbench.html).
-
-#### PgCat Pooler
-
-PgCat, written in asynchronous Rust, was running on `c5.xlarge` machines (4 vCPUs, 8GB RAM) with 4 Tokio workers. We used between 1 and 35 machines, and scaled them in increments of 5-20 at a time.
-
-The pooler did a decent amount of work around parsing queries, making sure they are read-only `SELECT`s, and routing them, at random, to replicas. If any replica was down for any reason, it would route around it to remaining machines.
-
-#### Postgres Replicas
-
-Postgres replicas were running on `c5.9xlarge` machines with 36 vCPUs and 72 GB of RAM. The hot dataset fits entirely in memory. The servers were intentionally saturated to maximum capacity before scaling up to test queuing and graceful degradation of performance.
-
-#### Raw Results
-
-Raw latency data is available [here](https://static.postgresml.org/benchmarks/reads-latency.csv) and raw throughput data is available [here](https://static.postgresml.org/benchmarks/reads-throughput.csv).
-
-## Call to Early Adopters
-
-[PostgresML](https://github.com/postgresml/postgresml/) and [PgCat](https://github.com/levkk/pgcat/) are free and open source. If your organization can benefit from simplified and fast machine learning, get in touch! We can help deploy PostgresML internally, and collaborate on new and existing features. Join our [Discord](https://discord.gg/DmyJP3qJ7U) or [email](mailto:team@postgresml.org) us!
-
-Many thanks and ❤️ to all those who are supporting this endeavor. We’d love to hear feedback from the broader ML and Engineering community about applications and other real world scenarios to help prioritize our work. You can show your support by starring us on our [Github](https://github.com/postgresml/postgresml/).
diff --git a/pgml-docs/benchmarks/mindsdb-vs-postgresml.md b/pgml-docs/benchmarks/mindsdb-vs-postgresml.md
deleted file mode 100644
index cfbc8ec7a..000000000
--- a/pgml-docs/benchmarks/mindsdb-vs-postgresml.md
+++ /dev/null
@@ -1,299 +0,0 @@
-# MindsDB vs PostgresML
-
-## Introduction
-
-There are a many ways to do machine learning with data in a SQL database. In this article, we'll compare 2 projects that both aim to provide a SQL interface to machine learning algorithms and the data they require: **MindsDB** and **PostgresML**. We'll look at how they work, what they can do, and how they compare to each other. The **TLDR** is that PostgresML is more opinionated, more scalable, more capable and several times faster than MindsDB. On the other hand, MindsDB is 5 times more mature than PostgresML according to age and GitHub Stars. What are the important factors?
-
-_We're occasionally asked what the difference is between PostgresML and MindsDB. We'd like to answer that question at length, and let you decide if the reasoning is fair._
-
-### At a glance
-
-Both projects are Open Source, although PostgresML allows for more permissive use with the MIT license, compared to the GPL-3.0 license used by MindsDB. PostgresML is also a significantly newer project, with the first commit in 2022, compared to MindsDB which has been around since 2017, but one of the first hints at the real differences between the two projects is the choice of programming languages. MindsDB is implemented in Python, while PostgresML is implemented with Rust. I say _in_ Python, because it's a language with a runtime, and _with_ Rust, because it's a language with a compiler that does not require a Runtime. We'll see how this difference in implementation languages leads to different outcomes.
-
-| | MindsDB | PostgresML |
-| -------- | ------- | ---------- |
-| Age | 5 years | 1 year |
-| License | GPL-3.0 | MIT |
-| Language | Python | Rust |
-
-### Algorithms
-
-Both Projects integrate several dozen machine learning algorithms, including the latest LLMs from Hugging Face.
-
-| | MindsDB | PostgresML |
-| ----------------- | ------- | ---------- |
-| Classification | ✅ | ✅ |
-| Regression | ✅ | ✅ |
-| Time Series | ✅ | ✅ |
-| LLM Support | ✅ | ✅ |
-| Embeddings | - | ✅ |
-| Vector Support | - | ✅ |
-| Full Text Search | - | ✅ |
-| Geospatial Search | - | ✅ |
-
-\
-Both MindsDB and PostgresML support many classical machine learning algorithms to do classification and regression. They are both able to load ~~the latest LLMs~~ some models from Hugging Face, supported by underlying implementations in libtorch. I had to cross that out after exploring all the caveats in the MindsDB implementations. PostgresML supports the models released immediately as long as underlying dependencies are met. MindsDB has to release an update to support any new models, and their current model support is extremely limited. New algorithms, tasks, and models are constantly released, so it's worth checking the documentation for the latest list.
-
-Another difference is that PostgresML also supports embedding models, and closely integrates them with vector search inside the database, which is well beyond the scope of MindsDB, since it's not a database at all. PostgresML has direct access to all the functionality provided by other Postgres extensions, like vector indexes from [pgvector](https://github.com/pgvector/pgvector) to perform efficient KNN & ANN vector recall, or [PostGIS](http://postgis.net/) for geospatial information as well as built in full text search. Multiple algorithms and extensions can be combined in compound queries to build state-of-the-art systems, like search and recommendations or fraud detection that generate an end to end result with a single query, something that might take a dozen different machine learning models and microservices in a more traditional architecture.
-
-### Architecture
-
-The architectural implementations for these projects is significantly different. PostgresML takes a data centric approach with Postgres as the provider for both storage _and_ compute. To provide horizontal scalability for inference, the PostgresML team has also created [PgCat](https://github.com/postgresml/pgcat) to distribute workloads across many Postgres databases. On the other hand, MindsDB takes a service oriented approach that connects to various databases over the network.
-
-\
-
-
-
-
-| | MindsDB | PostgresML |
-| ------------- | ------------- | ---------- |
-| Data Access | Over the wire | In process |
-| Multi Process | ✅ | ✅ |
-| Database | - | ✅ |
-| Replication | - | ✅ |
-| Sharding | - | ✅ |
-| Cloud Hosting | ✅ | ✅ |
-| On Premise | ✅ | ✅ |
-| Web UI | ✅ | ✅ |
-
-\
-
-
-The difference in architecture leads to different tradeoffs and challenges. There are already hundreds of ways to get data into and out of a Postgres database, from just about every other service, language and platform that makes PostgresML highly compatible with other application workflows. On the other hand, the MindsDB Python service accepts connections from specifically supported clients like `psql` and provides a pseudo-SQL interface to the functionality. The service will parse incoming MindsDB commands that look similar to SQL (but are not), for tasks like configuring database connections, or doing actual machine learning. These commands typically have what looks like a sub-select, that will actually fetch data over the wire from configured databases for Machine Learning training and inference.
-
-MindsDB is actually a pretty standard Python microservice based architecture that separates data from compute over the wire, just with an SQL like API, instead of gRPC or REST. MindsDB isn't actually a DB at all, but rather an ML service with adapters for just about every database that Python can connect to.
-
-On the other hand, PostgresML runs ML algorithms inside the database itself. It shares memory with the database, and can access data directly, using pointers to avoid the serialization and networking overhead that frequently dominates data hungry machine learning applications. Rust is an important language choice for PostgresML because its memory safety simplifies the effort required to achieve stability along with performance in a large and complex memory space. The "tradeoff", is that it requires a Postgres database to actually host the data it operates on.
-
-In addition to the extension, PostgresML relies on PgCat to scale Postgres clusters horizontally using both sharding and replication strategies to provide both scalable compute and storage. Scaling a low latency and high availability feature store is often the most difficult operational challenge for Machine Learning applications. That's the primary driver of PostgresML's architectural choices. MindsDB leaves those issues as an exercise for the adopter, while also introducing a new single service bottleneck for ML compute implemented in Python.
-
-## Benchmarks
-
-If you missed our previous article benchmarking PostgresML vs Python Microservices, spoiler alert, PostgresML is between 8-40x faster than Python microservice architectures that do the same thing, even if they use "specialized" in memory databases like Redis. The network transit cost as well as data serialization is a major cost for data hungry machine learning algorithms. Since MindsDB doesn't actually provide a DB, we'll create a synthetic benchmark that doesn't use stored data in a database (even though that's the whole point of SQL ML, right?). This will negate the network serialization and transit costs a MindsDB service would typically occur, and highlight the performance differences between Python and Rust implementations.
-
-#### PostgresML
-
-We'll connect to our Postgres server running locally:
-
-```commandline
-psql postgres://postgres:password@127.0.0.1:5432
-```
-
-For both implementations, we can just pass in our data as part of the query for an apples to apples performance comparison. PostgresML adds the `pgml.transform` function, that takes an array of inputs to transform, given a task and model, without any setup beyond installing the extension. Let's see how long it takes to run a sentiment analysis model on a single sentence:
-
-!!! generic
-
-!!! code\_block time="4769.337 ms"
-
-```sql
-SELECT pgml.transform(
- inputs => ARRAY[
- 'I am so excited to benchmark deep learning models in SQL. I can not wait to see the results!'
- ],
- task => '{
- "task": "text-classification",
- "model": "cardiffnlp/twitter-roberta-base-sentiment"
- }'::JSONB
-);
-```
-
-!!!
-
-!!! results
-
-| positivity |
-| ---------------------------------------------------- |
-| \[{"label": "LABEL\_2", "score": 0.990081250667572}] |
-
-!!!
-
-!!!
-
-The first time `transform` is run with a particular model name, it will download that pretrained transformer from HuggingFace, and load it into RAM, or VRAM if a GPU is available. In this case, that took about 5 seconds, but let's see how fast it is now that the model is cached.
-
-!!! generic
-
-!!! code\_block time="45.094 ms"
-
-```sql
-SELECT pgml.transform(
- inputs => ARRAY[
- 'I don''t really know if 5 seconds is fast or slow for deep learning. How much time is spent downloading vs running the model?'
- ],
- task => '{
- "task": "text-classification",
- "model": "cardiffnlp/twitter-roberta-base-sentiment"
- }'::JSONB
-);
-```
-
-!!!
-
-!!! results
-
-| transform |
-| ------------------------------------------------------ |
-| \[{"label": "LABEL\_1", "score": 0.49658918380737305}] |
-
-!!!
-
-!!!
-
-45ms is below the level of human perception, so we could use a deep learning model like this to build an interactive application that feels instantaneous to our users. It's worth noting that PostgresML will automatically use a GPU if it's available. This benchmark machine includes an NVIDIA RTX 3090. We can also check the speed on CPU only, by setting the `device` argument to `cpu`:
-
-!!! generic
-
-!!! code\_block time="165.036 ms"
-
-```sql
-SELECT pgml.transform(
- inputs => ARRAY[
- 'Are GPUs really worth it? Sometimes they are more expensive than the rest of the computer combined.'
- ],
- task => '{
- "task": "text-classification",
- "model": "cardiffnlp/twitter-roberta-base-sentiment",
- "device": "cpu"
- }'::JSONB
-);
-```
-
-!!!
-
-!!! results
-
-| transform |
-| ----------------------------------------------------- |
-| \[{"label": "LABEL\_0", "score": 0.7333963513374329}] |
-
-!!!
-
-!!!
-
-The GPU is able to run this model about 4x faster than the i9-13900K with 24 cores.
-
-#### Model Outputs
-
-You might have noticed that the `inputs` the model was analyzing got less positive over time, and the model moved from `LABEL_2` to `LABEL_1` to `LABEL_0`. Some models use more descriptive outputs, but in this case I had to look at the [README](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment/blob/main/README.md) to see what the labels represent.
-
-Labels:
-
-* 0 -> Negative
-* 1 -> Neutral
-* 2 -> Positive
-
-It looks like this model did correctly pick up on the decreasing enthusiasm in the text, so not only is it relatively fast on a GPU, it's usefully accurate. Another thing to consider when it comes to model quality is that this model was trained on tweets, and these inputs were chosen to be about as long and complex as a tweet. It's not always clear how well a model will generalize to novel looking inputs, so it's always important to do a little reading about a model when you're looking for ways to test and improve the quality of it's output.
-
-#### MindsDB
-
-MindsDB requires a bit more setup than just the database, but I'm running it on the same machine with the latest version. I'll also use the same model, so we can compare apples to apples.
-
-```commandline
-python -m mindsdb --api postgres
-```
-
-Then we can connect to this Python service with our Postgres client:
-
-```
-psql postgres://mindsdb:123@127.0.0.1:55432
-```
-
-And turn timing on to see how long it takes to run the same query:
-
-```sql
-\timing on
-```
-
-And now we can issue some MindsDB pseudo sql:
-
-!!! code\_block time="277.722 ms"
-
-```
-CREATE MODEL mindsdb.sentiment_classifier
-PREDICT sentiment
-USING
- engine = 'huggingface',
- task = 'text-classification',
- model_name = 'cardiffnlp/twitter-roberta-base-sentiment',
- input_column = 'text',
- labels = ['negativ', 'neutral', 'positive'];
-```
-
-!!!
-
-This kicked off a background job in the Python service to download the model and set it up, which took about 4 seconds judging from the logs, but I don't have an exact time for exactly when the model became "status: complete" and was ready to handle queries.
-
-Now we can write a query that will make a prediction similar to PostgresML, using the same Huggingface model.
-
-!!! generic
-
-!!! code\_block time="741.650 ms"
-
-```
-SELECT *
-FROM mindsdb.sentiment_classifier
-WHERE text = 'I am so excited to benchmark deep learning models in SQL. I can not wait to see the results!'
-```
-
-!!!
-
-!!! results
-
-| sentiment | sentiment\_explain | text |
-| --------- | -------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- |
-| positive | {"positive": 0.990081250667572, "neutral": 0.008058485575020313, "negativ": 0.0018602772615849972} | I am so excited to benchmark deep learning models in SQL. I can not wait to see the results! |
-
-!!!
-
-!!!
-
-Since we've provided the MindsDB model with more human-readable labels, they're reusing those (including the negativ typo), and returning all three scores along with the input by default. However, this seems to be a bit slower than anything we've seen so far. Let's try to speed it up by only returning the label without the full sentiment\_explain.
-
-!!! generic
-
-!!! code\_block time="841.936 ms"
-
-```
-SELECT sentiment
-FROM mindsdb.sentiment_classifier
-WHERE text = 'I am so excited to benchmark deep learning models in SQL. I can not wait to see the results!'
-```
-
-!!!
-
-!!! results
-
-| sentiment |
-| --------- |
-| positive |
-
-!!!
-
-!!!
-
-It's not the sentiment\_explain that's slowing it down. I spent several hours of debugging, and learned a lot more about the internal Python service architecture. I've confirmed that even though inside the Python service, `torch.cuda.is_available()` returns `True` when the service starts, I never see a Python process use the GPU with `nvidia-smi`. MindsDB also claims to run on GPU, but I haven't been able to find any documentation, or indication in the code why it doesn't "just work". I'm stumped on this front, but I think it's fair to assume this is a pure CPU benchmark.
-
-The other thing I learned trying to get this working is that MindsDB isn't just a single Python process. Python famously has a GIL that will impair parallelism, so the MindsDB team has cleverly built a service that can run multiple Python processes in parallel. This is great for scaling out, but it means that our query is serialized to JSON and sent to a worker, and then the worker actually runs the model and sends the results back to the parent, again as JSON, which as far as I can tell is where the 5x slow-down is happening.
-
-## Results
-
-PostgresML is the clear winner in terms of performance. It seems to me that it currently also support more models with a looser function API than the pseudo SQL required to create a MindsDB model. You'll notice the output structure for models on HuggingFace can very widely. I tried several not listed in the MindsDB documentation, but received errors on creation. PostgresML just returns the models output without restructuring, so it's able to handle more discrepancies, although that does leave it up to the end user to sort out how to use models.
-
-| task | model | MindsDB | PostgresML CPU | PostgresML GPU |
-| ----------------------- | ----------------------------------------- | ------- | -------------- | -------------- |
-| text-classification | cardiffnlp/twitter-roberta-base-sentiment | 741 | 165 | 45 |
-| translation\_en\_to\_es | t5-base | 1573 | 1148 | 294 |
-| summarization | sshleifer/distilbart-cnn-12-6 | 4289 | 3450 | 479 |
-
-\
-
-
-There is a general trend, the larger and slower the model is, the more work is spent inside libtorch, the less the performance of the rest matters, but for interactive models and use cases there is a significant difference. We've tried to cover the most generous use case we could between these two. If we were to compare XGBoost or other classical algorithms, that can have sub millisecond prediction times in PostgresML, the 20ms Python service overhead of MindsDB just to parse the incoming query would be hundreds of times slower.
-
-## Clouds
-
-Setting these services up is a bit of work, even for someone heavily involved in the day-to-day machine learning mayhem. Managing machine learning services and databases at scale requires a significant investment over time. Both services are available in the cloud, so let's see how they compare on that front as well.
-
-MindsDB is available on the AWS marketplace on top of your own hardware instances. You can scale it out and configure your data sources through their Web UI, very similar to the local installation, but you'll also need to figure out your data sources and how to scale them for machine learning workloads. Good luck!
-
-PostgresML is available as a fully managed database service, that includes the storage, backups, metrics, and scalability through PgCat that large ML deployments need. End-to-end machine learning is rarely just about running the models, and often more about scaling the data pipelines and managing the data infrastructure around them, so in this case PostgresML also provides a large service advantage, whereas with MindsDB, you'll still need to figure out your cloud data storage solution independently.
diff --git a/pgml-docs/benchmarks/postgresml-is-8-40x-faster-than-python-http-microservices.md b/pgml-docs/benchmarks/postgresml-is-8-40x-faster-than-python-http-microservices.md
deleted file mode 100644
index 6d51a11eb..000000000
--- a/pgml-docs/benchmarks/postgresml-is-8-40x-faster-than-python-http-microservices.md
+++ /dev/null
@@ -1,191 +0,0 @@
-# PostgresML is 8-40x faster than Python HTTP microservices
-
-
-
-Machine learning architectures can be some of the most complex, expensive and _difficult_ arenas in modern systems. The number of technologies and the amount of required hardware compete for tightening headcount, hosting, and latency budgets. Unfortunately, the trend in the industry is only getting worse along these lines, with increased usage of state-of-the-art architectures that center around data warehouses, microservices and NoSQL databases.
-
-PostgresML is a simpler alternative to that ever-growing complexity. In this post, we explore some additional performance benefits of a more elegant architecture and discover that PostgresML outperforms traditional Python microservices by a **factor of 8** in local tests and by a **factor of 40** on AWS EC2.
-
-## Candidate architectures
-
-To consider Python microservices with every possible advantage, our first benchmark is run with Python and Redis located on the same machine. Our goal is to avoid any additional network latency, which puts it on a more even footing with PostgresML. Our second test takes place on AWS EC2, with Redis and Gunicorn separated by a network; this benchmark proves to be relatively devastating.
-
-The full source code for both benchmarks is available on [Github](https://github.com/postgresml/postgresml/tree/master/pgml-docs/docs/blog/benchmarks/python\_microservices\_vs\_postgresml).
-
-### PostgresML
-
-PostgresML architecture is composed of:
-
-1. A PostgreSQL server with PostgresML v2.0
-2. [pgbench](https://www.postgresql.org/docs/current/pgbench.html) SQL client
-
-### Python
-
-Python architecture is composed of:
-
-1. A Flask/Gunicorn server accepting and returning JSON
-2. CSV file with the training data
-3. Redis feature store with the inference dataset, serialized with JSON
-4. [ab](https://httpd.apache.org/docs/2.4/programs/ab.html) HTTP client
-
-### ML
-
-Both architectures host the same XGBoost model, running predictions against the same dataset. See [Methodology](broken-reference) for more details.
-
-## Results
-
-### Throughput
-
-
-
-
-
-Throughput is defined as the number of XGBoost predictions the architecture can serve per second. In this benchmark, PostgresML outperformed Python and Redis, running on the same machine, by a **factor of 8**.
-
-In Python, most of the bottleneck comes from having to fetch and deserialize Redis data. Since the features are externally stored, they need to be passed through Python and into XGBoost. XGBoost itself is written in C++, and it's Python library only provides a convenient interface. The prediction coming out of XGBoost has to go through Python again, serialized as JSON, and sent via HTTP to the client.
-
-This is pretty much the bare minimum amount of work you can do for an inference microservice.
-
-PostgresML, on the other hand, collocates data and compute. It fetches data from a Postgres table, which already comes in a standard floating point format, and the Rust inference layer forwards it to XGBoost via a pointer.
-
-An interesting thing happened when the benchmark hit 20 clients: PostgresML throughput starts to quickly decrease. This may be surprising to some, but to Postgres enthusiasts it's a known issue: Postgres isn't very good at handling more concurrent active connections than CPU threads. To mitigate this, we introduced PgBouncer (a Postgres proxy and pooler) in front of the database, and the throughput increased back up, and continued to hold as we went to 100 clients.
-
-It's worth noting that the benchmarking machine had only 16 available CPU threads (8 cores). If more cores were available, the bottleneck would only occur with more clients. The general recommendation for Postgres servers it to open around 2 connections per available CPU core, although newer versions of PostgreSQL have been incrementally chipping away at this limitation.
-
-#### Why throughput is important
-
-Throughput allows you to do more with less. If you're able to serve 30,000 queries per second using a single machine, but only using 1,000 today, you're unlikely to need an upgrade anytime soon. On the other hand, if the system can only serve 5,000 requests, an expensive and possibly stressful upgrade is in your near future.
-
-### Latency
-
-
-
-
-
-Latency is defined as the time it takes to return a single XGBoost prediction. Since most systems have limited resources, throughput directly impacts latency (and vice versa). If there are many active requests, clients waiting in the queue take longer to be serviced, and overall system latency increases.
-
-In this benchmark, PostgresML outperformed Python by a **factor of 8** as well. You'll note the same issue happens at 20 clients, and the same mitigation using PgBouncer reduces its impact. Meanwhile, Python's latency continues to increase substantially.
-
-Latency is a good metric to use when describing the performance of an architecture. In other words, if I were to use this service, I would get a prediction back in at most this long, irrespective of how many other clients are using it.
-
-#### Why latency is important
-
-Latency is important in machine learning services because they are often running as an addition to the main application, and sometimes have to be accessed multiple times during the same HTTP request.
-
-Let's take the example of an e-commerce website. A typical storefront wants to show many personalization models concurrently. Examples of such models could include "buy it again" recommendations for recurring purchases (binary classification), or "popular items in your area" (geographic clustering of purchase histories) or "customers like you bought this item" (nearest neighbour model).
-
-All of these models are important because they have been proven, over time, to be very successful at driving purchases. If inference latency is high, the models start to compete for very expensive real estate, front page and checkout, and the business has to drop some of them or, more likely, suffer from slow page loads. Nobody likes a slow app when they are trying to order groceries or dinner.
-
-### Memory utilization
-
-
-
-
-
-Python is known for using more memory than more optimized languages and, in this case, it uses **7 times** more than PostgresML.
-
-PostgresML is a Postgres extension, and it shares RAM with the database server. Postgres is very efficient at fetching and allocating only the memory it needs: it reuses `shared_buffers` and OS page cache to store rows for inference, and requires very little to no memory allocation to serve queries.
-
-Meanwhile, Python must allocate memory for each feature it receives from Redis and for each HTTP response it returns. This benchmark did not measure Redis memory utilization, which is an additional and often substantial cost of running traditional machine learning microservices.
-
-#### Training
-
-
-
-
-
-Since Python often uses Pandas to load and preprocess data, it is notably more memory hungry. Before even passing the data into XGBoost, we were already at 8GB RSS (resident set size); during actual fitting, memory utilization went to almost 12GB. This test is another best case scenario for Python, since the data has already been preprocessed, and was merely passed on to the algorithm.
-
-Meanwhile, PostresML enjoys sharing RAM with the Postgres server and only allocates the memory needed by XGBoost. The dataset size was significant, but we managed to train the same model using only 5GB of RAM. PostgresML therefore allows training models on datasets at least twice as large as Python, all the while using identical hardware.
-
-#### Why memory utilization is important
-
-This is another example of doing more with less. Most machine learning algorithms, outside of FAANG and research universities, require the dataset to fit into the memory of a single machine. Distributed training is not where we want it to be, and there is still so much value to be extracted from simple linear regressions.
-
-Using less RAM allows to train larger and better models on larger and more complete datasets. If you happen to suffer from large machine learning compute bills, using less RAM can be a pleasant surprise at the end of your fiscal year.
-
-## What about UltraJSON/MessagePack/Serializer X?
-
-We spent a lot of time talking about serialization, so it makes sense to look at prior work in that field.
-
-JSON is the most user-friendly format, but it's certainly not the fastest. MessagePack and Ultra JSON, for example, are sometimes faster and more efficient at reading and storing binary information. So, would using them in this benchmark be better, instead of Python's built-in `json` module?
-
-The answer is: not really.
-
-
-
-
-
-
-
-
-
-Time to (de)serialize is important, but ultimately needing (de)serialization in the first place is the bottleneck. Taking data out of a remote system (e.g. a feature store like Redis), sending it over a network socket, parsing it into a Python object (which requires memory allocation), only to convert it again to a binary type for XGBoost, is causing unnecessary delays in the system.
-
-PostgresML does **one in-memory copy** of features from Postgres. No network, no (de)serialization, no unnecessary latency.
-
-## What about the real world?
-
-Testing over localhost is convenient, but it's not the most realistic benchmark. In production deployments, the client and the server are on different machines, and in the case of the Python + Redis architecture, the feature store is yet another network hop away.
-
-To demonstrate this, we spun up 3 EC2 instances and ran the benchmark again. This time, PostgresML outperformed Python and Redis **by a factor of 40**.
-
-
-
-
-
-
-
-
-
-Network gap between Redis and Gunicorn made things worse...a lot worse. Fetching data from a remote feature store added milliseconds to the request the Python architecture could not spare. The additional latency compounded, and in a system that has finite resources, caused contention. Most Gunicorn threads were simply waiting on the network, and thousands of requests were stuck in the queue.
-
-PostgresML didn't have this issue, because the features and the Rust inference layer live on the same system. This architectural choice removes network latency and (de)serialization from the equation.
-
-You'll note the concurrency issue we discussed earlier hit Postgres at 20 connections, and we used PgBouncer again to save the day.
-
-Scaling Postgres, once you know how to do it, isn't as difficult as it sounds.
-
-## Methodology
-
-### Hardware
-
-Both the client and the server in the first benchmark were located on the same machine. Redis was local as well. The machine is an 8 core, 16 threads AMD Ryzen 7 5800X with 32GB RAM, 1TB NVMe SSD running Ubuntu 22.04.
-
-AWS EC2 benchmarks were done with one `c5.4xlarge` instance hosting Gunicorn and PostgresML, and two `c5.large` instances hosting the client and Redis, respectively. They were located in the same VPC.
-
-### Configuration
-
-Gunicorn was running with 5 workers and 2 threads per worker. Postgres was using 1, 5 and 20 connections for 1, 5 and 20 clients, respectively. PgBouncer was given a `default_pool_size` of 10, so a maximum of 10 Postgres connections were used for 20 and 100 clients.
-
-XGBoost was allowed to use 2 threads during inference, and all available CPU cores (16 threads) during training.
-
-Both `ab` and `pgbench` use all available resources, but are very lightweight; the requests were a single JSON object and a single query respectively. Both of the clients use persistent connections, `ab` by using HTTP Keep-Alives, and `pgbench` by keeping the Postgres connection open for the duration of the benchmark.
-
-## ML
-
-### Data
-
-We used the [Flight Status Prediction](https://www.kaggle.com/datasets/robikscube/flight-delay-dataset-20182022) dataset from Kaggle. After some post-processing, it ended up being about 2 GB of floating point features. We didn't use all columns because some of them are redundant, e.g. airport name and airport identifier, which refer to the same thing.
-
-### Model
-
-Our XGBoost model was trained with default hyperparameters and 25 estimators (also known as boosting rounds).
-
-Data used for training and inference is available [here](https://static.postgresml.org/benchmarks/flights.csv). Data stored in the Redis feature store is available [here](https://static.postgresml.org/benchmarks/flights\_sub.csv). It's only a subset because it was taking hours to load the entire dataset into Redis with a single Python process (28 million rows). Meanwhile, Postgres `COPY` only took about a minute.
-
-PostgresML model is trained with:
-
-```sql
-SELECT * FROM pgml.train(
- project_name => 'r2',
- algorithm => 'xgboost',
- hyperparams => '{ "n_estimators": 25 }'
-);
-```
-
-It had terrible accuracy (as did the Python version), probably because we were missing any kind of weather information, the latter most likely causing delays at airports.
-
-### Source code
-
-Benchmark source code can be found on [Github](https://github.com/postgresml/postgresml/tree/master/pgml-docs/docs/blog/benchmarks/python\_microservices\_vs\_postgresml/).
diff --git a/pgml-docs/developer-docs/README.md b/pgml-docs/developer-docs/README.md
deleted file mode 100644
index b9194723c..000000000
--- a/pgml-docs/developer-docs/README.md
+++ /dev/null
@@ -1,2 +0,0 @@
-# Developer Docs
-
diff --git a/pgml-docs/developer-docs/contributing.md b/pgml-docs/developer-docs/contributing.md
deleted file mode 100644
index 75aa933fa..000000000
--- a/pgml-docs/developer-docs/contributing.md
+++ /dev/null
@@ -1,236 +0,0 @@
-# Contributing
-
-Thank you for your interest in contributing to PostgresML! We are an open source, MIT licensed project, and we welcome all contributions, including bug fixes, features, documentation, typo fixes, and Github stars.
-
-Our project consists of three (3) applications:
-
-1. Postgres extension (`pgml-extension`)
-2. Dashboard web app (`pgml-dashboard`)
-3. Documentation (`pgml-docs`)
-
-The development environment for each differs slightly, but overall we use Python, Rust, and PostgreSQL, so as long as you have all of those installed, the setup should be straight forward.
-
-## Build Dependencies
-
-1. Install the latest Rust compiler from [rust-lang.org](https://www.rust-lang.org/learn/get-started).
-2. Install a [modern version](https://apt.kitware.com/) of CMake.
-3. Install PostgreSQL development headers and other dependencies:
-
- ```commandline
- export POSTGRES_VERSION=15
- sudo apt-get update && \
- sudo apt-get install -y \
- postgresql-server-dev-${POSTGRES_VERSION} \
- bison \
- build-essential \
- clang \
- cmake \
- flex \
- libclang-dev \
- libopenblas-dev \
- libpython3-dev \
- libreadline-dev \
- libssl-dev \
- pkg-config \
- python3-dev
- ```
-4. Install the Python dependencies
-
- If your system comes with Python 3.6 or lower, you'll need to install `libpython3.7-dev` or higher. You can get it from [`ppa:deadsnakes/ppa`](https://launchpad.net/\~deadsnakes/+archive/ubuntu/ppa):
-
- ```commandline
- sudo add-apt-repository ppa:deadsnakes/ppa && \
- sudo apt update && sudo apt install -y libpython3.7-dev
- ```
-5. Clone our git repository:
-
- ```commandline
- git clone https://github.com/postgresml/postgresml && \
- cd postgresml && \
- git submodule update --init --recursive && \
- ```
-
-## Postgres extension
-
-PostgresML is a Rust extension written with `tcdi/pgrx` crate. Local development therefore requires the [latest Rust compiler](https://www.rust-lang.org/learn/get-started) and PostgreSQL development headers and libraries.
-
-The extension code is located in:
-
-```commandline
-cd pgml-extension/
-```
-
-You'll need to install basic dependencies
-
-Once there, you can initialize `pgrx` and get going:
-
-#### Pgrx command line and environments
-
-```commandline
-cargo install cargo-pgrx --version "0.9.8" --locked && \
-cargo pgrx init # This will take a few minutes
-```
-
-#### Huggingface transformers
-
-If you'd like to use huggingface transformers with PostgresML, you'll need to install the Python dependencies:
-
-```commandline
-sudo pip3 install -r requirements.txt
-```
-
-#### Update postgresql.conf
-
-`pgrx` uses Postgres 15 by default. Since `pgml` is using shared memory, you need to add it to `shared_preload_libraries` in `postgresql.conf` which, for `pgrx`, is located in `~/.pgrx/data-15/postgresql.conf`.
-
-```
-shared_preload_libraries = 'pgml' # (change requires restart)
-```
-
-Run the unit tests
-
-```commandline
-cargo pgrx test
-```
-
-Run the integration tests:
-
-```commandline
-cargo pgrx run --release
-psql -h localhost -p 28813 -d pgml -f tests/test.sql -P pager
-```
-
-Run an interactive psql session
-
-```commandline
-cargo pgrx run
-```
-
-Create the extension in your database:
-
-```commandline
-CREATE EXTENSION pgml;
-```
-
-That's it, PostgresML is ready. You can validate the installation by running:
-
-\=== "SQL"
-
-```sql
-SELECT pgml.version();
-```
-
-\=== "Output"
-
-```
-postgres=# select pgml.version();
- version
--------------------
- 2.7.4
-(1 row)
-```
-
-\===
-
-Basic extension usage:
-
-```sql
-SELECT * FROM pgml.load_dataset('diabetes');
-SELECT * FROM pgml.train('Project name', 'regression', 'pgml.diabetes', 'target', 'xgboost');
-SELECT target, pgml.predict('Project name', ARRAY[age, sex, bmi, bp, s1, s2, s3, s4, s5, s6]) FROM pgml.diabetes LIMIT 10;
-```
-
-By default, the extension is built without CUDA support for XGBoost and LightGBM. You'll need to install CUDA locally to build and enable the `cuda` feature for cargo. CUDA can be downloaded [here](https://developer.nvidia.com/cuda-downloads?target\_os=Linux).
-
-```commandline
-CUDACXX=/usr/local/cuda/bin/nvcc cargo pgrx run --release --features pg15,python,cuda
-```
-
-If you ever want to reset the environment, simply spin up the database with `cargo pgrx run` and drop the extension and metadata tables:
-
-```postgresql
-DROP EXTENSION IF EXISTS pgml CASCADE;
-DROP SCHEMA IF EXISTS pgml CASCADE;
-CREATE EXTENSION pgml;
-```
-
-#### Packaging
-
-This requires Docker. Once Docker is installed, you can run:
-
-```bash
-bash build_extension.sh
-```
-
-which will produce a `.deb` file in the current directory (this will take about 20 minutes). The deb file can be installed with `apt-get`, for example:
-
-```bash
-apt-get install ./postgresql-pgml-12_0.0.4-ubuntu20.04-amd64.deb
-```
-
-which will take care of installing its dependencies as well. Make sure to run this as root and not with sudo.
-
-## Run the dashboard
-
-The dashboard is a web app that can be run against any Postgres database with the extension installed. There is a Dockerfile included with the source code if you wish to run it as a container.
-
-The dashboard requires a Postgres database with the [pgml-extension](https://github.com/postgresml/postgresml/tree/master/pgml-extension) to generate the core schema. See that subproject for developer setup.
-
-We develop and test this web application on Linux, OS X, and Windows using WSL2.
-
-Basic installation can be achieved with:
-
-1. Clone the repo (if you haven't already for the extension):
-
-```commandline
- cd postgresml/pgml-dashboard
-```
-
-2. Set the `DATABASE_URL` environment variable, for example to a running interactive `cargo pgrx run` session started previously:
-
-```commandline
-export DATABASE_URL=postgres://localhost:28815/pgml
-```
-
-3. Run migrations
-
-```commandline
-sqlx migrate run
-```
-
-4. Run tests:
-
-```commandline
-cargo test
-```
-
-5. Incremental and automatic compilation for development cycles is supported with:
-
-```commandline
-cargo watch --exec run
-```
-
-The dashboard can be packaged for distribution. You'll need to copy the static files along with the `target/release` directory to your server.
-
-## Documentation app
-
-The documentation app (you're using it right now) is using MkDocs.
-
-```
-cd pgml-docs/
-```
-
-Once there, you can set up a virtual environment and get going:
-
-```commandline
-python3 -m venv venv
-source venv/bin/activate
-pip install -r requirements.txt
-python -m mkdocs serve
-```
-
-## General
-
-We are a cross-platform team, some of us use WSL and some use Linux or Mac OS. Keeping that in mind, it's good to use common line endings for all files to avoid production errors, e.g. broken Bash scripts.
-
-The project is presently using [Unix line endings](https://docs.github.com/en/get-started/getting-started-with-git/configuring-git-to-handle-line-endings).
diff --git a/pgml-docs/developer-docs/distributed-training.md b/pgml-docs/developer-docs/distributed-training.md
deleted file mode 100644
index 4236962b5..000000000
--- a/pgml-docs/developer-docs/distributed-training.md
+++ /dev/null
@@ -1,172 +0,0 @@
-# Distributed Training
-
-Depending on the size of your dataset and its change frequency, you may want to offload training (or inference) to secondary PostgreSQL servers to avoid excessive load on your primary. We've outlined three of the built-in mechanisms to help distribute the load.
-
-## pg\_dump (< 10GB)
-
-`pg_dump` is a [standard tool](https://www.postgresql.org/docs/12/app-pgdump.html) used to export data from a PostgreSQL database. If your dataset is small (e.g. less than 10GB) and changes infrequently, this could be quickest and simplest way to do it.
-
-!!! example
-
-```
-# Export data from your production DB
-pg_dump \
- postgres://username:password@production-database.example.com/production_db \
- --no-owner \
- -t table_one \
- -t table_two > dump.sql
-
-# Import the data into PostgresML
-psql \
- postgres://username:password@postgresml.example.com/postgresml_db \
- -f dump.sql
-```
-
-If you're using our Docker stack, you can import the data there:
-
-```
-psql \
- postgres://postgres@localhost:5433/pgml_development \
- -f dump.sql
-```
-
-!!!
-
-PostgresML tables and functions are located in the `pgml` schema, so you can safely import your data into PostgresML without conflicts. You can also use `pg_dump` to copy the `pgml` schema to other servers which will make the trained models available in a distributed fashion.
-
-## Foreign Data Wrappers (10GB - 100GB)
-
-Foreign Data Wrappers, or [FDWs](https://www.postgresql.org/docs/12/postgres-fdw.html) for short, are another good tool for reading or importing data from another PostgreSQL database into PostgresML.
-
-Setting up FDWs is a bit more involved than `pg_dump` but they provide real time access to your production data and are good for small to medium size datasets (e.g. 10GB to 100GB) that change frequently.
-
-Official PostgreSQL [docs](https://www.postgresql.org/docs/12/postgres-fdw.html) explain FDWs with more detail; we'll document a basic example below.
-
-### Install the extension
-
-PostgreSQL comes with `postgres_fdw` already available, but the extension needs to be explicitly installed into the database. Connect to your PostgresML database as a superuser and run:
-
-```postgresql
-CREATE EXTENSION postgres_fdw;
-```
-
-### Create foreign server
-
-A foreign server is a FDW reference to another PostgreSQL database running somewhere else. In this case, that foreign server is your production database.
-
-```postgresql
-CREATE SERVER your_production_db
- FOREIGN DATA WRAPPER postgres_fdw
- OPTIONS (
- host 'production-database.example.com',
- port '5432',
- dbname 'production_db'
- );
-```
-
-### Create user mapping
-
-A user mapping is a relationship between the user you're connecting with to PostgresML and a user that exists on your production database. FDW will use this mapping to talk to your database when it wants to read some data.
-
-```postgresql
-CREATE USER MAPPING FOR pgml_user
- SERVER your_production_db
- OPTIONS (
- user 'your_production_db_user',
- password 'your_production_db_user_password'
- );
-```
-
-At this point, when you connect to PostgresML using the example `pgml_user` and then query data in your production database using FDW, it'll use the user `your_production_db_user` to connect to your DB and fetch the data. Make sure that `your_production_db_user` has `SELECT` permissions on the tables you want to query and the `USAGE` permissions on the schema.
-
-### Import the tables
-
-The final step is import your production database tables into PostgresML by creating a foreign schema mapping. This mapping will tell PostgresML which tables are available in your database. The quickest way is to import all of them, like so:
-
-```postgresql
-IMPORT FOREIGN SCHEMA public
-FROM SERVER your_production_db
-INTO public;
-```
-
-This will import all tables from your production DB `public` schema into the `public` schema in PostgresML. The tables are now available for querying in PostgresML.
-
-### Usage
-
-PostgresML snapshots the data before training on it, so every time you run `pgml.train` with a `relation_name` argument, the data will be fetched from the foreign data wrapper and imported into PostgresML.
-
-FDWs are reasonably good at fetching only the data specified by the `VIEW`, so if you place sufficient limits on your dataset in the `CREATE VIEW` statement, e.g. train on the last two weeks of data, or something similar, FDWs will do its best to fetch only the last two weeks of data in an efficient manner, leaving the rest behind on the primary.
-
-## Logical replication (100GB - 10TB)
-
-Logical replication is a [replication mechanism](https://www.postgresql.org/docs/12/logical-replication.html) that's been available since PostgreSQL 10. It allows to copy entire tables and schemas from any database into PostgresML and keeping them up-to-date in real time fairly cheaply as the data in production changes. This is suitable for medium to large PostgreSQL deployments (e.g. 100GB - 10TB).
-
-Logical replication is designed as a pub/sub system, where your production database is the publisher and PostgresML is the subscriber. As data in your database changes, it is streamed into PostgresML in milliseconds, which is very similar to how Postgres streaming replication works as well.
-
-The setup is slightly more involved than Foreign Data Wrappers, and is documented below. All queries must be run as a superuser.
-
-### WAL
-
-First, make sure that your production DB has logical replication enabled. For this, it has to be on PostgreSQL 10 or above and also have `wal_level` configuration set to `logical`.
-
-```
-pgml# SHOW wal_level;
- wal_level
------------
- logical
-(1 row)
-```
-
-If this is not the case, you'll need to change it and restart the server.
-
-### Publication
-
-The [publication](https://www.postgresql.org/docs/12/sql-createpublication.html) is created on your production DB and configures which tables are replicated using logical replication. To replicate all tables in your `public` schema, you can run this:
-
-```postgresql
-CREATE PUBLICATION all_tables
-FOR ALL TABLES;
-```
-
-### Schema
-
-Logical replication does not copy the schema, so it needs to be copied manually in advance; `pg_dump` is great for this:
-
-```bash
-# Dump the schema from your production DB
-pg_dump \
- postgres://username:password@production-db.example.com/production_db \
- --schema-only \
- --no-owner > schema.sql
-
-# Import the schema in PostgresML
-psql \
- postgres://username:password@postgresml.example.com/postgresml_db \
- -f schema.sql
-```
-
-### Subscription
-
-The [subscription](https://www.postgresql.org/docs/12/sql-createsubscription.html) is created in your PostgresML database. To replicate all the tables we marked in the previous step, run:
-
-```postgresql
-CREATE SUBSCRIPTION all_tables
-CONNECTION 'postgres://superuser:password@production-database.example.com/production_db'
-PUBLICATION all_tables;
-```
-
-As soon as you run this, logical replication will begin. It will start by copying all the data from your production database into PostgresML. That will take a while, depending on database size, network connection and hardware performance. Each table will be copied individually and the process is parallelized.
-
-Once the copy is complete, logical replication will synchronize and will replicate the data from your production database into PostgresML in real-time.
-
-### Schema changes
-
-Logical replication has one notable limitation: it does not replicate schema (table) changes. If you change a table in your production DB in an incompatible way, e.g. by adding a column, the replication will break.
-
-To remediate this, when you're performing the schema change, make the change first in PostgresML and then in your production database.
-
-## Native installation (10TB and beyond)
-
-For databases that are very large, e.g. 10TB+, we recommend you install the extension directly into your database.
-
-This option is available for databases of all sizes, but we recognize that many small to medium databases run on managed services, e.g. RDS, which don't allow this mechanism.
diff --git a/pgml-docs/developer-docs/gpu-support.md b/pgml-docs/developer-docs/gpu-support.md
deleted file mode 100644
index 0e6e86034..000000000
--- a/pgml-docs/developer-docs/gpu-support.md
+++ /dev/null
@@ -1,62 +0,0 @@
-# GPU Support
-
-PostgresML is capable of leveraging GPUs when the underlying libraries and hardware are properly configured on the database server. The CUDA runtime is statically linked during the build process, so it does not introduce additional dependencies on the runtime host.
-
-!!! tip
-
-Models trained on GPU may also require GPU support to make predictions. Consult the documentation for each library on configuring training vs inference.
-
-!!!
-
-## Tensorflow
-
-GPU setup for Tensorflow is covered in the [documentation](https://www.tensorflow.org/install/pip). You may acquire pre-trained GPU enabled models for fine tuning from Hugging Face.
-
-## Torch
-
-GPU setup for Torch is covered in the [documentation](https://pytorch.org/get-started/locally/). You may acquire pre-trained GPU enabled models for fine tuning from Hugging Face.
-
-## Flax
-
-GPU setup for Flax is covered in the [documentation](https://github.com/google/jax#pip-installation-gpu-cuda). You may acquire pre-trained GPU enabled models for fine tuning from Hugging Face.
-
-## XGBoost
-
-GPU setup for XGBoost is covered in the [documentation](https://xgboost.readthedocs.io/en/stable/gpu/index.html).
-
-!!! example
-
-```sql
-pgml.train(
- 'GPU project',
- algorithm => 'xgboost',
- hyperparams => '{"tree_method" : "gpu_hist"}'
-);
-```
-
-!!!
-
-## LightGBM
-
-GPU setup for LightGBM is covered in the [documentation](https://lightgbm.readthedocs.io/en/latest/GPU-Tutorial.html).
-
-!!! example
-
-```sql
-pgml.train(
- 'GPU project',
- algorithm => 'lightgbm',
- hyperparams => '{"device" : "cuda"}'
-);
-```
-
-!!!
-
-## Scikit-learn
-
-None of the scikit-learn algorithms natively support GPU devices. There are a few projects to improve scikit performance with additional parallelism, although we currently have not integrated these with PostgresML:
-
-* https://github.com/intel/scikit-learn-intelex
-* https://github.com/rapidsai/cuml
-
-If your project would benefit from GPU support, please consider opening an issue, so we can prioritize integrations.
diff --git a/pgml-docs/developer-docs/installation.md b/pgml-docs/developer-docs/installation.md
deleted file mode 100644
index ce07cf671..000000000
--- a/pgml-docs/developer-docs/installation.md
+++ /dev/null
@@ -1,374 +0,0 @@
-# Installation
-
-A typical PostgresML deployment consists of two parts: the PostgreSQL extension, and the dashboard web app. The extension provides all the machine learning functionality, and can be used independently. The dashboard provides a system overview for easier management, and notebooks for writing experiments.
-
-## Extension
-
-The extension can be installed by compiling it from source, or if you're using Ubuntu 22.04, from our package repository.
-
-### macOS
-
-!!! tip
-
-If you're just looking to try PostgresML without installing it on your system, take a look at our Quick Start with Docker guide.
-
-!!!
-
-#### Get the source code
-
-To get the source code for PostgresML, you can clone our Github repository:
-
-```bash
-git clone https://github.com/postgresml/postgresml
-```
-
-#### Install dependencies
-
-We provide a `Brewfile` that will install all the necessary dependencies for compiling PostgresML from source:
-
-```bash
-cd pgml-extension && \
-brew bundle
-```
-
-**Rust**
-
-PostgresML is written in Rust, so you'll need to install the latest compiler from [rust-lang.org](https://rust-lang.org). Additionally, we use the Rust PostgreSQL extension framework `pgrx`, which requires some initialization steps:
-
-```bash
-cargo install cargo-pgrx --version 0.9.8 && \
-cargo pgrx init
-```
-
-This step will take a few minutes. Perfect opportunity to get a coffee while you wait.
-
-### Compile and install
-
-With all the dependencies installed, you can compile and install the extension:
-
-```bash
-cargo pgrx install
-```
-
-This will compile all the necessary packages, including Rust bindings to XGBoost and LightGBM, together with Python support for Hugging Face transformers and Scikit-learn. The extension will be automatically installed into the PostgreSQL installation created by the `postgresql@15` Homebrew formula.
-
-### Python dependencies
-
-PostgresML uses Python packages to provide support for Hugging Face LLMs and Scikit-learn algorithms and models. To make this work on your system, you have two options: install those packages into a virtual environment (strongly recommended), or install them globally.
-
-\=== "Virtual environment"
-
-To install the necessary Python packages into a virtual environment, use the `virtualenv` tool installed previously by Homebrew:
-
-```bash
-virtualenv pgml-venv && \
-source pgml-venv/bin/activate && \
-pip install -r requirements.txt && \
-pip install -r requirements-xformers.txt --no-dependencies
-```
-
-\=== "Globally"
-
-Installing Python packages globally can cause issues with your system. If you wish to proceed nonetheless, you can do so:
-
-```bash
-pip3 install -r requirements.txt
-```
-
-\===
-
-### Configuration
-
-We have one last step remaining to get PostgresML running on your system: configuration.
-
-PostgresML needs to be loaded into shared memory by PostgreSQL. To do so, you need to add it to `preload_shared_libraries`.
-
-Additionally, if you've chosen to use a virtual environment for the Python packages, we need to tell PostgresML where to find it.
-
-Both steps can be done by editing the PostgreSQL configuration file `postgresql.conf` usinig your favorite editor:
-
-```bash
-vim /opt/homebrew/var/postgresql@15/postgresql.conf
-```
-
-Both settings can be added to the config, like so:
-
-```
-shared_preload_libraries = 'pgml,pg_stat_statements'
-pgml.venv = '/absolute/path/to/your/pgml-venv'
-```
-
-Save the configuration file and restart PostgreSQL:
-
-```bash
-brew services restart postgresql@15
-```
-
-### Test your installation
-
-You should be able to connect to PostgreSQL and use our extension now:
-
-!!! generic
-
-!!! code\_block time="953.681ms"
-
-```postgresql
-CREATE EXTENSION pgml;
-SELECT pgml.version();
-```
-
-!!!
-
-!!! results
-
-```
-psql (15.3 (Homebrew))
-Type "help" for help.
-
-pgml_test=# CREATE EXTENSION pgml;
-INFO: Python version: 3.11.4 (main, Jun 20 2023, 17:23:00) [Clang 14.0.3 (clang-1403.0.22.14.1)]
-INFO: Scikit-learn 1.2.2, XGBoost 1.7.5, LightGBM 3.3.5, NumPy 1.25.1
-CREATE EXTENSION
-
-pgml_test=# SELECT pgml.version();
- version
----------
- 2.7.4
-(1 row)
-```
-
-!!!
-
-!!!
-
-### pgvector
-
-We like and use pgvector a lot, as documented in our blog posts and examples, to store and search embeddings. You can install pgvector from source pretty easily:
-
-```bash
-git clone --branch v0.4.4 https://github.com/pgvector/pgvector && \
-cd pgvector && \
-echo "trusted = true" >> vector.control && \
-make && \
-make install
-```
-
-**Test pgvector installation**
-
-You can create the `vector` extension in any database:
-
-!!! generic
-
-!!! code\_block time="21.075ms"
-
-```postgresql
-CREATE EXTENSION vector;
-```
-
-!!!
-
-!!! results
-
-```
-psql (15.3 (Homebrew))
-Type "help" for help.
-
-pgml_test=# CREATE EXTENSION vector;
-CREATE EXTENSION
-```
-
-!!!
-
-!!!
-
-### Ubuntu
-
-!!! note
-
-If you're looking to use PostgresML in production, [try our cloud](https://postgresml.org/plans). We support serverless deployments with modern GPUs for startups of all sizes, and dedicated GPU hardware for larger teams that would like to tweak PostgresML to their needs.
-
-!!!
-
-For Ubuntu, we compile and ship packages that include everything needed to install and run the extension. At the moment, only Ubuntu 22.04 (Jammy) is supported.
-
-#### Add our sources
-
-Add our repository to your system sources:
-
-```bash
-echo "deb [trusted=yes] https://apt.postgresml.org $(lsb_release -cs) main" | \
-sudo tee -a /etc/apt/sources.list
-```
-
-#### Install PostgresML
-
-Update your package lists and install PostgresML:
-
-```bash
-export POSTGRES_VERSION=15
-sudo apt update && \
-sudo apt install postgresml-${POSTGRES_VERSION}
-```
-
-The `postgresml-15` package includes all the necessary dependencies, including Python packages shipped inside a virtual environment. Your PostgreSQL server is configured automatically.
-
-We support PostgreSQL versions 11 through 15, so you can install the one matching your currently installed PostgreSQL version.
-
-#### Installing just the extension
-
-If you prefer to manage your own Python environment and dependencies, you can install just the extension:
-
-```bash
-export POSTGRES_VERSION=15
-sudo apt install postgresql-pgml-${POSTGRES_VERSION}
-```
-
-#### Optimized pgvector
-
-pgvector, the extension we use for storing and searching embeddings, needs to be installed separately for optimal performance. Your hardware may support vectorized operation instructions (like AVX-512), which pgvector can take advantage of to run faster.
-
-To install pgvector from source, you can simply:
-
-```bash
-git clone --branch v0.4.4 https://github.com/pgvector/pgvector && \
-cd pgvector && \
-echo "trusted = true" >> vector.control && \
-make && \
-make install
-```
-
-### Other Linux
-
-PostgresML will compile and run on pretty much any modern Linux distribution. For a quick example, you can take a look at what we do to build the extension on [Ubuntu](../../.github/workflows/package-extension.yml), and modify those steps to work on your distribution.
-
-#### Get the source code
-
-To get the source code for PostgresML, you can clone our Github repo:
-
-```bash
-git clone https://github.com/postgresml/postgresml
-```
-
-#### Dependencies
-
-You'll need the following packages installed first. The names are taken from Ubuntu (and other Debian based distros), so you'll need to change them to fit your distribution:
-
-```
-export POSTGRES_VERSION=15
-
-build-essential
-clang
-libopenblas-dev
-libssl-dev
-bison
-flex
-pkg-config
-cmake
-libreadline-dev
-libz-dev
-tzdata
-sudo
-libpq-dev
-libclang-dev
-postgresql-{POSTGRES_VERSION}
-postgresql-server-dev-${POSTGRES_VERSION}
-python3
-python3-pip
-libpython3
-lld
-```
-
-**Rust**
-
-PostgresML is written in Rust, so you'll need to install the latest compiler version from [rust-lang.org](https://rust-lang.org).
-
-#### `pgrx`
-
-We use the `pgrx` Postgres Rust extension framework, which comes with its own installation and configuration steps:
-
-```bash
-cd pgml-extension && \
-cargo install cargo-pgrx --version 0.9.8 && \
-cargo pgrx init
-```
-
-This step will take a few minutes since it has to download and compile multiple PostgreSQL versions used by `pgrx` for development.
-
-#### Compile and install
-
-Finally, you can compile and install the extension:
-
-```bash
-cargo pgrx install
-```
-
-## Dashboard
-
-The dashboard is a web app that can be run against any Postgres database which has the extension installed. There is a [Dockerfile](../../pgml-dashboard/Dockerfile) included with the source code if you wish to run it as a container.
-
-### Get the source code
-
-To get our source code, you can clone our Github repo (if you haven't already):
-
-```bash
-git clone clone https://github.com/postgresml/postgresml && \
-cd pgml-dashboard
-```
-
-### Configure your database
-
-Use an existing database which has the `pgml` extension installed, or create a new one:
-
-```bash
-createdb pgml_dashboard && \
-psql -d pgml_dashboard -c 'CREATE EXTENSION pgml;'
-```
-
-### Configure the environment
-
-Create a `.env` file with the necessary `DATABASE_URL`, for example:
-
-```bash
-DATABASE_URL=postgres:///pgml_dashboard
-```
-
-### Get Rust
-
-The dashboard is written in Rust and uses the SQLx crate to interact with Postgres. Make sure to install the latest Rust compiler from [rust-lang.org](https://rust-lang.org).
-
-### Database setup
-
-To setup the database, you'll need to install `sqlx-cli` and run the migrations:
-
-```bash
-cargo install sqlx-cli --version 0.6.3 && \
-cargo sqlx database setup
-```
-
-### Frontend dependencies
-
-The dashboard frontend is using Sass which requires Node & the Sass compiler. You can install Node from Brew, your package repository, or by using [Node Version Manager](https://github.com/nvm-sh/nvm).
-
-If using nvm, you can install the latest stable Node version with:
-
-```bash
-nvm install stable
-```
-
-Once you have Node installed, you can install the Sass compiler globally:
-
-```bash
-npm install -g sass
-```
-
-### Compile and run
-
-Finally, you can compile and run the dashboard:
-
-```
-cargo run
-```
-
-Once compiled, the dashboard will be available on [localhost:8000](http://localhost:8000).
-
-The dashboard can also be packaged for distribution. You'll need to copy the static files along with the `target/release` directory to your server.
diff --git a/pgml-docs/developer-docs/quick-start-with-docker.md b/pgml-docs/developer-docs/quick-start-with-docker.md
deleted file mode 100644
index 53d623220..000000000
--- a/pgml-docs/developer-docs/quick-start-with-docker.md
+++ /dev/null
@@ -1,280 +0,0 @@
-# Quick Start with Docker
-
-To try PostgresML on your system for the first time, [Docker](https://docs.docker.com/engine/install/) is a great tool to get you started quicky. We've prepared a Docker image that comes with the latest version of PostgresML and all of its dependencies. If you have Nvidia GPUs on your machine, you'll also be able to use GPU acceleration.
-
-!!! tip
-
-If you're looking to get started with PostgresML as quickly as possible, [sign up](https://postgresml.org/signup) for our free serverless [cloud](https://postgresml.org/signup). You'll get a database in seconds, and will be able to use all the latest Hugging Face models on modern GPUs.
-
-!!!
-
-## Get Started
-
-\=== "macOS"
-
-```bash
-docker run \
- -it \
- -v postgresml_data:/var/lib/postgresql \
- -p 5433:5432 \
- -p 8000:8000 \
- ghcr.io/postgresml/postgresml:2.7.3 \
- sudo -u postgresml psql -d postgresml
-```
-
-\=== "Linux with GPUs"
-
-Make sure you have Cuda, the Cuda container toolkit, and matching graphics drivers installed. You can install everything from [Nvidia](https://developer.nvidia.com/cuda-downloads).
-
-On Ubuntu, you can install everything with:
-
-```bash
-sudo apt install -y \
- cuda \
- cuda-container-toolkit
-```
-
-To run the container with GPU capabilities:
-
-```bash
-docker run \
- -it \
- -v postgresml_data:/var/lib/postgresql \
- --gpus all \
- -p 5433:5432 \
- -p 8000:8000 \
- ghcr.io/postgresml/postgresml:2.7.3 \
- sudo -u postgresml psql -d postgresml
-```
-
-If your machine doesn't have a GPU, just omit the `--gpus all` option, and the container will start and use the CPU instead.
-
-\=== "Windows"
-
-Install [WSL](https://learn.microsoft.com/en-us/windows/wsl/install) and [Docker Desktop](https://www.docker.com/products/docker-desktop/). You can then use **Linux with GPUs** instructions. GPU support is included, make sure to [enable CUDA](https://learn.microsoft.com/en-us/windows/ai/directml/gpu-cuda-in-wsl).
-
-\===
-
-Once the container is running, setting up PostgresML is as simple as creating the extension and running a few queries to make sure everything is working correctly.
-
-!!! generic
-
-!!! code\_block time="41.520ms"
-
-```postgresql
-CREATE EXTENSION IF NOT EXISTS pgml;
-SELECT pgml.version();
-```
-
-!!!
-
-!!! results
-
-```
-postgresml=# CREATE EXTENSION IF NOT EXISTS pgml;
-INFO: Python version: 3.10.6 (main, May 29 2023, 11:10:38) [GCC 11.3.0]
-INFO: Scikit-learn 1.2.2, XGBoost 1.7.5, LightGBM 3.3.5, NumPy 1.25.1
-CREATE EXTENSION
-Time: 41.520 ms
-
-postgresml=# SELECT pgml.version();
- version
----------
- 2.7.3
-(1 row)
-```
-
-!!!
-
-!!!
-
-You can continue using the command line, or connect to the container using any of the commonly used PostgreSQL tools like `psql`, pgAdmin, DBeaver, and others:
-
-```bash
-psql -h 127.0.0.1 -p 5433 -U postgresml
-```
-
-## Workflows
-
-PostgresML allows you to generate embeddings with open source models from Hugging Face, easily prompt LLMs with tasks like translation and text generation, and train classical machine learning models on tabular data.
-
-### Embeddings
-
-To generate an embedding, all you have to do is use the `pgml.embed(model_name, text)` function with any open source model available on Hugging Face.
-
-!!! example
-
-!!! code\_block time="51.907ms"
-
-```postgresql
-SELECT pgml.embed(
- 'intfloat/e5-small',
- 'passage: PostgresML is so easy!'
-);
-```
-
-!!!
-
-!!! results
-
-```
-postgres=# SELECT pgml.embed(
- 'intfloat/e5-small',
- 'passage: PostgresML is so easy!'
-);
-
-{0.02997742,-0.083322115,-0.074212186,0.016167048,0.09899471,-0.08137268,-0.030717574,0.03474584,-0.078880586,0.053087912,-0.027900297,-0.06316991,
- 0.04218509,-0.05953648,0.028624319,-0.047688972,0.055339724,0.06451558,-0.022694778,0.029539965,-0.03861752,-0.03565117,0.06457901,0.016581751,
-0.030634841,-0.026699776,-0.03840521,0.10052487,0.04131341,-0.036192447,0.036209006,-0.044945586,-0.053815156,0.060391728,-0.042378396,
- -0.008441956,-0.07911099,0.021774381,0.034313954,0.011788908,-0.08744744,-0.011105505,0.04577902,0.0045646844,-0.026846683,-0.03492123,0.068385094,
--0.057966642,-0.04777695,0.11460253,0.010138827,-0.0023120022,0.052329376,0.039127126,-0.100108854,-0.03925074,-0.0064703166,-0.078960024,-0.046833295,
-0.04841002,0.029004619,-0.06588247,-0.012441916,0.001127402,-0.064730585,0.05566701,-0.08166461,0.08834854,-0.030919826,0.017261868,-0.031665307,
-0.039764903,-0.0747297,-0.079097,-0.063424855,0.057243366,-0.025710078,0.033673875,0.050384883,-0.06700917,-0.020863676,0.001511638,-0.012377004,
--0.01928165,-0.0053149736,0.07477675,0.03526208,-0.033746846,-0.034142617,0.048519857,0.03142429,-0.009989936,-0.018366965,0.098441005,-0.060974542,
-0.066505,-0.013180869,-0.067969725,0.06731659,-0.008099243,-0.010721313,0.06885249,-0.047483806,0.004565877,-0.03747329,-0.048288923,-0.021769432,
-0.033546787,0.008165753,-0.0018901207,-0.05621888,0.025734955,-0.07408746,-0.053908117,-0.021819277,0.045596648,0.0586417,0.0057576317,-0.05601786,
--0.03452876,-0.049566686,-0.055589233,0.0056059696,0.034660816,0.018012922,-0.06444576,0.036400944,-0.064374834,-0.019948835,-0.09571418,0.09412033,-0.07085108,0.039256454,-0.030016104,-0.07527431,-0.019969895,-0.09996753,0.008969355,0.016372273,0.021206321,0.0041883467,0.032393526,0.04027315,-0.03194125,-0.03397957,-0.035261292,0.061776843,0.019698814,-0.01767779,0.018515844,-0.03544395,-0.08169962,-0.02272048,-0.0830616,-0.049991447,-0.04813149,-0.06792019,0.031181566,-0.04156394,-0.058702122,-0.060489867,0.0020844154,0.18472219,0.05215536,-0.038624488,-0.0029086764,0.08512023,0.08431501,-0.03901469,-0.05836445,0.118146114,-0.053862963,0.014351494,0.0151984785,0.06532256,-0.056947585,0.057420347,0.05119938,0.001644649,0.05911524,0.012656099,-0.00918104,-0.009667282,-0.037909098,0.028913427,-0.056370094,-0.06015602,-0.06306665,-0.030340875,-0.14780329,0.0502743,-0.039765555,0.00015358179,0.018831518,0.04897686,0.014638214,-0.08677867,-0.11336724,-0.03236903,-0.065230116,-0.018204475,0.022788873,0.026926292,-0.036414392,-0.053245157,-0.022078559,-0.01690316,-0.042608887,-0.000196666,-0.0018297597,-0.06743311,0.046494357,-0.013597083,-0.06582122,-0.065659754,-0.01980711,0.07082651,-0.020514658,-0.05147128,-0.012459332,0.07485931,0.037384395,-0.03292486,0.03519196,0.014782926,-0.011726298,0.016492695,-0.0141114695,0.08926231,-0.08323172,0.06442687,0.03452826,-0.015580203,0.009428933,0.06759306,0.024144053,0.055612188,-0.015218529,-0.027584016,0.1005267,-0.054801818,-0.008317948,-0.000781896,-0.0055441647,0.018137401,0.04845575,0.022881811,-0.0090647405,0.00068219384,-0.050285354,-0.05689162,0.015139549,0.03553917,-0.09011886,0.010577362,0.053231273,0.022833975,-3.470906e-05,-0.0027906548,-0.03973121,0.007263015,0.00042456342,0.07092535,-0.043497834,-0.0015815622,-0.03489149,0.050679605,0.03153052,0.037204932,-0.13364139,-0.011497628,-0.043809805,0.045094978,-0.037943177,0.0021411474,0.044974167,-0.05388966,0.03780391,0.033220228,-0.027566046,-0.043608706,0.021699436,-0.011780484,0.04654962,-0.04134961,0.00018980364,-0.0846228,-0.0055453447,0.057337128,0.08390022,-0.019327229,0.10235083,0.048388377,0.042193796,0.025521005,0.013201268,-0.0634062,-0.08712715,0.059367906,-0.007045281,0.0041695046,-0.08747506,-0.015170839,-0.07994115,0.06913491,0.06286314,0.030512255,0.0141608,0.046193067,0.0026272296,0.057590637,-0.06136263,0.069828056,-0.038925823,-0.076347575,0.08457048,0.076567,-0.06237806,0.06076619,0.05488552,-0.06070616,0.10767283,0.008605431,0.045823734,-0.0055780583,0.043272685,-0.05226901,0.035603754,0.04357865,-0.061862156,0.06919797,-0.00086810143,-0.006476894,-0.043467253,0.017243104,-0.08460669,0.07001912,0.025264058,0.048577853,-0.07994533,-0.06760861,-0.034988943,-0.024210323,-0.02578568,0.03488276,-0.0064449264,0.0345789,-0.0155197615,0.02356351,0.049044855,0.0497944,0.053986903,0.03198324,0.05944599,-0.027359396,-0.026340311,0.048312716,-0.023747599,0.041861262,0.017830249,0.0051145423,0.018402847,0.027941752,0.06337417,0.0026447168,-0.057954717,-0.037295196,0.03976777,0.057269543,0.09760822,-0.060166832,-0.039156828,0.05768707,0.020471212,0.013265894,-0.050758235,-0.020386606,0.08815887,-0.05172276,-0.040749934,0.01554588,-0.017021973,0.034403082,0.12543736}
-```
-
-!!!
-
-!!!
-
-### Training an XGBoost model
-
-#### Importing a dataset
-
-PostgresML comes with a few built-in datasets. You can also import your own CSV files or data from other sources like BigQuery, S3, and other databases or files. For our example, let's import the `digits` dataset from Scikit:
-
-!!! generic
-
-!!! code\_block time="47.532ms"
-
-```postgresql
-SELECT * FROM pgml.load_dataset('digits');
-```
-
-!!!
-
-!!! results
-
-```
-postgres=# SELECT * FROM pgml.load_dataset('digits');
- table_name | rows
--------------+------
- pgml.digits | 1797
-(1 row)
-```
-
-!!!
-
-!!!
-
-#### Training a model
-
-The heart of PostgresML is its `pgml.train()` function. Using only that function, you can load the data from any table or view in the database, train any number of ML models on it, and deploy the best model to production.
-
-!!! generic
-
-!!! code\_block time="222.206ms"
-
-```postgresql
-SELECT * FROM pgml.train(
- project_name => 'My First PostgresML Project',
- task => 'classification',
- relation_name => 'pgml.digits',
- y_column_name => 'target',
- algorithm => 'xgboost',
- hyperparams => '{
- "n_estimators": 25
- }'
-);
-```
-
-!!!
-
-!!! results
-
-```
-postgres=# SELECT * FROM pgml.train(
- project_name => 'My First PostgresML Project',
- task => 'classification',
- relation_name => 'pgml.digits',
- y_column_name => 'target',
- algorithm => 'xgboost',
- hyperparams => '{
- "n_estimators": 25
- }'
-);
-
-[...]
-
-INFO: Metrics: {
- "f1": 0.88244045,
- "precision": 0.8835865,
- "recall": 0.88687027,
- "accuracy": 0.8841871,
- "mcc": 0.87189955,
- "fit_time": 0.7631203,
- "score_time": 0.007338208
-}
-INFO: Deploying model id: 1
- project | task | algorithm | deployed
------------------------------+----------------+-----------+----------
- My First PostgresML Project | classification | xgboost | t
-(1 row)
-```
-
-!!!
-
-!!!
-
-#### Making predictions
-
-After training a model, you can use it to make predictions. PostgresML provides a `pgml.predict(project_name, features)` function which makes real time predictions using the best deployed model for the given project:
-
-!!! generic
-
-!!! code\_block time="8.676ms"
-
-```postgresql
-SELECT
- target,
- pgml.predict('My First PostgresML Project', image) AS prediction
-FROM pgml.digits
-LIMIT 5;
-```
-
-!!!
-
-!!! results
-
-```
- target | prediction
---------+------------
- 0 | 0
- 1 | 1
- 2 | 2
- 3 | 3
- 4 | 4
-```
-
-!!!
-
-!!!
-
-#### Automation of common ML tasks
-
-The following common machine learning tasks are performed automatically by PostgresML:
-
-1. Snapshot the data so the experiment is reproducible
-2. Split the dataset into train and test sets
-3. Train and validate the model
-4. Save it into the model store (a Postgres table)
-5. Load it and cache it during inference
-
-Check out our Training and Predictions documentation for more details. Some more advanced topics like hyperparameter search and GPU acceleration are available as well.
-
-## Dashboard
-
-The Dashboard app is running on [localhost:8000](http://localhost:8000/). You can use it to write experiments in Jupyter-style notebooks, manage projects, and visualize datasets used by PostgresML.
diff --git a/pgml-docs/docs/guides/getting-started/sign-up.md b/pgml-docs/docs/guides/getting-started/sign-up.md
index 9ec627997..11fd8b1b7 100644
--- a/pgml-docs/docs/guides/getting-started/sign-up.md
+++ b/pgml-docs/docs/guides/getting-started/sign-up.md
@@ -5,7 +5,6 @@
1. Go to [https://postgresml.org/signup](https://postgresml.org/signup)
2. Sign up using your email or using Google or Github authentication
3. Login using your account
-4. [data-pre-processing.md](../machine-learning/supervised-learning/data-pre-processing.md "mention")
diff --git a/pgml-docs/docs/guides/machine-learning/supervised-learning/data-pre-processing.md b/pgml-docs/docs/guides/machine-learning/supervised-learning/data-pre-processing.md
index 4f37feaed..90a65132b 100644
--- a/pgml-docs/docs/guides/machine-learning/supervised-learning/data-pre-processing.md
+++ b/pgml-docs/docs/guides/machine-learning/supervised-learning/data-pre-processing.md
@@ -25,9 +25,9 @@ In this example:
There are 3 steps to preprocessing data:
-* [Encoding](../../../../../pgml-dashboard/content/docs/guides/training/preprocessing.md#categorical-encodings) categorical values into quantitative values
-* [Imputing](../../../../../pgml-dashboard/content/docs/guides/training/preprocessing.md#imputing-missing-values) NULL values to some quantitative value
-* [Scaling](../../../../../pgml-dashboard/content/docs/guides/training/preprocessing.md#scaling-values) quantitative values across all variables to similar ranges
+* [Encoding](#categorical-encodings) categorical values into quantitative values
+* [Imputing](#imputing-missing-values) NULL values to some quantitative value
+* [Scaling](#scaling-values) quantitative values across all variables to similar ranges
These preprocessing steps may be specified on a per-column basis to the [train()](../../../../../docs/guides/training/overview) function. By default, PostgresML does minimal preprocessing on training data, and will raise an error during analysis if NULL values are encountered without a preprocessor. All types other than `TEXT` are treated as quantitative variables and cast to floating point representations before passing them to the underlying algorithm implementations.
diff --git a/pgml-docs/faqs.md b/pgml-docs/faqs.md
deleted file mode 100644
index 524aab00b..000000000
--- a/pgml-docs/faqs.md
+++ /dev/null
@@ -1,40 +0,0 @@
----
-description: PostgresML Frequently Asked Questions
----
-
-# FAQs
-
-## What is PostgresML?
-
-PostgresML is an open-source database extension that turns Postgres into an end-to-end machine learning platform. It allows you to build, train, and deploy ML models directly within your Postgres database without moving data between systems.
-
-## What is a DB extension?
-
-A database extension is software that extends the capabilities of a database. Postgres allows extensions to add new data types, functions, operators, indexes, etc. PostgresML uses extensions to bring machine learning capabilities natively into Postgres.
-
-## How does it work?
-
-PostgresML installs as extensions in Postgres. It provides SQL API functions for each step of the ML workflow like importing data, transforming features, training models, making predictions, etc. Models are stored back into Postgres tables. This unified approach eliminates complexity.
-
-## What are the benefits?
-
-Benefits include faster development cycles, reduced latency, tighter integration between ML and applications, leveraging Postgres' reliability and ACID transactions, and horizontal scaling.
-
-## What are the cons?
-
-PostgresML requires using Postgres as the database. If your data currently resides in a different database, there would be some upfront effort required to migrate the data into Postgres in order to utilize PostgresML's capabilities.
-
-## What is hosted PostgresML?
-
-Hosted PostgresML is a fully managed cloud service that provides all the capabilities of open source PostgresML without the need to run your own database infrastructure.
-
-With hosted PostgresML, you get:
-
-* Flexible compute resources - Choose CPU, RAM or GPU machines tailored to your workload
-* Horizontally scalable inference with read-only replicas
-* High availability for production applications with multi-region deployments
-* Support for multiple users and databases
-* Automated backups and point-in-time restore
-* Monitoring dashboard with metrics and logs
-
-In summary, hosted PostgresML removes the operational burden so you can focus on developing machine learning applications, while still getting the benefits of the unified PostgresML architecture.
diff --git a/pgml-docs/getting-started/README.md b/pgml-docs/getting-started/README.md
deleted file mode 100644
index 9004d48d8..000000000
--- a/pgml-docs/getting-started/README.md
+++ /dev/null
@@ -1,3 +0,0 @@
-# Getting Started
-
-PostgresML is a machine learning extension for PostgreSQL that enables you to perform training and inference on text and tabular data using SQL queries. With PostgresML, you can seamlessly integrate machine learning models into your PostgreSQL database and harness the power of cutting-edge algorithms to process data efficiently.
diff --git a/pgml-docs/getting-started/connect-to-the-database.md b/pgml-docs/getting-started/connect-to-the-database.md
deleted file mode 100644
index 2c37b93f8..000000000
--- a/pgml-docs/getting-started/connect-to-the-database.md
+++ /dev/null
@@ -1,36 +0,0 @@
-# Connect to the Database
-
-### SQL Clients
-
-Use any of these popular tools to connect to PostgresML and write SQL queries
-
-* [Apache Superset](https://superset.apache.org/)
-* [DBeaver](https://dbeaver.io/)
-* [Data Grip](https://www.jetbrains.com/datagrip/)
-* [Postico 2](https://eggerapps.at/postico2/)
-* [Popsql](https://popsql.com/)
-* [Tableau](https://www.tableau.com/)
-* [PowerBI](https://powerbi.microsoft.com/en-us/)
-* [Jupyter](https://jupyter.org/)
-* [VSCode](https://code.visualstudio.com/)
-
-### SQL Libraries
-
-Connect directly to the database with your favorite programming language
-
-* C++: [libpqxx](https://www.tutorialspoint.com/postgresql/postgresql\_c\_cpp.htm)
-* C#: [Npgsql](https://github.com/npgsql/npgsql),[Dapper](https://github.com/DapperLib/Dapper), or [Entity Framework Core](https://github.com/dotnet/efcore)
-* Elixir: [ecto](https://github.com/elixir-ecto/ecto) or [Postgrex](https://github.com/elixir-ecto/postgrex)
-* Go: [pgx](https://github.com/jackc/pgx), [pg](https://github.com/go-pg/pg) or [Bun](https://github.com/uptrace/bun)
-* Haskell: [postgresql-simple](https://hackage.haskell.org/package/postgresql-simple)
-* Java & Scala: [JDBC](https://jdbc.postgresql.org/) or [Slick](https://github.com/slick/slick)
-* Julia: [LibPQ.jl](https://github.com/iamed2/LibPQ.jl)
-* Lua: [pgmoon](https://github.com/leafo/pgmoon)
-* Node: [node-postgres](https://github.com/brianc/node-postgres), [pg-promise](https://github.com/vitaly-t/pg-promise), or [Sequelize](https://sequelize.org/)
-* Perl: [DBD::Pg](https://github.com/bucardo/dbdpg)
-* PHP: [Laravel](https://laravel.com/) or [PHP](https://www.php.net/manual/en/book.pgsql.php)
-* Python: [psycopg2](https://github.com/psycopg/psycopg2/), [SQLAlchemy](https://www.sqlalchemy.org/), or [Django](https://www.djangoproject.com/)
-* R: [DBI](https://github.com/r-dbi/DBI) or [dbx](https://github.com/ankane/dbx)
-* Ruby: [pg](https://github.com/ged/ruby-pg) or [Rails](https://rubyonrails.org/)
-* Rust: [postgres](https://crates.io/crates/postgres), [SQLx](https://github.com/launchbadge/sqlx) or [Diesel](https://github.com/diesel-rs/diesel)
-* Swift: [PostgresNIO](https://github.com/vapor/postgres-nio) or [PostgresClientKit](https://github.com/codewinsdotcom/PostgresClientKit)
diff --git a/pgml-docs/getting-started/database-credentials.md b/pgml-docs/getting-started/database-credentials.md
deleted file mode 100644
index 0d7df2e09..000000000
--- a/pgml-docs/getting-started/database-credentials.md
+++ /dev/null
@@ -1,5 +0,0 @@
-# Database Credentials
-
-Get your database credentials from the database connectivity tab. If you have `psql` installed on your machine you can copy paste Connecting with psql field at your terminal.
-
-
diff --git a/pgml-docs/getting-started/select-a-plan.md b/pgml-docs/getting-started/select-a-plan.md
deleted file mode 100644
index aea9fbb23..000000000
--- a/pgml-docs/getting-started/select-a-plan.md
+++ /dev/null
@@ -1,5 +0,0 @@
-# Select a plan
-
-Click on **Start Free Project** to get serverless GPU-powered database
-
-
diff --git a/pgml-docs/getting-started/sign-up.md b/pgml-docs/getting-started/sign-up.md
deleted file mode 100644
index 11fd8b1b7..000000000
--- a/pgml-docs/getting-started/sign-up.md
+++ /dev/null
@@ -1,11 +0,0 @@
-# Sign up
-
-## Create a new account
-
-1. Go to [https://postgresml.org/signup](https://postgresml.org/signup)
-2. Sign up using your email or using Google or Github authentication
-3. Login using your account
-
-
-
-
diff --git a/pgml-docs/machine-learning/README.md b/pgml-docs/machine-learning/README.md
deleted file mode 100644
index bbb96b550..000000000
--- a/pgml-docs/machine-learning/README.md
+++ /dev/null
@@ -1,2 +0,0 @@
-# Machine Learning
-
diff --git a/pgml-docs/machine-learning/natural-language-processing/README.md b/pgml-docs/machine-learning/natural-language-processing/README.md
deleted file mode 100644
index 7e349dc43..000000000
--- a/pgml-docs/machine-learning/natural-language-processing/README.md
+++ /dev/null
@@ -1,8 +0,0 @@
-# Natural Language Processing
-
-PostgresML integrates [🤗 Hugging Face Transformers](https://huggingface.co/transformers) to bring state-of-the-art models into the data layer. There are tens of thousands of pre-trained models with pipelines to turn raw inputs into useful results. Many state of the art deep learning architectures have been published and made available for download. You will want to browse all the [models](https://huggingface.co/models) available to find the perfect solution for your [dataset](https://huggingface.co/dataset) and [task](https://huggingface.co/tasks). For instance, with PostgresML you can:
-
-* Perform natural language processing (NLP) tasks like sentiment analysis, question and answering, translation, summarization and text generation
-* Access 1000s of state-of-the-art language models like GPT-2, GPT-J, GPT-Neo from :hugs: HuggingFace model hub
-* Fine tune large language models (LLMs) on your own text data for different tasks
-* Use your existing PostgreSQL database as a vector database by generating embeddings from text stored in the database.
diff --git a/pgml-docs/machine-learning/natural-language-processing/embeddings.md b/pgml-docs/machine-learning/natural-language-processing/embeddings.md
deleted file mode 100644
index 65a7d6eac..000000000
--- a/pgml-docs/machine-learning/natural-language-processing/embeddings.md
+++ /dev/null
@@ -1,25 +0,0 @@
----
-description: Numeric representation of text
----
-
-# Embeddings
-
-Embeddings are a numeric representation of text. They are used to represent words and sentences as vectors, an array of numbers. Embeddings can be used to find similar pieces of text, by comparing the similarity of the numeric vectors using a distance measure, or they can be used as input features for other machine learning models, since most algorithms can't use text directly.
-
-Many pretrained LLMs can be used to generate embeddings from text within PostgresML. You can browse all the [models](https://huggingface.co/models?library=sentence-transformers) available to find the best solution on Hugging Face.
-
-```sql
-SELECT pgml.embed(
- 'distilbert-base-uncased',
- 'Star Wars christmas special is on Disney'
- )::vector
-AS embedding
-```
-
-_Result_
-
-```json
-{
-"embedding" : [-0.048401695,-0.20282568,0.2653648,0.12278256,0.24706738, ...]
-}
-```
diff --git a/pgml-docs/machine-learning/natural-language-processing/fill-mask.md b/pgml-docs/machine-learning/natural-language-processing/fill-mask.md
deleted file mode 100644
index 42ef2d3e8..000000000
--- a/pgml-docs/machine-learning/natural-language-processing/fill-mask.md
+++ /dev/null
@@ -1,30 +0,0 @@
----
-description: Task to fill words in a sentence that are hidden
----
-
-# Fill Mask
-
-Fill-mask refers to a task where certain words in a sentence are hidden or "masked", and the objective is to predict what words should fill in those masked positions. Such models are valuable when we want to gain statistical insights about the language used to train the model.
-
-```sql
-SELECT pgml.transform(
- task => '{
- "task" : "fill-mask"
- }'::JSONB,
- inputs => ARRAY[
- 'Paris is the of France.'
-
- ]
-) AS answer;
-```
-
-_Result_
-
-```json
-[
- {"score": 0.679, "token": 812, "sequence": "Paris is the capital of France.", "token_str": " capital"},
- {"score": 0.051, "token": 32357, "sequence": "Paris is the birthplace of France.", "token_str": " birthplace"},
- {"score": 0.038, "token": 1144, "sequence": "Paris is the heart of France.", "token_str": " heart"},
- {"score": 0.024, "token": 29778, "sequence": "Paris is the envy of France.", "token_str": " envy"},
- {"score": 0.022, "token": 1867, "sequence": "Paris is the Capital of France.", "token_str": " Capital"}]
-```
diff --git a/pgml-docs/machine-learning/natural-language-processing/question-answering.md b/pgml-docs/machine-learning/natural-language-processing/question-answering.md
deleted file mode 100644
index 5118327a4..000000000
--- a/pgml-docs/machine-learning/natural-language-processing/question-answering.md
+++ /dev/null
@@ -1,30 +0,0 @@
----
-description: Retrieve the answer to a question from a given text
----
-
-# Question Answering
-
-Question Answering models are designed to retrieve the answer to a question from a given text, which can be particularly useful for searching for information within a document. It's worth noting that some question answering models are capable of generating answers even without any contextual information.
-
-```sql
-SELECT pgml.transform(
- 'question-answering',
- inputs => ARRAY[
- '{
- "question": "Where do I live?",
- "context": "My name is Merve and I live in İstanbul."
- }'
- ]
-) AS answer;
-```
-
-_Result_
-
-```json
-{
- "end" : 39,
- "score" : 0.9538117051124572,
- "start" : 31,
- "answer": "İstanbul"
-}
-```
diff --git a/pgml-docs/machine-learning/natural-language-processing/summarization.md b/pgml-docs/machine-learning/natural-language-processing/summarization.md
deleted file mode 100644
index 022b68ca8..000000000
--- a/pgml-docs/machine-learning/natural-language-processing/summarization.md
+++ /dev/null
@@ -1,50 +0,0 @@
----
-description: Task oof creating a condensed version of a document
----
-
-# Summarization
-
-Summarization involves creating a condensed version of a document that includes the important information while reducing its length. Different models can be used for this task, with some models extracting the most relevant text from the original document, while other models generate completely new text that captures the essence of the original content.
-
-```sql
-select pgml.transform(
- task => '{"task": "summarization",
- "model": "sshleifer/distilbart-cnn-12-6"
- }'::JSONB,
- inputs => array[
- 'Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles). The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.'
- ]
-);
-```
-
-_Result_
-
-```json
-[
- {"summary_text": " Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018 . The city is the centre and seat of government of the region and province of Île-de-France, or Paris Region . Paris Region has an estimated 18 percent of the population of France as of 2017 ."}
- ]
-```
-
-You can control the length of summary\_text by passing `min_length` and `max_length` as arguments to the SQL query.
-
-```sql
-select pgml.transform(
- task => '{"task": "summarization",
- "model": "sshleifer/distilbart-cnn-12-6"
- }'::JSONB,
- inputs => array[
- 'Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles). The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.'
- ],
- args => '{
- "min_length" : 20,
- "max_length" : 70
- }'::JSONB
-);
-```
-
-```json
-[
- {"summary_text": " Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018 . City of Paris is centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated 12,174,880, or about 18 percent"
- }
-]
-```
diff --git a/pgml-docs/machine-learning/natural-language-processing/text-classification.md b/pgml-docs/machine-learning/natural-language-processing/text-classification.md
deleted file mode 100644
index 2a378e3f1..000000000
--- a/pgml-docs/machine-learning/natural-language-processing/text-classification.md
+++ /dev/null
@@ -1,190 +0,0 @@
----
-description: Task that involves assigning a label or category to a given text.
----
-
-# Text Classification
-
-Common use cases include sentiment analysis, natural language inference, and the assessment of grammatical correctness. It has a wide range of applications in fields such as marketing, customer service, and political analysis
-
-### Sentiment Analysis
-
-Sentiment analysis is a type of natural language processing technique that involves analyzing a piece of text to determine the sentiment or emotion expressed within it. It can be used to classify a text as positive, negative, or neutral.
-
-_Basic usage_
-
-```sql
-SELECT pgml.transform(
- task => 'text-classification',
- inputs => ARRAY[
- 'I love how amazingly simple ML has become!',
- 'I hate doing mundane and thankless tasks. ☹️'
- ]
-) AS positivity;
-```
-
-_Result_
-
-```json
-[
- {"label": "POSITIVE", "score": 0.9995759129524232},
- {"label": "NEGATIVE", "score": 0.9903519749641418}
-]
-```
-
-The default [model](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) used for text classification is a fine-tuned version of DistilBERT-base-uncased that has been specifically optimized for the Stanford Sentiment Treebank dataset (sst2).
-
-#### _Using specific model_
-
-To use one of the over 19,000 models available on Hugging Face, include the name of the desired model and `text-classification` task as a JSONB object in the SQL query. For example, if you want to use a RoBERTa [model](https://huggingface.co/models?pipeline\_tag=text-classification) trained on around 40,000 English tweets and that has POS (positive), NEG (negative), and NEU (neutral) labels for its classes, include this information in the JSONB object when making your query.
-
-```sql
-SELECT pgml.transform(
- inputs => ARRAY[
- 'I love how amazingly simple ML has become!',
- 'I hate doing mundane and thankless tasks. ☹️'
- ],
- task => '{"task": "text-classification",
- "model": "finiteautomata/bertweet-base-sentiment-analysis"
- }'::JSONB
-) AS positivity;
-```
-
-_Result_
-
-```json
-[
- {"label": "POS", "score": 0.992932200431826},
- {"label": "NEG", "score": 0.975599765777588}
-]
-```
-
-#### _Using industry specific model_
-
-By selecting a model that has been specifically designed for a particular industry, you can achieve more accurate and relevant text classification. An example of such a model is [FinBERT](https://huggingface.co/ProsusAI/finbert), a pre-trained NLP model that has been optimized for analyzing sentiment in financial text. FinBERT was created by training the BERT language model on a large financial corpus, and fine-tuning it to specifically classify financial sentiment. When using FinBERT, the model will provide softmax outputs for three different labels: positive, negative, or neutral.
-
-```sql
-SELECT pgml.transform(
- inputs => ARRAY[
- 'Stocks rallied and the British pound gained.',
- 'Stocks making the biggest moves midday: Nvidia, Palantir and more'
- ],
- task => '{"task": "text-classification",
- "model": "ProsusAI/finbert"
- }'::JSONB
-) AS market_sentiment;
-```
-
-_Result_
-
-```json
-[
- {"label": "positive", "score": 0.8983612656593323},
- {"label": "neutral", "score": 0.8062630891799927}
-]
-```
-
-### Natural Language Inference (NLI)
-
-NLI, or Natural Language Inference, is a type of model that determines the relationship between two texts. The model takes a premise and a hypothesis as inputs and returns a class, which can be one of three types:
-
-* Entailment: This means that the hypothesis is true based on the premise.
-* Contradiction: This means that the hypothesis is false based on the premise.
-* Neutral: This means that there is no relationship between the hypothesis and the premise.
-
-The GLUE dataset is the benchmark dataset for evaluating NLI models. There are different variants of NLI models, such as Multi-Genre NLI, Question NLI, and Winograd NLI.
-
-If you want to use an NLI model, you can find them on the :hugs: Hugging Face model hub. Look for models with "mnli".
-
-```sql
-SELECT pgml.transform(
- inputs => ARRAY[
- 'A soccer game with multiple males playing. Some men are playing a sport.'
- ],
- task => '{"task": "text-classification",
- "model": "roberta-large-mnli"
- }'::JSONB
-) AS nli;
-```
-
-_Result_
-
-```json
-[
- {"label": "ENTAILMENT", "score": 0.98837411403656}
-]
-```
-
-### Question Natural Language Inference (QNLI)
-
-The QNLI task involves determining whether a given question can be answered by the information in a provided document. If the answer can be found in the document, the label assigned is "entailment". Conversely, if the answer cannot be found in the document, the label assigned is "not entailment".
-
-If you want to use an QNLI model, you can find them on the :hugs: Hugging Face model hub. Look for models with "qnli".
-
-```sql
-SELECT pgml.transform(
- inputs => ARRAY[
- 'Where is the capital of France?, Paris is the capital of France.'
- ],
- task => '{"task": "text-classification",
- "model": "cross-encoder/qnli-electra-base"
- }'::JSONB
-) AS qnli;
-```
-
-_Result_
-
-```json
-[
- {"label": "LABEL_0", "score": 0.9978110194206238}
-]
-```
-
-### Quora Question Pairs (QQP)
-
-The Quora Question Pairs model is designed to evaluate whether two given questions are paraphrases of each other. This model takes the two questions and assigns a binary value as output. LABEL\_0 indicates that the questions are paraphrases of each other and LABEL\_1 indicates that the questions are not paraphrases. The benchmark dataset used for this task is the Quora Question Pairs dataset within the GLUE benchmark, which contains a collection of question pairs and their corresponding labels.
-
-If you want to use an QQP model, you can find them on the :hugs: Hugging Face model hub. Look for models with `qqp`.
-
-```sql
-SELECT pgml.transform(
- inputs => ARRAY[
- 'Which city is the capital of France?, Where is the capital of France?'
- ],
- task => '{"task": "text-classification",
- "model": "textattack/bert-base-uncased-QQP"
- }'::JSONB
-) AS qqp;
-```
-
-_Result_
-
-```json
-[
- {"label": "LABEL_0", "score": 0.9988721013069152}
-]
-```
-
-### Grammatical Correctness
-
-Linguistic Acceptability is a task that involves evaluating the grammatical correctness of a sentence. The model used for this task assigns one of two classes to the sentence, either "acceptable" or "unacceptable". LABEL\_0 indicates acceptable and LABEL\_1 indicates unacceptable. The benchmark dataset used for training and evaluating models for this task is the Corpus of Linguistic Acceptability (CoLA), which consists of a collection of texts along with their corresponding labels.
-
-If you want to use a grammatical correctness model, you can find them on the :hugs: Hugging Face model hub. Look for models with `cola`.
-
-```sql
-SELECT pgml.transform(
- inputs => ARRAY[
- 'I will walk to home when I went through the bus.'
- ],
- task => '{"task": "text-classification",
- "model": "textattack/distilbert-base-uncased-CoLA"
- }'::JSONB
-) AS grammatical_correctness;
-```
-
-_Result_
-
-```json
-[
- {"label": "LABEL_1", "score": 0.9576480388641356}
-]
-```
diff --git a/pgml-docs/machine-learning/natural-language-processing/text-generation.md b/pgml-docs/machine-learning/natural-language-processing/text-generation.md
deleted file mode 100644
index 8d84ca762..000000000
--- a/pgml-docs/machine-learning/natural-language-processing/text-generation.md
+++ /dev/null
@@ -1,190 +0,0 @@
----
-description: Task of producing new text
----
-
-# Text Generation
-
-Text generation is the task of producing new text, such as filling in incomplete sentences or paraphrasing existing text. It has various use cases, including code generation and story generation. Completion generation models can predict the next word in a text sequence, while text-to-text generation models are trained to learn the mapping between pairs of texts, such as translating between languages. Popular models for text generation include GPT-based models, T5, T0, and BART. These models can be trained to accomplish a wide range of tasks, including text classification, summarization, and translation.
-
-```sql
-SELECT pgml.transform(
- task => 'text-generation',
- inputs => ARRAY[
- 'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone'
- ]
-) AS answer;
-```
-
-_Result_
-
-```json
-[
- [
- {"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, and eight for the Dragon-lords in their halls of blood.\n\nEach of the guild-building systems is one-man"}
- ]
-]
-```
-
-### Model from hub
-
-To use a specific model from :hugging: model hub, pass the model name along with task name in task.
-
-```sql
-SELECT pgml.transform(
- task => '{
- "task" : "text-generation",
- "model" : "gpt2-medium"
- }'::JSONB,
- inputs => ARRAY[
- 'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone'
- ]
-) AS answer;
-```
-
-_Result_
-
-```json
-[
- [{"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone.\n\nThis place has a deep connection to the lore of ancient Elven civilization. It is home to the most ancient of artifacts,"}]
-]
-```
-
-### Maximum Length
-
-To make the generated text longer, you can include the argument `max_length` and specify the desired maximum length of the text.
-
-```sql
-SELECT pgml.transform(
- task => '{
- "task" : "text-generation",
- "model" : "gpt2-medium"
- }'::JSONB,
- inputs => ARRAY[
- 'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone'
- ],
- args => '{
- "max_length" : 200
- }'::JSONB
-) AS answer;
-```
-
-_Result_
-
-```json
-[
- [{"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, Three for the Dwarfs and the Elves, One for the Gnomes of the Mines, and Two for the Elves of Dross.\"\n\nHobbits: The Fellowship is the first book of J.R.R. Tolkien's story-cycle, and began with his second novel - The Two Towers - and ends in The Lord of the Rings.\n\n\nIt is a non-fiction novel, so there is no copyright claim on some parts of the story but the actual text of the book is copyrighted by author J.R.R. Tolkien.\n\n\nThe book has been classified into two types: fantasy novels and children's books\n\nHobbits: The Fellowship is the first book of J.R.R. Tolkien's story-cycle, and began with his second novel - The Two Towers - and ends in The Lord of the Rings.It"}]
-]
-```
-
-### Return Sequences
-
-If you want the model to generate more than one output, you can specify the number of desired output sequences by including the argument `num_return_sequences` in the arguments.
-
-```sql
-SELECT pgml.transform(
- task => '{
- "task" : "text-generation",
- "model" : "gpt2-medium"
- }'::JSONB,
- inputs => ARRAY[
- 'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone'
- ],
- args => '{
- "num_return_sequences" : 3
- }'::JSONB
-) AS answer;
-```
-
-_Result_
-
-```json
-[
- [
- {"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, and Thirteen for the human-men in their hall of fire.\n\nAll of us, our families, and our people"},
- {"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, and the tenth for a King! As each of these has its own special story, so I have written them into the game."},
- {"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone… What's left in the end is your heart's desire after all!\n\nHans: (Trying to be brave)"}
- ]
-]
-```
-
-### Beam Search
-
-Text generation typically utilizes a greedy search algorithm that selects the word with the highest probability as the next word in the sequence. However, an alternative method called beam search can be used, which aims to minimize the possibility of overlooking hidden high probability word combinations. Beam search achieves this by retaining the num\_beams most likely hypotheses at each step and ultimately selecting the hypothesis with the highest overall probability. We set `num_beams > 1` and `early_stopping=True` so that generation is finished when all beam hypotheses reached the EOS token.
-
-```sql
-SELECT pgml.transform(
- task => '{
- "task" : "text-generation",
- "model" : "gpt2-medium"
- }'::JSONB,
- inputs => ARRAY[
- 'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone'
- ],
- args => '{
- "num_beams" : 5,
- "early_stopping" : true
- }'::JSONB
-) AS answer;
-```
-
-_Result_
-
-```json
-[[
- {"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, Nine for the Dwarves in their caverns of ice, Ten for the Elves in their caverns of fire, Eleven for the"}
-]]
-```
-
-Sampling methods involve selecting the next word or sequence of words at random from the set of possible candidates, weighted by their probabilities according to the language model. This can result in more diverse and creative text, as well as avoiding repetitive patterns. In its most basic form, sampling means randomly picking the next word $w\_t$ according to its conditional probability distribution: $$w_t \approx P(w_t|w_{1:t-1})$$
-
-However, the randomness of the sampling method can also result in less coherent or inconsistent text, depending on the quality of the model and the chosen sampling parameters such as temperature, top-k, or top-p. Therefore, choosing an appropriate sampling method and parameters is crucial for achieving the desired balance between creativity and coherence in generated text.
-
-You can pass `do_sample = True` in the arguments to use sampling methods. It is recommended to alter `temperature` or `top_p` but not both.
-
-### _Temperature_
-
-```sql
-SELECT pgml.transform(
- task => '{
- "task" : "text-generation",
- "model" : "gpt2-medium"
- }'::JSONB,
- inputs => ARRAY[
- 'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone'
- ],
- args => '{
- "do_sample" : true,
- "temperature" : 0.9
- }'::JSONB
-) AS answer;
-```
-
-_Result_
-
-```json
-[[{"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, and Thirteen for the Giants and Men of S.A.\n\nThe First Seven-Year Time-Traveling Trilogy is"}]]
-```
-
-### _Top p_
-
-```sql
-SELECT pgml.transform(
- task => '{
- "task" : "text-generation",
- "model" : "gpt2-medium"
- }'::JSONB,
- inputs => ARRAY[
- 'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone'
- ],
- args => '{
- "do_sample" : true,
- "top_p" : 0.8
- }'::JSONB
-) AS answer;
-```
-
-_Result_
-
-```json
-[[{"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, Four for the Elves of the forests and fields, and Three for the Dwarfs and their warriors.\" ―Lord Rohan [src"}]]
-```
diff --git a/pgml-docs/machine-learning/natural-language-processing/text-to-text-generation.md b/pgml-docs/machine-learning/natural-language-processing/text-to-text-generation.md
deleted file mode 100644
index 6761ba66e..000000000
--- a/pgml-docs/machine-learning/natural-language-processing/text-to-text-generation.md
+++ /dev/null
@@ -1,40 +0,0 @@
-# Text-to-Text Generation
-
-Text-to-text generation methods, such as T5, are neural network architectures designed to perform various natural language processing tasks, including summarization, translation, and question answering. T5 is a transformer-based architecture pre-trained on a large corpus of text data using denoising autoencoding. This pre-training process enables the model to learn general language patterns and relationships between different tasks, which can be fine-tuned for specific downstream tasks. During fine-tuning, the T5 model is trained on a task-specific dataset to learn how to perform the specific task.
-
-_Translation_
-
-```sql
-SELECT pgml.transform(
- task => '{
- "task" : "text2text-generation"
- }'::JSONB,
- inputs => ARRAY[
- 'translate from English to French: I''m very happy'
- ]
-) AS answer;
-```
-
-_Result_
-
-```json
-[
- {"generated_text": "Je suis très heureux"}
-]
-```
-
-Similar to other tasks, we can specify a model for text-to-text generation.
-
-```sql
-SELECT pgml.transform(
- task => '{
- "task" : "text2text-generation",
- "model" : "bigscience/T0"
- }'::JSONB,
- inputs => ARRAY[
- 'Is the word ''table'' used in the same meaning in the two previous sentences? Sentence A: you can leave the books on the table over there. Sentence B: the tables in this book are very hard to read.'
-
- ]
-) AS answer;
-
-```
diff --git a/pgml-docs/machine-learning/natural-language-processing/token-classification.md b/pgml-docs/machine-learning/natural-language-processing/token-classification.md
deleted file mode 100644
index 6f90a04fb..000000000
--- a/pgml-docs/machine-learning/natural-language-processing/token-classification.md
+++ /dev/null
@@ -1,60 +0,0 @@
----
-description: Task where labels are assigned to certain tokens in a text.
----
-
-# Token Classification
-
-Token classification is a task in natural language understanding, where labels are assigned to certain tokens in a text. Some popular subtasks of token classification include Named Entity Recognition (NER) and Part-of-Speech (PoS) tagging. NER models can be trained to identify specific entities in a text, such as individuals, places, and dates. PoS tagging, on the other hand, is used to identify the different parts of speech in a text, such as nouns, verbs, and punctuation marks.
-
-### Named Entity Recognition
-
-Named Entity Recognition (NER) is a task that involves identifying named entities in a text. These entities can include the names of people, locations, or organizations. The task is completed by labeling each token with a class for each named entity and a class named "0" for tokens that don't contain any entities. In this task, the input is text, and the output is the annotated text with named entities.
-
-```sql
-SELECT pgml.transform(
- inputs => ARRAY[
- 'I am Omar and I live in New York City.'
- ],
- task => 'token-classification'
-) as ner;
-```
-
-_Result_
-
-```json
-[[
- {"end": 9, "word": "Omar", "index": 3, "score": 0.997110, "start": 5, "entity": "I-PER"},
- {"end": 27, "word": "New", "index": 8, "score": 0.999372, "start": 24, "entity": "I-LOC"},
- {"end": 32, "word": "York", "index": 9, "score": 0.999355, "start": 28, "entity": "I-LOC"},
- {"end": 37, "word": "City", "index": 10, "score": 0.999431, "start": 33, "entity": "I-LOC"}
-]]
-```
-
-### Part-of-Speech (PoS) Tagging
-
-PoS tagging is a task that involves identifying the parts of speech, such as nouns, pronouns, adjectives, or verbs, in a given text. In this task, the model labels each word with a specific part of speech.
-
-Look for models with `pos` to use a zero-shot classification model on the :hugs: Hugging Face model hub.
-
-```sql
-select pgml.transform(
- inputs => array [
- 'I live in Amsterdam.'
- ],
- task => '{"task": "token-classification",
- "model": "vblagoje/bert-english-uncased-finetuned-pos"
- }'::JSONB
-) as pos;
-```
-
-_Result_
-
-```json
-[[
- {"end": 1, "word": "i", "index": 1, "score": 0.999, "start": 0, "entity": "PRON"},
- {"end": 6, "word": "live", "index": 2, "score": 0.998, "start": 2, "entity": "VERB"},
- {"end": 9, "word": "in", "index": 3, "score": 0.999, "start": 7, "entity": "ADP"},
- {"end": 19, "word": "amsterdam", "index": 4, "score": 0.998, "start": 10, "entity": "PROPN"},
- {"end": 20, "word": ".", "index": 5, "score": 0.999, "start": 19, "entity": "PUNCT"}
-]]
-```
diff --git a/pgml-docs/machine-learning/natural-language-processing/translation.md b/pgml-docs/machine-learning/natural-language-processing/translation.md
deleted file mode 100644
index 874467b2f..000000000
--- a/pgml-docs/machine-learning/natural-language-processing/translation.md
+++ /dev/null
@@ -1,26 +0,0 @@
----
-description: Task of converting text written in one language into another language.
----
-
-# Translation
-
-Translation is the task of converting text written in one language into another language. You have the option to select from over 2000 models available on the Hugging Face [hub](https://huggingface.co/models?pipeline\_tag=translation) for translation.
-
-```sql
-select pgml.transform(
- inputs => array[
- 'How are you?'
- ],
- task => '{"task": "translation",
- "model": "Helsinki-NLP/opus-mt-en-fr"
- }'::JSONB
-);
-```
-
-_Result_
-
-```json
-[
- {"translation_text": "Comment allez-vous ?"}
-]
-```
diff --git a/pgml-docs/machine-learning/natural-language-processing/zero-shot-classification.md b/pgml-docs/machine-learning/natural-language-processing/zero-shot-classification.md
deleted file mode 100644
index 8d7e272e3..000000000
--- a/pgml-docs/machine-learning/natural-language-processing/zero-shot-classification.md
+++ /dev/null
@@ -1,38 +0,0 @@
----
-description: Task where the model predicts a class that it hasn't seen during the training.
----
-
-# Zero-shot Classification
-
-Zero Shot Classification is a task where the model predicts a class that it hasn't seen during the training phase. This task leverages a pre-trained language model and is a type of transfer learning. Transfer learning involves using a model that was initially trained for one task in a different application. Zero Shot Classification is especially helpful when there is a scarcity of labeled data available for the specific task at hand.
-
-In the example provided below, we will demonstrate how to classify a given sentence into a class that the model has not encountered before. To achieve this, we make use of `args` in the SQL query, which allows us to provide `candidate_labels`. You can customize these labels to suit the context of your task. We will use `facebook/bart-large-mnli` model.
-
-Look for models with `mnli` to use a zero-shot classification model on the :hugs: Hugging Face model hub.
-
-```sql
-SELECT pgml.transform(
- inputs => ARRAY[
- 'I have a problem with my iphone that needs to be resolved asap!!'
- ],
- task => '{
- "task": "zero-shot-classification",
- "model": "facebook/bart-large-mnli"
- }'::JSONB,
- args => '{
- "candidate_labels": ["urgent", "not urgent", "phone", "tablet", "computer"]
- }'::JSONB
-) AS zero_shot;
-```
-
-_Result_
-
-```json
-[
- {
- "labels": ["urgent", "phone", "computer", "not urgent", "tablet"],
- "scores": [0.503635, 0.47879, 0.012600, 0.002655, 0.002308],
- "sequence": "I have a problem with my iphone that needs to be resolved asap!!"
- }
-]
-```
diff --git a/pgml-docs/machine-learning/supervised-learning/README.md b/pgml-docs/machine-learning/supervised-learning/README.md
deleted file mode 100644
index fbe6f91c5..000000000
--- a/pgml-docs/machine-learning/supervised-learning/README.md
+++ /dev/null
@@ -1,317 +0,0 @@
----
-description: A machine learning approach that uses labeled data
----
-
-# Supervised Learning
-
-PostgresML is a machine learning extension for PostgreSQL that enables you to perform training and inference using SQL queries.
-
-## Training
-
-The training function is at the heart of PostgresML. It's a powerful single mechanism that can handle many different training tasks which are configurable with the function parameters.
-
-### API
-
-Most parameters are optional and have configured defaults. The `project_name` parameter is required and is an easily recognizable identifier to organize your work.
-
-```sql
-pgml.train(
- project_name TEXT,
- task TEXT DEFAULT NULL,
- relation_name TEXT DEFAULT NULL,
- y_column_name TEXT DEFAULT NULL,
- algorithm TEXT DEFAULT 'linear',
- hyperparams JSONB DEFAULT '{}'::JSONB,
- search TEXT DEFAULT NULL,
- search_params JSONB DEFAULT '{}'::JSONB,
- search_args JSONB DEFAULT '{}'::JSONB,
- test_size REAL DEFAULT 0.25,
- test_sampling TEXT DEFAULT 'random'
-)
-```
-
-#### Parameters
-
-
Parameter
Description
Example
project_name
An easily recognizable identifier to organize your work.
My First PostgresML Project
task
The objective of the experiment: regression or classification.
classification
relation_name
The Postgres table or view where the training data is stored or defined.
public.users
y_column_name
The name of the label (aka "target" or "unknown") column in the training table.
The hyperparameters to pass to the algorithm for training, JSON formatted.
{ "n_estimators": 25 }
search
If set, PostgresML will perform a hyperparameter search to find the best hyperparameters for the algorithm. See Hyperparameter Search for details.
grid
search_params
Search parameters used in the hyperparameter search, using the scikit-learn notation, JSON formatted.
{ "n_estimators": [5, 10, 25, 100] }
search_args
Configuration parameters for the search, JSON formatted. Currently only n_iter is supported for random search.
{ "n_iter": 10 }
test_size
Fraction of the dataset to use for the test set and algorithm validation.
0.25
test_sampling
Algorithm used to fetch test data from the dataset: random, first, or last.
random
-
-### Example
-
-```sql
-SELECT * FROM pgml.train(
- project_name => 'My Classification Project',
- task => 'classification',
- relation_name => 'pgml.digits',
- y_column_name => 'target'
-);
-```
-
-This will create a **My Classification Project**, copy the `pgml.digits` table into the `pgml` schema, naming it `pgml.snapshot_{id}` where `id` is the primary key of the snapshot, and train a linear classification model on the snapshot using the `target` column as the label.
-
-
-
-When used for the first time in a project, `pgml.train()` function requires the `task` parameter, which can be either `regression` or `classification`. The task determines the relevant metrics and analysis performed on the data. All models trained within the project will refer to those metrics and analysis for benchmarking and deployment.
-
-The first time it's called, the function will also require a `relation_name` and `y_column_name`. The two arguments will be used to create the first snapshot of training and test data. By default, 25% of the data (specified by the `test_size` parameter) will be randomly sampled to measure the performance of the model after the `algorithm` has been trained on the 75% of the data.
-
-{% hint style="info" %}
-```sql
-SELECT * FROM pgml.train(
- 'My Classification Project',
- algorithm => 'xgboost'
-);
-```
-{% endhint %}
-
-!
-
-Future calls to `pgml.train()` may restate the same `task` for a project or omit it, but they can't change it. Projects manage their deployed model using the metrics relevant to a particular task (e.g. `r2` or `f1`), so changing it would mean some models in the project are no longer directly comparable. In that case, it's better to start a new project.
-
-{% hint style="info" %}
-If you'd like to train multiple models on the same snapshot, follow up calls to `pgml.train()` may omit the `relation_name`, `y_column_name`, `test_size` and `test_sampling` arguments to reuse identical data with multiple algorithms or hyperparameters.
-{% endhint %}
-
-### Getting Training Data
-
-A large part of the machine learning workflow is acquiring, cleaning, and preparing data for training algorithms. Naturally, we think Postgres is a great place to store your data. For the purpose of this example, we'll load a toy dataset, the classic handwritten digits image collection, from scikit-learn.
-
-
-
-```sql
-SELECT * FROM pgml.load_dataset('digits');
-```
-
-```plsql
-pgml=# SELECT * FROM pgml.load_dataset('digits');
-NOTICE: table "digits" does not exist, skipping
- table_name | rows
--------------+------
- pgml.digits | 1797
-(1 row)
-```
-
-This `NOTICE` can safely be ignored. PostgresML attempts to do a clean reload by dropping the `pgml.digits` table if it exists. The first time this command is run, the table does not exist.
-
-
-
-PostgresML loaded the Digits dataset into the `pgml.digits` table. You can examine the 2D arrays of image data, as well as the label in the `target` column:
-
-```sql
-SELECT
- target,
- image
-FROM pgml.digits LIMIT 5;
-
-```
-
-```plsql
-target | image
--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------
- 0 | {{0,0,5,13,9,1,0,0},{0,0,13,15,10,15,5,0},{0,3,15,2,0,11,8,0},{0,4,12,0,0,8,8,0},{0,5,8,0,0,9,8,0},{0,4,11,0,1,12,7,0},{0,2,14,5,10,12,0,0},{0,0,6,13,10,0,0,0}}
- 1 | {{0,0,0,12,13,5,0,0},{0,0,0,11,16,9,0,0},{0,0,3,15,16,6,0,0},{0,7,15,16,16,2,0,0},{0,0,1,16,16,3,0,0},{0,0,1,16,16,6,0,0},{0,0,1,16,16,6,0,0},{0,0,0,11,16,10,0,0}}
- 2 | {{0,0,0,4,15,12,0,0},{0,0,3,16,15,14,0,0},{0,0,8,13,8,16,0,0},{0,0,1,6,15,11,0,0},{0,1,8,13,15,1,0,0},{0,9,16,16,5,0,0,0},{0,3,13,16,16,11,5,0},{0,0,0,3,11,16,9,0}}
- 3 | {{0,0,7,15,13,1,0,0},{0,8,13,6,15,4,0,0},{0,2,1,13,13,0,0,0},{0,0,2,15,11,1,0,0},{0,0,0,1,12,12,1,0},{0,0,0,0,1,10,8,0},{0,0,8,4,5,14,9,0},{0,0,7,13,13,9,0,0}}
- 4 | {{0,0,0,1,11,0,0,0},{0,0,0,7,8,0,0,0},{0,0,1,13,6,2,2,0},{0,0,7,15,0,9,8,0},{0,5,16,10,0,16,6,0},{0,4,15,16,13,16,1,0},{0,0,0,3,15,10,0,0},{0,0,0,2,16,4,0,0}}
-(5 rows)
-```
-
-### Training a Model
-
-Now that we've got data, we're ready to train a model using an algorithm. We'll start with the default `linear` algorithm to demonstrate the basics. See the [Algorithms](../../../docs/guides/training/algorithm\_selection) for a complete list of available algorithms.
-
-```sql
-SELECT * FROM pgml.train(
- 'Handwritten Digit Image Classifier',
- 'classification',
- 'pgml.digits',
- 'target'
-);
-```
-
-```plsql
-INFO: Snapshotting table "pgml.digits", this may take a little while...
-INFO: Snapshot of table "pgml.digits" created and saved in "pgml"."snapshot_1"
-INFO: Dataset { num_features: 64, num_labels: 1, num_rows: 1797, num_train_rows: 1348, num_test_rows: 449 }
-INFO: Training Model { id: 1, algorithm: linear, runtime: python }
-INFO: Hyperparameter searches: 1, cross validation folds: 1
-INFO: Hyperparams: {}
-INFO: Metrics: {
- "f1": 0.91903764,
- "precision": 0.9175061,
- "recall": 0.9205743,
- "accuracy": 0.9175947,
- "mcc": 0.90866333,
- "fit_time": 0.17586434,
- "score_time": 0.01282608
-}
- project | task | algorithm | deployed
-------------------------------------+----------------+-----------+----------
- Handwritten Digit Image Classifier | classification | linear | t
-(1 row)
-```
-
-The output gives us information about the training run, including the `deployed` status. This is great news indicating training has successfully reached a new high score for the project's key metric and our new model was automatically deployed as the one that will be used to make new predictions for the project. See [Deployments](../../../docs/guides/predictions/deployments) for a guide to managing the active model.
-
-### Inspecting the results
-
-Now we can inspect some of the artifacts a training run creates.
-
-```sql
-SELECT * FROM pgml.overview;
-```
-
-```plsql
-pgml=# SELECT * FROM pgml.overview;
- name | deployed_at | task | algorithm | runtime | relation_name | y_column_name | test_sampling | test_size
-------------------------------------+----------------------------+----------------+-----------+---------+---------------+---------------+---------------+-----------
- Handwritten Digit Image Classifier | 2022-10-11 12:43:15.346482 | classification | linear | python | pgml.digits | {target} | last | 0.25
-(1 row)
-```
-
-## Inference
-
-The `pgml.predict()` function is the key value proposition of PostgresML. It provides online predictions using the best, automatically deployed model for a project.
-
-### API
-
-The API for predictions is very simple and only requires two arguments: the project name and the features used for prediction.
-
-```sql
-select pgml.predict (
- project_name TEXT,
- features REAL[]
-)
-```
-
-#### Parameters
-
-| Parameter | Description | Example |
-| -------------- | -------------------------------------------------------- | ----------------------------- |
-| `project_name` | The project name used to train models in `pgml.train()`. | `My First PostgresML Project` |
-| `features` | The feature vector used to predict a novel data point. | `ARRAY[0.1, 0.45, 1.0]` |
-
-#### Example
-
-```
-SELECT pgml.predict(
- 'My Classification Project',
- ARRAY[0.1, 2.0, 5.0]
-) AS prediction;
-```
-
-
-
-where `ARRAY[0.1, 2.0, 5.0]` is the same type of features used in training, in the same order as in the training data table or view. This score can be used in other regular queries.
-
-!!! example
-
-```
-SELECT *,
- pgml.predict(
- 'Buy it Again',
- ARRAY[
- user.location_id,
- NOW() - user.created_at,
- user.total_purchases_in_dollars
- ]
- ) AS buying_score
-FROM users
-WHERE tenant_id = 5
-ORDER BY buying_score
-LIMIT 25;
-```
-
-!!!
-
-### Example
-
-If you've already been through the [Training Overview](../../../docs/guides/training/overview), you can see the results of those efforts:
-
-```sql
-SELECT
- target,
- pgml.predict('Handwritten Digit Image Classifier', image) AS prediction
-FROM pgml.digits
-LIMIT 10;
-```
-
-```plsql
- target | prediction
---------+------------
- 0 | 0
- 1 | 1
- 2 | 2
- 3 | 3
- 4 | 4
- 5 | 5
- 6 | 6
- 7 | 7
- 8 | 8
- 9 | 9
-(10 rows)
-```
-
-### Active Model
-
-Since it's so easy to train multiple algorithms with different hyperparameters, sometimes it's a good idea to know which deployed model is used to make predictions. You can find that out by querying the `pgml.deployed_models` view:
-
-```sql
-SELECT * FROM pgml.deployed_models;
-```
-
-```plsql
- id | name | task | algorithm | runtime | deployed_at
-----+------------------------------------+----------------+-----------+---------+----------------------------
- 4 | Handwritten Digit Image Classifier | classification | xgboost | rust | 2022-10-11 13:06:26.473489
-(1 row)
-```
-
-PostgresML will automatically deploy a model only if it has better metrics than existing ones, so it's safe to experiment with different algorithms and hyperparameters.
-
-Take a look at [Deploying Models](../../../docs/guides/predictions/deployments) documentation for more details.
-
-### Specific Models
-
-You may also specify a model\_id to predict rather than a project name, to use a particular training run. You can find model ids by querying the `pgml.models` table.
-
-```sql
-SELECT models.id, models.algorithm, models.metrics
-FROM pgml.models
-JOIN pgml.projects
- ON projects.id = models.project_id
-WHERE projects.name = 'Handwritten Digit Image Classifier';
-```
-
-```
- id | algorithm | metrics
-
-----+-----------+-------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------
- 1 | linear | {"f1": 0.9190376400947571, "mcc": 0.9086633324623108, "recall": 0.9205743074417114, "accuracy": 0.9175946712493896, "fit_time": 0.8388963937759399, "p
-recision": 0.9175060987472534, "score_time": 0.019625699147582054}
-```
-
-For example, making predictions with `model_id = 1`:
-
-```sql
-SELECT
- target,
- pgml.predict(1, image) AS prediction
-FROM pgml.digits
-LIMIT 10;
-```
-
-```plsql
- target | prediction
---------+------------
- 0 | 0
- 1 | 1
- 2 | 2
- 3 | 3
- 4 | 4
- 5 | 5
- 6 | 6
- 7 | 7
- 8 | 8
- 9 | 9
-(10 rows)
-```
diff --git a/pgml-docs/machine-learning/supervised-learning/classification.md b/pgml-docs/machine-learning/supervised-learning/classification.md
deleted file mode 100644
index d801343ab..000000000
--- a/pgml-docs/machine-learning/supervised-learning/classification.md
+++ /dev/null
@@ -1,53 +0,0 @@
----
-description: >-
- Technique that assigns new observations to categorical labels or classes based
- on a model built from labeled training data.
----
-
-# Classification
-
-We currently support classification algorithms from [scikit-learn](https://scikit-learn.org/), [XGBoost](https://xgboost.readthedocs.io/), and [LightGBM](https://lightgbm.readthedocs.io/).
-
-### Gradient Boosting
-
-| Algorithm | Classification |
-| ----------------------- | -------------------------------------------------------------------------------------------------------------------------- |
-| `xgboost` | [XGBClassifier](https://xgboost.readthedocs.io/en/stable/python/python\_api.html#xgboost.XGBClassifier) |
-| `xgboost_random_forest` | [XGBRFClassifier](https://xgboost.readthedocs.io/en/stable/python/python\_api.html#xgboost.XGBRFClassifier) |
-| `lightgbm` | [LGBMClassifier](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier) |
-| `catboost` | [CatBoostClassifier](https://catboost.ai/en/docs/concepts/python-reference\_catboostclassifier) |
-
-### Scikit Ensembles
-
-| Algorithm | Classification |
-| ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
-| `ada_boost` | [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) |
-| `bagging` | [BaggingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html) |
-| `extra_trees` | [ExtraTreesClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html) |
-| `gradient_boosting_trees` | [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) |
-| `random_forest` | [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) |
-| `hist_gradient_boosting` | [HistGradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html) |
-
-### Support Vector Machines
-
-| Algorithm | Classification |
-| ------------ | ----------------------------------------------------------------------------------------- |
-| `svm` | [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) |
-| `nu_svm` | [NuSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html) |
-| `linear_svm` | [LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) |
-
-### Linear Models
-
-| Algorithm | Classification |
-| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
-| `linear` | [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.LogisticRegression.html) |
-| `ridge` | [RidgeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.RidgeClassifier.html) |
-| `stochastic_gradient_descent` | [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.SGDClassifier.html) |
-| `perceptron` | [Perceptron](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.Perceptron.html) |
-| `passive_aggressive` | [PassiveAggressiveClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.PassiveAggressiveClassifier.html) |
-
-### Other
-
-| Algorithm | Classification |
-| ------------------ | --------------------------------------------------------------------------------------------------------------------------------------- |
-| `gaussian_process` | [GaussianProcessClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian\_process.GaussianProcessClassifier.html) |
diff --git a/pgml-docs/machine-learning/supervised-learning/data-pre-processing.md b/pgml-docs/machine-learning/supervised-learning/data-pre-processing.md
deleted file mode 100644
index 0625c736a..000000000
--- a/pgml-docs/machine-learning/supervised-learning/data-pre-processing.md
+++ /dev/null
@@ -1,156 +0,0 @@
-# Data Pre-processing
-
-The training function also provides the option to preprocess data with the `preprocess` param. Preprocessors can be configured on a per-column basis for the training data set. There are currently three types of preprocessing available, for both categorical and quantitative variables. Below is a brief example for training data to learn a model of whether we should carry an umbrella or not.
-
-{% hint style="info" %}
-Preprocessing steps are saved after training, and repeated identically for future calls to `pgml.predict()`.
-{% endhint %}
-
-#### `weather_data`
-
-| **month** | **clouds** | **humidity** | **temp** | **rain** |
-| --------- | ---------- | ------------ | -------- | -------- |
-| 'jan' | 'cumulus' | 0.8 | 5 | true |
-| 'jan' | NULL | 0.1 | 10 | false |
-| … | … | … | … | … |
-| 'dec' | 'nimbus' | 0.9 | -2 | false |
-
-In this example:
-
-* `month` is an ordinal categorical `TEXT` variable
-* `clouds` is a nullable nominal categorical `INT4` variable
-* `humidity` is a continuous quantitative `FLOAT4` variable
-* `temp` is a discrete quantitative `INT4` variable
-* `rain` is a nominal categorical `BOOL` label
-
-There are 3 steps to preprocessing data:
-
-* [Encoding](../../../pgml-dashboard/content/docs/guides/training/preprocessing.md#categorical-encodings) categorical values into quantitative values
-* [Imputing](../../../pgml-dashboard/content/docs/guides/training/preprocessing.md#imputing-missing-values) NULL values to some quantitative value
-* [Scaling](../../../pgml-dashboard/content/docs/guides/training/preprocessing.md#scaling-values) quantitative values across all variables to similar ranges
-
-These preprocessing steps may be specified on a per-column basis to the [train()](../../../docs/guides/training/overview) function. By default, PostgresML does minimal preprocessing on training data, and will raise an error during analysis if NULL values are encountered without a preprocessor. All types other than `TEXT` are treated as quantitative variables and cast to floating point representations before passing them to the underlying algorithm implementations.
-
-```sql
-SELECT pgml.train(
- project_name => 'preprocessed_model',
- task => 'classification',
- relation_name => 'weather_data',
- target => 'rain',
- preprocess => '{
- "month": {"encode": {"ordinal": ["jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"]}}
- "clouds": {"encode": "target", scale: "standard"}
- "humidity": {"impute": "mean", scale: "standard"}
- "temp": {"scale": "standard"}
- }'
-);
-```
-
-In some cases, it may make sense to use multiple steps for a single column. For example, the `clouds` column will be target encoded, and then scaled to the standard range to avoid dominating other variables, but there are some interactions between preprocessors to keep in mind.
-
-* `NULL` and `NaN` are treated as additional, independent categories if seen during training, so columns that `encode` will only ever `impute` novel when novel data is encountered during training values.
-* It usually makes sense to scale all variables to the same scale.
-* It does not usually help to scale or preprocess the target data, as that is essentially the problem formulation and/or task selection.
-
-{% hint style="info" %}
-`TEXT` is used in this document to also refer to `VARCHAR` and `CHAR(N)` types.
-{% endhint %}
-
-## Predicting with Preprocessors
-
-A model that has been trained with preprocessors should use a Postgres tuple for prediction, rather than a `FLOAT4[]`. Tuples may contain multiple different types (like `TEXT` and `BIGINT`), while an ARRAY may only contain a single type. You can use parenthesis around values to create a Postgres tuple.
-
-```sql
-SELECT pgml.predict('preprocessed_model', ('jan', 'nimbus', 0.5, 7));
-```
-
-## Categorical encodings
-
-Encoding categorical variables is an O(N log(M)) where N is the number of rows, and M is the number of distinct categories.
-
-| **name** | **description** |
-| --------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
-| `none` | **Default** - Casts the variable to a 32-bit floating point representation compatible with numerics. This is the default for non-`TEXT` values. |
-| `target` | Encodes the variable as the average value of the target label for all members of the category. This is the default for `TEXT` variables. |
-| `one_hot` | Encodes the variable as multiple independent boolean columns. |
-| `ordinal` | Encodes the variable as integer values provided by their position in the input array. NULLS are always 0. |
-
-### `target` encoding
-
-Target encoding is a relatively efficient way to represent a categorical variable. The average value of the target is computed for each category in the training data set. It is reasonable to `scale` target encoded variables using the same method as other variables.
-
-```sql
-preprocess => '{
- "clouds": {"encode": "target" }
-}'
-```
-
-!!! note
-
-Target encoding is currently limited to the first label column specified in a joint optimization model when there are multiple labels.
-
-!!!
-
-### `one_hot` encoding
-
-One-hot encoding converts each category into an independent boolean column, where all columns are false except the one column the instance is a member of. This is generally not as efficient or as effective as target encoding because the number of additional columns for a single feature can swamp the other features, regardless of scaling in some algorithms. In addition, the columns are highly correlated which can also cause quality issues in some algorithms. PostgresML drops one column by default to break the correlation but preserves the information, which is also referred to as dummy encoding.
-
-```
-preprocess => '{
- "clouds": {"encode": "one_hot" }
-}
-```
-
-!!! note
-
-All one-hot encoded data is scaled from 0-1 by definition, and will not be further scaled, unlike the other encodings which are scaled.
-
-!!!
-
-### `ordinal` encoding
-
-Some categorical variables have a natural ordering, like months of the year, or days of the week that can be effectively treated as a discrete quantitative variable. You may set the order of your categorical values, by passing an exhaustive ordered array. e.g.
-
-```
-preprocess => '{
- "month": {"encode": {"ordinal": ["jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"]}}
-}
-```
-
-## Imputing missing values
-
-`NULL` and `NaN` values can be replaced by several statistical measures observed in the training data.
-
-| **name** | **description** |
-| -------- | ------------------------------------------------------------------------------------ |
-| `error` | **Default** - will abort training or inference when a `NULL` or `NAN` is encountered |
-| `mean` | the mean value of the variable in the training data set |
-| `median` | the middle value of the variable in the sorted training data set |
-| `mode` | the most common value of the variable in the training data set |
-| `min` | the minimum value of the variable in the training data set |
-| `max` | the maximum value of the variable in the training data set |
-| `zero` | replaces all missing values with 0.0 |
-
-```sql
-preprocess => '{
- "temp": {"impute": "mean"}
-}'
-```
-
-## Scaling values
-
-Scaling all variables to a standardized range can help make sure that no feature dominates the model, strictly because it has a naturally larger scale.
-
-| **name** | **description** |
-| ---------- | -------------------------------------------------------------------------------------------------------------------- |
-| `preserve` | **Default** - Does not scale the variable at all. |
-| `standard` | Scales data to have a mean of zero, and variance of one. |
-| `min_max` | Scales data from zero to one. The minimum becomes 0.0 and maximum becomes 1.0. |
-| `max_abs` | Scales data from -1.0 to +1.0. Data will not be centered around 0, unless abs(min) == abs(max). |
-| `robust` | Scales data as a factor of the first and third quartiles. This method may handle outliers more robustly than others. |
-
-```sql
-preprocess => '{
- "temp": {"scale": "standard"}
-}'
-```
diff --git a/pgml-docs/machine-learning/supervised-learning/hyperparameter-search.md b/pgml-docs/machine-learning/supervised-learning/hyperparameter-search.md
deleted file mode 100644
index 6f8260cb4..000000000
--- a/pgml-docs/machine-learning/supervised-learning/hyperparameter-search.md
+++ /dev/null
@@ -1,75 +0,0 @@
-# Hyperparameter Search
-
-Models can be further refined by using hyperparameter search and cross validation. We currently support `random` and `grid` search algorithms, and k-fold cross validation.
-
-## API
-
-The parameters passed to `pgml.train()` easily allow one to perform hyperparameter tuning. The three parameters relevant to this are: `search`, `search_params` and `search_args`.
-
-| **Parameter** | **Example** |
-| --------------- | ----------------------------- |
-| `search` | `grid` |
-| `search_params` | `{"alpha": [0.1, 0.2, 0.5] }` |
-| `search_args` | `{"n_iter": 10 }` |
-
-
-
-```sql
-SELECT * FROM pgml.train(
- 'Handwritten Digit Image Classifier',
- algorithm => 'xgboost',
- search => 'grid',
- search_params => '{
- "max_depth": [1, 2, 3, 4, 5, 6],
- "n_estimators": [20, 40, 80, 160]
- }'
-);
-```
-
-
-
-You may pass any of the arguments listed in the algorithms documentation as hyperparameters. See [Algorithms](../../../docs/guides/training/algorithm\_selection) for the complete list of algorithms and their associated hyperparameters.
-
-### Search Algorithms
-
-We currently support two search algorithms: `random` and `grid`.
-
-| Algorithm | Description |
-| --------- | ----------------------------------------------------------------------------------------------- |
-| `grid` | Trains every permutation of `search_params` using a cartesian product. |
-| `random` | Randomly samples `search_params` up to `n_iter` number of iterations provided in `search_args`. |
-
-### Analysis
-
-PostgresML automatically selects the optimal set of hyperparameters for the model, and that combination is highlighted in the Dashboard, among all other search candidates.
-
-The impact of each hyperparameter is measured against the key metric (`r2` for regression and `f1` for classification), as well as the training and test times.
-
-[](../../../dashboard/static/images/dashboard/hyperparams.png)
-
-{% hint style="info" %}
-In our example case, it's interesting that as \`max\_depth\` increases, the "Test Score" on the key metric trends lower, so the smallest value of max\_depth is chosen to maximize the "Test Score".
-
-Luckily, the smallest `max_depth` values also have the fastest "Fit Time", indicating that we pay less for training these higher quality models.
-
-It's a little less obvious how the different values \`n\_estimators\` and `learning_rate` impact the test score. We may want to rerun our search and zoom in on our the search space to get more insight.
-{% endhint %}
-
-### Performance
-
-In our example above, the grid search will train `len(max_depth) * len(n_estimators) * len(learning_rate) = 6 * 4 * 4 = 96` combinations to compare all possible permutations of `search_params`.
-
-It only took about a minute on my computer because we're using optimized Rust/C++ XGBoost bindings, but you can delete some values if you want to speed things up even further. I like to watch all cores operate at 100% utilization in a separate terminal with `htop`:
-
-[](../../../dashboard/static/images/demos/htop.png)
-
-In the end, we get the following output:
-
-```plsql
- project | task | algorithm | deployed
-------------------------------------+----------------+-----------+----------
- Handwritten Digit Image Classifier | classification | xgboost | t
-(1 row)
-```
-
-A new model has been deployed with better performance and metrics. There will also be a new analysis available for this model, viewable in the dashboard.
diff --git a/pgml-docs/machine-learning/supervised-learning/joint-optimization.md b/pgml-docs/machine-learning/supervised-learning/joint-optimization.md
deleted file mode 100644
index dac67f25a..000000000
--- a/pgml-docs/machine-learning/supervised-learning/joint-optimization.md
+++ /dev/null
@@ -1,18 +0,0 @@
-# Joint Optimization
-
-Some algorithms support joint optimization of the task across multiple outputs, which can improve results compared to using multiple independent models.
-
-To leverage multiple outputs in PostgresML, you'll need to substitute the standard usage of `pgml.train()` with `pgml.train_joint()`, which has the same API, except the notable exception of `y_column_name` parameter, which now accepts an array instead of a simple string.
-
-```sql
-SELECT * FROM pgml.train_join(
- 'My Joint Project',
- task => 'regression',
- relation_name => 'my_table',
- y_column_name => ARRAY['target_a', 'target_b'],
-);
-```
-
-
-
-You can read more in [scikit-learn](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.multioutput) documentation.
diff --git a/pgml-docs/machine-learning/supervised-learning/regression.md b/pgml-docs/machine-learning/supervised-learning/regression.md
deleted file mode 100644
index a4d83c93a..000000000
--- a/pgml-docs/machine-learning/supervised-learning/regression.md
+++ /dev/null
@@ -1,64 +0,0 @@
----
-description: >-
- Statistical method used to model the relationship between a dependent variable
- and one or more independent variables.
----
-
-# Regression
-
-We currently support regression algorithms from [scikit-learn](https://scikit-learn.org/), [XGBoost](https://xgboost.readthedocs.io/), and [LightGBM](https://lightgbm.readthedocs.io/).
-
-### Gradient Boosting
-
-| Algorithm | Regression |
-| ----------------------- | ----------------------------------------------------------------------------------------------------------------------- |
-| `xgboost` | [XGBRegressor](https://xgboost.readthedocs.io/en/stable/python/python\_api.html#xgboost.XGBRegressor) |
-| `xgboost_random_forest` | [XGBRFRegressor](https://xgboost.readthedocs.io/en/stable/python/python\_api.html#xgboost.XGBRFRegressor) |
-| `lightgbm` | [LGBMRegressor](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html#lightgbm.LGBMRegressor) |
-| `catboost` | [CatBoostRegressor](https://catboost.ai/en/docs/concepts/python-reference\_catboostregressor) |
-
-### Scikit Ensembles
-
-| Algorithm | Regression |
-| ------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
-| `ada_boost` | [AdaBoostRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html) |
-| `bagging` | [BaggingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html) |
-| `extra_trees` | [ExtraTreesRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html) |
-| `gradient_boosting_trees` | [GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) |
-| `random_forest` | [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) |
-| `hist_gradient_boosting` | [HistGradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html) |
-
-### Support Vector Machines
-
-| Algorithm | Regression |
-| ------------ | ----------------------------------------------------------------------------------------- |
-| `svm` | [SVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) |
-| `nu_svm` | [NuSVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVR.html) |
-| `linear_svm` | [LinearSVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html) |
-
-### Linear Models
-
-| Algorithm | Regression |
-| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
-| `linear` | [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.LinearRegression.html) |
-| `ridge` | [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.Ridge.html) |
-| `lasso` | [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.Lasso.html) |
-| `elastic_net` | [ElasticNet](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.ElasticNet.html) |
-| `least_angle` | [LARS](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.Lars.html) |
-| `lasso_least_angle` | [LassoLars](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.LassoLars.html) |
-| `orthoganl_matching_pursuit` | [OrthogonalMatchingPursuit](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.OrthogonalMatchingPursuit.html) |
-| `bayesian_ridge` | [BayesianRidge](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.BayesianRidge.html) |
-| `automatic_relevance_determination` | [ARDRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.ARDRegression.html) |
-| `stochastic_gradient_descent` | [SGDRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.SGDRegressor.html) |
-| `passive_aggressive` | [PassiveAggressiveRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.PassiveAggressiveRegressor.html) |
-| `ransac` | [RANSACRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.RANSACRegressor.html) |
-| `theil_sen` | [TheilSenRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.TheilSenRegressor.html) |
-| `huber` | [HuberRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.HuberRegressor.html) |
-| `quantile` | [QuantileRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.QuantileRegressor.html) |
-
-### Other
-
-| Algorithm | Regression |
-| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
-| `kernel_ridge` | [KernelRidge](https://scikit-learn.org/stable/modules/generated/sklearn.kernel\_ridge.KernelRidge.html) |
-| `gaussian_process` | [GaussianProcessRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian\_process.GaussianProcessRegressor.html) |
diff --git a/pgml-docs/machine-learning/unsupervised-learning.md b/pgml-docs/machine-learning/unsupervised-learning.md
deleted file mode 100644
index 150bbfb73..000000000
--- a/pgml-docs/machine-learning/unsupervised-learning.md
+++ /dev/null
@@ -1,66 +0,0 @@
----
-description: A machine learning approach that uses unlabeled data
----
-
-# Unsupervised Learning
-
-PostgresML supports several clustering algorithms for unsupervised learning. Models can be trained using `pgml.train` on unlabeled data to identify groups within the data.
-
-## Training
-
-To build clusters on a given dataset, we can use the table or a view. Since clustering is an unsupervised algorithm, we don't need a column that represents a label as one of the inputs to `pgml.train`.
-
-## API
-
-In `pgml.train` you need to set `cluster` as task and pass a `project_name`. Most parameters are optional.
-
-```sql
-pgml.train(
- project_name TEXT,
- task TEXT DEFAULT NULL,
- relation_name TEXT DEFAULT NULL,
- algorithm TEXT DEFAULT 'linear',
- hyperparams JSONB DEFAULT '{}'::JSONB
-)
-```
-
-## Algorithms
-
-| Algorithm | Reference |
-| ---------------------- | ----------------------------------------------------------------------------------------------------------------- |
-| `affinity_propagation` | [AffinityPropagation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html) |
-| `birch` | [Birch](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html) |
-| `kmeans` | [K-Means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) |
-| `mini_batch_kmeans` | [MiniBatchKMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html) |
-
-### Example
-
-This example trains models on the sklean digits dataset -- which is a copy of the test set of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits). This demonstrates using a table with a single array feature column for clustering. You could do something similar with a vector column.
-
-```sql
-SELECT pgml.load_dataset('digits');
-
--- create an unlabeled table of the images for unsupervised learning
-CREATE VIEW pgml.digit_vectors AS
-SELECT image FROM pgml.digits;
-
--- view the dataset
-SELECT left(image::text, 40) || ',...}' FROM pgml.digit_vectors LIMIT 10;
-
--- train a simple model to classify the data
-SELECT * FROM pgml.train('Handwritten Digit Clusters', 'cluster', 'pgml.digit_vectors', hyperparams => '{"n_clusters": 10}');
-
--- check out the predictions
-SELECT target, pgml.predict('Handwritten Digit Clusters', image) AS prediction
-FROM pgml.digits
-LIMIT 10;
-```
-
-### Other Algorithms
-
-```sql
-SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'affinity_propagation');
-SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'birch', hyperparams => '{"n_clusters": 10}');
-SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'kmeans', hyperparams => '{"n_clusters": 10}');
-SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'mini_batch_kmeans', hyperparams => '{"n_clusters": 10}');
-```
diff --git a/pgml-docs/monitoring.md b/pgml-docs/monitoring.md
deleted file mode 100644
index fbc79e996..000000000
--- a/pgml-docs/monitoring.md
+++ /dev/null
@@ -1,2 +0,0 @@
-# Monitoring
-
diff --git a/pgml-docs/overview.md b/pgml-docs/overview.md
deleted file mode 100644
index d0c98fdc4..000000000
--- a/pgml-docs/overview.md
+++ /dev/null
@@ -1,17 +0,0 @@
-# Overview
-
-PostgresML supercharges your Postgres database into an end-to-end MLOps platform, seamlessly integrating the key components of the machine learning workflow. Without moving data outside your database, PostgresML allows Postgres to function as a feature store, model store, training engine, and inference service all in one place. This consolidation streamlines building and deploying performant, real-time AI applications for developers.
-
-\
-With PostgresML, your database becomes a full-fledged ML workbench. It supports supervised and unsupervised algorithms like regression, clustering, deep neural networks, and more. You can build models using SQL on data inside Postgres. Models are stored back into Postgres for low-latency inferences later.
-
-\
-PostgresML also unlocked the power of large language models like GPT-3 for your database. With just a few lines of SQL, you can leverage state-of-the-art NLP to build semantic search, analyze text, extract insights, summarize documents, translate text, and more. The possibilities are endless.
-
-\
-PostgresML is open source but also offered as a fully-managed cloud service. In addition to the SQL API, it provides Javascript, Python, and Rust SDKs to quickly build vector search, chatbots, and other ML apps in just a few lines of code.
-
-\
-To scale horizontally, PostgresML utilizes PgCat, an advanced PostgreSQL proxy and load balancer. PgCat enables sharding, load balancing, failover, and mirroring to achieve extremely high throughput and low latency. By keeping the entire machine learning workflow within Postgres, PostgresML avoids expensive network calls between disparate systems. This allows PostgresML to handle millions of requests per second at up to 40x the speed of other platforms. PgCat and Postgres replication deliver seamless scaling while retaining transactional integrity.
-
-\
diff --git a/pgml-docs/pgcat.md b/pgml-docs/pgcat.md
deleted file mode 100644
index f691ef28f..000000000
--- a/pgml-docs/pgcat.md
+++ /dev/null
@@ -1,252 +0,0 @@
----
-description: Nextgen PostgreSQL Pooler
----
-
-# PgCat
-
-PgCat is a PostgreSQL pooler and proxy (like PgBouncer) with support for sharding, load balancing, failover and mirroring.
-
-## Features
-
-| **Feature** | **Status** | **Comments** |
-| ------------------------------------- | ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| Transaction pooling | **Stable** | Identical to PgBouncer with notable improvements for handling bad clients and abandoned transactions. |
-| Session pooling | **Stable** | Identical to PgBouncer. |
-| Multi-threaded runtime | **Stable** | Using Tokio asynchronous runtime, the pooler takes advantage of multicore machines. |
-| Load balancing of read queries | **Stable** | Queries are automatically load balanced between replicas and the primary. |
-| Failover | **Stable** | Queries are automatically rerouted around broken replicas, validated by regular health checks. |
-| Admin database statistics | **Stable** | Pooler statistics and administration via the `pgbouncer` and `pgcat` databases. |
-| Prometheus statistics | **Stable** | Statistics are reported via a HTTP endpoint for Prometheus. |
-| SSL/TLS | **Stable** | Clients can connect to the pooler using TLS. Pooler can connect to Postgres servers using TLS. |
-| Client/Server authentication | **Stable** | Clients can connect using MD5 authentication, supported by `libpq` and all Postgres client drivers. PgCat can connect to Postgres using MD5 and SCRAM-SHA-256. |
-| Live configuration reloading | **Stable** | Identical to PgBouncer; all settings can be reloaded dynamically (except `host` and `port`). |
-| Auth passthrough | **Stable** | MD5 password authentication can be configured to use an `auth_query` so no cleartext passwords are needed in the config file. |
-| Sharding using extended SQL syntax | **Experimental** | Clients can dynamically configure the pooler to route queries to specific shards. |
-| Sharding using comments parsing/Regex | **Experimental** | Clients can include shard information (sharding key, shard ID) in the query comments. |
-| Automatic sharding | **Experimental** | PgCat can parse queries, detect sharding keys automatically, and route queries to the correct shard. |
-| Mirroring | **Experimental** | Mirror queries between multiple databases in order to test servers with realistic production traffic. |
-
-## Status
-
-PgCat is stable and used in production to serve hundreds of thousands of queries per second.
-
-| | | |
-| -------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- | --------- |
-| [Instacart](https://tech.instacart.com/adopting-pgcat-a-nextgen-postgres-proxy-3cf284e68c2f) | [PostgresML](https://postgresml.org/blog/scaling-postgresml-to-one-million-requests-per-second) | OneSignal |
-
-Some features remain experimental and are being actively developed. They are optional and can be enabled through configuration.
-
-## Deployment
-
-See `Dockerfile` for example deployment using Docker. The pooler is configured to spawn 4 workers so 4 CPUs are recommended for optimal performance. That setting can be adjusted to spawn as many (or as little) workers as needed.
-
-A Docker image is available from `docker pull ghcr.io/postgresml/pgcat:latest`. See our [Github packages repository](https://github.com/postgresml/pgcat/pkgs/container/pgcat).
-
-For quick local example, use the Docker Compose environment provided:
-
-```bash
-docker-compose up
-
-# In a new terminal:
-PGPASSWORD=postgres psql -h 127.0.0.1 -p 6432 -U postgres -c 'SELECT 1'
-```
-
-### Config
-
-See [**Configuration**](https://github.com/levkk/pgcat/blob/main/CONFIG.md).
-
-## Contributing
-
-The project is being actively developed and looking for additional contributors and production deployments.
-
-### Local development
-
-1. Install Rust (latest stable will work great).
-2. `cargo build --release` (to get better benchmarks).
-3. Change the config in `pgcat.toml` to fit your setup (optional given next step).
-4. Install Postgres and run `psql -f tests/sharding/query_routing_setup.sql` (user/password may be required depending on your setup)
-5. `RUST_LOG=info cargo run --release` You're ready to go!
-
-### Tests
-
-When making substantial modifications to the protocol implementation, make sure to test them with pgbench:
-
-```
-pgbench -i -h 127.0.0.1 -p 6432 && \
-pgbench -t 1000 -p 6432 -h 127.0.0.1 --protocol simple && \
-pgbench -t 1000 -p 6432 -h 127.0.0.1 --protocol extended
-```
-
-See sharding README for sharding logic testing.
-
-Additionally, all features are tested with Ruby, Python, and Rust unit and integration tests.
-
-Run `cargo test` to run Rust unit tests.
-
-Run the following commands to run Ruby and Python integration tests:
-
-```
-cd tests/docker/
-docker compose up --exit-code-from main # This will also produce coverage report under ./cov/
-```
-
-### Docker-based local development
-
-You can open a Docker development environment where you can debug tests easier. Run the following command to spin it up:
-
-```
-./dev/script/console
-```
-
-This will open a terminal in an environment similar to that used in tests. In there, you can compile the pooler, run tests, do some debugging with the test environment, etc. Objects compiled inside the container (and bundled gems) will be placed in `dev/cache` so they don't interfere with what you have on your machine.
-
-## Usage
-
-### Session mode
-
-In session mode, a client talks to one server for the duration of the connection. Prepared statements, `SET`, and advisory locks are supported. In terms of supported features, there is very little if any difference between session mode and talking directly to the server.
-
-To use session mode, change `pool_mode = "session"`.
-
-### Transaction mode
-
-In transaction mode, a client talks to one server for the duration of a single transaction; once it's over, the server is returned to the pool. Prepared statements, `SET`, and advisory locks are not supported; alternatives are to use `SET LOCAL` and `pg_advisory_xact_lock` which are scoped to the transaction.
-
-This mode is enabled by default.
-
-### Load balancing of read queries
-
-All queries are load balanced against the configured servers using either the random or least open connections algorithms. The most straightforward configuration example would be to put this pooler in front of several replicas and let it load balance all queries.
-
-If the configuration includes a primary and replicas, the queries can be separated with the built-in query parser. The query parser, implemented with the `sqlparser` crate, will interpret the query and route all `SELECT` queries to a replica, while all other queries including explicit transactions will be routed to the primary.
-
-#### **Query parser**
-
-The query parser will do its best to determine where the query should go, but sometimes that's not possible. In that case, the client can select which server it wants using this custom SQL syntax:
-
-```sql
--- To talk to the primary for the duration of the next transaction:
-SET SERVER ROLE TO 'primary';
-
--- To talk to the replica for the duration of the next transaction:
-SET SERVER ROLE TO 'replica';
-
--- Let the query parser decide
-SET SERVER ROLE TO 'auto';
-
--- Pick any server at random
-SET SERVER ROLE TO 'any';
-
--- Reset to default configured settings
-SET SERVER ROLE TO 'default';
-```
-
-The setting will persist until it's changed again or the client disconnects.
-
-By default, all queries are routed to the first available server; `default_role` setting controls this behavior.
-
-#### Failover
-
-All servers are checked with a `;` (very fast) query before being given to a client. Additionally, the server health is monitored with every client query that it processes. If the server is not reachable, it will be banned and cannot serve any more transactions for the duration of the ban. The queries are routed to the remaining servers. If all servers become banned, the ban list is cleared: this is a safety precaution against false positives. The primary can never be banned.
-
-The ban time can be changed with `ban_time`. The default is 60 seconds.
-
-#### Sharding
-
-We use the `PARTITION BY HASH` hashing function, the same as used by Postgres for declarative partitioning. This allows to shard the database using Postgres partitions and place the partitions on different servers (shards). Both read and write queries can be routed to the shards using this pooler.
-
-**Extended syntax**
-
-To route queries to a particular shard, we use this custom SQL syntax:
-
-```sql
--- To talk to a shard explicitly
-SET SHARD TO '1';
-
--- To let the pooler choose based on a value
-SET SHARDING KEY TO '1234';
-```
-
-The active shard will last until it's changed again or the client disconnects. By default, the queries are routed to shard 0.
-
-For hash function implementation, see `src/sharding.rs` and `tests/sharding/partition_hash_test_setup.sql`.
-
-**ActiveRecord/Rails**
-
-```ruby
-class User < ActiveRecord::Base
-end
-
-# Metadata will be fetched from shard 0
-ActiveRecord::Base.establish_connection
-
-# Grab a bunch of users from shard 1
-User.connection.execute "SET SHARD TO '1'"
-User.take(10)
-
-# Using id as the sharding key
-User.connection.execute "SET SHARDING KEY TO '1234'"
-User.find_by_id(1234)
-
-# Using geographical sharding
-User.connection.execute "SET SERVER ROLE TO 'primary'"
-User.connection.execute "SET SHARDING KEY TO '85'"
-User.create(name: "test user", email: "test@example.com", zone_id: 85)
-
-# Let the query parser figure out where the query should go.
-# We are still on shard = hash(85) % shards.
-User.connection.execute "SET SERVER ROLE TO 'auto'"
-User.find_by_email("test@example.com")
-```
-
-**Raw SQL**
-
-```sql
--- Grab a bunch of users from shard 1
-SET SHARD TO '1';
-SELECT * FROM users LIMT 10;
-
--- Find by id
-SET SHARDING KEY TO '1234';
-SELECT * FROM USERS WHERE id = 1234;
-
--- Writing in a primary/replicas configuration.
-SET SHARDING ROLE TO 'primary';
-SET SHARDING KEY TO '85';
-INSERT INTO users (name, email, zome_id) VALUES ('test user', 'test@example.com', 85);
-
-SET SERVER ROLE TO 'auto'; -- let the query router figure out where the query should go
-SELECT * FROM users WHERE email = 'test@example.com'; -- shard setting lasts until set again; we are reading from the primary
-```
-
-**With comments**
-
-Issuing queries to the pooler can cause additional latency. To reduce its impact, it's possible to include sharding information inside SQL comments sent via the query. This is reasonably easy to implement with ORMs like [ActiveRecord](https://api.rubyonrails.org/classes/ActiveRecord/QueryMethods.html#method-i-annotate) and [SQLAlchemy](https://docs.sqlalchemy.org/en/20/core/events.html#sql-execution-and-connection-events).
-
-```
-/* shard_id: 5 */ SELECT * FROM foo WHERE id = 1234;
-
-/* sharding_key: 1234 */ SELECT * FROM foo WHERE id = 1234;
-```
-
-**Automatic query parsing**
-
-PgCat can use the `sqlparser` crate to parse SQL queries and extract the sharding key. This is configurable with the `automatic_sharding_key` setting. This feature is still experimental, but it's the ideal implementation for sharding, requiring no client modifications.
-
-#### Statistics reporting
-
-The stats are very similar to what PgBouncer reports and the names are kept to be comparable. They are accessible by querying the admin database `pgcat`, and `pgbouncer` for compatibility.
-
-```
-psql -h 127.0.0.1 -p 6432 -d pgbouncer -c 'SHOW DATABASES'
-```
-
-Additionally, Prometheus statistics are available at `/metrics` via HTTP.
-
-#### Live configuration reloading
-
-The config can be reloaded by sending a `kill -s SIGHUP` to the process or by querying `RELOAD` to the admin database. All settings except the `host` and `port` can be reloaded without restarting the pooler, including sharding and replicas configurations.
-
-#### Mirroring
-
-Mirroring allows to route queries to multiple databases at the same time. This is useful for prewarning replicas before placing them into the active configuration, or for testing different versions of Postgres with live traffic.
diff --git a/pgml-docs/sdks/README.md b/pgml-docs/sdks/README.md
deleted file mode 100644
index bed5fb936..000000000
--- a/pgml-docs/sdks/README.md
+++ /dev/null
@@ -1,3 +0,0 @@
-# SDKs
-
-SDKs are designed to facilitate the development of scalable vector search applications on PostgreSQL databases. With these SDKs, you can seamlessly manage various database tables related to documents, text chunks, text splitters, LLM (Language Model) models, and embeddings. By leveraging the SDK's capabilities, you can efficiently index LLM embeddings using PgVector for fast and accurate queries.
diff --git a/pgml-docs/sdks/collections.md b/pgml-docs/sdks/collections.md
deleted file mode 100644
index 92232a1d7..000000000
--- a/pgml-docs/sdks/collections.md
+++ /dev/null
@@ -1,89 +0,0 @@
-# Collections
-
-
-
-Collections are the organizational building blocks of the SDK. They manage all documents and related chunks, embeddings, tsvectors, and pipelines.
-
-## Creating Collections
-
-By default, collections will read and write to the database specified by `DATABASE_URL`.
-
-### **Default `DATABASE_URL`**
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-collection = Collection("test_collection")
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
-collection = pgml.newCollection("test_collection")
-```
-{% endtab %}
-{% endtabs %}
-
-### **Custom DATABASE\_URL**
-
-Create a Collection that reads from a different database than that set by the environment variable `DATABASE_URL`.
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-collection = Collection("test_collection", CUSTOM_DATABASE_URL)
-```
-{% endtab %}
-
-{% tab title="Javascript" %}
-```javascript
-collection = pgml.newCollection("test_collection", CUSTOM_DATABASE_URL)
-```
-{% endtab %}
-{% endtabs %}
-
-```
-```
-
-## Upserting Documents
-
-Documents are dictionaries with two required keys: `id` and `text`. All other keys/value pairs are stored as metadata for the document.
-
-**Upsert documents with metadata**
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-documents = [
- {
- "id": "Document 1",
- "text": "Here are the contents of Document 1",
- "random_key": "this will be metadata for the document"
- },
- {
- "id": "Document 2",
- "text": "Here are the contents of Document 2",
- "random_key": "this will be metadata for the document"
- }
-]
-collection = Collection("test_collection")
-await collection.upsert_documents(documents)
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
- const documents = [
- {
- id: "Document One",
- text: "document one contents...",
- },
- {
- id: "Document Two",
- text: "document two contents...",
- },
- ];
- await collection.upsert_documents(documents);
-```
-{% endtab %}
-{% endtabs %}
diff --git a/pgml-docs/sdks/getting-started.md b/pgml-docs/sdks/getting-started.md
deleted file mode 100644
index 871091b1e..000000000
--- a/pgml-docs/sdks/getting-started.md
+++ /dev/null
@@ -1,229 +0,0 @@
-# Getting Started
-
-## Installation
-
-{% tabs %}
-{% tab title="Python " %}
-Python > 3.8.1
-
-```bash
-pip install pgml
-```
-{% endtab %}
-
-{% tab title="JavaScript " %}
-```
-npm i pgml
-```
-{% endtab %}
-{% endtabs %}
-
-
-
-## Example
-
-Once the SDK is installed, you an use the following example to get started.
-
-### Create a collection
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-from pgml import Collection, Model, Splitter, Pipeline
-import asyncio
-
-async def main():
- # Initialize collection
- collection = Collection("sample_collection")
-```
-{% endtab %}
-
-{% tab title="JavaScript " %}
-```javascript
-const pgml = require("pgml");
-
-const main = async () => {
- collection = pgml.newCollection("sample_collection");
-```
-{% endtab %}
-{% endtabs %}
-
-**Explanation:**
-
-* The code imports the pgml module.
-* It creates an instance of the Collection class which we will add pipelines and documents onto
-
-### Create a pipeline
-
-Continuing with `main`
-
-{% tabs %}
-{% tab title="Python" %}
-```python
- # Create a pipeline using the default model and splitter
- model = Model()
- splitter = Splitter()
- pipeline = Pipeline("sample_pipeline", model, splitter)
- await collection.add_pipeline(pipeline)
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
- model = pgml.newModel();
- splitter = pgml.newSplitter();
- pipeline = pgml.Pipeline("sample_pipeline", model, splitter);
- await collection.add_pipeline(pipeline);
-```
-{% endtab %}
-{% endtabs %}
-
-#### Explanation:
-
-* The code creates an instance of `Model` and `Splitter` using their default arguments.
-* Finally, the code constructs a pipeline called `"sample_pipeline"` and add it to the collection we Initialized above. This pipeline automatically generates chunks and embeddings for every upserted document.
-
-### Upsert documents
-
-Continuing with `main`
-
-{% tabs %}
-{% tab title="Python" %}
-```python
- documents = [
- {
- id: "Document One",
- text: "document one contents...",
- },
- {
- id: "Document Two",
- text: "document two contents...",
- },
- ];
- await collection.upsert_documents(documents);
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
- const documents = [
- {
- id: "Document One",
- text: "document one contents...",
- },
- {
- id: "Document Two",
- text: "document two contents...",
- },
- ];
- await collection.upsert_documents(documents);
-```
-{% endtab %}
-{% endtabs %}
-
-**Explanation**
-
-* This code creates and upserts some filler documents.
-* As mentioned above, the pipeline added earlier automatically runs and generates chunks and embeddings for each document.
-
-### Query documents
-
-Continuing with `main`
-
-{% tabs %}
-{% tab title="Python" %}
-```python
- # Query
- query = "Some user query that will match document one first"
- results = await collection.query().vector_recall(query, pipeline).limit(2).fetch_all()
- print(results)
- # Archive collection
- await collection.archive()
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
-const queryResults = await collection
- .query()
- .vector_recall("Some user query that will match document one first", pipeline)
- .limit(2)
- .fetch_all();
-
- // Convert the results to an array of objects
- const results = queryResults.map((result) => {
- const [similarity, text, metadata] = result;
- return {
- similarity,
- text,
- metadata,
- };
- });
- console.log(results);
-
- await collection.archive();
-```
-{% endtab %}
-{% endtabs %}
-
-**Explanation:**
-
-* The `query` method is called to perform a vector-based search on the collection. The query string is `Some user query that will match document one first`, and the top 2 results are requested.
-* The search results are converted to objects and printed.
-* Finally, the `archive` method is called to archive the collection and free up resources in the PostgresML database.
-
-Call `main` function.
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-if __name__ == "__main__":
- asyncio.run(main())
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
-main().then(() => {
- console.log("Done with PostgresML demo");
-});
-```
-{% endtab %}
-{% endtabs %}
-
-### **Running the Code**
-
-Open a terminal or command prompt and navigate to the directory where the file is saved.
-
-Execute the following command:
-
-{% tabs %}
-{% tab title="Python" %}
-```bash
-python vector_search.py
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```bash
-node vector_search.js
-```
-{% endtab %}
-{% endtabs %}
-
-You should see the search results printed in the terminal. As you can see, our vector search engine did match document one first.
-
-```bash
-[
- {
- similarity: 0.8506832955692104,
- text: 'document one contents...',
- metadata: { id: 'Document One' }
- },
- {
- similarity: 0.8066114609244565,
- text: 'document two contents...',
- metadata: { id: 'Document Two' }
- }
-]
-```
diff --git a/pgml-docs/sdks/overview.md b/pgml-docs/sdks/overview.md
deleted file mode 100644
index 3e55b1ebb..000000000
--- a/pgml-docs/sdks/overview.md
+++ /dev/null
@@ -1,25 +0,0 @@
-# Overview
-
-### Key Features
-
-* **Automated Database Management**: You can easily handle the management of database tables related to documents, text chunks, text splitters, LLM models, and embeddings. This automated management system simplifies the process of setting up and maintaining your vector search application's data structure.
-* **Embedding Generation from Open Source Models**: Provides the ability to generate embeddings using hundreds of open source models. These models, trained on vast amounts of data, capture the semantic meaning of text and enable powerful analysis and search capabilities.
-* **Flexible and Scalable Vector Search**: Build flexible and scalable vector search applications. PostgresML seamlessly integrates with PgVector, a PostgreSQL extension specifically designed for handling vector-based indexing and querying. By leveraging these indices, you can perform advanced searches, rank results by relevance, and retrieve accurate and meaningful information from your database.
-
-### Use Cases
-
-* Search: Embeddings are commonly used for search functionalities, where results are ranked by relevance to a query string. By comparing the embeddings of query strings and documents, you can retrieve search results in order of their similarity or relevance.
-* Clustering: With embeddings, you can group text strings by similarity, enabling clustering of related data. By measuring the similarity between embeddings, you can identify clusters or groups of text strings that share common characteristics.
-* Recommendations: Embeddings play a crucial role in recommendation systems. By identifying items with related text strings based on their embeddings, you can provide personalized recommendations to users.
-* Anomaly Detection: Anomaly detection involves identifying outliers or anomalies that have little relatedness to the rest of the data. Embeddings can aid in this process by quantifying the similarity between text strings and flagging outliers.
-* Classification: Embeddings are utilized in classification tasks, where text strings are classified based on their most similar label. By comparing the embeddings of text strings and labels, you can classify new text strings into predefined categories.
-
-### How the SDK Works
-
-SDK streamlines the development of vector search applications by abstracting away the complexities of database management and indexing. Here's an overview of how the SDK works:
-
-* **Automatic Document and Text Chunk Management**: The SDK provides a convenient interface to manage documents and pipelines, automatically handling chunking and embedding for you. You can easily organize and structure your text data within the PostgreSQL database.
-* **Open Source Model Integration**: With the SDK, you can seamlessly incorporate a wide range of open source models to generate high-quality embeddings. These models capture the semantic meaning of text and enable powerful analysis and search capabilities.
-* **Embedding Indexing**: The Python SDK utilizes the PgVector extension to efficiently index the embeddings generated by the open source models. This indexing process optimizes search performance and allows for fast and accurate retrieval of relevant results.
-* **Querying and Search**: Once the embeddings are indexed, you can perform vector-based searches on the documents and text chunks stored in the PostgreSQL database. The SDK provides intuitive methods for executing queries and retrieving search results.
-
diff --git a/pgml-docs/sdks/pipelines.md b/pgml-docs/sdks/pipelines.md
deleted file mode 100644
index 8fe5ea3ab..000000000
--- a/pgml-docs/sdks/pipelines.md
+++ /dev/null
@@ -1,257 +0,0 @@
-# Pipelines
-
-Pipelines are composed of a Model, Splitter, and additional optional arguments. Collections can have any number of Pipelines. Each Pipeline is ran everytime documents are upserted.
-
-## Models
-
-Models are used for embedding chuncked documents. We support most every open source model on [Hugging Face](https://huggingface.co/), and also OpenAI's embedding models.
-
-### **Create a default Model "intfloat/e5-small" with default parameters: {}**
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-model = Model()
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
-model = pgml.newModel()
-```
-{% endtab %}
-{% endtabs %}
-
-### **Create a Model with custom parameters**
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-model = Model(
- name="hkunlp/instructor-base",
- parameters={"instruction": "Represent the Wikipedia document for retrieval: "}
-)
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
-model = pgml.newModel(
- name="hkunlp/instructor-base",
- parameters={instruction: "Represent the Wikipedia document for retrieval: "}
-)
-```
-{% endtab %}
-{% endtabs %}
-
-### **Use an OpenAI model**
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-model = Model(name="text-embedding-ada-002", source="openai")
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
-model = pgml.newModel(name="text-embedding-ada-002", source="openai")
-```
-{% endtab %}
-{% endtabs %}
-
-## Splitters
-
-Splitters are used to split documents into chunks before embedding them. We support splitters found in [LangChain](https://www.langchain.com/).
-
-### **Create a default Splitter "recursive\_character" with default parameters: {}**
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-splitter = Splitter()
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
-splitter = pgml.newSplitter()
-```
-{% endtab %}
-{% endtabs %}
-
-### **Create a Splitter with custom parameters**
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-splitter = Splitter(
- name="recursive_character",
- parameters={"chunk_size": 1500, "chunk_overlap": 40}
-)
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
-splitter = pgml.newSplitter(
- name="recursive_character",
- parameters={chunk_size: 1500, chunk_overlap: 40}
-)
-```
-{% endtab %}
-{% endtabs %}
-
-## Adding Pipelines to a Collection
-
-When adding a Pipeline to a collection it is required that Pipeline has a Model and Splitter.
-
-The first time a Pipeline is added to a Collection it will automatically chunk and embed any documents already in that Collection.
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-model = Model()
-splitter = Splitter()
-pipeline = Pipeline("test_pipeline", model, splitter)
-await collection.add_pipeline(pipeline)
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
-model = pgml.newModel()
-splitter = pgml.newSplitter()
-pipeline = pgml.newPipeline("test_pipeline", model, splitter)
-await collection.add_pipeline(pipeline)
-```
-{% endtab %}
-{% endtabs %}
-
-### Enabling full text search
-
-Pipelines can take additional arguments enabling full text search. When full text search is enabled, in addition to automatically chunking and embedding, the Pipeline will create the necessary tsvectors to perform full text search.
-
-For more information on full text search please see: [Postgres Full Text Search](https://www.postgresql.org/docs/15/textsearch.html).
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-model = Model()
-splitter = Splitter()
-pipeline = Pipeline("test_pipeline", model, splitter, {
- "full_text_search": {
- "active": True,
- "configuration": "english"
- }
-})
-await collection.add_pipeline(pipeline)
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
-model = pgml.newModel()
-splitter = pgml.newSplitter()
-pipeline = pgml.newPipeline("test_pipeline", model, splitter, {
- "full_text_search": {
- active: True,
- configuration: "english"
- }
-})
-await collection.add_pipeline(pipeline)
-```
-{% endtab %}
-{% endtabs %}
-
-## Searching with Pipelines
-
-Pipelines are a required argument when performing vector search. After a Pipeline has been added to a Collection, the Model and Splitter can be omitted when instantiating it.
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-pipeline = Pipeline("test_pipeline")
-collection = Collection("test_collection")
-results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).fetch_all()
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
-pipeline = pgml.newPipeline("test_pipeline")
-collection = pgml.newCollection("test_collection")
-results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).fetch_all()
-```
-{% endtab %}
-{% endtabs %}
-
-
-
-Pipelines can be disabled or removed to prevent them from running automatically when documents are upserted.
-
-## **Disable a Pipeline**
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-pipeline = Pipeline("test_pipeline")
-collection = Collection("test_collection")
-await collection.disable_pipeline(pipeline)
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
-pipeline = pgml.newPipeline("test_pipeline")
-collection = pgml.newCollection("test_collection")
-await collection.disable_pipeline(pipeline)
-```
-{% endtab %}
-{% endtabs %}
-
-Disabling a Pipeline prevents it from running automatically, but leaves all chunks and embeddings already created by that Pipeline in the database.
-
-## **Enable a Pipeline**
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-pipeline = Pipeline("test_pipeline")
-collection = Collection("test_collection")
-await collection.enable_pipeline(pipeline)
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
-pipeline = pgml.newPipeline("test_pipeline")
-collection = pgml.newCollection("test_collection")
-await collection.enable_pipeline(pipeline)
-```
-{% endtab %}
-{% endtabs %}
-
-Enabling a Pipeline will cause it to automatically run and chunk and embed all documents it may have missed while disabled.
-
-## **Remove a Pipeline**
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-pipeline = Pipeline("test_pipeline")
-collection = Collection("test_collection")
-await collection.remove_pipeline(pipeline)
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
-pipeline = pgml.newPipeline("test_pipeline")
-collection = pgml.newCollection("test_collection")
-await collection.remove_pipeline(pipeline)
-```
-{% endtab %}
-{% endtabs %}
-
-Removing a Pipeline deletes it and all associated data from the database. Removed Pipelines cannot be re-enabled but can be recreated.
diff --git a/pgml-docs/sdks/search.md b/pgml-docs/sdks/search.md
deleted file mode 100644
index 500aed1a6..000000000
--- a/pgml-docs/sdks/search.md
+++ /dev/null
@@ -1,271 +0,0 @@
-# Search
-
-SDK is specifically designed to provide powerful, flexible vector search. Pipelines are required to perform search. See the [pipelines.md](pipelines.md "mention") for more information about using Pipelines.
-
-### **Basic vector search**
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-collection = Collection("test_collection")
-pipeline = Pipeline("test_pipeline")
-results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).fetch_all()
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
-collection = pgml.newCollection("test_collection")
-pipeline = pgml.newPipeline("test_pipeline")
-results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).fetch_all()
-```
-{% endtab %}
-{% endtabs %}
-
-### **Vector search with custom limit**
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-collection = Collection("test_collection")
-pipeline = Pipeline("test_pipeline")
-results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).limit(10).fetch_all()
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
-collection = pgml.newCollection("test_collection")
-pipeline = pgml.newPipeline("test_pipeline")
-results = await collection.query().vector_recall("Why is PostgresML the best?", pipeline).limit(10).fetch_all()
-```
-{% endtab %}
-{% endtabs %}
-
-### **Metadata Filtering**
-
-We provide powerful and flexible arbitrarly nested metadata filtering based off of [MongoDB Comparison Operators](https://www.mongodb.com/docs/manual/reference/operator/query-comparison/). We support each operator mentioned except the `$nin`.
-
-**Vector search with $eq metadata filtering**
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-collection = Collection("test_collection")
-pipeline = Pipeline("test_pipeline")
-results = (
- await collection.query()
- .vector_recall("Here is some query", pipeline)
- .limit(10)
- .filter({
- "metadata": {
- "uuid": {
- "$eq": 1
- }
- }
- })
- .fetch_all()
-)
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
-collection = pgml.newCollection("test_collection")
-pipeline = pgml.newPipeline("test_pipeline")
-results = await collection.query()
- .vector_recall("Here is some query", pipeline)
- .limit(10)
- .filter({
- "metadata": {
- "uuid": {
- "$eq": 1
- }
- }
- })
- .fetch_all()
-```
-{% endtab %}
-{% endtabs %}
-
-The above query would filter out all documents that do not contain a key `uuid` equal to `1`.
-
-**Vector search with $gte metadata filtering**
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-collection = Collection("test_collection")
-pipeline = Pipeline("test_pipeline")
-results = (
- await collection.query()
- .vector_recall("Here is some query", pipeline)
- .limit(10)
- .filter({
- "metadata": {
- "index": {
- "$gte": 3
- }
- }
- })
- .fetch_all()
-)
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
-collection = pgml.newCollection("test_collection")
-pipeline = pgml.newPipeline("test_pipeline")
-results = await collection.query()
- .vector_recall("Here is some query", pipeline)
- .limit(10)
- .filter({
- "metadata": {
- "index": {
- "$gte": 3
- }
- }
- })
- .fetch_all()
-)
-```
-{% endtab %}
-{% endtabs %}
-
-The above query would filter out all documents that do not contain a key `index` with a value greater than `3`.
-
-**Vector search with $or and $and metadata filtering**
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-collection = Collection("test_collection")
-pipeline = Pipeline("test_pipeline")
-results = (
- await collection.query()
- .vector_recall("Here is some query", pipeline)
- .limit(10)
- .filter({
- "metadata": {
- "$or": [
- {
- "$and": [
- {
- "$eq": {
- "uuid": 1
- }
- },
- {
- "$lt": {
- "index": 100
- }
- }
- ]
- },
- {
- "special": {
- "$ne": True
- }
- }
- ]
- }
- })
- .fetch_all()
-)
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
-collection = pgml.newCollection("test_collection")
-pipeline = pgml.newPipeline("test_pipeline")
-results = await collection.query()
- .vector_recall("Here is some query", pipeline)
- .limit(10)
- .filter({
- "metadata": {
- "$or": [
- {
- "$and": [
- {
- "$eq": {
- "uuid": 1
- }
- },
- {
- "$lt": {
- "index": 100
- }
- }
- ]
- },
- {
- "special": {
- "$ne": True
- }
- }
- ]
- }
- })
- .fetch_all()
-```
-{% endtab %}
-{% endtabs %}
-
-The above query would filter out all documents that do not have a key `special` with a value `True` or (have a key `uuid` equal to 1 and a key `index` less than 100).
-
-### **Full Text Filtering**
-
-If full text search is enabled for the associated Pipeline, documents can be first filtered by full text search and then recalled by embedding similarity.
-
-{% tabs %}
-{% tab title="Python" %}
-```python
-collection = Collection("test_collection")
-pipeline = Pipeline("test_pipeline")
-results = (
- await collection.query()
- .vector_recall("Here is some query", pipeline)
- .limit(10)
- .filter({
- "full_text": {
- "configuration": "english",
- "text": "Match Me"
- }
- })
- .fetch_all()
-)
-```
-{% endtab %}
-
-{% tab title="JavaScript" %}
-```javascript
-collection = pgml.newCollection("test_collection")
-pipeline = pgml.newPipeline("test_pipeline")
-results = await collection.query()
- .vector_recall("Here is some query", pipeline)
- .limit(10)
- .filter({
- "full_text": {
- "configuration": "english",
- "text": "Match Me"
- }
- })
- .fetch_all()
-```
-{% endtab %}
-{% endtabs %}
-
-The above query would first filter out all documents that do not match the full text search criteria, and then perform vector recall on the remaining documents.
-
-
-
-!!! generic
-
-
-
-!!!
-
-
-
diff --git a/pgml-docs/sdks/tutorials/README.md b/pgml-docs/sdks/tutorials/README.md
deleted file mode 100644
index 84ce15b78..000000000
--- a/pgml-docs/sdks/tutorials/README.md
+++ /dev/null
@@ -1,2 +0,0 @@
-# Tutorials
-
diff --git a/pgml-docs/sdks/tutorials/extractive-question-answering.md b/pgml-docs/sdks/tutorials/extractive-question-answering.md
deleted file mode 100644
index 566c344fa..000000000
--- a/pgml-docs/sdks/tutorials/extractive-question-answering.md
+++ /dev/null
@@ -1,145 +0,0 @@
-# Extractive Question Answering
-
-Here is the documentation for the JavaScript and Python code snippets performing end-to-end question answering:
-
-Imports and Setup
-
-**Python**
-
-```python
-from pgml import Collection, Model, Splitter, Pipeline, Builtins
-from datasets import load_dataset
-from dotenv import load_dotenv
-```
-
-**JavaScript**
-
-```js
-const pgml = require("pgml");
-require("dotenv").config();
-```
-
-The SDK and datasets are imported. Builtins are used in Python for transforming text.
-
-### Initialize Collection
-
-**Python**
-
-```python
-collection = Collection("squad_collection")
-```
-
-**JavaScript**
-
-```js
-const collection = pgml.newCollection("my_javascript_eqa_collection");
-```
-
-A collection is created to hold context passages.
-
-### Create Pipeline
-
-**Python**
-
-```python
-model = Model()
-splitter = Splitter()
-pipeline = Pipeline("squadv1", model, splitter)
-await collection.add_pipeline(pipeline)
-```
-
-**JavaScript**
-
-```js
-const pipeline = pgml.newPipeline(
- "my_javascript_eqa_pipeline",
- pgml.newModel(),
- pgml.newSplitter(),
-);
-
-await collection.add_pipeline(pipeline);
-```
-
-A pipeline is created and added to the collection.
-
-### Upsert Documents
-
-**Python**
-
-```python
-data = load_dataset("squad")
-
-documents = [
- {"id": ..., "text": ...}
- for r in data
-]
-
-await collection.upsert_documents(documents)
-```
-
-**JavaScript**
-
-```js
-const documents = [
- {
- id: "...",
- text: "...",
- }
-];
-
-await collection.upsert_documents(documents);
-```
-
-Context passages from SQuAD are upserted into the collection.
-
-### Query for Context
-
-**Python**
-
-```python
-results = await collection.query()
- .vector_recall(query, pipeline)
- .fetch_all()
-
-context = " ".join(results[0][1])
-```
-
-**JavaScript**
-
-```js
-const queryResults = await collection
- .query()
- .vector_recall(query, pipeline)
- .fetch_all();
-
-const context = queryResults
- .map(result => result[1])
- .join("\n");
-```
-
-A vector search query retrieves context passages.
-
-### Query for Answer
-
-**Python**
-
-```python
-builtins = Builtins()
-
-answer = await builtins.transform(
- "question-answering",
- [{"question": query, "context": context}]
-)
-```
-
-**JavaScript**
-
-```js
-const builtins = pgml.newBuiltins();
-
-const answer = await builtins.transform("question-answering", [
- JSON.stringify({question, context})
-]);
-```
-
-The context is passed to a QA model to extract the answer.
diff --git a/pgml-docs/sdks/tutorials/semantic-search-using-instructor-model.md b/pgml-docs/sdks/tutorials/semantic-search-using-instructor-model.md
deleted file mode 100644
index baa109c44..000000000
--- a/pgml-docs/sdks/tutorials/semantic-search-using-instructor-model.md
+++ /dev/null
@@ -1,115 +0,0 @@
-# Semantic Search using Instructor model
-
-This shows using instructor models in the `pgml` SDK for more advanced use cases.
-
-#### Imports and Setup
-
-**Python**
-
-```python
-from pgml import Collection, Model, Splitter, Pipeline
-from datasets import load_dataset
-from dotenv import load_dotenv
-```
-
-**JavaScript**
-
-```js
-const pgml = require("pgml");
-require("dotenv").config();
-```
-
-#### Initialize Collection
-
-**Python**
-
-```python
-collection = Collection("squad_collection_1")
-```
-
-**JavaScript**
-
-```js
-const collection = pgml.newCollection("my_javascript_qai_collection");
-```
-
-#### Create Pipeline
-
-**Python**
-
-```python
-model = Model("hkunlp/instructor-base", parameters={
- "instruction": "Represent the Wikipedia document for retrieval: "
-})
-
-pipeline = Pipeline("squad_instruction", model, Splitter())
-await collection.add_pipeline(pipeline)
-```
-
-**JavaScript**
-
-```js
-const model = pgml.newModel("hkunlp/instructor-base", "pgml", {
- instruction: "Represent the Wikipedia document for retrieval: ",
-});
-
-const pipeline = pgml.newPipeline(
- "my_javascript_qai_pipeline",
- model,
- pgml.newSplitter(),
-);
-
-await collection.add_pipeline(pipeline);
-```
-
-#### Upsert Documents
-
-**Python**
-
-```python
-data = load_dataset("squad")
-
-documents = [
- {"id": ..., "text": ...} for r in data
-]
-
-await collection.upsert_documents(documents)
-```
-
-**JavaScript**
-
-```js
-const documents = [
- {
- id: "...",
- text: "...",
- },
-];
-
-await collection.upsert_documents(documents);
-```
-
-#### Query
-
-**Python**
-
-```python
-results = await collection.query()
- .vector_recall(query, pipeline, {
- "instruction": "Represent the Wikipedia question for retrieving supporting documents: "
- })
- .fetch_all()
-```
-
-**JavaScript**
-
-```js
-const queryResults = await collection
- .query()
- .vector_recall(query, pipeline, {
- instruction:
- "Represent the Wikipedia question for retrieving supporting documents: ",
- })
- .fetch_all();
-```
-
diff --git a/pgml-docs/sdks/tutorials/semantic-search.md b/pgml-docs/sdks/tutorials/semantic-search.md
deleted file mode 100644
index 69b626329..000000000
--- a/pgml-docs/sdks/tutorials/semantic-search.md
+++ /dev/null
@@ -1,176 +0,0 @@
----
-description: Example for Semantic Search
----
-
-# Semantic Search
-
-This tutorial demonstrates using the `pgml` SDK to create a collection, add documents, build a pipeline for vector search, make a sample query, and archive the collection when finished. It loads sample data, indexes questions, times a semantic search query, and prints formatted results.
-
-
-
-### Imports and Setup
-
-**Python**
-
-```python
-from pgml import Collection, Model, Splitter, Pipeline
-from datasets import load_dataset
-from dotenv import load_dotenv
-import asyncio
-```
-
-**JavaScript**
-
-```js
-const pgml = require("pgml");
-
-require("dotenv").config();
-```
-
-The SDK is imported and environment variables are loaded.
-
-### Initialize Collection
-
-**Python**
-
-```python
-async def main():
-
- load_dotenv()
-
- collection = Collection("my_collection")
-```
-
-**JavaScript**
-
-```js
-const main = async () => {
-
- const collection = pgml.newCollection("my_javascript_collection");
-
-}
-```
-
-A collection object is created to represent the search collection.
-
-### Create Pipeline
-
-**Python**
-
-```python
- model = Model()
- splitter = Splitter()
-
- pipeline = Pipeline("my_pipeline", model, splitter)
-
- await collection.add_pipeline(pipeline)
-```
-
-**JavaScript**
-
-```js
- const model = pgml.newModel();
-
- const splitter = pgml.newSplitter();
-
- const pipeline = pgml.newPipeline("my_javascript_pipeline", model, splitter);
-
- await collection.add_pipeline(pipeline);
-```
-
-A pipeline encapsulating a model and splitter is created and added to the collection.
-
-### Upsert Documents
-
-**Python**
-
-```python
- documents = [
- {"id": "doc1", "text": "..."},
- {"id": "doc2", "text": "..."}
- ]
-
- await collection.upsert_documents(documents)
-```
-
-**JavaScript**
-
-```js
- const documents = [
- {
- id: "Document One",
- text: "...",
- },
- {
- id: "Document Two",
- text: "...",
- },
- ];
-
- await collection.upsert_documents(documents);
-```
-
-Documents are upserted into the collection and indexed by the pipeline.
-
-### Query
-
-**Python**
-
-```python
- results = await collection.query()
- .vector_recall("query", pipeline)
- .fetch_all()
-```
-
-**JavaScript**
-
-```js
- const queryResults = await collection
- .query()
- .vector_recall(
- "query",
- pipeline,
- )
- .fetch_all();
-```
-
-A vector similarity search query is made on the collection.
-
-### Archive Collection
-
-**Python**
-
-```python
- await collection.archive()
-```
-
-**JavaScript**
-
-```js
- await collection.archive();
-```
-
-The collection is archived when finished.
-
-Let me know if you would like me to modify or add anything!
-
-### Main
-
-**Python**
-
-```python
-if __name__ == "__main__":
- asyncio.run(main())
-```
-
-**JavaScript**
-
-```javascript
-main().then((results) => {
-console.log("Vector search Results: \n", results);
-});
-```
-
-Boilerplate to call main() async function.
-
-Let me know if you would like me to modify or add anything to this markdown documentation. Happy to iterate on it!
diff --git a/pgml-docs/sdks/tutorials/summarizing-question-answering.md b/pgml-docs/sdks/tutorials/summarizing-question-answering.md
deleted file mode 100644
index 9080b50ac..000000000
--- a/pgml-docs/sdks/tutorials/summarizing-question-answering.md
+++ /dev/null
@@ -1,146 +0,0 @@
-# Summarizing Question Answering
-
-Here are the Python and JavaScript examples for text summarization using `pgml` SDK
-
-### Imports and Setup
-
-**Python**
-
-```python
-from pgml import Collection, Model, Splitter, Pipeline, Builtins
-from datasets import load_dataset
-from dotenv import load_dotenv
-```
-
-**JavaScript**
-
-```js
-const pgml = require("pgml");
-require("dotenv").config();
-```
-
-The SDK and datasets are imported. Builtins are used for transformations.
-
-### Initialize Collection
-
-**Python**
-
-```python
-collection = Collection("squad_collection")
-```
-
-**JavaScript**
-
-```js
-const collection = pgml.newCollection("my_javascript_sqa_collection");
-```
-
-A collection is created to hold text passages.
-
-### Create Pipeline
-
-**Python**
-
-```python
-model = Model()
-splitter = Splitter()
-pipeline = Pipeline("squadv1", model, splitter)
-await collection.add_pipeline(pipeline)
-```
-
-**JavaScript**
-
-```js
-const pipeline = pgml.newPipeline(
- "my_javascript_sqa_pipeline",
- pgml.newModel(),
- pgml.newSplitter(),
-);
-
-await collection.add_pipeline(pipeline);
-```
-
-A pipeline is created and added to the collection.
-
-### Upsert Documents
-
-**Python**
-
-```python
-data = load_dataset("squad")
-
-documents = [
- {"id": ..., "text": ...}
- for r in data
-]
-
-await collection.upsert_documents(documents)
-```
-
-**JavaScript**
-
-```js
-const documents = [
- {
- id: "...",
- text: "...",
- }
-];
-
-await collection.upsert_documents(documents);
-```
-
-Text passages are upserted into the collection.
-
-### Query for Context
-
-**Python**
-
-```python
-results = await collection.query()
- .vector_recall(query, pipeline)
- .fetch_all()
-
-context = results[0][1]
-```
-
-**JavaScript**
-
-```js
-const queryResults = await collection
- .query()
- .vector_recall(query, pipeline)
- .fetch_all();
-
-const context = queryResults[0][1];
-```
-
-A vector search retrieves a relevant text passage.
-
-### Summarize Text
-
-**Python**
-
-```python
-builtins = Builtins()
-
-summary = await builtins.transform(
- {"task": "summarization",
- "model": "sshleifer/distilbart-cnn-12-6"},
- [context]
-)
-```
-
-**JavaScript**
-
-```js
-const builtins = pgml.newBuiltins();
-
-const summary = await builtins.transform(
- {task: "summarization",
- model: "sshleifer/distilbart-cnn-12-6"},
- [context]
-);
-```
-
-The text is summarized using a pretrained model.
diff --git a/pgml-docs/test.md b/pgml-docs/test.md
new file mode 100644
index 000000000..b58eb16a6
--- /dev/null
+++ b/pgml-docs/test.md
@@ -0,0 +1,6 @@
+# Table of contents
+
+* [Machine Learning](machine-learning/README.md)
+ * [Natural Language Processing](machine-learning/natural-language-processing/README.md)
+ * [Embeddings](machine-learning/natural-language-processing/embeddings.md)
+ * [Fill Mask](machine-learning/natural-language-processing/fill-mask.md)
\ No newline at end of file
diff --git a/pgml-docs/use-cases/README.md b/pgml-docs/use-cases/README.md
deleted file mode 100644
index 57881efaa..000000000
--- a/pgml-docs/use-cases/README.md
+++ /dev/null
@@ -1,2 +0,0 @@
-# Use cases
-
diff --git a/pgml-docs/use-cases/generating-llm-embeddings-with-open-source-models-in-postgresml.md b/pgml-docs/use-cases/generating-llm-embeddings-with-open-source-models-in-postgresml.md
deleted file mode 100644
index f0f2037e1..000000000
--- a/pgml-docs/use-cases/generating-llm-embeddings-with-open-source-models-in-postgresml.md
+++ /dev/null
@@ -1,350 +0,0 @@
-# Generating LLM embeddings with open source models in PostgresML
-
-
-
-PostgresML makes it easy to generate embeddings from text in your database using a large selection of state-of-the-art models with one simple call to **`pgml.embed`**`(model_name, text)`. Prove the results in this series to your own satisfaction, for free, by signing up for a GPU accelerated database.
-
-This article is the first in a multipart series that will show you how to build a post-modern semantic search and recommendation engine, including personalization, using open source models.
-
-1. Generating LLM Embeddings with HuggingFace models
-2. Tuning vector recall with pgvector
-3. Personalizing embedding results with application data
-4. Optimizing semantic results with an XGBoost ranking model - coming soon!
-
-## Introduction
-
-In recent years, embeddings have become an increasingly popular technique in machine learning and data analysis. They are essentially vector representations of data points that capture their underlying characteristics or features. In most programming environments, vectors can be efficiently represented as native array datatypes. They can be used for a wide range of applications, from natural language processing to image recognition and recommendation systems.
-
-They can also turn natural language into quantitative features for downstream machine learning models and applications.
-
-
-
-_Embeddings show us the relationships between rows in the database._
-
-A popular use case driving the adoption of "vector databases" is doing similarity search on embeddings, often referred to as "Semantic Search". This is a powerful technique that allows you to find similar items in large datasets by comparing their vectors. For example, you could use it to find similar products in an e-commerce site, similar songs in a music streaming service, or similar documents given a text query.
-
-Postgres is a good candidate for this type of application because it's a general purpose database that can store both the embeddings and the metadata in the same place, and has a rich set of features for querying and analyzing them, including fast vector indexes used for search.
-
-This chapter is the first in a multipart series that will show you how to build a modern semantic search and recommendation engine, including personalization, using PostgresML and open source models. We'll show you how to use the **`pgml.embed`** function to generate embeddings from text in your database using an open source pretrained model. Further chapters will expand on how to implement many of the different use cases for embeddings in Postgres, like similarity search, personalization, recommendations and fine-tuned models.
-
-## It always starts with data
-
-Most general purpose databases are full of all sorts of great data for machine learning use cases. Text data has historically been more difficult to deal with using complex Natural Language Processing techniques, but embeddings created from open source models can effectively turn unstructured text into structured features, perfect for more straightforward implementations.
-
-In this example, we'll demonstrate how to generate embeddings for products on an e-commerce site. We'll use a public dataset of millions of product reviews from the [Amazon US Reviews](https://huggingface.co/datasets/amazon\_us\_reviews). It includes the product title, a text review written by a customer and some additional metadata about the product, like category. With just a few pieces of data, we can create a full-featured and personalized product search and recommendation engine, using both generic embeddings and later, additional fine-tuned models trained with PostgresML.
-
-PostgresML includes a convenience function for loading public datasets from [HuggingFace](https://huggingface.co/datasets) directly into your database. To load the DVD subset of the Amazon US Reviews dataset into your database, run the following command:
-
-!!! code\_block
-
-```postgresql
-SELECT *
-FROM pgml.load_dataset('amazon_us_reviews', 'Video_DVD_v1_00');
-```
-
-!!!
-
-It took about 23 minutes to download the 7.1GB raw dataset with 5,069,140 rows into a table within the `pgml` schema (where all PostgresML functionality is name-spaced). Once it's done, you can see the table structure with the following command:
-
-!!! generic
-
-!!! code\_block
-
-```postgresql
-\d pgml.amazon_us_reviews
-```
-
-!!!
-
-!!! results
-
-| Column | Type | Collation | Nullable | Default |
-| ------------------ | ------- | --------- | -------- | ------- |
-| marketplace | text | | | |
-| customer\_id | text | | | |
-| review\_id | text | | | |
-| product\_id | text | | | |
-| product\_parent | text | | | |
-| product\_title | text | | | |
-| product\_category | text | | | |
-| star\_rating | integer | | | |
-| helpful\_votes | integer | | | |
-| total\_votes | integer | | | |
-| vine | bigint | | | |
-| verified\_purchase | bigint | | | |
-| review\_headline | text | | | |
-| review\_body | text | | | |
-| review\_date | text | | | |
-
-!!!
-
-!!!
-
-Let's take a peek at the first 5 rows of data:
-
-!!! code\_block
-
-```postgresql
-SELECT *
-FROM pgml.amazon_us_reviews
-LIMIT 5;
-```
-
-!!! results
-
-| marketplace | customer\_id | review\_id | product\_id | product\_parent | product\_title | product\_category | star\_rating | helpful\_votes | total\_votes | vine | verified\_purchase | review\_headline | review\_body | review\_date |
-| ----------- | ------------ | -------------- | ----------- | --------------- | ------------------------------------------------------------------------------------------------------------------- | ----------------- | ------------ | -------------- | ------------ | ---- | ------------------ | --------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------ |
-| US | 27288431 | R33UPQQUZQEM8 | B005T4ND06 | 400024643 | Yoga for Movement Disorders DVD: Rebuilding Strength, Balance, and Flexibility for Parkinson's Disease and Dystonia | Video DVD | 5 | 3 | 3 | 0 | 1 | This was a gift for my aunt who has Parkinson's ... | This was a gift for my aunt who has Parkinson's. While I have not previewed it myself, I also have not gotten any complaints. My prior experiences with yoga tell me this should be just what the doctor ordered. | 2015-08-31 |
-| US | 13722556 | R3IKTNQQPD9662 | B004EPZ070 | 685335564 | Something Borrowed | Video DVD | 5 | 0 | 0 | 0 | 1 | Five Stars | Teats my heart out. | 2015-08-31 |
-| US | 20381037 | R3U27V5QMCP27T | B005S9EKCW | 922008804 | Les Miserables (2012) \[Blu-ray] | Video DVD | 5 | 1 | 1 | 0 | 1 | Great movie! | Great movie. | 2015-08-31 |
-| US | 24852644 | R2TOH2QKNK4IOC | B00FC1ZCB4 | 326560548 | Alien Anthology and Prometheus Bundle \[Blu-ray] | Video DVD | 5 | 0 | 1 | 0 | 1 | Amazing | My husband was so excited to receive these as a gift! Great picture quality and great value! | 2015-08-31 |
-| US | 15556113 | R2XQG5NJ59UFMY | B002ZG98Z0 | 637495038 | Sex and the City 2 | Video DVD | 5 | 0 | 0 | 0 | 1 | Five Stars | Love this series. | 2015-08-31 |
-
-!!!
-
-!!!
-
-## Generating embeddings from natural language text
-
-PostgresML provides a simple interface to generate embeddings from text in your database. You can use the [`pgml.embed`](https://postgresml.org/docs/guides/transformers/embeddings) function to generate embeddings for a column of text. The function takes a transformer name and a text value. The transformer will automatically be downloaded and cached on your connection process for reuse. You can see a list of potential good candidate models to generate embeddings on the [Massive Text Embedding Benchmark leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
-
-Since our corpus of documents (movie reviews) are all relatively short and similar in style, we don't need a large model. [`intfloat/e5-small`](https://huggingface.co/intfloat/e5-small) will be a good first attempt. The great thing about PostgresML is you can always regenerate your embeddings later to experiment with different embedding models.
-
-It takes a couple of minutes to download and cache the `intfloat/e5-small` model to generate the first embedding. After that, it's pretty fast.
-
-Note how we prefix the text we want to embed with either `passage:` or `query:` , the e5 model requires us to prefix our data with `passage:` if we're generating embeddings for our corpus and `query:` if we want to find semantically similar content.
-
-```postgresql
-SELECT pgml.embed('intfloat/e5-small', 'passage: hi mom');
-```
-
-This is a pretty powerful function, because we can pass any arbitrary text to any open source model, and it will generate an embedding for us. We can benchmark how long it takes to generate an embedding for a single review, using client-side timings in Postgres:
-
-```postgresql
-\timing on
-```
-
-Aside from using this function with strings passed from a client, we can use it on strings already present in our database tables by calling **pgml.embed** on columns. For example, we can generate an embedding for the first review using a pretty simple query:
-
-!!! generic
-
-!!! code\_block time="54.820 ms"
-
-```postgresql
-SELECT
- review_body,
- pgml.embed('intfloat/e5-small', 'passage: ' || review_body)
-FROM pgml.amazon_us_reviews
-LIMIT 1;
-```
-
-!!!
-
-!!! results
-
-```
-CREATE INDEX
-```
-
-!!!
-
-!!!
-
-Time to generate an embedding increases with the length of the input text, and varies widely between different models. If we up our batch size (controlled by `LIMIT`), we can see the average time to compute an embedding on the first 1000 reviews is about 17ms per review:
-
-!!! code\_block time="17955.026 ms"
-
-```postgresql
-SELECT
- review_body,
- pgml.embed('intfloat/e5-small', 'passage: ' || review_body) AS embedding
-FROM pgml.amazon_us_reviews
-LIMIT 1000;
-```
-
-!!!
-
-## Comparing different models and hardware performance
-
-This database is using a single GPU with 32GB RAM and 8 vCPUs with 16GB RAM. Running these benchmarks while looking at the database processes with `htop` and `nvidia-smi`, it becomes clear that the bottleneck in this case is actually tokenizing the strings which happens in a single thread on the CPU, not computing the embeddings on the GPU which was only 20% utilized during the query.
-
-We can also do a quick sanity check to make sure we're really getting value out of our GPU by passing the device to our embedding function:
-
-!!! code\_block time="30421.491 ms"
-
-```postgresql
-SELECT
- reviqew_body,
- pgml.embed(
- 'intfloat/e5-small',
- 'passage: ' || review_body,
- '{"device": "cpu"}'
- ) AS embedding
-FROM pgml.amazon_us_reviews
-LIMIT 1000;
-```
-
-!!!
-
-Forcing the embedding function to use `cpu` is almost 2x slower than `cuda` which is the default when GPUs are available.
-
-If you're managing dedicated hardware, there's always a decision to be made about resource utilization. If this is a multi-workload database with other queries using the GPU, it's probably great that we're not completely hogging it with our multi-decade-Amazon-scale data import process, but if this is a machine we've spun up just for this task, we can up the resource utilization to 4 concurrent connections, all running on a subset of the data to more completely utilize our CPU, GPU and RAM.
-
-Another consideration is that GPUs are much more expensive right now than CPUs, and if we're primarily interested in backfilling a dataset like this, high concurrency across many CPU cores might just be the price-competitive winner.
-
-With 4x concurrency and a GPU, it'll take about 6 hours to compute all 5 million embeddings, which will cost $72 on PostgresML Cloud. If we use the CPU instead of the GPU, we'll probably want more cores and higher concurrency to plug through the job faster. A 96 CPU core machine could complete the job in half the time our single GPU would take and at a lower hourly cost as well, for a total cost of $24. It's overall more cost-effective and faster in parallel, but keep in mind if you're interactively generating embeddings for a user facing application, it will add double the latency, 30ms CPU vs 17ms for GPU.
-
-For comparison, it would cost about $299 to use OpenAI's cheapest embedding model to process this dataset. Their API calls average about 300ms, although they have high variability (200-400ms) and greater than 1000ms p99 in our measurements. They also have a default rate limit of 200 tokens per minute which means it would take 1,425 years to process this dataset. You better call ahead.
-
-| Processor | Latency | Cost | Time |
-| --------- | ------- | ---- | --------- |
-| CPU | 30ms | $24 | 3 hours |
-| GPU | 17ms | $72 | 6 hours |
-| OpenAI | 300ms | $299 | millennia |
-
-\
-
-
-You can also find embedding models that outperform OpenAI's `text-embedding-ada-002` model across many different tests on the [leaderboard](https://huggingface.co/spaces/mteb/leaderboard). It's always best to do your own benchmarking with your data, models, and hardware to find the best fit for your use case.
-
-> _HTTP requests to a different datacenter cost more time and money for lower reliability than co-located compute and storage._
-
-## Instructor embedding models
-
-The current leading model is `hkunlp/instructor-xl`. Instructor models take an additional `instruction` parameter which includes context for the embeddings use case, similar to prompts before text generation tasks.
-
-Instructions can provide a "classification" or "topic" for the text:
-
-#### Classification
-
-!!! code\_block time="17.912ms"
-
-```postgresql
-SELECT pgml.embed(
- transformer => 'hkunlp/instructor-xl',
- text => 'The Federal Reserve on Wednesday raised its benchmark interest rate.',
- kwargs => '{"instruction": "Represent the Financial statement:"}'
-);
-```
-
-!!!
-
-They can also specify particular use cases for the embedding:
-
-#### Querying
-
-!!! code\_block time="24.263 ms"
-
-```postgresql
-SELECT pgml.embed(
- transformer => 'hkunlp/instructor-xl',
- text => 'where is the food stored in a yam plant',
- kwargs => '{
- "instruction": "Represent the Wikipedia question for retrieving supporting documents:"
- }'
-);
-```
-
-!!!
-
-#### Indexing
-
-!!! code\_block time="30.571 ms"
-
-```postgresql
-SELECT pgml.embed(
- transformer => 'hkunlp/instructor-xl',
- text => 'Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.',
- kwargs => '{"instruction": "Represent the Wikipedia document for retrieval:"}'
-);
-```
-
-!!!
-
-#### Clustering
-
-!!! code\_block time="18.986 ms"
-
-```postgresql
-SELECT pgml.embed(
- transformer => 'hkunlp/instructor-xl',
- text => 'Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity"}',
- kwargs => '{"instruction": "Represent the Medicine sentence for clustering:"}'
-);
-```
-
-!!!
-
-Performance remains relatively good, even with the most advanced models.
-
-## Generating embeddings for a large dataset
-
-For our use case, we want to generate an embedding for every single review in the dataset. We'll use the `vector` datatype available from the `pgvector` extension to store (and later index) embeddings efficiently. All PostgresML cloud installations include [pgvector](https://github.com/pgvector/pgvector). To enable this extension in your database, you can run:
-
-```postgresql
-CREATE EXTENSION vector;
-```
-
-Then we can add a `vector` column for our review embeddings, with 384 dimensions (the size of e5-small embeddings):
-
-```postgresql
-ALTER TABLE pgml.amazon_us_reviews
-ADD COLUMN review_embedding_e5_large vector(1024);
-```
-
-It's best practice to keep running queries on a production database relatively short, so rather than trying to update all 5M rows in one multi-hour query, we should write a function to issue the updates in smaller batches. To make iterating over the rows easier and more efficient, we'll add an `id` column with an index to our table:
-
-```postgresql
-ALTER TABLE pgml.amazon_us_reviews
-ADD COLUMN id SERIAL PRIMARY KEY;
-```
-
-Every language/framework/codebase has its own preferred method for backfilling data in a table. The 2 most important considerations are:
-
-1. Keep the number of rows per query small enough that the queries take less than a second
-2. More concurrency will get the job done faster, but keep in mind the other workloads on your database
-
-Here's an example of a very simple back-fill job implemented in pure PGSQL, but I'd also love to see example PRs opened with your techniques in your language of choice for tasks like this.
-
-```postgresql
-DO $$
-BEGIN
- FOR i in 1..(SELECT max(id) FROM pgml.amazon_us_reviews) by 10 LOOP
- BEGIN RAISE NOTICE 'updating % to %', i, i + 10; END;
-
- UPDATE pgml.amazon_us_reviews
- SET review_embedding_e5_large = pgml.embed(
- 'intfloat/e5-large',
- 'passage: ' || review_body
- )
- WHERE id BETWEEN i AND i + 10
- AND review_embedding_e5_large IS NULL;
-
- COMMIT;
- END LOOP;
-END;
-$$;
-```
-
-## What's next?
-
-That's it for now. We've got an Amazon scale table with state-of-the-art machine learning embeddings. As a premature optimization, we'll go ahead and build an index on our new column to make our future vector similarity queries faster. For the full documentation on vector indexes in Postgres see the [pgvector docs](https://github.com/pgvector/pgvector).
-
-!!! code\_block time="4068909.269 ms (01:07:48.909)"
-
-```postgresql
-CREATE INDEX CONCURRENTLY index_amazon_us_reviews_on_review_embedding_e5_large
-ON pgml.amazon_us_reviews
-USING ivfflat (review_embedding_e5_large vector_cosine_ops)
-WITH (lists = 2000);
-```
-
-!!!
-
-!!! tip
-
-Create indexes `CONCURRENTLY` to avoid locking your table for other queries.
-
-!!!
-
-Building a vector index on a table with this many entries takes a while, so this is a good time to take a coffee break. In the next article we'll look at how to query these embeddings to find the best products and make personalized recommendations for users. We'll also cover updating an index in real time as new data comes in.
diff --git a/pgml-docs/use-cases/improve-search-results-with-machine-learning.md b/pgml-docs/use-cases/improve-search-results-with-machine-learning.md
deleted file mode 100644
index 5a6f20cef..000000000
--- a/pgml-docs/use-cases/improve-search-results-with-machine-learning.md
+++ /dev/null
@@ -1,456 +0,0 @@
-# Improve Search Results with Machine Learning
-
-PostgresML makes it easy to use machine learning with your database and to scale workloads horizontally in our cloud. One of the most common use cases is to improve search results. In this article, we'll show you how to build a search engine from the ground up, that leverages multiple types of natural language processing (NLP) and machine learning (ML) models to improve search results, including vector search and personalization with embeddings.
-
-## Keyword Search
-
-One important takeaway from this article is that search engines are built in multiple layers from simple to complex and use iterative refinement of results along the way. We'll explore what that composition and iterative refinement looks like using standard SQL and the additional functions provided by PostgresML. Our foundational layer is the traditional form of search, keyword search. This is the type of search you're probably most familiar with. You type a few words into a search box, and get back a list of results that contain those words.
-
-### Queries
-
-Our search application will start with a **documents** table. Our documents have a title and a body, as well as a unique id for our application to reference when updating or deleting existing documents. We create our table with the standard SQL `CREATE TABLE` syntax.
-
-!!! generic
-
-!!! code\_block time="10.493 ms"
-
-```sql
-CREATE TABLE documents (
- id BIGSERIAL PRIMARY KEY,
- title TEXT,
- body TEXT
-);
-```
-
-!!!
-
-!!!
-
-We can add new documents to our _text corpus_ with the standard SQL `INSERT` statement. Postgres will automatically take care of generating the unique ids, so we'll add a few **documents** with just a **title** and **body** to get started.
-
-!!! generic
-
-!!! code\_block time="3.417 ms"
-
-```sql
-INSERT INTO documents (title, body) VALUES
- ('This is a title', 'This is the body of the first document.'),
- ('This is another title', 'This is the body of the second document.'),
- ('This is the third title', 'This is the body of the third document.')
-;
-```
-
-!!!
-
-!!!
-
-As you can see, it takes a few milliseconds to insert new documents into our table. Postgres is pretty fast out of the box. We'll also cover scaling and tuning in more depth later on for production workloads.
-
-Now that we have some documents, we can immediately start using built in keyword search functionality. Keyword queries allow us to find documents that contain the words in our queries, but not necessarily in the order we typed them. Standard variations on a root word, like pluralization, or past tense, should also match our queries. This is accomplished by "stemming" the words in our queries and documents. Postgres provides 2 important functions that implement these grammatical cleanup rules on queries and documents.
-
-* `to_tsvector(config, text)` will turn plain text into a `tsvector` that can also be indexed for faster recall.
-* `to_tsquery(config, text)` will turn a plain text query into a boolean rule (and, or, not, phrase) `tsquery` that can match `@@` against a `tsvector`.
-
-You can configure the grammatical rules in many advanced ways, but we'll use the built-in `english` config for our examples. Here's how we can use the match `@@` operator with these functions to find documents that contain the word "second" in the **body**.
-
-!!! generic
-
-!!! code\_block time="0.651 ms"
-
-```sql
-SELECT *
-FROM documents
-WHERE to_tsvector('english', body) @@ to_tsquery('english', 'second');
-```
-
-!!!
-
-!!! results
-
-| id | title | body |
-| -- | --------------------- | ---------------------------------------- |
-| 2 | This is another title | This is the body of the second document. |
-
-!!!
-
-!!!
-
-Postgres provides the complete reference [documentation](https://www.postgresql.org/docs/current/datatype-textsearch.html) on these functions.
-
-### Indexing
-
-Postgres treats everything in the standard SQL `WHERE` clause as a filter. By default, it makes this keyword search work by scanning the entire table, converting each document body to a `tsvector`, and then comparing the `tsquery` to the `tsvector`. This is called a "sequential scan". It's fine for small tables, but for production use cases at scale, we'll need a more efficient solution.
-
-The first step is to store the `tsvector` in the table, so we don't have to generate it during each search. We can do this by adding a new `GENERATED` column to our table, that will automatically stay up to date. We also want to search both the **title** and **body**, so we'll concatenate `||` the fields we want to include in our search, separated by a simple space `' '`.
-
-!!! generic
-
-!!! code\_block time="17.883 ms"
-
-```sql
-ALTER TABLE documents
-ADD COLUMN title_and_body_text tsvector
-GENERATED ALWAYS AS (to_tsvector('english', title || ' ' || body )) STORED;
-```
-
-!!!
-
-!!!
-
-One nice aspect of generated columns is that they will backfill the data for existing rows. They can also be indexed, just like any other column. We can add a Generalized Inverted Index (GIN) on this new column that will pre-compute the lists of all documents that contain each keyword. This will allow us to skip the sequential scan, and instead use the index to find the exact list of documents that satisfy any given `tsquery`.
-
-!!! generic
-
-!!! code\_block time="5.145 ms"
-
-```sql
-CREATE INDEX documents_title_and_body_text_index
-ON documents
-USING GIN (title_and_body_text);
-```
-
-!!!
-
-!!!
-
-And now, we'll demonstrate a slightly more complex `tsquery`, that requires both the keywords **another** and **second** to match `@@` the **title** or **body** of the document, which will automatically use our index on **title\_and\_body\_text**.
-
-!!! generic
-
-!!! code\_block time="3.673 ms"
-
-```sql
-SELECT *
-FROM documents
-WHERE title_and_body_text @@ to_tsquery('english', 'another & second');
-```
-
-!!!
-
-!!! results
-
-| id | title | body | title\_and\_body\_text |
-| -- | --------------------- | ---------------------------------------- | ----------------------------------------------------- |
-| 2 | This is another title | This is the body of the second document. | 'anoth':3 'bodi':8 'document':12 'second':11 'titl':4 |
-
-!!!
-
-!!!
-
-We can see our new `tsvector` column in the results now as well, since we used `SELECT *`. You'll notice that the `tsvector` contains the stemmed words from both the **title** and **body**, along with their position. The position information allows Postgres to support _phrase_ matches as well as single keywords. You'll also notice that _stopwords_, like "the", "is", and "of" have been removed. This is a common optimization for keyword search, since these words are so common, they don't add much value to the search results.
-
-### Ranking
-
-Ranking is a critical component of search, and it's also where Machine Learning becomes critical for great results. Our users will expect us to sort our results with the most relevant at the top. A simple arithmetic relevance score is provided `ts_rank`. It computes the Term Frequency (TF) of each keyword in the query that matches the document. For example, if the document has 2 keyword matches out of 5 words total, it's `ts_rank` will be `2 / 5 = 0.4`. The more matches and the fewer total words, the higher the score and the more relevant the document.
-
-With multiple query terms OR `|` together, the `ts_rank` will add the numerators and denominators to account for both. For example, if the document has 2 keyword matches out of 5 words total for the first query term, and 1 keyword match out of 5 words total for the second query term, it's ts\_rank will be `(2 + 1) / (5 + 5) = 0.3`. The full `ts_rank` function has many additional options and configurations that you can read about in the [documentation](https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING), but this should give you the basic idea.
-
-!!! generic
-
-!!! code\_block time="0.561 ms"
-
-```sql
-SELECT ts_rank(title_and_body_text, to_tsquery('english', 'second | title')), *
-FROM documents
-ORDER BY ts_rank DESC;
-```
-
-!!!
-
-!!! results
-
-| ts\_rank | id | title | body | title\_and\_body\_text |
-| ----------- | -- | ----------------------- | ---------------------------------------- | ----------------------------------------------------- |
-| 0.06079271 | 2 | This is another title | This is the body of the second document. | 'anoth':3 'bodi':8 'document':12 'second':11 'titl':4 |
-| 0.030396355 | 1 | This is a title | This is the body of the first document. | 'bodi':8 'document':12 'first':11 'titl':4 |
-| 0.030396355 | 3 | This is the third title | This is the body of the third document. | 'bodi':9 'document':13 'third':4,12 'titl':5 |
-
-!!!
-
-!!!
-
-Our document that matches 2 of the keywords has twice the score of the documents that match just one of the keywords. It's important to call out, that this query has no `WHERE` clause. It will rank and return every document in a potentially large table, even when the `ts_rank` is 0, i.e. not a match at all. We'll generally want to add both a basic match `@@` filter that can leverage an index, and a `LIMIT` to make sure we're not returning completely irrelevant documents or too many results per page.
-
-### Boosting
-
-A quick improvement we could make to our search query would be to differentiate relevance of the title and body. It's intuitive that a keyword match in the title is more relevant than a keyword match in the body. We can implement a simple boosting function by multiplying the title rank 2x, and adding it to the body rank. This will _boost_ title matches up the rankings in our final results list. This can be done by creating a simple arithmetic formula in the `ORDER BY` clause.
-
-!!! generic
-
-!!! code\_block time="0.561 ms"
-
-```sql
-SELECT
- ts_rank(title, to_tsquery('english', 'second | title')) AS title_rank,
- ts_rank(body, to_tsquery('english', 'second | title')) AS body_rank,
- *
-FROM documents
-ORDER BY (2 * title_rank) + body_rank DESC;
-```
-
-!!!
-
-!!!
-
-Wait a second... is a title match 2x or 10x, or maybe log(π / tsrank2) more relevant than a body match? Since document length penalizes ts\_rank more in the longer body content, maybe we should be boosting body matches instead? You might try a few equations against some test queries, but it's hard to know what the value that works best across all queries is. Optimizing functions like this is one area Machine Learning can help.
-
-## Learning to Rank
-
-So far we've only considered simple statistical measures of relevance like `ts_rank`s TF/IDF, but people have a much more sophisticated idea of relevance. Luckily, they'll tell you exactly what they think is relevant by clicking on it. We can use this feedback to train a model that learns the optimal weights of **title\_rank** vs **body\_rank** for our boosting function. We'll redefine relevance as the probability that a user will click on a search result, given our inputs like **title\_rank** and **body\_rank**.
-
-This is considered a Supervised Learning problem, because we have a labeled dataset of user clicks that we can use to train our model. The inputs to our function are called _features_ of the data for the machine learning model, and the output is often referred to as the _label_.
-
-### Training Data
-
-First things first, we need to record some user clicks on our search results. We'll create a new table to store our training data, which are the observed inputs and output of our new relevance function. In a real system, we'd probably have separate tables to record **sessions**, **searches**, **results**, **clicks** and other events, but for simplicity in this example, we'll just record the exact information we need to train our model in a single table. Everytime we perform a search, we'll record the `ts_rank` for the both the **title** and **body**, and whether the user **clicked** on the result.
-
-!!! generic
-
-!!! code\_block time="0.561 ms"
-
-```sql
-CREATE TABLE search_result_clicks (
- title_rank REAL,
- body_rank REAL,
- clicked BOOLEAN
-);
-```
-
-!!!
-
-!!!
-
-One of the hardest parts of machine learning is gathering the data from disparate sources and turning it into features like this. There are often teams of data engineers involved in maintaining endless pipelines from one feature store or data warehouse and then back again. We don't need that complexity in PostgresML and can just insert the ML features directly into the database.
-
-I've made up 4 example searches, across our 3 documents, and recorded the `ts_rank` for the **title** and **body**, and whether the user **clicked** on the result. I've cherry-picked some intuitive results, where the user always clicked on the top ranked document, that has the highest combined title and body ranks. We'll insert this data into our new table.
-
-!!! generic
-
-!!! code\_block time="2.161 ms"
-
-```sql
-INSERT INTO search_result_clicks
- (title_rank, body_rank, clicked)
-VALUES
--- search 1
- (0.5, 0.5, true),
- (0.3, 0.2, false),
- (0.1, 0.0, false),
--- search 2
- (0.0, 0.5, true),
- (0.0, 0.2, false),
- (0.0, 0.0, false),
--- search 3
- (0.2, 0.5, true),
- (0.1, 0.2, false),
- (0.0, 0.0, false),
--- search 4
- (0.4, 0.5, true),
- (0.4, 0.2, false),
- (0.4, 0.0, false)
-;
-```
-
-!!!
-
-!!!
-
-In a real application, we'd record the results of millions of searches results with the ts\_ranks and clicks from our users, but even this small amount of data is enough to train a model with PostgresML. Bootstrapping or back-filling data is also possible with several techniques. You could build the app, and have your admins or employees use it to generate training data before a public release.
-
-### Training a Model to rank search results
-
-We'll train a model for our "Search Ranking" project using the `pgml.train` function, which takes several arguments. The `project_name` is a handle we can use to refer to the model later when we're ranking results, and the `task` is the type of model we want to train. In this case, we want to train a model to predict the probability of a user clicking on a search result, given the `title_rank` and `body_rank` of the result. This is a regression problem, because we're predicting a continuous value between 0 and 1. We could also train a classification model to make a boolean prediction whether a user will click on a result, but we'll save that for another example.
-
-Here goes some machine learning:
-
-!!! generic
-
-!!! code\_block time="6.867 ms"
-
-```sql
-SELECT * FROM pgml.train(
- project_name => 'Search Ranking',
- task => 'regression',
- relation_name => 'search_result_clicks',
- y_column_name => 'clicked'
-);
-```
-
-!!!
-
-!!! results
-
-| project | task | algorithm | deployed |
-| -------------- | ---------- | --------- | -------- |
-| Search Ranking | regression | linear | t |
-
-!!!
-
-!!!
-
-SQL statements generally begin with `SELECT` to read something, but in this case we're really just interested in reading the result of the training function. The `pgml.train` function takes a few arguments, but the most important are the `relation_name` and `y_column_name`. The `relation_name` is the table we just created with our training data, and the `y_column_name` is the column we want to predict. In this case, we want to predict whether a user will click on a search result, given the **title\_rank** and **body\_rank**. There are two common machine learning **tasks** for making predictions like this. Classification makes a discrete or categorical prediction like `true` or `false`. Regression makes a floating point prediction, akin to the probability that a user will click on a search result. In this case, we want to rank search results from most likely to least likely, so we'll use the `regression` task. The project is just a name for the model we're training, and we'll use it later to make predictions.
-
-Training a model in PostgresML is actually a multiple step pipeline that gets executed to implement best practices. There are options to control the pipeline, but by default, the following steps are executed:
-
-1. The training data is split into a training set and a test set
-2. The model is trained on the training set
-3. The model is evaluated on the test set
-4. The model is saved into `pgml.models` along with the evaluation metrics
-5. The model is deployed if it's better than the currently deployed model
-
-!!! tip
-
-The `pgml.train` function will return a table with some information about the training process. It will show several columns of data about the model that was trained, including the accuracy of the model on the training data. You may see calls to `pgml.train` that use `SELECT * FROM pgml.train(...)` instead of `SELECT pgml.train(...)`. Both invocations of the function are equivalent, but calling the function in `FROM` as if it were a table gives a slightly more readable table formatted result output.
-
-!!!
-
-PostgresML automatically deploys a model for online predictions after training, if the **key metric** is a better than the currently deployed model. We'll train many models over time for this project, and you can read more about deployments later.
-
-### Making Predictions
-
-Once a model is trained, you can use `pgml.predict` to use it on new inputs. `pgml.predict` is a function that takes our project name, along with an array of features to predict on. In this case, our features are th `title_rank` and `body_rank`. We can use the `pgml.predict` function to make predictions on the training data, but in a real application, we'd want to make predictions on new data that the model hasn't seen before. Let's do a quick sanity check, and see what the model predicts for all the values of our training data.
-
-!!! generic
-
-!!! code\_block time="3.119 ms"
-
-```sql
-SELECT
- clicked,
- pgml.predict('Search Ranking', array[title_rank, body_rank])
-FROM search_result_clicks;
-```
-
-!!!
-
-!!! results
-
-| clicked | predict |
-| ------- | ----------- |
-| t | 0.88005996 |
-| f | 0.2533733 |
-| f | -0.1604198 |
-| t | 0.910045 |
-| f | 0.27136433 |
-| f | -0.15442279 |
-| t | 0.898051 |
-| f | 0.26536733 |
-| f | -0.15442279 |
-| t | 0.886057 |
-| f | 0.24737626 |
-| f | -0.17841086 |
-
-!!!
-
-!!!
-
-!!! note
-
-If you're watching your database logs, you'll notice the first time a model is used there is a "Model cache miss". PostgresML automatically caches models in memory for faster predictions, and the cache is invalidated when a new model is deployed. The cache is also invalidated when the database is restarted or a connection is closed.
-
-!!!
-
-The model is predicting values close to 1 when there was a click, and values closer to 0 when there wasn't a click. This is a good sign that the model is learning something useful. We can also use the `pgml.predict` function to make predictions on new data, and this is where things actually get interesting in online search results with PostgresML.
-
-### Ranking Search Results with Machine Learning
-
-Search results are often computed in multiple steps of recall and (re)ranking. Each step can apply more sophisticated (and expensive) models on more and more features, before pruning less relevant results for the next step. We're going to expand our original keyword search query to include a machine learning model that will re-rank the results. We'll use the `pgml.predict` function to make predictions on the title and body rank of each result, and then we'll use the predictions to re-rank the results.
-
-It's nice to organize the query into logical steps, and we can use **Common Table Expressions** (CTEs) to do this. CTEs are like temporary tables that only exist for the duration of the query. We'll start by defining a CTE that will rank all the documents in our table by the ts\_rank for title and body text. We define a CTE using the `WITH` keyword, and then we can use the CTE as if it were a table in the rest of the query. We'll name our CTE **first\_pass\_ranked\_documents**. Having the full power of SQL gives us a lot of power to flex in this step.
-
-1. We can efficiently recall matching documents using the keyword index `WHERE title_and_body_text @@ to_tsquery('english', 'second | title'))`
-2. We can generate multiple ts\_rank scores for each row the documents using the `ts_rank` function as if they were columns in the table
-3. We can order the results by the `title_and_body_rank` and limit the results to the top 100 to avoid wasting time in the next step applying an ML model to less relevant results
-4. We'll use this new table in a second query to apply the ML model to the title and body rank of each document and re-rank the results with a second `ORDER BY` clause
-
-!!! generic
-
-!!! code\_block time="2.118 ms"
-
-```sql
-WITH first_pass_ranked_documents AS (
- SELECT
- -- Compute the ts_rank for the title and body text of each document
- ts_rank(title_and_body_text, to_tsquery('english', 'second | title')) AS title_and_body_rank,
- ts_rank(to_tsvector('english', title), to_tsquery('english', 'second | title')) AS title_rank,
- ts_rank(to_tsvector('english', body), to_tsquery('english', 'second | title')) AS body_rank,
- *
- FROM documents
- WHERE title_and_body_text @@ to_tsquery('english', 'second | title')
- ORDER BY title_and_body_rank DESC
- LIMIT 100
-)
-SELECT
- -- Use the ML model to predict the probability that a user will click on the result
- pgml.predict('Search Ranking', array[title_rank, body_rank]) AS ml_rank,
- *
-FROM first_pass_ranked_documents
-ORDER BY ml_rank DESC
-LIMIT 10;
-```
-
-!!!
-
-!!! results
-
-| ml\_rank | title\_and\_body\_rank | title\_rank | body\_rank | id | title | body | title\_and\_body\_text |
-| ----------- | ---------------------- | ----------- | ----------- | -- | ----------------------- | ---------------------------------------- | ----------------------------------------------------- |
-| -0.09153378 | 0.06079271 | 0.030396355 | 0.030396355 | 2 | This is another title | This is the body of the second document. | 'anoth':3 'bodi':8 'document':12 'second':11 'titl':4 |
-| -0.15624566 | 0.030396355 | 0.030396355 | 0 | 1 | This is a title | This is the body of the first document. | 'bodi':8 'document':12 'first':11 'titl':4 |
-| -0.15624566 | 0.030396355 | 0.030396355 | 0 | 3 | This is the third title | This is the body of the third document. | 'bodi':9 'document':13 'third':4,12 'titl':5 |
-
-!!!
-
-!!!
-
-You'll notice that calculating the `ml_rank` adds virtually no additional time to the query. The `ml_rank` is not exactly "well calibrated", since I just made up 4 for searches worth of `search_result_clicks` data, but it's a good example of how we can use machine learning to re-rank search results extremely efficiently, without having to write much code or deploy any new microservices.
-
-You can also be selective about which fields you return to the application for greater efficiency over the network, or return everything for logging and debugging modes. After all, this is all just standard SQL, with a few extra function calls involved to make predictions.
-
-## Next steps with Machine Learning
-
-With composable CTEs and a mature Postgres ecosystem, you can continue to extend your search engine capabilities in many ways.
-
-### Add more features
-
-You can bring a lot more data into the ML model as **features**, or input columns, to improve the quality of the predictions. Many documents have a notion of "popularity" or "quality" metrics, like the `average_star_rating` from customer reviews or `number_of_views` for a video. Another common set of features would be the global Click Through Rate (CTR) and global Conversion Rate (CVR). You should probably track all **sessions**, **searches**, **results**, **clicks** and **conversions** in tables, and compute global stats for how appealing each product is when it appears in search results, along multiple dimensions. Not only should you track the average stats for a document across all searches globally, you can track the stats for every document for each search query it appears in, i.e. the CTR for the "apples" document is different for the "apple" keyword search vs the "fruit" keyword search. So you could use both the global CTR and the keyword specific CTR as features in the model. You might also want to track short term vs long term stats, or things like "freshness".
-
-Postgres offers `MATERIALIZED VIEWS` that can be periodically refreshed to compute and cache these stats table efficiently from the normalized tracking tables your application would write the structured event data into. This prevents write amplification from occurring when a single event causes updates to dozens of related statistics.
-
-### Use more sophisticated ML Algorithms
-
-PostgresML offers more than 50 algorithms. Modern gradient boosted tree based models like XGBoost, LightGBM and CatBoost provide state-of-the-art results for ranking problems like this. They are also relatively fast and efficient. PostgresML makes it simple to just pass an additional `algorithm` parameter to the `pgml.train` function to use a different algorithm. All the resulting models will be tracked in your project, and the best one automatically deployed. You can also pass a specific **model\_id** to `pgml.predict` instead of a **project\_name** to use a specific model. This makes it easy to compare the results of different algorithms statistically. You can also compare the results of different algorithms at the application level in AB tests for business metrics, not just statistical measures like r2.
-
-### Train regularly
-
-You can also retrain the model with new data whenever new data is available which will naturally improve your model over time as the data set grows larger and has more examples including edge cases and outliers. It's important to note you should only need to retrain when there has been a "statistically meaningful" change in the total dataset, not on every single new search or result. Training once a day or once a week is probably sufficient to avoid "concept drift".
-
-An additional benefit of regular training is that you will have faster detection of any breakage in the data pipeline. If the data pipeline breaks, for whatever reason, like the application team drops an important column they didn't realize was in use for training by the model, it'd be much better to see that error show up within 24 hours, and lose 1 day of training data, than to wait until the next time a Data Scientist decides to work on the model, and realize that the data has been lost for the last year, making it impossible to continue using in the next version, potentially leaving you with a model that can never be retrained and never beaten by new versions, until the entire project is revisited from the ground up. That sort of thing happens all the time in other more complicated distributed systems, and it's a huge waste of time and money.
-
-### Vector Search w/ LLM embeddings
-
-PostgresML not only incorporates the latest vector search, including state-of-the\_art HNSW recall provided by pgvector, but it can generate the embeddings _inside the database with no network overhead_ using the latest pre-trained LLMs downloaded from Huggingface. This is big enough to be its own topic, so we've outlined it in a series on how to generate LLM Embeddings with HuggingFace models.
-
-### Personalization & Recommendations
-
-There are a few ways to implement personalization for search results. PostgresML supports both collaborative or content based filtering for personalization and recommendation systems. We've outlined one approach to personalizing embedding results with application data for further reading, but you can implement many different approaches using all the building blocks provided by PostgresML.
-
-### Multi-Modal Search
-
-You may want to offer search results over multiple document types. For example a professional social networking site may return results from **People**, **Companies**, **JobPostings**, etc. You can have features specific to each document type, and PostgresML will handle the `NULL` inputs where documents don't have data for specific feature. This will allow you to build one model that ranks all types of "documents" together to optimize a single global objective.
-
-### Tie it all together in a single query
-
-You can tier multiple models and ranking algorithms together in a single query. For example, you could recall candidates with both vector search and keyword search, join their global document level CTR & CVR and other stats, join more stats for how each document has converted on this exact query, join more personalized stats or vectors from the user history or current session, and input all those features into a tree based model to re-rank the results. Pulling all those features together from multiple feature stores in a microservice architecture and joining at the application layer would be prohibitively slow at scale, but with PostgresML you can do it all in a single query with indexed joins in a few milliseconds on the database, layering CTEs as necessary to keep the query maintainable.
-
-### Make it fast
-
-When you have a dozen joins across many tables in a single query, it's important to make sure the query is fast. We typically target sub 100ms for end to end search latency on large production scale datasets, including LLM embedding generation, vector search, and personalization reranking. You can use standard SQL `EXPLAIN ANALYZE` to see what parts of the query take the cost the most time or memory. Postgres offers many index types (BTREE, GIST, GIN, IVFFLAT, HNSW) which can efficiently deal with billion row datasets of numeric, text, keyword, JSON, vector or even geospatial data.
-
-### Make it scale
-
-Modern machines are available in most clouds with hundreds of cores, which will scale to tens of thousands of queries per second. More advanced techniques like partitioning and sharding can be used to scale beyond billion row datasets or to millions of queries per second. Postgres has tried and true replication patterns that we expose with a simple slider to scale out to as many machines as necessary in our cloud hosted platform, but since PostgresML is open source, you can run it however you're comfortable scaling your Postgres workloads in house as well.
-
-## Conclusion
-
-You can use PostgresML to build a state-of-the-art search engine with cutting edge capabilities on top of your application and domain data. It's easy to get started with our fully hosted platform that provides additional features like horizontal scalability and GPU acceleration for the most intensive workloads at scale. The efficiency inherent to our shared memory implementation without network calls means PostgresML is also more reliable and cheaper to operate than alternatives, and the integrated machine learning algorithms mean you can fully leverage all of your application data. PostgresML is also open source, and we welcome contributions from the community, especially when it comes to the rapidly evolve ML landscape with the latest improvements we're seeing from foundation model capabilities.
diff --git a/pgml-docs/use-cases/llm-based-pipelines-with-postgresml-and-dbt-data-build-tool.md b/pgml-docs/use-cases/llm-based-pipelines-with-postgresml-and-dbt-data-build-tool.md
deleted file mode 100644
index d67fb8b70..000000000
--- a/pgml-docs/use-cases/llm-based-pipelines-with-postgresml-and-dbt-data-build-tool.md
+++ /dev/null
@@ -1,192 +0,0 @@
-# LLM based pipelines with PostgresML and dbt (data build tool)
-
-In the realm of data analytics and machine learning, text processing and large language models (LLMs) have become pivotal in deriving insights from textual data. Efficient data pipelines play a crucial role in enabling streamlined workflows for processing and analyzing text. This blog explores the synergy between PostgresML and dbt, showcasing how they empower organizations to build efficient data pipelines that leverage large language models for text processing, unlocking valuable insights and driving data-driven decision-making.
-
-
-
-## PostgresML
-
-PostgresML, an open-source machine learning extension for PostgreSQL, is designed to handle text processing tasks using large language models. Its motivation lies in harnessing the power of LLMs within the familiar PostgreSQL ecosystem. By integrating LLMs directly into the database, PostgresML eliminates the need for data movement and offers scalable and secure text processing capabilities. This native integration enhances data governance, security, and ensures the integrity of text data throughout the pipeline.
-
-## dbt (data build tool)
-
-dbt is an open-source command-line tool that streamlines the process of building, testing, and maintaining data infrastructure. Specifically designed for data analysts and engineers, dbt offers a consistent and standardized approach to data transformation and analysis. By providing an intuitive and efficient workflow, dbt simplifies working with data, empowering organizations to seamlessly transform and analyze their data.
-
-## PostgresML and dbt
-
-The integration of PostgresML and dbt offers an exceptional advantage for data engineers seeking to swiftly incorporate text processing into their workflows. With PostgresML's advanced machine learning capabilities and dbt's streamlined data transformation framework, data engineers can seamlessly integrate text processing tasks into their existing pipelines. This powerful combination empowers data engineers to efficiently leverage PostgresML's text processing capabilities, accelerating the incorporation of sophisticated NLP techniques and large language models into their data workflows. By bridging the gap between machine learning and data engineering, PostgresML and dbt enable data engineers to unlock the full potential of text processing with ease and efficiency.
-
-* Streamlined Text Processing: PostgresML seamlessly integrates large language models into the data pipeline, enabling efficient and scalable text processing. It leverages the power of the familiar PostgreSQL environment, ensuring data integrity and simplifying the overall workflow.
-* Simplified Data Transformation: dbt simplifies the complexities of data transformation by automating repetitive tasks and providing a modular approach. It seamlessly integrates with PostgresML, enabling easy incorporation of large language models for feature engineering, model training, and text analysis.
-* Scalable and Secure Pipelines: PostgresML's integration with PostgreSQL ensures scalability and security, allowing organizations to process and analyze large volumes of text data with confidence. Data governance, access controls, and compliance frameworks are seamlessly extended to the text processing pipeline.
-
-## Tutorial
-
-By following this [tutorial](https://github.com/postgresml/postgresml/tree/master/pgml-extension/examples/dbt/embeddings), you will gain hands-on experience in setting up a dbt project, defining models, and executing an LLM-based text processing pipeline. We will guide you through the process of incorporating LLM-based text processing into your data workflows using PostgresML and dbt. Here's a high-level summary of the tutorial:
-
-### Prerequisites
-
-* [PostgresML DB](https://github.com/postgresml/postgresml#installation)
-* Python >=3.7.2,<4.0
-* [Poetry](https://python-poetry.org/)
-* Install `dbt` using the following commands
- * `poetry shell`
- * `poetry install`
-* Documents in a table
-
-### dbt Project Setup
-
-Once you have the pre-requisites satisfied, update `dbt` project configuration files.
-
-### Project name
-
-You can find the name of the `dbt` project in `dbt_project.yml`.
-
-```yaml
-# Name your project! Project names should contain only lowercase characters
-# and underscores. A good package name should reflect your organization's
-# name or the intended use of these models
-name: 'pgml_flow'
-version: '1.0.0'
-```
-
-### Dev and prod DBs
-
-Update `profiles.yml` file with development and production database properties. If you are using Docker based local PostgresML installation, `profiles.yml` will be as follows:
-
-```yaml
-pgml_flow:
- outputs:
-
- dev:
- type: postgres
- threads: 1
- host: 127.0.0.1
- port: 5433
- user: postgres
- pass: ""
- dbname: pgml_development
- schema:
-
- prod:
- type: postgres
- threads: [1 or more]
- host: [host]
- port: [port]
- user: [prod_username]
- pass: [prod_password]
- dbname: [dbname]
- schema: [prod_schema]
-
- target: dev
-```
-
-Run `dbt debug` at the command line where the project's Python environment is activated to make sure the DB credentials are correct.
-
-### Source
-
-Update `models/schema.yml` with schema and table where documents are ingested.
-
-```yaml
- sources:
- - name:
- tables:
- - name:
-```
-
-### Variables
-
-The provided YAML configuration includes various parameters that define the setup for a specific task involving embeddings and models.
-
-```yaml
-vars:
- splitter_name: "recursive_character"
- splitter_parameters: {"chunk_size": 100, "chunk_overlap": 20}
- task: "embedding"
- model_name: "intfloat/e5-base"
- query_string: 'Lorem ipsum 3'
- limit: 2
-```
-
-Here's a summary of the key parameters:
-
-* `splitter_name`: Specifies the name of the splitter, set as "recursive\_character".
-* `splitter_parameters`: Defines the parameters for the splitter, such as a chunk size of 100 and a chunk overlap of 20.
-* `task`: Indicates the task being performed, specified as "embedding".
-* `model_name`: Specifies the name of the model to be used, set as "intfloat/e5-base".
-* `query_string`: Provides a query string, set as 'Lorem ipsum 3'.
-* `limit`: Specifies a limit of 2, indicating the maximum number of results to be processed.
-
-These configuration parameters offer a specific setup for the task, allowing for customization and flexibility in performing embeddings with the chosen splitter, model, table, query, and result limit.
-
-## Models
-
-dbt models form the backbone of data transformation and analysis pipelines. These models allow you to define the structure and logic for processing your data, enabling you to extract insights and generate valuable outputs.
-
-### Splitters
-
-The Splitters model serves as a central repository for storing information about text splitters and their associated hyperparameters, such as chunk size and chunk overlap. This model allows you to keep track of the different splitters used in your data pipeline and their specific configuration settings.
-
-### Chunks
-
-Chunks build upon splitters and process documents, generating individual chunks. Each chunk represents a smaller segment of the original document, facilitating more granular analysis and transformations. Chunks capture essential information like IDs, content, indices, and creation timestamps.
-
-### Models
-
-Models serve as a repository for storing information about different embeddings models and their associated hyperparameters. This model allows you to keep track of the various embedding techniques used in your data pipeline and their specific configuration settings.
-
-### Embeddings
-
-Embeddings focus on generating feature embeddings from chunks using an embedding model in models table. These embeddings capture the semantic representation of textual data, facilitating more effective machine learning models.
-
-### Transforms
-
-The Transforms maintains a mapping between the splitter ID, model ID, and the corresponding embeddings table for each combination. It serves as a bridge connecting the different components of your data pipeline.
-
-## Pipeline execution
-
-In order to run the pipeline, execute the following command:
-
-```bash
-dbt run
-```
-
-You should see an output similar to below:
-
-```bash
-22:29:58 Running with dbt=1.5.2
-22:29:58 Registered adapter: postgres=1.5.2
-22:29:58 Unable to do partial parsing because a project config has changed
-22:29:59 Found 7 models, 10 tests, 0 snapshots, 0 analyses, 307 macros, 0 operations, 0 seed files, 1 source, 0 exposures, 0 metrics, 0 groups
-22:29:59
-22:29:59 Concurrency: 1 threads (target='dev')
-22:29:59
-22:29:59 1 of 7 START sql view model test_collection_1.characters ....................... [RUN]
-22:29:59 1 of 7 OK created sql view model test_collection_1.characters .................. [CREATE VIEW in 0.11s]
-22:29:59 2 of 7 START sql incremental model test_collection_1.models .................... [RUN]
-22:29:59 2 of 7 OK created sql incremental model test_collection_1.models ............... [INSERT 0 1 in 0.15s]
-22:29:59 3 of 7 START sql incremental model test_collection_1.splitters ................. [RUN]
-22:30:00 3 of 7 OK created sql incremental model test_collection_1.splitters ............ [INSERT 0 1 in 0.07s]
-22:30:00 4 of 7 START sql incremental model test_collection_1.chunks .................... [RUN]
-22:30:00 4 of 7 OK created sql incremental model test_collection_1.chunks ............... [INSERT 0 0 in 0.08s]
-22:30:00 5 of 7 START sql incremental model test_collection_1.embedding_36b7e ........... [RUN]
-22:30:00 5 of 7 OK created sql incremental model test_collection_1.embedding_36b7e ...... [INSERT 0 0 in 0.08s]
-22:30:00 6 of 7 START sql incremental model test_collection_1.transforms ................ [RUN]
-22:30:00 6 of 7 OK created sql incremental model test_collection_1.transforms ........... [INSERT 0 1 in 0.07s]
-22:30:00 7 of 7 START sql table model test_collection_1.vector_search ................... [RUN]
-22:30:05 7 of 7 OK created sql table model test_collection_1.vector_search .............. [SELECT 2 in 4.81s]
-22:30:05
-22:30:05 Finished running 1 view model, 5 incremental models, 1 table model in 0 hours 0 minutes and 5.59 seconds (5.59s).
-22:30:05
-22:30:05 Completed successfully
-22:30:05
-22:30:05 Done. PASS=7 WARN=0 ERROR=0 SKIP=0 TOTAL=7
-```
-
-As part of the pipeline execution, some models in the workflow utilize incremental materialization. Incremental materialization is a powerful feature provided by dbt that optimizes the execution of models by only processing and updating the changed or new data since the last run. This approach reduces the processing time and enhances the efficiency of the pipeline.
-
-By configuring certain models with incremental materialization, dbt intelligently determines the changes in the source data and applies only the necessary updates to the target tables. This allows for faster iteration cycles, particularly when working with large datasets, as dbt can efficiently handle incremental updates instead of reprocessing the entire dataset.
-
-## Conclusions
-
-With PostgresML and dbt, organizations can leverage the full potential of LLMs, transforming raw textual data into valuable knowledge, and staying at the forefront of data-driven innovation. By seamlessly integrating LLM-based transformations, data engineers can unlock deeper insights, perform advanced analytics, and drive informed decision-making. Data governance, access controls, and compliance frameworks seamlessly extend to the text processing pipeline, ensuring data integrity and security throughout the LLM-based workflow.
diff --git a/pgml-docs/use-cases/personalize-embedding-results-with-application-data-in-your-database.md b/pgml-docs/use-cases/personalize-embedding-results-with-application-data-in-your-database.md
deleted file mode 100644
index 0e70c569d..000000000
--- a/pgml-docs/use-cases/personalize-embedding-results-with-application-data-in-your-database.md
+++ /dev/null
@@ -1,300 +0,0 @@
-# Personalize embedding results with application data in your database
-
-PostgresML makes it easy to generate embeddings using open source models from Huggingface and perform complex queries with vector indexes and application data unlike any other database. The full expressive power of SQL as a query language is available to seamlessly combine semantic, geospatial, and full text search, along with filtering, boosting, aggregation, and ML reranking in low latency use cases. You can do all of this faster, simpler and with higher quality compared to applications built on disjoint APIs like OpenAI + Pinecone. Prove the results in this series to your own satisfaction, for free, by signing up for a GPU accelerated database.
-
-## Introduction
-
-This article is the third in a multipart series that will show you how to build a post-modern semantic search and recommendation engine, including personalization, using open source models. You may want to start with the previous articles in the series if you aren't familiar with PostgresML's capabilities.
-
-1. Generating LLM Embeddings with HuggingFace models
-2. Tuning vector recall with pgvector
-3. Personalizing embedding results with application data
-4. Optimizing semantic results with an XGBoost ranking model - coming soon!
-
-
-
-_Embeddings can be combined into personalized perspectives when stored as vectors in the database._
-
-## Personalization
-
-In the era of big data and advanced machine learning algorithms, personalization has become a critical component in many modern technologies. One application of personalization is in search and recommendation systems, where the goal is to provide users with relevant and personalized experiences. Embedding vectors have become a popular tool for achieving this goal, as they can represent items and users in a compact and meaningful way. However, standard embedding vectors have limitations, as they do not take into account the unique preferences and behaviors of individual users. To address this, a promising approach is to use aggregates of user data to personalize embedding vectors. This article will explore the concept of using aggregates to create new embedding vectors and provide a step-by-step guide to implementation.
-
-We'll continue working with the same dataset from the previous articles. 5M+ customer reviews about movies from amazon over a decade. We've already generated embeddings for each review, and aggregated them to build a consensus view of the reviews for each movie. You'll recall that our reviews also include a customer\_id as well.
-
-!!! generic
-
-!!! code\_block
-
-```postgresql
-\d pgml.amazon_us_reviews
-```
-
-!!!
-
-!!! results
-
-| Column | Type | Collation | Nullable | Default |
-| ------------------ | ------- | --------- | -------- | ------- |
-| marketplace | text | | | |
-| customer\_id | text | | | |
-| review\_id | text | | | |
-| product\_id | text | | | |
-| product\_parent | text | | | |
-| product\_title | text | | | |
-| product\_category | text | | | |
-| star\_rating | integer | | | |
-| helpful\_votes | integer | | | |
-| total\_votes | integer | | | |
-| vine | bigint | | | |
-| verified\_purchase | bigint | | | |
-| review\_headline | text | | | |
-| review\_body | text | | | |
-| review\_date | text | | | |
-
-!!!
-
-!!!
-
-## Creating embeddings for customers
-
-In the previous article, we saw that we could aggregate all the review embeddings to create a consensus view of each movie. Now we can take that a step further, and aggregate all the movie embeddings that each customer has reviewed, to create an embedding for every customer in terms of the movies they've reviewed. We're not going to worry about if they liked the movie or not just yet based on their star rating. Simply the fact that they've chosen to review a movie indicates they chose to purchase the DVD, and reveals something about their preferences. It's always easy to create more tables and indexes related to other tables in our database.
-
-!!! generic
-
-!!! code\_block time="458838.918 ms (07:38.839)"
-
-```postgresql
-CREATE TABLE customers AS
-SELECT
- customer_id AS id,
- count(*) AS total_reviews,
- avg(star_rating) AS star_rating_avg,
- pgml.sum(movies.review_embedding_e5_large)::vector(1024) AS movie_embedding_e5_large
-FROM pgml.amazon_us_reviews
-JOIN movies
- ON movies.id = amazon_us_reviews.product_id
-GROUP BY customer_id;
-```
-
-!!!
-
-!!! results
-
-SELECT 2075970
-
-!!!
-
-!!!
-
-We've just created a table aggregating our 5M+ reviews into 2M+ customers, with mostly vanilla SQL. The query includes a JOIN between the `pgml.amazon_us_reviews` we started with, and the `movies` table we created to hold the movie embeddings. We're using `pgml.sum()` again, this time to sum up all the movies a customer has reviewed, to create an embedding for the customer. We will want to be able to quickly recall a customers embedding by their ID whenever they visit the site, so we'll create a standard Postgres index on their ID. This isn't just a vector database, it's a full AI application database.
-
-!!! generic
-
-!!! code\_block time="2709.506 ms (00:02.710)"
-
-```postgresql
-CREATE INDEX customers_id_idx ON customers (id);
-```
-
-!!!
-
-!!! results
-
-```
-CREATE INDEX
-```
-
-!!!
-
-!!!
-
-Now we can incorporate a customer embedding to personalize the results whenever they search. Normally, we'd have the `customers.id` in our application already because they'd be searching and browsing our site, but we don't have an actual application or customers for this article, so we'll have to find one for our example. Let's find a customer that loves the movie Empire Strikes Back. No Star Wars made our original list, so we have a good opportunity to improve our previous results with personalization.
-
-## Finding a customer to personalize results for
-
-Now that we have customer embeddings around movies they've reviewed, we can incorporate those to personalize the results whenever they search. Normally, we'd have the `customers.id` handy in our application because they'd be searching and browsing our app, but we don't have an actual application or customers for this article, so we'll have to find one for our example. Let's find a customer that loves the movie "Empire Strikes Back". No "Star Wars" made our original list of "Best 1980's scifi movie", so we have a good opportunity to improve our previous results with personalization.
-
-We can find a customer that our embeddings model feels is close to the sentiment "I love all Star Wars, but Empire Strikes Back is particularly amazing". Keep in mind, we didn't want to take the time to build a vector index for queries against the customers table, so this is going to be slower than it could be, but that's fine because it's just a one-off exploration, not some frequently executed query in our application. We can still do vector searches, just without the speed boost an index provides.
-
-!!! generic
-
-!!! code\_block time="9098.883 ms (00:09.099)"
-
-```postgresql
-WITH request AS (
- SELECT pgml.embed(
- 'intfloat/e5-large',
- 'query: I love all Star Wars, but Empire Strikes Back is particularly amazing'
- )::vector(1024) AS embedding
-)
-
-SELECT
- id,
- total_reviews,
- star_rating_avg,
- 1 - (
- movie_embedding_e5_large <=> (SELECT embedding FROM request)
- ) AS cosine_similarity
-FROM customers
-ORDER BY cosine_similarity DESC
-LIMIT 1;
-```
-
-!!!
-
-!!! results
-
-| id | total\_reviews | star\_rating\_avg | cosine\_similarity |
-| -------- | -------------- | ------------------ | ------------------ |
-| 44366773 | 1 | 2.0000000000000000 | 0.8831349398621555 |
-
-!!!
-
-!!!
-
-!!! note
-
-Searching without indexes is slower (9s), but creating a vector index can take a very long time (remember indexing all the reviews took more than an hour). For frequently executed application queries, we always want to make sure we have at least 1 index available to improve speed. Anyway, it turns out we have a customer with a very similar embedding to our desired personalization. Semantic search is wonderfully powerful. Once you've generated embeddings, you can find all the things that are similar to other things, even if they don't share any of the same words. Whether this customer has actually ever even seen Star Wars, the model thinks their embedding is pretty close to a review like that...
-
-!!!
-
-It turns out we have a customer with a very similar embedding to our desired personalization. Semantic search is wonderfully powerful. Once you've generated embeddings, you can find all the things that are similar to other things, even if they don't share any of the same words. Whether this customer has actually ever even seen Star Wars, the model thinks their embedding is pretty close to a review like that... They seem a little picky though with 2-star rating average. I'm curious what the 1 review they've actually written looks like:
-
-!!! generic
-
-!!! code\_block time="25156.945 ms (00:25.157)"
-
-```postgresql
-SELECT product_title, star_rating, review_body
-FROM pgml.amazon_us_reviews
-WHERE customer_id = '44366773';
-```
-
-!!!
-
-!!! results
-
-| product\_title | star\_rating | review\_body |
-| ------------------------------------------------------------------ | ------------ | ----------------------------------------------------------------------------- |
-| Star Wars, Episode V: The Empire Strikes Back (Widescreen Edition) | 2 | The item was listed as new. The box was opened and had damage to the outside. |
-
-!!!
-
-!!!
-
-This is odd at first glance. The review doesn't mention anything thing about Star Wars, and the sentiment is actually negative, even the `star_rating` is bad. How did they end up with an embedding so close to our desired sentiment of "I love all Star Wars, but Empire Strikes Back is particularly amazing"? Remember we didn't generate embeddings from their review text directly. We generated customer embeddings from the movies they had bothered to review. This customer has only ever reviewed 1 movie, and that happens to be the movie closest to our sentiment. Exactly what we were going for!
-
-If someone only ever bothered to write 1 review, and they are upset about the physical DVD, it's likely they are a big fan of the movie, and they are upset about the physical DVD because they wanted to keep it for a long time. This is a great example of how stacking and relating embeddings carefully can generate insights at a scale that is otherwise impossible, revealing the signal in the noise.
-
-Now we can write our personalized SQL query. It's nearly the same as our query from the previous article, but we're going to include an additional CTE to fetch the customers embedding by id, and then tweak our `final_score`. Here comes personalized query results, using that customer 44366773's embedding. Instead of the generic popularity boost we've been using, we'll calculate the cosine similarity of the customer embedding to all the movies in the results, and use that as a boost. This will push movies that are similar to the customer's embedding to the top of the results.
-
-## Personalizing search results
-
-Now we can write our personalized SQL query. It's nearly the same as our query from the previous article, but we're going to include an additional CTE to fetch the customers embedding by id, and then tweak our `final_score`. Instead of the generic popularity boost we've been using, we'll calculate the cosine similarity of the customer embedding to all the movies in the results, and use that as a boost. This will push movies that are similar to the customer's embedding to the top of the results. Here comes personalized query results, using that customer 44366773's embedding:
-
-!!! generic
-
-!!! code\_block time="127.639 ms (00:00.128)"
-
-```postgresql
--- create a request embedding on the fly
-WITH request AS (
- SELECT pgml.embed(
- 'intfloat/e5-large',
- 'query: Best 1980''s scifi movie'
- )::vector(1024) AS embedding
-),
-
--- retrieve the customers embedding by id
-customer AS (
- SELECT movie_embedding_e5_large AS embedding
- FROM customers
- WHERE id = '44366773'
-),
-
--- vector similarity search for movies and calculate a customer_cosine_similarity at the same time
-first_pass AS (
- SELECT
- title,
- total_reviews,
- star_rating_avg,
- 1 - (
- review_embedding_e5_large <=> (SELECT embedding FROM request)
- ) AS request_cosine_similarity,
- (1 - (
- review_embedding_e5_large <=> (SELECT embedding FROM customer)
- ) - 0.9) * 10 AS customer_cosine_similarity,
- star_rating_avg / 5 AS star_rating_score
- FROM movies
- WHERE total_reviews > 10
- ORDER BY review_embedding_e5_large <=> (SELECT embedding FROM request)
- LIMIT 1000
-)
-
--- grab the top 10 results, re-ranked using a combination of request similarity and customer similarity
-SELECT
- title,
- total_reviews,
- round(star_rating_avg, 2) as star_rating_avg,
- star_rating_score,
- request_cosine_similarity,
- customer_cosine_similarity,
- request_cosine_similarity + customer_cosine_similarity + star_rating_score AS final_score
-FROM first_pass
-ORDER BY final_score DESC
-LIMIT 10;
-```
-
-!!!
-
-!!! results
-
-| title | total\_reviews | star\_rating\_avg | star\_rating\_score | request\_cosine\_similarity | customer\_cosine\_similarity | final\_score |
-| -------------------------------------------------------------------- | -------------- | ----------------- | ---------------------- | --------------------------- | ---------------------------- | ------------------ |
-| Star Wars, Episode V: The Empire Strikes Back (Widescreen Edition) | 78 | 4.44 | 0.88717948717948718000 | 0.8295302273865711 | 0.9999999999999998 | 2.716709714566058 |
-| Star Wars, Episode IV: A New Hope (Widescreen Edition) | 80 | 4.36 | 0.87250000000000000000 | 0.8339361274771777 | 0.9336656923446551 | 2.640101819821833 |
-| Forbidden Planet (Two-Disc 50th Anniversary Edition) | 255 | 4.82 | 0.96392156862745098000 | 0.8577616472530644 | 0.6676592605840725 | 2.489342476464588 |
-| The Day the Earth Stood Still | 589 | 4.76 | 0.95212224108658744000 | 0.8555529952535671 | 0.6733939449212423 | 2.4810691812613967 |
-| Forbidden Planet \[Blu-ray] | 223 | 4.79 | 0.95874439461883408000 | 0.8479982398847651 | 0.6536320269646467 | 2.4603746614682462 |
-| John Carter (Four-Disc Combo: Blu-ray 3D/Blu-ray/DVD + Digital Copy) | 559 | 4.65 | 0.93059033989266548000 | 0.8338600628541288 | 0.6700415876545052 | 2.4344919904012996 |
-| The Terminator | 430 | 4.59 | 0.91813953488372094000 | 0.8428833221752442 | 0.6638043064287047 | 2.4248271634876697 |
-| The Day the Earth Stood Still (Two-Disc Special Edition) | 37 | 4.57 | 0.91351351351351352000 | 0.8419118958433142 | 0.6636373066510914 | 2.419062716007919 |
-| The Thing from Another World | 501 | 4.71 | 0.94291417165668662000 | 0.8511107698234265 | 0.6231913893834695 | 2.4172163308635826 |
-| The War of the Worlds (Special Collector's Edition) | 171 | 4.67 | 0.93333333333333334000 | 0.8460163011246516 | 0.6371641286728591 | 2.416513763130844 |
-
-!!!
-
-!!!
-
-Bingo. Now we're boosting movies by `(customer_cosine_similarity - 0.9) * 10`, and we've kept our previous boost for movies with a high average star rating. Not only does Episode V top the list as expected, Episode IV is a close second. This query has gotten fairly complex! But the results are perfect for me, I mean our hypothetical customer who is searching for "Best 1980's scifi movie" but has already revealed to us with their one movie review that they think like the comment "I love all Star Wars, but Empire Strikes Back is particularly amazing". I promise I'm not just doing all of this to find a new movie to watch tonight.
-
-You can compare this to our non-personalized results from the previous article for reference Forbidden Planet used to be the top result, but now it's #3.
-
-!!! code\_block time="124.119 ms"
-
-!!! results
-
-| title | total\_reviews | star\_rating\_avg | final\_score | star\_rating\_score | cosine\_similarity |
-| ---------------------------------------------------- | -------------: | ----------------: | -----------------: | ---------------------: | -----------------: |
-| Forbidden Planet (Two-Disc 50th Anniversary Edition) | 255 | 4.82 | 1.8216832158805154 | 0.96392156862745098000 | 0.8577616472530644 |
-| Back to the Future | 31 | 4.94 | 1.82090702765472 | 0.98709677419354838000 | 0.8338102534611714 |
-| Warning Sign | 17 | 4.82 | 1.8136734057737756 | 0.96470588235294118000 | 0.8489675234208343 |
-| Plan 9 From Outer Space/Robot Monster | 13 | 4.92 | 1.8126103400815046 | 0.98461538461538462000 | 0.8279949554661198 |
-| Blade Runner: The Final Cut (BD) \[Blu-ray] | 11 | 4.82 | 1.8120690455673043 | 0.96363636363636364000 | 0.8484326819309408 |
-| The Day the Earth Stood Still | 589 | 4.76 | 1.8076752363401547 | 0.95212224108658744000 | 0.8555529952535671 |
-| Forbidden Planet \[Blu-ray] | 223 | 4.79 | 1.8067426345035993 | 0.95874439461883408000 | 0.8479982398847651 |
-| Aliens (Special Edition) | 25 | 4.76 | 1.803194119705901 | 0.95200000000000000000 | 0.851194119705901 |
-| Night of the Comet | 22 | 4.82 | 1.802469182369724 | 0.96363636363636364000 | 0.8388328187333605 |
-| Forbidden Planet | 19 | 4.68 | 1.795573710000297 | 0.93684210526315790000 | 0.8587316047371392 |
-
-!!!
-
-!!!
-
-Big improvement! We're doing a lot now to achieve filtering, boosting, and personalized re-ranking, but you'll notice that this extra work only takes a couple more milliseconds in PostgresML. Remember in the previous article when took over 100ms to just retrieve 5 embedding vectors in no particular order. All this embedding magic is pretty much free when it's done inside the database. Imagine how slow a service would be if it had to load 1000 embedding vectors (not 5) like our similarity search is doing, and then passing those to some HTTP API where some ML black box lives, and then fetching a different customer embedding from a different database, and then trying to combine that with the thousand results from the first query... This is why machine learning microservices break down at scale, and it's what makes PostgresML one step ahead of less mature vector databases.
-
-## What's next?
-
-We've got personalized results now, but `(... - 0.9) * 10` is a bit of a hack I used to scale the personalization score to have a larger impact on the final score. Hacks and heuristics are frequently injected like this when a Product Manager tells an engineer to "just make it work", but oh no! Back To The Future is now nowhere to be found on my personalized list. We can do better! Those magic numbers are intended to optimize something our Product Manager is going for as a business metric. There's a way out of infinite customer complaints and one off hacks like this, and it's called machine learning.
-
-Finding the optimal set of magic numbers that "just make it work" is what modern machine learning is all about from one point of view. In the next article, we'll look at building a real personalized ranking model using XGBoost on top of our personalized embeddings, that predicts how our customer will rate a movie on our 5-star review scale. Then we can rank results based on a much more sophisticated model's predicted star rating score instead of just using cosine similarity and made up numbers. With all the savings we're accruing in terms of latency and infrastructure simplicity, our ability to layer additional models, refinements and techniques will put us another step ahead of the alternatives.
diff --git a/pgml-docs/use-cases/tuning-vector-recall-while-generating-query-embeddings-in-the-database.md b/pgml-docs/use-cases/tuning-vector-recall-while-generating-query-embeddings-in-the-database.md
deleted file mode 100644
index a8bc2af9a..000000000
--- a/pgml-docs/use-cases/tuning-vector-recall-while-generating-query-embeddings-in-the-database.md
+++ /dev/null
@@ -1,505 +0,0 @@
-# Tuning vector recall while generating query embeddings in the database
-
-
-
-PostgresML makes it easy to generate embeddings using open source models and perform complex queries with vector indexes unlike any other database. The full expressive power of SQL as a query language is available to seamlessly combine semantic, geospatial, and full text search, along with filtering, boosting, aggregation, and ML reranking in low latency use cases. You can do all of this faster, simpler and with higher quality compared to applications built on disjoint APIs like OpenAI + Pinecone. Prove the results in this series to your own satisfaction, for free, by signing up for a GPU accelerated database.
-
-## Introduction
-
-This article is the second in a multipart series that will show you how to build a post-modern semantic search and recommendation engine, including personalization, using open source models.
-
-1. Generating LLM Embeddings with HuggingFace models
-2. Tuning vector recall with pgvector
-3. Personalizing embedding results with application data
-4. Optimizing semantic results with an XGBoost ranking model - coming soon!
-
-The previous article discussed how to generate embeddings that perform better than OpenAI's `text-embedding-ada-002` and save them in a table with a vector index. In this article, we'll show you how to query those embeddings effectively.
-
-
-
-_Embeddings show us the relationships between rows in the database, using natural language._
-
-Our example data is based on 5 million DVD reviews from Amazon customers submitted over a decade. For reference, that's more data than fits in a Pinecone Pod at the time of writing. Webscale: check. Let's start with a quick refresher on the data in our `pgml.amazon_us_reviews` table:
-
-!!! generic
-
-!!! code\_block time="107.207ms"
-
-```postgresql
-SELECT *
-FROM pgml.amazon_us_reviews
-LIMIT 5;
-```
-
-!!!
-
-!!! results
-
-| marketplace | customer\_id | review\_id | product\_id | product\_parent | product\_title | product\_category | star\_rating | helpful\_votes | total\_votes | vine | verified\_purchase | review\_headline | review\_body | review\_date | id | review\_embedding\_e5\_large |
-| ----------- | ------------ | -------------- | ----------- | --------------- | ----------------------------------------------------------------------------------------------------------------- | ----------------- | ------------ | -------------- | ------------ | ---- | ------------------ | ------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------ | -- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| US | 16164990 | RZKBT035JA0UQ | B00X797LUS | 883589001 | Revenge: Season 4 | Video DVD | 5 | 1 | 2 | 0 | 1 | It's a hit with me | I don't usually watch soap operas, but Revenge grabbed me from the first episode. Now I have all four seasons and can watch them over again. If you like suspense and who done it's, then you will like Revenge. The ending was terrific, not to spoil it for those who haven't seen the show, but it's more fun to start with season one. | 2015-08-31 | 11 | \[-0.44635132,-1.4744929,0.29134354,0.060305085,-0.41350508,0.5875407,-0.061205346,0.3317157,0.3318643,-0.31223094,0.4632605,1.1153598,0.8087972,0.24135485,-0.09573943,-0.6522662,0.3471857,0.06589421,-0.49588993,-0.10770899,-0.12906694,-0.6840891,-0.0079286955,0.6722917,-1.1333038,0.9841143,-0.05413917,-0.63103,0.4891317,0.49941555,0.36425045,-1.1122142,0.39679757,-0.16903037,2.0291917,-0.4769759,0.069017395,-0.13972181,0.26427677,0.05579555,0.7277221,-0.09724414,-0.4079459,0.8500204,-1.4091835,0.020688279,-0.68782306,-0.024399774,1.159901,-0.7870475,0.8028308,-0.48158854,0.7254225,0.31266358,-0.8171888,0.0016202603,0.18997599,1.1948254,-0.027479807,-0.46444815,-0.16508491,0.7332363,0.53439474,0.17962055,-0.5157759,0.6162931,-0.2308871,-1.2384704,0.9215715,0.093228154,-1.0873187,0.44506252,0.6780382,1.4210767,-0.035378184,-0.37101075,0.36248568,-0.20481548,1.7752264,0.96295184,0.25421357,0.32428253,0.15021282,1.2010641,1.3598334,-0.09641862,1.9206793,-0.6621351,-0.19654606,0.9614237,0.8942871,0.06781684,0.6154728,0.5322664,-0.47281718,-0.10806668,0.19615875,1.1427128,1.1363747,-0.7448851,-0.6235285,-0.4178455,0.2823742,0.2022872,0.4639155,-0.82450366,-1.0911003,0.29300234,0.09920952,0.35992235,-0.89154017,0.6345019,-0.3539376,0.13820754,-0.08596075,-0.016720073,-0.86973023,0.60496914,1.0057746,1.4023327,1.3364636,0.41459054,0.8762501,-0.9326738,-0.62262,0.8540947,0.46354002,-0.5997743,0.14315224,1.276051,0.22685385,-0.27431846,-0.35084888,0.124737024,1.3882787,1.27789,-2.0416644,-1.2735635,0.45739195,-0.5252866,-0.049650192,-1.2893498,-0.13299808,-0.37871423,1.3282262,0.40052852,0.7439125,0.4438182,-0.11048192,0.28375423,-0.641405,-0.393038,-0.5177149,-0.9469533,-1.1396636,-1.2370745,0.36096996,0.02870304,0.5063284,-0.07706672,0.94798875,-0.27705917,-0.29239914,0.31463885,-1.0989273,-0.656829,2.8949435,-0.17305379,0.3815719,0.42526448,0.3081009,0.5685343,0.33076203,0.72707826,0.50143975,0.5845048,0.84975934,0.42427582,0.30121675,0.5989959,-0.7319157,-0.549556,0.63867736,0.012300444,-0.45165,0.6612118,-0.512683,-0.5376379,0.47559577,-0.8463519,-1.1943918,-0.76171356,0.7841424,0.5601279,-0.82258976,-1.0125699,-0.38812968,0.4420742,-0.6571599,-0.06353831,-0.59025985,0.61750174,1.126035,-1.280225,0.04327058,1.0567118,0.5743241,-1.1305283,0.45828968,-0.74915165,-1.0058457,0.44758803,-0.41461354,0.09315924,0.33658516,-0.0040031066,-0.06580057,0.5101937,-0.45152435,0.009831754,-0.86611366,0.71392256,1.3910902,1.0870686,0.7477381,0.96166354,0.27147853,0.044556435,0.6843247,-0.82584035,0.55440176,0.07432493,-0.0876536,0.89933145,-0.20821023,1.0045182,1.3212318,0.0023916673,0.30949935,-0.49783787,-0.0894654,0.42442265,0.16125606,-0.31338125,-0.18276067,0.8512234,0.29042283,1.1811026,0.17194802,0.104081966,-0.17348862,0.3214033,0.05323091,0.452102,0.44595376,-0.54339683,1.2369651,-0.90202415,-0.14463677,-0.40089816,0.4221295,-0.27183273,-0.46332398,0.03636483,-0.4491677,0.11768485,0.25375235,-0.5391649,1.6532613,-0.44395766,0.52174264,0.46777102,-0.6175785,-0.8521162,0.4074876,0.8601743,0.16133149,1.2534949,0.17186514,-1.4400607,0.12929483,0.19184573,-0.10323317,0.17845587,-0.9316995,-0.29608884,-0.15901098,0.13879488,0.7077851,0.7130752,-0.33218113,0.65922844,-0.16829759,-0.85618913,-0.50507075,0.04030782,0.28823212,0.63344556,-0.64391583,0.82986885,0.36421177,-0.31541574,0.15703243,-0.6918284,0.07207678,0.10856655,0.1837874,0.20774966,0.5002916,0.36118835,0.15846755,-0.59214884,-0.2806985,-1.4209367,-0.8781769,0.59149474,0.09860907,0.7798751,0.08356752,-0.3816034,0.62692493,1.0605069,0.009612969,-1.1639553,0.0387234,-0.62128127,-0.65425646,0.026634911,0.13652368,-0.31386188,0.5132959,-0.2279612,1.5733948,0.9453454,-0.47791338,-0.86752695,0.2590365,0.010133599,0.0731045,-0.08996825,1.5178722,0.2790404,0.42920277,0.16204502,0.51732993,0.7824352,-0.53204685,0.6322838,0.027865775,0.1909194,0.75459373,0.5329097,-0.25675827,-0.6438361,-0.6730749,0.0419199,1.647542,-0.79603523,-0.039030924,0.57257867,0.97090834,-0.18933444,0.061723463,0.054686982,0.057177402,0.24391848,-0.45859554,0.36363262,-0.028061919,0.5537379,0.23430054,0.06542831,-0.8465644,-0.61477613,-1.8602425,-0.5563627,0.5518607,1.1379824,0.05827968,0.6034838,0.10843904,0.66301763,-0.68257576,0.49940518,-1.0600849,0.3026614,0.20583217,0.45980504,-0.54227024,0.83065176,-0.12527004,0.94367605,-0.22141562,0.2656482,-1.0248334,-0.64097667,0.9686471,-0.2892358,-0.7154707,0.33837032,0.25886488,1.754326,0.040067837,-0.0130331945,1.014779,0.6381671,-0.14163442,-0.6668947,-0.52272713,0.44740087,1.0573436,0.7079764,-0.4765707,-0.45119467,0.33266848,-0.3335042,0.6264001,0.096436426,0.4861287,-0.64570946,-0.55701566,-0.8017526,-0.3268717,0.6509844,0.51674,0.5527258,0.06715509,0.13850002,-0.16415404,0.5339686,0.7038742,-0.23962326,-0.40861428,-0.80195314,-0.2562518,-0.31416067,-0.6004696,0.17173254,-0.08187528,-0.10650221,-0.8317999,0.21745056,0.5430748,-0.95596164,0.47898734,-0.6119156,0.41032174,-0.55160147,0.23355038,0.51838225,0.6097409,0.54803956,-0.64297825,-1.095854,-1.7266736,0.46846822,0.24315582,0.93500775,-1.2847418,-0.09460731,-0.9284272,-0.58228695,0.35412273,-1.338897,0.09689145,-0.9634888,-0.105158746,-0.24354713,-1.8149018,-0.81706595,0.5610544,0.2604056,-0.15690021,-0.34233433,0.21085337,0.095561,0.3357639,-0.4168723,-0.16001065,0.019738067,-0.25119543,0.21538053,0.9338039,-1.3079301,-0.5274139,0.0042342604,-0.26708132,-1.1157236,0.41096166,-1.0650482,-0.92784685,0.1649683,-0.076478265,-0.89887,-0.49810255,-0.9988228,0.398151,-0.1489247,0.18536144,0.47142923,0.7188731,-0.19373408,-0.43892148,-0.007021479,0.27125278,-0.0755358,-0.21995014,-0.09820049,-1.1432658,-0.6438058,0.45684898,-0.16717891,-0.06339566,-0.54050285,-0.21786614,-0.009872514,0.95797646,-0.6364886,0.06476644,0.15031907,-0.114178315,-0.6920534,0.33618665,-0.20828676,-1.218436,1.0650855,0.92841274,0.15988845,1.5152671,-0.27995184,0.43647304,0.123278655,-1.320316,-0.25041837,0.24997042,0.87653285,0.12610753,-0.8309733,0.5842415,-0.840945,-0.46114716,0.51617026,-0.6507864,1.5720816,0.43062973,-0.7194931,-1.400388,-0.9877925,-0.87884194,0.46331164,-0.51055473,0.24852753,0.30240974,0.12866661,-0.84918654,-0.3372634,0.46535993,0.22479752,0.7400517,0.4833228,1.3157144,1.270739,0.93192166,0.9926317,0.7777536,-0.8000388,-0.22760339,-0.7243004,-0.90151507,-0.73649806,-0.18375495,-0.9876769,-0.22154166,0.15750378,-0.051066816,1.218425,0.58040893,-0.32723624,0.08092578,-0.41428035,-0.8565249,-1.3621647,0.42233124,0.49325675,1.4729465,0.957077,-0.40788552,-0.7064396,0.67477965,0.74812657,0.17461313,1.2278605,0.42229348,0.00287759,1.6320366,0.045381133,0.8773843,-0.23280792,0.025544237,0.75055337,0.8755495,-0.21244618,-0.6180616,-0.019127166,0.55689186,1.2838972,-0.8412692,0.8461143,0.39903468,0.1857164,-0.025012616,-0.8494315,-0.2573743,-1.1831325,-0.5007239,0.5891477,-1.2416826,0.38735542,0.41872358,1.0267426,0.2482442,-0.060767986,0.7538531,-0.24033615,0.9042795,-0.24176258,-0.44520715,0.7715707,-0.6773665,0.9288903,-0.3960447,-0.041194934,0.29724947,0.8664729,0.07247823,-1.7166628,-1.1924342,-1.1135329,0.4729775,0.5345159,0.57545316,0.14463085,-0.34623942,1.2155776,0.24223511,1.3281958,-1.0329959,-1.3902934,0.09121965,0.18269718,-1.3109862,1.4591801,0.58750343,-0.8072534,0.23610781,-1.4992374,0.71078837,0.25371152,0.85618514,0.807575,1.2301548,-0.27820417,-0.29354396,0.28911537,1.2117325,4.4740834,1.3543533,0.214103,-1.3109514,-0.013579576,-0.53262085,-0.22086248,0.24246897,-0.26330945,0.30646166,-0.21399511,1.5816526,0.64849514,0.31172174,0.57089436,1.0467637,-0.42125005,-0.2877409,0.6157391,-0.6682809,-0.44719923,-0.251028,-1.0622188,-1.5241078,1.3073357,-0.21030799,0.75480264,-1.0422926,0.23265716,0.20796475,0.73489463,0.5507254,-0.04313501,1.30877,0.19338085,0.27448726,0.04000665,-0.7004063,-1.0822202,0.6009482,0.2412081,0.33919787,0.020680452,0.7649121,-0.69652104,-0.5461974,-0.60095215,-0.9746675,0.7837197,1.2018669,-0.23473008,-0.44692823,0.12413922,-1.3088125,-1.4267013,0.82524955,0.8647329,0.16150166,-1.4038807,-0.8987668,0.61025685,-0.8479041,0.59218127,0.65450156,-0.022710972,0.19090322,-0.55995494,0.12569806,0.019536465,-0.5719187,-1.1703067,0.13916619,-1.2546546,0.3547577,-0.6583496,1.4738533,0.15210527,0.045928936,-1.7701638,-1.1357217,0.0656034,0.34817895,-0.9715934,-0.036333986,-0.54871166,-0.28730902,-0.4544463,0.0044411435,-0.091176935,0.5609336,0.8184279,1.7430352,0.14487076,-0.54478693,0.13478011,-0.78083384,-0.5450215,-0.39379802,-0.52507687,0.8898843,-0.46146545,-0.6123672,-0.20210318,0.72413814,-1.3112601,0.20672223,0.73001564,-1.4695473,-0.3112792,-0.048050843,-0.25363198,-1.0228323,-0.071546085,-0.3245472,0.12762389,-0.064207725,-0.46297944,-0.61758167,1.1423731,-1.2279893,1.4896537,-0.61985505,-0.39032778,-1.1789387,-0.05861108,0.33709309,-0.11082967,0.35026795,0.011960861,-0.73383653,-0.5427297,-0.48166794,-1.1341039,-0.07019004,-0.6253811,-0.55956876,-0.87954766,0.0038243965,-1.1747614,-0.2742908,1.3408217,-0.8604027,-0.4190716,1.0705358,-0.17213087,0.2715014,0.8245274,0.06066578,0.82805973,0.47945866,-0.37825295,0.014340248,0.9461009,0.256653,-0.19689955,1.1786914,0.18505198,0.710402,-0.59817654,0.12953508,0.48922333,0.8255816,0.4042885,-0.75975555,0.20467097,0.018755354,-0.69151515,-0.23537838,0.26312333,0.82981825,-0.10950847,-0.25987357,0.33299834,-0.31744313,-0.4765103,-0.8831548,0.056800444,0.07922315,0.5476093,-0.817339,0.22928628,0.5257919,-1.1328216,0.66853505,0.42755872,-0.18290512,-0.49680132,0.7065077,-0.2543334,0.3081367,0.5692426,0.31948256,0.668704,0.72916716,-0.3097971,0.04443544,0.5626836,1.5217534,-0.51814324,-1.2701787,0.6485761,-0.8157134,-0.74196255,0.7771558,-1.3504819,0.2796807,0.44736814,0.6552933,0.13390358,0.5573986,0.099469736,-0.48586744,-0.16189729,0.40172148,-0.18505138,0.3092212,-0.30285,-0.45625964,0.8346098,-0.14941978,-0.44034964,-0.13228996,-0.45626387,-0.5833162,-0.56918347,-0.10052125,0.011119543,-0.423692,-0.36374965,-1.0971813,0.88712555,0.38785303,-0.22129343,0.19810538,0.75521517,-0.34437984,-0.9454472,-0.006488466,-0.42379746,-0.67618704,-0.25211233,0.2702919,-0.6131363,0.896094,-0.4232919,-0.25754875,-0.39714852,1.4831372,0.064787336,-0.770308,0.036396563,0.2313668,0.5655817,-0.6738516,0.857144,0.77432656,0.1454645,-1.3901217,-0.46331334,0.109622695,0.45570934,0.92387015,-0.011060692,0.30186698,-0.35252112,0.1457121,-0.2570497,0.7082791,-0.30265188,-0.23325084,-0.026542446,-0.17957532,1.1194676,0.59331983,-0.34250805,0.39761257,-0.97051114,0.6302743,-1.0416062,-0.14316575,-0.17302139,0.25761867,-0.62417996,0.427799,-0.26894867,0.4448027,-0.6683409,-1.0712901,-0.49355477,0.46255362,-0.26607195,-0.1882482,-1.0833352,-1.2174416,-0.22160827,-0.63442576,-0.20239262,0.08509241,0.27062747,0.3231089,0.75656915,-0.59737813,0.64800847,-0.3792087,0.06189245,-1.0148673,-0.64977705,0.23959091,0.5693892,0.2220355,0.050067283,-1.1472284,-0.05411025,-0.51574,0.9436675,0.08399284,-0.1538182,-0.087096035,0.22088972,-0.74958104,-0.45439938,-0.9840612,0.18691222,-0.27567235,1.4122254,-0.5019997,0.59119046,-0.3159759,0.18572812,-0.8638007,-0.20484222,-0.22735544,0.009947425,0.08660857,-0.43803024,-0.87153643,0.06910624,1.3576175,-0.5727235,0.001615673,-0.5057925,0.93217665,-1.0369575,-0.8864083,-0.76695895,-0.6097337,0.046172515,0.4706499,-0.43419397,-0.7006992,-1.2508268,-0.5113818,0.96917367,-0.65436345,-0.83149797,-0.9900211,0.38023964,0.16216993,-0.11047968] |
-| US | 33386989 | R253N5W74SM7N3 | B00C6MXB42 | 734735137 | YOUNG INDIANA JONES CHRONICLES Volumes 1, 2 and 3 DVD Sets (Complete Collections All 3 Volumes DVD Sets Together) | Video DVD | 4 | 1 | 1 | 0 | 1 | great stuff. I thought excellent for the kids | great stuff. I thought excellent for the kids. The extras are a must after the movie. | 2015-08-31 | 12 | \[0.30739722,-1.2976353,0.44150844,0.28229898,0.8129836,0.19451006,-0.16999333,-0.07356771,0.5831099,-0.5702598,0.5513152,0.9893058,0.8913247,1.2790804,-0.21743622,-0.13258074,0.5267081,-1.1273692,0.08361904,-0.32674226,-0.7284242,-0.3742802,-0.315159,-0.06914908,-0.9370208,0.5965896,-0.46391407,-0.30802932,0.34784046,0.35328323,-0.06566019,-0.83673024,1.2235038,-0.5311309,1.7232236,0.100425154,-0.42236832,-0.4189702,0.65639615,-0.19411941,0.2861547,-0.011099293,0.6224927,0.2937978,-0.57707405,0.1723467,-1.1128687,-0.23458324,0.85969496,-0.5544667,0.69622403,0.20537117,0.5376313,0.18094051,-0.5935286,0.58459294,0.2588672,1.2592428,0.40739542,-0.3853751,0.5736207,-0.27588457,0.44027475,0.06457652,-0.40556684,-0.25630975,-0.0024269535,-0.63066584,1.435617,-0.41023165,-0.39362282,0.9855966,1.1903448,0.8181575,-0.13602419,-1.1992644,0.057811044,0.17973477,1.3552206,0.38971838,-0.021610033,0.19899082,-0.10303763,1.0268506,0.6143311,-0.21900427,2.4331384,-0.7311581,-0.07520742,0.25789547,0.78391874,-0.48391873,1.4095061,0.3000153,-1.1587081,-0.470519,0.63760203,1.212848,-0.13230722,0.1575143,0.5233601,-0.26733217,0.88544065,1.0455207,0.3242259,-0.08548101,-1.1858246,-0.34827423,0.10947221,0.7657727,-1.1886615,0.5846556,-0.06701131,-0.18275288,0.9688948,-0.44766253,-0.24283795,0.84013104,1.1865685,1.0322199,1.1621728,0.2904784,0.45513308,-0.046442263,-1.5924592,1.1268036,1.2244802,-0.12986387,-0.652806,1.3956618,0.09316843,0.0074809124,-0.40963998,0.11233859,0.23004606,1.0019808,-1.1334686,-1.6484728,0.17822856,-0.52497756,-0.97292185,-1.3860162,-0.10179921,0.41441512,0.94668996,0.6478229,-0.1378847,0.2240062,0.12373086,0.37892383,-1.0213026,-0.002514686,-0.6206891,-1.2263044,-0.81023514,-2.1251488,-0.05212076,0.5007569,-0.10503322,-0.15165941,0.80570364,-0.67640734,-0.38113695,-0.7051068,-0.7457319,-1.1459444,1.2534835,-0.48408872,0.20323983,0.49218604,-0.01939073,0.42854333,0.871685,0.3215819,-0.016663345,0.492181,0.93779576,0.59563607,1.2095222,-0.1319952,-0.74563706,-0.7584777,-0.06784309,1.0673252,-0.18296064,1.180183,-0.01517544,-0.996551,1.4614015,-0.9834482,-0.8929142,-1.1343371,1.2919606,0.67674285,-1.264175,-0.78025484,-0.91170585,0.6446593,-0.44662225,-0.02165111,-0.34166083,0.23982073,-0.0695019,-0.55098635,0.061257105,0.14019178,0.58004445,-0.22117937,0.20757008,-0.47917584,-0.23402964,0.07655301,-0.28613323,-0.24914591,-0.40391505,-0.53980047,1.0352598,0.08218856,-0.21157777,0.5807184,-1.4730825,0.3812591,0.83882,0.5867736,0.74007905,1.0515761,-0.15946862,1.1032714,0.58210975,-1.3155121,-0.74103445,-0.65089387,0.8670826,0.43553326,-0.6407162,0.47036576,1.5228021,-0.45694724,0.7269809,0.5492361,-1.1711032,0.23924577,0.34736052,-0.12079343,-0.09562126,0.74119747,-0.6178057,1.3842496,-0.24629863,0.16725276,0.543255,0.28207174,0.58856744,0.87834567,0.50831103,-1.2316333,1.2317014,-1.0706112,-0.16112426,0.6000713,0.5483024,-0.13964792,-0.75518215,-0.98008883,0.6262824,-0.056649026,-0.14632829,-0.6952095,1.1196847,0.16559249,0.8219887,0.27358034,-0.37535465,-0.45660818,0.47437778,0.54943615,0.6596993,1.3418778,0.088481836,-1.0798514,-0.20523094,-0.043823265,-0.03007651,0.6147437,-1.2054923,0.21634094,0.5619677,-0.38945594,1.1649859,0.67147845,-0.67930675,0.25937733,-0.41399506,0.14421114,0.8055827,0.11315601,-0.25499323,0.5075335,-0.96640706,0.86042404,0.27332047,-0.262736,0.1961017,-0.85305786,-0.32757896,0.008568222,-0.46760023,-0.5723287,0.353183,0.20126922,-0.022152433,0.39879513,-0.57369196,-1.1627877,-0.948688,0.54274577,0.52627236,0.7573314,-0.72570753,0.22652717,0.5562541,0.8202502,-1.0198171,-1.3022298,-0.2893229,-0.0275145,-0.46199337,0.119201764,0.73928577,0.05394686,0.5549575,0.5820973,0.5786865,0.4721187,-0.75830203,-1.2166464,-0.83674186,-0.3327995,-0.41074058,0.12167103,0.5753096,-0.39288408,0.101028144,-0.076566614,0.28128016,0.30121502,-0.45290747,0.3249064,0.29726675,0.060289554,1.012353,0.5653782,0.50774586,-1.1048855,-0.89840156,0.04853676,-0.0005516126,-0.43757257,0.52133596,0.90517247,1.2548338,0.032170154,-0.45365888,-0.32101494,0.52082396,0.06505445,-0.016106995,-0.15512307,0.4979914,0.019423941,-0.4410003,0.13686578,-0.55569375,-0.22618975,-1.3745868,0.14976598,0.31227916,0.22514923,-0.09152527,0.9595029,-0.24047574,0.9036276,0.06045522,0.4275914,-1.6211287,0.23627052,-0.123569466,1.0207809,-0.20820981,0.2928954,-0.37402752,-0.39281377,-0.9055283,0.42601687,-0.64971703,-0.83537567,-0.7551133,-0.3613483,-1.2591509,0.38164553,0.23480861,0.67463505,0.4188478,0.30875853,-0.23840418,-0.10466987,-0.45718357,-0.47870898,-0.7566724,-0.124758095,0.8912765,0.37436476,0.123713054,-0.9435858,-0.19343798,-0.7673082,0.45333877,-0.1314696,-0.046679523,-1.0924501,-0.36073965,-0.55994475,-0.25058964,0.6564909,-0.44103456,0.2519441,0.791008,0.7515483,-0.27565363,0.7055519,1.195922,0.37065807,-0.8460473,-0.070156336,0.46037647,-0.42738107,-0.40138105,0.13542275,-0.16810405,-0.17116192,-1.0791,0.094485305,0.499162,-1.3476236,0.21234894,-0.45902762,0.30559424,-0.75315285,-0.18889536,-0.18098111,0.6468135,-0.027758462,-0.4563393,-1.8142252,-1.1079813,0.15492673,0.67000175,1.7885993,-1.163623,-0.19585003,-1.265403,-0.65268534,0.8609888,-0.12089075,0.16340052,-0.40799433,0.1796395,-0.6490773,-1.1581244,-0.69040763,0.9861761,-0.94788885,-0.23661669,-0.26939982,-0.10966676,-0.2558066,0.11404798,0.2280753,1.1175905,1.2406538,-0.8405682,-0.0042185634,0.08700524,-1.490236,-0.83169794,0.80318516,-0.2759455,-1.2379494,1.2254013,-0.574187,-0.589692,-0.30691916,-0.23825237,-0.26592287,-0.34925,-1.1334181,0.18125409,-0.15863669,0.5677274,0.15621394,0.69536006,-0.7235879,-0.4440141,0.72681504,-0.071697086,-0.28574806,0.1978488,-0.29763848,-1.3379228,-1.7364287,0.4866264,-0.4246215,0.39696288,-0.39847228,-0.43619227,0.74066365,1.3941747,-0.980746,0.28616947,-0.41534734,-0.37235045,-0.3020338,-0.078414746,0.5320422,-0.8390588,0.39802805,0.9956247,0.48060423,1.0830654,-0.3462163,0.1495632,-0.70074755,-1.4337711,-0.47201052,-0.20542778,1.4469681,-0.28534025,-0.8658506,0.43706423,-0.031963903,-1.1208986,0.24726066,-0.15195882,1.6915563,0.48345947,0.36665258,-0.84477395,-0.67024755,-1.3117748,0.5186414,-0.111863896,-0.24438074,0.4496351,-0.16038479,-0.6309886,0.30835655,0.5210999,-0.08546635,0.8993058,0.79404515,0.6026624,1.415141,0.99138695,0.32465398,0.40468198,1.0601974,-0.18599145,-0.13816476,-0.6396179,-0.3233479,0.03862472,-0.17224589,0.09181578,-0.07982533,-0.5043218,1.0261234,0.18545899,-0.49497896,-0.54437244,-0.7879132,0.5358195,-1.6340284,0.25045714,-0.8396354,0.83989215,0.3047345,-0.49021208,0.05403753,1.0338433,0.6628198,-0.3480594,1.3061327,0.54290605,-0.9569749,1.8446399,-0.030642787,0.87419564,-1.2377026,0.026958525,0.50364405,1.1583173,0.38988844,-0.101992935,-0.23575047,-0.3413202,0.7004839,-0.94112486,0.46198457,-0.35058874,-0.039545525,0.23826565,-0.7062571,-0.4111793,0.25476676,-0.6673185,1.0281954,-0.9923886,0.35417762,0.42138654,1.6712382,0.408056,-0.11521088,-0.13972034,-0.14252779,-0.30223042,-0.33124694,-0.811924,0.28540173,-0.7444932,0.45001662,0.24809383,-0.35693368,0.9220196,0.28611687,-0.48261562,-0.41284987,-0.9931806,-0.8012102,-0.06244095,0.27006462,0.12398263,-0.9655248,-0.5692315,0.61817557,0.2861948,1.370767,-0.28261876,-1.6861429,-0.28172758,-0.25411567,-0.61593235,0.9216087,-0.09091336,-0.5353816,0.8020888,-0.508142,0.3009135,1.110475,0.03977944,0.8507262,1.5284235,0.10842794,-0.20826894,0.65857565,0.36973011,4.5352683,0.5847559,-0.11878182,-1.5029415,0.28518912,-1.6161069,0.024860675,-0.044661783,-0.28830758,-0.3638917,0.10329107,1.0316309,1.9032342,0.7131887,0.5412085,0.624381,-0.058650784,-0.99251175,0.61980045,-0.28385028,-0.79383695,-0.70285636,-1.2722979,-0.91541255,0.68193483,0.2765532,0.34829107,-0.4023206,0.25704393,0.5214571,0.13212398,0.28562054,0.20593974,1.0513201,0.9532814,0.095775016,-0.03877548,-0.33986154,-0.4798648,0.3228808,0.6315719,-0.10437137,0.14374955,0.48003596,-1.2454797,-0.40197062,-0.6159714,-0.6270214,0.25393748,0.72447217,-0.56466436,-0.958443,-0.096530266,-1.5505805,-1.6704174,0.8296298,0.05975852,-0.21028696,-0.5795715,-0.36282688,-0.24036546,-0.41609624,0.43595442,-0.14127952,0.6236689,-0.18053003,-0.38712737,0.70119154,-0.21448976,-0.9455639,-0.48454222,0.8712007,-0.94259155,1.1402144,-1.8355223,0.99784017,-0.10760504,0.01682847,-1.6035974,-1.2844374,0.01041493,0.258503,-0.46182942,-0.55694705,-0.36024556,-0.60274285,-0.7641168,-0.22333422,0.23358914,0.32214895,-0.2880609,2.0434432,0.021884317,-0.026297037,0.6764826,0.0018281384,-1.4232233,0.06965969,-0.6603106,1.7217827,-0.55071676,-0.5765741,0.41212377,0.47296098,-0.74749064,0.8318265,1.0190908,-0.30624846,0.1550751,-0.107695036,0.318128,-0.91269255,-0.084052026,-0.071086854,0.58557767,-0.059559256,-0.25214714,-0.37190074,0.1845709,-1.011793,1.6667081,-0.59240544,0.62364835,-0.87666374,0.5493202,0.15618894,-0.55065084,-1.1594291,0.013051172,-0.58089346,-0.69672656,-0.084555894,-1.002506,-0.12453595,-1.3197669,-0.6465615,0.18977834,0.70997524,-0.1717262,-0.06295184,0.7844014,-0.34741658,-0.79253453,0.50359297,0.12176384,0.43127277,0.51099414,-0.4762928,0.6427185,0.5405122,-0.50845987,-0.9031403,1.4412987,-0.14767419,0.2546413,0.1589461,-0.27697682,-0.2348109,-0.36988798,0.48541197,0.055055868,0.6457861,0.1634515,-0.4656323,0.09907467,-0.14479966,-0.7043871,0.36758122,0.37735868,1.0355871,-0.9822478,-0.19883083,-0.028797302,0.06903542,-0.72867984,-0.83410156,-0.44142655,-0.023862194,0.7508692,-1.2131448,0.73933,0.82066983,-0.9567533,0.8022456,-0.46039414,-0.122145995,-0.57758415,1.6009285,-0.38629133,-0.719489,-0.26290792,0.2784449,0.4006592,0.7685309,0.021456026,-0.46657726,-0.045093264,0.27306503,0.11820289,-0.010290818,1.4277694,0.37877312,-0.6586902,0.6534258,-0.4882668,-0.013708393,0.5874833,0.67575705,0.0448849,0.79752296,-0.48222196,-0.27727848,0.1908209,-0.37270054,0.2255683,0.49677694,-0.8097378,-0.041833293,1.0997742,0.24664953,-0.13645545,0.60577506,-0.36643773,-0.38665995,-0.30393195,0.8074676,0.71181476,-1.1759185,-0.43375242,-0.54943913,0.60299504,-0.29033506,0.35640588,0.2535554,0.23497777,-0.6322611,-1.0659716,-0.5208576,-0.20098525,-0.70759755,-0.20329496,0.06746797,0.4192544,0.9459473,0.3056658,-0.41945052,-0.6862448,0.92653894,-0.28863263,0.1017883,-0.16960514,0.43107504,0.6719024,-0.19271156,0.84156036,1.4232695,0.23043889,-0.36577883,0.1706496,0.4989679,1.0149425,1.6899607,-0.017684896,0.14658369,-0.5460582,0.25970757,0.21367438,-0.23919336,0.00311709,0.24278529,-0.054968767,-0.1936215,1.0572686,1.1302485,-0.14131032,0.70154583,-0.6389119,0.56687975,-0.7653478,0.73563385,0.34357715,0.54296106,-0.289852,0.8999764,-0.51342,0.42874512,-0.15059376,-0.38104424,-1.255755,0.8929743,0.035588194,-0.032178655,-1.0616962,-1.2204084,-0.23632799,-1.692825,-0.23117402,0.57683736,0.50997025,-0.374657,1.6718119,0.41329297,1.0922033,-0.032909054,0.52968246,-0.15998183,-0.8479956,-0.08485309,1.350768,0.4181131,0.2278139,-0.4233213,0.77379596,0.020778842,1.4049225,0.6989054,0.38101918,-0.14007418,-0.020670284,-0.65089977,-0.9920829,-0.373814,0.31086117,-0.43933883,1.1054604,-0.30419546,0.3853193,-1.0691531,-0.010626761,-1.2146289,-0.41391885,-0.5968098,0.70136315,0.17279832,0.030435344,-0.8829543,-0.27144116,0.045436643,-1.4135028,0.70108044,-0.73424995,1.0382471,0.89125097,-0.6630885,-0.22839329,-0.631642,0.2600539,1.0844377,-0.24859901,-1.2038339,-1.1615102,0.013521354,2.0688252,-1.1227499,0.40164688,-0.57415617,0.18793584,0.39685404,0.27067253] |
-| US | 45486371 | R2D5IFTFPHD3RN | B000EZ9084 | 821764517 | Survival Island | Video DVD | 4 | 1 | 1 | 0 | 1 | Four Stars | very good | 2015-08-31 | 13 | \[-0.04560827,-1.0738801,0.6053605,0.2644575,0.046181858,0.92946494,-0.14833489,0.12940715,0.45553935,-0.7009164,0.8873173,0.8739785,0.93965644,0.99645066,-0.3013455,0.009464348,0.49103707,-0.31142452,-0.698856,-0.68302655,0.09756764,0.08612168,-0.10133423,0.74844116,-1.1546779,-0.478543,-0.33127898,0.2641717,-0.16090837,0.77208316,-0.20998663,-1.0271599,-0.21180272,-0.441733,1.3920364,-0.29355,-0.14628173,-0.1670586,0.38985613,0.7232808,-0.1478917,-1.2944599,0.079248585,0.804303,-0.22106579,0.17671943,-0.16625091,-0.2116828,1.3004253,-1.0479127,0.7193388,-0.26320568,1.4964588,-0.10538341,-0.3048142,0.35343128,0.2383181,1.8991082,-0.18256101,-0.58556455,0.3282545,-0.5290774,1.0674107,0.5099032,-0.6321608,-0.19459783,-0.33794925,-1.2250574,0.30687732,0.10018553,-0.38825148,0.5468978,0.6464592,0.63404274,0.4275827,-0.4252685,0.20222056,0.37558758,0.67473555,0.43457538,-0.5480667,-0.5751551,-0.5282744,0.6499875,0.74931085,-0.41133487,2.1029837,-0.6469921,-0.36067986,0.87258714,0.9366592,-0.5068644,1.288624,0.42634118,-0.88624424,0.023693975,0.82858825,0.53235066,-0.21634954,-0.79934657,0.37243468,-0.43083912,0.6150686,0.9484009,-0.18876135,-0.24328673,-0.2675956,-0.6934638,-0.016312882,0.9681279,-0.93228894,0.49323967,0.08511063,-0.058108483,-0.10482833,-0.49948782,-0.50077546,0.16938816,0.6500032,1.2108738,0.98961586,0.47821587,0.88961387,-0.5261087,-0.97606266,1.334534,0.4484072,-0.15161656,-0.6182878,1.3505218,0.07164596,0.41611874,-0.19641197,0.055405065,0.7972649,0.10020526,-1.0767709,-0.90705204,0.48867372,-0.46962035,-0.7453811,-1.4456259,0.02953603,1.0104666,1.1868577,1.1099546,0.40447012,-0.042927116,-0.37483892,-0.09478704,-1.223529,-0.8275733,-0.2067015,-1.0913882,-0.3732751,-1.5847363,0.41378438,-0.29002684,-0.2014314,-0.016470056,0.32161012,-0.5640414,-0.14769524,-0.43124712,-1.4276416,-0.10542446,1.5781338,-0.2290403,0.45508677,0.080797836,0.16426548,0.63305223,1.0155399,0.28184965,0.25335202,-0.6090523,1.181813,-0.5924076,1.4182706,-0.3111642,0.12979284,-0.5306278,-0.592878,0.67098105,-0.3403599,0.8093008,-0.425102,-0.20143461,0.88729143,-1.3048863,-0.8509538,-0.64478755,0.72528464,0.27115706,-0.91018283,-0.37501037,-0.25344363,-0.28149638,-0.65170574,0.058373883,-0.279707,0.3435093,0.15421666,-0.08175891,0.37342703,1.1068349,0.370284,-1.1112201,0.791234,-0.33149278,-0.906468,0.77429736,-0.16918264,0.07161721,-0.020805538,-0.19074778,0.9714475,0.4217115,-0.99798465,0.23597187,-1.1951764,0.72325313,1.371934,-0.2528682,0.17550357,1.0121015,-0.28758067,0.52312744,0.08538565,-0.9472321,-0.7915376,-0.41640997,0.83389455,0.6387671,0.18294477,0.1850706,1.3700297,-0.43967843,0.9739228,0.25433502,-0.7903001,0.29034948,0.4432687,0.23781417,0.64576876,0.89437866,-0.92056245,0.8566781,0.2436927,-0.06929546,0.35795254,0.7436991,0.21376142,0.23869698,0.14639515,-0.87127894,0.8130877,-1.0923429,-0.3279097,0.09232058,-0.19745012,0.31907612,-1.0878816,-0.04473375,0.4249065,0.34453565,0.45376292,-0.5525641,1.6031032,-0.017522424,-0.04903584,-0.2470398,-0.06611821,-0.33618444,0.04579974,0.28910857,0.5733638,1.1579076,-0.123608775,-1.1244149,-0.32105175,-0.0028353594,0.6315558,0.20455408,-1.0754945,0.2644,0.24109934,0.042885803,1.597761,0.20982133,-1.1588631,0.47945598,-0.59829426,-0.45671254,0.15635385,-0.25241938,0.2880083,0.17821103,-0.16359845,0.35200477,1.0819628,-0.4892587,0.24970399,-0.43380582,-0.5588407,0.31640014,-0.10481888,0.10812894,0.13438466,1.0478258,0.5863666,0.035384405,-0.30704767,-1.6373035,-1.2590733,0.9295908,0.1164237,0.68977344,-0.36746788,-0.40554866,0.64503556,0.42557728,-0.6643828,-1.2095946,0.5771222,-0.6911773,-0.96415323,0.07771304,0.8753759,-0.60232115,0.5423659,0.037202258,0.9478343,0.8238534,-0.04875912,-1.5575435,-0.023152929,-0.16479905,-1.123967,0.00679872,1.4028634,-0.9268266,-0.17736283,0.17429933,0.08551961,1.1467109,-0.09408428,0.32461596,0.5739471,0.41277337,0.4900577,0.6426135,-0.28586757,-0.7086031,-1.2137725,0.45787215,0.16102555,0.27866384,0.5178121,0.7158286,1.0705677,0.07049831,-0.85161424,-0.3042984,0.42947394,0.060441002,-0.06413476,-0.25434074,0.020860653,0.18758196,-0.3637798,0.48589218,-0.38999668,-0.23843117,-1.7653351,-0.040434383,0.5825778,0.30748087,0.06381909,0.81247973,-0.39792076,0.7121066,0.2782456,0.59765404,-1.3232024,0.34060842,0.19809672,0.41175848,0.24246249,0.25381815,-0.44391263,-0.07614571,-0.87287176,0.33984363,-0.21994372,-1.4966714,0.10044764,-0.061777685,-0.71176904,-0.4737114,-0.057971925,1.3261204,0.49915332,0.3063325,-0.0374391,0.013750633,-0.19973677,-0.089847654,0.121245734,0.11679503,0.61989266,0.023939274,0.51651406,-0.7324229,0.19555955,-0.9648657,1.249217,-0.055881638,0.40515238,0.3683988,-0.42780614,-0.24780461,-0.032880165,0.6969112,0.66245943,0.54872966,0.67410636,0.35999185,-1.1955742,0.38909116,0.9214033,-0.5265669,-0.16324537,-0.49275506,-0.27807295,0.33720574,-0.6482551,0.6556906,0.09675206,0.035689153,-1.4017167,-0.42488196,0.53470165,-0.9318509,0.06659188,-0.9330244,-0.6317253,-0.5170034,-0.090258315,0.067027874,0.47430456,0.34263068,-0.034816273,-1.8725855,-2.0368457,0.43204042,0.3529114,1.3256972,-0.57799745,0.025022656,-1.2134962,-0.6376366,1.2210813,-0.8623049,0.47356188,-0.48248583,-0.30049723,-0.7189453,-0.6286008,-0.7182035,0.337718,-0.11861088,-0.67316926,0.03807467,-0.4894712,0.0021176785,0.6980891,0.24103045,0.54633296,0.58161646,-0.44642344,-0.16555169,0.7964468,-1.2131425,-0.67829454,0.4893405,-0.38461393,-1.1225401,0.44452366,-0.30833852,-0.6711606,0.051745616,-0.775163,-0.2677435,-0.39321816,-0.74936676,0.16192177,-0.059772447,0.68762016,0.53828514,0.6541142,-0.5421721,-0.26251954,-0.023202112,0.3014187,0.008828241,0.79605895,-0.3317026,-0.7724727,-1.2411877,0.31939238,-0.096119456,0.47874188,-0.7791832,-0.22323853,-0.08456612,1.0795188,-0.7827005,-0.28929207,0.46884036,-0.42510015,0.16214833,0.3501767,0.36617047,-1.119466,0.19195387,0.85851586,0.18922725,0.94338834,-0.32304144,0.4827557,-0.81715256,-1.4261038,0.49614763,0.062142983,1.249345,0.2014524,-0.6995533,-0.15864229,0.38652128,-0.659232,0.11766203,-0.2557698,1.4296027,0.9037317,-0.011628535,-1.1893693,-0.956275,-0.18136917,0.3941797,0.39998764,0.018311564,0.27029866,0.14892557,-0.48989707,0.05881763,0.49618796,-0.11214719,0.71434236,0.35651416,0.8689908,1.0284718,0.9596098,-0.009955626,0.40186208,0.4057858,-0.28830874,-0.72128904,-0.5276375,-0.44327998,-0.025095768,-0.7058158,-0.16796891,0.12855923,-0.34389406,0.4430077,0.16097692,-0.58964425,-0.80346566,0.32405907,0.06305365,-1.5064402,0.2241937,-0.6216805,0.1358616,0.3714332,-0.99806577,-0.22238642,0.33287752,0.14240637,-0.29236397,1.1396701,0.23270036,0.5262793,1.0991998,0.2879055,0.22905749,-0.95235413,0.52312446,0.10592761,0.30011278,-0.7657238,0.16400222,-0.5638396,-0.57501423,1.121968,-0.7843481,0.09353633,-0.18324867,0.21604645,-0.8815248,-0.07529478,-0.8126517,-0.011605805,-0.50744057,1.3081754,-0.852715,0.39023215,0.7651248,1.68998,0.5819176,-0.02141522,0.5877081,0.2024052,0.09264247,-0.13779058,-1.5314059,1.2719066,-1.0927896,0.48220706,0.05559338,-0.20929311,-0.4278733,0.28444275,-0.0008470379,-0.09534583,-0.6519637,-1.4282455,0.18477388,0.9507184,-0.6751443,-0.18364592,-0.37007314,1.0216024,0.6869564,1.1653348,-0.7538794,-1.3345296,0.6104916,0.08152369,-0.8394207,0.87403923,0.5290044,-0.56332856,0.37691587,-0.45009997,-0.17864561,0.5992149,-0.25145024,1.0287454,1.4305328,-0.011586349,0.3485581,0.66344,0.18219411,4.940573,1.0454609,-0.23867694,-0.8316158,0.4034564,-0.49062842,0.016044907,-0.22793365,-0.38472247,0.2440083,0.41246706,1.1865108,1.2949868,0.4173234,0.5325333,0.5680148,-0.07169041,-1.005387,0.965118,-0.340425,-0.4471613,-0.40878603,-1.1905128,-1.1868874,1.2017782,0.53103817,0.3596472,-0.9262005,0.31224424,0.72889113,0.63557464,-0.07019187,-0.68807346,0.69582283,0.45101142,0.014984587,0.577816,-0.1980364,-1.0826674,0.69556504,0.88146895,-0.2119645,0.6493935,0.9528447,-0.44620317,-0.9011973,-0.50394785,-1.0315249,-0.4472283,0.7796344,-0.15637895,-0.16639937,-0.20352335,-0.68020046,-0.98728025,0.64242256,0.31667972,-0.71397847,-1.1293691,-0.9860645,0.39156264,-0.69573534,0.30602834,-0.1618791,0.23074874,-0.3379239,-0.12191323,1.6582693,0.2339738,-0.6107068,-0.26497284,0.17334077,-0.5923304,0.10445539,-0.7599427,0.5096536,-0.20216745,0.049196683,-1.1881349,-0.9009607,-0.83798426,0.44164553,-0.48808926,-0.04667333,-0.66054153,-0.66128224,-1.7136352,-0.7366011,-0.31853634,0.30232653,-0.10852443,1.9946622,0.13590258,-0.76326686,-0.25446486,0.32006142,-1.046221,0.30643058,0.52830505,1.7721215,0.71685624,0.35536727,0.02379851,0.7471644,-1.3178513,0.26788896,1.0505391,-0.8308426,-0.44220716,-0.2996315,0.2289448,-0.8129853,-0.32032526,-0.67732286,0.49977696,-0.58026063,-0.4267268,-1.165912,0.5383717,-0.2600939,0.4909254,-0.7529048,0.5186025,-0.68272185,0.37688586,-0.16525345,0.68933797,-0.43853116,0.2531767,-0.7273167,0.0042542545,0.2527112,-0.64449465,-0.07678814,-0.57123,-0.0017966144,-0.068321034,0.6406287,-0.81944615,-0.5292494,0.67187285,-0.45312735,-0.19861545,0.5808865,0.24339013,0.19081701,-0.3795915,-1.1802675,0.5864333,0.5542488,-0.026795216,-0.27652445,0.5329341,0.29494807,0.5427568,0.84580654,-0.39151683,-0.2985327,-1.0449492,0.69868237,0.39184457,0.9617548,0.8102169,0.07298472,-0.5491848,-1.012611,-0.76594234,-0.1864931,0.5790788,0.32611984,-0.7400497,0.23077846,-0.15595563,-0.06170243,-0.26768005,-0.7510913,-0.81110775,0.044999585,1.3336306,-1.774329,0.8607937,0.8938075,-0.9528547,0.43048507,-0.49937993,-0.61716783,-0.58577335,0.6208,-0.56602585,0.6925776,-0.50487256,0.80735886,0.36914152,0.6803319,0.000295409,-0.28081727,-0.65416694,0.9890088,0.5936174,-0.38552138,0.92602617,-0.46841428,-0.07666884,0.6774499,-1.1728637,0.23638526,0.35253218,0.5990712,0.47170952,1.1473405,-0.6329502,0.07515354,-0.6493073,-0.7312147,0.003280595,0.53415585,-0.84027874,0.21279827,0.73492074,-0.08271271,-0.6393985,0.21382183,-0.5933761,0.26885328,0.31527188,-0.17841923,0.8519613,-0.87693113,0.14174065,-0.3014772,0.21034332,0.7176752,0.045435462,0.43554127,0.7759069,-0.2540516,-0.21126957,-0.1182913,0.504212,0.07782592,-0.06410891,-0.016180445,0.16819397,0.7418499,-0.028192373,-0.21616131,-0.46842667,0.8750199,0.16664875,0.4422129,-0.24636972,0.011146031,0.5407099,-0.1995775,0.9732007,0.79718286,-0.3531048,-0.17953855,-0.30455542,-0.011377579,-0.21079576,1.3742573,-0.4004308,-0.30791727,-1.06878,0.53180254,0.3412094,-0.06790889,0.08864223,-0.6960799,-0.12536404,0.24884924,0.9308994,0.46485603,0.12150945,0.8934372,-1.6594642,0.27694207,-1.1839775,-0.54069275,0.2967536,0.94271827,-0.21412376,1.5007582,-0.75979245,0.4711972,-0.005775435,-0.13180988,-0.9351274,0.5930414,0.23131478,-0.4255422,-1.1771399,-0.49364802,-0.32276222,-1.6043308,-0.27617428,0.76369554,-0.19217926,0.12788418,1.9225345,0.35335732,1.6825448,0.12466301,0.1598846,-0.43834555,-0.086372584,0.47859296,0.79709494,0.049911886,-0.52836734,-0.6721834,0.21632576,-0.36516222,1.6216894,0.8214337,0.6054308,-0.41862285,0.027636342,-0.1940268,-0.43570083,-0.14520688,0.4045223,-0.35977545,1.8254343,-0.31089872,0.19665615,-1.1023157,0.4019758,-0.4453815,-1.0864284,-0.1992614,0.11380532,0.16687272,-0.29629833,-0.728387,-0.5445154,0.23433375,-1.5238215,0.71899056,-0.8600819,1.0411007,-0.05895088,-0.8002717,-0.72914296,-0.59206986,-0.28384188,0.4074883,0.56018656,-1.068546,-1.021818,-0.050443307,1.116262,-1.3534596,0.6736171,-0.55024904,-0.31289905,0.36604482,0.004892461] |
-| US | 14006420 | R1CECK3H1URK1G | B000CEXFZG | 115883890 | Teen Titans - The Complete First Season (DC Comics Kids Collection) | Video DVD | 5 | 0 | 0 | 0 | 1 | Five Stars | Kids love the DVD. It came quickly also. | 2015-08-31 | 14 | \[-0.6312561,-1.7367789,1.2021036,-0.048960943,0.20266847,-0.53402656,0.22530322,0.58472973,0.7067528,-0.4026424,0.48143443,1.320443,1.390252,0.8614183,-0.27450773,-0.5175409,0.35882184,0.029378487,-0.7798119,-0.9161627,0.21374469,-0.5097005,0.08925354,-0.03162415,-0.777172,0.26952067,0.21780597,-0.25940415,-0.43257955,0.5047774,-0.62753534,-0.18389052,0.3908125,-0.8562782,1.197537,-0.072108865,-0.26840302,0.1337818,0.5329664,-0.02881749,0.18806009,0.15675639,-0.46279088,0.33493695,-0.5976519,0.17071217,-0.79716325,0.1967204,1.1276897,-0.20772636,0.93440086,0.34529057,0.19401568,-0.41807452,-0.86519367,0.47235286,0.33779994,1.5397296,-0.18204026,-0.016024688,0.24120326,-0.17716222,0.3138746,-0.20993066,-0.09079028,0.25766942,-0.07014277,-0.8694822,0.64777964,-0.057605933,-0.28278375,0.8075776,1.8393523,0.81496745,-0.004307902,-0.84534615,-0.03156269,0.010678162,1.8573742,0.20478101,-0.1694233,0.3143575,-0.598893,0.80677253,0.6163861,-0.46703136,2.229697,-0.53163594,-0.32738847,-0.024545679,0.729927,-0.3483534,1.2920879,0.25684443,0.34726465,0.2070297,0.47215447,1.5762097,0.5379836,-0.011129107,0.83513135,0.18692249,0.2752282,0.6455876,0.129197,-0.5211538,-1.3686453,-0.44263896,-1.0396893,0.32529148,-1.4775138,0.16855894,-0.22110634,0.5737801,1.1978029,-0.3934193,-0.2697715,0.62218326,1.4344715,0.82834864,0.766156,0.3510282,0.59684426,-0.1322549,-0.9330995,1.8485514,0.6753625,-0.33342996,-0.23867355,0.8621254,-0.4277517,-0.26068765,-0.67580503,0.13551037,0.44111,1.0628351,-1.1878395,-1.2636286,0.55473286,0.18764772,-0.06866432,-2.0283139,0.46497917,0.5886715,0.30433393,0.3501315,0.23519383,0.5980003,0.36994958,0.30603382,-0.8369203,-0.25988623,-0.93126506,-0.873884,-0.5146805,-1.8220243,-0.28068694,0.39212993,0.20002748,-0.47740325,-0.251296,-0.85625666,-1.1412939,-0.73454237,-0.7070889,-0.8038149,1.5993606,-0.42553523,0.29790545,0.75804514,-0.14183688,1.28933,0.60941213,0.89150697,0.10587394,0.74460125,0.61516047,1.3431324,0.8083828,-0.11270667,-0.5399225,-0.609704,-0.07033227,0.37664047,-0.17491077,1.3854522,-0.41539654,-0.4362298,1.1235062,-1.8496975,-2.0035222,-0.49260524,1.3446016,-0.031373296,-1.3091855,-0.19887531,-0.49534202,0.4523722,-0.16276014,-0.08273346,-0.5079003,-0.124883376,0.099591255,-0.8943932,-0.1293136,0.9836214,0.548599,-0.78369313,0.19080715,-0.088178605,-0.6870386,0.58293986,-0.39954463,-0.19963749,-0.37985775,-0.24642159,0.5121634,0.6653276,-0.4190921,1.0305376,-1.4589696,0.28977314,1.3795608,0.5321369,1.1054996,0.5312297,-0.028157832,0.4668366,1.0069275,-1.2730085,-0.11376997,-0.7962425,0.49372005,0.28656003,-0.30227122,0.24839808,1.923211,-0.37085673,0.3625795,0.16379173,-0.43515328,0.4553001,0.08762408,0.105411,-0.964348,0.66819906,-0.6617094,1.5985628,-0.23792887,0.32831386,0.38515973,-0.293926,0.5914876,-0.12198629,0.45570955,-0.703119,1.2077283,-0.82626694,-0.28149354,0.7069072,0.31349573,0.4899691,-0.4599767,-0.8091348,0.30254528,0.08147084,0.3877693,-0.79083973,1.3907013,-0.25077394,0.9531004,0.3682364,-0.8173011,-0.09942776,0.2869549,-0.045799185,0.5354464,0.6409063,-0.20659842,-0.9725278,-0.26192304,0.086217284,0.3165221,0.44227958,-0.7680571,0.5399834,0.6985113,-0.52230656,0.6970132,0.373832,-0.70743656,0.20157939,-0.6858654,-0.50790364,0.2795364,0.29279485,-0.012475173,0.076419905,-0.40851966,0.82844526,-0.48934165,-0.5245244,-0.20289789,-0.8136387,-0.5363099,0.48981985,-0.76652956,-0.1211052,-0.056907576,0.4420836,0.066036455,0.41965017,-0.6063774,-0.8071671,-1.0445249,0.66432387,0.5274697,1.0376729,-0.7697964,-0.37606835,0.3890853,0.6605356,-0.14112039,-1.5217428,-0.15197764,-0.3213161,-1.1519533,0.60909057,0.9403774,-0.27944884,0.7312047,-0.3696203,0.74681044,1.2170473,-0.69628173,-1.6213799,-0.5346468,-0.6516008,-0.33496094,-0.43141463,1.2713503,-0.8897746,-0.087588705,-0.46260807,0.5793111,0.09900403,-0.17237963,0.62258226,0.21377154,-0.010726848,0.6530878,-0.2783685,0.00858428,-1.1332816,-0.6482847,0.7085231,0.36013532,-0.92266655,0.22018129,0.9001391,0.92635745,-0.008031485,-0.5917975,-0.568456,-0.06777777,0.8137389,-0.09866476,-0.22243339,0.64311814,-0.18830536,-0.39094377,0.19102454,-0.16511707,0.025081763,-1.8210138,-0.2697892,0.6846239,0.2854376,0.18948092,1.413507,-0.32061276,1.068837,-0.43719074,0.26041105,-1.3256634,-0.3310394,-0.727746,0.5768826,0.12309951,0.64337856,-0.35449612,0.5904533,-0.93767214,0.056747835,-0.96975976,-0.50144833,-0.68525606,0.08461835,-0.956482,0.39153412,-0.47589955,1.1512613,-0.15391372,0.22249506,0.34223804,-0.30088118,-0.12304757,-0.887302,-0.41605315,-0.4448053,0.11436053,0.36566892,0.051920563,-1.0589696,-0.21019076,-0.5414011,0.57006586,0.25899884,0.27656814,-1.2040092,-1.0228744,-0.9569173,-0.40212157,0.24625045,0.0363089,0.67136663,1.2104007,0.5976004,0.3837572,1.1889356,0.8584326,-0.19918711,-0.694845,-0.114167996,-0.108385384,-0.40644845,-0.8660314,0.7782318,0.1538889,-0.33543634,-1.2151926,0.15467443,0.68193775,-1.2943494,0.5995984,-0.954463,0.08679533,-0.70457053,-0.13386653,-0.49978074,0.75912595,0.6441198,-0.24760693,-1.6255957,-1.1165076,0.06757002,0.424513,0.8805125,-1.3958868,0.20875917,-1.9329861,-0.23697405,0.55918163,-0.23028342,0.7898856,-0.31575334,-0.10341185,-0.59226173,-0.6364673,-0.70446855,0.8730485,-0.3070955,-0.62998897,-0.25874397,-0.36943534,-0.006459128,0.19268708,0.25422436,0.7851406,0.5298526,-0.7919893,0.2925912,0.2669904,-1.3556485,-0.3184692,0.6531485,-0.43356547,-0.7023434,0.70575243,-0.64844227,-0.90868706,-0.37580702,-0.46109352,-0.06858048,-0.5020828,-1.0959914,0.19850428,-0.3697118,0.5327658,-0.24482745,-0.0050697043,-0.48321095,-0.8755402,0.33493343,0.0400091,-0.9211368,0.50489336,0.20374565,-0.49659476,-1.7711049,0.9425723,0.413107,-0.15736774,-0.3663932,-0.110296495,0.32382917,1.4628458,-0.9015841,1.0747851,0.20627196,-0.33258128,-0.68392354,0.45976254,0.7596731,-1.1001155,0.9608397,0.68715054,0.835493,1.0332432,-0.1770479,-0.47063908,-0.4371135,-1.5693063,-0.09170902,-0.14182071,0.9199287,0.089211576,-1.330432,0.74252445,-0.12902485,-1.1330069,0.37604442,-0.08594573,1.1911551,0.514451,-0.820967,-0.7663223,-0.8453414,-1.6072954,-0.006961733,0.10301163,-0.9520235,0.09837824,-0.11854994,-0.676488,0.31623104,0.9415478,0.5674442,0.5121303,0.46830702,0.5967715,1.1180271,1.109548,0.57702965,0.33545986,0.88252956,-0.23821445,0.1681848,0.13121948,-0.21055935,0.14183077,-0.12930463,-0.66376144,-0.34428838,-0.6456075,0.7975275,0.7979727,-0.07281647,-0.786334,-0.9695745,0.7647379,-1.2006234,0.2262308,-0.5081758,0.035541046,0.0056368224,-0.30493388,0.4218361,1.5293287,0.33595875,-0.4748238,1.1775192,-0.33924198,-0.6341838,1.534413,-0.19799161,1.0994059,-0.51108354,0.35798654,0.17381774,1.0035061,0.35685256,0.15786275,-0.10758176,0.039194133,0.6899009,-0.65326214,0.91365,-0.15350929,-0.1537966,-0.010726042,-0.13360718,-0.6982152,-0.52826196,-0.011109476,0.65476435,-0.9023214,0.64104265,0.5995644,1.4986526,0.57909846,0.30374798,0.39150548,-0.3463178,0.34487796,0.052982118,-0.5143066,0.9766171,-0.74480146,1.2273649,-0.029264934,-0.21231978,0.5529358,-0.15056185,-0.021292707,-0.6332784,-0.9690395,-1.5970473,0.6537644,0.7459297,0.12835206,-0.13237919,-0.6256427,0.5145036,0.94801706,1.9347028,-0.69850945,-1.1467483,-0.14642377,0.58050627,-0.44958553,1.5241412,0.12447801,-0.5492241,0.61864674,-0.7053797,0.3704767,1.3781306,0.16836958,1.0158046,2.339806,0.25807586,-0.38426653,0.31904867,-0.18488075,4.3820143,0.3402816,0.075437106,-1.7444987,0.14969935,-1.032585,0.105298005,-0.48405352,-0.043107588,0.41331384,0.23115341,1.4535589,1.4320177,1.2625074,0.6917493,0.57606643,0.18086748,-0.56871295,0.50524384,-0.3616062,-0.030594595,0.031995427,-1.2015928,-1.0093418,0.8197662,-0.39160928,0.35074282,-1.0193396,0.536061,0.047622234,-0.24839634,0.6208857,0.59378546,1.1138327,1.1455421,0.28545633,-0.33827814,-0.10528313,-0.3800622,0.38597932,0.48995104,0.20974272,0.05999745,0.61636347,-1.0790776,0.40463042,-1.144643,-1.1443852,0.24288934,0.7188756,-0.43240666,-0.45432237,-0.026534924,-1.4719657,-0.6369496,1.2381822,-0.2820557,-0.40019664,-0.42836204,0.009404399,-0.21320148,-0.68762875,0.79391354,0.13644795,0.2921131,0.5521372,-0.39167717,0.43077433,-0.1978993,-0.5903825,-0.5364767,1.2527494,-0.6508138,1.006776,-0.80243343,0.8591213,-0.5838775,0.51986057,-2.0343292,-1.1657227,-0.19022554,0.4203408,-0.85203123,0.27117053,-0.7466831,-0.54998875,-0.78761035,-0.23125184,-0.4558538,0.27839115,-0.8282628,1.9886168,-0.081262186,-0.7112829,0.9389117,-0.4538624,-1.4541539,-0.40657237,-0.3986729,2.1551015,-0.15287222,-0.49151388,-0.0558472,-0.08496425,-0.42135897,0.9383027,0.52064234,0.15240821,-0.083340704,0.18793257,-0.27070358,-0.7748509,-0.44401792,-0.84802055,0.38330504,-0.16992734,-0.04359399,-0.5745709,0.737314,-0.68381006,1.973286,-0.48940006,0.31930843,-0.033326432,0.26788878,-0.12552531,0.48650578,-0.37769738,0.28189135,-0.61763984,-0.7224581,-0.5546388,-1.0413891,0.38789925,-0.3598852,-0.032914143,-0.26091114,0.7435369,-0.55370283,-0.28856206,0.99145585,-0.65208393,-1.2676566,0.4271154,-0.109385125,0.07578249,0.36406067,-0.24682517,0.75629663,0.7614913,-1.0769705,-0.97570497,1.9109854,-0.33307776,0.0739104,1.1380597,-0.3641174,0.22451513,-0.33712614,0.19201177,0.4894991,0.10351006,0.6902971,-1.0849994,-0.26750708,0.3598063,-0.5578461,0.50199044,0.7905739,0.6338177,-0.5717301,-0.54366827,-0.10897577,-0.33433878,-0.6747299,-0.6021895,-0.19320905,-0.5550029,0.72644496,-1.1670401,0.024564115,1.0110236,-1.599555,0.68184775,-0.7405006,-0.42144236,-1.0563204,0.89424497,-0.48237786,-0.07939503,0.5832966,0.011636782,0.26296118,0.97361255,-0.61712617,0.023346817,0.13983403,0.47923192,0.015965229,-0.70331126,0.43716618,-0.16208862,-0.3113084,0.34937248,-0.9447899,-0.67551583,0.6474735,0.54826015,0.32212958,0.32812944,-0.25576934,-0.7014241,0.47824702,0.1297568,0.14742444,0.2605472,-1.0799223,-0.4960915,1.1971446,0.5583594,0.0546587,0.9143655,-0.27093348,-0.08269074,0.29264918,0.07787958,0.6288142,-0.96116096,-0.20745337,-1.2486024,0.44887972,-0.73063356,0.080278285,0.24266525,0.75150806,-0.87237483,-0.30616572,-0.9860237,-0.009145497,-0.008834001,-0.4702344,-0.4934195,-0.13811351,1.2453324,0.25669295,-0.38921633,-0.73387384,0.80260897,0.4079765,0.11871702,-0.236781,0.38567695,0.24849908,0.07333609,0.96814114,1.071782,0.5340243,-0.58761954,0.6691571,0.059928205,1.1879109,1.6365756,0.5595157,0.27928302,-0.26380432,0.75958675,-0.19349675,-0.37584463,0.1626631,-0.11273714,0.081596196,0.64045995,0.76134443,0.7323921,-0.75440234,0.49163356,-0.36328706,0.3499968,-0.7155915,-0.12234358,0.31324995,0.3552525,-0.07196079,0.5915569,-0.48357463,0.042654503,-0.6132918,-0.539919,-1.3009099,0.83370167,-0.035098318,0.2308337,-1.3226038,-1.5454197,-0.40349385,-2.0024583,-0.011536424,-0.05012955,-0.054146707,0.07704314,1.1840333,0.007676903,1.3632768,0.1696332,0.39087996,-0.5171457,-0.42958948,0.0700221,1.8722692,0.08307789,-0.10879701,-0.0138636725,-0.02509088,-0.08575117,1.2478887,0.5698622,0.86583894,0.22210665,-0.5863262,-0.6379792,-0.2500705,-0.7450812,0.50900066,-0.8095482,1.7303423,-0.5499353,0.26281437,-1.161274,0.4653201,-1.0534812,-0.12422981,-0.1350228,0.23891108,-0.40800253,0.30440316,-0.43603706,-0.7405148,0.2974373,-0.4674921,-0.0037770707,-0.51527864,1.2588171,0.75661725,-0.42883956,-0.13898624,-0.45078608,0.14367218,0.2798476,-0.73272926,-1.0425364,-1.1782882,0.18875533,2.1849613,-0.7969517,-0.083258845,-0.21416587,0.021902844,0.861686,0.20170754] |
-| US | 23411619 | R11MHQRE45204T | B00KXEM6XM | 651533797 | Fargo: Season 1 | Video DVD | 5 | 0 | 0 | 0 | 1 | A wonderful cover of the movie and so much more! | Great news Fargo Fans....there is another one in the works! We loved this series. Great characters....great story line and we loved the twists and turns. Cohen Bros. you are "done proud"! It was great to have the time to really explore the story and the characters. | 2015-08-31 | 15 | \[-0.19611593,-0.69027615,0.78467464,0.3645557,0.34207717,0.41759247,-0.23958844,0.11605658,0.92974365,-0.5541752,0.76759464,1.1066549,1.2487572,0.3000814,0.12316142,0.0537864,0.46125686,-0.7134164,-0.6902733,-0.030810203,-0.2626231,-0.17225128,0.29405335,0.4245395,-1.1013782,0.72367406,-0.32295582,-0.42930996,0.14767756,0.3164477,-0.2439065,-1.1365703,0.6799936,-0.21695563,1.9845483,0.29386163,-0.2292162,-0.5616508,-0.2090607,0.2147022,-0.36172745,-0.6168721,-0.7897761,1.1507696,-1.0567898,-0.5793794,-1.0577669,0.11405863,0.5670167,-0.67856425,0.41588035,-0.39696974,1.148421,-0.0018125019,-0.9563887,0.05888491,0.47841984,1.3950354,0.058197483,-0.7937125,-0.039544407,-0.02428613,0.37479407,0.40881336,-0.9731192,0.6479315,-0.5398291,-0.53990036,0.5293877,-0.60560757,-0.88233495,0.05452904,0.8653024,0.55807567,0.7858541,-0.9958526,0.33570826,-0.0056177955,0.9546163,1.0308326,-0.1942335,0.21661046,0.42235866,0.56544167,1.4272121,-0.74875134,2.0610666,0.09774256,-0.6197288,1.4207827,0.7629225,-0.053203158,1.6839175,-0.059772894,-0.978858,-0.23643266,-0.22536495,0.9444282,0.509495,-0.47264612,0.21497262,-0.60796165,0.47013962,0.8952143,-0.008930805,-0.17680325,-0.704242,-1.1091275,-0.6867162,0.5404577,-1.0234057,0.71886224,-0.769501,0.923611,-0.7606229,-0.19196886,-0.86931545,0.95357025,0.8420425,1.6821389,1.1922816,0.64718795,0.67438436,-0.83948326,-1.0336314,1.135635,0.9907036,0.14935225,-0.62381935,1.7775474,-0.054657657,0.78640664,-0.7279978,-0.45434985,1.1893182,1.2544643,-2.15092,-1.7235436,1.047173,-0.1170733,-0.051908553,-1.098293,0.17285198,-0.085874915,1.4612851,0.24653414,-0.14835985,0.3946811,-0.33008638,-0.17601183,-0.79181874,-0.001846984,-0.5688003,-0.32315254,-1.5091114,-1.3093823,0.35818374,-0.020578597,0.13254775,0.08677244,0.25909093,-0.46612057,0.02809602,-0.87092584,-1.1213324,-1.503037,1.8704559,-0.10248221,0.21668856,0.2714984,0.031719234,0.8509111,0.87941355,0.32090616,0.70586735,-0.2160697,1.2130814,0.81380475,0.8308766,0.69376045,0.20059735,-0.62706333,0.06513833,-0.25983867,-0.26937178,1.1370893,0.12345111,0.4245841,0.8032184,-0.85147107,-0.7817614,-1.1791542,0.054727774,0.33709362,-0.7165752,-0.6065557,-0.6793303,-0.10181883,-0.80588853,-0.60589695,0.04176558,0.9381139,0.86121285,-0.483753,0.27040368,0.7229057,0.3529946,-0.86491895,-0.0883965,-0.45674118,-0.57884586,0.4881854,-0.2732384,0.2983724,0.3962273,-0.12534264,0.8856427,1.3331532,-0.26294935,-0.14494254,-1.4339849,0.48596704,1.0052125,0.5438694,0.78611183,0.86212146,0.17376512,0.113286816,0.39630392,-0.9429737,-0.5384651,-0.31277686,0.98931545,0.35072982,-0.50156367,0.2987925,1.2240223,-0.3444314,-0.06413657,-0.4139552,-1.3548497,0.3713058,0.5338464,0.047096968,0.17121102,0.4908476,0.33481652,1.0725886,0.068777196,-0.18275931,-0.018743126,0.35847363,0.61257994,-0.01896591,0.53872716,-1.0410246,1.2810577,-0.65638995,-0.4950475,-0.14177354,-0.38749444,-0.12146497,-0.69324815,-0.8031308,-0.11394101,0.4511331,-0.36235264,-1.0423448,1.3434777,-0.61404437,0.103578284,-0.42243803,0.13448912,-0.0061332933,0.19688538,0.111303836,0.14047435,2.3025432,-0.20064694,-1.0677278,0.6088145,-0.038092047,0.26895407,0.11633718,-1.5688779,-0.09998454,0.10787329,-0.30374414,0.9052384,0.4006251,-0.7892597,0.7623954,-0.34756395,-0.54056764,0.3252798,0.33199653,0.62842965,0.37663814,-0.030949261,1.0469799,0.03405783,-0.62260365,-0.34344113,-0.39576128,0.24071567,-0.0143306,-0.36152077,-0.21019648,0.15403631,0.54536396,0.070417285,-1.1143794,-0.6841382,-1.4072497,-1.2050889,0.36286953,-0.48767778,1.0853148,-0.62063366,-0.22110772,0.30935922,0.657101,-1.0029979,-1.4981637,-0.05903004,-0.85891956,-0.8045846,0.05591573,0.86750376,0.5158197,0.42628267,0.45796645,1.8688178,0.84444594,-0.8722601,-1.099219,0.1675867,0.59336346,-0.12265335,-0.41956308,0.93164825,-0.12881526,0.28344584,0.21308619,-0.039647672,0.8919175,-0.8751169,0.1825347,-0.023952499,0.55597776,1.0254196,0.3826872,-0.08271052,-1.1974314,-0.8977747,0.55039763,1.5131414,-0.451007,0.14583892,0.24330004,1.0137768,-0.48189703,-0.48874113,-0.1470369,0.49510378,0.38879463,-0.7000347,-0.061767917,0.29879406,0.050993137,0.4503994,0.44063208,-0.844459,-0.10434887,-1.3999974,0.2449593,0.2624704,0.9094605,-0.15879464,0.7038591,0.30076742,0.7341888,-0.5257968,0.34079516,-1.7379513,0.13891199,0.0982849,1.2222294,0.11706773,0.05191148,0.12235231,0.34845573,0.62851644,0.3305461,-0.52740043,-0.9233819,0.4350543,-0.31442615,-0.84617394,1.1801229,-0.0564243,2.2154071,-0.114281625,0.809236,1.0508876,0.93325424,-0.14246169,-0.70618397,0.22045197,0.043732524,0.89360833,0.17979233,0.7782733,-0.16246022,-0.21719909,0.024336463,0.48491704,0.40749896,0.8901898,-0.57082295,-0.4949802,-0.5102787,-0.21259686,0.417162,0.37601888,1.0007366,0.7449076,0.6223696,-0.49961302,0.8396295,1.117957,0.008836402,-0.49906662,-0.03272103,0.13135666,0.25935343,-1.3398852,0.18256736,-0.011611674,-0.27749947,-0.84756446,0.11329307,-0.25090477,-1.1771594,0.67494935,-0.5614711,-0.09085327,-0.3132199,0.7154967,-0.3607141,0.5187279,0.16049784,-0.73461974,-1.7925078,-1.9164195,0.7991559,0.99091554,0.7067987,-0.57791114,-0.4848671,-1.100601,-0.59190345,0.30508074,-1.0731133,0.35330638,-1.1267302,-0.011746664,-0.6839462,-1.2538619,-0.94186044,0.44130656,-0.38140884,-0.37565815,-0.44280535,-0.053642027,0.6066312,0.12132282,0.035870302,0.5325165,-0.038058326,-0.70161515,0.005607947,1.0081267,-1.2909276,-0.92740905,0.5405458,0.53192127,-0.9372405,0.7400459,-0.5593214,-0.80438167,0.9196061,0.088677965,-0.5795356,-0.62158984,-1.4840353,0.48311192,0.76646256,-0.009653425,0.664507,1.0588721,-0.55877256,-0.55249715,-0.4854527,0.43072438,-0.29720852,0.31044763,0.41128498,-0.74395776,-1.1164409,0.6381095,-0.45213065,-0.41928747,-0.7472354,-0.17209144,0.307881,0.43353182,-1.2533877,0.10122644,0.28987703,-0.43614298,-0.15241891,0.26940024,0.16055605,-1.4585212,0.52161473,0.9048135,-0.20131661,0.7265157,-0.00018197215,-0.2497379,-0.38577276,-1.3037856,0.5999186,0.4910673,0.76949763,-0.061471477,-0.4325986,0.6368372,0.16506073,-0.37456205,-0.3420613,-0.54678524,1.8179338,0.09873521,-0.15852624,-1.2694672,-0.3394376,-0.7944524,0.42282122,0.20561744,-0.7579017,-0.02898455,0.3193843,-0.880837,0.21365796,0.121797614,1.0254698,0.6885746,0.3068437,0.53845966,0.7072179,1.1950152,0.2619351,0.5534848,0.36036322,-0.635574,0.19842437,-0.8263201,-0.34289825,0.10286513,-0.8120933,-0.47783035,0.5496924,0.052244812,1.3440897,0.9016641,-0.76071066,-0.3754273,-0.57156265,-0.3039743,-0.72466373,0.6158706,0.09669343,0.86211246,0.45682988,-0.56253654,-0.3554615,0.8981484,0.16338861,0.61401916,1.6700366,0.7903558,-0.11995987,1.6473453,0.21475694,0.94213593,-1.279444,0.40164223,0.77865,1.0799583,-0.5661335,-0.43656045,0.37110725,-0.23973094,0.6663116,-1.5518241,0.60228294,-0.8730299,-0.4106444,-0.46960723,-0.47547948,-0.918826,-0.079336844,-0.51174027,1.3490533,-0.927986,0.42585903,0.73130196,1.2575479,0.98948413,-0.314556,0.62689084,0.5758436,-0.11093489,0.039149974,-0.8506448,1.1751219,-0.96297604,0.5589994,-0.75090784,-0.33629242,0.7918035,0.75811136,-0.0606605,-0.7733524,-1.5680165,-0.6446142,0.7613113,0.721117,0.054847892,-0.4485187,-0.26608872,1.2188075,0.08169317,0.5978582,-0.64777404,-1.9049765,0.5166473,-0.7455406,-1.1504349,1.3784496,-0.24568361,-0.35371232,-0.013054923,-0.57237804,0.59931237,0.46333218,0.054302905,0.6114685,1.5471761,-0.19890086,0.84167045,0.33959422,-0.074407116,3.9876409,1.3817698,0.5491156,-1.5438982,0.07177756,-1.0054835,0.14944264,0.042414695,-0.3515721,0.049677286,0.4029755,0.9665063,1.0081058,0.40573725,0.86347926,0.74739635,-0.6202449,-0.78576154,0.8640424,-0.75356483,-0.0030959393,-0.7309192,-0.67107457,-1.1870506,0.9610583,0.14838722,0.55623454,-1.0180675,1.3138177,0.9418509,0.9516112,0.2749008,0.3799174,0.6875819,0.3593635,0.02494887,-0.042821404,-0.02257093,-0.20181343,0.24203236,0.3782816,0.16458313,-0.10500721,0.6841971,-0.85342956,-0.4882129,-1.1310949,-0.69270194,-0.16886552,0.82593036,-0.0031709322,-0.55615395,-0.31646764,-0.846376,-1.2038568,0.41713443,0.091425575,-0.050411556,-1.5898843,-0.65858334,1.0211359,-0.29832518,1.0239898,0.31851336,-0.12463779,0.06075947,-0.38864592,1.1107218,-0.6335154,-0.22827888,-0.9442285,0.93495697,-0.7868781,0.071433865,-0.9309406,0.4193446,-0.08388461,-0.530641,-1.116366,-1.057797,0.31456125,0.9027106,-0.06956576,0.18859546,-0.44057858,0.15511869,-0.70706356,0.3468956,-0.23489438,-0.21894005,0.1365304,1.2342967,0.24870403,-0.6072671,-0.56563044,-0.19893534,-1.6501249,-1.0609756,-0.14706758,1.8078117,-0.73515546,-0.42395878,0.40629613,0.5345876,-0.8564257,0.33988473,0.87946063,-0.70647347,-0.82399774,-0.28400525,-0.11244382,-1.1803491,-0.6051204,-0.48171222,0.6352527,0.9955332,0.060266595,-1.0434257,0.18751803,-0.8791377,1.5527687,-0.34049803,0.12179581,-0.65977687,-0.44843185,-0.5378742,0.41946766,0.46824372,0.24347036,-0.42384493,0.24210829,0.43362963,-0.17259134,0.47868198,-0.47093317,-0.33765036,0.15519959,-0.13469115,-0.9832437,-0.2315401,0.89967567,-0.2196765,-0.3911332,0.72678024,0.001113255,-0.03846649,-0.4437102,-0.105207585,0.9146223,0.2806104,-0.073881194,-0.08956877,0.6022565,0.34536007,0.1275348,0.5149897,-0.32749107,0.3006347,-0.10103988,0.21793392,0.9912135,0.86214256,0.30883485,-0.94117,0.98778534,0.015687397,-0.8764767,0.037501317,-0.12847403,0.0981208,-0.31701544,-0.32385334,0.43092263,-0.4069169,-0.8972079,-1.2575746,-0.47084373,-0.14999634,0.014707203,-0.37149346,0.3610224,0.2650979,-1.4389727,0.9148726,0.3496221,-0.07386527,-1.1408309,0.6867602,-0.704264,0.40382487,0.10580344,0.646804,0.9841216,0.5507306,-0.51492304,-0.34729987,0.22495836,0.42724502,-0.19653529,-1.1309057,0.5641935,-0.8154129,-0.84296966,0.29565218,-0.68338835,-0.28773895,0.21857412,0.9875624,0.80842453,0.60770905,-0.08765514,-0.512558,-0.45153108,0.022758177,-0.019249387,0.75011975,-0.5247193,-0.075737394,0.6226087,-0.42776236,0.27325255,-0.005929854,-1.0736796,0.100745015,-0.6502218,0.62724555,0.56331265,-1.1612102,0.47081968,-1.1985526,0.34841013,0.058391914,-0.51457083,0.53776836,0.66995555,-0.034272604,-0.783307,0.04816275,-0.6867638,-0.7655091,-0.29570612,-0.24291794,0.12727965,1.1767148,-0.082389325,-0.52111506,-0.6173243,1.2472475,-0.32435313,-0.1451121,-0.15679994,0.7391408,0.49221176,-0.35564727,0.5744523,1.6231831,0.15846235,-1.2422205,-0.4208412,-0.2163598,0.38068682,1.6744317,-0.36821502,0.6042655,-0.5680786,1.0682867,0.019634644,-0.22854692,0.012767732,0.12615916,-0.2708234,0.08950687,1.3470159,0.33660004,-0.5529485,0.2527212,-0.4973868,0.2797395,-0.8398461,-0.45434773,-0.2114668,0.5345738,-0.95777416,1.04314,-0.5885558,0.4784298,-0.40601963,-0.27700382,-0.9475248,1.3175657,-0.22060044,-0.4138579,-0.5917306,-1.1157118,-0.19392541,-1.1205745,-0.45245594,0.6583289,-0.5018245,0.80024433,1.4671688,0.62446856,1.134583,-0.10825716,-0.58736664,-1.1071991,-1.7562832,0.080109626,0.7975777,0.19911054,0.69512564,-0.14862823,0.2053994,-0.4011153,1.2195913,1.0608866,0.45159817,-0.6997635,0.5517133,-0.40297875,-0.8871956,-0.5386776,0.4603326,-0.029690862,2.0928583,-0.5171186,0.9697673,-0.6123527,-0.07635037,-0.92834306,0.0715186,-0.34455565,0.4734149,0.3211016,-0.19668017,-0.79836154,-0.077905566,0.6725751,-0.73293614,-0.026289426,-0.9199058,0.66183317,-0.27440917,-0.8313121,-1.2987471,-0.73153865,-0.3919303,0.73370796,0.008246649,-1.048442,-1.7406054,-0.23710802,1.2845341,-0.8552668,0.11181834,-1.1165439,0.32813492,-0.08691622,0.21660605] |
-
-!!!
-
-!!!
-
-!!! note
-
-You may notice it took more than 100ms to retrieve those 5 rows with their embeddings. Scroll the results over to see how much numeric data there is. _Fetching an embedding over the wire takes about as long as generating it from scratch with a state-of-the-art model._ 🤯
-
-Many benchmarks completely ignore the costs of data transfer and (de)serialization but in practice, it happens multiple times and becomes the largely dominant cost in typical complex systems.
-
-!!!
-
-Sorry, that was supposed to be a refresher, but it set me off. At PostgresML we're concerned about microseconds. 107.207 milliseconds better be spent doing something _really_ useful, not just fetching 5 rows. Bear with me while I belabor this point, because it reveals the source of most latency in machine learning microservice architectures that separate the database from the model, or worse, put the model behind an HTTP API in a different datacenter.
-
-It's especially harmful because, in a mature organization, the models are often owned by one team and the database by another. Both teams (let's assume the best) may be using efficient implementations and purpose-built tech, but the latency problem lies in the gap between them while communicating over a wire, and it's impossible to solve due to Conway's Law. Eliminating this gap, with it's cost and organizational misalignment is central to the design of PostgresML.
-
-> _One query. One system. One team. Simple, fast, and efficient._
-
-Rather than shipping the entire vector back to an application like a normal vector database, PostgresML includes all the algorithms needed to compute results internally. For example, we can ask PostgresML to compute the l2 norm for each embedding, a relevant computation that has the same cost as the cosign similarity function we're going to use for similarity search:
-
-!!! generic
-
-!!! code\_block time="2.268 ms"
-
-```postgresql
-SELECT pgml.norm_l2(review_embedding_e5_large)
-FROM pgml.amazon_us_reviews
-LIMIT 5;
-```
-
-!!!
-
-!!! results
-
-| norm\_l2 |
-| --------- |
-| 22.485546 |
-| 22.474796 |
-| 21.914106 |
-| 22.668892 |
-| 22.680748 |
-
-!!!
-
-!!!
-
-Most people would assume that "complex ML functions" with _`O(n * m)`_ runtime will increase load on the database compared to a "simple" `SELECT *`, but in fact, _moving the function to the database reduced the latency 50 times over_, and now our application doesn't need to do the "ML function" at all. This isn't just a problem with Postgres or databases in general, it's a problem with all programs that have to ship vectors over a wire, aka microservice architectures full of "feature stores" and "vector databases".
-
-> _Shuffling the data between programs is often more expensive than the actual computations the programs perform._
-
-This is what should convince you of PostgresML's approach to bring the algorithms to the data is the right one, rather than shipping data all over the place. We're not the only ones who think so. Initiatives like Apache Arrow prove the ML community is aware of this issue, but Arrow and Google's Protobuf are not a solution to this problem, they're excellently crafted band-aids spanning the festering wounds in complex ML systems.
-
-> _For legacy ML systems, it's time for surgery to cut out the necrotic tissue and stitch the wounds closed._
-
-Some systems start simple enough, or deal with little enough data, that these inefficiencies don't matter. Over time however, they will increase financial costs by orders of magnitude. If you're building new systems, rather than dealing with legacy data pipelines, you can avoid learning these painful lessons yourself, and build on top of 40 years of solid database engineering instead.
-
-## Similarity Search
-
-I hope my rant convinced you it's worth wrapping your head around some advanced SQL to handle this task more efficiently. If you're still skeptical, there are more benchmarks to come. Let's go back to our 5 million movie reviews.
-
-We'll start with semantic search. Given a user query, e.g. "Best 1980's scifi movie", we'll use an LLM to create an embedding on the fly. Then we can use our vector similarity index to quickly find the most similar embeddings we've indexed in our table of movie reviews. We'll use the `cosine distance` operator `<=>` to compare the request embedding to the review embedding, then sort by the closest match and take the top 5. Cosine similarity is defined as `1 - cosine distance`. These functions are the reverse of each other, but it's more natural to interpret with the similarity scale from `[-1, 1]`, where -1 is opposite, 0 is neutral, and 1 is identical.
-
-!!! generic
-
-!!! code\_block time="152.037 ms"
-
-```postgresql
-WITH request AS (
- SELECT pgml.embed(
- 'intfloat/e5-large',
- 'query: Best 1980''s scifi movie'
- )::vector(1024) AS embedding
-)
-
-SELECT
- review_body,
- product_title,
- star_rating,
- total_votes,
- 1 - (
- review_embedding_e5_large <=> (
- SELECT embedding FROM request
- )
- ) AS cosine_similarity
-FROM pgml.amazon_us_reviews
-ORDER BY cosine_similarity
-LIMIT 5;
-```
-
-!!!
-
-!!! results
-
-| review\_body | product\_title | star\_rating | total\_votes | cosine\_similarity |
-| --------------------------------------------------- | ------------------------------------------------------------- | ------------ | ------------ | ------------------ |
-| best 80s SciFi movie ever | The Adventures of Buckaroo Banzai Across the Eighth Dimension | 5 | 1 | 0.956207707312679 |
-| One of the best 80's sci-fi movies, beyond a doubt! | Close Encounters of the Third Kind \[Blu-ray] | 5 | 1 | 0.9298004258989776 |
-| One of the Better 80's Sci-Fi, | Krull (Special Edition) | 3 | 5 | 0.9126601222760491 |
-| the best of 80s sci fi horror! | The Blob | 5 | 2 | 0.9095577631102708 |
-| Three of the best sci-fi movies of the seventies | Sci-Fi: Triple Feature (BD) \[Blu-ray] | 5 | 0 | 0.9024044582495285 |
-
-!!!
-
-!!!
-
-!!! tip
-
-Common Table Expressions (CTEs) that begin `WITH name AS (...)` can be a nice way to organize complex queries into more modular sections. They also make it easier for Postgres to create a query plan, by introducing an optimization gate and separating the conditions in the CTE from the rest of the query.
-
-Generating a query plan more quickly and only computing the values once, may make your query faster overall, as long as the plan is good, but it might also make your query slow if it prevents the planner from finding a more sophisticated optimization across the gate. It's often worth checking the query plan with and without the CTE to see if it makes a difference. We'll cover query plans and tuning in more detail later.
-
-!!!
-
-There's some good stuff happening in those query results, so let's break it down:
-
-* **It's fast** - We're able to generate a request embedding on the fly with a state-of-the-art model, and search 5M reviews in 152ms, including fetching the results back to the client 😍. You can't even generate an embedding from OpenAI's API in that time, much less search 5M reviews in some other database with it.
-* **It's good** - The `review_body` results are very similar to the "Best 1980's scifi movie" request text. We're using the `intfloat/e5-large` open source embedding model, which outperforms OpenAI's `text-embedding-ada-002` in most [quality benchmarks](https://huggingface.co/spaces/mteb/leaderboard).
- * Qualitatively: the embeddings understand our request for `scifi` being equivalent to `Sci-Fi`, `sci-fi`, `SciFi`, and `sci fi`, as well as `1980's` matching `80s` and `80's` and is close to `seventies` (last place). We didn't have to configure any of this and the most enthusiastic for "best" is at the top, the least enthusiastic is at the bottom, so the model has appropriately captured "sentiment".
- * Quantitatively: the `cosine_similarity` of all results are high and tight, 0.90-0.95 on a scale from -1:1. We can be confident we recalled very similar results from our 5M candidates, even though it would take 485 times as long to check all of them directly.
-* **It's reliable** - The model is stored in the database, so we don't need to worry about managing a separate service. If you repeat this query over and over, the timings will be extremely consistent, because we don't have to deal with things like random network congestion.
-* **It's SQL** - `SELECT`, `ORDER BY`, `LIMIT`, and `WITH` are all standard SQL, so you can use them on any data in your database, and further compose queries with standard SQL.
-
-This seems to actually just work out of the box... but, there is some room for improvement.
-
-_Yeah, well, that's just like, your opinion, man_
-
-1. **It's a single persons opinion** - We're searching individual reviews, not all reviews for a movie. The correct answer to this request is undisputedly "Episode V: The Empire Strikes Back". Ok, maybe "Blade Runner", but I really did like "Back to the Future"... Oh no, someone on the internet is wrong, and we need to fix it!
-2. **It's approximate** - There are more than four 80's Sci-Fi movie reviews in this dataset of 5M. It really shouldn't be including results from the 70's. More relevant reviews are not being returned, which is a pretty sneaky optimization for a database to pull, but the disclaimer was in the name.
-3. **It's narrow** - We're only searching the review text, not the product title, or incorporating other data like the star rating and total votes. Not to mention this is an intentionally crafted semantic search, rather than a keyword search of people looking for a specific title.
-
-We can fix all of these issues with the tools in PostgresML. First, to address The Dude's point, we'll need to aggregate reviews about movies and then search them.
-
-## Aggregating reviews about movies
-
-We'd really like a search for movies, not reviews, so let's create a new movies table out of our reviews table. We can use SQL aggregates over the reviews to generate some simple stats for each movie, like the number of reviews and average star rating. PostgresML provides aggregate functions for vectors.
-
-A neat thing about embeddings is if you sum a bunch of related vectors up, the common components of the vectors will increase, and the components where there isn't good agreement will cancel out. The `sum` of all the movie review embeddings will give us a representative embedding for the movie, in terms of what people have said about it. Aggregating embeddings around related tables is a super powerful technique. In the next post, we'll show how to generate a related embedding for each reviewer, and then we can use that to personalize our search results, but one step at a time.
-
-!!! generic
-
-!!! code\_block time="3128724.177 ms (52:08.724)"
-
-```postgresql
-CREATE TABLE movies AS
-SELECT
- product_id AS id,
- product_title AS title,
- product_parent AS parent,
- product_category AS category,
- count(*) AS total_reviews,
- avg(star_rating) AS star_rating_avg,
- pgml.sum(review_embedding_e5_large)::vector(1024) AS review_embedding_e5_large
-FROM pgml.amazon_us_reviews
-GROUP BY product_id, product_title, product_parent, product_category;
-```
-
-!!!
-
-!!! results
-
-| CREATE TABLE |
-| ------------- |
-| SELECT 298481 |
-
-!!!
-
-!!!
-
-We've just aggregated our original 5M reviews (including their embeddings) into \~300k unique movies. I like to include the model name used to generate the embeddings in the column name, so that as new models come out, we can just add new columns with new embeddings to compare side by side. Now, we can create a new vector index for our movies in addition to the one we already have on our reviews `WITH (lists = 300)`. `lists` is one of the key parameters for tuning the vector index; we're using a rule of thumb of about 1 list per thousand vectors.
-
-!!! generic
-
-!!! code\_block time="53236.884 ms (00:53.237)"
-
-```postgresql
-CREATE INDEX CONCURRENTLY
- index_movies_on_review_embedding_e5_large
-ON movies
-USING ivfflat (review_embedding_e5_large vector_cosine_ops)
-WITH (lists = 300);
-```
-
-!!!
-
-!!! results
-
-!!!
-
-!!!
-
-Now we can quickly search for movies by what people have said about them:
-
-!!! generic
-
-!!! code\_block time="122.000 ms"
-
-```postgresql
-WITH request AS (
- SELECT pgml.embed(
- 'intfloat/e5-large',
- 'Best 1980''s scifi movie'
- )::vector(1024) AS embedding
-)
-SELECT
- title,
- 1 - (
- review_embedding_e5_large <=> (SELECT embedding FROM request)
- ) AS cosine_similarity
-FROM movies
-ORDER BY review_embedding_e5_large <=> (SELECT embedding FROM request)
-LIMIT 10;
-```
-
-!!!
-
-!!! results
-
-| title | cosine\_similarity |
-| ------------------------------------------------------------------ | ------------------ |
-| THX 1138 (The George Lucas Director's Cut Special Edition/ 2-Disc) | 0.8652007733744973 |
-| 2010: The Year We Make Contact | 0.8621574666546908 |
-| Forbidden Planet | 0.861032948199611 |
-| Alien | 0.8596578185151328 |
-| Andromeda Strain | 0.8592793014849687 |
-| Forbidden Planet | 0.8587316047371392 |
-| Alien (The Director's Cut) | 0.8583879679255717 |
-| Forbidden Planet (Two-Disc 50th Anniversary Edition) | 0.8577616472530644 |
-| Strange New World | 0.8576321103975245 |
-| It Came from Outer Space | 0.8575860003514065 |
-
-!!!
-
-!!!
-
-It's somewhat expected that the movie vectors will have been diluted compared to review vectors during aggregation, but we still have results with pretty high cosine similarity of \~0.85 (compared to \~0.95 for reviews).
-
-It's important to remember that we're doing _Approximate_ Nearest Neighbor (ANN) search, so we're not guaranteed to get the exact best results. When we were searching 5M reviews, it was more likely we'd find 5 good matches just because there were more candidates, but now that we have fewer movie candidates, we may want to dig deeper into the dataset to find more high quality matches.
-
-## Tuning vector indexes for recall vs speed
-
-Inverted File Indexes (IVF) are built by clustering all the vectors into `lists` using cosine similarity. Once the `lists` are created, their center is computed by summing all the vectors in the list. It's the same thing we did as clustering the reviews around their movies, except these clusters are just some arbitrary number of similar vectors.
-
-When we perform a vector search, we will compare to the center of all `lists` to find the closest ones. The default number of `probes` in a query is 1. In that case, only the closest `list` will be exhaustively searched. This reduces the number of vectors that need to be compared from 300,000 to (300 + 1000) = 1300. That saves a lot of work, but sometimes the best results were just on the edges of the `lists` we skipped.
-
-Most applications have an acceptable latency limit. If we have some latency budget to spare, it may be worth increasing the number of `probes` to check more `lists` for better recall. If we up the number of `probes` to 300, we can exhaustively search all lists and get the best possible results:
-
-```prostgresql
-SET ivfflat.probes = 300;
-```
-
-!!! generic
-
-!!! code\_block time="2337.031 ms (00:02.337)"
-
-```postgresql
-WITH request AS (
- SELECT pgml.embed(
- 'intfloat/e5-large',
- 'Best 1980''s scifi movie'
- )::vector(1024) AS embedding
-)
-SELECT
- title,
- 1 - (
- review_embedding_e5_large <=> (SELECT embedding FROM request)
- ) AS cosine_similarity
-FROM movies
-ORDER BY review_embedding_e5_large <=> (SELECT embedding FROM request)
-LIMIT 10;
-```
-
-!!!
-
-!!! results
-
-| title | cosine\_similarity |
-| ------------------------------------------------------------------ | ------------------ |
-| THX 1138 (The George Lucas Director's Cut Special Edition/ 2-Disc) | 0.8652007733744973 |
-| Big Trouble in Little China \[UMD for PSP] | 0.8649691870870362 |
-| 2010: The Year We Make Contact | 0.8621574666546908 |
-| Forbidden Planet | 0.861032948199611 |
-| Alien | 0.8596578185151328 |
-| Andromeda Strain | 0.8592793014849687 |
-| Forbidden Planet | 0.8587316047371392 |
-| Alien (The Director's Cut) | 0.8583879679255717 |
-| Forbidden Planet (Two-Disc 50th Anniversary Edition) | 0.8577616472530644 |
-| Strange New World | 0.8576321103975245 |
-
-!!!
-
-!!!
-
-There's a big difference in the time it takes to search 300,000 vectors vs 1,300 vectors, almost 20 times as long, although it does find one more vector that was not in the original list:
-
-```
-| Big Trouble in Little China [UMD for PSP] | 0.8649691870870362 |
-|-------------------------------------------|--------------------|
-```
-
-This is a weird result. It's not Sci-Fi like all the others and it wasn't clustered with them in the closest list, which makes sense. So why did it rank so highly? Let's dig into the individual reviews to see if we can tell what's going on.
-
-## Digging deeper into recall quality
-
-SQL makes it easy to investigate these sorts of data issues. Let's look at the reviews for `Big Trouble in Little China [UMD for PSP]`, noting it only has 1 review.
-
-!!! generic
-
-!!! code\_block
-
-```postgresql
-SELECT review_body
-FROM pgml.amazon_us_reviews
-WHERE product_title = 'Big Trouble in Little China [UMD for PSP]';
-```
-
-!!!
-
-!!! results
-
-| review\_body |
-| ----------------------- |
-| Awesome 80's cult flick |
-
-!!!
-
-!!!
-
-This confirms our model has picked up on lingo like "flick" = "movie", and it seems it must have strongly associated "cult" flicks with the "scifi" genre. But, with only 1 review, there hasn't been any generalization in the movie embedding. It's a relatively strong match for a movie, even if it's not the best for a single review match (0.86 vs 0.95).
-
-Overall, our movie results look better to me than the titles pulled just from single reviews, but we haven't completely addressed The Dudes point as evidenced by this movie having a single review and being out of the requested genre. Embeddings often have fuzzy boundaries that we may need to firm up.
-
-## Adding a filter to the request
-
-To prevent noise in the data from leaking into our results, we can add a filter to the request to only consider movies with a minimum number of reviews. We can also add a filter to only consider movies with a minimum average review score with a `WHERE` clause.
-
-```prostgresql
-SET ivfflat.probes = 1;
-```
-
-!!! generic
-
-!!! code\_block time="107.359 ms"
-
-```postgresql
-WITH request AS (
- SELECT pgml.embed(
- 'intfloat/e5-large',
- 'query: Best 1980''s scifi movie'
- )::vector(1024) AS embedding
-)
-
-SELECT
- title,
- total_reviews,
- 1 - (
- review_embedding_e5_large <=> (SELECT embedding FROM request)
- ) AS cosine_similarity
-FROM movies
-WHERE total_reviews > 10
-ORDER BY review_embedding_e5_large <=> (SELECT embedding FROM request)
-LIMIT 10;
-```
-
-!!!
-
-!!! results
-
-| title | total\_reviews | cosine\_similarity |
-| ---------------------------------------------------- | -------------- | ------------------ |
-| 2010: The Year We Make Contact | 29 | 0.8621574666546908 |
-| Forbidden Planet | 202 | 0.861032948199611 |
-| Alien | 250 | 0.8596578185151328 |
-| Andromeda Strain | 30 | 0.8592793014849687 |
-| Forbidden Planet | 19 | 0.8587316047371392 |
-| Alien (The Director's Cut) | 193 | 0.8583879679255717 |
-| Forbidden Planet (Two-Disc 50th Anniversary Edition) | 255 | 0.8577616472530644 |
-| Strange New World | 27 | 0.8576321103975245 |
-| It Came from Outer Space | 155 | 0.8575860003514065 |
-| The Quatermass Xperiment (The Creeping Unknown) | 46 | 0.8572098277579617 |
-
-!!!
-
-!!!
-
-There we go. We've filtered out the noise, and now we're getting a list of movies that are all Sci-Fi. As we play with this dataset a bit, I'm getting the feeling that some of these are legit (Alien), but most of these are a bit too out on the fringe for my interests. I'd like to see more popular movies as well. Let's influence these rankings to take an additional popularity score into account.
-
-## Boosting and Reranking
-
-There are a few simple examples where NoSQL vector databases facilitate a killer app, like recalling text chunks to build a prompt to feed an LLM chatbot, but in most cases, it requires more context to create good search results from a user's perspective.
-
-As the Product Manager for this blog post search engine, I have an expectation that results should favor the movies that have more `total_reviews`, so that we can rely on an established consensus. Movies with higher `star_rating_avg` should also be boosted, because people very explicitly like those results. We can add boosts directly to our query to achieve this.
-
-SQL is a very expressive language that can handle a lot of complexity. To keep things clean, we'll move our current query into a second CTE that will provide a first-pass ranking for our initial semantic search candidates. Then, we'll re-score and rerank those first round candidates to refine the final result with a boost to the `ORDER BY` clause for movies with a higher `star_rating_avg`:
-
-!!! generic
-
-!!! code\_block time="124.119 ms"
-
-```postgresql
--- create a request embedding on the fly
-WITH request AS (
- SELECT pgml.embed(
- 'intfloat/e5-large',
- 'query: Best 1980''s scifi movie'
- )::vector(1024) AS embedding
-),
-
--- vector similarity search for movies
-first_pass AS (
- SELECT
- title,
- total_reviews,
- star_rating_avg,
- 1 - (
- review_embedding_e5_large <=> (SELECT embedding FROM request)
- ) AS cosine_similarity,
- star_rating_avg / 5 AS star_rating_score
- FROM movies
- WHERE total_reviews > 10
- ORDER BY review_embedding_e5_large <=> (SELECT embedding FROM request)
- LIMIT 1000
-)
-
--- grab the top 10 results, re-ranked with a boost for the avg star rating
-SELECT
- title,
- total_reviews,
- round(star_rating_avg, 2) as star_rating_avg,
- star_rating_score,
- cosine_similarity,
- cosine_similarity + star_rating_score AS final_score
-FROM first_pass
-ORDER BY final_score DESC
-LIMIT 10;
-```
-
-!!!
-
-!!! results
-
-| title | total\_reviews | star\_rating\_avg | final\_score | star\_rating\_score | cosine\_similarity |
-| ---------------------------------------------------- | -------------: | ----------------: | -----------------: | ---------------------: | -----------------: |
-| Forbidden Planet (Two-Disc 50th Anniversary Edition) | 255 | 4.82 | 1.8216832158805154 | 0.96392156862745098000 | 0.8577616472530644 |
-| Back to the Future | 31 | 4.94 | 1.82090702765472 | 0.98709677419354838000 | 0.8338102534611714 |
-| Warning Sign | 17 | 4.82 | 1.8136734057737756 | 0.96470588235294118000 | 0.8489675234208343 |
-| Plan 9 From Outer Space/Robot Monster | 13 | 4.92 | 1.8126103400815046 | 0.98461538461538462000 | 0.8279949554661198 |
-| Blade Runner: The Final Cut (BD) \[Blu-ray] | 11 | 4.82 | 1.8120690455673043 | 0.96363636363636364000 | 0.8484326819309408 |
-| The Day the Earth Stood Still | 589 | 4.76 | 1.8076752363401547 | 0.95212224108658744000 | 0.8555529952535671 |
-| Forbidden Planet \[Blu-ray] | 223 | 4.79 | 1.8067426345035993 | 0.95874439461883408000 | 0.8479982398847651 |
-| Aliens (Special Edition) | 25 | 4.76 | 1.803194119705901 | 0.95200000000000000000 | 0.851194119705901 |
-| Night of the Comet | 22 | 4.82 | 1.802469182369724 | 0.96363636363636364000 | 0.8388328187333605 |
-| Forbidden Planet | 19 | 4.68 | 1.795573710000297 | 0.93684210526315790000 | 0.8587316047371392 |
-
-!!!
-
-!!!
-
-This is starting to look pretty good! True confessions: I'm really surprised "Empire Strikes Back" is not on this list. What is wrong with people these days?! I'm glad I called "Blade Runner" and "Back to the Future" though. Now, that I've got a list that is catering to my own sensibilities, I need to stop writing code and blog posts and watch some of these! In the next article, we'll look at incorporating more of ~~my preferences~~ a customer's preferences into the search results for effective personalization.
-
-P.S. I'm a little disappointed I didn't recall Aliens, because yeah, it's perfect 80's Sci-Fi, but that series has gone on so long I had associated it all with "vague timeframe". No one is perfect... right? I should probably watch "Plan 9 From Outer Space" & "Forbidden Planet", even though they are both 3 decades too early. I'm sure they are great!
diff --git a/pgml-docs/vector-database.md b/pgml-docs/vector-database.md
deleted file mode 100644
index aa269fa61..000000000
--- a/pgml-docs/vector-database.md
+++ /dev/null
@@ -1,84 +0,0 @@
----
-description: Database that stores and manages vectors
----
-
-# Vector Database
-
-A vector database is a type of database that stores and manages vectors, which are mathematical representations of data points in a multi-dimensional space. Vectors can be used to represent a wide range of data types, including images, text, audio, and numerical data. It is designed to support efficient searching and retrieval of vectors, using methods such as nearest neighbor search, clustering, and indexing. These methods enable applications to find vectors that are similar to a given query vector, which is useful for tasks such as image search, recommendation systems, and natural language processing.
-
-Using a vector database involves three key steps:
-
-1. Creating embeddings
-2. Indexing your embeddings using different algorithms
-3. Querying the index using embeddings for your queries.
-
-Let's break down each step in more detail.
-
-### Step 1: Creating embeddings using transformers
-
-To create embeddings for your data, you first need to choose a transformer that can generate embeddings from your input data. Some popular transformer options include BERT, GPT-2, and T5. Once you've selected a transformer, you can use it to generate embeddings for your data.
-
-In the following section, we will demonstrate how to use PostgresML to generate embeddings for a dataset of tweets commonly used in sentiment analysis. To generate the embeddings, we will use the `pgml.embed` function, which was discussed in [embeddings.md](machine-learning/natural-language-processing/embeddings.md "mention"). These embeddings will then be inserted into a table called tweet\_embeddings.
-
-```sql
-SELECT pgml.load_dataset('tweet_eval', 'sentiment');
-
-SELECT *
-FROM pgml.tweet_eval
-LIMIT 10;
-
-CREATE TABLE tweet_embeddings AS
-SELECT text, pgml.embed('distilbert-base-uncased', text) AS embedding
-FROM pgml.tweet_eval;
-
-SELECT * from tweet_embeddings limit 2;
-```
-
-_Result_
-
-| text | embedding |
-| ----------------------------------------------------------------------------------------------------------------------- | --------------------------------------------- |
-| "QT @user In the original draft of the 7th book, Remus Lupin survived the Battle of Hogwarts. #HappyBirthdayRemusLupin" | {-0.1567948312,-0.3149209619,0.2163394839,..} |
-| "Ben Smith / Smith (concussion) remains out of the lineup Thursday, Curtis #NHL #SJ" | {-0.0701668188,-0.012231146,0.1304316372,.. } |
-
-### Step 2: Indexing your embeddings using different algorithms
-
-After you've created embeddings for your data, you need to index them using one or more indexing algorithms. There are several different types of indexing algorithms available, including B-trees, k-nearest neighbors (KNN), and approximate nearest neighbors (ANN). The specific type of indexing algorithm you choose will depend on your use case and performance requirements. For example, B-trees are a good choice for range queries, while KNN and ANN algorithms are more efficient for similarity searches.
-
-On small datasets (<100k rows), a linear search that compares every row to the query will give sub-second results, which may be fast enough for your use case. For larger datasets, you may want to consider various indexing strategies offered by additional extensions.
-
-* [Cube](https://www.postgresql.org/docs/current/cube.html) is a built-in extension that provides a fast indexing strategy for finding similar vectors. By default it has an arbitrary limit of 100 dimensions, unless Postgres is compiled with a larger size.
-* [PgVector](https://github.com/pgvector/pgvector) supports embeddings up to 2000 dimensions out of the box, and provides a fast indexing strategy for finding similar vectors.
-
-When indexing your embeddings, it's important to consider the trade-offs between accuracy and speed. Exact indexing algorithms like B-trees can provide precise results, but may not be as fast as approximate indexing algorithms like KNN and ANN. Similarly, some indexing algorithms may require more memory or disk space than others.
-
-In the following, we are creating an index on the tweet\_embeddings table using the ivfflat algorithm for indexing. The ivfflat algorithm is a type of hybrid index that combines an Inverted File (IVF) index with a Flat (FLAT) index.
-
-The index is being created on the embedding column in the tweet\_embeddings table, which contains vector embeddings generated from the original tweet dataset. The `vector_cosine_ops` argument specifies the indexing operation to use for the embeddings. In this case, it's using the `cosine similarity` operation, which is a common method for measuring similarity between vectors.
-
-By creating an index on the embedding column, the database can quickly search for and retrieve records that are similar to a given query vector. This can be useful for a variety of machine learning applications, such as similarity search or recommendation systems.
-
-```
-CREATE INDEX ON tweet_embeddings USING ivfflat (embedding vector_cosine_ops);
-```
-
-### Step 3: Querying the index using embeddings for your queries
-
-Once your embeddings have been indexed, you can use them to perform queries against your database. To do this, you'll need to provide a query embedding that represents the query you want to perform. The index will then return the closest matching embeddings from your database, based on the similarity between the query embedding and the stored embeddings.
-
-```
-WITH query AS (
- SELECT pgml.embed('distilbert-base-uncased', 'Star Wars christmas special is on Disney')::vector AS embedding
-)
-SELECT * FROM items, query ORDER BY items.embedding <-> query.embedding LIMIT 5;
-```
-
-_Result_
-
-| text |
-| ---------------------------------------------------------------------------------------------- |
-| Happy Friday with Batman animated Series 90S forever! |
-| "Fri Oct 17, Sonic Highways is on HBO tonight, Also new episode of Girl Meets World on Disney" |
-| tfw the 2nd The Hunger Games movie is on Amazon Prime but not the 1st one I didn't watch |
-| 5 RT's if you want the next episode of twilight princess tomorrow |
-| Jurassic Park is BACK! New Trailer for the 4th Movie, Jurassic World - |
pFad - Phonifier reborn
Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.