Skip to content

Sql extension docs + fixes #1427

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added pgml-cms/docs/.gitbook/assets/architecture_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added pgml-cms/docs/.gitbook/assets/architecture_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added pgml-cms/docs/.gitbook/assets/architecture_3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added pgml-cms/docs/.gitbook/assets/performance_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added pgml-cms/docs/.gitbook/assets/performance_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 6 additions & 4 deletions pgml-cms/docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,16 +21,18 @@ PostgresML allows you to take advantage of the fundamental relationship between

<figure><img src=".gitbook/assets/ml_system.svg" alt="Machine Learning Infrastructure (2.0) by a16z"><figcaption class="mt-2"><p>PostgresML handles all of the functions <a href="https://a16z.com/emerging-architectures-for-modern-data-infrastructure/">described by a16z</a></p></figcaption></figure>

These capabilities are primarily provided by two open-source software projects, that may be used independently, but are designed to be used with the rest of the Postgres ecosystem:
These capabilities are primarily provided by two open-source software projects, that may be used independently, but are designed to be used together with the rest of the Postgres ecosystem:

* **pgml** - an open source extension for PostgreSQL. It adds support for GPUs and the latest ML & AI algorithms _inside_ the database with a SQL API and no additional infrastructure, networking latency, or reliability costs
* **PgCat** - an open source pooler for PostgreSQL. It abstracts the scalability and reliability concerns of managing a distributed cluster of Postgres databases. Client applications connect only to the pooler, which handles load balancing, sharding, and failover, outside of any single database server.
* [**pgml**](/docs/api/sql-extension/) - an open source extension for PostgreSQL. It adds support for GPUs and the latest ML & AI algorithms _inside_ the database with a SQL API and no additional infrastructure, networking latency, or reliability costs.
* [**PgCat**](/docs/product/pgcat/) - an open source connection pooler for PostgreSQL. It abstracts the scalability and reliability concerns of managing a distributed cluster of Postgres databases. Client applications connect only to the pooler, which handles load balancing, sharding, and failover, outside of any single database server.

<figure><img src=".gitbook/assets/architecture.png" alt="PostgresML architectural diagram"><figcaption></figcaption></figure>

To learn more about how we designed PostgresML, take a look at our [architecture overview](/docs/resources/architecture/).

## Client SDK

The PostgresML team also provides [native language SDKs](https://github.com/postgresml/postgresml/tree/master/pgml-sdks/pgml) which implement best practices for common ML & AI applications. The JavaScript and Python SDKs are generated from the a core Rust library, which provides a uniform API, correctness and efficiency across all environments.
The PostgresML team also provides [native language SDKs](/docs/api/client-sdk/) which implement best practices for common ML & AI applications. The JavaScript and Python SDKs are generated from the a core Rust library, which provides a uniform API, correctness and efficiency across all environments.

While using the SDK is completely optional, SDK clients can perform advanced machine learning tasks in a single SQL request, without having to transfer additional data, models, hardware or dependencies to the client application.

Expand Down
26 changes: 13 additions & 13 deletions pgml-cms/docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,18 @@

* [Overview](api/apis.md)
* [SQL extension](api/sql-extension/README.md)
* [pgml.deploy()](api/sql-extension/pgml.deploy.md)
* [pgml.embed()](api/sql-extension/pgml.embed.md)
* [pgml.transform()](api/sql-extension/pgml.transform/README.md)
* [Fill Mask](api/sql-extension/pgml.transform/fill-mask.md)
* [Question Answering](api/sql-extension/pgml.transform/question-answering.md)
* [Summarization](api/sql-extension/pgml.transform/summarization.md)
* [Text Classification](api/sql-extension/pgml.transform/text-classification.md)
* [Text Generation](api/sql-extension/pgml.transform/text-generation.md)
* [Text-to-Text Generation](api/sql-extension/pgml.transform/text-to-text-generation.md)
* [Token Classification](api/sql-extension/pgml.transform/token-classification.md)
* [Translation](api/sql-extension/pgml.transform/translation.md)
* [Zero-shot Classification](api/sql-extension/pgml.transform/zero-shot-classification.md)
* [pgml.deploy()](api/sql-extension/pgml.deploy.md)
* [pgml.chunk()](api/sql-extension/pgml.chunk.md)
* [pgml.generate()](api/sql-extension/pgml.generate.md)
* [pgml.predict()](api/sql-extension/pgml.predict/README.md)
Expand All @@ -29,16 +39,6 @@
* [Data Pre-processing](api/sql-extension/pgml.train/data-pre-processing.md)
* [Hyperparameter Search](api/sql-extension/pgml.train/hyperparameter-search.md)
* [Joint Optimization](api/sql-extension/pgml.train/joint-optimization.md)
* [pgml.transform()](api/sql-extension/pgml.transform/README.md)
* [Fill Mask](api/sql-extension/pgml.transform/fill-mask.md)
* [Question Answering](api/sql-extension/pgml.transform/question-answering.md)
* [Summarization](api/sql-extension/pgml.transform/summarization.md)
* [Text Classification](api/sql-extension/pgml.transform/text-classification.md)
* [Text Generation](api/sql-extension/pgml.transform/text-generation.md)
* [Text-to-Text Generation](api/sql-extension/pgml.transform/text-to-text-generation.md)
* [Token Classification](api/sql-extension/pgml.transform/token-classification.md)
* [Translation](api/sql-extension/pgml.transform/translation.md)
* [Zero-shot Classification](api/sql-extension/pgml.transform/zero-shot-classification.md)
* [pgml.tune()](api/sql-extension/pgml.tune.md)
* [Client SDK](api/client-sdk/README.md)
* [Collections](api/client-sdk/collections.md)
Expand Down Expand Up @@ -79,6 +79,8 @@

## Resources

* [Architecture](resources/architecture/README.md)
* [Why PostgresML?](resources/architecture/why-postgresml.md)
* [FAQs](resources/faqs.md)
* [Data Storage & Retrieval](resources/data-storage-and-retrieval/tabular-data.md)
* [Tabular data](resources/data-storage-and-retrieval/tabular-data.md)
Expand All @@ -97,8 +99,6 @@
* [Contributing](resources/developer-docs/contributing.md)
* [Distributed Training](resources/developer-docs/distributed-training.md)
* [GPU Support](resources/developer-docs/gpu-support.md)
* [Deploying PostgresML](resources/developer-docs/deploying-postgresml/README.md)
* [Monitoring](resources/developer-docs/deploying-postgresml/monitoring.md)
* [Self-hosting](resources/developer-docs/self-hosting/README.md)
* [Pooler](resources/developer-docs/self-hosting/pooler.md)
* [Building from source](resources/developer-docs/self-hosting/building-from-source.md)
Expand Down
103 changes: 77 additions & 26 deletions pgml-cms/docs/api/sql-extension/pgml.embed.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,48 +6,99 @@ description: >-

# pgml.embed()

Embeddings are a numeric representation of text. They are used to represent words and sentences as vectors, an array of numbers. Embeddings can be used to find similar pieces of text, by comparing the similarity of the numeric vectors using a distance measure, or they can be used as input features for other machine learning models, since most algorithms can't use text directly.

Many pretrained LLMs can be used to generate embeddings from text within PostgresML. You can browse all the [models](https://huggingface.co/models?library=sentence-transformers) available to find the best solution on Hugging Face.
The `pgml.embed()` function generates [embeddings](/docs/use-cases/embeddings/) from text, using in-database models downloaded from Hugging Face. Thousands of [open-source models](https://huggingface.co/models?library=sentence-transformers) are available and new and better ones are being published regularly.

## API

```sql
pgml.embed(
transformer TEXT, -- huggingface sentence-transformer name
text TEXT, -- input to embed
kwargs JSON -- optional arguments (see below)
transformer TEXT,
"text" TEXT,
kwargs JSON
)
```

## Example
| Argument | Description | Example |
|----------|-------------|---------|
| transformer | The name of a Hugging Face embedding model. | `intfloat/e5-large-v2` |
| text | The text to embed. This can be a string or the name of a column from a PostgreSQL table. | `'I am your father, Luke'` |
| kwargs | Additional arguments that are passed to the model. | |

Let's use the `pgml.embed` function to generate embeddings for tweets, so we can find similar ones. We will use the `distilbert-base-uncased` model from :hugging: HuggingFace. This model is a small version of the `bert-base-uncased` model. It is a good choice for short texts like tweets. To start, we'll load a dataset that provides tweets classified into different topics.
### Examples

```sql
SELECT pgml.load_dataset('tweet_eval', 'sentiment');
```
#### Generate embeddings from text

View some tweets and their topics.
Creating an embedding from text is as simple as calling the function with the text you want to embed:

```sql
SELECT *
FROM pgml.tweet_eval
LIMIT 10;
{% tabs %}
{% tab title="SQL" %}

```postgresql
SELECT * FROM pgml.embed(
'intfloat/e5-small',
'No, that''s not true, that''s impossible.'
) AS star_wars_embedding;
```

Get a preview of the embeddings for the first 10 tweets. This will also download the model and cache it for reuse, since it's the first time we've used it.
{% endtab %}
{% endtabs %}

```sql
SELECT text, pgml.embed('distilbert-base-uncased', text)
FROM pgml.tweet_eval
LIMIT 10;
#### Generate embeddings from a table

SQL functions can be used as part of a query to insert, update, or even automatically generate column values of any table:

{% tabs %}
{% tab title="SQL" %}

```postgresql
CREATE TABLE star_wars_quotes (
quote TEXT NOT NULL,
embedding vector(384) GENERATED ALWAYS AS (
pgml.embed('intfloat/e5-small', quote)
) STORED
);

INSERT INTO
star_wars_quotes (quote)
VALUES
('I find your lack of faith disturbing'),
('I''ve got a bad feeling about this.'),
('Do or do not, there is no try.');
```

It will take a few minutes to generate the embeddings for the entire dataset. We'll save the results to a new table.
{% endtab %}
{% endtabs %}

```sql
CREATE TABLE tweet_embeddings AS
SELECT text, pgml.embed('distilbert-base-uncased', text) AS embedding
FROM pgml.tweet_eval;
In this example, we're using [generated columns](https://www.postgresql.org/docs/current/ddl-generated-columns.html) to automatically create an embedding of the `quote` column every time the column value is updated.

#### Using embeddings in queries

Once you have embeddings, you can use them in queries to find text with similar semantic meaning:

{% tabs %}
{% tab title="SQL" %}

```postgresql
SELECT
quote
FROM
star_wars_quotes
ORDER BY
pgml.embed(
'intfloat/e5-small',
'Feel the force!',
) <=> embedding
DESC
LIMIT 1;
```

{% endtab %}
{% endtabs %}

This query will return the quote with the most similar meaning to `'Feel the force!'` by generating an embedding of that quote and comparing it to all other embeddings in the table, using vector cosine similarity as the measure of distance.

## Performance

First time `pgml.embed()` is called with a new model, it is downloaded from Hugging Face and saved in the cache directory. Subsequent calls will use the cached model, which is faster, and if the connection to the database is kept open, the model will be reused across multiple queries without being unloaded from memory.

If a GPU is available, the model will be automatically loaded onto the GPU and the embedding generation will be even faster.
125 changes: 111 additions & 14 deletions pgml-cms/docs/api/sql-extension/pgml.transform/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,37 +17,134 @@ layout:

# pgml.transform()

PostgresML integrates [🤗 Hugging Face Transformers](https://huggingface.co/transformers) to bring state-of-the-art models into the data layer. There are tens of thousands of pre-trained models with pipelines to turn raw inputs into useful results. Many state of the art deep learning architectures have been published and made available for download. You will want to browse all the [models](https://huggingface.co/models) available to find the perfect solution for your [dataset](https://huggingface.co/dataset) and [task](https://huggingface.co/tasks).
The `pgml.transform()` function is the most powerful feature of PostgresML. It integrates open-source large language models, like Llama, Mixtral, and many more, which allows to perform complex tasks on your data.

We'll demonstrate some of the tasks that are immediately available to users of your database upon installation: [translation](https://github.com/postgresml/postgresml/blob/v2.7.12/pgml-dashboard/content/docs/guides/transformers/pre\_trained\_models.md#translation), [sentiment analysis](https://github.com/postgresml/postgresml/blob/v2.7.12/pgml-dashboard/content/docs/guides/transformers/pre\_trained\_models.md#sentiment-analysis), [summarization](https://github.com/postgresml/postgresml/blob/v2.7.12/pgml-dashboard/content/docs/guides/transformers/pre\_trained\_models.md#summarization), [question answering](https://github.com/postgresml/postgresml/blob/v2.7.12/pgml-dashboard/content/docs/guides/transformers/pre\_trained\_models.md#question-answering) and [text generation](https://github.com/postgresml/postgresml/blob/v2.7.12/pgml-dashboard/content/docs/guides/transformers/pre\_trained\_models.md#text-generation).
The models are downloaded from [🤗 Hugging Face](https://huggingface.co/transformers) which hosts tens of thousands of pre-trained and fine-tuned models for various tasks like text generation, question answering, summarization, text classification, and more.

### Examples
## API

All of the tasks and models demonstrated here can be customized by passing additional arguments to the `Pipeline` initializer or call. You'll find additional links to documentation in the examples below.
The `pgml.transform()` function comes in two flavors, task-based and model-based.

The Hugging Face [`Pipeline`](https://huggingface.co/docs/transformers/main\_classes/pipelines) API is exposed in Postgres via:
### Task-based API

```sql
The task-based API automatically chooses a model to use based on the task:

```postgresql
pgml.transform(
task TEXT,
args JSONB,
inputs TEXT[]
)
```

| Argument | Description | Example |
|----------|-------------|---------|
| task | The name of a natural language processing task. | `text-generation` |
| args | Additional kwargs to pass to the pipeline. | `{"max_new_tokens": 50}` |
| inputs | Array of prompts to pass to the model for inference. | `['Once upon a time...']` |

#### Example

{% tabs %}
{% tab title="SQL" %}

```postgresql
SELECT *
FROM pgml.transform (
'translation_en_to_fr',
'How do I say hello in French?',
);
```

{% endtab %}
{% endtabs %}

### Model-based API

The model-based API requires the name of the model and the task, passed as a JSON object, which allows it to be more generic:

```postgresql
pgml.transform(
task TEXT OR JSONB, -- task name or full pipeline initializer arguments
call JSONB, -- additional call arguments alongside the inputs
inputs TEXT[] OR BYTEA[] -- inputs for inference
model JSONB,
args JSONB,
inputs TEXT[]
)
```

This is roughly equivalent to the following Python:
| Argument | Description | Example |
|----------|-------------|---------|
| task | Model configuration, including name and task. | `{"task": "text-generation", "model": "mistralai/Mixtral-8x7B-v0.1"}` |
| args | Additional kwargs to pass to the pipeline. | `{"max_new_tokens": 50}` |
| inputs | Array of prompts to pass to the model for inference. | `['Once upon a time...']` |

#### Example

{% tabs %}
{% tab title="SQL" %}

```postgresql
SELECT pgml.transform(
task => '{
"task": "text-generation",
"model": "TheBloke/zephyr-7B-beta-GPTQ",
"model_type": "mistral",
"revision": "main",
}'::JSONB,
inputs => ['AI is going to change the world in the following ways:'],
args => '{
"max_new_tokens": 100
}'::JSONB
);
```

{% endtab %}

{% tab title="Equivalent Python" %}

```python
import transformers

def transform(task, call, inputs):
return transformers.pipeline(**task)(inputs, **call)

transform(
{
"task": "text-generation",
"model": "TheBloke/zephyr-7B-beta-GPTQ",
"model_type": "mistral",
"revision": "main",
},
{"max_new_tokens": 100},
['AI is going to change the world in the following ways:']
)
```

Most pipelines operate on `TEXT[]` inputs, but some require binary `BYTEA[]` data like audio classifiers. `inputs` can be `SELECT`ed from tables in the database, or they may be passed in directly with the query. The output of this call is a `JSONB` structure that is task specific. See the [Postgres JSON](https://www.postgresql.org/docs/14/functions-json.html) reference for ways to process this output dynamically.
{% endtab %}
{% endtabs %}


### Supported tasks

PostgresML currently supports most NLP tasks available on Hugging Face:

| Task | Name | Description |
|------|-------------|---------|
| [Fill mask](fill-mask) | `key-mask` | Fill in the blank in a sentence. |
| [Question answering](question-answering) | `question-answering` | Answer a question based on a context. |
| [Summarization](summarization) | `summarization` | Summarize a long text. |
| [Text classification](text-classification) | `text-classification` | Classify a text as positive or negative. |
| [Text generation](text-generation) | `text-generation` | Generate text based on a prompt. |
| [Text-to-text generation](text-to-text-generation) | `text-to-text-generation` | Generate text based on an instruction in the prompt. |
| [Token classification](token-classification) | `token-classification` | Classify tokens in a text. |
| [Translation](translation) | `translation` | Translate text from one language to another. |
| [Zero-shot classification](zero-shot-classification) | `zero-shot-classification` | Classify a text without training data. |


## Performance

!!! tip
Much like `pgml.embed()`, the models used in `pgml.transform()` are downloaded from Hugging Face and cached locally. If the connection to the database is kept open, the model remains in memory, which allows for faster inference on subsequent calls. If you want to free up memory, you can close the connection.

Models will be downloaded and stored locally on disk after the first call. They are also cached per connection to improve repeated calls in a single session. To free that memory, you'll need to close your connection. You may want to establish dedicated credentials and connection pools via [pgcat](https://github.com/levkk/pgcat) or [pgbouncer](https://www.pgbouncer.org/) for larger models that have billions of parameters. You may also pass `{"cache": false}` in the JSON `call` args to prevent this behavior.
## Additional resources

!!!
- [Hugging Face datasets](https://huggingface.co/datasets)
- [Hugging Face tasks](https://huggingface.co/tasks)
Loading
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy