diff --git a/README.md b/README.md index 80960740f..760180c70 100644 --- a/README.md +++ b/README.md @@ -19,7 +19,7 @@

- Simple machine learning with + Generative AI and Simple ML with PostgreSQL

@@ -30,78 +30,844 @@

- Train and deploy models to make online predictions using only SQL, with an open source extension for Postgres. Manage your projects and visualize datasets using the built-in dashboard. -

-![PostgresML in practice](pgml-docs/docs/images/console.png) +# Table of contents +- [Introduction](#introduction) +- [Installation](#installation) +- [Getting started](#getting-started) +- [Natural Language Processing](#nlp-tasks) + - [Text Classification](#text-classification) + - [Zero-Shot Classification](#zero-shot-classification) + - [Token Classification](#token-classification) + - [Translation](#translation) + - [Summarization](#summarization) + - [Question Answering](#question-answering) + - [Text Generation](#text-generation) + - [Text-to-Text Generation](#text-to-text-generation) + - [Fill-Mask](#fill-mask) +- [Vector Database](#vector-database) + -The dashboard makes it easy to compare different algorithms or hyperparameters across models and datasets. +# Introduction +PostgresML is a machine learning extension to PostgreSQL that enables you to perform training and inference on text and tabular data using SQL queries. With PostgresML, you can seamlessly integrate machine learning models into your PostgreSQL database and harness the power of cutting-edge algorithms to process data efficiently. -[![PostgresML dashboard](pgml-docs/docs/images/dashboard/models.png)](https://cloud.postgresml.org/) +## Text Data +- Perform natural language processing (NLP) tasks like sentiment analysis, question and answering, translation, summarization and text generation +- Access 1000s of state-of-the-art language models like GPT-2, GPT-J, GPT-Neo from :hugs: HuggingFace model hub +- Fine tune large language models (LLMs) on your own text data for different tasks +- Use your existing PostgreSQL database as a vector database by generating embeddings from text stored in the database. -

- See it in action — cloud.postgresml.org -

+**Translation** -Please see the [quick start instructions](https://postgresml.org/user_guides/setup/quick_start_with_docker/) for general information on installing or deploying PostgresML. A [developer guide](https://postgresml.org/docs/guides/setup/developers) is also available for those who would like to contribute. +*SQL query* -## What's in the box -See the documentation for a complete **[list of functionality](https://postgresml.org/)**. +```sql +SELECT pgml.transform( + 'translation_en_to_fr', + inputs => ARRAY[ + 'Welcome to the future!', + 'Where have you been all this time?' + ] +) AS french; +``` +*Result* -### All your favorite algorithms -Whether you need a simple linear regression, or extreme gradient boosting, we've included support for all classification and regression algorithms in [Scikit Learn](https://scikit-learn.org/) and [XGBoost](https://xgboost.readthedocs.io/) with no extra configuration. +```sql + french +------------------------------------------------------------ -### Managed model deployments -Models can be periodically retrained and automatically promoted to production depending on their key metric. Rollback capability is provided to ensure that you're always able to serve the highest quality predictions, along with historical logs of all deployments for long term study. +[ + {"translation_text": "Bienvenue à l'avenir!"}, + {"translation_text": "Où êtes-vous allé tout ce temps?"} +] +``` -### Online and offline support -Predictions are served via a standard Postgres connection to ensure that your core apps can always access both your data and your models in real time. Pure SQL workflows also enable batch predictions to cache results in native Postgres tables for lookup. -### Instant visualizations -Run standard analysis on your datasets to detect outliers, bimodal distributions, feature correlation, and other common data visualizations on your datasets. Everything is cataloged in the dashboard for easy reference. -### Hyperparameter search -Use either grid or random searches with cross validation on your training set to discover the most important knobs to tweak on your favorite algorithm. +**Sentiment Analysis** +*SQL query* -### SQL native vector operations -Vector operations make working with learned embeddings a snap, for things like nearest neighbor searches or other similarity comparisons. +```sql +SELECT pgml.transform( + task => 'text-classification', + inputs => ARRAY[ + 'I love how amazingly simple ML has become!', + 'I hate doing mundane and thankless tasks. ☹️' + ] +) AS positivity; +``` +*Result* +```sql + positivity +------------------------------------------------------ +[ + {"label": "POSITIVE", "score": 0.9995759129524232}, + {"label": "NEGATIVE", "score": 0.9903519749641418} +] +``` -### The performance of Postgres -Since your data never leaves the database, you retain the speed, reliability and security you expect in your foundational stateful services. Leverage your existing infrastructure and expertise to deliver new capabilities. +## Tabular data +- [47+ classification and regression algorithms](https://postgresml.org/docs/guides/training/algorithm_selection) +- [8 - 40X faster inference than HTTP based model serving](https://postgresml.org/blog/postgresml-is-8x-faster-than-python-http-microservices) +- [Millions of transactions per second](https://postgresml.org/blog/scaling-postgresml-to-one-million-requests-per-second) +- [Horizontal scalability](https://github.com/postgresml/pgcat) -### Open source -We're building on the shoulders of giants. These machine learning libraries and Postgres have received extensive academic and industry use, and we'll continue their tradition to build with the community. Licensed under MIT. -## Quick Start +**Training a classification model** -1) Clone this repo: +*Training* +```sql +SELECT * FROM pgml.train( + 'Handwritten Digit Image Classifier', + algorithm => 'xgboost', + 'classification', + 'pgml.digits', + 'target' +); +``` -```bash -$ git clone git@github.com:postgresml/postgresml.git +*Inference* +```sql +SELECT pgml.predict( + 'My Classification Project', + ARRAY[0.1, 2.0, 5.0] +) AS prediction; ``` -2) Start dockerized services. PostgresML will run on port 5433, just in case you already have Postgres running: +# Installation +PostgresML installation consists of three parts: PostgreSQL database, Postgres extension for machine learning and a dashboard app. The extension provides all the machine learning functionality and can be used independently using any SQL IDE. The dashboard app provides an easy to use interface for writing SQL notebooks, performing and tracking ML experiments and ML models. + +## Docker + +Step 1: Clone this repository ```bash -$ cd postgresml && docker-compose up +git clone git@github.com:postgresml/postgresml.git ``` -3) Connect to PostgreSQL in the Docker container with PostgresML installed: +Step 2: Start dockerized services. PostgresML will run on port 5433, just in case you already have Postgres running. You can find Docker installation instructions [here](https://docs.docker.com/desktop/) +```bash +cd postgresml +docker-compose up +``` +Step 3: Connect to Postgres using an SQL IDE or psql ```bash -$ psql postgres://postgres@localhost:5433/pgml_development +postgres://postgres@localhost:5433/pgml_development +``` + +## Free trial +If you want to check out the functionality without the hassle of Docker, [sign up for a free PostgresML account](https://postgresml.org/signup). We will provide 5GiB of storage for your data and demo notebooks to help you get started. + +# Getting Started + +## Option 1 +- On local installation go to dashboard app at `http://localhost:8000/` to use SQL notebooks. + +- On the hosted console click on the **Dashboard** button to connect to your instance with SQL notebooks. +![dashboard](pgml-docs/docs/images/dashboard.png) + +- Try one of the pre-built SQL notebooks +![notebooks](pgml-docs/docs/images/notebooks.png) + +## Option 2 +- Use any of these popular tools to connect to PostgresML and write SQL queries + - Apache Superset + - DBeaver + - Data Grip + - Postico 2 + - Popsql + - Tableau + - PowerBI + - Jupyter + - VSCode +## Option 3 +- Connect directly to the database with your favorite programming language + - C++: libpqxx + - C#: Npgsql,Dapper, or Entity Framework Core + - Elixer: ecto or Postgrex + - Go: pgx, pg or Bun + - Haskell: postgresql-simple + - Java & Scala: JDBC or Slick + - Julia: LibPQ.jl + - Lua: pgmoon + - Node: node-postgres, pg-promise, or Sequelize + - Perl: DBD::Pg + - PHP: Laravel or PHP + - Python: psycopg2, SQLAlchemy, or Django + - R: DBI or dbx + - Ruby: pg or Rails + - Rust: postgres, SQLx or Diesel + - Swift: PostgresNIO or PostgresClientKit + - ... open a PR to add your favorite language and connector. +# NLP Tasks +PostgresML integrates 🤗 Hugging Face Transformers to bring state-of-the-art NLP models into the data layer. There are tens of thousands of pre-trained models with pipelines to turn raw text in your database into useful results. Many state of the art deep learning architectures have been published and made available from Hugging Face model hub. + +You can call different NLP tasks and customize using them using the following SQL query. + +```sql +SELECT pgml.transform( + task => TEXT OR JSONB, -- Pipeline initializer arguments + inputs => TEXT[] OR BYTEA[], -- inputs for inference + args => JSONB -- (optional) arguments to the pipeline. +) +``` +## Text Classification + +Text classification involves assigning a label or category to a given text. Common use cases include sentiment analysis, natural language inference, and the assessment of grammatical correctness. + +![text classification](pgml-docs/docs/images/text-classification.png) + +### Sentiment Analysis +Sentiment analysis is a type of natural language processing technique that involves analyzing a piece of text to determine the sentiment or emotion expressed within it. It can be used to classify a text as positive, negative, or neutral, and has a wide range of applications in fields such as marketing, customer service, and political analysis. + +*Basic usage* +```sql +SELECT pgml.transform( + task => 'text-classification', + inputs => ARRAY[ + 'I love how amazingly simple ML has become!', + 'I hate doing mundane and thankless tasks. ☹️' + ] +) AS positivity; +``` +*Result* +```json +[ + {"label": "POSITIVE", "score": 0.9995759129524232}, + {"label": "NEGATIVE", "score": 0.9903519749641418} +] +``` +The default model used for text classification is a fine-tuned version of DistilBERT-base-uncased that has been specifically optimized for the Stanford Sentiment Treebank dataset (sst2). + + +*Using specific model* + +To use one of the over 19,000 models available on Hugging Face, include the name of the desired model and `text-classification` task as a JSONB object in the SQL query. For example, if you want to use a RoBERTa model trained on around 40,000 English tweets and that has POS (positive), NEG (negative), and NEU (neutral) labels for its classes, include this information in the JSONB object when making your query. + +```sql +SELECT pgml.transform( + inputs => ARRAY[ + 'I love how amazingly simple ML has become!', + 'I hate doing mundane and thankless tasks. ☹️' + ], + task => '{"task": "text-classification", + "model": "finiteautomata/bertweet-base-sentiment-analysis" + }'::JSONB +) AS positivity; +``` +*Result* +```json +[ + {"label": "POS", "score": 0.992932200431826}, + {"label": "NEG", "score": 0.975599765777588} +] +``` + +*Using industry specific model* + +By selecting a model that has been specifically designed for a particular industry, you can achieve more accurate and relevant text classification. An example of such a model is FinBERT, a pre-trained NLP model that has been optimized for analyzing sentiment in financial text. FinBERT was created by training the BERT language model on a large financial corpus, and fine-tuning it to specifically classify financial sentiment. When using FinBERT, the model will provide softmax outputs for three different labels: positive, negative, or neutral. + +```sql +SELECT pgml.transform( + inputs => ARRAY[ + 'Stocks rallied and the British pound gained.', + 'Stocks making the biggest moves midday: Nvidia, Palantir and more' + ], + task => '{"task": "text-classification", + "model": "ProsusAI/finbert" + }'::JSONB +) AS market_sentiment; +``` + +*Result* +```json +[ + {"label": "positive", "score": 0.8983612656593323}, + {"label": "neutral", "score": 0.8062630891799927} +] +``` + +### Natural Language Inference (NLI) +NLI, or Natural Language Inference, is a type of model that determines the relationship between two texts. The model takes a premise and a hypothesis as inputs and returns a class, which can be one of three types: +- Entailment: This means that the hypothesis is true based on the premise. +- Contradiction: This means that the hypothesis is false based on the premise. +- Neutral: This means that there is no relationship between the hypothesis and the premise. + +The GLUE dataset is the benchmark dataset for evaluating NLI models. There are different variants of NLI models, such as Multi-Genre NLI, Question NLI, and Winograd NLI. + +If you want to use an NLI model, you can find them on the :hugs: Hugging Face model hub. Look for models with "mnli". + +```sql +SELECT pgml.transform( + inputs => ARRAY[ + 'A soccer game with multiple males playing. Some men are playing a sport.' + ], + task => '{"task": "text-classification", + "model": "roberta-large-mnli" + }'::JSONB +) AS nli; +``` +*Result* +```json +[ + {"label": "ENTAILMENT", "score": 0.98837411403656} +] +``` +### Question Natural Language Inference (QNLI) +The QNLI task involves determining whether a given question can be answered by the information in a provided document. If the answer can be found in the document, the label assigned is "entailment". Conversely, if the answer cannot be found in the document, the label assigned is "not entailment". + +If you want to use an QNLI model, you can find them on the :hugs: Hugging Face model hub. Look for models with "qnli". + +```sql +SELECT pgml.transform( + inputs => ARRAY[ + 'Where is the capital of France?, Paris is the capital of France.' + ], + task => '{"task": "text-classification", + "model": "cross-encoder/qnli-electra-base" + }'::JSONB +) AS qnli; +``` + +*Result* +```json +[ + {"label": "LABEL_0", "score": 0.9978110194206238} +] +``` + +### Quora Question Pairs (QQP) +The Quora Question Pairs model is designed to evaluate whether two given questions are paraphrases of each other. This model takes the two questions and assigns a binary value as output. LABEL_0 indicates that the questions are paraphrases of each other and LABEL_1 indicates that the questions are not paraphrases. The benchmark dataset used for this task is the Quora Question Pairs dataset within the GLUE benchmark, which contains a collection of question pairs and their corresponding labels. + +If you want to use an QQP model, you can find them on the :hugs: Hugging Face model hub. Look for models with `qqp`. + +```sql +SELECT pgml.transform( + inputs => ARRAY[ + 'Which city is the capital of France?, Where is the capital of France?' + ], + task => '{"task": "text-classification", + "model": "textattack/bert-base-uncased-QQP" + }'::JSONB +) AS qqp; +``` + +*Result* +```json +[ + {"label": "LABEL_0", "score": 0.9988721013069152} +] +``` + +### Grammatical Correctness +Linguistic Acceptability is a task that involves evaluating the grammatical correctness of a sentence. The model used for this task assigns one of two classes to the sentence, either "acceptable" or "unacceptable". LABEL_0 indicates acceptable and LABEL_1 indicates unacceptable. The benchmark dataset used for training and evaluating models for this task is the Corpus of Linguistic Acceptability (CoLA), which consists of a collection of texts along with their corresponding labels. + +If you want to use a grammatical correctness model, you can find them on the :hugs: Hugging Face model hub. Look for models with `cola`. + +```sql +SELECT pgml.transform( + inputs => ARRAY[ + 'I will walk to home when I went through the bus.' + ], + task => '{"task": "text-classification", + "model": "textattack/distilbert-base-uncased-CoLA" + }'::JSONB +) AS grammatical_correctness; +``` +*Result* +```json +[ + {"label": "LABEL_1", "score": 0.9576480388641356} +] ``` -4) Validate your installation: +## Zero-Shot Classification +Zero Shot Classification is a task where the model predicts a class that it hasn't seen during the training phase. This task leverages a pre-trained language model and is a type of transfer learning. Transfer learning involves using a model that was initially trained for one task in a different application. Zero Shot Classification is especially helpful when there is a scarcity of labeled data available for the specific task at hand. + +![zero-shot classification](pgml-docs/docs/images/zero-shot-classification.png) + +In the example provided below, we will demonstrate how to classify a given sentence into a class that the model has not encountered before. To achieve this, we make use of `args` in the SQL query, which allows us to provide `candidate_labels`. You can customize these labels to suit the context of your task. We will use `facebook/bart-large-mnli` model. + +Look for models with `mnli` to use a zero-shot classification model on the :hugs: Hugging Face model hub. ```sql -pgml_development=# SELECT pgml.version(); - - version ---------- - 0.8.1 -(1 row) +SELECT pgml.transform( + inputs => ARRAY[ + 'I have a problem with my iphone that needs to be resolved asap!!' + ], + task => '{ + "task": "zero-shot-classification", + "model": "facebook/bart-large-mnli" + }'::JSONB, + args => '{ + "candidate_labels": ["urgent", "not urgent", "phone", "tablet", "computer"] + }'::JSONB +) AS zero_shot; +``` +*Result* + +```json +[ + { + "labels": ["urgent", "phone", "computer", "not urgent", "tablet"], + "scores": [0.503635, 0.47879, 0.012600, 0.002655, 0.002308], + "sequence": "I have a problem with my iphone that needs to be resolved asap!!" + } +] +``` +## Token Classification +Token classification is a task in natural language understanding, where labels are assigned to certain tokens in a text. Some popular subtasks of token classification include Named Entity Recognition (NER) and Part-of-Speech (PoS) tagging. NER models can be trained to identify specific entities in a text, such as individuals, places, and dates. PoS tagging, on the other hand, is used to identify the different parts of speech in a text, such as nouns, verbs, and punctuation marks. + +![token classification](pgml-docs/docs/images/token-classification.png) + +### Named Entity Recognition +Named Entity Recognition (NER) is a task that involves identifying named entities in a text. These entities can include the names of people, locations, or organizations. The task is completed by labeling each token with a class for each named entity and a class named "0" for tokens that don't contain any entities. In this task, the input is text, and the output is the annotated text with named entities. + +```sql +SELECT pgml.transform( + inputs => ARRAY[ + 'I am Omar and I live in New York City.' + ], + task => 'token-classification' +) as ner; +``` +*Result* +```json +[[ + {"end": 9, "word": "Omar", "index": 3, "score": 0.997110, "start": 5, "entity": "I-PER"}, + {"end": 27, "word": "New", "index": 8, "score": 0.999372, "start": 24, "entity": "I-LOC"}, + {"end": 32, "word": "York", "index": 9, "score": 0.999355, "start": 28, "entity": "I-LOC"}, + {"end": 37, "word": "City", "index": 10, "score": 0.999431, "start": 33, "entity": "I-LOC"} +]] ``` -See the documentation for a complete guide to **[working with PostgresML](https://postgresml.org/)**. +### Part-of-Speech (PoS) Tagging +PoS tagging is a task that involves identifying the parts of speech, such as nouns, pronouns, adjectives, or verbs, in a given text. In this task, the model labels each word with a specific part of speech. + +Look for models with `pos` to use a zero-shot classification model on the :hugs: Hugging Face model hub. +```sql +select pgml.transform( + inputs => array [ + 'I live in Amsterdam.' + ], + task => '{"task": "token-classification", + "model": "vblagoje/bert-english-uncased-finetuned-pos" + }'::JSONB +) as pos; +``` +*Result* +```json +[[ + {"end": 1, "word": "i", "index": 1, "score": 0.999, "start": 0, "entity": "PRON"}, + {"end": 6, "word": "live", "index": 2, "score": 0.998, "start": 2, "entity": "VERB"}, + {"end": 9, "word": "in", "index": 3, "score": 0.999, "start": 7, "entity": "ADP"}, + {"end": 19, "word": "amsterdam", "index": 4, "score": 0.998, "start": 10, "entity": "PROPN"}, + {"end": 20, "word": ".", "index": 5, "score": 0.999, "start": 19, "entity": "PUNCT"} +]] +``` +## Translation +Translation is the task of converting text written in one language into another language. + +![translation](pgml-docs/docs/images/translation.png) + +You have the option to select from over 2000 models available on the Hugging Face hub for translation. + +```sql +select pgml.transform( + inputs => array[ + 'How are you?' + ], + task => '{"task": "translation", + "model": "Helsinki-NLP/opus-mt-en-fr" + }'::JSONB +); +``` +*Result* +```json +[ + {"translation_text": "Comment allez-vous ?"} +] +``` +## Summarization +Summarization involves creating a condensed version of a document that includes the important information while reducing its length. Different models can be used for this task, with some models extracting the most relevant text from the original document, while other models generate completely new text that captures the essence of the original content. + +![summarization](pgml-docs/docs/images/summarization.png) + +```sql +select pgml.transform( + task => '{"task": "summarization", + "model": "sshleifer/distilbart-cnn-12-6" + }'::JSONB, + inputs => array[ + 'Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles). The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.' + ] +); +``` +*Result* +```json +[ + {"summary_text": " Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018 . The city is the centre and seat of government of the region and province of Île-de-France, or Paris Region . Paris Region has an estimated 18 percent of the population of France as of 2017 ."} + ] +``` +You can control the length of summary_text by passing `min_length` and `max_length` as arguments to the SQL query. + +```sql +select pgml.transform( + task => '{"task": "summarization", + "model": "sshleifer/distilbart-cnn-12-6" + }'::JSONB, + inputs => array[ + 'Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles). The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.' + ], + args => '{ + "min_length" : 20, + "max_length" : 70 + }'::JSONB +); +``` + +```json +[ + {"summary_text": " Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018 . City of Paris is centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated 12,174,880, or about 18 percent" + } +] +``` +## Question Answering +Question Answering models are designed to retrieve the answer to a question from a given text, which can be particularly useful for searching for information within a document. It's worth noting that some question answering models are capable of generating answers even without any contextual information. + +![question answering](pgml-docs/docs/images/question-answering.png) + +```sql +SELECT pgml.transform( + 'question-answering', + inputs => ARRAY[ + '{ + "question": "Where do I live?", + "context": "My name is Merve and I live in İstanbul." + }' + ] +) AS answer; +``` +*Result* + +```json +{ + "end" : 39, + "score" : 0.9538117051124572, + "start" : 31, + "answer": "İstanbul" +} +``` + + +## Text Generation +Text generation is the task of producing new text, such as filling in incomplete sentences or paraphrasing existing text. It has various use cases, including code generation and story generation. Completion generation models can predict the next word in a text sequence, while text-to-text generation models are trained to learn the mapping between pairs of texts, such as translating between languages. Popular models for text generation include GPT-based models, T5, T0, and BART. These models can be trained to accomplish a wide range of tasks, including text classification, summarization, and translation. + +![text generation](pgml-docs/docs/images/text-generation.png) + +```sql +SELECT pgml.transform( + task => 'text-generation', + inputs => ARRAY[ + 'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone' + ] +) AS answer; +``` +*Result* + +```json +[ + [ + {"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, and eight for the Dragon-lords in their halls of blood.\n\nEach of the guild-building systems is one-man"} + ] +] +``` + +To use a specific model from :hugs: model hub, pass the model name along with task name in task. + +```sql +SELECT pgml.transform( + task => '{ + "task" : "text-generation", + "model" : "gpt2-medium" + }'::JSONB, + inputs => ARRAY[ + 'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone' + ] +) AS answer; +``` +*Result* +```json +[ + [{"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone.\n\nThis place has a deep connection to the lore of ancient Elven civilization. It is home to the most ancient of artifacts,"}] +] +``` +To make the generated text longer, you can include the argument `max_length` and specify the desired maximum length of the text. + +```sql +SELECT pgml.transform( + task => '{ + "task" : "text-generation", + "model" : "gpt2-medium" + }'::JSONB, + inputs => ARRAY[ + 'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone' + ], + args => '{ + "max_length" : 200 + }'::JSONB +) AS answer; +``` +*Result* +```json +[ + [{"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, Three for the Dwarfs and the Elves, One for the Gnomes of the Mines, and Two for the Elves of Dross.\"\n\nHobbits: The Fellowship is the first book of J.R.R. Tolkien's story-cycle, and began with his second novel - The Two Towers - and ends in The Lord of the Rings.\n\n\nIt is a non-fiction novel, so there is no copyright claim on some parts of the story but the actual text of the book is copyrighted by author J.R.R. Tolkien.\n\n\nThe book has been classified into two types: fantasy novels and children's books\n\nHobbits: The Fellowship is the first book of J.R.R. Tolkien's story-cycle, and began with his second novel - The Two Towers - and ends in The Lord of the Rings.It"}] +] +``` +If you want the model to generate more than one output, you can specify the number of desired output sequences by including the argument `num_return_sequences` in the arguments. + +```sql +SELECT pgml.transform( + task => '{ + "task" : "text-generation", + "model" : "gpt2-medium" + }'::JSONB, + inputs => ARRAY[ + 'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone' + ], + args => '{ + "num_return_sequences" : 3 + }'::JSONB +) AS answer; +``` +*Result* +```json +[ + [ + {"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, and Thirteen for the human-men in their hall of fire.\n\nAll of us, our families, and our people"}, + {"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, and the tenth for a King! As each of these has its own special story, so I have written them into the game."}, + {"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone… What's left in the end is your heart's desire after all!\n\nHans: (Trying to be brave)"} + ] +] +``` +Text generation typically utilizes a greedy search algorithm that selects the word with the highest probability as the next word in the sequence. However, an alternative method called beam search can be used, which aims to minimize the possibility of overlooking hidden high probability word combinations. Beam search achieves this by retaining the num_beams most likely hypotheses at each step and ultimately selecting the hypothesis with the highest overall probability. We set `num_beams > 1` and `early_stopping=True` so that generation is finished when all beam hypotheses reached the EOS token. + +```sql +SELECT pgml.transform( + task => '{ + "task" : "text-generation", + "model" : "gpt2-medium" + }'::JSONB, + inputs => ARRAY[ + 'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone' + ], + args => '{ + "num_beams" : 5, + "early_stopping" : true + }'::JSONB +) AS answer; +``` + +*Result* +```json +[[ + {"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, Nine for the Dwarves in their caverns of ice, Ten for the Elves in their caverns of fire, Eleven for the"} +]] +``` +Sampling methods involve selecting the next word or sequence of words at random from the set of possible candidates, weighted by their probabilities according to the language model. This can result in more diverse and creative text, as well as avoiding repetitive patterns. In its most basic form, sampling means randomly picking the next word $w_t$ according to its conditional probability distribution: +$$ w_t \approx P(w_t|w_{1:t-1})$$ + + +However, the randomness of the sampling method can also result in less coherent or inconsistent text, depending on the quality of the model and the chosen sampling parameters such as temperature, top-k, or top-p. Therefore, choosing an appropriate sampling method and parameters is crucial for achieving the desired balance between creativity and coherence in generated text. + +You can pass `do_sample = True` in the arguments to use sampling methods. It is recommended to alter `temperature` or `top_p` but not both. + +*Temperature* +```sql +SELECT pgml.transform( + task => '{ + "task" : "text-generation", + "model" : "gpt2-medium" + }'::JSONB, + inputs => ARRAY[ + 'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone' + ], + args => '{ + "do_sample" : true, + "temperature" : 0.9 + }'::JSONB +) AS answer; +``` +*Result* +```json +[[{"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, and Thirteen for the Giants and Men of S.A.\n\nThe First Seven-Year Time-Traveling Trilogy is"}]] +``` +*Top p* + +```sql +SELECT pgml.transform( + task => '{ + "task" : "text-generation", + "model" : "gpt2-medium" + }'::JSONB, + inputs => ARRAY[ + 'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone' + ], + args => '{ + "do_sample" : true, + "top_p" : 0.8 + }'::JSONB +) AS answer; +``` +*Result* +```json +[[{"generated_text": "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, Four for the Elves of the forests and fields, and Three for the Dwarfs and their warriors.\" ―Lord Rohan [src"}]] +``` +## Text-to-Text Generation +Text-to-text generation methods, such as T5, are neural network architectures designed to perform various natural language processing tasks, including summarization, translation, and question answering. T5 is a transformer-based architecture pre-trained on a large corpus of text data using denoising autoencoding. This pre-training process enables the model to learn general language patterns and relationships between different tasks, which can be fine-tuned for specific downstream tasks. During fine-tuning, the T5 model is trained on a task-specific dataset to learn how to perform the specific task. +![text-to-text](pgml-docs/docs/images/text-to-text-generation.png) + +*Translation* +```sql +SELECT pgml.transform( + task => '{ + "task" : "text2text-generation" + }'::JSONB, + inputs => ARRAY[ + 'translate from English to French: I''m very happy' + ] +) AS answer; +``` + +*Result* +```json +[ + {"generated_text": "Je suis très heureux"} +] +``` +Similar to other tasks, we can specify a model for text-to-text generation. + +```sql +SELECT pgml.transform( + task => '{ + "task" : "text2text-generation", + "model" : "bigscience/T0" + }'::JSONB, + inputs => ARRAY[ + 'Is the word ''table'' used in the same meaning in the two previous sentences? Sentence A: you can leave the books on the table over there. Sentence B: the tables in this book are very hard to read.' + + ] +) AS answer; + +``` +## Fill-Mask +Fill-mask refers to a task where certain words in a sentence are hidden or "masked", and the objective is to predict what words should fill in those masked positions. Such models are valuable when we want to gain statistical insights about the language used to train the model. +![fill mask](pgml-docs/docs/images/fill-mask.png) + +```sql +SELECT pgml.transform( + task => '{ + "task" : "fill-mask" + }'::JSONB, + inputs => ARRAY[ + 'Paris is the of France.' + + ] +) AS answer; +``` +*Result* +```json +[ + {"score": 0.679, "token": 812, "sequence": "Paris is the capital of France.", "token_str": " capital"}, + {"score": 0.051, "token": 32357, "sequence": "Paris is the birthplace of France.", "token_str": " birthplace"}, + {"score": 0.038, "token": 1144, "sequence": "Paris is the heart of France.", "token_str": " heart"}, + {"score": 0.024, "token": 29778, "sequence": "Paris is the envy of France.", "token_str": " envy"}, + {"score": 0.022, "token": 1867, "sequence": "Paris is the Capital of France.", "token_str": " Capital"}] +``` + +# Vector Database +A vector database is a type of database that stores and manages vectors, which are mathematical representations of data points in a multi-dimensional space. Vectors can be used to represent a wide range of data types, including images, text, audio, and numerical data. It is designed to support efficient searching and retrieval of vectors, using methods such as nearest neighbor search, clustering, and indexing. These methods enable applications to find vectors that are similar to a given query vector, which is useful for tasks such as image search, recommendation systems, and natural language processing. + +PostgresML enhances your existing PostgreSQL database to be used as a vector database by generating embeddings from text stored in your tables. To generate embeddings, you can use the `pgml.embed` function, which takes a transformer name and a text value as input. This function automatically downloads and caches the transformer for future reuse, which saves time and resources. + +Using a vector database involves three key steps: creating embeddings, indexing your embeddings using different algorithms, and querying the index using embeddings for your queries. Let's break down each step in more detail. + +## Step 1: Creating embeddings using transformers +To create embeddings for your data, you first need to choose a transformer that can generate embeddings from your input data. Some popular transformer options include BERT, GPT-2, and T5. Once you've selected a transformer, you can use it to generate embeddings for your data. + +In the following section, we will demonstrate how to use PostgresML to generate embeddings for a dataset of tweets commonly used in sentiment analysis. To generate the embeddings, we will use the `pgml.embed` function, which will generate an embedding for each tweet in the dataset. These embeddings will then be inserted into a table called tweet_embeddings. +```sql +SELECT pgml.load_dataset('tweet_eval', 'sentiment'); + +SELECT * +FROM pgml.tweet_eval +LIMIT 10; + +CREATE TABLE tweet_embeddings AS +SELECT text, pgml.embed('distilbert-base-uncased', text) AS embedding +FROM pgml.tweet_eval; + +SELECT * from tweet_embeddings limit 2; +``` + +*Result* + +|text|embedding| +|----|---------| +|"QT @user In the original draft of the 7th book, Remus Lupin survived the Battle of Hogwarts. #HappyBirthdayRemusLupin"|{-0.1567948312,-0.3149209619,0.2163394839,..}| +|"Ben Smith / Smith (concussion) remains out of the lineup Thursday, Curtis #NHL #SJ"|{-0.0701668188,-0.012231146,0.1304316372,.. }| + + +## Step 2: Indexing your embeddings using different algorithms +After you've created embeddings for your data, you need to index them using one or more indexing algorithms. There are several different types of indexing algorithms available, including B-trees, k-nearest neighbors (KNN), and approximate nearest neighbors (ANN). The specific type of indexing algorithm you choose will depend on your use case and performance requirements. For example, B-trees are a good choice for range queries, while KNN and ANN algorithms are more efficient for similarity searches. + +On small datasets (<100k rows), a linear search that compares every row to the query will give sub-second results, which may be fast enough for your use case. For larger datasets, you may want to consider various indexing strategies offered by additional extensions. + +- Cube is a built-in extension that provides a fast indexing strategy for finding similar vectors. By default it has an arbitrary limit of 100 dimensions, unless Postgres is compiled with a larger size. +- PgVector supports embeddings up to 2000 dimensions out of the box, and provides a fast indexing strategy for finding similar vectors. + +When indexing your embeddings, it's important to consider the trade-offs between accuracy and speed. Exact indexing algorithms like B-trees can provide precise results, but may not be as fast as approximate indexing algorithms like KNN and ANN. Similarly, some indexing algorithms may require more memory or disk space than others. + +In the following, we are creating an index on the tweet_embeddings table using the ivfflat algorithm for indexing. The ivfflat algorithm is a type of hybrid index that combines an Inverted File (IVF) index with a Flat (FLAT) index. + +The index is being created on the embedding column in the tweet_embeddings table, which contains vector embeddings generated from the original tweet dataset. The `vector_cosine_ops` argument specifies the indexing operation to use for the embeddings. In this case, it's using the `cosine similarity` operation, which is a common method for measuring similarity between vectors. + +By creating an index on the embedding column, the database can quickly search for and retrieve records that are similar to a given query vector. This can be useful for a variety of machine learning applications, such as similarity search or recommendation systems. + +```sql +CREATE INDEX ON tweet_embeddings USING ivfflat (embedding vector_cosine_ops); +``` +## Step 3: Querying the index using embeddings for your queries +Once your embeddings have been indexed, you can use them to perform queries against your database. To do this, you'll need to provide a query embedding that represents the query you want to perform. The index will then return the closest matching embeddings from your database, based on the similarity between the query embedding and the stored embeddings. + +```sql +WITH query AS ( + SELECT pgml.embed('distilbert-base-uncased', 'Star Wars christmas special is on Disney')::vector AS embedding +) +SELECT * FROM items, query ORDER BY items.embedding <-> query.embedding LIMIT 5; +``` + +*Result* +|text| +|----| +|Happy Friday with Batman animated Series 90S forever!| +|"Fri Oct 17, Sonic Highways is on HBO tonight, Also new episode of Girl Meets World on Disney"| +|tfw the 2nd The Hunger Games movie is on Amazon Prime but not the 1st one I didn't watch| +|5 RT's if you want the next episode of twilight princess tomorrow| +|Jurassic Park is BACK! New Trailer for the 4th Movie, Jurassic World -| + + + + + + + + diff --git a/pgml-docs/docs/images/dashboard.png b/pgml-docs/docs/images/dashboard.png new file mode 100644 index 000000000..c86fb4906 Binary files /dev/null and b/pgml-docs/docs/images/dashboard.png differ diff --git a/pgml-docs/docs/images/fill-mask.png b/pgml-docs/docs/images/fill-mask.png new file mode 100644 index 000000000..e7f7281c3 Binary files /dev/null and b/pgml-docs/docs/images/fill-mask.png differ diff --git a/pgml-docs/docs/images/notebooks.png b/pgml-docs/docs/images/notebooks.png new file mode 100644 index 000000000..c00c50468 Binary files /dev/null and b/pgml-docs/docs/images/notebooks.png differ diff --git a/pgml-docs/docs/images/question-answering.png b/pgml-docs/docs/images/question-answering.png new file mode 100644 index 000000000..790f8263d Binary files /dev/null and b/pgml-docs/docs/images/question-answering.png differ diff --git a/pgml-docs/docs/images/sentence-similarity.png b/pgml-docs/docs/images/sentence-similarity.png new file mode 100644 index 000000000..cfcbc43c8 Binary files /dev/null and b/pgml-docs/docs/images/sentence-similarity.png differ diff --git a/pgml-docs/docs/images/summarization.png b/pgml-docs/docs/images/summarization.png new file mode 100644 index 000000000..680bf4b5d Binary files /dev/null and b/pgml-docs/docs/images/summarization.png differ diff --git a/pgml-docs/docs/images/table-question-answering.png b/pgml-docs/docs/images/table-question-answering.png new file mode 100644 index 000000000..e43182e82 Binary files /dev/null and b/pgml-docs/docs/images/table-question-answering.png differ diff --git a/pgml-docs/docs/images/text-classification.png b/pgml-docs/docs/images/text-classification.png new file mode 100644 index 000000000..2f2bec4a0 Binary files /dev/null and b/pgml-docs/docs/images/text-classification.png differ diff --git a/pgml-docs/docs/images/text-generation.png b/pgml-docs/docs/images/text-generation.png new file mode 100644 index 000000000..e67a3fbb9 Binary files /dev/null and b/pgml-docs/docs/images/text-generation.png differ diff --git a/pgml-docs/docs/images/text-to-text-generation.png b/pgml-docs/docs/images/text-to-text-generation.png new file mode 100644 index 000000000..546d77d35 Binary files /dev/null and b/pgml-docs/docs/images/text-to-text-generation.png differ diff --git a/pgml-docs/docs/images/token-classification.png b/pgml-docs/docs/images/token-classification.png new file mode 100644 index 000000000..bdd219fda Binary files /dev/null and b/pgml-docs/docs/images/token-classification.png differ diff --git a/pgml-docs/docs/images/translation.png b/pgml-docs/docs/images/translation.png new file mode 100644 index 000000000..4f73b2cde Binary files /dev/null and b/pgml-docs/docs/images/translation.png differ diff --git a/pgml-docs/docs/images/zero-shot-classification.png b/pgml-docs/docs/images/zero-shot-classification.png new file mode 100644 index 000000000..eb7d8434d Binary files /dev/null and b/pgml-docs/docs/images/zero-shot-classification.png differ diff --git a/pgml-extension/Cargo.lock b/pgml-extension/Cargo.lock index c9fdb01f1..a731ecb6c 100644 --- a/pgml-extension/Cargo.lock +++ b/pgml-extension/Cargo.lock @@ -28,9 +28,9 @@ dependencies = [ [[package]] name = "anyhow" -version = "1.0.69" +version = "1.0.70" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "224afbd727c3d6e4b90103ece64b8d1b67fbb1973b1046c2281eed3f3803f800" +checksum = "7de8ce5e0f9f8d88245311066a578d72b7af3e7088f32783804676302df237e4" [[package]] name = "approx" @@ -78,13 +78,13 @@ dependencies = [ [[package]] name = "async-trait" -version = "0.1.66" +version = "0.1.68" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b84f9ebcc6c1f5b8cb160f6990096a5c127f423fcb6e1ccc46c370cbdfb75dfc" +checksum = "b9ccdd8f2a161be9bd5c023df56f1b2a0bd1d83872ae53b71a84a12c9bf6e842" dependencies = [ "proc-macro2", "quote 1.0.26", - "syn 1.0.109", + "syn 2.0.13", ] [[package]] @@ -317,9 +317,9 @@ checksum = "baf1de4339761588bc0619e3cbc0120ee582ebb74b53b4efbf79117bd2da40fd" [[package]] name = "clang-sys" -version = "1.6.0" +version = "1.6.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "77ed9a53e5d4d9c573ae844bfac6872b159cb1d1585a83b29e7a64b7eef7332a" +checksum = "c688fc74432808e3eb684cae8830a86be1d66a2bd58e1f248ed0960a590baf6f" dependencies = [ "glob", "libc", @@ -343,13 +343,12 @@ dependencies = [ [[package]] name = "clap" -version = "4.1.8" +version = "4.2.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c3d7ae14b20b94cb02149ed21a86c423859cbe18dc7ed69845cace50e52b40a5" +checksum = "046ae530c528f252094e4a77886ee1374437744b2bff1497aa898bbddbbb29b3" dependencies = [ - "bitflags", + "clap_builder", "clap_derive", - "clap_lex", "once_cell", ] @@ -359,46 +358,55 @@ version = "0.10.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "eca953650a7350560b61db95a0ab1d9c6f7b74d146a9e08fb258b834f3cf7e2c" dependencies = [ - "clap 4.1.8", + "clap 4.2.1", "doc-comment", ] +[[package]] +name = "clap_builder" +version = "4.2.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "223163f58c9a40c3b0a43e1c4b50a9ce09f007ea2cb1ec258a687945b4b7929f" +dependencies = [ + "bitflags", + "clap_lex", +] + [[package]] name = "clap_derive" -version = "4.1.8" +version = "4.2.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "44bec8e5c9d09e439c4335b1af0abaab56dcf3b94999a936e1bb47b9134288f0" +checksum = "3f9644cd56d6b87dbe899ef8b053e331c0637664e9e21a33dfcdc36093f5c5c4" dependencies = [ "heck", - "proc-macro-error", "proc-macro2", "quote 1.0.26", - "syn 1.0.109", + "syn 2.0.13", ] [[package]] name = "clap_lex" -version = "0.3.2" +version = "0.4.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "350b9cf31731f9957399229e9b2adc51eeabdfbe9d71d9a0552275fd12710d09" -dependencies = [ - "os_str_bytes", -] +checksum = "8a2dd5a6fe8c6e3502f568a6353e5273bbb15193ad9a89e457b9970798efbea1" [[package]] name = "cmake" -version = "0.1.49" +version = "0.1.50" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "db34956e100b30725f2eb215f90d4871051239535632f84fea3bc92722c66b7c" +checksum = "a31c789563b815f77f4250caee12365734369f942439b7defd71e18a48197130" dependencies = [ "cc", ] [[package]] name = "convert_case" -version = "0.5.0" +version = "0.6.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fb4a24b1aaf0fd0ce8b45161144d6f42cd91677fd5940fd431183eb023b3a2b8" +checksum = "ec182b0ca2f35d8fc196cf3404988fd8b8c739a4d270ff118a398feb0cbec1ca" +dependencies = [ + "unicode-segmentation", +] [[package]] name = "core-foundation" @@ -412,15 +420,15 @@ dependencies = [ [[package]] name = "core-foundation-sys" -version = "0.8.3" +version = "0.8.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5827cebf4670468b8772dd191856768aedcb1b0278a04f989f7766351917b9dc" +checksum = "e496a50fda8aacccc86d7529e2c1e0892dbd0f898a6b5645b5561b89c3210efa" [[package]] name = "cpufeatures" -version = "0.2.5" +version = "0.2.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "28d997bd5e24a5928dd43e46dc529867e207907fe0b239c3477d924f7f2ca320" +checksum = "280a9f2d8b3a38871a3c8a46fb80db65e5e5ed97da80c4d08bf27fb63e35e181" dependencies = [ "libc", ] @@ -516,12 +524,12 @@ dependencies = [ [[package]] name = "ctor" -version = "0.1.26" +version = "0.2.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6d2301688392eb071b0bf1a37be05c469d3cc4dbbd95df672fe28ab021e6a096" +checksum = "dd4056f63fce3b82d852c3da92b08ea59959890813a7f4ce9c0ff85b10cf301b" dependencies = [ "quote 1.0.26", - "syn 1.0.109", + "syn 2.0.13", ] [[package]] @@ -708,13 +716,13 @@ dependencies = [ [[package]] name = "errno" -version = "0.2.8" +version = "0.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f639046355ee4f37944e44f60642c6f3a7efa3cf6b78c78a0d989a8ce6c396a1" +checksum = "50d6a0976c999d473fe89ad888d5a284e55366d9dc9038b1ba2aa15128c4afa0" dependencies = [ "errno-dragonfly", "libc", - "winapi", + "windows-sys 0.45.0", ] [[package]] @@ -754,14 +762,14 @@ dependencies = [ [[package]] name = "filetime" -version = "0.2.20" +version = "0.2.21" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8a3de6e8d11b22ff9edc6d916f890800597d60f8b2da1caf2955c274638d6412" +checksum = "5cbc844cecaee9d4443931972e1289c8ff485cb4cc2767cb03ca139ed6885153" dependencies = [ "cfg-if", "libc", - "redox_syscall", - "windows-sys 0.45.0", + "redox_syscall 0.2.16", + "windows-sys 0.48.0", ] [[package]] @@ -818,9 +826,9 @@ checksum = "e6d5a32815ae3f33302d95fdcb2ce17862f8c65363dcfd29360480ba1001fc9c" [[package]] name = "futures-channel" -version = "0.3.27" +version = "0.3.28" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "164713a5a0dcc3e7b4b1ed7d3b433cabc18025386f9339346e8daf15963cf7ac" +checksum = "955518d47e09b25bbebc7a18df10b81f0c766eaf4c4f1cccef2fca5f2a4fb5f2" dependencies = [ "futures-core", "futures-sink", @@ -828,38 +836,38 @@ dependencies = [ [[package]] name = "futures-core" -version = "0.3.27" +version = "0.3.28" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "86d7a0c1aa76363dac491de0ee99faf6941128376f1cf96f07db7603b7de69dd" +checksum = "4bca583b7e26f571124fe5b7561d49cb2868d79116cfa0eefce955557c6fee8c" [[package]] name = "futures-macro" -version = "0.3.27" +version = "0.3.28" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3eb14ed937631bd8b8b8977f2c198443447a8355b6e3ca599f38c975e5a963b6" +checksum = "89ca545a94061b6365f2c7355b4b32bd20df3ff95f02da9329b34ccc3bd6ee72" dependencies = [ "proc-macro2", "quote 1.0.26", - "syn 1.0.109", + "syn 2.0.13", ] [[package]] name = "futures-sink" -version = "0.3.27" +version = "0.3.28" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ec93083a4aecafb2a80a885c9de1f0ccae9dbd32c2bb54b0c3a65690e0b8d2f2" +checksum = "f43be4fe21a13b9781a69afa4985b0f6ee0e1afab2c6f454a8cf30e2b2237b6e" [[package]] name = "futures-task" -version = "0.3.27" +version = "0.3.28" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fd65540d33b37b16542a0438c12e6aeead10d4ac5d05bd3f805b8f35ab592879" +checksum = "76d3d132be6c0e6aa1534069c705a74a5997a356c0dc2f86a47765e5617c5b65" [[package]] name = "futures-util" -version = "0.3.27" +version = "0.3.28" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3ef6b17e481503ec85211fed8f39d1970f128935ca1f814cd32ac4a6842e84ab" +checksum = "26b01e40b772d54cf6c6d721c1d1abd0647a0106a12ecaa1c186273392a69533" dependencies = [ "futures-core", "futures-macro", @@ -872,9 +880,9 @@ dependencies = [ [[package]] name = "generic-array" -version = "0.14.6" +version = "0.14.7" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bff49e947297f3312447abdca79f45f4738097cc82b06e72054d2223f601f1b9" +checksum = "85649ca51fd72272d7821adaf274ad91c288277713d9c18820d8499a7ff69e9a" dependencies = [ "typenum", "version_check", @@ -893,13 +901,13 @@ dependencies = [ [[package]] name = "ghost" -version = "0.1.8" +version = "0.1.9" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "69e0cd8a998937e25c6ba7cc276b96ec5cc3f4dc4ab5de9ede4fb152bdd5c5eb" +checksum = "e77ac7b51b8e6313251737fcef4b1c01a2ea102bde68415b62c0ee9268fec357" dependencies = [ "proc-macro2", "quote 1.0.26", - "syn 1.0.109", + "syn 2.0.13", ] [[package]] @@ -966,6 +974,12 @@ dependencies = [ "libc", ] +[[package]] +name = "hermit-abi" +version = "0.3.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fed44880c466736ef9a5c5b5facefb5ed0785676d0c02d612db14e54f0d84286" + [[package]] name = "hmac" version = "0.12.1" @@ -1005,9 +1019,9 @@ checksum = "ce23b50ad8242c51a442f3ff322d56b02f08852c77e4c0b4d3fd684abc89c683" [[package]] name = "indexmap" -version = "1.9.2" +version = "1.9.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1885e79c1fc4b10f0e172c475f458b7f7b93061064d98c3293e98c5ba0c8b399" +checksum = "bd070e393353796e801d209ad339e89596eb4c8d430d18ede6a1cced8fafbd99" dependencies = [ "autocfg", "hashbrown", @@ -1031,9 +1045,9 @@ dependencies = [ [[package]] name = "inventory" -version = "0.3.4" +version = "0.3.5" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "498ae1c9c329c7972b917506239b557a60386839192f1cf0ca034f345b65db99" +checksum = "7741301a6d6a9b28ce77c0fb77a4eb116b6bc8f3bef09923f7743d059c4157d3" dependencies = [ "ctor", "ghost", @@ -1041,12 +1055,13 @@ dependencies = [ [[package]] name = "io-lifetimes" -version = "1.0.6" +version = "1.0.10" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "cfa919a82ea574332e2de6e74b4c36e74d41982b335080fa59d4ef31be20fdf3" +checksum = "9c66c74d2ae7e79a5a8f7ac924adbe38ee42a859c6539ad869eb51f0b52dc220" dependencies = [ + "hermit-abi 0.3.1", "libc", - "windows-sys 0.45.0", + "windows-sys 0.48.0", ] [[package]] @@ -1087,9 +1102,9 @@ checksum = "830d08ce1d1d941e6b30645f1a0eb5643013d835ce3779a5fc208261dbe10f55" [[package]] name = "libc" -version = "0.2.140" +version = "0.2.141" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "99227334921fae1a979cf0bfdfcc6b3e5ce376ef57e16fb6fb3ea2ed6095f80c" +checksum = "3304a64d199bb964be99741b7a14d26972741915b3649639149b2479bb46f4b5" [[package]] name = "libloading" @@ -1218,9 +1233,9 @@ dependencies = [ [[package]] name = "linux-raw-sys" -version = "0.1.4" +version = "0.3.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f051f77a7c8e6957c0696eac88f26b0117e54f52d3fc682ab19397a8812846a4" +checksum = "d59d8c75012853d2e872fb56bc8a2e53718e2cafe1a4c823143141c6d90c322f" [[package]] name = "lock_api" @@ -1241,15 +1256,6 @@ dependencies = [ "cfg-if", ] -[[package]] -name = "matchers" -version = "0.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8263075bb86c5a1b1427b5ae862e8889656f126e9f77c484496e8b47cf5c5558" -dependencies = [ - "regex-automata", -] - [[package]] name = "matrixmultiply" version = "0.3.2" @@ -1418,16 +1424,6 @@ dependencies = [ "winapi", ] -[[package]] -name = "nu-ansi-term" -version = "0.46.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "77a8165726e8236064dbb45459242600304b42a5ea24ee2948e18e023bf7ba84" -dependencies = [ - "overload", - "winapi", -] - [[package]] name = "num" version = "0.4.0" @@ -1558,9 +1554,9 @@ dependencies = [ [[package]] name = "openssl" -version = "0.10.48" +version = "0.10.49" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "518915b97df115dd36109bfa429a48b8f737bd05508cf9588977b599648926d2" +checksum = "4d2f106ab837a24e03672c59b1239669a0596406ff657c3c0835b6b7f0f35a33" dependencies = [ "bitflags", "cfg-if", @@ -1573,13 +1569,13 @@ dependencies = [ [[package]] name = "openssl-macros" -version = "0.1.0" +version = "0.1.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b501e44f11665960c7e7fcf062c7d96a14ade4aa98116c004b2e37b5be7d736c" +checksum = "a948666b637a0f465e8564c73e89d4dde00d72d4d473cc972f390fc3dcee7d9c" dependencies = [ "proc-macro2", "quote 1.0.26", - "syn 1.0.109", + "syn 2.0.13", ] [[package]] @@ -1590,11 +1586,10 @@ checksum = "ff011a302c396a5197692431fc1948019154afc178baf7d8e37367442a4601cf" [[package]] name = "openssl-sys" -version = "0.9.83" +version = "0.9.84" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "666416d899cf077260dac8698d60a60b435a46d57e82acb1be3d0dad87284e5b" +checksum = "3a20eace9dc2d82904039cb76dcf50fb1a0bba071cfd1629720b5d6f1ddba0fa" dependencies = [ - "autocfg", "cc", "libc", "pkg-config", @@ -1607,18 +1602,6 @@ version = "0.1.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "efa535d5117d3661134dbf1719b6f0ffe06f2375843b13935db186cd094105eb" -[[package]] -name = "os_str_bytes" -version = "6.4.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9b7820b9daea5457c9f21c69448905d723fbd21136ccf521748f23fd49e723ee" - -[[package]] -name = "overload" -version = "0.1.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b15813163c1d831bf4a13c3610c05c0d03b39feb07f7e09fa234dac9b15aaf39" - [[package]] name = "owo-colors" version = "3.5.0" @@ -1643,7 +1626,7 @@ checksum = "9069cbb9f99e3a5083476ccb29ceb1de18b9118cafa53e90c9551235de2b9521" dependencies = [ "cfg-if", "libc", - "redox_syscall", + "redox_syscall 0.2.16", "smallvec", "windows-sys 0.45.0", ] @@ -1678,9 +1661,9 @@ checksum = "478c572c3d73181ff3c2539045f6eb99e5491218eae919370993b890cdbdd98e" [[package]] name = "pest" -version = "2.5.6" +version = "2.5.7" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8cbd939b234e95d72bc393d51788aec68aeeb5d51e748ca08ff3aad58cb722f7" +checksum = "7b1403e8401ad5dedea73c626b99758535b342502f8d1e361f4a2dd952749122" dependencies = [ "thiserror", "ucd-trie", @@ -1733,9 +1716,9 @@ dependencies = [ [[package]] name = "pgx" -version = "0.7.1" +version = "0.7.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fc91f19f84e7c1ba7b25953b042bd487b6e1bbec4c3af09f61a6ac31207ff776" +checksum = "4c2947326bd9a80ec122207f0a59367592f79c053390d6ee961fe17a71ef1e3d" dependencies = [ "atomic-traits", "bitflags", @@ -1760,9 +1743,9 @@ dependencies = [ [[package]] name = "pgx-macros" -version = "0.7.1" +version = "0.7.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1ebfde3c33353d42c2fbcc76bea758b37018b33b1391c93d6402546569914e94" +checksum = "96bf5c70a467b39c1a67a2e1ec7acc4ba8bb32e5bf2d3dead2d89b8442f31ff9" dependencies = [ "pgx-sql-entity-graph", "proc-macro2", @@ -1772,9 +1755,9 @@ dependencies = [ [[package]] name = "pgx-pg-config" -version = "0.7.1" +version = "0.7.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e97c27bab88fdb7b94e549b02267ab9595bd9d1043718d6d72bc2d34cf1e3952" +checksum = "020f2f1e0805a60321a375d0f27d771678d59b808bbb5f632c42607a661ab63a" dependencies = [ "dirs 4.0.0", "eyre", @@ -1789,14 +1772,14 @@ dependencies = [ [[package]] name = "pgx-pg-sys" -version = "0.7.1" +version = "0.7.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6b79c48c564bed305d202b852321603107e5f3ac31f25ea2cc4031475f38d0b3" +checksum = "db2371dc1ee5c6f32b9a862fe1706e7ddf862003f167d21d9886b4b4f3f2391e" dependencies = [ "bindgen 0.60.1", "eyre", "libc", - "memoffset 0.6.5", + "memoffset 0.8.0", "once_cell", "pgx-macros", "pgx-pg-config", @@ -1811,29 +1794,24 @@ dependencies = [ [[package]] name = "pgx-sql-entity-graph" -version = "0.7.1" +version = "0.7.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "573a8d8c23be24c39f7b7fbbc7e15d95aa0327acd61ba95c9c9f237fec51f205" +checksum = "5e5b7304665fe3a052dd353a08d013c4d5d780a49be8b60d27c430492b1d442e" dependencies = [ "convert_case", "eyre", "petgraph", "proc-macro2", "quote 1.0.26", - "regex", - "seq-macro", "syn 1.0.109", - "tracing", - "tracing-error", - "tracing-subscriber", "unescape", ] [[package]] name = "pgx-tests" -version = "0.7.1" +version = "0.7.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fc09f25ae560bc4e3308022999416966beda5b60d2957b9ab92bffaf2d6a86c3" +checksum = "b2dfa440a295e0a6bc1a7c87af83dc5e9f7a85c05d28b9fa77f1793f6883f917" dependencies = [ "clap-cargo", "eyre", @@ -1890,9 +1868,9 @@ checksum = "6ac9a59f73473f1b8d852421e59e64809f025994837ef743615c6d0c5b305160" [[package]] name = "postgres" -version = "0.19.4" +version = "0.19.5" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "960c214283ef8f0027974c03e9014517ced5db12f021a9abb66185a5751fab0a" +checksum = "0bed5017bc2ff49649c0075d0d7a9d676933c1292480c1d137776fb205b5cd18" dependencies = [ "bytes", "fallible-iterator", @@ -1904,11 +1882,11 @@ dependencies = [ [[package]] name = "postgres-protocol" -version = "0.6.4" +version = "0.6.5" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "878c6cbf956e03af9aa8204b407b9cbf47c072164800aa918c516cd4b056c50c" +checksum = "78b7fa9f396f51dffd61546fd8573ee20592287996568e6175ceb0f8699ad75d" dependencies = [ - "base64 0.13.1", + "base64 0.21.0", "byteorder", "bytes", "fallible-iterator", @@ -1922,9 +1900,9 @@ dependencies = [ [[package]] name = "postgres-types" -version = "0.2.4" +version = "0.2.5" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "73d946ec7d256b04dfadc4e6a3292324e6f417124750fc5c0950f981b703a0f1" +checksum = "f028f05971fe20f512bcc679e2c10227e57809a3af86a7606304435bc8896cd6" dependencies = [ "bytes", "fallible-iterator", @@ -1937,35 +1915,11 @@ version = "0.2.17" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5b40af805b3121feab8a3c29f04d8ad262fa8e0561883e7653e024ae4479e6de" -[[package]] -name = "proc-macro-error" -version = "1.0.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "da25490ff9892aab3fcf7c36f08cfb902dd3e71ca0f9f9517bea02a73a5ce38c" -dependencies = [ - "proc-macro-error-attr", - "proc-macro2", - "quote 1.0.26", - "syn 1.0.109", - "version_check", -] - -[[package]] -name = "proc-macro-error-attr" -version = "1.0.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a1be40180e52ecc98ad80b184934baf3d0d29f979574e439af5a55274b35f869" -dependencies = [ - "proc-macro2", - "quote 1.0.26", - "version_check", -] - [[package]] name = "proc-macro2" -version = "1.0.52" +version = "1.0.56" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1d0e1ae9e836cc3beddd63db0df682593d7e2d3d891ae8c9083d2113e1744224" +checksum = "2b63bdb0cd06f1f4dedf69b254734f9b45af66e4a031e42a7480257d9898b435" dependencies = [ "unicode-ident", ] @@ -2140,6 +2094,15 @@ dependencies = [ "bitflags", ] +[[package]] +name = "redox_syscall" +version = "0.3.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "567664f262709473930a4bf9e51bf2ebf3348f2e748ccc50dea20646858f8f29" +dependencies = [ + "bitflags", +] + [[package]] name = "redox_users" version = "0.4.3" @@ -2147,35 +2110,26 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b033d837a7cf162d7993aded9304e30a83213c648b6e389db233191f891e5c2b" dependencies = [ "getrandom", - "redox_syscall", + "redox_syscall 0.2.16", "thiserror", ] [[package]] name = "regex" -version = "1.7.1" +version = "1.7.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "48aaa5748ba571fb95cd2c85c09f629215d3a6ece942baa100950af03a34f733" +checksum = "8b1f693b24f6ac912f4893ef08244d70b6067480d2f1a46e950c9691e6749d1d" dependencies = [ "aho-corasick", "memchr", "regex-syntax", ] -[[package]] -name = "regex-automata" -version = "0.1.10" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6c230d73fb8d8c1b9c0b3135c5142a8acee3a0558fb8db5cf1cb65f8d7862132" -dependencies = [ - "regex-syntax", -] - [[package]] name = "regex-syntax" -version = "0.6.28" +version = "0.6.29" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "456c603be3e8d448b072f410900c09faf164fbce2d480456f50eea6e25f9c848" +checksum = "f162c6dd7b008981e4d40210aca20b4bd0f9b60ca9271061b07f78537722f2e1" [[package]] name = "rmp" @@ -2225,9 +2179,9 @@ dependencies = [ [[package]] name = "rustix" -version = "0.36.9" +version = "0.37.7" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fd5c6ff11fecd55b40746d1995a02f2eb375bf8c00d192d521ee09f42bef37bc" +checksum = "2aae838e49b3d63e9274e1c01833cc8139d3fec468c3b84688c628f44b1ae11d" dependencies = [ "bitflags", "errno", @@ -2355,9 +2309,9 @@ checksum = "e6b44e8fc93a14e66336d230954dda83d18b4605ccace8fe09bc7514a71ad0bc" [[package]] name = "serde" -version = "1.0.156" +version = "1.0.159" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "314b5b092c0ade17c00142951e50ced110ec27cea304b1037c6969246c2469a4" +checksum = "3c04e8343c3daeec41f58990b9d77068df31209f2af111e059e9fe9646693065" dependencies = [ "serde_derive", ] @@ -2374,20 +2328,20 @@ dependencies = [ [[package]] name = "serde_derive" -version = "1.0.156" +version = "1.0.159" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d7e29c4601e36bcec74a223228dce795f4cd3616341a4af93520ca1a837c087d" +checksum = "4c614d17805b093df4b147b51339e7e44bf05ef59fba1e45d83500bcfb4d8585" dependencies = [ "proc-macro2", "quote 1.0.26", - "syn 1.0.109", + "syn 2.0.13", ] [[package]] name = "serde_json" -version = "1.0.94" +version = "1.0.95" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1c533a59c9d8a93a09c6ab31f0fd5e5f4dd1b8fc9434804029839884765d04ea" +checksum = "d721eca97ac802aa7777b701877c8004d950fc142651367300d21c1cc0194744" dependencies = [ "indexmap", "itoa", @@ -2395,6 +2349,15 @@ dependencies = [ "serde", ] +[[package]] +name = "serde_spanned" +version = "0.6.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0efd8caf556a6cebd3b285caf480045fcc1ac04f6bd786b09a6f11af30c4fcf4" +dependencies = [ + "serde", +] + [[package]] name = "sha2" version = "0.10.6" @@ -2520,6 +2483,16 @@ dependencies = [ "winapi", ] +[[package]] +name = "socket2" +version = "0.5.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "bc8d618c6641ae355025c449427f9e96b98abf99a772be3cef6708d15c77147a" +dependencies = [ + "libc", + "windows-sys 0.45.0", +] + [[package]] name = "spin" version = "0.9.8" @@ -2604,6 +2577,17 @@ dependencies = [ "unicode-ident", ] +[[package]] +name = "syn" +version = "2.0.13" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4c9da457c5285ac1f936ebd076af6dac17a61cfe7826f2076b4d015cf47bc8ec" +dependencies = [ + "proc-macro2", + "quote 1.0.26", + "unicode-ident", +] + [[package]] name = "synom" version = "0.11.3" @@ -2615,9 +2599,9 @@ dependencies = [ [[package]] name = "sysinfo" -version = "0.27.8" +version = "0.28.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a902e9050fca0a5d6877550b769abd2bd1ce8c04634b941dbe2809735e1a1e33" +checksum = "b4c2f3ca6693feb29a89724516f016488e9aafc7f37264f898593ee4b942f31b" dependencies = [ "cfg-if", "core-foundation-sys", @@ -2659,15 +2643,15 @@ checksum = "8ae9980cab1db3fceee2f6c6f643d5d8de2997c58ee8d25fb0cc8a9e9e7348e5" [[package]] name = "tempfile" -version = "3.4.0" +version = "3.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "af18f7ae1acd354b992402e9ec5864359d693cd8a79dcbef59f76891701c1e95" +checksum = "b9fbec84f381d5795b08656e4912bec604d162bff9291d6189a78f4c8ab87998" dependencies = [ "cfg-if", "fastrand", - "redox_syscall", + "redox_syscall 0.3.5", "rustix", - "windows-sys 0.42.0", + "windows-sys 0.45.0", ] [[package]] @@ -2701,22 +2685,22 @@ dependencies = [ [[package]] name = "thiserror" -version = "1.0.39" +version = "1.0.40" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a5ab016db510546d856297882807df8da66a16fb8c4101cb8b30054b0d5b2d9c" +checksum = "978c9a314bd8dc99be594bc3c175faaa9794be04a5a5e153caba6915336cebac" dependencies = [ "thiserror-impl", ] [[package]] name = "thiserror-impl" -version = "1.0.39" +version = "1.0.40" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5420d42e90af0c38c3290abcca25b9b3bdf379fc9f55c528f53a269d9c9a267e" +checksum = "f9456a42c5b0d803c8cd86e73dd7cc9edd429499f37a3550d286d5e86720569f" dependencies = [ "proc-macro2", "quote 1.0.26", - "syn 1.0.109", + "syn 2.0.13", ] [[package]] @@ -2775,25 +2759,24 @@ checksum = "1f3ccbac311fea05f86f61904b462b55fb3df8837a366dfc601a0161d0532f20" [[package]] name = "tokio" -version = "1.26.0" +version = "1.27.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "03201d01c3c27a29c8a5cee5b55a93ddae1ccf6f08f65365c2c918f8c1b76f64" +checksum = "d0de47a4eecbe11f498978a9b29d792f0d2692d1dd003650c24c76510e3bc001" dependencies = [ "autocfg", "bytes", "libc", - "memchr", "mio", "pin-project-lite", - "socket2", + "socket2 0.4.9", "windows-sys 0.45.0", ] [[package]] name = "tokio-postgres" -version = "0.7.7" +version = "0.7.8" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "29a12c1b3e0704ae7dfc25562629798b29c72e6b1d0a681b6f29ab4ae5e7f7bf" +checksum = "6e89f6234aa8fd43779746012fcf53603cdb91fdd8399aa0de868c2d56b6dde1" dependencies = [ "async-trait", "byteorder", @@ -2808,7 +2791,7 @@ dependencies = [ "pin-project-lite", "postgres-protocol", "postgres-types", - "socket2", + "socket2 0.5.1", "tokio", "tokio-util", ] @@ -2829,11 +2812,36 @@ dependencies = [ [[package]] name = "toml" -version = "0.5.11" +version = "0.7.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f4f7f0dd8d50a853a531c426359045b1998f04219d88799810762cd4ad314234" +checksum = "b403acf6f2bb0859c93c7f0d967cb4a75a7ac552100f9322faf64dc047669b21" dependencies = [ "serde", + "serde_spanned", + "toml_datetime", + "toml_edit", +] + +[[package]] +name = "toml_datetime" +version = "0.6.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3ab8ed2edee10b50132aed5f331333428b011c99402b5a534154ed15746f9622" +dependencies = [ + "serde", +] + +[[package]] +name = "toml_edit" +version = "0.19.8" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "239410c8609e8125456927e6707163a3b1fdb40561e4b803bc041f466ccfdc13" +dependencies = [ + "indexmap", + "serde", + "serde_spanned", + "toml_datetime", + "winnow", ] [[package]] @@ -2866,7 +2874,6 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "24eb03ba0eab1fd845050058ce5e616558e8f8d8fca633e6b163fe25c797213a" dependencies = [ "once_cell", - "valuable", ] [[package]] @@ -2879,33 +2886,15 @@ dependencies = [ "tracing-subscriber", ] -[[package]] -name = "tracing-log" -version = "0.1.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "78ddad33d2d10b1ed7eb9d1f518a5674713876e97e5bb9b7345a7984fbb4f922" -dependencies = [ - "lazy_static", - "log", - "tracing-core", -] - [[package]] name = "tracing-subscriber" version = "0.3.16" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "a6176eae26dd70d0c919749377897b54a9276bd7061339665dd68777926b5a70" dependencies = [ - "matchers", - "nu-ansi-term", - "once_cell", - "regex", "sharded-slab", - "smallvec", "thread_local", - "tracing", "tracing-core", - "tracing-log", ] [[package]] @@ -2916,9 +2905,9 @@ checksum = "497961ef93d974e23eb6f433eb5fe1b7930b659f06d12dec6fc44a8f554c0bba" [[package]] name = "typetag" -version = "0.2.6" +version = "0.2.7" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "69bf9bd14fed1815295233a0eee76a963283b53ebcbd674d463f697d3bfcae0c" +checksum = "edc3ebbaab23e6cc369cb48246769d031f5bd85f1b28141f32982e3c0c7b33cf" dependencies = [ "erased-serde", "inventory", @@ -2929,13 +2918,13 @@ dependencies = [ [[package]] name = "typetag-impl" -version = "0.2.6" +version = "0.2.7" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bf9f5f225956dc2254c6c27500deac9390a066b2e8a1a571300627a7c4400a33" +checksum = "bb01b60fcc3f5e17babb1a9956263f3ccd2cadc3e52908400231441683283c1d" dependencies = [ "proc-macro2", "quote 1.0.26", - "syn 1.0.109", + "syn 2.0.13", ] [[package]] @@ -2952,9 +2941,9 @@ checksum = "ccb97dac3243214f8d8507998906ca3e2e0b900bf9bf4870477f125b82e68f6e" [[package]] name = "unicode-bidi" -version = "0.3.11" +version = "0.3.13" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "524b68aca1d05e03fdf03fcdce2c6c94b6daf6d16861ddaa7e4f2b6638a9052c" +checksum = "92888ba5573ff080736b3648696b70cafad7d250551175acbaa4e0385b3e1460" [[package]] name = "unicode-ident" @@ -2971,6 +2960,12 @@ dependencies = [ "tinyvec", ] +[[package]] +name = "unicode-segmentation" +version = "1.10.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1dd624098567895118886609431a7c3b8f516e41d30e0643f03d94592a147e36" + [[package]] name = "unicode-width" version = "0.1.10" @@ -3024,12 +3019,6 @@ dependencies = [ "getrandom", ] -[[package]] -name = "valuable" -version = "0.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "830b7e5d4d90034032940e4ace0d9a9a057e7a45cd94e6c007832e39edb82f6d" - [[package]] name = "vcpkg" version = "0.2.15" @@ -3050,12 +3039,11 @@ checksum = "49874b5167b65d7193b8aba1567f5c7d93d001cafc34600cee003eda787e483f" [[package]] name = "walkdir" -version = "2.3.2" +version = "2.3.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "808cf2735cd4b6866113f648b791c6adc5714537bc222d9347bb203386ffda56" +checksum = "36df944cda56c7d8d8b7496af378e6b16de9284591917d307c9b4d313c44e698" dependencies = [ "same-file", - "winapi", "winapi-util", ] @@ -3122,13 +3110,13 @@ version = "0.42.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5a3e1820f08b8513f676f7ab6c1f99ff312fb97b553d30ff4dd86f9f15728aa7" dependencies = [ - "windows_aarch64_gnullvm", - "windows_aarch64_msvc", - "windows_i686_gnu", - "windows_i686_msvc", - "windows_x86_64_gnu", - "windows_x86_64_gnullvm", - "windows_x86_64_msvc", + "windows_aarch64_gnullvm 0.42.2", + "windows_aarch64_msvc 0.42.2", + "windows_i686_gnu 0.42.2", + "windows_i686_msvc 0.42.2", + "windows_x86_64_gnu 0.42.2", + "windows_x86_64_gnullvm 0.42.2", + "windows_x86_64_msvc 0.42.2", ] [[package]] @@ -3137,7 +3125,16 @@ version = "0.45.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "75283be5efb2831d37ea142365f009c02ec203cd29a3ebecbc093d52315b66d0" dependencies = [ - "windows-targets", + "windows-targets 0.42.2", +] + +[[package]] +name = "windows-sys" +version = "0.48.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "677d2418bec65e3338edb076e806bc1ec15693c5d0104683f2efe857f61056a9" +dependencies = [ + "windows-targets 0.48.0", ] [[package]] @@ -3146,13 +3143,28 @@ version = "0.42.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "8e5180c00cd44c9b1c88adb3693291f1cd93605ded80c250a75d472756b4d071" dependencies = [ - "windows_aarch64_gnullvm", - "windows_aarch64_msvc", - "windows_i686_gnu", - "windows_i686_msvc", - "windows_x86_64_gnu", - "windows_x86_64_gnullvm", - "windows_x86_64_msvc", + "windows_aarch64_gnullvm 0.42.2", + "windows_aarch64_msvc 0.42.2", + "windows_i686_gnu 0.42.2", + "windows_i686_msvc 0.42.2", + "windows_x86_64_gnu 0.42.2", + "windows_x86_64_gnullvm 0.42.2", + "windows_x86_64_msvc 0.42.2", +] + +[[package]] +name = "windows-targets" +version = "0.48.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7b1eb6f0cd7c80c79759c929114ef071b87354ce476d9d94271031c0497adfd5" +dependencies = [ + "windows_aarch64_gnullvm 0.48.0", + "windows_aarch64_msvc 0.48.0", + "windows_i686_gnu 0.48.0", + "windows_i686_msvc 0.48.0", + "windows_x86_64_gnu 0.48.0", + "windows_x86_64_gnullvm 0.48.0", + "windows_x86_64_msvc 0.48.0", ] [[package]] @@ -3161,42 +3173,93 @@ version = "0.42.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "597a5118570b68bc08d8d59125332c54f1ba9d9adeedeef5b99b02ba2b0698f8" +[[package]] +name = "windows_aarch64_gnullvm" +version = "0.48.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "91ae572e1b79dba883e0d315474df7305d12f569b400fcf90581b06062f7e1bc" + [[package]] name = "windows_aarch64_msvc" version = "0.42.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e08e8864a60f06ef0d0ff4ba04124db8b0fb3be5776a5cd47641e942e58c4d43" +[[package]] +name = "windows_aarch64_msvc" +version = "0.48.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b2ef27e0d7bdfcfc7b868b317c1d32c641a6fe4629c171b8928c7b08d98d7cf3" + [[package]] name = "windows_i686_gnu" version = "0.42.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "c61d927d8da41da96a81f029489353e68739737d3beca43145c8afec9a31a84f" +[[package]] +name = "windows_i686_gnu" +version = "0.48.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "622a1962a7db830d6fd0a69683c80a18fda201879f0f447f065a3b7467daa241" + [[package]] name = "windows_i686_msvc" version = "0.42.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "44d840b6ec649f480a41c8d80f9c65108b92d89345dd94027bfe06ac444d1060" +[[package]] +name = "windows_i686_msvc" +version = "0.48.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4542c6e364ce21bf45d69fdd2a8e455fa38d316158cfd43b3ac1c5b1b19f8e00" + [[package]] name = "windows_x86_64_gnu" version = "0.42.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "8de912b8b8feb55c064867cf047dda097f92d51efad5b491dfb98f6bbb70cb36" +[[package]] +name = "windows_x86_64_gnu" +version = "0.48.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ca2b8a661f7628cbd23440e50b05d705db3686f894fc9580820623656af974b1" + [[package]] name = "windows_x86_64_gnullvm" version = "0.42.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "26d41b46a36d453748aedef1486d5c7a85db22e56aff34643984ea85514e94a3" +[[package]] +name = "windows_x86_64_gnullvm" +version = "0.48.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7896dbc1f41e08872e9d5e8f8baa8fdd2677f29468c4e156210174edc7f7b953" + [[package]] name = "windows_x86_64_msvc" version = "0.42.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9aec5da331524158c6d1a4ac0ab1541149c0b9505fde06423b02f5ef0106b9f0" +[[package]] +name = "windows_x86_64_msvc" +version = "0.48.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1a515f5799fe4961cb532f983ce2b23082366b898e52ffbce459c86f67c8378a" + +[[package]] +name = "winnow" +version = "0.4.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ae8970b36c66498d8ff1d66685dc86b91b29db0c7739899012f63a63814b4b28" +dependencies = [ + "memchr", +] + [[package]] name = "wyz" version = "0.5.1" diff --git a/pgml-extension/Cargo.toml b/pgml-extension/Cargo.toml index 77e81eba4..2e68aa17c 100644 --- a/pgml-extension/Cargo.toml +++ b/pgml-extension/Cargo.toml @@ -18,8 +18,8 @@ python = ["pyo3"] cuda = ["xgboost/cuda", "lightgbm/cuda"] [dependencies] -pgx = "=0.7.1" -pgx-pg-sys = "=0.7.1" +pgx = "=0.7.4" +pgx-pg-sys = "=0.7.4" xgboost = { git="https://github.com/postgresml/rust-xgboost.git", branch = "master" } once_cell = "1" rand = "0.8" @@ -48,7 +48,7 @@ flate2 = "1.0" csv = "1.1" [dev-dependencies] -pgx-tests = "=0.7.1" +pgx-tests = "=0.7.4" [profile.dev] panic = "unwind" diff --git a/pgml-extension/Dockerfile b/pgml-extension/Dockerfile index 8a20f2324..25f336260 100644 --- a/pgml-extension/Dockerfile +++ b/pgml-extension/Dockerfile @@ -37,7 +37,7 @@ RUN useradd postgresml -m -s /bin/bash -G sudo RUN echo 'postgresml ALL=(ALL) NOPASSWD: ALL' >> /etc/sudoers USER postgresml RUN curl https://sh.rustup.rs -sSf | sh -s -- -y -RUN $HOME/.cargo/bin/cargo install cargo-pgx --version "0.7.1" +RUN $HOME/.cargo/bin/cargo install cargo-pgx --version "0.7.4" RUN $HOME/.cargo/bin/cargo pgx init RUN curl https://www.postgresql.org/media/keys/ACCC4CF8.asc | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/apt.postgresql.org.gpg >/dev/null RUN sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list' diff --git a/pgml-extension/examples/finetune.sql b/pgml-extension/examples/finetune.sql new file mode 100644 index 000000000..ca52acfea --- /dev/null +++ b/pgml-extension/examples/finetune.sql @@ -0,0 +1,90 @@ +-- Exit on error (psql) +\set ON_ERROR_STOP true +\timing on + + +SELECT pgml.load_dataset('kde4', kwargs => '{"lang1": "en", "lang2": "es"}'); +CREATE OR REPLACE VIEW kde4_en_to_es AS +SELECT translation->>'en' AS "en", translation->>'es' AS "es" +FROM pgml.kde4 +LIMIT 10; +SELECT pgml.tune( + 'Translate English to Spanish', + task => 'translation', + relation_name => 'kde4_en_to_es', + y_column_name => 'es', -- translate into spanish + model_name => 'Helsinki-NLP/opus-mt-en-es', + hyperparams => '{ + "learning_rate": 2e-5, + "per_device_train_batch_size": 16, + "per_device_eval_batch_size": 16, + "num_train_epochs": 1, + "weight_decay": 0.01, + "max_length": 128 + }', + test_size => 0.5, + test_sampling => 'last' +); + +SELECT pgml.load_dataset('imdb'); +SELECT pgml.tune( + 'IMDB Review Sentiment', + task => 'text-classification', + relation_name => 'pgml.imdb', + y_column_name => 'label', + model_name => 'distilbert-base-uncased', + hyperparams => '{ + "learning_rate": 2e-5, + "per_device_train_batch_size": 16, + "per_device_eval_batch_size": 16, + "num_train_epochs": 1, + "weight_decay": 0.01 + }', + test_size => 0.5, + test_sampling => 'last' +); +SELECT pgml.predict('IMDB Review Sentiment', 'I love SQL'); + +SELECT pgml.load_dataset('squad_v2'); +SELECT pgml.tune( + 'SQuAD Q&A v2', + 'question-answering', + 'pgml.squad_v2', + 'answers', + 'deepset/roberta-base-squad2', + hyperparams => '{ + "evaluation_strategy": "epoch", + "learning_rate": 2e-5, + "per_device_train_batch_size": 16, + "per_device_eval_batch_size": 16, + "num_train_epochs": 1, + "weight_decay": 0.01, + "max_length": 384, + "stride": 128 + }', + test_size => 11873, + test_sampling => 'last' +); + + +SELECT pgml.load_dataset('billsum', kwargs => '{"split": "ca_test"}'); +CREATE OR REPLACE VIEW billsum_training_data +AS SELECT title || '\n' || text AS text, summary FROM pgml.billsum; +SELECT pgml.tune( + 'Legal Summarization', + task => 'summarization', + relation_name => 'billsum_training_data', + y_column_name => 'summary', + model_name => 'sshleifer/distilbart-xsum-12-1', + hyperparams => '{ + "learning_rate": 2e-5, + "per_device_train_batch_size": 2, + "per_device_eval_batch_size": 2, + "num_train_epochs": 1, + "weight_decay": 0.01, + "max_input_length": 1024, + "max_summary_length": 128 + }', + test_size => 0.01, + test_sampling => 'last' +); diff --git a/pgml-extension/examples/transformers.sql b/pgml-extension/examples/transformers.sql index 36f019350..e7fabbb7d 100644 --- a/pgml-extension/examples/transformers.sql +++ b/pgml-extension/examples/transformers.sql @@ -32,89 +32,60 @@ SELECT pgml.transform( 'Dominic Cobb is the foremost practitioner of the artistic science of extraction, inserting oneself into a subject''s dreams to obtain hidden information without the subject knowing, a concept taught to him by his professor father-in-law, Dr. Stephen Miles. Dom''s associates are Miles'' former students, who Dom requires as he has given up being the dream architect for reasons he won''t disclose. Dom''s primary associate, Arthur, believes it has something to do with Dom''s deceased wife, Mal, who often figures prominently and violently in those dreams, or Dom''s want to "go home" (get back to his own reality, which includes two young children). Dom''s work is generally in corporate espionage. As the subjects don''t want the information to get into the wrong hands, the clients have zero tolerance for failure. Dom is also a wanted man, as many of his past subjects have learned what Dom has done to them. One of those subjects, Mr. Saito, offers Dom a job he can''t refuse: to take the concept one step further into inception, namely planting thoughts into the subject''s dreams without them knowing. Inception can fundamentally alter that person as a being. Saito''s target is Robert Michael Fischer, the heir to an energy business empire, which has the potential to rule the world if continued on the current trajectory. Beyond the complex logistics of the dream architecture of the case and some unknowns concerning Fischer, the biggest obstacles in success for the team become worrying about one aspect of inception which Cobb fails to disclose to the other team members prior to the job, and Cobb''s newest associate Ariadne''s belief that Cobb''s own subconscious, especially as it relates to Mal, may be taking over what happens in the dreams.' ] ); +SELECT pgml.transform( + inputs => ARRAY[ + 'I love how amazingly simple ML has become!', + 'I hate doing mundane and thankless tasks. ☹️' + ], + task => '{"task": "text-classification", + "model": "finiteautomata/bertweet-base-sentiment-analysis" + }'::JSONB +) AS positivity; -SELECT pgml.load_dataset('kde4', kwargs => '{"lang1": "en", "lang2": "es"}'); -CREATE OR REPLACE VIEW kde4_en_to_es AS -SELECT translation->>'en' AS "en", translation->>'es' AS "es" -FROM pgml.kde4 -LIMIT 10; -SELECT pgml.tune( - 'Translate English to Spanish', - task => 'translation', - relation_name => 'kde4_en_to_es', - y_column_name => 'es', -- translate into spanish - model_name => 'Helsinki-NLP/opus-mt-en-es', - hyperparams => '{ - "learning_rate": 2e-5, - "per_device_train_batch_size": 16, - "per_device_eval_batch_size": 16, - "num_train_epochs": 1, - "weight_decay": 0.01, - "max_length": 128 - }', - test_size => 0.5, - test_sampling => 'last' -); +SELECT pgml.transform( + task => 'text-classification', + inputs => ARRAY[ + 'I love how amazingly simple ML has become!', + 'I hate doing mundane and thankless tasks. ☹️' + ] +) AS positivity; + +SELECT pgml.transform( + inputs => ARRAY[ + 'Stocks rallied and the British pound gained.', + 'Stocks making the biggest moves midday: Nvidia, Palantir and more' + ], + task => '{"task": "text-classification", + "model": "ProsusAI/finbert" + }'::JSONB +) AS market_sentiment; -SELECT pgml.load_dataset('imdb'); -SELECT pgml.tune( - 'IMDB Review Sentiment', - task => 'text-classification', - relation_name => 'pgml.imdb', - y_column_name => 'label', - model_name => 'distilbert-base-uncased', - hyperparams => '{ - "learning_rate": 2e-5, - "per_device_train_batch_size": 16, - "per_device_eval_batch_size": 16, - "num_train_epochs": 1, - "weight_decay": 0.01 - }', - test_size => 0.5, - test_sampling => 'last' +SELECT pgml.transform( + inputs => ARRAY[ + 'I have a problem with my iphone that needs to be resolved asap!!' + ], + task => '{"task": "zero-shot-classification", + "model": "roberta-large-mnli" + }'::JSONB, + args => '{"candidate_labels": ["urgent", "not urgent", "phone", "tablet", "computer"] + }'::JSONB +) AS zero_shot; + +SELECT pgml.transform( + inputs => ARRAY[ + 'Hugging Face is a French company based in New York City.' + ], + task => 'token-classification' ); -SELECT pgml.predict('IMDB Review Sentiment', 'I love SQL'); -SELECT pgml.load_dataset('squad_v2'); -SELECT pgml.tune( - 'SQuAD Q&A v2', +SELECT pgml.transform( 'question-answering', - 'pgml.squad_v2', - 'answers', - 'deepset/roberta-base-squad2', - hyperparams => '{ - "evaluation_strategy": "epoch", - "learning_rate": 2e-5, - "per_device_train_batch_size": 16, - "per_device_eval_batch_size": 16, - "num_train_epochs": 1, - "weight_decay": 0.01, - "max_length": 384, - "stride": 128 - }', - test_size => 11873, - test_sampling => 'last' -); + inputs => ARRAY[ + '{ + "question": "Am I dreaming?", + "context": "I got a good nights sleep last night and started a simple tutorial over my cup of morning coffee. The capabilities seem unreal, compared to what I came to expect from the simple SQL standard I studied so long ago. The answer is staring me in the face, and I feel the uncanny call from beyond the screen to check the results." + }' + ] +) AS answer; -SELECT pgml.load_dataset('billsum', kwargs => '{"split": "ca_test"}'); -CREATE OR REPLACE VIEW billsum_training_data -AS SELECT title || '\n' || text AS text, summary FROM pgml.billsum; -SELECT pgml.tune( - 'Legal Summarization', - task => 'summarization', - relation_name => 'billsum_training_data', - y_column_name => 'summary', - model_name => 'sshleifer/distilbart-xsum-12-1', - hyperparams => '{ - "learning_rate": 2e-5, - "per_device_train_batch_size": 2, - "per_device_eval_batch_size": 2, - "num_train_epochs": 1, - "weight_decay": 0.01, - "max_input_length": 1024, - "max_summary_length": 128 - }', - test_size => 0.01, - test_sampling => 'last' -); diff --git a/pgml-extension/src/bindings/transformers.py b/pgml-extension/src/bindings/transformers.py index 43040f42a..da109b9f2 100644 --- a/pgml-extension/src/bindings/transformers.py +++ b/pgml-extension/src/bindings/transformers.py @@ -3,7 +3,7 @@ import math import shutil import time - +import numpy as np import datasets from rouge import Rouge @@ -40,6 +40,12 @@ __cache_transformer_by_model_id = {} __cache_sentence_transformer_by_name = {} +class NumpyJSONEncoder(json.JSONEncoder): + def default(self, obj): + if isinstance(obj, np.float32): + return float(obj) + return super().default(obj) + def transform(task, args, inputs): task = json.loads(task) args = json.loads(args) @@ -50,7 +56,7 @@ def transform(task, args, inputs): if pipe.task == "question-answering": inputs = [json.loads(input) for input in inputs] - return json.dumps(pipe(inputs, **args)) + return json.dumps(pipe(inputs, **args), cls = NumpyJSONEncoder) def embed(transformer, text, kwargs): kwargs = json.loads(kwargs) @@ -101,7 +107,7 @@ def tokenize_summarization(tokenizer, max_length, x, y): return datasets.Dataset.from_dict(encoding.data) def tokenize_text_generation(tokenizer, max_length, y): - encoding = tokenizer(y, max_length=max_length) + encoding = tokenizer(y, max_length=max_length, truncation=True, padding="max_length") return datasets.Dataset.from_dict(encoding.data) def tokenize_question_answering(tokenizer, max_length, x, y): diff --git a/pgml-extension/tests/test.sql b/pgml-extension/tests/test.sql index db89f25e6..ed14c510d 100644 --- a/pgml-extension/tests/test.sql +++ b/pgml-extension/tests/test.sql @@ -27,3 +27,4 @@ SELECT pgml.load_dataset('wine'); \i examples/multi_classification.sql \i examples/regression.sql \i examples/vectors.sql + pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Alternative Proxies: