Skip to content

Commit e29c4a3

Browse files
authored
llm blogpost (#1421)
1 parent fc941c5 commit e29c4a3

35 files changed

+65
-104
lines changed

README.md

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,6 @@
3030
</a>
3131
</p>
3232

33-
3433
# Table of contents
3534
- [Introduction](#introduction)
3635
- [Installation](#installation)
@@ -87,8 +86,6 @@ SELECT pgml.transform(
8786
]
8887
```
8988

90-
91-
9289
**Sentiment Analysis**
9390
*SQL query*
9491

@@ -117,7 +114,6 @@ SELECT pgml.transform(
117114
- [Millions of transactions per second](https://postgresml.org/blog/scaling-postgresml-to-one-million-requests-per-second)
118115
- [Horizontal scalability](https://github.com/postgresml/pgcat)
119116

120-
121117
**Training a classification model**
122118

123119
*Training*
@@ -242,7 +238,6 @@ SELECT pgml.transform(
242238
```
243239
The default <a href="https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english" target="_blank">model</a> used for text classification is a fine-tuned version of DistilBERT-base-uncased that has been specifically optimized for the Stanford Sentiment Treebank dataset (sst2).
244240

245-
246241
*Using specific model*
247242

248243
To use one of the over 19,000 models available on Hugging Face, include the name of the desired model and `text-classification` task as a JSONB object in the SQL query. For example, if you want to use a RoBERTa <a href="https://huggingface.co/models?pipeline_tag=text-classification" target="_blank">model</a> trained on around 40,000 English tweets and that has POS (positive), NEG (negative), and NEU (neutral) labels for its classes, include this information in the JSONB object when making your query.
@@ -681,7 +676,6 @@ SELECT pgml.transform(
681676
Sampling methods involve selecting the next word or sequence of words at random from the set of possible candidates, weighted by their probabilities according to the language model. This can result in more diverse and creative text, as well as avoiding repetitive patterns. In its most basic form, sampling means randomly picking the next word $w_t$ according to its conditional probability distribution:
682677
$$ w_t \approx P(w_t|w_{1:t-1})$$
683678

684-
685679
However, the randomness of the sampling method can also result in less coherent or inconsistent text, depending on the quality of the model and the chosen sampling parameters such as temperature, top-k, or top-p. Therefore, choosing an appropriate sampling method and parameters is crucial for achieving the desired balance between creativity and coherence in generated text.
686680

687681
You can pass `do_sample = True` in the arguments to use sampling methods. It is recommended to alter `temperature` or `top_p` but not both.
@@ -821,7 +815,6 @@ SELECT * from tweet_embeddings limit 2;
821815
|"QT @user In the original draft of the 7th book, Remus Lupin survived the Battle of Hogwarts. #HappyBirthdayRemusLupin"|{-0.1567948312,-0.3149209619,0.2163394839,..}|
822816
|"Ben Smith / Smith (concussion) remains out of the lineup Thursday, Curtis #NHL #SJ"|{-0.0701668188,-0.012231146,0.1304316372,.. }|
823817

824-
825818
## Step 2: Indexing your embeddings using different algorithms
826819
After you've created embeddings for your data, you need to index them using one or more indexing algorithms. There are several different types of indexing algorithms available, including B-trees, k-nearest neighbors (KNN), and approximate nearest neighbors (ANN). The specific type of indexing algorithm you choose will depend on your use case and performance requirements. For example, B-trees are a good choice for range queries, while KNN and ANN algorithms are more efficient for similarity searches.
827820

@@ -860,7 +853,6 @@ SELECT * FROM items, query ORDER BY items.embedding <-> query.embedding LIMIT 5;
860853
|5 RT's if you want the next episode of twilight princess tomorrow|
861854
|Jurassic Park is BACK! New Trailer for the 4th Movie, Jurassic World -|
862855

863-
864856
<!-- ## Sentence Similarity
865857
Sentence Similarity involves determining the degree of similarity between two texts. To accomplish this, Sentence similarity models convert the input texts into vectors (embeddings) that encapsulate semantic information, and then measure the proximity (or similarity) between the vectors. This task is especially beneficial for tasks such as information retrieval and clustering/grouping.
866858
![sentence similarity](pgml-cms/docs/images/sentence-similarity.png)
@@ -869,7 +861,6 @@ Sentence Similarity involves determining the degree of similarity between two te
869861
<!-- # Regression
870862
# Classification -->
871863

872-
873864
# LLM Fine-tuning
874865

875866
In this section, we will provide a step-by-step walkthrough for fine-tuning a Language Model (LLM) for differnt tasks.
@@ -1036,7 +1027,6 @@ Fine-tuning a language model requires careful consideration of training paramete
10361027
* hub_token: Your Hugging Face API token to push the fine-tuned model to the Hugging Face Model Hub. Replace "YOUR_HUB_TOKEN" with the actual token.
10371028
* push_to_hub: A boolean flag indicating whether to push the model to the Hugging Face Model Hub after fine-tuning.
10381029

1039-
10401030
#### 5.3 Monitoring
10411031
During training, metrics like loss, gradient norm will be printed as info and also logged in pgml.logs table. Below is a snapshot of such output.
10421032

@@ -1151,7 +1141,6 @@ Here is an example pgml.transform call for real-time predictions on the newly mi
11511141
Time: 175.264 ms
11521142
```
11531143

1154-
11551144
**Batch predictions**
11561145

11571146
```sql
@@ -1247,7 +1236,6 @@ SELECT pgml.tune(
12471236

12481237
By following these steps, you can effectively restart training from a previously trained model, allowing for further refinement and adaptation of the model based on new requirements or insights. Adjust parameters as needed for your specific use case and dataset.
12491238

1250-
12511239
## 8. Hugging Face Hub vs. PostgresML as Model Repository
12521240
We utilize the Hugging Face Hub as the primary repository for fine-tuning Large Language Models (LLMs). Leveraging the HF hub offers several advantages:
12531241

pgml-apps/pgml-chat/README.md

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@ Before you begin, make sure you have the following:
1414
- Python version >=3.8
1515
- (Optional) OpenAI API key
1616

17-
1817
# Getting started
1918
1. Create a virtual environment and install `pgml-chat` using `pip`:
2019
```bash
@@ -104,7 +103,6 @@ model performance, as well as integrated notebooks for rapid iteration. Postgres
104103
If you have any further questions or need more information, please feel free to send an email to team@postgresml.org or join the PostgresML Discord community at https://discord.gg/DmyJP3qJ7U.
105104
```
106105
107-
108106
### Slack
109107
110108
**Setup**
@@ -128,7 +126,6 @@ Once the slack app is running, you can interact with the chatbot on Slack as sho
128126
129127
![Slack Chatbot](./images/slack_screenshot.png)
130128
131-
132129
### Discord
133130
134131
**Setup**
@@ -194,8 +191,6 @@ pip install .
194191
4. Check the [roadmap](#roadmap) for features that you would like to work on.
195192
5. If you are looking for features that are not included here, please open an issue and we will add it to the roadmap.
196193
197-
198-
199194
# Roadmap
200195
- ~~Use a collection for chat history that can be retrieved and used to generate responses.~~
201196
- Support for file formats like rst, html, pdf, docx, etc.
942 KB
Loading
Loading

pgml-cms/blog/SUMMARY.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,14 @@
11
# Table of contents
22

33
* [Home](README.md)
4-
* [Meet us at the 2024 Postgres Conference!](meet-us-at-the-2024-postgres-conference.md)
5-
* [The 1.0 SDK is Here](the-1.0-sdk-is-here.md)
6-
* [Using PostgresML with Django and embedding search](using-postgresml-with-django-and-embedding-search.md)
7-
* [PostgresML is going multicloud](postgresml-is-going-multicloud.md)
84
* [Introducing the OpenAI Switch Kit: Move from closed to open-source AI in minutes](introducing-the-openai-switch-kit-move-from-closed-to-open-source-ai-in-minutes.md)
95
* [Speeding up vector recall 5x with HNSW](speeding-up-vector-recall-5x-with-hnsw.md)
106
* [How-to Improve Search Results with Machine Learning](how-to-improve-search-results-with-machine-learning.md)
7+
* [LLMs are commoditized; data is the differentiator](llms-are-commoditized-data-is-the-differentiator.md)
8+
* [PostgresML is going multicloud](postgresml-is-going-multicloud.md)
9+
* [The 1.0 SDK is Here](the-1.0-sdk-is-here.md)
10+
* [Using PostgresML with Django and embedding search](using-postgresml-with-django-and-embedding-search.md)
11+
* [Meet us at the 2024 Postgres Conference!](meet-us-at-the-2024-postgres-conference.md)
1112
* [pgml-chat: A command-line tool for deploying low-latency knowledge-based chatbots](pgml-chat-a-command-line-tool-for-deploying-low-latency-knowledge-based-chatbots-part-i.md)
1213
* [Announcing Support for AWS us-east-1 Region](announcing-support-for-aws-us-east-1-region.md)
1314
* [LLM based pipelines with PostgresML and dbt (data build tool)](llm-based-pipelines-with-postgresml-and-dbt-data-build-tool.md)

pgml-cms/blog/announcing-support-for-aws-us-east-1-region.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -27,12 +27,8 @@ To demonstrate the impact of moving the data closer to your application, we've c
2727

2828
<figure><img src=".gitbook/assets/image (8).png" alt=""><figcaption></figcaption></figure>
2929

30-
\\
31-
3230
<figure><img src=".gitbook/assets/image (9).png" alt=""><figcaption></figcaption></figure>
3331

34-
\\
35-
3632
## Using the New Region
3733

3834
To take advantage of latency savings, you can [deploy a dedicated PostgresML database](https://postgresml.org/signup) in `us-east-1` today. We make it as simple as filling out a very short form and clicking "Create database".

pgml-cms/blog/data-is-living-and-relational.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,4 @@ Meanwhile, denormalized datasets:
5656

5757
We think it’s worth attempting to move the machine learning process and modern data architectures beyond the status quo. To that end, we’re building the PostgresML Gym, a free offering, to provide a test bed for real world ML experimentation, in a Postgres database. Your personal Gym will include the PostgresML dashboard, several tutorial notebooks to get you started, and access to your own personal PostgreSQL database, supercharged with our machine learning extension.
5858

59-
60-
6159
Many thanks and ❤️ to all those who are supporting this endeavor. We’d love to hear feedback from the broader ML and Engineering community about applications and other real world scenarios to help prioritize our work.

pgml-cms/blog/generating-llm-embeddings-with-open-source-models-in-postgresml.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -216,8 +216,6 @@ For comparison, it would cost about $299 to use OpenAI's cheapest embedding mode
216216
| GPU | 17ms | $72 | 6 hours |
217217
| OpenAI | 300ms | $299 | millennia |
218218

219-
\\
220-
221219
You can also find embedding models that outperform OpenAI's `text-embedding-ada-002` model across many different tests on the [leaderboard](https://huggingface.co/spaces/mteb/leaderboard). It's always best to do your own benchmarking with your data, models, and hardware to find the best fit for your use case.
222220

223221
> _HTTP requests to a different datacenter cost more time and money for lower reliability than co-located compute and storage._
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# LLMs are Commoditized; Data is the Differentiator
2+
3+
<div align="left">
4+
5+
<figure><img src=".gitbook/assets/montana.jpg" alt="Author" width="100"><figcaption></figcaption></figure>
6+
7+
</div>
8+
9+
Montana Low
10+
11+
April 14, 2024
12+
13+
## Introduction
14+
15+
Last year, OpenAI’s GPT-4 launched to great fanfare and was widely hailed as the arrival of AI. Last week, Meta’s Llama 3 surpassed the launch performance of GPT-4, making AI truly available to all with an open-weight model.
16+
17+
The closed-source GPT-4 is rumored to be more than 1 trillion parameters, more than 10x larger and more expensive to operate than the latest 70 billion open-weight model from Meta. Yet, the smaller open-weight model achieves indistinguishable quality responses when judged by English speaking human evaluators in a side-by-side comparison. Meta is still training a larger 405B version of Llama 3, and plans to release the weights to the community in the next couple of months.
18+
19+
Not only are open-weight models leading in high-end performance, further optimized and scaled down open-weight versions are replacing many of the tasks that were only serviceable by proprietary vendors last year. Mistral, Qwen, Yi and a host of community members regularly contribute high quality fine-tuned models optimized for specific tasks at a fraction of the operational cost.
20+
21+
<figure><img src=".gitbook/assets/open-weight-models.png"><figcaption>GPT-4 progress has stagnated across recent updates. We look forward to continuing the trend lines when Llama 3 405B and other models are tested soon.</figcaption></figure>
22+
23+
## Increasing Complexity
24+
25+
At the same time, few of the thinly implemented LLM wrapper applications survived their debut last year. Quality, latency, security, complexity and other concerns have stymied many efforts.
26+
27+
The machine learning infrastructure required to deliver value continues to grow increasingly complex, despite or perhaps because of advances on multiple fronts. Tree based approaches still outperform LLMs on tabular data. Older, encoder models can easily handle tasks like sentiment analysis orders of magnitude more efficiently. LLMs and vector databases are a couple of the many commoditized components of the machine learning stack, part of a toolkit that continues to grow.
28+
29+
<figure><img src=".gitbook/assets/machine-learning-platform.png"><figcaption>Original diagram credit to a16z.com</figcaption></figure>
30+
31+
The one aspect that remains consistent is that data differentiates open-source algorithms and models. In the modern age of LLMs, fine-tuning, RAG, re-ranking, and RLHF; they all require data. Implementing high quality search, personalization, recommendation, anomaly detection, forecasting, classification and so many more use cases, all depend on the data.
32+
33+
The hard part of AI & ML systems has always been managing that data. Vastly more engineers have a full-time job managing data pipelines than models. Vastly more money is spent on data management systems than LLMs, and this will continue to be the case, because data is the bespoke differentiator.
34+
35+
Getting the data to the models in a timely manner often spans multiple teams and multiple disciplines collaborating for multiple quarters. When the landscape is changing as quickly as modern AI & ML, many applications are out of date before they launch, and unmaintainable long term. Unfortunately, for those teams, the speed of innovation is only increasing.
36+
37+
Keeping up with the latest innovations in just one small area of the field is a full time job, and wiring all of those together with ever-changing business requirements is a bunch of other people’s. That’s the force that created the previous diagram with a ton of siloed solutions and interconnections. Only the most lucrative businesses can afford the engineers and services required by the status quo.
38+
39+
### _Move models to the data, rather than constantly pulling data to the models_
40+
41+
In-database machine learning represents a strategic shift to leverage data more effectively. By enabling machine learning operations directly within database environments, even organizations outside of the “magnificent seven” can make real-world applications that are more efficient, effective and reactive to real-time data changes. How?
42+
43+
- *Reduced engineering overhead* Eliminate the need for an excess of engineers managing data pipelines full-time.
44+
- *Increased efficiency* Reduce the number of external network calls from your data to the models, which are costly in both speed, spend, and uptime.
45+
- *Enhanced security* No need to send your data to multiple third parties, or worry about new attack vectors on unproven technology.
46+
- *Scalability* Store and scale your data with a proven platform handling millions of requests per second and billion row datasets.
47+
- *Flexibility* Open-weight models on an open source platform gives you greater control for upgrades, use cases and deployment options.
48+
49+
## How PostgresML fits in
50+
We built PostgresML after a series of hard lessons learned building (and re-building) and then scaling the machine learning platform at Instacart during one of the companies’ highest-ever growth periods. At the end of the day, nothing worked better than building it all on a trusted, 35-year-old RDBMS. That’s why I’m confident that in-database machine learning is the future of real-world AI applications.
51+
PostgresML brings AI & ML capabilities directly into a PostgreSQL database. It allows users to train, deploy, and predict using models inside the database. It’s all the benefits of in-database machine learning, packaged in a few easy to access ways. You can use our open-source extension or our hosted cloud. You can get started quickly with SDKs in Python and JavaScript, or you can get complete AI & ML capabilities with just a few SQL calls. That means generating embeddings, performing vector operations, using transformers for NLP – all directly where your data resides. Real-world applications range from predicting customer behaviors to automating financial forecasts.
52+
53+
<figure><img src=".gitbook/assets/machine-learning-platform.png"><figcaption>Original diagram credit to a16z.com</figcaption></figure>
54+
55+
## Conclusion
56+
The practical benefits of in-database machine learning are many, and we built PostgresML to deliver those benefits in the simplest way. By running LLMs and other predictive models inside the database, PostgresML enhances the agility and performance of software engineering teams. For developers, this means less context switching and greater ease of use, as they can manage data and model training in the environment they are already familiar with. Users benefit from reduced latency and improved accuracy in their predictive models. Organizations benefit from more performant applications, but also from the flexibility of a platform that can be easily updated with the latest models once a week rather than once a year.
57+
Feel free to give PostgresML a try and let us know what you think. We’re open source, and welcome contributions from the community, especially when it comes to the rapidly evolving ML/AI landscape.

pgml-cms/blog/meet-us-at-the-2024-postgres-conference.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,6 @@ Why should you care? It's not every day you get to dive headfirst into the world
2222
Save 25% on your ticket with our discount code: 2024\_POSTGRESML\_25
2323
{% endhint %}
2424

25-
\
2625
PostgresML CEO and founder, Montana Low, will kick off the event on April 17th with a keynote about navigating the confluence of hardware evolution and machine learning technology.
2726

2827
We’ll also be hosting a masterclass in retrieval augmented generation (RAG) on April 18th. Our own Silas Marvin will give hands-on guidance to equip you with the ability to implement RAG directly within your database.
@@ -37,4 +36,3 @@ So, why sit on the sidelines when you could be right in the thick of it, soaking
3736

3837
See you there!
3938

40-
\\

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy