Skip to content

Commit d59367e

Browse files
Lev Kokotovgitbook-bot
authored andcommitted
GITBOOK-63: change request with no subject merged in GitBook
1 parent f8d3414 commit d59367e

File tree

7 files changed

+282
-3
lines changed

7 files changed

+282
-3
lines changed

pgml-docs/docs/guides/SUMMARY.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,9 @@
4949
* [Tuning vector recall while generating query embeddings in the database](use-cases/tuning-vector-recall-while-generating-query-embeddings-in-the-database.md)
5050
* [Personalize embedding results with application data in your database](use-cases/personalize-embedding-results-with-application-data-in-your-database.md)
5151
* [LLM based pipelines with PostgresML and dbt (data build tool)](use-cases/llm-based-pipelines-with-postgresml-and-dbt-data-build-tool.md)
52+
* [Data Storage & Retrieval](data-storage-and-retrieval/README.md)
53+
* [Tabular data](data-storage-and-retrieval/tabular-data.md)
54+
* [Vectors](data-storage-and-retrieval/vectors.md)
5255
* [Deploying PostgresML](deploying-postgresml/README.md)
5356
* [PostgresML Cloud](deploying-postgresml/postgresml-cloud/README.md)
5457
* [Plans](deploying-postgresml/postgresml-cloud/plans/README.md)
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Data Storage & Retrieval
2+
Lines changed: 217 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,217 @@
1+
# Tabular data
2+
3+
Tabular data is data stored in tables. While that's a bit of a recursive definition, tabular data is any kind for format that defines rows and columns, and is the most common type of data storage mechanism. Examples of tabular data include things like spreadsheets, database tables, CSV files, and Pandas dataframes.
4+
5+
Storing and accessing tabular data is a subject of decades of studies, and is the core purpose of many database systems. PostgreSQL has been leading the charge on optimal tabular storage for a while and remains today one of the most popular and effective ways to store, organize and retrieve this kind of data.
6+
7+
### Creating tables
8+
9+
Postgres makes it really easy to create and use tables. If you're looking to use PostgresML for a supervised learning project, creating a table will be very similar to a Pandas dataframe, except it will be durable and easily accessible for as long as the database exists.
10+
11+
For the rest of this guide, we'll take the [USA House Prices](https://www.kaggle.com/code/fatmakursun/supervised-unsupervised-learning-examples/) dataset from Kaggle, store it in Postgres and query it for basic statistics. The dataset has seven (7) columns and 5,000 rows:
12+
13+
14+
15+
| Column | Data type | Postgres data type |
16+
| ---------------------------- | --------- | ------------------ |
17+
| Avg. Area Income | Float | REAL |
18+
| Avg. Area House Age | Float | REAL |
19+
| Avg. Area Number of Rooms | Float | REAL |
20+
| Avg. Area Number of Bedrooms | Float | REAL |
21+
| Area Population | Float | REAL |
22+
| Price | Float | REAL |
23+
| Address | String | VARCHAR |
24+
25+
Once we know the column names and data types, the Postgres table definition almost writes itself:
26+
27+
```plsql
28+
CREATE TABLE usa_house_prices (
29+
"Avg. Area Income" REAL NOT NULL,
30+
"Avg. Area House Age" REAL NOT NULL,
31+
"Avg. Area Number of Rooms" REAL NOT NULL,
32+
"Avg. Area Number of Bedrooms" REAL NOT NULL,
33+
"Area Population" REAL NOT NULL,
34+
"Price" REAL NOT NULL,
35+
"Address" VARCHAR NOT NULL
36+
);
37+
```
38+
39+
The column names are double quoted because they contain special characters like `.` and space, which can be interpreted to be part of the SQL syntax. Generally speaking, it's good practice to double quote all entity names when using them in a PostgreSQL query, although most of the time it's not needed.
40+
41+
If you run this using `psql`, you'll get something like this:
42+
43+
```
44+
postgresml=# CREATE TABLE usa_house_prices (
45+
"Avg. Area Income" REAL NOT NULL,
46+
"Avg. Area House Age" REAL NOT NULL,
47+
"Avg. Area Number of Rooms" REAL NOT NULL,
48+
"Avg. Area Number of Bedrooms" REAL NOT NULL,
49+
"Area Population" REAL NOT NULL,
50+
"Price" REAL NOT NULL,
51+
"Address" VARCHAR NOT NULL
52+
);
53+
CREATE TABLE
54+
postgresml=#
55+
```
56+
57+
### Ingesting data
58+
59+
Right now the table is empty and that's a bit boring. Let's import the USA House Prices dataset into it using one of the easiest and fastest way to do so in Postgres: using `COPY`.
60+
61+
If you're like me and prefer to use the terminal, you can open up `psql` and ingest the dataset like this:
62+
63+
```
64+
postgresml=# \copy usa_house_prices FROM 'USA_Housing.csv' CSV HEADER;
65+
COPY 5000
66+
```
67+
68+
As expected, Postgres copied all 5,000 rows into the `usa_house_prices` table. `COPY` accepts CSV, text, and Postgres binary formats, but CSV is definitely the most common.
69+
70+
You may have noticed that we used the `\copy` command in the terminal, not `COPY`. The `COPY` command actually comes in two forms: `\copy` which is a `psql` command that performs a local system to remote database server copy, and `COPY` which is more commonly used in applications. If you're writing your own application to ingest data into Postgres, you'll be using `COPY`.
71+
72+
### Querying data
73+
74+
Querying data stored in tables is what this is all about. After all, just storing data isn't particularly interesting or useful. Postgres has one of the most comprehensive and powerful querying languages of all data storage systems we've worked with so, for our example, we won't have any trouble calculating some statistics to understand our data better.
75+
76+
Let's compute some basic statistics on the "Avg. Area Income" column using SQL:
77+
78+
```sql
79+
SELECT
80+
count(*),
81+
avg("Avg. Area Income"),
82+
max("Avg. Area Income"),
83+
min("Avg. Area Income"),
84+
percentile_cont(0.75)
85+
WITHIN GROUP (ORDER BY "Avg. Area Income") AS percentile_75,
86+
stddev("Avg. Area Income")
87+
FROM usa_house_prices;
88+
```
89+
90+
which produces exactly what we want:
91+
92+
```
93+
count | avg | max | min | percentile_75 | stddev
94+
-------+-------------------+-----------+----------+----------------+-------------------
95+
5000 | 68583.10897773437 | 107701.75 | 17796.63 | 75783.33984375 | 10657.99120344229
96+
```
97+
98+
The SQL language is very expressive and allows to select, filter and aggregate any number of columns from any number of tables with a single query.
99+
100+
### Adding more data
101+
102+
Because databases store data in perpetuity, adding more data to Postgres can take several forms. The simplest and most commonly used way to add data is to just insert it into a table that we already have. Using the USA House Prices example, we can add a new row into the table with just one query:
103+
104+
```sql
105+
INSERT INTO usa_house_prices (
106+
"Avg. Area Income",
107+
"Avg. Area House Age",
108+
"Avg. Area Number of Rooms",
109+
"Avg. Area Number of Bedrooms",
110+
"Area Population",
111+
"Price",
112+
"Address"
113+
) VALUES (
114+
199778.0,
115+
43.0,
116+
3.0,
117+
2.0,
118+
57856.0,
119+
5000000000.0,
120+
'1 Infinite Loop, Cupertino, California'
121+
);
122+
```
123+
124+
Another way to add more data to a table is to run `COPY` again with a different CSV as the source. Many ETL pipelines from places like Snowflake or Redshift split their output into multiple CSVs, which can be individually imported into Postgres using multiple `COPY` statements.
125+
126+
Adding rows is pretty simple, but now that our dataset is changing, we should explore some tools to help us protect it against bad values.
127+
128+
### Data integrity
129+
130+
Databases store very important data and they were built with many safety features to protect that data from common errors. In machine learning, one of the most common errors is data duplication, i.e. having the same row appear in the a table twice. Postgres can easily protect us against this with unique indexes.
131+
132+
Looking at the USA House Price dataset, we can find its natural key pretty easily. Since most columns are aggregates, the only column that seems unique is the "Address". After all, there should never be more than one house at a single address, not for sale anyway.
133+
134+
To ensure that our dataset reflects this, let's add a unique index to our table. To do so, we can use this SQL query:
135+
136+
```sql
137+
CREATE UNIQUE INDEX ON usa_house_prices USING btree("Address");
138+
```
139+
140+
Postgres scans the whole table, ensures there are no duplicates in the "Address" column and creates an index on that column using the B-Tree algorithm.
141+
142+
If we now attempt to insert the same row again, we'll get an error:
143+
144+
```
145+
ERROR: duplicate key value violates unique constraint "usa_house_prices_Address_idx"
146+
DETAIL: Key ("Address")=(1 Infinite Loop, Cupertino, California) already exists.
147+
```
148+
149+
Postgres supports many more indexing algorithms, namely GiST, BRIN, GIN, and Hash. Many extensions, for example `pgvector`, implement their own index types like HNSW and IVFFlat, to efficiently search and retrieve specialized values. We explore those in our guide about [Vectors](vectors.md).
150+
151+
### Accelerating recall
152+
153+
Once the dataset gets large enough, and we're talking millions of rows, it's no longer practical to query the table directly. The amount of data Postgres has to scan to return a result becomes quite large and queries become slow. To help with that, tables should have indexes that order and organize commonly accessed columns. Scanning a B-Tree index can be done in _O(log n)_ time, which is orders of magnitude faster than the _O(n)_ full table search.
154+
155+
#### Querying an index
156+
157+
Postgres automatically uses indexes when possible in order to accelerate recall. Using our example above, we can query data using the "Address" column and we can do so very quickly by using the unique index we created.
158+
159+
```sql
160+
SELECT
161+
"Avg. Area House Age",
162+
"Address"
163+
FROM usa_house_prices
164+
WHERE "Address" = '1 Infinite Loop, Cupertino, California';
165+
```
166+
167+
which produces
168+
169+
```
170+
Avg. Area House Age | Address
171+
---------------------+----------------------------------------
172+
43 | 1 Infinite Loop, Cupertino, California
173+
(1 row)
174+
```
175+
176+
which is exactly what we expected. Since we have a unique index on the table, we should only be getting one row back with that address.
177+
178+
To ensure that Postgres is using an index when querying a table, we can ask it to produce the query execution plan that it's going to use before executing that query. A query plan is a list of steps that Postgres will take in order to get the query result we requested.
179+
180+
To get the query plan for any query, prepend the keyword `EXPLAIN` to any query you're planning on running:
181+
182+
```
183+
postgresml=# EXPLAIN (FORMAT JSON) SELECT
184+
"Avg. Area House Age",
185+
"Address"
186+
FROM usa_house_prices
187+
WHERE "Address" = '1 Infinite Loop, Cupertino, California';
188+
189+
QUERY PLAN
190+
----------------------------------------------------------------------------------------------
191+
[ +
192+
{ +
193+
"Plan": { +
194+
"Node Type": "Index Scan", +
195+
"Parallel Aware": false, +
196+
"Async Capable": false, +
197+
"Scan Direction": "Forward", +
198+
"Index Name": "usa_house_prices_Address_idx", +
199+
"Relation Name": "usa_house_prices", +
200+
"Alias": "usa_house_prices", +
201+
"Startup Cost": 0.28, +
202+
"Total Cost": 8.30, +
203+
"Plan Rows": 1, +
204+
"Plan Width": 51, +
205+
"Index Cond": "((\"Address\")::text = '1 Infinite Loop, Cupertino, California'::text)"+
206+
} +
207+
} +
208+
]
209+
```
210+
211+
The query plan indicates that it will be running an "Index Scan" using the index `usa_house_prices_Address_index` which is exactly what we want.
212+
213+
The ability to create indexes on datasets of any size and to then efficiently query that data is what separates Postgres from most ad-hoc tools like Pandas and Arrow. Postgres can store and query datasets that would never be able to fit into memory and can do so quicker and more efficiently than most database systems currently used across the industry.
214+
215+
#### Maintaining an index
216+
217+
Indexes are automatically updated when new data is added and old data is removed. Postgres automatically ensures that indexes are efficiently organized and are ACID compliant. When using Postgres tables, the system guarantees that the data will always be consistent, no matter how many concurrent changes are made to the tables.
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Vectors
2+

pgml-docs/docs/guides/deploying-postgresml/self-hosting/README.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,3 +49,39 @@ CREATE EXTENSION
4949
postgres=#
5050
```
5151

52+
### GPU support
53+
54+
If you have access to Nvidia GPUs and would like to use them for accelerating LLMs or XGBoost/LightGBM/Catboost, you'll need to install Cuda and the matching drivers.
55+
56+
#### Installing Cuda
57+
58+
Nvidia has an apt repository that can be added to your system pretty easily:
59+
60+
```bash
61+
curl -LsSf \
62+
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb \
63+
-o /tmp/cuda-keyring.deb
64+
sudo dpkg -i /tmp/cuda-keyring.deb
65+
sudo apt update
66+
sudo apt install -y cuda
67+
```
68+
69+
Once installed, you should check your installation by running `nvidia-smi`:
70+
71+
<pre><code><strong>$ nvidia-smi
72+
</strong>
73+
Fri Oct 6 09:38:19 2023
74+
+---------------------------------------------------------------------------------------+
75+
| NVIDIA-SMI 535.54.04 Driver Version: 536.23 CUDA Version: 12.2 |
76+
|-----------------------------------------+----------------------+----------------------+
77+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
78+
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
79+
| | | MIG M. |
80+
|=========================================+======================+======================|
81+
| 0 NVIDIA GeForce RTX 3070 Ti On | 00000000:08:00.0 On | N/A |
82+
| 0% 41C P8 28W / 290W | 1268MiB / 8192MiB | 5% Default |
83+
| | | N/A |
84+
+-----------------------------------------+----------------------+----------------------+
85+
</code></pre>
86+
87+
It's important that the Cuda version and the Nvidia driver versions are compatible. When installing Cuda for the first time, it's common to have to reboot the system before both are detected successfully.

pgml-docs/docs/guides/deploying-postgresml/self-hosting/backups.md

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -88,15 +88,34 @@ repo1-retention-full=14
8888
repo1-retention-archive=14
8989
```
9090

91-
This configuration will ensure that you have at least 14 backups and 14 backups worth of WAL files. Because Postgres allows point-in-time recovery, you'll be able to restore your database to any version (up to millisecond precision) going back two (2) weeks.
91+
This configuration will ensure that you have at least 14 backups and 14 backups worth of WAL files. Because Postgres allows point-in-time recovery, you'll be able to restore your database to any version (up to millisecond precision) going back two weeks.
9292

9393
#### Automating backups
9494

9595
Backups can be automated by running `pgbackrest backup --stanza=main` with a cron. You can edit your cron with `crontab -e` and add a daily midnight run, ensuring that you have fresh backups every day. Make sure you're editing the crontab of the `postgres` user since no other user will be allowed to backup Postgres or read the pgBackRest configuration file.
9696

97+
#### Backup overruns
98+
99+
If backups are taken frequently and take a long time to complete, it is possible for one backup to overrun the other. pgBackRest uses lock files located in `/tmp/pgbackrest` to ensure that no two backups are taken concurrently. If a backup attempts to start when another one is running, pgBackRest will abort the later backup.
100+
101+
This is a good safety measure, but if it happens, the backup schedule will break and you could end up with missing backups. There are a couple options to avoid this problem: take less frequent backups as not to overrun them, or implement a lock and wait protection outside of pgBackRest.
102+
103+
#### Lock and wait
104+
105+
To implement a lock and wait protection using only Bash, you can use `flock(1)`. Flock will open and hold a filesystem lock on a file until a command it's running is complete. When the lock is released, any other waiting flock will take the lock and run its own command.
106+
107+
To implement backups that don't overrun, it's usually sufficient to just protect the pgBackRest command with flock, like so:
108+
109+
```bash
110+
touch /tmp/pgbackrest-flock-lock
111+
flock /tmp/pgbackrest-flock-lock pgbackrest backup --stanza=main
112+
```
113+
114+
If you find yourself in a situation with too many overrunning backups, you end up with a system that's constantly backing up. As comforting as that sounds, that's not a great backup policy since you can't be sure that your backup schedule is being followed. If that's your situation, it may be time to consider alternative backup solutions like filesystem snapshots (e.g. ZFS snapshots) or volume level snapshots (e.g. EBS snapshots).
115+
97116
### PostgresML considerations
98117

99-
Since PostgresML stores most of its data in regular Postgres tables, a PostgreSQL backup is a valid PostgresML backup. The only thing stored outside of Postgres is the Hugging Face LLM cache, which is stored directly on disk in `/var/lib/postgresql/.cache`. In case of a disaster, the cache will be lost, but that's fine. Since it's only a cache, next time a PostgresML `pgml.embed()` or `pgml.transform()` function is used, PostgresML will automatically repopulate all the necessary files in the cache from Hugging Face and resume normal operations.
118+
Since PostgresML stores most of its data in regular Postgres tables, a PostgreSQL backup is a valid PostgresML backup. The only thing stored outside of Postgres is the Hugging Face LLM cache, which is stored directly on disk in `/var/lib/postgresql/.cache`. In case of a disaster, the cache will be lost, but that's fine; since it's only a cache, next time PostgresML `pgml.embed()` or `pgml.transform()` functions are used, PostgresML will automatically repopulate all the necessary files in the cache from Hugging Face and resume normal operations.
100119

101120
#### HuggingFace cold starts
102121

pgml-docs/docs/guides/deploying-postgresml/self-hosting/running-on-ec2.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ or a RAIDZ1 with 5 volumes:
6767

6868
RAIDZ1 protects against single volume failure, allowing you to replace an EBS volume without taking your database offline or restoring from backup. Considering EBS guarantees and additional redundancy provided by RAIDZ, this is a reasonable configuration to use for systems that require good durability and performance guarantees.
6969

70-
A RAID configuration with at 4 volumes allows up to 4x read throughput which, in EBS terms, can produce up to 600MBps, without having to pay for additional IOPS.
70+
A RAID configuration with 4 volumes allows up to 4x read throughput of a single volume which, in EBS terms, can produce up to 600MBps, without having to pay for additional IOPS.
7171

7272
####
7373

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy