Skip to content

Commit b259b3e

Browse files
Lev Kokotovgitbook-bot
authored andcommitted
GITBOOK-55: change request with no subject merged in GitBook
1 parent 098e46f commit b259b3e

File tree

6 files changed

+267
-0
lines changed

6 files changed

+267
-0
lines changed

pgml-docs/docs/guides/SUMMARY.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,11 @@
4949
* [Tuning vector recall while generating query embeddings in the database](use-cases/tuning-vector-recall-while-generating-query-embeddings-in-the-database.md)
5050
* [Personalize embedding results with application data in your database](use-cases/personalize-embedding-results-with-application-data-in-your-database.md)
5151
* [LLM based pipelines with PostgresML and dbt (data build tool)](use-cases/llm-based-pipelines-with-postgresml-and-dbt-data-build-tool.md)
52+
* [Deployment](deployment/README.md)
53+
* [PostgresML Cloud](deployment/postgresml-cloud.md)
54+
* [Self-hosting](deployment/self-hosting/README.md)
55+
* [Replication](deployment/self-hosting/replication.md)
56+
* [Building from Source](deployment/self-hosting/building-from-source.md)
5257
* [PgCat](pgcat.md)
5358
* [Benchmarks](benchmarks/README.md)
5459
* [PostgresML is 8-40x faster than Python HTTP microservices](benchmarks/postgresml-is-8-40x-faster-than-python-http-microservices.md)
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Deployment
2+
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# PostgresML Cloud
2+
3+
PostgresML Cloud is a fully managed deployment of PostgresML, operated and supported by the team that created it.
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# Self-hosting
2+
3+
PostgresML is a Postgres extension, so running it is very similar to running a self-hosted PostgreSQL database server. A typical architecture consists of a primary database that will serve reads and writes, optional replicas to scale reads horizontally, and a pooler to load balance connections.
4+
5+
### Operating system
6+
7+
At PostgresML, we prefer running Postgres on Ubuntu, mainly because of its extensive network of supported hardware architectures, packages, and drivers. The rest of this guide will assume that we're using Ubuntu 22.04, the current long term support release of Ubuntu, but you can run PostgresML pretty easily on any other flavor of Linux.
8+
9+
### Installing PostgresML
10+
11+
PostgresML for Ubuntu 22.04 can be downloaded directly from our APT repository. There is no need to install any additional dependencies or compiling from source.
12+
13+
To add our APT repository to our sources, you can run:
14+
15+
```bash
16+
echo "deb [trusted=yes] https://apt.postgresml.org jammy main" | \
17+
sudo tee -a /etc/apt/sources.list
18+
```
19+
20+
We don't sign our Debian packages since we can rely on HTTPS to guarantee the authenticity of our binaries.
21+
22+
Once you've added the repository, make sure to update APT:
23+
24+
```bash
25+
sudo apt update
26+
```
27+
28+
Finally, you can install PostgresML:
29+
30+
```bash
31+
sudo apt install -y postgresml-14
32+
```
33+
34+
Ubuntu 22.04 ships with PostgreSQL 14, but if you have a different version installed on your system, just change `14` in the package name to your Postgres version. We currently support all versions supported by the community: Postgres 12 through 15.
35+
36+
### Validate your installation
37+
38+
You should be able to connect to Postgres and install the extension into the database of your choice:
39+
40+
```bash
41+
sudo -u postgres psql
42+
```
43+
44+
```
45+
postgres=# CREATE EXTENSION pgml;
46+
INFO: Python version: 3.10.6 (main, Nov 2 2022, 18:53:38) [GCC 11.3.0]
47+
INFO: Scikit-learn 1.1.3, XGBoost 1.7.1, LightGBM 3.3.3, NumPy 1.23.5
48+
CREATE EXTENSION
49+
postgres=#
50+
```
51+
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Building from Source
2+
Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
# Replication
2+
3+
PostgresML is fully integrated into the Postgres replication system and requires no special considerations. Setting up a PostgreSQL replica may seem to be a daunting task, but it's actually a quite straight forward step-by-step process.
4+
5+
### Architecture
6+
7+
PostgreSQL replication is composed of three (3) parts: a primary, a replica, and a Write-Ahead Log archive. Each is independently configured and operated, providing a high degree of reliability in the architecture.
8+
9+
#### Primary
10+
11+
The primary serves all queries, including writes and reads. In a replicated configuration, every single write made to the primary is replicated to the replicas and to the Write-Ahead Log archive.
12+
13+
#### Replica
14+
15+
A replica serves only read queries. Setting up additional replicas helps to horizontally scale the read capacity of a database cluster. Adding more replicas to the system can be done dynamically as demand on the system increases, and removed, as the number of clients and queries decreases.
16+
17+
Postgres supports three (3) kinds of replication: streaming, logical, and log-shipping. Streaming replication sends data changes as they are written to the database files, ensuring that replicas are almost byte-for-byte identical to the primary. Logical replication sends the queries as they are interpreted by the primary, e.g. `SELECT`/ `UPDATE` / `DELETE`, which are then replayed on the replica. Log-shipping replicas download the Write-Ahead Log from the archive and replay it at their own pace.
18+
19+
Each replication type has its own pros and cons. In this guide, we'll focus on setting up the more commonly used streaming replication.
20+
21+
#### Write-Ahead Log archive
22+
23+
The Write-Ahead Log archive, or WAL for short, is a safe place where the primary can upload every single data change that occurs in order for the replicas to download and apply them on their own system. Typically, the WAL archive is stored on a separate machine, network-attached storage or more commonly these days, in an object storage system like S3 or CloudFlare's R2.
24+
25+
### Dependencies
26+
27+
PostgreSQL replication requires third-party software to operate smoothly. At PostgresML, we're big fans of the [pgBackRest](https://pgbackrest.org) project, and we'll be using it in this guide. In order to install it and some other dependencies on your system, add the PostgreSQL APT repository to your sources:
28+
29+
```bash
30+
sudo apt install -y postgresql-common
31+
sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh
32+
sudo apt update
33+
```
34+
35+
Finally, install pgBackRest:
36+
37+
```bash
38+
sudo apt install -y pgbackrest
39+
```
40+
41+
### **Configure the primary**
42+
43+
The primary needs to be configured to allow replication. By default, replication is disabled in PostgreSQL. First, to enable replication, change the following settings in `/etc/postgresql/14/main/postgresql.conf`:
44+
45+
```
46+
archive_mode = on
47+
wal_level = replica
48+
archive_command = 'pgbackrest --stanza=main archive-push %p'
49+
```
50+
51+
Second, Postgres requires that a user with replication permissions is used for replicas to connect to the primary. To create this user, login as a superuser and run:
52+
53+
```sql
54+
CREATE ROLE replication_user PASSWORD '<secure password>' LOGIN REPLICATION;
55+
```
56+
57+
Once the user is created, it has to be allowed to connect to the database from another machine. Postgres configures this type of access in `/etc/postgresql/14/main/pg_hba.conf`
58+
59+
Open that file and append this to the end:
60+
61+
```
62+
host replication replication_user 0.0.0.0/0 scram-sha-256
63+
```
64+
65+
This configures Postgres to allow the `replication_user` to connect from anywhere (`0.0.0.0/0`) and authenticate using the now default SCRAM-SHA-256 algorithm.
66+
67+
Finally, restart PostreSQL for all these settings changes to take effect:
68+
69+
```bash
70+
sudo service postgresql restart
71+
```
72+
73+
### Create a WAL archive
74+
75+
In this guide, we'll be using an S3 bucket for the WAL archive. S3 is a very reliable and affordable place to store WAL. We've used it in the past to transfer, store and replicate petabytes of data.
76+
77+
#### **Create an S3 bucket**
78+
79+
You can create an S3 bucket in the AWS Console or by using the AWS CLI:
80+
81+
```bash
82+
aws s3api create-bucket \
83+
--bucket postgresml-tutorial-wal-archive \
84+
--create-bucket-configuration="LocationConstraint=us-west-2"
85+
```
86+
87+
By default, S3 buckets are protected against public access, so it's a safe place to store your WAL.
88+
89+
#### **Configure pgBackRest**
90+
91+
pgBackRest can be configured by editing the `/etc/pgbackrest.conf` file. This file should be readable by the `postgres` user since it'll contain some important information.&#x20;
92+
93+
Using the S3 bucket we created above, we can configure pgBackRest to use it for the WAL archive:
94+
95+
```
96+
[main]
97+
pg1-path=/var/lib/postgresql/14/main/
98+
99+
[global]
100+
process-max=4
101+
repo1-path=/wal-archive/main
102+
repo1-s3-bucket=postgresml-tutorial-wal-archive
103+
repo1-s3-endpoint=s3.us-west-2.amazonaws.com
104+
repo1-s3-region=us-west-2
105+
repo1-s3-key=<YOUR AWS ACCESS KEY ID>
106+
repo1-s3-key-<YOUR AWS SECRET ACCESS KEY>
107+
repo1-type=s3
108+
start-fast=y
109+
compress-type=lz4
110+
archive-mode-check=n
111+
archive-check=n
112+
113+
[global:archive-push]
114+
compress-level=3
115+
```
116+
117+
Once configured, we can create the archive:
118+
119+
```bash
120+
sudo -u postgres pgbackrest stanza-create --stanza main
121+
```
122+
123+
You can validate the archive created successfully by listing the files using the AWS CLI:
124+
125+
```bash
126+
aws s3 ls s3://postgresml-tutorial-wal-archive/wal-archive/main/
127+
PRE archive/
128+
PRE backup/
129+
```
130+
131+
### Create a replica
132+
133+
A PostgreSQL replica should run on a different system than the primary. The two machines have to be able to communicate via the network in order for Postgres to send changes made to the primary over to the replica.
134+
135+
#### Install dependencies
136+
137+
Before configuring the replica, we need to make sure it's running the same software the primary is. Before proceeding, follow the [Self-hosting](./) guide to install PostgresML on the system. Once done, install pgBackRest and configure it the same way we did above for the primary. The replica has to be able to access the WAL files stored in the WAL archive.
138+
139+
#### Replicating data
140+
141+
A streaming replica is byte-for-byte identical to the primary, so in order to create one, we first need to copy all the database files stored on the primary over to the replica. Postgres provides a very handy command line tool for this called `pg_basebackup`.&#x20;
142+
143+
On Ubuntu 22.04, PostgreSQL 14 Debian package automatically creates a new Postgres data directory. Since the replica has to have the same data as the primary, first thing we need to do is to delete that automatically created data directory and replace it with the one stored on the primary.
144+
145+
To do so, first, stop the PostgreSQL server:
146+
147+
```
148+
sudo service postgresql stop
149+
```
150+
151+
Once stopped, delete the data directory:
152+
153+
```
154+
sudo rm -r /var/lib/postgresql/14/main
155+
```
156+
157+
Finally, copy the data directory from the primary onto the replica:
158+
159+
```
160+
PGPASSWORD=<secure password> pg_basebackup \
161+
-h <the host or IP address of the primary>
162+
-p 5432
163+
-U replication_user
164+
-D /var/lib/postgresql/14/main
165+
```
166+
167+
Depending on how big your database is, this will take a few seconds to a few hours. Once complete, don't start Postgres just yet. We need to set a few configuration options first.
168+
169+
#### Configuring the replica
170+
171+
In order to start replicating from the primary, the replica needs to be able to connect to it. To do so, edit the configuration file `/etc/postgresql/14/main/postgresql.conf` and add the following settings:
172+
173+
```
174+
primary_conninfo = 'host=<the host or IP of the primary> port=5432 user=replication_user password=<secure password>'
175+
restore_command = 'pgbackrest --stanza=demo archive-get %f "%p"'
176+
```
177+
178+
#### Enable standby mode
179+
180+
By default, if Postgres is started as a replica, it will download all the WAL it can find from the archive, apply the data changes and promote itself to the primary role. To avoid this and keep the Postgres replica running as a read replica, we need to configure it to run in standby mode. To do so, place a file called `standby.signal` into the data directory, like so:
181+
182+
```
183+
sudo -u postgres touch /var/lib/postgresql/14/main/standby.signal
184+
```
185+
186+
#### Start the replica
187+
188+
Finally, the replica is ready to start:
189+
190+
```
191+
sudo service postgresql start
192+
```
193+
194+
If you connect to it with `psql`, you can validate it's running in read-only mode:
195+
196+
```
197+
SELECT pg_is_in_recovery();
198+
```
199+
200+
which will return `true`.
201+
202+
### Adding more replicas
203+
204+
Adding more replicas to the system is done the same way. A Postgres primary can support up to 16 replicas, which is more than enough to serve millions of queries per second and provide high availability for enterprise-grade deployments of PostgresML.

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy