You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: pgml-docs/docs/blog/scaling-postgresml-to-one-million-requests-per-second.md
+24-26Lines changed: 24 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -80,7 +80,7 @@ There are many poolers available presently, the most notable being PgBouncer, wh
80
80
- Failover in case a read replica is broken
81
81
- Sharding (this feature is still being developed)
82
82
83
-
In this benchmark, we used its load balancing feature to evenly distribute our XGBoost predictions across our Postgres replicas.
83
+
In this benchmark, we used its load balancing feature to evenly distribute XGBoost predictions across our Postgres replicas.
84
84
85
85
86
86
#### Postgres Replicas
@@ -90,9 +90,11 @@ Scaling Postgres reads is a solved problem. If more read queries are coming in,
90
90
91
91
## Results
92
92
93
-
We ran over a 100 different benchmarks, by changing the number of clients, poolers, replicas, and XGBoost predictions we requested. Our raw data is available [below](#methodology).
93
+
We ran over a 100 different benchmarks, by changing the number of clients, poolers, replicas, and XGBoost predictions we requested. The benchmarks were meant to test the limits of each configuration, and what remediations were needed in each scenario. Our raw data is available [below](#methodology).
94
94
95
-
### Summary
95
+
One of the tests we ran used 1,000 clients, which were connected to 1, 2, and 5 replicas. The results were exactly what we expected.
@@ -106,32 +108,16 @@ Both latency and throughput, the standard measurements of system performance, sc
106
108
107
109
Our architecture shares nothing and requires no synchronization. The replicas don't talk to each other and the poolers don't either. Every component has the knowledge it needs (through configuration) to do its job, and they do it well.
As the demand on PostgresML increases, the system gracefully handles the load. If the number of replicas stays the same, latency slowly increases, all the while remaining well below acceptable ranges. Throughput holds as well, as increasing number of clients evenly split available resources.
120
-
121
-
If we increase the number of replicas, latency decreases and throughput increases and holds as the number of clients increases in parallel. We get the best result with 5 replicas, but this number is variable and can be changed as needs for latency compete with cost.
122
-
123
-
!!! success
124
-
The most impressive result is serving close to a million predictions with an average latency of less than 1ms. You might notice though that `950160.7` isn't quite one million, and that's true.
111
+
The most impressive result is serving close to a million predictions with an average latency of less than 1ms. You might notice though that `950160.7` isn't quite one million, and that's true. We couldn't reach one million with 1000 clients, so we increased to 2000 and got our magic number: **1,021,692.7 req/sec**, with an average latency of **1.7ms**.
125
112
126
-
We couldn't reach one million with 1000 clients, so we increased to 2000 and got our magic number: **1,021,692.7 req/sec**, with an average latency of **1.7ms**.
127
113
128
114
### Batching Predictions
129
115
130
-
Batching is a proven method to optimize performance. If you need to get several data points, batch the requests into one query, and it should run faster than making individual requests.
116
+
Batching is a proven method to optimize performance. If you need to get several data points, batch the requests into one query, and it will run faster than making individual requests.
131
117
132
118
We should precede this result by stating that PostgresML does not yet have a batch prediction API as such. Our `pgml.predict()` function can predict multiple points, but we haven't implemented a query pattern to pass multiple Postgres rows to that function at the same time. Once we do, based on our tests, we should see a quadratic increase in performance.
133
119
134
-
Regardless of that limitation, we still managed to get better results by batching queries together since Postgres needed to do less query parsing and data fetching, and we saved on network round trip as well.
120
+
Regardless of that limitation, we still managed to get better results by batching queries together since Postgres needed to do less query parsing and data fetching, and we saved on network round trip time as well.
If batching did not work at all, we would see a linear increase in latency and a linear decrease in throughput. That did not happen, as we saw a good increase in throughput and a sublinear increase in latency. A modest success, but a success nonetheless.
145
-
130
+
If batching did not work at all, we would see a linear increase in latency and a linear decrease in throughput. That did not happen; instead, we got a good increase in throughput and a sublinear increase in latency. A modest success, but a success nonetheless.
All systems, at some point in their lifetime, will come under more load than they were designed for; what happens then is an important feature (or bug) of their design. Horizontal scaling is never immediate: it takes a bit of time to spin up additional hardware to handle the load. It can take a second, or a minute, depending on availability, but in both cases, existing resources need to serve traffic the best way they can.
150
143
151
-
We were hoping to test PostgresML to its breaking point, but we couldn't quite get there. As load (number of clients) increased beyond provisioned capacity, the only thing we saw was a gradual increase in latency. Throughput remained roughly the same. This gradual latency increase was caused by simple queuing: the replicas couldn't serve requests concurrently so they had to wait, patiently, in the poolers.
144
+
We were hoping to test PostgresML to its breaking point, but we couldn't quite get there. As load (number of clients) increased beyond provisioned capacity, the only thing we saw was a gradual increase in latency. Throughput remained roughly the same. This gradual latency increase was caused by simple queuing: the replicas couldn't serve requests concurrently, so the requests had to patiently wait in the poolers.
@@ -159,11 +152,16 @@ Among many others, this is a very important feature of any proxy: it's a FIFO qu
159
152
160
153
Queueing overall is not desirable, but it's a feature, not a bug. While autoscaling spins up an additional replica, the app continues to work, although a few milliseconds slower, which is a good trade off for not overspending on hardware.
161
154
155
+
As the demand on PostgresML increases, the system gracefully handles the load. If the number of replicas stays the same, latency slowly increases, all the while remaining well below acceptable ranges. Throughput holds as well, as increasing number of clients evenly split available resources.
156
+
157
+
If we increase the number of replicas, latency decreases and throughput increases and eventually stabilizies as the number of clients increases in parallel. We get the best result with 5 replicas, but this number is variable and can be changed as needs for latency compete with cost.
158
+
159
+
162
160
## What's Next
163
161
164
162
Horizontal scaling and high availability are fascinating topics in software engineering. After doing this benchmark, we had no more doubts about our chosen architecture. Needing to serve 1 million predictions per second is rare, but having the ability to do that, and more if desired, is an important aspect for any new system.
165
163
166
-
The next challenge for us is to scale writes horizontally. In the database world, this means sharding the database into multiple separate machines using a hashing function, and automatically routing both reads and writes to the right shards. There are many possible solutions on the market for this already, e.g. Citus and Foreign Data Wrappers, but none are as horizontally scalable to our liking, although we will incorporate them into our architecture until we build the one we really want.
164
+
The next challenge for us is to scale writes horizontally. In the database world, this means sharding the database into multiple separate machines using a hashing function, and automatically routing both reads and writes to the right shards. There are many possible solutions on the market for this already, e.g. Citus and Foreign Data Wrappers, but none are as horizontally scalable as we like, although we will incorporate them into our architecture until we build the one we really want.
167
165
168
166
For that purpose, we're building our own open source [Postgres proxy](https://github.com/levkk/pgcat/) which we discussed earlier in the article. As we progress further in our journey, we'll be adding more features and performance improvements.
0 commit comments