Skip to content

Commit 882fe8b

Browse files
authored
2.0 announcement (#328)
1 parent 20b79f2 commit 882fe8b

File tree

1 file changed

+17
-14
lines changed

1 file changed

+17
-14
lines changed

pgml-docs/docs/blog/postgresml-is-moving-to-rust-for-our-2.0-release.md

Lines changed: 17 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ image: https://postgresml.org/blog/images/abstraction.webp
55
image_alt: Moving from one abstraction layer to another.
66
---
77

8-
PostgresML is moving to Rust for our 2.0 release
8+
PostgresML is Moving to Rust for our 2.0 Release
99
================================================
1010

1111
<p class="author">
@@ -14,12 +14,13 @@ PostgresML is moving to Rust for our 2.0 release
1414
September 19, 2022
1515
</p>
1616

17-
PostgresML is a fairly young project. We recently released 1.0 and now we're considering what we want to accomplish for 2.0. In addition to simplifying the workflow for building models, we'd like to address runtime speed, memory consumption and the overall reliability we've seen for machine learning deployments running at scale.
1817

19-
Python is generally touted as fast enough for machine learning, and is the de facto industry standard with tons of popular libraries implementing all the latest and greatest algorithms. Many of these algorithms (torch, tensorflow, xgboost, numpy) have been optimized in C, but not all of them. For example, most of the [linear algorithms](https://github.com/scikit-learn/scikit-learn/tree/main/sklearn/linear_model) in scikit learn are implemented in pure Python, although they rely on numpy, which is a convenient optimization. It also uses cython in a few performance critical places. This ecosystem has allowed PostgresML to offer a ton of functionality with minimal duplication of effort.
18+
PostgresML is a fairly young project. We recently released v1.0 and now we're considering what we want to accomplish for v2.0. In addition to simplifying the workflow for building models, we'd like to address runtime speed, memory consumption and the overall reliability we've seen is needed for machine learning deployments running at scale.
2019

20+
Python is generally touted as fast enough for machine learning, and is the de facto industry standard with tons of popular libraries, implementing all the latest and greatest algorithms. Many of these algorithms (Torch, Tensorflow, XGboost, NumPy) have been optimized in C, but not all of them. For example, most of the [linear algorithms](https://github.com/scikit-learn/scikit-learn/tree/main/sklearn/linear_model) in scikit-learn are written in pure Python, although they do use NumPy, which is a convenient optimization. It also uses Cython in a few performance critical places. This ecosystem has allowed PostgresML to offer a ton of functionality with minimal duplication of effort.
2121

22-
## Ambition starts with a simple benchmark
22+
23+
## Ambition Starts With a Simple Benchmark
2324
<figure>
2425
<img alt="Ferris the crab" src="/blog/images/rust_programming_crab_sea.webp" />
2526
<figcaption>Rust mascot image by opensource.com</figcaption>
@@ -34,7 +35,7 @@ FROM generate_series(1, 1280000) i
3435
GROUP BY i % 10000;
3536
```
3637

37-
Spoiler alert, idiomatic Rust is about 10x faster than native SQL, the embedded PL/pgSQL scripting language, and Python in this benchmark. Rust comes close to the hand optimized assembly version of the Basic Linear Algebra Subroutines implementation for the dot product. Numpy is supposed to provide optimizations in cases like this, but it's actually the worst performer. Data movement from Postgres to PL/Python is pretty good. It's even faster than the pure SQL equivalent, but adding the extra conversion from Python list to Numpy array takes almost as much time as everything else. Machine Learning systems that move relatively large quantities of data around can become dominated by these extraneous operations, rather than the ML algorithms that actually generate value.
38+
Spoiler alert: idiomatic Rust is about 10x faster than native SQL, embedded PL/pgSQL, and pure Python. Rust comes close to the hand-optimized assembly version of the Basic Linear Algebra Subroutines (BLAS) implementation. NumPy is supposed to provide optimizations in cases like this, but it's actually the worst performer. Data movement from Postgres to PL/Python is pretty good; it's even faster than the pure SQL equivalent, but adding the extra conversion from Python list to Numpy array takes almost as much time as everything else. Machine Learning systems that move relatively large quantities of data around can become dominated by these extraneous operations, rather than the ML algorithms that actually generate value.
3839

3940
<center>
4041
<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vShmCVrYwmscys5TIo7c_C-1M3gE_GwENc4tTiU7A6_l3YamjJx7v5bZcafLIDcEIbFu-C2Buao4rQ6/pubchart?oid=815608582&amp;format=interactive"></iframe>
@@ -101,7 +102,7 @@ Spoiler alert, idiomatic Rust is about 10x faster than native SQL, the embedded
101102
ORDER BY 1
102103
LIMIT 1;
103104
```
104-
=== "Numpy"
105+
=== "NumPy"
105106
```sql linenums="1" title="define_numpy.sql"
106107
CREATE OR REPLACE FUNCTION dot_product_numpy(a FLOAT4[], b FLOAT4[])
107108
RETURNS FLOAT4
@@ -179,15 +180,17 @@ ML isn't just about basic math and a little bit of business logic. It's about al
179180
<figcaption>Layers of abstraction must remain a good value.</figcaption>
180181
</figure>
181182

182-
The results are somewhat staggering. We didn't spend any time intentionally optimizing Rust over Python. Most of the time spent was just trying to get things to compile. 😅 It's hard to believe the difference is this big, but those fringe operations outside of the core machine learning algorithms really do dominate, requiring up to 35x more time in Python during inference. The difference between classification and regression speeds here are related to the dataset size. The scikit learn handwritten image classification dataset effectively has 64 features (pixels) vs the diabetes regression dataset having only 10 features. **The more data we're dealing with, the bigger the improvement we see in Rust**. We're even giving Python some leeway by warming up the runtime on the connection before the test, which typically takes a second or two to interpret all of PostgresML's dependencies. Since Rust is a compiled language, there is no longer a need to warmup the connection.
183+
The results are somewhat staggering. We didn't spend any time intentionally optimizing Rust over Python. Most of the time spent was just trying to get things to compile. 😅 It's hard to believe the difference is this big, but those fringe operations outside of the core machine learning algorithms really do dominate, requiring up to 35x more time in Python during inference. The difference between classification and regression speeds here are related to the dataset size. The scikit learn handwritten image classification dataset effectively has 64 features (pixels) vs the diabetes regression dataset having only 10 features.
184+
185+
**The more data we're dealing with, the bigger the improvement we see in Rust**. We're even giving Python some leeway by warming up the runtime on the connection before the test, which typically takes a second or two to interpret all of PostgresML's dependencies. Since Rust is a compiled language, there is no longer a need to warmup the connection.
183186

184187
<center>
185188
<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vShmCVrYwmscys5TIo7c_C-1M3gE_GwENc4tTiU7A6_l3YamjJx7v5bZcafLIDcEIbFu-C2Buao4rQ6/pubchart?oid=345126465&amp;format=interactive"></iframe>
186189
</center>
187190

188191
> _This language comparison uses in-process data access. Python based machine learning microservices that communicate with other services over HTTP with JSON or gRPC interfaces will look even worse in comparison, especially if they are stateless and rely on yet another database to provide their data over yet another wire._
189192
190-
## Preserving backward compatibility
193+
## Preserving Backward Compatibility
191194
```sql linenums="1" title="train.sql"
192195
SELECT pgml.train(
193196
project_name => 'Handwritten Digit Classifier',
@@ -203,10 +206,10 @@ SELECT pgml.predict('Handwritten Digit Classifier', image)
203206
FROM pgml.digits;
204207
```
205208

206-
The API is identical between versions 1.0 and 2.0. We take breaking changes seriously and we're not going to break existing deployments just because we're rewriting the whole project. The only reason we're bumping the major version is because we feel like this is a dramatic change, but we intend to preserve a full compatibility layer with models trained on 1.0 in Python. This means that to get the full performance benefits, you'll need to retrain models after upgrading.
209+
The API is identical between v1.0 and v2.0. We take breaking changes seriously and we're not going to break existing deployments just because we're rewriting the whole project. The only reason we're bumping the major version is because we feel like this is a dramatic change, but we intend to preserve a full compatibility layer with models trained on v1.0 in Python. However, this does mean that to get the full performance benefits, you'll need to retrain models after upgrading.
207210

208-
## Ensuring high quality Rust implementations
209-
Besides backwards compatibility, we're building a Python compatibility layer to guarantee we can preserve the full Python model training APIs, when Rust APIs are not at parity in terms of functionality, quality or performance. We started this journey thinking that the older algorithms in scikit learn that are implemented in vanilla Python would be the best candidates for replacement in Rust, but that is only partly true. There are high quality efforts in [linfa](https://github.com/rust-ml/linfa) and [smartcore](https://github.com/smartcorelib/smartcore) that also show 10-30x speedup over scikit, but they still lack some of the deeper functionality like joint regression, some of the more obscure algorithms and hyperparams, and some of the error handling that has been hardened into scikit with mass adoption.
211+
## Ensuring High Quality Rust Implementations
212+
Besides backwards compatibility, we're building a Python compatibility layer to guarantee we can preserve the full Python model training APIs, when Rust APIs are not at parity in terms of functionality, quality or performance. We started this journey thinking that the older vanilla Python algorithms in Scikit would be the best candidates for replacement in Rust, but that is only partly true. There are high quality efforts in [linfa](https://github.com/rust-ml/linfa) and [smartcore](https://github.com/smartcorelib/smartcore) that also show 10-30x speedup over Scikit, but they still lack some of the deeper functionality like joint regression, some of the more obscure algorithms and hyperparameters, and some of the error handling that has been hardened into Scikit with mass adoption.
210213

211214
<center>
212215
<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/
@@ -225,11 +228,11 @@ The Rust implementations also produce high quality predictions against test sets
225228
<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vShmCVrYwmscys5TIo7c_C-1M3gE_GwENc4tTiU7A6_l3YamjJx7v5bZcafLIDcEIbFu-C2Buao4rQ6/pubchart?oid=631927399&amp;format=interactive"></iframe>
226229
</center>
227230

228-
Interestingly, the training times for some of the simplest algorithms are much worse in the Rust implementation. Until we can guarantee each algorithm implementation is an upgrade in every way, we'll continue to use the Python compatibility layer on a case by case basis to avoid any unpleasant surprises.
231+
Interestingly, the training times for some of the simplest algorithms are worse in the Rust implementation. Until we can guarantee each Rust algorithm is an upgrade in every way, we'll continue to use the Python compatibility layer on a case by case basis to avoid any unpleasant surprises.
229232

230-
We believe that [machine learning in Rust](https://www.arewelearningyet.com/) is mature enough to add significant value now, where we'll be using the same underlying C libraries, and that it's worth contributing to the Rust implementations further to help bring them up to full feature parity. With this goal in mind, we intend to drop our Python compatibility layer completely in 3.0, and only support 2.0 models trained with Rust long term. Part of our 2.0 release process will include a benchmark suite for the full API we support via all Python libraries, so that we can track our progress toward pure Rust implementations across the board.
233+
We believe that [machine learning in Rust](https://www.arewelearningyet.com/) is mature enough to add significant value now. We'll be using the same underlying C/C++ libraries, and it's worth contributing to the Rust ML ecosystem to bring it up to full feature parity. Our v2.0 release will include a benchmark suite for the full API we support via all Python libraries, so that we can track our progress toward pure Rust implementations over time.
231234

232-
Many thanks and ❤️ to all those who are supporting this endeavor. We’d love to hear feedback from the broader ML and Engineering community about applications and other real world scenarios to help prioritize our work. We'd also appreciate your support in the form of [stars on our github](https://github.com/postgresml/postgresml).
235+
Many thanks and ❤️ to all those who are supporting this endeavor. We’d love to hear feedback from the broader ML and Engineering community about applications and other real world scenarios to help prioritize our work. You can show your support by [starring us on our GitHub](https://github.com/postgresml/postgresml).
233236

234237
<center>
235238
<video controls autoplay loop muted width="90%" style="box-shadow: 0 0 8px #000;">

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy