|
| 1 | +--- |
| 2 | +author: Lev Kokotov |
| 3 | +description: Machine learning in Python is slow and error-prone, while Rust makes it fast and reliable. |
| 4 | +--- |
| 5 | + |
| 6 | + |
| 7 | +# Oxidizing Machine Learning |
| 8 | + |
| 9 | +<p class="author"> |
| 10 | + <img width="54px" height="54px" src="/images/team/lev.jpg" alt="Author" /> |
| 11 | + Lev Kokotov<br/> |
| 12 | + September 7, 2022 |
| 13 | +</p> |
| 14 | + |
| 15 | + |
| 16 | +Machine learning in Python can be hard to deploy at scale. We all love Python, but it's no secret |
| 17 | +that its overhead is large: |
| 18 | + |
| 19 | +* Load data from large CSV files |
| 20 | +* Do some post-processing with NumPy |
| 21 | +* Move and join data into a Pandas dataframe |
| 22 | +* Load data into the algorithm |
| 23 | + |
| 24 | +Each step incurs at least one copy of the data in memory; 4x storage and compute cost for training a model sounds inefficient, but when you add Python's memory allocation, the price tag increases exponentially. |
| 25 | + |
| 26 | +Even if you could find the money to pay for the compute needed, fitting the dataset we want into the RAM we have becomes difficult. |
| 27 | + |
| 28 | +The status quo needs a shake up, and along came Rust. |
| 29 | + |
| 30 | +## The State of ML in Rust |
| 31 | + |
| 32 | +Doing machine learning in anything but Python sounds wild, but if one looks under the hood, ML algorithms are mostly written in C++: `libtorch` (Torch), XGBoost, large parts of Tensorflow, `libsvm` (Support Vector Machines), and the list goes on. A linear regression can be (and is) written in about 10 lines of for-loops. |
| 33 | + |
| 34 | +It then should come to no surprise that the Rust ML community is alive, and doing well: |
| 35 | + |
| 36 | +* SmartCore[^1] is rivaling Scikit for commodity algorithms |
| 37 | +* XGBoost bindings[^2] work great for gradient boosted trees |
| 38 | +* Torch bindings[^3] are first class for building any kind of neural network |
| 39 | +* Tensorflow bindings[^4] are also in the mix, although parts of them are still Python (e.g. Keras) |
| 40 | + |
| 41 | +If you start missing NumPy, don't worry, the Rust version[^5] has got you covered, and the list of available tools keeps growing. |
| 42 | + |
| 43 | +When you only need 4 bytes to represent a floating point instead of Python's 26 bytes[^6], suddenly you can do more. |
| 44 | + |
| 45 | +## XGBoost, Rustified |
| 46 | + |
| 47 | +Let's do a quick example to illustrate our point. |
| 48 | + |
| 49 | +XGBoost is a popular decision tree algorithm which uses gradient boosting, a fancy optimization technique, to train algorithms on data that could confuse simpler linear models. It comes with a Python interface, which calls into its C++ primitives, but now, it has a Rust interface as well. |
| 50 | + |
| 51 | +_Cargo.toml_ |
| 52 | +```toml |
| 53 | +[dependencies] |
| 54 | +xgboost = "0.1" |
| 55 | +``` |
| 56 | + |
| 57 | +_src/main.rs_ |
| 58 | +```rust |
| 59 | +use xgboost::{parameters, Booster, DMatrix}; |
| 60 | + |
| 61 | +fn main() { |
| 62 | + // Data is read directly into the C++ data structure. |
| 63 | + let train = DMatrix::load("train.txt").unwrap(); |
| 64 | + let test = DMatrix::load("test.txt").unwrap(); |
| 65 | + |
| 66 | + // Task (regression or classification) |
| 67 | + let learning_params = parameters::learning::LearningTaskParametersBuilder::default() |
| 68 | + .objective(parameters::learning::Objective::BinaryLogistic) |
| 69 | + .build() |
| 70 | + .unwrap(); |
| 71 | + |
| 72 | + // Tree parameters (e.g. depth) |
| 73 | + let tree_params = parameters::tree::TreeBoosterParametersBuilder::default() |
| 74 | + .max_depth(2) |
| 75 | + .eta(1.0) |
| 76 | + .build() |
| 77 | + .unwrap(); |
| 78 | + |
| 79 | + // Gradient boosting parameters |
| 80 | + let booster_params = parameters::BoosterParametersBuilder::default() |
| 81 | + .booster_type(parameters::BoosterType::Tree(tree_params)) |
| 82 | + .learning_params(learning_params) |
| 83 | + .build() |
| 84 | + .unwrap(); |
| 85 | + |
| 86 | + // Train on train data, test accuracy on test data |
| 87 | + let evaluation_sets = &[(&train, "train"), (&test, "test")]; |
| 88 | + |
| 89 | + // Final algorithm configuration |
| 90 | + let params = parameters::TrainingParametersBuilder::default() |
| 91 | + .dtrain(&train) |
| 92 | + .boost_rounds(2) // n_estimators |
| 93 | + .booster_params(booster_params) |
| 94 | + .evaluation_sets(Some(evaluation_sets)) |
| 95 | + .build() |
| 96 | + .unwrap(); |
| 97 | + |
| 98 | + // Train! |
| 99 | + let model = Booster::train(¶ms).unwrap(); |
| 100 | + |
| 101 | + // Save and load later in any language that has XGBoost bindings. |
| 102 | + model.save("/tmp/xbgoost_model.bin").unwrap(); |
| 103 | +} |
| 104 | +``` |
| 105 | + |
| 106 | +<small>Example created from `rust-xgboost`[^7] documentation and my own experiments.</small> |
| 107 | + |
| 108 | +That's it! You just trained an XGBoost model in Rust, in just a few lines of efficient and ergonomic code. |
| 109 | + |
| 110 | +Unlike Python, Rust compiles and verifies your code, so you'll know that it's likely to work before you even run it. When it can take several hours to train a model, it's great to know that you don't have a syntax error on your last line. |
| 111 | + |
| 112 | + |
| 113 | +[^1]: [SmartCore](https://smartcorelib.org/) |
| 114 | +[^2]: [XGBoost bindings](https://github.com/davechallis/rust-xgboost) |
| 115 | +[^3]: [Torch bindings](https://github.com/LaurentMazare/tch-rs) |
| 116 | +[^4]: [Tensorflow bindings](https://github.com/tensorflow/rust) |
| 117 | +[^5]: [rust-ndarray](https://github.com/rust-ndarray/ndarray) |
| 118 | +[^6]: [Python floating points](https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Include/floatobject.h#L15) |
| 119 | +[^7]: [`rust-xgboost`](https://docs.rs/xgboost/latest/xgboost/) |
| 120 | + |
0 commit comments