Skip to content

Commit 6f73eaa

Browse files
authored
preprocessing (#470)
Co-authored-by: Montana Low <montana.low@gmail.com>
1 parent 003ad04 commit 6f73eaa

File tree

11 files changed

+1074
-354
lines changed

11 files changed

+1074
-354
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -158,3 +158,6 @@ cython_debug/
158158
# and can be added to the global gitignore or merged into this file. For a more nuclear
159159
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
160160
.idea/
161+
162+
# local scratch pad
163+
scratch.sql

pgml-dashboard/src/models.rs

Lines changed: 36 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -369,7 +369,7 @@ pub enum Runtime {
369369
Rust,
370370
}
371371

372-
#[derive(FromRow)]
372+
#[derive(FromRow, Debug)]
373373
#[allow(dead_code)]
374374
pub struct Model {
375375
pub id: i64,
@@ -520,7 +520,7 @@ impl Model {
520520
}
521521
}
522522

523-
#[derive(FromRow)]
523+
#[derive(FromRow, Debug)]
524524
#[allow(dead_code)]
525525
pub struct Snapshot {
526526
pub id: i64,
@@ -665,34 +665,51 @@ impl Snapshot {
665665
}
666666

667667
pub fn labels<'a>(&'a self) -> Option<Vec<&'a serde_json::Map<String, serde_json::Value>>> {
668-
match self.columns() {
669-
Some(columns) => Some(
670-
columns
671-
.into_iter()
672-
.filter(|column| {
673-
self.y_column_name
674-
.contains(&column["name"].as_str().unwrap().to_string())
675-
})
676-
.collect(),
677-
),
678-
None => None,
679-
}
668+
self.columns().map(|columns|
669+
columns
670+
.into_iter()
671+
.filter(|column| {
672+
self.y_column_name
673+
.contains(&column["name"].as_str().unwrap().to_string())
674+
})
675+
.collect()
676+
)
680677
}
681678

682679
pub async fn models(&self, pool: &PgPool) -> anyhow::Result<Vec<Model>> {
683680
Model::get_by_snapshot_id(pool, self.id).await
684681
}
685682

686683
pub fn target_stddev(&self, name: &str) -> f32 {
687-
self.analysis
684+
match self.analysis
688685
.as_ref()
689686
.unwrap()
690687
.as_object()
691688
.unwrap()
692-
.get(&format!("{}_stddev", name))
693-
.unwrap()
694-
.as_f64()
695-
.unwrap() as f32
689+
.get(&format!("{}_stddev", name)) {
690+
// 2.1
691+
Some(value) => value.as_f64().unwrap() as f32,
692+
// 2.2+
693+
None => {
694+
let columns = self.columns().unwrap();
695+
let column = columns.iter().find(|column|
696+
&column["name"].as_str().unwrap() == &name
697+
);
698+
match column {
699+
Some(column) => {
700+
column.get("statistics")
701+
.unwrap()
702+
.as_object()
703+
.unwrap()
704+
.get("std_dev")
705+
.unwrap()
706+
.as_f64()
707+
.unwrap() as f32
708+
},
709+
None => 0.
710+
}
711+
}
712+
}
696713
}
697714
}
698715

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# Preprocessing Data
2+
3+
The training function also provides the option to preprocess data with the `preprocess` param. Preprocessors can be configured on a per-column basis for the training data set. There are currently three types of preprocessing available, for both categorical and quantitative variables. Below is a brief example for training data to learn a model of whether we should carry an umbrella or not.
4+
5+
!!! note
6+
Preprocessing steps are saved after training, and repeated identically for future calls to `predict`.
7+
8+
### `weather_data`
9+
| **month** | **clouds** | **humidity** | **temp** | **rain** |
10+
|-----------|------------|--------------|----------|----------|
11+
| 'jan' | 'cumulus' | 0.8 | 5 | true |
12+
| 'jan' | NULL | 0.1 | 10 | false |
13+
||||||
14+
| 'dec' | 'nimbus' | 0.9 | -2 | false |
15+
16+
In this example:
17+
- `month` is an ordinal categorical `TEXT` variable
18+
- `clouds` is a nullable nominal categorical `INT4` variable
19+
- `humidity` is a continuous quantitative `FLOAT4` variable
20+
- `temp` is a discrete quantitative `INT4` variable
21+
- `rain` is a nominal categorical `BOOL` label
22+
23+
There are 3 steps to preprocessing data:
24+
25+
- [Encoding](#categorical-encoding) categorical values into quantitative values
26+
- [Imputing](#imputing-missing-values) NULL values to some quantitative value
27+
- [Scaling](#scaling-values) quantitative values across all variables to similar ranges
28+
29+
These preprocessing steps may be specified on a per-column basis to the [train()](/user_guides/training/overview/) function. By default, PostgresML does minimal preprocessing on training data, and will raise an error during analysis if NULL values are encountered without a preprocessor. All types other than `TEXT` are treated as quantitative variables and cast to floating point representations before passing them to the underlying algorithm implementations.
30+
31+
```postgresql title="pgml.train()"
32+
select pgml.train(
33+
project_name => 'preprocessed_model',
34+
task => 'classification',
35+
relation_name => 'weather_data',
36+
target => 'rain',
37+
preprocess => '{
38+
"month": {"encode": {"ordinal": ["jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"]}}
39+
"clouds": {"encode": "target", scale: "standard"}
40+
"humidity": {"impute": "mean", scale: "standard"}
41+
"temp": {"scale": "standard"}
42+
}'
43+
);
44+
```
45+
46+
In some cases, it may make sense to use multiple steps for a single column. For example, the `clouds` column will be target encoded, and then scaled to the standard range to avoid dominating other variables, but there are some interactions between preprocessors to keep in mind.
47+
48+
- `NULL` and `NaN` are treated as additional, independent categories if seen during training, so columns that `encode` will only ever `impute` novel when novel data is encountered during training values.
49+
- It usually makes sense to scale all variables to the same scale.
50+
- It does not usually help to scale or preprocess the target data, as that is essentially the problem formulation and/or task selection.
51+
52+
!!! note
53+
TEXT is used in this document to also refer to VARCHAR and CHAR(N) types.
54+
55+
## Categorical encodings
56+
Encoding categorical variables is an O(N log(M)) where N is the number of rows, and M is the number of distinct categories.
57+
58+
| **name** | **description** |
59+
|-----------|-------------------------------------------------------------------------------------------------------------------------------------------------|
60+
| `none` | **Default** - Casts the variable to a 32-bit floating point representation compatible with numerics. This is the default for non-`TEXT` values. |
61+
| `target` | Encodes the variable as the average value of the target label for all members of the category. This is the default for `TEXT` variables. |
62+
| `one_hot` | Encodes the variable as multiple independent boolean columns. |
63+
| `ordinal` | Encodes the variable as integer values provided by their position in the input array. NULLS are always 0. |
64+
65+
### `target` encoding
66+
Target encoding is a relatively efficient way to represent a categorical variable. The average value of the target is computed for each category in the training data set. It is reasonable to `scale` target encoded variables using the same method as other variables.
67+
```
68+
preprocess => '{
69+
"clouds": {"encode": "target" }
70+
}
71+
```
72+
73+
!!! note
74+
Target encoding is currently limited to the first label column specified in a joint optimization model when there are multiple labels.
75+
76+
### `one_hot` encoding
77+
One-hot encoding converts each category into an independent boolean column, where all columns are false except the one column the instance is a member of. This is generally not as efficient or as effective as target encoding because the number of additional columns for a single feature can swamp the other features, regardless of scaling in some algorithms. In addition, the columns are highly correlated which can also cause quality issues in some algorithms. PostgresML drops one column by default to break the correlation but preserves the information, which is also referred to as dummy encoding.
78+
```
79+
preprocess => '{
80+
"clouds": {"encode": "one_hot" }
81+
}
82+
```
83+
!!! note
84+
All one-hot encoded data is scaled from 0-1 by definition, and will not be further scaled, unlike the other encodings which are scaled.
85+
86+
### `ordinal` encoding
87+
Some categorical variables have a natural ordering, like months of the year, or days of the week that can be effectively treated as a discrete quantitative variable. You may set the order of your categorical values, by passing an exhaustive ordered array. e.g.
88+
```
89+
preprocess => '{
90+
"month": {"encode": {"ordinal": ["jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"]}}
91+
}
92+
```
93+
94+
## Imputing missing values
95+
NULL and NAN values can be replaced by several statistical measures observed in the training data.
96+
97+
| **name** | **description** |
98+
|----------|---------------------------------------------------------------------------------------|
99+
| `error` | **Default** - will abort training or inference when a `NULL` or `NAN` is encountered |
100+
| `mean` | the mean value of the variable in the training data set |
101+
| `median` | the middle value of the variable in the sorted training data set |
102+
| `mode` | the most common value of the variable in the training data set |
103+
| `min` | the minimum value of the variable in the training data set |
104+
| `max` | the maximum value of the variable in the training data set |
105+
| `zero` | replaces all missing values with 0.0 |
106+
107+
!!! example
108+
```
109+
preprocess => '{
110+
"temp": {"impute": "mean"}
111+
}
112+
```
113+
114+
## Scaling values
115+
Scaling all variables to a standardized range can help make sure that no feature dominates the model, strictly because it has a naturally larger scale.
116+
117+
| **name** | **description** |
118+
|------------|-----------------------------------------------------------------------------------------------------------------------|
119+
| `preserve` | **Default** - Does not scale the variable at all. |
120+
| `standard` | Scales data to have a mean of zero, and variance of one. |
121+
| `min_max` | Scales data from zero to one. The minimum becomes 0.0 and maximum becomes 1.0. |
122+
| `max_abs` | Scales data from -1.0 to +1.0. Data will not be centered around 0, unless abs(min) == abs(max). |
123+
| `robust` | Scales data as a factor of the first and third quartiles. This method may handle outliers more robustly than others. |
124+
125+
!!! example
126+
```
127+
preprocess => '{
128+
"temp": {"scale": "standard"}
129+
}
130+
```

pgml-extension/Cargo.lock

Lines changed: 1 addition & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pgml-extension/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ cuda = ["xgboost/cuda", "lightgbm/cuda"]
1818

1919
[dependencies]
2020
pgx = "=0.5.6"
21+
pgx-pg-sys = "=0.5.6"
2122
xgboost = { git="https://github.com/postgresml/rust-xgboost.git", branch = "master" }
2223
once_cell = "1"
2324
rand = "0.8"

pgml-extension/examples/image_classification.sql

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ SELECT left(image::text, 40) || ',...}', target FROM pgml.digits LIMIT 10;
2222
SELECT * FROM pgml.train('Handwritten Digits', 'classification', 'pgml.digits', 'target');
2323

2424
-- check out the predictions
25-
SELECT target, pgml.predict('Handwritten Digits', image) AS prediction
25+
SELECT target, pgml.predict('Handwritten Digits', image::FLOAT4[]) AS prediction
2626
FROM pgml.digits
2727
LIMIT 10;
2828

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy