Content-Length: 94361 | pFad | http://github.com/postgresml/postgresml/pull/1.patch

thub.com From 92b80f272580ee4e97a5c57b8760f8dfa93ccdb0 Mon Sep 17 00:00:00 2001 From: Montana Low Date: Mon, 11 Apr 2022 17:41:57 -0700 Subject: [PATCH 01/15] MVP goals --- README.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++-- sql/test.sql | 5 ++-- 2 files changed, 81 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 1883fb017..99ab004cd 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,82 @@ -## Postgres ML demo +## PostgresML + +PostgresML aims to be the easiest way to gain value from machine learning. Anyone with a basic understanding of SQL should be able to build and deploy models to production, while receiving the benefits of a high performance machine learning platform. PostgresML leverages state of the art algorithms with built in best practices, without having to setup additional infrastructure or learn additional programming languages. + +Getting started is as easy as creating a `table` or `view` that holds the training data, and then registering that with PostgresML. + +```sql +SELECT pgml.create_regression('Red Wine Quality', training_data_table_or_view_name, label_column_name); +``` + +And predict novel datapoints: + +```sql +SELECT pgml.predict('Red Wine Quality', red_wines.*) +FROM pgml.red_wines +LIMIT 3; + + quality +--------- + 0.896432 + 0.834822 + 0.954502 +(3 rows) +``` + +PostgresML similarly supports classification to predict numeric scores rather than classes for novel data. + +```sql +SELECT pgml.create_classification('Handwritten Digit Classifier', pgml.mnist_training_data, label_column_name); +``` + +And predict novel datapoints: + +```sql +SELECT pgml.predict('Handwritten Digit Classifier', pgml.mnist_test_data.*) +FROM pgml.mnist +LIMIT 1; + + digit | likelihood +-------+---- + 5 | 0.956432 +(1 row) +``` + +Checkout the [documentation](https://TODO) to view the full capabilities, including: +- [Creating Training Sets](https://TODO) + - [Classification](https://TODO) + - [Regression](https://TODO) +- [Supported Algorithms](https://TODO) + - [Scikit Learn](https://TODO) + - [XGBoost](https://TODO) + - [Tensorflow](https://TODO) + - [PyTorch](https://TODO) + +### Planned features +- Model management dashboard +- Data explorer +- More algorithms and libraries incluiding custom algorithm support + + +### FAQ + +*How well does this scale?* + +Petabyte sized Postgres deployements are [documented](https://www.computerworld.com/article/2535825/size-matters--yahoo-claims-2-petabyte-database-is-world-s-biggest--busiest.html) in production since at least 2008, and [recent patches](https://www.2ndquadrant.com/en/blog/postgresql-maximum-table-size/) have enabled working beyond exabyte up to the yotabyte scale. Machine learning models can be horizontally scaled using well tested Postgres replication techniques on top of a mature storage and compute platform. + +*How reliable is this system?* + +Postgres is widely considered mission critical, and some of the most [reliable](https://www.postgresql.org/docs/current/wal-reliability.html) technology in any modern stack. PostgresML allows an infrastructure organization to leverage pre-existing best practices to deploy machine learning into production with less risk and effort than other systems. For example, model backup and recovery happens automatically alongside normal data backup procedures. + +*How good are the models?* + +Model quality is often a tradeoff between compute resources and incremental quality improvements. PostgresML allows stakeholders to choose algorithms from several libraries that will provide the most bang for the buck. In addition, PostgresML automatically applies best practices for data cleaning like imputing missing values by default and normalizing data to prevent common problems in production. After quickly enabling 0 to 1 value creation, PostgresML enables further expert iteration with custom data preperation and algorithm implementations. Like most things in life, the ultimate in quality will be a concerted effort of experts working over time, but that shouldn't get in the way of a quick start. + +*Is PostgresML fast?* + +Colocating the compute with the data inside the database removes one of the most common latency bottlenecks in the ML stack, which is the (de)serialization of data between stores and services across the wire. Modern versions of Postgres also support automatic query parrellization across multiple workers to further minimize latency in large batch workloads. Finally, PostgresML will utilize GPU compute if both the algorithm and hardware support it, although it is currently rare in practice for production databases to have GPUs. Checkout our [benchmarks](https://todo). + -Quick demo with Postgres, PL/Python, and Scikit. ### Installation in WSL or Ubuntu @@ -29,7 +105,7 @@ Install Scikit globally (I didn't bother setup Postgres with a virtualenv, but i sudo pip3 install sklearn ``` -### Run the demo +### Run the example ```bash sudo mkdir /app/models diff --git a/sql/test.sql b/sql/test.sql index 3268d83b1..0488f0a9e 100644 --- a/sql/test.sql +++ b/sql/test.sql @@ -20,6 +20,5 @@ WITH latest_model AS ( ) SELECT pgml.score( (SELECT model_name FROM latest_model), -- last model we just trained - - -- features as variadic arguments - 7.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4) AS score; + 7.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4 -- features as variadic arguments +) AS score; From f18f276be63af5fef7601c2fb3c8e908f65815fc Mon Sep 17 00:00:00 2001 From: Montana Low Date: Tue, 12 Apr 2022 10:12:29 -0700 Subject: [PATCH 02/15] Use unittest as the test running harness --- pgml/tests/test_train.py | 31 +++++++++++++++++-------------- 1 file changed, 17 insertions(+), 14 deletions(-) diff --git a/pgml/tests/test_train.py b/pgml/tests/test_train.py index fe2438ea7..1cb1c79d0 100644 --- a/pgml/tests/test_train.py +++ b/pgml/tests/test_train.py @@ -1,3 +1,4 @@ +import unittest from pgml.train import train @@ -14,18 +15,20 @@ def fetch(self, n): return self._values -def test_train(): - it = PlPyIterator( - [ - { - "value": 5, - "weight": 5, - }, - { - "value": 34, - "weight": 5, - }, - ] - ) +class TestTrain(unittest.TestCase): + def test_train(self): + it = PlPyIterator( + [ + { + "value": 5, + "weight": 5, + }, + { + "value": 34, + "weight": 5, + }, + ] + ) - train(it, y_column="weight", name="test", save=False) + train(it, y_column="weight", name="test", save=False) + self.assertTrue(True) From 3c66272e891b5fcab69cb185e8f52384ffedc009 Mon Sep 17 00:00:00 2001 From: Montana Low Date: Tue, 12 Apr 2022 10:14:46 -0700 Subject: [PATCH 03/15] remove validate because validation has a different meaning in ML, and we will be more liberal with data types we accept --- pgml/pgml/train.py | 3 --- pgml/pgml/validate.py | 13 ------------- pgml/tests/test_validate.py | 22 ---------------------- sql/install.sql | 15 --------------- 4 files changed, 53 deletions(-) delete mode 100644 pgml/pgml/validate.py delete mode 100644 pgml/tests/test_validate.py diff --git a/pgml/pgml/train.py b/pgml/pgml/train.py index 968cc6e59..2eb8d945b 100644 --- a/pgml/pgml/train.py +++ b/pgml/pgml/train.py @@ -12,7 +12,6 @@ from pgml.sql import all_rows from pgml.exceptions import PgMLException -from pgml.validate import check_type def train(cursor, y_column, name, save=True, destination="/tmp/pgml_models"): @@ -34,8 +33,6 @@ def train(cursor, y_column, name, save=True, destination="/tmp/pgml_models"): for row in all_rows(cursor): row = row.copy() - check_type(row) - if y_column not in row: PgMLException( f"Column `{y}` not found. Did you name your `y_column` correctly?" diff --git a/pgml/pgml/validate.py b/pgml/pgml/validate.py deleted file mode 100644 index 2fa08acb3..000000000 --- a/pgml/pgml/validate.py +++ /dev/null @@ -1,13 +0,0 @@ -""" -Run some basic sanity checks on the data. -""" - -# import sklearn -from pgml.exceptions import PgMLException - - -def check_type(row): - """We only accept certain column types for now.""" - for col in row: - if type(row[col]) not in (int, float): - raise PgMLException(f"Column '{col}' is not a integer or float.") diff --git a/pgml/tests/test_validate.py b/pgml/tests/test_validate.py deleted file mode 100644 index b7118c4b0..000000000 --- a/pgml/tests/test_validate.py +++ /dev/null @@ -1,22 +0,0 @@ -from pgml.validate import check_type -from pgml.exceptions import PgMLException - -import pytest - - -def test_check_type(): - row = { - "col1": 1, - "col2": "text", - "col3": 1.5, - } - - check_type(row) - - row = { - "col1": 1, - "col2": Exception(), - } - - with pytest.raises(PgMLException): - check_type(row) diff --git a/sql/install.sql b/sql/install.sql index a34757dbf..81c61e2e8 100644 --- a/sql/install.sql +++ b/sql/install.sql @@ -31,21 +31,6 @@ CREATE TABLE pgml.model_versions( successful BOOL NULL ); ---- ---- Run some validations on the table/view to make sure ---- it'll work without our package. ---- -CREATE OR REPLACE FUNCTION pgml.validate(table_name TEXT) -RETURNS BOOL -AS $$ - from pgml.sql import all_rows - from pgml.validate import check_type - - for row in all_rows(plpy.cursor(f"SELECT * FROM {table_name}")): - check_type(row) - return True -$$ LANGUAGE plpython3u; - --- --- Train the model. --- From 958cfba7c0309ccd847b8e3777fbfec0ca1266a2 Mon Sep 17 00:00:00 2001 From: Montana Low Date: Tue, 12 Apr 2022 10:25:14 -0700 Subject: [PATCH 04/15] keep model in memory to avoid going to disk --- README.md | 2 -- scikit_train_and_predict.sql | 16 ++++++++-------- 2 files changed, 8 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 99ab004cd..43bbada25 100644 --- a/README.md +++ b/README.md @@ -108,8 +108,6 @@ sudo pip3 install sklearn ### Run the example ```bash -sudo mkdir /app/models -sudo chown postgres:postgres /app/models psql -f scikit_train_and_predict.sql ``` diff --git a/scikit_train_and_predict.sql b/scikit_train_and_predict.sql index 3aa93e97a..a67f88a17 100644 --- a/scikit_train_and_predict.sql +++ b/scikit_train_and_predict.sql @@ -45,27 +45,27 @@ AS $$ rfc = RandomForestClassifier() rfc.fit(X, y) - with open("/app/models/postgresml-rfc.pickle", "wb") as f: - pickle.dump(rfc, f) - return "OK" + return pickle.dumps(rfc).hex() $$ LANGUAGE plpython3u; -SELECT scikit_learn_train_example(); +; -CREATE OR REPLACE FUNCTION scikit_learn_predict_example(value INT) +CREATE OR REPLACE FUNCTION scikit_learn_predict_example(model TEXT, value INT) RETURNS DOUBLE PRECISION AS $$ import pickle - with open("/app/models/postgresml-rfc.pickle", "rb") as f: - m = pickle.load(f) + m = pickle.loads(bytes.fromhex(model)) r = m.predict([[value,]]) return r[0] $$ LANGUAGE plpython3u; +WITH model as ( + SELECT scikit_learn_train_example() AS pickle +) SELECT value, weight, - scikit_learn_predict_example(value::int) AS prediction + scikit_learn_predict_example((SELECT model.pickle FROM model), value::int) AS prediction FROM scikit_train_view LIMIT 5; From 14b1f6121d7565fe31d523e5c44953a47da333bd Mon Sep 17 00:00:00 2001 From: Montana Low Date: Tue, 12 Apr 2022 10:39:10 -0700 Subject: [PATCH 05/15] use bytea directly for pl/python rather than hex/text conversion --- scikit_train_and_predict.sql | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/scikit_train_and_predict.sql b/scikit_train_and_predict.sql index a67f88a17..6f8b5c990 100644 --- a/scikit_train_and_predict.sql +++ b/scikit_train_and_predict.sql @@ -26,7 +26,7 @@ INSERT INTO scikit_train_data (value, weight) SELECT generate_series(1, 500), 5. CREATE OR REPLACE FUNCTION scikit_learn_train_example() -RETURNS TEXT +RETURNS BYTEA AS $$ from sklearn.ensemble import RandomForestClassifier import pickle @@ -45,18 +45,18 @@ AS $$ rfc = RandomForestClassifier() rfc.fit(X, y) - return pickle.dumps(rfc).hex() + return pickle.dumps(rfc) $$ LANGUAGE plpython3u; ; -CREATE OR REPLACE FUNCTION scikit_learn_predict_example(model TEXT, value INT) +CREATE OR REPLACE FUNCTION scikit_learn_predict_example(model BYTEA, value INT) RETURNS DOUBLE PRECISION AS $$ import pickle - m = pickle.loads(bytes.fromhex(model)) + m = pickle.loads(model) r = m.predict([[value,]]) return r[0] From 829b62e2a50f9255d1daa561c78b5c937f52bd64 Mon Sep 17 00:00:00 2001 From: Montana Low Date: Tue, 12 Apr 2022 16:59:27 -0700 Subject: [PATCH 06/15] add a draft schema to support snapshots and multiple training runs for a project --- README.md | 4 +- benchmarks.sql | 23 +++++++++++ pgml/pgml/model.py | 95 ++++++++++++++++++++++++++++++++++++++++++++ pgml/pgml/sql.py | 3 +- sql/install.sql | 99 ++++++++++++++++++++++++++++++++++++++++++---- sql/test.sql | 3 -- 6 files changed, 214 insertions(+), 13 deletions(-) create mode 100644 benchmarks.sql create mode 100644 pgml/pgml/model.py diff --git a/README.md b/README.md index 43bbada25..46b7cc760 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ PostgresML aims to be the easiest way to gain value from machine learning. Anyon Getting started is as easy as creating a `table` or `view` that holds the training data, and then registering that with PostgresML. ```sql -SELECT pgml.create_regression('Red Wine Quality', training_data_table_or_view_name, label_column_name); +SELECT pgml.model_regression('Red Wine Quality', training_data_table_or_view_name, label_column_name); ``` And predict novel datapoints: @@ -23,7 +23,7 @@ LIMIT 3; (3 rows) ``` -PostgresML similarly supports classification to predict numeric scores rather than classes for novel data. +PostgresML similarly supports classification to predict discrete classes rather than numeric scores for novel data. ```sql SELECT pgml.create_classification('Handwritten Digit Classifier', pgml.mnist_training_data, label_column_name); diff --git a/benchmarks.sql b/benchmarks.sql new file mode 100644 index 000000000..f2a6bfc5c --- /dev/null +++ b/benchmarks.sql @@ -0,0 +1,23 @@ +-- +-- CREATE EXTENSION +-- +CREATE EXTENSION IF NOT EXISTS plpython3u; + +CREATE OR REPLACE FUNCTION pg_call() +RETURNS INT +AS $$ +BEGIN + RETURN 1; +END; +$$ LANGUAGE plpgsql; + +CREATE OR REPLACE FUNCTION py_call() +RETURNS INT +AS $$ + return 1; +$$ LANGUAGE plpython3u; + +\timing on +SELECT generate_series(1, 50000), pg_call(); -- Time: 20.679 ms +SELECT generate_series(1, 50000), py_call(); -- Time: 67.355 ms + diff --git a/pgml/pgml/model.py b/pgml/pgml/model.py new file mode 100644 index 000000000..f6dc37d47 --- /dev/null +++ b/pgml/pgml/model.py @@ -0,0 +1,95 @@ +import plpy + +class Regression: + """Provides continuous real number predictions learned from the training data. + """ + def __init__( + model_name: str, + relation_name: str, + y_column_name: str, + implementation: str = "sklearn.linear_model" + ) -> None: + """Create a regression model from a table or view filled with training data. + + Args: + model_name (str): a human friendly identifier + relation_name (str): the table or view that stores the training data + y_column_name (str): the column in the training data that acts as the label + implementation (str, optional): the algorithm used to implement the regression. Defaults to "sklearn.linear_model". + """ + + data_source = f"SELECT * FROM {table_name}" + + # Start training. + start = plpy.execute(f""" + INSERT INTO pgml.model_versions + (name, data_source, y_column) + VALUES + ('{table_name}', '{data_source}', '{y}') + RETURNING *""", 1) + + id_ = start[0]["id"] + name = f"{table_name}_{id_}" + + destination = models_directory(plpy) + + # Train! + pickle, msq, r2 = train(plpy.cursor(data_source), y_column=y, name=name, destination=destination) + X = [] + y = [] + columns = [] + + for row in all_rows(cursor): + row = row.copy() + + if y_column not in row: + PgMLException( + f"Column `{y}` not found. Did you name your `y_column` correctly?" + ) + + y_ = row.pop(y_column) + x_ = [] + + # Always pull the columns in the same order from the row. + # Python dict iteration is not always in the same order (hash table). + if not columns: + for col in row: + columns.append(col) + + for column in columns: + x_.append(row[column]) + X.append(x_) + y.append(y_) + + X_train, X_test, y_train, y_test = train_test_split(X, y) + + # Just linear regression for now, but can add many more later. + lr = LinearRegression() + lr.fit(X_train, y_train) + + # Test + y_pred = lr.predict(X_test) + msq = mean_squared_error(y_test, y_pred) + r2 = r2_score(y_test, y_pred) + + path = os.path.join(destination, name) + + if save: + with open(path, "wb") as f: + pickle.dump(lr, f) + + return path, msq, r2 + + + plpy.execute(f""" + UPDATE pgml.model_versions + SET pickle = '{pickle}', + successful = true, + mean_squared_error = '{msq}', + r2_score = '{r2}', + ended_at = clock_timestamp() + WHERE id = {id_}""") + + return name + + model diff --git a/pgml/pgml/sql.py b/pgml/pgml/sql.py index 508ae6045..95b19fed9 100644 --- a/pgml/pgml/sql.py +++ b/pgml/pgml/sql.py @@ -1,6 +1,7 @@ """Tools to run SQL. """ import os +import plpy def all_rows(cursor): @@ -14,7 +15,7 @@ def all_rows(cursor): yield row -def models_directory(plpy): +def models_directory(): """Get the directory where we store our models.""" data_directory = plpy.execute( """ diff --git a/sql/install.sql b/sql/install.sql index 81c61e2e8..4f00a5202 100644 --- a/sql/install.sql +++ b/sql/install.sql @@ -1,10 +1,87 @@ +SET client_min_messages TO WARNING; -- Create the PL/Python3 extension. CREATE EXTENSION IF NOT EXISTS plpython3u; +--- +--- Create schema for models. +--- DROP SCHEMA pgml CASCADE; CREATE SCHEMA IF NOT EXISTS pgml; +CREATE OR REPLACE FUNCTION pgml.auto_updated_at(tbl regclass) +RETURNS VOID +AS $$ + DECLARE name_parts TEXT[]; + DECLARE name TEXT; +BEGIN + name_parts := string_to_array(tbl::TEXT, '.'); + name := name_parts[array_upper(name_parts, 1)]; + + EXECUTE format('DROP TRIGGER IF EXISTS %s_auto_updated_at ON %s', name, tbl); + EXECUTE format('CREATE TRIGGER %s_auto_updated_at BEFORE UPDATE ON %s + FOR EACH ROW EXECUTE PROCEDURE pgml.set_updated_at()', name, tbl); +END; +$$ +LANGUAGE plpgsql; + +CREATE OR REPLACE FUNCTION pgml.set_updated_at() +RETURNS TRIGGER +AS $$ +BEGIN + IF ( + NEW IS DISTINCT FROM OLD + AND NEW.updated_at IS NOT DISTINCT FROM OLD.updated_at + ) THEN + NEW.updated_at := CURRENT_TIMESTAMP; + END IF; + RETURN new; +END; +$$ +LANGUAGE plpgsql; + +CREATE TABLE pgml.projects( + id BIGSERIAL PRIMARY KEY, + name TEXT NOT NULL, + created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP, + updated_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP +); +SELECT pgml.auto_updated_at('pgml.projects'); + +CREATE TABLE pgml.snapshots( + id BIGSERIAL PRIMARY KEY, + relation TEXT NOT NULL, + y TEXT NOT NULL, + validation_ratio FLOAT4 NOT NULL, + validation_strategy TEXT NOT NULL, + created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP, + updated_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP +); +SELECT pgml.auto_updated_at('pgml.snapshots'); + +CREATE TABLE pgml.models( + id BIGSERIAL PRIMARY KEY, + project_id BIGINT, + snapshot_id BIGINT, + created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP, + updated_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP, + pickle BYTEA, + CONSTRAINT project_id_fk FOREIGN KEY(project_id) REFERENCES pgml.projects(id), + CONSTRAINT snapshot_id_fk FOREIGN KEY(snapshot_id) REFERENCES pgml.snapshots(id) +); +SELECT pgml.auto_updated_at('pgml.models'); + +CREATE TABLE pgml.promotions( + project_id BIGINT NOT NULL, + model_id BIGINT NOT NULL, + created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP, + CONSTRAINT project_id_fk FOREIGN KEY(project_id) REFERENCES pgml.projects(id), + CONSTRAINT model_id_fk FOREIGN KEY(model_id) REFERENCES pgml.models(id) +); +CREATE INDEX promotions_project_id_created_at_idx ON pgml.promotions(project_id, created_at); +SELECT pgml.auto_updated_at('pgml.promotions'); + + --- --- Extension version. --- @@ -15,20 +92,28 @@ AS $$ return pgml.version() $$ LANGUAGE plpython3u; +CREATE OR REPLACE FUNCTION pgml.model_regression(model_name TEXT, relation_name TEXT, y_column_name TEXT, algorithm TEXT) +RETURNS VOID +AS $$ + import pgml + pgml.model.regression(model_name, relation_name, y_column_name, algorithm) +$$ LANGUAGE plpython3u; + + --- --- Track table versions. --- CREATE TABLE pgml.model_versions( id BIGSERIAL PRIMARY KEY, - name VARCHAR, - location VARCHAR NULL, + name VARCHAR NOT NULL, data_source TEXT, y_column VARCHAR, started_at TIMESTAMP WITHOUT TIME ZONE DEFAULT CURRENT_TIMESTAMP, ended_at TIMESTAMP WITHOUT TIME ZONE NULL, mean_squared_error DOUBLE PRECISION, r2_score DOUBLE PRECISION, - successful BOOL NULL + successful BOOL NULL, + pickle BYTEA ); --- @@ -54,14 +139,14 @@ AS $$ id_ = start[0]["id"] name = f"{table_name}_{id_}" - destination = models_directory(plpy) + destination = models_directory() # Train! - location, msq, r2 = train(plpy.cursor(data_source), y_column=y, name=name, destination=destination) + pickle, msq, r2 = train(plpy.cursor(data_source), y_column=y, name=name, destination=destination) plpy.execute(f""" UPDATE pgml.model_versions - SET location = '{location}', + SET pickle = '{pickle}', successful = true, mean_squared_error = '{msq}', r2_score = '{r2}', @@ -85,7 +170,7 @@ AS $$ if model_name in SD: model = SD[model_name] else: - SD[model_name] = load(model_name, models_directory(plpy)) + SD[model_name] = load(model_name, models_directory()) model = SD[model_name] scores = model.predict([features,]) diff --git a/sql/test.sql b/sql/test.sql index 0488f0a9e..9ee8a766c 100644 --- a/sql/test.sql +++ b/sql/test.sql @@ -6,9 +6,6 @@ SELECT pgml.version(); --- Valiate our wine data. -SELECT pgml.validate('wine_quality_red'); - -- Train twice SELECT pgml.train('wine_quality_red', 'quality'); From 9907aaab9ab3f5f14c246a920161934b0b612ae9 Mon Sep 17 00:00:00 2001 From: Montana Low Date: Tue, 12 Apr 2022 20:04:17 -0700 Subject: [PATCH 07/15] sketch out the regression model training cycle --- pgml/pgml/model.py | 166 ++++++++++++++++++++++++++++----------------- sql/install.sql | 20 ++++-- sql/test.sql | 22 +++--- 3 files changed, 132 insertions(+), 76 deletions(-) diff --git a/pgml/pgml/model.py b/pgml/pgml/model.py index f6dc37d47..93c9365d3 100644 --- a/pgml/pgml/model.py +++ b/pgml/pgml/model.py @@ -1,95 +1,139 @@ +from cmath import e import plpy +from sklearn.linear_model import LinearRegression +from sklearn.model_selection import train_test_split +from sklearn.metrics import mean_squared_error, r2_score + +import pickle + +from pgml.exceptions import PgMLException + class Regression: """Provides continuous real number predictions learned from the training data. """ def __init__( - model_name: str, + self, + project_name: str, relation_name: str, y_column_name: str, - implementation: str = "sklearn.linear_model" + algorithm: str = "sklearn.linear_model", + test_size: float or int = 0.1, + test_sampling: str = "random" ) -> None: """Create a regression model from a table or view filled with training data. Args: - model_name (str): a human friendly identifier + project_name (str): a human friendly identifier relation_name (str): the table or view that stores the training data y_column_name (str): the column in the training data that acts as the label - implementation (str, optional): the algorithm used to implement the regression. Defaults to "sklearn.linear_model". + algorithm (str, optional): the algorithm used to implement the regression. Defaults to "sklearn.linear_model". + test_size (float or int, optional): If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25. + test_sampling: (str, optional): How to sample to create the test data. Defaults to "random". Valid values are ["first", "last", "random"]. """ - data_source = f"SELECT * FROM {table_name}" - - # Start training. - start = plpy.execute(f""" - INSERT INTO pgml.model_versions - (name, data_source, y_column) - VALUES - ('{table_name}', '{data_source}', '{y}') - RETURNING *""", 1) - - id_ = start[0]["id"] - name = f"{table_name}_{id_}" - - destination = models_directory(plpy) + plpy.warning("snapshot") + # Create a snapshot of the relation + snapshot = plpy.execute(f"INSERT INTO pgml.snapshots (relation, y, test_size, test_sampling, status) VALUES ('{relation_name}', '{y_column_name}', {test_size}, '{test_sampling}', 'new') RETURNING *", 1)[0] + plpy.execute(f"""CREATE TABLE pgml.snapshot_{snapshot['id']} AS SELECT * FROM "{relation_name}";""") + plpy.execute(f"UPDATE pgml.snapshots SET status = 'created' WHERE id = {snapshot['id']}") + + plpy.warning("project") + # Find or create the project + project = plpy.execute(f"SELECT * FROM pgml.projects WHERE name = '{project_name}'", 1) + plpy.warning(f"project {project}") + if (project.nrows == 1): + plpy.warning("project found") + project = project[0] + else: + try: + project = plpy.execute(f"INSERT INTO pgml.projects (name) VALUES ('{project_name}') RETURNING *", 1) + plpy.warning(f"project inserted {project}") + if (project.nrows() == 1): + project = project[0] + + except Exception as e: # handle race condition to insert + plpy.warning(f"project retry: #{e}") + project = plpy.execute(f"SELECT * FROM pgml.projects WHERE name = '{project_name}'", 1)[0] + + plpy.warning("model") + # Create the model + model = plpy.execute(f"INSERT INTO pgml.models (project_id, snapshot_id, algorithm, status) VALUES ({project['id']}, {snapshot['id']}, '{algorithm}', 'training') RETURNING *")[0] + + plpy.warning("data") + # Prepare the data + data = plpy.execute(f"SELECT * FROM pgml.snapshot_{snapshot['id']}") + + # Sanity check the data + if data.nrows == 0: + PgMLException( + f"Relation `{y_column_name}` contains no rows. Did you pass the correct `relation_name`?" + ) + if y_column_name not in data[0]: + PgMLException( + f"Column `{y_column_name}` not found. Did you pass the correct `y_column_name`?" + ) + + # Always pull the columns in the same order from the row. + # Python dict iteration is not always in the same order (hash table). + columns = [] + for col in data[0]: + if col != y_column_name: + columns.append(col) - # Train! - pickle, msq, r2 = train(plpy.cursor(data_source), y_column=y, name=name, destination=destination) + # Split the label from the features X = [] y = [] - columns = [] - - for row in all_rows(cursor): - row = row.copy() - - if y_column not in row: - PgMLException( - f"Column `{y}` not found. Did you name your `y_column` correctly?" - ) - - y_ = row.pop(y_column) + for row in data: + plpy.warning(f"row: {row}") + y_ = row.pop(y_column_name) x_ = [] - # Always pull the columns in the same order from the row. - # Python dict iteration is not always in the same order (hash table). - if not columns: - for col in row: - columns.append(col) - for column in columns: x_.append(row[column]) + X.append(x_) y.append(y_) - X_train, X_test, y_train, y_test = train_test_split(X, y) - - # Just linear regression for now, but can add many more later. - lr = LinearRegression() - lr.fit(X_train, y_train) - + # Split into training and test sets + plpy.warning("split") + if (test_sampling == 'random'): + X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=0) + else: + if (test_sampling == 'first'): + X.reverse() + y.reverse() + if isinstance(split, float): + split = 1.0 - split + split = test_size + if isinstance(split, float): + split = int(test_size * X.len()) + X_train, X_test, y_train, y_test = X[0:split], X[split:X.len()-1], y[0:split], y[split:y.len()-1] + + # TODO normalize and clean data + + plpy.warning("train") + # Train the model + algo = LinearRegression() + algo.fit(X_train, y_train) + + plpy.warning("test") # Test - y_pred = lr.predict(X_test) + y_pred = algo.predict(X_test) msq = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) - path = os.path.join(destination, name) - - if save: - with open(path, "wb") as f: - pickle.dump(lr, f) - - return path, msq, r2 - + plpy.warning("save") + # Save the model + weights = pickle.dumps(algo) plpy.execute(f""" - UPDATE pgml.model_versions - SET pickle = '{pickle}', - successful = true, + UPDATE pgml.models + SET pickle = '\\x{weights.hex()}', + status = 'successful', mean_squared_error = '{msq}', - r2_score = '{r2}', - ended_at = clock_timestamp() - WHERE id = {id_}""") - - return name + r2_score = '{r2}' + WHERE id = {model['id']} + """) - model + # TODO: promote the model? diff --git a/sql/install.sql b/sql/install.sql index 4f00a5202..04dcc51f6 100644 --- a/sql/install.sql +++ b/sql/install.sql @@ -47,13 +47,15 @@ CREATE TABLE pgml.projects( updated_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP ); SELECT pgml.auto_updated_at('pgml.projects'); +CREATE UNIQUE INDEX projects_name_idx ON pgml.projects(name); CREATE TABLE pgml.snapshots( id BIGSERIAL PRIMARY KEY, relation TEXT NOT NULL, y TEXT NOT NULL, - validation_ratio FLOAT4 NOT NULL, - validation_strategy TEXT NOT NULL, + test_size FLOAT4 NOT NULL, + test_sampling TEXT NOT NULL, + status TEXT NOT NULL, created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP ); @@ -61,14 +63,19 @@ SELECT pgml.auto_updated_at('pgml.snapshots'); CREATE TABLE pgml.models( id BIGSERIAL PRIMARY KEY, - project_id BIGINT, - snapshot_id BIGINT, + project_id BIGINT NOT NULL, + snapshot_id BIGINT NOT NULL, + algorithm TEXT NOT NULL, + status TEXT NOT NULL, created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP, + mean_squared_error DOUBLE PRECISION, + r2_score DOUBLE PRECISION, pickle BYTEA, CONSTRAINT project_id_fk FOREIGN KEY(project_id) REFERENCES pgml.projects(id), CONSTRAINT snapshot_id_fk FOREIGN KEY(snapshot_id) REFERENCES pgml.snapshots(id) ); +CREATE INDEX models_project_id_created_at_idx ON pgml.models(project_id, created_at); SELECT pgml.auto_updated_at('pgml.models'); CREATE TABLE pgml.promotions( @@ -92,11 +99,12 @@ AS $$ return pgml.version() $$ LANGUAGE plpython3u; -CREATE OR REPLACE FUNCTION pgml.model_regression(model_name TEXT, relation_name TEXT, y_column_name TEXT, algorithm TEXT) +CREATE OR REPLACE FUNCTION pgml.model_regression(project_name TEXT, relation_name TEXT, y_column_name TEXT) RETURNS VOID AS $$ import pgml - pgml.model.regression(model_name, relation_name, y_column_name, algorithm) + from pgml.model import Regression + Regression(project_name, relation_name, y_column_name) $$ LANGUAGE plpython3u; diff --git a/sql/test.sql b/sql/test.sql index 9ee8a766c..eadc30ca9 100644 --- a/sql/test.sql +++ b/sql/test.sql @@ -7,15 +7,19 @@ SELECT pgml.version(); -- Train twice -SELECT pgml.train('wine_quality_red', 'quality'); +-- SELECT pgml.train('wine_quality_red', 'quality'); -SELECT * FROM pgml.model_versions; +-- SELECT * FROM pgml.model_versions; + +-- \timing +-- WITH latest_model AS ( +-- SELECT name || '_' || id AS model_name FROM pgml.model_versions ORDER BY id DESC LIMIT 1 +-- ) +-- SELECT pgml.score( +-- (SELECT model_name FROM latest_model), -- last model we just trained +-- 7.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4 -- features as variadic arguments +-- ) AS score; \timing -WITH latest_model AS ( - SELECT name || '_' || id AS model_name FROM pgml.model_versions ORDER BY id DESC LIMIT 1 -) -SELECT pgml.score( - (SELECT model_name FROM latest_model), -- last model we just trained - 7.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4 -- features as variadic arguments -) AS score; + +SELECT pgml.model_regression('Red Wine', 'wine_quality_red', 'quality'); From b50f000aecd496d4cdb4cced9b1efd60cac4d55d Mon Sep 17 00:00:00 2001 From: Montana Low Date: Wed, 13 Apr 2022 12:12:31 -0700 Subject: [PATCH 08/15] break it down into model classes --- pgml/pgml/model.py | 144 +++++++++++++++++++-------------------- pgml/pgml/score.py | 17 ----- pgml/pgml/sql.py | 17 ----- pgml/pgml/train.py | 72 -------------------- pgml/tests/test_train.py | 35 ++-------- sql/install.sql | 20 +++--- 6 files changed, 85 insertions(+), 220 deletions(-) delete mode 100644 pgml/pgml/score.py delete mode 100644 pgml/pgml/train.py diff --git a/pgml/pgml/model.py b/pgml/pgml/model.py index 93c9365d3..0471cff7a 100644 --- a/pgml/pgml/model.py +++ b/pgml/pgml/model.py @@ -1,7 +1,6 @@ -from cmath import e import plpy - from sklearn.linear_model import LinearRegression +from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score @@ -9,84 +8,48 @@ from pgml.exceptions import PgMLException -class Regression: - """Provides continuous real number predictions learned from the training data. - """ - def __init__( - self, - project_name: str, - relation_name: str, - y_column_name: str, - algorithm: str = "sklearn.linear_model", - test_size: float or int = 0.1, - test_sampling: str = "random" - ) -> None: - """Create a regression model from a table or view filled with training data. - - Args: - project_name (str): a human friendly identifier - relation_name (str): the table or view that stores the training data - y_column_name (str): the column in the training data that acts as the label - algorithm (str, optional): the algorithm used to implement the regression. Defaults to "sklearn.linear_model". - test_size (float or int, optional): If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25. - test_sampling: (str, optional): How to sample to create the test data. Defaults to "random". Valid values are ["first", "last", "random"]. - """ - - plpy.warning("snapshot") - # Create a snapshot of the relation - snapshot = plpy.execute(f"INSERT INTO pgml.snapshots (relation, y, test_size, test_sampling, status) VALUES ('{relation_name}', '{y_column_name}', {test_size}, '{test_sampling}', 'new') RETURNING *", 1)[0] - plpy.execute(f"""CREATE TABLE pgml.snapshot_{snapshot['id']} AS SELECT * FROM "{relation_name}";""") - plpy.execute(f"UPDATE pgml.snapshots SET status = 'created' WHERE id = {snapshot['id']}") - - plpy.warning("project") +class Project: + def __init__(self, name): # Find or create the project - project = plpy.execute(f"SELECT * FROM pgml.projects WHERE name = '{project_name}'", 1) - plpy.warning(f"project {project}") - if (project.nrows == 1): - plpy.warning("project found") - project = project[0] + result = plpy.execute(f"SELECT * FROM pgml.projects WHERE name = '{name}'", 1) + if (result.nrows == 1): + self.__dict__ = dict(result[0]) else: try: - project = plpy.execute(f"INSERT INTO pgml.projects (name) VALUES ('{project_name}') RETURNING *", 1) - plpy.warning(f"project inserted {project}") - if (project.nrows() == 1): - project = project[0] - + self.__dict__ = dict(plpy.execute(f"INSERT INTO pgml.projects (name) VALUES ('{name}') RETURNING *", 1)[0]) except Exception as e: # handle race condition to insert - plpy.warning(f"project retry: #{e}") - project = plpy.execute(f"SELECT * FROM pgml.projects WHERE name = '{project_name}'", 1)[0] + self.__dict__ = dict(plpy.execute(f"SELECT * FROM pgml.projects WHERE name = '{name}'", 1)[0]) - plpy.warning("model") - # Create the model - model = plpy.execute(f"INSERT INTO pgml.models (project_id, snapshot_id, algorithm, status) VALUES ({project['id']}, {snapshot['id']}, '{algorithm}', 'training') RETURNING *")[0] +class Snapshot: + def __init__(self, relation_name, y_column_name, test_size, test_sampling): + self.__dict__ = dict(plpy.execute(f"INSERT INTO pgml.snapshots (relation_name, y_column_name, test_size, test_sampling, status) VALUES ('{relation_name}', '{y_column_name}', {test_size}, '{test_sampling}', 'new') RETURNING *", 1)[0]) + plpy.execute(f"""CREATE TABLE pgml.snapshot_{self.id} AS SELECT * FROM "{relation_name}";""") + self.__dict__ = dict(plpy.execute(f"UPDATE pgml.snapshots SET status = 'created' WHERE id = {self.id} RETURNING *")[0]) - plpy.warning("data") - # Prepare the data - data = plpy.execute(f"SELECT * FROM pgml.snapshot_{snapshot['id']}") + def data(self): + data = plpy.execute(f"SELECT * FROM pgml.snapshot_{self.id}") # Sanity check the data if data.nrows == 0: PgMLException( - f"Relation `{y_column_name}` contains no rows. Did you pass the correct `relation_name`?" + f"Relation `{self.y_column_name}` contains no rows. Did you pass the correct `relation_name`?" ) - if y_column_name not in data[0]: + if self.y_column_name not in data[0]: PgMLException( - f"Column `{y_column_name}` not found. Did you pass the correct `y_column_name`?" + f"Column `{self.y_column_name}` not found. Did you pass the correct `y_column_name`?" ) # Always pull the columns in the same order from the row. # Python dict iteration is not always in the same order (hash table). - columns = [] - for col in data[0]: - if col != y_column_name: - columns.append(col) + columns = list(data[0].keys()) + columns.remove(self.y_column_name) + columns.sort() # Split the label from the features X = [] y = [] for row in data: - plpy.warning(f"row: {row}") - y_ = row.pop(y_column_name) + y_ = row.pop(self.y_column_name) x_ = [] for column in columns: @@ -96,44 +59,79 @@ def __init__( y.append(y_) # Split into training and test sets - plpy.warning("split") - if (test_sampling == 'random'): - X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=0) + if (self.test_sampling == 'random'): + return train_test_split(X, y, test_size=self.test_size, random_state=0) else: - if (test_sampling == 'first'): + if (self.test_sampling == 'first'): X.reverse() y.reverse() if isinstance(split, float): split = 1.0 - split - split = test_size + split = self.test_size if isinstance(split, float): - split = int(test_size * X.len()) - X_train, X_test, y_train, y_test = X[0:split], X[split:X.len()-1], y[0:split], y[split:y.len()-1] + split = int(self.test_size * X.len()) + return X[0:split], X[split:X.len()-1], y[0:split], y[split:y.len()-1] # TODO normalize and clean data - plpy.warning("train") + +class Model: + def __init__(self, project, snapshot, algorithm): + self.__dict__ = dict(plpy.execute(f"INSERT INTO pgml.models (project_id, snapshot_id, algorithm, status) VALUES ({project.id}, {snapshot.id}, '{algorithm}', 'training') RETURNING *")[0]) + + def fit(self, snapshot): + X_train, X_test, y_train, y_test = snapshot.data() + # Train the model - algo = LinearRegression() + algo = { + 'linear': LinearRegression, + 'random_forest': RandomForestRegressor + }[self.algorithm]() algo.fit(X_train, y_train) - plpy.warning("test") # Test y_pred = algo.predict(X_test) msq = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) - plpy.warning("save") # Save the model weights = pickle.dumps(algo) - plpy.execute(f""" + self.__dict__ = dict(plpy.execute(f""" UPDATE pgml.models SET pickle = '\\x{weights.hex()}', status = 'successful', mean_squared_error = '{msq}', r2_score = '{r2}' - WHERE id = {model['id']} - """) + WHERE id = {self.id} + RETURNING * + """)[0]) +class Regression: + """Provides continuous real number predictions learned from the training data. + """ + def __init__( + self, + project_name: str, + relation_name: str, + y_column_name: str, + algorithms: str = ["linear", "random_forest"], + test_size: float or int = 0.1, + test_sampling: str = "random" + ) -> None: + """Create a regression model from a table or view filled with training data. + + Args: + project_name (str): a human friendly identifier + relation_name (str): the table or view that stores the training data + y_column_name (str): the column in the training data that acts as the label + algorithm (str, optional): the algorithm used to implement the regression. Defaults to "linear". Valid values are ["linear", "random_forest"]. + test_size (float or int, optional): If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25. + test_sampling: (str, optional): How to sample to create the test data. Defaults to "random". Valid values are ["first", "last", "random"]. + """ + project = Project(project_name) + snapshot = Snapshot(relation_name, y_column_name, test_size, test_sampling) + for algorithm in algorithms: + model = Model(project, snapshot, algorithm) + model.fit(snapshot) # TODO: promote the model? diff --git a/pgml/pgml/score.py b/pgml/pgml/score.py deleted file mode 100644 index cbb415825..000000000 --- a/pgml/pgml/score.py +++ /dev/null @@ -1,17 +0,0 @@ -"""Score""" - -import os -import pickle - -from pgml.exceptions import PgMLException - - -def load(name, source): - """Load a model from file.""" - path = os.path.join(source, name) - - if not os.path.exists(path): - raise PgMLException(f"Model source directory `{path}` does not exist.") - - with open(path, "rb") as f: - return pickle.load(f) diff --git a/pgml/pgml/sql.py b/pgml/pgml/sql.py index 95b19fed9..ed8827bff 100644 --- a/pgml/pgml/sql.py +++ b/pgml/pgml/sql.py @@ -14,20 +14,3 @@ def all_rows(cursor): for row in rows: yield row - -def models_directory(): - """Get the directory where we store our models.""" - data_directory = plpy.execute( - """ - SELECT setting FROM pg_settings WHERE name = 'data_directory' - """, - 1, - )[0]["setting"] - - models_dir = os.path.join(data_directory, "pgml_models") - - # TODO: Ideally this happens during extension installation. - if not os.path.exists(models_dir): - os.mkdir(models_dir, 0o770) - - return models_dir diff --git a/pgml/pgml/train.py b/pgml/pgml/train.py deleted file mode 100644 index 2eb8d945b..000000000 --- a/pgml/pgml/train.py +++ /dev/null @@ -1,72 +0,0 @@ -""" -Train the model. -""" - -# TODO: import more models here -from sklearn.linear_model import LinearRegression -from sklearn.model_selection import train_test_split -from sklearn.metrics import mean_squared_error, r2_score - -import pickle -import os - -from pgml.sql import all_rows -from pgml.exceptions import PgMLException - - -def train(cursor, y_column, name, save=True, destination="/tmp/pgml_models"): - """Train the model on data on some rows. - - Arguments: - - cursor: iterable with rows, - - y_column: the name of the column containing the y predicate (a.k.a solution), - - name: the name of the model, e.g 'test_model', - - save: to save the model to disk or not. - - Return: - Path on disk where the model was saved or could be saved if saved=True. - """ - X = [] - y = [] - columns = [] - - for row in all_rows(cursor): - row = row.copy() - - if y_column not in row: - PgMLException( - f"Column `{y}` not found. Did you name your `y_column` correctly?" - ) - - y_ = row.pop(y_column) - x_ = [] - - # Always pull the columns in the same order from the row. - # Python dict iteration is not always in the same order (hash table). - if not columns: - for col in row: - columns.append(col) - - for column in columns: - x_.append(row[column]) - X.append(x_) - y.append(y_) - - X_train, X_test, y_train, y_test = train_test_split(X, y) - - # Just linear regression for now, but can add many more later. - lr = LinearRegression() - lr.fit(X_train, y_train) - - # Test - y_pred = lr.predict(X_test) - msq = mean_squared_error(y_test, y_pred) - r2 = r2_score(y_test, y_pred) - - path = os.path.join(destination, name) - - if save: - with open(path, "wb") as f: - pickle.dump(lr, f) - - return path, msq, r2 diff --git a/pgml/tests/test_train.py b/pgml/tests/test_train.py index 1cb1c79d0..9a0bb4b0d 100644 --- a/pgml/tests/test_train.py +++ b/pgml/tests/test_train.py @@ -1,34 +1,7 @@ import unittest -from pgml.train import train +import pgml - -class PlPyIterator: - def __init__(self, values): - self._values = values - self._returned = False - - def fetch(self, n): - if self._returned: - return - else: - self._returned = True - return self._values - - -class TestTrain(unittest.TestCase): - def test_train(self): - it = PlPyIterator( - [ - { - "value": 5, - "weight": 5, - }, - { - "value": 34, - "weight": 5, - }, - ] - ) - - train(it, y_column="weight", name="test", save=False) +class TestRegression(unittest.TestCase): + def test_init(self): + pgml.model.Regression("Test", "test", "test_y") self.assertTrue(True) diff --git a/sql/install.sql b/sql/install.sql index 04dcc51f6..9bee2724d 100644 --- a/sql/install.sql +++ b/sql/install.sql @@ -33,7 +33,7 @@ BEGIN NEW IS DISTINCT FROM OLD AND NEW.updated_at IS NOT DISTINCT FROM OLD.updated_at ) THEN - NEW.updated_at := CURRENT_TIMESTAMP; + NEW.updated_at := clock_timestamp(); END IF; RETURN new; END; @@ -43,21 +43,21 @@ LANGUAGE plpgsql; CREATE TABLE pgml.projects( id BIGSERIAL PRIMARY KEY, name TEXT NOT NULL, - created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP, - updated_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP + created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT clock_timestamp(), + updated_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT clock_timestamp() ); SELECT pgml.auto_updated_at('pgml.projects'); CREATE UNIQUE INDEX projects_name_idx ON pgml.projects(name); CREATE TABLE pgml.snapshots( id BIGSERIAL PRIMARY KEY, - relation TEXT NOT NULL, - y TEXT NOT NULL, + relation_name TEXT NOT NULL, + y_column_name TEXT NOT NULL, test_size FLOAT4 NOT NULL, test_sampling TEXT NOT NULL, status TEXT NOT NULL, - created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP, - updated_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP + created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT clock_timestamp(), + updated_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT clock_timestamp() ); SELECT pgml.auto_updated_at('pgml.snapshots'); @@ -67,8 +67,8 @@ CREATE TABLE pgml.models( snapshot_id BIGINT NOT NULL, algorithm TEXT NOT NULL, status TEXT NOT NULL, - created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP, - updated_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP, + created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT clock_timestamp(), + updated_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT clock_timestamp(), mean_squared_error DOUBLE PRECISION, r2_score DOUBLE PRECISION, pickle BYTEA, @@ -81,7 +81,7 @@ SELECT pgml.auto_updated_at('pgml.models'); CREATE TABLE pgml.promotions( project_id BIGINT NOT NULL, model_id BIGINT NOT NULL, - created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP, + created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT clock_timestamp(), CONSTRAINT project_id_fk FOREIGN KEY(project_id) REFERENCES pgml.projects(id), CONSTRAINT model_id_fk FOREIGN KEY(model_id) REFERENCES pgml.models(id) ); From 89b467d16204d44c524a7beb213eef61d605c730 Mon Sep 17 00:00:00 2001 From: Montana Low Date: Wed, 13 Apr 2022 19:53:50 -0700 Subject: [PATCH 09/15] add categoricals --- pgml/pgml/model.py | 253 +++++++++++++++++++++++++++++---------- pgml/pgml/sql.py | 22 +--- pgml/tests/test_train.py | 2 +- sql/install.sql | 91 +++----------- sql/test.sql | 22 ++-- 5 files changed, 220 insertions(+), 170 deletions(-) diff --git a/pgml/pgml/model.py b/pgml/pgml/model.py index 0471cff7a..a6059f329 100644 --- a/pgml/pgml/model.py +++ b/pgml/pgml/model.py @@ -1,38 +1,104 @@ import plpy from sklearn.linear_model import LinearRegression -from sklearn.ensemble import RandomForestRegressor +from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score import pickle from pgml.exceptions import PgMLException +from pgml.sql import q -class Project: - def __init__(self, name): - # Find or create the project - result = plpy.execute(f"SELECT * FROM pgml.projects WHERE name = '{name}'", 1) - if (result.nrows == 1): - self.__dict__ = dict(result[0]) - else: - try: - self.__dict__ = dict(plpy.execute(f"INSERT INTO pgml.projects (name) VALUES ('{name}') RETURNING *", 1)[0]) - except Exception as e: # handle race condition to insert - self.__dict__ = dict(plpy.execute(f"SELECT * FROM pgml.projects WHERE name = '{name}'", 1)[0]) +class Project(object): + _cache = {} + + @classmethod + def find(cls, id): + result = plpy.execute(f""" + SELECT * + FROM pgml.projects + WHERE id = {q(id)} + """, 1) + if (result.nrows == 0): + return None + + project = Project() + project.__dict__ = dict(result[0]) + project.__init__() + cls._cache[project.name] = project + return project + + @classmethod + def find_by_name(cls, name): + if name in cls._cache: + return cls._cache[name] + + result = plpy.execute(f""" + SELECT * + FROM pgml.projects + WHERE name = {q(name)} + """, 1) + if (result.nrows == 0): + return None + + project = Project() + project.__dict__ = dict(result[0]) + project.__init__() + cls._cache[name] = project + return project + + @classmethod + def create(cls, name, objective): + project = Project() + project.__dict__ = dict(plpy.execute(f""" + INSERT INTO pgml.projects (name, objective) + VALUES ({q(name)}, {q(objective)}) + RETURNING * + """, 1)[0]) + project.__init__() + cls._cache[name] = project + return project + + def __init__(self): + self._deployed_model = None + + @property + def deployed_model(self): + if self._deployed_model is None: + self._deployed_model = Model.find_deployed(self.id) + return self._deployed_model -class Snapshot: - def __init__(self, relation_name, y_column_name, test_size, test_sampling): - self.__dict__ = dict(plpy.execute(f"INSERT INTO pgml.snapshots (relation_name, y_column_name, test_size, test_sampling, status) VALUES ('{relation_name}', '{y_column_name}', {test_size}, '{test_sampling}', 'new') RETURNING *", 1)[0]) - plpy.execute(f"""CREATE TABLE pgml.snapshot_{self.id} AS SELECT * FROM "{relation_name}";""") - self.__dict__ = dict(plpy.execute(f"UPDATE pgml.snapshots SET status = 'created' WHERE id = {self.id} RETURNING *")[0]) +class Snapshot(object): + @classmethod + def create(cls, relation_name, y_column_name, test_size, test_sampling): + snapshot = Snapshot() + snapshot.__dict__ = dict(plpy.execute(f""" + INSERT INTO pgml.snapshots (relation_name, y_column_name, test_size, test_sampling, status) + VALUES ({q(relation_name)}, {q(y_column_name)}, {q(test_size)}, {q(test_sampling)}, 'new') + RETURNING * + """, 1)[0]) + plpy.execute(f""" + CREATE TABLE pgml."snapshot_{snapshot.id}" AS + SELECT * FROM "{snapshot.relation_name}"; + """) + snapshot.__dict__ = dict(plpy.execute(f""" + UPDATE pgml.snapshots + SET status = 'created' + WHERE id = {q(snapshot.id)} + RETURNING * + """)[0]) + return snapshot def data(self): - data = plpy.execute(f"SELECT * FROM pgml.snapshot_{self.id}") + data = plpy.execute(f""" + SELECT * + FROM pgml."snapshot_{self.id}" + """) # Sanity check the data if data.nrows == 0: PgMLException( - f"Relation `{self.y_column_name}` contains no rows. Did you pass the correct `relation_name`?" + f"Relation `{self.relation_name}` contains no rows. Did you pass the correct `relation_name`?" ) if self.y_column_name not in data[0]: PgMLException( @@ -74,64 +140,127 @@ def data(self): # TODO normalize and clean data +class Model(object): + @classmethod + def create(cls, project, snapshot, algorithm_name): + result = plpy.execute(f""" + INSERT INTO pgml.models (project_id, snapshot_id, algorithm_name, status) + VALUES ({q(project.id)}, {q(snapshot.id)}, {q(algorithm_name)}, 'training') + RETURNING * + """) + model = Model() + model.__dict__ = dict(result[0]) + model.__init__() + model._project = project + return model + + @classmethod + def find_deployed(cls, project_id): + result = plpy.execute(f""" + SELECT models.* + FROM pgml.models + JOIN pgml.deployments + ON deployments.model_id = models.id + AND deployments.project_id = {q(project_id)} + ORDER by deployments.created_at DESC + LIMIT 1 + """) + if (result.nrows == 0): + return None + + model = Model() + model.__dict__ = dict(result[0]) + model.__init__() + return model -class Model: - def __init__(self, project, snapshot, algorithm): - self.__dict__ = dict(plpy.execute(f"INSERT INTO pgml.models (project_id, snapshot_id, algorithm, status) VALUES ({project.id}, {snapshot.id}, '{algorithm}', 'training') RETURNING *")[0]) + def __init__(self): + self._algorithm = None + self._project = None + + @property + def project(self): + if self._project is None: + self._project = Project.find(self.project_id) + return self._project + + @property + def algorithm(self): + if self._algorithm is None: + if self.pickle is not None: + self._algorithm = pickle.loads(self.pickle) + else: + self._algorithm = { + 'linear_regression': LinearRegression, + 'random_forest_regression': RandomForestRegressor, + 'random_forest_classification': RandomForestClassifier + }[self.algorithm_name + '_' + self.project.objective]() + + return self._algorithm def fit(self, snapshot): X_train, X_test, y_train, y_test = snapshot.data() # Train the model - algo = { - 'linear': LinearRegression, - 'random_forest': RandomForestRegressor - }[self.algorithm]() - algo.fit(X_train, y_train) + self.algorithm.fit(X_train, y_train) # Test - y_pred = algo.predict(X_test) + y_pred = self.algorithm.predict(X_test) msq = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) # Save the model - weights = pickle.dumps(algo) - self.__dict__ = dict(plpy.execute(f""" UPDATE pgml.models - SET pickle = '\\x{weights.hex()}', + SET pickle = '\\x{pickle.dumps(self.algorithm).hex()}', status = 'successful', - mean_squared_error = '{msq}', - r2_score = '{r2}' - WHERE id = {self.id} + mean_squared_error = {q(msq)}, + r2_score = {q(r2)} + WHERE id = {q(self.id)} RETURNING * """)[0]) -class Regression: - """Provides continuous real number predictions learned from the training data. - """ - def __init__( - self, - project_name: str, - relation_name: str, - y_column_name: str, - algorithms: str = ["linear", "random_forest"], - test_size: float or int = 0.1, - test_sampling: str = "random" - ) -> None: - """Create a regression model from a table or view filled with training data. - - Args: - project_name (str): a human friendly identifier - relation_name (str): the table or view that stores the training data - y_column_name (str): the column in the training data that acts as the label - algorithm (str, optional): the algorithm used to implement the regression. Defaults to "linear". Valid values are ["linear", "random_forest"]. - test_size (float or int, optional): If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25. - test_sampling: (str, optional): How to sample to create the test data. Defaults to "random". Valid values are ["first", "last", "random"]. - """ - project = Project(project_name) - snapshot = Snapshot(relation_name, y_column_name, test_size, test_sampling) - for algorithm in algorithms: - model = Model(project, snapshot, algorithm) - model.fit(snapshot) - # TODO: promote the model? + def deploy(self): + plpy.execute(f""" + INSERT INTO pgml.deployments (project_id, model_id) + VALUES ({q(self.project_id)}, {q(self.id)}) + """) + + def predict(self, data): + return self.algorithm.predict(data) + + +def train( + project_name: str, + objective: str, + relation_name: str, + y_column_name: str, + test_size: float or int = 0.1, + test_sampling: str = "random" +) -> None: + """Create a regression model from a table or view filled with training data. + + Args: + project_name (str): a human friendly identifier + objective (str): Defaults to "regression". Valid values are ["regression", "classification"]. + relation_name (str): the table or view that stores the training data + y_column_name (str): the column in the training data that acts as the label + algorithm (str, optional): the algorithm used to implement the objective. Defaults to "linear". Valid values are ["linear", "random_forest"]. + test_size (float or int, optional): If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25. + test_sampling: (str, optional): How to sample to create the test data. Defaults to "random". Valid values are ["first", "last", "random"]. + """ + project = Project.create(project_name, objective) + snapshot = Snapshot.create(relation_name, y_column_name, test_size, test_sampling) + best_model = None + best_error = None + if objective == "regression": + algorithms = ["linear", "random_forest"] + elif objective == "classification": + algorithms = ["random_forest"] + + for algorithm_name in algorithms: + model = Model.create(project, snapshot, algorithm_name) + model.fit(snapshot) + if best_error is None or model.mean_squared_error < best_error: + best_error = model.mean_squared_error + best_model = model + best_model.deploy() diff --git a/pgml/pgml/sql.py b/pgml/pgml/sql.py index ed8827bff..79ab69bdc 100644 --- a/pgml/pgml/sql.py +++ b/pgml/pgml/sql.py @@ -1,16 +1,6 @@ -"""Tools to run SQL. -""" -import os -import plpy - - -def all_rows(cursor): - """Fetch all rows from a plpy-like cursor.""" - while True: - rows = cursor.fetch(5) - if not rows: - return - - for row in rows: - yield row - +from plpy import quote_literal + +def q(obj): + if type(obj) == str: + return quote_literal(obj) + return obj diff --git a/pgml/tests/test_train.py b/pgml/tests/test_train.py index 9a0bb4b0d..28ab8598e 100644 --- a/pgml/tests/test_train.py +++ b/pgml/tests/test_train.py @@ -3,5 +3,5 @@ class TestRegression(unittest.TestCase): def test_init(self): - pgml.model.Regression("Test", "test", "test_y") + pgml.model.train("Test", "regression", "test", "test_y") self.assertTrue(True) diff --git a/sql/install.sql b/sql/install.sql index 9bee2724d..b2758cda5 100644 --- a/sql/install.sql +++ b/sql/install.sql @@ -43,6 +43,7 @@ LANGUAGE plpgsql; CREATE TABLE pgml.projects( id BIGSERIAL PRIMARY KEY, name TEXT NOT NULL, + objective TEXT NOT NULL, created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT clock_timestamp(), updated_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT clock_timestamp() ); @@ -65,7 +66,7 @@ CREATE TABLE pgml.models( id BIGSERIAL PRIMARY KEY, project_id BIGINT NOT NULL, snapshot_id BIGINT NOT NULL, - algorithm TEXT NOT NULL, + algorithm_name TEXT NOT NULL, status TEXT NOT NULL, created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT clock_timestamp(), updated_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT clock_timestamp(), @@ -78,19 +79,19 @@ CREATE TABLE pgml.models( CREATE INDEX models_project_id_created_at_idx ON pgml.models(project_id, created_at); SELECT pgml.auto_updated_at('pgml.models'); -CREATE TABLE pgml.promotions( +CREATE TABLE pgml.deployments( project_id BIGINT NOT NULL, model_id BIGINT NOT NULL, created_at TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT clock_timestamp(), CONSTRAINT project_id_fk FOREIGN KEY(project_id) REFERENCES pgml.projects(id), CONSTRAINT model_id_fk FOREIGN KEY(model_id) REFERENCES pgml.models(id) ); -CREATE INDEX promotions_project_id_created_at_idx ON pgml.promotions(project_id, created_at); -SELECT pgml.auto_updated_at('pgml.promotions'); +CREATE INDEX deployments_project_id_created_at_idx ON pgml.deployments(project_id, created_at); +SELECT pgml.auto_updated_at('pgml.deployments'); --- ---- Extension version. +--- Extension version --- CREATE OR REPLACE FUNCTION pgml.version() RETURNS TEXT @@ -99,88 +100,24 @@ AS $$ return pgml.version() $$ LANGUAGE plpython3u; -CREATE OR REPLACE FUNCTION pgml.model_regression(project_name TEXT, relation_name TEXT, y_column_name TEXT) -RETURNS VOID -AS $$ - import pgml - from pgml.model import Regression - Regression(project_name, relation_name, y_column_name) -$$ LANGUAGE plpython3u; - - --- ---- Track table versions. ---- -CREATE TABLE pgml.model_versions( - id BIGSERIAL PRIMARY KEY, - name VARCHAR NOT NULL, - data_source TEXT, - y_column VARCHAR, - started_at TIMESTAMP WITHOUT TIME ZONE DEFAULT CURRENT_TIMESTAMP, - ended_at TIMESTAMP WITHOUT TIME ZONE NULL, - mean_squared_error DOUBLE PRECISION, - r2_score DOUBLE PRECISION, - successful BOOL NULL, - pickle BYTEA -); - +--- Regression --- ---- Train the model. ---- -CREATE OR REPLACE FUNCTION pgml.train(table_name TEXT, y TEXT) -RETURNS TEXT +CREATE OR REPLACE FUNCTION pgml.train(project_name TEXT, objective TEXT, relation_name TEXT, y_column_name TEXT) +RETURNS VOID AS $$ - from pgml.train import train - from pgml.sql import models_directory - import os - - data_source = f"SELECT * FROM {table_name}" - - # Start training. - start = plpy.execute(f""" - INSERT INTO pgml.model_versions - (name, data_source, y_column) - VALUES - ('{table_name}', '{data_source}', '{y}') - RETURNING *""", 1) - - id_ = start[0]["id"] - name = f"{table_name}_{id_}" + from pgml.model import train - destination = models_directory() - - # Train! - pickle, msq, r2 = train(plpy.cursor(data_source), y_column=y, name=name, destination=destination) - - plpy.execute(f""" - UPDATE pgml.model_versions - SET pickle = '{pickle}', - successful = true, - mean_squared_error = '{msq}', - r2_score = '{r2}', - ended_at = clock_timestamp() - WHERE id = {id_}""") - - return name + train(project_name, objective, relation_name, y_column_name) $$ LANGUAGE plpython3u; - --- --- Predict --- -CREATE OR REPLACE FUNCTION pgml.score(model_name TEXT, VARIADIC features DOUBLE PRECISION[]) +CREATE OR REPLACE FUNCTION pgml.predict(project_name TEXT, VARIADIC features DOUBLE PRECISION[]) RETURNS DOUBLE PRECISION AS $$ - from pgml.sql import models_directory - from pgml.score import load - import pickle - - if model_name in SD: - model = SD[model_name] - else: - SD[model_name] = load(model_name, models_directory()) - model = SD[model_name] + from pgml.model import Project - scores = model.predict([features,]) - return scores[0] + return Project.find_by_name(project_name).deployed_model.predict([features,])[0] $$ LANGUAGE plpython3u; diff --git a/sql/test.sql b/sql/test.sql index eadc30ca9..5b239ced0 100644 --- a/sql/test.sql +++ b/sql/test.sql @@ -6,20 +6,14 @@ SELECT pgml.version(); --- Train twice --- SELECT pgml.train('wine_quality_red', 'quality'); - --- SELECT * FROM pgml.model_versions; +\timing --- \timing --- WITH latest_model AS ( --- SELECT name || '_' || id AS model_name FROM pgml.model_versions ORDER BY id DESC LIMIT 1 --- ) --- SELECT pgml.score( --- (SELECT model_name FROM latest_model), -- last model we just trained --- 7.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4 -- features as variadic arguments --- ) AS score; +SELECT pgml.train('Red Wine', 'regression', 'wine_quality_red', 'quality'); +SELECT pgml.predict('Red Wine', 7.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4); +SELECT pgml.predict('Red Wine', 6.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4); +SELECT pgml.predict('Red Wine', 5.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4); +SELECT pgml.predict('Red Wine', 3.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4); -\timing +SELECT pgml.train('Red Wine Categories', 'classification', 'wine_quality_red', 'quality'); +SELECT pgml.predict('Red Wine', 7.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4); -SELECT pgml.model_regression('Red Wine', 'wine_quality_red', 'quality'); From d9d6727cc150ac0b66dce66b3bf1452c98eda73b Mon Sep 17 00:00:00 2001 From: Montana Low Date: Wed, 13 Apr 2022 20:00:38 -0700 Subject: [PATCH 10/15] Update pgml/tests/test_train.py Co-authored-by: Lev Kokotov --- pgml/tests/test_train.py | 1 - 1 file changed, 1 deletion(-) diff --git a/pgml/tests/test_train.py b/pgml/tests/test_train.py index 28ab8598e..1440de966 100644 --- a/pgml/tests/test_train.py +++ b/pgml/tests/test_train.py @@ -4,4 +4,3 @@ class TestRegression(unittest.TestCase): def test_init(self): pgml.model.train("Test", "regression", "test", "test_y") - self.assertTrue(True) From dfb57c6fa67923af84a402e8a6474f792ce0644a Mon Sep 17 00:00:00 2001 From: Montana Low Date: Wed, 13 Apr 2022 20:09:47 -0700 Subject: [PATCH 11/15] fix categorical test --- sql/test.sql | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/sql/test.sql b/sql/test.sql index 5b239ced0..7522f83ec 100644 --- a/sql/test.sql +++ b/sql/test.sql @@ -8,12 +8,12 @@ SELECT pgml.version(); \timing -SELECT pgml.train('Red Wine', 'regression', 'wine_quality_red', 'quality'); -SELECT pgml.predict('Red Wine', 7.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4); -SELECT pgml.predict('Red Wine', 6.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4); -SELECT pgml.predict('Red Wine', 5.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4); -SELECT pgml.predict('Red Wine', 3.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4); +SELECT pgml.train('Red Wine Scores', 'regression', 'wine_quality_red', 'quality'); +SELECT pgml.predict('Red Wine Scores', 7.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4); +SELECT pgml.predict('Red Wine Scores', 6.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4); +SELECT pgml.predict('Red Wine Scores', 5.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4); +SELECT pgml.predict('Red Wine Scores', 3.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4); SELECT pgml.train('Red Wine Categories', 'classification', 'wine_quality_red', 'quality'); -SELECT pgml.predict('Red Wine', 7.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4); +SELECT pgml.predict('Red Wine Categories', 7.4, 0.7, 0, 1.9, 0.076, 11, 34, 0.99, 2, 0.5, 9.4); From a1ef9094c41f545fffeb2b7bc34ad12dd9de1386 Mon Sep 17 00:00:00 2001 From: Montana Low Date: Thu, 14 Apr 2022 10:14:09 -0700 Subject: [PATCH 12/15] docs --- pgml/pgml/model.py | 152 +++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 139 insertions(+), 13 deletions(-) diff --git a/pgml/pgml/model.py b/pgml/pgml/model.py index a6059f329..d8d051196 100644 --- a/pgml/pgml/model.py +++ b/pgml/pgml/model.py @@ -10,10 +10,32 @@ from pgml.sql import q class Project(object): + """ + Use projects to refine multiple models of a particular dataset on a specific objective. + + Attributes: + id (int): a unique identifier + name (str): a human friendly unique identifier + objective (str): the purpose of this project + created_at (Timestamp): when this project was created + updated_at (Timestamp): when this project was last updated + """ + _cache = {} + def __init__(self): + self._deployed_model = None + @classmethod - def find(cls, id): + def find(cls, id: int): + """ + Get a Project from the database. + + Args: + id (int): the project id + Returns: + Project or None: instantiated from the database if found + """ result = plpy.execute(f""" SELECT * FROM pgml.projects @@ -29,7 +51,18 @@ def find(cls, id): return project @classmethod - def find_by_name(cls, name): + def find_by_name(cls, name: str): + """ + Get a Project from the database by name. + + This is the prefered API to retrieve projects, and they are cached by + name to avoid needing to go to he database on every usage. + + Args: + name (str): the project name + Returns: + Project or None: instantiated from the database if found + """ if name in cls._cache: return cls._cache[name] @@ -48,7 +81,17 @@ def find_by_name(cls, name): return project @classmethod - def create(cls, name, objective): + def create(cls, name: str, objective: str): + """ + Create a Project and save it to the database. + + Args: + name (str): a human friendly identifier + objective (str): valid values are ["regression", "classification"]. + Returns: + Project: instantiated from the database + """ + project = Project() project.__dict__ = dict(plpy.execute(f""" INSERT INTO pgml.projects (name, objective) @@ -59,18 +102,48 @@ def create(cls, name, objective): cls._cache[name] = project return project - def __init__(self): - self._deployed_model = None - @property def deployed_model(self): + """ + Returns: + Model: that should currently be used for predictions + """ if self._deployed_model is None: self._deployed_model = Model.find_deployed(self.id) return self._deployed_model class Snapshot(object): + """ + Snapshots capture a set of training & test data for repeatability. + + Attributes: + id (int): a unique identifier + relation_name (str): the name of the table or view to snapshot + y_column_name (str): the label for training data + test_size (float or int, optional): If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25. + test_sampling (str, optional): How to sample to create the test data. Defaults to "random". Valid values are ["first", "last", "random"]. + status (str): The current status of the snapshot, e.g. 'new' or 'created' + created_at (Timestamp): when this snapshot was created + updated_at (Timestamp): when this snapshot was last updated + """ @classmethod - def create(cls, relation_name, y_column_name, test_size, test_sampling): + def create(cls, relation_name: str, y_column_name: str, test_size: float or int, test_sampling: str): + """ + Create a Snapshot and save it to the database. + + This creates both a metadata record in the snapshots table, as well as creating a new table + that holds a snapshot of all the data currently present in the relation so that training + runs may be repeated, or further analysis may be conducted against the input. + + Args: + relation_name (str): the name of the table or view to snapshot + y_column_name (str): the label for training data + test_size (float or int, optional): If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25. + test_sampling: (str, optional): How to sample to create the test data. Defaults to "random". Valid values are ["first", "last", "random"]. + Returns: + Snapshot: metadata instantiated from the database + """ + snapshot = Snapshot() snapshot.__dict__ = dict(plpy.execute(f""" INSERT INTO pgml.snapshots (relation_name, y_column_name, test_size, test_sampling, status) @@ -90,6 +163,10 @@ def create(cls, relation_name, y_column_name, test_size, test_sampling): return snapshot def data(self): + """ + Returns: + list, list, list, list: All rows from the snapshot split into X_train, X_test, y_train, y_test sets. + """ data = plpy.execute(f""" SELECT * FROM pgml."snapshot_{self.id}" @@ -141,11 +218,35 @@ def data(self): # TODO normalize and clean data class Model(object): + """Models use an algorithm on a snapshot of data to record the parameters learned. + + Attributes: + project (str): the project the model belongs to + snapshot (str): the snapshot that provides the training and test data + algorithm_name (str): the name of the algorithm used to train this model + status (str): The current status of the model, e.g. 'new', 'training' or 'successful' + created_at (Timestamp): when this model was created + updated_at (Timestamp): when this model was last updated + mean_squared_error (float): + r2_score (float): + pickle (bytes): the serialized version of the model parameters + algorithm: the in memory version of the model parameters that can make predictions + """ @classmethod - def create(cls, project, snapshot, algorithm_name): + def create(cls, project: Project, snapshot: Snapshot, algorithm_name: str): + """ + Create a Model and save it to the database. + + Args: + project (str): + snapshot (str): + algorithm_name (str): + Returns: + Model: instantiated from the database + """ result = plpy.execute(f""" INSERT INTO pgml.models (project_id, snapshot_id, algorithm_name, status) - VALUES ({q(project.id)}, {q(snapshot.id)}, {q(algorithm_name)}, 'training') + VALUES ({q(project.id)}, {q(snapshot.id)}, {q(algorithm_name)}, 'new') RETURNING * """) model = Model() @@ -155,7 +256,13 @@ def create(cls, project, snapshot, algorithm_name): return model @classmethod - def find_deployed(cls, project_id): + def find_deployed(cls, project_id: int): + """ + Args: + project_id (int): The project id + Returns: + Model: that should currently be used for predictions of the project + """ result = plpy.execute(f""" SELECT models.* FROM pgml.models @@ -179,6 +286,10 @@ def __init__(self): @property def project(self): + """ + Returns: + Project: that this model belongs to + """ if self._project is None: self._project = Project.find(self.project_id) return self._project @@ -197,7 +308,13 @@ def algorithm(self): return self._algorithm - def fit(self, snapshot): + def fit(self, snapshot: Snapshot): + """ + Learns the parameters of this model and records them in the database. + + Args: + snapshot (Snapshot): dataset used to train this model + """ X_train, X_test, y_train, y_test = snapshot.data() # Train the model @@ -220,12 +337,21 @@ def fit(self, snapshot): """)[0]) def deploy(self): + """Promote this model to the active version for the project that will be used for predictions""" plpy.execute(f""" INSERT INTO pgml.deployments (project_id, model_id) VALUES ({q(self.project_id)}, {q(self.id)}) """) - def predict(self, data): + def predict(self, data: list): + """Use the model for a set of features. + + Args: + data (list): list of features to form a single prediction for + + Returns: + float or int: scores for regressions or ints for classifications + """ return self.algorithm.predict(data) @@ -236,7 +362,7 @@ def train( y_column_name: str, test_size: float or int = 0.1, test_sampling: str = "random" -) -> None: +): """Create a regression model from a table or view filled with training data. Args: From c2de3d84456658aca3b04e5e8d4b4a8ec53bb107 Mon Sep 17 00:00:00 2001 From: Montana Low Date: Thu, 14 Apr 2022 10:56:21 -0700 Subject: [PATCH 13/15] make test that "works" --- pgml/pgml/model.py | 11 +++---- pgml/tests/plpy.py | 16 ++++++++++ pgml/tests/test_model.py | 65 ++++++++++++++++++++++++++++++++++++++++ pgml/tests/test_train.py | 6 ---- 4 files changed, 87 insertions(+), 11 deletions(-) create mode 100644 pgml/tests/plpy.py create mode 100644 pgml/tests/test_model.py delete mode 100644 pgml/tests/test_train.py diff --git a/pgml/pgml/model.py b/pgml/pgml/model.py index d8d051196..b38d3f427 100644 --- a/pgml/pgml/model.py +++ b/pgml/pgml/model.py @@ -41,7 +41,7 @@ def find(cls, id: int): FROM pgml.projects WHERE id = {q(id)} """, 1) - if (result.nrows == 0): + if (len(result) == 0): return None project = Project() @@ -71,7 +71,7 @@ def find_by_name(cls, name: str): FROM pgml.projects WHERE name = {q(name)} """, 1) - if (result.nrows == 0): + if (len(result)== 0): return None project = Project() @@ -159,7 +159,7 @@ def create(cls, relation_name: str, y_column_name: str, test_size: float or int, SET status = 'created' WHERE id = {q(snapshot.id)} RETURNING * - """)[0]) + """, 1)[0]) return snapshot def data(self): @@ -172,8 +172,9 @@ def data(self): FROM pgml."snapshot_{self.id}" """) + print(data) # Sanity check the data - if data.nrows == 0: + if len(data) == 0: PgMLException( f"Relation `{self.relation_name}` contains no rows. Did you pass the correct `relation_name`?" ) @@ -272,7 +273,7 @@ def find_deployed(cls, project_id: int): ORDER by deployments.created_at DESC LIMIT 1 """) - if (result.nrows == 0): + if (len(result) == 0): return None model = Model() diff --git a/pgml/tests/plpy.py b/pgml/tests/plpy.py new file mode 100644 index 000000000..4bbbbc6fd --- /dev/null +++ b/pgml/tests/plpy.py @@ -0,0 +1,16 @@ +from collections import deque + +execute_results = deque() + +def quote_literal(literal): + return "'" + literal + "'" + +def execute(sql, lines = 0): + if len(execute_results) > 0: + result = execute_results.popleft() + return result + else: + return [] + +def add_mock_result(result): + execute_results.append(result) diff --git a/pgml/tests/test_model.py b/pgml/tests/test_model.py new file mode 100644 index 000000000..02605982d --- /dev/null +++ b/pgml/tests/test_model.py @@ -0,0 +1,65 @@ +# stub out plpy +from . import plpy +import sys +sys.modules['plpy'] = plpy + +import time +import unittest +from pgml import model + +class TestModel(unittest.TestCase): + def test_the_world(self): + plpy.add_mock_result( + [{"id": 1, "name": "Test", "objective": "regression", "created_at": time.time(), "updated_at": time.time()}] + ) + plpy.add_mock_result( + [{"id": 1, "relation_name": "test", "y_column_name": "test_y", "test_size": 0.1, "test_sampling": "random", "status": "new", "created_at": time.time(), "updated_at": time.time()}] + ) + plpy.add_mock_result( + "OK" + ) + plpy.add_mock_result( + [{"id": 1, "relation_name": "test", "y_column_name": "test_y", "test_size": 0.1, "test_sampling": "random", "status": "created", "created_at": time.time(), "updated_at": time.time()}] + ) + plpy.add_mock_result( + [{"id": 1, "project_id": 1, "snapshot_id": 1, "algorithm_name": "linear", "status": "new", "r2_score": None, "mean_squared_error": None, "pickle": None, "created_at": time.time(), "updated_at": time.time()}] + ) + plpy.add_mock_result( + [ + {"a": 1, "b": 2, "test_y": 3}, + {"a": 2, "b": 3, "test_y": 4}, + {"a": 3, "b": 4, "test_y": 5}, + ] + ) + plpy.add_mock_result( + [{"id": 1, "project_id": 1, "snapshot_id": 1, "algorithm_name": "linear", "status": "new", "r2_score": None, "mean_squared_error": None, "pickle": None, "created_at": time.time(), "updated_at": time.time()}] + ) + + plpy.add_mock_result( + [{"id": 1, "project_id": 1, "snapshot_id": 1, "algorithm_name": "linear", "status": "new", "r2_score": None, "mean_squared_error": None, "pickle": None, "created_at": time.time(), "updated_at": time.time()}] + ) + plpy.add_mock_result( + [ + {"a": 1, "b": 2, "test_y": 3}, + {"a": 2, "b": 3, "test_y": 4}, + {"a": 3, "b": 4, "test_y": 5}, + ] + ) + plpy.add_mock_result( + [{"id": 1, "project_id": 1, "snapshot_id": 1, "algorithm_name": "linear", "status": "new", "r2_score": None, "mean_squared_error": None, "pickle": None, "created_at": time.time(), "updated_at": time.time()}] + ) + + plpy.add_mock_result( + [{"id": 1, "project_id": 1, "snapshot_id": 1, "algorithm_name": "linear", "status": "new", "r2_score": None, "mean_squared_error": None, "pickle": None, "created_at": time.time(), "updated_at": time.time()}] + ) + plpy.add_mock_result( + [ + {"a": 1, "b": 2, "test_y": 3}, + {"a": 2, "b": 3, "test_y": 4}, + {"a": 3, "b": 4, "test_y": 5}, + ] + ) + plpy.add_mock_result( + [{"id": 1, "project_id": 1, "snapshot_id": 1, "algorithm_name": "linear", "status": "new", "r2_score": None, "mean_squared_error": None, "pickle": None, "created_at": time.time(), "updated_at": time.time()}] + ) + model.train("Test", "regression", "test", "test_y") diff --git a/pgml/tests/test_train.py b/pgml/tests/test_train.py deleted file mode 100644 index 1440de966..000000000 --- a/pgml/tests/test_train.py +++ /dev/null @@ -1,6 +0,0 @@ -import unittest -import pgml - -class TestRegression(unittest.TestCase): - def test_init(self): - pgml.model.train("Test", "regression", "test", "test_y") From ffedbc559ac06d785ed240356a4784377af69b7b Mon Sep 17 00:00:00 2001 From: Montana Low Date: Thu, 14 Apr 2022 11:05:53 -0700 Subject: [PATCH 14/15] Update pgml/pgml/model.py Co-authored-by: Lev Kokotov --- pgml/pgml/model.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/pgml/pgml/model.py b/pgml/pgml/model.py index b38d3f427..c5375f24b 100644 --- a/pgml/pgml/model.py +++ b/pgml/pgml/model.py @@ -214,7 +214,8 @@ def data(self): split = self.test_size if isinstance(split, float): split = int(self.test_size * X.len()) - return X[0:split], X[split:X.len()-1], y[0:split], y[split:y.len()-1] + return X[:split], X[split:], y[:split], y[split:] + # TODO normalize and clean data From aa44f9468ae16a298ef4a24f3c49edd0841ca9ae Mon Sep 17 00:00:00 2001 From: Montana Low Date: Thu, 14 Apr 2022 11:11:31 -0700 Subject: [PATCH 15/15] remove parens around ifs --- pgml/pgml/model.py | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/pgml/pgml/model.py b/pgml/pgml/model.py index b38d3f427..35e92176b 100644 --- a/pgml/pgml/model.py +++ b/pgml/pgml/model.py @@ -41,7 +41,7 @@ def find(cls, id: int): FROM pgml.projects WHERE id = {q(id)} """, 1) - if (len(result) == 0): + if len(result) == 0: return None project = Project() @@ -71,7 +71,7 @@ def find_by_name(cls, name: str): FROM pgml.projects WHERE name = {q(name)} """, 1) - if (len(result)== 0): + if len(result)== 0: return None project = Project() @@ -203,10 +203,10 @@ def data(self): y.append(y_) # Split into training and test sets - if (self.test_sampling == 'random'): + if self.test_sampling == 'random': return train_test_split(X, y, test_size=self.test_size, random_state=0) else: - if (self.test_sampling == 'first'): + if self.test_sampling == 'first': X.reverse() y.reverse() if isinstance(split, float): @@ -273,7 +273,7 @@ def find_deployed(cls, project_id: int): ORDER by deployments.created_at DESC LIMIT 1 """) - if (len(result) == 0): + if len(result) == 0: return None model = Model()

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!