0% found this document useful (0 votes)
3 views3 pages

Loan ML Complete Guide

This document is an end-to-end cookbook for machine learning focused on loan data, detailing steps from data loading to model training and evaluation. It covers various preprocessing techniques such as missing value imputation, scaling, encoding, feature engineering, and model selection. Additionally, it includes snippets for implementing common machine learning models using Python libraries like scikit-learn and XGBoost.

Uploaded by

vijay575886
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views3 pages

Loan ML Complete Guide

This document is an end-to-end cookbook for machine learning focused on loan data, detailing steps from data loading to model training and evaluation. It covers various preprocessing techniques such as missing value imputation, scaling, encoding, feature engineering, and model selection. Additionally, it includes snippets for implementing common machine learning models using Python libraries like scikit-learn and XGBoost.

Uploaded by

vijay575886
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

LOAN / TABULAR ML – END■TO■END COOKBOOK

======================================

SECTION 1 · DATA LOADING FROM S3


----------------------------------
import boto3, pandas as pd
from io import BytesIO, StringIO

bucket = "my-data-bucket"
key = "loan/loan_data.csv"

s3 = boto3.client("s3")
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(StringIO(obj["Body"].read().decode("utf-8")))
# For Parquet:
# df = pd.read_parquet(BytesIO(obj["Body"].read()))

------------------------------------------------------------

SECTION 2 · MISSING■VALUE IMPUTATION


--------------------------------------
from sklearn.impute import SimpleImputer
num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="most_frequent")
df[num_cols] = num_imputer.fit_transform(df[num_cols])
df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])

------------------------------------------------------------

SECTION 3 · NUMERIC SCALING & TRANSFORMS


------------------------------------------
from sklearn.preprocessing import StandardScaler, MinMaxScaler
df[num_cols] = StandardScaler().fit_transform(df[num_cols])
# or MinMaxScaler / PowerTransformer

------------------------------------------------------------

SECTION 4 · CATEGORICAL ENCODING


----------------------------------
# 4.1 One■Hot
df = pd.get_dummies(df, columns=cat_cols, dtype=int)

# 4.2 Ordinal / Label


from sklearn.preprocessing import OrdinalEncoder
df[cat_cols] = OrdinalEncoder(handle_unknown="use_encoded_value",
unknown_value=-1).fit_transform(df[cat_cols])

# 4.3 Target / Mean Encoding


import category_encoders as ce
enc = ce.TargetEncoder(cols=cat_cols)
df[cat_cols] = enc.fit_transform(df[cat_cols], df["target"])

------------------------------------------------------------

SECTION 5 · DATE■TIME FEATURES


--------------------------------
df["issue_d"] = pd.to_datetime(df["issue_d"])
df["issue_year"] = df["issue_d"].dt.year
df["issue_qtr"] = df["issue_d"].dt.quarter
df["issue_wkday"]= df["issue_d"].dt.weekday

------------------------------------------------------------

SECTION 6 · TEXT VECTORS (TF■IDF)


-----------------------------------
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=10_000, ngram_range=(1,2),
stop_words="english")
X_text = tfidf.fit_transform(df["review_text"])

------------------------------------------------------------
SECTION 7 · POLYNOMIAL / INTERACTIONS
---------------------------------------
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
df_poly = pd.DataFrame(poly.fit_transform(df[num_cols]),
columns=poly.get_feature_names_out(num_cols))

------------------------------------------------------------

SECTION 8 · BINNING CONTINUOUS VARS


-------------------------------------
df["income_band"] = pd.cut(df["annual_inc"],
bins=[0,40_000,80_000,120_000, float("inf")],
labels=["Low","Mid","High","VeryHigh"])

------------------------------------------------------------

SECTION 9 · CLASS BALANCING (UPSAMPLING)


------------------------------------------
from sklearn.utils import resample
maj = df[df.target==0]; min = df[df.target==1]
min_up = resample(min, replace=True, n_samples=len(maj), random_state=42)
df_bal = pd.concat([maj, min_up]).sample(frac=1, random_state=42).reset_index(drop=True)

------------------------------------------------------------

SECTION 10 · FEATURE SELECTION


-------------------------------
from sklearn.feature_selection import VarianceThreshold, SelectKBest, mutual_info_classif
vt = VarianceThreshold(threshold=0.01)
X_vt = vt.fit_transform(df.drop(columns="target"))
skb = SelectKBest(mutual_info_classif, k=30)
X_k = skb.fit_transform(df.drop(columns="target"), df["target"])

------------------------------------------------------------

SECTION 11 · END■TO■END PIPELINE + EXPORT


-----------------------------------------
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
import joblib

num_pipe = Pipeline([("impute", SimpleImputer(strategy="median")),


("scale", StandardScaler())])

cat_pipe = Pipeline([("impute", SimpleImputer(strategy="most_frequent")),


("onehot", ce.OneHotEncoder(use_cat_names=True,
handle_unknown="ignore"))])

prep = ColumnTransformer([("num", num_pipe, num_cols),


("cat", cat_pipe, cat_cols)])

rf_model = Pipeline([("prep", prep),


("clf", RandomForestClassifier(random_state=42))])

rf_model.fit(X_train, y_train)
joblib.dump(rf_model, "rf_full_pipeline.pkl")

------------------------------------------------------------

SECTION 12 · TIME■SERIES FEATURE ENGINEERING


---------------------------------------------
for lag in [1,7,30]: df[f"y_lag_{lag}"] = df["y"].shift(lag)
df["y_roll_mean_7"] = df["y"].rolling(7).mean()
df["y_ema_0_9"] = df["y"].ewm(alpha=0.9).mean()
df_ts = df.dropna()

------------------------------------------------------------

SECTION 13 · DIMENSIONALITY REDUCTION


--------------------------------------
# PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95).fit_transform(df[num_cols])
# UMAP
import umap
X_umap = umap.UMAP(random_state=42).fit_transform(df[num_cols])
# Truncated SVD (sparse)
from sklearn.decomposition import TruncatedSVD
X_svd = TruncatedSVD(n_components=300, random_state=42).fit_transform(X_text)

------------------------------------------------------------

SECTION 14 · HYPER■PARAMETER OPTIMISATION


------------------------------------------
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from skopt import BayesSearchCV

# GridSearch example for RF omitted here for brevity

------------------------------------------------------------

SECTION 15 · QUICK■START SNIPPETS: COMMON ML MODELS


----------------------------------------------------
# 1 Logistic Regression
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(max_iter=1000).fit(X_train, y_train)

# 2 Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state=42).fit(X_train, y_train)

# 3 Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=500, random_state=42).fit(X_train, y_train)

# 4 Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier().fit(X_train, y_train)

# 5 XGBoost
import xgboost as xgb
xgb_clf = xgb.XGBClassifier(random_state=42).fit(X_train, y_train)

# 6 LightGBM
import lightgbm as lgb
lgb_clf = lgb.LGBMClassifier(random_state=42).fit(X_train, y_train)

# 7 Support Vector Machine


from sklearn.svm import SVC
svc = SVC(kernel="rbf", probability=True).fit(X_train, y_train)

# 8 k-Nearest Neighbours
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=15).fit(X_train, y_train)

# 9 Naive Bayes
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB().fit(X_train, y_train)

#10 Multi-layer Perceptron


from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(128,64), max_iter=500,
random_state=42).fit(X_train, y_train)

------------------------------------------------------------

END OF COOKBOOK

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy