0% found this document useful (0 votes)

22 views549 pages

WQU Lecon 8 3

This document outlines usage guidelines for a lesson in the DS Lab core curriculum, emphasizing restrictions on sharing and downloading content. It details a project involving predicting apartment prices in Mexico City, including tasks like data wrangling, model building, and evaluation. The document also provides instructions for using various libraries and tools to analyze real estate data, culminating in visualizations and model assessments.

Uploaded by

Philemon Katambarare

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views549 pages

WQU Lecon 8 3

Uploaded by

Philemon Katambarare

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 549

Usage Guidelines

This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your
WQU virtual machine.

This means:

 ⓧ No downloading this notebook.

2.5. Predicting Apartment Prices in Mexico

City 🇲🇽
import warnings

import wqet_grader

warnings.simplefilter(action="ignore", category=FutureWarning)
wqet_grader.init("Project 2 Assessment")
Note: In this project there are graded tasks in both the lesson notebooks and in this assignment. Together they
total 24 points. The minimum score you need to move to the next project is 22 points. Once you get 22 points,
you will be enrolled automatically in the next project, and this assignment will be closed. This means that you
might not be able to complete the last two tasks in this notebook. If you get an error message saying that you've
already passed the course, that's good news. You can stop this assignment and move onto the project 3.

In this assignment, you'll decide which libraries you need to complete the tasks. You can import them in the
cell below. 👇
# Import libraries here
from glob import glob

import matplotlib.pyplot as plt

import plotly.express as px
import pandas as pd
import plotly.graph_objects as go
import seaborn as sns
from category_encoders import OneHotEncoder
from ipywidgets import Dropdown, FloatSlider, IntSlider, interact
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, Ridge # noqa F401
from sklearn.metrics import mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.utils.validation import check_is_fitted

Prepare Data
Import
Task 2.5.1: Write a wrangle function that takes the name of a CSV file as input and returns a DataFrame. The
function should do the following steps:

1. Subset the data in the CSV file and return only apartments in Mexico City ("Distrito Federal") that cost
less than $100,000.
2. Remove outliers by trimming the bottom and top 10% of properties in terms
of "surface_covered_in_m2".
3. Create separate "lat" and "lon" columns.
4. Mexico City is divided into 15 boroughs. Create a "borough" feature from
the "place_with_parent_names" column.
5. Drop columns that are more than 50% null values.
6. Drop columns containing low- or high-cardinality categorical values.
7. Drop any columns that would constitute leakage for the target "price_aprox_usd".
8. Drop any columns that would create issues of multicollinearity.

Tip: Don't try to satisfy all the criteria in the first version of your wrangle function. Instead, work iteratively.
Start with the first criteria, test it out with one of the Mexico CSV files in the data/ directory, and submit it to
the grader for feedback. Then add the next criteria.

# Build your `wrangle` function

def wrangle(filepath):
# Read CSV file
df = pd.read_csv(filepath)

# Subset data: Apartments in '"Distrito Federal"', less than 100,000

mask_ba = df["place_with_parent_names"].str.contains("Distrito Federal")
mask_apt = df["property_type"] == "apartment"
mask_price = df["price_aprox_usd"] < 100_000
df = df[mask_ba & mask_apt & mask_price]

# Subset data: Remove outliers for "surface_covered_in_m2"

low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
mask_area = df["surface_covered_in_m2"].between(low, high)
df = df[mask_area]

# Split "lat-lon" column

df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
df.drop(columns="lat-lon", inplace=True)

# Get place name

df["borough"] = df["place_with_parent_names"].str.split("|", expand=True)[1]
df.drop(columns="place_with_parent_names", inplace=True)

# Drop features with high null counts

df.drop( columns = ["floor","expenses"], inplace=True)

# Drop low- and high-cardinality categorical variables

df.drop(columns= ["operation", "property_type", "currency", "properati_url"], inplace=True)

# Drop leaky variables

df.drop(
columns=[
"price",
"price_aprox_local_currency",
"price_per_m2",
"price_usd_per_m2"
],
inplace=True
)
# Drop columns zith multicolinearlity
df.drop(columns=["surface_total_in_m2", "rooms"], inplace=True)

return df

# Use this cell to test your wrangle function and explore the data
df = wrangle("data/mexico-city-real-estate-1.csv")
df.shape

(1101, 5)

wqet_grader.grade(
"Project 2 Assessment", "Task 2.5.1", wrangle("data/mexico-city-real-estate-1.csv")
)
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
Cell In[27], line 1
----> 1 wqet_grader.grade(
2 "Project 2 Assessment", "Task 2.5.1", wrangle("data/mexico-city-real-estate-1.csv")
3)

File /opt/conda/lib/python3.11/site-packages/wqet_grader/init.py:182, in grade(assessment_id, question_id, sub

mission)
177 def grade(assessment_id, question_id, submission):
178 submission_object = {
179 'type': 'simple',
180 'argument': [submission]
181 }
--> 182 return show_score(grade_submission(assessment_id, question_id, submission_object))

File /opt/conda/lib/python3.11/site-packages/wqet_grader/transport.py:160, in grade_submission(assessment_id, que

stion_id, submission_object)
158 raise Exception('Grader raised error: {}'.format(error['message']))
159 else:
--> 160 raise Exception('Could not grade submission: {}'.format(error['message']))
161 result = envelope['data']['result']
163 # Used only in testing

Exception: Could not grade submission: Could not verify access to this assessment: Received error from WQET sub
mission API: You have already passed this course!
Task 2.5.2: Use glob to create the list files. It should contain the filenames of all the Mexico City real estate
CSVs in the ./data directory, except for mexico-city-test-features.csv.
# Using 'glob' to create the list file
files = glob("data/mexico-city-real-estate-*.csv")
files

wqet_grader.grade("Project 2 Assessment", "Task 2.5.2", files)

Task 2.5.3: Combine your wrangle function, a list comprehension, and pd.concat to create a DataFrame df. It
should contain all the properties from the five CSVs in files.
df = pd.concat([wrangle(file) for file in files], ignore_index=True)
print(df.info())
df.head()

wqet_grader.grade("Project 2 Assessment", "Task 2.5.3", df)

Explore
Task 2.5.4: Create a histogram showing the distribution of apartment prices ("price_aprox_usd") in df. Be sure
to label the x-axis "Price [$]", the y-axis "Count", and give it the title "Distribution of Apartment Prices". Use
Matplotlib (plt).

What does the distribution of price look like? Is the data normal, a little skewed, or very skewed?
# Build histogram
plt.hist(df["price_aprox_usd"])

# Label axes
plt.xlabel("Price [$]")

# Add title
plt.title("Distribution of Apartment Prices")

# Don't delete the code below 👇

plt.savefig("images/2-5-4.png", dpi=150)

with open("images/2-5-4.png", "rb") as file:

wqet_grader.grade("Project 2 Assessment", "Task 2.5.4", file)
Task 2.5.5: Create a scatter plot that shows apartment price ("price_aprox_usd") as a function of apartment size
("surface_covered_in_m2"). Be sure to label your x-axis "Area [sq meters]" and y-axis "Price [USD]". Your plot
should have the title "Mexico City: Price vs. Area". Use Matplotlib (plt).
# Build scatter plot
plt.scatter(x = df["surface_covered_in_m2"], y = df["price_aprox_usd"])

# Label axes
plt.xlabel("Area [sq meters]")
plt.ylabel("Price [USD]")

# Add title
plt.title("Mexico City: Price vs. Area");

# Don't delete the code below 👇

plt.savefig("images/2-5-5.png", dpi=150)

Do you see a relationship between price and area in the data? How is this similar to or different from the
Buenos Aires dataset? WQU WorldQuant University Applied Data Science Lab QQQQ

with open("images/2-5-5.png", "rb") as file:

wqet_grader.grade("Project 2 Assessment", "Task 2.5.5", file)
Task 2.5.6: (UNGRADED) Create a Mapbox scatter plot that shows the location of the apartments in your
dataset and represent their price using color.

What areas of the city seem to have higher real estate prices?
# Plot Mapbox location and price
fig = px.scatter_mapbox(
df, # Our DataFrame
lat="lat",
lon="lon",
width=600, # Width of map
height=600, # Height of map
color="price_aprox_usd",
hover_data=["price_aprox_usd"], # Display price when hovering mouse over house
)

fig.update_layout(mapbox_style="open-street-map")

fig.show()

Split
Task 2.5.7: Create your feature matrix X_train and target vector y_train. Your target is "price_aprox_usd". Your
features should be all the columns that remain in the DataFrame you cleaned above.
# Split data into feature matrix `X_train` and target vector `y_train`.

target = "price_aprox_usd"
features = [col for col in df.columns if col != target]
X_train = df[features]
y_train = df[target]

wqet_grader.grade("Project 2 Assessment", "Task 2.5.7a", X_train)

wqet_grader.grade("Project 2 Assessment", "Task 2.5.7b", y_train)

Build Model
Baseline
Task 2.5.8: Calculate the baseline mean absolute error for your model.
y_mean = y_train.mean()
y_pred_baseline = [y_mean]*len(y_train)
baseline_mae = mean_absolute_error(y_train, y_pred_baseline)
print("Mean apt price:", y_mean)
print("Baseline MAE:", baseline_mae)
wqet_grader.grade("Project 2 Assessment", "Task 2.5.8", [baseline_mae])

Iterate
Task 2.5.9: Create a pipeline named model that contains all the transformers necessary for this dataset and one
of the predictors you've used during this project. Then fit your model to the training data.
# Build Model
model = make_pipeline(
OneHotEncoder(use_cat_names=True),
SimpleImputer(),
Ridge()
)

# Fit model
model.fit(X_train, y_train)

wqet_grader.grade("Project 2 Assessment", "Task 2.5.9", model)

Evaluate
Task 2.5.10: Read the CSV file mexico-city-test-features.csv into the DataFrame X_test.
Tip: Make sure the X_train you used to train your model has the same column order as X_test. Otherwise, it
may hurt your model's performance.
X_test = pd.read_csv("data/mexico-city-test-features.csv")
print(X_test.info())
X_test.head()

wqet_grader.grade("Project 2 Assessment", "Task 2.5.10", X_test)

Task 2.5.11: Use your model to generate a Series of predictions for X_test. When you submit your predictions
to the grader, it will calculate the mean absolute error for your model.
y_test_pred = pd.Series(model.predict(X_test))
y_test_pred.head()

wqet_grader.grade("Project 2 Assessment", "Task 2.5.11", y_test_pred)

Communicate Results
Task 2.5.12: Create a Series named feat_imp. The index should contain the names of all the features your
model considers when making predictions; the values should be the coefficient values associated with each
feature. The Series should be sorted ascending by absolute value.
coefficients = model.named_steps["ridge"].coef_
features = model.named_steps["onehotencoder"].get_feature_names()
feat_imp = pd.Series(coefficients, index=features)
feat_imp
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[22], line 1
----> 1 coefficients = model.named_steps["ridge"].coef_
2 features = model.named_steps["onehotencoder"].get_feature_names()
3 feat_imp = pd.Series(coefficients, index=features)

NameError: name 'model' is not defined

wqet_grader.grade("Project 2 Assessment", "Task 2.5.12", feat_imp)

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[1], line 1
----> 1 wqet_grader.grade("Project 2 Assessment", "Task 2.5.12", feat_imp)

NameError: name 'wqet_grader' is not defined

Task 2.5.13: Create a horizontal bar chart that shows the 10 most influential coefficients for your model. Be
sure to label your x- and y-axis "Importance [USD]" and "Feature", respectively, and give your chart the
title "Feature Importances for Apartment Price". Use pandas.
# Build bar chart
feat_imp

# Label axes

# Add title

# Don't delete the code below 👇

plt.savefig("images/2-5-13.png", dpi=150)

with open("images/2-5-13.png", "rb") as file:

wqet_grader.grade("Project 2 Assessment", "Task 2.5.13", file)

Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.

Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your
WQU virtual machine.

This means:
 ⓧ No downloading this notebook.
 ⓧ No re-sharing of this notebook with friends or colleagues.
 ⓧ No downloading the embedded videos in this notebook.
 ⓧ No re-sharing embedded videos with friends or colleagues.
 ⓧ No adding this notebook to public or private repositories.
 ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study
resources.

3.1. Wrangling Data with MongoDB 🇰🇪

from pprint import PrettyPrinter

import pandas as pd
from IPython.display import VimeoVideo
from pymongo import MongoClient

VimeoVideo("665412094", h="8334dfab2e", width=600)

VimeoVideo("665412135", h="dcff7ab83a", width=600)

Task 3.1.1: Instantiate a PrettyPrinter, and assign it to the variable pp.

 Construct a PrettyPrinter instance in pprint.

pp = PrettyPrinter(indent=2)

Prepare Data
Connect
VimeoVideo("665412155", h="1ca0dd03d0", width=600)

Task 3.1.2: Create a client that connects to the database running at localhost on port 27017.

 What's a database client?

 What's a database server?
 Create a client object for a MongoDB instance.

client = MongoClient(host="localhost", port=27017)

Explore
VimeoVideo("665412176", h="6fea7c6346", width=600)

Task 3.1.3: Print a list of the databases available on client.

 What's an iterator?
 List the databases of a server using PyMongo.
 Print output using pprint.

from sys import getsizeof

my_list = [0, 1, 2, 3, 4]
my_range = range(0,8_000_000) # Iterator
#for i in my_list:
# print(i)

getsizeof(my_range)

pp.pprint(list(client.list_databases()))
[ {'empty': False, 'name': 'admin', 'sizeOnDisk': 40960},
{'empty': False, 'name': 'air-quality', 'sizeOnDisk': 4198400},
{'empty': False, 'name': 'config', 'sizeOnDisk': 12288},
{'empty': False, 'name': 'local', 'sizeOnDisk': 73728},
{'empty': False, 'name': 'wqu-abtest', 'sizeOnDisk': 585728}]

VimeoVideo("665412216", h="7d4027dc33", width=600)

Task 3.1.4: Assign the "air-quality" database to the variable db.

 What's a MongoDB database?

 Access a database using PyMongo.

db = client["air-quality"]

VimeoVideo("665412231", h="89c546b00f", width=600)

Task 3.1.5: Use the list_collections method to print a list of the collections available in db.

 What's a MongoDB collection?

 List the collections in a database using PyMongo.

for c in db.list_collections():
print(c["name"])
system.views
nairobi
system.buckets.nairobi
lagos
system.buckets.lagos
dar-es-salaam
system.buckets.dar-es-salaam

VimeoVideo("665412252", h="bff2abbdc0", width=600)

Task 3.1.6: Assign the "nairobi" collection in db to the variable name nairobi.
 Access a collection in a database using PyMongo.

nairobi = db["nairobi"]

VimeoVideo("665412270", h="e4a5f5c84b", width=600)

Task 3.1.7: Use the count_documents method to see how many documents are in the nairobi collection.

 What's a MongoDB document?

 Count the documents in a collection using PyMongo.

nairobi.count_documents({})

202212

VimeoVideo("665412279", h="c2315f3be1", width=600)

Task 3.1.8: Use the find_one method to retrieve one document from the nairobi collection, and assign it to the
variable name result.

 What's metadata?
 What's semi-structured data?
 Retrieve a document from a collection using PyMongo.

result = nairobi.find_one({})
pp.pprint(result)
{ '_id': ObjectId('65136020d400b2b47f672e5f'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'temperature',
'sensor_id': 58,
'sensor_type': 'DHT22',
'site': 29},
'temperature': 16.5,
'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 4, 301000)}

VimeoVideo("665412306", h="e1e913dfd1", width=600)

Task 3.1.9: Use the distinct method to determine how many sensor sites are included in the nairobi collection.

 Get a list of distinct values for a key among all documents using PyMongo.

nairobi.distinct("metadata.site")

[6, 29]

VimeoVideo("665412322", h="4776c6d548", width=600)

Task 3.1.10: Use the count_documents method to determine how many readings there are for each site in
the nairobi collection.

 Count the documents in a collection using PyMongo. WQU WorldQuant University Applied Data Science Lab QQQQ

print("Documents from site 6:", nairobi.count_documents({"metadata.site":6}))

print("Documents from site 29:", nairobi.count_documents({"metadata.site":29}))
Documents from site 6: 70360
Documents from site 29: 131852

VimeoVideo("665412344", h="d2354584cd", width=600)

Task 3.1.11: Use the aggregate method to determine how many readings there are for each site in
the nairobi collection.

 Perform aggregation calculations on documents using PyMongo.

result = nairobi.aggregate(
[
{"$group": {"_id": "$metadata.site", "count": {"$count":{} }}}
]
)
pp.pprint(list(result))
[{'_id': 29, 'count': 131852}, {'_id': 6, 'count': 70360}]

VimeoVideo("665412372", h="565122c9cc", width=600)

Task 3.1.12: Use the distinct method to determine how many types of measurements have been taken in
the nairobi collection.

 Get a list of distinct values for a key among all documents using PyMongo.

nairobi.distinct("metadata.measurement")

['P1', 'humidity', 'P2', 'temperature']

VimeoVideo("665412380", h="f7f7a39bb3", width=600)

Task 3.1.13: Use the find method to retrieve the PM 2.5 readings from all sites. Be sure to limit your results to
3 records only.

 Query a collection using PyMongo.

result = nairobi.find({"metadata.measurement": "P2"}).limit(3)

pp.pprint(list(result))
[ { 'P2': 34.43,
'_id': ObjectId('65136023d400b2b47f68b0e0'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P2',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
{ 'P2': 30.53,
'_id': ObjectId('65136023d400b2b47f68b0e1'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P2',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 5, 3, 941000)},
{ 'P2': 22.8,
'_id': ObjectId('65136023d400b2b47f68b0e2'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P2',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 10, 4, 374000)}]

VimeoVideo("665412389", h="8976ea3090", width=600)

Task 3.1.14: Use the aggregate method to calculate how many readings there are for each type
("humidity", "temperature", "P2", and "P1") in site 6.

 Perform aggregation calculations on documents using PyMongo.

result = nairobi.aggregate(
[
{"$match": {"metadata.site": 6}},
{"$group": {"_id": "$metadata.measurement", "count": {"$count":{} }}}
]
)

pp.pprint(list(result))
[ {'_id': 'P1', 'count': 18169},
{'_id': 'humidity', 'count': 17011},
{'_id': 'P2', 'count': 18169},
{'_id': 'temperature', 'count': 17011}]

VimeoVideo("665412418", h="0c4b125254", width=600)

Task 3.1.15: Use the aggregate method to calculate how many readings there are for each type
("humidity", "temperature", "P2", and "P1") in site 29.

 Perform aggregation calculations on documents using PyMongo.

result = nairobi.aggregate(
[
{"$match": {"metadata.site": 29}},
{"$group": {"_id": "$metadata.measurement", "count": {"$count":{} }}}
]
)
pp.pprint(list(result))
[ {'_id': 'P1', 'count': 32907},
{'_id': 'humidity', 'count': 33019},
{'_id': 'P2', 'count': 32907},
{'_id': 'temperature', 'count': 33019}]

Import
VimeoVideo("665412437", h="7a436c7e7e", width=600)

Task 3.1.16: Use the find method to retrieve the PM 2.5 readings from site 29. Be sure to limit your results to 3
records only. Since we won't need the metadata for our model, use the projection argument to limit the results to
the "P2" and "timestamp" keys only.

 Query a collection using PyMongo.

result = nairobi.find(
{"metadata.site": 29, "metadata.measurement": "P2"},
projection = {"P2": 1, "timestamp": 1, "_id":0}
)
#pp.pprint(result.next())

VimeoVideo("665412442", h="494636d1ea", width=600)

Task 3.1.17: Read records from your result into the DataFrame df. Be sure to set the index to "timestamp".

 Create a DataFrame from a dictionary using pandas.

df = pd.DataFrame(result).set_index("timestamp")
df.head()

timestamp

2018-09-01 00:00:02.472 34.43

2018-09-01 00:05:03.941 30.53

timestamp

2018-09-01 00:10:04.374 22.80

2018-09-01 00:15:04.245 13.30

2018-09-01 00:20:04.869 16.57

# Check your work

assert df.shape[1] == 1, f"`df` should have only one column, not {df.shape[1]}."
assert df.columns == [
"P2"
], f"The single column in `df` should be `'P2'`, not {df.columns[0]}."
assert isinstance(df.index, pd.DatetimeIndex), "`df` should have a `DatetimeIndex`."

Copyright 2023 WorldQuant University. This content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.

Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your
WQU virtual machine.

This means:

 ⓧ No downloading this notebook.

3.2. Linear Regression with Time Series Data

import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
import pytz
from IPython.display import VimeoVideo
from pymongo import MongoClient
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

VimeoVideo("665412117", h="c39a50bd58", width=600)

Prepare Data
Import
VimeoVideo("665412469", h="135f32c7da", width=600)

Task 3.2.1: Complete to the create a client to connect to the MongoDB server, assign the "air-quality" database
to db, and assign the "nairobi" connection to nairobi.

 Create a client object for a MongoDB instance.

 Access a database using PyMongo.
 Access a collection in a database using PyMongo.

client = MongoClient(host="localhost", port=27017)

db = client["air-quality"]
nairobi = db["nairobi"]

VimeoVideo("665412480", h="c20ed3e570", width=600)

Task 3.2.2: Complete the wrangle function below so that the results from the database query are read into the
DataFrame df. Be sure that the index of df is the "timestamp" from the results.

 Create a DataFrame from a dictionary using pandas.

def wrangle(collection):
results = collection.find(
{"metadata.site": 29, "metadata.measurement": "P2"},
projection={"P2": 1, "timestamp": 1, "_id": 0},
)

df = pd.DataFrame(results).set_index("timestamp")

# Localize timezone
df.index = df.index.tz_localize("UTC").tz_convert("Africa/Nairobi")

# Remove outliers
df = df[df["P2"] < 500]

# Resample to 1H window, ffill missing values

df = df["P2"].resample("1H").mean().fillna(method="ffill").to_frame()

# Add lag feature

df["P2.L1"] = df["P2"].shift(1)
# Drop NaN rows
df.dropna(inplace = True)

return df

VimeoVideo("665412496", h="d757475f7c", width=600)

Task 3.2.3: Use your wrangle function to read the data from the nairobi collection into the DataFrame df.

df = wrangle(nairobi)
df.head(10)
df.shape

(2927, 2)

# Check your work

assert any([isinstance(df, pd.DataFrame), isinstance(df, pd.Series)])
assert len(df) <= 32907
assert isinstance(df.index, pd.DatetimeIndex)
VimeoVideo("665412520", h="e03eefff07", width=600)

Task 3.2.4: Add to your wrangle function so that the DatetimeIndex for df is localized to the correct
timezone, "Africa/Nairobi". Don't forget to re-run all the cells above after you change the function.

 Localize a timestamp to another timezone using pandas.

# Localize timezone
df.index.tz_localize("UTC").tz_convert("Africa/Nairobi")[:5]

DatetimeIndex(['2018-09-01 03:00:02.472000+03:00',
'2018-09-01 03:05:03.941000+03:00',
'2018-09-01 03:10:04.374000+03:00',
'2018-09-01 03:15:04.245000+03:00',
'2018-09-01 03:20:04.869000+03:00'],
dtype='datetime64[ns, Africa/Nairobi]', name='timestamp', freq=None)

# Check your work

assert df.index.tzinfo == pytz.timezone("Africa/Nairobi")

Explore
VimeoVideo("665412546", h="97792cb982", width=600)

Task 3.2.5: Create a boxplot of the "P2" readings in df.

 Create a boxplot using pandas.

fig, ax = plt.subplots(figsize=(15, 6))

df["P2"].plot(kind="box", vert=False, title= "Distribution of PM2.5 Readings",ax=ax)
<Axes: title={'center': 'Distribution of PM2.5 Readings'}>

VimeoVideo("665412573", h="b46049021b", width=600)

Task 3.2.6: Add to your wrangle function so that all "P2" readings above 500 are dropped from the dataset.
Don't forget to re-run all the cells above after you change the function.

 Subset a DataFrame with a mask using pandas.

# Check your work

assert len(df) <= 32906

VimeoVideo("665412594", h="e56c2f6839", width=600)

Task 3.2.7: Create a time series plot of the "P2" readings in df.

 Create a line plot using pandas.

fig, ax = plt.subplots(figsize=(15, 6))

df["P2"].plot(xlabel="Time", ylabel="PM2.5", title="PM2.5 Time Series", ax=ax);
VimeoVideo("665412601", h="a16c5a73fc", width=600)

Task 3.2.8: Add to your wrangle function to resample df to provide the mean "P2" reading for each hour. Use a
forward fill to impute any missing values. Don't forget to re-run all the cells above after you change the
function.

 Resample time series data in pandas.

 Impute missing time series values using pandas.

df["P2"].resample("1H").mean().fillna(method="ffill").to_frame().head()

timestamp

2018-09-01 03:00:00+03:00 17.541667

2018-09-01 04:00:00+03:00 15.800000

2018-09-01 05:00:00+03:00 11.420000

2018-09-01 06:00:00+03:00 11.614167

2018-09-01 07:00:00+03:00 17.665000

# Check your work

assert len(df) <= 2928

VimeoVideo("665412649", h="d2e99d2e75", width=600)

Task 3.2.9: Plot the rolling average of the "P2" readings in df. Use a window size of 168 (the number of hours
in a week).

 What's a rolling window?

 Do a rolling window calculation in pandas.
 Make a line plot with time series data in pandas.

fig, ax = plt.subplots(figsize=(15, 6))

df["P2"].rolling(168).mean().plot(ax= ax, ylabel= "PM2.5", title="Weekly Rolling Average");

VimeoVideo("665412693", h="c3bca16aff", width=600)

Task 3.2.10: Add to your wrangle function to create a column called "P2.L1" that contains the
mean"P2" reading from the previous hour. Since this new feature will create NaN values in your DataFrame, be
sure to also drop null rows from df.

 Shift the index of a Series in pandas.

 Drop rows with missing values from a DataFrame using pandas.

# Add lag feature

df["P2.L1"] = df["P2"].shift(1)
# Drop NaN rows
df.dropna(inplace = True).head()

# Check your work

assert len(df) <= 11686
assert df.shape[1] == 2

VimeoVideo("665412732", h="059e4088c5", width=600)

Task 3.2.11: Create a correlation matrix for df.

 Create a correlation matrix in pandas.

df.corr()

P2 P2.L1

P2 1.000000 0.650679

P2.L1 0.650679 1.000000

VimeoVideo("665412741", h="7439cb107c", width=600)

Task 3.2.12: Create a scatter plot that shows PM 2.5 mean reading for each our as a function of the mean
reading from the previous hour. In other words, "P2.L1" should be on the x-axis, and "P2" should be on the y-
axis. Don't forget to label your axes!

 Create a scatter plot using Matplotlib.

fig, ax = plt.subplots(figsize=(6, 6))

ax.scatter(x=df["P2.L1"], y=df["P2"])
ax.plot([0 , 120], [0 , 120], linestyle="--", color="orange")
plt.xlabel("P2.L1")
plt.ylabel("P2")
plt.title("PM2.5 Autocorrelation");
Split
VimeoVideo("665412762", h="a5eba496f7", width=600)

Task 3.2.13: Split the DataFrame df into the feature matrix X and the target vector y. Your target is "P2".

 Subset a DataFrame by selecting one or more columns in pandas.

 Select a Series from a DataFrame in pandas.

target = "P2"
y = df[target]
X = df.drop(columns=target)
X.head()
P2.L1

timestamp

2018-09-01 04:00:00+03:00 17.541667

2018-09-01 05:00:00+03:00 15.800000

2018-09-01 06:00:00+03:00 11.420000

2018-09-01 07:00:00+03:00 11.614167

2018-09-01 08:00:00+03:00 17.665000

VimeoVideo("665412785", h="03118eda71", width=600)

Task 3.2.14: Split X and y into training and test sets. The first 80% of the data should be in your training set.
The remaining 20% should be in the test set.

 Divide data into training and test sets in pandas.

cutoff = int(len(X)*0.8)
X_train, y_train = X.iloc[:cutoff], y.iloc[:cutoff]
X_test, y_test = X.iloc[cutoff:], y.iloc[cutoff:]

len(X_train)+len(X_test)==len(X)

True

Build Model
Baseline
Task 3.2.15: Calculate the baseline mean absolute error for your model.

 Calculate summary statistics for a DataFrame or Series in pandas.

y_mean = y_train.mean()
y_pred_baseline = [y_mean]*len(y_train)
mae_baseline = mean_absolute_error(y_train, y_pred_baseline)

print("Mean P2 Reading:", round(y_train.mean(), 2))

print("Baseline MAE:", round(mae_baseline, 2))
Mean P2 Reading: 9.27
Baseline MAE: 3.89

Iterate
Task 3.2.16: Instantiate a LinearRegression model named model, and fit it to your training data.

 Instantiate a predictor in scikit-learn.

 Fit a model to training data in scikit-learn.

model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

Evaluate
VimeoVideo("665412844", h="129865775d", width=600)

Task 3.2.17: Calculate the training and test mean absolute error for your model.

 Generate predictions using a trained model in scikit-learn.

 Calculate the mean absolute error for a list of predictions in scikit-learn.

training_mae = mean_absolute_error(y_train, model.predict(X_train))

test_mae = mean_absolute_error(y_test, model.predict(X_test))
print("Training MAE:", round(training_mae, 2))
print("Test MAE:", round(test_mae, 2))
Training MAE: 2.46
Test MAE: 1.8

Communicate Results
Task 3.2.18: Extract the intercept and coefficient from your model.

 Access an object in a pipeline in scikit-learn WQU WorldQuant University Applied Data Science Lab QQQQ

intercept = round(model.intercept_, 2)
coefficient = round(model.coef_[0], 2)

print(f"P2 = {intercept} + ({coefficient} * P2.L1)")

P2 = 3.36 + (0.64 * P2.L1)
VimeoVideo("665412870", h="318d69683e", width=600)

Task 3.2.19: Create a DataFrame df_pred_test that has two columns: "y_test" and "y_pred". The first should
contain the true values for your test set, and the second should contain your model's predictions. Be sure the
index of df_pred_test matches the index of y_test.

 Create a DataFrame from a dictionary using pandas.

df_pred_test = pd.DataFrame(
{
"y_test": y_test,
"y_pred": model.predict(X_test)
}
)
df_pred_test.head()

y_test y_pred

timestamp

2018-12-07 17:00:00+03:00 7.070000 8.478927

2018-12-07 18:00:00+03:00 8.968333 7.865485

2018-12-07 19:00:00+03:00 11.630833 9.076421

2018-12-07 20:00:00+03:00 11.525833 10.774814

2018-12-07 21:00:00+03:00 9.533333 10.707836

VimeoVideo("665412891", h="39d7356a26", width=600)

Task 3.2.20: Create a time series line plot for the values in test_predictions using plotly express. Be sure that
the y-axis is properly labeled as "P2".

 Create a line plot using plotly express.

fig = px.line(df_pred_test, labels={"value":"P2"})

fig.show()
Copyright 2023 WorldQuant University. This content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.

Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your
WQU virtual machine.

This means:

 ⓧ No downloading this notebook.

3.3. Autoregressive Models

import warnings

import matplotlib.pyplot as plt

import pandas as pd
import plotly.express as px
from IPython.display import VimeoVideo
from pymongo import MongoClient
from sklearn.metrics import mean_absolute_error
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.ar_model import AutoReg

warnings.simplefilter(action="ignore", category=FutureWarning)

VimeoVideo("665851858", h="e39fc3d260", width=600)

Prepare Data
Import
VimeoVideo("665851852", h="16aa0a56e6", width=600)

Task 3.3.1: Complete to the create a client to connect to the MongoDB server, assigns the "air-quality" database
to db, and assigned the "nairobi" connection to nairobi.

 Create a client object for a MongoDB instance.

 Access a database using PyMongo.
 Access a collection in a database using PyMongo.

client = MongoClient(host="localhost", port=27017)

db = client["air-quality"]
nairobi = db["nairobi"]

VimeoVideo("665851840", h="e048425f49", width=600)

Task 3.3.2: Change the wrangle function below so that it returns a Series of the resampled data instead of a
DataFrame.
def wrangle(collection):
results = collection.find(
{"metadata.site": 29, "metadata.measurement": "P2"},
projection={"P2": 1, "timestamp": 1, "_id": 0},
)

# Read data into DataFrame

df = pd.DataFrame(list(results)).set_index("timestamp")

# Localize timezone
df.index = df.index.tz_localize("UTC").tz_convert("Africa/Nairobi")

# Remove outliers
df = df[df["P2"] < 500]

# Resample to 1hr window

y = df["P2"].resample("1H").mean().fillna(method='ffill')

return y
Task 3.3.3: Use your wrangle function to read the data from the nairobi collection into the Series y.
y = wrangle(nairobi)
y.head()
timestamp
2018-09-01 03:00:00+03:00 17.541667
2018-09-01 04:00:00+03:00 15.800000
2018-09-01 05:00:00+03:00 11.420000
2018-09-01 06:00:00+03:00 11.614167
2018-09-01 07:00:00+03:00 17.665000
Freq: H, Name: P2, dtype: float64

# Check your work

assert isinstance(y, pd.Series), f"`y` should be a Series, not type {type(y)}"
assert len(y) == 2928, f"`y` should have 2928 observations, not {len(y)}"
assert y.isnull().sum() == 0

Explore
VimeoVideo("665851830", h="85f58bc92b", width=600)

Task 3.3.4: Create an ACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis
as "Correlation Coefficient".

 What's an ACF plot?

 Create an ACF plot using statsmodels

fig, ax = plt.subplots(figsize=(15, 6))

plot_acf(y,ax=ax)
plt.xlabel("Lag [hours]")
plt.ylabel("Correlation Coefficient");

VimeoVideo("665851811", h="ee3a2b5c24", width=600)

Task 3.3.5: Create an PACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis
as "Correlation Coefficient".

 What's a PACF plot?

 Create an PACF plot using statsmodels
fig, ax = plt.subplots(figsize=(15, 6))
plot_pacf(y, ax=ax)
plt.xlabel("Lag [hours]")
plt.ylabel("Correlation Coefficient");

Split
VimeoVideo("665851798", h="6c191cd94c", width=600)

Task 3.3.6: Split y into training and test sets. The first 95% of the data should be in your training set. The
remaining 5% should be in the test set.

 Divide data into training and test sets in pandas.

cutoff_test = int(len(y)*0.95)

y_train = y.iloc[:cutoff_test]
y_test = y.iloc[cutoff_test:]

len(y_train)+len(y_test)

2928

Build Model
Baseline
Task 3.3.7: Calculate the baseline mean absolute error for your model.

 Calculate summary statistics for a DataFrame or Series in pandas.

y_train_mean = y_train.mean()
y_pred_baseline = [y_train_mean] * len(y_train)
mae_baseline = mean_absolute_error(y_train, y_pred_baseline)
print("Mean P2 Reading:", round(y_train_mean, 2))
print("Baseline MAE:", round(mae_baseline, 2))
Mean P2 Reading: 9.22
Baseline MAE: 3.71

Iterate
VimeoVideo("665851769", h="94a4296cde", width=600)

Task 3.3.8: Instantiate an AutoReg model and fit it to the training data y_train. Be sure to set the lags argument
to 26.

 What's an AR model?
 Instantiate a predictor in statsmodels.
 Train a model in statsmodels.

model = AutoReg(y_train, lags=26).fit()

VimeoVideo("665851746", h="1a4511e883", width=600)

Task 3.3.9: Generate a list of training predictions for your model and use them to calculate your training mean
absolute error.

 Generate in-sample predictions for a model in statsmodels.

 Calculate the mean absolute error for a list of predictions in scikit-learn.

y_pred = model.predict().dropna()
training_mae = mean_absolute_error(y_train.iloc[26:], y_pred)
print("Training MAE:", training_mae)
Training MAE: 2.2809871656467036

VimeoVideo("665851744", h="60d053b455", width=600)

Task 3.3.10: Use y_train and y_pred to calculate the residuals for your model.

 What's a residual?
 Create new columns derived from existing columns in a DataFrame using pandas.

y_train_resid = model.resid
y_train_resid.tail()

timestamp
2018-12-25 19:00:00+03:00 -0.392002
2018-12-25 20:00:00+03:00 -1.573180
2018-12-25 21:00:00+03:00 -0.735747
2018-12-25 22:00:00+03:00 -2.022221
2018-12-25 23:00:00+03:00 -0.061916
Freq: H, dtype: float64
VimeoVideo("665851712", h="9ff0cdba9c", width=600)

Task 3.3.11: Create a plot of y_train_resid.

 Create a line plot using pandas.

fig, ax = plt.subplots(figsize=(15, 6))

y_train_resid.plot(ylabel="Residual Value", ax=ax)

<Axes: xlabel='timestamp', ylabel='Residual Value'>

VimeoVideo("665851702", h="b494adc297", width=600)

Task 3.3.12: Create a histogram of y_train_resid.

 Create a histogram using plotly express.

y_train_resid.hist()
plt.xlabel("Residual Value")
plt.ylabel("Frequency")
plt.title("AR(26), Distribution ofResiduals");
VimeoVideo("665851684", h="d6d782a1f3", width=600)

Task 3.3.13: Create an ACF plot of y_train_resid.

 What's an ACF plot?

 Create an ACF plot using statsmodels

fig, ax = plt.subplots(figsize=(15, 6))

plot_acf(y_train_resid, ax=ax);

Evaluate
VimeoVideo("665851662", h="72e767e121", width=600)
Task 3.3.14: Calculate the test mean absolute error for your model.

 Generate out-of-sample predictions using model in statsmodels.

 Calculate the mean absolute error for a list of predictions in scikit-learn.

y_pred_test = model.predict(y_test.index.min(), y_test.index.max())

test_mae = mean_absolute_error(y_test, y_pred_test)
print("Test MAE:", test_mae)
Test MAE: 3.0136439495039054
Task 3.3.15: Create a DataFrame test_predictions that has two columns: "y_test" and "y_pred". The first should
contain the true values for your test set, and the second should contain your model's predictions. Be sure the
index of test_predictions matches the index of y_test.

 Create a DataFrame from a dictionary using pandas. WQU WorldQuant University Applied Data Science Lab QQQQ

df_pred_test = pd.DataFrame(
{"y_test": y_test, "y_pred": y_pred_test}, index=y_test.index
)

VimeoVideo("665851628", h="29b43e482e", width=600)

Task 3.3.16: Create a time series plot for the values in test_predictions using plotly express. Be sure that the y-
axis is properly labeled as "P2".

 Create a line plot in plotly express.

fig = px.line(df_pred_test, labels={"value": "P2"})

fig.show()

VimeoVideo("665851599", h="bb30d96e43", width=600)

Task 3.3.17: Perform walk-forward validation for your model for the entire test set y_test. Store your model's
predictions in the Series y_pred_wfv.

 What's walk-forward validation?

 Perform walk-forward validation for time series model.

%%capture

y_pred_wfv = pd.Series()
history = y_train.copy()
for i in range(len(y_test)):

model = AutoReg(history, lags=26).fit()

next_pred = model.forecast()
y_pred_wfv = y_pred_wfv.append(next_pred)
history = history.append(y_test[next_pred.index])
len(y_pred_wfv)

147

VimeoVideo("665851568", h="a764ab5416", width=600)

Task 3.3.18: Calculate the test mean absolute error for your model.

 Calculate the mean absolute error for a list of predictions in scikit-learn.

test_mae = mean_absolute_error(y_test, y_pred_wfv)

print("Test MAE (walk forward validation):", round(test_mae, 2))
Test MAE (walk forward validation): 1.4

Communicate Results
VimeoVideo("665851553", h="46338036cc", width=600)

Task 3.3.19: Print out the parameters for your trained model.

 Access model parameters in statsmodels

print(model.params)
const 2.011432
P2.L1 0.587118
P2.L2 0.019796
P2.L3 0.023615
P2.L4 0.027187
P2.L5 0.044014
P2.L6 -0.102128
P2.L7 0.029583
P2.L8 0.049867
P2.L9 -0.016897
P2.L10 0.032438
P2.L11 0.064360
P2.L12 0.005987
P2.L13 0.018375
P2.L14 -0.007636
P2.L15 -0.016075
P2.L16 -0.015953
P2.L17 -0.035444
P2.L18 0.000756
P2.L19 -0.003907
P2.L20 -0.020655
P2.L21 -0.012578
P2.L22 0.052499
P2.L23 0.074229
P2.L24 -0.023806
P2.L25 0.090577
P2.L26 -0.088323
dtype: float64

VimeoVideo("665851529", h="39284d37ac", width=600)

Task 3.3.20: Put the values for y_test and y_pred_wfv into the DataFrame df_pred_test (don't forget the index).
Then plot df_pred_test using plotly express.

 Create a line plot in plotly express.

df_pred_test = pd.DataFrame(
{"y_test": y_test, "y_pred_wfv": y_pred_wfv}
)
fig = px.line(df_pred_test, labels= {"value": "PM2.5"})
fig.show()

Copyright 2023 WorldQuant University. This content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.

Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your
WQU virtual machine.

This means:

 ⓧ No downloading this notebook.

3.4. ARMA Models

import inspect
import time
import warnings

import matplotlib.pyplot as plt

import pandas as pd
import plotly.express as px
import seaborn as sns
from IPython.display import VimeoVideo
from pymongo import MongoClient
from sklearn.metrics import mean_absolute_error
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima.model import ARIMA

warnings.filterwarnings("ignore")

VimeoVideo("665851728", h="95c59d2805", width=600)

Prepare Data
Import
Task 3.4.1: Create a client to connect to the MongoDB server, then assign the "air-quality" database to db, and
the "nairobi" collection to nairobi.

 Create a client object for a MongoDB instance.

 Access a database using PyMongo.
 Access a collection in a database using PyMongo.

client = MongoClient(host="localhost", port = 27017)

db = client["air-quality"]
nairobi = db["nairobi"]

def wrangle(collection, resample_rule = "1H"):

results = collection.find(
{"metadata.site": 29, "metadata.measurement": "P2"},
projection={"P2": 1, "timestamp": 1, "_id": 0},
)

# Read results into DataFrame

df = pd.DataFrame(list(results)).set_index("timestamp")

# Localize timezone
df.index = df.index.tz_localize("UTC").tz_convert("Africa/Nairobi")

# Remove outliers
df = df[df["P2"] < 500]

# Resample and forward-fill

y = df["P2"].resample(resample_rule).mean().fillna(method="ffill")

return y

VimeoVideo("665851670", h="3efc0c20d4", width=600)

Task 3.4.2: Change your wrangle function so that it has a resample_rule argument that allows the user to change
the resampling interval. The argument default should be "1H".

 What's an argument?
 Include an argument in a function in Python.

# Check your work

func_params = set(inspect.signature(wrangle).parameters.keys())
assert func_params == set(
["collection", "resample_rule"]
), f"Your function should take two arguments: `'collection'`, `'resample_rule'`. Your function takes the following
arguments: {func_params}"

Task 3.4.3: Use your wrangle function to read the data from the nairobi collection into the Series y.

y = wrangle(nairobi)
y.head()

timestamp
2018-09-01 03:00:00+03:00 17.541667
2018-09-01 04:00:00+03:00 15.800000
2018-09-01 05:00:00+03:00 11.420000
2018-09-01 06:00:00+03:00 11.614167
2018-09-01 07:00:00+03:00 17.665000
Freq: H, Name: P2, dtype: float64

# Check your work

assert isinstance(y, pd.Series), f"`y` should be a Series, not a {type(y)}."
assert len(y) == 2928, f"`y` should have 2,928 observations, not {len(y)}."
assert (
y.isnull().sum() == 0
), f"There should be no null values in `y`. Your `y` has {y.isnull().sum()} null values."

Explore
VimeoVideo("665851654", h="687ff8d5ee", width=600)

Task 3.4.4: Create an ACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis
as "Correlation Coefficient".
 What's an ACF plot?
 Create an ACF plot using statsmodels

fig, ax = plt.subplots(figsize=(15, 6))

plot_acf(y, ax = ax)
plt.xlabel("Lag [hours]")
plt.ylabel("Correlation Coefficient");

VimeoVideo("665851644", h="e857f05bfb", width=600)

Task 3.4.5: Create an PACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis
as "Correlation Coefficient".

 What's a PACF plot?

 Create an PACF plot using statsmodels

fig, ax = plt.subplots(figsize=(15, 6))

plot_pacf(y, ax = ax)
plt.xlabel("Lag [hours]")
plt.ylabel("Correlation Coefficient");
Split
Task 3.4.6: Create a training set y_train that contains only readings from October 2018, and a test set y_test that
contains readings from November 1, 2018.

 Subset a DataFrame by selecting one or more rows in pandas.

#cutoff_test = timestamp < pd.to_datetime("2018-10-31", format='%Y-%m-%d')

#y_train = y.iloc[:cutoff_test]
#y_test = y.iloc[cutoff_test:]

#train = btc[y.index < pd.to_datetime("2020-11-01", format='%Y-%m-%d')]

#test = btc[btc.index > pd.to_datetime("2020-11-01", format='%Y-%m-%d')]
y_train = y.loc["2018-10-01":"2018-10-31"]
y_test = y.loc["2018-11-01":"2018-11-01"]

y_test.head()

timestamp
2018-11-01 00:00:00+03:00 5.556364
2018-11-01 01:00:00+03:00 5.664167
2018-11-01 02:00:00+03:00 5.835000
2018-11-01 03:00:00+03:00 7.992500
2018-11-01 04:00:00+03:00 6.785000
Freq: H, Name: P2, dtype: float64

# Check your work

assert (
len(y_train) == 744
), f"`y_train` should have 744 observations, not {len(y_train)}."
assert len(y_test) == 24, f"`y_test` should have 24 observations, not {len(y_test)}."

Build Model
Baseline
Task 3.4.7: Calculate the baseline mean absolute error for your model.
y_train_mean = y_train.mean()
y_pred_baseline = [y_train_mean] * len(y_train)
mae_baseline = mean_absolute_error(y_train, y_pred_baseline)

print("Mean P2 Reading:", round(y_train_mean, 2))

print("Baseline MAE:", round(mae_baseline, 2))
Mean P2 Reading: 10.12
Baseline MAE: 4.17

Iterate
VimeoVideo("665851576", h="36e2dc6269", width=600)

Task 3.4.8: Create ranges for possible 𝑝� and 𝑞� values. p_params should range between 0 and 25, by steps
of 8. q_params should range between 0 and 3 by steps of 1.

 What's a hyperparameter?
 What's an iterator?
 Create a range in Python.

p_params = range(0, 25, 8)

q_params = range(0, 3, 1)

list(q_params)

[0, 1, 2]

VimeoVideo("665851476", h="d60346ed30", width=600)

Task 3.4.9: Complete the code below to train a model with every combination of hyperparameters
in p_params and q_params. Every time the model is trained, the mean absolute error is calculated and then saved
to a dictionary. If you're not sure where to start, do the code-along with Nicholas!

 What's an ARMA model?

 Append an item to a list in Python.
 Calculate the mean absolute error for a list of predictions in scikit-learn.
 Instantiate a predictor in statsmodels.
 Train a model in statsmodels.
 Write a for loop in Python.

# Create dictionary to store MAEs

mae_grid = dict()
# Outer loop: Iterate through possible values for `p`
for p in p_params:
# Create key-value pair in dict. Key is `p`, value is empty list.
mae_grid[p] = list()
# Inner loop: Iterate through possible values for `q`
for q in q_params:
# Combination of hyperparameters for model
order = (p, 0, q)
# Note start time
start_time = time.time()
# Train model
model = ARIMA(y_train, order=order).fit()
# Calculate model training time
elapsed_time = round(time.time() - start_time, 2)
print(f"Trained ARIMA {order} in {elapsed_time} seconds.")
# Generate in-sample (training) predictions
y_pred = model.predict()
# Calculate training MAE
mae = mean_absolute_error(y_train, y_pred)
# Append MAE to list in dictionary
mae_grid[p].append(mae)
print()
print(mae_grid)
Trained ARIMA (0, 0, 0) in 0.32 seconds.
Trained ARIMA (0, 0, 1) in 0.24 seconds.
Trained ARIMA (0, 0, 2) in 1.1 seconds.
Trained ARIMA (8, 0, 0) in 8.51 seconds.
Trained ARIMA (8, 0, 1) in 36.3 seconds.
Trained ARIMA (8, 0, 2) in 66.2 seconds.
Trained ARIMA (16, 0, 0) in 43.1 seconds.
Trained ARIMA (16, 0, 1) in 149.6 seconds.
Trained ARIMA (16, 0, 2) in 233.89 seconds.
Trained ARIMA (24, 0, 0) in 134.8 seconds.
Trained ARIMA (24, 0, 1) in 170.51 seconds.
Trained ARIMA (24, 0, 2) in 329.59 seconds.

{0: [4.171460443827197, 3.3506427433555537, 3.105722258818694], 8: [2.9384480570404223, 2.9149010689899

86, 2.8982772120299893], 16: [2.9201084726122, 2.929436109615129, 2.914719892608631], 24: [2.91439032582
73323, 2.9136013250083956, 2.8979226606568624]}

VimeoVideo("665851464", h="12f4080d0b", width=600)

Task 3.4.10: Organize all the MAE's from above in a DataFrame names mae_df. Each row represents a
possible value for 𝑞� and each column represents a possible value for 𝑝�.

 Create a DataFrame from a dictionary using pandas.

mae_df = pd.DataFrame(mae_grid)
mae_df.round(4)

0 8 16 24

0 4.1715 2.9384 2.9201 2.9144

1 3.3506 2.9149 2.9294 2.9136

2 3.1057 2.8983 2.9147 2.8979

VimeoVideo("665851453", h="dfd415bc08", width=600)

Task 3.4.11: Create heatmap of the values in mae_grid. Be sure to label your x-axis "p values" and your y-
axis "q values".

 Create a heatmap in seaborn.

sns.heatmap(mae_df, cmap="Blues")
plt.xlabel("p_values")
plt.ylabel("q_values")
plt.title("ARMA Grid Search (Criterion:MAE)")

Text(0.5, 1.0, 'ARMA Grid Search (Criterion:MAE)')

VimeoVideo("665851444", h="8b58161f26", width=600)

Task 3.4.12: Use the plot_diagnostics method to check the residuals for your model. Keep in mind that the plot
will represent the residuals from the last model you trained, so make sure it was your best model, too!

 Examine time series model residuals using statsmodels.

fig, ax = plt.subplots(figsize=(15, 12))

model.plot_diagnostics(fig=fig);
Evaluate
VimeoVideo("665851439", h="c48d80cdf4", width=600)

Task 3.4.13: Complete the code below to perform walk-forward validation for your model for the entire test
set y_test. Store your model's predictions in the Series y_pred_wfv. Choose the values for 𝑝� and 𝑞� that best
balance model performance and computation time. Remember: This model is going to have to train 24 times
before you can see your test MAE! WQU WorldQuant University Applied Data Science Lab QQQQ

y_pred_wfv = pd.Series()
history = y_train.copy()
for i in range(len(y_test)):
model = ARIMA(history, order=(8, 0, 2)).fit()
next_pred = model.forecast()
y_pred_wfv = y_pred_wfv.append(next_pred)
history = history.append(y_test[next_pred.index])

test_mae = mean_absolute_error(y_test, y_pred_wfv)

print("Test MAE (walk forward validation):", round(test_mae, 2))
Test MAE (walk forward validation): 1.67

Communicate Results
VimeoVideo("665851423", h="8236ff348f", width=600)
Task 3.4.14: First, generate the list of training predictions for your model. Next, create a
DataFrame df_predictions with the true values y_test and your predictions y_pred_wfv (don't forget the index).
Finally, plot df_predictions using plotly express. Make sure that the y-axis is labeled "P2".

 Generate in-sample predictions for a model in statsmodels.

 Create a DataFrame from a dictionary using pandas.
 Create a line plot in pandas.

df_predictions = pd.DataFrame({"y_test": y_test, "y_pred_wfv": y_pred_wfv})

fig = px.line(df_predictions, labels= {"value": "PM2.5"})
fig.show()
00:00Nov 1, 201803:0006:0009:0012:0015:0018:0021:00681012

variabley_testy_pred_wfvindexPM2.5

Copyright 2023 WorldQuant University. This content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.

Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your
WQU virtual machine.

This means:

 ⓧ No downloading this notebook.

3.5. Air Quality in Dar es Salaam 🇹🇿

import warnings

import wqet_grader

warnings.simplefilter(action="ignore", category=FutureWarning)
wqet_grader.init("Project 3 Assessment")

# Import libraries here

import inspect
import time
import warnings

import matplotlib.pyplot as plt

warnings.filterwarnings("ignore")

Prepare Data
Connect
Task 3.5.1: Connect to MongoDB server running at host "localhost" on port 27017. Then connect to the "air-
quality" database and assign the collection for Dar es Salaam to the variable name dar.
client=MongoClient(host="localhost",port=27017)
db=client["air-quality"]
dar=db["dar-es-salaam"]

wqet_grader.grade("Project 3 Assessment", "Task 3.5.1", [dar.name])

Correct.

Score: 1

Explore
Task 3.5.2: Determine the numbers assigned to all the sensor sites in the Dar es Salaam collection. Your
submission should be a list of integers. WQU WorldQuant University Applied Data Science Lab QQQQ

sites = dar.distinct("metadata.site")
sites

[23, 11]

wqet_grader.grade("Project 3 Assessment", "Task 3.5.2", sites)

Very impressive.

Score: 1

Task 3.5.3: Determine which site in the Dar es Salaam collection has the most sensor readings (of any type, not
just PM2.5 readings). You submission readings_per_site should be a list of dictionaries that follows this format:

[{'_id': 6, 'count': 70360}, {'_id': 29, 'count': 131852}]

Note that the values here ☝️ are from the Nairobi collection, so your values will look different.
result = dar.aggregate(
[
{"$group": {"_id": "$metadata.site", "count": {"$count":{} }}}
]
)
readings_per_site = list(result)
readings_per_site

[{'_id': 23, 'count': 60020}, {'_id': 11, 'count': 173242}]

wqet_grader.grade("Project 3 Assessment", "Task 3.5.3", readings_per_site)

Yes! Great problem solving.

Score: 1

Import
Task 3.5.4: Create a wrangle function that will extract the PM2.5 readings from the site that has the most total
readings in the Dar es Salaam collection. Your function should do the following steps:

1. Localize reading time stamps to the timezone for "Africa/Dar_es_Salaam".

2. Remove all outlier PM2.5 readings that are above 100.
3. Resample the data to provide the mean PM2.5 reading for each hour.
4. Impute any missing values using the forward-fill method.
5. Return a Series y.

def wrangle(collection, resample_rule = "1H"):

results = collection.find(
{"metadata.site": 11, "metadata.measurement": "P2"},
projection={"P2": 1, "timestamp": 1, "_id": 0},
)

# Read results into DataFrame

df = pd.DataFrame(list(results)).set_index("timestamp")

# Localize timezone
df.index = df.index.tz_localize("UTC").tz_convert("Africa/Dar_es_Salaam")

# Remove outliers
df = df[df["P2"] < 100]

# Resample and forward-fill

y = df["P2"].resample(resample_rule).mean().fillna(method="ffill")

return y
Use your wrangle function to query the dar collection and return your cleaned results.
y = wrangle(dar)
y.head()
timestamp
2018-01-01 03:00:00+03:00 9.456327
2018-01-01 04:00:00+03:00 9.400833
2018-01-01 05:00:00+03:00 9.331458
2018-01-01 06:00:00+03:00 9.528776
2018-01-01 07:00:00+03:00 8.861250
Freq: H, Name: P2, dtype: float64

wqet_grader.grade("Project 3 Assessment", "Task 3.5.4", wrangle(dar))

Yes! Your hard work is paying off.

Score: 1

Explore Some More

Task 3.5.5: Create a time series plot of the readings in y. Label your x-axis "Date" and your y-axis "PM2.5
Level". Use the title "Dar es Salaam PM2.5 Levels".
fig, ax = plt.subplots(figsize=(15, 6))
y.plot(xlabel="Date", ylabel="PM2.5 Level", title="Dar es Salaam PM2.5 Levels", ax=ax);
# Don't delete the code below 👇
plt.savefig("images/3-5-5.png", dpi=150)

with open("images/3-5-5.png", "rb") as file:

wqet_grader.grade("Project 3 Assessment", "Task 3.5.5", file)
Python master 😁

Score: 1

Task 3.5.6: Plot the rolling average of the readings in y. Use a window size of 168 (the number of hours in a
week). Label your x-axis "Date" and your y-axis "PM2.5 Level". Use the title "Dar es Salaam PM2.5 Levels, 7-
Day Rolling Average".
fig, ax = plt.subplots(figsize=(15, 6))
y.rolling(168).mean().plot(ax= ax, xlabel = "Date", ylabel= "PM2.5 Level",
title="Dar es Salaam PM2.5 Levels, 7-Day Rolling Average");
# Don't delete the code below 👇
plt.savefig("images/3-5-6.png", dpi=150)

with open("images/3-5-6.png", "rb") as file:

wqet_grader.grade("Project 3 Assessment", "Task 3.5.6", file)
Correct.

Score: 1

Task 3.5.7: Create an ACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis
as "Correlation Coefficient". Use the title "Dar es Salaam PM2.5 Readings, ACF".
fig, ax = plt.subplots(figsize=(15, 6))
plot_acf(y,ax=ax)
plt.xlabel("Lag [hours]")
plt.ylabel("Correlation Coefficient")
plt.title("Dar es Salaam PM2.5 Readings, ACF");
# Don't delete the code below 👇
plt.savefig("images/3-5-7.png", dpi=150)

with open("images/3-5-7.png", "rb") as file:

wqet_grader.grade("Project 3 Assessment", "Task 3.5.7", file)
Very impressive.

Score: 1

Task 3.5.8: Create an PACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis
as "Correlation Coefficient". Use the title "Dar es Salaam PM2.5 Readings, PACF".
fig, ax = plt.subplots(figsize=(15, 6))
plot_pacf(y,ax=ax)
plt.xlabel("Lag [hours]")
plt.ylabel("Correlation Coefficient")
plt.title("Dar es Salaam PM2.5 Readings, PACF");
# Don't delete the code below 👇
plt.savefig("images/3-5-8.png", dpi=150)

with open("images/3-5-8.png", "rb") as file:

wqet_grader.grade("Project 3 Assessment", "Task 3.5.8", file)
Boom! You got it.

Score: 1

Split
Task 3.5.9: Split y into training and test sets. The first 90% of the data should be in your training set. The
remaining 10% should be in the test set.
cutoff_test = int(len(y)*0.9)
y_train = y.iloc[:cutoff_test]
y_test = y.iloc[cutoff_test:]
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
y_train shape: (1944,)
y_test shape: (216,)

wqet_grader.grade("Project 3 Assessment", "Task 3.5.9a", y_train)

Good work!

Score: 1
wqet_grader.grade("Project 3 Assessment", "Task 3.5.9b", y_test)

Awesome work.

Score: 1

Build Model
Baseline
Task 3.5.10: Establish the baseline mean absolute error for your model.
y_train_mean = y_train.mean()
y_pred_baseline = [y_train_mean] * len(y_train)
mae_baseline = mean_absolute_error(y_train, y_pred_baseline)

print("Mean P2 Reading:", y_train_mean)

print("Baseline MAE:", mae_baseline)
Mean P2 Reading: 8.57142319061077
Baseline MAE: 4.053101181299159

wqet_grader.grade("Project 3 Assessment", "Task 3.5.10", mae_baseline)

Boom! You got it.

Score: 1

Iterate
Task 3.5.11: You're going to use an AutoReg model to predict PM2.5 readings, but which hyperparameter
settings will give you the best performance? Use a for loop to train your AR model on using settings
for lags from 1 to 30. Each time you train a new model, calculate its mean absolute error and append the result
to the list maes. Then store your results in the Series mae_series.
Tip: In this task, you'll need to combine the model you learned about in Task 3.3.8 with the hyperparameter
tuning technique you learned in Task 3.4.9.
# Create range to test different lags
p_params = range(1, 31)

# Create empty list to hold mean absolute error scores

maes = []

# Iterate through all values of p in `p_params`

for p in p_params:
# Build model
model = AutoReg(y_train, lags=p).fit()

# Make predictions on training data, dropping null values caused by lag

y_pred = model.predict().dropna()

# Calculate mean absolute error for training data vs predictions

mae = mean_absolute_error(y_train.iloc[p:], y_pred)
# Append `mae` to list `maes`
maes.append(mae)

# Put list `maes` into Series with index `p_params`

mae_series = pd.Series(maes, name="mae", index=p_params)

# Inspect head of Series

mae_series.head()

1 1.059376
2 1.045182
3 1.032489
4 1.032147
5 1.031022
Name: mae, dtype: float64

wqet_grader.grade("Project 3 Assessment", "Task 3.5.11", mae_series)

Party time! 🎉🎉🎉

Score: 1

Task 3.5.12: Look through the results in mae_series and determine what value for p provides the best
performance. Then build and train best_model using the best hyperparameter value.

Note: Make sure that you build and train your model in one line of code, and that the data type
of best_model is statsmodels.tsa.ar_model.AutoRegResultsWrapper.
best_p = 26
best_model = statsmodels.tsa.ar_model.AutoRegResultsWrapper(model)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[103], line 2
1 best_p = 26
----> 2 best_model = statsmodels.tsa.ar_model.AutoRegResultsWrapper(model)

NameError: name 'statsmodels' is not defined

wqet_grader.grade(
"Project 3 Assessment", "Task 3.5.12", [isinstance(best_model.model, AutoReg)]
)
Task 3.5.13: Calculate the training residuals for best_model and assign the result to y_train_resid. Note that
your name of your Series should be "residuals".
y_train_resid = model.resid
y_train_resid.name = "residuals"
y_train_resid.head()

timestamp
2018-01-02 09:00:00+03:00 -0.530654
2018-01-02 10:00:00+03:00 -2.185269
2018-01-02 11:00:00+03:00 0.112928
2018-01-02 12:00:00+03:00 0.590670
2018-01-02 13:00:00+03:00 -0.118088
Freq: H, Name: residuals, dtype: float64
wqet_grader.grade("Project 3 Assessment", "Task 3.5.13", y_train_resid.tail(1500))

Yes! Keep on rockin'. 🎸That's right.

Score: 1

Task 3.5.14: Create a histogram of y_train_resid. Be sure to label the x-axis as "Residuals" and the y-axis
as "Frequency". Use the title "Best Model, Training Residuals".
# Plot histogram of residuals
y_train_resid.hist()
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.title("Best Model, Training Residuals")
# Don't delete the code below 👇
plt.savefig("images/3-5-14.png", dpi=150)

with open("images/3-5-14.png", "rb") as file:

wqet_grader.grade("Project 3 Assessment", "Task 3.5.14", file)
Very impressive.

Score: 1

Task 3.5.15: Create an ACF plot for y_train_resid. Be sure to label the x-axis as "Lag [hours]" and y-axis
as "Correlation Coefficient". Use the title "Dar es Salaam, Training Residuals ACF".

fig, ax = plt.subplots(figsize=(15, 6))

plot_acf(y_train_resid,ax=ax)
plt.xlabel("Lag [hours]")
plt.ylabel("Correlation Coefficient")
plt.title("Dar es Salaam, Training Residuals ACF");
# Don't delete the code below 👇
plt.savefig("images/3-5-15.png", dpi=150)

with open("images/3-5-15.png", "rb") as file:

wqet_grader.grade("Project 3 Assessment", "Task 3.5.15", file)
Way to go!

Score: 1

Evaluate
Task 3.5.16: Perform walk-forward validation for your model for the entire test set y_test. Store your model's
predictions in the Series y_pred_wfv. Make sure the name of your Series is "prediction" and the name of your
Series index is "timestamp".
y_pred_wfv = pd.Series()
history = y_train.copy()
for i in range(len(y_test)):
model = AutoReg(history, lags=26).fit()
next_pred = model.forecast()
y_pred_wfv = y_pred_wfv.append(next_pred)
history = history.append(y_test[next_pred.index])

y_pred_wfv.name = "prediction"
y_pred_wfv.index.name = "timestamp"
y_pred_wfv.head()

timestamp
2018-03-23 03:00:00+03:00 10.414744
2018-03-23 04:00:00+03:00 8.269589
2018-03-23 05:00:00+03:00 15.178677
2018-03-23 06:00:00+03:00 33.475398
2018-03-23 07:00:00+03:00 39.571363
Freq: H, Name: prediction, dtype: float64

wqet_grader.grade("Project 3 Assessment", "Task 3.5.16", y_pred_wfv)

Wow, you're making great progress.

Score: 1

Task 3.5.17: Submit your walk-forward validation predictions to the grader to see the test mean absolute error
for your model.
wqet_grader.grade("Project 3 Assessment", "Task 3.5.17", y_pred_wfv)

Your model's mean absolute error is 3.968. Excellent work.

Score: 1

Communicate Results
Task 3.5.18: Put the values for y_test and y_pred_wfv into the DataFrame df_pred_test (don't forget the index).
Then plot df_pred_test using plotly express. Be sure to label the x-axis as "Date" and the y-axis as "PM2.5
Level". Use the title "Dar es Salaam, WFV Predictions".

df_pred_test = pd.DataFrame(
{"y_test": y_test, "y_pred_wfv": y_pred_wfv}
)
fig = px.line(df_pred_test, labels= {"value": "PM2.5"})
fig.update_layout(
title="Dar es Salaam, WFV Predictions",
xaxis_title="Date",
yaxis_title="PM2.5 Level",
)
# Don't delete the code below 👇
fig.write_image("images/3-5-18.png", scale=1, height=500, width=700)

fig.show()

with open("images/3-5-18.png", "rb") as file:

wqet_grader.grade("Project 3 Assessment", "Task 3.5.18", file)
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
Cell In[102], line 2
1 with open("images/3-5-18.png", "rb") as file:
----> 2 wqet_grader.grade("Project 3 Assessment", "Task 3.5.18", file)

File /opt/conda/lib/python3.11/site-packages/wqet_grader/init.py:182, in grade(assessment_id, question_id, sub

File /opt/conda/lib/python3.11/site-packages/wqet_grader/transport.py:160, in grade_submission(assessment_id, que

Exception: Could not grade submission: Could not verify access to this assessment: Received error from WQET sub
mission API: You have already passed this course!

Copyright 2023 WorldQuant University. This content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.

Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your
WQU virtual machine.

This means:

 ⓧ No downloading this notebook.

4.1. Wrangling Data with SQL

import sqlite3

import pandas as pd
from IPython.display import VimeoVideo
VimeoVideo("665414044", h="ff34728e6a", width=600)

Prepare Data
Connect
VimeoVideo("665414180", h="573444d2f6", width=600)
Task 4.1.1: Run the cell below to connect to the nepal.sqlite database.

 What's ipython-sql?
 What's a Magics function?

%load_ext sql
%sql sqlite:////home/jovyan/nepal.sqlite

Explore
VimeoVideo("665414201", h="4f30b7a95f", width=600)
Task 4.1.2: Select all rows and columns from the sqlite_schema table, and examine the output.

 What's a SQL database?

 What's a SQL table?
 Write a basic query in SQL.

How many tables are in the nepal.sqlite database? What information do they hold?
%%sql

VimeoVideo("665414239", h="d7319aa0a8", width=600)

Task 4.1.3: Select the name column from the sqlite_schema table, showing only rows where the type is "table".

 Select a column from a table in SQL.

 Subset a table using a WHERE clause in SQL.

%%sql

VimeoVideo("665414263", h="5b7d1e875f", width=600)

Task 4.1.4: Select all columns from the id_map table, limiting your results to the first five rows.

 Inspect a table using a LIMIT clause in SQL.

How is the data organized? What type of observation does each row represent? How do you think
the household_id, building_id, vdcmun_id, and district_id columns are related to each other?
%%sql

VimeoVideo("665414293", h="72fbe6b7d8", width=600)

Task 4.1.5: How many observations are in the id_map table? Use the count command to find out.

 Calculate the number of rows in a table using a count function in SQL.

%%sql

VimeoVideo("665414303", h="6ba10ddf88", width=600)

Task 4.1.6: What districts are represented in the id_map table? Use the distinct command to determine the
unique values in the district_id column.

 Determine the unique values in a column using a distinct function in SQL. %%sql

SELECT distinct(district_id)
FROM id_map

UsageError: Cell magic `%%sql` not found.

VimeoVideo("665414313", h="adbab3e418", width=600)

Task 4.1.7: How many buildings are there in id_map table? Combine the count and distinct commands to
calculate the number of unique values in building_id.

 Calculate the number of rows in a table using a count function in SQL.

 Determine the unique values in a column using a distinct function in SQL.

%%sql
SELECT count(distinct(building_id))
FROM id_map

UsageError: Cell magic `%%sql` not found.

VimeoVideo("665414336", h="5b595107c6", width=600)

Task 4.1.8: For our model, we'll focus on Gorkha (district 4). Select all columns that from id_map, showing
only rows where the district_id is 4 and limiting your results to the first five rows.

 Inspect a table using a LIMIT clause in SQL.

 Subset a table using a WHERE clause in SQL.

%%sql

VimeoVideo("665414344", h="bdcb4a50a3", width=600)

Task 4.1.9: How many observations in the id_map table come from Gorkha? Use
the count and WHERE commands together to calculate the answer.

 Calculate the number of rows in a table using a count function in SQL.

 Subset a table using a WHERE clause in SQL.
%%sql
SELECT count(*)
FROM id_map
WHERE district_id = 4

VimeoVideo("665414356", h="5d2bdb3813", width=600)

Task 4.1.10: How many buildings in the id_map table are in Gorkha? Combine
the count and distinct commands to calculate the number of unique values in building_id, considering only rows
where the district_id is 4.

 Calculate the number of rows in a table using a count function in SQL.

 Determine the unique values in a column using a distinct function in SQL.
 Subset a table using a WHERE clause in SQL.

%%sql

VimeoVideo("665414390", h="308ea86e4b", width=600)

Task 4.1.11: Select all the columns from the building_structure table, and limit your results to the first five
rows.

 Inspect a table using a LIMIT clause in SQL.

What information is in this table? What does each row represent? How does it relate to the information in
the id_map table? WQU WorldQuant University Applied Data Science Lab QQQQ

%%sql

VimeoVideo("665414402", h="64875c7779", width=600)

Task 4.1.12: How many building are there in the building_structure table? Use the count command to find out.

 Calculate the number of rows in a table using a count function in SQL.

%%sql

VimeoVideo("665414414", h="202f83f3cb", width=600)

Task 4.1.13: There are over 200,000 buildings in the building_structure table, but how can we retrieve only
buildings that are in Gorkha? Use the JOIN command to join the id_map and building_structure tables, showing
only buildings where district_id is 4 and limiting your results to the first five rows of the new table.

 Create an alias for a column or table using the AS command in SQL.

 Merge two tables using a JOIN clause in SQL.
 Inspect a table using a LIMIT clause in SQL.
 Subset a table using a WHERE clause in SQL.

%%sql
In the table we just made, each row represents a unique household in Gorkha. How can we create a table where
each row represents a unique building?
VimeoVideo("665414450", h="0fcb4dc3fa", width=600)

Task 4.1.14: Use the distinct command to create a column with all unique building IDs in
the id_map table. JOIN this column with all the columns from the building_structure table, showing only
buildings where district_id is 4 and limiting your results to the first five rows of the new table.

 Create an alias for a column or table using the AS command in SQL.

 Determine the unique values in a column using a distinct function in SQL.
 Merge two tables using a JOIN clause in SQL.
 Inspect a table using a LIMIT clause in SQL.
 Subset a table using a WHERE clause in SQL.

%%sql

We've combined the id_map and building_structure tables to create a table with all the buildings in Gorkha, but
the final piece of data needed for our model, the damage that each building sustained in the earthquake, is in
the building_damage table.

VimeoVideo("665414466", h="37dde03c93", width=600)

Task 4.1.15: How can combine all three tables? Using the query you created in the last task as a foundation,
include the damage_grade column to your table by adding a second JOIN for the building_damage table. Be
sure to limit your results to the first five rows of the new table.

 Create an alias for a column or table using the AS command in SQL.

%%sql

Import
VimeoVideo("665414492", h="9392e1a66e", width=600)

Task 4.1.16: Use the connect method from the sqlite3 library to connect to the database. Remember that the
database is located at "/home/jovyan/nepal.sqlite".

 Open a connection to a SQL database using sqlite3.

conn = ...

VimeoVideo("665414501", h="812d482c73", width=600)

Task 4.1.17: Put your last SQL query into a string and assign it to the variable query.
query = """..."""
print(query)

VimeoVideo("665414513", h="c6a81b49ad", width=600)

Task 4.1.18: Use the read_sql from the pandas library to create a DataFrame from your query. Be sure that
the building_id is set as your index column.

 Read SQL query into a DataFrame using pandas.

Tip: Your table might have two building_id columns, and that will make it hard to set it as the index column
for your DataFrame. If you face this problem, add an alias for one of the building_id columns in your query
using AS.

df = ...

df.head()

# Check your work

assert df.shape[0] == 70836, f"`df` should have 70,836 rows, not {df.shape[0]}."
assert (
df.shape[1] > 14
), "`df` seems to be missing columns. Does your query combine the `id_map`, `building_structure`, and
`building_damage` tables?"
assert (
"damage_grade" in df.columns
), "`df` is missing the target column, `'damage_grade'`."

Copyright 2023 WorldQuant University. This content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.

Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your
WQU virtual machine.

This means:

 ⓧ No downloading this notebook.

import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
import seaborn as sns
from category_encoders import OneHotEncoder
from IPython.display import VimeoVideo
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.utils.validation import check_is_fitted

warnings.simplefilter(action="ignore", category=FutureWarning)

VimeoVideo("665414074", h="d441543f18", width=600)

Prepare Data
Import
def wrangle(db_path):
# Connect to database
conn = sqlite3.connect(db_path)

# Construct query
query = """
SELECT distinct(i.building_id) AS b_id,
s.*,
d.damage_grade
FROM id_map AS i
JOIN building_structure AS s ON i.building_id = s.building_id
JOIN building_damage AS d ON i.building_id = d.building_id
WHERE district_id = 4
"""

# Read query results into DataFrame

df = pd.read_sql(query, conn, index_col = "b_id")

# Identify leaky columns

drop_cols = [ col for col in df.columns if "post_eq" in col]

# Create binary target

df["damage_grade"] = df["damage_grade"].str[-1].astype(int)
df["severe_damage"] = (df["damage_grade"] > 3).astype(int)
# Drop old target
drop_cols.append("damage_grade")

# Drop collinearlity column

drop_cols.append("count_floors_pre_eq")

# Drop cardinality
drop_cols.append("building_id")
# drop columns

df.drop( columns = drop_cols, inplace= True)

return df

VimeoVideo("665414541", h="dfe22afdfb", width=600)

Task 4.2.1: Complete the wrangle function above so that the it returns the results of query as a DataFrame. Be
sure that the index column is set to "b_id". Also, the path to the SQLite database is "/home/jovyan/nepal.sqlite".

 Read SQL query into a DataFrame using pandas.

 Write a function in Python.

df = wrangle("/home/jovyan/nepal.sqlite")
df.head()

age plinth heigh land_su foun groun other po plan_ supe seve
_bu _area t_ft_ rface_c datio roof_ d_flo _floo sit config rstru re_d
ildi _sq_f pre_e onditio n_ty type or_ty r_typ io uratio ctur ama
ng t q n pe pe e n n e ge

b
_i
d

1 Mud Bam No
TImb Ston
6 mort boo/ t
er/Ba e,
4 ar- Timb att Recta
20 560 18 Flat Mud mbo mud 0
0 Ston er- ac ngular
o- mort
0 e/Bri Light he
Mud ar
2 ck roof d

1 Mud Bam TImb No Recta Ston

21 200 12 Flat Mud 0
6 mort boo/ er/Ba t ngular e,
4 ar- Timb mbo att mud
age plinth heigh land_su foun groun other po plan_ supe seve
_bu _area t_ft_ rface_c datio roof_ d_flo _floo sit config rstru re_d
ildi _sq_f pre_e onditio n_ty type or_ty r_typ io uratio ctur ama
ng t q n pe pe e n n e ge

b
_i
d

0 Ston er- o- ac mort

8 e/Bri Light Mud he ar
1 ck roof d

1 Mud Bam No
TImb Ston
6 mort boo/ t
er/Ba e,
4 ar- Timb att Recta
18 315 20 Flat Mud mbo mud 0
0 Ston er- ac ngular
o- mort
8 e/Bri Light he
Mud ar
9 ck roof d

1 Mud Bam No
TImb Ston
6 mort boo/ t
er/Ba e,
4 ar- Timb att Recta
45 290 13 Flat Mud mbo mud 0
0 Ston er- ac ngular
o- mort
9 e/Bri Light he
Mud ar
8 ck roof d

1 Mud Bam No
TImb Ston
6 mort boo/ t
er/Ba e,
4 ar- Timb att Recta
21 230 13 Flat Mud mbo mud 0
1 Ston er- ac ngular
o- mort
0 e/Bri Light he
Mud ar
3 ck roof d

# Check your work

assert df.shape[0] == 70836, f"`df` should have 70,836 rows, not {df.shape[0]}."
There seem to be several features in df with information about the condition of a property after the earthquake.
VimeoVideo("665414560", h="ad4bba19ed", width=600)
Task 4.2.2: Add to your wrangle function so that these features are dropped from the DataFrame. Don't forget
to rerun all the cells above.

 Drop a column from a DataFrame using pandas.

 Subset a DataFrame's columns based on column names in pandas.

#drop_cols = []

#for col in df.columns:

# if "post_eq" in col:
# drop_cols.append(col)
drop_cols = [ col for col in df.columns if "post_eq" in col]
drop_cols

['count_floors_post_eq', 'height_ft_post_eq', 'condition_post_eq']

print(df.info())

# Check your work

assert (
df.filter(regex="post_eq").shape[1] == 0
), "`df` still has leaky features. Try again!"
We want to build a binary classification model, but our current target "damage_grade" has more than two
categories.
VimeoVideo("665414603", h="12b3d2f23e", width=600)

Task 4.2.3: Add to your wrangle function so that it creates a new target column "severe_damage". For buildings
where the "damage_grade" is Grade 4 or above, "severe_damage" should be 1. For all other
buildings, "severe_damage" should be 0. Don't forget to drop "damage_grade" to avoid leakage, and rerun all the
cells above.

 Access a substring in a Series using pandas.

 Drop a column from a DataFrame using pandas.
 Recast a column as a different data type in pandas.

print(df["severe_damage"].value_counts())

# Check your work

assert (
"damage_grade" not in df.columns
), "Your DataFrame should not include the `'damage_grade'` column."
assert (
"severe_damage" in df.columns
), "Your DataFrame is missing the `'severe_damage'` column."
assert (
df["severe_damage"].value_counts().shape[0] == 2
), f"The `'damage_grade'` column should have only two unique values, not
{df['severe_damage'].value_counts().shape[0]}"

Explore
Since our model will be a type of linear model, we need to make sure there's no issue with multicollinearity in
our dataset.
VimeoVideo("665414636", h="d34256b4e3", width=600)

Task 4.2.4: Plot a correlation heatmap of the remaining numerical features in df. Since "severe_damage" will be
your target, you don't need to include it in your heatmap.

 What's a correlation coefficient?

 What's a heatmap?
 Create a correlation matrix in pandas.
 Create a heatmap in seaborn.

Do you see any features that you need to drop?

# Create correlation matrix
correlation = df.select_dtypes("number").drop(columns = "severe_damage").corr()
# Plot heatmap of `correlation`
sns.heatmap(correlation);

Task 4.2.5: Change wrangle function so that it drops the "count_floors_pre_eq" column. Don't forget to rerun all
the cells above.

 Drop a column from a DataFrame using pandas.

# Check your work

assert (
"count_floors_pre_eq" not in df.columns
), "Did you drop the `'count_floors_pre_eq'` column?"
Before we build our model, let's see if we can identify any obvious differences between houses that were
severely damaged in the earthquake ("severe_damage"==1) those that were not ("severe_damage"==0). Let's start
with a numerical feature.
VimeoVideo("665414667", h="f39c2c21bc", width=600)

Task 4.2.6: Use seaborn to create a boxplot that shows the distributions of the "height_ft_pre_eq" column for
both groups in the "severe_damage" column. Remember to label your axes.

 What's a boxplot?
 Create a boxplot using Matplotlib.

# Create boxplot
sns.boxplot(x = "severe_damage", y = "height_ft_pre_eq", data = df)
# Label axes
plt.xlabel("Severe Damage")
plt.ylabel("Height Pre-earthquake [ft.]")
plt.title("Distribution of Building Height by Class");

Before we move on to the many categorical features in this dataset, it's a good idea to see the balance between
our two classes. What percentage were severely damaged, what percentage were not?
VimeoVideo("665414684", h="81295d5bdb", width=600)

Task 4.2.7: Create a bar chart of the value counts for the "severe_damage" column. You want to calculate the
relative frequencies of the classes, not the raw count, so be sure to set the normalize argument to True.
 What's a bar chart?
 What's a majority class?
 What's a minority class?
 Aggregate data in a Series using value_counts in pandas.
 Create a bar chart using pandas.

# Plot value counts of `"severe_damage"`

df["severe_damage"].value_counts(normalize=True).plot(
kind = "bar" , xlabel = "Class", ylabel = "Relative Frequency", title = "Class Balance"
)

<Axes: title={'center': 'Class Balance'}, xlabel='Class', ylabel='Relative Frequency'>

VimeoVideo("665414697", h="ee2d4f28c6", width=600)

Task 4.2.8: Create two variables, majority_class_prop and minority_class_prop, to store the normalized value
counts for the two classes in df["severe_damage"].

 Aggregate data in a Series using value_counts in pandas.

majority_class_prop, minority_class_prop = df["severe_damage"].value_counts(normalize=True)

print(majority_class_prop, minority_class_prop)
0.6425969845841097 0.3574030154158902

# Check your work

assert (
majority_class_prop < 1
), "`majority_class_prop` should be a floating point number between 0 and 1."
assert (
minority_class_prop < 1
), "`minority_class_prop` should be a floating point number between 0 and 1."

VimeoVideo("665414718", h="6a1e0c1e53", width=600)

Task 4.2.9: Are buildings with certain foundation types more likely to suffer severe damage? Create a pivot
table of df where the index is "foundation_type" and the values come from the "severe_damage" column,
aggregated by the mean.

 What's a pivot table?

 Reshape a DataFrame based on column values in pandas.

# Create pivot table

foundation_pivot = pd.pivot_table(
df, index = "foundation_type", values = "severe_damage", aggfunc = np.mean
).sort_values(by= "severe_damage")
foundation_pivot

severe_damage

foundation_type

RC 0.026224

Bamboo/Timber 0.324074

Cement-Stone/Brick 0.421908

Mud mortar-Stone/Brick 0.687792

Other 0.818898

VimeoVideo("665414734", h="46de83f96e", width=600)

Task 4.2.10: How do the proportions in foundation_pivot compare to the proportions for our majority and
minority classes? Plot foundation_pivot as horizontal bar chart, adding vertical lines at the values
for majority_class_prop and minority_class_prop.

 What's a bar chart?

 Add a vertical or horizontal line across a plot using Matplotlib.
 Create a bar chart using pandas.

# Plot bar chart of `foundation_pivot`

foundation_pivot.plot(kind = "barh", legend = "none")
plt.axvline (
majority_class_prop, linestyle = "--", color = "red", label = "majority class"
)

plt.axvline (
minority_class_prop, linestyle = "--", color = "green", label = "minority class"
)
plt.legend(loc= "lower right")

<matplotlib.legend.Legend at 0x7fae66419bd0>

VimeoVideo("665414748", h="8549a0f89c", width=600)

Task 4.2.11: Combine the select_dtypes and nunique methods to see if there are any high- or low-cardinality
categorical features in the dataset.

 What are high- and low-cardinality features?

 Determine the unique values in a column using pandas.
 Subset a DataFrame's columns based on the column data types in pandas.

# Check for high- and low-cardinality categorical features

df.select_dtypes("object").nunique()

land_surface_condition 3
foundation_type 5
roof_type 3
ground_floor_type 5
other_floor_type 4
position 4
plan_configuration 10
superstructure 11
dtype: int64
Split
Task 4.2.12: Create your feature matrix X and target vector y. Your target is "severe_damage".

 What's a feature matrix?

 What's a target vector?
 Subset a DataFrame by selecting one or more columns in pandas.
 Select a Series from a DataFrame in pandas.

target = "severe_damage"
X = df.drop(columns = target)
y = df[target]

VimeoVideo("665414769", h="1bfddf07b2", width=600)

Task 4.2.13: Divide your data (X and y) into training and test sets using a randomized train-test split. Your test
set should be 20% of your total data. And don't forget to set a random_state for reproducibility.

 Perform a randomized train-test split using scikit-learn.

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size = 0.2, random_state = 42
)

print("X_train shape:", X_train.shape)

print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)
X_train shape: (56668, 11)
y_train shape: (56668,)
X_test shape: (14168, 11)
y_test shape: (14168,)
Frequent Question: Why do we set the random state to 42?

Answer: The truth is you can pick any integer when setting a random state. The number you choose doesn't
affect the results of your project; it just makes sure that your work is reproducible so that others can verify it.
However, lots of people choose 42 because it appears in a well-known work of science fiction called The
Hitchhiker's Guide to the Galaxy. In short, it's an inside joke. 😉

Build Model
Baseline
VimeoVideo("665414807", h="c997c58720", width=600)

Task 4.2.14: Calculate the baseline accuracy score for your model.

 What's accuracy score?

 Aggregate data in a Series using value_counts in pandas. WQU WorldQuant University Applied Data Science Lab QQQQ

acc_baseline = y_train.value_counts(normalize=True).max()
print("Baseline Accuracy:", round(acc_baseline, 2))
Baseline Accuracy: 0.64

Iterate
VimeoVideo("665414835", h="1d8673223e", width=600)

Task 4.2.15: Create a pipeline named model that contains a OneHotEncoder transformer and
a LogisticRegression predictor. Be sure you set the use_cat_names argument for your transformer to True. Then
fit it to the training data.

 What's logistic regression?

 What's one-hot encoding?
 Create a pipeline in scikit-learn.
 Fit a model to training data in scikit-learn.

Tip: If you get a ConvergenceWarning when you fit your model to the training data, don't worry. This can
sometimes happen with logistic regression models. Try setting the max_iter argument in your predictor to 1000.

# Build model
model = make_pipeline(
OneHotEncoder(use_cat_names = True),
LogisticRegression(max_iter = 1000)
)
# Fit model to training data
model.fit(X_train, y_train)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with
nbviewer.org.
# Check your work
assert isinstance(
model, Pipeline
), f"`model` should be a Pipeline, not type {type(model)}."
assert isinstance(
model[0], OneHotEncoder
), f"The first step in your Pipeline should be a OneHotEncoder, not type {type(model[0])}."
assert isinstance(
model[-1], LogisticRegression
), f"The last step in your Pipeline should be LogisticRegression, not type {type(model[-1])}."
check_is_fitted(model)

Evaluate
VimeoVideo("665414885", h="f35ff0e23e", width=600)

Task 4.2.16: Calculate the training and test accuracy scores for your models.

 Calculate the accuracy score for a model in scikit-learn.

 Generate predictions using a trained model in scikit-learn.

acc_train = accuracy_score(y_train, model.predict(X_train))

acc_test = model.score(X_test, y_test)

print("Training Accuracy:", round(acc_train, 2))

print("Test Accuracy:", round(acc_test, 2))
Training Accuracy: 0.71
Test Accuracy: 0.72

Communicate
VimeoVideo("665414902", h="f9bdbe9e75", width=600)

Task 4.2.17: Instead of using the predict method with your model, try predict_proba with your training data.
How does the predict_proba output differ than that of predict? What does it represent?

 Generate probability estimates using a trained model in scikit-learn.

y_train_pred_proba = model.predict_proba(X_train)
print(y_train_pred_proba[:5])
[[0.96640778 0.03359222]
[0.47705031 0.52294969]
[0.34587951 0.65412049]
[0.4039248 0.5960752 ]
[0.33007247 0.66992753]]
Task 4.2.18: Extract the feature names and importances from your model.

 Access an object in a pipeline in scikit-learn.

features = model.named_steps["onehotencoder"].get_feature_names()
importances = model.named_steps["logisticregression"].coef_[0]
VimeoVideo("665414916", h="c0540604cd", width=600)

Task 4.2.19: Create a pandas Series named odds_ratios, where the index is features and the values are your the
exponential of the importances. How does odds_ratios for this model look different from the other linear models
we made in projects 2 and 3?
 Create a Series in pandas.

odds_ratios = pd.Series(np.exp(importances), index=features).sort_values()

odds_ratios.head()

superstructure_Brick, cement mortar 0.264181

foundation_type_RC 0.344885
roof_type_RCC/RB/RBC 0.379972
ground_floor_type_RC 0.487375
other_floor_type_RCC/RB/RBC 0.543866
dtype: float64

VimeoVideo("665414943", h="56eb74d93e", width=600)

Task 4.2.20: Create a horizontal bar chart with the five largest coefficients from odds_ratios. Be sure to label
your x-axis "Odds Ratio".

 What's a bar chart?

 Create a bar chart using Matplotlib.

# Horizontal bar chart, five largest coefficients

odds_ratios.tail().plot(kind="barh")
plt.xlabel("Odds Ratio");

VimeoVideo("665414964", h="a61b881450", width=600)

Task 4.2.21: Create a horizontal bar chart with the five smallest coefficients from odds_ratios. Be sure to label
your x-axis "Odds Ratio".

 What's a bar chart?

 Create a bar chart using Matplotlib.

# Horizontal bar chart, five smallest coefficients

odds_ratios.head().plot(kind="barh")
plt.xlabel("Odds Ratio");

Copyright 2023 WorldQuant University. This content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.

Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your
WQU virtual machine.

This means:

 ⓧ No downloading this notebook.

4.3. Predicting Damage with Decision Trees

import sqlite3
import warnings

import matplotlib.pyplot as plt

import pandas as pd
from category_encoders import OrdinalEncoder
from IPython.display import VimeoVideo
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.utils.validation import check_is_fitted

warnings.simplefilter(action="ignore", category=FutureWarning)

VimeoVideo("665414130", h="71649d291e", width=600)

Prepare Data
Import
def wrangle(db_path):
# Connect to database
conn = sqlite3.connect(db_path)

# Read query results into DataFrame

df = pd.read_sql(query, conn, index_col="b_id")

# Identify leaky columns

drop_cols = [col for col in df.columns if "post_eq" in col]

# Add high-cardinality / redundant column

drop_cols.append("building_id")

# Create binary target column

df["damage_grade"] = df["damage_grade"].str[-1].astype(int)
df["severe_damage"] = (df["damage_grade"] > 3).astype(int)

# Drop old target

drop_cols.append("damage_grade")

# Drop multicollinearity column

drop_cols.append("count_floors_pre_eq")

# Drop columns
df.drop(columns=drop_cols, inplace=True)

return df
Task 4.3.1: Use the wrangle function above to import your data set into the DataFrame df. The path to the
SQLite database is "/home/jovyan/nepal.sqlite"

 Read SQL query into a DataFrame using pandas.

 Write a function in Python.

df = wrangle("/home/jovyan/nepal.sqlite")
df.head()

b
_i
d

1 Mud Bam No
TImb Ston
6 mort boo/ t
er/Ba e,
4 ar- Timb att Recta
20 560 18 Flat Mud mbo mud 0
0 Ston er- ac ngular
o- mort
0 e/Bri Light he
Mud ar
2 ck roof d

1 Mud Bam No
TImb Ston
6 mort boo/ t
er/Ba e,
4 ar- Timb att Recta
21 200 12 Flat Mud mbo mud 0
0 Ston er- ac ngular
o- mort
8 e/Bri Light he
Mud ar
1 ck roof d

1 Mud Bam No
TImb Ston
6 mort boo/ t
er/Ba e,
4 ar- Timb att Recta
18 315 20 Flat Mud mbo mud 0
0 Ston er- ac ngular
o- mort
8 e/Bri Light he
Mud ar
9 ck roof d
age plinth heigh land_su foun groun other po plan_ supe seve
_bu _area t_ft_ rface_c datio roof_ d_flo _floo sit config rstru re_d
ildi _sq_f pre_e onditio n_ty type or_ty r_typ io uratio ctur ama
ng t q n pe pe e n n e ge

b
_i
d

1 Mud Bam No
TImb Ston
6 mort boo/ t
er/Ba e,
4 ar- Timb att Recta
45 290 13 Flat Mud mbo mud 0
0 Ston er- ac ngular
o- mort
9 e/Bri Light he
Mud ar
8 ck roof d

1 Mud Bam No
TImb Ston
6 mort boo/ t
er/Ba e,
4 ar- Timb att Recta
21 230 13 Flat Mud mbo mud 0
1 Ston er- ac ngular
o- mort
0 e/Bri Light he
Mud ar
3 ck roof d

# Check your work

assert df.shape[0] == 70836, f"`df` should have 70,836 rows, not {df.shape[0]}."
assert df.shape[1] == 12, f"`df` should have 12 columns, not {df.shape[1]}."

Split
Task 4.3.2: Create your feature matrix X and target vector y. Your target is "severe_damage".

 What's a feature matrix?

 What's a target vector?
 Subset a DataFrame by selecting one or more columns in pandas.
 Select a Series from a DataFrame in pandas.

target = "severe_damage"
X = df.drop(columns = target)
y = df[target]

# Check your work

assert X.shape == (70836, 11), f"The shape of `X` should be (70836, 11), not {X.shape}."
assert y.shape == (70836,), f"The shape of `y` should be (70836,), not {y.shape}."
VimeoVideo("665415006", h="ecb1b87861", width=600)
Task 4.3.3: Divide your data (X and y) into training and test sets using a randomized train-test split. Your test
set should be 20% of your total data. And don't forget to set a random_state for reproducibility.

 Perform a randomized train-test split using scikit-learn. WQU WorldQuant U niversity Applied Data Science Lab QQQQ

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size = 0.2, random_state = 42
)

print("X_train shape:", X_train.shape)

# Check your work

assert X_train.shape == (
56668,
11,
), f"The shape of `X_train` should be (56668, 11), not {X_train.shape}."
assert y_train.shape == (
56668,
), f"The shape of `y_train` should be (56668,), not {y_train.shape}."
assert X_test.shape == (
14168,
11,
), f"The shape of `X_test` should be (14168, 11), not {X_test.shape}."
assert y_test.shape == (
14168,
), f"The shape of `y_test` should be (14168,), not {y_test.shape}."

Task 4.3.4: Divide your training data (X_train and y_train) into training and validation sets using a randomized
train-test split. Your validation data should be 20% of the remaining data. Don't forget to set a random_state.

 What's a validation set?

 Perform a randomized train-test split using scikit-learn.

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Check your work

assert X_train.shape == (
45334,
11,
), f"The shape of `X_train` should be (45334, 11), not {X_train.shape}."
assert y_train.shape == (
45334,
), f"The shape of `y_train` should be (45334,), not {y_train.shape}."
assert X_val.shape == (
11334,
11,
), f"The shape of `X_val` should be (11334, 11), not {X_val.shape}."
assert y_val.shape == (
11334,
), f"The shape of `y_val` should be (11334,), not {y_val.shape}."

Build Model
Baseline
Task 4.3.5: Calculate the baseline accuracy score for your model.

 What's accuracy score?

 Aggregate data in a Series using value_counts in pandas.

acc_baseline = y_train.value_counts(normalize=True).max()
print("Baseline Accuracy:", round(acc_baseline, 2))
Baseline Accuracy: 0.64

Iterate
VimeoVideo("665415061", h="6250826047", width=600)

VimeoVideo("665415109", h="b3bb82651d", width=600)

Task 4.3.6: Create a pipeline named model that contains a OrdinalEncoder transformer and
a DecisionTreeClassifier predictor. (Be sure to set a random_state for your predictor.) Then fit your model to the
training data.

 What's a decision tree?

 What's ordinal encoding?
 Create a pipeline in scikit-learn.
 Fit a model to training data in scikit-learn.

# Build Model
model = make_pipeline(
OrdinalEncoder(),
DecisionTreeClassifier(max_depth = 6, random_state=42)
)
# Fit model to training data
model.fit(X_train, y_train)

Pipeline(steps=[('ordinalencoder',
OrdinalEncoder(cols=['land_surface_condition',
'foundation_type', 'roof_type',
'ground_floor_type', 'other_floor_type',
'position', 'plan_configuration',
'superstructure'],
mapping=[{'col': 'land_surface_condition',
'data_type': dtype('O'),
'mapping': Flat 1
Moderate slope 2
Steep slope 3
NaN -2
dtype: int64},
{'col': 'foundation_type',
'dat...
Others 9
Building with Central Courtyard 10
NaN -2
dtype: int64},
{'col': 'superstructure',
'data_type': dtype('O'),
'mapping': Stone, mud mortar 1
Stone 2
RC, engineered 3
Brick, cement mortar 4
Adobe/mud 5
Timber 6
RC, non-engineered 7
Brick, mud mortar 8
Stone, cement mortar 9
Bamboo 10
Other 11
NaN -2
dtype: int64}])),
('decisiontreeclassifier',
DecisionTreeClassifier(max_depth=6, random_state=42))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with
nbviewer.org.
# Check your work
assert isinstance(
model, Pipeline
), f"`model` should be a Pipeline, not type {type(model)}."
assert isinstance(
model[0], OrdinalEncoder
), f"The first step in your Pipeline should be an OrdinalEncoder, not type {type(model[0])}."
assert isinstance(
model[-1], DecisionTreeClassifier
), f"The last step in your Pipeline should be an DecisionTreeClassifier, not type {type(model[-1])}."
check_is_fitted(model)

VimeoVideo("665415153", h="f0ec320955", width=600)

Task 4.3.7: Calculate the training and validation accuracy scores for your models.

 Calculate the accuracy score for a model in scikit-learn.

 Generate predictions using a trained model in scikit-learn.

acc_train = accuracy_score(y_train, model.predict(X_train))

acc_val = model.score(X_val, y_val)
print("Training Accuracy:", round(acc_train, 2))
print("Validation Accuracy:", round(acc_val, 2))
Training Accuracy: 0.72
Validation Accuracy: 0.72
VimeoVideo("665415169", h="44702fc696", width=600)

Task 4.3.8: Use the get_depth method on the DecisionTreeClassifier in your model to see how deep your tree
grew during training.

 Access an object in a pipeline in scikit-learn.

tree_depth = model.named_steps["decisiontreeclassifier"].get_depth()
print("Tree Depth:", tree_depth)
Tree Depth: 49

VimeoVideo("665415186", h="c4ce187170", width=600)

Task 4.3.9: Create a range of possible values for max_depth hyperparameter of your
model's DecisionTreeClassifier. depth_hyperparams should range from 1 to 50 by steps of 2.

 What's an iterator?
 Create a range in Python.

depth_hyperparams = range(1, 50, 2)

# Check your work

assert (
len(list(depth_hyperparams)) == 25
), f"`depth_hyperparams` should contain 25 items, not {len(list(depth_hyperparams))}."
assert (
list(depth_hyperparams)[0] == 1
), f"`depth_hyperparams` should begin at 1, not {list(depth_hyperparams)[0]}."
assert (
list(depth_hyperparams)[-1] == 49
), f"`depth_hyperparams` should end at 49, not {list(depth_hyperparams)[-1]}."

VimeoVideo("665415198", h="b4b85c3308", width=600)

Task 4.3.10: Complete the code below so that it trains a model for every max_depth in depth_hyperparams.
Every time a new model is trained, the code should also calculate the training and validation accuracy scores
and append them to the training_acc and validation_acc lists, respectively.

 Append an item to a list in Python.

 Create a pipeline in scikit-learn.
 Fit a model to training data in scikit-learn.
 Write a for loop in Python.

# Create empty lists for training and validation accuracy scores

training_acc = []
validation_acc = []
for d in depth_hyperparams:
# Create model with `max_depth` of `d`
test_model = make_pipeline(
OrdinalEncoder(), DecisionTreeClassifier(max_depth = d, random_state=42)
)
# Fit model to training data
test_model.fit(X_train, y_train)
# Calculate training accuracy score and append to `training_acc`
training_acc.append(test_model.score(X_train, y_train))
# Calculate validation accuracy score and append to `training_acc`
validation_acc.append(test_model.score(X_val, y_val))

print("Training Accuracy Scores:", training_acc[:3])

print("Validation Accuracy Scores:", validation_acc[:3])
Training Accuracy Scores: [0.7071072484228174, 0.7117395332421582, 0.7162394670666608]
Validation Accuracy Scores: [0.7088406564319746, 0.7132521616375508, 0.7166049055937886]

# Check your work

assert (
len(training_acc) == 25
), f"`training_acc` should contain 25 items, not {len(training_acc)}."
assert (
len(validation_acc) == 25
), f"`validation_acc` should contain 25 items, not {len(validation_acc)}."

VimeoVideo("665415236", h="51d4be13fa", width=600)

Task 4.3.11: Create a visualization with two lines. The first line should plot the training_acc values as a
function of depth_hyperparams, and the second should plot validation_acc as a function of depth_hyperparams.
You x-axis should be labeled "Max Depth", and the y-axis "Accuracy Score". Also include a legend so that your
audience can distinguish between the two lines.

 What's a line plot?

 Create a line plot in Matplotlib.

# Plot `depth_hyperparams`, `training_acc`

plt.plot(depth_hyperparams, training_acc, label="training")
plt.plot(depth_hyperparams, validation_acc, label="validation")
plt.xlabel("Max Depth")
plt.ylabel("Accuracy Score")
plt.legend();
Evaluate
VimeoVideo("665415255", h="573e9cfd74", width=600)

Task 4.3.12: Based on your visualization, choose the max_depth value that leads to the best validation accuracy
score. Then retrain your original model with that max_depth value. Lastly, check how your tuned model
performs on your test set by calculating the test accuracy score below. Were you able to resolve the overfitting
problem with this new max_depth?

 Calculate the accuracy score for a model in scikit-learn.

 Generate predictions using a trained model in scikit-learn.

test_acc = model.score(X_test, y_test)

print("Test Accuracy:", round(test_acc, 2))
Test Accuracy: 0.72

Communicate
VimeoVideo("665415275", h="880366a826", width=600)

Task 4.3.13: Complete the code below to use the plot_tree function from scikit-learn to visualize the decision
logic of your model.
 Plot a decision tree using scikit-learn.

# Create larger figure

fig, ax = plt.subplots(figsize=(25, 12))
# Plot tree
plot_tree(
decision_tree = model.named_steps["decisiontreeclassifier"],
feature_names = X_train.columns.to_list(),
filled=True, # Color leaf with class
rounded=True, # Round leaf edges
proportion=True, # Display proportion of classes in leaf
max_depth=3, # Only display first 3 levels
fontsize=12, # Enlarge font
ax=ax, # Place in figure axis
);

VimeoVideo("665415304", h="c7eeac08c9", width=600)

Task 4.3.14: Assign the feature names and importances of your model to the variables below. For the features,
you can get them from the column names in your training set. For the importances, you access
the feature_importances_ attribute of your model's DecisionTreeClassifier.

 Access an object in a pipeline in scikit-learn.

features = X_train.columns
importances = model.named_steps["decisiontreeclassifier"].feature_importances_

print("Features:", features[:3])
print("Importances:", importances[:3])

Features: Index(['age_building', 'plinth_area_sq_ft', 'height_ft_pre_eq'], dtype='object')

Importances: [0.03515085 0.04618639 0.08839161]

# Check your work

assert len(features) == 11, f"`features` should contain 11 items, not {len(features)}."
assert (
len(importances) == 11
), f"`importances` should contain 11 items, not {len(importances)}."

Task 4.3.15: Create a pandas Series named feat_imp, where the index is features and the values are
your importances. The Series should be sorted from smallest to largest importance.

 Create a Series in pandas.

feat_imp = pd.Series(importances, index= features).sort_values()

feat_imp.head()

position 0.000644
plan_configuration 0.004847
foundation_type 0.005206
roof_type 0.007620
land_surface_condition 0.020759
dtype: float64

# Check your work

assert isinstance(
feat_imp, pd.Series
), f"`feat_imp` should be a Series, not {type(feat_imp)}."
assert feat_imp.shape == (
11,
), f"`feat_imp` should have shape (11,), not {feat_imp.shape}."

VimeoVideo("665415316", h="0dd9004477", width=600)

Task 4.3.16: Create a horizontal bar chart with all the features in feat_imp. Be sure to label your x-axis "Gini
Importance".

 What's a bar chart?

 Create a bar chart using pandas.

# Create horizontal bar chart

feat_imp.plot(kind="barh")
plt.xlabel("Gini Importance")
plt.ylabel("Feature");
Copyright 2023 WorldQuant University. This content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.

Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your
WQU virtual machine.

This means:

 ⓧ No downloading this notebook.

4.4. Beyond the Model: Data Ethics

import sqlite3
import warnings

import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
from category_encoders import OneHotEncoder
from IPython.display import VimeoVideo
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.utils.validation import check_is_fitted

warnings.simplefilter(action="ignore", category=FutureWarning)

VimeoVideo("665414155", h="c8a3e81a05", width=600)

Prepare Data
Task 4.4.1: Run the cell below to connect to the nepal.sqlite database.

 What's ipython-sql?
 What's a Magics function?

%load_ext sql
%sql sqlite:////home/jovyan/nepal.sqlite
The sql extension is already loaded. To reload it, use:
%reload_ext sql

VimeoVideo("665415362", h="f677c48c46", width=600)

Task 4.4.2: Select all columns from the household_demographics table, limiting your results to the first five
rows.

 Write a basic query in SQL.

 Inspect a table using a LIMIT clause in SQL.

%%sql
SELECT *
FROM household_demographics
LIMIT 5

Running query in 'sqlite:////home/jovyan/nepal.sqlite'

hous gender_h age_hou caste_ education_lev income_le size_h is_bank_account

ehold ousehold_ sehold_h house el_household vel_house ouseh _present_in_hou
_id head ead hold _head hold old sehold

Rs. 10
101 Male 31.0 Rai Illiterate 3.0 0.0
thousand
hous gender_h age_hou caste_ education_lev income_le size_h is_bank_account
ehold ousehold_ sehold_h house el_household vel_house ouseh _present_in_hou
_id head ead hold _head hold old sehold

Rs. 10
201 Female 62.0 Rai Illiterate 6.0 0.0
thousand

Gharti/ Rs. 10
301 Male 51.0 Illiterate 13.0 0.0
Bhujel thousand

Gharti/ Rs. 10
401 Male 48.0 Illiterate 5.0 0.0
Bhujel thousand

Gharti/ Rs. 10
501 Male 70.0 Illiterate 8.0 0.0
Bhujel thousand

Task 4.4.3: How many observations are in the household_demographics table? Use the count command to find
out.

 Calculate the number of rows in a table using a count function in SQL. WQU WorldQuant University Applied Data Science Lab QQQQ

%%sql
SELECT count(*)
FROM household_demographics

Running query in 'sqlite:////home/jovyan/nepal.sqlite'

count(*)

249932

VimeoVideo("665415378", h="aa2b99493e", width=600)

Task 4.4.4: Select all columns from the id_map table, limiting your results to the first five rows.

 Inspect a table using a LIMIT clause in SQL.

What columns does it have in common with household_demographics that we can use to join them?
%%sql

SELECT *
FROM id_map
LIMIT 5
Running query in 'sqlite:////home/jovyan/nepal.sqlite'
household_id building_id vdcmun_id district_id

5601 56 7 1

6301 63 7 1

9701 97 7 1

9901 99 7 1

11501 115 7 1

VimeoVideo("665415406", h="46a990c8f7", width=600)

Task 4.4.5: Create a table with all the columns from household_demographics, all the columns
from building_structure, the vdcmun_id column from id_map, and the damage_grade column
from building_damage. Your results should show only rows where the district_id is 4 and limit your results to
the first five rows.

 Create an alias for a column or table using the AS command in SQL.

 Determine the unique values in a column using a DISTINCT function in SQL.
 Merge two tables using a JOIN clause in SQL.
 Inspect a table using a LIMIT clause in SQL.
 Subset a table using a WHERE clause in SQL.

%%sql
SELECT h.*,
s.*,
i.vdcmun_id,
d.damage_grade
FROM household_demographics AS h
JOIN id_map AS i ON i.household_id = h.household_id
JOIN building_structure AS s ON i.building_id = s.building_id
JOIN building_damage AS d ON i.building_id = d.building_id
WHERE district_id = 4
LIMIT 5

Running query in 'sqlite:////home/jovyan/nepal.sqlite'

e i
g
d n l
e c
a uc c is_ c a p
n o h g c
g at o ba o p h n o l
d c u e f r o
e io m s nk u li e d t a s
e a n i o o n
h _ n e i _a n a n i _ h n u d
r s b t g u u d
o h _l _ z cc t g t g s e _ p a
_ t u _ h n r n i v
u o ev l e ou _ e h h u r p c e m
h e i fl t d o d t d
s u el e _ nt fl _ _ t rf _ o o r a
o _ l o _ a o _ i c
e s _ v h _p o b a _ a f s n s g
u h d o f t f f o m
h e h e o re o u r f c l i f t e
s o i r t i _ l n u
o h o l u se r i e t e o t i r _
e u n s _ o t o _ n
l o us _ s nt s l a _ _ o i g u g
h s g _ p n y o p _
d l e h e _i _ d _ p c r o u c r
o e _ p o _ p r o i
_ d h o h n_ p i s r o _ n r t a
l h i o s t e _ s d
i _ ol u o ho r n q e n t a u d
d o d s t y t t
d h d s l us e g _ _ d y t r e
_ l t _ p y _
e _ e d eh _ f e it p i e
h d _ e e p e
a h h ol e t q i e o
e e q e q
d e o d q o n
a q
a l n
d
d d

B D
M
a T a
u
R m I m S
d
s. b m a t
m
1 o b N R g o
o
0 o e o e e n
1 C r
- / r t c d e
6 h 1 t G
F 2 T / a t - ,
4 4 h Cl 6 F a r
e 0 4 5 i M B t a R m
0 6 e as 1. 4 2 1 1 l r 3 a
m t . 3 3 6 m u a t n e u
0 . t s 0 0 0 8 8 a - 8 d
al h 0 0 b d m a g p d
2 0 r 5 0 t S e
e o e b c u a m
0 e 2 t 2
u r o h l i o
1 e o
s - o e a r r
n
a L - d r e t
e
n i M d a
/
d g u a r
B
h d n
r
t d
i
r u
e i
g
d n l
e c
a uc c is_ c a p
n o h g c
g at o ba o p h n o l
d c u e f r o
e io m s nk u li e d t a s
e a n i o o n
h _ n e i _a n a n i _ h n u d
r s b t g u u d
o h _l _ z cc t g t g s e _ p a
_ t u _ h n r n i v
u o ev l e ou _ e h h u r p c e m
h e i fl t d o d t d
s u el e _ nt fl _ _ t rf _ o o r a
o _ l o _ a o _ i c
e s _ v h _p o b a _ a f s n s g
u h d o f t f f o m
h e h e o re o u r f c l i f t e
s o i r t i _ l n u
o h o l u se r i e t e o t i r _
e u n s _ o t o _ n
l o us _ s nt s l a _ _ o i g u g
h s g _ p n y o p _
d l e h e _i _ d _ p c r o u c r
o e _ p o _ p r o i
_ d h o h n_ p i s r o _ n r t a
l h i o s t e _ s d
i _ ol u o ho r n q e n t a u d
d o d s t y t t
d h d s l us e g _ _ d y t r e
_ l t _ p y _
e _ e d eh _ f e it p i e
h d _ e e p e
a h h ol e t q i e o
e e q e q
d e o d q o n
a q
a l n
d
d d

c o s
k o e
f d

M B D S
T
u a a t
R I
d m N R m o
s. m
m b o e a n
1 C 1 b
o o t c g e
6 h 0 1 e G
r o a t e ,
4 6 h Illi t 6 F r r
M 5 2 t / M t a d m
0 6 e te h 0. 4 2 1 1 l / 3 a
al . 2 2 0 a T u t n - u
8 . t ra o 0 0 1 2 2 a B 8 d
e 0 0 r i d a g U d
1 0 r te u 8 t a e
- m c u s m
0 e s 1 m 2
S b h l e o
1 e a b
t e e a d r
n o
o r d r i t
d o
n - n a
-
e L r r
M
/ i i
e i
g
d n l
e c
a uc c is_ c a p
n o h g c
g at o ba o p h n o l
d c u e f r o
e io m s nk u li e d t a s
e a n i o o n
h _ n e i _a n a n i _ h n u d
r s b t g u u d
o h _l _ z cc t g t g s e _ p a
_ t u _ h n r n i v
u o ev l e ou _ e h h u r p c e m
h e i fl t d o d t d
s u el e _ nt fl _ _ t rf _ o o r a
o _ l o _ a o _ i c
e s _ v h _p o b a _ a f s n s g
u h d o f t f f o m
h e h e o re o u r f c l i f t e
s o i r t i _ l n u
o h o l u se r i e t e o t i r _
e u n s _ o t o _ n
l o us _ s nt s l a _ _ o i g u g
h s g _ p n y o p _
d l e h e _i _ d _ p c r o u c r
o e _ p o _ p r o i
_ d h o h n_ p i s r o _ n r t a
l h i o s t e _ s d
i _ ol u o ho r n q e n t a u d
d o d s t y t t
d h d s l us e g _ _ d y t r e
_ l t _ p y _
e _ e d eh _ f e it p i e
h d _ e e p e
a h h ol e t q i e o
e e q e q
d e o d q o n
a q
a l n
d
d d

B g u s
r h d k
i t
c r
k o
o
f

R M B T D S
N R
s. u a I a t
o e
1 1 d m m m o
t c
6 0 1 m b b a n G
M a t
4 5 Cl t 6 F o o M e g e r
M a 5 3 t a
0 4 as h 1. 4 1 2 2 l r o u r e , 3 a
al g . 3 3 1 t n
8 . s o 0 0 8 0 0 a t / d / d m 8 d
e a 0 5 a g
9 0 4 u 8 t a T B - u e
r c u
0 s 9 r i a U d 2
h l
1 a - m m s m
e a
n S b b e o
d r
d t e o d r
e i
g
d n l
e c
a uc c is_ c a p
n o h g c
g at o ba o p h n o l
d c u e f r o
e io m s nk u li e d t a s
e a n i o o n
h _ n e i _a n a n i _ h n u d
r s b t g u u d
o h _l _ z cc t g t g s e _ p a
_ t u _ h n r n i v
u o ev l e ou _ e h h u r p c e m
h e i fl t d o d t d
s u el e _ nt fl _ _ t rf _ o o r a
o _ l o _ a o _ i c
e s _ v h _p o b a _ a f s n s g
u h d o f t f f o m
h e h e o re o u r f c l i f t e
s o i r t i _ l n u
o h o l u se r i e t e o t i r _
e u n s _ o t o _ n
l o us _ s nt s l a _ _ o i g u g
h s g _ p n y o p _
d l e h e _i _ d _ p c r o u c r
o e _ p o _ p r o i
_ d h o h n_ p i s r o _ n r t a
l h i o s t e _ s d
i _ ol u o ho r n q e n t a u d
d o d s t y t t
d h d s l us e g _ _ d y t r e
_ l t _ p y _
e _ e d eh _ f e it p i e
h d _ e e p e
a h h ol e t q i e o
e e q e q
d e o d q o n
a q
a l n
d
d d

o r o i t
n - - n a
e L M r r
/ i u i
B g d s
r h k
i t
c r
k o
o
f

1 C R M B T N R D S
6 h s. 1 u a I o e a t G
4 3 h Cl 1 6 F d m M m t c m o r
M 6 2
0 6 e as 0 1. 4 4 1 1 l m b u b a t a n 3 a
al . 2 2 9
9 . t s t 0 0 5 3 3 a o o d e t a g e 8 d
e 0 0
8 0 r 5 h 9 t r o r t n e , e
0 e o 8 t / / a g d m 3
1 e u a T B c u - u
e i
g
d n l
e c
a uc c is_ c a p
n o h g c
g at o ba o p h n o l
d c u e f r o
e io m s nk u li e d t a s
e a n i o o n
h _ n e i _a n a n i _ h n u d
r s b t g u u d
o h _l _ z cc t g t g s e _ p a
_ t u _ h n r n i v
u o ev l e ou _ e h h u r p c e m
h e i fl t d o d t d
s u el e _ nt fl _ _ t rf _ o o r a
o _ l o _ a o _ i c
e s _ v h _p o b a _ a f s n s g
u h d o f t f f o m
h e h e o re o u r f c l i f t e
s o i r t i _ l n u
o h o l u se r i e t e o t i r _
e u n s _ o t o _ n
l o us _ s nt s l a _ _ o i g u g
h s g _ p n y o p _
d l e h e _i _ d _ p c r o u c r
o e _ p o _ p r o i
_ d h o h n_ p i s r o _ n r t a
l h i o s t e _ s d
i _ ol u o ho r n q e n t a u d
d o d s t y t t
d h d s l us e g _ _ d y t r e
_ l t _ p y _
e _ e d eh _ f e it p i e
h d _ e e p e
a h h ol e t q i e o
e e q e q
d e o d q o n
a q
a l n
d
d d

s r i a h l U d
a - m m e a s m
n S b b d r e o
d t e o d r
o r o i t
n - - n a
e L M r r
/ i u i
B g d s
r h k
i t
c r
k o
o
f

1 3 C Cl R 1 F M B M T N R D S G
F 3 2
6 9 h as s. 0. 6 2 1 1 l u a u I o e a t 3 r
e . 2 2 3
4 . h s 1 0 4 1 3 3 a d m d m t c m o 8 a
m 0 0
1 0 e 4 0 1 t m b b a t a n d
e i
g
d n l
e c
a uc c is_ c a p
n o h g c
g at o ba o p h n o l
d c u e f r o
e io m s nk u li e d t a s
e a n i o o n
h _ n e i _a n a n i _ h n u d
r s b t g u u d
o h _l _ z cc t g t g s e _ p a
_ t u _ h n r n i v
u o ev l e ou _ e h h u r p c e m
h e i fl t d o d t d
s u el e _ nt fl _ _ t rf _ o o r a
o _ l o _ a o _ i c
e s _ v h _p o b a _ a f s n s g
u h d o f t f f o m
h e h e o re o u r f c l i f t e
s o i r t i _ l n u
o h o l u se r i e t e o t i r _
e u n s _ o t o _ n
l o us _ s nt s l a _ _ o i g u g
h s g _ p n y o p _
d l e h e _i _ d _ p c r o u c r
o e _ p o _ p r o i
_ d h o h n_ p i s r o _ n r t a
l h i o s t e _ s d
i _ ol u o ho r n q e n t a u d
d o d s t y t t
d h d s l us e g _ _ d y t r e
_ l t _ p y _
e _ e d eh _ f e it p i e
h d _ e e p e
a h h ol e t q i e o
e e q e q
d e o d q o n
a q
a l n
d
d d

0 al t t 0 o o e t a g e e
3 e r h 3 r o r t n e , 3
0 e o t / / a g d m
1 e u a T B c u - u
s r i a h l U d
a - m m e a s m
n S b b d r e o
d t e o d r
o r o i t
n - - n a
e L M r r
/ i u i
B g d s
r h k
i t
c r
k o
o
f
Import
def wrangle(db_path):
# Connect to database
conn = sqlite3.connect(db_path)

# Construct query
query = """
SELECT h.*,
s.*,
i.vdcmun_id,
d.damage_grade
FROM household_demographics AS h
JOIN id_map AS i ON i.household_id = h.household_id
JOIN building_structure AS s ON i.building_id = s.building_id
JOIN building_damage AS d ON i.building_id = d.building_id
WHERE district_id = 4

"""

# Read query results into DataFrame

df = pd.read_sql(query, conn, index_col = "household_id")

# Identify leaky columns

drop_cols = [col for col in df.columns if "post_eq" in col]

# Add high-cardinality / redundant column

drop_cols.append("building_id")

# Create binary target column

df["damage_grade"] = df["damage_grade"].str[-1].astype(int)
df["severe_damage"] = (df["damage_grade"] > 3).astype(int)

# Drop old target

drop_cols.append("damage_grade")

# Drop multicollinearity column

drop_cols.append("count_floors_pre_eq")

# Group caste column

top_10 = df["caste_household"].value_counts().head(10).index
df["caste_household"] = df["caste_household"].apply(
lambda c: c if c in top_10 else "Other"
)

# Drop columns
df.drop(columns=drop_cols, inplace=True)

return df

VimeoVideo("665415443", h="ca27a7ebfc", width=600)

Task 4.4.6: Add the query you created in the previous task to the wrangle function above. Then import your
data by running the cell below. The path to the database is "/home/jovyan/nepal.sqlite".

 Read SQL query into a DataFrame using pandas.

 Write a function in Python.

df = wrangle("/home/jovyan/nepal.sqlite")
df.head()

c si pl f o s s
h gr
ge a inc z a in la o t pl u e
ag edu is_b ei o
nd st o e g t nd u h a p v v
e_ cati ank g u p
er e m _ e h _s n r e n e d e
h on_ _acc h n o
_h _ e_ h _ _ ur d o r _c r c r
o lev ount t_ d s
ou h lev o b ar fa a o _f o s m e
us el_ _pre ft _f i
se o el u u e ce ti f_ lo nf tr u _
eh hou sent _ lo t
ho u _h s il a _c o ty o ig u n d
ol seh _in_ p or i
ld s ou e d _ on n p r ur c _ a
d_ old hou re _t o
_h e se h i s dit _ e _t at t i m
he _he seho _ y n
ea h ho o n q io ty y io u d a
ad ad ld e p
d ol ld l g _f n p p n r g
q e
d d t e e e e

h
o
u
s
e
h
o
l
d
_
i
d

M B TI N S
1 Rs.
C u a m o R t
6 Fe 10 5 M
46 h Clas 4. 2 1 Fl d m b t ec o 3
4 m - 1.0 6 u 0
.0 h s5 0 0 8 at m b e a ta n 8
0 ale 20 0 d
et o o r/ t n e
0 th
r rt o B t g ,
2 ou
ar / a a m
c si pl f o s s
h gr
ge a inc z a in la o t pl u e
ag edu is_b ei o
nd st o e g t nd u h a p v v
e_ cati ank g u p
er e m _ e h _s n r e n e d e
h on_ _acc h n o
_h _ e_ h _ _ ur d o r _c r c r
o lev ount t_ d s
ou h lev o b ar fa a o _f o s m e
us el_ _pre ft _f i
se o el u u e ce ti f_ lo nf tr u _
eh hou sent _ lo t
ho u _h s il a _c o ty o ig u n d
ol seh _in_ p or i
ld s ou e d _ on n p r ur c _ a
d_ old hou re _t o
_h e se h i s dit _ e _t at t i m
he _he seho _ y n
ea h ho o n q io ty y io u d a
ad ad ld e p
d ol ld l g _f n p p n r g
q e
d d t e e e e

h
o
u
s
e
h
o
l
d
_
i
d

0 e sa - Ti m c ul u
1 e nd St m b h ar d
o b o e m
n e o- d o
e r- M rt
/ Li u a
B g d r
ri h
c t
k r
o
o
f
c si pl f o s s
h gr
ge a inc z a in la o t pl u e
ag edu is_b ei o
nd st o e g t nd u h a p v v
e_ cati ank g u p
er e m _ e h _s n r e n e d e
h on_ _acc h n o
_h _ e_ h _ _ ur d o r _c r c r
o lev ount t_ d s
ou h lev o b ar fa a o _f o s m e
us el_ _pre ft _f i
se o el u u e ce ti f_ lo nf tr u _
eh hou sent _ lo t
ho u _h s il a _c o ty o ig u n d
ol seh _in_ p or i
ld s ou e d _ on n p r ur c _ a
d_ old hou re _t o
_h e se h i s dit _ e _t at t i m
he _he seho _ y n
ea h ho o n q io ty y io u d a
ad ad ld e p
d ol ld l g _f n p p n r g
q e
d d t e e e e

h
o
u
s
e
h
o
l
d
_
i
d

M B TI S
u a m t
N
d m b o
o
1 m b e n
C t R
6 Rs. o o r/ e
h a ec
4 10 rt o B ,
h Illit 2 M t ta
0 M 66 th 5. 2 1 Fl ar / a m 3
et erat 0.0 0 u t n 0
8 ale .0 ou 0 1 2 at - Ti m u 8
r e 0 d a g
1 sa St m b d
e c ul
0 nd o b o m
e h ar
1 n e o- o
e
e r- M rt
d
/ Li u a
B g d r
ri h
c si pl f o s s
h gr
ge a inc z a in la o t pl u e
ag edu is_b ei o
nd st o e g t nd u h a p v v
e_ cati ank g u p
er e m _ e h _s n r e n e d e
h on_ _acc h n o
_h _ e_ h _ _ ur d o r _c r c r
o lev ount t_ d s
ou h lev o b ar fa a o _f o s m e
us el_ _pre ft _f i
se o el u u e ce ti f_ lo nf tr u _
eh hou sent _ lo t
ho u _h s il a _c o ty o ig u n d
ol seh _in_ p or i
ld s ou e d _ on n p r ur c _ a
d_ old hou re _t o
_h e se h i s dit _ e _t at t i m
he _he seho _ y n
ea h ho o n q io ty y io u d a
ad ad ld e p
d ol ld l g _f n p p n r g
q e
d d t e e e e

h
o
u
s
e
h
o
l
d
_
i
d

c t
k r
o
o
f

1 M B TI N S
R
6 Rs. u a m o t
ec
4 M 10 d m b t o
3 M ta
0 M 54 a Clas th 5. 1 2 Fl m b e a n 3
1.0 1 u n 0
8 ale .0 g s4 ou 0 8 0 at o o r/ t e 8
5 d g
9 ar sa rt o B t ,
ul
0 nd ar / a a m
ar
1 - Ti m c u
St m b h d
c si pl f o s s
h gr
ge a inc z a in la o t pl u e
ag edu is_b ei o
nd st o e g t nd u h a p v v
e_ cati ank g u p
er e m _ e h _s n r e n e d e
h on_ _acc h n o
_h _ e_ h _ _ ur d o r _c r c r
o lev ount t_ d s
ou h lev o b ar fa a o _f o s m e
us el_ _pre ft _f i
se o el u u e ce ti f_ lo nf tr u _
eh hou sent _ lo t
ho u _h s il a _c o ty o ig u n d
ol seh _in_ p or i
ld s ou e d _ on n p r ur c _ a
d_ old hou re _t o
_h e se h i s dit _ e _t at t i m
he _he seho _ y n
ea h ho o n q io ty y io u d a
ad ad ld e p
d ol ld l g _f n p p n r g
q e
d d t e e e e

h
o
u
s
e
h
o
l
d
_
i
d

o b o e m
n e o- d o
e r- M rt
/ Li u a
B g d r
ri h
c t
k r
o
o
f
c si pl f o s s
h gr
ge a inc z a in la o t pl u e
ag edu is_b ei o
nd st o e g t nd u h a p v v
e_ cati ank g u p
er e m _ e h _s n r e n e d e
h on_ _acc h n o
_h _ e_ h _ _ ur d o r _c r c r
o lev ount t_ d s
ou h lev o b ar fa a o _f o s m e
us el_ _pre ft _f i
se o el u u e ce ti f_ lo nf tr u _
eh hou sent _ lo t
ho u _h s il a _c o ty o ig u n d
ol seh _in_ p or i
ld s ou e d _ on n p r ur c _ a
d_ old hou re _t o
_h e se h i s dit _ e _t at t i m
he _he seho _ y n
ea h ho o n q io ty y io u d a
ad ad ld e p
d ol ld l g _f n p p n r g
q e
d d t e e e e

h
o
u
s
e
h
o
l
d
_
i
d

M B TI S
u a m t
N
d m b o
o
1 m b e n
C t R
6 Rs. o o r/ e
h a ec
4 10 rt o B ,
h 2 M t ta
0 M 36 Clas th 6. 4 1 Fl ar / a m 3
et 1.0 9 u t n 0
9 ale .0 s5 ou 0 5 3 at - Ti m u 8
r 0 d a g
8 sa St m b d
e c ul
0 nd o b o m
e h ar
1 n e o- o
e
e r- M rt
d
/ Li u a
B g d r
ri h
c si pl f o s s
h gr
ge a inc z a in la o t pl u e
ag edu is_b ei o
nd st o e g t nd u h a p v v
e_ cati ank g u p
er e m _ e h _s n r e n e d e
h on_ _acc h n o
_h _ e_ h _ _ ur d o r _c r c r
o lev ount t_ d s
ou h lev o b ar fa a o _f o s m e
us el_ _pre ft _f i
se o el u u e ce ti f_ lo nf tr u _
eh hou sent _ lo t
ho u _h s il a _c o ty o ig u n d
ol seh _in_ p or i
ld s ou e d _ on n p r ur c _ a
d_ old hou re _t o
_h e se h i s dit _ e _t at t i m
he _he seho _ y n
ea h ho o n q io ty y io u d a
ad ad ld e p
d ol ld l g _f n p p n r g
q e
d d t e e e e

h
o
u
s
e
h
o
l
d
_
i
d

c t
k r
o
o
f

1 M B TI N S
C R
6 Rs. u a m o t
h ec
4 10 d m b t o
Fe h 2 M ta
1 39 Clas th 3. 2 1 Fl m b e a n 3
m et 0.0 3 u n 0
0 .0 s4 ou 0 1 3 at o o r/ t e 8
ale r 0 d g
3 sa rt o B t ,
e ul
0 nd ar / a a m
e ar
1 - Ti m c u
St m b h d
c si pl f o s s
h gr
ge a inc z a in la o t pl u e
ag edu is_b ei o
nd st o e g t nd u h a p v v
e_ cati ank g u p
er e m _ e h _s n r e n e d e
h on_ _acc h n o
_h _ e_ h _ _ ur d o r _c r c r
o lev ount t_ d s
ou h lev o b ar fa a o _f o s m e
us el_ _pre ft _f i
se o el u u e ce ti f_ lo nf tr u _
eh hou sent _ lo t
ho u _h s il a _c o ty o ig u n d
ol seh _in_ p or i
ld s ou e d _ on n p r ur c _ a
d_ old hou re _t o
_h e se h i s dit _ e _t at t i m
he _he seho _ y n
ea h ho o n q io ty y io u d a
ad ad ld e p
d ol ld l g _f n p p n r g
q e
d d t e e e e

h
o
u
s
e
h
o
l
d
_
i
d

o b o e m
n e o- d o
e r- M rt
/ Li u a
B g d r
ri h
c t
k r
o
o
f

# Check your work

assert df.shape == (75883, 20), f"`df` should have shape (75883, 20), not {df.shape}"
Explore
VimeoVideo("665415463", h="86c306199f", width=600)

Task 4.4.7: Combine the select_dtypes and nunique methods to see if there are any high- or low-cardinality
categorical features in the dataset.

 What are high- and low-cardinality features?

 Determine the unique values in a column using pandas.
 Subset a DataFrame's columns based on the column data types in pandas.

# Check for high- and low-cardinality categorical features

df.select_dtypes("object").nunique()

gender_household_head 2
caste_household 63
education_level_household_head 19
income_level_household 5
land_surface_condition 3
foundation_type 5
roof_type 3
ground_floor_type 5
other_floor_type 4
position 4
plan_configuration 10
superstructure 11
dtype: int64

VimeoVideo("665415472", h="1142d69e4a", width=600)

Task 4.4.8: Add to your wrangle function so that the "caste_household" contains only the 10 largest caste
groups. For the rows that are not in those groups, "caste_household" should be changed to "Other".

 Determine the unique values in a column using pandas.

 Combine multiple categories in a Series using pandas.

#top_10 = df["caste_household"].value_counts().head(10).index
#df["caste_household"].apply(lambda c: c if c in top_10 else "Other").value_counts()

Index(['Gurung', 'Brahman-Hill', 'Chhetree', 'Magar', 'Sarki', 'Newar', 'Kami',

'Tamang', 'Kumal', 'Damai/Dholi'],
dtype='object')

df["caste_household"].apply(lambda c: c if c in top_10 else "Other").value_counts()

Gurung 15119
Brahman-Hill 13043
Chhetree 8766
Other 8608
Magar 8180
Sarki 6052
Newar 5906
Kami 3565
Tamang 2396
Kumal 2271
Damai/Dholi 1977
Name: caste_household, dtype: int64

# Check your work

assert (
df["caste_household"].nunique() == 11
), f"The `'caste_household'` column should only have 11 unique values, not {df['caste_household'].nunique()}."

Split
VimeoVideo("665415515", h="defc252edd", width=600)

Task 4.4.9: Create your feature matrix X and target vector y. Since our model will only consider building and
household data, X should not include the municipality column "vdcmun_id". Your target is "severe_damage".
target = "severe_damage"
X = df.drop(columns = [target, "vdcmun_id"])
y = df[target]

# Check your work

assert X.shape == (75883, 18), f"The shape of `X` should be (75883, 18), not {X.shape}."
assert "vdcmun_id" not in X.columns, "There should be no `'vdcmun_id'` column in `X`."
assert y.shape == (75883,), f"The shape of `y` should be (75883,), not {y.shape}."
Task 4.4.10: Divide your data (X and y) into training and test sets using a randomized train-test split. Your test
set should be 20% of your total data. Be sure to set a random_state for reproducibility.

X_train, X_test, y_train, y_test = train_test_split(

# Check your work

assert X_train.shape == (
60706,
18,
), f"The shape of `X_train` should be (60706, 18), not {X_train.shape}."
assert y_train.shape == (
60706,
), f"The shape of `y_train` should be (60706,), not {y_train.shape}."
assert X_test.shape == (
15177,
18,
), f"The shape of `X_test` should be (15177, 18), not {X_test.shape}."
assert y_test.shape == (
15177,
), f"The shape of `y_test` should be (15177,), not {y_test.shape}."

Build Model
Baseline
Task 4.4.11: Calculate the baseline accuracy score for your model.

 What's accuracy score?

 Aggregate data in a Series using value_counts in pandas.

acc_baseline = y_train.value_counts(normalize=True).max()
print("Baseline Accuracy:", round(acc_baseline, 2))
Baseline Accuracy: 0.63

Iterate
Task 4.4.12: Create a Pipeline called model_lr. It should have an OneHotEncoder transformer and
a LogisticRegression predictor. Be sure you set the use_cat_names argument for your transformer to True.

 What's logistic regression?

 What's one-hot encoding?
 Create a pipeline in scikit-learn.
 Fit a model to training data in scikit-learn.

model_lr = make_pipeline(
OneHotEncoder(use_cat_names = True),
LogisticRegression(max_iter = 1000)
)
# Fit model to training data
model_lr.fit(X_train, y_train)

/opt/conda/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:460: ConvergenceWarning: lbfgs failed to

converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(

Pipeline(steps=[('onehotencoder',
OneHotEncoder(cols=['gender_household_head', 'caste_household',
'education_level_household_head',
'income_level_household',
'land_surface_condition',
'foundation_type', 'roof_type',
'ground_floor_type', 'other_floor_type',
'position', 'plan_configuration',
'superstructure'],
use_cat_names=True)),
('logisticregression', LogisticRegression(max_iter=1000))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with
nbviewer.org.
# Check your work
assert isinstance(
model_lr, Pipeline
), f"`model_lr` should be a Pipeline, not type {type(model_lr)}."
assert isinstance(
model_lr[0], OneHotEncoder
), f"The first step in your Pipeline should be a OneHotEncoder, not type {type(model_lr[0])}."
assert isinstance(
model_lr[-1], LogisticRegression
), f"The last step in your Pipeline should be LogisticRegression, not type {type(model_lr[-1])}."
check_is_fitted(model_lr)

Evaluate
Task 4.4.13: Calculate the training and test accuracy scores for model_lr.

 Calculate the accuracy score for a model in scikit-learn.

 Generate predictions using a trained model in scikit-learn.

acc_train = accuracy_score(y_train, model_lr.predict(X_train))

acc_test = model_lr.score(X_test, y_test)

print("LR Training Accuracy:", acc_train)

print("LR Validation Accuracy:", acc_test)
LR Training Accuracy: 0.7182815537179191
LR Validation Accuracy: 0.7222771298675628

Communicate
VimeoVideo("665415532", h="00440f76a9", width=600)

Task 4.4.14: First, extract the feature names and importances from your model. Then create a pandas Series
named feat_imp, where the index is features and the values are your the exponential of the importances.

 What's a bar chart?

 Access an object in a pipeline in scikit-learn.
 Create a Series in pandas.

features = model_lr.named_steps["onehotencoder"].get_feature_names()
importances = model_lr.named_steps["logisticregression"].coef_[0]
feat_imp = pd.Series(np.exp(importances), index= features).sort_values()
feat_imp.head()
superstructure_Brick, cement mortar 0.328117
foundation_type_RC 0.334613
roof_type_RCC/RB/RBC 0.378834
caste_household_Bhote 0.513165
other_floor_type_RCC/RB/RBC 0.521128
dtype: float64

VimeoVideo("665415552", h="5b2383ccf8", width=600)

Task 4.4.15: Create a horizontal bar chart with the ten largest coefficients from feat_imp. Be sure to label your
x-axis "Odds Ratio".

 Create a bar chart using pandas.

feat_imp.tail(10).plot(kind="barh")
plt.xlabel("Odds Ratio")

Text(0.5, 0, 'Odds Ratio')

VimeoVideo("665415581", h="d15477e14d", width=600)

Task 4.4.16: Create a horizontal bar chart with the ten smallest coefficients from feat_imp. Be sure to label
your x-axis "Odds Ratio".

 Create a bar chart using pandas.

feat_imp.head(10).plot(kind="barh")
plt.xlabel("Odds Ratio")

Text(0.5, 0, 'Odds Ratio')

Explore Some More
VimeoVideo("665415631", h="90ba264392", width=600)

Task 4.4.17: Which municipalities saw the highest proportion of severely damaged buildings? Create a
DataFrame damage_by_vdcmun by grouping df by "vdcmun_id" and then calculating the mean of
the "severe_damage" column. Be sure to sort damage_by_vdcmun from highest to lowest proportion.

 Aggregate data using the groupby method in pandas.

damage_by_vdcmun = (
df.groupby("vdcmun_id")["severe_damage"].mean().sort_values(ascending = False)
).to_frame()
damage_by_vdcmun

severe_damage

vdcmun_id

31 0.930199

32 0.851117

35 0.827145
severe_damage

vdcmun_id

30 0.824201

33 0.782464

34 0.666979

39 0.572344

40 0.512444

38 0.506425

36 0.503972

37 0.437789

# Check your work

assert isinstance(
damage_by_vdcmun, pd.DataFrame
), f"`damage_by_vdcmun` should be a Series, not type {type(damage_by_vdcmun)}."
assert damage_by_vdcmun.shape == (
11,
1,
), f"`damage_by_vdcmun` should be shape (11,1), not {damage_by_vdcmun.shape}."

VimeoVideo("665415651", h="9b5244dec1", width=600)

Task 4.4.18: Create a line plot of damage_by_vdcmun. Label your x-axis "Municipality ID", your y-axis "% of
Total Households", and give your plot the title "Household Damage by Municipality".

 Create a line plot in Matplotlib.

# Plot line
plt.plot(damage_by_vdcmun.values, color = "grey")
plt.xticks(range(len(damage_by_vdcmun)), labels=damage_by_vdcmun.index)
plt.yticks(np.arange(0.0, 1.1, .2))
plt.xlabel("Municipality ID")
plt.ylabel("% of Total Households")
plt.title("Severe Damage by Municipality");

Given the plot above, our next question is: How are the Gurung and Kumal populations distributed across these
municipalities?
VimeoVideo("665415693", h="fb2e54aa04", width=600)

Task 4.4.19: Create a new column in damage_by_vdcmun that contains the the proportion of Gurung
households in each municipality.

 Aggregate data using the groupby method in pandas.

 Create a Series in pandas.

damage_by_vdcmun["Gurung"] = (
df[df["caste_household"] == "Gurung"].groupby("vdcmun_id")["severe_damage"].count()
/df.groupby("vdcmun_id")["severe_damage"].count()
)
damage_by_vdcmun
severe_damage Gurung

vdcmun_id

31 0.930199 0.326937

32 0.851117 0.387849

35 0.827145 0.826889

30 0.824201 0.338152

33 0.782464 0.011943

34 0.666979 0.385084

39 0.572344 0.097971

40 0.512444 0.246727

38 0.506425 0.049023

36 0.503972 0.143178

37 0.437789 0.050485

VimeoVideo("665415707", h="9b29c23434", width=600)

Task 4.4.20: Create a new column in damage_by_vdcmun that contains the the proportion of Kumal households
in each municipality. Replace any NaN values in the column with 0.

 Aggregate data using the groupby method in pandas.

 Create a Series in pandas.

damage_by_vdcmun["Kumal"] = (
df[df["caste_household"] == "Kumal"].groupby("vdcmun_id")["severe_damage"].count()
/df.groupby("vdcmun_id")["severe_damage"].count()
).fillna(0)
damage_by_vdcmun

severe_damage Gurung Kumal

vdcmun_id

31 0.930199 0.326937 0.000000

32 0.851117 0.387849 0.000000

35 0.827145 0.826889 0.000000

30 0.824201 0.338152 0.000000

33 0.782464 0.011943 0.029478

34 0.666979 0.385084 0.000000

39 0.572344 0.097971 0.000267

40 0.512444 0.246727 0.036973

38 0.506425 0.049023 0.100686

36 0.503972 0.143178 0.003282

37 0.437789 0.050485 0.048842

VimeoVideo("665415729", h="8d0712c306", width=600)

Task 4.4.21: Create a visualization that combines the line plot of severely damaged households you made
above with a stacked bar chart showing the proportion of Gurung and Kumal households in each district. Label
your x-axis "Municipality ID", your y-axis "% of Total Households".

 Create a bar chart using pandas.

 Drop a column from a DataFrame using pandas.

damage_by_vdcmun.drop(columns="severe_damage").plot(
kind= "bar", stacked = True
)
plt.plot(damage_by_vdcmun["severe_damage"].values, color = "grey")
plt.xticks(range(len(damage_by_vdcmun)), labels=damage_by_vdcmun.index)
plt.yticks(np.arange(0.0, 1.1, .2))
plt.xlabel("Municipality ID")
plt.ylabel("% of Total Households")
plt.title("Household Caste by Municipality")
plt.legend();

Copyright 2023 WorldQuant University. This content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.

Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your
WQU virtual machine.

4.5. Earthquake Damage in Kavrepalanchok

🇳🇵
In this assignment, you'll build a classification model to predict building damage for the district
of Kavrepalanchok.
import warnings

import wqet_grader

warnings.simplefilter(action="ignore", category=FutureWarning)
wqet_grader.init("Project 4 Assessment")

# Import libraries here

import sqlite3
import warnings

import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
import seaborn as sns
from category_encoders import OneHotEncoder
from category_encoders import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.utils.validation import check_is_fitted
from sklearn.tree import DecisionTreeClassifier, plot_tree

Prepare Data
Connect
Run the cell below to connect to the nepal.sqlite database. WQU WorldQuant University Applied Data Science Lab QQQQ

%load_ext sql
%sql sqlite:////home/jovyan/nepal.sqlite
Warning:Be careful with your SQL queries in this assignment. If you try to get all the rows from a table (for
example, SELECT * FROM id_map), you will cause an Out of Memory error on your virtual machine. So
always include a LIMIT when first exploring a database.
Task 4.5.1: What districts are represented in the id_map table? Determine the unique values in
the district_id column.

%%sql
SELECT distinct(district_id)
FROM id_map

Running query in 'sqlite:////home/jovyan/nepal.sqlite'

district_id

result = _.DataFrame().squeeze() # noqa F821

wqet_grader.grade("Project 4 Assessment", "Task 4.5.1", result)

That's the right answer. Keep it up!

Score: 1

What's the district ID for Kavrepalanchok? From the lessons, you already know that Gorkha is 4; from the
textbook, you know that Ramechhap is 2. Of the remaining districts, Kavrepalanchok is the one with the largest
number of observations in the id_map table.
Task 4.5.2: Calculate the number of observations in the id_map table associated with district 1.

%%sql
SELECT count(*)
FROM id_map
WHERE district_id = 1
Running query in 'sqlite:////home/jovyan/nepal.sqlite'
count(*)

36112

result = [_.DataFrame().astype(float).squeeze()] # noqa F821

wqet_grader.grade("Project 4 Assessment", "Task 4.5.2", result)
That's the right answer. Keep it up!

Score: 1

Task 4.5.3: Calculate the number of observations in the id_map table associated with district 3.
%%sql

SELECT count(*)
FROM id_map
WHERE district_id = 3
Running query in 'sqlite:////home/jovyan/nepal.sqlite'

count(*)

82684

result = [_.DataFrame().astype(float).squeeze()] # noqa F821

wqet_grader.grade("Project 4 Assessment", "Task 4.5.3", result)
Excellent work.

Score: 1

Task 4.5.4: Join the unique building IDs from Kavrepalanchok in id_map, all the columns
from building_structure, and the damage_grade column from building_damage, limiting your results to 5 rows.
Make sure you rename the building_id column in id_map as b_id and limit your results to the first five rows of
the new table.

%%sql

SELECT distinct(i.building_id) AS b_id,

s.*,
d.damage_grade
FROM id_map AS i
JOIN building_structure AS s ON i.building_id = s.building_id
JOIN building_damage AS d ON i.building_id = d.building_id
WHERE district_id = 3

LIMIT 5
Running query in 'sqlite:////home/jovyan/nepal.sqlite'
ag
b cou pli hei fo gro ot co su da
cou e hei land p pla
ui nt_ nt ght un ro un he ndi pe m
b nt_f _ ght _sur o n_c
ld flo h_ _ft da of d_f r_f tio rst ag
_ loor b _ft face si onf
in ors are _p tio _t loo loo n_ ru e_
i s_p ui _p _co ti igu
g _pr a_s ost n_ yp r_t r_t po ct gr
d ost ld re_ ndit o rati
_i e_e q_f _e ty e yp yp st_ ur ad
_eq in eq ion n on
d q t q pe e e eq e e
g

Ba
M m N
TI St
ud bo o Da
mb on
m o/ t ma
8 8 er/ e,
ort Ti a Rec ge Gr
7 7 Ba m
1 38 ar- m Mu tt tan d- ad
4 4 2 1 18 7 Flat mb ud
5 2 St be d a gul Us e
7 7 oo m
on r- c ar ed 4
3 3 - or
e/ Lig h in
M ta
Bri ht e risk
ud r
ck ro d
of

Ba
M m N
Da St
ud bo o
ma on
m o/ No t
8 8 ge e,
ort Ti t a Rec Gr
7 7 d- m
1 32 ar- m Mu ap tt tan ad
4 4 1 0 7 0 Flat Ru ud
2 8 St be d pli a gul e
7 7 bbl m
on r- ca c ar 5
9 9 e or
e/ Lig ble h
cle ta
Bri ht e
ar r
ck ro d
of

M Ba N
TI St
8 8 ud m o Da
mb Rec on Gr
7 7 m bo t ma
2 42 Mu er/ tan e, ad
4 4 2 1 20 7 Flat ort o/ a ge
3 7 d Ba gul m e
8 8 ar- Ti tt d-
mb ar ud 4
2 2 St m a No
oo m
on be c t
- or
e/ r- h
ag
b cou pli hei fo gro ot co su da
cou e hei land p pla
ui nt_ nt ght un ro un he ndi pe m
b nt_f _ ght _sur o n_c
ld flo h_ _ft da of d_f r_f tio rst ag
_ loor b _ft face si onf
in ors are _p tio _t loo loo n_ ru e_
i s_p ui _p _co ti igu
g _pr a_s ost n_ yp r_t r_t po ct gr
d ost ld re_ ndit o rati
_i e_e q_f _e ty e yp yp st_ ur ad
_eq in eq ion n on
d q t q pe e e eq e e
g

Bri Lig M e use ta

ck ht ud d d r
ro
of

Ba
M m N
TI St
ud bo o Da
mb on
m o/ t ma
8 8 er/ e,
ort Ti a Rec ge Gr
7 7 Ba m
1 42 ar- m Mu tt tan d- ad
4 4 2 1 14 7 Flat mb ud
2 7 St be d a gul No e
9 9 oo m
on r- c ar t 4
1 1 - or
e/ Lig h use
M ta
Bri ht e d
ud r
ck ro d
of

Ba
M m N
TI Da St
ud bo o
mb ma on
m o/ t
8 8 er/ ge e,
ort Ti a Rec Gr
7 7 Ba d- m
3 36 ar- m Mu tt tan ad
4 4 2 0 18 0 Flat mb Ru ud
2 0 St be d a gul e
9 9 oo bbl m
on r- c ar 5
6 6 - e or
e/ Lig h
M cle ta
Bri ht e
ud ar r
ck ro d
of

result = _.DataFrame().set_index("b_id") # noqa F821

wqet_grader.grade("Project 4 Assessment", "Task 4.5.4", result)

Yes! Great problem solving.
Score: 1

Import
Task 4.5.5: Write a wrangle function that will use the query you created in the previous task to create a
DataFrame. In addition your function should:

1. Create a "severe_damage" column, where all buildings with a damage grade greater than 3 should be
encoded as 1. All other buildings should be encoded at 0.
2. Drop any columns that could cause issues with leakage or multicollinearity in your model.

# Build your `wrangle` function here

def wrangle(db_path):
# Connect to database
conn = sqlite3.connect(db_path)

# Read query results into DataFrame

df = pd.read_sql(query, conn, index_col="b_id")

# Identify leaky columns

drop_cols = [col for col in df.columns if "post_eq" in col]

# Add high-cardinality / redundant column

drop_cols.append("building_id")

# Create binary target column

df["damage_grade"] = df["damage_grade"].str[-1].astype(int)
df["severe_damage"] = (df["damage_grade"] > 3).astype(int)

# Drop old target

drop_cols.append("damage_grade")

# Drop multicollinearity column

drop_cols.append("count_floors_pre_eq")

# Drop columns
df.drop(columns=drop_cols, inplace=True)

return df
Use your wrangle function to query the database at "/home/jovyan/nepal.sqlite" and return your cleaned results.

df = wrangle("/home/jovyan/nepal.sqlite")
df.head()

b
_i
d

Mud Bam No
8 Ston
mort boo/ TImb t
7 e,
ar- Timb er/Ba att Recta
4 15 382 18 Flat Mud mud 1
Ston er- mboo ac ngular
7 mort
e/Bri Light -Mud he
3 ar
ck roof d

Mud Bam No
8 Ston
mort boo/ t
7 Not e,
ar- Timb att Recta
4 12 328 7 Flat Mud appli mud 1
Ston er- ac ngular
7 cable mort
e/Bri Light he
9 ar
ck roof d

Mud Bam No
8 Ston
mort boo/ TImb t
7 e,
ar- Timb er/Ba att Recta
4 23 427 20 Flat Mud mud 1
Ston er- mboo ac ngular
8 mort
e/Bri Light -Mud he
2 ar
ck roof d

Mud Bam No
8 Ston
mort boo/ TImb t
7 e,
ar- Timb er/Ba att Recta
4 12 427 14 Flat Mud mud 1
Ston er- mboo ac ngular
9 mort
e/Bri Light -Mud he
1 ar
ck roof d
age plinth heigh land_su foun groun other po plan_ supe seve
_bu _area t_ft_ rface_c datio roof_ d_flo _floo sit config rstru re_d
ildi _sq_f pre_e onditio n_ty type or_ty r_typ io uratio ctur ama
ng t q n pe pe e n n e ge

b
_i
d

Mud Bam No
8 Ston
mort boo/ TImb t
7 e,
ar- Timb er/Ba att Recta
4 32 360 18 Flat Mud mud 1
Ston er- mboo ac ngular
9 mort
e/Bri Light -Mud he
6 ar
ck roof d

wqet_grader.grade(
"Project 4 Assessment", "Task 4.5.5", wrangle("/home/jovyan/nepal.sqlite")
)
Boom! You got it.

Score: 1

Explore
Task 4.5.6: Are the classes in this dataset balanced? Create a bar chart with the normalized value counts from
the "severe_damage" column. Be sure to label the x-axis "Severe Damage" and the y-axis "Relative Frequency".
Use the title "Kavrepalanchok, Class Balance".
# Plot value counts of `"severe_damage"`
df["severe_damage"].value_counts(normalize=True).plot(
kind = "bar" , xlabel = "Severe Damage", ylabel = "Relative Frequency", title = "Kavrepalanchok, Class Balance"
)
# Don't delete the code below 👇
plt.savefig("images/4-5-6.png", dpi=150)
with open("images/4-5-6.png", "rb") as file:
wqet_grader.grade("Project 4 Assessment", "Task 4.5.6", file)
Party time! 🎉🎉🎉

Score: 1

Task 4.5.7: Is there a relationship between the footprint size of a building and the damage it sustained in the
earthquake? Use seaborn to create a boxplot that shows the distributions of the "plinth_area_sq_ft" column for
both groups in the "severe_damage" column. Label your x-axis "Severe Damage" and y-axis "Plinth Area [sq.
ft.]". Use the title "Kavrepalanchok, Plinth Area vs Building Damage".
# Create boxplot
sns.boxplot(x = "severe_damage", y = "plinth_area_sq_ft", data = df)
# Label axes
plt.xlabel("Severe Damage")
plt.ylabel("Plinth Area [sq. ft.]")
plt.title("Kavrepalanchok, Plinth Area vs Building Damage");
# Don't delete the code below 👇
plt.savefig("images/4-5-7.png", dpi=150)
with open("images/4-5-7.png", "rb") as file:
wqet_grader.grade("Project 4 Assessment", "Task 4.5.7", file)
Wow, you're making great progress.

Score: 1

Task 4.5.8: Are buildings with certain roof types more likely to suffer severe damage? Create a pivot table
of df where the index is "roof_type" and the values come from the "severe_damage" column, aggregated by the
mean.
# Create pivot table
roof_pivot = pd.pivot_table(
df, index = "roof_type", values = "severe_damage", aggfunc = np.mean
).sort_values(by= "severe_damage")
roof_pivot

severe_damage

roof_type

RCC/RB/RBC 0.040715

Bamboo/Timber-Heavy roof 0.569477

severe_damage

roof_type

Bamboo/Timber-Light roof 0.604842

wqet_grader.grade("Project 4 Assessment", "Task 4.5.8", roof_pivot)

You = coding 🥷

Score: 1

Split
Task 4.5.9: Create your feature matrix X and target vector y. Your target is "severe_damage".
target = "severe_damage"
X = df.drop(columns = target)
y = df[target]
print("X shape:", X.shape)
print("y shape:", y.shape)
X shape: (76533, 11)
y shape: (76533,)

wqet_grader.grade("Project 4 Assessment", "Task 4.5.9a", X)

Wow, you're making great progress.

Score: 1

wqet_grader.grade("Project 4 Assessment", "Task 4.5.9b", y)

Good work!

Score: 1

Task 4.5.10: Divide your dataset into training and validation sets using a randomized split. Your validation set
should be 20% of your data.
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_val shape:", X_val.shape)
print("y_val shape:", y_val.shape)
X_train shape: (61226, 11)
y_train shape: (61226,)
X_val shape: (15307, 11)
y_val shape: (15307,)

wqet_grader.grade("Project 4 Assessment", "Task 4.5.10", [X_train.shape == (61226, 11)])

You got it. Dance party time! 🕺💃🕺💃

Score: 1
Build Model
Baseline
Task 4.5.11: Calculate the baseline accuracy score for your model.
acc_baseline = y_train.value_counts(normalize=True).max()
print("Baseline Accuracy:", round(acc_baseline, 2))
Baseline Accuracy: 0.55

wqet_grader.grade("Project 4 Assessment", "Task 4.5.11", [acc_baseline])

Very impressive.

Score: 1

Iterate
Task 4.5.12: Create a model model_lr that uses logistic regression to predict building damage. Be sure to
include an appropriate encoder for categorical features.

model_lr = make_pipeline(
OneHotEncoder(use_cat_names = True),
LogisticRegression(max_iter = 1000)
)
# Fit model to training data
model_lr.fit(X_train, y_train)

Pipeline(steps=[('onehotencoder',
OneHotEncoder(cols=['land_surface_condition',
'foundation_type', 'roof_type',
'ground_floor_type', 'other_floor_type',
'position', 'plan_configuration',
'superstructure'],
use_cat_names=True)),
('logisticregression', LogisticRegression(max_iter=1000))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with
nbviewer.org.

wqet_grader.grade("Project 4 Assessment", "Task 4.5.12", model_lr)

That's the right answer. Keep it up!

Score: 1

Task 4.5.13: Calculate training and validation accuracy score for model_lr.

lr_train_acc = accuracy_score(y_train, model_lr.predict(X_train))

lr_val_acc = model_lr.score(X_val, y_val)

print("Logistic Regression, Training Accuracy Score:", lr_train_acc)

print("Logistic Regression, Validation Accuracy Score:", lr_val_acc)
Logistic Regression, Training Accuracy Score: 0.6513735994512135
Logistic Regression, Validation Accuracy Score: 0.6522506042986869

submission = [lr_train_acc, lr_val_acc]

wqet_grader.grade("Project 4 Assessment", "Task 4.5.13", submission)
Very impressive.

Score: 1

Task 4.5.14: Perhaps a decision tree model will perform better than logistic regression, but what's the best
hyperparameter value for max_depth? Create a for loop to train and evaluate the model model_dt at all depths
from 1 to 15. Be sure to use an appropriate encoder for your model, and to record its training and validation
accuracy scores at every depth. The grader will evaluate your validation accuracy scores only.

depth_hyperparams = range(1, 16)

training_acc = []
validation_acc = []
for d in depth_hyperparams:
model_dt = make_pipeline(
OrdinalEncoder(), DecisionTreeClassifier(max_depth = d, random_state=42)
)
model_dt.fit(X_train, y_train)
# Fit model to training data
model_dt.fit(X_train, y_train)
# Calculate training accuracy score and append to `training_acc`
training_acc.append(model_dt.score(X_train, y_train))
# Calculate validation accuracy score and append to `training_acc`
validation_acc.append(model_dt.score(X_val, y_val))

print("Training Accuracy Scores:", training_acc[:3])

print("Validation Accuracy Scores:", validation_acc[:3])

Training Accuracy Scores: [0.6303041191650606, 0.6303041191650606, 0.642292490118577]

Validation Accuracy Scores: [0.6350035931273273, 0.6350035931273273, 0.6453909975828053]

submission = pd.Series(validation_acc, index=depth_hyperparams)

wqet_grader.grade("Project 4 Assessment", "Task 4.5.14", submission)

You're making this look easy. 😉

Score: 1

Task 4.5.15: Using the values in training_acc and validation_acc, plot the validation curve for model_dt. Label
your x-axis "Max Depth" and your y-axis "Accuracy Score". Use the title "Validation Curve, Decision Tree
Model", and include a legend.
# Plot `depth_hyperparams`, `training_acc`

plt.plot(depth_hyperparams, training_acc, label="training")

plt.plot(depth_hyperparams, validation_acc, label="validation")
plt.xlabel("Max Depth")
plt.ylabel("Accuracy Score")
plt.title("Validation Curve, Decision Tree Model")
plt.legend();
# Don't delete the code below 👇
plt.savefig("images/4-5-15.png", dpi=150)

with open("images/4-5-15.png", "rb") as file:

wqet_grader.grade("Project 4 Assessment", "Task 4.5.15", file)
Awesome work.

Score: 1

Task 4.5.16: Build and train a new decision tree model final_model_dt, using the value for max_depth that
yielded the best validation accuracy score in your plot above.

final_model_dt = make_pipeline(
OrdinalEncoder(),
DecisionTreeClassifier(max_depth = 10, random_state=42)
)
# Fit model to training data
final_model_dt.fit(X_train, y_train)

Pipeline(steps=[('ordinalencoder',
OrdinalEncoder(cols=['land_surface_condition',
'foundation_type', 'roof_type',
'ground_floor_type', 'other_floor_type',
'position', 'plan_configuration',
'superstructure'],
mapping=[{'col': 'land_surface_condition',
'data_type': dtype('O'),
'mapping': Flat 1
Moderate slope 2
Steep slope 3
NaN -2
dtype: int64},
{'col': 'foundation_type',
'dat...
Building with Central Courtyard 9
H-shape 10
NaN -2
dtype: int64},
{'col': 'superstructure',
'data_type': dtype('O'),
'mapping': Stone, mud mortar 1
Adobe/mud 2
Brick, cement mortar 3
RC, engineered 4
Brick, mud mortar 5
Stone, cement mortar 6
RC, non-engineered 7
Timber 8
Other 9
Bamboo 10
Stone 11
NaN -2
dtype: int64}])),
('decisiontreeclassifier',
DecisionTreeClassifier(max_depth=10, random_state=42))])

wqet_grader.grade("Project 4 Assessment", "Task 4.5.16", final_model_dt)

Python master 😁

Score: 1

Evaluate
Task 4.5.17: How does your model perform on the test set? First, read the CSV file "data/kavrepalanchok-test-
features.csv" into the DataFrame X_test. Next, use final_model_dt to generate a list of test
predictions y_test_pred. Finally, submit your test predictions to the grader to see how your model performs.

Tip: Make sure the order of the columns in X_test is the same as in your X_train. Otherwise, it could hurt your
model's performance.

X_test = pd.read_csv("data/kavrepalanchok-test-features.csv", index_col="b_id")

y_test_pred = final_model_dt.predict(X_test)
y_test_pred[:5]

array([1, 1, 1, 1, 0])
submission = pd.Series(y_test_pred)
wqet_grader.grade("Project 4 Assessment", "Task 4.5.17", submission)
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
Cell In[66], line 2
1 submission = pd.Series(y_test_pred)
----> 2 wqet_grader.grade("Project 4 Assessment", "Task 4.5.17", submission)

File /opt/conda/lib/python3.11/site-packages/wqet_grader/init.py:182, in grade(assessment_id, question_id, sub

File /opt/conda/lib/python3.11/site-packages/wqet_grader/transport.py:160, in grade_submission(assessment_id, que

Exception: Could not grade submission: Could not verify access to this assessment: Received error from WQET sub
mission API: You have already passed this course!

Communicate Results
Task 4.5.18: What are the most important features for final_model_dt? Create a Series Gini feat_imp, where the
index labels are the feature names for your dataset and the values are the feature importances for your model.
Be sure that the Series is sorted from smallest to largest feature importance.

features = X_train.columns
importances = final_model_dt.named_steps["decisiontreeclassifier"].feature_importances_
feat_imp = pd.Series(importances, index= features).sort_values()
feat_imp.head()

plan_configuration 0.004189
land_surface_condition 0.008599
foundation_type 0.009967
position 0.011795
ground_floor_type 0.013521
dtype: float64

wqet_grader.grade("Project 4 Assessment", "Task 4.5.18", feat_imp)

---------------------------------------------------------------------------
Exception Traceback (most recent call last)
Cell In[69], line 1
----> 1 wqet_grader.grade("Project 4 Assessment", "Task 4.5.18", feat_imp)
File /opt/conda/lib/python3.11/site-packages/wqet_grader/__init__.py:182, in grade(assessment_id, question_id, sub
mission)
177 def grade(assessment_id, question_id, submission):
178 submission_object = {
179 'type': 'simple',
180 'argument': [submission]
181 }
--> 182 return show_score(grade_submission(assessment_id, question_id, submission_object))

File /opt/conda/lib/python3.11/site-packages/wqet_grader/transport.py:160, in grade_submission(assessment_id, que

Exception: Could not grade submission: Could not verify access to this assessment: Received error from WQET sub
mission API: You have already passed this course!
Task 4.5.19: Create a horizontal bar chart of feat_imp. Label your x-axis "Gini Importance" and your y-
axis "Feature". Use the title "Kavrepalanchok Decision Tree, Feature Importance".

Do you see any relationship between this plot and the exploratory data analysis you did regarding roof type?

# Create horizontal bar chart of feature importances

# Don't delete the code below 👇

plt.tight_layout()
plt.savefig("images/4-5-19.png", dpi=150)

with open("images/4-5-19.png", "rb") as file:

wqet_grader.grade("Project 4 Assessment", "Task 4.5.19", file)
Congratulations! You made it to the end of Project 4. 👏👏👏

Copyright 2023 WorldQuant University. This content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.

Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your
WQU virtual machine.

This means:

 ⓧ No downloading this notebook.

4.6. Data Dictionary

Below is a summary of the features stored in the the nepal.sqlite database.

Table building_structure
Variable Description Type

age_building Age of the building (in years) Number

A unique ID that identifies a unique building from the

building_id Text
survey

condition_post_eq Actual condition of the building after the earthquake Categorical

count_floors_post_eq Number of floors that the building had after the earthquake Number

Number of floors that the building had before the

count_floors_pre_eq Number
earthquake

foundation_type Type of foundation used in the building Categorical

ground_floor_type Ground floor type Categorical

height_ft_post_eq Height of the building after the earthquake (in feet) Number

height_ft_pre_eq Height of the building before the earthquake (in feet) Number

land_surface_condition Surface condition of the land in which the building is built categorical

Type of construction used in other floors (except ground

other_floor_type Categorical
floor and roof)

plan_configuration Building plan configuration Categorical

plinth_area_sq_ft Plinth area of the building (in square feet) Number

Variable Description Type

position Position of the building Categorical

Type of roof used in the building. Categories are (1) light

bamboo/timber, (2) heavy bamboo timber, and (3)
roof_type Categorical
reinforced cement concrete/reinforced brick/reinforced
brick concrete

superstructure Superstructure of the building Categorical

Table building_damage
Variable Description Type

Indicates the nature of the

damage assessment in terms of
area_assesed Categorical
the areas of the building that
were assessed

A unique ID that identifies

building_id every individual building in the Text
survey

Categorical variable that

captures insignificant beam
failure related damage to the
damage_beam_failure_insignificant Categorical
building in terms of the
proportion of overall area that is
insignificantly damaged

Categorical variable that

captures moderate beam failure
related damage to the building
damage_beam_failure_moderate Categorical
in terms of the proportion of
overall area that is moderately
damaged

Categorical variable that

damage_beam_failure_severe captures severe beam failure Categorical
related damage to the building
Variable Description Type

in terms of the proportion of

overall area that is severely
damaged

Categorical variable that

captures insignificant
cladding/glazing related damage
damage_cladding_glazing_insignificant Categorical
to the building in terms of the
proportion of overall area that is
insignificantly damaged

Categorical variable that

captures moderate
cladding/glazing related damage
damage_cladding_glazing_moderate Categorical
to the building in terms of the
proportion of overall area that is
moderately damaged

Categorical variable that

captures severe cladding/glazing
related damage to the building
damage_cladding_glazing_severe Categorical
in terms of the proportion of
overall area that is severely
damaged

Categorical variable that

captures insignificant column
failure related damage to the
damage_column_failure_insignificant Categorical
building in terms of the
proportion of overall area that is
insignificantly damaged

Categorical variable that

captures moderate column
failure related damage to the
damage_column_failure_moderate Categorical
building in terms of the
proportion of overall area that is
moderately damaged

Categorical variable that

damage_column_failure_severe captures severe column failure Categorical
related damage to the building
Variable Description Type

in terms of the proportion of

overall area that is severely
damaged

Categorical variable that

captures insignificant corner
separation damage to the
damage_corner_separation_insignificant Categorical
building in terms of the
proportion of overall area that is
insignificantly damaged

Categorical variable that

captures moderate corner
separation damage to the
damage_corner_separation_moderate Categorical
building in terms of the
proportion of overall area that is
moderately damaged

Categorical variable that

captures severe corner
separation damage to the
damage_corner_separation_severe Categorical
building in terms of the
proportion of overall area that is
severely damaged

Categorical variable that

captures insignificant
delamination failure related
damage_delamination_failure_insignificant Categorical
damage to the building in terms
of the proportion of overall area
that is insignificantly damaged

Categorical variable that

captures moderate delamination
failure related damage to the
damage_delamination_failure_moderate Categorical
building in terms of the
proportion of overall area that is
moderately damaged

Categorical variable that

damage_delamination_failure_severe captures severe delamination Categorical
failure related damage to the
Variable Description Type

building in terms of the

proportion of overall area that is
severely damaged

Categorical variable that

captures insignificant diagonal
cracking damage to the building
damage_diagonal_cracking_insignificant Categorical
in terms of the proportion of
overall area that is
insignificantly damaged

Categorical variable that

captures moderate diagonal
cracking damage to the building
damage_diagonal_cracking_moderate Categorical
in terms of the proportion of
overall area that is moderately
damaged

Categorical variable that

captures severe diagonal
cracking damage to the building
damage_diagonal_cracking_severe Categorical
in terms of the proportion of
overall area that is severely
damaged

Categorical variable that

captures insignificant
foundational damage to the
damage_foundation_insignificant Categorical
building in terms of the
proportion of overall area that is
insignificantly damaged

Categorical variable that

captures moderate foundational
damage_foundation_moderate damage to the building in terms Categorical
of the proportion of overall area
that is moderately damaged

Categorical variable that

damage_foundation_severe captures severe foundational Categorical
damage to the building in terms
Variable Description Type

of the proportion of overall area

that is severely damaged

Categorical variable that

captures insignificant gable
failure related damage to the
damage_gable_failure_insignificant Categorical
building in terms of the
proportion of overall area that is
insignificantly damaged

Categorical variable that

captures moderate gable failure
related damage to the building
damage_gable_failure_moderate Categorical
in terms of the proportion of
overall area that is moderately
damaged

Categorical variable that

captures severe gable failure
related damage to the building
damage_gable_failure_severe Categorical
in terms of the proportion of
overall area that is severely
damaged

Damage grade assigned to the

damage_grade building by the surveyor after Categorical
assessment

Categorical variable that

captures insignificant in plane
failure related damage to the
damage_in_plane_failure_insignificant Categorical
building in terms of the
proportion of overall area that is
insignificantly damaged

Categorical variable that

captures moderate in plane
failure related damage to the
damage_in_plane_failure_moderate Categorical
building in terms of the
proportion of overall area that is
moderately damaged
Variable Description Type

Categorical variable that

captures severe in plane failure
related damage to the building
damage_in_plane_failure_severe Categorical
in terms of the proportion of
overall area that is severely
damaged

Categorical variable that

captures insignificant
infill/partition failure related
damage_infill_partition_failure_insignificant Categorical
damage to the building in terms
of the proportion of overall area
that is insignificantly damaged

Categorical variable that

captures moderate
infill/partition failure related
damage_infill_partition_failure_moderate Categorical
damage to the building in terms
of the proportion of overall area
that is moderately damaged

Categorical variable that

captures severe infill/partition
failure related damage to the
damage_infill_partition_failure_severe Categorical
building in terms of the
proportion of overall area that is
severely damaged

Categorical variable that

captures insignificant out of
plane failure related damage to
damage_out_of_plane_failure_insignificant Categorical
the building in terms of the
proportion of overall area that is
insignificantly damaged

Categorical variable that

captures moderate out of plane
failure related damage to the
damage_out_of_plane_failure_moderate Categorical
building in terms of the
proportion of overall area that is
moderately damaged
Variable Description Type

Categorical variable that

captures severe out of plane
failure related damage to the
damage_out_of_plane_failure_severe Categorical
building in terms of the
proportion of overall area that is
severely damaged

Categorical variable that

captures insignificant out of
plane failure of walls not
damage_out_of_plane_failure_walls_ncfr_insignificant carrying floor/roof in the Categorical
building in terms of the
proportion of overall area that is
insignificantly damaged

Categorical variable that

captures moderate out of plane
failure of walls not carrying
damage_out_of_plane_failure_walls_ncfr_moderate floor/roof in the building in Categorical
terms of the proportion of
overall area that is moderately
damaged

Categorical variable that

captures severe out of plane
failure of walls not carrying
damage_out_of_plane_failure_walls_ncfr_severe floor/roof in the building in Categorical
terms of the proportion of
overall area that is severely
damaged

damage_overall_adjacent_building_risk Adjacent building risk Categorical

Overall damage assessment for

damage_overall_collapse Categorical
the building - collapse

Overall damage assessment for

damage_overall_leaning Categorical
the building - leaning

damage_parapet_insignificant Categorical variable that Categorical

captures insignificant parapet
Variable Description Type

related damage to the building

in terms of the proportion of
overall area that is
insignificantly damaged

Categorical variable that

captures moderate parapet
related damage to the building
damage_parapet_moderate Categorical
in terms of the proportion of
overall area that is moderately
damaged

Categorical variable that

captures severe parapet related
damage_parapet_severe damage to the building in terms Categorical
of the proportion of overall area
that is severely damaged

Categorical variable that

captures insignificant roof
damage_roof_insignificant damage to the building in terms Categorical
of the proportion of overall area
that is insignificantly damaged

Categorical variable that

captures moderate roof damage
damage_roof_moderate to the building in terms of the Categorical
proportion of overall area that is
moderately damaged

Categorical variable that

captures severe roof damage to
damage_roof_severe the building in terms of the Categorical
proportion of overall area that is
severely damaged

Categorical variable that

damage_staircase_insignificant captures insignificant staircase Categorical
related damage to the building
in terms of the proportion of
Variable Description Type

overall area that is

insignificantly damaged

Categorical variable that

captures moderate staircase
related damage to the building
damage_staircase_moderate Categorical
in terms of the proportion of
overall area that is moderately
damaged

Categorical variable that

captures severe staircase related
damage_staircase_severe damage to the building in terms Categorical
of the proportion of overall area
that is severely damaged

District where the building is

district_id Text
located

Flag variable that indicates if

has_damage_beam_failure Boolean
the building has beam failure

Flag variable that indicates if

has_damage_cladding_glazing the building has damaged Boolean
cladding/glazing

Flag variable that indicates if

has_damage_column_failure Boolean
the building has column failure

Flag variable that indicates if

has_damage_corner_separation the building has corner Boolean
separation related damage

Flag variable that indicates if

has_damage_delamination_failure the building has delamination Boolean
failure

Flag variable that indicates if

has_damage_diagonal_cracking the building has diagonal Boolean
cracking related damage
Variable Description Type

Flag variable that indicates if

has_damage_foundation the building has foundational Boolean
damage

Flag variable that indicates if

has_damage_gable_failure Boolean
the building has gable failure

Flag variable that indicates if

has_damage_in_plane_failure Boolean
the building has in-plane-failure

Flag variable that indicates if

has_damage_infill_partition_failure the building has infill/partition Boolean
failure

Flag variable that indicates if

has_damage_out_of_plane_failure the building has out-plane- Boolean
failure

Flag variable that indicates if

the building has out-of-plane-
has_damage_out_of_plane_walls_ncfr_failure Boolean
failure of walls not carrying
floor or roof

Flag variable that indicates if

has_damage_parapet the building has damaged Boolean
parapet

Flag variable that indicates if

has_damage_roof Boolean
the building has roof damage

Flag variable that indicates if

has_damage_staircase the building has damaged Boolean
staircase

Flag variable that indicates if

has_geotechnical_risk_fault_crack the building has geotechnical Boolean
risks related to fault cracking
Variable Description Type

Flag variable that indicates if

has_geotechnical_risk_flood the building has geotechnical Boolean
risks related to flood

Flag variable that indicates if

has_geotechnical_risk_land_settlement the building has geotechnical Boolean
risks related to land settlement

Flag variable that indicates if

the building has risk
has_geotechnical_risk_landslide Boolean
geotechnical risks related to
landslide

Flag variable that indicates if

has_geotechnical_risk_liquefaction the building has geotechnical Boolean
risks related to liquefaction

Flag variable that indicates if

has_geotechnical_risk_other the building has any other Boolean
geotechnical risk

Flag variable that indicates if

has_geotechnical_risk_rock_fall the building has geotechnical Boolean
risk related to rockfall

Flag variable that indicates if

has_geotechnical_risk the building has geotechnical Boolean
risk

Flag variable that indicates if

has_repair_started the repair work had started Boolean
during the time of the survey

A unique ID that identifies a

id unique information from all Number
table

Technical solution proposed by

technical_solution_proposed Categorical
the surveyor after assessment
Table household_demographics
Variable Description Type

A unique ID that identifies every individual

household_id Text
household

gender_household_head Gender of household head Categorical

age_household_head Age of household head Number

caste_household Caste/Ethnicity of household Categorical

education_level_household_head Education level of household head Categorical

income_level_household Household's average monthly income Categorical

size_household Size of household Number

Flag variable that indicates if the household

is_bank_account_present_in_household Boolean
has bank account

Table id_map
Variable Description Type

building_id A unique ID that identifies a unique building from the survey Text

district_id District of residence of the household Text

household_id A unique ID that identifies every individual household Text

vdcmun_id Municipality of residence of the household Text

Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.
WQU WorldQuant Un iversity Applied Data Science Lab QQQQ
Copyright 2023 WorldQuant University. This content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.

Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your
WQU virtual machine.

This means:

 ⓧ No downloading this notebook.

5.1. Working with JSON files

In this project, we'll be looking at tracking corporate bankruptcies in Poland. To do that, we'll need to get data
that's been stored in a JSON file, explore it, and turn it into a DataFrame that we'll use to train our model.
import gzip
import json

import pandas as pd
import wqet_grader
from IPython.display import VimeoVideo

wqet_grader.init("Project 5 Assessment")

VimeoVideo("694158732", h="73c2fb4e4f", width=600)

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[1], line 1
----> 1 VimeoVideo("694158732", h="73c2fb4e4f", width=600)

NameError: name 'VimeoVideo' is not defined

Prepare Data
Open
The first thing we need to do is access the file that contains the data we need. We've done this using multiple
strategies before, but this time around, we're going to use the command line.
VimeoVideo("693794546", h="6e1fab0a5e", width=600)

Task 5.1.1: Open a terminal window and navigate to the directory where the data for this project is located.

 What's the Linux command line?

 Navigate a file system using the Linux command line.

As we've seen in our other projects, datasets can be large or small, messy or clean, and complex or easy to
understand. Regardless of how the data looks, though, it needs to be saved in a file somewhere, and when that
file gets too big, we need to compress it. Compressed files are easier to store because they take up less space. If
you've ever come across a ZIP file, you've worked with compressed data.

The file we're using for this project is compressed, so we'll need to use a file utility called gzip to open it up.
VimeoVideo("693794604", h="a8c0f15712", width=600)

Task 5.1.2: In the terminal window, locate the data file for this project and decompress it.

 What's gzip?
 What's data compression?
 Decompress a file using gzip.

VimeoVideo("693794641", h="d77bf46d41", width=600)

%%bash

cd data
gzip -dkf poland-bankruptcy-data-2009.json.gz

Explore
Now that we've decompressed the data, let's take a look and see what's there.
VimeoVideo("693794658", h="c8f1bba831", width=600)

Task 5.1.3: In the terminal window, examine the first 10 lines of poland-bankruptcy-data-2009.json.

 Print lines from a file in the Linux command line.

Does this look like any of the data structures we've seen in previous projects?
VimeoVideo("693794680", h="7f1302444b", width=600)

Task 5.1.4: Open poland-bankruptcy-data-2009.json by opening the data folder to the left and then double-
clicking on the file. 👈
How is the data organized?
Curly brackets? Key-value pairs? It looks similar to a Python dictionary. It's important to note that JSON is
not exactly the same as a dictionary, but a lot of the same concepts apply. Let's try reading the file into a
DataFrame and see what happens.
VimeoVideo("693794696", h="dd5b5ad116", width=600)

Task 5.1.5: Load the data into a DataFrame.

 Read a JSON file into a DataFrame using pandas.

df = pd.read_json("data/poland-bankruptcy-data-2009.json")
df.head()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[13], line 1
----> 1 df = pd.read_json("data/poland-bankruptcy-data-2009.json")
2 df.head()

File /opt/conda/lib/python3.11/site-packages/pandas/util/_decorators.py:211, in deprecate_kwarg.<locals>._deprecat

e_kwarg.<locals>.wrapper(*args, **kwargs)
209 else:
210 kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)

File /opt/conda/lib/python3.11/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<

locals>.decorate.<locals>.wrapper(*args, **kwargs)
325 if len(args) > num_allow_args:
326 warnings.warn(
327 msg.format(arguments=_format_argument_list(allow_args)),
328 FutureWarning,
329 stacklevel=find_stack_level(),
330 )
--> 331 return func(*args, **kwargs)

File /opt/conda/lib/python3.11/site-packages/pandas/io/json/_json.py:757, in read_json(path_or_buf, orient, typ, dtyp

e, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, encoding_errors, line
s, chunksize, compression, nrows, storage_options)
754 return json_reader
756 with json_reader:
--> 757 return json_reader.read()

File /opt/conda/lib/python3.11/site-packages/pandas/io/json/_json.py:915, in JsonReader.read(self)

913 obj = self._get_object_parser(self._combine_lines(data_lines))
914 else:
--> 915 obj = self._get_object_parser(self.data)
916 self.close()
917 return obj

File /opt/conda/lib/python3.11/site-packages/pandas/io/json/_json.py:937, in JsonReader._get_object_parser(self, jso

n)
935 obj = None
936 if typ == "frame":
--> 937 obj = FrameParser(json, **kwargs).parse()
939 if typ == "series" or obj is None:
940 if not isinstance(dtype, bool):
File /opt/conda/lib/python3.11/site-packages/pandas/io/json/_json.py:1064, in Parser.parse(self)
1062 self._parse_numpy()
1063 else:
-> 1064 self._parse_no_numpy()
1066 if self.obj is None:
1067 return None

File /opt/conda/lib/python3.11/site-packages/pandas/io/json/_json.py:1320, in FrameParser._parse_no_numpy(self)

1317 orient = self.orient
1319 if orient == "columns":
-> 1320 self.obj = DataFrame(
1321 loads(json, precise_float=self.precise_float), dtype=None
1322 )
1323 elif orient == "split":
1324 decoded = {
1325 str(k): v
1326 for k, v in loads(json, precise_float=self.precise_float).items()
1327 }

File /opt/conda/lib/python3.11/site-packages/pandas/core/frame.py:664, in DataFrame.init(self, data, index, colu

mns, dtype, copy)
658 mgr = self._init_mgr(
659 data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
660 )
662 elif isinstance(data, dict):
663 # GH#38939 de facto copy defaults to False only in non-dict cases
--> 664 mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
665 elif isinstance(data, ma.MaskedArray):
666 import numpy.ma.mrecords as mrecords

File /opt/conda/lib/python3.11/site-packages/pandas/core/internals/construction.py:493, in dict_to_mgr(data, index, c

olumns, dtype, typ, copy)
489 else:
490 # dtype check to exclude e.g. range objects, scalars
491 arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
--> 493 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)

File /opt/conda/lib/python3.11/site-packages/pandas/core/internals/construction.py:118, in arrays_to_mgr(arrays, col

umns, index, dtype, verify_integrity, typ, consolidate)
115 if verify_integrity:
116 # figure out the index, if necessary
117 if index is None:
--> 118 index = _extract_index(arrays)
119 else:
120 index = ensure_index(index)

File /opt/conda/lib/python3.11/site-packages/pandas/core/internals/construction.py:669, in _extract_index(data)

666 raise ValueError("All arrays must be of the same length")
668 if have_dicts:
--> 669 raise ValueError(
670 "Mixing dicts with non-Series may lead to ambiguous ordering."
671 )
673 if have_series:
674 assert index is not None # for mypy
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.

VimeoVideo("693794711", h="fdb009c4eb", width=600)

Hmmm. It looks like something went wrong, and we're going to have to fix it. Luckily for us, there's an error
message to help us figure out what's happening here:

ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.

What should we do? That error sounds serious, but the world is big, and we can't possibly be the first people to
encounter this problem. When you come across an error, copy the message into a search engine and see what
comes back. You'll get lots of results. The web has lots of places to look for solutions to problems like this one,
and Stack Overflow is one of the best. Click here to check out a possible solution to our problem.

There are three things to look for when you're browsing through solutions on Stack Overflow.

1. Context: A good question is specific; if you click through that link, you'll see that the person asks
a specific question, gives some relevant information about their OS and hardware, and then offers the
code that threw the error. That's important, because we need...
2. Reproducible Code: A good question also includes enough information for you to reproduce the
problem yourself. After all, the only way to make sure the solution actually applies to your situation is
to see if the code in the question throws the error you're having trouble with! In this case, the person
included not only the code they used to get the error, but the actual error message itself. That would be
useful on its own, but since you're looking for an actual solution to your problem, you're really looking
for...
3. An answer: Not every question on Stack Overflow gets answered. Luckily for us, the one we've been
looking at did. There's a big green check mark next to the first solution, which means that the person
who asked the question thought that solution was the best one.

Let's try it and see if it works for us too!

VimeoVideo("693794734", h="fecea6a81e", width=600)

Task 5.1.6: Using a context manager, open the file poland-bankruptcy-data-2009.json and load it as a dictionary
with the variable name poland_data.

 What's a context manager?

 Open a file in Python.
 Load a JSON file into a dictionary using Python.

# Open file and load JSON

with open ("data/poland-bankruptcy-data-2009.json", "r") as read_file:

poland_data = json.load(read_file)
print(type(poland_data))
<class 'dict'>
Okay! Now that we've successfully opened up our dataset, let's take a look and see what's there, starting with
the keys. Remember, the keys in a dictionary are categories of things in a dataset. WQU WorldQuant University Applied Data Science Lab QQQQ

VimeoVideo("693794754", h="18e70f4225", width=600)

Task 5.1.7: Print the keys for poland_data.

 List the keys of a dictionary in Python.

# Print `poland_data` keys

poland_data.keys()

dict_keys(['schema', 'data', 'metadata'])

schema tells us how the data is structured, metadata tells us where the data comes from, and data is the data
itself.
Now let's take a look at the values. Remember, the values in a dictionary are ways to describe the variable that
belongs to a key.
VimeoVideo("693794768", h="8e5b53b0ca", width=600)

Task 5.1.8: Explore the values associated with the keys in poland_data. What do each of them represent? How
is the information associated with the "data" key organized?

# Continue Exploring `poland_data`

#poland_data["metadata"]
#poland_data["schema"].keys()
poland_data["data"][0]

dict_keys(['fields', 'primaryKey', 'pandas_version'])

This dataset includes all the information we need to figure whether or not a Polish company went bankrupt in
2009. There's a bunch of features included in the dataset, each of which corresponds to some element of a
company's balance sheet. You can explore the features by looking at the data dictionary. Most importantly, we
also know whether or not the company went bankrupt. That's the last key-value pair.
Now that we know what data we have for each company, let's take a look at how many companies there are.
VimeoVideo("693794783", h="8d333027cc", width=600)

Task 5.1.9: Calculate the number of companies included in the dataset.

 Calculate the length of a list in Python.

 List the keys of a dictionary in Python.

# Calculate number of companies

len(poland_data["data"])

9977
And then let's see how many features were included for one of the companies.
VimeoVideo("693794797", h="3c1eff82dc", width=600)

Task 5.1.10: Calculate the number of features associated with "company_1".

# Calculate number of features

len(poland_data["data"][0])

66
Since we're dealing with data stored in a JSON file, which is common for semi-structured data, we can't assume
that all companies have the same features. So let's check!
VimeoVideo("693794810", h="80e195944b", width=600)

Task 5.1.11: Iterate through the companies in poland_data["data"] and check that they all have the same number
of features.

 What's an iterator?
 Access the items in a dictionary in Python.
 Write a for loop in Python.

# Iterate through companies

for item in poland_data["data"]:
if len(item) != 66:
print("ALERT!!")
It looks like they do!
Let's put all this together. First, open up the compressed dataset and load it directly into a dictionary.
VimeoVideo("693794824", h="dbfc9b43ee", width=600)

Task 5.1.12: Using a context manager, open the file poland-bankruptcy-data-2009.json.gz and load it as a
dictionary with the variable name poland_data_gz.

 What's a context manager?

 Open a file in Python.
 Load a JSON file into a dictionary using Python.

# Open compressed file and load contents

with gzip.open ("data/poland-bankruptcy-data-2009.json.gz", "r") as read_file:
poland_data_gz = json.load(read_file)
print(type(poland_data_gz))
<class 'dict'>
Since we now have two versions of the dataset — one compressed and one uncompressed — we need to
compare them to make sure they're the same.
VimeoVideo("693794837", h="925b5e4e5a", width=600)

Task 5.1.13: Explore poland_data_gz to confirm that is contains the same data as data, in the same format.
# Explore `poland_data_gz`
print(poland_data_gz.keys())
print(len(poland_data_gz["data"]))
print(len(poland_data_gz["data"][0]))

dict_keys(['schema', 'data', 'metadata'])

9977
66
Looks good! Now that we have an uncompressed dataset, we can turn it into a DataFrame using pandas.
VimeoVideo("693794853", h="b74ef86783", width=600)
Task 5.1.14: Create a DataFrame df that contains the all companies in the dataset, indexed by "company_id".
Remember the principles of tidy data that you learned in Project 1, and make sure your DataFrame has
shape (9977, 65).

 Create a DataFrame from a dictionary in pandas.

df = pd.DataFrame.from_dict(poland_data_gz["data"]).set_index("company_id")
print(df.shape)
df.head()
(9977, 65)

f f f
f f b
fe fe fe fe fe fe e e fe e fe
fe fe e fe fe fe fe e a
a at . at at at at a a at a at
at at a at at at at a n
t _ . _ _ _ _ t t _ t _
_ _ t _ _ _ _ t kr
_ 1 . 5 5 5 5 _ _ 6 _ 6
1 3 _ 5 6 7 8 _ u
2 0 6 7 8 9 6 6 2 6 4
4 9 pt
0 1 3

co
m
pa
ny
_id

0. - 0. 0. 0. 0.
0. 0. 1 0. 1. 1 0. 0.
1 2 2 1 3 0 9. 6. 8 4. 4.
4 1 . 6 1 . 4 8
7 8. 1 . 6 7 0 7 2 4. 3 0 Fa
1 4 3 0 2 1 6 3
1 4 9 9 . 3 5 0 1 8 2 3 3 ls
2 3 4 3 2 9 3 6
1 8 4 . 9 7 0 4 1 9 0 4 e
9 7 8 8 5 6 5 0
9 2 6 6 4 0 5 3 1 3 1
9 1 0 3 0 1 9 4
0 0 0 0 0 7

0. 0. 0. 0. 0.
0. 0. 1 0. 1. 1 0. 0. 1
1 2. 1 0 2 0 5. 4. 3. 5.
4 2 . 0 1 . 5 9 0
4 5 7 . 2 7 0 9 1 5 9 Fa
6 8 6 0 7 6 3 0 2.
2 6 9 1 . 7 1 0 8 1 7 5 ls
0 2 2 0 2 0 9 1 1
2 5 8 . 5 0 0 8 0 1 0 e
3 3 9 0 1 1 6 0 9
4 2 5 1 0 0 2 3 6 0
8 0 4 0 0 8 2 8 0
0 0 6 0 0
f f f
f f b
fe fe fe fe fe fe e e fe e fe
fe fe e fe fe fe fe e a
a at . at at at at a a at a at
at at a at at at at a n
t _ . _ _ _ _ t t _ t _
_ _ t _ _ _ _ t kr
_ 1 . 5 5 5 5 _ _ 6 _ 6
1 3 _ 5 6 7 8 _ u
2 0 6 7 8 9 6 6 2 6 4
4 9 pt
0 1 3

co
m
pa
ny
_id

0. 0. 0. 0. 0.
0. 0. 3 8 0. 2. 1 0. 0.
0 0 0 0 0 6. 3. 6 5. 4.
2 4 . 4. 1 9 . 6 9
0 0 . 0 0 0 7 7 4. 6 4 Fa
2 8 1 8 9 8 0 7 9
3 0 4 . 7 0 0 7 9 8 2 5 ls
6 8 5 7 1 8 0 5 2
5 5 . 6 8 0 4 2 4 8 8 e
1 3 9 4 1 1 7 6 3
9 7 3 8 0 2 2 6 7 1
2 9 9 0 4 0 7 6 6
5 2 9 1 0

0. - 0. 0. 0. 0.
0. 0. 1 0. 1. 1 0. 0. 1
1 5 2 1 3 0 2. 7. 3. 4.
4 3 . 0 4 . 5 8 0
8 8. 3 . 7 2 7 5 0 6 6 Fa
1 4 9 0 0 3 8 2 0.
5 8 2 3 . 6 1 3 9 7 3 3 ls
5 2 2 0 9 3 4 6 5
2 7 5 . 4 8 0 1 5 0 7 e
0 3 7 0 4 9 9 3 4
9 4 8 8 8 3 2 6 3 5
4 1 9 0 0 3 6 5 0
0 0 0 0 0 9

0. 0. 0. 0. 0.
0. 0. 1 1 0. 0. 1 0. 0. 1 1
1 1 5 4 0 8. 3. 3.
5 3 . 6. 0 7 . 4 4 0 2.
8 8 . 5 1 2 4 3 4 Fa
5 2 6 3 0 9 8 4 6 7. 4
6 2 2 . 5 0 9 5 4 0 ls
6 1 0 1 0 8 1 3 9 2 5
0 0 . 7 1 4 5 8 3 e
1 9 4 4 0 0 2 8 5 4 4
6 6 7 9 2 3 8 6
5 1 5 0 0 8 6 5 7 0 0
0 0 0 0 1

5 rows × 65 columns

Import
Now that we have everything set up the way we need it to be, let's combine all these steps into a single function
that will decompress the file, load it into a DataFrame, and return it to us as something we can use.

VimeoVideo("693794879", h="f51a3a342f", width=600)

Task 5.1.15: Create a wrangle function that takes the name of a compressed file as input and returns a tidy
DataFrame. After you confirm that your function is working as intended, submit it to the grader.

def wrangle(filename):
# Open compressed file, load into dict
with gzip.open(filename, "r") as f:
data = json.load(f)

# Turn dict into DataFrame

df = pd.DataFrame().from_dict(data["data"]).set_index("company_id")

return df

df = wrangle("data/poland-bankruptcy-data-2009.json.gz")
print(df.shape)
df.head()
(9977, 65)

co
m
pa
ny
_id

0. - 0. 0. 0. 0.
0. 0. 1 0. 1. 1 0. 0.
1 2 2 1 3 0 9. 6. 8 4. 4.
4 1 . 6 1 . 4 8
7 8. 1 . 6 7 0 7 2 4. 3 0 Fa
1 4 3 0 2 1 6 3
1 4 9 9 . 3 5 0 1 8 2 3 3 ls
2 3 4 3 2 9 3 6
1 8 4 . 9 7 0 4 1 9 0 4 e
9 7 8 8 5 6 5 0
9 2 6 6 4 0 5 3 1 3 1
9 1 0 3 0 1 9 4
0 0 0 0 0 7
f f f
f f b
fe fe fe fe fe fe e e fe e fe
fe fe e fe fe fe fe e a
a at . at at at at a a at a at
at at a at at at at a n
t _ . _ _ _ _ t t _ t _
_ _ t _ _ _ _ t kr
_ 1 . 5 5 5 5 _ _ 6 _ 6
1 3 _ 5 6 7 8 _ u
2 0 6 7 8 9 6 6 2 6 4
4 9 pt
0 1 3

co
m
pa
ny
_id

0. 0. 0. 0. 0.
0. 0. 1 0. 1. 1 0. 0. 1
1 2. 1 0 2 0 5. 4. 3. 5.
4 2 . 0 1 . 5 9 0
4 5 7 . 2 7 0 9 1 5 9 Fa
6 8 6 0 7 6 3 0 2.
2 6 9 1 . 7 1 0 8 1 7 5 ls
0 2 2 0 2 0 9 1 1
2 5 8 . 5 0 0 8 0 1 0 e
3 3 9 0 1 1 6 0 9
4 2 5 1 0 0 2 3 6 0
8 0 4 0 0 8 2 8 0
0 0 6 0 0

0. - 0. 0. 0. 0.
0. 0. 1 0. 1. 1 0. 0. 1
1 5 2 1 3 0 2. 7. 3. 4.
4 3 . 0 4 . 5 8 0
8 8. 3 . 7 2 7 5 0 6 6 Fa
1 4 9 0 0 3 8 2 0.
5 8 2 3 . 6 1 3 9 7 3 3 ls
5 2 2 0 9 3 4 6 5
2 7 5 . 4 8 0 1 5 0 7 e
0 3 7 0 4 9 9 3 4
9 4 8 8 8 3 2 6 3 5
4 1 9 0 0 3 6 5 0
0 0 0 0 0 9
f f f
f f b
fe fe fe fe fe fe e e fe e fe
fe fe e fe fe fe fe e a
a at . at at at at a a at a at
at at a at at at at a n
t _ . _ _ _ _ t t _ t _
_ _ t _ _ _ _ t kr
_ 1 . 5 5 5 5 _ _ 6 _ 6
1 3 _ 5 6 7 8 _ u
2 0 6 7 8 9 6 6 2 6 4
4 9 pt
0 1 3

co
m
pa
ny
_id

5 rows × 65 columns

wqet_grader.grade(
"Project 5 Assessment",
"Task 5.1.15",
wrangle("data/poland-bankruptcy-data-2009.json.gz"),
)
Yes! Keep on rockin'. 🎸That's right.

Score: 1

Copyright 2023 WorldQuant University. This content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.

Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your
WQU virtual machine.

This means:

 ⓧ No downloading this notebook.

5.2. Imbalanced Data

In the last lesson, we prepared the data.

In this lesson, we're going to explore some of the features of the dataset, use visualizations to help us
understand those features, and develop a model that solves the problem of imbalanced data by under- and over-
sampling.
import gzip
import json
import pickle

import matplotlib.pyplot as plt

import pandas as pd
import seaborn as sns
import wqet_grader
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from IPython.display import VimeoVideo
from sklearn.impute import SimpleImputer
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier

wqet_grader.init("Project 5 Assessment")

VimeoVideo("694058667", h="44426f200b", width=600)

Prepare Data
Import
As always, we need to begin by bringing our data into the project, and the function we developed in the
previous module is exactly what we need.

VimeoVideo("694058628", h="00b4cfd027", width=600)

Task 5.2.1: Complete the wrangle function below using the code you developed in the last lesson. Then use it
to import poland-bankruptcy-data-2009.json.gz into the DataFrame df.

 Write a function in Python.

def wrangle(filename):

# Open compressed file, load into dictionary

with gzip.open(filename, "r") as f:

data = json.load(f)

# Turn dict into DataFrame

df = pd.DataFrame().from_dict(data["data"]).set_index("company_id")

return df

df = wrangle("data/poland-bankruptcy-data-2009.json.gz")
print(df.shape)
df.head()
(9977, 65)

co
m
pa
ny
_id

0. 0. 1 0. 1. 1 0. 0.
9. 6. 8 4. 4.
0. 4 1 . - 6 0. 1 . 4 0. 0. 8 0.
. 7 2 4. 3 0 Fa
1 1 4 3 2 0 2 2 1 6 1 3 3 0
1 . 1 8 2 3 3 ls
7 2 3 4 8. 3 1 2 9 3 6 7 6 0
. 4 1 9 0 4 e
4 9 7 8 9 8 9 5 6 5 3 5 0 0
5 3 1 3 1
1 9 1 0 8 3 4 0 1 9 9 7 4 0
f f f
f f b
fe fe fe fe fe fe e e fe e fe
fe fe e fe fe fe fe e a
a at . at at at at a a at a at
at at a at at at at a n
t _ . _ _ _ _ t t _ t _
_ _ t _ _ _ _ t kr
_ 1 . 5 5 5 5 _ _ 6 _ 6
1 3 _ 5 6 7 8 _ u
2 0 6 7 8 9 6 6 2 6 4
4 9 pt
0 1 3

co
m
pa
ny
_id

9 2 6 6 4 0
0 0 0 0 0 7

0. 0. 1 0. 1. 1 0. 0. 1
2. 7. 3. 4.
0. 4 3 . - 0 0. 4 . 5 0. 0. 8 0. 0
. 5 0 6 6 Fa
1 1 4 9 5 0 2 0 3 8 1 3 2 0 0.
5 . 9 7 3 3 ls
8 5 2 2 8. 0 3 9 3 4 7 2 6 7 5
. 1 5 0 7 e
8 0 3 7 2 0 3 4 9 9 6 1 3 3 4
2 6 3 5
2 4 1 9 7 0 5 0 3 6 4 8 5 0 0
f f f
f f b
fe fe fe fe fe fe e e fe e fe
fe fe e fe fe fe fe e a
a at . at at at at a a at a at
at at a at at at at a n
t _ . _ _ _ _ t t _ t _
_ _ t _ _ _ _ t kr
_ 1 . 5 5 5 5 _ _ 6 _ 6
1 3 _ 5 6 7 8 _ u
2 0 6 7 8 9 6 6 2 6 4
4 9 pt
0 1 3

co
m
pa
ny
_id

9 4 8 8 8 3
0 0 0 0 0 9

5 rows × 65 columns

Explore
Let's take a moment to refresh our memory on what's in this dataset. In the last lesson, we noticed that the data
was stored in a JSON file (similar to a Python dictionary), and we explored the key-value pairs. This time,
we're going to look at what the values in those pairs actually are.
VimeoVideo("694058591", h="8fc20629aa", width=600)

Task 5.2.2: Use the info method to explore df. What type of features does this dataset have? Which column is
the target? Are there columns will missing values that we'll need to address?

 Inspect a DataFrame using the shape, info, and head in pandas.

# Inspect DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9977 entries, 1 to 10503
Data columns (total 65 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 feat_1 9977 non-null float64
1 feat_2 9977 non-null float64
2 feat_3 9977 non-null float64
3 feat_4 9960 non-null float64
4 feat_5 9952 non-null float64
5 feat_6 9977 non-null float64
6 feat_7 9977 non-null float64
7 feat_8 9964 non-null float64
8 feat_9 9974 non-null float64
9 feat_10 9977 non-null float64
10 feat_11 9977 non-null float64
11 feat_12 9960 non-null float64
12 feat_13 9935 non-null float64
13 feat_14 9977 non-null float64
14 feat_15 9970 non-null float64
15 feat_16 9964 non-null float64
16 feat_17 9964 non-null float64
17 feat_18 9977 non-null float64
18 feat_19 9935 non-null float64
19 feat_20 9935 non-null float64
20 feat_21 9205 non-null float64
21 feat_22 9977 non-null float64
22 feat_23 9935 non-null float64
23 feat_24 9764 non-null float64
24 feat_25 9977 non-null float64
25 feat_26 9964 non-null float64
26 feat_27 9312 non-null float64
27 feat_28 9765 non-null float64
28 feat_29 9977 non-null float64
29 feat_30 9935 non-null float64
30 feat_31 9935 non-null float64
31 feat_32 9881 non-null float64
32 feat_33 9960 non-null float64
33 feat_34 9964 non-null float64
34 feat_35 9977 non-null float64
35 feat_36 9977 non-null float64
36 feat_37 5499 non-null float64
37 feat_38 9977 non-null float64
38 feat_39 9935 non-null float64
39 feat_40 9960 non-null float64
40 feat_41 9787 non-null float64
41 feat_42 9935 non-null float64
42 feat_43 9935 non-null float64
43 feat_44 9935 non-null float64
44 feat_45 9416 non-null float64
45 feat_46 9960 non-null float64
46 feat_47 9896 non-null float64
47 feat_48 9977 non-null float64
48 feat_49 9935 non-null float64
49 feat_50 9964 non-null float64
50 feat_51 9977 non-null float64
51 feat_52 9896 non-null float64
52 feat_53 9765 non-null float64
53 feat_54 9765 non-null float64
54 feat_55 9977 non-null float64
55 feat_56 9935 non-null float64
56 feat_57 9977 non-null float64
57 feat_58 9948 non-null float64
58 feat_59 9977 non-null float64
59 feat_60 9415 non-null float64
60 feat_61 9961 non-null float64
61 feat_62 9935 non-null float64
62 feat_63 9960 non-null float64
63 feat_64 9765 non-null float64
64 bankrupt 9977 non-null bool
dtypes: bool(1), float64(64)
memory usage: 5.0 MB
That's solid information. We know all our features are numerical and that we have missing data. But, as always,
it's a good idea to do some visualizations to see if there are any interesting trends or ideas we should keep in
mind while we work. First, let's take a look at how many firms are bankrupt, and how many are not.
VimeoVideo("694058537", h="01caf9ae83", width=600)

Task 5.2.3: Create a bar chart of the value counts for the "bankrupt" column. You want to calculate the relative
frequencies of the classes, not the raw count, so be sure to set the normalize argument to True.

 What's a bar chart?

 What's a majority class?
 What's a minority class?
 What's a positive class?
 What's a negative class?
 Aggregate data in a Series using value_counts in pandas.
 Create a bar chart using pandas.

# Plot class balance

df["bankrupt"].value_counts(normalize = True).plot(
kind = "bar",
xlabel = "Bankrupt",
ylabel = "Frequency",
title = "Classe Balance"
)

<Axes: title={'center': 'Classe Balance'}, xlabel='Bankrupt', ylabel='Frequency'>

That's good news for Poland's economy! Since it looks like most of the companies in our dataset are doing all
right for themselves, let's drill down a little farther. However, it also shows us that we have an imbalanced
dataset, where our majority class is far bigger than our minority class.

In the last lesson, we saw that there were 64 features of each company, each of which had some kind of
numerical value. It might be useful to understand where the values for one of these features cluster, so let's
make a boxplot to see how the values in "feat_27" are distributed.

VimeoVideo("694058487", h="6e066151d9", width=600)

Task 5.2.4: Use seaborn to create a boxplot that shows the distributions of the "feat_27" column for both
groups in the "bankrupt" column. Remember to label your axes.

 What's a boxplot?
 Create a boxplot using Matplotlib.

# Create boxplot
sns.boxplot(x = "bankrupt", y = "feat_27", data = df)
plt.xlabel("Bankrupt")
plt.ylabel("POA / financial expenses")
plt.title("Distribution of Profit/Expenses Ratio, by Class");
Why does this look so funny? Remember that boxplots exist to help us see the quartiles in a dataset, and this
one doesn't really do that. Let's check the distribution of "feat_27"to see if we can figure out what's going on
here.

VimeoVideo("694058435", h="8f0ae805d6", width=600)

Task 5.2.5: Use the describe method on the column for "feat_27". What can you tell about the distribution of
the data based on the mean and median?

# Summary statistics for `feat_27`

df["feat_27"].describe().apply("{0:,.0f}".format)

count 9,312
mean 1,206
std 35,477
min -190,130
25% 0
50% 1
75% 5
max 2,723,000
Name: feat_27, dtype: object

Hmmm. Note that the median is around 1, but the mean is over 1000. That suggests that this feature is skewed
to the right. Let's make a histogram to see what the distribution actually looks like.

VimeoVideo("694058398", h="1078bb6d8b", width=600)

Task 5.2.6: Create a histogram of "feat_27". Make sure to label x-axis "POA / financial expenses", the y-
axis "Count", and use the title "Distribution of Profit/Expenses Ratio".

 What's a histogram?
 Create a histogram using Matplotlib.

# Plot histogram of `feat_27`

df["feat_27"].hist()
plt.xlabel("POA / financial expenses")
plt.ylabel("Count"),
plt.title("Distribution of Profit/Expenses Ratio");

Aha! We saw it in the numbers and now we see it in the histogram. The data is very skewed. So, in order to
create a helpful boxplot, we need to trim the data.

VimeoVideo("694058328", h="4aecdc442d", width=600)

Task 5.2.7: Recreate the boxplot that you made above, this time only using the values for "feat_27" that fall
between the 0.1 and 0.9 quantiles for the column.

 What's a boxplot?
 What's a quantile?
 Calculate the quantiles for a Series in pandas.
 Create a boxplot using Matplotlib.

# Create clipped boxplot

q1, q9 = df["feat_27"].quantile([0.1, 0.9])
mask = df["feat_27"].between(q1, q9)
sns.boxplot(x = "bankrupt", y = "feat_27", data = df[mask])
plt.xlabel("Bankrupt")
plt.ylabel("POA / financial expenses")
plt.title("Distribution of Profit/Expenses Ratio, by Bankruptcy Status");

That makes a lot more sense. Let's take a look at some of the other features in the dataset to see what else is out
there.
More context on "feat_27": Profit on operating activities is profit that a company makes through its "normal"
operations. For instance, a car company profits from the sale of its cars. However, a company may have other
forms of profit, such as financial investments. So a company's total profit may be positive even when its profit
on operating activities is negative.

Financial expenses include things like interest due on loans, and does not include "normal" expenses (like the
money that a car company spends on raw materials to manufacture cars).
Task 5.2.8: Repeat the exploration you just did for "feat_27" on two other features in the dataset. Do they show
the same skewed distribution? Are there large differences between bankrupt and solvent companies?

# Explore another feature

# Plot histogram of `feat_21`

df["feat_21"].hist()
plt.xlabel("POA / financial expenses")
plt.ylabel("Count"),
plt.title("Distribution of Profit/Expenses Ratio");
Looking at other features, we can see that they're skewed, too. This will be important to keep in mind when we
decide what type of model we want to use.

Another important consideration for model selection is whether there are any issues with multicollinearity in
our model. Let's check.

# Summary statistics for `feat_21`

df["feat_21"].describe().apply("{0:,.0f}".format)

count 9,205
mean 5
std 314
min -1
25% 1
50% 1
75% 1
max 29,907
Name: feat_21, dtype: object

# Create clipped boxplot

q1, q9 = df["feat_21"].quantile([0.1, 0.9])
mask = df["feat_21"].between(q1, q9)
sns.boxplot(x = "bankrupt", y = "feat_21", data = df[mask])
plt.xlabel("Bankrupt")
plt.ylabel("POA / financial expenses")
plt.title("Distribution of Profit/Expenses Ratio, by Bankruptcy Status");
# Summary statistics for `feat_7`
df["feat_7"].describe().apply("{0:,.0f}".format)

count 9,977
mean 0
std 1
min -18
25% 0
50% 0
75% 0
max 53
Name: feat_7, dtype: object

# Explore another feature

# Plot histogram of `feat_7`

df["feat_7"].hist()
plt.xlabel("POA / financial expenses")
plt.ylabel("Count"),
plt.title("Distribution of Profit/Expenses Ratio");
# Create clipped boxplot
q1, q9 = df["feat_7"].quantile([0.1, 0.9])
mask = df["feat_7"].between(q1, q9)
sns.boxplot(x = "bankrupt", y = "feat_7", data = df[mask])
plt.xlabel("Bankrupt")
plt.ylabel("POA / financial expenses")
plt.title("Distribution of Profit/Expenses Ratio, by Bankruptcy Status");
VimeoVideo("694058273", h="85b3be2f63", width=600)

Task 5.2.9: Plot a correlation heatmap of features in df. Since "bankrupt" will be your target, you don't need to
include it in your heatmap.

 What's a heatmap?
 Create a correlation matrix in pandas.
 Create a heatmap in seaborn.

corr = df.drop(columns = "bankrupt").corr()

sns.heatmap(corr);

So what did we learn from this EDA? First, our data is imbalanced. This is something we need to address in our
data preparation. Second, many of our features have missing values that we'll need to impute. And since the
features are highly skewed, the best imputation strategy is likely median, not mean. Finally, we have
autocorrelation issues, which means that we should steer clear of linear models, and try a tree-based model
instead.

Split
So let's start building that model. If you need a refresher on how and why we split data in these situations, take
a look back at the Time Series module.
Task 5.2.10: Create your feature matrix X and target vector y. Your target is "bankrupt".

 What's a feature matrix?

 What's a target vector?
 Subset a DataFrame by selecting one or more columns in pandas.
 Select a Series from a DataFrame in pandas.

target = "bankrupt"
X = df.drop(columns = target)
y = df[target]

print("X shape:", X.shape)

print("y shape:", y.shape)
X shape: (9977, 64)
y shape: (9977,)
In order to make sure that our model can generalize, we need to put aside a test set that we'll use to evaluate our
model once it's trained.
Task 5.2.11: Divide your data (X and y) into training and test sets using a randomized train-test split. Your
validation set should be 20% of your total data. And don't forget to set a random_state for reproducibility.

 Perform a randomized train-test split using scikit-learn.

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size = 0.2, random_state = 42
)
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)
X_train shape: (7981, 64)
y_train shape: (7981,)
X_test shape: (1996, 64)
y_test shape: (1996,)
Note that if we wanted to tune any hyperparameters for our model, we'd do another split here, further dividing
the training set into training and validation sets. However, we're going to leave hyperparameters for the next
lesson, so no need to do the extra split now.

Resample
Now that we've split our data into training and validation sets, we can address the class imbalance we saw
during our EDA. One strategy is to resample the training data. (This will be different than the resampling we
did with time series data in Project 3.) There are many to do this, so let's start with under-sampling.
VimeoVideo("694058220", h="00c3a98358", width=600)

Task 5.2.12: Create a new feature matrix X_train_under and target vector y_train_under by performing random
under-sampling on your training data.

 What is under-sampling?
 Perform random under-sampling using imbalanced-learn.

under_sampler = RandomUnderSampler(random_state = 42)

X_train_under, y_train_under = under_sampler.fit_resample(X_train, y_train)
print(X_train_under.shape)
X_train_under.head()
(768, 64)

f f f f f f f
f f f f
fe fe e e fe e e e e e
fe fe e e e fe fe fe e fe
at . at a a at a a a a a
at at a a a at at at a at
_ . _ t t _ t t t t t
_ _ t t t _ _ _ t _5
1 . 5 _ _ 5 _ _ _ _ _
1 2 _ _ _ 6 7 8 _ 5
0 6 5 5 9 6 6 6 6 6
3 4 5 9
7 8 0 1 2 3 4

co
m
pa
ny
_i
d

0. 0. 0. 0. 9. 0. 0. 0.
0. 8. 2 2 0. 0. 1 2 1
1 0 0 1 2 9 0 0 4. 1
7 6 5. . 1 9 1. 8. 7.
2 9 0 5 8 0 . 77 7 0 9 2.
15 4 6 8 7 3 4 3 3 4
1 7 0 8 4 2 . 5. 5 0 0 8
09 5 1 3 5 4 2 3 7 8
4 2 0 8 1 7 . 71 9 0 4 6
0 6 7 8 4 4 9 2 7
0 3 0 4 0 6 9 0 9 5
0 0 0 8 7 3 0 0 0
0 8 0 0 0 0 9 0

0. 0. 0. 0. 2. 0. 0. 0.
0. 3. 5 2 0. 0. 1 3
3 2 2 3 7 7 1 0 4. 3 9.
6 4 5. . 13 4 8 0. 0.
1 6 9 1 7 3 . 1 0 5 9. 3
60 5 6 1 4 67 3 7 6 2
6 4 0 6 4 5 . 9 0 5 1 3
96 3 6 8 7 .9 0 4 0 8
2 9 1 2 7 0 . 9 0 3 1 1
4 6 9 2 0 2 1 2 7
8 2 4 8 0 8 9 0 3 5 4
6 0 0 1 7 2 0 0
0 0 0 0 0 0 0 0

0. 1 0. 3 1 1
- - - 3. 5.
0. 0. - 4 - 0. 0. . 0. - 9 0. 2. 0. 0
0. 0. . 46 4 2
73 0 8 0. 6 7 0 0 0 0 6. 1 0 0 0 5.
0 0 . 56 6 3
69 6 9 4 0 6. 4 8 9 8 5 3 0 2 4 2
1 1 . 6. 8 6
6 0 8 7 3 7 7 5 6 6 1 0 8 8 2
1 0 00 9 2
6 2 0 6 0 4 9 2 8 0 1 0 0 0 0
4 1
f f f f f f f
f f f f
fe fe e e fe e e e e e
fe fe e e e fe fe fe e fe
at . at a a at a a a a a
at at a a a at at at a at
_ . _ t t _ t t t t t
_ _ t t t _ _ _ t _5
1 . 5 _ _ 5 _ _ _ _ _
1 2 _ _ _ 6 7 8 _ 5
0 6 5 5 9 6 6 6 6 6
3 4 5 9
7 8 0 1 2 3 4

co
m
pa
ny
_i
d

1 7 0 8 8 3 0 5 9 1 0
5 0 7 0 7 6 6 5 4 0 0

0. 0. 0. 0. 1. 0. 0. 0.
0. 1. - 1 0. 0. 3 1 1 1
2 3 2 2 6 5 18 0 0 2
3 9 5. . 3 9 4. 9. 4. 8.
0 6 5 5 0 8 . 70 5 0 4.
52 3 3 3 0 5 4 8 0 8 2
8 3 1 8 3 3 . 50 0 0 4
68 8 1 1 5 6 9 8 0 9 0
0 6 8 2 3 0 . .0 7 0 9
7 6 6 3 7 2 4 3 9 4
0 3 7 8 0 2 0 4 0 8
5 0 9 5 6 5 0 0 0 0
0 0 0 0 0 0 6 0

0. 0. 0. 0. 2. 0. 0. 0.
0. 2. 3 1 0. 0. 1
0 2 2 1 3 7 0 0 6. 3 9. 5.
3 4 1. . 29 1 9 3.
9 9 6 2 4 0 . 4 7 8 9. 2 7
39 5 2 5 0 93 3 5 7
6 9 7 5 1 0 . 7 3 6 6 0 3
5 3 6 6 4 .0 8 2 8
8 3 5 1 0 6 . 1 0 2 6 1 7
9 4 3 9 0 2 8 2
6 2 5 3 0 8 2 4 1 8 3 6
3 0 0 5 4 8 0
3 0 0 0 0 0 3 1

5 rows × 64 columns

Note: Depending on the random state you set above, you may get a different shape for X_train_under. Don't
worry, it's normal!

And then we'll over-sample.

VimeoVideo("694058177", h="5cef977f2d", width=600)
Task 5.2.13: Create a new feature matrix X_train_over and target vector y_train_over by performing random
over-sampling on your training data.

 What is over-sampling?
 Perform random over-sampling using imbalanced-learn.

over_sampler = RandomOverSampler(random_state = 42)

X_train_over, y_train_over = over_sampler.fit_resample(X_train, y_train)
print(X_train_over.shape)
X_train_over.head()
(15194, 64)

f
fe fe fe fe fe fe fe fe fe fe fe
fe fe fe fe fe fe fe fe e
at . at at at at at at at at at at
at at at at at at at at a
_ . _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ t
1 . 5 5 5 5 5 6 6 6 6 6
1 2 3 4 5 6 7 8 _
0 5 6 7 8 9 0 1 2 3 4
9

0. 0. 0. 0. 0. 1 5 0. 0.
1 1 1 0. 0. 0. 3 1
2 0 8 7 3 6. 2 1 3 4. 1
7. 9 . 8 8 0 3. 8.
7 5 5 4 5 0 . 8 9 2 N 1 1.
0 9. 2 4 0 0 1 5
0 9 3 2 1 3 0 . 5 0 8 a 8 0
4 0 3 9 9 0 7 7
3 1 0 7 5 6 . 7. 0 6 N 5 0
4 8 4 9 9 0 6 2
2 0 3 7 7 0 0 4 3 8 2
0 0 6 7 6 0 0 0
0 5 0 0 0 0 0 0 0

0. 0. 0. 0. 0. 0. 0.
- 0. 1 0. 0. 0. 1
0 7 1 1. 0 0 4 0 0 7. 2. 2. 9.
1 3 . 2 9 0 6
0 3 5 2 0 0 . 4 1 0 4 2 1 6
0. 6 4 6 9 0 9.
1 1 5 6 2 0 2 . 0. 4 7 2 9 4 1
8 0 8 4 8 0 9
8 1 4 6 0 9 . 0 7 0 6 2 7 8
3 3 0 8 0 0 6
7 2 6 9 0 3 2 9 6 8 5 6 5
7 2 9 8 3 0 0
1 0 0 0 8 4 4

0. 0. 0. 0. 0. 0.
- 1. 1 0. 4 0. 0. 1
1 4 0 1. - 1 2 2 6. 6. 3. 1.
4 0 . 5 6 7 2 0
1 9 7 2 0. 1 . 1 2 2 1 5 9
3. 3 1 0 1 8 7 3.
2 3 0 7 3 0 3 . 4 3 7 6 2 6
1 9 6 9 7. 7 4 6
9 2 1 3 0 9 . 8 5 9 2 2 7
8 8 4 7 4 6 1 3
4 5 2 2 0 4 9 2 1 2 0 3
4 0 9 5 0 1 2 0
0 0 1 1 0 0 0
f
fe fe fe fe fe fe fe fe fe fe fe
fe fe fe fe fe fe fe fe e
at . at at at at at at at at at at
at at at at at at at at a
_ . _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ t
1 . 5 5 5 5 5 6 6 6 6 6
1 2 3 4 5 6 7 8 _
0 5 6 7 8 9 0 1 2 3 4
9

7
1

0. 0. 0. 0. 0. 0. 0.
0. 1 0. 0. 0. 2 1
0 6 1 1. 2 0 0 9 0 0 2. 2. 4.
5 . 3 9 1 2. 5
0 5 4 2 9. 0 0 . 2 4 2 2 2 4
3 2 4 9 4 7 9.
3 8 2 8 6 0 0 8 . 0. 5 3 6 8 7
2 8 7 4 4 4 5
1 6 1 2 7 0 1 . 9 1 4 7 7 1
3 9 3 3 0 8 8
3 1 2 8 1 0 3 8 6 2 3 2 8
0 1 9 4 3 0 0
6 0 0 0 6 9 1

0. 0. 0. 0. 0. 1 0. 0.
2 2. 1 0. 0. 0. 1 4 2
0 2 7 3. 0 0 0 0 0 9 3.
3 5 . 7 9 0 3. 9. 9.
4 7 0 7 0 5 . 7 4 6 1. 9
8. 7 0 2 4 0 8 0 0
4 5 9 8 6 0 6 . 4 7 3 9 6
1 6 1 0 6 0 8 6 4
3 6 7 5 0 7 . 4. 5 0 8 8
2 1 6 3 2 0 6 6 6
9 4 3 6 0 1 0 0 1 4 1
0 0 9 6 4 0 0 0 0
6 0 0 0 0 0 1 9

5 rows × 64 columns

Build Model
Baseline
As always, we need to establish the baseline for our model. Since this is a classification problem, we'll use
accuracy score.
VimeoVideo("694058140", h="7ae111412f", width=600)

Task 5.2.14: Calculate the baseline accuracy score for your model.

 What's accuracy score?

 Aggregate data in a Series using value_counts in pandas.
acc_baseline = y_train.value_counts(normalize = True).max()
print("Baseline Accuracy:", round(acc_baseline, 4))
Baseline Accuracy: 0.9519
Note here that, because our classes are imbalanced, the baseline accuracy is very high. We should keep this in
mind because, even if our trained model gets a high validation accuracy score, that doesn't mean it's
actually good.

Iterate
Now that we have a baseline, let's build a model to see if we can beat it.
VimeoVideo("694058110", h="dc751751bf", width=600)

Task 5.2.15: Create three identical models: model_reg, model_under and model_over. All of them should use
a SimpleImputer followed by a DecisionTreeClassifier. Train model_reg using the unaltered training data.
For model_under, use the undersampled data. For model_over, use the oversampled data.

 What's a decision tree?

 What's imputation?
 Create a pipeline in scikit-learn.
 Fit a model to training data in scikit-learn.

# Fit on `X_train`, `y_train`

model_reg = make_pipeline(
SimpleImputer(strategy = "median"), DecisionTreeClassifier(random_state = 42)
)
model_reg.fit(X_train, y_train)

# Fit on `X_train_under`, `y_train_under`

model_under = make_pipeline(
SimpleImputer(strategy = "median"), DecisionTreeClassifier(random_state = 42)
)
model_under.fit(X_train_under, y_train_under)

# Fit on `X_train_over`, `y_train_over`

model_over = make_pipeline(
SimpleImputer(strategy = "median"), DecisionTreeClassifier(random_state = 42)
)
model_over.fit(X_train_over, y_train_over)

Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
('decisiontreeclassifier',
DecisionTreeClassifier(random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with
nbviewer.org.

Evaluate
How did we do?
VimeoVideo("694058076", h="d57fb27d07", width=600)

Task 5.2.16: Calculate training and test accuracy for your three models.

 What's an accuracy score?

 Calculate the accuracy score for a model in scikit-learn.

for m in [model_reg, model_under, model_over]:

acc_train = m.score(X_train, y_train)
acc_test = m.score(X_test, y_test)

print("Training Accuracy:", round(acc_train, 4))

print("Test Accuracy:", round(acc_test, 4))
Training Accuracy: 1.0
Test Accuracy: 0.9359
Training Accuracy: 0.7421
Test Accuracy: 0.7104
Training Accuracy: 1.0
Test Accuracy: 0.9344
As we mentioned earlier, "good" accuracy scores don't tell us much about the model's performance when
dealing with imbalanced data. So instead of looking at what the model got right or wrong, let's see how its
predictions differ for the two classes in the dataset.
VimeoVideo("694058022", h="ce29f57dee", width=600)

Task 5.2.17: Plot a confusion matrix that shows how your best model performs on your validation set.

 What's a confusion matrix?

 Create a confusion matrix using scikit-learn.

# Plot confusion matrix

ConfusionMatrixDisplay.from_estimator(model_reg, X_test, y_test)

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fdd046b01d0>
In this lesson, we didn't do any hyperparameter tuning, but it will be helpful in the next lesson to know what the
depth of the tree model_over.

VimeoVideo("694057996", h="73882663cf", width=600)

Task 5.2.18: Determine the depth of the decision tree in model_over.

 What's a decision tree?

 Access an object in a pipeline in scikit-learn.

depth = model_over.named_steps["decisiontreeclassifier"].get_depth()
print(depth)
33

Communicate
Now that we have a reasonable model, let's graph the importance of each feature.
VimeoVideo("694057962", h="f60aa3b614", width=600)

Task 5.2.19: Create a horizontal bar chart with the 15 most important features for model_over. Be sure to label
your x-axis "Gini Importance".

 What's a bar chart?

 Access an object in a pipeline in scikit-learn.
 Create a bar chart using pandas.
 Create a Series in pandas. WQU WorldQuant University Applied Data Science Lab QQQQ

# Get importances
importances = model_over.named_steps["decisiontreeclassifier"].feature_importances_

# Put importances into a Series

feat_imp = pd.Series(importances, index = X_train_over.columns).sort_values()

# Plot series
feat_imp.tail(15).plot(kind="barh")
plt.xlabel("Gini Importance")
plt.ylabel("Feature")
plt.title("model_over Feature Importance");

There's our old friend "feat_27" near the top, along with features 34 and 26. It's time to share our findings.

Sometimes communication means sharing a visualization. Other times, it means sharing the actual model
you've made so that colleagues can use it on new data or deploy your model into production. First step towards
production: saving your model.
VimeoVideo("694057923", h="85a50bb588", width=600)

Task 5.2.20: Using a context manager, save your best-performing model to a a file named "model-5-2.pkl".

 What's serialization?
 Store a Python object as a serialized file using pickle.

# Save your model as `"model-5-2.pkl"`

with open("model-5-2.pkl", "wb") as f :
pickle.dump(model_over, f)
VimeoVideo("694057859", h="fecd8f9e54", width=600)

Task 5.2.21: Make sure you've saved your model correctly by loading "model-5-2.pkl" and assigning to the
variable loaded_model. Once you're satisfied with the result, run the last cell to submit your model to the grader.

 Load a Python object from a serialized file using pickle.

# Load `"model-5-2.pkl"`
with open("model-5-2.pkl", "rb") as f:
loaded_model = pickle.load(f)
print(loaded_model)
Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
('decisiontreeclassifier',
DecisionTreeClassifier(random_state=42))])

with open("model-5-2.pkl", "rb") as f:

loaded_model = pickle.load(f)
wqet_grader.grade(
"Project 5 Assessment",
"Task 5.2.16",
loaded_model,
)
Way to go!

Score: 1

Copyright 2023 WorldQuant University. This content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.

Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your
WQU virtual machine.

This means:

 ⓧ No downloading this notebook.

In this lesson, we're going to expand our decision tree model into an entire forest (an example of something
called an ensemble model); learn how to use a grid search to tune hyperparameters; and create a function that
loads data and a pre-trained model, and uses that model to generate a Series of predictions.
import gzip
import json
import pickle

import matplotlib.pyplot as plt

import pandas as pd
import wqet_grader
from imblearn.over_sampling import RandomOverSampler
from IPython.display import VimeoVideo
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split
from sklearn.pipeline import make_pipeline

wqet_grader.init("Project 5 Assessment")

VimeoVideo("694695674", h="538b4d2725", width=600)

Prepare Data
As always, we'll begin by importing the dataset.

Import
Task 5.3.1: Complete the wrangle function below using the code you developed in the lesson 5.1. Then use it to
import poland-bankruptcy-data-2009.json.gz into the DataFrame df.

 Write a function in Python. WQU WorldQuant University Applied Data Science Lab QQQQ

def wrangle(filename):
# Open compressed file, load into dict
with gzip.open(filename, "r") as f:
data = json.load(f)

# Turn dict into DataFrame

df = pd.DataFrame().from_dict(data["data"]).set_index("company_id")
return df

df = wrangle("data/poland-bankruptcy-data-2009.json.gz")
print(df.shape)
df.head()
(9977, 65)

co
m
pa
ny
_id

0. 0. 0. 0. 0.
0. 0. 3 8 0. 2. 1 0. 0.
0 0 0 0 0 6. 3. 6 5. 4.
2 4 . 4. 1 9 . 6 9
0 0 . 0 0 0 7 7 4. 6 4 Fa
2 8 1 8 9 8 0 7 9
3 0 4 . 7 0 0 7 9 8 2 5 ls
6 8 5 7 1 8 0 5 2
5 5 . 6 8 0 4 2 4 8 8 e
1 3 9 4 1 1 7 6 3
9 7 3 8 0 2 2 6 7 1
2 9 9 0 4 0 7 6 6
5 2 9 1 0
f f f
f f b
fe fe fe fe fe fe e e fe e fe
fe fe e fe fe fe fe e a
a at . at at at at a a at a at
at at a at at at at a n
t _ . _ _ _ _ t t _ t _
_ _ t _ _ _ _ t kr
_ 1 . 5 5 5 5 _ _ 6 _ 6
1 3 _ 5 6 7 8 _ u
2 0 6 7 8 9 6 6 2 6 4
4 9 pt
0 1 3

co
m
pa
ny
_id

5 rows × 65 columns

Split
Task 5.3.2: Create your feature matrix X and target vector y. Your target is "bankrupt".

 What's a feature matrix?

 What's a target vector?
 Subset a DataFrame by selecting one or more columns in pandas.
 Select a Series from a DataFrame in pandas.

target = "bankrupt"
X = df.drop(columns = target)
y = df[target]

print("X shape:", X.shape)

print("y shape:", y.shape)
X shape: (9977, 64)
y shape: (9977,)

Since we're not working with time series data, we're going to randomly divide our dataset into training and test
sets — just like we did in project 4.
Task 5.3.3: Divide your data (X and y) into training and test sets using a randomized train-test split. Your test
set should be 20% of your total data. And don't forget to set a random_state for reproducibility.

 Perform a randomized train-test split using scikit-learn.

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size = 0.2, random_state = 42
)

print("X_train shape:", X_train.shape)

print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)
X_train shape: (7981, 64)
y_train shape: (7981,)
X_test shape: (1996, 64)
y_test shape: (1996,)
You might have noticed that we didn't create a validation set, even though we're planning on tuning our model's
hyperparameters in this lesson. That's because we're going to use cross-validation, which we'll talk about more
later on.

Resample
VimeoVideo("694695662", h="dc60d76861", width=600)

Task 5.3.4: Create a new feature matrix X_train_over and target vector y_train_over by performing random
over-sampling on the training data.

 What is over-sampling?
 Perform random over-sampling using imbalanced-learn.

over_sampler = RandomOverSampler(random_state = 42)

X_train_over, y_train_over = over_sampler.fit_resample(X_train, y_train)
print("X_train_over shape:", X_train_over.shape)
X_train_over.head()
X_train_over shape: (15194, 64)
f
fe fe fe fe fe fe fe fe fe fe fe
fe fe fe fe fe fe fe fe e
at . at at at at at at at at at at
at at at at at at at at a
_ . _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ t
1 . 5 5 5 5 5 6 6 6 6 6
1 2 3 4 5 6 7 8 _
0 5 6 7 8 9 0 1 2 3 4
9

-
0. 0. 0. 0. 0. 0.
- 0. 1. 1 0. 4 0. 0. 1
1 4 0 1. 1 2 2 6. 6. 3. 1.
4 0 0 . 5 6 7 2 0
1 9 7 2 1 . 1 2 2 1 5 9
3. 0 3 1 0 1 8 7 3.
2 3 0 7 3 3 . 4 3 7 6 2 6
1 0 9 6 9 7. 7 4 6
9 2 1 3 9 . 8 5 9 2 2 7
8 1 8 4 7 4 6 1 3
4 5 2 2 4 9 2 1 2 0 3
4 7 0 9 5 0 1 2 0
0 0 1 0 0 0
1

0. 0. 0. 0. 0. 0. 0.
0. 1 0. 0. 0. 2 1
0 6 1 1. 2 0 0 9 0 0 2. 2. 4.
5 . 3 9 1 2. 5
0 5 4 2 9. 0 0 . 2 4 2 2 2 4
3 2 4 9 4 7 9.
3 8 2 8 6 0 0 8 . 0. 5 3 6 8 7
2 8 7 4 4 4 5
1 6 1 2 7 0 1 . 9 1 4 7 7 1
3 9 3 3 0 8 8
3 1 2 8 1 0 3 8 6 2 3 2 8
0 1 9 4 3 0 0
6 0 0 0 6 9 1
f
fe fe fe fe fe fe fe fe fe fe fe
fe fe fe fe fe fe fe fe e
at . at at at at at at at at at at
at at at at at at at at a
_ . _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ t
1 . 5 5 5 5 5 6 6 6 6 6
1 2 3 4 5 6 7 8 _
0 5 6 7 8 9 0 1 2 3 4
9

5 rows × 64 columns

Build Model
Now that we have our data set up the right way, we can build the model. 🏗

Baseline
Task 5.3.5: Calculate the baseline accuracy score for your model.

 What's accuracy score?

 Aggregate data in a Series using value_counts in pandas.

acc_baseline = y_train.value_counts(normalize = True).max()

print("Baseline Accuracy:", round(acc_baseline, 4))
Baseline Accuracy: 0.9519

Iterate
So far, we've built single models that predict a single outcome. That's definitely a useful way to predict the
future, but what if the one model we built isn't the right one? If we could somehow use more than one model
simultaneously, we'd have a more trustworthy prediction.

Ensemble models work by building multiple models on random subsets of the same data, and then comparing
their predictions to make a final prediction. Since we used a decision tree in the last lesson, we're going to
create an ensemble of trees here. This type of model is called a random forest.

We'll start by creating a pipeline to streamline our workflow.

VimeoVideo("694695643", h="32c3d5b1ed", width=600)

Task 5.3.6: Create a pipeline named clf (short for "classifier") that contains a SimpleImputer transformer and
a RandomForestClassifier predictor.

 What's an ensemble model?

 What's a random forest model?

clf = make_pipeline(SimpleImputer(), RandomForestClassifier(random_state = 42))

print(clf)
Pipeline(steps=[('simpleimputer', SimpleImputer()),
('randomforestclassifier',
RandomForestClassifier(random_state=42))])

By default, the number of trees in our forest (n_estimators) is set to 100. That means when we train this
classifier, we'll be fitting 100 trees. While it will take longer to train, it will hopefully lead to better
performance.

In order to get the best performance from our model, we need to tune its hyperparameter. But how can we do
this if we haven't created a validation set? The answer is cross-validation. So, before we look at
hyperparameters, let's see how cross-validation works with the classifier we just built.

VimeoVideo("694695619", h="2c41dca371", width=600)

Task 5.3.7: Perform cross-validation with your classifier, using the over-sampled training data. We want five
folds, so set cv to 5. We also want to speed up training, to set n_jobs to -1.

 What's cross-validation?
 Perform k-fold cross-validation on a model in scikit-learn.

cv_acc_scores = cross_val_score(clf, X_train_over, y_train_over, cv = 5, n_jobs = -1)

print(cv_acc_scores)
[0.99670944 0.99835472 0.99769661 0.9970385 0.99901251]

That took kind of a long time, but we just trained 500 random forest classifiers (100 jobs x 5 folds). No wonder
it takes so long!

Pro tip: even though cross_val_score is useful for getting an idea of how cross-validation works, you'll rarely
use it. Instead, most people include a cv argument when they do a hyperparameter search.
Now that we have an idea of how cross-validation works, let's tune our model. The first step is creating a range
of hyperparameters that we want to evaluate.

VimeoVideo("694695593", h="5143f0b63f", width=600)

Task 5.3.8: Create a dictionary with the range of hyperparameters that we want to evaluate for our classifier.

1. For the SimpleImputer, try both the "mean" and "median" strategies.
2. For the RandomForestClassifier, try max_depth settings between 10 and 50, by steps of 10.
3. Also for the RandomForestClassifier, try n_estimators settings between 25 and 100 by steps of 25.

 What's a dictionary?
 What's a hyperparameter?
 Create a range in Python
 Define a hyperparameter grid for model tuning in scikit-learn.

params = {
"simpleimputer__strategy" : ["mean", "median"],
"randomforestclassifier__n_estimators": range(25, 100, 25),
"randomforestclassifier__max_depth": range(10, 50, 10)
}
params

{'simpleimputer__strategy': ['mean', 'median'],

'randomforestclassifier__n_estimators': range(25, 100, 25),
'randomforestclassifier__max_depth': range(10, 50, 10)}
Now that we have our hyperparameter grid, let's incorporate it into a grid search.
VimeoVideo("694695574", h="8588bf015f", width=600)

Task 5.3.9: Create a GridSearchCV named model that includes your classifier and hyperparameter grid. Be sure
to use the same arguments for cv and n_jobs that you used above, and set verbose to 1.

 What's cross-validation?
 What's a grid search?
 Perform a hyperparameter grid search in scikit-learn.

model = GridSearchCV(
clf,
param_grid = params,
cv = 5,
n_jobs = -1,
verbose = 1
)
model

GridSearchCV(cv=5,
estimator=Pipeline(steps=[('simpleimputer', SimpleImputer()),
('randomforestclassifier',
RandomForestClassifier(random_state=42))]),
n_jobs=-1,
param_grid={'randomforestclassifier__max_depth': range(10, 50, 10),
'randomforestclassifier__n_estimators': range(25, 100, 25),
'simpleimputer__strategy': ['mean', 'median']},
verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with
nbviewer.org.
Finally, now let's fit the model.
VimeoVideo("694695566", h="f4e9910a9e", width=600)
Task 5.3.10: Fit model to the over-sampled training data.

# Train model
model.fit(X_train_over, y_train_over)
Fitting 5 folds for each of 24 candidates, totalling 120 fits

GridSearchCV(cv=5,
estimator=Pipeline(steps=[('simpleimputer', SimpleImputer()),
('randomforestclassifier',
RandomForestClassifier(random_state=42))]),
n_jobs=-1,
param_grid={'randomforestclassifier__max_depth': range(10, 50, 10),
'randomforestclassifier__n_estimators': range(25, 100, 25),
'simpleimputer__strategy': ['mean', 'median']},
verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with
nbviewer.org.
This will take some time to train, so let's take a moment to think about why. How many forests did we just test?
4 different max_depths times 3 n_estimators times 2 imputation strategies... that makes 24 forests. How many
fits did we just do? 24 forests times 5 folds is 120. And remember that each forest is comprised of 25-75 trees,
which works out to at least 3,000 trees. So it's computationally expensive!

Okay, now that we've tested all those models, let's take a look at the results.

VimeoVideo("694695546", h="4ae60129c4", width=600)

Task 5.3.11: Extract the cross-validation results from model and load them into a DataFrame named cv_results.

 Get cross-validation results from a hyperparameter search in scikit-learn.

cv_results = pd.DataFrame(model.cv_results_)
cv_results.head(10)
m m
st st sp sp sp sp sp m st ra
ea ea param param para
d d_ lit lit lit lit lit ea d nk
n n_ _rando _rando m_si
_f sc 0_ 1_ 2_ 3_ 4_ n_ _t _t
_f sc mfores mfores mple
it or para te te te te te te es es
it or tclassif tclassifi impu
_ e_ ms st st st st st st t_ t_
_t e_ ier__m er__n_ ter__
ti ti _s _s _s _s _s _s sc sc
i ti ax_de estima strat
m m co co co co co co or or
m m pth tors egy
e e re re re re re re e e
e e

{'rand
3. 0. omfor 0.
2 2 estcla 0
0. 0. 0. 0. 0. 0. 0. 0.
3 9 ssifier 0
05 02 mea 97 97 97 98 98 97
0 1 1 10 25 __ma 1 21
44 92 n 92 72 79 09 22 95
7 4 x_dep 8
37 13 69 95 53 15 25 32
1 5 th': 2
5 0 10, 9
'ran...

{'rand
3. 0. omfor 0.
5 2 estcla 0
0. 0. 0. 0. 0. 0. 0. 0.
6 3 ssifier 0
01 00 medi 97 96 97 97 97 97
1 2 7 10 25 __ma 3 24
65 05 an 86 90 56 00 49 36
3 7 x_dep 5
67 58 11 69 50 56 84 74
0 3 th': 8
9 7 10, 6
'ran...

{'rand
6. 0. omfor 0.
2 0 estcla 0
0. 0. 0. 0. 0. 0. 0. 0.
1 7 ssifier 0
04 02 mea 98 97 97 98 98 98
2 7 6 10 50 __ma 2 20
05 70 n 32 95 82 45 38 19
6 2 x_dep 4
59 68 18 99 82 34 71 01
7 3 th': 8
3 5 10, 8
'ran...
m m
st st sp sp sp sp sp m st ra
ea ea param param para
d d_ lit lit lit lit lit ea d nk
n n_ _rando _rando m_si
_f sc 0_ 1_ 2_ 3_ 4_ n_ _t _t
_f sc mfores mfores mple
it or para te te te te te te es es
it or tclassif tclassifi impu
_ e_ ms st st st st st st t_ t_
_t e_ ier__m er__n_ ter__
ti ti _s _s _s _s _s _s sc sc
i ti ax_de estima strat
m m co co co co co co or or
m m pth tors egy
e e re re re re re re e e
e e

{'rand
6. 0. omfor 0.
2 0 estcla 0
0. 0. 0. 0. 0. 0. 0. 0.
9 0 ssifier 0
06 02 medi 98 97 97 97 97 97
3 3 9 10 50 __ma 3 23
35 92 an 02 10 89 82 76 72
5 3 x_dep 2
66 03 57 43 40 82 17 28
5 3 th': 1
6 2 10, 3
'ran...

{'rand
9. 0. omfor 0.
1 0 estcla 0
0. 0. 0. 0. 0. 0. 0. 0.
2 3 ssifier 0
08 02 mea 98 98 97 98 98 98
4 9 2 10 75 __ma 2 19
71 46 n 42 05 69 45 45 21
3 3 x_dep 9
80 90 05 86 66 34 29 64
4 7 th': 9
6 7 10, 6
'ran...

{'rand
9. 0. omfor 0.
2 0 estcla 0
0. 0. 0. 0. 0. 0. 0. 0.
8 7 ssifier 0
09 00 medi 98 97 98 97 98 97
5 0 2 10 75 __ma 3 22
79 07 an 09 26 15 99 15 93
3 6 x_dep 3
96 57 15 88 73 28 67 34
4 1 th': 7
1 2 10, 7
'ran...
m m
st st sp sp sp sp sp m st ra
ea ea param param para
d d_ lit lit lit lit lit ea d nk
n n_ _rando _rando m_si
_f sc 0_ 1_ 2_ 3_ 4_ n_ _t _t
_f sc mfores mfores mple
it or para te te te te te te es es
it or tclassif tclassifi impu
_ e_ ms st st st st st st t_ t_
_t e_ ier__m er__n_ ter__
ti ti _s _s _s _s _s _s sc sc
i ti ax_de estima strat
m m co co co co co co or or
m m pth tors egy
e e re re re re re re e e
e e

{'rand
3. 0. omfor 0.
4 0 estcla 0
0. 0. 0. 0. 0. 0. 0. 0.
9 2 ssifier 0
05 02 mea 99 99 99 99 99 99
6 0 9 20 25 __ma 0 17
44 96 n 63 70 57 53 73 63
0 3 x_dep 7
63 17 80 38 22 93 67 80
4 3 th': 5
2 7 20, 0
'ran...

{'rand
3. 0. omfor 0.
6 0 estcla 0
0. 0. 0. 0. 0. 0. 0. 0.
8 5 ssifier 0
06 02 medi 99 99 99 99 99 99
7 7 7 20 25 __ma 0 14
73 45 an 57 67 73 63 80 68
9 8 x_dep 7
84 60 22 09 68 80 25 41
5 7 th': 9
0 7 20, 5
'ran...

{'rand
6. 0. omfor 0.
9 0 estcla 0
0. 0. 0. 0. 0. 0. 0. 0.
7 4 ssifier 0
08 02 mea 99 99 99 99 99 99
8 6 0 20 50 __ma 0 18
17 54 n 67 67 57 50 73 63
6 9 x_dep 8
37 79 09 09 22 64 67 14
0 5 th': 1
4 4 20, 6
'ran...
m m
st st sp sp sp sp sp m st ra
ea ea param param para
d d_ lit lit lit lit lit ea d nk
n n_ _rando _rando m_si
_f sc 0_ 1_ 2_ 3_ 4_ n_ _t _t
_f sc mfores mfores mple
it or para te te te te te te es es
it or tclassif tclassifi impu
_ e_ ms st st st st st st t_ t_
_t e_ ier__m er__n_ ter__
ti ti _s _s _s _s _s _s sc sc
i ti ax_de estima strat
m m co co co co co co or or
m m pth tors egy
e e re re re re re re e e
e e

{'rand
7. 0. omfor 0.
1 0 estcla 0
0. 0. 0. 0. 0. 0. 0. 0.
4 5 ssifier 0
07 02 medi 99 99 99 99 99 99
9 3 2 20 50 __ma 0 11
98 42 an 60 73 80 67 83 73
3 7 x_dep 8
75 48 51 68 26 09 54 02
4 7 th': 4
4 3 20, 3
'ran...

In addition to the accuracy scores for all the different models we tried during our grid search, we can see how
long it took each model to train. Let's take a closer look at how different hyperparameter settings affect training
time.

First, we'll look at n_estimators. Our grid search evaluated this hyperparameter for various max_depth settings,
but let's only look at models where max_depth equals 10.

VimeoVideo("694695537", h="e460435664", width=600)

Task 5.3.12: Create a mask for cv_results for rows where "param_randomforestclassifier__max_depth" equals
10. Then plot "param_randomforestclassifier__n_estimators" on the x-axis and "mean_fit_time" on the y-axis.
Don't forget to label your axes and include a title.

 Subset a DataFrame with a mask using pandas.

 Create a line plot in Matplotlib.

##### Create mask

mask = cv_results["param_randomforestclassifier__max_depth"] == 10
# Plot fit time vs n_estimators
plt.plot(
cv_results[mask]["param_randomforestclassifier__n_estimators"],
cv_results[mask]["mean_fit_time"]
)
# Label axes
plt.xlabel("Number of Estimators")
plt.ylabel("Mean Fit Time [seconds]")
plt.title("Training Time vs Estimators (max_depth=10)");
Create mask

mask = cv_results["param_randomforestclassifier__max_depth"] == 10

Plot fit time vs n_estimators

plt.plot( cv_results["param_randomforestclassifier__n_estimators"], cv_results[mask]["mean_fit_time"] )

Label axes
plt.xlabel("Number of Estimators") plt.ylabel("Mean Fit Time [seconds]") plt.title("Training Time vs
Estimators (max_depth=10)");
Next, we'll look at max_depth. Here, we'll also limit our data to rows where n_estimators equals 25.

VimeoVideo("694695525", h="99f2dfc9eb", width=600)

Task 5.3.13: Create a mask for cv_results for rows where "param_randomforestclassifier__n_estimators" equals
25. Then plot "param_randomforestclassifier__max_depth" on the x-axis and "mean_fit_time" on the y-axis. Don't
forget to label your axes and include a title.

 Subset a DataFrame with a mask using pandas.

 Create a line plot in Matplotlib.
# Create mask
mask = cv_results["param_randomforestclassifier__n_estimators"] == 25
# Plot fit time vs max_depth
plt.plot(
cv_results[mask]["param_randomforestclassifier__max_depth"],
cv_results[mask]["mean_fit_time"]
)
# Label axes
plt.xlabel("Max Depth")
plt.ylabel("Mean Fit Time [seconds]")
plt.title("Training Time vs Max Depth (n_estimators=25)");

There's a general upwards trend, but we see a lot of up-and-down here. That's because for each max depth, grid
search tries two different imputation strategies: mean and median. Median is a lot faster to calculate, so that
speeds up training time.

Finally, let's look at the hyperparameters that led to the best performance.

VimeoVideo("694695505", h="f98f660ce1", width=600)

Task 5.3.14: Extract the best hyperparameters from model.

 Get settings from a hyperparameter search in scikit-learn.

# Extract best hyperparameters

model.best_params_

{'randomforestclassifier__max_depth': 40,
'randomforestclassifier__n_estimators': 50,
'simpleimputer__strategy': 'median'}
Note that we don't need to build and train a new model with these settings. Now that the grid search is
complete, when we use model.predict(), it will serve up predictions using the best model — something that we'll
do at the end of this lesson.

Evaluate
All right: The moment of truth. Let's see how our model performs.
Task 5.3.15: Calculate the training and test accuracy scores for model.

 Calculate the accuracy score for a model in scikit-learn.

acc_train = model.score(X_train, y_train)

acc_test = model.score(X_test, y_test)

print("Training Accuracy:", round(acc_train, 4))

print("Test Accuracy:", round(acc_test, 4))
Training Accuracy: 1.0
Test Accuracy: 0.9589
We beat the baseline! Just barely, but we beat it.
Next, we're going to use a confusion matrix to see how our model performs. To better understand the values
we'll see in the matrix, let's first count how many observations in our test set belong to the positive and negative
classes.

y_test.value_counts()

False 1913
True 83
Name: bankrupt, dtype: int64

VimeoVideo("694695486", h="1d6ac2bf77", width=600)

Task 5.3.16: Plot a confusion matrix that shows how your best model performs on your test set.

 What's a confusion matrix?

 Create a confusion matrix using scikit-learn.

# Plot confusion matrix

ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fd89f362710>
Notice the relationship between the numbers in this matrix with the count you did the previous task. If you sum
the values in the bottom row, you get the total number of positive observations in y_test ($72 + 11 = 83$). And
the top row sum to the number of negative observations ($1903 + 10 = 1913$).

Communicate
VimeoVideo("698358615", h="3fd4b2186a", width=600)

Task 5.3.17: Create a horizontal bar chart with the 10 most important features for your model.
# Get feature names from training data
features = X_train_over.columns
# Extract importances from model
importances = model.best_estimator_.named_steps[
"randomforestclassifier"
].feature_importances_

# Create a series with feature names and importances

feat_imp = pd.Series(importances, index = features).sort_values()
# Plot 10 most important features
feat_imp.tail(10).plot(kind="barh")
plt.xlabel("Gini Importance")
plt.ylabel("Feature")
plt.title("Feature Importance");
The only thing left now is to save your model so that it can be reused.

VimeoVideo("694695478", h="a13bdacb55", width=600)

Task 5.3.18: Using a context manager, save your best-performing model to a a file named "model-5-3.pkl".

 What's serialization?
 Store a Python object as a serialized file using pickle.

# Save model
with open("model-5-3.pkl", "wb") as f:
pickle.dump(model, f)

VimeoVideo("694695451", h="fc96dd8d1f", width=600)

Task 5.3.19: Create a function make_predictions. It should take two arguments: the path of a JSON file that
contains test data and the path of a serialized model. The function should load and clean the data using
the wrangle function you created, load the model, generate an array of predictions, and convert that array into a
Series. (The Series should have the name "bankrupt" and the same index labels as the test data.) Finally, the
function should return its predictions as a Series.

 What's a function?
 Load a serialized file
 What's a Series?
 Create a Series in pandas
def make_predictions(data_filepath, model_filepath):
# Wrangle JSON file
X_test = wrangle(data_filepath)
# Load model
with open(model_filepath, "rb") as f:
model = pickle.load(f)
# Generate predictions
y_test_pred = model.predict(X_test)
# Put predictions into Series with name "bankrupt", and same index as X_test
y_test_pred = pd.Series(y_test_pred, index = X_test.index, name = "bankrupt" )
return y_test_pred

VimeoVideo("694695426", h="f75588d43a", width=600)

Task 5.3.20: Use the code below to check your make_predictions function. Once you're satisfied with the result,
submit it to the grader.
y_test_pred = make_predictions(
data_filepath="data/poland-bankruptcy-data-2009-mvp-features.json.gz",
model_filepath="model-5-3.pkl",
)

print("predictions shape:", y_test_pred.shape)

y_test_pred.head()
predictions shape: (526,)

company_id
4 False
32 False
34 False
36 False
40 False
Name: bankrupt, dtype: bool

wqet_grader.grade(
"Project 5 Assessment",
"Task 5.3.19",
make_predictions(
data_filepath="data/poland-bankruptcy-data-2009-mvp-features.json.gz",
model_filepath="model-5-3.pkl",
),
)
Your model's accuracy score is 0.9544. Excellent work.

Score: 1

Copyright 2023 WorldQuant University. This content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.

Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your
WQU virtual machine.

This means:

 ⓧ No downloading this notebook.

5.4. Gradient Boosting Trees

You've been working hard, and now you have all the tools you need to build and tune models. We'll start this
lesson the same way we've started the others: preparing the data and building our model, and this time with a
new ensemble model. Once it's working, we'll learn some new performance metrics to evaluate it. By the end of
this lesson, you'll have written your first Python module!

import gzip
import json
import pickle

import ipywidgets as widgets

import pandas as pd
import wqet_grader
from imblearn.over_sampling import RandomOverSampler
from IPython.display import VimeoVideo
from ipywidgets import interact
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import (
ConfusionMatrixDisplay,
classification_report,
confusion_matrix,
)
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import make_pipeline
from teaching_tools.widgets import ConfusionMatrixWidget

wqet_grader.init("Project 5 Assessment")

VimeoVideo("696221191", h="275ffd1421", width=600)

Prepare Data
All the data preparation for this module is the same as it was last time around. See you on the other side!

Import
Task 5.4.1: Complete the wrangle function below using the code you developed in the lesson 5.1. Then use it to
import poland-bankruptcy-data-2009.json.gz into the DataFrame df.

 Write a function in Python.

def wrangle(filename):
# Open compressed file, load into dict

with gzip.open(filename, "r") as f:

data = json.load(f)

# Turn dict into DataFrame

df = pd.DataFrame().from_dict(data["data"]).set_index("company_id")

return df

df = wrangle("data/poland-bankruptcy-data-2009.json.gz")
print(df.shape)
df.head()
(9977, 65)

co
m
pa
ny
_id

5 rows × 65 columns

Split
Task 5.4.2: Create your feature matrix X and target vector y. Your target is "bankrupt".

 What's a feature matrix?

 What's a target vector?
 Subset a DataFrame by selecting one or more columns in pandas.
 Select a Series from a DataFrame in pandas.

target = "bankrupt"
X = df.drop(columns= target)
y = df[target]

print("X shape:", X.shape)

print("y shape:", y.shape)
X shape: (9977, 64)
y shape: (9977,)
Task 5.4.3: Divide your data (X and y) into training and test sets using a randomized train-test split. Your test
set should be 20% of your total data. And don't forget to set a random_state for reproducibility.

 Perform a randomized train-test split using scikit-learn.

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.2, random_state = 42
)

print("X_train shape:", X_train.shape)

Resample
Task 5.4.4: Create a new feature matrix X_train_over and target vector y_train_over by performing random
over-sampling on the training data.

 What is over-sampling?
 Perform random over-sampling using imbalanced-learn.

over_sampler = RandomOverSampler(random_state = 42)

X_train_over, y_train_over = over_sampler.fit_resample(X_train, y_train)
print("X_train_over shape:", X_train_over.shape)
X_train_over.head()
X_train_over shape: (15194, 64)

.
0. 0. 0. 1. - 0. 0. 0. 1 0. 4 0. 0. 0. 0. 7. 2. 1 2. 9.
1 .
0 7 1 2 1 0 0 3 . 2 4 0 0 9 0 4 2 6 1 6
.
0 3 5 2 0. 0 0 6 4 6 0. 1 0 9 0 2 9 9. 4 1
f
fe fe fe fe fe fe fe fe fe fe fe
fe fe fe fe fe fe fe fe e
at . at at at at at at at at at at
at at at at at at at at a
_ . _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ t
1 . 5 5 5 5 5 6 6 6 6 6
1 2 3 4 5 6 7 8 _
0 5 6 7 8 9 0 1 2 3 4
9

1 5 6 6 8 0 2 0 8 4 0 4 7 8 0 6 2 9 7 8
8 1 4 9 3 0 9 3 0 8 2 7 0 0 0 8 5 6 6 5
7 2 6 7 0 3 2 9 8 9 6 3 0 0
1 0 0 0 8 4 4

5 rows × 64 columns

Build Model
Now let's put together our model. We'll start by calculating the baseline accuracy, just like we did last time.

Baseline
Task 5.4.5: Calculate the baseline accuracy score for your model.

 What's accuracy score?

 Aggregate data in a Series using value_counts in pandas. WQU WorldQuant University Applied Data Science Lab QQQQ

acc_baseline = y_train.value_counts(normalize = True).max()

print("Baseline Accuracy:", round(acc_baseline, 4))
Baseline Accuracy: 0.9519

Iterate
Even though the building blocks are the same, here's where we start working with something new. First, we're
going to use a new type of ensemble model for our classifier.
VimeoVideo("696221115", h="44fe95d5d9", width=600)

Task 5.4.6: Create a pipeline named clf (short for "classifier") that contains a SimpleImputer transformer and
a GradientBoostingClassifier predictor.

 What's an ensemble model?

 What's a gradient boosting model?

clf = make_pipeline(SimpleImputer(), GradientBoostingClassifier())

clf

Pipeline(steps=[('simpleimputer', SimpleImputer()),
('gradientboostingclassifier', GradientBoostingClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with
nbviewer.org.
Remember while we're doing this that we only want to be looking at the positive class. Here, the positive class
is the one where the companies really did go bankrupt. In the dictionary we made last time, the positive class is
made up of the companies with the bankrupt: true key-value pair.

Next, we're going to tune some of the hyperparameters for our model.

VimeoVideo("696221055", h="b675d7fec0", width=600)

Task 5.4.7: Create a dictionary with the range of hyperparameters that we want to evaluate for our classifier.

1. For the SimpleImputer, try both the "mean" and "median" strategies.
2. For the GradientBoostingClassifier, try max_depth settings between 2 and 5.
3. Also for the GradientBoostingClassifier, try n_estimators settings between 20 and 31, by steps of 5.
 What's a dictionary?
 What's a hyperparameter?
 Create a range in Python.
 Define a hyperparameter grid for model tuning in scikit-learn.

params = {

"simpleimputer_strategy": ["mean", "median"],

"gradientboostingclassifier__n_estimators": range(20, 31, 5),
"gradientboostingclassifier__max_depth": range(2, 5)
}
params

{'simpleimputer_strategy': ['mean', 'median'],

'gradientboostingclassifier__n_estimators': range(20, 31, 5),
'gradientboostingclassifier__max_depth': range(2, 5)}

Note that we're trying much smaller numbers of n_estimators. This is because GradientBoostingClassifier is
slower to train than the RandomForestClassifier. You can try increasing the number of estimators to see if model
performance improves, but keep in mind that you could be waiting a long time!

VimeoVideo("696221023", h="218915d38e", width=600)

Task 5.4.8: Create a GridSearchCV named model that includes your classifier and hyperparameter grid. Be sure
to use the same arguments for cv and n_jobs that you used above, and set verbose to 1.

 What's cross-validation?
 What's a grid search?
 Perform a hyperparameter grid search in scikit-learn.

model = GridSearchCV(clf, param_grid=params, cv=5, n_jobs=-1, verbose=1)

model

GridSearchCV(cv=5,
estimator=Pipeline(steps=[('simpleimputer', SimpleImputer()),
('gradientboostingclassifier',
GradientBoostingClassifier())]),
n_jobs=-1,
param_grid={'gradientboostingclassifier__max_depth': range(2, 5),
'gradientboostingclassifier__n_estimators': range(20, 31, 5),
'simpleimputer_strategy': ['mean', 'median']},
verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with
nbviewer.org.
Now that we have everything we need for the model, let's fit it to the data and see what we've got.

VimeoVideo("696220978", h="008d915f33", width=600)

Task 5.4.9: Fit your model to the over-sampled training data.

# Fit model to over-sampled training data
model.fit(X_train_over, y_train_over)
Fitting 5 folds for each of 18 candidates, totalling 90 fits
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[15], line 2
1 # Fit model to over-sampled training data
----> 2 model.fit(X_train_over, y_train_over)

File /opt/conda/lib/python3.11/site-packages/sklearn/base.py:1151, in _fit_context.<locals>.decorator.<locals>.wrap

per(estimator, *args, **kwargs)
1144 estimator._validate_params()
1146 with config_context(
1147 skip_parameter_validation=(
1148 prefer_skip_nested_validation or global_skip_validation
1149 )
1150 ):
-> 1151 return fit_method(estimator, *args, **kwargs)

File /opt/conda/lib/python3.11/site-packages/sklearn/model_selection/_search.py:898, in BaseSearchCV.fit(self, X, y

, groups, **fit_params)
892 results = self._format_results(
893 all_candidate_params, n_splits, all_out, all_more_results
894 )
896 return results
--> 898 self._run_search(evaluate_candidates)
900 # multimetric is determined here because in the case of a callable
901 # self.scoring the return type is only known after calling
902 first_test_score = all_out[0]["test_scores"]

File /opt/conda/lib/python3.11/site-packages/sklearn/model_selection/_search.py:1419, in GridSearchCV._run_searc

h(self, evaluate_candidates)
1417 def _run_search(self, evaluate_candidates):
1418 """Search all candidates in param_grid"""
-> 1419 evaluate_candidates(ParameterGrid(self.param_grid))

File /opt/conda/lib/python3.11/site-packages/sklearn/model_selection/_search.py:845, in BaseSearchCV.fit.<locals>.

evaluate_candidates(candidate_params, cv, more_results)
837 if self.verbose > 0:
838 print(
839 "Fitting {0} folds for each of {1} candidates,"
840 " totalling {2} fits".format(
841 n_splits, n_candidates, n_candidates * n_splits
842 )
843 )
--> 845 out = parallel(
846 delayed(_fit_and_score)(
847 clone(base_estimator),
848 X,
849 y,
850 train=train,
851 test=test,
852 parameters=parameters,
853 split_progress=(split_idx, n_splits),
854 candidate_progress=(cand_idx, n_candidates),
855 **fit_and_score_kwargs,
856 )
857 for (cand_idx, parameters), (split_idx, (train, test)) in product(
858 enumerate(candidate_params), enumerate(cv.split(X, y, groups))
859 )
860 )
862 if len(out) < 1:
863 raise ValueError(
864 "No fits were performed. "
865 "Was the CV iterator empty? "
866 "Were there no candidates?"
867 )

File /opt/conda/lib/python3.11/site-packages/sklearn/utils/parallel.py:65, in Parallel.call(self, iterable)

60 config = get_config()
61 iterable_with_config = (
62 (_with_config(delayed_func, config), args, kwargs)
63 for delayed_func, args, kwargs in iterable
64 )
---> 65 return super().__call__(iterable_with_config)

File /opt/conda/lib/python3.11/site-packages/joblib/parallel.py:1863, in Parallel.call(self, iterable)

1861 output = self._get_sequential_output(iterable)
1862 next(output)
-> 1863 return output if self.return_generator else list(output)
1865 # Let's create an ID that uniquely identifies the current call. If the
1866 # call is interrupted early and that the same instance is immediately
1867 # re-used, this id will be used to prevent workers that were
1868 # concurrently finalizing a task from the previous call to run the
1869 # callback.
1870 with self._lock:

File /opt/conda/lib/python3.11/site-packages/joblib/parallel.py:1792, in Parallel._get_sequential_output(self, iterable)

1790 self.n_dispatched_batches += 1
1791 self.n_dispatched_tasks += 1
-> 1792 res = func(*args, **kwargs)
1793 self.n_completed_tasks += 1
1794 self.print_progress()

File /opt/conda/lib/python3.11/site-packages/sklearn/utils/parallel.py:127, in _FuncWrapper.call(self, *args, **k

wargs)
125 config = {}
126 with config_context(**config):
--> 127 return self.function(*args, **kwargs)

File /opt/conda/lib/python3.11/site-packages/sklearn/model_selection/_validation.py:720, in _fit_and_score(estimato

r, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samp
les, return_times, return_estimator, split_progress, candidate_progress, error_score)
717 for k, v in parameters.items():
718 cloned_parameters[k] = clone(v, safe=False)
--> 720 estimator = estimator.set_params(**cloned_parameters)
722 start_time = time.time()
724 X_train, y_train = _safe_split(estimator, X, y, train)

File /opt/conda/lib/python3.11/site-packages/sklearn/pipeline.py:215, in Pipeline.set_params(self, **kwargs)

196 def set_params(self, **kwargs):
197 """Set the parameters of this estimator.
198
199 Valid parameter keys can be listed with ``get_params()``. Note that
(...)
213 Pipeline class instance.
214 """
--> 215 self._set_params("steps", **kwargs)
216 return self

File /opt/conda/lib/python3.11/site-packages/sklearn/utils/metaestimators.py:68, in _BaseComposition._set_params(s

elf, attr, **params)
65 self._replace_estimator(attr, name, params.pop(name))
67 # 3. Step parameters and other initialisation arguments
---> 68 super().set_params(**params)
69 return self

File /opt/conda/lib/python3.11/site-packages/sklearn/base.py:229, in BaseEstimator.set_params(self, **params)

227 if key not in valid_params:
228 local_valid_params = self._get_param_names()
--> 229 raise ValueError(
230 f"Invalid parameter {key!r} for estimator {self}. "
231 f"Valid parameters are: {local_valid_params!r}."
232 )
234 if delim:
235 nested_params[key][sub_key] = value

ValueError: Invalid parameter 'simpleimputer_strategy' for estimator Pipeline(steps=[('simpleimputer', SimpleImpute

r()),
('gradientboostingclassifier', GradientBoostingClassifier())]). Valid parameters are: ['memory', 'steps', 'verb
ose'].
This will take longer than our last grid search, so now's a good time to get coffee or cook dinner. 🍲

Okay! Let's take a look at the results!

VimeoVideo("696220937", h="9148032400", width=600)

Task 5.4.10: Extract the cross-validation results from model and load them into a DataFrame named cv_results.

 Get cross-validation results from a hyperparameter search in scikit-learn.

results = pd.DataFrame(model.cv_results_)
results.sort_values("rank_test_score").head(10)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[41], line 1
----> 1 results = pd.DataFrame(model.cv_results_)
2 results.sort_values("rank_test_score").head(10)

AttributeError: 'GridSearchCV' object has no attribute 'cv_results_'

There are quite a few hyperparameters there, so let's pull out the ones that work best for our model.

VimeoVideo("696220899", h="342d55e7d7", width=600)

Task 5.4.11: Extract the best hyperparameters from model.
 Get settings from a hyperparameter search in scikit-learn.

# Extract best hyperparameters

model.best_params_
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[87], line 2
1 # Extract best hyperparameters
----> 2 model.best_params_

AttributeError: 'GridSearchCV' object has no attribute 'best_params_'

Evaluate
Now that we have a working model that's actually giving us something useful, let's see how good it really is.
Task 5.4.12: Calculate the training and test accuracy scores for model.

 Calculate the accuracy score for a model in scikit-learn.

acc_train = model.score(X_train, y_train)

acc_test = model.score(X_test, y_test)

print("Training Accuracy:", round(acc_train, 4))

print("Validation Accuracy:", round(acc_test, 4))
---------------------------------------------------------------------------
NotFittedError Traceback (most recent call last)
Cell In[86], line 1
----> 1 acc_train = model.score(X_train, y_train)
2 acc_test = model.score(X_test, y_test)
4 print("Training Accuracy:", round(acc_train, 4))

File /opt/conda/lib/python3.11/site-packages/sklearn/model_selection/_search.py:456, in BaseSearchCV.score(self,

X, y)
433 """Return the score on the given data, if the estimator has been refit.
434
435 This uses the score defined by ``scoring`` where provided, and the
(...)
453 ``best_estimator_.score`` method otherwise.
454 """
455 _check_refit(self, "score")
--> 456 check_is_fitted(self)
457 if self.scorer_ is None:
458 raise ValueError(
459 "No score function explicitly defined, "
460 "and the estimator doesn't provide one %s"
461 % self.best_estimator_
462 )

File /opt/conda/lib/python3.11/site-packages/sklearn/utils/validation.py:1462, in check_is_fitted(estimator, attributes,

msg, all_or_any)
1459 raise TypeError("%s is not an estimator instance." % (estimator))
1461 if not _is_fitted(estimator, attributes, all_or_any):
-> 1462 raise NotFittedError(msg % {"name": type(estimator).__name__})
NotFittedError: This GridSearchCV instance is not fitted yet. Call 'fit' with appropriate arguments before using this e
stimator.
Just like before, let's make a confusion matrix to see how our model is making its correct and incorrect
predictions.
Task 5.4.13: Plot a confusion matrix that shows how your best model performs on your test set.

 What's a confusion matrix?

 Create a confusion matrix using scikit-learn.

# Plot confusion matrix

ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)
---------------------------------------------------------------------------
NotFittedError Traceback (most recent call last)
Cell In[88], line 2
1 # Plot confusion matrix
----> 2 ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)

File /opt/conda/lib/python3.11/site-packages/sklearn/metrics/_plot/confusion_matrix.py:320, in ConfusionMatrixDis

play.from_estimator(cls, estimator, X, y, labels, sample_weight, normalize, display_labels, include_values, xticks_ro
tation, values_format, cmap, ax, colorbar, im_kw, text_kw)
318 if not is_classifier(estimator):
319 raise ValueError(f"{method_name} only supports classifiers")
--> 320 y_pred = estimator.predict(X)
322 return cls.from_predictions(
323 y,
324 y_pred,
(...)
336 text_kw=text_kw,
337 )

File /opt/conda/lib/python3.11/site-packages/sklearn/model_selection/_search.py:518, in BaseSearchCV.predict(self,

X)
499 @available_if(_estimator_has("predict"))
500 def predict(self, X):
501 """Call predict on the estimator with the best found parameters.
502
503 Only available if ``refit=True`` and the underlying estimator supports
(...)
516 the best found parameters.
517 """
--> 518 check_is_fitted(self)
519 return self.best_estimator_.predict(X)

File /opt/conda/lib/python3.11/site-packages/sklearn/utils/validation.py:1462, in check_is_fitted(estimator, attributes,

NotFittedError: This GridSearchCV instance is not fitted yet. Call 'fit' with appropriate arguments before using this e
stimator.
This matrix is a great reminder of how imbalanced our data is, and of why accuracy isn't always the best metric
for judging whether or not a model is giving us what we want. After all, if 95% of the companies in our dataset
didn't go bankrupt, all the model has to do is always predict {"bankrupt": False}, and it'll be right 95% of the
time. The accuracy score will be amazing, but it won't tell us what we really need to know.

Instead, we can evaluate our model using two new metrics: precision and recall. The precision score is
important when we want our model to only predict that a company will go bankrupt if its very confident in its
prediction. The recall score is important if we want to make sure to identify all the companies that will go
bankrupt, even if that means being incorrect sometimes.

Let's start with a report you can create with scikit-learn to calculate both metrics. Then we'll look at them one-
by-one using a visualization tool we've built especially for the Data Science Lab.

VimeoVideo("696297886", h="fac5454b22", width=600)

Task 5.4.14: Print the classification report for your model, using the test set.

 Generate a classification report with scikit-learn.

# Print classification report

print(classification_report(y_test, model.predict(X_test)) )
---------------------------------------------------------------------------
NotFittedError Traceback (most recent call last)
Cell In[90], line 2
1 # Print classification report
----> 2 print(classification_report(y_test, model.predict(X_test)) )

File /opt/conda/lib/python3.11/site-packages/sklearn/model_selection/_search.py:518, in BaseSearchCV.predict(self,

File /opt/conda/lib/python3.11/site-packages/sklearn/utils/validation.py:1462, in check_is_fitted(estimator, attributes,

NotFittedError: This GridSearchCV instance is not fitted yet. Call 'fit' with appropriate arguments before using this e
stimator.

VimeoVideo("696220837", h="f93be5aba0", width=600)

VimeoVideo("696220785", h="8a4c4bff58", width=600)

Task 5.4.15: Run the cell below to load the confusion matrix widget.

 What's precision?
 What's recall?

model.predict(X_test)[:5]

---------------------------------------------------------------------------
NotFittedError Traceback (most recent call last)
Cell In[92], line 1
----> 1 model.predict(X_test)[:5]

File /opt/conda/lib/python3.11/site-packages/sklearn/model_selection/_search.py:518, in BaseSearchCV.predict(self,

File /opt/conda/lib/python3.11/site-packages/sklearn/utils/validation.py:1462, in check_is_fitted(estimator, attributes,

NotFittedError: This GridSearchCV instance is not fitted yet. Call 'fit' with appropriate arguments before using this e
stimator.
model.predict_proba(X_test)[:5, -1]
---------------------------------------------------------------------------
NotFittedError Traceback (most recent call last)
Cell In[93], line 1
----> 1 model.predict_proba(X_test)[:5, -1]

File /opt/conda/lib/python3.11/site-packages/sklearn/model_selection/_search.py:541, in BaseSearchCV.predict_pro

ba(self, X)
521 @available_if(_estimator_has("predict_proba"))
522 def predict_proba(self, X):
523 """Call predict_proba on the estimator with the best found parameters.
524
525 Only available if ``refit=True`` and the underlying estimator supports
(...)
539 to that in the fitted attribute :term:`classes_`.
540 """
--> 541 check_is_fitted(self)
542 return self.best_estimator_.predict_proba(X)

File /opt/conda/lib/python3.11/site-packages/sklearn/utils/validation.py:1462, in check_is_fitted(estimator, attributes,

c = ConfusionMatrixWidget(model, X_test, y_test)

c.show()
FloatSlider(value=0.5, continuous_update=False, description='Threshold:', max=1.0)
HBox(children=(Output(layout=Layout(height='300px', width='300px')), VBox(children=(Output(layout=Layout(hei
gh…
If you move the probability threshold, you can see that there's a tradeoff between precision and recall. That is,
as one gets better, the other suffers. As a data scientist, you'll often need to decide whether you want a model
with better precision or better recall. What you choose will depend on how to intend to use your model.

Let's look at two examples, one where recall is the priority and one where precision is more important. First,
let's say you work for a regulatory agency in the European Union that assists companies and investors
navigate insolvency proceedings. You want to build a model to predict which companies could go bankrupt so
that you can send debtors information about filing for legal protection before their company becomes insolvent.
The administrative costs of sending information to a company is €500. The legal costs to the European court
system if a company doesn't file for protection before bankruptcy is €50,000.

For a model like this, we want to focus on recall, because recall is all about quantity. A model that prioritizes
recall will cast the widest possible net, which is the way to approach this problem. We want to send
information to as many potentially-bankrupt companies as possible, because it costs a lot less to send
information to a company that might not become insolvent than it does to skip a company that does.
VimeoVideo("696209314", h="36a14b503c", width=600)

Task 5.4.16: Run the cell below, and use the slider to change the probability threshold of your model. What
relationship do you see between changes in the threshold and changes in wasted administrative and legal costs?
In your opinion, which is more important for this model: high precision or high recall?

 What's precision?
 What's recall?

c.show_eu()
FloatSlider(value=0.5, continuous_update=False, description='Threshold:', max=1.0)
HBox(children=(Output(layout=Layout(height='300px', width='300px')), VBox(children=(Output(layout=Layout(hei
gh…
For the second example, let's say we work at a private equity firm that purchases distressed businesses, improve
them, and then sells them for a profit. You want to build a model to predict which companies will go bankrupt
so that you can purchase them ahead of your competitors. If the firm purchases a company that is indeed
insolvent, it can make a profit of €100 million or more. But if it purchases a company that isn't insolvent and
can't be resold at a profit, the firm will lose €250 million.

For a model like this, we want to focus on precision. If we're trying to maximize our profit, the quality of our
predictions is much more important than the quantity of our predictions. It's not a big deal if we don't catch
every single insolvent company, but it's definitely a big deal if the companies we catch don't end up becoming
insolvent.

This time we're going to build the visualization together.

VimeoVideo("696209348", h="f7e1981c9f", width=600)

Task 5.4.17: Create an interactive dashboard that shows how company profit and losses change in relationship
to your model's probability threshold. Start with the make_cnf_matrix function, which should calculate and print
profit/losses, and display a confusion matrix. Then create a FloatSlider thresh_widget that ranges from 0 to 1.
Finally combine your function and slider in the interact function.

 What's a function?
 What's a confusion matrix?
 Create a confusion matrix using scikit-learn.

def make_cnf_matrix(threshold):

y_pred_proba = model.predict_proba(X_test)[:, -1]

y_pred = y_pred_proba > threshold
conf_matrix = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = conf_matrix.ravel()
print(f"Profit: € {tp*100_000_000}")
print(f"Losses: € {fp*250_000_000}")
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, colorbar = False)

thresh_widget = widgets.FloatSlider(min = 0, max = 1, value = 0.5, step = 0.05)

interact(make_cnf_matrix, threshold=thresh_widget);
interactive(children=(FloatSlider(value=0.5, description='threshold', max=1.0, step=0.05), Output()), _dom_cla…
Go Further:💡 Some students have suggested that this widget would be better if it showed the sum of profits
and losses. Can you add that total?

Communicate
Almost there! Save the best model so we can share it with other people, then put it all together with what we
learned in the last lesson.
Task 5.4.18: Using a context manager, save your best-performing model to a file named "model-5-4.pkl".

 What's serialization?
 Store a Python object as a serialized file using pickle.

# Save model
with open("model-5-4.pkl", "wb") as f:
pickle.dump(model, f)

VimeoVideo("696220731", h="8086ff0bcd", width=600)

Task 5.4.19: Open the file my_predictor_lesson.py, add the wrangle and make_predictions functions from the
last lesson, and add all the necessary import statements to the top of the file. Once you're done, save the file.
You can check that the contents are correct by running the cell below.

 What's a function?
%%bash

cat my_predictor_lesson.py
# Import libraries
import gzip
import json
import pickle

import pandas as pd

# Add wrangle function from lesson 5.4

def wrangle(filename):
# Open compressed file, load into dict
with gzip.open(filename, "r") as f:
data = json.load(f)

# Turn dict into DataFrame

df = pd.DataFrame().from_dict(data["data"]).set_index("company_id")

return df

# Add make_predictions function from lesson 5.3

def make_predictions(data_filepath, model_filepath):
# Wrangle JSON file
X_test = wrangle(data_filepath)
# Load model
with open(model_filepath, "rb") as f:
model = pickle.load(f)
# Generate predictions
y_test_pred = model.predict(X_test)
# Put predictions into Series with name "bankrupt", and same index as X_test
y_test_pred = pd.Series(y_test_pred, index = X_test.index, name = "bankrupt" )
return y_test_pred
Congratulations: You've created your first module!

VimeoVideo("696220643", h="8a3f141262", width=600)

Task 5.4.20: Import your make_predictions function from your my_predictor module, and use the code below to
make sure it works as expected. Once you're satisfied, submit it to the grader.

# Import your module

from my_predictor_lesson import make_predictions

# Generate predictions
y_test_pred = make_predictions(
data_filepath="data/poland-bankruptcy-data-2009-mvp-features.json.gz",
model_filepath="model-5-4.pkl",
)

print("predictions shape:", y_test_pred.shape)

y_test_pred.head()
---------------------------------------------------------------------------
NotFittedError Traceback (most recent call last)
Cell In[103], line 5
2 from my_predictor_lesson import make_predictions
4 # Generate predictions
----> 5 y_test_pred = make_predictions(
6 data_filepath="data/poland-bankruptcy-data-2009-mvp-features.json.gz",
7 model_filepath="model-5-4.pkl",
8)
10 print("predictions shape:", y_test_pred.shape)
11 y_test_pred.head()

File ~/work/ds-curriculum/050-bankruptcy-in-poland/my_predictor_lesson.py:29, in make_predictions(data_filepath,

model_filepath)
27 model = pickle.load(f)
28 # Generate predictions
---> 29 y_test_pred = model.predict(X_test)
30 # Put predictions into Series with name "bankrupt", and same index as X_test
31 y_test_pred = pd.Series(y_test_pred, index = X_test.index, name = "bankrupt" )

File /opt/conda/lib/python3.11/site-packages/sklearn/model_selection/_search.py:518, in BaseSearchCV.predict(self,

File /opt/conda/lib/python3.11/site-packages/sklearn/utils/validation.py:1462, in check_is_fitted(estimator, attributes,

NotFittedError: This GridSearchCV instance is not fitted yet. Call 'fit' with appropriate arguments before using this e
stimator.

wqet_grader.grade(
"Project 5 Assessment",
"Task 5.4.20",
make_predictions(
data_filepath="data/poland-bankruptcy-data-2009-mvp-features.json.gz",
model_filepath="model-5-4.pkl",
),
)

Copyright 2023 WorldQuant University. This content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.
Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your
WQU virtual machine.

This means:

 ⓧ No downloading this notebook.

5.5. Bankruptcy in Taiwan 🇹🇼

import wqet_grader

wqet_grader.init("Project 5 Assessment")

# Import libraries here

import gzip
import json
import pickle

import ipywidgets as widgets

import matplotlib.pyplot as plt
import pandas as pd
import wqet_grader
from imblearn.over_sampling import RandomOverSampler
from ipywidgets import interact
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import (
ConfusionMatrixDisplay,
classification_report,
confusion_matrix,
)
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split
from sklearn.pipeline import make_pipeline
from teaching_tools.widgets import ConfusionMatrixWidget

Prepare Data
Import
Task 5.5.1: Load the contents of the "data/taiwan-bankruptcy-data.json.gz" and assign it to the
variable taiwan_data.

Note that taiwan_data should be a dictionary. You'll create a DataFrame in a later task.

# Load data file

with gzip.open ("data/taiwan-bankruptcy-data.json.gz", "r") as read_file:
taiwan_data = json.load(read_file)
print(type(taiwan_data))
<class 'dict'>

wqet_grader.grade("Project 5 Assessment", "Task 5.5.1", taiwan_data["metadata"])

Way to go!

Score: 1

Task 5.5.2: Extract the key names from taiwan_data and assign them to the variable taiwan_data_keys.

Tip: The data in this assignment might be organized differently than the data from the project, so be sure to
inspect it first.
taiwan_data_keys = taiwan_data.keys()
print(taiwan_data_keys)
dict_keys(['schema', 'metadata', 'observations'])

wqet_grader.grade("Project 5 Assessment", "Task 5.5.2", list(taiwan_data_keys))

Yup. You got it.

Score: 1

Task 5.5.3: Calculate how many companies are in taiwan_data and assign the result to n_companies.
n_companies = len(taiwan_data["observations"])
print(n_companies)
6137

wqet_grader.grade("Project 5 Assessment", "Task 5.5.3", [n_companies])

You got it. Dance party time! 🕺💃🕺💃

Score: 1

Task 5.5.4: Calculate the number of features associated with each company and assign the result to n_features.

n_features = len(taiwan_data["observations"][0])
print(n_features)
97

wqet_grader.grade("Project 5 Assessment", "Task 5.5.4", [n_features])

Excellent! Keep going.
Score: 1

Task 5.5.5: Create a wrangle function that takes as input the path of a compressed JSON file and returns the
file's contents as a DataFrame. Be sure that the index of the DataFrame contains the ID of the companies. When
your function is complete, use it to load the data into the DataFrame df.

# Create wrangle function

def wrangle(filename):
# Open compressed file, load into dict
with gzip.open(filename, "r") as f:
data = json.load(f)

# Turn dict into DataFrame

df = pd.DataFrame().from_dict(data["observations"]).set_index("id")

return df

df = wrangle("data/taiwan-bankruptcy-data.json.gz")
print("df shape:", df.shape)
df.head()
df shape: (6137, 96)

b f
a fe fe fe fe fe fe fe fe e fe
fe fe fe fe fe fe fe fe fe
n . at at at at at at at at a at
at at at at at at at at at
k . _ _ _ _ _ _ _ _ t _
_ _ _ _ _ _ _ _ _
r . 8 8 8 8 9 9 9 9 _ 9
1 2 3 4 5 6 7 8 9
u 6 7 8 9 0 1 2 3 9 5
pt 4

i
d

0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
3 4 4 6 6 9 7 8 3 7 0 6 6 8 2 0 5 0
T 7 2 0 0 0 9 9 0 0 . 1 0 2 0 2 9 2 6 1
1 ru 0 4 5 1 1 8 6 8 2 . 6 9 2 1 7 0 6 4 1 6
e 5 3 7 4 4 9 8 8 6 . 8 2 8 4 8 2 6 0 4
9 8 5 5 5 6 8 0 4 4 1 7 5 9 0 0 5 6
4 9 0 7 7 9 7 9 6 5 9 9 3 0 2 1 0 9

0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
T 4 5 5 6 6 9 7 8 3 . 7 0 6 6 8 2 2 5 0
2 ru 6 3 1 1 1 9 9 0 0 . 9 0 2 1 3 8 6 7 1 2
e 4 8 6 0 0 8 7 9 3 . 5 8 3 0 9 3 4 0 0
2 2 7 2 2 9 3 3 5 2 3 6 2 9 8 5 1 7
b f
a fe fe fe fe fe fe fe fe e fe
fe fe fe fe fe fe fe fe fe
n . at at at at at at at at a at
at at at at at at at at at
k . _ _ _ _ _ _ _ _ t _
_ _ _ _ _ _ _ _ _
r . 8 8 8 8 9 9 9 9 _ 9
1 2 3 4 5 6 7 8 9
u 6 7 8 9 0 1 2 3 9 5
pt 4

i
d

9 1 3 3 3 4 8 0 5 9 2 5 3 6 4 7 7 9
1 4 0 5 5 6 0 1 6 7 3 2 7 9 6 7 5 4

0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
4 4 4 6 6 9 7 8 3 7 0 6 6 8 2 0 5 0
T 2 9 7 0 0 9 9 0 0 . 7 4 2 0 3 9 2 6 1
3 ru 6 9 2 1 1 8 6 8 2 . 4 0 3 1 6 0 6 3 1 6
e 0 0 2 4 3 8 4 3 0 . 6 0 8 4 7 1 5 7 4
7 1 9 5 6 5 0 8 3 7 0 4 4 7 8 5 0 7
1 9 5 0 4 7 3 8 5 0 3 1 9 4 9 5 6 4

0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
3 4 4 5 5 9 7 8 3 7 0 6 5 8 2 0 5 0
T 9 5 5 8 8 9 9 0 0 . 3 0 2 8 3 8 2 6 2
4 ru 9 1 7 3 3 8 6 8 3 . 9 3 2 3 4 1 6 4 1 3
e 8 2 7 5 5 7 9 9 3 . 5 2 9 5 6 7 6 6 9
4 6 3 4 4 0 6 6 5 5 5 2 3 9 2 9 6 8
4 5 3 1 1 0 7 6 0 5 2 9 8 7 1 7 3 2

0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
4 5 5 5 5 9 7 8 3 7 0 6 5 8 2 0 5 0
T 6 3 2 9 9 9 9 0 0 . 9 0 2 9 3 7 2 7 3
5 ru 5 8 2 8 8 8 7 9 3 . 5 3 3 8 9 8 4 5 1 5
e 0 4 2 7 7 9 3 3 4 . 0 8 5 7 9 5 7 6 4
2 3 9 8 8 7 6 0 7 1 7 2 8 7 1 5 1 9
2 2 8 3 3 3 6 4 5 6 8 1 2 3 4 2 7 0

5 rows × 96 columns

wqet_grader.grade("Project 5 Assessment", "Task 5.5.5", df)

Excellent! Keep going.
Score: 1

Explore
Task 5.5.6: Is there any missing data in the dataset? Create a Series where the index contains the name of the
columns in df and the values are the number of NaNs in each column. Assign the result to nans_by_col. Neither
the Series itself nor its index require a name.
WQU WorldQuant University Applied Data Science Lab QQQQ

nans_by_col = pd.Series(df.isnull().sum())
print("nans_by_col shape:", nans_by_col.shape)
nans_by_col.head()
nans_by_col shape: (96,)

bankrupt 0
feat_1 0
feat_2 0
feat_3 0
feat_4 0
dtype: int64

wqet_grader.grade("Project 5 Assessment", "Task 5.5.6", nans_by_col)

Wow, you're making great progress.

Score: 1

Task 5.5.7: Is the data imbalanced? Create a bar chart that shows the normalized value counts for the
column df["bankrupt"]. Be sure to label your x-axis "Bankrupt", your y-axis "Frequency", and use the title "Class
Balance".

# Plot class balance

df["bankrupt"].value_counts(normalize = True).plot(
kind = "bar",
xlabel = "Bankrupt",
ylabel = "Frequency",
title = "Classe Balance"
)
# Don't delete the code below 👇
plt.savefig("images/5-5-7.png", dpi=150)
with open("images/5-5-7.png", "rb") as file:
wqet_grader.grade("Project 5 Assessment", "Task 5.5.7", file)
Party time! 🎉🎉🎉

Score: 1

Split
Task 5.5.8: Create your feature matrix X and target vector y. Your target is "bankrupt".
target = "bankrupt"
X = df.drop(columns = target)
y = df[target]
print("X shape:", X.shape)
print("y shape:", y.shape)
X shape: (6137, 95)
y shape: (6137,)

wqet_grader.grade("Project 5 Assessment", "Task 5.5.8a", X)

Good work!

Score: 1

wqet_grader.grade("Project 5 Assessment", "Task 5.5.8b", y)

Python master 😁

Score: 1

Task 5.5.9: Divide your dataset into training and test sets using a randomized split. Your test set should be
20% of your data. Be sure to set random_state to 42.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.2, random_state = 42
)
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)
X_train shape: (4909, 95)
y_train shape: (4909,)
X_test shape: (1228, 95)
y_test shape: (1228,)

wqet_grader.grade("Project 5 Assessment", "Task 5.5.9", list(X_train.shape))

Boom! You got it.

Score: 1

Resample
Task 5.5.10: Create a new feature matrix X_train_over and target vector y_train_over by performing random
over-sampling on the training data. Be sure to set the random_state to 42.
over_sampler = RandomOverSampler(random_state = 42)
X_train_over, y_train_over = over_sampler.fit_resample(X_train, y_train)
print("X_train_over shape:", X_train_over.shape)
X_train_over.head()
X_train_over shape: (9512, 95)

f
fe fe fe fe fe fe fe fe fe e fe
fe fe fe fe fe fe fe fe fe
at . at at at at at at at at a at
at at at at at at at at at
_ . _ _ _ _ _ _ _ _ t _
_ _ _ _ _ _ _ _ _
1 . 8 8 8 8 9 9 9 9 _ 9
1 2 3 4 5 6 7 8 9
0 6 7 8 9 0 1 2 3 9 5
4

0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
5 5 5 6 6 9 7 8 3 7 8 0 6 6 8 2 0 5 1
3 9 9 2 2 9 9 0 0 8 . 3 2 2 2 4 7 2 6 4
0 5 9 4 7 7 9 7 9 3 1 . 4 2 4 7 1 5 6 5 1 7
8 1 4 0 0 2 6 5 5 8 . 0 0 3 1 9 3 7 1 9
5 6 1 9 9 2 8 9 1 6 9 2 6 0 7 8 9 5 4
5 0 1 9 9 0 6 1 8 5 1 5 4 1 7 4 1 8 3
f
fe fe fe fe fe fe fe fe fe e fe
fe fe fe fe fe fe fe fe fe
at . at at at at at at at at a at
at at at at at at at at at
_ . _ _ _ _ _ _ _ _ t _
_ _ _ _ _ _ _ _ _
1 . 8 8 8 8 9 9 9 9 _ 9
1 2 3 4 5 6 7 8 9
0 6 7 8 9 0 1 2 3 9 5
4

0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
5 6 5 6 6 9 7 8 3 7 8 0 6 6 8 2 0 5 0
5 1 9 0 0 9 9 0 0 8 . 4 0 2 0 4 7 2 6 6
1 4 2 5 7 7 9 7 9 3 1 . 0 2 4 7 2 6 6 5 1 2
1 7 0 3 3 1 6 4 6 7 . 2 4 5 3 6 5 7 1 5
3 3 0 8 8 2 1 8 0 5 9 0 4 8 4 3 9 5 4
6 4 0 8 8 0 4 3 0 4 3 7 8 5 5 2 1 8 4

0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
5 6 5 6 6 9 7 8 3 7 8 0 6 6 8 2 0 5 0
4 0 9 2 2 9 9 0 0 8 . 4 0 2 2 4 7 2 6 4
2 9 3 9 0 0 9 7 9 3 1 . 0 0 4 0 2 7 6 5 1 7
5 4 1 1 1 1 5 4 5 7 . 4 8 0 1 8 2 8 2 9
5 6 2 6 6 1 6 7 2 4 0 4 1 6 7 4 0 0 2
4 7 2 6 6 9 9 0 4 0 3 0 0 3 3 9 0 0 9

0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
5 6 6 6 6 9 7 8 3 7 8 0 6 6 8 2 0 5 0
4 0 0 2 2 9 9 0 0 8 . 3 0 2 2 4 8 2 6 2
3 3 3 6 2 2 9 7 9 3 1 . 1 6 6 2 2 0 6 5 1 8
8 2 9 5 5 2 7 6 5 9 . 5 1 7 5 9 0 8 3 3
0 4 9 1 1 5 2 4 1 3 1 7 7 1 8 1 3 7 8
1 9 2 5 5 9 8 9 0 0 4 6 5 3 9 3 9 5 6

0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
4 5 5 6 6 9 7 8 3 7 8 0 6 6 8 2 0 5 0
9 6 4 0 0 9 9 0 0 8 . 1 0 2 0 4 7 2 6 4
4 8 2 6 3 3 8 7 9 4 1 . 1 4 3 3 1 7 6 5 1 3
6 3 9 6 6 9 5 4 0 7 . 9 2 6 6 1 6 8 6 0
5 6 7 7 7 0 8 5 0 1 8 5 7 6 0 2 9 1 8
9 4 8 0 0 4 4 9 0 3 8 6 4 9 5 8 7 8 0

5 rows × 95 columns

wqet_grader.grade("Project 5 Assessment", "Task 5.5.10", list(X_train_over.shape))

Yes! Your hard work is paying off.

Score: 1

Build Model
Iterate
Task 5.5.11: Create a classifier clf that can be trained on (X_train_over, y_train_over). You can use any of the
predictors you've learned about in the Data Science Lab.

clf = GradientBoostingClassifier()
print(clf)
GradientBoostingClassifier()

wqet_grader.grade("Project 5 Assessment", "Task 5.5.11", clf)

Yup. You got it.

Score: 1

Task 5.5.12: Perform cross-validation with your classifier using the over-sampled training data, and assign
your results to cv_scores. Be sure to set the cv argument to 5.
Tip: Use your CV scores to evaluate different classifiers. Choose the one that gives you the best scores.

cv_scores = cross_val_score(clf, X_train_over, y_train_over, cv = 5, n_jobs = -1)

print(cv_scores)
[0.96952181 0.97162375 0.96950578 0.9721346 0.96845426]

wqet_grader.grade("Project 5 Assessment", "Task 5.5.12", list(cv_scores))

Boom! You got it.

Score: 1

Ungraded Task: Create a dictionary params with the range of hyperparameters that you want to evaluate for
your classifier. If you're not sure which hyperparameters to tune, check the scikit-learn documentation for your
predictor for ideas.
Tip: If the classifier you built is a predictor only (not a pipeline with multiple steps), you don't need to include
the step name in the keys of your params dictionary. For example, if your classifier was only a random forest
(not a pipeline containing a random forest), your would access the number of estimators using "n_estimators",
not "randomforestclassifier__n_estimators".

params = params = {

"n_estimators": range(25, 100, 25),

"max_depth": range(10, 50, 10)
}
params
{'n_estimators': range(25, 100, 25), 'max_depth': range(10, 50, 10)}
Task 5.5.13: Create a GridSearchCV named model that includes your classifier and hyperparameter grid. Be
sure to set cv to 5, n_jobs to -1, and verbose to 1.
model = GridSearchCV(clf, param_grid=params, cv=5, n_jobs=-1, verbose=1)
model

GridSearchCV(cv=5, estimator=GradientBoostingClassifier(), n_jobs=-1,

wqet_grader.grade("Project 5 Assessment", "Task 5.5.13", model)

That's the right answer. Keep it up!

Score: 1

Ungraded Task: Fit your model to the over-sampled training data.

model.fit(X_train_over, y_train_over)
Fitting 5 folds for each of 12 candidates, totalling 60 fits

GridSearchCV(cv=5, estimator=GradientBoostingClassifier(), n_jobs=-1,

param_grid={'max_depth': range(10, 50, 10),
'n_estimators': range(25, 100, 25)},
verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with
nbviewer.org.
Task 5.5.14: Extract the cross-validation results from your model, and load them into a DataFrame
named cv_results. Looking at the results, which set of hyperparameters led to the best performance?

cv_results = pd.DataFrame(model.cv_results_)
cv_results.head(5)

me st me par
std par spli spli spli spli spli me std ran
an d_ an_ am
_sc am_ pa t0_ t1_ t2_ t3_ t4_ an_ _te k_t
_fi fit sco _m
ore n_es ra test test test test test test st_ est
t_t _ti re_ ax_
_ti tima ms _sc _sc _sc _sc _sc _sc sco _sc
im m tim dep
me tors ore ore ore ore ore ore re ore
e e e th

0.0 {' 0.9 0.9 0.9 0.9 0.9 0.9

0 28. 2. 302 0.0 10 25 ma 837 842 789 784 847 820 0.0 12
46 02 05 29 x_ 10 35 70 44 53 22 02
de
me st me par
std par spli spli spli spli spli me std ran
an d_ an_ am
_sc am_ pa t0_ t1_ t2_ t3_ t4_ an_ _te k_t
_fi fit sco _m
ore n_es ra test test test test test test st_ est
t_t _ti re_ ax_
_ti tima ms _sc _sc _sc _sc _sc _sc sco _sc
im m tim dep
me tors ore ore ore ore ore ore re ore
e e e th

46 36 45 pth 73
48 60 4 ': 2
10,
'n_
est
im
ato
rs':
25
}

{'
ma
x_
de
pth
60. 1. 0.0 ': 0.0
0.0 0.9 0.9 0.9 0.9 0.9 0.9
53 94 00 10, 02
1 075 10 50 900 905 847 842 884 875 6
62 15 13 'n_ 63
81 16 41 53 27 33 94
18 88 6 est 3
im
ato
rs':
50
}

{'
ma
x_
96. 2. 0.0 de 0.0
0.0 0.9 0.9 0.9 0.9 0.9 0.9
97 68 00 pth 00
2 091 10 75 889 894 879 894 889 889 3
40 19 21 ': 57
11 65 90 07 85 59 61
09 67 4 10, 7
'n_
est
im
ato
me st me par
std par spli spli spli spli spli me std ran
an d_ an_ am
_sc am_ pa t0_ t1_ t2_ t3_ t4_ an_ _te k_t
_fi fit sco _m
ore n_es ra test test test test test test st_ est
t_t _ti re_ ax_
_ti tima ms _sc _sc _sc _sc _sc _sc sco _sc
im m tim dep
me tors ore ore ore ore ore ore re ore
e e e th

rs':
75
}

{'
ma
x_
de
pth
30. 3. 0.0 ': 0.0
0.0 0.9 0.9 0.9 0.9 0.9 0.9
11 50 25 20, 01
3 196 20 25 863 879 842 852 884 864 9
84 46 31 'n_ 57
01 37 14 27 79 33 38
48 56 3 est 5
im
ato
rs':
25
}

{'
ma
x_
de
pth
62. 6. 0.0 ': 0.0
0.0 0.9 0.9 0.9 0.9 0.9 0.9
45 13 26 20, 01
4 217 20 50 900 900 879 879 931 898 2
92 79 01 'n_ 92
99 16 16 07 07 65 02
23 36 6 est 8
im
ato
rs':
50
}

wqet_grader.grade("Project 5 Assessment", "Task 5.5.14", cv_results)

Yup. You got it.
Score: 1

Task 5.5.15: Extract the best hyperparameters from your model and assign them to best_params.

best_params = model.best_params_
print(best_params)
{'max_depth': 20, 'n_estimators': 75}

wqet_grader.grade(
"Project 5 Assessment", "Task 5.5.15", [isinstance(best_params, dict)]
)
Awesome work.

Score: 1

Evaluate
Ungraded Task: Test the quality of your model by calculating accuracy scores for the training and test data.

acc_train = model.score(X_train, y_train)

acc_test = model.score(X_test, y_test)

print("Model Training Accuracy:", round(acc_train, 4))

print("Model Test Accuracy:", round(acc_test, 4))
Model Training Accuracy: 1.0
Model Test Accuracy: 0.9739

Task 5.5.16: Plot a confusion matrix that shows how your model performed on your test set.

# Plot confusion matrix

ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)

# Don't delete the code below 👇

plt.savefig("images/5-5-16.png", dpi=150)
with open("images/5-5-16.png", "rb") as file:
wqet_grader.grade("Project 5 Assessment", "Task 5.5.16", file)
You got it. Dance party time! 🕺💃🕺💃

Score: 1

Task 5.5.17: Generate a classification report for your model's performance on the test data and assign it
to class_report.
class_report = classification_report(y_test, model.predict(X_test))
print(class_report)
precision recall f1-score support

False 0.98 0.99 0.99 1191

True 0.59 0.46 0.52 37

accuracy 0.97 1228

macro avg 0.78 0.72 0.75 1228
weighted avg 0.97 0.97 0.97 1228

wqet_grader.grade("Project 5 Assessment", "Task 5.5.17", class_report)

Yup. You got it.

Score: 1

Communicate
Task 5.5.18: Create a horizontal bar chart with the 10 most important features for your model. Be sure to label
the x-axis "Gini Importance", the y-axis "Feature", and use the title "Feature Importance".
# Get feature names from training data
features = X_train_over.columns
# Extract importances from model
importances = model.best_estimator_.feature_importances_

# Create a series with feature names and importances

# Don't delete the code below 👇

plt.savefig("images/5-5-17.png", dpi=150)

with open("images/5-5-17.png", "rb") as file:

wqet_grader.grade("Project 5 Assessment", "Task 5.5.18", file)
Good work!

Score: 1

Task 5.5.19: Save your best-performing model to a a file named "model-5-5.pkl".

# Save model
with open("model-5-5.pkl", "wb") as f:
pickle.dump(model, f)

with open("model-5-5.pkl", "rb") as f:

wqet_grader.grade("Project 5 Assessment", "Task 5.5.19", pickle.load(f))
Excellent work.
Score: 1

Task 5.5.20: Open the file my_predictor_assignment.py. Add your wrangle function, and then create
a make_predictions function that takes two arguments: data_filepath and model_filepath. Use the cell below to
test your module. When you're satisfied with the result, submit it to the grader.

%%bash

cat my_predictor_assignment.py
# Create your masterpiece :)

# Import libraries
import gzip
import json
import pickle

import pandas as pd

# Add wrangle function from lesson 5.5

# Create wrangle function
def wrangle(filename):
# Open compressed file, load into dict
with gzip.open(filename, "r") as f:
data = json.load(f)

# Turn dict into DataFrame

df = pd.DataFrame().from_dict(data["observations"]).set_index("id")

return df

# Add make_predictions function from lesson 5.3

# Import your module

from my_predictor_assignment import make_predictions

# Generate predictions
y_test_pred = make_predictions(
data_filepath="data/taiwan-bankruptcy-data-test-features.json.gz",
model_filepath="model-5-5.pkl",
)

print("predictions shape:", y_test_pred.shape)

y_test_pred.head()
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[66], line 6
3 from my_predictor_assignment import make_predictions
5 # Generate predictions
----> 6 y_test_pred = make_predictions(
7 data_filepath="data/taiwan-bankruptcy-data-test-features.json.gz",
8 model_filepath="model-5-5.pkl",
9)
11 print("predictions shape:", y_test_pred.shape)
12 y_test_pred.head()

File ~/work/ds-curriculum/050-bankruptcy-in-poland/my_predictor_assignment.py:26, in make_predictions(data_file

path, model_filepath)
24 def make_predictions(data_filepath, model_filepath):
25 # Wrangle JSON file
---> 26 X_test = wrangle(data_filepath)
27 # Load model
28 with open(model_filepath, "rb") as f:

File ~/work/ds-curriculum/050-bankruptcy-in-poland/my_predictor_assignment.py:18, in wrangle(filename)

15 data = json.load(f)
17 # Turn dict into DataFrame
---> 18 df = pd.DataFrame().from_dict(data["data"]).set_index("id")
20 return df

KeyError: 'data'
Tip: If you get an ImportError when you try to import make_predictions from my_predictor_assignment, try
restarting your kernel. Go to the Kernel menu and click on Restart Kernel and Clear All Outputs. Then
rerun just the cell above. ☝️
wqet_grader.grade(
"Project 5 Assessment",
"Task 5.5.20",
make_predictions(
data_filepath="data/taiwan-bankruptcy-data-test-features.json.gz",
model_filepath="model-5-5.pkl",
),
)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[67], line 4
1 wqet_grader.grade(
2 "Project 5 Assessment",
3 "Task 5.5.20",
----> 4 make_predictions(
5 data_filepath="data/taiwan-bankruptcy-data-test-features.json.gz",
6 model_filepath="model-5-5.pkl",
7 ),
8)

File ~/work/ds-curriculum/050-bankruptcy-in-poland/my_predictor_assignment.py:26, in make_predictions(data_file

path, model_filepath)
24 def make_predictions(data_filepath, model_filepath):
25 # Wrangle JSON file
---> 26 X_test = wrangle(data_filepath)
27 # Load model
28 with open(model_filepath, "rb") as f:

File ~/work/ds-curriculum/050-bankruptcy-in-poland/my_predictor_assignment.py:18, in wrangle(filename)

15 data = json.load(f)
17 # Turn dict into DataFrame
---> 18 df = pd.DataFrame().from_dict(data["data"]).set_index("id")
20 return df

KeyError: 'data'

Copyright 2023 WorldQuant University. This content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.

My predictor assignment.py

# Create your masterpiece :)

# Import libraries
import gzip
import json
import pickle

import pandas as pd

# Add wrangle function from lesson 5.5

# Create wrangle function
def wrangle(filename):
# Open compressed file, load into dict
with gzip.open(filename, "r") as f:
data = json.load(f)

# Turn dict into DataFrame

df = pd.DataFrame().from_dict(data["observations"]).set_index("id")
return df

# Add make_predictions function from lesson 5.3

My predictor lesson

# Import libraries
import gzip
import json
import pickle

import pandas as pd

# Add wrangle function from lesson 5.4

def wrangle(filename):
# Open compressed file, load into dict
with gzip.open(filename, "r") as f:
data = json.load(f)

# Turn dict into DataFrame

df = pd.DataFrame().from_dict(data["data"]).set_index("company_id")

return df

# Add make_predictions function from lesson 5.3

Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your
WQU virtual machine.

This means:

 ⓧ No downloading this notebook.

5.6. Data Dictionary

Poland Bankruptcy Data
Below is a summary of the features from the Poland bankruptcy dataset.

feature description

feat_1 net profit / total assets

feat_2 total liabilities / total assets

feat_3 working capital / total assets

feat_4 current assets / short-term liabilities

[(cash + short-term securities + receivables - short-term liabilities) / (operating expenses -

feat_5
depreciation)] * 365

feat_6 retained earnings / total assets

feat_7 EBIT / total assets

feat_8 book value of equity / total liabilities

feat_9 sales / total assets

feat_10 equity / total assets

feat_11 (gross profit + extraordinary items + financial expenses) / total assets

feat_12 gross profit / short-term liabilities

feat_13 (gross profit + depreciation) / sales

feat_14 (gross profit + interest) / total assets

feat_15 (total liabilities * 365) / (gross profit + depreciation)

feat_16 (gross profit + depreciation) / total liabilities

feature description

feat_17 total assets / total liabilities

feat_18 gross profit / total assets

feat_19 gross profit / sales

feat_20 (inventory * 365) / sales

feat_21 sales (n) / sales (n-1)

feat_22 profit on operating activities / total assets

feat_23 net profit / sales

feat_24 gross profit (in 3 years) / total assets

feat_25 (equity - share capital) / total assets

feat_26 (net profit + depreciation) / total liabilities

feat_27 profit on operating activities / financial expenses

feat_28 working capital / fixed assets

feat_29 logarithm of total assets

feat_30 (total liabilities - cash) / sales

feat_31 (gross profit + interest) / sales

feat_32 (current liabilities * 365) / cost of products sold

feat_33 operating expenses / short-term liabilities

feat_34 operating expenses / total liabilities

feat_35 profit on sales / total assets

feature description

feat_36 total sales / total assets

feat_37 (current assets - inventories) / long-term liabilities

feat_38 constant capital / total assets

feat_39 profit on sales / sales

feat_40 (current assets - inventory - receivables) / short-term liabilities

feat_41 total liabilities / ((profit on operating activities + depreciation) * (12/365))

feat_42 profit on operating activities / sales

feat_43 rotation receivables + inventory turnover in days

feat_44 (receivables * 365) / sales

feat_45 net profit / inventory

feat_46 (current assets - inventory) / short-term liabilities

feat_47 (inventory * 365) / cost of products sold

feat_48 EBITDA (profit on operating activities - depreciation) / total assets

feat_49 EBITDA (profit on operating activities - depreciation) / sales

feat_50 current assets / total liabilities

feat_51 short-term liabilities / total assets

feat_52 (short-term liabilities * 365) / cost of products sold)

feat_53 equity / fixed assets

feat_54 constant capital / fixed assets

feature description

feat_55 working capital

feat_56 (sales - cost of products sold) / sales

feat_57 (current assets - inventory - short-term liabilities) / (sales - gross profit - depreciation)

feat_58 total costs /total sales

feat_59 long-term liabilities / equity

feat_60 sales / inventory

feat_61 sales / receivables

feat_62 (short-term liabilities *365) / sales

feat_63 sales / short-term liabilities

feat_64 sales / fixed assets

bankrupt Whether company went bankrupt at end of forecasting period (2013)

Taiwan Bankruptcy Dataset

Below is a summary of the features from the Taiwan bankruptcy dataset.

Note: All of the variables have been normalized into the range from 0 to 1. WQU WorldQuant University Applied Data Science Lab QQQQ

feature description

bankrupt Whether or not company has gone bankrupt

feat_1 ROA(C) before interest and depreciation before interest

feat_2 ROA(A) before interest and % after tax

feat_3 ROA(B) before interest and depreciation after tax

feature description

feat_4 Operating Gross Margin

feat_5 Realized Sales Gross Margin

feat_6 Operating Profit Rate

feat_7 Pre-tax net Interest Rate

feat_8 After-tax net Interest Rate

feat_9 Non-industry income and expenditure/revenue

feat_10 Continuous interest rate (after tax)

feat_11 Operating Expense Rate

feat_12 Research and development expense rate

feat_13 Cash flow rate

feat_14 Interest-bearing debt interest rate

feat_15 Tax rate (A)

feat_16 Net Value Per Share (B)

feat_17 Net Value Per Share (A)

feat_18 Net Value Per Share (C)

feat_19 Persistent EPS in the Last Four Seasons

feat_20 Cash Flow Per Share

feat_21 Revenue Per Share (Yuan ¥)

feat_22 Operating Profit Per Share (Yuan ¥)

feature description

feat_23 Per Share Net profit before tax (Yuan ¥)

feat_24 Realized Sales Gross Profit Growth Rate

feat_25 Operating Profit Growth Rate

feat_26 After-tax Net Profit Growth Rate

feat_27 Regular Net Profit Growth Rate

feat_28 Continuous Net Profit Growth Rate

feat_29 Total Asset Growth Rate

feat_30 Net Value Growth Rate

feat_31 Total Asset Return Growth Rate Ratio

feat_32 Cash Reinvestment %

feat_33 Current Ratio

feat_34 Quick Ratio

feat_35 Interest Expense Ratio

feat_36 Total debt/Total net worth

feat_37 Debt ratio %

feat_38 Net worth/Assets

feat_39 Long-term fund suitability ratio (A)

feat_40 Borrowing dependency

feat_41 Contingent liabilities/Net worth

feature description

feat_42 Operating profit/Paid-in capital

feat_43 Net profit before tax/Paid-in capital

feat_44 Inventory and accounts receivable/Net value

feat_45 Total Asset Turnover

feat_46 Accounts Receivable Turnover

feat_47 Average Collection Days

feat_48 Inventory Turnover Rate (times)

feat_49 Fixed Assets Turnover Frequency

feat_50 Net Worth Turnover Rate (times)

feat_51 Revenue per person

feat_52 Operating profit per person

feat_53 Allocation rate per person

feat_54 Working Capital to Total Assets

feat_55 Quick Assets/Total Assets

feat_56 Current Assets/Total Assets

feat_57 Cash/Total Assets

feat_58 Quick Assets/Current Liability

feat_59 Cash/Current Liability

feat_60 Current Liability to Assets

feature description

feat_61 Operating Funds to Liability

feat_62 Inventory/Working Capital

feat_63 Inventory/Current Liability

feat_64 Current Liabilities/Liability

feat_65 Working Capital/Equity

feat_66 Current Liabilities/Equity

feat_67 Long-term Liability to Current Assets

feat_68 Retained Earnings to Total Assets

feat_69 Total income/Total expense

feat_70 Total expense/Assets

feat_71 Current Asset Turnover Rate

feat_72 Quick Asset Turnover Rate

feat_73 Working Capital Turnover Rate

feat_74 Cash Turnover Rate

feat_75 Cash Flow to Sales

feat_76 Fixed Assets to Assets

feat_77 Current Liability to Liability

feat_78 Current Liability to Equity

feat_79 Equity to Long-term Liability

feature description

feat_80 Cash Flow to Total Assets

feat_81 Cash Flow to Liability

feat_82 CFO to Assets

feat_83 Cash Flow to Equity

feat_84 Current Liability to Current Assets

feat_85 Liability-Assets Flag

feat_86 Net Income to Total Assets

feat_87 Total assets to GNP price

feat_88 No-credit Interval

feat_89 Gross Profit to Sales

feat_90 Net Income to Stockholder's Equity

feat_91 Liability to Equity

feat_92 Degree of Financial Leverage (DFL)

feat_93 Interest Coverage Ratio (Interest expense to EBIT)

feat_94 Net Income Flag

feat_95 Equity to Liability

This means:

 ⓧ No downloading this notebook.

6.1. Exploring the Data

In this project, we're going to work with data from the Survey of Consumer Finances (SCF). The SCF is a
survey sponsored by the US Federal Reserve. It tracks financial, demographic, and opinion information about
families in the United States. The survey is conducted every three years, and we'll work with an extract of the
results from 2019.
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import wqet_grader
from IPython.display import VimeoVideo

wqet_grader.init("Project 6 Assessment")

VimeoVideo("710780578", h="43bb879d16", width=600)

Prepare Data
Import
First, we need to load the data, which is stored in a compressed CSV file: SCFP2019.csv.gz. In the last project,
you learned how to decompress files using gzip and the command line. However, pandas read_csv function can
work with compressed files directly.
VimeoVideo("710781788", h="efd2dda882", width=600)

Task 6.1.1: Read the file "data/SCFP2019.csv.gz" into the DataFrame df.

 Read a CSV file into a DataFrame using pandas.

df = pd.read_csv("data/SCFP2019.csv.gz")
print("df type:", type(df))
print("df shape:", df.shape)
df.head()
df type: <class 'pandas.core.frame.DataFrame'>
df shape: (28885, 351)

M I A NI NI IN
H A N N NIN IN NI
E E A K N SS N N CP
Y H A G . W WP CPC CQ NC
Y W D D R I C ET C C2 CT
Y S G E . C CTL TLE RT QR
1 GT U C RI D C C C C LE
1 E E C . A EC CA CA TC
C L E S A A A A CA
X L T AT T T AT
D T T T T T

61
19. .
1 7 1
0 1 77 2 6 4 2 0 . 5 3 6 3 2 10 6 6 3 3
1 5 2
93 .
08

47
12. .
1 7 1
1 1 37 2 6 4 2 0 . 5 3 6 3 1 10 5 5 2 2
2 5 2
49 .
12

51
45. .
1 7 1
2 1 22 2 6 4 2 0 . 5 3 6 3 1 10 5 5 2 2
3 5 2
44 .
55

52
97. .
1 7 1
3 1 66 2 6 4 2 0 . 5 2 6 2 1 10 4 4 2 2
4 5 2
34 .
12

.
1 47 7 1
4 1 2 6 4 2 0 . 5 3 6 3 1 10 5 5 2 2
5 61. 5 2
.
81
M I A NI NI IN
H A N N NIN IN NI
E E A K N SS N N CP
Y H A G . W WP CPC CQ NC
Y W D D R I C ET C C2 CT
Y S G E . C CTL TLE RT QR
1 GT U C RI D C C C C LE
1 E E C . A EC CA CA TC
C L E S A A A A CA
X L T AT T T AT
D T T T T T

23
71

5 rows × 351 columns

One of the first things you might notice here is that this dataset is HUGE — over 20,000 rows and 351
columns! SO MUCH DATA!!! We won't have time to explore all of the features in this dataset, but you can
look in the data dictionary for this project for details and links to the official Code Book. For now, let's just say
that this dataset tracks all sorts of behaviors relating to the ways households earn, save, and spend money in the
United States.

For this project, we're going to focus on households that have "been turned down for credit or feared being
denied credit in the past 5 years." These households are identified in the "TURNFEAR" column.
VimeoVideo("710783015", h="c24ce96aab", width=600)

Task 6.1.2: Use amask to subset create df to only households that have been turned down or feared being
turned down for credit ("TURNFEAR" == 1). Assign this subset to the variable name df_fear.

 Subset a DataFrame with a mask using pandas.

mask = df["TURNFEAR"] == 1
mask.sum()

4623

mask = df["TURNFEAR"] == 1
df_fear = df[mask]
print("df_fear type:", type(df_fear))
print("df_fear shape:", df_fear.shape)
df_fear.head()
df_fear type: <class 'pandas.core.frame.DataFrame'>
df_fear shape: (4623, 351)
M I A NI NI IN
H A N N NIN IN NI
E E A K N SS N N CP
Y H A G . W WP CPC CQ NC
Y W D D R I C ET C C2 CT
Y S G E . C CTL TLE RT QR
1 GT U C RI D C C C C LE
1 E E C . A EC CA CA TC
C L E S A A A A CA
X L T AT T T AT
D T T T T T

37
90. .
2 5
5 2 47 1 3 8 2 1 3 . 1 2 1 2 1 1 4 4 2 2
1 0
66 .
07

37
98. .
2 5
6 2 86 1 3 8 2 1 3 . 1 2 1 2 1 1 4 3 2 2
2 0
85 .
05

37
99. .
2 5
7 2 46 1 3 8 2 1 3 . 1 2 1 2 1 1 4 4 2 2
3 0
83 .
93

37
88. .
2 5
8 2 07 1 3 8 2 1 3 . 1 2 1 2 1 1 4 4 2 2
4 0
60 .
05

37
93. .
2 5
9 2 06 1 3 8 2 1 3 . 1 2 1 2 1 1 4 4 2 2
5 0
65 .
89

5 rows × 351 columns

Explore
Age
Now that we have our subset, let's explore the characteristics of this group. One of the features is age group
("AGECL").

VimeoVideo("710784794", h="71b10e363d", width=600)

Task 6.1.3: Create a list age_groups with the unique values in the "AGECL" column. Then review the entry
for "AGECL" in the Code Book to determine what the values represent.

 Determine the unique values in a column using pandas.

age_groups = df_fear["AGECL"].unique()
print("Age Groups:", age_groups)
Age Groups: [3 5 1 2 4 6]
Looking at the Code Book we can see that "AGECL" represents categorical data, even though the values in the
column are numeric.

This simplifies data storage, but it's not very human-readable. So before we create a visualization, let's create a
version of this column that uses the actual group names.

VimeoVideo("710785566", h="f0fafd3a29", width=600)

Task 6.1.4: Create a Series agecl that contains the observations from "AGECL" using the true group names.

 Create a Series in pandas.

 Replace values in a column using pandas.

agecl_dict = {
1: "Under 35",
2: "35-44",
3: "45-54",
4: "55-64",
5: "65-74",
6: "75 or Older",
}

age_cl = df_fear["AGECL"].replace(agecl_dict)
print("age_cl type:", type(age_cl))
print("age_cl shape:", age_cl.shape)
age_cl.head()
age_cl type: <class 'pandas.core.series.Series'>
age_cl shape: (4623,)

5 45-54
6 45-54
7 45-54
8 45-54
9 45-54
Name: AGECL, dtype: object
Now that we have better labels, let's make a bar chart and see the age distribution of our group.
VimeoVideo("710840376", h="d43825c14b", width=600)

Task 6.1.5: Create a bar chart showing the value counts from age_cl. Be sure to label the x-axis "Age Group",
the y-axis "Frequency (count)", and use the title "Credit Fearful: Age Groups".

 Create a bar chart using pandas.

age_cl_value_counts = age_cl.value_counts()

# Bar plot of `age_cl_value_counts`

age_cl_value_counts.plot(
kind = "bar",
xlabel = "Age Group",
ylabel = "Frequency (count)",
title = "Credit Fearful: Age Groups"
);

You might have noticed that by creating their own age groups, the authors of the survey have basically made a
histogram for us comprised of 6 bins. Our chart is telling us that many of the people who fear being denied
credit are younger. But the first two age groups cover a wider range than the other four. So it might be useful to
look inside those values to get a more granular understanding of the data.
To do that, we'll need to look at a different variable: "AGE". Whereas "AGECL" was a categorical
variable, "AGE" is continuous, so we can use it to make a histogram of our own.
VimeoVideo("710841580", h="a146a24e5c", width=600)

Task 6.1.6: Create a histogram of the "AGE" column with 10 bins. Be sure to label the x-axis "Age", the y-
axis "Frequency (count)", and use the title "Credit Fearful: Age Distribution".

 Create a histogram using pandas.

# Plot histogram of "AGE"

df_fear["AGE"].hist(bins = 10)
plt.xlabel("Age")
plt.ylabel("Frequency (count)")
plt.title("Credit Fearful: Age Distribution");

It looks like younger people are still more concerned about being able to secure a loan than older people, but
the people who are most concerned seem to be between 30 and 40.
Race
Now that we have an understanding of how age relates to our outcome of interest, let's try some other
possibilities, starting with race. If we look at the Code Book for "RACE", we can see that there are 4 categories.

Note that there's no 4 category here. If a value for 4 did exist, it would be reasonable to assign it to "Asian
American / Pacific Islander" — a group that doesn't seem to be represented in the dataset. This is a strange
omission, but you'll often find that large public datasets have these sorts of issues. The important thing is to
always read the data dictionary carefully. In this case, remember that this dataset doesn't provide a complete
picture of race in America — something that you'd have to explain to anyone interested in your analysis.
VimeoVideo("710842177", h="8d8354e091", width=600)

Task 6.1.7: Create a horizontal bar chart showing the normalized value counts for "RACE". In your chart, you
should replace the numerical values with the true group names. Be sure to label the x-axis "Frequency (%)", the
y-axis "Race", and use the title "Credit Fearful: Racial Groups". Finally, set the xlim for this plot to (0,1).

 Create a bar chart using pandas.

race_dict = {
1: "White/Non-Hispanic",
2: "Black/African-American",
3: "Hispanic",
5: "Other",
}
race = df_fear["RACE"].replace(race_dict)
race_value_counts = race.value_counts(normalize = True)
# Create bar chart of race_value_counts
race_value_counts.plot(kind="barh")
plt.xlim((0, 1))
plt.xlabel("Frequency (%)")
plt.ylabel("Race")
plt.title("Credit Fearful: Racial Groups");

This suggests that White/Non-Hispanic people worry more about being denied credit, but thinking critically
about what we're seeing, that might be because there are more White/Non-Hispanic in the population of the
United States than there are other racial groups, and the sample for this survey was specifically drawn to be
representative of the population as a whole.

VimeoVideo("710844376", h="8e1fdf92ef", width=600)

Task 6.1.8: Recreate the horizontal bar chart you just made, but this time use the entire dataset df instead of the
subset df_fear. The title of this plot should be "SCF Respondents: Racial Groups"

 Create a bar chart using pandas.

race = df["RACE"].replace(race_dict)
race_value_counts = race.value_counts(normalize = True)
# Create bar chart of race_value_counts
race_value_counts.plot(kind="barh")
plt.xlim((0, 1))
plt.xlabel("Frequency (%)")
plt.ylabel("Race")
plt.title("SCF Respondents: Racial Groups");

How does this second bar chart change our perception of the first one? On the one hand, we can see that White
Non-Hispanics account for around 70% of whole dataset, but only 54% of credit fearful respondents. On the
other hand, Black and Hispanic respondents represent 23% of the whole dataset but 40% of credit fearful
respondents. In other words, Black and Hispanic households are actually more likely to be in the credit fearful
group.
Data Ethics: It's important to note that segmenting customers by race (or any other demographic group) for the
purpose of lending is illegal in the United States. The same thing might be legal elsewhere, but even if it is,
making decisions for things like lending based on racial categories is clearly unethical. This is a great example
of how easy it can be to use data science tools to support and propagate systems of inequality. Even though
we're "just" using numbers, statistical analysis is never neutral, so we always need to be thinking critically
about how our work will be interpreted by the end-user.

Income
What about income level? Are people with lower incomes concerned about being denied credit, or is that
something people with more money worry about? In order to answer that question, we'll need to again compare
the entire dataset with our subgroup using the "INCCAT" feature, which captures income percentile groups.
This time, though, we'll make a single, side-by-side bar chart.
VimeoVideo("710849451", h="34a367a3f9", width=600)

Task 6.1.9: Create a DataFrame df_inccat that shows the normalized frequency for income categories for both
the credit fearful and non-credit fearful households in the dataset. Your final DataFrame should look something
like this:

TURNFEAR INCCAT frequency

0 0 90-100 0.297296

1 0 60-79.9 0.174841

2 0 40-59.9 0.143146

3 0 0-20 0.140343

4 0 21-39.9 0.135933

5 0 80-89.9 0.108441

6 1 0-20 0.288125

7 1 21-39.9 0.256327

8 1 40-59.9 0.228856

9 1 60-79.9 0.132598

10 1 90-100 0.048886

11 1 80-89.9 0.045209

 Aggregate data in a Series using value_counts in pandas.

inccat_dict = {
1: "0-20",
2: "21-39.9",
3: "40-59.9",
4: "60-79.9",
5: "80-89.9",
6: "90-100",
}

df_inccat = (
df["INCCAT"]
.replace(inccat_dict)
.groupby(df["TURNFEAR"])
.value_counts(normalize = True)
.rename("frequency")
.to_frame()
.reset_index()
)

print("df_inccat type:", type(df_inccat))

print("df_inccat shape:", df_inccat.shape)
df_inccat
df_inccat type: <class 'pandas.core.frame.DataFrame'>
df_inccat shape: (12, 3)

TURNFEAR INCCAT frequency

0 0 90-100 0.297296

1 0 60-79.9 0.174841

2 0 40-59.9 0.143146

3 0 0-20 0.140343

4 0 21-39.9 0.135933

5 0 80-89.9 0.108441

6 1 0-20 0.288125

7 1 21-39.9 0.256327

8 1 40-59.9 0.228856

9 1 60-79.9 0.132598
TURNFEAR INCCAT frequency

10 1 90-100 0.048886

11 1 80-89.9 0.045209

VimeoVideo("710852691", h="3dcbf24a68", width=600)

Task 6.1.10: Using seaborn, create a side-by-side bar chart of df_inccat. Set hue to "TURNFEAR", and make
sure that the income categories are in the correct order along the x-axis. Label to the x-axis "Income Category",
the y-axis "Frequency (%)", and use the title "Income Distribution: Credit Fearful vs. Non-fearful".

 Create a bar chart using seaborn.

# Create bar chart of `df_inccat`

sns.barplot(
x="INCCAT",
y="frequency",
hue="TURNFEAR",
data= df_inccat,
order=inccat_dict.values()
)
plt.xlabel("Income Category")
plt.ylabel("Frequency (%)")
plt.title("Income Distribution: Credit Fearful vs. Non-fearful");
Comparing the income categories across the fearful and non-fearful groups, we can see that credit fearful
households are much more common in the lower income categories. In other words, the credit fearful have
lower incomes.
So, based on all this, what do we know? Among the people who responded that they were indeed worried about
being approved for credit after having been denied in the past five years, a plurality of the young and low-
income had the highest number of respondents. That makes sense, right? Young people tend to make less
money and rely more heavily on credit to get their lives off the ground, so having been denied credit makes
them more anxious about the future.
Assets
Not all the data is demographic, though. If you were working for a bank, you would probably care less about
how old the people are, and more about their ability to carry more debt. If we were going to build a model for
that, we'd want to establish some relationships among the variables, and making some correlation matrices is a
good place to start.

First, let's zoom out a little bit. We've been looking at only the people who answered "yes" when the survey
asked about "TURNFEAR", but what if we looked at everyone instead? To begin with, let's bring in a clear
dataset and run a single correlation.

VimeoVideo("710856200", h="7b06e8b7f2", width=600)

Task 6.1.11: Calculate the correlation coefficient for "ASSET" and "HOUSES" in the whole dataset df.

 Calculate the correlation coefficient for two Series using pandas.

asset_house_corr = df["ASSET"].corr(df["HOUSES"])
print("SCF: Asset Houses Correlation:", asset_house_corr)
SCF: Asset Houses Correlation: 0.5198273544779252
That's a moderate positive correlation, which we would probably expect, right? For many Americans, the value
of their primary residence makes up most of the value of their total assets. What about the people in
our TURNFEAR subset, though? Let's run that correlation to see if there's a difference.

VimeoVideo("710857088", h="33b8f810fb", width=600)

Task 6.1.12: Calculate the correlation coefficient for "ASSET" and "HOUSES" in the whole credit-fearful
subset df_fear.

 Calculate the correlation coefficient for two Series using pandas. WQU WorldQuant University Applied Data Science Lab QQQQ

asset_house_corr = df_fear["ASSET"].corr(df_fear["HOUSES"])
print("Credit Fearful: Asset Houses Correlation:", asset_house_corr)

Credit Fearful: Asset Houses Correlation: 0.5832879735979154

Aha! They're different! It's still only a moderate positive correlation, but the relationship between the total
value of assets and the value of the primary residence is stronger for our TURNFEAR group than it is for the
population as a whole.

Let's make correlation matrices using the rest of the data for both df and df_fear and see if the differences
persist. Here, we'll look at only 5 features: "ASSET", "HOUSES", "INCOME", "DEBT", and "EDUC".

VimeoVideo("710857545", h="c67691d13e", width=600)

Task 6.1.13: Make a correlation matrix using df, considering only the
columns "ASSET", "HOUSES", "INCOME", "DEBT", and "EDUC".

 Create a correlation matrix in pandas.

cols = ["ASSET", "HOUSES", "INCOME", "DEBT", "EDUC"]

corr = df[cols].corr()
corr.style.background_gradient(axis=None)

ASSET HOUSES INCOME DEBT EDUC

ASSET 1.000000 0.519827 0.622429 0.261250 0.116673

HOUSES 0.519827 1.000000 0.247852 0.266661 0.169300

INCOME 0.622429 0.247852 1.000000 0.114646 0.069400

ASSET HOUSES INCOME DEBT EDUC

DEBT 0.261250 0.266661 0.114646 1.000000 0.054179

EDUC 0.116673 0.169300 0.069400 0.054179 1.000000

wqet_grader.grade("Project 6 Assessment", "Task 6.1.13", corr)

Excellent! Keep going.

Score: 1

VimeoVideo("710858210", h="b679fd1fa5", width=600)

Task 6.1.14: Make a correlation matrix using df_fear.

 Create a correlation matrix in pandas.

corr = df_fear[cols].corr()
corr.style.background_gradient(axis=None)

ASSET HOUSES INCOME DEBT EDUC

ASSET 1.000000 0.583288 0.722074 0.474658 0.113536

HOUSES 0.583288 1.000000 0.264099 0.962629 0.160348

INCOME 0.722074 0.264099 1.000000 0.172393 0.133170

DEBT 0.474658 0.962629 0.172393 1.000000 0.177386

EDUC 0.113536 0.160348 0.133170 0.177386 1.000000

Whoa! There are some pretty important differences here! The relationship between "DEBT" and "HOUSES" is
positive for both datasets, but while the coefficient for df is fairly weak at 0.26, the same number for df_fear is
0.96.

Remember, the closer a correlation coefficient is to 1.0, the more exactly they correspond. In this case, that
means the value of the primary residence and the total debt held by the household is getting pretty close to
being the same. This suggests that the main source of debt being carried by our "TURNFEAR" folks is their
primary residence, which, again, is an intuitive finding.

"DEBT" and "ASSET" share a similarly striking difference, as do "EDUC" and "DEBT" which, while not as
extreme a contrast as the other, is still big enough to catch the interest of our hypothetical banker.
Let's make some visualizations to show these relationships graphically.
Education
First, let's start with education levels "EDUC", comparing credit fearful and non-credit fearful groups.

VimeoVideo("710858769", h="2e6596cd4b", width=600)

Task 6.1.15: Create a DataFrame df_educ that shows the normalized frequency for education categories for
both the credit fearful and non-credit fearful households in the dataset. This will be similar in format
to df_inccat, but focus on education. Note that you don't need to replace the numerical values in "EDUC" with
the true labels.

TURNFEAR EDUC frequency

0 0 12 0.257481

1 0 8 0.192029

2 0 13 0.149823

3 0 9 0.129833

4 0 14 0.096117

5 0 10 0.051150

...

25 1 5 0.015358

26 1 2 0.012979

27 1 3 0.011897

28 1 1 0.005408

29 1 -1 0.003245

 Aggregate data in a Series using value_counts in pandas.

 Aggregate data using the groupby method in pandas.
 Create a Series in pandas.
 Rename a Series in pandas.
 Replace values in a column using pandas.
 Set and reset the index of a DataFrame in pandas.
df_educ = (
df["EDUC"]
.groupby(df["TURNFEAR"])
.value_counts(normalize = True)
.rename("frequency")
.to_frame()
.reset_index()
)

print("df_educ type:", type(df_educ))

print("df_educ shape:", df_educ.shape)
df_educ.head()
df_educ type: <class 'pandas.core.frame.DataFrame'>
df_educ shape: (30, 3)

TURNFEAR EDUC frequency

0 0 12 0.257481

1 0 8 0.192029

2 0 13 0.149823

3 0 9 0.129833

4 0 14 0.096117

VimeoVideo("710861978", h="81349c4b6a", width=600)

Task 6.1.16: Using seaborn, create a side-by-side bar chart of df_educ. Set hue to "TURNFEAR", and make
sure that the education categories are in the correct order along the x-axis. Label to the x-axis "Education
Level", the y-axis "Frequency (%)", and use the title "Educational Attainment: Credit Fearful vs. Non-fearful".

 Create a bar chart using seaborn.

# Create bar chart of `df_educ`

sns.barplot(
x="EDUC",
y="frequency",
hue ="TURNFEAR",
data=df_educ
)
plt.xlabel("Education Level")
plt.ylabel("Frequency (%)")
plt.title("Educational Attainment: Credit Fearful vs. Non-fearful");
In this plot, we can see that a much higher proportion of credit-fearful respondents have only a high school
diploma, while university degrees are more common among the non-credit fearful.
Debt
Let's keep going with some scatter plots that look at debt.
VimeoVideo("710862939", h="0f6e0fc201", width=600)

Task 6.1.17: Use df to make a scatter plot showing the relationship between DEBT and ASSET.

 Create a scatter plot with pandas.

# Create scatter plot of ASSET vs DEBT, df

df.plot.scatter(x="DEBT", y="ASSET"),

(<Axes: xlabel='DEBT', ylabel='ASSET'>,)

VimeoVideo("710864442", h="2428f1c168", width=600)

Task 6.1.18: Use df_fear to make a scatter plot showing the relationship between DEBT and ASSET.

 Create a scatter plot with pandas.

# Create scatter plot of ASSET vs DEBT, df_fear

df.plot.scatter(x="DEBT", y= "ASSET");
You can see relationship in our df_fear graph is flatter than the one in our df graph, but they clearly are
different.
Let's end with the most striking difference from our matrices, and make some scatter plots showing the
difference between HOUSES and DEBT.

VimeoVideo("710865281", h="2e9fc0d9b9", width=600)

Task 6.1.19: Use df to make a scatter plot showing the relationship between HOUSES and DEBT.

 Create a scatter plot with pandas.

# Create scatter plot of HOUSES vs DEBT, df

df.plot.scatter(x="DEBT", y="HOUSES");
And make the same scatter plot using df_fear.

VimeoVideo("710870286", h="3cd177965a", width=600)

Task 6.1.20: Use df_fear to make a scatter plot showing the relationship between HOUSES and DEBT.

 Create a scatter plot with pandas.

# Create scatter plot of HOUSES vs DEBT, df_fear

df_fear.plot.scatter(x="DEBT", y="HOUSES");
The outliers make it a little difficult to see the difference between these two plots, but the relationship is clear
enough: our df_fear graph shows an almost perfect linear relationship, while our df graph shows something a
little more muddled. You might also notice that the datapoints on the df_fear graph form several little groups.
Those are called "clusters," and we'll be talking more about how to analyze clustered data in the next lesson.

Copyright 2023 WorldQuant University. This content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.

Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your
WQU virtual machine.

This means:

 ⓧ No downloading this notebook.

6.2. Clustering with Two Features

In the previous lesson, you explored data from the Survey of Consumer Finances (SCF), paying special
attention to households that have been turned down for credit or feared being denied credit. In this lesson, we'll
build a model to segment those households into distinct clusters, and examine the differences between those
clusters.

import matplotlib.pyplot as plt

import pandas as pd
import seaborn as sns
import wqet_grader
from IPython.display import VimeoVideo
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.utils.validation import check_is_fitted
from teaching_tools.widgets import ClusterWidget, SCFClusterWidget

wqet_grader.init("Project 6 Assessment")

VimeoVideo("713919442", h="7b4cbc1495", width=600)

Prepare Data
Import
Just like always, we need to begin by bringing our data into the project. We spent some time in the previous
lesson working with a subset of the larger SCF dataset called "TURNFEAR". Let's start with that.

VimeoVideo("713919411", h="fd4fae4013", width=600)

Task 6.2.1: Create a wrangle function that takes a path of a CSV file as input, reads the file into a DataFrame,
subsets the data to households that have been turned down for credit or feared being denied credit in the past 5
years (see "TURNFEAR"), and returns the subset DataFrame.

 Write a function in Python.

 Subset a DataFrame by selecting one or more columns in pandas.

def wrangle(filepath):
df = pd.read_csv(filepath)
mask = df["TURNFEAR"] ==1
df = df[mask]
return df
And now that we've got that taken care of, we'll import the data and see what we've got.
Task 6.2.2: Use your wrangle function to read the file SCFP2019.csv.gz into a DataFrame named df.

 Read a CSV file into a DataFrame using pandas.

df = wrangle("data/SCFP2019.csv.gz")

print("df type:", type(df))

print("df shape:", df.shape)
df.head()
df type: <class 'pandas.core.frame.DataFrame'>
df shape: (4623, 351)

NI
A N N IN IN
M N N
I S I W CP NI C
H A A N I C
E E K N S N PC C NC Q
Y W H A G R . W N Q
Y D D I C E C TL TL PC R
Y G S G E R . C C R
1 U C D C T 2 E E TL T
1 T E E C I . A C T
C L S A C C C C EC C
X L E T A C
T A A A A AT A
D T A
T T T T T
T

37
90
.
2 .4 5
5 2 1 3 8 2 1 3 . 1 2 1 2 1 1 4 4 2 2
1 76 0
.
60
7

37
98
.
2 .8 5
6 2 1 3 8 2 1 3 . 1 2 1 2 1 1 4 3 2 2
2 68 0
.
50
5

37
99
.
2 .4 5
7 2 1 3 8 2 1 3 . 1 2 1 2 1 1 4 4 2 2
3 68 0
.
39
3
NI
A N N IN IN
M N N
I S I W CP NI C
H A A N I C
E E K N S N PC C NC Q
Y W H A G R . W N Q
Y D D I C E C TL TL PC R
Y G S G E R . C C R
1 U C D C T 2 E E TL T
1 T E E C I . A C T
C L S A C C C C EC C
X L E T A C
T A A A A AT A
D T A
T T T T T
T

37
88
.
2 .0 5
8 2 1 3 8 2 1 3 . 1 2 1 2 1 1 4 4 2 2
4 76 0
.
00
5

37
93
.
2 .0 5
9 2 1 3 8 2 1 3 . 1 2 1 2 1 1 4 4 2 2
5 66 0
.
58
9

5 rows × 351 columns

Explore
We looked at a lot of different features of the "TURNFEAR" subset in the last lesson, and the last thing we
looked at was the relationship between real estate and debt. To refresh our memory on what that relationship
looked like, let's make that graph again.
VimeoVideo("713919351", h="55dc979d55", width=600)

Task 6.2.3: Create a scatter plot of that shows the total value of primary residence of a household ("HOUSES")
as a function of the total value of household debt ("DEBT"). Be sure to label your x-axis as "Household Debt",
your y-axis as "Home Value", and use the title "Credit Fearful: Home Value vs. Household Debt".

 What's a scatter plot?

 Create a scatter plot using seaborn.

# Plot "HOUSES" vs "DEBT"

sns.scatterplot(x=df["DEBT"] / 1e6, y=df["HOUSES"] / 1e6 )
plt.xlabel("Household Debt [$1M]")
plt.ylabel("Home Value [$1M]")
plt.title("Credit Fearful: Home Value vs. Household Debt");
Remember that graph and its clusters? Let's get a little deeper into it.

Split
We need to split our data, but we're not going to need target vector or a test set this time around. That's because
the model we'll be building involves unsupervised learning. It's called unsupervised because the model doesn't
try to map input to a st of labels or targets that already exist. It's kind of like how humans learn new skills, in
that we don't always have models to copy. Sometimes, we just try out something and see what happens. Keep
in mind that this doesn't make these models any less useful, it just makes them different.

So, keeping that in mind, let's do the split.

VimeoVideo("713919336", h="775867f48a", width=600)

Task 6.2.4: Create the feature matrix X. It should contain two features only: "DEBT" and "HOUSES".

 What's a feature matrix?

 Subset a DataFrame by selecting one or more columns in pandas.

X = df[["DEBT", "HOUSES"]]

print("X type:", type(X))

print("X shape:", X.shape)
X.head()
X type: <class 'pandas.core.frame.DataFrame'>
X shape: (4623, 2)

DEBT HOUSES

5 12200.0 0.0

6 12600.0 0.0

7 15300.0 0.0

8 14100.0 0.0

9 15400.0 0.0

Build Model
Before we start building the model, let's take a second to talk about something called KMeans.

Take another look at the scatter plot we made at the beginning of this lesson. Remember how the datapoints
form little clusters? It turns out we can use an algorithm that partitions the dataset into smaller groups.

Let's take a look at how those things work together.

VimeoVideo("713919214", h="028502efe7", width=600)

Task 6.2.5: Run the cell below to display the ClusterWidget.

 What's a centroid?
 What's a cluster?

cw = ClusterWidget(n_clusters=3)
cw.show()
VBox(children=(IntSlider(value=0, continuous_update=False, description='Step:', max=10), Output(layout=Layout(
…
Take a second and run slowly through all the positions on the slider. At the first position, there's whole bunch
of gray datapoints, and if you look carefully, you'll see there are also three stars. Those stars are the centroids.
At first, their position is set randomly. If you move the slider one more position to the right, you'll see all the
gray points change colors that correspond to three clusters.

Since a centroid represents the mean value of all the data in the cluster, we would expect it to fall in the center
of whatever cluster it's in. That's what will happen if you move the slider one more position to the right. See
how the centroids moved?
Aha! But since they moved, the datapoints might not be in the right clusters anymore. Move the slider again,
and you'll see the data points redistribute themselves to better reflect the new position of the centroids. The new
clusters mean that the centroids also need to move, which will lead to the clusters changing again, and so on,
until all the datapoints end up in the right cluster with a centroid that reflects the mean value of all those points.

Let's see what happens when we try the same with our "DEBT" and "HOUSES" data.
VimeoVideo("713919177", h="102616b1c3", width=600)

Task 6.2.6: Run the cell below to display the SCFClusterWidget.

scfc = SCFClusterWidget(x=df["DEBT"], y=df["HOUSES"], n_clusters=3)
scfc.show()
VBox(children=(IntSlider(value=0, continuous_update=False, description='Step:', max=10), Output(layout=Layout(
…

Iterate
Now that you've had a chance to play around with the process a little bit, let's get into how to build a model that
does the same thing.

VimeoVideo("713919157", h="0b2c3c95f2", width=600)

Task 6.2.7: Build a KMeans model, assign it to the variable name model, and fit it to the training data X.

 What's k-means clustering?

 Fit a model to training data in scikit-learn.

Tip: The k-means clustering algorithm relies on random processes, so don't forget to set a random_state for all
your models in this lesson.
# Build model
model = KMeans(n_clusters=3, random_state=42)
print("model type:", type(model))

# Fit model to data

model.fit(X)

# Assert that model has been fit to data

check_is_fitted(model)
model type: <class 'sklearn.cluster._kmeans.KMeans'>
/opt/conda/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning: The default value of `n_i
nit` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
And there it is. 42 datapoints spread across three clusters. Let's grab the labels that the model has assigned to
the data points so we can start making a new visualization.

VimeoVideo("713919137", h="7eafe805ff", width=600)

Task 6.2.8: Extract the labels that your model created during training and assign them to the variable labels.
 Access an object in a pipeline in scikit-learn.

labels = model.labels_
print("labels type:", type(labels))
print("labels shape:", labels.shape)
labels[:10]
labels type: <class 'numpy.ndarray'>
labels shape: (4623,)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

Using the labels we just extracted, let's recreate the scatter plot from before, this time we'll color each point
according to the cluster to which the model assigned it.
VimeoVideo("713919104", h="2f6d4285f1", width=600)

Task 6.2.9: Recreate the "Home Value vs. Household Debt" scatter plot you made above, but with two
changes. First, use seaborn to create the plot. Second, pass your labels to the hue argument, and set
the palette argument to "deep".

 What's a scatter plot?

 Create a scatter plot using seaborn.

# Plot "HOUSES" vs "DEBT" with hue=label

sns.scatterplot(
x= df["DEBT"] / 1e6,
y=df["HOUSES"] / 1e6,
hue= labels,
palette = "deep"
)
plt.xlabel("Household Debt [$1M]")
plt.ylabel("Home Value [$1M]")
plt.title("Credit Fearful: Home Value vs. Household Debt");
Nice! Each cluster has its own color. The centroids are still missing, so let's pull those out.
VimeoVideo("713919087", h="9b8635c9a8", width=600)

Task 6.2.10: Extract the centroids that your model created during training, and assign them to the
variable centroids.

 What's a centroid?

centroids = model.cluster_centers_
print("centroids type:", type(centroids))
print("centroids shape:", centroids.shape)
centroids
centroids type: <class 'numpy.ndarray'>
centroids shape: (3, 2)

[18384100. , 34484000. ],
[ 5065800. , 11666666.66666667]])
Let's add the centroids to the graph.
VimeoVideo("713919002", h="08cba14f6b", width=600)

Task 6.2.11: Recreate the seaborn "Home Value vs. Household Debt" scatter plot you just made, but with one
difference: Add the centroids to the plot. Be sure to set the centroids color to "gray".
 What's a scatter plot?
 Create a scatter plot using seaborn.

# Plot "HOUSES" vs "DEBT", add centroids

sns.scatterplot(
x= df["DEBT"] / 1e6,
y=df["HOUSES"] / 1e6,
hue= labels,
palette = "deep"
)
plt.scatter(
x=centroids[:, 0] / 1e6,
y=centroids[:, 1] / 1e6,
color="gray",
marker="*",
s=150

)
plt.xlabel("Household Debt [$1M]")
plt.ylabel("Home Value [$1M]")
plt.title("Credit Fearful: Home Value vs. Household Debt");

That looks great, but let's not pat ourselves on the back just yet. Even though our graph makes it look like the
clusters are correctly assigned but, as data scientists, we need a numerical evaluation. The data we're using is
pretty clear-cut, but if things were a little more muddled, we'd want to run some calculations to make sure we
got everything right.
There are two metrics that we'll use to evaluate our clusters. We'll start with inertia, which measure the
distance between the points within the same cluster.
VimeoVideo("713918749", h="bfc741b1e7", width=600)

Question: What do those double bars in the equation mean?

Answer: It's the L2 norm, that is, the non-negative Euclidean distance between each datapoint and its centroid.
In Python, it would be something like sqrt((x1-c)**2 + (x2-c)**2) + ...).

Many thanks to Aghogho Esuoma Monorien for his comment in the forum! 🙏
Task 6.2.12: Extract the inertia for your model and assign it to the variable inertia.

 What's inertia?
 Access an object in a pipeline in scikit-learn.
 Calculate the inertia for a model in scikit-learn.

inertia = model.inertia_
print("inertia type:", type(inertia))
print("Inertia (3 clusters):", inertia)
inertia type: <class 'float'>
Inertia (3 clusters): 939554010797059.4
The "best" inertia is 0, and our score is pretty far from that. Does that mean our model is "bad?" Not
necessarily. Inertia is a measurement of distance (like mean absolute error from Project 2). This means that the
unit of measurement for inertia depends on the unit of measurement of our x- and y-axes. And
since "DEBT" and "HOUSES" are measured in tens of millions of dollars, it's not surprising that inertia is so
large.

However, it would be helpful to have metric that was easier to interpret, and that's where silhouette
score comes in. Silhouette score measures the distance between different clusters. It ranges from -1 (the worst)
to 1 (the best), so it's easier to interpret than inertia.
WQU WorldQuant University Applied Data Science Lab Q QQQ

VimeoVideo("713918501", h="0462c4784a", width=600)

Task 6.2.13: Calculate the silhouette score for your model and assign it to the variable ss.

 What's silhouette score?

 Calculate the silhouette score for a model in scikit-learn.

ss = silhouette_score(X, model.labels_)
print("ss type:", type(ss))
print("Silhouette Score (3 clusters):", ss)
ss type: <class 'numpy.float64'>
Silhouette Score (3 clusters): 0.9768842462944348
Outstanding! 0.976 is pretty close to 1, so our model has done a good job at identifying 3 clusters that are far
away from each other.
It's important to remember that these performance metrics are the result of the number of clusters we told our
model to create. In unsupervised learning, the number of clusters is hyperparameter that you set before training
your model. So what would happen if we change the number of clusters? Will it lead to better performance?
Let's try!
VimeoVideo("713918420", h="e16f3735c7", width=600)

Task 6.2.14: Use a for loop to build and train a K-Means model where n_clusters ranges from 2 to 12
(inclusive). Each time a model is trained, calculate the inertia and add it to the list inertia_errors, then calculate
the silhouette score and add it to the list silhouette_scores.

 Write a for loop in Python.

 Calculate the inertia for a model in scikit-learn.
 Calculate the silhouette score for a model in scikit-learn.

n_clusters = range(2, 13)

inertia_errors = []
silhouette_scores = []

# Add `for` loop to train model and calculate inertia, silhouette score.
for k in n_clusters:
# Build model
model= KMeans(n_clusters=k, random_state=42)
# Train model
model.fit(X)
# Calculate inertia
inertia_errors.append(model.inertia_)
# Calculate silhouette
silhouette_scores.append(silhouette_score(X, model.labels_))

print("inertia_errors type:", type(inertia_errors))

print("inertia_errors len:", len(inertia_errors))
print("Inertia:", inertia_errors)
print()
print("silhouette_scores type:", type(silhouette_scores))
print("silhouette_scores len:", len(silhouette_scores))
print("Silhouette Scores:", silhouette_scores)
/opt/conda/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning: The default value of `n_i
nit` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
/opt/conda/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning: The default value of `n_i
nit` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
/opt/conda/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning: The default value of `n_i
nit` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
/opt/conda/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning: The default value of `n_i
nit` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
/opt/conda/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning: The default value of `n_i
nit` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
/opt/conda/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning: The default value of `n_i
nit` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
/opt/conda/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning: The default value of `n_i
nit` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
/opt/conda/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning: The default value of `n_i
nit` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
/opt/conda/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning: The default value of `n_i
nit` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
/opt/conda/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning: The default value of `n_i
nit` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
/opt/conda/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning: The default value of `n_i
nit` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
inertia_errors type: <class 'list'>
inertia_errors len: 11
Inertia: [3018038313336857.5, 939554010797059.4, 546098841715646.25, 309313172681861.4, 235250007188435
.38, 182185545995311.7, 150727950872604.22, 114321995931021.89, 100488983856739.94, 86227397125225.02,
73193859398329.2]

silhouette_scores type: <class 'list'>

silhouette_scores len: 11
Silhouette Scores: [0.9855099957519555, 0.9768842462944348, 0.9490311483406091, 0.839669623678179, 0.752
6801280714244, 0.7277940458463407, 0.7256332651512161, 0.7335125606476427, 0.7313509140373811, 0.6950
363232867054, 0.6964839563551604]
Now that we have both performance metrics for several different settings of n_clusters, let's make some line
plots to see the relationship between the number of clusters in a model and its inertia and silhouette scores.

VimeoVideo("713918224", h="32ff34ffa1", width=600)

Task 6.2.15: Create a line plot that shows the values of inertia_errors as a function of n_clusters. Be sure to
label your x-axis "Number of Clusters", your y-axis "Inertia", and use the title "K-Means Model: Inertia vs
Number of Clusters".

 Create a line plot in Matplotlib.

# Plot `inertia_errors` by `n_clusters`

plt.plot(n_clusters, inertia_errors)
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia")
plt.title("K-Means Model : Inertia vs Number of Clusters");
What we're seeing here is that, as the number of clusters increases, inertia goes down. In fact, we could get
inertia to 0 if we told our model to make 4,623 clusters (the same number of observations in X), but those
clusters wouldn't be helpful to us.

The trick with choosing the right number of clusters is to look for the "bend in the elbow" for this plot. In other
words, we want to pick the point where the drop in inertia becomes less dramatic and the line begins to flatten
out. In this case, it looks like the sweet spot is 4 or 5.

Let's see what the silhouette score looks like.

VimeoVideo("713918153", h="3f3a1312d2", width=600)

Task 6.2.16: Create a line plot that shows the values of silhouette_scores as a function of n_clusters. Be sure to
label your x-axis "Number of Clusters", your y-axis "Silhouette Score", and use the title "K-Means Model:
Silhouette Score vs Number of Clusters".

 Create a line plot in Matplotlib.

# Plot `silhouette_scores` vs `n_clusters`

plt.plot(n_clusters, silhouette_scores)
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Score")
plt.title("K-Means Model: Silhouette Score vs Number of Clusters");
Note that, in contrast to our inertia plot, bigger is better. So we're not looking for a "bend in the elbow" but
rather a number of clusters for which the silhouette score still remains high. We can see that silhouette score
drops drastically beyond 4 clusters. Given this and what we saw in the inertia plot, it looks like the optimal
number of clusters is 4.

Now that we've decided on the final number of clusters, let's build a final model.
VimeoVideo("713918108", h="e6aa88569e", width=600)

Task 6.2.17: Build and train a new k-means model named final_model. Use the information you gained from
the two plots above to set an appropriate value for the n_clusters argument. Once you've built and trained your
model, submit it to the grader for evaluation.

 Fit a model to training data in scikit-learn.

# Build model
final_model = KMeans(n_clusters=4,random_state=42)
print("final_model type:", type(final_model))

# Fit model to data

final_model.fit(X)

# Assert that model has been fit to data

check_is_fitted(final_model)
final_model type: <class 'sklearn.cluster._kmeans.KMeans'>
/opt/conda/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning: The default value of `n_i
nit` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)

wqet_grader.grade("Project 6 Assessment", "Task 6.2.17", final_model)

Yes! Great problem solving.

Score: 1

(In case you're wondering, we don't need an Evaluate section in this notebook because we don't have any test
data to evaluate our model with.)

Communicate
VimeoVideo("713918073", h="3929b58011", width=600)
Task 6.2.18: Create one last "Home Value vs. Household Debt" scatter plot that shows the clusters that
your final_model has assigned to the training data.

 What's a scatter plot?

 Create a scatter plot using Matplotlib.

# Plot "HOUSES" vs "DEBT" with final_model labels

plt.xlabel("Household Debt [$1M]")

plt.ylabel("Home Value [$1M]")
plt.title("Credit Fearful: Home Value vs. Household Debt");
Nice! You can see all four of our clusters, each differentiated from the rest by color.

We're going to make one more visualization, converting the cluster analysis we just did to something a little
more actionable: a side-by-side bar chart. In order to do that, we need to put our clustered data into a
DataFrame.
VimeoVideo("713918023", h="110156bd98", width=600)
Task 6.2.19: Create a DataFrame xgb that contains the mean "DEBT" and "HOUSES" values for each of the
clusters in your final_model.

 Access an object in a pipeline in scikit-learn.

 Aggregate data using the groupby method in pandas.
 Create a DataFrame from a Series in pandas.

xgb = ...

print("xgb type:", type(xgb))

print("xgb shape:", xgb.shape)
xgb
Before you move to the next task, print out the cluster_centers_ for your final_model. Do you see any
similarities between them and the DataFrame you just made? Why do you think that is?
VimeoVideo("713917740", h="bcc496c2d9", width=600)

Task 6.2.20: Create a side-by-side bar chart from xgb that shows the mean "DEBT" and "HOUSES" values for
each of the clusters in your final_model. For readability, you'll want to divide the values in xgb by 1 million. Be
sure to label the x-axis "Cluster", the y-axis "Value [$1 million]", and use the title "Mean Home Value &
Household Debt by Cluster".

 Create a bar chart using pandas.

# Create side-by-side bar chart of `xgb`

plt.xlabel("Cluster")
plt.ylabel("Value [$1 million]")
plt.title("Mean Home Value & Household Debt by Cluster");
In this plot, we have our four clusters spread across the x-axis, and the dollar amounts for home value and
household debt on the y-axis.

The first thing to look at in this chart is the different mean home values for the five clusters. Clusters 0
represents households with small to moderate home values, clusters 2 and 3 have high home values, and cluster
1 has extremely high values.

The second thing to look at is the proportion of debt to home value. In clusters 1 and 3, this proportion is
around 0.5. This suggests that these groups have a moderate amount of untapped equity in their homes. But for
group 0, it's almost 1, which suggests that the largest source of household debt is their mortgage. Group 2 is
unique in that they have the smallest proportion of debt to home value, around 0.4.

This information could be useful to financial institution that want to target customers with products that would
appeal to them. For instance, households in group 0 might be interested in refinancing their mortgage to lower
their interest rate. Group 2 households could be interested in a home equity line of credit because they have
more equity in their homes. And the bankers, Bill Gates, and Beyoncés in group 1 might want white-glove
personalized wealth management.

Copyright 2023 WorldQuant University. This content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.

Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your
WQU virtual machine.

This means:

 ⓧ No downloading this notebook.

 ⓧ No re-sharing of this notebook with friends or colleagues.
 ⓧ No downloading the embedded videos in this notebook.
 ⓧ No re-sharing embedded videos with friends or colleagues.
 ⓧ No adding this notebook to public or private repositories.
 ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study
resources.
6.3. Clustering with Multiple Features
In the previous lesson, we built a K-Means model to create clusters of respondents to the Survey of Consumer
Finances. We made our clusters by looking at two features only, but there are hundreds of features in the
dataset that we didn't take into account and that could contain valuable information. In this lesson, we'll
examine all the features, selecting five to create clusters with. After we build our model and choose an
appropriate number of clusters, we'll learn how to visualize multi-dimensional clusters in a 2D scatter plot
using something called principal component analysis (PCA).

import pandas as pd
import plotly.express as px
import wqet_grader
from IPython.display import VimeoVideo
from scipy.stats.mstats import trimmed_var
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils.validation import check_is_fitted

wqet_grader.init("Project 6 Assessment")

VimeoVideo("714612789", h="f4f8c10683", width=600)

Prepare Data
Import
We spent some time in the last lesson zooming in on a useful subset of the SCF, and this time, we're going to
zoom in even further. One of the persistent issues we've had with this dataset is that it includes some outliers in
the form of ultra-wealthy households. This didn't make much of a difference for our last analysis, but it could
pose a problem in this lesson, so we're going to focus on families with net worth under \$2 million.

VimeoVideo("714612746", h="07dc57f72c", width=600)

Task 6.3.1: Rewrite your wrangle function from the last lesson so that it returns a DataFrame of households
whose net worth is less than \$2 million and that have been turned down for credit or feared being denied credit
in the past 5 years (see "TURNFEAR").

 Write a function in Python.

 Subset a DataFrame by selecting one or more columns in pandas.

def wrangle(filepath):
# Read file into DataFrame
df=pd.read_csv(filepath)
mask = (df["TURNFEAR"]==1) & (df["NETWORTH"] < 2e6)
df=df[mask]
return df
df = wrangle("data/SCFP2019.csv.gz")

print("df type:", type(df))

print("df shape:", df.shape)
df.head()
df type: <class 'pandas.core.frame.DataFrame'>
df shape: (4418, 351)

37
90
.
2 .4 5
5 2 1 3 8 2 1 3 . 1 2 1 2 1 1 4 4 2 2
1 76 0
.
60
7

37
98
.
2 .8 5
6 2 1 3 8 2 1 3 . 1 2 1 2 1 1 4 3 2 2
2 68 0
.
50
5

37
99
.
2 .4 5
7 2 1 3 8 2 1 3 . 1 2 1 2 1 1 4 4 2 2
3 68 0
.
39
3

37 .
2 88 5
8 2 1 3 8 2 1 3 . 1 2 1 2 1 1 4 4 2 2
4 .0 0
.
76
NI
A N N IN IN
M N N
I S I W CP NI C
H A A N I C
E E K N S N PC C NC Q
Y W H A G R . W N Q
Y D D I C E C TL TL PC R
Y G S G E R . C C R
1 U C D C T 2 E E TL T
1 T E E C I . A C T
C L S A C C C C EC C
X L E T A C
T A A A A AT A
D T A
T T T T T
T

00
5

37
93
.
2 .0 5
9 2 1 3 8 2 1 3 . 1 2 1 2 1 1 4 4 2 2
5 66 0
.
58
9

5 rows × 351 columns

Explore
In this lesson, we want to make clusters using more than two features, but which of the 351 features should we
choose? Often times, this decision will be made for you. For example, a stakeholder could give you a list of the
features that are most important to them. If you don't have that limitation, though, another way to choose the
best features for clustering is to determine which numerical features have the largest variance. That's what
we'll do here.

VimeoVideo("714612679", h="040facf6e2", width=600)

Task 6.3.2: Calculate the variance for all the features in df, and create a Series top_ten_var with the 10 features
with the largest variance.

 What's variance?
 Calculate the variance of a DataFrame or Series in pandas.

# Calculate variance, get 10 largest features

top_ten_var = df.var().sort_values().tail(10)

print("top_ten_var type:", type(top_ten_var))

print("top_ten_var shape:", top_ten_var.shape)
top_ten_var
top_ten_var type: <class 'pandas.core.series.Series'>
top_ten_var shape: (10,)

PLOAN1 1.140894e+10
ACTBUS 1.251892e+10
BUS 1.256643e+10
KGTOTAL 1.346475e+10
DEBT 1.848252e+10
NHNFIN 2.254163e+10
HOUSES 2.388459e+10
NETWORTH 4.847029e+10
NFIN 5.713939e+10
ASSET 8.303967e+10
dtype: float64
As usual, it's harder to make sense of a list like this than it would be if we visualized it, so let's make a graph.
VimeoVideo("714612647", h="5ecf36a0db", width=600)

Task 6.3.3: Use plotly express to create a horizontal bar chart of top_ten_var. Be sure to label your x-
axis "Variance", the y-axis "Feature", and use the title "SCF: High Variance Features".

 What's a bar chart?

 Create a bar chart using plotly express.

# Create horizontal bar chart of `top_ten_var`

fig = px.bar(
x= top_ten_var,
y= top_ten_var.index,
title= "SCF: High Variance Features"
)
fig.update_layout(xaxis_title= "Variance", yaxis_title="Feature")
fig.show()

One thing that we've seen throughout this project is that many of the wealth indicators are highly skewed, with
a few outlier households having enormous wealth. Those outliers can affect our measure of variance. Let's see
if that's the case with one of the features from top_five_var.
VimeoVideo("714612615", h="9ae23890fc", width=600)

Task 6.3.4: Use plotly express to create a horizontal boxplot of "NHNFIN" to determine if the values are
skewed. Be sure to label the x-axis "Value [$]", and use the title "Distribution of Non-home, Non-Financial
Assets".

 What's a boxplot?
 Create a boxplot using plotly express.

# Create a boxplot of `NHNFIN`

fig = px.box(
data_frame=df,
x = "NHNFIN",
title = "Distribution of Non-home, Non-Financial Assets"
)
fig.update_layout(xaxis_title="Value [$]")
fig.show()

Whoa! The dataset is massively right-skewed because of the huge outliers on the right side of the distribution.
Even though we already excluded households with a high net worth with our wrangle function, the variance is
still being distorted by some extreme outliers.

The best way to deal with this is to look at the trimmed variance, where we remove extreme values before
calculating variance. We can do this using the trimmed_variance function from the SciPy library.

VimeoVideo("714612570", h="b1be8fb750", width=600)

Task 6.3.5: Calculate the trimmed variance for the features in df. Your calculations should not include the top
and bottom 10% of observations. Then create a Series top_ten_trim_var with the 10 features with the largest
variance.

 What's trimmed variance?

 Calculate the trimmed variance of data using SciPy.
 Apply a function to a DataFrame in pandas.

trimmed_var?
Signature:
trimmed_var(
a,
limits=(0.1, 0.1),
inclusive=(1, 1),
relative=True,
axis=None,
ddof=0,
)
Docstring:
Returns the trimmed variance of the data along the given axis.

Parameters
----------
a : sequence
Input array
limits : {None, tuple}, optional
If `relative` is False, tuple (lower limit, upper limit) in absolute values.
Values of the input array lower (greater) than the lower (upper) limit are
masked.

If `relative` is True, tuple (lower percentage, upper percentage) to cut

on each side of the array, with respect to the number of unmasked data.

Noting n the number of unmasked data before trimming, the (n*limits[0])th

smallest data and the (n*limits[1])th largest data are masked, and the
total number of unmasked data after trimming is n*(1.-sum(limits))
In each case, the value of one limit can be set to None to indicate an
open interval.

If limits is None, no trimming is performed

inclusive : {(bool, bool) tuple}, optional
If `relative` is False, tuple indicating whether values exactly equal
to the absolute limits are allowed.
If `relative` is True, tuple indicating whether the number of data
being masked on each side should be rounded (True) or truncated
(False).
relative : bool, optional
Whether to consider the limits as absolute values (False) or proportions
to cut (True).
axis : int, optional
Axis along which to trim.

ddof : {0,integer}, optional

Means Delta Degrees of Freedom. The denominator used during computations
is (n-ddof). DDOF=0 corresponds to a biased estimate, DDOF=1 to an un-
biased estimate of the variance.
File: /opt/conda/lib/python3.11/site-packages/scipy/stats/_mstats_basic.py
Type: function
# Calculate trimmed variance
top_ten_trim_var = df.apply(trimmed_var, limits = (0.1, 0.1)).sort_values().tail(10)

print("top_ten_trim_var type:", type(top_ten_trim_var))

print("top_ten_trim_var shape:", top_ten_trim_var.shape)
top_ten_trim_var
top_ten_trim_var type: <class 'pandas.core.series.Series'>
top_ten_trim_var shape: (10,)

WAGEINC 5.550737e+08
HOMEEQ 7.338377e+08
NH_MORT 1.333125e+09
MRTHEL 1.380468e+09
PLOAN1 1.441968e+09
DEBT 3.089865e+09
NETWORTH 3.099929e+09
HOUSES 4.978660e+09
NFIN 8.456442e+09
ASSET 1.175370e+10
dtype: float64
Okay! Now that we've got a better set of numbers, let's make another bar graph.
VimeoVideo("714611188", h="d762a98b1e", width=600)

Task 6.3.6: Use plotly express to create a horizontal bar chart of top_ten_trim_var. Be sure to label your x-
axis "Trimmed Variance", the y-axis "Feature", and use the title "SCF: High Variance Features".

 What's a bar chart?

 Create a bar chart using plotly express.

# Create horizontal bar chart of `top_ten_trim_var`

fig = px.bar(
x= top_ten_trim_var,
y= top_ten_trim_var.index,
title= "SCF: High Variance Features"
)
fig.update_layout(xaxis_title= "Trimmed Variance", yaxis_title="Feature")

fig.show()
There are three things to notice in this plot. First, the variances have decreased a lot. In our previous chart, the
x-axis went up to \$80 billion; this one goes up to \$12 billion. Second, the top 10 features have changed a bit.
All the features relating to business ownership ("...BUS") are gone. Finally, we can see that there are big
differences in variance from feature to feature. For example, the variance for "WAGEINC" is around than \$500
million, while the variance for "ASSET" is nearly \$12 billion. In other words, these features have completely
different scales. This is something that we'll need to address before we can make good clusters.

VimeoVideo("714611161", h="61dee490ee", width=600)

Task 6.3.7: Generate a list high_var_cols with the column names of the five features with the highest trimmed
variance.

 What's an index?
 Access the index of a DataFrame or Series in pandas.

high_var_cols = top_ten_trim_var.tail(5).index.to_list()

print("high_var_cols type:", type(high_var_cols))

print("high_var_cols len:", len(top_ten_trim_var))
high_var_cols
high_var_cols type: <class 'list'>
high_var_cols len: 10

['DEBT', 'NETWORTH', 'HOUSES', 'NFIN', 'ASSET']

Split
Now that we've gotten our data to a place where we can use it, we can follow the steps we've used before to
build a model, starting with a feature matrix.

VimeoVideo("714611148", h="f7fbd4bcc5", width=600)

Task 6.3.8: Create the feature matrix X. It should contain the five columns in high_var_cols.

 What's a feature matrix?

 Subset a DataFrame by selecting one or more columns in pandas.

X = df[high_var_cols]

print("X type:", type(X))

print("X shape:", X.shape)
X.head()
X type: <class 'pandas.core.frame.DataFrame'>
X shape: (4418, 5)

DEBT NETWORTH HOUSES NFIN ASSET

5 12200.0 -6710.0 0.0 3900.0 5490.0

6 12600.0 -4710.0 0.0 6300.0 7890.0

7 15300.0 -8115.0 0.0 5600.0 7185.0

8 14100.0 -2510.0 0.0 10000.0 11590.0

9 15400.0 -5715.0 0.0 8100.0 9685.0

Build Model
Iterate
During our EDA, we saw that we had a scale issue among our features. That issue can make it harder to cluster
the data, so we'll need to fix that to help our analysis along. One strategy we can use is standardization, a
statistical method for putting all the variables in a dataset on the same scale. Let's explore how that works here.
Later, we'll incorporate it into our model pipeline.

VimeoVideo("714611113", h="3671a603b5", width=600)

Task 6.3.9: Create a DataFrame X_summary with the mean and standard deviation for all the features in X.

 Aggregate data in a DataFrame using one or more functions in pandas.

X_summary = X.aggregate(["mean", "std"]).astype(int)

print("X_summary type:", type(X_summary))

print("X_summary shape:", X_summary.shape)
X_summary
X_summary type: <class 'pandas.core.frame.DataFrame'>
X_summary shape: (2, 5)

DEBT NETWORTH HOUSES NFIN ASSET

mean 72701 76387 74530 117330 149089

std 135950 220159 154546 239038 288166

That's the information we need to standardize our data, so let's make it happen.

VimeoVideo("714611056", h="670f6bdb78", width=600)

Task 6.3.10: Create a StandardScaler transformer, use it to transform the data in X, and then put the
transformed data into a DataFrame named X_scaled.

 What's standardization?
 Transform data using a transformer in scikit-learn.
WQU WorldQuant Un iversity Applied Data Science Lab QQQQ

# Instantiate transformer
ss = StandardScaler()

# Transform `X`
X_scaled_data = ss.fit_transform(X)

# Put `X_scaled_data` into DataFrame

X_scaled = pd.DataFrame(X_scaled_data, columns = X.columns)

print("X_scaled type:", type(X_scaled))

print("X_scaled shape:", X_scaled.shape)
X_scaled.head()
X_scaled type: <class 'pandas.core.frame.DataFrame'>
X_scaled shape: (4418, 5)

DEBT NETWORTH HOUSES NFIN ASSET

0 -0.445075 -0.377486 -0.48231 -0.474583 -0.498377

1 -0.442132 -0.368401 -0.48231 -0.464541 -0.490047

DEBT NETWORTH HOUSES NFIN ASSET

2 -0.422270 -0.383868 -0.48231 -0.467470 -0.492494

3 -0.431097 -0.358407 -0.48231 -0.449061 -0.477206

4 -0.421534 -0.372966 -0.48231 -0.457010 -0.483818

As you can see, all five of the features use the same scale now. But just to make sure, let's take a look at their
mean and standard deviation.
VimeoVideo("714611032", h="1ed03c46eb", width=600)

Task 6.3.11: Create a DataFrame X_scaled_summary with the mean and standard deviation for all the features
in X_scaled.

 Aggregate data in a DataFrame using one or more functions in pandas.

X_scaled_summary = X_scaled.aggregate(["mean", "std"]).astype(int)

print("X_scaled_summary type:", type(X_scaled_summary))

print("X_scaled_summary shape:", X_scaled_summary.shape)
X_scaled_summary
X_scaled_summary type: <class 'pandas.core.frame.DataFrame'>
X_scaled_summary shape: (2, 5)

DEBT NETWORTH HOUSES NFIN ASSET

mean 0 0 0 0 0

std 1 1 1 1 1

And that's what it should look like. Remember, standardization takes all the features and scales them so that
they all have a mean of 0 and a standard deviation of 1.
Now that we can compare all our data on the same scale, we can start making clusters. Just like we did last
time, we need to figure out how many clusters we should have.
VimeoVideo("714610976", h="82f32af967", width=600)

Task 6.3.12: Use a for loop to build and train a K-Means model where n_clusters ranges from 2 to 12
(inclusive). Your model should include a StandardScaler. Each time a model is trained, calculate the inertia and
add it to the list inertia_errors, then calculate the silhouette score and add it to the list silhouette_scores.
 Write a for loop in Python.
 Calculate the inertia for a model in scikit-learn.
 Calculate the silhouette score for a model in scikit-learn.
 Create a pipeline in scikit-learn.

Just like last time, let's create an elbow plot to see how many clusters we should use.
n_clusters = range(2,13)
inertia_errors = []
silhouette_scores = []

# Add for loop to train model and calculate inertia, silhouette score.
for k in n_clusters:
# Build model
model=make_pipeline(StandardScaler(), KMeans(n_clusters=k, random_state=42))
# Train model
model.fit(X)
# calculate inertia
inertia_errors.append(model.named_steps["kmeans"].inertia_)
# Calculate silhouette
silhouette_scores.append(
silhouette_score(X, model.named_steps["kmeans"].labels_)
)

print("inertia_errors type:", type(inertia_errors))

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the w
arning

/opt/conda/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the w
arning

/opt/conda/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the w
arning

/opt/conda/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the w
arning

/opt/conda/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the w
arning
/opt/conda/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:1412: FutureWarning: