0% found this document useful (0 votes)
4 views20 pages

Advanced Machine Learning Final Project

Uploaded by

yolobolo3412
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views20 pages

Advanced Machine Learning Final Project

Uploaded by

yolobolo3412
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Advanced Machine Learning

Final Project
Team Name: Byte
Team Members: Javier Pacheco and Manav Middha
EDA and Initial Thoughts
Variable Information

● 18 variables
● 8 float, 9 int, 1 object
● 0 null values
● Numerical and categorical values only
changed with each sub_id

● Time variables changed with each row


Variable Correlations

● obs and sub_id highly correlated


● num_0 positively correlated with num_1 and
num_1 positively correlated with num_2

● t_2, t_3, t_4 are positively correlated amongst


themselves

● y_1 positively correlated with t_2, t_3, t_4


● y_2 heavily positively correlated with t_1
Variable Distributions

● t_1 had many zero values and outliers


● Y variables and other time variables are
normally distributed

● num_0 and num_2 skewed


Distribution of Time dependent variables

There are a lot of outliers so


some sort of outlier
treatment needs to be done
Initial Thoughts

1. Multiple data points for same obs and same target variable. Need a
measure of central tendency
2. Some values are 0. Are they missing? Do we need to impute them?
3. What kind of models should we use?
4. What is the ideal testing method?
Checkpoint 1
Initial Steps

1. We took the mean of time dependent variables for each obs so as to have
a single row for each target variable.
2. Tried imputing the zeros with the mean for each obs of that time
dependent variable but it did not yield good results.
Models tried

1. Linear regression - This model gave the best results after doing feature selection
that were not significant at 5% significance level. On the public scoreboard, this
gave the best MAE of 4.46.
2. Random Forest Regressor with parameter tuning
3. XGBoost regressor
Both RandomForest and XGBoost were leading to overfitting and were not
generalizing well, leading to MAE’s between 4.6 and 4.7 on the public dashboard.
Linear regression was selected as the final model
Learnings from Checkpoint 1

1. Linear models are going to work best.


2. No outlier treatment done earlier. Need a way to take care of outliers
3. When we take the mean, a lot of information is lost. Need to incorporate all
the data.
Checkpoint 2
Initial Steps

Transposed the data so that we have 1 row for each obs and target variable.
We now have ~160 columns for each obs and we need some sort of feature
reduction technique.
Models Tried

1. Linear Regression - Selecting features which were significant at 5% significance


level were selected. Gave an MAE of 3.99 on public dataset.
2. Linear regression with PCA - Used PCA to select features since it is better than
OLS for feature selection. Selected the top 50 dimensions and got an MAE of
3.94 on public dataset
3. Random Forest Regressor with PCA - Did not generalize well and did not yield
good results on holdout dataset
Models Tried

4. Huber Regressor - This model deals with outliers and high dimensions. It
helps deals with the 2 biggest issues with this dataset. The vanilla version with
default hyperparameters gave an MAE of 3.809 on public dataset and after
tuning the regularization and epsilon parameters gave an MAE of 3.807.

5. Huber Regressor with PCA - Applied Huber Regression on top of PCA rather
than the original dataset. Gave an MAE of 3.797 on public dataset.

Huber Regressor with PCA was selected as the final model.


Learnings from Checkpoint 2

1. Linear models are going to work best.


2. Huber Regressor takes care of outlier and high dimensions
3. Not all t_ variables have the same significance. Need a way to weigh latter
t_ values more heavily than earlier ones. Need to work on this
Checkpoint 3
Initial Steps

Took only the last 15 time dependent values for each obs as this is sort of a
time series and latter values would have more impact.
Models Tries

1. Huber Regressor - Vanilla huber regressor gave an MAE of 3.789.


2. Neural Nets - Used 3 layer neural network with relu activation function. This
gave an MAE of 3.97 on public dataset. With Proper tuning/layer selection,
I feel that neural nets have potential to perform better.

Huber Regressor was selected as the best model.


Learnings from the Competition/ Mistakes
Identified

1. Personal point of view: A more robust testing should have been followed at
our end as we realise that our models overfit after looking at the private
leaderboard
2. Spend more time at feature engineering

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy