Advanced Machine Learning Final Project
Advanced Machine Learning Final Project
Final Project
Team Name: Byte
Team Members: Javier Pacheco and Manav Middha
EDA and Initial Thoughts
Variable Information
● 18 variables
● 8 float, 9 int, 1 object
● 0 null values
● Numerical and categorical values only
changed with each sub_id
1. Multiple data points for same obs and same target variable. Need a
measure of central tendency
2. Some values are 0. Are they missing? Do we need to impute them?
3. What kind of models should we use?
4. What is the ideal testing method?
Checkpoint 1
Initial Steps
1. We took the mean of time dependent variables for each obs so as to have
a single row for each target variable.
2. Tried imputing the zeros with the mean for each obs of that time
dependent variable but it did not yield good results.
Models tried
1. Linear regression - This model gave the best results after doing feature selection
that were not significant at 5% significance level. On the public scoreboard, this
gave the best MAE of 4.46.
2. Random Forest Regressor with parameter tuning
3. XGBoost regressor
Both RandomForest and XGBoost were leading to overfitting and were not
generalizing well, leading to MAE’s between 4.6 and 4.7 on the public dashboard.
Linear regression was selected as the final model
Learnings from Checkpoint 1
Transposed the data so that we have 1 row for each obs and target variable.
We now have ~160 columns for each obs and we need some sort of feature
reduction technique.
Models Tried
4. Huber Regressor - This model deals with outliers and high dimensions. It
helps deals with the 2 biggest issues with this dataset. The vanilla version with
default hyperparameters gave an MAE of 3.809 on public dataset and after
tuning the regularization and epsilon parameters gave an MAE of 3.807.
5. Huber Regressor with PCA - Applied Huber Regression on top of PCA rather
than the original dataset. Gave an MAE of 3.797 on public dataset.
Took only the last 15 time dependent values for each obs as this is sort of a
time series and latter values would have more impact.
Models Tries
1. Personal point of view: A more robust testing should have been followed at
our end as we realise that our models overfit after looking at the private
leaderboard
2. Spend more time at feature engineering