Project Re-Cell by Patel Dakshesh Maheshbhai
Project Re-Cell by Patel Dakshesh Maheshbhai
re-cell
Bivariate Analysis:
2
5. Model Building – OLS Regression
Linear Model Performance Check
Regression
6. Linear Test for Multicollinearity
Regression Removing Multicollinearity
Dropping high p-values
Assumption Test for linearity for independence
Test for Normality
Test for Homoscedasticity
3
Problem Statement &
Understanding
Business Context:
Buying and selling used phones and tablets used to be something that happened on a
handful of online marketplace sites. But the used and refurbished device market has
grown considerably over the past decade, and a new IDC (International Data
Corporation) forecast predicts that the used phone market would be worth $52.7bn by
2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This
growth can be attributed to an uptick in demand for used phones and tablets that offer
considerable savings compared with new models.
Objective:
The rising potential of this comparatively under-the-radar market fuels the need for an
ML-based solution to develop a dynamic pricing strategy for used and refurbished
devices. Re-Cell, a startup aiming to tap the potential in this market, has hired you as a
data scientist. They want you to analyze the data provided and build a linear regression
model to predict the price of a used phone/tablet and identify factors that significantly
influence it.
4
Data Description and Dictionary:
The data contains the different attributes of used/refurbished phones and tablets. The
data was collected in the year 2021. The detailed data dictionary is given below.
There are 34 different phone brands, 4 operating systems, and phones support
either 4G or 5G with "yes" or "no" values.
Android is the most popular operating system, with 3,246 phones using it.
2,359 phones have 4G connectivity, while only 152 phones support 5G.
The average values for features like screen size, camera megapixels, internal
memory, battery, weight, and prices are greater than the median, indicating right-
skewed data.
The average and median values for RAM are almost the same, showing little to
no skewness.
The average number of days a used phone has been in use is less than the
median, indicating left-skewed data.
5
Data Overview
Checking the shape of the dataset
brand_ os screen 4 5 main_cam selfie_cam int_me ra batt wei release days_ normalized_ normalized_
name _size g g era_mp era_mp mory m ery ght _year used used_price new_price
0 Honor And 14.50 y n 13.0 5.0 64.0 3. 302 146. 2020 127 4.307572 4.715100
roid e o 0 0.0 0
s
1 Honor And 17.30 y y 13.0 16.0 128.0 8. 430 213. 2020 325 5.162097 5.519018
roid e e 0 0.0 0
s s
2 Honor And 16.69 y y 13.0 8.0 128.0 8. 420 213. 2020 162 5.111084 5.884631
roid e e 0 0.0 0
s s
3 Honor And 25.50 y y 13.0 8.0 64.0 6. 725 480. 2020 345 5.135387 5.630961
roid e e 0 0.0 0
s s
4 Honor And 15.32 y n 13.0 8.0 64.0 3. 500 185. 2020 293 4.389995 4.947837
roid e o 0 0.0 0
s
In this data we can find that there are 3,455 Rows and 15 Columns. A high percentage
of devices seem to be Androids. There are devices available from as late as 2020.
There are 11 numerical (float and integer) types and 4 object types in the dataset.
The target variable is the normalized price of a used device and is a float type.
6
Checking for duplicate values
Checked for duplicate values and there are no duplicate values in this
dataset.
main_camera_mp
selfie_camera_mp
int_memory
ram
battery
weight
7
Exploratory Data Analysis
Univariate Analysis:
Q1. What does the distribution of normalized used device prices
look like?
normalized_used_price
The distribution of the normalized used price for the devices appears to look normal.
There seems to be outliers on both the lower and higher ends. The average normalized
used price for the devices is $4.36.
normalized_new_price
The normalized new price distribution for devices resembles a normal curve, though it
includes outliers at both the lower and higher extremes. The average normalized new
price is $5.23.
8
screen_size
The distribution of the screen size for the devices appears to not have a clear pattern,
although it most closely resembles a normal distribution. There seems to be outliers on
both the lower and higher ends. The average screen size for the devices is 13.71 cm.
main_camera_mp
The distribution of the resolution of the main camera for the devices appears to be
skewed slightly left. There seems to be outliers on the upper end. The average
resolution of the main camera for the devices is 9.46 MP.
9
selfie_camera_mp
The distribution of the resolution of the selfie camera for the devices appears to be
skewed slightly right. There seems to be outliers on the upper end. The average
resolution of the selfie camera for the devices is 6.55 MP.
int_memory
The distribution of the amount of internal memory for the devices appears to be
skewed right. There seems to be outliers on the upper end. The average amount of
internal memory for the devices is 54.57 GB.
1
0
ram
There does not seem to be discernable distribution for RAM in the devices. The average
amount of RAM for the devices is 4.04 GB.
weight
The weight distribution of the devices is slightly right-skewed, with numerous outliers
at the higher end and a few at the lower end. The average weight is 182.75 grams.
1
1
battery
The distribution of the energy capacity for the devices appears to resemble a
multimodal distribution. There seems to be outliers on the upper end. The average
energy capacity for the devices is 182.75 mAh.
days_used
1
2
brand_name
It appears that various other brand names are the most popular in the used device
industry, comprising of 14.5% of the market. This is followed by Samsung and Huawei
with 9.9% and 7.3%, respectively.
os
It appears that majority of the used phone market is dominated by Android devices as
they make up 93.1% of it. IOS has the smallest market share in the used phone market
with 1.0%.
1
3
4g
5g
1
4
release_year
Many of the devices (18.6%) had a release year of 2014. This is followed by 2013 and
2015 with 16.5% and 14.9%, respectively.
1
5
Bivariate Analysis
Q3: Which attributes are highly correlated with the normalized
price of a used device?
Correlation Check
The normalized used price of a device is highly positively correlated with the
normalized new price, battery capacity, selfie camera resolution, and screen size. It is
negatively correlated with the number of days the device has been used.
RAM
There does not seem to be discernable distribution for RAM in the devices. The average
amount of RAM for the devices is 4 GB. The brand that has the most RAM is OnePlus
and the brand that has the least RAM is Celkon.
1
6
Battery
The weight distribution of devices varies noticeably across brands, with none appearing
normally distributed and some showing outliers. Google has the heaviest devices, while
Micromax has the lightest.
1
7
Screen
People who buy phones and tablets primarily for entertainment purposes prefer a large
screen as they offer a better viewing experience. The brand that has the greatest
number of devices with screen sizes larger than 6 inches is Huawei, taking up 13.6% of
the market. This is followed by Samsung and other miscellaneous brands with market
shares of 10.8% and 9%, respectively.
1
8
Camera
Huawei leads in the number of devices with selfie cameras over 8 MP, holding 13.3% of
the market, followed by Vivo at 11.9% and Oppo at 11.5%.
Rear Camera
Rear cameras typically have higher resolution than front cameras, with a threshold of
16 MP set for analysis. Sony leads in devices with main cameras over 16 MP, capturing
39.4% of the market, followed by Motorola at 11.7% and other brands at 9.6%.
1
9
Price
There appears to a positive relationship between the release year of the device and the
normalized used price. As the release year of the devices increases, the normalized
used price increases as well.
It appears that devices with 4g availability have a higher normalized price than
devices that do not.
It appears that devices with 5g availability have a higher normalized price than
devices that do not.
Devices that possess 5g have a higher normalized price than devices that possess
4g.
2
0
Data Preprocessing:
Missing Value Imputation
We will impute the missing values in the data by the column medians grouped by
release_year and brand_name.
We will impute the remaining missing values in the data by the column medians
grouped by brand_name.
2
1
We will fill the remaining missing values in the main_camera_mp column by the column
median.
Feature Engineering
2
2
Outlier Check
2
3
Data Preparation for modeling
2
4
Model Building - Linear
Regression
Printing x_train and y_train datatype:
OLS Regression
2
5
OLS Regression Result
2
6
2
7
Model Performance Check
We will be using metric functions defined in sklearn for RMSE, MAE, and R2 .
We will define a function to calculate MAPE and adjusted R2 .
We will create a function which will print out all the above metrics in one go.
2
8
Linear Regression
Assumptions
We will be checking the following Linear Regression assumptions:
No Multicollinearity
Linearity of variables
Independence of error terms
Normality of error terms
No Heteroscedasticity
Dropping os_iOS would have the maximum impact on the predictive power of the
model (amongst the variables being considered). We'll drop os_iOS and check the VIF
again.
2
9
VIF after dropping os_iOS:
3
0
VIF after dropping brand_name_Others:
3
1
Dropping the weight:
Dropping weight would have the maximum impact on the predictive power of the
model (amongst the variables being considered). We'll drop weight and check the VIF
again.
There are no more predictors that have multicollinearity and the assumption is
satisfied.
3
2
Dropping high p-value variables (if needed):
We will drop the predictor variables having a p-value greater than 0.05 as they do
not significantly impact the target variable.
But sometimes p-values change after dropping a variable. So, we'll not drop all
variables at once.
Instead, we will do the following:
Build a model, check the p-values of the variables, and drop the column
with the highest p-value.
Create a new model without the dropped feature, check the p-values of
the variables, and drop the column with the highest p-value.
Repeat the above two steps till there are no columns with p-value > 0.05.
The above process can also be done manually by picking one variable at a time that
has a high p-value, dropping it, and building a model again. But that might be a little
tedious and using a loop will be more efficient.
Observation:
The final model, olsmod2, includes predictor variables from x_train6, with no p-values
exceeding 0.05. It has an adjusted R-squared of 0.841, explaining ~84% of the variance.
Compared to olsmod1 (adjusted R-squared of 0.845), the dropped variables had
minimal impact. Comparable RMSE and MAE values for train and test sets confirm the
model is not overfitting.
3
3
3
4
Training Performance
Test Performance
We will test for linearity and independence by making a plot of fitted values vs
residuals and checking for patterns.
If there is no pattern, then we say the model is linear and residuals are
independent.
Otherwise, the model is showing signs of non-linearity and residuals are not
independent.
Creating a data frame with actual, fitted, and residual values and plotting the fitted
values vs residuals:
The scatter plot of residuals versus fitted values illustrates the distribution of errors. If a
pattern exists in the plot, it suggests non-linearity in the data, meaning the model does
not account for non-linear effects. Since no pattern is observed, the assumptions of
linearity and independence are met.
3
5
TEST FOR NORMALITY
3
6
Shapiro Result (statistic=0.9748184084892273, pvalue=3.1171865175534697e-20)
Since p-value < 0.05, the residuals are not normal as per the Shapiro-Wilk test.
Strictly speaking, the residuals are not normal.
However, as an approximation, we can accept this distribution as close to being
normal.
So, the assumption is satisfied.
Since p-value > 0.05, we can say that the residuals are homoscedastic. So, this
assumption is satisfied.
3
7
We can observe here that our model has returned pretty good prediction results,
and the actual and predicted values are comparable.
We can also visualize comparison result as a bar graph.
3
8
Final Model Summary
3
9
Training Performance:
Test Performance:
4
0
Actionable Insights and Recommendations
Insights:
o New vs. Used Price Relationship: The price of used devices strongly correlates
with their new counterparts. Higher-priced new devices lead to higher resale
value, making them key targets for refurbishment.
o Key Features Driving Value: Large screen sizes, higher RAM, better front and rear
cameras, and 4G connectivity significantly boost resale prices. Devices with these
specifications from specific brands are especially profitable.
o Impact of Brand and Features: While certain brands like Samsung and some main
camera configurations negatively affect resale prices, others contribute positively.
Recommendations:
Prioritize refurbishing devices with large screens, high RAM, and excellent camera
specifications, as these are in high demand.
Focus on newer models with high initial market prices to maximize revenue
potential.
Expand to selling other used gadgets, such as smartwatches, to diversify offerings
and attract more customers.
Collect and analyse customer demographics to refine product selection and cater
to different market segments.
Retailers should avoid overstocking models or brands that do not retain value
well, like some Samsung models with specific camera configurations.
4
1