SLF - ReCell Project - Presentation
SLF - ReCell Project - Presentation
Phones/Tablets Market
ReCell _ Supervised Learning -Foundations
29-Jul-2023
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Contents / Agenda
● Executive Summary
● EDA Results
● Data Preprocessing
● Appendix
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Executive Summary
Based on the analysis following are actionable insights and recommendations.
Actionable Insights:
1) Screen size, main camera mp, selfie camera mp, ram, days used, normalized new price and 4g_yes have
positive coefficients which means that the price of used devices will increase with the increase of these
variables.
2) The price of used devices is highly dependent on the price of new device. A unit increase in the price of
new device will result in increasing the price of the used device by 0.428 unit assuming all other
variables are constant.
3) Years since release has a negative coefficient that means the older the phone the lower the price of the
used device.
4) The brand name Lenovo, Nokia, Xiaomi seem to increase the price of the used device; they may be in
demand.
Recommendations:
1. The features such as number of days used, battery and weight seem to have no impact on the price of the
used device. It may be inferred that the dealers maintain certain quality of the used devices by making
necessary checks in order to keep the refurbished market attractive. Therefore the company should not be
over concerned about the stated factors.
2. In the analysis adding factors like gender, age and income of the customers can give more insight about
the used devices market.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Business Problem Overview and Solution Approach
Problem Overview / Statement
Over the past decade the use of refurbished devices has increased significantly. “As per the forecast by
2023 the market would be worth $52bn with compound annual growth rate of 13.6% w.r.t. 2018.”
Apart from offering cost-effective options to both buyers and sellers, the used phone market bids a
number of other benefits such as,
● the used devices can also be sold with warranties,
● the consumption of used phones helps in reducing environmental impact and its negative effects on
the health of people involved in manufacturing phones/tablets,
● Also, third party vendors provide attractive offers to customers for refurbished devices, etc.
Start-ups such as ReCell are intended to venture this growing market and are interested in knowing the
price of the used devices and the factors influencing their price.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Data Overview
Missing values
Observations:
● There are 15 columns and 3454 entries or row in the data frame
● Out of 15 columns, the data type of 9 columns are floats, 2 are integers and 4 are object.
● 6 columns have missing values as shown in the table of missing values
● Statistical summary shows that for some of the features the mean value is slightly higher than the
median that means their distribution would be slightly skewed to the right while for days_used mean
is less than median so its distribution is expected to be left skewed.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
EDA Results_Univariate Analysis
Observations:
● The data of normalized used price, normalized new price and
screen size follows approximately normal distribution with
outliers on both sides
● The battery data is slight skewed to the right with outliers on
the right side only.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
EDA Results_Univariate Analysis (Cont’d)
Observations:
● The resolution of main camera seems to follow normal distribution
with outliers on the right side only
● The resolution of selfie camera has right skewed distribution with
outliers on the right side only
● The weight column follows right skewed distribution with large
number of outliers on the right and few on the left side
● The number of days used shows left skewed distribution with no
outliers
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
EDA Results_Univariate Analysis (Cont’d)
Observations:
● Most of the brand names are masked by column others
● Samsung seems to be highly demanded brand among all
● About 93% refurbished devices have Android operating system
● In the refurbished market availability of 4G devices is higher
compared to 5G devices
● In the refurbished devices most of them were released in 2014
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
EDA Results_Bivariate Analysis
Observations:
● The heat map shows high positive correlation between normalized new
price and normalized used price
● There exists high positive correlation between weight and screen size
● Also, battery and screen size are highly correlated
● The box plot indicates that OnePlus offers more ram to devices
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
EDA Results_Bivariate Analysis (Cont’d)
Brand vs high selfie camera
Brand vs Large screen size
Observations:
● The line plot shows that the price of used devices increases with year of release
● The box plots indicate that the price of used devices with 5g is higher than the ones with 4g
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Data Preprocessing_ Missing value treatment
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Data Preprocessing_Duplication check & Feature engg.
(Cont’d) Duplicate value check results
Observations:
● Duplication check showed that none of a
complete rows is duplicated
● In feature engineering a new column called
years_since_release was introduced and the
release_year column was dropped; Feature
engineering results table shows the statistics Feature engineering results
of the newly introduced column.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Data Preprocessing _Outliers check (Cont’d)
Observations:
● Except for year_since_release and days_used outliers are present in all the numerical columns.
● In this analysis the outliers will not be treated, as treating them may result in losing useful
information.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Data Preprocessing_Data preparation for modeling
(Cont’d)
Observations:
● Following introducing dummies for categorical columns the number of column increased to
49 from 15
● After splitting between train and test data 2417 rows assigned to train data and 1037 to test
data.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Overview of initial ML model (Before applying VIF & P-Value check)
Observations:
● Adj. R-squared value is 0.842 that shows the model
is reasonably good.
● The constant is 1.31, which means that in the initial
model this much cannot be explained by the
predictor variables
● There is a long list of dummy variables introduced
due to categorical data (brands and operating
system) however, most of them have p-value >0.05
that means they do not have significant impact on
the model hence will be dropped from the final
model.
● By dropping high p-value variables the number
of predictor variables dropped from 45 to 11 in
the final model. (See final model in next slide)
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Summary of final ML model factors for prediction
Observations:
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Training & Test data performance metrics
Observations:
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
APPENDIX
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Data Background and Contents
The data contains the different attributes of used/refurbished phones and tablets. The data was
collected in the year 2021. The detailed data dictionary is given below.
● brand_name: Name of manufacturing brand
● os: OS on which the device runs
● screen_size: Size of the screen in cm
● 4g: Whether 4G is available or not
● 5g: Whether 5G is available or not
● main_camera_mp: Resolution of the rear camera in megapixels
● selfie_camera_mp: Resolution of the front camera in megapixels
● int_memory: Amount of internal memory (ROM) in GB
● ram: Amount of RAM in GB
● battery: Energy capacity of the device battery in mAh
● weight: Weight of the device in grams
● release_year: Year when the device model was released
● days_used: Number of days the used/refurbished device has been used
● normalized_new_price: Normalized price of a new device of the same model in euros
● normalized_used_price: Normalized price of the used/refurbished device in euros
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
EDA Results_Univariate Analysis
Observations:
● The ram column has normal distribution with outliers on both sides
● The internal memory has right skewed distribution with outliers on the right side only
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Model Assumptions
Followings are the assumptions of the model:
1. No Multicollinearity
2. Linearity of variables
3. Independence of error terms
4. Normality of error terms
5. No Heteroscedasticity
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Linear regression assumptions check
Test for Multicollinearity
● 7 variables [screen_size, weight ,brand_name_Apple, brand_name_Huawei, brand_name_Others, brand_name_Samsung, os_iOS] observed to
have VIF > 5 (see Table-1 in appendix)
● By dropping brand_name_Apple and , brand_name_Others & weight variables VIF for all predictor variables dropped below 5 (see Table-2
in appendix) therefore the assumption of multicollinearity is satisfied.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Linear regression assumptions check (Cont’d)
● On the Q-Q, plot except for tail values the residuals lie on a straight line indicating that
the distribution is approximately normal
● As per Shapiro test the residuals are not normally distributed as p-value less than 0.05
● Test results indicate that the distribution is not precisely normal but it can be assumed
to be close to normal
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Results of Multicollinearity treatment
Table-3
Table-1 Table-2
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Happy Learning !
26
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.