0% found this document useful (0 votes)
8 views41 pages

Project Re-Cell by Patel Dakshesh Maheshbhai

The document outlines a project titled 'Re-Cell' aimed at developing a dynamic pricing strategy for used and refurbished phones and tablets through data analysis and linear regression modeling. It includes sections on problem understanding, data overview, exploratory data analysis, data preprocessing, model building, and insights. The analysis reveals significant market growth for used devices, with Android dominating the market and various factors influencing pricing, such as camera resolution and battery capacity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views41 pages

Project Re-Cell by Patel Dakshesh Maheshbhai

The document outlines a project titled 'Re-Cell' aimed at developing a dynamic pricing strategy for used and refurbished phones and tablets through data analysis and linear regression modeling. It includes sections on problem understanding, data overview, exploratory data analysis, data preprocessing, model building, and insights. The analysis reveals significant market growth for used devices, with Android dominating the market and various factors influencing pricing, such as camera resolution and battery capacity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

PROJECT

re-cell

BY-PATEL DAKSHESH MAHESHBHAI


Content Sub-Content
1. Problem Statement  Business Context
and  Objective
 Data Description & Dictionary
Understanding
2. Data Overview  Checking the shape of the dataset
 Checking the datatype of columns of the dataset
 Checking duplicate values in the dataset
 Checking mission values

3. Exploratory Univariate Analysis:


Data
o What does the distribution of normalized used
Analysis device prices look like?
o What percentage of the used device market is
dominated by Android devices?

Bivariate Analysis:

 Which attributes are highly correlated with the


normalized price of a used device?
 The amount of RAM is important for the smooth
functioning of a device. How does the amount of
RAM vary with the brand?
 A large battery often increases a device's weight,
making it feel uncomfortable in the hands. How
does the weight vary for phones and tablets offering
large batteries (more than 4500 mAh)?
 Bigger screens are desirable for entertainment
purposes as they offer a better viewing experience.
How many phones and tablets are available across
different brands with a screen size larger than 6
inches?
 A lot of devices nowadays offer great selfie cameras,
allowing us to capture our favourite moments with
loved ones. What is the distribution of devices
offering greater than 8MP selfie cameras across
brands?

4. Data Preprocessing  Missing Value Imputation


 Feature Engineering
 Outlier Check
 Data Preparation for Modeling

2
5. Model Building –  OLS Regression
Linear  Model Performance Check
Regression
6. Linear  Test for Multicollinearity
Regression  Removing Multicollinearity
 Dropping high p-values
Assumption  Test for linearity for independence
 Test for Normality
 Test for Homoscedasticity

7. Final Model Summary


8. Insights and Recommendations

3
Problem Statement &
Understanding
 Business Context:
Buying and selling used phones and tablets used to be something that happened on a
handful of online marketplace sites. But the used and refurbished device market has
grown considerably over the past decade, and a new IDC (International Data
Corporation) forecast predicts that the used phone market would be worth $52.7bn by
2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This
growth can be attributed to an uptick in demand for used phones and tablets that offer
considerable savings compared with new models.

Refurbished and used devices continue to provide cost-effective alternatives to both


consumers and businesses that are looking to save money when purchasing one. There
are plenty of other benefits associated with the used device market. Used and
refurbished devices can be sold with warranties and can also be insured with proof of
purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide
attractive offers to customers for refurbished devices. Maximizing the longevity of
devices through second-hand trade also reduces their environmental impact and helps
in recycling and reducing waste. The impact of the COVID-19 outbreak may further
boost this segment as consumers cut back on discretionary spending and buy phones
and tablets only for immediate needs.

 Objective:
The rising potential of this comparatively under-the-radar market fuels the need for an
ML-based solution to develop a dynamic pricing strategy for used and refurbished
devices. Re-Cell, a startup aiming to tap the potential in this market, has hired you as a
data scientist. They want you to analyze the data provided and build a linear regression
model to predict the price of a used phone/tablet and identify factors that significantly
influence it.

4
 Data Description and Dictionary:
The data contains the different attributes of used/refurbished phones and tablets. The
data was collected in the year 2021. The detailed data dictionary is given below.

brand_name Name of manufacturing brand


os OS on which the device runs
screen_size Size of the screen in cm
4g Whether 4G is available or not
5g Whether 5G is available or not
main_camera_mp Resolution of the rear camera in megapixels
selfie_camera_mp Resolution of the front camera in megapixels
int_memory Amount of internal memory (ROM) in GB
ram Amount of RAM in GB
battery Energy capacity of the device battery in mAh
weight Weight of the device in grams
release_year Year when the device model was released
days_used Number of days the used/refurbished device has been used
normalized_new_price Normalized price of a new device of the same model in
euros
normalized_used_price Normalized price of the used/refurbished device in euros

 There are 34 different phone brands, 4 operating systems, and phones support
either 4G or 5G with "yes" or "no" values.
 Android is the most popular operating system, with 3,246 phones using it.
 2,359 phones have 4G connectivity, while only 152 phones support 5G.
 The average values for features like screen size, camera megapixels, internal
memory, battery, weight, and prices are greater than the median, indicating right-
skewed data.
 The average and median values for RAM are almost the same, showing little to
no skewness.
 The average number of days a used phone has been in use is less than the
median, indicating left-skewed data.

5
Data Overview
Checking the shape of the dataset
brand_ os screen 4 5 main_cam selfie_cam int_me ra batt wei release days_ normalized_ normalized_
name _size g g era_mp era_mp mory m ery ght _year used used_price new_price

0 Honor And 14.50 y n 13.0 5.0 64.0 3. 302 146. 2020 127 4.307572 4.715100
roid e o 0 0.0 0
s

1 Honor And 17.30 y y 13.0 16.0 128.0 8. 430 213. 2020 325 5.162097 5.519018
roid e e 0 0.0 0
s s

2 Honor And 16.69 y y 13.0 8.0 128.0 8. 420 213. 2020 162 5.111084 5.884631
roid e e 0 0.0 0
s s

3 Honor And 25.50 y y 13.0 8.0 64.0 6. 725 480. 2020 345 5.135387 5.630961
roid e e 0 0.0 0
s s

4 Honor And 15.32 y n 13.0 8.0 64.0 3. 500 185. 2020 293 4.389995 4.947837
roid e o 0 0.0 0
s

In this data we can find that there are 3,455 Rows and 15 Columns. A high percentage
of devices seem to be Androids. There are devices available from as late as 2020.

Checking the data types of the columns for the dataset

There are 11 numerical (float and integer) types and 4 object types in the dataset.
The target variable is the normalized price of a used device and is a float type.

6
Checking for duplicate values

Checked for duplicate values and there are no duplicate values in this
dataset.

Checking for missing values

There are missing values for the following columns:

 main_camera_mp
 selfie_camera_mp
 int_memory
 ram
 battery
 weight

7
Exploratory Data Analysis
Univariate Analysis:
Q1. What does the distribution of normalized used device prices
look like?

 normalized_used_price

The distribution of the normalized used price for the devices appears to look normal.
There seems to be outliers on both the lower and higher ends. The average normalized
used price for the devices is $4.36.

 normalized_new_price

The normalized new price distribution for devices resembles a normal curve, though it
includes outliers at both the lower and higher extremes. The average normalized new
price is $5.23.

8
 screen_size

The distribution of the screen size for the devices appears to not have a clear pattern,
although it most closely resembles a normal distribution. There seems to be outliers on
both the lower and higher ends. The average screen size for the devices is 13.71 cm.

 main_camera_mp

The distribution of the resolution of the main camera for the devices appears to be
skewed slightly left. There seems to be outliers on the upper end. The average
resolution of the main camera for the devices is 9.46 MP.

9
 selfie_camera_mp

The distribution of the resolution of the selfie camera for the devices appears to be
skewed slightly right. There seems to be outliers on the upper end. The average
resolution of the selfie camera for the devices is 6.55 MP.

 int_memory

The distribution of the amount of internal memory for the devices appears to be
skewed right. There seems to be outliers on the upper end. The average amount of
internal memory for the devices is 54.57 GB.

1
0
 ram

There does not seem to be discernable distribution for RAM in the devices. The average
amount of RAM for the devices is 4.04 GB.

 weight

The weight distribution of the devices is slightly right-skewed, with numerous outliers
at the higher end and a few at the lower end. The average weight is 182.75 grams.

1
1
 battery

The distribution of the energy capacity for the devices appears to resemble a
multimodal distribution. There seems to be outliers on the upper end. The average
energy capacity for the devices is 182.75 mAh.

 days_used

The distribution of the number of days a refurbished product is used appears to be


skewed slightly to the left. There seems to be no outliers. The average number of days
a refurbished product is used is about 675 days.

1
2
 brand_name

It appears that various other brand names are the most popular in the used device
industry, comprising of 14.5% of the market. This is followed by Samsung and Huawei
with 9.9% and 7.3%, respectively.

 os

Q2: What percentage of the used device market is dominated by


Android devices?

It appears that majority of the used phone market is dominated by Android devices as
they make up 93.1% of it. IOS has the smallest market share in the used phone market
with 1.0%.

1
3
 4g

Majority of the devices (67.6%) are available with 4g.

 5g

Majority of the devices (95.6%) are not available with 5g.

1
4
 release_year

Many of the devices (18.6%) had a release year of 2014. This is followed by 2013 and
2015 with 16.5% and 14.9%, respectively.

1
5
Bivariate Analysis
Q3: Which attributes are highly correlated with the normalized
price of a used device?

 Correlation Check

The normalized used price of a device is highly positively correlated with the
normalized new price, battery capacity, selfie camera resolution, and screen size. It is
negatively correlated with the number of days the device has been used.

 RAM

Q4: The amount of RAM is important for the smooth functioning


of a device. How does the amount of RAM vary with the brand?

There does not seem to be discernable distribution for RAM in the devices. The average
amount of RAM for the devices is 4 GB. The brand that has the most RAM is OnePlus
and the brand that has the least RAM is Celkon.

1
6
 Battery

Q5: People who travel frequently require devices with large


batteries to run through the day. But large battery often increases
weight, making it feel uncomfortable in the hands. Checking
how the weight varies for phones offering large batteries (more
than 4500 mAh).

The weight distribution of devices varies noticeably across brands, with none appearing
normally distributed and some showing outliers. Google has the heaviest devices, while
Micromax has the lightest.

1
7
 Screen

Q6: Bigger screens are desirable for entertainment purposes as


they offer a better viewing experience. How many phones and
tablets are available across different brands with a screen size
larger than 6 inches?

People who buy phones and tablets primarily for entertainment purposes prefer a large
screen as they offer a better viewing experience. The brand that has the greatest
number of devices with screen sizes larger than 6 inches is Huawei, taking up 13.6% of
the market. This is followed by Samsung and other miscellaneous brands with market
shares of 10.8% and 9%, respectively.

1
8
 Camera

Q7: A lot of devices nowadays offer great selfie cameras, allowing


us to capture our favorite moments with loved ones. What is the
distribution of devices offering greater than 8MP selfie cameras
across brands?

Huawei leads in the number of devices with selfie cameras over 8 MP, holding 13.3% of
the market, followed by Vivo at 11.9% and Oppo at 11.5%.

 Rear Camera

Rear cameras typically have higher resolution than front cameras, with a threshold of
16 MP set for analysis. Sony leads in devices with main cameras over 16 MP, capturing
39.4% of the market, followed by Motorola at 11.7% and other brands at 9.6%.

1
9
 Price

Prices of Used Device across years.

There appears to a positive relationship between the release year of the device and the
normalized used price. As the release year of the devices increases, the normalized
used price increases as well.

Prices for used phones and tablets offering 4G and 5G networks.

 It appears that devices with 4g availability have a higher normalized price than
devices that do not.
 It appears that devices with 5g availability have a higher normalized price than
devices that do not.
 Devices that possess 5g have a higher normalized price than devices that possess
4g.

2
0
Data Preprocessing:
 Missing Value Imputation

We will impute the missing values in the data by the column medians grouped by
release_year and brand_name.

There are 6 variables with missing values:

 main_camera_mp has 179 missing values


 selfie_camera_mp has 2 missing values
 int_memory has 4 missing values
 ram has 4 missing values
 battery has 6 missing values
 weight has 7 missing values

We will impute the remaining missing values in the data by the column medians
grouped by brand_name.

2
1
We will fill the remaining missing values in the main_camera_mp column by the column
median.

All missing values have been treated.

 Feature Engineering

 Let's create a new column years_since_release from the release_year column.


 We will consider the year of data collection, 2021, as the baseline.
 We will drop the release_year column.

2
2
 Outlier Check

Let's check for outliers in the data.

 There are quite a few outliers in the data


 However, we will not treat them as they are proper values

2
3
 Data Preparation for modeling

 We want to predict the normalized price of used devices


 Before we proceed to build a model, we'll have to encode categorical features
 We'll split the data into train and test to be able to evaluate the model that we
build on the train data
 We will build a Linear Regression model using the train data and then check it's
performance

Splitting the data in 70:30 ratio for train to test data

 Number of rows in train data = 2417


 Number of rows in test data = 1037

2
4
Model Building - Linear
Regression
Printing x_train and y_train datatype:

 OLS Regression

o Adjusted. R-squared: It reflects the fit of the model.


 Adjusted R-squared values generally range from 0 to 1, where a
higher value generally indicates a better fit, assuming certain
• conditions are met.
 In our case, the value for adj. R-squared is 0.845, which is good.

o const coefficient: It is the Y-intercept.


 It means that if all the predictor variable coefficients are zero, then
the expected output (i.e., Y) would be equal to the const coefficient.
 In our case, the value for const coefficient is 1.6815

o Coefficient of a predictor variable: It represents the change in the output Y


due to a change in the predictor variable (everything else held constant).
 For example, the coefficient of normalized_new_price is 0.4146.

2
5
OLS Regression Result

2
6
2
7
 Model Performance Check

Let us check the performance of the model using different metrics.

 We will be using metric functions defined in sklearn for RMSE, MAE, and R2 .
 We will define a function to calculate MAPE and adjusted R2 .
 We will create a function which will print out all the above metrics in one go.

2
8
Linear Regression
Assumptions
We will be checking the following Linear Regression assumptions:

 No Multicollinearity
 Linearity of variables
 Independence of error terms
 Normality of error terms
 No Heteroscedasticity

 TEST FOR MULTICOLLINEARITY

 We will test for multicollinearity using VIF.


 General Rule of thumb:
 If VIF is 1 then there is no correlation between the k th predictor and the
remaining predictor variables.
 If VIF exceeds 5 or is close to exceeding 5, we say there is moderate
multicollinearity.
 If VIF is 10 or exceeding 10, it shows signs of high multicollinearity.

Let's define a function to check VIF and removing multicollinearity:

Dropping os_iOS would have the maximum impact on the predictive power of the
model (amongst the variables being considered). We'll drop os_iOS and check the VIF
again.

2
9
VIF after dropping os_iOS:

Dropping the brand_name_Others:

Dropping brand_name_Others would have the maximum impact on the predictive


power of the model (amongst the variables being considered). We'll drop
brand_name_Others and check the VIF again.

3
0
VIF after dropping brand_name_Others:

Dropping the years_since_release:

Dropping years_since_release would have the maximum impact on the predictive


power of the model (amongst the variables being considered). We'll drop
years_since_release and check the VIF again.

VIF after dropping years_since_release:

3
1
Dropping the weight:

Dropping weight would have the maximum impact on the predictive power of the
model (amongst the variables being considered). We'll drop weight and check the VIF
again.

VIF after dropping weight:

There are no more predictors that have multicollinearity and the assumption is
satisfied.

3
2
Dropping high p-value variables (if needed):

 We will drop the predictor variables having a p-value greater than 0.05 as they do
not significantly impact the target variable.
 But sometimes p-values change after dropping a variable. So, we'll not drop all
variables at once.
 Instead, we will do the following:
 Build a model, check the p-values of the variables, and drop the column
with the highest p-value.
 Create a new model without the dropped feature, check the p-values of
the variables, and drop the column with the highest p-value.
 Repeat the above two steps till there are no columns with p-value > 0.05.

The above process can also be done manually by picking one variable at a time that
has a high p-value, dropping it, and building a model again. But that might be a little
tedious and using a loop will be more efficient.

Checking the p-values on the right dataset:

OLS Regression for updated dataset (no multicollinearity and no insignificant


predictors)

Observation:

The final model, olsmod2, includes predictor variables from x_train6, with no p-values
exceeding 0.05. It has an adjusted R-squared of 0.841, explaining ~84% of the variance.
Compared to olsmod1 (adjusted R-squared of 0.845), the dropped variables had
minimal impact. Comparable RMSE and MAE values for train and test sets confirm the
model is not overfitting.

3
3
3
4
Training Performance

Test Performance

 TEST FOR LINEARITY AND INDEPENDENCE

 We will test for linearity and independence by making a plot of fitted values vs
residuals and checking for patterns.
 If there is no pattern, then we say the model is linear and residuals are
independent.
 Otherwise, the model is showing signs of non-linearity and residuals are not
independent.

Creating a data frame with actual, fitted, and residual values and plotting the fitted
values vs residuals:

The scatter plot of residuals versus fitted values illustrates the distribution of errors. If a
pattern exists in the plot, it suggests non-linearity in the data, meaning the model does
not account for non-linear effects. Since no pattern is observed, the assumptions of
linearity and independence are met.

3
5
 TEST FOR NORMALITY

 We will test for normality by checking the distribution of residuals, by checking


the Q-Q plot of residuals, and by using the Shapiro-Wilk test.
 If the residuals follow a normal distribution, they will make a straight-line plot,
otherwise not.
 If the p-value of the Shapiro-Wilk test is greater than 0.05, we can say the
residuals are normally distributed.

The histogram of residuals does have a bell shape.

Checking Q-Q plot:

The residuals follow a straight line except


for the tails.

3
6
Shapiro Result (statistic=0.9748184084892273, pvalue=3.1171865175534697e-20)

 Since p-value < 0.05, the residuals are not normal as per the Shapiro-Wilk test.
 Strictly speaking, the residuals are not normal.
 However, as an approximation, we can accept this distribution as close to being
normal.
 So, the assumption is satisfied.

 TEST FOR HOMOSCEDASTICITY:

 We will test for homoscedasticity by using the goldfeldquandt test.


 If we get a p-value greater than 0.05, we can say that the residuals are
homoscedastic. Otherwise, they are heteroscedastic.

Goldfeldquandt Test: [('F statistic', 1.0643431899824787), ('p-value',.141633165194831)]

Since p-value > 0.05, we can say that the residuals are homoscedastic. So, this
assumption is satisfied.

Prediction on Test data set:

3
7
 We can observe here that our model has returned pretty good prediction results,
and the actual and predicted values are comparable.
 We can also visualize comparison result as a bar graph.

3
8
 Final Model Summary

3
9
Training Performance:

Test Performance:

 The model explains approximately 84% of the variation in the data.


 Train and test RMSE and MAE are low and comparable, indicating no overfitting.
 The MAPE on the test set suggests predictions are within 4.6% of the anime
ratings.
 The final model, olsmodel_final, is suitable for both prediction and inference.

4
0
 Actionable Insights and Recommendations

Insights:

o New vs. Used Price Relationship: The price of used devices strongly correlates
with their new counterparts. Higher-priced new devices lead to higher resale
value, making them key targets for refurbishment.
o Key Features Driving Value: Large screen sizes, higher RAM, better front and rear
cameras, and 4G connectivity significantly boost resale prices. Devices with these
specifications from specific brands are especially profitable.
o Impact of Brand and Features: While certain brands like Samsung and some main
camera configurations negatively affect resale prices, others contribute positively.

Recommendations:

 Prioritize refurbishing devices with large screens, high RAM, and excellent camera
specifications, as these are in high demand.
 Focus on newer models with high initial market prices to maximize revenue
potential.
 Expand to selling other used gadgets, such as smartwatches, to diversify offerings
and attract more customers.
 Collect and analyse customer demographics to refine product selection and cater
to different market segments.
 Retailers should avoid overstocking models or brands that do not retain value
well, like some Samsung models with specific camera configurations.

4
1

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy