Deep Sharan (REPORT)
Deep Sharan (REPORT)
DATA SCIENCE
1
ACKNOWLEDGEMENT
I would like to acknowledge the contributions of the following people without whose help and
guidance this report would not have been completed.
I acknowledge the counsel and support of Dr. Sanjay Kumar Singh, Professor, IT
Department, with respect and gratitude, whose expertise, guidance, support, encouragement,
and enthusiasm has made this report possible. Their feedback vastly improved the quality of this
report and provided an enthralling experience. I am indeed proud and fortunate to be supported
by him. I am also thankful to Prof. (Dr.) Amit Sinha, H.O.D of Information Technology
Department, and Dr. Kanika Gupta, A.H.O.D of Information Technology Department for
his constant encouragement, valuable suggestions and moral support and blessings.
Although it is not possible to name individually, I shall ever remain indebted to the faculty
members of ABES Engineering College, Ghaziabad for their persistent support and cooperation
extended during this work.
This acknowledgement will remain incomplete if I fail to express our deep sense of obligation to
my parents and God for their consistent blessings and encouragement.
2
CONTENTS
Page no.
Acknowledgement
Student’s Declaration
Regression
Dataset
Project
3
Regression
In the context of machine learning and data science, regression specifically refers
to the estimation of a continuous dependent variable or response from a list of
input variables, or features. There are a variety of regression techniques, ranging
from the simplest (linear regression), to complicated statistical classic regression
models (Lasso, Elastic Net, etc.), to more complex techniques including gradient
boosting and neural networks.
4
Why is Regression Important?
5
Data Science
The regression method of forecasting, as the name implies, is used for
forecasting and for finding the casual relationship between variables.
From a business point of view, the regression method of forecasting
can be helpful for an individual working with data in the following
ways:
6
Meaning of Regression
Our first job is to collect the details of the GRE scores and CGPAs of all
the students of a college in a tabular form. The GRE scores and the
CGPAs are listed in the 1st and 2nd columns, respectively.
Here, we can see a linear relationship between CGPA and GRE score in
the scatter plot. This indicates that if the CGPA increases, the GRE
scores also increase. Thus, it would also mean that a student with a
high CGPA is likely to have a greater chance of getting a high GRE
score.
7
In a regression algorithm, we usually have one dependent variable and
one or more than one independent variable where we try to regress
the dependent variable "Y" (in this case, GRE score) using the
independent variable "X" (in this case, CGPA). In layman's terms, we
are trying to understand how the value of "Y" changes concerning the
change in "X".
8
Dependent and Independent variables
9
variable,” “responding variable,” “explained variable,” “outcome
variable,” “experimental variable,” or “output variable.”
10
Regression lines are mainly used for forecasting procedures. The
significance of the line is that it describes the interrelation of a
dependent variable “Y” with one or more independent variables “X”. It
is used to minimize the squared deviations of predictions.
Regression line of Y on X: This gives the most probable Y values from the
given values of X.
Regression line of X on Y: This gives the most probable values of X from
the given values of Y.
If the two regression lines coincide, i.e. only a single line exists,
correlation tends to be either perfect positive or perfect negative.
However, if the variables are independent, then the correlation is zero,
and the lines of regression will be at right angles.
Regression lines are widely used in the financial sector and business
procedures. Financial Analysts use linear regression techniques to
predict prices of stocks, commodities and perform valuations, whereas
businesses employ regressions for forecasting sales, inventories, and
many other variables essential for business strategy and planning.
11
What is the Regression Equation?
Let us consider one regression line, say Y on X and another line, say X
on Y, then there will be one regression equation for each regression
line:
Ye = a + bX
Where,
12
The parameter “a” indicates the distance of a line above or below the
origin, i.e. the level of the fitted line, whereas parameter "b" indicates
the change in the value of the independent variable Y for one unit of
change in the dependent variable X.
The parameters "a" and "b" can be calculated using the least square
method. According to this method, the line needs to be drawn to
connect all the plotted points. In mathematical terms, the sum of the
squares of the vertical deviations of observed Y from the calculated
values of Y is the least. In other words, the best-fitted line is obtained
when ∑ (Y-Ye)2 is the minimum.
∑ Y = Na + b ∑ X
Xe = a + bY
Where,
13
a and b are the two unknown constants that determine the position of the
line.
∑ X = Na + b ∑ Y
14
When Should I Use Regression Analysis?
15
Do education and IQ affect earnings?
Impact of exercise habits and diet affect weight.
Do drinking coffee and smoking cigarettes reduce the mortality rate?
Does a particular exercise have an impact on bone density?
These research questions create a huge amount of data that entwines
numerous independent and dependent variables and question their
influence on each other. It is an important task to untangle this web of
related variables and find out which variables are statistically essential
and the role of each of these variables. To answer all these questions
and rescue us in this game of variables, we need to take the help of
regression analysis for all the scenarios.
16
Now, let us understand how regression can help control the other
variables in the process.
This model isolates the role of each variable while holding the other
variables constant. You can examine the effect of coffee intake while
controlling the smoking factor. On the other hand, you can also look at
smoking while controlling for coffee intake.
17
PROJECT BASED LEARNING AND IMPLEMENTATION
DATASET: BLACK FRIDAY
Description:
Black Friday is a colloquial term for the Friday after Thanksgiving in the
United States. It traditionally marks the start of the Christmas shopping
season in the United States. Many stores offer highly promoted sales at
discounted prices and often open early, sometimes as early as midnight
or even on Thanksgiving.
Black Friday marks the beginning of the Christmas shopping festival
across the US. On Black Friday big shopping giants like Amazon,
Flipkart, etc. lure customers by offering discounts and deals on
different product categories. The product categories range from
electronic items, Clothing, kitchen appliances, Décor. Research has
been carried out to predict sales by various researchers. The
analysis of this data serves as a basis to provide discounts on various
product items. With the purpose of analyzing and predicting the sales,
we have used three models. The dataset Black Friday Sales Dataset
available on Kaggle has been used for analysis and prediction
purposes. The models used for prediction are linear regression, lasso
18
regression, ridge regression, Decision Tree Regressor, and Random
Forest Regressor. Mean Squared Error (MSE) is used as a performance
evaluation measure. Random Forest Regressor outperforms the other
models with the least MSE score
CODE:-
import numpy as np
import pandas as pd
import seaborn as sns
In [2]:
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")
In [3]:
train_data.head() #head gives data of first 5 rows
Out[3]:
Use Prod Ge A Occu City_ Stay_In_Cur Marit Product_ Product_ Product_ Pur
r_I uct_I nd g patio Categ rent_City_Ye al_Sta Category Category Category cha
D D er e n ory ars tus _1 _2 _3 se
0
100 P000
- 837
0 000 6904 F 10 A 2 0 3 NaN NaN
1 0
1 2
7
0
100 P002
- 152
1 000 4894 F 10 A 2 0 1 6.0 14.0
1 00
1 2
7
19
Use Prod Ge A Occu City_ Stay_In_Cur Marit Product_ Product_ Product_ Pur
r_I uct_I nd g patio Categ rent_City_Ye al_Sta Category Category Category cha
D D er e n ory ars tus _1 _2 _3 se
0
100 P000
- 105
3 000 8544 F 10 A 2 0 12 14.0 NaN
1 7
1 2
7
100 P002 5
796
4 000 8544 M 5 16 C 4+ 0 8 NaN NaN
9
2 2 +
In [4]:
test_data.head()
Out[4]:
Use Prod Ge A Occu City_C Stay_In_Curr Marita Product_ Product_ Product_
r_I uct_I nde g patio ategor ent_City_Year l_Statu Category_ Category_ Category_
D D r e n y s s 1 2 3
4
100
P0012 6-
0 000 M 7 B 2 1 1 11.0 NaN
8942 5
4
0
2
100
P0011 6-
1 000 M 17 C 0 0 3 5.0 NaN
3442 3
9
5
3
100
P0028 6-
2 001 F 1 B 4+ 1 5 14.0 NaN
8442 4
0
5
3
100
P0014 6-
3 001 F 1 B 4+ 1 4 9.0 NaN
5342 4
0
5
20
Use Prod Ge A Occu City_C Stay_In_Curr Marita Product_ Product_ Product_
r_I uct_I nde g patio ategor ent_City_Year l_Statu Category_ Category_ Category_
D D r e n y s s 1 2 3
In [5]:
train_data.info() #gives data of the totals no. of columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User_ID 550068 non-null int64
1 Product_ID 550068 non-null object
2 Gender 550068 non-null object
3 Age 550068 non-null object
4 Occupation 550068 non-null int64
5 City_Category 550068 non-null object
6 Stay_In_Current_City_Years 550068 non-null object
7 Marital_Status 550068 non-null int64
8 Product_Category_1 550068 non-null int64
9 Product_Category_2 376430 non-null float64
10 Product_Category_3 166821 non-null float64
11 Purchase 550068 non-null int64
dtypes: float64(2), int64(5), object(5)
memory usage: 50.4+ MB
In [6]:
test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233599 entries, 0 to 233598
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User_ID 233599 non-null int64
1 Product_ID 233599 non-null object
2 Gender 233599 non-null object
3 Age 233599 non-null object
4 Occupation 233599 non-null int64
5 City_Category 233599 non-null object
6 Stay_In_Current_City_Years 233599 non-null object
7 Marital_Status 233599 non-null int64
8 Product_Category_1 233599 non-null int64
9 Product_Category_2 161255 non-null float64
10 Product_Category_3 71037 non-null float64
dtypes: float64(2), int64(4), object(5)
memory usage: 19.6+ MB
In [7]:
# Dropping columns with null values in train and test data
21
train_data.drop(['Product_Category_2','Product_Category_3','User_ID','Product_
ID'],axis=1,inplace=True)
test_data.drop(['Product_Category_2','Product_Category_3','User_ID','Product_I
D'],axis=1,inplace=True)
In [8]:
train_data.head(10) #to ckech the data of first particular rows instert the
values in bracets
Out[8]:
Gend Ag Occupati City_Categ Stay_In_Current_City_ Marital_Sta Product_Catego Purcha
er e on ory Years tus ry_1 se
0-
0 F 10 A 2 0 3 8370
17
0-
1 F 10 A 2 0 1 15200
17
0-
2 F 10 A 2 0 12 1422
17
0-
3 F 10 A 2 0 12 1057
17
55
4 M 16 C 4+ 0 8 7969
+
26-
5 M 15 A 3 0 1 15227
35
46-
6 M 7 B 2 1 1 19215
50
46-
7 M 7 B 2 1 1 15854
50
46-
8 M 7 B 2 1 1 15686
50
26-
9 M 20 A 1 1 8 7871
35
In [9]:
test_data.head(10)
22
Out[9]:
Gende Ag Occupatio City_Categor Stay_In_Current_City_Ye Marital_Stat Product_Category
r e n y ars us _1
46-
0 M 7 B 2 1 1
50
26-
1 M 17 C 0 0 3
35
36-
2 F 1 B 4+ 1 5
45
36-
3 F 1 B 4+ 1 4
45
26-
4 F 1 C 1 0 4
35
46-
5 M 1 C 3 1 2
50
46-
6 M 1 C 3 1 1
50
46-
7 M 1 C 3 1 2
50
26-
8 M 7 A 1 0 10
35
18-
9 M 15 A 4+ 0 5
25
In [10]:
sns.scatterplot(data=train_data,x='Occupation',y='Purchase')
Out[10]:
<AxesSubplot:xlabel='Occupation', ylabel='Purchase'>
23
In [11]:
sns.scatterplot(data=test_data,x='Occupation',y='Gender')
Out[11]:
<AxesSubplot:xlabel='Occupation', ylabel='Gender'>
In [12]:
sns.barplot(data=train_data,x='Occupation',y='Purchase')
Out[12]:
24
<AxesSubplot:xlabel='Occupation', ylabel='Purchase'>
In [13]:
sns.barplot(data=test_data,y='Occupation',x='Gender')
Out[13]:
<AxesSubplot:xlabel='Gender', ylabel='Occupation'>
In [14]:
sns.barplot(data=train_data,x='Marital_Status',y='Purchase')
25
Out[14]:
<AxesSubplot:xlabel='Marital_Status', ylabel='Purchase'>
In [15]:
sns.barplot(data=train_data,x='City_Category',y='Purchase')
Out[15]:
<AxesSubplot:xlabel='City_Category', ylabel='Purchase'>
In [16]:
sns.barplot(data=train_data,x='Age',y='Purchase')
26
Out[16]:
<AxesSubplot:xlabel='Age', ylabel='Purchase'>
In [17]:
# temporarily concatenating the train and test dataframes
df = pd.concat([train_data.assign(ind="train"),
test_data.assign(ind="test")],ignore_index=True)
In [18]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 783667 entries, 0 to 783666
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 783667 non-null object
1 Age 783667 non-null object
2 Occupation 783667 non-null int64
3 City_Category 783667 non-null object
4 Stay_In_Current_City_Years 783667 non-null object
5 Marital_Status 783667 non-null int64
6 Product_Category_1 783667 non-null int64
7 Purchase 550068 non-null float64
8 ind 783667 non-null object
dtypes: float64(1), int64(3), object(5)
memory usage: 53.8+ MB
In [19]:
# One hot encoding multiple categorical columns in combined dataset
complete_dataset = pd.get_dummies(data=df,drop_first=True, columns=['Gender',
'Age','City_Category','Stay_In_Current_City_Years'])
27
In [20]:
# Splitting the above dataset into train and test dataset
train_data2, test_data2 =
complete_dataset[complete_dataset["ind"].eq("train")].copy(), \
complete_dataset[complete_dataset["ind"].eq("test")].copy().reset_index(drop=T
rue)
# Removing the unwanted column ind used for marking train and test data
train_data2.drop('ind',axis=1,inplace=True)
test_data2.drop('ind',axis=1,inplace=True)
In [21]:
# Splitting the training dataset into X_train, y_train
X_train, y_train = train_data2.drop("Purchase",axis=1),
train_data2["Purchase"]
X_test = test_data2
y_train.head()
Out[21]:
0 8370.0
1 15200.0
2 1422.0
3 1057.0
4 7969.0
Name: Purchase, dtype: float64
In [22]:
# Using feature selection to select the best features for regression
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
2 Product_Category_1 73684.939778
1
City_Category_C 2055.228480
1
28
Specs Score
3 Gender_M 2010.442472
0 Occupation 238.831156
1
City_Category_B 200.691083
0
8 Age_51-55 120.383197
4 Age_18-25 42.903923
6 Age_36-45 24.746972
1
Stay_In_Current_City_Years_2 15.790576
3
7 Age_46-50 6.050446
9 Age_55+ 4.637896
1
Stay_In_Current_City_Years_3 2.402775
4
In [23]:
X_train =
X_train[["Product_Category_1","City_Category_C","Gender_M","Occupation","City_
Category_B"]]
In [24]:
pip install LightGBM
Requirement already satisfied: LightGBM in c:\users\acer\appdata\local\
programs\python\python310\lib\site-packages (3.3.2)
Requirement already satisfied: scikit-learn!=0.22.0 in c:\users\acer\appdata\
local\programs\python\python310\lib\site-packages (from LightGBM) (1.1.1)
Requirement already satisfied: scipy in c:\users\acer\appdata\local\programs\
python\python310\lib\site-packages (from LightGBM) (1.8.1)
Requirement already satisfied: numpy in c:\users\acer\appdata\local\programs\
python\python310\lib\site-packages (from LightGBM) (1.22.4)
Requirement already satisfied: wheel in c:\users\acer\appdata\local\programs\
python\python310\lib\site-packages (from LightGBM) (0.37.1)
29
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\acer\appdata\
local\programs\python\python310\lib\site-packages (from scikit-learn!=0.22.0-
>LightGBM) (3.1.0)
Requirement already satisfied: joblib>=1.0.0 in c:\users\acer\appdata\local\
programs\python\python310\lib\site-packages (from scikit-learn!=0.22.0-
>LightGBM) (1.1.0)
Note: you may need to restart the kernel to use updated packages.
WARNING: There was an error checking the latest version of pip.
In [25]:
pip install XGBoost
Requirement already satisfied: XGBoost in c:\users\acer\appdata\local\
programs\python\python310\lib\site-packages (1.6.1)
Requirement already satisfied: numpy in c:\users\acer\appdata\local\programs\
python\python310\lib\site-packages (from XGBoost) (1.22.4)
Requirement already satisfied: scipy in c:\users\acer\appdata\local\programs\
python\python310\lib\site-packages (from XGBoost) (1.8.1)
Note: you may need to restart the kernel to use updated packages.
WARNING: There was an error checking the latest version of pip.
In [26]:
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
In [27]:
regressor = LGBMRegressor()
xgb_regressor = XGBRegressor()
In [28]:
regressor.fit(X_train,y_train)
xgb_regressor.fit(X_train,y_train)
Out[28]:
XGBRegressor
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
early_stopping_rounds=None, enable_categorical=False,
eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
importance_type=None, interaction_constraints='',
learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,
num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
reg_lambda=1, ...)
In [29]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import ShuffleSplit, cross_validate
30
In [30]:
print(cv_results["test_score"].mean())
0.6364900199920567
In [31]:
X_test =
test_data2[["Product_Category_1","City_Category_C","Gender_M","Occupation","Ci
ty_Category_B"]]
In [32]:
y_pred = regressor.predict(X_test)
In [33]:
y_pred
Out[33]:
array([13804.62771151, 10207.34592312, 6158.6480155 , ...,
13571.47548186, 19648.31996153, 2450.37255679])
In [34]:
test_data2["Purchase"] = y_pred
In [35]:
test_data2.head()
Out[35]:
A A A A A
g g g g g A
Cit Cit
O Ma Pro G e e e e e g Stay_I Stay_I Stay_I Stay_I
Pu y_ y_
cc rit duct en _ _ _ _ _ e n_Cur n_Cur n_Cur n_Curr
rc Cat Cat
up al_ _Ca de 1 2 3 4 5 _ rent_C rent_C rent_C ent_Cit
ha ego ego
ati Sta tego r_ 8 6 6 6 1 5 ity_Ye ity_Ye ity_Ye y_Year
se ry_ ry_
on tus ry_1 M - - - - - 5 ars_1 ars_2 ars_3 s_4+
B C
2 3 4 5 5 +
5 5 5 0 5
13
80
4.6
0 7 1 1 1 0 0 0 1 0 0 1 0 0 1 0 0
27
71
2
10
20
7.3
1 17 0 3 1 0 1 0 0 0 0 0 1 0 0 0 0
45
92
3
2 1 1 5 61 0 0 0 1 0 0 0 1 0 0 0 0 1
58.
64
80
31
A A A A A
g g g g g A
Cit Cit
O Ma Pro G e e e e e g Stay_I Stay_I Stay_I Stay_I
Pu y_ y_
cc rit duct en _ _ _ _ _ e n_Cur n_Cur n_Cur n_Curr
rc Cat Cat
up al_ _Ca de 1 2 3 4 5 _ rent_C rent_C rent_C ent_Cit
ha ego ego
ati Sta tego r_ 8 6 6 6 1 5 ity_Ye ity_Ye ity_Ye y_Year
se ry_ ry_
on tus ry_1 M - - - - - 5 ars_1 ars_2 ars_3 s_4+
B C
2 3 4 5 5 +
5 5 5 0 5
15
23
80.
3 1 1 4 85 0 0 0 1 0 0 0 1 0 0 0 0 1
92
56
24
48.
4 1 0 4 48 0 0 1 0 0 0 0 0 1 1 0 0 0
94
56
In [36]:
import math
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
In [37]:
from sklearn.datasets import make_blobs
from sklearn.datasets import make_circles
from sklearn.preprocessing import StandardScaler
import seaborn as sns
In [40]:
# Generate sample data
X, y = make_circles(n_samples=1000, factor=0.009, noise=0.15)
#-------------------------------------------------------------
ads_arr = StandardScaler().fit_transform(X)
sns.scatterplot(x=ads_arr[:,0],y=ads_arr[:,1],hue=y)
Out[40]:
<AxesSubplot:>
32
In [41]:
#distance calculation UDF
def minkowski_(point_a,point_b,p=2):
if p==1:
print('----> Manhattan')
dist = np.sum(abs(point_a-point_b))
print('Manual Distance :',dist)
elif p==2:
#print('----> Euclidean')
dist = np.sqrt(np.sum(np.square(point_a-point_b)))
#print('Manual Distance :',dist)
return dist
#------------------------------------------------------------------
#UDF for Calculation of distance from a point to every ther point
def distance_to_all(curr_vec,data,p_=2):
return distance_list
In [49]:
def core_border_noise_mapping(data=ads_arr,min_points=10,epsilon=0.38):
33
#Initializing trays for collecting labels & dictionaries for key-value
combinations
core = []
interim = []
density_reachable = []
idx_dict = {} #Mapping of each point to its directly density reachable
points (within epsilon)
nmin_dict= {} #Count of total density reachable points within epsilon for
each point
#----------------------------------------------------------------------------
-------------
for idx in range(len(data)): #For each point of data
current_arr = data[idx]
current_to_all =
np.array(distance_to_all(curr_vec=current_arr,data=data,p_=2)) #Calculating
distance
#----------------------------------------------------------------------------
-------------
idx_dict_updated = {} #Copy of the mapping dict with removal of self
distance (a point has 0 distance with itself)
for key,val in idx_dict.items():
val_ = val[val!=key]
idx_dict_updated.update({key : val_})
#----------------------------------------------------------------------------
-------------
#Classifying between Core and non-core (interim) points through nmin
parameter
for (key,value) in nmin_dict.items():
if value>=min_points:
core.append(key)
elif value<=min_points:
34
interim.append(key)
#----------------------------------------------------
print('Total core points :',len(core))
print('Total interim points :',len(interim))
#----------------------------------------------------
#Calculating the directly density reachable points (All points which are
within eps of any point)
for key_ in idx_dict_updated.keys():
val = list(idx_dict_updated[key_])
density_reachable += val
#----------------------------------------------------
noise = []
border = []
#Classifying between border and noisy points
for idx in interim:
if idx in density_reachable:
border.append(idx)
elif idx not in density_reachable:
noise.append(idx)
return core,border,noise,idx_dict_updated
In [50]:
#!pip install kneed #If not already installed
from sklearn.neighbors import NearestNeighbors #Nearest Neighbor Calculator
#from kneed import KneeLocator
nbrs =
NearestNeighbors(n_neighbors=6,algorithm='auto',metric='minkowski',p=2,n_jobs=
-1).fit(ads_arr)
distances, indices = nbrs.kneighbors(ads_arr)
#----------------------------------------------------------------------------
-------------------------
distances = np.sort(distances[:,-1], axis=0)
i = np.arange(len(distances))
#knee = KneeLocator(i, distances, S=1, curve='convex',
direction='increasing', interp_method='polynomial')
35
#----------------------------------------------------------------------------
-------------------------
sns.set_style('darkgrid')
ax = sns.lineplot(x=range(0,len(ads_arr)),y=distances,color='g')
ax.set(xlabel='No of Points',ylabel='Distance of kth neighbor',title='Elbow
Curve of kth distance-vs-point')
#knee.plot_knee()
Out[50]:
[Text(0.5, 0, 'No of Points'),
Text(0, 0.5, 'Distance of kth neighbor'),
Text(0.5, 1.0, 'Elbow Curve of kth distance-vs-point')]
In [51]:
epsilon = 0.31 #Calculate through Nearest Neighbours Distance graph
min_points = 5
core,border,noise,idx_dict_updated =
core_border_noise_mapping(data=ads_arr,min_points=min_points,epsilon=epsilon)
Total core points : 963
Total interim points : 37
Total density reachable points : 999
Total noisy points : 1
Total border points : 36
In [56]:
def expand_clusters(point, neighbors_, border_, core_, idx_dict_updated_):
36
#print('Length of neighbor at START :',len(neighbors_)) ##
i = 0 #Initializing
nextPoint = neighbors_[i]
counter_assign.append(nextPoint)
#print('Next point else :',nextPoint)
nextNeighbors = list(idx_dict_updated_[nextPoint])
#return clusters_
In [57]:
clusters = np.array([np.nan]*len(ads_arr)) #Initializing clusters with
required size and nan value
print('Cluster length :',clusters.shape)
37
if idx not in counter_assign: #Checking if that point is not assigned
already
expand_clusters(point=idx,
neighbors_=neighbors,
border_=border,
core_=core,
idx_dict_updated_=idx_dict_updated) #Greedy
search algo to allot cluster till the edge
38
In [59]:
sns.scatterplot(x=ads_arr[:,0],y=ads_arr[:,1],hue=clusters)
Out[59]:
<AxesSubplot:>
In [60]:
from sklearn.cluster import DBSCAN
#----------------------------------------------------------------------------
---------
dbscan = DBSCAN(eps=epsilon,min_samples=min_points,metric='euclidean',n_jobs=-
1)
dbscan.fit(ads_arr)
dbscan_labels = dbscan.labels_
In [61]:
sns.scatterplot(x=ads_arr[:,0],y=ads_arr[:,1],hue=dbscan_labels)
Out[61]:
<AxesSubplot:>
39
40
CONCLUSION
With traditional methods not being of much help to business growth in terms of
revenue, the use of Machine learning approaches proves to be an important
point for the shaping of the business plan taking into consideration the
shopping pattern of consumers.
Projection of sales concerning several factors including the sale of last year
helps businesses take on suitable strategies for increasing the sales of goods
that are in demand. Thus the dataset is used for the experimentation, Black
Friday Sales Dataset from Kaggle [9].
The models used are Linear Regression, Lasso Regression, Ridge Regression,
Decision Tree Regressor, and Random Forest Regressor. The evaluation measure
used is Mean Squared Error (MSE).
Based on Table II Random Forest Regressor is best suitable for the prediction of
sales based on a given dataset.
Thus the proposed model will predict the customer purchase on Black Friday
and give the retailer insight into customer choice of products. This will result in
a discount based on customer-centric choices thus increasing the profit to the
retailer as well as the customer.
41
Thank You
42