Sales 1
Sales 1
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)Accredited with Grade “A” by NAAC I 12B
Status by UGC I Approved by AICTE
JEPPIAAR NAGAR, RAJIV GANDHI SALAI, CHENNAI - 600 119
March – 2022
1
SATHYABAMA INSTITUTE
OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
(Established under Section 3 of UGC Act, 1956)
Jeppiaar Nagar, Rajiv Gandhi Salai, Chennai - 600119
www.sathyabamauniversity.ac.in
SCHOOL OF COMPUTING
BONAFIDE CERTIFICATE
This is to certify that this Project Report is the bonafide work of Snigdha Saha
(38110546), Shivani Agarwal (38110526) carried out the project entitled “Target
Corporation Sales Prediction” under our supervision from __to__.
Internal Guide
Dr. Veena K.
Head of the Department
Dr. S.VIGNESHWARI, M.E., Ph.D.,
Dr. LAKSHMANAN L, M.E., Ph.D.
2
DECLARATION
DATE:
PLACE:
3
ACKNOWLEDGEMENT
4
ABSTRACT
5
TABLE OF CONTENTS
Abstract 6
List of Figures 9
1. INTRODUCTION 10
Proposed System
Workflow Diagram
System Architecture
Block Diagram
3. PROJECT IMPLENTATIONS, 15
ALGORITHMS AND METHODS USED
Introduction
Hardware Requirement
Software Requirement
Working Explanation
4. IMPLEMENTED SCREENSHOTS 29
6
5. SUMMARY & CONCLUSION 32
6. APPENDIX 33
A. SOURCE CODE 37
REFERENCES
7
LIST OF FIGURES
Figure No. Title Page No.
2.1 Workflow 14
Diagram
2.2 System 15
Architecture
Diagram
3.2.1 Numerical 25
Features Graph
8
CHAPTER I
INTRODUCTION:
⮚ Outline of the project:
Everyday competitiveness between various shopping centers as and as huge marts
is becoming higher intense, violent just because of the quick development of global
malls also online shopping. Each market seeks to offer personalized and limited-time
deals to attract many clients relying on period of time, so that each item's volume of
sales may be estimated for the organization's stock control, transportation and
logistical services. The current machine learning algorithm is very advanced and
provides methods for predicting or forecasting sales any kind of organization,
extremely beneficial to overcome low – priced used for prediction. Always better
prediction is helpful, both in developing and improving marketing strategies for the
marketplace, which is also particularly helpful. In today’s modern world, huge
shopping centers such as big malls and marts are recording data related to sales of
items or products with their various dependent or independent factors as an
important step to be helpful in prediction of future demands and inventory
management. The dataset built with various dependent and independent variables is
a composite form of item attributes, data gathered by means of customer, and also
data related to inventory management in a data warehouse. The data is thereafter
refined in order to get accurate predictions and gather new as well as interesting
results that shed a new light on our knowledge with respect to the task’s data. This
can then further be used for forecasting future sales by means of employing machine
learning algorithms such as the random forests and simple or multiple linear
regression model.
9
● Spacing out projects that would otherwise preoccupy staff needed to support
an increase in demand
● Proposed System:
In this paper, we are addressing the problem of Target mart sales prediction or
forecasting of an item on customer’s future demand in different target mart stores
across various locations and products based on the previous record. We started with
making some hypothesis about the data without looking at it. Then we moved on
to data exploration where we found out some nuances in the data which required
remediation. Next, we performed data cleaning and feature engineering, where we
imputed missing values and solved other irregularities, made new features and also
made the data model-friendly. Finally we made regression, decision tree and random
forest model and got a glimpse of how to tune them for better results.
● Advantages:
10
of the market.
● Disadvantages:
11
CHAPTER II
LITERATURE SURVEY:
A great deal of work having been gotten really intended to date the territory of deals
foreseeing. A concise audit of the important work in the field of big mart deals is
depicted in this part. Numerous other Measurable methodologies, for example, with
regression, (ARIMA) Auto-Regressive Integrated Moving Average, (ARMA)
Auto-Regressive Moving Average, have been utilized to develop a few deals forecast
standards. Be that as it may, deals anticipating is a refined issue and is influenced by
both outer and inside factors, and there are two significant detriments to the
measurable technique as set out in A. S. Weigend et A mixture occasional quantum
relapse approach and (ARIMA) Auto-Regressive Integrated Moving Average way to
deal with every day food deals anticipating were recommend by N. S. Arunraj and
furthermore found that the exhibition of the individual model was moderately lower
than that of the crossover model.
A Forecast for Big Mart Sales Based on Random Forests and Multiple Linear
Regression (2018) Author: - Kadam, H., Shevade, R., Ketkar, P. and Rajguru
Description: - A Forecast for Big Mart Sales Based on Random Forests and Multiple
Linear Regression used Random Forest and Linear Regression for prediction
12
analysis which gives less accuracy. To overcome this we can use XG boost
Algorithm which will give more accuracy and will be more efficient.
13
CHAPTER THREE
PROJECT DESIGN:
● 1.Workflow Diagram:
14
● 2.System Architecture :
15
CHAPTER III
● Software Requirements:
⮚ Windows XP (8,9 or 10).
⮚ Any Python Interpreter.
⮚ Machine Learning modules installed.
⮚ Target Mart Dataset.
● Working Explanation:
The aim is to build a predictive model and find out the sales of each product at
a particular store. Using this model, Big Marts like Target will try to understand
the properties of products and stores which play a key role in increasing
sales. So the idea is to find out the properties of a product, and store which
impacts the sales of a product. We came up with certain hypothesis in order to
solve the problem statement relating to the various factors of the products and
stores. We develop a predictive model using different ML Algorithms like
Linear regression, Polynomial regression, and Ridge regression techniques
for forecasting the sales of a business such as Target Corporation.
• By using this model, we will try to understand the properties of products
and stores which play a key role in increasing sales.
• We came up with certain hypothesis in order to solve the problem
statement relating to the various factors of the products and stores.
16
• We’ll be performing some basic data exploration and come up with
some inferences about the data.
• In our model we have used Target corporation sales dataset. After
preprocessing and filling missing values, we used ensemble classifier
using Decision trees, Linear regression, Ridge regression, Polynomial
regression and XgBoost regression classifier.
● Prerequisites:
The dataset contains about 10 years of daily weather observations from
numerous Australian weather stations.
contain more than two predictor variables are called Multiple Regression Model.
• Model Design:
17
This shows the architecture Diagram of the proposed model where they focus on
the different algorithm application to the dataset. Here, we are calculating the
Accuracy, MAE, MSE, RMSE and final concluding the best yield algorithm. Here
are the following Algorithm are used.
A. Linear Regression:
• Build a fragmented plot:
1) a linear or non-linear pattern of data and 2) a variance (outliers).
Consider a transformation if the marking isn't linear. If this is the case,
outsiders, it can suggest only eliminating them if there is a
non-statistical justification.
• Link the data to the least squares line and confirm the model
assumptions using the residual plot (for the constant standard deviation
assumption) and the normal probability plot (for the normal probability
assumption) A transformation might be necessary if the assumptions
made do not appear to be met
• If required, convert the data to the least square using the transformed
data, construct a regression line.
• If a change has been completed, return to the previous process 1. If
not, continue to phase 5.
• When a "good-fit" classic is defined, write the least-square regression
line equation. Consist of normal estimation, estimation, and R-squared
errors. • Linear regression formulas look like this:
Y=o1x1+ o2x2+……… onxn
R-Square: Defines the difference in X (depending variable) explains the
total variance in Y (dependent variable) (independent variable).
B. Polynomial Regression:
• Polynomial Regression is a relapse calculation that modules the
relationship here among dependent(y) and the autonomous variable(x)
in light of the fact that as most extreme limit polynomial. The condition
for polynomial relapse is given beneath: y= b0+b1x1+ b2x1 2+ b2x1
3+...... bnx1 n
• It is regularly alluded to as the exceptional instance of various straight
relapse in ML. Since we apply some polynomial terms to the numerous
18
straight relapse condition to change it to polynomial relapse adjustment
to improve accuracy.
• The informational collection utilized for preparing in polynomial
relapse is of a non-straight nature.
• It uses a linear regression model to fit complex and non-linear
functions and datasets.
C. Ridge Regression:
Ridge regression is a model tuning tool used to evaluate any data that
suffers from multi-collinearity. This method performs the L2
regularization procedure. When multi-collinearity issues arise, the least
squares are unbiased and the variances are high, resulting in the
expected values being far removed from the actual values. The cost
function for ridge regression:
Min (||Y – X(theta)||^2 + λ||theta||^2)
D. XgBoost Regression:
“Extreme Gradient Boosting” is same but much more effective to the
gradient boosting system. It has both a linear model solver and a tree
algorithm. Which permits “XgBoost” in any event multiple times quicker
than current slope boosting executions? It underpins various target
capacities, including relapse, order and rating. As "XgBoost" is
extremely high in prescient force however generally delayed with
organization, it is appropriate for some rivalries. It likewise has extra
usefulness for cross-approval and finding significant factors.
● Methodology:
19
4. Feature Engineering - modifying existing variables and creating new ones
for analysis.
5. Implementing model - making predictive models on the data.
1. Importing Libraries:
The first step in any Data Analysis step is importing necessary libraries.
2. Data Preprocessing:
Real-world data is
often messy, incomplete, unstructured, inconsistent, redundant, sprinkled with
wacky values. So, without deploying any Data Preprocessing techniques, it is
almost impossible to gain insights from raw data. Data preprocessing is a
process of converting raw data to a suitable format to extract insights. It is the
first and foremost step in the Data Science life cycle. Data Preprocessing
makes sure that data is clean, organize and read-to-feed to the Machine
Learning model.
If a dataset contains missing values and loaded using pandas, then missing values
get replaced with NaN(Not a Number) values. These NaN values can be identified
using methods like isna() or isnull() and they can be imputed using fillna(). This
process is known as Missing Data Imputation.
21
# Handling Missing values in Categorical Features:
big_mart_data.isnull().sum()
Missing values in Numerical Features can be imputed using Mean and Median.
Mean is sensitive to outliers and median is immune to outliers. If you want to impute
the missing values with mean values, then outliers in numerical features need to be
addressed properly.
It’s time to do some analysis on each feature to understand about data and get
some insights.
22
6. Exploratory Data Analysis:
# Outlet_Establishment_Year column:
plt.figure(figsize=(6,6))
sns.countplot(x='Outlet_Establishment_Year', data=big_mart_data)
plt.show()
# Item_Fat_Content column:
plt.figure(figsize=(6,6))
sns.countplot(x='Item_Fat_Content', data=big_mart_data)
plt.show()
23
# Item_Type column:
plt.figure(figsize=(30,6))
sns.countplot(x='Item_Type', data=big_mart_data)
plt.show()
There are many feature encoding techniques like One code encoding, label
encoding. But in this particular blog, I will be using replace() function to encode
categorical data to numerical data.
8. Correlation:
For feature importance and feature scaling, we need to split data into independent
and dependent features.
24
● y – Dependent Features or target label.
9. Feature Importance:
● Machine Learning Model performance depends on features that are used to train a
model. Feature importance describes which features are relevant to build a model.
● Feature Importance refers to the techniques that assign a score to input/label
features based on how useful they are at predicting a target variable. Feature
importance helps in Feature Selection.
We’ll be using ExtraTreesRegressor class for Feature Importance. This class implements a
meta estimator that fits a number of randomized decision trees on various samples of the
dataset and uses averaging to improve the predictive accuracy and control over-fitting.
Feature Scaling is a technique used to scale, normalize, standardize data in range(0,1). When
each column of a dataset has distinct values, then it helps to scale data of each column to a
common level. StandardScaler is a class used to implement feature scaling.
In this article, I will be using a Logistic Regression algorithm to build a predictive model to
predict whether or not it will rain tomorrow.
25
CHAPTER IV
IMPLEMENTED SCREENSHOTS:
1.
26
2.
3.
27
CHAPTER V
CONCLUSION:
In this work, the effectiveness of various algorithms on the data on revenue and
review of, best performance-algorithm, here propose a software to using regression
approach for predicting the sales centered on sales data from the past the accuracy
of linear regression prediction can be enhanced with this method, polynomial
regression, Ridge regression, and Xgboost regression can be determined. So, we
can conclude ridge and Xgboost regression gives the better prediction than the
Linear and polynomial regression approaches. In future, the forecasting sales and
building a sales plan can help to avoid unforeseen cash flow and manage
production, staff and financing needs more effectively. In future work we can also
consider with the ARIMA model which shows the time series graph.
28
APPENDIX
SOURCE CODE:
#Importing Libraries:
import numpy as np
import pandas as pd
%matplotlib inline
big_mart_data = pd.read_csv('Train.csv')
big_mart_data.head()
# number of data points & number of features
big_mart_data.shape
# getting some information about thye dataset
big_mart_data.info()
#Categorical Features:
29
#Item_Identifier
#Item_Fat_Content
#Item_Type
#Outlet_Identifier
#Outlet_Size
#Outlet_Location_Type
#Outlet_Type
# checking for missing values
big_mart_data.isnull().sum()
# mean value of "Item_Weight" column
big_mart_data['Item_Weight'].mean()
# filling the missing values in "Item_weight column" with "Mean" value
big_mart_data['Item_Weight'].fillna(big_mart_data['Item_Weight'].mean(),
inplace=True)
# mode of "Outlet_Size" column
big_mart_data['Outlet_Size'].mode()
# filling the missing values in "Outlet_Size" column with Mode
mode_of_Outlet_size = big_mart_data.pivot_table(values='Outlet_Size',
columns='Outlet_Type', aggfunc=(lambda x: x.mode()[0]))
miss_values = big_mart_data['Outlet_Size'].isnull()
print(miss_values)
big_mart_data.loc[miss_values, 'Outlet_Size'] =
big_mart_data.loc[miss_values,'Outlet_Type'].apply(lambda x:
mode_of_Outlet_size[x])
big_mart_data.describe()
sns.set()
# Item_Weight distribution
plt.figure(figsize=(6,6))
sns.distplot(big_mart_data['Item_Weight'])
plt.show()
# Item Visibility distribution
plt.figure(figsize=(6,6))
sns.distplot(big_mart_data['Item_Visibility'])
plt.show()
30
# Item MRP distribution
plt.figure(figsize=(6,6))
sns.distplot(big_mart_data['Item_MRP'])
plt.show()
# Item_Outlet_Sales distribution
plt.figure(figsize=(6,6))
sns.distplot(big_mart_data['Item_Outlet_Sales'])
plt.show()
# Outlet_Establishment_Year column
plt.figure(figsize=(6,6))
sns.countplot(x='Outlet_Establishment_Year', data=big_mart_data)
plt.show()
# Item_Fat_Content column
plt.figure(figsize=(6,6))
sns.countplot(x='Item_Fat_Content', data=big_mart_data)
plt.show()
# Outlet_Size column
plt.figure(figsize=(6,6))
sns.countplot(x='Outlet_Size', data=big_mart_data)
plt.show()
big_mart_data.head()
big_mart_data['Item_Fat_Content'].value_counts()
big_mart_data.replace({'Item_Fat_Content': {'low fat':'Low Fat','LF':'Low Fat',
'reg':'Regular'}}, inplace=True)
big_mart_data['Item_Fat_Content'].value_counts()
Label Encoding
encoder = LabelEncoder()
big_mart_data['Item_Identifier'] =
encoder.fit_transform(big_mart_data['Item_Identifier'])
big_mart_data['Item_Fat_Content'] =
encoder.fit_transform(big_mart_data['Item_Fat_Content'])
31
big_mart_data['Item_Type'] = encoder.fit_transform(big_mart_data['Item_Type'])
big_mart_data['Outlet_Identifier'] =
encoder.fit_transform(big_mart_data['Outlet_Identifier'])
big_mart_data['Outlet_Size'] = encoder.fit_transform(big_mart_data['Outlet_Size'])
big_mart_data['Outlet_Location_Type'] =
encoder.fit_transform(big_mart_data['Outlet_Location_Type'])
big_mart_data['Outlet_Type'] = encoder.fit_transform(big_mart_data['Outlet_Type'])
big_mart_data.head()
X = big_mart_data.drop(columns='Item_Outlet_Sales', axis=1)
Y = big_mart_data['Item_Outlet_Sales']
print(X)
print(Y)
XGBoost Regressor
regressor = XGBRegressor()
regressor.fit(X_train, Y_train)
# prediction on training data
training_data_prediction = regressor.predict(X_train)
# R squared Value
r2_train = metrics.r2_score(Y_train, training_data_prediction)
print('R Squared value = ', r2_train)
# prediction on test data
test_data_prediction = regressor.predict(X_test)
# R squared Value
32
r2_test = metrics.r2_score(Y_test, test_data_prediction)
print('R Squared value = ', r2_test)
#RMSE
rmse_xgboost
Linear Regressor
new_data = data.copy()
#Independent Variables:
x = new_data.drop("item_outlet_sales", axis = 1)
#Depenedent Variables
y = new_data["item_outlet_sales"].values.reshape(-1,1)
regressor =LinearRegression()
regressor.fit(x_train, y_train)
#Prediction
33
y_pred = regressor.predict(x_test)
r2_score(y_test, y_pred)
rmse
Ridge Regressor
#Ridge Regression
rr = Ridge(alpha = 0.009)
rr.fit(x_train, y_train)
rr_pred = rr.predict(x_test)
r2_score(y_test, y_pred)
#RMSE
rmse_ridge
34
REFFERENCES
[1] Ching Wu Chu and Guoqiang Peter Zhang, “A comparative study of linear and
nonlinear models for aggregate retails sales forecasting”, Int. Journal Production
Economics, vol. 86, pp. 217- 231, 2003.
[2] Wang, Haoxiang. "Sustainable development and management in consumer
electronics using soft computation." Journal of Soft Computing Paradigm (JSCP)
1, no. 01 (2019): 56.- 2. Suma, V., and Shavige Malleshwara Hills. "Data Mining
based Prediction of D
[3] Suma, V., and Shavige Malleshwara Hills. "Data Mining based Prediction of
Demand in Indian Market for Refurbished Electronics." Journal of Soft Computing
Paradigm (JSCP) 2, no. 02 (2020): 101- 110
[4] Giuseppe Nunnari, Valeria Nunnari, “Forecasting Monthly Sales Retail Time
Series: A Case Study”, Proc. of IEEE Conf. on Business Informatics (CBI), July
2017.
[5] Maike Krause-Traudes et al. Spatial data mining for retail sales forecasting.
Tech.rep. Fraunhofer-Institut Intelligente Analyse- und Informationssysteme
(IAIS), 2008.
[6] L. Breiman. Random forests. Machine Learning, 45:5–32, 2001.
[7] Cerrada, M., & Aguilar, J. (2008). Reinforcement learning in system
identification. In Reinforcement Learning. IntechOpen.
[8] Welling, M. (2011). A first encounter with Machine Learning. Irvine, CA.:
University of California, 12.
[9] Learning, M. (1994). Neural and Statistical Classification. Editors D. Mitchie et.
al, 350.
[10] Mitchell, T. M. (1999). Machine learning and data mining. Communications of
the ACM, 42(11), 30- 36.
[11] Makridakis, S., Wheelwright, S.C., Hyndman, R.J.: Forecasting methods and
applications. John wiley & sons (2008)
[12] Kadam, H., Shevade, R., Ketkar, P. and Rajguru.: “A Forecast for Big Mart
Sales Based on Random Forests and Multiple Linear Regression.” (2018).
[13] C. M. Wu, P. Patil and S. Gunaseelan: Comparison of Different Machine
Learning Algorithms for Multiple Regression on Black Friday Sales Data (2018).
[14] Das, P., Chaudhury: Prediction of retail sales of footwear using feed forward
and recurrent neural networks (2018)
35
[15] Das, P., Chaudhury, S.: Comparison of Different Machine Learning
Algorithms for Multiple Regression on Black Friday Sales Data (2007)
[16] G. Behera and N. Nain, quot;A Comparative Study of Big Mart Sales
Predictionquot;, pp. 1-13, 2019. [Accessed 10 October 2019].
[17] S. Beheshti-Kashi, H. Karimi, K. Thoben and M. L¨utjen, quot;A survey on
retail sales forecasting and prediction in fashion marketsquot;, Systems Science
amp; Control Engineering, vol. 3, no. 1, pp. 154-161, 2014. Available:
10.1080/21642583.2014.999389 [Accessed 27 January 2020].
[18] A. Chandel, A. Dubey, S. Dhawale and M. Ghuge, quot;Sales Prediction
System using Machine Learningquot;, International Journal of Scientific Research
and Engineering Development, vol. 2, no. 2, pp. 1-4, 2019. [Accessed 27 January
2020].
[19] M. Wistuba, N. Schilling and L. Schmidt-Thieme, quot;Hyperparameter
Optimization Machines,quot; 2016 IEEE International Conference on Data
Science and Advanced Analytics (DSAA), Montreal, QC, 2016, pp. 41- 50.
[20] M. Wistuba, N. Schilling and L. Schmidt-Thieme, quot;Learning
hyperparameter optimization initializations, quot; 2015 IEEE International
Conference on Data Science and Advanced Analytics (DSAA), Paris, 2015, pp.
1-10.
36