0% found this document useful (0 votes)
13 views23 pages

Customer Transaction Analysis - Vu Truong

Uploaded by

Minh Duc Ha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views23 pages

Customer Transaction Analysis - Vu Truong

Uploaded by

Minh Duc Ha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

– Abstract –

In this report, I have applied various approaches to analyzing customer transaction dataset. My
objective is to identify data insight and perform some diagnostic analysis. I also apply Machine
Learning to predict future performance and evaluate how much the sales and profit have
improved from the prior years. Finally, I use K-means clustering algorithm to do customer
segmentation.
CONTENTS
1. Data Overview ........................................................................................................................... 1
2. Data Validation ......................................................................................................................... 2
3. Exploratory Data Analysis ....................................................................................................... 3
3.1 Orders, Products and Customers ...................................................................................... 3
3.2 Category ............................................................................................................................... 3
3.3 Sales ...................................................................................................................................... 3
3.4 Quantity................................................................................................................................ 4
3.5 Ship Mode ............................................................................................................................ 4
4. Descriptive Analysis .................................................................................................................. 5
4.1 Total Order by quarters and years .................................................................................... 5
4.2 Total Orders by states ......................................................................................................... 5
4.3 Sales over months and years .............................................................................................. 6
4.4 Category Sales over quarters and years ............................................................................ 6
4.5 Profit by Category ............................................................................................................... 7
4.6 Sale and Profit ratio by Quarters and Category .............................................................. 7
4.7 Time Delivery....................................................................................................................... 8
5. Diagnostic Analysis ................................................................................................................... 9
5.1 Which category made the most sales over years? ............................................................ 9
5.2 Which products got the most sales and quantity? .......................................................... 11
6. Sale prediction in 2019............................................................................................................ 12
7. Strategies to improve Sales and Profit .................................................................................. 14
7.1 Minimize negative profit ................................................................................................... 14
7.2 Focus on promoting key items ......................................................................................... 14
7.3 Reduce Discount ................................................................................................................ 15
7.4 Attract more customers .................................................................................................... 15
8. Customer Segmentation ......................................................................................................... 17
9. Conclusion ............................................................................................................................... 19
10. Resources ............................................................................................................................... 20
1. Data Overview
The original dataset contains 2846 rows and 20 columns.
There are 20 attributes, including:
• Row ID: index of row
• Order ID: index of order
• Order Date: date of ordering
• Ship Date: date of shipping order
• Ship Mode: Options of shipping method
• Customer ID: index of customer
• Customer Name: name of customer
• Product ID: index of product
• Category: category of product
• Sub-Category: sub-category of product
• Product Name: name of product
• City: city
• State: state
• Postal Cost: postal code of city and state
• Region: region in US
• Country: United States
• Quantity: number of items of order
• Discount: discounted voucher of order
• Profit: how much profit earns from order
• Sales: total price of order after applying discount

Figure 1.1: First 10 rows and several columns of dataset

1
2. Data Validation
Before jumping to the analysis process, we should do data pre-processing, including cleaning and
transformation steps. The first thing I did was to remove 40 blank rows that had no data recorded.
I also replaced missing values of city Burlington, state Vermont by its actual value of 5401. Then
I extracted meaningful values from Sales column as its initial data are combination of currency
(USD) and digits. Finally, I removed several columns that are unnecessary for subsequent analysis.
When looking at the remaining attributes:
• Order Date values range from 5 Jan 2015 to 30 Dec 2018
• Ship Date values range from 12 Jan 2015 to 4 Dec 2019
• There are 3 distinct values in Category: Office Supplies, Furniture, and Technology
• There are 17 possible values recorded in Sub-Category
• The number of distinct State and City are 14 and 120, respectively
• There are 4 Ship Mode classes: First, Second, Standard, Same Day
• Quantity values range from 1 to 14
• Discount values range from 0 to 0.7
• Profit values range from -6600 to 5040
• Sales has non-negative values with a maximum of 11200

After the data validation, the dataset contains 2806 rows and 17 columns without missing values.

2
3. Exploratory Data Analysis
3.1 Orders, Products and Customers
There were 2882 orders and 673 distinct customers, while 1431 products were sold over four
years.

3.2 Category
There are three types of categories included in this dataset. The most common category listed was
Office Supplier, which had more than half of the total. Furniture and Technology were coming
next with the same values at around 300 products.

Figure 3.1: Number of products by Category


3.3 Sales

Figure 3.2: Sales distribution


Looking at sales distribution, we can observe that most products were sold under 1000 USD,
especially the sale of more than 2000 products were less than 200 USD. There are several outliers
whose sales were more than 1000, but this is very uncommon.

3
3.4 Quantity

Figure 3.3: Number of transactions by quantity


The most common quantities were 2 and 3, accounting for approximately 50%. Transactions with
1, 4, or 5 quantities were less popular, but their observations were in mid-level, at around 300
observations. Higher than 10 quantities were less common, just occurring in a small number of
transactions.

3.5 Ship Mode


Standard Class is the most popular ship mode, with nearly 1700 observations, followed by Second
and First class. Far below is Same Day option, which the values being around 150. This would
suggest that most customers selected Standard Class as their primary shipping mode when
shopping online.

Figure 3.4: Number of transactions by ship mode

4
4. Descriptive Analysis
4.1 Total Order by quarters and years
Below is the number of orders by quarters over four years. We can observe an upward trend in
total orders. To be more detailed, the figure was always low in the first quarter, then increased
gradually and peaked in the fourth quarter. It means customers’ shopping habits increased at the
end of year, so that we should launch appropriate plans to boost this metric.

Figure 4.1: Total orders by year and quarter

4.2 Total Orders by states

Figure 4.2: Total order by state

The diagram shows that New York had the highest orders, which values more than 1000.
Pennsylvania and Ohio were in second and third place, with 601 and 478 orders, respectively. The
remaining states had a small number of orders, accounting for only 25%. This indicates that
shopping activities were highly frequent in large states, particularly New York, Pennsylvania,
and Ohio.

5
4.3 Sales over months and years

Figure 4.3: Total sales by month


It can be seen from the line chart above that the sales had an upward trend over years. In a particular
year, the sales of the first six months tend to be less than that of last six months. During the end of
year, such figures increased dramatically in September, then significantly declined in October
before rising again two months later. This would suggest that customers had a tendency to spend
more at the end of year, while shopping activities in the first half of year were less frequent.

4.4 Category Sales over quarters and years


The monthly sales were determined by adding all sales of each category during that month. Thus,
let’s have a look at the sales by category.
The diagram indicates that all categories had an uptrend from the first to the fourth quarter. Except
for the year 2017, this figure for Technology fluctuated between 7000 USD and 30000 USD. In
the last quarter of each year, Technology consistently ranked in the first position, meaning that
customers purchased more technology products at the end of the year.

Figure 4.4: Total sales by month and category

6
4.5 Profit by Category
While total sales was always a positive number, profit can be a negative value. The line chart
below describes the profit of 3 categories from January 2015 to December 2018. Overall, the
profits grew over the given period in all categories.
Office Supplies was the most stable category since its profit ranged from 0 to 5000 USD. In
contrast, Furniture achieved the lowest profit in almost the given time. Its profits fluctuated in the
first two years from -1400 USD to 1400 USD and peaked at 2077 USD in December 2016. From
this point, such figures were unchanged and hit a plateau of 1050 USD at the end of 2018.
Regarding Technology, its profits had an upward trend in the first two years before being
experienced a downward pattern in the following year. Then the figure rose again and reached a
peak of approximately 15000 USD at the end of the period.

Figure 4.5: Profit by quarter and category

4.6 Sale and Profit ratio by Quarters and Category

Figure 4.6: Sale and Profit ratio by Quarters and Category


Profit ratio, which is obtained by dividing profit by sum sales, was displayed in the combined chart
above. Notice that in the first quarter of 2015, the profit ratio was negative, while this figure
achieved positive values in the rest of period. The fourth quarter’s sales were always the highest,
but its profit ratio halved from 14.6% to 6.7% before jumping threefold to 20.8% in 2018.

7
The second quarter began at a value of 6.5% in 2015, then remained steady at 16%. In contrast,
such figures for the third quarter experienced a continuous fall from 17.2% to 7.3% over the given
period. This would conclude that the second quarter had a better performance than the third
quarter in terms of profit ratio though the sales of the former quarter were less than that of the
latter one during the period.

4.7 Time Delivery


We are able to find the time delivery by calculating the days difference between ship date and
order date. The delivery time distribution shows that 800 transactions have been shipped 4 days
after ordering, which is the most common delivery duration. Next are 5 and 2 days, with the values
being around 600 and 400, respectively.

Figure 4.7: Time delivery distribution

Figure 4.8: Time delivery by ship mode


When looking at time delivery distribution across Ship Mode, we can observe that there are 157
transactions shipped by the same date as order date, labeled by Same Day. The First Class
transactions could be shipped from 1 to 3 days, which was faster than the Second Class with an
average shipping duration of 3,27 days. The Standard Class transactions were delivered slowest,
from 4 to 7 days. This means customers should allow more waiting time to receive products
when selecting Standard Class as Ship Mode.

8
5. Diagnostic Analysis
5.1 Which category made the most sales over years?

Figure 5.1: Percentage of sales by category and year


The stacked bar chart indicates how the sales percentage of categories changed over four years.
Technology is considered as the category that made the most sales because its figure increased
from 35.09% in 2015 to 40.82% in 2018.
We should investigate why most sales was generated from Technology products. As we know,
Sales = Quantity * Price * (1 – Discount). In order words, sales depended on quantity, price per
unit, and discount. Let’s have a look at these attributes.

Figure 5.2: Distribution of sales by category


The Quantity distributions of three categories are pretty the same in terms of mean, IQR, and
outliers. It means that the number of items has no or little impact on the total sales.
When looking at the price per unit distribution, we can notice that all categories have outliers,
making comparison difficult. To make it easier to compare between three diagrams, we will look
at shorter intervals, particularly from 0 to 500 USD.

9
Figure 5.3: Price per unit distribution with long and short intervals
Office Supplier products were sold at the cheapest price but contained numerous outliers. While
the rest of two categories have the same mean and IQR, Technology has more outliers with higher
prices than Furniture, resulting in higher sales in total.
There is one remaining factor that might affect sales, which is the discount attribute. As we can
see, Technology has a wider IQR than Furniture, but its mean and minimum value are the same. It
means that most Technology products were not discounted, whereas a large number of Furniture
ones could be discounted from 10% to 50%.

Figure 5.4: Discount distribution by category

We can also look at sub-categories of Technology. It


is noticed that nearly 40% sales of Technology items
came from Phones, followed by Machines and
Copiers, with the figure being around 25% and 20%
respectively. The remaining percentage, 17,4%,
belongs to Accessories.
Figure 5.5: Types of Technology products

Finally, we can conclude that the higher sales in Technology products resulted from:
• There are more products sold at higher prices than Furniture items.
• Technology products were discounted less than Offices Supplies ones.
• 40% of sales of Technology products belonged to the Phones sub-category.

10
5.2 Which products got the most sales and quantity?

Figure 5.6: Sales versus quantity


From the scatter plot above, it is quite hard to determine which products brought us quantity and
sales. Because some products were high in quantity but low in sales and vice versa.
Since we want to find out the products that made both sales and quantity, I will divide each attribute
into 4 groups using first quartile (Q1), median (Q2), and third quartile (Q3) as thresholds.
Those products which sales lower than Q1 were
labeled 1, sales between Q1 and Q2 were labeled 2,
sales between Q2 and Q3 were labeled 3, and sales
higher than Q3 were labeled 4. The same process will
apply to quantity attribute.

Figure 5.7: Several high sale and quantity products

After doing such method, there are 130 products in total that were high in both sales and quantity.

11
6. Sale prediction in 2019
From the dataset provided, we can easily calculate quarterly sales, as shown in Figure 4.3.
However, we should forecast the sales next year in order to provide the manager with a better
overview of short-term plans.
I use Pycaret library to perform sales prediction in 2019. Because the Pycaret can compare
numerous time-series models, evaluate them through different metrics, and yield the best model.
Before training models, I split the dataset into training and test sets. I select the sales from the
beginning to June 2018 for training set and keep the rest for test set.

Figure 6.1: Line chart of training set and test set

Figure 6.2: Top 15 best models


After evaluating around 30 various models, it turns out that Bayesian Ridge gets the lowest MASE.
Though ARIMA and Seasonal Naive Forecaster are two popular models in terms of time-series
forecasting, we can neither choose one of them because of their poor performances.
Using the Bayesian Ridge model, we are able to predict sales over the next 12 months.

12
Figure 6.3: Sales prediction in 2019
The line chart above demonstrates the sale prediction in 2019. It has the same pattern as the
previous year, which is low in the first half year, then increases in September, and declines in
October before rising again at the end of 2019.
We should identify the total sales of 2019 to evaluate how it
performs from the previous year.
The overall sales forecast in 2019 is 229808 USD, which is a
slight rise of 5.6% of the year 2018. When looking at sales
growth over years, it reached a peak of 25.9% in 2016, then
dropped half in 2017 before increasing to 18.6% in 2018. There
were no years that have had sales growth under 10%, and 5.6%
seems to be a very low yearly sale growth in the retail industry. It
would suggest that the business team should make more effort to
achieve a high volume of sales in 2019.

Figure 6.4: Total sales and sales growth over years

13
7. Strategies to improve Sales and Profit
Besides the analysis process to discover insights, we also should recommend several strategies and
present them to our manager. Based on the data provided, I do make suggestions on how to
generate more sales and profit.

7.1 Minimize negative profit

Figure 7.1: Profit by sub-category


It can be seen from the waterfall chart that among 17 sub-categories, there are three types of
products: Supplies, Bookcases, and Tables that yield negative profit, meaning that the production
costs were more than their sales. Hence, if these negative values can be alleviated, we may
generate 13.4K USD in profit.

7.2 Focus on promoting key items


Since the profit can be obtained through sales, we should aim to boost the sales, leading to an
increase in profits. As shown in Figure 5.1, Technology and Office Supplies account for most of
the sales. We ignore the Furniture category because its sales dropped over four years.
Our above findings revealed that 130 products generated the most significant sales and quantity.
Among them, the number of products belonging to Office Supplies and Technology was 48 and
36, respectively. Hence, we should make these products our major in the next campaign.

Figure 7.2: Several major products of Office Supplies and Technology categories

14
7.3 Reduce Discount
The discount vouchers can be considered as a primary factor that has a huge effect on profit. Hence,
for products that were high in discount, we should reduce such figure to save more money.
There are 172 transactions purchasing Binders that were discounted by 70%. Also, Phones and
Tables sub-categories had a 40% discount in 116 and 59 transactions. Several transactions of
Bookcases, Machines, and Copies were discounted from 40% to 70%. These transactions,
especially the purchased products, should be discounted less to make more profit.

Figure 7.3: Sub-categories with high discount

7.4 Attract more customers


As shown in section 6, the sales prediction in 2019 is 229808 USD, which only increases by 5.6%
compared to last year. This can put pressure on the business team to achieve higher metrics. For
example, if the board of directors aims for a sale growth of 30% in 2019, how many customers do
the business team need to achieve that target?
To solve this problem, first, we should calculate total customers and profit each year. I figure out
that the correlation between the number of customers and profit is 0.88, meaning that both
attributes have a positive linear relationship.

Figure 7.4: Table of total customers, profit, and sales over 4 years

To answer this question, we must calculate total customers and


profit each year. I figure out that the correlation coefficient
between number of customers and sales is 0.98, meaning that
both attributes have a positive linear relationship.

15
I apply machine learning, especially building a simple linear regression model to predict the
number of customers in 2019 if the sale is 282807 USD (an increase of 30% from last year). It
turns out that we need around 469 customers actively purchasing products to achieve that target.

Figure 7.5: Customers prediction by sales

I realize that the correlation coefficient between sales and profit is 0.89. Hence, we can also build
a linear regression model to forecast profit from sales. When the target sale is 282807 USD, we
will earn a profit of 42442 USD, equivalent to a 28.9% growth from last year.

Figure 7.6: Profit prediction by sales

16
8. Customer Segmentation
Customer Segmentation is the process of dividing customers into groups that are similar in several
characteristics. In this report, I use unsupervised learning, particularly K-means clustering method,
to perform this task.
I select Sales and Profit as two attributes to segment customers. Let’s have a look at these variables’
distribution. From Figure 8.1, the Sales per customer distribution is highly right skewed while that
of profit is a belt curve with a average of 133 USD.

Figure 8.1: Sales and profit distribution

I try different clusters from 1 to 10 to evaluate which numbers of groups yield the best result. I
plot the relationship between number of clusters and Within Cluster Sum of Square (WCSS). Then,
I apply the Elbow method to select the appropriate clusters where the change in WCSS begins to
level off. In this case, the “Elbow” is located between 2 and 4, giving us an indication that choosing
3 is a good fit.

Figure 8.2: Number of clusters versus WCSS

17
We can view 3 groups of customers through a scatter plot.

Figure 8.3: Three groups of customers

Cluster Characteristics
Sales were less than 2000 USD, and many customers were
0
bringing negative profit.
Group 1 has higher sales than group 0, but there were still a
1
few negative profit customers.
Only 12 customers belong to group 2, which purchased more
2
than 6000 USD and brought positive profit.

Figure 8.4: Simple interpretation of result

As we can notice, the model mainly divides customers based on their sales. If we want to have a
better segmentation result or do in-depth analysis, we should request more data collection.

18
9. Conclusion
After utilizing different techniques, I have discovered the following insights:
• Shopping activities were more frequent in big cities such as New York, Pennsylvania, and
Ohio.
• The number of orders were always highest in the fourth quarters, indicating an appropriate
plan should be taken at the end of the year to maximize this metric.
• Sales had an upward trend over years. This figure, in particular, was low in the first six
months but considerably increased in the remaining months.
• Customers had a habit of spending more at the end of year, especially purchasing
Technology products.
• Technology products contributed primarily to the profit.
• Profit ratios in the second quarter were stable, at around 16% for three consecutive years.
• The Standard Class was the most common ship mode, with time delivery from 4 to 7 days.
• Technology outperformed other categories in terms of sales due to high-value items and
minimal discounts.
• 130 products were high in sales and quantity.
• The sales prediction in 2019 is only a 5.6% growth, much lower than in prior years.

I also recommend several strategies to improve sales and profits:


• Minimize negative profit from Tables, Bookcases, and Supplies products.
• Pay attention to 84 key items from Office Supplies and Technology categories.
• Keep discounted vouchers as low as possible.
• Attract a higher number of customers.

Finally, I classify customers into 3 groups, which allows the marketing department to have a deeper
level of understanding them.

19
10. Resources

[1] help.tableau.com. (n.d.). Get Started with Tableau Prep. [online] Available at:
https://help.tableau.com/current/prep/en-us/prep_get_started.htm [Accessed 30 Jan. 2023].

[2] GitHub. (n.d.). GitHub - pycaret/pycaret: An open-source, low-code machine learning


library in Python. [online] Available at: https://github.com/pycaret/pycaret.

[3] Hshan.T (2021). Exploring Customers Segmentation With RFM Analysis and K-Means
Clustering. [online] The Startup. Available at: https://medium.com/swlh/exploring-
customers-segmentation-with-rfm-analysis-and-k-means-clustering-93aa4c79f7a7.

[4] Briggs, J. (2020). K-Means Clustering in Python. [online] Medium. Available at:
https://towardsdatascience.com/k-means-clustering-in-python-4061510145cc [Accessed 30
Jan. 2023].

20

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy