Customer Transaction Analysis - Vu Truong
Customer Transaction Analysis - Vu Truong
In this report, I have applied various approaches to analyzing customer transaction dataset. My
objective is to identify data insight and perform some diagnostic analysis. I also apply Machine
Learning to predict future performance and evaluate how much the sales and profit have
improved from the prior years. Finally, I use K-means clustering algorithm to do customer
segmentation.
CONTENTS
1. Data Overview ........................................................................................................................... 1
2. Data Validation ......................................................................................................................... 2
3. Exploratory Data Analysis ....................................................................................................... 3
3.1 Orders, Products and Customers ...................................................................................... 3
3.2 Category ............................................................................................................................... 3
3.3 Sales ...................................................................................................................................... 3
3.4 Quantity................................................................................................................................ 4
3.5 Ship Mode ............................................................................................................................ 4
4. Descriptive Analysis .................................................................................................................. 5
4.1 Total Order by quarters and years .................................................................................... 5
4.2 Total Orders by states ......................................................................................................... 5
4.3 Sales over months and years .............................................................................................. 6
4.4 Category Sales over quarters and years ............................................................................ 6
4.5 Profit by Category ............................................................................................................... 7
4.6 Sale and Profit ratio by Quarters and Category .............................................................. 7
4.7 Time Delivery....................................................................................................................... 8
5. Diagnostic Analysis ................................................................................................................... 9
5.1 Which category made the most sales over years? ............................................................ 9
5.2 Which products got the most sales and quantity? .......................................................... 11
6. Sale prediction in 2019............................................................................................................ 12
7. Strategies to improve Sales and Profit .................................................................................. 14
7.1 Minimize negative profit ................................................................................................... 14
7.2 Focus on promoting key items ......................................................................................... 14
7.3 Reduce Discount ................................................................................................................ 15
7.4 Attract more customers .................................................................................................... 15
8. Customer Segmentation ......................................................................................................... 17
9. Conclusion ............................................................................................................................... 19
10. Resources ............................................................................................................................... 20
1. Data Overview
The original dataset contains 2846 rows and 20 columns.
There are 20 attributes, including:
• Row ID: index of row
• Order ID: index of order
• Order Date: date of ordering
• Ship Date: date of shipping order
• Ship Mode: Options of shipping method
• Customer ID: index of customer
• Customer Name: name of customer
• Product ID: index of product
• Category: category of product
• Sub-Category: sub-category of product
• Product Name: name of product
• City: city
• State: state
• Postal Cost: postal code of city and state
• Region: region in US
• Country: United States
• Quantity: number of items of order
• Discount: discounted voucher of order
• Profit: how much profit earns from order
• Sales: total price of order after applying discount
1
2. Data Validation
Before jumping to the analysis process, we should do data pre-processing, including cleaning and
transformation steps. The first thing I did was to remove 40 blank rows that had no data recorded.
I also replaced missing values of city Burlington, state Vermont by its actual value of 5401. Then
I extracted meaningful values from Sales column as its initial data are combination of currency
(USD) and digits. Finally, I removed several columns that are unnecessary for subsequent analysis.
When looking at the remaining attributes:
• Order Date values range from 5 Jan 2015 to 30 Dec 2018
• Ship Date values range from 12 Jan 2015 to 4 Dec 2019
• There are 3 distinct values in Category: Office Supplies, Furniture, and Technology
• There are 17 possible values recorded in Sub-Category
• The number of distinct State and City are 14 and 120, respectively
• There are 4 Ship Mode classes: First, Second, Standard, Same Day
• Quantity values range from 1 to 14
• Discount values range from 0 to 0.7
• Profit values range from -6600 to 5040
• Sales has non-negative values with a maximum of 11200
After the data validation, the dataset contains 2806 rows and 17 columns without missing values.
2
3. Exploratory Data Analysis
3.1 Orders, Products and Customers
There were 2882 orders and 673 distinct customers, while 1431 products were sold over four
years.
3.2 Category
There are three types of categories included in this dataset. The most common category listed was
Office Supplier, which had more than half of the total. Furniture and Technology were coming
next with the same values at around 300 products.
3
3.4 Quantity
4
4. Descriptive Analysis
4.1 Total Order by quarters and years
Below is the number of orders by quarters over four years. We can observe an upward trend in
total orders. To be more detailed, the figure was always low in the first quarter, then increased
gradually and peaked in the fourth quarter. It means customers’ shopping habits increased at the
end of year, so that we should launch appropriate plans to boost this metric.
The diagram shows that New York had the highest orders, which values more than 1000.
Pennsylvania and Ohio were in second and third place, with 601 and 478 orders, respectively. The
remaining states had a small number of orders, accounting for only 25%. This indicates that
shopping activities were highly frequent in large states, particularly New York, Pennsylvania,
and Ohio.
5
4.3 Sales over months and years
6
4.5 Profit by Category
While total sales was always a positive number, profit can be a negative value. The line chart
below describes the profit of 3 categories from January 2015 to December 2018. Overall, the
profits grew over the given period in all categories.
Office Supplies was the most stable category since its profit ranged from 0 to 5000 USD. In
contrast, Furniture achieved the lowest profit in almost the given time. Its profits fluctuated in the
first two years from -1400 USD to 1400 USD and peaked at 2077 USD in December 2016. From
this point, such figures were unchanged and hit a plateau of 1050 USD at the end of 2018.
Regarding Technology, its profits had an upward trend in the first two years before being
experienced a downward pattern in the following year. Then the figure rose again and reached a
peak of approximately 15000 USD at the end of the period.
7
The second quarter began at a value of 6.5% in 2015, then remained steady at 16%. In contrast,
such figures for the third quarter experienced a continuous fall from 17.2% to 7.3% over the given
period. This would conclude that the second quarter had a better performance than the third
quarter in terms of profit ratio though the sales of the former quarter were less than that of the
latter one during the period.
8
5. Diagnostic Analysis
5.1 Which category made the most sales over years?
9
Figure 5.3: Price per unit distribution with long and short intervals
Office Supplier products were sold at the cheapest price but contained numerous outliers. While
the rest of two categories have the same mean and IQR, Technology has more outliers with higher
prices than Furniture, resulting in higher sales in total.
There is one remaining factor that might affect sales, which is the discount attribute. As we can
see, Technology has a wider IQR than Furniture, but its mean and minimum value are the same. It
means that most Technology products were not discounted, whereas a large number of Furniture
ones could be discounted from 10% to 50%.
Finally, we can conclude that the higher sales in Technology products resulted from:
• There are more products sold at higher prices than Furniture items.
• Technology products were discounted less than Offices Supplies ones.
• 40% of sales of Technology products belonged to the Phones sub-category.
10
5.2 Which products got the most sales and quantity?
After doing such method, there are 130 products in total that were high in both sales and quantity.
11
6. Sale prediction in 2019
From the dataset provided, we can easily calculate quarterly sales, as shown in Figure 4.3.
However, we should forecast the sales next year in order to provide the manager with a better
overview of short-term plans.
I use Pycaret library to perform sales prediction in 2019. Because the Pycaret can compare
numerous time-series models, evaluate them through different metrics, and yield the best model.
Before training models, I split the dataset into training and test sets. I select the sales from the
beginning to June 2018 for training set and keep the rest for test set.
12
Figure 6.3: Sales prediction in 2019
The line chart above demonstrates the sale prediction in 2019. It has the same pattern as the
previous year, which is low in the first half year, then increases in September, and declines in
October before rising again at the end of 2019.
We should identify the total sales of 2019 to evaluate how it
performs from the previous year.
The overall sales forecast in 2019 is 229808 USD, which is a
slight rise of 5.6% of the year 2018. When looking at sales
growth over years, it reached a peak of 25.9% in 2016, then
dropped half in 2017 before increasing to 18.6% in 2018. There
were no years that have had sales growth under 10%, and 5.6%
seems to be a very low yearly sale growth in the retail industry. It
would suggest that the business team should make more effort to
achieve a high volume of sales in 2019.
13
7. Strategies to improve Sales and Profit
Besides the analysis process to discover insights, we also should recommend several strategies and
present them to our manager. Based on the data provided, I do make suggestions on how to
generate more sales and profit.
Figure 7.2: Several major products of Office Supplies and Technology categories
14
7.3 Reduce Discount
The discount vouchers can be considered as a primary factor that has a huge effect on profit. Hence,
for products that were high in discount, we should reduce such figure to save more money.
There are 172 transactions purchasing Binders that were discounted by 70%. Also, Phones and
Tables sub-categories had a 40% discount in 116 and 59 transactions. Several transactions of
Bookcases, Machines, and Copies were discounted from 40% to 70%. These transactions,
especially the purchased products, should be discounted less to make more profit.
Figure 7.4: Table of total customers, profit, and sales over 4 years
15
I apply machine learning, especially building a simple linear regression model to predict the
number of customers in 2019 if the sale is 282807 USD (an increase of 30% from last year). It
turns out that we need around 469 customers actively purchasing products to achieve that target.
I realize that the correlation coefficient between sales and profit is 0.89. Hence, we can also build
a linear regression model to forecast profit from sales. When the target sale is 282807 USD, we
will earn a profit of 42442 USD, equivalent to a 28.9% growth from last year.
16
8. Customer Segmentation
Customer Segmentation is the process of dividing customers into groups that are similar in several
characteristics. In this report, I use unsupervised learning, particularly K-means clustering method,
to perform this task.
I select Sales and Profit as two attributes to segment customers. Let’s have a look at these variables’
distribution. From Figure 8.1, the Sales per customer distribution is highly right skewed while that
of profit is a belt curve with a average of 133 USD.
I try different clusters from 1 to 10 to evaluate which numbers of groups yield the best result. I
plot the relationship between number of clusters and Within Cluster Sum of Square (WCSS). Then,
I apply the Elbow method to select the appropriate clusters where the change in WCSS begins to
level off. In this case, the “Elbow” is located between 2 and 4, giving us an indication that choosing
3 is a good fit.
17
We can view 3 groups of customers through a scatter plot.
Cluster Characteristics
Sales were less than 2000 USD, and many customers were
0
bringing negative profit.
Group 1 has higher sales than group 0, but there were still a
1
few negative profit customers.
Only 12 customers belong to group 2, which purchased more
2
than 6000 USD and brought positive profit.
As we can notice, the model mainly divides customers based on their sales. If we want to have a
better segmentation result or do in-depth analysis, we should request more data collection.
18
9. Conclusion
After utilizing different techniques, I have discovered the following insights:
• Shopping activities were more frequent in big cities such as New York, Pennsylvania, and
Ohio.
• The number of orders were always highest in the fourth quarters, indicating an appropriate
plan should be taken at the end of the year to maximize this metric.
• Sales had an upward trend over years. This figure, in particular, was low in the first six
months but considerably increased in the remaining months.
• Customers had a habit of spending more at the end of year, especially purchasing
Technology products.
• Technology products contributed primarily to the profit.
• Profit ratios in the second quarter were stable, at around 16% for three consecutive years.
• The Standard Class was the most common ship mode, with time delivery from 4 to 7 days.
• Technology outperformed other categories in terms of sales due to high-value items and
minimal discounts.
• 130 products were high in sales and quantity.
• The sales prediction in 2019 is only a 5.6% growth, much lower than in prior years.
Finally, I classify customers into 3 groups, which allows the marketing department to have a deeper
level of understanding them.
19
10. Resources
[1] help.tableau.com. (n.d.). Get Started with Tableau Prep. [online] Available at:
https://help.tableau.com/current/prep/en-us/prep_get_started.htm [Accessed 30 Jan. 2023].
[3] Hshan.T (2021). Exploring Customers Segmentation With RFM Analysis and K-Means
Clustering. [online] The Startup. Available at: https://medium.com/swlh/exploring-
customers-segmentation-with-rfm-analysis-and-k-means-clustering-93aa4c79f7a7.
[4] Briggs, J. (2020). K-Means Clustering in Python. [online] Medium. Available at:
https://towardsdatascience.com/k-means-clustering-in-python-4061510145cc [Accessed 30
Jan. 2023].
20