Unit 3 Mid Emerging Areas of Analytics
Unit 3 Mid Emerging Areas of Analytics
1
2
Procter and Gamble who use analytics to create products and solutions. Marshall
(2016) and MacKenzie et al. (2013) reported that Amazon’s recommender systems
resulted in a sales increase of 35%.
❖ Davenport and Harris (2007) and Hopkins et al. (2010) reported that there was a high
correlation between use of analytics and business performance. They claimed that the
majority of high performers (measured in terms of profit, shareholder return and
revenue, etc.) strategically apply analytics in their daily operations, as compared to low
performers.
A few of the problems that e-commerce companies such as Amazon and Flipkart try to address
are as follows:
1. Forecasting demand for products directly sold by the company; excess inventory
and shortage can impact both the top line and the bottom line.
2. Cancellation of orders placed by customers before their delivery. Ability to predict
cancellations and intervention can save cost incurred on unnecessary logistics.
3. Fraudulent transactions resulting in financial loss to the company.
4. Predicting delivery time since it is an important service level agreement from the
customer perspective.
5. Predicting what a customer is likely to buy in future to create recommender systems.
❖ Analytics is used to solve a wide range of problems starting with simple process
improvement such as reducing procurement cycle time to complex decision-making
problems such as farm advisory systems that involve accurate weather prediction,
forecasting commodity price etc, so that farmers can be advised about crop selection,
crop rotation, etc.
2
3
❖ Figure 1.2 shows the pyramid of analytics applications, at the bottom of the pyramid
analytics is used for process improvement and at the top it is used for decision making
and as a competitive strategy.
WHY ANALYTICS
❖ According to the theory of firm (Coase, 1937 and Fame, 1980) as proposed by several
economists, firms exist to minimize the transaction cost. Transactions take place when
goods or services are transferred to customers from the supplier.
❖ The cost of decision making is an important element of transaction cost. Michalos
(1970) groups the costs of decision making into three categories:
1. Cost of reaching a decision with the help of a decision maker or procedure; this
is also known as production cost, that is, cost of producing a decision.
3
4
4
5
argument to prove that the probability of winning increases to 2/3 when the contestant changes
his/ her initial choice, many scholars did not accept her argument that changing the initial
option is the right decision.
❖ Table 1.1 shows why changing the initial option increases the probability of winning.
The expensive item can be behind any one of the three doors as shown in Table 1.1
(rows 2−4).
❖ Assume that the contestant has chosen door 1 initially, columns 4 and 5 (last row) give
the probability of winning the car if contestant stays with door 1 (column 4) and the
door 1 is changed (column 5), respectively.
❖ The above argument can be extended to any number of doors without loss of generality.
In the case of Monty Hall problem, the number of alternatives available to the player is
just two. Even when the number of options is only 2, many find it difficult to
comprehend that changing the initial choice will increase the probability of winning
5
6
DESCRIPTIVE ANALYTICS “If the statistics are boring, then you’ve got the wrong
numbers”. —Edward R. Tufte
❖ Descriptive analytics is the simplest form of analytics that mainly uses simple
descriptive statistics, data visualization techniques, and business related queries to
understand past data.
❖ One of the primary objectives of descriptive analytics is innovative ways of data
summarization. Descriptive analytics is used for understanding the trends in past
data which can be useful for generating insights.
❖ Figure 1.5 shows visualization of relationship break-ups reported in Facebook. It is
clear from Figure 1.5 that spike in breakups occurred during spring break and in
December before Christmas. There could be many reasons for increase in breakups
during December (we hope it is
not a New Year resolution that they would like to change the partner). Many believe that since
December is a holiday season, couples get a lot of time to talk to each other, probably that is
where the problem starts.
❖ However, descriptive analytics is not about why a pattern exists, but about what the
pattern means for a business.
❖ The fact that there is a significant increase in breakups during December we can deduce
the following insights (or possibilities):
1. There will be more traffic to online dating sites during December/January.
2. There will be greater demand for relationship counsellors and lawyers.
3. There will be greater demand for housing and the housing prices are likely to increase
in December/January.
4. There will be greater demand for household items.
6
7
5. People would like to forget the past, so they might change the brand of beer they
drink.
Descriptive analytics using visualization identifies trends in the data and connects the dots
to gain insights about associated businesses.
7
8
❖ In 2002, Target hired statistician Andrew Pole; one of his assignments was to predict
whether a customer is pregnant (Duhigg, 2012). At the outset, the question posed by
the marketing department to Pole may look bizarre, but it made great business sense.
❖ Any marketer would like to identify the price-insensitive customers among the
shoppers, and who can beat soon-to-be parents? A list of interesting applications of
predictive analytics is presented in Table 1.2
PRESCRIPTIVE ANALYTICS
Every decision has a consequence. —Damon Darrel
❖ Prescriptive analytics is the highest level of analytics capability which is used for
choosing optimal actions once an organization gains insights through descriptive and
predictive analytics.
❖ In many cases, prescriptive analytics is solved as a separate optimization problem.
Prescriptive analytics assists users in finding the optimal solution to a problem or in
making the right choice/decision among several alternatives.
❖ Operations Research (OR) techniques form the core of prescriptive analytics. Apart
from operations research techniques, machine learning algorithms, metaheuristics,
and advanced statistical models are used in prescriptive analytics.
❖ Note that actionable items can be derived directly after descriptive and predictive
analytics model development; however, they may not be the optimal action.
8
9
9
10
10
11
11
12
12
13
13
14
metrics such as overall accuracy, sensitivity, specificity, and area under the receive operating
characteristic curve (AUC).
14
15
Pattern Discovery
Revise Association Rule Mining notes and the understand the following
presented algorithms
1. Suppose you have the set C of all frequent closed itemsets on a data set D, as well as
the support count for each frequent closed itemset. Describe an algorithm to
determine whether a given itemset X is frequent or not, and the support of X if it is
frequent.
Answer:
Algorithm: Itemset Freq Tester. Determine if an itemset is frequent.
Input: C, set of all frequent closed itemsets along with their support counts; test itemset, X.
Output: Support of X if it is frequent, otherwise -1.
Method:
(1) s = ∅;
(2) for each itemset, l ∈ C
(3) if X ⊂ l and (length(l) < length(s) or s = ∅) then {
(4) s = l;
(5) }
(6) if s ̸= ∅ then {
(7) return support(s);
(8) }
(9) return -1;
2. An itemset X is called a generator on a data set D if there does not exist a proper sub-
itemset Y ⊂ X such that support(X) = support(Y ). A generator X is a frequent
generator if support(X) passes the minimum support threshold. Let G be the set of all
frequent generators on a data set D. (a) Can you determine whether an itemset A is
frequent and the support of A, if it is frequent, using only G and the support counts
of all frequent generators? If yes, present your algorithm. Otherwise, what other
information is needed? Can you give an algorithm assuming the information needed
is available? (b) What is the relationship between closed itemsets and generators?
Algorithm: InferSupport. Determine if an itemset is frequent.
Input:
• l is an itemset;
• F G is the set of frequent generators;
• P Bd(F G) is the positive border of F G;
Output: Support of l if it is frequent, otherwise -1.
Method:
15
16
3. A database has four transactions. Let min sup = 60% and min conf = 80%.
(a) At the granularity of item category (e.g., itemi could be “Milk”), for the following
rule template,
∀X ∈ transaction, buys(X, item1) ∧ buys(X, item2)⇒buys(X, item3) [s, c]
list the frequent k-itemset for the largest k and all of the strong association rules (with
their
support s and confidence c) containing the frequent k-itemset for the largest k.
k = 3 and the frequent 3-itemset is {Bread, Milk, Cheese}. The rules are
Bread ∧ Cheese ⇒ M ilk, [75%, 100%]
Chees ∧ M ilk ⇒ Bread, [75%, 100%]
Chees⇒M ilk ∧ Bread, [75%, 100%]
(b) At the granularity of brand-item category (e.g., itemi could be “Sunset-Milk”), for the
following
rule template,
∀X ∈ customer, buys(X, item1) ∧ buys(X, item2)⇒buys(X, item3)
list the frequent k-itemset for the largest k. Note: do not print any rules.
k = 3 and the frequent 3-itemset is {(Wheat-Bread, Dairyland-Milk, Tasty-Pie), (Wheat-
Bread,
16
17
Sunset-Milk, Dairyland-Cheese)}.
4. Give a short example to show that items in a strong association rule may actually be
negatively correlated.
Consider the following table:
Let the minimum support be 40%. Let the minimum confidence be 60%. A⇒B is a
strong rule because it satisfies minimum support and minimum confidence with a
support of 65/150 = 43.3% and a confidence of 65/100 = 61.9%. However, the
correlation between A and B is corrA,B = 0.433 0.700×0.667 = 0.928, which is less
than 1, meaning that the occurrence of A is negatively correlated with the occurrence
of B.
5. The following contingency table summarizes supermarket transaction data, where hot
dogs refers to the transactions containing hot dogs, hotdogs refers to the transactions
that do not contain hot dogs, hamburgers refers to the transactions containing
hamburgers, and hamburgers refers to the transactions that do not contain hamburgers
(a) Suppose that the association rule “hot dogs ⇒ hamburgers” is mined. Given a
minimum support threshold of 25% and a minimum confidence threshold of 50%,
is this association rule strong?
(b) (b) Based on the given data, is the purchase of hot dogs independent of the
purchase of hamburgers? If not, what kind of correlation relationship exists
between the two?
(c) (c) Compare the use of the all confidence, max confidence, Kulczynski, and
cosine measures with lift and correlation on the given data.
(a) Suppose that the association rule “hotdogs ⇒ hamburgers” is mined. Given a minimum
support threshold of 25% and a minimum confidence threshold of 50%, is this association
rule strong?
For the rule, support = 2000/5000 = 40%, and confidence = 2000/3000 = 66.7%. Therefore,
the association rule is strong.
(b) Based on the given data, is the purchase of hotdogs independent of the purchase of
hamburgers?
If not, what kind of correlation relationship exists between the two?
corr{hotdog,hamburger} = P({hot dog, hamburger})/(P({hot dog})
P({hamburger}))=0.4/(0.5 × 0.6) = 1.33 > 1. So, the purchase of hotdogs is NOT independent
of the purchase of hamburgers.
There exists a POSITIVE correlation between the two.
17
18
18