IBM Slide CHAPTER 3
IBM Slide CHAPTER 3
• Above Figure 1 shows the flow and transformation of data by using BAO technology that
helps to drive refinement of raw data into insights and actionable decisions.
• As Figure 1 shows, organizations today gather massive amounts of raw data from various
sources, which can be dynamic. Information that is available to businesses has grown
exponentially in the past several decades. Although most executives are convinced of its
value, their enterprises need sophisticated analytics to capture that value.
• This data feeds business intelligence and performance management capabilities to
generate information that promotes awareness and measures the state of the business.
Additional analysis of information and data can generate insights.
• Advanced analytics can further process the information to provide foresight that
supports key decisions by business users. Input to transactional systems drives action
within the business environment.
• Enterprise information and content management solutions can help establish
performance measurements and establish key performance indicators that help
management and personnel take action on the data.
The Need for BAO Now
IBM ICE (Innovation Centre for Education)
• The application of business analytics is opening up important new possibilities for clients
and promises to transform the way consulting is practiced.
• BAO has defined the following competency areas that bring together critical skills that
are necessary to define and drive IBM leadership in the growing analytics market.
• Through these competencies, our clients can operate at a new level of intelligence and
achieve “breakaway” levels by using the -
• BAO Strategy.
• Business Intelligence and Performance Management.
• Advanced Analytics and Optimization.
• Enterprise Information Management
• Enterprise Content Management
The Need for BAO Now
IBM ICE (Innovation Centre for Education)
• By using the BAO Strategy, clients can achieve business objectives faster, with
less risk, and at a lower cost by defining and helping to implement
improvements in how information is identified and acted upon. Applied
enterprise-wide and deep within a business function, this strategy addresses
both what to do and how to do it with actions that span policy, analytics,
business process, organization, applications, and data.
• Organization is made up of people with different skills and roles all trying to “pull in the same direction” with
the goal of optimizing business performance. Each of these people require different levels of information
and detail in order to make decisions that impact performance. Only IBM offers the complete range of
integrated Business Analytics capabilities to address the needs of your people.
• Through highly visual scorecards, dashboards, reports and real-time activity monitoring, decision makers
gain immediate insights regarding the health of the business and can understand what is happening in their
area of the business.
• Analyzing trends, statistics, correlation, and context, decision makers can understand what leads to the best
outcomes and discover why things are on or off track.
• Knowing what is likely to happen equips decision-makers with the foresight they need to intervene.
Simulation through predictive modeling and “what-if” analysis enables decision makers to predict and act:
change the course to improve the outcomes. Financial and operational planning, budgeting and forecasting
puts resources in the right place and sets targets for those allocations.
• Everyone in the organization can be confidant in a common, consistent and trusted data. IBM allows you to
pull data from a range of systems and makes it easier to turn this data into information. It doesn’t matter
what the knowledge level, everyone will be able to consume the information in a manner that is relevant to
them.
• The right information, in the right way, to the right people at the right time leads to optimized decision
making.
BAO Capabilities: Predictive Analysis and
Mining IBM ICE (Innovation Centre for Education)
What is Cluster Analysis?
IBM ICE (Innovation Centre for Education)
• Partitioning criteria
• Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is
desirable)
• Separation of clusters
• Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one
document may belong to more than one class)
• Similarity measure
• Distance-based (e.g., Euclidian) vs. connectivity-based (e.g., density)
• Clustering space
• Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)
IBM ICE (Innovation Centre for Education)
BAO Capabilities: Predictive Analysis and
Mining IBM ICE (Innovation Centre for Education)
√
What are Association Rules?
IBM ICE (Innovation Centre for Education)
• Applications
• Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click
stream) analysis, and DNA sequence analysis.
Generating Rules - Term
IBM ICE (Innovation Centre for Education)
• Brute-force approach:
• List all possible association rules
• Compute the support and confidence for each rule
• Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
Mining Association Rules
IBM ICE (Innovation Centre for Education)
TID Items
Example of Rules:
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
{Milk,Diaper} {Beer} (s=0.4, c=0.67)
{Milk,Beer} {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer
{Beer} {Milk,Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules
IBM ICE (Innovation Centre for Education)
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset, where each rule is a binary
partitioning of a frequent itemset
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
√ √
Prediction Problems: Classification vs.
Numeric Prediction IBM ICE (Innovation Centre for Education)
• Classification
• predicts categorical class labels (discrete or nominal)
• classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute and
uses it in classifying new data
• Numeric Prediction
• models continuous-valued functions, i.e., predicts unknown
or missing values
• Typical applications
• Credit/loan approval:
• Medical diagnosis: if a tumor is cancerous or benign
• Fraud detection: if a transaction is fraudulent
• Web page categorization: which category it is
Classification—A Two-Step Process
IBM ICE (Innovation Centre for Education)
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as determined by the
class label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or mathematical
formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified result from the
model
• Accuracy rate is the percentage of test set samples that are correctly classified by
the model
• Test set is independent of training set (otherwise overfitting)
• If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set
Process (1): Model Construction
IBM ICE (Innovation Centre for Education)
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
IBM ICE (Innovation Centre for Education)
BAO Capabilities: Predictive Analysis and
Mining IBM ICE (Innovation Centre for Education)
√ √
√
Linear Regression
IBM ICE (Innovation Centre for Education)
• Linear dependence: constant rate of increase of one variable with respect
to another (as opposed to, e.g., diminishing returns).
• Regression analysis describes the relationship between two (or more)
variables.
• Examples:
• Income and educational level
• Demand for electricity and the weather
• Home sales and interest rates
• Our focus:
•Gain some understanding of the mechanics.
• the regression line
• regression error
• Learn how to interpret and use the results.
• Learn how to setup a regression analysis.
Two Main Questions:
IBM ICE (Innovation Centre for Education)
•Prediction and Forecasting
• Predict home sales for December given the interest rate for this month.
• Use time series data (e.g., sales vs. year) to forecast future performance
(next year sales).
• Predict the selling price of houses in some area.
• Collect data on several houses (# of BR, #BA, sq.ft, lot size, property tax)
and their selling price.
• Can we use this data to predict the selling price of a specific house?
•Quantifying causality
• Determine factors that relate to the variable to be predicted; e.g., predict
growth for the economy in the next quarter: use past history on
quarterly growth, index of leading economic indicators, and others.
• Want to determine advertising expenditure and promotion for the 1999
Ford Explorer.
• Sales over a quarter might be influenced by: ads in print, ads in radio, ads in
TV, and other promotions.
Motivated Example
IBM ICE (Innovation Centre for Education)
• Predict the selling prices of houses in the region.
•Intuitively, we should compare the house for which we need a predicted selling price with houses
that have sold recently in the same area, of roughly the same size, same style etc.
•Idea: Treat it as a multiple sample problem.
•Unfortunately, the list of houses meeting these criteria may be quite small, or there may not be a house
of exactly the same characteristics.
•Alternative approach: Consider the factors that determine the selling price of a house in this region.
• Collect recent historical data on selling prices, and a number of characteristics
about each house sold (size, age, style, etc.).
•Idea: one sample problem
•To predict the selling price of a house without any particular knowledge of the house, we use the
average selling price of all of the houses in the data set.
•Better idea:
•One of the factors that cause houses in the data set to sell for different amounts of money is the fact that
houses come in various sizes.
•A preliminary model might posit that the average value per square foot of a new house is $40 and that
the average lot sells for $20,000. The predicted selling price of a house of size X (in square feet) would be:
20,000 + 40X.
•A house of 2,000 square feet would be estimated to sell for 20,000 + 40(2,000) = $100,000.
Motivated Example
IBM ICE (Innovation Centre for Education)
•Probability Model:
• We know, however, that this is just an approximation, and the selling price of this
particular house of 2,000 square feet is not likely to be exactly $100,000.
• Prices for houses of this size may actually range from $50,000 to $150,000.
• In other words, the deterministic model is not really suitable. We should therefore
consider a probabilistic model.
•Let Y be the actual selling price of the house. Then
Y = 20,000 + 40x + ,
where (Greek letter epsilon) represents a random error term (which
might be positive or negative).
• If the error term is usually small, then we can say the model is a good one.
• The random term, in theory, accounts for all the variables that are not part of the
model (for instance, lot size, neighborhood, etc.).
• The value of will vary from sale to sale, even if the house size remains constant.
That is, houses of the exact same size may sell for different prices.
BAO Capabilities: Predictive Analysis and
Mining IBM ICE (Innovation Centre for Education)
√ √
√ √
IBM Business Analytics Maturity Model
IBM ICE (Innovation Centre for Education)
IBM Business Analytics Maturity Model
IBM ICE (Innovation Centre for Education)
• IBM Business Analytics Maturity Model identifies 5 stages of Information and Analytics
maturity and maps the same with the Business Operations Maturity within an
organization-
• Ad-hoc – AT this stage the information is mainly on spreadsheets and extracts and most of the
analysis is Ad-hoc. The business operations maturity level is of High Command and Control
structure.
• Foundational- At this stage the organization sets up Data Warehouses, governance models and
production reporting processes. The business operations generally are driven by Task integrations
( most likely by ERP systems)
• Competitive- At this stage the use of Master data Management ( MDM) Dashboards and
Scorecards are at a high level within the organization. And most of business operations run by a
process automation and workflow methodologies.
• Differentiating – At this stage, organizations start using Predictions, Contextual business rules and
patterns. The Business operations run riding on a complete Business process integration and
collaboration ( such as CRM )
• Breakaway - At this highest stage of maturity, the organizations start using very high level of
analytical capability built internally and the usages are prescriptive, real time, pattern based
strategies with situational context.
Advantages to Implementing BAO Solutions
IBM ICE (Innovation Centre for Education)
• Organizations that are just beginning to develop analytical decision support often
start with ad hoc tools, such as spreadsheets and SQL. These tools have the
advantages of simplicity, low cost, and flexibility, and therefore, encourage
experimentation.
√ √
√ √