0% found this document useful (0 votes)
9 views23 pages

Week 05

The document discusses the analytics process and data mining, emphasizing the importance of data quality and the steps involved in data pre-processing. It outlines different machine learning approaches, including supervised and unsupervised learning, and introduces concepts of statistical learning and model selection. Additionally, it covers prediction errors, cross-validation methods, and the bias-variance trade-off in model evaluation.

Uploaded by

Riya singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views23 pages

Week 05

The document discusses the analytics process and data mining, emphasizing the importance of data quality and the steps involved in data pre-processing. It outlines different machine learning approaches, including supervised and unsupervised learning, and introduces concepts of statistical learning and model selection. Additionally, it covers prediction errors, cross-validation methods, and the bias-variance trade-off in model evaluation.

Uploaded by

Riya singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

BUSINESS INTELLIGENCE & ANALYTICS

Analytics process

Saji K Mathew, PhD


Professor, Department of Management Studies
INDIAN INSTITUTE OF TECHNOLOGY MADRAS
Mindfulness
I just completed a thorough statistical examination of the
life of President Bush. For fifty eight years close to 21,000
observations, he did not die once. I can hence
pronounce him immortal with a high degree of statistical
significance.
Data mining process
How to decide on variables
Include a variable if:
The variable is important in making a managerial
decision
Eg.: Square foot area in a sales outlet
The variable helps to control for important factors
Eg.: Seasonality
There aren’t too many (Parsimony-Occam’s razor)
Is it really necessary to include this variable?
Availability of data
Data mining process
Inspect data
Data Quality
Good data characterized by (Han et al., 2012):
Accuracy
Completeness
Consistency
Timeliness
Believability
Interpretability
Data problems in the real world
Missing data
Noise
Inconsistency
Units of measurement

SOLUTION: Discrepancy detection, Data pre-


processing
Data pre-processing
Data cleaning
Missing values ignore the tuple, manual replacement,
replacement following a method
Noisy data smoothing techniques
Data integration
Database normalization
Data transformation
Scale, normalization
Data Reduction
Aggregation
Data mining process
Machine learning
In supervised learning, for each observation of the
predictor measurement(s) xi, i = 1, . . . , n there is an
associated response measurement yi.
Eg.: regression,
In unsupervised learning, for every observation i = 1, .
. . , n, we observe a vector of measurements xi but no
associated response yi.
Eg.: clustering
Statistical vs algorithmic (CS)
Explanatory vs predictive modeling

No of credit cards in a family


Explanatory vs predictive modeling

No of credit cards in a family


Statistical learning
More generally, suppose that we observe a quantitative
response Y and p different predictors, X1,X2, . . .,Xp. We
assume that there is some relationship between Y and X
= (X1,X2, . . .,Xp), which can be written in the very
general form
Y = f(X) +
Statistical learning refers to a set of approaches for
estimating f
Training: Residual Sum of Squares,
Choice of models: prediction error-
flexibility (bias-variance) trade off

LHS: Different fits (linear regression (orange)


RHS: Red Line: Test MSE, Grey line: Training MSE
Prediction

Three sources of error in predicted Y:


Reducible error due to inaccurate estimation of f
Irreducible error due to randomness
Test data variation

Reducible error can be reduced by better learning techniques


Cross validation
Training error vs. testing error (prediction error)
Mean Squared Error (MSE), a measure of testing error:

Three kinds of cross validation:


Test set approach
Leave-one-out cross-validation (LOOCV)
K-fold cross validation
The test set method

1. Randomly choose
30% of the data to be in a
test set
2. The remainder is a
y
training set
3. Perform your
regression on the training
set
x 4. Estimate your future
(Linear regression example) performance with the test
Mean Squared Error = 2.4 set
LOOCV (Leave-one-out Cross Validation)

For k=1 to R
1. Let (xk,yk) be the kth record
2. Temporarily remove (xk,yk)
from the dataset
3. Train on the remaining R-1
y datapoints
4. Note your error (xk,yk)

x
When you’ve done all points,
report the mean error.
Randomly break the dataset into k
k-fold Cross partitions (in our example we’ll have k=3
Validation partitions colored Red Green and Blue)
For the red partition: Train on all the
points not in the red partition. Find
the test-set sum of errors on the red
points.
For the green partition: Train on all the
points not in the green partition.
y Find the test-set sum of errors on
the green points.
For the blue partition: Train on all the
points not in the blue partition. Find
x the test-set sum of errors on the
Linear Regression MSE3FOLD=2.05 blue points.
Then report the mean error

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy