0% found this document useful (0 votes)
28 views275 pages

Ca 3 Merged

Uploaded by

Barath NM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views275 pages

Ca 3 Merged

Uploaded by

Barath NM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 275

Confusion matrix and ROC - I

Dr A. RAMESH
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Confusion matrix
• Receiver operating characteristics curve

2
Why Evaluate?

• Multiple methods are available to classify or predict


• For each method, multiple choices are available for settings
• To choose best model, need to assess each model’s performance

3
Accuracy Measures (Classification)

Misclassification error
• Error = classifying a record as belonging to one class when it belongs to
another class.

• Error rate = percent of misclassified records out of the total records in the
validation data

4
Confusion Matrix

Classification Confusion Matrix


Predicted Class
Actual Class 1 0
1 201 85
0 25 2689

201 1’s correctly classified as “1”


85 1’s incorrectly classified as “0”
25 0’s incorrectly classified as “1”
2689 0’s correctly classified as “0”

5
Error Rate
Classification Confusion Matrix
Predicted Class
Actual Class 1 0
1 201 85
0 25 2689

Overall error rate = (25+85)/3000 = 3.67%


Accuracy = 1 – err = (201+2689) = 96.33%
If multiple classes, error rate is:
(sum of misclassified records)/(total records)

6
Cutoff for classification
Most algorithms classify via a 2-step process:
For each record,
1. Compute probability of belonging to class “1”
2. Compare to cutoff value, and classify accordingly

• Default cutoff value is 0.50


If >= 0.50, classify as “1”
If < 0.50, classify as “0”
• Can use different cutoff values
• Typically, error rate is lowest for cutoff = 0.50

7
Cutoff Table
Actual Class Prob. of "1" Actual Class Prob. of "1"
1 0.996 1 0.506
1 0.988 0 0.471
1 0.984 0 0.337
1 0.980 1 0.218
1 0.948 0 0.199
1 0.889 0 0.149
1 0.848 0 0.048
0 0.762 0 0.038
1 0.707 0 0.025
1 0.681 0 0.022
1 0.656 0 0.016
0 0.622 0 0.004

• If cutoff is 0.50: 11 records are classified as “1”


• If cutoff is 0.80: seven records are classified as “1”
8
Confusion Matrix for Different Cutoffs
Cut off Prob.Val. for Success (Updatable) 0.25

Classification Confusion Matrix


Predicted Class

Actual Class owner non-owner

owner 11 1
non-owner 4 8

Cut off Prob.Val. for Success (Updatable) 0.75

Classification Confusion Matrix


Predicted Class

Actual Class owner non-owner

owner 7 5
non-owner 1 11

9
Compute Outcome Measures

10
When One Class is More Important
In many cases it is more important to identify members of one class

– Tax fraud
– Credit default
– Response to promotional offer
– Detecting electronic network intrusion
– Predicting delayed flights

In such cases, we are willing to tolerate greater overall error, in return


for better identifying the important class for further attention

11
ROC curves

• ROC = Receiver Operating Characteristic


• Started in electronic signal detection theory (1940s - 1950s)
• Has become very popular in biomedical applications, particularly
radiology and imaging
• Also used in machine learning applications to assess classifiers
• Can be used to compare tests/procedures
ROC curves: simplest case

• Consider diagnostic test for a disease


• Test has 2 possible outcomes:
– ‘positive’ = suggesting presence of disease
– ‘negative’

• An individual can test either positive or negative for the disease


ROC Analysis
• True Positives = Test states you have the disease when you do have the
disease
• True Negatives = Test states you do not have the disease when you do not
have the disease
• False Positives = Test states you have the disease when you do not have
the disease
• False Negatives = Test states you do not have the disease when you do
Specific Example

Patients with disease


Patients without the disease

Test Result
Threshold

Call these patients “negative” Call these patients “positive”

Test Result
Some definitions ...

Call these patients “negative” Call these patients “positive”

True Positives

without the disease Test Result with the disease


Call these patients “negative” Call these patients “positive”

with the disease

without the disease Test Result False


Positives
Call these patients “negative” Call these patients “positive”

True
negatives

without the disease Test Result with the disease


Call these patients “negative” Call these patients “positive”

False
negatives

Test Result with the disease


without the disease
Moving the Threshold: right

‘‘-’’ ‘‘+’’

Test Result
without the disease
with the disease
Moving the Threshold: left

‘‘-’’ ‘‘+’’

without the disease Test Result


with the disease
Threshold Value
• The outcome of a logistic regression model is a probability
• Often, we want to make a binary prediction
• We can do this using a threshold value t
• If P(y = 1) ≥ t, predict positive
– If P(y = 1) < t, predict negative
– What value should we pick for t?

23
Threshold Value
• Often selected based on which errors are “better”
• If t is large, predict positive rarely (when P(y=1) is large)
– More errors where we say negative , but it is actually positive
– Detects patients who are negative
• If t is small, predict negative rarely (when P(y=1) is small)
– More errors where we say positive, but it is actually negative
– Detects all patients who are positive
• With no preference between the errors, select t = 0.5
– Predicts the more likely outcome

24
Selecting a Threshold Value

• Compare actual outcomes to predicted outcomes using a confusion matrix


(classification matrix)

25
True disease state vs. Test result
not rejected
Test rejected/accepted
Disease
No disease ☺ X
(D = 0) specificity Type I error
(False +)

Disease X ☺
(D = 1) Type II error Power 1 - ;
(False -) sensitivity

Classification matrix: Meaning of each cell

27
Alternate Accuracy Measures

If “C1” is the important class,


Sensitivity = % of “C1” class correctly classified
Sensitivity = n1,1 / (n1,0+ n1,1 )
Specificity = % of “C0” class correctly classified
Specificity = n0,0 / (n0,0+ n0,1 )
False positive rate = % of predicted “C1’s” that were not “C1’s”
False negative rate = % of predicted “C0’s” that were not “C0’s”

28
Receiver Operator Characteristic (ROC) Curve

• True positive rate (sensitivity) on y-axis


– Proportion of positive
• False positive rate (1-specificity) on x-axis
– Proportion of negative labelled as positive
• Low Threshold
– Low specificity
– High sensitivity

29
Selecting a Threshold using ROC

• Captures all thresholds simultaneously


• High threshold
– High specificity
– Low sensitivity
• Low Threshold
– Low specificity
– High sensitivity

30
Thank You

31
Confusion Matrix and ROC-II

Dr A. RAMESH
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Receiver operating characteristics curve


• Optimum threshold value

2
ROC analysis
• True Positive Fraction
– TPF = TP / (TP+FN)
– also called sensitivity
– true abnormals called abnormal by the
observer
• False Positive Fraction
– FPF = FP / (FP+TN)
• Specificity = TN / (TN+FP)
– True normals called normal by the observer
– FPF = 1 - specificity
Evaluating classifiers (via
their ROC curves)

Classifier A can’t
distinguish between
normal and abnormal.

B is better but makes


some mistakes.

C makes very few


mistakes.
“Perfect”
means no
false positives
and no false
negatives.
ROC analysis
• ROC = receiver operator/operating characteristic/curve
Area Under the ROC Curve (AUC)

7
Area Under the ROC Curve (AUC)

• What is a good AUC?


– Maximum of 1 (perfect prediction)

8
Area Under the ROC Curve (AUC)

• What is a good AUC?


• Maximum of 1 (perfect
prediction)
• Minimum of 0.5
(just guessing)

9
Selecting a Threshold using ROC

• Choose best threshold for best trade off


– cost of failing to detect positives
– costs of raising false alarms

10
ROC Plot
• A typical look of ROC plot with few points in it is shown in the following
figure.

• Note the four cornered points are the four extreme cases of classifiers

11
Interpretation of Different Points in ROC Plot
• The four points (A, B, C, and D)
• A: TPR = 1, FPR = 0, the ideal model, i.e., the perfect
classifier, no false results
• B: TPR = 0, FPR = 1, the worst classifier, not able to
predict a single instance
• C: TPR = 0, FPR = 0, the model predicts every instance
to be a Negative class, i.e., it is an ultra-conservative
classifier
• D: TPR = 1, FPR = 1, the model predicts every instance
to be a Positive class, i.e., it is an ultra-liberal classifier

12
Interpretation of Different Points in ROC Plot
• Let us interpret the different points in the ROC
plot.
• The points on the upper diagonal region
• All points, which reside on upper-diagonal region
are corresponding to classifiers “good” as their
TPR is as good as FPR (i.e., FPRs are lower than
TPRs)
• Here, X is better than Z as X has higher TPR and
lower FPR than Z.
• If we compare X and Y, neither classifier is superior
to the other

13
Interpretation of Different Points in ROC Plot

• Let us interpret the different points in the ROC


plot.
• The points on the lower diagonal region
– The Lower-diagonal triangle corresponds to
the classifiers that are worst than random
classifiers
– A classifier that is worser than random
guessing, simply by reversing its prediction,
we can get good results.
W’(0.2, 0.4) is the better version than W(0.4,
0.2), W’ is a mirror reflection of W

14
Tuning a Classifier through ROC Plot
• Using ROC plot, we can compare two or more
classifiers by their TPR and FPR values and this
plot also depicts the trade-off between TPR
and FPR of a classifier.
• Examining ROC curves can give insights into
the best way of tuning parameters of
classifier.
• For example, in the curve C2, the result is
degraded after the point P.
• Similarly for the observation C1, beyond Q the
settings are not acceptable.

15
Comparing Classifiers trough ROC Plot
• We can use the concept of “area under
curve” (AUC) as a better method to
compare two or more classifiers.
• If a model is perfect, then its AUC = 1.
• If a model simply performs random
guessing, then its AUC = 0.5
• A model that is strictly better than other,
would have a larger value of AUC than the
other.
• Here, C3 is best, and C2 is better than C1
as AUC(C3)>AUC(C2)>AUC(C1).

16
ROC curve
100%

True Positive Rate


(sensitivity)

0
% False Positive Rate (1- 100
0
% specificity) %
ROC curve comparison

A good test: A poor test:

100% 100%
True Positive Rate

True Positive Rate


0 0
% %
0 100% 100
0
% False Positive Rate False Positive Rate %
%
ROC curve extremes
Best Test: Worst test:

100% 100%

True Positive Rate


True Positive Rate

0
0 %
% 0 100
0 100 False Positive Rate %
False Positive Rate % %
%

The distributions The distributions


don’t overlap at all overlap completely
ROC curve extremes

20
Typical ROC

21
ROC curve extremes

22
Example

• Let us consider an application of logistic regression involving a direct mail


promotion being used by Simmons Stores.
• Simmons owns and operates a national chain of women’s apparel stores.
• Five thousand copies of an expensive four-color sales catalog have been
printed, and each catalog includes a coupon that provides a $50 discount
on purchases of $200 or more.
• The catalogs are expensive and Simmons would like to send them to only
those customers who have the highest probability of using the coupon.

Sources: Statistics for Business and Economics,11th Edition by David R. Anderson (Author), Dennis J.
Sweeney (Author), Thomas A. Williams (Author)

23
Variables

• Management thinks that annual spending at Simmons Stores and whether


a customer has a Simmons credit card are two variables that might be
helpful in predicting whether a customer who receives the catalog will use
the coupon.
• Simmons conducted a pilot study using a random sample of 50 Simmons
credit card customers and 50 other customers who do not have a
Simmons credit card.
• Simmons sent the catalog to each of the 100 customers selected.
• At the end of a test period, Simmons noted whether the customer used
the coupon or not?

24
Data (10 customer out of 100)
Customer Spending Card Coupon
1 2.291 1 0
2 3.215 1 0
3 2.135 1 0
4 3.924 0 0
5 2.528 1 0
6 2.473 0 1
7 2.384 0 0
8 7.076 0 0
9 1.182 1 1
10 3.345 0 0

25
Explanation of Variables

• The amount each customer spent last year at Simmons is shown in


thousands of dollars and the credit card information has been coded as 1
if the customer has a Simmons credit card and 0 if not.
• In the Coupon column, a 1 is recorded if the sampled customer used the
coupon and 0 if not.

26
Loading data file and get some statistical detail

27
Method’s description

• Dataframe.describe(): This method is used to get basic statistical details


such as central tendency, dispersion and shape of dataset’s distribution.

• Numpy.unique(): This method gives unique values in particular column.

• Series.value_counts(): Returns object containing counts of unique values.

• ravel(): It will return one dimensional array with all the input array
elements.

28
Split dataset into training and testing sets

29
Building the model and predicting values

30
Calculate probability of predicting data values

31
Summary for logistic model

32
Accuracy Checking

• By using accuracy_score function.


• By using confusion matrix

Predicted (0) Predicted (1)


Actual (0) True Negative(tn) False Positive(fp)
Actual (1) False Negative(fn) True Positive(tp)

33
Calculating Accuracy Score using Confusion Matrix

34
Generating Classification Report

• Recall gives us an idea


about when it’s actually
yes, how often does it
predict yes.
• Precision tells us about
when it predicts yes, how
often is it correct

35
Interpreting Classification Report

• Precision = tp / (tp + fp)

• Accuracy = (tp + tn) / (tp + tn + fp + fn)


Predicted (0) Predicted (1)
Actual (0) tn fp
• Recall= tp / (tp + fn)
Actual (1) fn tp

36
Thank You

37
Performance of Logistic Model-III

Dr A. RAMESH
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda
Python demo for accuracy prediction in logistic regression model using Receiver
operating characteristics curve

2
Sensitivity and Specificity

• For checking, what type of error we are making; we use two parameters-

1. Sensitivity = tp/(tp+fn) True Positive Rate(tpr)

2. Specificity = tn/(tn+fp) True Negative Rate (tnr)

3
Specificity and Sensitivity Relationship with Threshold
Threshold (Lower) Sensitivity ( ) Specificity ( )
Threshold (Higher) Sensitivity ( ) Specificity ( )

Which threshold value should be chosen??

4
Measuring Accuracy, Specificity and Sensitivity

5
ROC Curve for Training dataset

6
ROC Curve for Test data set

7
Threshold value selection

• The outcome of logistic regression model is a probability.


• Selecting a good threshold value is often challenging.
• Threshold values on ROC curve –
Threshold = 1 TPR = 0 FPR = 0
Threshold = 0 TPR = 1 FPR = 1

• Threshold values are often selected based on which errors are bettor.

8
Accuracy checking for different threshold values

9
Accuracy checking for different threshold values

10
Accuracy checking for different threshold values

11
Accuracy checking for different threshold values

12
Calculating Optimal Threshold Value

13
Optimal Threshold Value in ROC Curve

14
Classification Report using Optimal Threshold Value

15
Thank You

16
Regression Analysis Model Building - I

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Introduction

• Model building is the process of developing an estimated regression


equation that describes the relationship between a dependent variable
and one or more independent variables.
• The major issues in model building are finding the proper functional form
of the relationship and selecting the independent variables to be included
in the model.

2
General Linear Regression Model
• Suppose we collected data for one dependent variable y and k
independent variables x1,x2, . . . , xk.
• Objective is to use these data to develop an estimated regression equation
that provides the best relationship between the dependent and
independent variables.

• zj (where j =1, 2, . . . , p) is a function of x1, x2, . . . , xk (the variables for


which data are collected).
• In some cases, each zj may be a function of only one x variable.

3
Simple first-order model with one predictor variable

4
Modelling Curvilinear Relationships

• To illustrate, let us consider the problem facing Reynolds, Inc., a


manufacturer of industrial scales and laboratory equipment.
• Managers at Reynolds want to investigate the relationship between length
of employment of their salespeople and the number of electronic
laboratory scales sold.
• Table in the next slide gives the number of scales sold by 15 randomly
selected salespeople for the most recent sales period and the number of
months each salesperson has been employed by the firm.

Sources: Statistics for Business and Economics,11th Edition by David R. Anderson (Author), Dennis J.
Sweeney (Author), Thomas A. Williams (Author)

5
Data
Scales Months
Sold Employed
275 41
296 106
317 76
376 104
162 22
150 12
367 85
308 111
189 40
235 51
83 9
112 12
67 6
325 56
189 19

6
Importing libraries and table

7
SCATTER DIAGRAM FOR THE REYNOLDS EXAMPLE

8
Python code for the Reynolds example: first-order model

9
First-order regression equation

10
Standardized residual plot for the Reynolds example: first-
order model

11
Standardized residual plot for the Reynolds example: first-
order model

12
Need for curvilinear relationship

• Although the computer output shows that the relationship is significant (


p-value .000) and that a linear relationship explains a high percentage of
the variability in sales (R-sq 78.1%), the standardized residual plot
suggests that a curvilinear relationship is needed.

13
Second-order model with one predictor variable

• Set Z1= x1 and Z2 = X2

14
New Data set

• The data for the MonthsSq independent variable is obtained by squaring


the values of Months.

15
Python output for the Reynolds example:
second-order model

16
Second-order regression model

17
Standardized residual plot for the Reynolds example:
second-order model

18
Interpretation second order model

• Figure corresponding standardized residual plot shows that the previous


curvilinear pattern has been removed.
• At the .05 level of significance, the computer output shows that the
overall model is significant ( p-value for the F test is 0.000)
• Note also that the p-value corresponding to the t-ratio for MonthsSq ( p-
value .002) is less than .05
• Hence we can conclude that adding MonthsSq to the model involving
Months is significant.
• With an R-sq(adj) value of 88.6%, we should be pleased with the fit
provided by this estimated regression equation.
19
Meaning of linearity in GLM

• In multiple regression analysis the word linear in the term “general linear
model” refers only to the fact that b0, b1, . . . , bp all have exponents of b1
• It does not imply that the relationship between y and the xi’s is linear.
• Indeed, we have seen one example of how equation general linear model
can be used to model a curvilinear relationship.

20
Thank you

21
Regression Analysis Model Building (Interaction)- II

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Incorporating Interaction of the independent variable to the regression


model
• Python demo

2
Interaction
• If the original data set consists of observations for y and two independent
variables x1 and x2, we can develop a second-order model with two predictor
variables by setting z1 = x1, z2= x2, z3=x12 , z4=x22 , and z5 = x1x2 in the general
linear model of equation
• The model obtained is

• In this second-order model, the variable z5 = x1x2 is added to account for the
potential effects of the two variables acting together.
• This type of effect is called interaction.

3
Example – Interaction

• A company introduces a new shampoo product.


• Two factors believed to have the most influence on sales are unit selling
price and advertising expenditure.
• To investigate the effects of these two variables on sales, prices of $2.00,
$2.50, and $3.00 were paired with advertising expenditures of $50,000
and $100,000 in 24 test markets.

Source: Statistics for Business and Economics,11th Edition by David R.


Anderson (Author), Dennis J. Sweeney (Author), Thomas A. Williams (Author)

4
Advertising
Expenditure Sales
Price ($1000s) (1000s)
2 50 478
2.5 50 373
3 50 335
2 50 473
2.5 50 358
3 50 329
2 50 456
2.5 50 360
3 50 322
2 50 437
2.5 50 365
3 50 342
2 100 810
2.5 100 653
3 100 345
2 100 832
2.5 100 641
3 100 372
2 100 800
2.5 100 620
3 100 390
2 100 790
2.5 100 670
3 100 393

5
MEAN UNIT SALES (1000s)

6
Interpretation of interaction

• Note that the sample mean sales corresponding to a price of $2.00 and an
advertising expenditure of $50,000 is 461,000, and the sample mean sales
corresponding to a price of $2.00 and an advertising expenditure of
$100,000 is 808,000.
• Hence, with price held constant at $2.00, the difference in mean sales
between advertising expenditures of $50,000 and $100,000 is 808,000 -
461,000 = 347,000 units.

7
Interpretation of interaction

• When the price of the product is $2.50, the difference in mean sales is
646,000 -364,000 = 282,000 units.
• Finally, when the price is $3.00, the difference in mean sales is 375,000 -
332,000 = 43,000 units.
• Clearly, the difference in mean sales between advertising expenditures of
$50,000 and $100,000 depends on the price of the product.
• In other words, at higher selling prices, the effect of increased advertising
expenditure diminishes.
• These observations provide evidence of interaction between the price and
advertising expenditure variables.
8
Importing Data

9
Mean unit sales (1000s) as a function of selling price

10
Mean unit sales (1000s) as a function of Advertising
Expenditure($1000s)

11
Need for study the interaction between variable

• When interaction between two variables is present, we cannot study the


effect of one variable on the response y independently of the other
variable.
• In other words, meaningful conclusions can be developed only if we
consider the joint effect that both variables have on the response.

12
Estimated regression equation, a general linear model
involving three independent variables (z1, z2, and z3)

13
Interaction variable

• The data for the PriceAdv independent variable is obtained by multiplying


each value of Price times the corresponding value of AdvExp.

14
New Model

15
New Model

16
Interpretation

• Because the model is significant ( p-value for the F test is 0.000) and the p-
value corresponding to the t test for PriceAdv is 0.000, we conclude that
interaction is significant given the linear effect of the price of the product
and the advertising expenditure.
• Thus, the regression results show that the effect of advertising xpenditure
on sales depends on the price.

17
Transformations Involving the Dependent Variable

Miles per
Gallon Weight
28.7 2289
29.2 2113
34.2 2180
27.9 2448
33.3 2026
26.4 2702
23.9 2657
30.5 2106
18.1 3226
19.5 3213
14.3 3607
20.9 2888

18
Importing data

19
Scatter diagram

20
Model 1

21
Standardized residual plot corresponding to the first-order
model.

22
Standardized residual plot corresponding to the first-order
model

23
Model 2

24
Residual plot for model 2

25
Residual plot of model 2

26
• The miles-per-gallon estimate is obtained by finding the number whose
natural logarithm is 3.2675.
• Using a calculator with an exponential function, or raising e to the power
3.2675, we obtain 26.2 miles per gallon.

27
Nonlinear Models That Are Intrinsically Linear

28
Thank You

29
2 Test of Independence - I

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• To understand 2 Test of Independence

2
2 Test of Independence

• It is used to analyze the frequencies of two variables with multiple


categories to determine whether the two variables are independent.
• Qualitative Variables
• Nominal Data

3
2 Test of Independence: Investment Example
• In which region of the country do you reside?
A. Northeast B. Midwest C. South D. West
• Which type of financial investment are you most likely to make today?
E. Stocks F. Bonds G. Treasury bills
Type of financial
Investment
Contingency Table
E F G
A O13 nA
Geographic B nB
Region C nC
D nD
nE nF nG N
4
2 Test of Independence: Investment Example
e AF
= N  P( A  F )
n n n n 
If A and F are independent, P( A) = A
P( F ) = F
= N A  F
N N  N N 
P( A  F) = P( A)  P( F ) n n
P( A  F ) = A F
n n
N N = A F
N

Type of Financial
Contingency Table Investment
E F G
A e12 nA
Geographic B nB
Region C nC
D nD
nE nF nG N
5
2 Test of Independence: Formulas

e =
ij
(n ) (n)j
i j
N
Expected where : i = the row
Frequencies j = the column
ni = the total of row i
nj = the total of column j
N = the total of all frequencies

6
2 Test of Independence: Formulas

( f o − f e)
2

Calculated   
2
=
(Observed ) f
where : df = (r - 1)(c - 1)
e

r = the numberr of rows


c = the numberr of columns

7
Example for Independence

8
2 Test of Independence

Ho : Type of gasoline is
independent of income
Ha : Type of gasoline is not
independent of income

9
2 Test of Independence

Type of
Gasoline
r=4 c=3 Extra
Income Regular Premium Premium
Less than $30,000
$30,000 to $49,999
$50,000 to $99,000
At least $100,000

10
2 Test of Independence: Gasoline Preference Versus
Income Category
 =.01
df = ( r − 1)( c − 1)
= ( 4 − 1)( 3 − 1)
=6

 2

.01, 6
= 16.812

If  2

Cal
 16.812, reject Ho.

If  2

Cal
 16.812, do not reject Ho.

11
Python code

12
Gasoline Preference Versus Income Category:
Observed Frequencies

Type of
Gasoline
Extra
Income Regular Premium Premium
Less than $30,000 85 16 6 107
$30,000 to $49,999 102 27 13 142
$50,000 to $99,000 36 22 15 73
At least $100,000 15 23 25 63
238 88 59 385

13
Gasoline Preference Versus Income Category: Expected
Frequencies

e =
ij
(n )
N
(ni
) j
Type of
Gasoline Extra
=
(107 )(238 ) Income Regular Premium Premium
e11 385 Less than $30,000 (66.15) (24.46) (16.40)
= 66.15 85 16 6 107
(107 )(88 ) $30,000 to $49,999 (87.78) (32.46) (21.76)
e12 = 385
102 27 13 142
$50,000 to $99,000 (45.13) (16.69) (11.19)
= 24 .46 36 22 15 73
(107 )(59) At least $100,000 (38.95) (14.40) (9.65)
e 13 = 385 15 23 25 63
= 16.40 238 88 59 385
14
Gasoline Preference Versus Income Category: 2
Calculation

(f o −f f e)
2

 = 
2

(85 −6666 .15 ) + (16 −2424 .46) + (6 −16.40) +


2 2 2

= .15 .46 16.40


(102 87
− 87.78) + (27 −3232 .46) + (13 − 21.76) +
2 2 2

.78 .46 21.76


(36 −454513 . )+ (22 −1616 .69 ) + (15 −1119. )+
2 2 2

.13 .69 11.19


(15 −3838 .95) + (23 −1414 .40) + (25 − 9.65)
2 2 2

.95 .40 9.65


= 7075
.
15
Gasoline Preference Versus Income Category:
Conclusion

df = 6
0.01
Non rejection
region

16.812


2
= 70.78  16.812, reject Ho.
Cal

16
Contingency Tables

Contingency Tables
• Useful in situations involving multiple population proportions
• Used to classify sample observations according to two or more
characteristics
• Also called a cross-classification table.

17
Contingency Table Example

Hand Preference vs. Gender


Dominant Hand: Left vs. Right
Gender: Male vs. Female

• 2 categories for each variable, so the table is called a 2 x 2 table

• Suppose we examine a sample of 300 college students

18
Contingency Table Example

Sample results organized in a contingency table:

Gender
sample size = n = 300:
Hand
120 Females, 12 were Preference
Female Male
left handed
Left 12 24 36
180 Males, 24 were
left handed Right 108 156 264

120 180 300


19
Contingency Table Example

H0: π1 = π2 (Proportion of females who are left handed is equal to the


proportion of males who are left handed)
H1: π1 ≠ π2 (The two proportions are not the same Hand preference is
not independent of gender)

• If H0 is true, then the proportion of left-handed females should be the


same as the proportion of left-handed males.
• The two proportions above should be the same as the proportion of left-
handed people overall.

20
The Chi-Square Test Statistic

The Chi-square test statistic is:


(f − f ) 2
χ2 =  o e

all cells fe
where:
fo = observed frequency in a particular cell
fe = expected frequency in a particular cell if H0 is true

2 for the 2 x 2 case has 1 degree of freedom


Assumed: each cell in the contingency table has expected frequency of at least 5
21
The Chi-Square Test Statistic

The 2 test statistic approximately follows a chi-square


distribution with one degree of freedom

Decision Rule:
If 2 > 2U, reject H0,
otherwise, do not reject 
H0
0 Do not Reject H0 
reject H0 2U
22
Observed vs. Expected Frequencies
Gender
Hand
Female Male
Preference

Observed = 12 Observed = 24
Left 36
Expected = 14.4 Expected = 21.6

Observed = 108 Observed = 156


Right 264
Expected = 105.6 Expected = 158.4

120 180 300


The Chi-Square Test Statistic
Gender
Hand
Female Male
Preference

Observed = 12 Observed = 24
Left 36
Expected = 14.4 Expected = 21.6

Observed = 108 Observed = 156


Right 264
Expected = 105.6 Expected = 158.4
120 180 300
The test statistic is:
( fo − fe )2
2 = 
all cells fe
(12 − 14.4) 2 (108 − 105.6) 2 ( 24 − 21.6) 2 (156 − 158.4) 2
= + + + = 0.7576
14.4 105.6 21.6 158.4
24
The Chi-Square Test Statistic

The test statistic is  2 = 0.7576 , U2 with 1 d.f. = 3.841


Decision Rule:
If 2 > 3.841, reject H0, otherwise, do not
reject H0
Here,
2 = 0..7576 < 2U = 3.841,
=.05
so you do not reject H0 and
conclude that there is
0 Do not Reject H0  
insufficient evidence that the
reject H0
2U=3.841 two proportions are different.

25
2 Test for The Differences Among More Than Two
Proportions

• Extend the 2 test to the case with more than two independent
populations:

H0: π1 = π2 = … = πc
H1: Not all of the πj are equal (j = 1, 2, …, c)

26
The Chi-Square Test Statistic

The Chi-square test statistic is:


( fo − fe )2
2 = 
all cells fe
where:
• fo = observed frequency in a particular cell of the 2 x c table
• fe = expected frequency in a particular cell if H0 is true
• 2 for the 2 x c case has (2-1)(c-1) = c - 1 degrees of freedom

Assumed: each cell in the contingency table has expected frequency of at least 5
27
2 Test with More Than Two Proportions: Example

The sharing of patient records is a controversial issue in health care. A survey


of 500 respondents asked whether they objected to their records being
shared by insurance companies, by pharmacies, and by medical researchers.
The results are summarized on the following table:

28
2 Test with More Than Two Proportions: Example
Organization
Object to Insurance Pharmacies Medical
Record Companies Researchers
Sharing

Yes 410 295 335

No 90 205 165
2 Test with More Than Two Proportions: Example
Organization
Object to Insurance Pharmacies Medical Row Sum
Record Companies Researchers
Sharing

Yes 410 295 335 1040

No 90 205 165 460

Column 500 500 500 1500


Sum
2 Test with More Than Two Proportions: Example
The overall proportion is:
X 1 + X 2 + ... + X c 410 + 295 + 335
p= = = 0.6933
n1 + n2 + ... + nc 500 + 500 + 500

Organization
Object to Record Insurance Pharmacies Medical
Sharing Companies Researchers

Yes fo = 410 fo = 295 fo = 335


fe = 346.667 fe = 346.667 fe = 346.667
No fo = 90 fo = 205 fo = 165
fe = 153.333 fe = 153.333 fe = 153.333
2 Test with More Than Two Proportions: Example
Organization
Object to Insurance Pharmacies Medical
Record Companies Researchers
Sharing

Yes ( fo − fe )
2
= 11.571 ( f o − f e )2 ( f o − f e )2
= 7.700 = 0.3926
fe fe fe

No ( f o − f e )2 ( fo − fe )
2
= 17.409
( fo − fe )
2
= 0.888
= 26.159
fe fe fe

( fo − fe )2
The Chi-square test statistic is:  2
=  = 64.1196
all cells fe
2 Test with More Than Two Proportions: Example
H0: π1 = π2 = π3
H1: Not all of the πj are equal (j = 1, 2, 3)

Decision Rule: 2U = 5.991 is from the chi-square


If 2 > 2U, reject H0, otherwise, distribution with 2 degrees of
do not reject H0 freedom.

Conclusion: Since 64.1196 > 5.991, you reject H0 and you conclude that at
least one proportion of respondents who object to their records being shared
is different across the three organizations

33
Thank You

34
2 Test of Independence - II

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Using python to test the independence of variables


• Understanding goodness of fit test for Poisson

2
Example

• Record of 50 students studying in ABN School is taken at random, the first


10 entries are like this:

res_num aa pe sm ae r g c
1 99 19 1 2 0 0 1
2 46 12 0 0 0 0 0
3 57 15 1 1 0 0 0
4 94 18 2 2 1 1 1
5 82 13 2 1 1 1 1
6 59 12 0 0 2 0 0
7 61 12 1 2 0 0 0
8 29 9 0 0 1 1 0
9 36 13 1 1 0 0 0
10 91 16 2 2 1 1 0

3
Example

Here :
• res_num = registration no.
• aa= academic ability
• pe = parent education
• sm = student motivation
• r = religion
• g = gender

4
Python code

5
Hypothesis

• Test the hypothesis that “gender and student motivation” are


independent

6
Python code

7
Observed values
Gender Student motivation
0 1 2 Row Sum
(Disagree ) (Not (Agree)
decided )

0 (Male) 10 13 6 29

1(Female ) 4 9 8 21

Column 14 22 14 50
Sum

8
Expected frequency (contingency table)

Gender Student motivation


0 1 2

0 29*14/50= 12.76 8.12


8.12
1 5.88 9.24 5.88

9
Frequency Table

Gender Student motivation


0 1 2

0 fo = 10 fo = 13 fo = 6
fe = 8.12 fe =12.76 fe =8.12
1 fo = 4 fo = 9 fo = 8
fe =5.88 fe =9.24 fe =5.88

10
Chi sq. calculation

(f o −f f e)
2

 = 
2

= 0.435+ 0.005+0.554+0.601+0.006+0.764
= 2.365

11
Python code

12
Python code

Degrees of
freedom =
(2-1)*(3-1)

13
Python code

Contingency
table

14
2 Goodness of Fit Test

15
2 Goodness-of-Fit Test

• The 2 goodness-of-fit test compares expected (theoretical)


frequencies of categories from a population distribution to the
observed (actual) frequencies from a distribution to determine
whether there is a difference between what was expected and what
was observed

16
2 Goodness-of-Fit Test

( f o− f e )
2

 =
2

f e

df = k - 1 - p
where : f = frequency of observed values
o

f = frequency of expected values


e

k = number of categories
p = number of parameters estimated from the sample data

17
Goodness of Fit Test: Poisson Distribution
1. Set up the null and alternative hypotheses.
H0: Population has a Poisson probability distribution
Ha: Population does not have a Poisson distribution

2. Select a random sample and


• Record the observed frequency fi for each value of the Poisson
random variable.
• Compute the mean number of occurrences .

3. Compute the expected frequency of occurrences ei


for each value of the Poisson random variable.

18
Goodness of Fit Test: Poisson Distribution

4. Compute the value of the test statistic


k( f i − ei ) 2
 =
2
i =1 ei

where:
fi = observed frequency for category i
ei = expected frequency for category i
k = number of categories

19
Goodness of Fit Test: Poisson Distribution
5. Rejection rule:
p-value approach: Reject H0 if p-value < 

Critical value approach: Reject H0 if  2   2

where  is the significance level and


there are k - 2 degrees of freedom

20
Goodness of Fit Test: Poisson Distribution
• Example: Parking Garage

In studying the need for an additional entrance to a city parking


garage, a consultant has recommended an analysis, that approach is
applicable only in situations where the number of cars entering
during a specified time period follows a Poisson distribution.

21
Goodness of Fit Test: Poisson Distribution
A random sample of 100 one- minute time intervals resulted in the
customer arrivals listed below. A statistical test must be conducted to
see if the assumption of a Poisson distribution is reasonable.

# Arrivals 0 1 2 3 4 5 6 7 8 9 10 11 12
Frequency 0 1 4 10 14 20 12 12 9 8 6 3 1

22
Goodness of Fit Test: Poisson Distribution

• Hypotheses
H0: Number of cars entering the garage during
a one-minute interval is Poisson distributed

Ha: Number of cars entering the garage during a


one-minute interval is not Poisson distributed

23
Python Code

24
Goodness of Fit Test: Poisson Distribution

• Estimate of Poisson Probability Function


otal Arrivals = 0(0) + 1(1) + 2(4) + . . . + 12(1) = 600
Estimate of  = 600/100 = 6
Total Time Periods = 100
Hence,

6 x e −6
f ( x) =
x!

25
Goodness of Fit Test: Poisson Distribution
• Expected Frequencies

x f (x ) nf (x ) x f (x ) nf (x )
0 .0025 .25 7 .1377 13.77
1 .0149 1.49 8 .1033 10.33
2 .0446 4.46 9 .0688 6.88
3 .0892 8.92 10 .0413 4.13
4 .1339 13.39 11 .0225 2.25
5 .1606 16.06 12+ .0201 2.01
6 .1606 16.06 Total 1.0000 100.00

26
Python code

27
Python code

28
Goodness of Fit Test: Poisson Distribution
• Observed and Expected Frequencies
i fi ei fi - ei
0 or 1 or 2 5 6.20 -1.20
3 10 8.92 1.08
4 14 13.39 0.61
5 20 16.06 3.94
6 12 16.06 -4.06
7 12 13.77 -1.77
8 9 10.33 -1.33
9 8 6.88 1.12
10 or more 10 8.39 1.61
29
Python code

30
Goodness of Fit Test: Poisson Distribution
• Rejection Rule
With  = .05 and k - p - 1 = 9 - 1 - 1 = 7 d.f.
(where k = number of categories and p = number of
population parameters estimated),  .02 5 = 1 4 .0 6 7
Reject H0 if p-value < .05 or 2 > 14.067.
• Test Statistic

( − 1.20) 2
(1.08) 2
(1.61) 2
2 = + + ... + = 3.268
6.20 8.92 8.39

31
Python code

32
Goodness of Fit Test: Poisson
Distribution
df = 7
0.05
Non rejection
region

14.067


2
= 3.268  14.067, do not reject Ho.
Cal

33
Thank You

34
2 Goodness of Fit Test

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Python demo for testing GOF for Poisson distribution


• Understanding goodness of fit test for:
– Uniform
– Normal
• Python demo for testing GOF for uniform and normal distribution

2
Goodness of fit for Uniform Distribution
Month Litres
• Milk Sales Data January 1,610
February 1,585
March 1,649
April 1,590
May 1,540
June 1,397
July 1,410
August 1,350
September 1,495
October 1,564
November 1,602
December 1,655
18,447

3
Hypotheses and Decision Rules
Ho: The monthly milk figures for milk sales are uniformly distributed
Ha: The monthly milk figures for milk sales are not uniformly distributed

 = .01 If  2
 24.725, reject Ho.
Cal
df = k − 1 − p
= 12 − 1 − 0
If  2

Cal
 24.725, do not reject Ho.

= 11


2
= 24.725
.01,11

4
Python code

5
Calculations
Month fo fe (fo - fe)2/fe
January 1,610 1,537.25 3.44
February 1,585 1,537.25 1.48 18447
March 1,649 1,537.25 8.12 f =
April 1,590 1,537.25 1.81
e 12
May 1,540 1,537.25 0.00 = 1537.25
June 1,397 1,537.25 12.80
July 1,410 1,537.25 10.53
August 1,350 1,537.25 22.81  2

Cal
= 74.37
September 1,495 1,537.25 1.16
October 1,564 1,537.25 0.47
November 1,602 1,537.25 2.73
December 1,655 1,537.25 9.02
18,447 18,447.00 74.38
6
Python code

7
Conclusion
df = 11

Non Rejection 0.01


region
24.725

 2

Cal
= 74.37  24.725, reject Ho.

8
Goodness of Fit Test: Normal Distribution
1. Set up the null and alternative hypotheses.
2. Select a random sample and
a. Compute the mean and standard deviation.
b. Define intervals of values so that the expected frequency is at least 5 for
each interval.
c. For each interval record the observed frequencies
3. Compute the expected frequency, ei , for each interval.

9
Goodness of Fit Test: Normal Distribution
4. Compute the value of the test statistic.

( f i − ei ) 2
k
 = 2
i =1 ei

5. Reject H0 if    2 2

(where  is the significance level and there are k - 3 degrees


of freedom)

10
Normal Distribution Goodness of Fit Test
• Example: IQL Computers

IQL Computers manufactures and sells a general purpose


microcomputer. As part of a study to evaluate sales personnel,
management wants to determine, at  = 0.05 significance level, if the
annual sales volume (number of units sold by a salesperson) follows a
normal probability distribution.

11
Normal Distribution Goodness of Fit Test

A simple random sample of 30 of the salespeople was


taken and their numbers of units sold are below.

33 43 44 45 52 52 56 58 63 64
64 65 66 68 70 72 73 73 74 75
83 84 85 86 91 92 94 98 102 105
(mean = 71, standard deviation = 18.23)

12
Python code

13
Normal Distribution Goodness of Fit Test
• Hypotheses
H0: The population of number of units sold
has a normal distribution with mean 71
and standard deviation 18.23

Ha: The population of number of units sold


does not have a normal distribution with
mean 71 and standard deviation 18.23

14
Normal Distribution Goodness of Fit Test
• Interval Definition

To satisfy the requirement of an expected frequency of at


least 5 in each interval we will divide the normal distribution
into 30/5 = 6 equal probability intervals.

15
Normal Distribution Goodness of Fit Test
• Interval Definition

Areas
= 1.00/6
= .1667

53.367 71 88.63 = 71 + .97(18.24)


71 - .43(18.23) = 63.149 78.85
16
Python code

17
Normal Distribution Goodness of Fit Test
• Observed and Expected Frequencies
i fi ei f i - ei
Less than 53.02 6 5 1
53.02 to 63.03 3 5 -2
63.03 to 71.00 6 5 1
71.00 to 78.97 5 5 0
78.97 to 88.98 4 5 -1
More than 88.98 6 5 1
Total 30 30

18
Python code

19
Normal Distribution Goodness of Fit Test
• Rejection Rule
With  = .05 and k - p - 1 = 6 - 2 - 1 = 3 d.f.
(where k = number of categories and p = number
of population parameters estimated),  .0 5 = 7 .8 1 5
2

Reject H0 if p-value < .05 or 2 > 7.815.

• Test Statistic
(1) 2 ( − 2) 2 (1) 2 (0) 2 ( − 1) 2 (1) 2
 =2
+ + + + + = 1.600
5 5 5 5 5 5

20
Thank you

21
Cluster analysis: Introduction - I

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Understanding cluster analysis and its purpose


• Introduction to types of data and how to handle them

2
Cluster Analysis

• Cluster analysis is the art of finding


groups in data
• In cluster analysis basically, one wants to
form groups in such a way that objects in
the same group are similar to each other,
whereas objects in different groups are
as dissimilar as possible

3
Cluster analysis

• The classification of similar objects into


groups is an important human activity, this is
part of the learning process
• i.e. A child learns to distinguish between cats
and dogs, between tables and chairs,
between men and women, by means of
continuously improving subconscious
classification schemes
• This explains why cluster analysis is often
considered as a branch of pattern recognition
and artificial intelligence

4
Example

• Lets illustrate with the help of an example:


• It is a plot of twelve objects, on which two variables were measured. For
instance, the weight of an object might be displayed on the vertical axis
and its height on the horizontal one

5
Example

• Because this example contains only two variables, we can investigate it by merely looking

at the plot

• In this small data set there are clearly two distinct groups of objects

• Such groups are called clusters, and to discover them is the aim of cluster analysis

6
Cluster and discriminant analysis

• Cluster Analysis is an unsupervised • Discriminant Analysis (DA) is a statistical


classification technique in the sense that it is technique used to build a prediction model
applied to a dataset where patterns want to that is used to classify objects from a dataset
be discovered (i.e. groups of individuals or depending on the features observed on
variables want to be found) them. In this case, the dependent variable is
• No prior knowledge is needed for this the grouping variable, which identifies to
grouping, and it is sensitive to several which group and object belongs
decisions that have to be taken • This grouping variable should be known at
(similarity/dissimilarity measures, clustering the beginning, for the function to be built up.
method,...) Sometimes DA is considered as a Supervised
tool, as there is a previous known
classification for the elements of the dataset

7
Cluster analysis and discriminant analysis

• Cluster analysis can be used not only to identify a structure already


present in the data, but also to impose a structure on a more or less
homogeneous data set that has to be split up in a “fair” way, for instance
when dividing a country into telephone areas
• Cluster analysis is quite different from discriminant analysis in that it
actually establishes the groups, whereas discriminant analysis assigns
objects to groups that were defined in advance
Telephone area code for USA

8
Types of data and how to handle them

• Let us take an example, there are n objects to be clustered, which may be


persons, flowers, words, countries, or anything
• Clustering algorithms typically operate on either of two input structures:
– The first represents the objects by means of p measurements or
attributes, such as height, weight, sex, color, and so on
– These measurements can be arranged in an n-by-p matrix, where the
rows correspond to the objects and the columns to the attributes

9
Example
Attributes

Objects

10
Types of data and how to handle them

• The second structure is a collection of proximities that must be available


for all pairs of objects
• These proximities make up an n-by-n table, which is called a one-mode
matrix because the row and column entities are the same set of objects
• one shall consider two types of proximities, namely dissimilarities (which
measure how far away two objects are from each other) and similarities
(which measure how much they resemble each other)

11
Type of data

• Interval-Scaled Variables
• In this situation the n objects are characterized by p continuous
measurements
• These values are positive or negative real numbers, such as height, weight,
temperature, age, cost, ..., which follow a linear scale
• For instance, the time interval between 1900 and 1910 was equal in length
to that between 1960 and 1970

Time scale in years

12
Type of data

• Also, it takes the same amount of energy to heat an object of -16.4°C to -


12.4°C as to increase it from 35.2°C to 39.2°C
• In general it is required that intervals keep the same importance
throughout the scale

13
Interval-Scaled Variables

• These measurements can be organized in an n-by-p matrix, where the


rows correspond to the objects (or cases) and the columns correspond to
the variables.
• When the fth measurement of the ith object is denoted by xif (where i = 1,. .
. , n and f = 1,. . . , p) this matrix looks like:

14
Interval-Scaled Variables

• For example :
Person Weight(Kg) Height(cm)
• Take eight people, the weight (in A 15 95
kilograms) and the height (in centimetres) B 49 156
• In this situation, n = 8 and p = 2. C 13 95
D 45 160
E 85 178
F 66 176
G 12 90
H 10 78

Table :1

15
Figure 1
200 F E
DB
150

Height in cm
G
A
100 H
C
50

0
0 50 100
Weight in kg

16
Interval-Scaled Variables

• The units on the vertical axis are drawn to the same size as those on the horizontal axis, even

though they represent different physical concepts

• The plot contains two obvious clusters, which can in this case be interpreted easily: the one

consists of small children and the other of adults

• However, other variables might have led to completely different clustering

• For instance, measuring the concentration of certain natural hormones might have yielded a

clear cut partition into different male and female persons

17
Interval-Scaled Variables
• Let us now consider the effect of changing measurement
Person Weight(lb) Height(in)
units.
A 33.1 37.4
• If weight and height of the subjects had been expressed in B 108 61.4
pounds and inches, the results would have looked quite C 28.7 37.4

different. D 99.2 63
E 187.4 70
• A pound equals 0.4536 kg and an inch is 2.54 cm F 145.5 69.3
• Therefore, Table 2 contains larger numbers in the column G 26.5 35.4
of weights and smaller numbers in the column of heights. H 22 30.7

Figure 2 Table :2

18
Figure 2

100
Height in inches

D B F E
C
50 G
H
A

0
0 20 40 60 80 100 120 140 160 180 200
Weight in lb

19
Interpretation
• Although plotting essentially the same data as Figure 1, Figure 2 looks
much flatter
• In this figure, the relative importance of the variable “weight” is much
larger than in Figure 1
• As a consequence, the two clusters are not as nicely separated as in Figure
1 because in this particular example the height of a person gives a better
indication of adulthood than his or her weight. If height had been
expressed in feet (1 ft = 30.48 cm), the plot would become flatter still and
the variable “weight” would be rather dominant
• In some applications, changing the measurement units may even lead one
to see a very different clustering structure

20
Standardizing the data

• To avoid this dependence on the choice of measurement units, one has


the option of standardizing the data
• This converts the original measurements to unitless variables
• First one calculates the mean value of variable f, given by:

for each f = 1,. . . , p

21
Standardizing the data

• Then one computes a measure of the dispersion or “spread” of this fth


variable
• Generally, we use the standard deviation for this purpose

22
Standardizing the data

• However, this measure is affected very much by the presence of outlying


values
• For instance, suppose that one of the xif has been wrongly recorded, so
that it is much too large
• In this case stdf will be unduly inflated, because xif - mf is squared
• Hartigan (1975, p. 299) notes that one needs a dispersion measure that is
not too sensitive to outliers
• Therefore, we will use the mean absolute deviation, where the
contribution of each measurement xif is proportional to the absolute value
lxif - mfl
23
Standardizing the data

• Let us assume that sf is nonzero (otherwise variable f is constant over all


objects and must be removed)
• Then the standardized measurements are defined by and sometimes
called z-scores
• They are unitless because both the numerator and the denominator are
expressed in the same units
• By construction, the zif have mean value zero and their mean absolute
deviation is equal to1

24
Standardizing the data

• When applying standardization, one forgets about the original data and
uses the new data matrix in all subsequent computations

25
Detecting outlier

• The advantage of using sf rather than stdf, in the denominator of z-score


formula is that sf will not be blown up so much in the case of an outlying
xIf, and hence the corresponding zif will still be noticeable so the ith object
can be recognized as an outlier by the clustering algorithm, which will
typically put it in a separate cluster

26
Standardizing the data
• The preceding description might convey the impression that
standardization would be beneficial in all situations.
• However, it is merely an option that may or may not be useful in a given
application
• Sometimes the variables have an absolute meaning, and should not be
standardized
• For instance, it may happen that several variables are expressed in the
same units, so they should not be divided by different sf
• Often standardization dampens a clustering structure by reducing the
large effects because the variables with a big contribution are divided by a
large sf

27
Thank you

28
Cluster analysis: Part - II

Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES

1
Agenda

• Explain effect of standardization(with help of an example)


• Different types of distances computation between the objects

2
Example

• Lets take four persons A, B,C, D with following age and height:
200 A B
Person Age (yr) Height (cm) Height 190
A 35 190 180
B 40 190 D
170 C
C 35 160
D 40 160
160
Age
150
TABLE: 1 10 30 50
Finding Groups in Data: An Introduction to Cluster Analysis
Author(s): Leonard Kaufman, Peter J. Rousseeuw FIGURE: 1
March 1990, John Wiley & Sons, Inc.

3
Example

• In Figure 1 we can see to distinct clusters


• Let us standardize the data of Table 1
• The mean age equals m1 = 37.5 and the mean absolute deviation of the
first variable works out to be s1 = (2.5 + 2.5 + 2.5 + 2.5)/4 = 2.5
• Therefore, standardization converts age 40 to + 1 ((40-37.5)/2.5 = 1)and
age 35 ((35 - 37.5)/2.5 = -1) to – 1
• Analogously, m2 = 175 cm and s2 = (15 + 15 + 15 + 15)/4 = 15 cm, so 190
cm is standardized to +1 and 160 cm to - 1

4
Example
• The resulting data matrix, which is unitless, is given in Table 2
• Note that the new averages are zero and that the mean deviations equal 1

• Table 2
Person Variable 1 Variable 2
A 1 1
B -1 1
C 1 -1
D -1 -1

• Even when the data are converted to very strange units standardization will always yield
the same numbers

5
Example

• Plotting the values of Table 2 in Figure 2 1.5


does not give a very exciting result
1
• Figure 2 shows no clustering structure
0.5
because the four points lie at the vertices
0
of a square
-2 -1 -0.5 0 1 2
• One could say that there are four clusters,
each consisting of a single point, or that -1
there is only one big cluster containing -1.5
four points
FIGURE: 2
• Here standardizing is no solution

6
Choice of measurement (Units)- Merits and demerits

• The choice of measurement units gives rise to relative weights of the


variables
• Expressing a variable in smaller units will lead to a larger range for that
variable, which will then have a large effect on the resulting structure
• On the other hand, by standardizing one attempts to give all variables an
equal weight, in the hope of achieving objectivity
• As such, it may be used by a practitioner who possesses no prior
knowledge

7
Choice of measurement- Merits and demerits

• However, it may well be that some variables are intrinsically more


important than others in a particular application, and then the assignment
of weights should be based on subject-matter knowledge
• On the other hand, there have been attempts to devise clustering
techniques that are independent of the scale of the variables

8
Distances computation between the objects
• The next step is to compute distances between the objects, in order to
quantify their degree of dissimilarity
• It is necessary to have a distance for each pair of objects i and j.
• The most popular choice is the Euclidean distance:

• When the data are being standardized, one has to replace all x by z in this
expression
• This Formula corresponds to the true geometrical distance between the
points with coordinates (xi1,. .., xip) and (xj1 ,..., xjp)

9
Example

• let us consider the special case with p =


2 (Figure 3)
• Figure shows two points with
coordinates ( x i 1 , x i 2 ) and (xj1, xj2)
• It is clear that the actual distance
between objects i and j is given by the
length of the hypotenuse of the
triangle, yielding expression in previous
slide by virtue of Pythagoras’ theorem
Figure 3: Illustration of the Euclidean distance formula

10
Distances computation between the objects

• Another well-known metric is the city block or Manhattan distance,


defined by:

11
Interpretation

• Suppose you live in a city where the streets are all north-south or east-
west, and hence perpendicular to each other
• Let Figure 3 be part of a street map of such a city, where the streets are
portrayed as vertical and horizontal lines

12
Interpretation

• Then the actual distance you would have to travel by car to get from

location i to location j would total lxi1 – xj1l + lxi2 – xj2l

• This would be the shortest length among all possible paths from i to j

• Only a bird could fly straight from point i to point j, thereby covering the

Euclidean distance between these points

13
Mathematical Requirements of a Distance Function
• Both the Euclidean metric and the Manhattan metric satisfy the following
mathematical requirements of a distance function, for all objects i, j, and h:
• (D1) d(i, j) ≥ 0
• (D2) d(i, i) = 0
• (D3) d(i, j) = d(j, i)
• (D4) d(i, j) ≤ d(i, h) + d(h, j)
• Condition (D1) merely states that distances are nonnegative numbers and (D2) says
that the distance of an object to itself is zero
• Axiom (D3) is the symmetry of the distance function
• The triangle inequality (D4) looks a little bit more complicated, but is necessary to allow
a geometrical interpretation
• It says essentially that going directly from i to j is shorter than making a detour over
object h

14
Distances computation between the objects

• If d(i, j) = 0 does not necessarily imply that i = j, because it can very well
happen that two different objects have the same measurements for the
variables under study
• However, the triangle inequality implies that i and j will then have the
same distance to any other object h, because d(i, h) ≤ d(i, j) + d( j, h) = d(j,
h) and at the same time d( j, h) ≤ d( j, i) + d(i, h) = d(i, h), which together
imply that d(i, h) = d(j, h)

15
Minkowski distance

• A generalization of both the Euclidean and the Manhattan metric is the


Minkowski distance given by:

Where p is any real number larger than or equal to 1


• This is also called the Lp metric, with the Euclidean (p = 2) and the
Manhattan (p = 1) as special cases

16
Example for Calculation of Euclidean and Manhattan Distance

• Let x1 = (1, 2) and x2 = (3, 5) represent two objects as in the given Figure
The Euclidean distance between the two is (22 +32)= 3.61. The
Manhattan distance between the two is 2 + 3 = 5.

Figure: 4
17
n- by- n Matrix
• For example, when computing
Euclidean distances between the Person Weight(Kg) Height(cm)
objects of the following Table can be A 15 95
obtain as next slide: B 49 156
C 13 95
D 45 160
• Euclidean distances between B and E:
E 85 178
• ((49 – 85)2 +(156-178)2)½ = 42.2 F 66 176
G 12 90
H 10 78

18
n- by- n Matrix
A B C D E F G H

A
B
C
D
E
F
G
H
19
Interpretation

• The distance between object B and object E can be located at the


intersection of the fifth row and the second column, yielding 42.2
• The same number can also be found at the intersection of the second row
and the fifth column, because the distance between B and E is equal to
the distance between E and B
• Therefore, a distance matrix is always symmetric
• Moreover, note that the entries on the main diagonal are always zero,
because the distance of an object to itself has to be zero

20
Distance matrix
• It would suffice to write down only the lower triangular half of the
distance matrix
A B C D E F G
B
C
D
E
F
G
H

21
Selection of variables

• It should be noted that a variable not containing any relevant information


(say, the telephone number of each person) is worse than useless,
because it will make the clustering less apparent.
• The Occurrence of several such “trash variables” will kill the whole
clustering because they yield a lot of random terms in the distances,
thereby hiding the useful information provided by the other variables.
• Therefore, such non informative variables must be given a zero weight in
the analysis, which amounts to deleting them

22
Selection of variables

• The selection of “good” variables is a nontrivial task and may involve quite
some trial and error (in addition to subject-matter knowledge and
common sense)
• In this respect, cluster analysis may be considered an exploratory
technique

23
Thank you

24

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy