Ca 3 Merged
Ca 3 Merged
Dr A. RAMESH
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
• Confusion matrix
• Receiver operating characteristics curve
2
Why Evaluate?
3
Accuracy Measures (Classification)
Misclassification error
• Error = classifying a record as belonging to one class when it belongs to
another class.
• Error rate = percent of misclassified records out of the total records in the
validation data
4
Confusion Matrix
5
Error Rate
Classification Confusion Matrix
Predicted Class
Actual Class 1 0
1 201 85
0 25 2689
6
Cutoff for classification
Most algorithms classify via a 2-step process:
For each record,
1. Compute probability of belonging to class “1”
2. Compare to cutoff value, and classify accordingly
7
Cutoff Table
Actual Class Prob. of "1" Actual Class Prob. of "1"
1 0.996 1 0.506
1 0.988 0 0.471
1 0.984 0 0.337
1 0.980 1 0.218
1 0.948 0 0.199
1 0.889 0 0.149
1 0.848 0 0.048
0 0.762 0 0.038
1 0.707 0 0.025
1 0.681 0 0.022
1 0.656 0 0.016
0 0.622 0 0.004
owner 11 1
non-owner 4 8
owner 7 5
non-owner 1 11
9
Compute Outcome Measures
10
When One Class is More Important
In many cases it is more important to identify members of one class
– Tax fraud
– Credit default
– Response to promotional offer
– Detecting electronic network intrusion
– Predicting delayed flights
11
ROC curves
Test Result
Threshold
Test Result
Some definitions ...
True Positives
True
negatives
False
negatives
‘‘-’’ ‘‘+’’
Test Result
without the disease
with the disease
Moving the Threshold: left
‘‘-’’ ‘‘+’’
23
Threshold Value
• Often selected based on which errors are “better”
• If t is large, predict positive rarely (when P(y=1) is large)
– More errors where we say negative , but it is actually positive
– Detects patients who are negative
• If t is small, predict negative rarely (when P(y=1) is small)
– More errors where we say positive, but it is actually negative
– Detects all patients who are positive
• With no preference between the errors, select t = 0.5
– Predicts the more likely outcome
24
Selecting a Threshold Value
25
True disease state vs. Test result
not rejected
Test rejected/accepted
Disease
No disease ☺ X
(D = 0) specificity Type I error
(False +)
Disease X ☺
(D = 1) Type II error Power 1 - ;
(False -) sensitivity
Classification matrix: Meaning of each cell
27
Alternate Accuracy Measures
28
Receiver Operator Characteristic (ROC) Curve
29
Selecting a Threshold using ROC
30
Thank You
31
Confusion Matrix and ROC-II
Dr A. RAMESH
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
ROC analysis
• True Positive Fraction
– TPF = TP / (TP+FN)
– also called sensitivity
– true abnormals called abnormal by the
observer
• False Positive Fraction
– FPF = FP / (FP+TN)
• Specificity = TN / (TN+FP)
– True normals called normal by the observer
– FPF = 1 - specificity
Evaluating classifiers (via
their ROC curves)
Classifier A can’t
distinguish between
normal and abnormal.
7
Area Under the ROC Curve (AUC)
8
Area Under the ROC Curve (AUC)
9
Selecting a Threshold using ROC
10
ROC Plot
• A typical look of ROC plot with few points in it is shown in the following
figure.
• Note the four cornered points are the four extreme cases of classifiers
11
Interpretation of Different Points in ROC Plot
• The four points (A, B, C, and D)
• A: TPR = 1, FPR = 0, the ideal model, i.e., the perfect
classifier, no false results
• B: TPR = 0, FPR = 1, the worst classifier, not able to
predict a single instance
• C: TPR = 0, FPR = 0, the model predicts every instance
to be a Negative class, i.e., it is an ultra-conservative
classifier
• D: TPR = 1, FPR = 1, the model predicts every instance
to be a Positive class, i.e., it is an ultra-liberal classifier
12
Interpretation of Different Points in ROC Plot
• Let us interpret the different points in the ROC
plot.
• The points on the upper diagonal region
• All points, which reside on upper-diagonal region
are corresponding to classifiers “good” as their
TPR is as good as FPR (i.e., FPRs are lower than
TPRs)
• Here, X is better than Z as X has higher TPR and
lower FPR than Z.
• If we compare X and Y, neither classifier is superior
to the other
13
Interpretation of Different Points in ROC Plot
14
Tuning a Classifier through ROC Plot
• Using ROC plot, we can compare two or more
classifiers by their TPR and FPR values and this
plot also depicts the trade-off between TPR
and FPR of a classifier.
• Examining ROC curves can give insights into
the best way of tuning parameters of
classifier.
• For example, in the curve C2, the result is
degraded after the point P.
• Similarly for the observation C1, beyond Q the
settings are not acceptable.
15
Comparing Classifiers trough ROC Plot
• We can use the concept of “area under
curve” (AUC) as a better method to
compare two or more classifiers.
• If a model is perfect, then its AUC = 1.
• If a model simply performs random
guessing, then its AUC = 0.5
• A model that is strictly better than other,
would have a larger value of AUC than the
other.
• Here, C3 is best, and C2 is better than C1
as AUC(C3)>AUC(C2)>AUC(C1).
16
ROC curve
100%
0
% False Positive Rate (1- 100
0
% specificity) %
ROC curve comparison
100% 100%
True Positive Rate
100% 100%
0
0 %
% 0 100
0 100 False Positive Rate %
False Positive Rate % %
%
20
Typical ROC
21
ROC curve extremes
22
Example
Sources: Statistics for Business and Economics,11th Edition by David R. Anderson (Author), Dennis J.
Sweeney (Author), Thomas A. Williams (Author)
23
Variables
24
Data (10 customer out of 100)
Customer Spending Card Coupon
1 2.291 1 0
2 3.215 1 0
3 2.135 1 0
4 3.924 0 0
5 2.528 1 0
6 2.473 0 1
7 2.384 0 0
8 7.076 0 0
9 1.182 1 1
10 3.345 0 0
25
Explanation of Variables
26
Loading data file and get some statistical detail
27
Method’s description
• ravel(): It will return one dimensional array with all the input array
elements.
28
Split dataset into training and testing sets
29
Building the model and predicting values
30
Calculate probability of predicting data values
31
Summary for logistic model
32
Accuracy Checking
33
Calculating Accuracy Score using Confusion Matrix
34
Generating Classification Report
35
Interpreting Classification Report
36
Thank You
37
Performance of Logistic Model-III
Dr A. RAMESH
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
Python demo for accuracy prediction in logistic regression model using Receiver
operating characteristics curve
2
Sensitivity and Specificity
• For checking, what type of error we are making; we use two parameters-
3
Specificity and Sensitivity Relationship with Threshold
Threshold (Lower) Sensitivity ( ) Specificity ( )
Threshold (Higher) Sensitivity ( ) Specificity ( )
4
Measuring Accuracy, Specificity and Sensitivity
5
ROC Curve for Training dataset
6
ROC Curve for Test data set
7
Threshold value selection
• Threshold values are often selected based on which errors are bettor.
8
Accuracy checking for different threshold values
9
Accuracy checking for different threshold values
10
Accuracy checking for different threshold values
11
Accuracy checking for different threshold values
12
Calculating Optimal Threshold Value
13
Optimal Threshold Value in ROC Curve
14
Classification Report using Optimal Threshold Value
15
Thank You
16
Regression Analysis Model Building - I
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Introduction
2
General Linear Regression Model
• Suppose we collected data for one dependent variable y and k
independent variables x1,x2, . . . , xk.
• Objective is to use these data to develop an estimated regression equation
that provides the best relationship between the dependent and
independent variables.
3
Simple first-order model with one predictor variable
4
Modelling Curvilinear Relationships
Sources: Statistics for Business and Economics,11th Edition by David R. Anderson (Author), Dennis J.
Sweeney (Author), Thomas A. Williams (Author)
5
Data
Scales Months
Sold Employed
275 41
296 106
317 76
376 104
162 22
150 12
367 85
308 111
189 40
235 51
83 9
112 12
67 6
325 56
189 19
6
Importing libraries and table
7
SCATTER DIAGRAM FOR THE REYNOLDS EXAMPLE
8
Python code for the Reynolds example: first-order model
9
First-order regression equation
10
Standardized residual plot for the Reynolds example: first-
order model
11
Standardized residual plot for the Reynolds example: first-
order model
12
Need for curvilinear relationship
13
Second-order model with one predictor variable
14
New Data set
15
Python output for the Reynolds example:
second-order model
16
Second-order regression model
17
Standardized residual plot for the Reynolds example:
second-order model
18
Interpretation second order model
• In multiple regression analysis the word linear in the term “general linear
model” refers only to the fact that b0, b1, . . . , bp all have exponents of b1
• It does not imply that the relationship between y and the xi’s is linear.
• Indeed, we have seen one example of how equation general linear model
can be used to model a curvilinear relationship.
20
Thank you
21
Regression Analysis Model Building (Interaction)- II
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Interaction
• If the original data set consists of observations for y and two independent
variables x1 and x2, we can develop a second-order model with two predictor
variables by setting z1 = x1, z2= x2, z3=x12 , z4=x22 , and z5 = x1x2 in the general
linear model of equation
• The model obtained is
• In this second-order model, the variable z5 = x1x2 is added to account for the
potential effects of the two variables acting together.
• This type of effect is called interaction.
3
Example – Interaction
4
Advertising
Expenditure Sales
Price ($1000s) (1000s)
2 50 478
2.5 50 373
3 50 335
2 50 473
2.5 50 358
3 50 329
2 50 456
2.5 50 360
3 50 322
2 50 437
2.5 50 365
3 50 342
2 100 810
2.5 100 653
3 100 345
2 100 832
2.5 100 641
3 100 372
2 100 800
2.5 100 620
3 100 390
2 100 790
2.5 100 670
3 100 393
5
MEAN UNIT SALES (1000s)
6
Interpretation of interaction
• Note that the sample mean sales corresponding to a price of $2.00 and an
advertising expenditure of $50,000 is 461,000, and the sample mean sales
corresponding to a price of $2.00 and an advertising expenditure of
$100,000 is 808,000.
• Hence, with price held constant at $2.00, the difference in mean sales
between advertising expenditures of $50,000 and $100,000 is 808,000 -
461,000 = 347,000 units.
7
Interpretation of interaction
• When the price of the product is $2.50, the difference in mean sales is
646,000 -364,000 = 282,000 units.
• Finally, when the price is $3.00, the difference in mean sales is 375,000 -
332,000 = 43,000 units.
• Clearly, the difference in mean sales between advertising expenditures of
$50,000 and $100,000 depends on the price of the product.
• In other words, at higher selling prices, the effect of increased advertising
expenditure diminishes.
• These observations provide evidence of interaction between the price and
advertising expenditure variables.
8
Importing Data
9
Mean unit sales (1000s) as a function of selling price
10
Mean unit sales (1000s) as a function of Advertising
Expenditure($1000s)
11
Need for study the interaction between variable
12
Estimated regression equation, a general linear model
involving three independent variables (z1, z2, and z3)
13
Interaction variable
14
New Model
15
New Model
16
Interpretation
• Because the model is significant ( p-value for the F test is 0.000) and the p-
value corresponding to the t test for PriceAdv is 0.000, we conclude that
interaction is significant given the linear effect of the price of the product
and the advertising expenditure.
• Thus, the regression results show that the effect of advertising xpenditure
on sales depends on the price.
17
Transformations Involving the Dependent Variable
Miles per
Gallon Weight
28.7 2289
29.2 2113
34.2 2180
27.9 2448
33.3 2026
26.4 2702
23.9 2657
30.5 2106
18.1 3226
19.5 3213
14.3 3607
20.9 2888
18
Importing data
19
Scatter diagram
20
Model 1
21
Standardized residual plot corresponding to the first-order
model.
22
Standardized residual plot corresponding to the first-order
model
23
Model 2
24
Residual plot for model 2
25
Residual plot of model 2
26
• The miles-per-gallon estimate is obtained by finding the number whose
natural logarithm is 3.2675.
• Using a calculator with an exponential function, or raising e to the power
3.2675, we obtain 26.2 miles per gallon.
27
Nonlinear Models That Are Intrinsically Linear
28
Thank You
29
2 Test of Independence - I
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
2 Test of Independence
3
2 Test of Independence: Investment Example
• In which region of the country do you reside?
A. Northeast B. Midwest C. South D. West
• Which type of financial investment are you most likely to make today?
E. Stocks F. Bonds G. Treasury bills
Type of financial
Investment
Contingency Table
E F G
A O13 nA
Geographic B nB
Region C nC
D nD
nE nF nG N
4
2 Test of Independence: Investment Example
e AF
= N P( A F )
n n n n
If A and F are independent, P( A) = A
P( F ) = F
= N A F
N N N N
P( A F) = P( A) P( F ) n n
P( A F ) = A F
n n
N N = A F
N
Type of Financial
Contingency Table Investment
E F G
A e12 nA
Geographic B nB
Region C nC
D nD
nE nF nG N
5
2 Test of Independence: Formulas
e =
ij
(n ) (n)j
i j
N
Expected where : i = the row
Frequencies j = the column
ni = the total of row i
nj = the total of column j
N = the total of all frequencies
6
2 Test of Independence: Formulas
( f o − f e)
2
Calculated
2
=
(Observed ) f
where : df = (r - 1)(c - 1)
e
7
Example for Independence
8
2 Test of Independence
Ho : Type of gasoline is
independent of income
Ha : Type of gasoline is not
independent of income
9
2 Test of Independence
Type of
Gasoline
r=4 c=3 Extra
Income Regular Premium Premium
Less than $30,000
$30,000 to $49,999
$50,000 to $99,000
At least $100,000
10
2 Test of Independence: Gasoline Preference Versus
Income Category
=.01
df = ( r − 1)( c − 1)
= ( 4 − 1)( 3 − 1)
=6
2
.01, 6
= 16.812
If 2
Cal
16.812, reject Ho.
If 2
Cal
16.812, do not reject Ho.
11
Python code
12
Gasoline Preference Versus Income Category:
Observed Frequencies
Type of
Gasoline
Extra
Income Regular Premium Premium
Less than $30,000 85 16 6 107
$30,000 to $49,999 102 27 13 142
$50,000 to $99,000 36 22 15 73
At least $100,000 15 23 25 63
238 88 59 385
13
Gasoline Preference Versus Income Category: Expected
Frequencies
e =
ij
(n )
N
(ni
) j
Type of
Gasoline Extra
=
(107 )(238 ) Income Regular Premium Premium
e11 385 Less than $30,000 (66.15) (24.46) (16.40)
= 66.15 85 16 6 107
(107 )(88 ) $30,000 to $49,999 (87.78) (32.46) (21.76)
e12 = 385
102 27 13 142
$50,000 to $99,000 (45.13) (16.69) (11.19)
= 24 .46 36 22 15 73
(107 )(59) At least $100,000 (38.95) (14.40) (9.65)
e 13 = 385 15 23 25 63
= 16.40 238 88 59 385
14
Gasoline Preference Versus Income Category: 2
Calculation
(f o −f f e)
2
=
2
df = 6
0.01
Non rejection
region
16.812
2
= 70.78 16.812, reject Ho.
Cal
16
Contingency Tables
Contingency Tables
• Useful in situations involving multiple population proportions
• Used to classify sample observations according to two or more
characteristics
• Also called a cross-classification table.
17
Contingency Table Example
18
Contingency Table Example
Gender
sample size = n = 300:
Hand
120 Females, 12 were Preference
Female Male
left handed
Left 12 24 36
180 Males, 24 were
left handed Right 108 156 264
20
The Chi-Square Test Statistic
all cells fe
where:
fo = observed frequency in a particular cell
fe = expected frequency in a particular cell if H0 is true
Decision Rule:
If 2 > 2U, reject H0,
otherwise, do not reject
H0
0 Do not Reject H0
reject H0 2U
22
Observed vs. Expected Frequencies
Gender
Hand
Female Male
Preference
Observed = 12 Observed = 24
Left 36
Expected = 14.4 Expected = 21.6
Observed = 12 Observed = 24
Left 36
Expected = 14.4 Expected = 21.6
25
2 Test for The Differences Among More Than Two
Proportions
• Extend the 2 test to the case with more than two independent
populations:
H0: π1 = π2 = … = πc
H1: Not all of the πj are equal (j = 1, 2, …, c)
26
The Chi-Square Test Statistic
Assumed: each cell in the contingency table has expected frequency of at least 5
27
2 Test with More Than Two Proportions: Example
28
2 Test with More Than Two Proportions: Example
Organization
Object to Insurance Pharmacies Medical
Record Companies Researchers
Sharing
No 90 205 165
2 Test with More Than Two Proportions: Example
Organization
Object to Insurance Pharmacies Medical Row Sum
Record Companies Researchers
Sharing
Organization
Object to Record Insurance Pharmacies Medical
Sharing Companies Researchers
Yes ( fo − fe )
2
= 11.571 ( f o − f e )2 ( f o − f e )2
= 7.700 = 0.3926
fe fe fe
No ( f o − f e )2 ( fo − fe )
2
= 17.409
( fo − fe )
2
= 0.888
= 26.159
fe fe fe
( fo − fe )2
The Chi-square test statistic is: 2
= = 64.1196
all cells fe
2 Test with More Than Two Proportions: Example
H0: π1 = π2 = π3
H1: Not all of the πj are equal (j = 1, 2, 3)
Conclusion: Since 64.1196 > 5.991, you reject H0 and you conclude that at
least one proportion of respondents who object to their records being shared
is different across the three organizations
33
Thank You
34
2 Test of Independence - II
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Example
res_num aa pe sm ae r g c
1 99 19 1 2 0 0 1
2 46 12 0 0 0 0 0
3 57 15 1 1 0 0 0
4 94 18 2 2 1 1 1
5 82 13 2 1 1 1 1
6 59 12 0 0 2 0 0
7 61 12 1 2 0 0 0
8 29 9 0 0 1 1 0
9 36 13 1 1 0 0 0
10 91 16 2 2 1 1 0
3
Example
Here :
• res_num = registration no.
• aa= academic ability
• pe = parent education
• sm = student motivation
• r = religion
• g = gender
4
Python code
5
Hypothesis
6
Python code
7
Observed values
Gender Student motivation
0 1 2 Row Sum
(Disagree ) (Not (Agree)
decided )
0 (Male) 10 13 6 29
1(Female ) 4 9 8 21
Column 14 22 14 50
Sum
8
Expected frequency (contingency table)
9
Frequency Table
0 fo = 10 fo = 13 fo = 6
fe = 8.12 fe =12.76 fe =8.12
1 fo = 4 fo = 9 fo = 8
fe =5.88 fe =9.24 fe =5.88
10
Chi sq. calculation
(f o −f f e)
2
=
2
= 0.435+ 0.005+0.554+0.601+0.006+0.764
= 2.365
11
Python code
12
Python code
Degrees of
freedom =
(2-1)*(3-1)
13
Python code
Contingency
table
14
2 Goodness of Fit Test
15
2 Goodness-of-Fit Test
16
2 Goodness-of-Fit Test
( f o− f e )
2
=
2
f e
df = k - 1 - p
where : f = frequency of observed values
o
k = number of categories
p = number of parameters estimated from the sample data
17
Goodness of Fit Test: Poisson Distribution
1. Set up the null and alternative hypotheses.
H0: Population has a Poisson probability distribution
Ha: Population does not have a Poisson distribution
18
Goodness of Fit Test: Poisson Distribution
where:
fi = observed frequency for category i
ei = expected frequency for category i
k = number of categories
19
Goodness of Fit Test: Poisson Distribution
5. Rejection rule:
p-value approach: Reject H0 if p-value <
20
Goodness of Fit Test: Poisson Distribution
• Example: Parking Garage
21
Goodness of Fit Test: Poisson Distribution
A random sample of 100 one- minute time intervals resulted in the
customer arrivals listed below. A statistical test must be conducted to
see if the assumption of a Poisson distribution is reasonable.
# Arrivals 0 1 2 3 4 5 6 7 8 9 10 11 12
Frequency 0 1 4 10 14 20 12 12 9 8 6 3 1
22
Goodness of Fit Test: Poisson Distribution
• Hypotheses
H0: Number of cars entering the garage during
a one-minute interval is Poisson distributed
23
Python Code
24
Goodness of Fit Test: Poisson Distribution
6 x e −6
f ( x) =
x!
25
Goodness of Fit Test: Poisson Distribution
• Expected Frequencies
x f (x ) nf (x ) x f (x ) nf (x )
0 .0025 .25 7 .1377 13.77
1 .0149 1.49 8 .1033 10.33
2 .0446 4.46 9 .0688 6.88
3 .0892 8.92 10 .0413 4.13
4 .1339 13.39 11 .0225 2.25
5 .1606 16.06 12+ .0201 2.01
6 .1606 16.06 Total 1.0000 100.00
26
Python code
27
Python code
28
Goodness of Fit Test: Poisson Distribution
• Observed and Expected Frequencies
i fi ei fi - ei
0 or 1 or 2 5 6.20 -1.20
3 10 8.92 1.08
4 14 13.39 0.61
5 20 16.06 3.94
6 12 16.06 -4.06
7 12 13.77 -1.77
8 9 10.33 -1.33
9 8 6.88 1.12
10 or more 10 8.39 1.61
29
Python code
30
Goodness of Fit Test: Poisson Distribution
• Rejection Rule
With = .05 and k - p - 1 = 9 - 1 - 1 = 7 d.f.
(where k = number of categories and p = number of
population parameters estimated), .02 5 = 1 4 .0 6 7
Reject H0 if p-value < .05 or 2 > 14.067.
• Test Statistic
( − 1.20) 2
(1.08) 2
(1.61) 2
2 = + + ... + = 3.268
6.20 8.92 8.39
31
Python code
32
Goodness of Fit Test: Poisson
Distribution
df = 7
0.05
Non rejection
region
14.067
2
= 3.268 14.067, do not reject Ho.
Cal
33
Thank You
34
2 Goodness of Fit Test
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Goodness of fit for Uniform Distribution
Month Litres
• Milk Sales Data January 1,610
February 1,585
March 1,649
April 1,590
May 1,540
June 1,397
July 1,410
August 1,350
September 1,495
October 1,564
November 1,602
December 1,655
18,447
3
Hypotheses and Decision Rules
Ho: The monthly milk figures for milk sales are uniformly distributed
Ha: The monthly milk figures for milk sales are not uniformly distributed
= .01 If 2
24.725, reject Ho.
Cal
df = k − 1 − p
= 12 − 1 − 0
If 2
Cal
24.725, do not reject Ho.
= 11
2
= 24.725
.01,11
4
Python code
5
Calculations
Month fo fe (fo - fe)2/fe
January 1,610 1,537.25 3.44
February 1,585 1,537.25 1.48 18447
March 1,649 1,537.25 8.12 f =
April 1,590 1,537.25 1.81
e 12
May 1,540 1,537.25 0.00 = 1537.25
June 1,397 1,537.25 12.80
July 1,410 1,537.25 10.53
August 1,350 1,537.25 22.81 2
Cal
= 74.37
September 1,495 1,537.25 1.16
October 1,564 1,537.25 0.47
November 1,602 1,537.25 2.73
December 1,655 1,537.25 9.02
18,447 18,447.00 74.38
6
Python code
7
Conclusion
df = 11
2
Cal
= 74.37 24.725, reject Ho.
8
Goodness of Fit Test: Normal Distribution
1. Set up the null and alternative hypotheses.
2. Select a random sample and
a. Compute the mean and standard deviation.
b. Define intervals of values so that the expected frequency is at least 5 for
each interval.
c. For each interval record the observed frequencies
3. Compute the expected frequency, ei , for each interval.
9
Goodness of Fit Test: Normal Distribution
4. Compute the value of the test statistic.
( f i − ei ) 2
k
= 2
i =1 ei
5. Reject H0 if 2 2
10
Normal Distribution Goodness of Fit Test
• Example: IQL Computers
11
Normal Distribution Goodness of Fit Test
33 43 44 45 52 52 56 58 63 64
64 65 66 68 70 72 73 73 74 75
83 84 85 86 91 92 94 98 102 105
(mean = 71, standard deviation = 18.23)
12
Python code
13
Normal Distribution Goodness of Fit Test
• Hypotheses
H0: The population of number of units sold
has a normal distribution with mean 71
and standard deviation 18.23
14
Normal Distribution Goodness of Fit Test
• Interval Definition
15
Normal Distribution Goodness of Fit Test
• Interval Definition
Areas
= 1.00/6
= .1667
17
Normal Distribution Goodness of Fit Test
• Observed and Expected Frequencies
i fi ei f i - ei
Less than 53.02 6 5 1
53.02 to 63.03 3 5 -2
63.03 to 71.00 6 5 1
71.00 to 78.97 5 5 0
78.97 to 88.98 4 5 -1
More than 88.98 6 5 1
Total 30 30
18
Python code
19
Normal Distribution Goodness of Fit Test
• Rejection Rule
With = .05 and k - p - 1 = 6 - 2 - 1 = 3 d.f.
(where k = number of categories and p = number
of population parameters estimated), .0 5 = 7 .8 1 5
2
• Test Statistic
(1) 2 ( − 2) 2 (1) 2 (0) 2 ( − 1) 2 (1) 2
=2
+ + + + + = 1.600
5 5 5 5 5 5
20
Thank you
21
Cluster analysis: Introduction - I
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Cluster Analysis
3
Cluster analysis
4
Example
5
Example
• Because this example contains only two variables, we can investigate it by merely looking
at the plot
• In this small data set there are clearly two distinct groups of objects
• Such groups are called clusters, and to discover them is the aim of cluster analysis
6
Cluster and discriminant analysis
7
Cluster analysis and discriminant analysis
8
Types of data and how to handle them
9
Example
Attributes
Objects
10
Types of data and how to handle them
11
Type of data
• Interval-Scaled Variables
• In this situation the n objects are characterized by p continuous
measurements
• These values are positive or negative real numbers, such as height, weight,
temperature, age, cost, ..., which follow a linear scale
• For instance, the time interval between 1900 and 1910 was equal in length
to that between 1960 and 1970
12
Type of data
13
Interval-Scaled Variables
14
Interval-Scaled Variables
• For example :
Person Weight(Kg) Height(cm)
• Take eight people, the weight (in A 15 95
kilograms) and the height (in centimetres) B 49 156
• In this situation, n = 8 and p = 2. C 13 95
D 45 160
E 85 178
F 66 176
G 12 90
H 10 78
Table :1
15
Figure 1
200 F E
DB
150
Height in cm
G
A
100 H
C
50
0
0 50 100
Weight in kg
16
Interval-Scaled Variables
• The units on the vertical axis are drawn to the same size as those on the horizontal axis, even
• The plot contains two obvious clusters, which can in this case be interpreted easily: the one
• For instance, measuring the concentration of certain natural hormones might have yielded a
17
Interval-Scaled Variables
• Let us now consider the effect of changing measurement
Person Weight(lb) Height(in)
units.
A 33.1 37.4
• If weight and height of the subjects had been expressed in B 108 61.4
pounds and inches, the results would have looked quite C 28.7 37.4
different. D 99.2 63
E 187.4 70
• A pound equals 0.4536 kg and an inch is 2.54 cm F 145.5 69.3
• Therefore, Table 2 contains larger numbers in the column G 26.5 35.4
of weights and smaller numbers in the column of heights. H 22 30.7
Figure 2 Table :2
18
Figure 2
100
Height in inches
D B F E
C
50 G
H
A
0
0 20 40 60 80 100 120 140 160 180 200
Weight in lb
19
Interpretation
• Although plotting essentially the same data as Figure 1, Figure 2 looks
much flatter
• In this figure, the relative importance of the variable “weight” is much
larger than in Figure 1
• As a consequence, the two clusters are not as nicely separated as in Figure
1 because in this particular example the height of a person gives a better
indication of adulthood than his or her weight. If height had been
expressed in feet (1 ft = 30.48 cm), the plot would become flatter still and
the variable “weight” would be rather dominant
• In some applications, changing the measurement units may even lead one
to see a very different clustering structure
20
Standardizing the data
21
Standardizing the data
22
Standardizing the data
24
Standardizing the data
• When applying standardization, one forgets about the original data and
uses the new data matrix in all subsequent computations
25
Detecting outlier
26
Standardizing the data
• The preceding description might convey the impression that
standardization would be beneficial in all situations.
• However, it is merely an option that may or may not be useful in a given
application
• Sometimes the variables have an absolute meaning, and should not be
standardized
• For instance, it may happen that several variables are expressed in the
same units, so they should not be divided by different sf
• Often standardization dampens a clustering structure by reducing the
large effects because the variables with a big contribution are divided by a
large sf
27
Thank you
28
Cluster analysis: Part - II
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Example
• Lets take four persons A, B,C, D with following age and height:
200 A B
Person Age (yr) Height (cm) Height 190
A 35 190 180
B 40 190 D
170 C
C 35 160
D 40 160
160
Age
150
TABLE: 1 10 30 50
Finding Groups in Data: An Introduction to Cluster Analysis
Author(s): Leonard Kaufman, Peter J. Rousseeuw FIGURE: 1
March 1990, John Wiley & Sons, Inc.
3
Example
4
Example
• The resulting data matrix, which is unitless, is given in Table 2
• Note that the new averages are zero and that the mean deviations equal 1
• Table 2
Person Variable 1 Variable 2
A 1 1
B -1 1
C 1 -1
D -1 -1
• Even when the data are converted to very strange units standardization will always yield
the same numbers
5
Example
6
Choice of measurement (Units)- Merits and demerits
7
Choice of measurement- Merits and demerits
8
Distances computation between the objects
• The next step is to compute distances between the objects, in order to
quantify their degree of dissimilarity
• It is necessary to have a distance for each pair of objects i and j.
• The most popular choice is the Euclidean distance:
• When the data are being standardized, one has to replace all x by z in this
expression
• This Formula corresponds to the true geometrical distance between the
points with coordinates (xi1,. .., xip) and (xj1 ,..., xjp)
9
Example
10
Distances computation between the objects
11
Interpretation
• Suppose you live in a city where the streets are all north-south or east-
west, and hence perpendicular to each other
• Let Figure 3 be part of a street map of such a city, where the streets are
portrayed as vertical and horizontal lines
12
Interpretation
• Then the actual distance you would have to travel by car to get from
• This would be the shortest length among all possible paths from i to j
• Only a bird could fly straight from point i to point j, thereby covering the
13
Mathematical Requirements of a Distance Function
• Both the Euclidean metric and the Manhattan metric satisfy the following
mathematical requirements of a distance function, for all objects i, j, and h:
• (D1) d(i, j) ≥ 0
• (D2) d(i, i) = 0
• (D3) d(i, j) = d(j, i)
• (D4) d(i, j) ≤ d(i, h) + d(h, j)
• Condition (D1) merely states that distances are nonnegative numbers and (D2) says
that the distance of an object to itself is zero
• Axiom (D3) is the symmetry of the distance function
• The triangle inequality (D4) looks a little bit more complicated, but is necessary to allow
a geometrical interpretation
• It says essentially that going directly from i to j is shorter than making a detour over
object h
14
Distances computation between the objects
• If d(i, j) = 0 does not necessarily imply that i = j, because it can very well
happen that two different objects have the same measurements for the
variables under study
• However, the triangle inequality implies that i and j will then have the
same distance to any other object h, because d(i, h) ≤ d(i, j) + d( j, h) = d(j,
h) and at the same time d( j, h) ≤ d( j, i) + d(i, h) = d(i, h), which together
imply that d(i, h) = d(j, h)
15
Minkowski distance
16
Example for Calculation of Euclidean and Manhattan Distance
• Let x1 = (1, 2) and x2 = (3, 5) represent two objects as in the given Figure
The Euclidean distance between the two is (22 +32)= 3.61. The
Manhattan distance between the two is 2 + 3 = 5.
Figure: 4
17
n- by- n Matrix
• For example, when computing
Euclidean distances between the Person Weight(Kg) Height(cm)
objects of the following Table can be A 15 95
obtain as next slide: B 49 156
C 13 95
D 45 160
• Euclidean distances between B and E:
E 85 178
• ((49 – 85)2 +(156-178)2)½ = 42.2 F 66 176
G 12 90
H 10 78
18
n- by- n Matrix
A B C D E F G H
A
B
C
D
E
F
G
H
19
Interpretation
20
Distance matrix
• It would suffice to write down only the lower triangular half of the
distance matrix
A B C D E F G
B
C
D
E
F
G
H
21
Selection of variables
22
Selection of variables
• The selection of “good” variables is a nontrivial task and may involve quite
some trial and error (in addition to subject-matter knowledge and
common sense)
• In this respect, cluster analysis may be considered an exploratory
technique
23
Thank you
24