Unit-2 MFAI
Unit-2 MFAI
Descriptive Statistics
Dr.T.kusuma
Assistant Professor
H&S
VNRVJIET
Types of Variables
Data
Categorical Numerical
Examples:
Marital Status
Political Party
Eye Color
Discrete Continuous
(Defined categories)
Examples: Examples:
Number of Children Weight
Defects per hour Voltage
(Counted items) (Measured characteristics)
Measurement Levels
Differences
between Ratio Data
measurements,
true zero exists Quantitative
Data
Differences
between Interval Data
measurements but
no true zero
Ordered
Categories
(rankings, order, Ordinal Data
or scaling)
Qualitative
Data
Categories (no
ordering or Nominal Data
direction)
…continued
Nominal – level of measurement that applies to data that
consist of names, labels, or categories (with no implied
criteria by which the data can be ordered). For example,
Color: Red Blue Green Yellow etc.
Gender: Male Female
…continued
Ordinal – level of measurement that applies to data that may be
arranged in order. However, differences between data values
cannot be determined or are meaningless.
i.e. the variables are still classified into categories, but these
categories are ordered and there is no equivalent distance
between the categories.
Ratings: Bad, Poor, Average, Good, Excellent
Income Bracket: Low, Lower Middle, Middle, Upper Middle,
High
…continued
Interval – level of measurement that applies to data that can
be arranged in order and differences between data values are
meaningful.
• The variables are still classified into ordered categories,
but there is an equivalent distance between these
categories.
• This allows for a direct comparison between categories
such that the difference between any two sequential data
points is exactly the same as the difference between any
other two sequential data points.
• There is an arbitrary zero point i.e we can only add and
subtract two interval level variables but we can’t multiply
or divide them.
Discrete
Nominal Ordinal (number of babies Continuous
(blood types) (class grades) born in a hospital per (age, exam score)
day)
Interval Ratio
(Shoe size) (salary)
11
Tabular & Graphical Presentation of
Data
Data in raw form are usually not easy to use for decision
making
12
…continued
Categorical Numerical
Variables Variables
Categorical
Data
Frequency
Distribution Bar Chart Pie Chart Pareto
Table Diagram
14
Tabulating
Class Frequency
◦ Class frequency is the number of
observations in the data set that fall into a
particular class
Frequency Distribution
◦ It is a summary technique that organizes
data into classes and provides in tabular
form a list of the classes along with the
number of observations in each class.
16
…continued
Class Relative Frequency
◦ Class frequency divided by the total
number of observations in the data set
class frequency
class relative frequency =
n
Class Percentage
◦ Class relative frequency multiplied by
100
class percentage = (class relative frequency) 100
Why Frequency Distributions?
18
Visualization
…continued
Objectives of visualization
samples of a variable
occurring in different
categories.
Histogram is a graphical representation that organizes a
group of data points into user-specified ranges.
Unlike a bar chart, there are
no spaces between contiguous
columns.
…cont
1. Pie chart
2. Bar Chart
3. Pareto diagram
Hospital Unit
Patients Number of
Surgery 4,630
(Variables
are
categorical)
Number of
patients per year
1000
2000
3000
4000
5000
0
Bar Chart
Cardiac
Care
Emergency
Intensive
Care
Maternity
Hospital Patients by Unit
Surgery
Pie Chart
% Hospital
of Number
Unit
Total of Patients
Intensive Care
(Percentage 4%
s are Maternity
rounded to 6%
the nearest
percent)
Example 3
Example: 400 defective items are examined for
cause of defect:
Source of
Manufacturing Error Number of defects
Bad Weld 34
Poor Alignment 223
Missing Part 25
Paint Flaw 78
Electrical Short 19
Cracked case 21
Total 400
Pareto Diagram
Step 1: Sort by defect cause, in descending order
Step 2: Determine % in each category
Step 3: Show results graphically
Source of
Manufacturing Error Number of defects % of Total Defects
Poor Alignment 223 55.75
Paint Flaw 78 19.50
Bad Weld 34 8.50
Missing Part 25 6.25
Cracked case 21 5.25
Electrical Short 19 4.75
Total 400 100%
continued
90%
category (bar graph)
% of defects in each
50%
80%
cumulative % (line
70%
40%
60%
graph)
30% 50%
40%
20%
30%
20%
10%
10%
0% 0%
Poor Alignment Paint Flaw Bad Weld Missing Part Cracked case Electrical Short
Exercise
1)Let's suppose you give a survey concerning favorite color,
and the data you collect looks something like the table
below. Draw bar graph , pie chart and pareto diagam.
blue orang
red blue blue yellow green red pink
e
blue green blue purple blue blue green yellow pink
blue red pink green blue yellow green blue
2) Let's say you want to determine the distribution
of colors in a bag of Skittles. You open up a bag,
and you find that there are 15 red, 7 orange, 7
yellow, 13were
3) 30 students green, andtheir
asked what 8 purple.
majors were. The following represents
their responses (M=Management; A=Accounting; E=Economics;
S=Statistics), A M M A M E
M S A E E M
A S E M A M
A M A S A M
E E M A M M
Graphs to Describe Numerical
Variables
Numerical
Data
Frequency Distributions
and
Cumulative Distributions
Histogram Ogive
Sturge’s Rule
Class Intervals and Class Boundaries
Sort
raw data in ascending order:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35,
37, 38, 41, 43, 44, 46, 53, 58
Find range: 58 - 12 = 46
Select
number of classes: 5 (usually between 5
and 15)
Compute interval width: 10 (46/5 then round up)
Determineinterval boundaries: 10 but less than
20, 20 but less than 30, . . . , 60 but less than
70
Count observations & assign to classes
45
continued
Relative
Interval Frequency
Frequenc
y Percentag
10 but less than 20 3 .15 e
15
20 but less than 30 6 .30 30
30 but less than 40 5 .25 25
Total 20 1.00
Histogram
Interval Frequency
4 3
3 2
2
1 0 0
(No gaps 0
between 0 0 10 10 2020 3030 40
40 50
50 60
60
bars) 70 Temperature in
Degrees
Questions for Grouping Data into
Intervals
Frequency
2
1.5
empty classes 1
4
8
12
16
20
24
28
32
36
40
44
48
52
56
60
More
classes Temperature
Frequency
much and yield a blocky 6
distribution 4
2
can obscure important 0
patterns of variation. 0 30
Temperature
60 More
Cumulativ Cumulati
Class Frequenc Percenta
e ve
y ge
Frequency Percenta
10 but less than 20 3 15 ge
3 15
20 but less than 30 6 30
9 45
30 but less than 40 5 25
14 70
40 but less than 50 4 20
18 90
The Ogive Graphing CFD
Upper
interval Cumulative
Interval endpoi Percentage
nt
Less than 10 10
0
10 but less than 20 20
15
20 but less than 30 30
45
30 but less than 40 40 Ogive: Daily High Temperature
70
40 but less than 50 50
90 100
50 but less than 60 60
100 Cumulative Percentage 80
60
40
20
0
10 20 30 40 50 60
Interval endpoints
Exercise
1. Represent the following set of data in tabular and graphical form:
02, 07, 16, 21, 31, 03, 08, 17, 21, 55, 03, 13, 18, 22, 55, 04,
14,19, 25, 57, 06, 15, 20, 29, 58.
2. Alex measured the lengths of leaves on the oak tree (to the
nearest cm):
9,16,13,7,8,4,18,10,17,18,9,12,5,9,9,16,1,8,17,1,10,5,9,11,15,
6,14,9,1,12,5,16,4,16,8,15,14,17
Construct frequency distribution table and draw historam and
ogive plots
3. Let’s say you have a list of IQ scores for a gifted classroom in a
particular elementary school. The IQ scores are: 118, 123, 124,
125, 127, 128, 129, 130, 130, 133, 136, 138, 141, 142, 149,
150, 154.
Relationships Between Variables
Graphs illustrated so far have involved
only a single variable
When two variables exist other
techniques are used like Scatter diagram
Categorical Numerical
(Qualitative) (Quantitative)
Variables Variables
55 195 100
60 200
50
0
0 10 20 30 40 50 60 70
Volume per Day
Describing Data
Numerically
Describing Data
Numerically
Central Variation
Tendency
Arithmetic Range
Mean
Median Interquartile
Range
Mode Variance
Standard
Deviation
Coefficient of
Variation
Summary Definitions
The central tendency is the extent to
which all the data values group around a
typical or central value.
Mean:
The arithmetic mean (often just called “mean”)
is the most common measure of central
tendency
Affected by extreme values (outliers)
Mean = sum of values divided by the number of
values The ith value
Pronounced x-bar
For a sample of size n:
n
X i
X1 X 2 Xn
X i1
n n
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Mean = 3 Mean = 4
1 2 3 4 5 15 1 2 3 4 10 20
3 4
5 5 5 5
…continued
House Price
10
in Lowtown
x x i
2,950,000
97,000 x i 1
93,000 n 10
110,000
295,000
121,000
113,000
95,000 The “average” or mean
100,000 price for this sample of
122,000 10 houses in Lowtown
99,000
is $295,000
2,000,000
x 2,950,000 Outlier
The Median
In an ordered array, the median is the “middle” number (50%
above, 50% below)
The location of the median when the values are in numerical order
(smallest to largest):
n 1
Median position position in the ordered data
2
If the number of values is odd, the median is the middle number
n 1
Note that is not the value of the median, only the position of
2
the median in the ranked data
Not affected by extreme values
…continued
Example
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Median = 3 Median = 3
Example: Consider the Fancy town data. First,
we put the data in numerical increasing order to
get
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
0 1 2 3 4 5 6
No Mode
Review Example
Five houses on a hill by the beach
$2,000 K
House
Prices:
$2,000,000 $500 K
500,000
$300 K
300,000
100,000
100,000
$100 K
$100 K
…continued
Where
X Mean
=
= Sum of cross products of frequency
infXeach class with midpoint X of each
class
n = Total number of observations (Total
frequency)
f
=
continued
Example:
Find the arithmetic mean for the following
continuous
frequency distribution:
A B C D
1 Class X f fX
2 0-1 0.5 1 0.5
3 1-2 1.5 4 6.0
4 2-3 2.5 8 20.0
5 3-4 3.5 7 24.5
6 4-5 4.5 3 13.5
7 5-6 5.5 2 11.0
8 Totals 25 75.5
9 Mean 3.02
X
fX
n
= 75.5/25=3.02
Median for Grouped Data
Example:
Find the median for the following
continuous frequency distribution:
(n/2) m (25 / 2) 5
Median = L c = 2 1
f 8
= 2.9375
Mode for Grouped Data
d1
Mode = L c
d1 d 2
f2
= Frequency succeeding the modal class
C = Class Interval of the modal class
Modal class: The class that has the highest
frequency.
…continued
Example:
Find the mode for the following
continuous frequency distribution:
150 - 154 5
155 - 159 2
160 – 164 6
165 – 169 8
170 – 174 9
175 – 179 11
180 – 184 6
185 – 189 3
Measures of Variation
Variation
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 13 - 1 = 12
Why The Range Can Be Misleading
7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5
Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
The Variance
n
◦ Sample variance:
2
(X i X) 2
S i1
n -1
i
(X X ) 2
S i 1
n -1
Example
Sample
Data (Xi) : 10 12 14 15 17 18 18 24
n=8 Mean = X = 16
Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 S = 0.926
21
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21
S = 4.570
Comparing Standard Deviations
25-30 8 S i 1
60 n -1
Total
…continued
Solution:
Measures of Variation- Summary Characteristics
The more the data are spread out, the greater the range,
variance, and standard deviation.
The more the data are concentrated, the smaller the range,
variance, and standard deviation.
If the values are all the same (no variation), all these
measures will be zero.
Stock A:
◦ Average price last year = $50
◦ Standard deviation = $5
S $5
CVA 100% 100% 10%
X $50
Stock Both stocks
B:
have the
◦ Average price last year = $100same
standard
◦ Standard deviation = $5 deviation, but
stock B is less
S $5 variable
CVB 100% 100% 5% relative to its
X $100 price
Quartile Measures
Quartiles split the ranked data into 4
segments with an equal number of values
per segment
25% 25% 25% 25%
Q1 Q2 Q3
Left-Skewed
Left-Skewed Symmetric
Symmetric Right-Skewed
Mean < Median Mean
Mean
= Median
= Median Median
Median
< Mean
< Mean
…continued
The shape of the distribution is said
to be symmetric if the observations
are balanced, or evenly distributed,
about the center.
Symmetric Distribution
10
9
8
7
Frequency
6
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9
…continued
Theshape of the distribution is said to be
skewed if the observations are not
symmetrically distributed around the center.
distribution (skewed to 12
10
Frequency
6
0
1 2 3 4 5 6 7 8 9
values.
Negatively Skewed Distribution
A negatively skewed 12
10
distribution (skewed to 8
Frequency
6
direction of negative
Questions
a. Variance
b. Median
c. Range
d. Mean
Scatter plots
Frequency distribution(Two way table)
Correlation coefficient
Regression analysis
Example
Temperatur Ice Cream
An ice cream shop keeps e °C Sales
track of 14.2° $215
how much ice cream they 16.4° $325
sell versus the temperature 11.9° $185
on that day. 15.2° $332
Here are their figures for the 18.5° $406
last 12 days: 22.1° $522
Represent the data in scatter 19.4° $412
diagram. 25.1° $614
23.4° $544
18.1° $421
22.6° $445
17.2° $408
Frequency distributions for bivariate data
male
female
total
Joint relative frequency
male 3 2 8 13
female 2 4 1 7
total 5 6 9 20
Gender