0% found this document useful (0 votes)
15 views118 pages

Unit-2 MFAI

Uploaded by

dihasix183
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views118 pages

Unit-2 MFAI

Uploaded by

dihasix183
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 118

UNIT-II

Descriptive Statistics

Dr.T.kusuma
Assistant Professor
H&S
VNRVJIET
Types of Variables

• Categorical or Qualitative variables place an individual into


one of several groups or categories. (hair color, race, gender,
etc.)

• Numerical or Quantitative Variables that are measured in


terms of numbers. (age, height, weight, etc.)

• Discrete Variables that can only have whole numbers. Whole


numbers are called integers. (number of babies born in a
hospital per day, the number of wires in a cable.)

• Continuous variables that can take on any of a range of


values, such as the distance between two towns. (age, length,
weight, and time, and the points on a line.)
Types of Data

Data

Categorical Numerical

Examples:
 Marital Status
 Political Party
Eye Color
Discrete Continuous

(Defined categories)

Examples: Examples:
 Number of Children  Weight
 Defects per hour  Voltage
(Counted items) (Measured characteristics)
Measurement Levels

Differences
between Ratio Data
measurements,
true zero exists Quantitative
Data
Differences
between Interval Data
measurements but
no true zero
Ordered
Categories
(rankings, order, Ordinal Data
or scaling)
Qualitative
Data
Categories (no
ordering or Nominal Data
direction)
…continued
Nominal – level of measurement that applies to data that
consist of names, labels, or categories (with no implied
criteria by which the data can be ordered). For example,
Color: Red Blue Green Yellow etc.
Gender: Male Female
…continued
Ordinal – level of measurement that applies to data that may be
arranged in order. However, differences between data values
cannot be determined or are meaningless.
i.e. the variables are still classified into categories, but these
categories are ordered and there is no equivalent distance
between the categories.
Ratings: Bad, Poor, Average, Good, Excellent
Income Bracket: Low, Lower Middle, Middle, Upper Middle,
High
…continued
Interval – level of measurement that applies to data that can
be arranged in order and differences between data values are
meaningful.
• The variables are still classified into ordered categories,
but there is an equivalent distance between these
categories.
• This allows for a direct comparison between categories
such that the difference between any two sequential data
points is exactly the same as the difference between any
other two sequential data points.
• There is an arbitrary zero point i.e we can only add and
subtract two interval level variables but we can’t multiply
or divide them.

Eg: Shoe size, temperature (0F), IQ, etc.)


…continued

Ratio – level of measurement that applies to data that can be


arranged in order, and both differences between data values
and ratios of data values are meaningful. Data have a true
zero.
• The ratio level variables have all of the characteristics of
nominal, ordinal and interval variables, but also have a
meaningful zero point

For example, Weight of a person, the Kelvin scale of


temperature, monthly salary, etc.)
Variable

Categorical or Qualitative Numerical or Quantitative


variable Variable
(hair color, race, gender, etc.) (age, height, weight, etc.)

Discrete
Nominal Ordinal (number of babies Continuous
(blood types) (class grades) born in a hospital per (age, exam score)
day)

Interval Ratio
(Shoe size) (salary)
11
Tabular & Graphical Presentation of
Data
 Data in raw form are usually not easy to use for decision
making

 Some type of organization is needed


 Table
 Graph

 The type of graph to use depends on the variable being


summarized

12
…continued

Categorical Numerical
Variables Variables

• Frequency distribution • Line chart


• Bar chart • Frequency distribution
• Pie chart • Histogram and Ogive
• Pareto diagram • Scatter plot
…continued

Categorical
Data

Tabulating Data Graphing Data

Frequency
Distribution Bar Chart Pie Chart Pareto
Table Diagram

14
Tabulating

 Qualitative Data are non-numerical


◦ Major Discipline
◦ Political Party
◦ Gender
◦ Eye color

 Summarized in two ways:


◦ Class Frequency
◦ Class Relative Frequency
…continued
 Class
◦ A class is one of the categories into which
qualitative data can be classified

 Class Frequency
◦ Class frequency is the number of
observations in the data set that fall into a
particular class
 Frequency Distribution
◦ It is a summary technique that organizes
data into classes and provides in tabular
form a list of the classes along with the
number of observations in each class.

16
…continued
 Class Relative Frequency
◦ Class frequency divided by the total
number of observations in the data set
class frequency
class relative frequency =
n

 Class Percentage
◦ Class relative frequency multiplied by
100
class percentage = (class relative frequency) 100
Why Frequency Distributions?

A frequency distribution is a way to


summarize data
 The distribution condenses the raw data
into a more useful form...
 and allows for a quick visual
interpretation of the data

18
Visualization
…continued
Objectives of visualization

 It is often said that “A picture is worth a thousand words”.

 Words that are enhanced with appropriate graphs reduce or


remove the need for lengthy explanations.

 In statistics, a rule of thumb for effective communication is


to present numbers pictorially using charts and graphs.
 There are two main objectives for using data visualization
in statistics:
1. A visual context enables viewers to more easily
detect patterns, trend or correlations.
2. Pictures can make it easier to communicate
statistical results to an audience.
Types of plots

Scatter plot uses dots to represent the relationship


between two numeric variables and shows how being
high or low on one numeric variable relates to being
high or low on a second numeric variable.

Dot plot displays dots to represent individual


variables. It can be used with relatively small sets of
data groups
…cont

 Line plot illustrates how variable changes with respect to


another variable, for example time. Multiple lines can be
plotted to compare different variables.
 Box plot is used to display the sample distribution of a
variable and detect extreme values or other unusual
characteristics.
 Density plot shows the distribution of a numeric variable.
…cont

 Pie chart shows the proportions of a variable that occur in


different categories.
 Bar chart shows the number of

samples of a variable
occurring in different
categories.
 Histogram is a graphical representation that organizes a
group of data points into user-specified ranges.
Unlike a bar chart, there are
no spaces between contiguous
columns.
…cont

 Contour plot shows how variable changes according to two


other variables which could, for example, be directions on a
map

 Surface plot is another way to show how variable changes


according to two other variables. It can be helpful in
regression analysis for viewing the relationship between a
dependent and two independent variables.
…cont
Example 1
Frequency distribution table
Example: Adult Aphasia
Table: Data on 22 Adult Aphasias
Subject Type of Aphasia Subject Type of Aphasia
1 Broca’s 12 Broca’s
2 Anomic 13 Anomic
3 Anomic 14 Broca’s
4 Conduction 15 Anomic
5 Broca’s 16 Anomic
6 Conduction 17 Anomic
7 Conduction 18 Conduction
8 Anomic 19 Broca’s
9 Conduction 20 Anomic
10 Anomic 21 Conduction
11 Conduction 22 Anomic
Table: Frequency Distribution of Data on
22 Adult
Aphasias
Type of Aphasia Frequency
Anomic 10
Broca’s 5
Conduction 7
Total 22
Table: Frequency, relative frequency, and
class percentage on 22 Adult Aphasias

Type of Frequenc Relative Class


Aphasia y Frequency Percenta
ge
Anomic 10 10/22 = .455 45.5%
Broca’s 5 5/22 = .227 22.7%
Conducti 7 7/22 = .318 31.8%
on
Total 22 22/22 = 1.00 100%
Graphical method
• Graphical methods for describing Qualitative
variables

1. Pie chart
2. Bar Chart
3. Pareto diagram

 Barcharts and Pie charts are often used for


qualitative (category) data

 Heightof bar or size of pie slice shows the


frequency or percentage for each category
 Pie Chart
◦ A graph that displays data in a circular
format.
◦ The categories of the qualitative variable
are represented by the slices of a pie.
◦ Each slice of a pie represents a portion or
percentage of the total.
 Bar Chart
◦ A graphical representation of information in
the form of bars.
◦ Bars of equal width are drawn to represent
different categories, with the length of each
bar being proportional to the number or
frequency of occurrence of each category.
Pareto Diagram
 A bar chart, where categories are shown in
descending order of frequency
 A cumulative polygon is often shown in the
same graph
 Used to separate the “vital few” from the
“trivial many”
 The purpose is to highlight the most important
among a (typically large) set of factors.
Example 2

Hospital Patients by Unit

Hospital Unit
Patients Number of

Cardiac Care 1,052


Emergency 2,245
Intensive Care 340
Maternity 552

Surgery 4,630

(Variables
are
categorical)
Number of
patients per year

1000
2000
3000
4000
5000

0
Bar Chart

Cardiac
Care

Emergency

Intensive
Care

Maternity
Hospital Patients by Unit

Surgery
Pie Chart
% Hospital
of Number
Unit
Total of Patients

Cardiac Care 1,052


11.93
Hospital Patients by Unit
Emergency 2,245
25.46 Cardiac Care
Intensive Care 340 12%
3.86
Maternity 552
6.26
Surgery 4,630 Surgery
Emergency
25%
52.50 53%

Intensive Care
(Percentage 4%
s are Maternity
rounded to 6%
the nearest
percent)
Example 3
Example: 400 defective items are examined for
cause of defect:

Source of
Manufacturing Error Number of defects
Bad Weld 34
Poor Alignment 223
Missing Part 25
Paint Flaw 78
Electrical Short 19
Cracked case 21
Total 400
Pareto Diagram
Step 1: Sort by defect cause, in descending order
Step 2: Determine % in each category
Step 3: Show results graphically

Source of
Manufacturing Error Number of defects % of Total Defects
Poor Alignment 223 55.75
Paint Flaw 78 19.50
Bad Weld 34 8.50
Missing Part 25 6.25
Cracked case 21 5.25
Electrical Short 19 4.75
Total 400 100%
continued

Pareto Diagram: Cause of Manufacturing Defect


60% 100%

90%
category (bar graph)
% of defects in each

50%
80%

cumulative % (line
70%
40%

60%

graph)
30% 50%

40%

20%
30%

20%
10%

10%

0% 0%
Poor Alignment Paint Flaw Bad Weld Missing Part Cracked case Electrical Short
Exercise
1)Let's suppose you give a survey concerning favorite color,
and the data you collect looks something like the table
below. Draw bar graph , pie chart and pareto diagam.
blue orang
red blue blue yellow green red pink
e
blue green blue purple blue blue green yellow pink
blue red pink green blue yellow green blue
2) Let's say you want to determine the distribution
of colors in a bag of Skittles. You open up a bag,
and you find that there are 15 red, 7 orange, 7
yellow, 13were
3) 30 students green, andtheir
asked what 8 purple.
majors were. The following represents
their responses (M=Management; A=Accounting; E=Economics;
S=Statistics), A M M A M E
M S A E E M
A S E M A M
A M A S A M
E E M A M M
Graphs to Describe Numerical
Variables

Numerical
Data

Frequency Distributions
and
Cumulative Distributions

Histogram Ogive
Sturge’s Rule
Class Intervals and Class Boundaries

Each class grouping has the same width


Determine the width of each interval by

largest number  smallest number


w interval width 
number of desired intervals

 Use at least 5 but no more than 15-20


intervals
 Intervals never overlap
 Round up the interval width to get
desirable interval endpoints
Frequency Distribution
Example
Example: A manufacturer of
insulation randomly selects 20
winter days and records the daily
high temperature
24, 35, 17, 21, 24, 37, 26, 46,
58, 30, 32, 13, 12, 38, 41, 43,
44, 27, 53, 27
Frequency Distribution Example

 Sort
raw data in ascending order:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35,
37, 38, 41, 43, 44, 46, 53, 58
 Find range: 58 - 12 = 46
 Select
number of classes: 5 (usually between 5
and 15)
 Compute interval width: 10 (46/5 then round up)
 Determineinterval boundaries: 10 but less than
20, 20 but less than 30, . . . , 60 but less than
70
 Count observations & assign to classes

45
continued

Data in ordered array:


12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38,
41, 43, 44, 46, 53, 58

Relative
Interval Frequency
Frequenc
y Percentag
10 but less than 20 3 .15 e
15
20 but less than 30 6 .30 30
30 but less than 40 5 .25 25

40 but less than 50 4 .20


20
50 but less than 60 2 .10 10

Total 20 1.00
Histogram

 A graph of the data in a frequency distribution is


called a histogram
 The interval endpoints are shown on the
horizontal axis
 the vertical axis is either frequency, relative
frequency, or percentage
 Bars of the appropriate heights are used to
represent the number of observations within
each class
Histogram Example

Interval Frequency

10 but less than 20


3
20
6
but less than 30 His togram : Daily High Te m pe rature
30 but less than 40
5 7 6
40 but less than 50
4 6 5
50 but less than 60
2 5 4
Frequency

4 3
3 2
2
1 0 0
(No gaps 0
between 0 0 10 10 2020 3030 40
40 50
50 60
60
bars) 70 Temperature in
Degrees
Questions for Grouping Data into
Intervals

1. How wide should each interval


be?
(How many classes should be used?)

2. How should the endpoints of the


intervals be
determined?
 Often answered by trial and error, subject
to user judgment
 The goal is to create a distribution that is
neither too "jagged" nor too "blocky”
 Goal is to appropriately show the pattern of
variation in the data
How Many Class
Intervals?
 Many (Narrow class intervals) 3.5

 may yield a very jagged 3


2.5

distribution with gaps from

Frequency
2
1.5
empty classes 1

 Can give a poor indication of 0.5


0

how frequency varies across

4
8
12
16
20
24
28
32
36
40
44
48
52
56
60
More
classes Temperature

 Few (Wide class intervals) 12

 may compress variation too 10


8

Frequency
much and yield a blocky 6

distribution 4

2
 can obscure important 0

patterns of variation. 0 30
Temperature
60 More

(X axis labels are upper class


endpoints)
Ogive graph
• sometimes called a cumulative frequency
polygon, is a type of frequency polygon that
shows cumulative frequencies.
• In other words, the cumulative percents are
added on the graph from left to right
• An ogive graph plots cumulative
frequency on the y-axis and class
boundaries along the x-axis. It’s very similar
to a histogram, only instead of rectangles, an
ogive has a single point marking where the
top right of the rectangle would be.
The Cumulative Frequency
Distribution
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38,
41, 43, 44, 46, 53, 58

Cumulativ Cumulati
Class Frequenc Percenta
e ve
y ge
Frequency Percenta
10 but less than 20 3 15 ge
3 15
20 but less than 30 6 30
9 45
30 but less than 40 5 25
14 70
40 but less than 50 4 20
18 90
The Ogive Graphing CFD
Upper
interval Cumulative
Interval endpoi Percentage
nt
Less than 10 10
0
10 but less than 20 20
15
20 but less than 30 30
45
30 but less than 40 40 Ogive: Daily High Temperature
70
40 but less than 50 50
90 100
50 but less than 60 60
100 Cumulative Percentage 80
60

40
20

0
10 20 30 40 50 60
Interval endpoints
Exercise
1. Represent the following set of data in tabular and graphical form:

02, 07, 16, 21, 31, 03, 08, 17, 21, 55, 03, 13, 18, 22, 55, 04,
14,19, 25, 57, 06, 15, 20, 29, 58.

2. Alex measured the lengths of leaves on the oak tree (to the
nearest cm):
9,16,13,7,8,4,18,10,17,18,9,12,5,9,9,16,1,8,17,1,10,5,9,11,15,
6,14,9,1,12,5,16,4,16,8,15,14,17
Construct frequency distribution table and draw historam and
ogive plots
3. Let’s say you have a list of IQ scores for a gifted classroom in a
particular elementary school. The IQ scores are: 118, 123, 124,
125, 127, 128, 129, 130, 130, 133, 136, 138, 141, 142, 149,
150, 154.
Relationships Between Variables
Graphs illustrated so far have involved
only a single variable
When two variables exist other
techniques are used like Scatter diagram

Categorical Numerical
(Qualitative) (Quantitative)
Variables Variables

Cross tables Scatter plots


Scatter Diagrams
Scatter Diagrams are used for
paired observations taken from
two numerical variables

one variable is measured on the


vertical axis and the other variable is
measured on the horizontal axis
Scatter Diagram Example
Volume Cost
per day per day
23 125
26 140 Cost per Day vs. Production Volume
29 146
33 160 250
38 167
42 170
200
50 188 150
Cost per Day

55 195 100
60 200
50
0
0 10 20 30 40 50 60 70
Volume per Day
Describing Data
Numerically
Describing Data
Numerically

Central Variation
Tendency
Arithmetic Range
Mean
Median Interquartile
Range
Mode Variance

Standard
Deviation
Coefficient of
Variation
Summary Definitions
 The central tendency is the extent to
which all the data values group around a
typical or central value.

 The variation is the amount of dispersion,


or scattering, of values

7/4/2015 Numerical measure 59


Summarizing Data Sets

Numerical Measures of Central Tendency

 Central tendency is the value


or values around which the data
tend to cluster

 Variability shows how strongly


the data cluster around that (those)
value(s)
Measures of Central Tendency

Mean:
 The arithmetic mean (often just called “mean”)
is the most common measure of central
tendency
 Affected by extreme values (outliers)
 Mean = sum of values divided by the number of
values The ith value
Pronounced x-bar
For a sample of size n:
n

X i
X1  X 2    Xn
X  i1 
n n

Sample size Observed values


…continued
Examples

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Mean = 3 Mean = 4
1  2  3  4  5 15 1  2  3  4  10 20
 3  4
5 5 5 5
…continued

Example: During a two week period 10 houses


were sold in Fancy town.
House Price
in Fancytown 10
x
231,000
x i
2,950,000
313,000 x i 1

299,000
n 10
312,000 295,000
285,000
317,000
294,000 The “average” or mean
297,000 price for this sample of
315,000 10 houses in Fancy
287,000
town is $295,000
 x  2,950,000
Example: During a two week period 10 houses
were sold in
Low town.

House Price
10
in Lowtown
x x i
2,950,000
97,000 x i 1

93,000 n 10
110,000
295,000
121,000
113,000
95,000 The “average” or mean
100,000 price for this sample of
122,000 10 houses in Lowtown
99,000
is $295,000
2,000,000
 x  2,950,000 Outlier
The Median
 In an ordered array, the median is the “middle” number (50%
above, 50% below)

 The location of the median when the values are in numerical order
(smallest to largest):
n 1
Median position  position in the ordered data
2
 If the number of values is odd, the median is the middle number

 If the number of values is even, the median is the average of the


two middle numbers

n 1
Note that is not the value of the median, only the position of
2
the median in the ranked data
 Not affected by extreme values
…continued

Example

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Median = 3 Median = 3
Example: Consider the Fancy town data. First,
we put the data in numerical increasing order to
get

231,000 285,000 287,000 294,000


297,000
299,000 312,000 313,000 315,000
317,000

Since there are 10 (even) data values, the


297,000
median is the mean  299
of the ,000
two values in the
Median , M 
middle. $298,000
2
Example: Consider the Low town data. We put
the data in numerical increasing order to get
93,000 95,000 97,000 99,000
100,000
110,000 113,000 121,000 122,000,
2,000,000

Since there are 10 (even) data values, the


median is the mean of the two values in the
middle. 100,000  110,000
Median , M  105,000
2
The Mode
Value that occurs most often
Not affected by extreme values
Used for either numerical or
categorical data
There may be no mode
There may be several modes

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Mode = 9
0 1 2 3 4 5 6

No Mode
Review Example
 Five houses on a hill by the beach

$2,000 K
House
Prices:

$2,000,000 $500 K
500,000
$300 K
300,000
100,000
100,000
$100 K

$100 K
…continued

House Prices:  Mean: ($3,000,000/5)


$2,000,000 = $600,000
$500,000  Median: middle value of ranked
$300,000 data
$100,000
$100,000 = $300,000
Sum $3,000,000
 Mode: most frequent value
= $100,000
Which Measure to Choose?

 The mean is generally used, unless extreme


values (outliers) exist.
 The median is often used, since the median
is not sensitive to extreme values. For
example, median home prices may be
reported for a region; it is less sensitive to
outliers.
 In some situations it makes sense to report
both the mean and the median.

7/4/2015 Numerical measure 72


Mean for Grouped Data

Formula for Mean is given  fX


X  by
n

Where
X Mean
=
= Sum of cross products of frequency
infXeach class with midpoint X of each
class
n = Total number of observations (Total
frequency)
f
=
continued

Example:
Find the arithmetic mean for the following
continuous
frequency distribution:

Class 0-1 1-2 2-3 3-4 4-5 5-6


Frequency 1 4 8 7 3 2
…continued
Solution:

A B C D
1 Class X f fX
2 0-1 0.5 1 0.5
3 1-2 1.5 4 6.0
4 2-3 2.5 8 20.0
5 3-4 3.5 7 24.5
6 4-5 4.5 3 13.5
7 5-6 5.5 2 11.0
8 Totals 25 75.5
9 Mean 3.02

X
 fX
n
= 75.5/25=3.02
Median for Grouped Data

Formula for Median is given by


(n/2)  m
Median =L  c
f
Where
L =Lower limit of the median class f

n = Total number of observations =
m = Cumulative frequency preceding the median class
f = Frequency of the median class
c = Class interval of the median class

Median class: The class where the middle position is


located is called the median class and this is also the
class where the median is located.
…continued

Example:
Find the median for the following
continuous frequency distribution:

Class 0-1 1-2 2-3 3-4 4-5 5-6


Frequency 1 4 8 7 3 2
Solution:
continued
Class Frequency Cumulative
Frequency
0-1 1 1
1-2 4 5
2-3 8 13
3-4 7 20
4-5 3 23
5-6 2 25
Total 25

(n/2)  m (25 / 2)  5
Median = L  c = 2 1
f 8
= 2.9375
Mode for Grouped Data
d1
Mode = L  c
d1  d 2

Where L =Lower limit of the modal class


d1 f1  f 0 d 2 f 1  f 2
f1
= Frequency of the modal class
f0
= Frequency preceding the modal class

f2
= Frequency succeeding the modal class
C = Class Interval of the modal class
Modal class: The class that has the highest
frequency.
…continued

Example:
Find the mode for the following
continuous frequency distribution:

Class 0-1 1-2 2-3 3-4 4-5 5-6


Frequency 1 4 8 7 3 2
…continued
Solution
:
Class Frequency d1
0-1 1 Mode = L  d  d c
1 2
1-2 4
2-3 8
3-4 7 Ld1=f21  f 0
4-5 3 = 8-4 = 4
5-6 2 d 2 f 1  f 2
Total 25
= 8-7 = 1
4
2  1
5
C=1 Hence Mode =
= 2.8
Exercise:
1. The frequency table shows the weights of some patients a
doctors surgery. Calculate mean, median and mode

2. The frequency table shows the race times of a group of


athletes who took part in a 400m race. 6 people completed the
400m race in a time 45 seconds up to 50 seconds, 9 people
completed the race in a time 50 seconds up to 55 seconds, 9
people completed the race in a time 55 seconds up to 60 seconds
and the remaining 3 athletes completed the race in a time of 60
up to 65 seconds.
Exercise continued
3. You grew fifty baby carrots using special soil.
You dig them up and measure their lengths (to the
nearest mm) and group the results:
Length (mm) Frequency

150 - 154 5

155 - 159 2

160 – 164 6

165 – 169 8

170 – 174 9

175 – 179 11

180 – 184 6

185 – 189 3
Measures of Variation

Variation

Range Variance Standard Coefficient


Deviation of Variation

 Measures of variation give


information on the spread
or variability or
dispersion of the data
values.
Same center,
different variation
The Range

 Simplest measure of variation


 Difference between the largest and the smallest values:

Range = Xlargest – Xsmallest

Example:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 13 - 1 = 12
Why The Range Can Be Misleading

 Ignores the way in which data are distributed

7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5

 Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
The Variance

 Average (approximately) of squared deviations


of values from the mean

n
◦ Sample variance:
2
 (X i  X) 2

S  i1
n -1

Where X = arithmetic mean


n = sample size
Xi = ith value of the variable X
The Standard Deviation

 Most commonly used measure of variation


 Shows variation about the mean
 Is the square root of the variance
 Has the same units as the original data

Sample standard deviation:

 i
(X  X ) 2

S i 1
n -1
Example

Sample
Data (Xi) : 10 12 14 15 17 18 18 24
n=8 Mean = X = 16

(10  X)2  (12  X)2  (14  X)2    (24  X)2


S
n 1

(10  16)2  (12  16)2  (14  16)2    (24  16)2



8 1

130 A measure of the “average”


  4.3095
7 scatter around the mean
Comparing Standard
Deviations
Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 3.338

Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 S = 0.926
21

Data C

Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21
S = 4.570
Comparing Standard Deviations

Smaller standard deviation

Larger standard deviation


Standard Deviation for Grouped
Data
Example:
Frequency Distribution of Return on
Investment of Mutual Funds
Return on Number of
Investment Mutual Funds
5-10 10
10-15 12
n
15-20 16
20-25 14  i
f(X  X ) 2

25-30 8 S i 1

60 n -1
Total
…continued
Solution:
Measures of Variation- Summary Characteristics

 The more the data are spread out, the greater the range,
variance, and standard deviation.

 The more the data are concentrated, the smaller the range,
variance, and standard deviation.

 If the values are all the same (no variation), all these
measures will be zero.

 None of these measures are ever negative.


The Coefficient of Variation

 Measures relative variation


 Always in percentage (%)
 Shows variation relative to mean
 Can be used to compare the variability of two
or more sets of data measured in different
units
 S
CV   100%

 X
Comparing Coefficients of Variation

 Stock A:
◦ Average price last year = $50
◦ Standard deviation = $5
S $5
CVA   100%  100% 10%
X $50
 Stock Both stocks
B:
have the
◦ Average price last year = $100same
standard
◦ Standard deviation = $5 deviation, but
stock B is less
S $5 variable
CVB   100%  100% 5% relative to its
X $100 price
Quartile Measures
 Quartiles split the ranked data into 4
segments with an equal number of values
per segment
25% 25% 25% 25%

Q1 Q2 Q3

 The first quartile, Q1, is the value for which 25%


of the observations are smaller and 75% are
larger
 Q2 is the same as the median (50% of the
observations are smaller and 50% are larger)
 Only 25% of the observations are greater than
the third quartile Q3
Quartile Measures- Example

Sample Data in Ordered Array: 11 12 13 16 16 17 18 21


22
(n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data,
so Q1 = (12+13)/2 = 12.5

Q2 is in the (9+1)/2 = 5th position of the ranked data,


so Q2 = median = 16

Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data,


so Q3 = (18+21)/2 = 19.5
Q1 and Q3 are measures of non-central location
Q2 = median, is a measure of central tendency
Interquartile Range (IQR)
 The IQR is Q3 – Q1 and measures the spread in
the middle 50% of the data

 The IQR is also called the mid spread because


it covers the middle 50% of the data

 The IQR is a measure of variability that is not


influenced by outliers or extreme values

 Measures like Q1, Q3, and IQR that are not


influenced by outliers are called resistant
measures
Shape of the Distribution

Describes how data are distributed


Measures of shape: either
Symmetric or skewed

Left-Skewed
Left-Skewed Symmetric
Symmetric Right-Skewed
Mean < Median Mean
Mean
= Median
= Median Median
Median
< Mean
< Mean
…continued
The shape of the distribution is said
to be symmetric if the observations
are balanced, or evenly distributed,
about the center.
Symmetric Distribution

10
9
8
7
Frequency

6
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9
…continued
 Theshape of the distribution is said to be
skewed if the observations are not
symmetrically distributed around the center.

A positively skewed Positively Skewed Distribution

distribution (skewed to 12

10

the right) has a tail that 8

Frequency
6

extends to the right in 4

the direction of positive


2

0
1 2 3 4 5 6 7 8 9

values.
Negatively Skewed Distribution

A negatively skewed 12

10

distribution (skewed to 8

Frequency
6

the left) has a tail that 4

extends to the left in the 0


1 2 3 4 5 6 7 8 9

direction of negative
Questions

1. Which of the following is not true with regards


to finding the median of a data set?

a. The first step is to order the data set from


smallest to largest.
b. If n is even the sample median is the
average of the middle two values in the
ordered list.
c. Any repeated values should be removed
from the data set prior to determining the
median.
d. If n is odd, the sample median is the single
middle value of the ordered list.
2. A sample of 10 individuals was taken at a
college football game and the number of games
they attend a year was recorded. The
descriptive statistics for the data are given as
follows.

What is the five-number summary for this data?


a. 1, 2, 4, 5.25, 7
b. 1, 2, 3.9, 5.25, 7
c. 10, 3.9, 0.605, 1.912, 1.0
d. 1, –0.605, 3.9, 0.605, 7
3. Which of the following is not affected by
extreme outliers?

a. Variance
b. Median
c. Range
d. Mean

4. Which of the following is a not a measure of


variability?

a. Inter quartile Range


b. Standard deviation
c. Population variance
d. Proportion
Bivariate data

Data for two variables (usually two


types of related data).
Deals with two variables that can change
and are compared to find relationships.
 If one variable is influencing another
variable, then you will have bivariate
data that has an independent and a
dependent variable.
Types of bivariate analysis

Scatter plots
Frequency distribution(Two way table)
Correlation coefficient
Regression analysis
Example
Temperatur Ice Cream
An ice cream shop keeps e °C Sales
track of 14.2° $215
how much ice cream they 16.4° $325
sell versus the temperature 11.9° $185
on that day. 15.2° $332
Here are their figures for the 18.5° $406
last 12 days: 22.1° $522
Represent the data in scatter 19.4° $412
diagram. 25.1° $614
23.4° $544
18.1° $421
22.6° $445
17.2° $408
Frequency distributions for bivariate data

Two way table


Joint frequencies
Marginal frequencies
Conditional frequencies
Two way table

Itis a table listing two categorical


variables whose values have been
paired
Each set of numbers in a two-way
table has a specific name.
softwa teachin formin total
re g g

male

female

total
Joint relative frequency

The middle cells are the joint


frequency numbers
which is the ratio of the
frequency in a particular category
and the total number of data
values
The purple cells on the above
table are all joint frequency
numbers.
Marginal relative frequency
The marginal frequency numbers are the
numbers on the edges of a table.
The numbers in the column on the very
right and on the row on the very bottom
are the marginal frequency numbers.
which is the ratio of the sum of the joint
relative frequency in a row or column and
the total number of data value
On the above table, the marginal
frequency numbers are in the green cells
Conditional relative frequency

The ratio of a joint relative frequency


and related marginal relative frequency
This is a similar set up to conditional
probability, where the limitation, or
condition, is preceded by the word given
For example, the percentage of people
that selected software as a career, given
those people are female in the above
table
Example 1
Travis is considering running away and joining the another
job. He surveys 20 of his friends to determine the most
popular career options. He asks his friends, 13 male and 7
female, which career they would prefer: software,
teaching, or forming.
softwa Teachi Formin total
re ng g

male 3 2 8 13

female 2 4 1 7

total 5 6 9 20

Two way frequency table


 Jointrelative frequencies?
 Marginal relative frequencies?
 Find the percentage of people that selected
software as a career, given those people are
female?
…continued
Solutio
n:
career software Teaching Forming total

Gender

male 3/20 2/20 8/20 13/20

female 2/20 4/20 1/20 7/20

total 5/20 6/20 9/20 20/20

Two way relative


frequency table
…continued

 Jointrelative frequencies are:


3/20, 2/20, 8/20, 2/20, 4/20, 1/20(purple
coloured cells)

 Marginal relative frequencies are:


5/20, 6/20, 9/20, 13/20, 7/20(green coloured
cells)

 Conditional relative frequency:


The percentage of people that selected
software as a2 / 20 100 40% career, given those
5 / 20 is
people are female
Example 2

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy