Prob and Stat - Unit1
Prob and Stat - Unit1
Unit -1
Introduction to Statistics
Statistics:
• The word statistics has two meanings:
• In the most common usage – statistics refers to numerical facts
• The number that represents –
a) annul income
b) age
c) the percentage of students who scored grade A
d) the starting salary of a typical college graduate
• What will be other examples of statistics? ……………..
The following examples present some statistics:
• Approximately 30% of Google’s employees were female in July 2014
(USA TODAY, July 24, 2014).
• In 2013, author James Patterson earned $90 million from the sale of
his books (Forbes, September 29, 2014).
• As per the CBS report, the hotel and restaurant, manufacturing and
transportation sectors of Nepal will witness negative growth of 16.3
percent, 1.1 percent and 2.3 percent, respectively, in the current
fiscal year (The Himalayan Times, April 30, 2020).
• The second meaning of statistics refers to the field or
discipline of study.
• Statistics is the science of collecting, analyzing, presenting,
and interpreting data, as well as of making decisions based
on such analyses.
• A comprehensive definition given by Croxton and Cowden
is:
“Statistics may be defined as the collection, presentation,
analysis and interpretation of numerical data”
• Statistical methods help us make scientific and intelligent
decisions.
• Decisions made by using statistical methods are called
educated guesses.
• Decisions made without using statistical (or scientific)
methods are called pure guesses and, hence, may prove to
be unreliable.
• For example: …….
Applications:
Accounting: Generally the number of individual accounts
receivable is large and time taking to check its validity. Based on
sample data auditors make conclusions as to whether the
accounts receivable amount shown on the client’s balance is
acceptable or not.
Finance: Financial analysis, uses variety of statistical information
and methods to guide their investment recommendations.
Differences between
measurements but no Interval Data
true zero
Higher Levels
Ordered Categories
(rankings, order, or Ordinal Data
scaling)
Coefficient of
Variation
Measures of Center and Location
Overview
å
n
åx i
XW =
wx i i
x=
åw
i=1
n i
N
åx i µW =
å wxi i
µ= i=1
N åw i
Measures of Center for Ungrouped and Grouped Data
a) Mean
b) Median
• In an ordered array, the median is the “middle” number
• If n or N is odd, the median is the middle number
• If n or N is even, the median is the average of the two middle
numbers
• The advantage of using the median as a measure of central tendency is
that it is not influenced by outliers.
• When outliers exist, use median instead of mean as a measure of
central tendency.
ØThe median is the value of the middle term in a data set
that has been ranked in increasing order.
th
æ n +1 ö
Median = ç ÷ value
è 2 ø
n / 2 - cf
Median= l+ h
f
XW =
å wx
i i
=
(4 ´ 5) + (12 ´ 6) + (8 ´ 7) + (2 ´ 8)
5 4 åw i 4 + 12 + 8 + 2
6 12 164
7 8
= = 6.31 days
26
8 2
Which measure of location is the “best”?
• Mean is generally used, unless extreme values (outliers)
exist
• Then median is often used, since the median is not
sensitive to extreme values.
Relationships Among the Mean, Median, and Mode
Partition values
• The variate values dividing into the total number of observation in equal number of parts are
known as partition values.
• If the values of the variate are arranged in ascending or descending order of magnitudes,
then we have seen that median is that value of the variate which divides the total frequencies
in two equal parts.
• Similarly the given series can be divided into four, ten and hundred equal parts.
• Quartile:
The values of the variate which divide the total frequency into four equal parts, are
called quartiles. there are three types of quartiles:- first quartile (Q1), second quartile
(Q2), and third quartile (Q3 ).
• Decile:
Deciles are those values that divide any set of a given observation into a total of ten
equal parts. Therefore, there are a total of nine deciles. These representation of these
deciles are as follows D1, D2, D3, D4, ……… D9.
• Percentile:
Percentile basically divide any given observation into a total of 100 equal parts. The
representation of these percentiles or centiles is given as P1, P2, P3, P4, ……… P99.
Percentiles
• The pth percentile in an ordered array of n values is
the value in ith position, where
p
i= (n + 1)
100
n Example: The 60th percentile in an ordered array of 19
values is the value in 12th position:
p 60
i= (n + 1) = (19 + 1) = 12
100 100
Calculation of Partition value:
• Quartile: 𝐢𝐧
( 𝟒 − 𝐜. 𝐟. )
𝐐𝐢 = 𝐋 + ×𝐡 where, i= 1,2,3
𝐟
• Decile: 𝐢𝐧
( 𝟏𝟎 − 𝐜. 𝐟. )
𝐃𝐢 = 𝐋 + ×𝐡 where, i= 1,2,3,…,9
𝐟
• Percentile:
𝐢𝐧
( − 𝐜. 𝐟. )
𝐏𝐢 = 𝐋 + 𝟏𝟎𝟎 ×𝐡 where, i= 1,2,3,4,……,99
𝐟
Example:
X Median X
minimum Q1 (Q2) Q3 maximum
12 30 45 57 70
Interquartile range
= 57 – 30 = 27
Box and Whisker Plot
Example:
• Symmetric
• Right Skewed
• Left Skewed
Why Use a Boxplot?
• A boxplot provides an alternative to a histogram, a dot plot, and a stem-and-
leaf plot. Among the advantages of a boxplot over a histogram are ease of
construction and convenient handling of outliers. In addition, the
construction of a boxplot does not involve subjective judgements, as does a
histogram. That is, two individuals will construct the same boxplot for a
given set of data - which is not necessarily true of a histogram, because the
number of classes and the class endpoints must be chosen. On the other
hand, the boxplot lacks the details the histogram provides.
• Dot plots and stem plots retain the identity of the individual observations; a
boxplot does not. Many sets of data are more suitable for display as
boxplots than as a stem plot. A boxplot as well as a stem plot are useful for
making side-by-side comparisons.
Measures of Variation
Variation
Sample Sample
Variance Standard
Deviation
Variation
• Measures of variation give information on the
spread or variability of the data values.
Same center,
different variation
Measures of Dispersion for Grouped and Ungrouped Data
Range
• Range = Largest value – smallest value
7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5
• Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Variance
s =
2 i=1
n -1
N
• Population variance: å i
(x - μ)2
σ =
2 i=1
N
Standard Deviation
• Most commonly used measure of variation
• Shows variation about the mean
• Has the same units as the original data
(Ungroup data) å i
(x - x ) 2
s= i=1
n -1
å i
(x - μ)2
σ= i=1
N
For group data standard deviation is computed by using
the following relationship
∑ "($%$)̅ !
Sample standard deviation (s) = (%)
∑ "$ ! (∑ "$)*
= (%)
− (((%))
∑ "($%+)!
Population Standard Deviation (σ) =
,
∑ "$ ! (∑ "$)*
= −
, ,%
Comparing Standard Deviations
Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 3.338
Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = .9258
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 4.57
Coefficient of Variation (CV)
• C.V. is most widely used relative measure of dispersion in comparing two or more
than two distribution.
• While comparing the two or more distribution, lower the C.V., more
homogeneous or more consistent or more uniform or more regular or more stable
distribution.
Variety I Variety II
Mean (K.G.) 60 50
S.D. (K.G.) 10 9
𝟏𝟎
C.V. for Variety I = × 100 = 16.7 % Less variability More consistent
𝟔𝟎
𝟗
C.V. for Variety I = 𝟓𝟎
× 100 = 18.0 %
68%
μ
μ ± 1σ
The Empirical Rule
• μ ± 2σ contains about 95% of the values in
the population or the sample
• μ ± 3σ contains about 99.7% of the values
in the population or the sample
95% 99.7%
μ ± 2σ μ ± 3σ
Tchebysheff’s Theorem
• Examples:
At least within
(1 - 1/12) = 0% ……..... k=1 (μ ± 1σ)
(1 - 1/22) = 75% …........ k=2 (μ ± 2σ)
(1 - 1/32) = 89% ………. k=3 (μ ± 3σ)