0% found this document useful (0 votes)
31 views67 pages

Prob and Stat - Unit1

The document provides an introduction to statistics, defining it as the science of collecting, analyzing, presenting, and interpreting data. It covers key concepts such as population, sample, variables, and types of data, along with measurement scales and branches of statistics. Additionally, it discusses measures of central tendency, variability, and partition values like quartiles and percentiles.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views67 pages

Prob and Stat - Unit1

The document provides an introduction to statistics, defining it as the science of collecting, analyzing, presenting, and interpreting data. It covers key concepts such as population, sample, variables, and types of data, along with measurement scales and branches of statistics. Additionally, it discusses measures of central tendency, variability, and partition values like quartiles and percentiles.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Probability and Statistics

Unit -1
Introduction to Statistics
Statistics:
• The word statistics has two meanings:
• In the most common usage – statistics refers to numerical facts
• The number that represents –
a) annul income
b) age
c) the percentage of students who scored grade A
d) the starting salary of a typical college graduate
• What will be other examples of statistics? ……………..
The following examples present some statistics:
• Approximately 30% of Google’s employees were female in July 2014
(USA TODAY, July 24, 2014).
• In 2013, author James Patterson earned $90 million from the sale of
his books (Forbes, September 29, 2014).
• As per the CBS report, the hotel and restaurant, manufacturing and
transportation sectors of Nepal will witness negative growth of 16.3
percent, 1.1 percent and 2.3 percent, respectively, in the current
fiscal year (The Himalayan Times, April 30, 2020).
• The second meaning of statistics refers to the field or
discipline of study.
• Statistics is the science of collecting, analyzing, presenting,
and interpreting data, as well as of making decisions based
on such analyses.
• A comprehensive definition given by Croxton and Cowden
is:
“Statistics may be defined as the collection, presentation,
analysis and interpretation of numerical data”
• Statistical methods help us make scientific and intelligent
decisions.
• Decisions made by using statistical methods are called
educated guesses.
• Decisions made without using statistical (or scientific)
methods are called pure guesses and, hence, may prove to
be unreliable.
• For example: …….
Applications:
Accounting: Generally the number of individual accounts
receivable is large and time taking to check its validity. Based on
sample data auditors make conclusions as to whether the
accounts receivable amount shown on the client’s balance is
acceptable or not.
Finance: Financial analysis, uses variety of statistical information
and methods to guide their investment recommendations.

Economics: Economists use a variety of statistical information and


methods in making forecasting, planning and formulations
economic policies price index numbers, unemployment rates,
manufacturing capacity utilization, human development indicator
indices, and quality control charts etc.
Basic Terms
Population or target population: The collection of all
elements/members whose characteristics are being studied.
For example:………………..
Sample: A portion/fraction of the population of interest.
For example: ……………

Fig1. the relation between population and sample


Goal of Sample:
Usually populations are so large that a researcher cannot examine
the entire group. Therefore, a sample is selected to represent the
population in a research study. The goal is to use the results
obtained from the sample to help answer questions about the
population.
Basic terms continued…..
Survey:
A survey is a research method used for collecting data from a
predefined group of respondents to gain information and insights
into various topics of interest.
Census:
procedure of systematically calculating, acquiring and
recording information about the all the members of a
given population.
Sample Survey:
procedure of systematically calculating, acquiring and
recording information from only a portion of a population of
interest.
• Variable
- A variable is a characteristic under study that assumes
different values for different elements.
- A variable is often denoted by letters x, y, or z
- The value of a variable for an element is called an
observation or measurement.
• Data
- collection of information/observations
- The goal of statistics is to help researchers organize and
interpret the data.
Types of Variables
• Some variables (such as the height of person, price of
groceries) can be measured numerically, whereas others (such
as occupation, income sources) cannot.
• Variables are classified into two types:
a) Quantitative Variable
b) Qualitative Variable
i) Quantitative Variable
• A variable that can be measured numerically is called a quantitative
variable.
• The data collected on a quantitative variable are called quantitative
data.
• Example: Number of workers: 23, 24, 25, 15, 19, 18
• Other examples:
- Annual Gross sale
- No. of accidents
- Weight of a laptop
- Temperature
- No. of gadgets owned
• As you can see from the above examples that certain quantitative
variable can assume may be countable or noncountable
• Quantitative variables may be classified into two categories
a) Discrete Variable
b) Continuous Variable
A) Discrete Variable
• Variable whose values are countable.
• In other words, a discrete variable can assume only certain values
with no intermediate values.
• For example:
- No. of accidents
- The no. of daily admissions in a general hospitals
- The no. of people visit bank in on any day
- The no. of books in a library
B) Continuous Variable
• A variable that can assume any numerical value over a certain
interval or intervals is called a continuous variable.
• Example:
- Price of book: USD105.6
- Annual salary
- Body temperature
- Expenditure on food on any day
- The time it takes to complete a certain task
ii) Qualitative or Categorical Variable
• A variable that cannot assume a numerical value but can be
classified into two or more nonnumeric categories is called a
qualitative or categorical variable.
• The data collected on such a variable are called qualitative data.
• Examples:
- Gender of a person
- A person’s blood type
- Occupation
- Modes of transportation
Measuring Variables
• To establish relationships between variables, researchers must
observe the variables and record their observations. This
requires that the variables be measured.
• The process of measuring a variable requires a set of categories
called a scale of measurement and a process that classifies each
individual into one category.
Four Types of Measurement Scales

Differences between Highest Level


measurements, true Ratio Data
zero exists (Strongest forms of
measurement)

Differences between
measurements but no Interval Data
true zero
Higher Levels
Ordered Categories
(rankings, order, or Ordinal Data
scaling)

Categories (no Lowest Level


ordering or direction) Nominal Data (Weakest form of
measurement)
Nominal data:
Categorical data and numbers that are simply used as identifiers or
names represent a nominal scale of measurement.
Examples: Gender: a) male b) female
Ordinal data:
An ordinal scale of measurement represents an ordered series of
relationships or rank order.
Individuals competing in a contest may be fortunate to achieve first,
second, or third place.
First, second, and third place represent ordinal data
Examples: organizational chart, post, educational qualification,
Interval data:
• A scale which represents quantity and has equal units but for which
zero represents simply an additional point of measurement is an
interval scale
• Example: Temperature, Ph, SAT Score, IQ Test
Ratio data:
• The ratio scale of measurement is similar to the interval scale in that it
also represents quantity and has equality of units.
• However, this scale also has an absolute zero (no numbers exist below
the zero).
• Very often, physical measures will represent ratio data (for example,
height and weight).
Example: Scale of measurement
Branches of Statistics
• Descriptive statistics are methods for organizing and
summarizing data.
• For example, tables or graphs are used to organize data, and
descriptive values such as the average score are used to
summarize data.
• A descriptive value for a population is called a parameter and a
descriptive value for a sample is called a statistic.
• Inferential statistics are methods for using sample data to make
general conclusions (inferences) about populations.
• Because a sample is typically only a part of the whole population,
sample data provide only limited information about the
population. As a result, sample statistics are generally imperfect
representatives of the corresponding population parameters.
Things to remember….

• A descriptive study may be performed either on a sample or on a


population. Only when an inference is made about the population,
based on information obtained from the sample, does the study
become inferential.
• Descriptive statistics and inferential statistics are interrelated. You
must almost always use techniques of descriptive statistics to
organize and summarize the information obtained from a sample
before carrying out an inferential analysis.
• Furthermore, as you will see, the preliminary descriptive analysis of a
sample often reveals features that lead you to the choice of the
appropriate inferential method.
Describing Data with Numerical Measures
a) Measure of central tendency and location
b) Measure of Variability
Topics:
• Compute and interpret the mean, median, and mode for a set of data
• Compute the range, variance, and standard deviation and know what
these values mean
• Construct and interpret a box and whiskers plot
• Compute and explain the coefficient of variation
• Use numerical measures along with graphs, charts, and tables to
interpret data
Summary Measures

Describing Data Numerically

Center and Location Other Measures Variation


of Location
Mean Range
Percentiles
Median Interquartile Range
Quartiles
Mode
Variance
Weighted Mean
Standard Deviation

Coefficient of
Variation
Measures of Center and Location
Overview

Center and Location

Mean Median Mode Weighted Mean

å
n

åx i
XW =
wx i i
x=
åw
i=1
n i
N

åx i µW =
å wxi i

µ= i=1
N åw i
Measures of Center for Ungrouped and Grouped Data
a) Mean
b) Median
• In an ordered array, the median is the “middle” number
• If n or N is odd, the median is the middle number
• If n or N is even, the median is the average of the two middle
numbers
• The advantage of using the median as a measure of central tendency is
that it is not influenced by outliers.
• When outliers exist, use median instead of mean as a measure of
central tendency.
ØThe median is the value of the middle term in a data set
that has been ranked in increasing order.

th
æ n +1 ö
Median = ç ÷ value
è 2 ø

173,175 49,723 20,352 10,824 40,911 18,038 61,848


Find the median for these data.
28.0 + 28.2 56.2
Median = = = 28.1 = $28.1 million
2 2
Calculating median for grouped data

n / 2 - cf
Median= l+ h
f

Where l= lower limit of median class


n/2= median position
cf = cumulative frequency preceding to median class
f = median class frequency
h = class width of median class
c) Mode
Mode for ungrouped data
• The mode is the value that occurs with the highest frequency in a data
set.
• Example: …….
• Advantage:
- Can be used for both Qualitative and Quantitative data, whereas
the mean and median can be calculated for only quantitative data
- Not affected by outliers
• Disadvantage: (dependent on the nature of data set)
- There may be no mode
- There may be several modes
Calculating mode for grouped data
d) Weighted Mean
• Weighted Mean is an average computed by giving different weights to some of
the individual values. If all the weights are equal, then the weighted mean is
the same as the arithmetic mean.
• It represents the average of a given data. The Weighted mean is similar to
arithmetic mean or simple mean. The Weighted mean is calculated when data
is given in a different way compared to an arithmetic mean or simple mean.
• The Weighted mean for given set of non-negative data x1, x2, x3,… xn with non-
negative weighted w1, w2, w3,… wn. Then the weighted mean is given by;

w! x1+w" x2+w# x3+…+w$ xn ∑ #$


𝑿𝒘 = w! +w" +w# +…+w$ = ∑#

where, w = given weight


Example: Sample of
26 Repair Projects

Days to Frequency Weighted Mean Days


Complete to Complete:

XW =
å wx
i i
=
(4 ´ 5) + (12 ´ 6) + (8 ´ 7) + (2 ´ 8)
5 4 åw i 4 + 12 + 8 + 2
6 12 164
7 8
= = 6.31 days
26
8 2
Which measure of location is the “best”?
• Mean is generally used, unless extreme values (outliers)
exist
• Then median is often used, since the median is not
sensitive to extreme values.
Relationships Among the Mean, Median, and Mode
Partition values
• The variate values dividing into the total number of observation in equal number of parts are
known as partition values.
• If the values of the variate are arranged in ascending or descending order of magnitudes,
then we have seen that median is that value of the variate which divides the total frequencies
in two equal parts.
• Similarly the given series can be divided into four, ten and hundred equal parts.
• Quartile:
The values of the variate which divide the total frequency into four equal parts, are
called quartiles. there are three types of quartiles:- first quartile (Q1), second quartile
(Q2), and third quartile (Q3 ).
• Decile:
Deciles are those values that divide any set of a given observation into a total of ten
equal parts. Therefore, there are a total of nine deciles. These representation of these
deciles are as follows D1, D2, D3, D4, ……… D9.
• Percentile:
Percentile basically divide any given observation into a total of 100 equal parts. The
representation of these percentiles or centiles is given as P1, P2, P3, P4, ……… P99.
Percentiles
• The pth percentile in an ordered array of n values is
the value in ith position, where

p
i= (n + 1)
100
n Example: The 60th percentile in an ordered array of 19
values is the value in 12th position:
p 60
i= (n + 1) = (19 + 1) = 12
100 100
Calculation of Partition value:
• Quartile: 𝐢𝐧
( 𝟒 − 𝐜. 𝐟. )
𝐐𝐢 = 𝐋 + ×𝐡 where, i= 1,2,3
𝐟

• Decile: 𝐢𝐧
( 𝟏𝟎 − 𝐜. 𝐟. )
𝐃𝐢 = 𝐋 + ×𝐡 where, i= 1,2,3,…,9
𝐟

• Percentile:
𝐢𝐧
( − 𝐜. 𝐟. )
𝐏𝐢 = 𝐋 + 𝟏𝟎𝟎 ×𝐡 where, i= 1,2,3,4,……,99
𝐟

Note : Median = 𝑸𝟐 = 𝑫𝟓 = 𝑷𝟓𝟎


Interquartile Range

• Can eliminate some outlier problems by using the


interquartile range

• Eliminate some high-and low-valued observations and


calculate the range from the remaining values.

• Interquartile range = 3rd quartile – 1st quartile


Interquartile Range

Example:
X Median X
minimum Q1 (Q2) Q3 maximum

25% 25% 25% 25%

12 30 45 57 70

Interquartile range
= 57 – 30 = 27
Box and Whisker Plot

• A Graphical display of data using 5-number summary:

Minimum -- Q1 -- Median -- Q3 -- Maximum

Example:

25% 25% 25% 25%

Minimum 1st Median 3rd Maximum


Quartile Quartile
Features of Box and Whisker plot:
- Gives a graphic presentation of data using five measures: the median, the
first quartile, the third quartile, and the smallest and the largest values in
the data set between the lower and the upper inner fences.
- Can help visualize the center, the spread, and the skewness of a data set.
- It also helps detect outliers.
- Always located at actual data points, are quickly computable (originally
by hand), and have no tuning parameters. They are particularly useful for
comparing distributions across groups.
Shape of Box and Whisker Plot

• Symmetric

• Right Skewed

• Left Skewed
Why Use a Boxplot?
• A boxplot provides an alternative to a histogram, a dot plot, and a stem-and-
leaf plot. Among the advantages of a boxplot over a histogram are ease of
construction and convenient handling of outliers. In addition, the
construction of a boxplot does not involve subjective judgements, as does a
histogram. That is, two individuals will construct the same boxplot for a
given set of data - which is not necessarily true of a histogram, because the
number of classes and the class endpoints must be chosen. On the other
hand, the boxplot lacks the details the histogram provides.

• Dot plots and stem plots retain the identity of the individual observations; a
boxplot does not. Many sets of data are more suitable for display as
boxplots than as a stem plot. A boxplot as well as a stem plot are useful for
making side-by-side comparisons.
Measures of Variation

Variation

Range Variance Standard Deviation Coefficient of


Variation
Population Population
Interquartile
Variance Standard
Range
Deviation

Sample Sample
Variance Standard
Deviation
Variation
• Measures of variation give information on the
spread or variability of the data values.

Same center,
different variation
Measures of Dispersion for Grouped and Ungrouped Data
Range
• Range = Largest value – smallest value

Range = Largest value – smallest value


= 267,277 – 49,651
= 217,626 square miles
Disadvantages of the Range
• Ignores the way in which data are distributed

7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5

• Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Variance

• Average of squared deviations of values from the


mean(individual series)
n
• Sample variance:
å i
(x - x ) 2

s =
2 i=1
n -1

N
• Population variance: å i
(x - μ)2

σ =
2 i=1
N
Standard Deviation
• Most commonly used measure of variation
• Shows variation about the mean
• Has the same units as the original data

• Sample standard deviation: n

(Ungroup data) å i
(x - x ) 2

s= i=1
n -1

• Population standard deviation: N

å i
(x - μ)2

σ= i=1
N
For group data standard deviation is computed by using
the following relationship
∑ "($%$)̅ !
Sample standard deviation (s) = (%)
∑ "$ ! (∑ "$)*
= (%)
− (((%))

∑ "($%+)!
Population Standard Deviation (σ) =
,
∑ "$ ! (∑ "$)*
= −
, ,%
Comparing Standard Deviations

Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 3.338

Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = .9258
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 4.57
Coefficient of Variation (CV)
• C.V. is most widely used relative measure of dispersion in comparing two or more
than two distribution.

• While comparing the two or more distribution, lower the C.V., more
homogeneous or more consistent or more uniform or more regular or more stable
distribution.

• C.V. is used to compare two or more distribution about their variability,


consistency, uniformity, homogeneity, equitability, stability etc.
Coefficient of Variation (CV)
Note: A low CV indicates that there is a low variation in the data set
and hence, a higher consistency.
s
CV = ´ 100% (population)
µ
s
CV = ´ 100% (sample)
x
• E.g. 1. Consider the distribution of the yields(per plot) of two paddy varieties and the
information is given below:

Variety I Variety II
Mean (K.G.) 60 50
S.D. (K.G.) 10 9

𝟏𝟎
C.V. for Variety I = × 100 = 16.7 % Less variability More consistent
𝟔𝟎
𝟗
C.V. for Variety I = 𝟓𝟎
× 100 = 18.0 %

* But in terms of S.D. the interpretation could be reverse.


The Empirical Rule
• If the data distribution is bell-shaped, then the
interval:
• μ ± 1σ contains about 68% of the values in
the population or the sample
X

68%

μ
μ ± 1σ
The Empirical Rule
• μ ± 2σ contains about 95% of the values in
the population or the sample
• μ ± 3σ contains about 99.7% of the values
in the population or the sample

95% 99.7%

μ ± 2σ μ ± 3σ
Tchebysheff’s Theorem

• Regardless of how the data are distributed, at


least (1 - 1/k2) of the values will fall within k
standard deviations of the mean

• Examples:
At least within
(1 - 1/12) = 0% ……..... k=1 (μ ± 1σ)
(1 - 1/22) = 75% …........ k=2 (μ ± 2σ)
(1 - 1/32) = 89% ………. k=3 (μ ± 3σ)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy