1 - Presentation of Data
1 - Presentation of Data
1
STAT-402 3(2-1)
Statistics and Probability
Probability & Statistics for Engineers & Scientists
by
Walpole, R. E., R. H. Mayers, S. L. Mayers and K. Ye.
2
Why STATISTICS?
Statistics is the study of data. And data is everywhere. It is the science of
learning from data, organizing it, and drawing conclusions from it.
Why BIOLOGY?
The study of living organisms, their origins, anatomy, morphology, physiology,
behaviour , and distribution. 3
Some possible questions?
•A new treatment for HIV disease works better than current therapies
•High blood pressure is demonstrated to be associated with heart disease
•A study suggests that a certain pollutant may be harmful to humans
•Hormone replacement therapy is determined to carry increased risk of certain types
of cancer (and the evidence is so compelling that the study is stopped earlier than
planned)
Biostatisticians play essential roles in designing the studies, analyzing the data,
and creating new methods for addressing above problems.
Biostatistics (Biometry)
The branch of statistics that deals with data relating to living organisms i.e Biostatistics is a
branch of statistics that applies statistical methods to a wide range of topics in biology. 4
Bioinformatics is the science of storing, retrieving and
analysing large and complex biological data such as genetic
codes.
It is a highly interdisciplinary field involving many different
types of specialists, including biologists, molecular life
scientists, computer scientists , mathematicians and
statisticians.
Bioinformatics is mainly used to extract knowledge from
biological data through the development of algorithms and
software.
5
Why STATISTICS?
Information
DATA
4 X=7
3,2
2,5,3 S=5
2,5,40,6,2
5,5,2,5,4,5,6, Statistical tools Min=1
5,4,8,5,4,5,4,5,6 X , S, Min, Max, Outlier
,5,4,2,1,4,5,2,4,7,8,5,6,4,1 Max=40
15,4,5,6,9,8,7,7,7,7,7,5,2,2,2,3
Outlier=40
6
What is Statistics?
• “Statistics is a way to get information from data”
Decision
making
Statistics
Data Information
7
STATISTICS
Statistics may be defined as a science of
• Collection
• Representation
• Analysis
• Interpretation
8
Statistics (Definitions)
Descriptive Inferential
Involves in Organization, Summarization Using sample information such as
and Display of Data into Tables, Graphs ഥ , S, r, p to draw Inference about the
𝑿
and Summary Numbers such as Population.
ഥ , S, r, p
𝑿
10
Descriptive Statistics
Descriptive Statistics consists of the tools and techniques
designed to describe data, such as frequency table, graphs,
and numerical measures like average and variation measures
Inferential Statistics
Inferential Statistics consists of techniques that allow a
decision-maker to reach a conclusion about characteristics of
a larger data set (Population) based upon a subset (Sample)
of those data
11
Population
• Totality of the observations made on all the objects (under investigation)
possessing some common specific characteristics, which are of particular
interest to researchers
Parameter
• Numerical quantity calculated from a population data
Population mean(μ), variance (σ2) proportion (P).
12
Sample
• A representative part of the population which is selected to obtain information concerning
the characteristics of the population
• Size of the sample, is denoted by “n”
Statistic
• Numerical quantity calculated from the sample data
Sample mean (X), variance(S2), proportion (p^)
• It is used to give information about unknown values in the corresponding population
parameter i,.e educated guess value about population parameter
13
Population vs Sample
Statistical
Inference
Population
Sample
(have Parameters) (have Statistic)
ഥ , S, r
Statistic: 𝑿
Parameters: µ, σ, ρ
14
Variable:
Any characteristic that may vary from individual to individual is known as
Variable.
•Height of a tree
•BMI of a student
•Number of insects on a tree
•Colour of a flower
•Gender of a student 15
Type of Variable
Variable
Qualitative
Quantitative
(categorical)
Characteristic which Discrete Continuous
varies in quality (not
numerically) e.g., •Height
•Eye colour •No. of students
•Weight
•Behaviour •No. of chairs
•Marks
•Gender •No. of deaths
•Time
•Blood group •No. of births in a hospital
•Distance
•Taste •No. of accidents
•Temperature 16
Qualitative Variable
• When the characteristic being studied is
nonnumeric, it is called a qualitative variable or
an attribute.
• For example, gender, religious affiliation, type
of automobile owned, eye colour, etc.
• When the data are qualitative, we are usually
interested in how many or what proportion fall
in each category.
17
Quantitative Variable
When the variable studied can be reported
numerically, the variable is called a quantitative
variable.
For example,
• balance in your checking account
• ages of company employees
• life of an automobile battery
• number of children in a family
18
Discrete Variable
Discrete variables can assume only certain values, and
there are usually “gaps” between the values.
For example, number of bedrooms in a house (1, 2, 3,
4, etc.)
Typically, discrete variables result from counting.
19
Continuous Variable
Continuous variable can assume any value within a
specific range, i.e., its domain is an interval with all
possible values without gaps.
Examples
• air pressure in a tire
• weight of a shipment of tomatoes
• height of a student
Typically, continuous variables result from measuring.
20
Some important notations
•Variables are usually denoted by X, Y, Z etc
X
2
•Sum of squared of all values of X :
X X2
2 4 (1) X
4 16
( 2) X
2
5 25
(3) X (25)
6 36 2 2
625
8 64
25 145 22
Some ingredients of statistics formula
X X-6
2 -4
(X-6)2
16
(1) ( X 6)
(2) ( X 6)
2
4 -2 4
5 -1 1
6 0 0
8 2 4
25 -5 25 23
Some ingredients of statistics formula
X X-5 (1) X
X
5
2 -3 n
4 -1
(2) X X
5 0
6 1 Sum of deviations of values
8 3 from mean is always zero
25 0 24
Some ingredients of statistics formula
X (X-6)2
X X
(X-5)2 2
(1)
2 9 16
1 4 (2) X 6
2
4
5 0 1
• Sum of squared deviations of values
6 1 0 from mean is always minimum
• Sum of squared deviations is always
8 9 4 positive
25 20 25
25
Presentation of data
After collecting statistical data, the next step is
the presentation of the data so that valid
inferences can be drawn.
Gender M F M M F M F M M M
Male 7 0.7 70 7
Female 3 0.3 30 10 Gender Sec A Sec B Total
Total 10 1.0 100 Male 3 4 7
6
5
5
4
4
Frequency
3 Sec A
3 3
2 2 Sec B
1 1
0
0
Male Female
Male Female
Sex
Sex 28
Simple Bar Chart
A bar chart is a type of chart which shows the values of different categories of
data as rectangular bars with different lengths.
Example: Draw a Simple Bar Chart to represent the Population of 5 cities of the
province Punjab. Bar diagram showing Population of 5 cities
Cities Population (000) of Punjab
12,000
10,355
Lahore 10,355 10,000
Population in ‘000’
Rawalpindi 4,765 8,000
6,000
Faisalabad 3,675 4,765
4,000 3,675
3,100
Sargodha 1,550 2,000 1,550
Multan 3,100 0
Lahore Rawalpindi Faisalabad Sargodha Multan
Cities 29
Multiple Bar Chart
Multiple Bar Chart showing Population of
Males and Females
Population
Cities (000) Male Female 6000
5385
4,970
5000 Males Females
Lahore 10,355 5385 4,970
4000
POPULATION
3000
2478
Rawalpindi 4,765 2478 2,287 2,287
19111,764
2000
12000
8000 Females
4,970
Population
6000
Rawalpindi 4,765 2478 2,287
4000
2,287
5385 1,764
Faisalabad 3,675 1911 1,764 2000
2478 744
1911
0
806
Lahore Rawalpindi Faisalabad Sargodha
Sargodha 1,550 806 744 Cities
31
Discrete data – Frequency Distribution
Example:
Following data represent the number of infected plants from a sample of
twenty experimental plots.
Make a frequency distribution, relative frequency, % frequency and
cumulative frequency of the above data and interpret your results? Make an
appropriate graph?
1 2 4 3 0 1 2 3 1 1 0 2
1 0 2 3 0 0 1 3
32
Discrete Frequency Distribution
33
Graphical Representation of Discrete Data
Bar Chart representing the infected items
7
6
6
5
5
Frequency
4
4 4
3
1
1
0
0 1 2 3 4
No. of infected items
34
Pie Chart
A pie chart is a type of graph in which a circle is divided into sectors that each
represent a proportion of the whole.
Example: The following data represent the blood groups of different students
Blood Groups Number of students
A+ 75
B+ 80
O+ 30
O- 15
Total 200 35
Pie Chart
Blood frequency Angle Cumulative
groups of sector Angle
A+
A+ 75 (75/200)x360=135 135
B+ 80 144 279 O-
O+ 30 54 333 B+
O+
O- 15 27 360
Total 200 360
36
Line Chart / Histogram
The following data represent production of a factory over 9 years
years 2000 2001 2002 2003 2004 2005 2006 2007 2008
production 50 65 75 80 85 110 90 85 70
Historigram
115
105
95
Production
85
75
65
55
45
2000 2001 2002 2003 2004 2005 2006 2007 2008
Years
37
Line Chart / Histogram
The following data represent production of two factory over 9 years
years 2000 2001 2002 2003 2004 2005 2006 2007 2008
Company A 50 65 75 80 85 110 90 85 70
Company B 45 60 80 90 95 100 80 75 80
Historigram
115
105
95
Production
85
Company A
75
Company B
65
55
45
2000 2001 2002 2003 2004 2005 2006 2007 2008
Years 38
Frequency Table for continuous variable
The following data represents the height of 30 wheat plants taken from the
experimental area. Construct a frequency distribution and appropriate
graphs to explain the distribution of data:
87 91 89 88 89 91 87 92 90 98 95
97 96 100 101 96 98 99 98 100 102 99
101 105 103 107 105 106 107 112
39
Following data Classes Frequency
represents the plant (f)
height (cm) of a sample
of 30 plants. 86–90 6
87 91 89 91–95 4
88 89 91
87 92 90
96–100 10
98 95 97 101–105 6
96 100 101 106–110 3
96 98 99
98 100 102 111–115 1
99 101 105
103 107 105 Total 30
106 107 112
40
Frequency Distribution
41
Some definitions
Class Limits
The class limits are defined as the number or the values of the variables which
are used to separate two classes. Sometimes classes are taken as
20--25, 25--30 etc In such a case, these class limits means " 20 but less than
25", "25 but less than 30" etc
Class marks or midpoints
The class mark or the midpoint is that value which divides a class into two
equal parts. It is obtained by dividing the sum of lower and upper class limits
or class boundaries of a class by 2.
Class interval
The difference between either two successive lower class limits or two
successive upper class limits or two successive midpoints and denoted by "h".
42
Construction of a frequency distribution
• Subtract any Upper Class Limit from its Subsequent Lower Class limit and
divide the difference with 2, you will get the Continuity correction factor
• Subtract this factor from all Lower Class Limits and add it to all Upper Class
limits.
45
• X-axis =Class Boundaries Histogram
• Y-axis = Frequency
12
Class Boundaries
f 10
10
85.5–90.5 6 8
Frequency
90.5–95.5 4 6
6 6
4
95.5–100.5 10 4
3
100.5–105.5 6 2
1
105.5–110.5 3 0
85.5–90.5 90.5–95.5 95.5–100.5 100.5–105.5 105.5–110.5 110.5–115.5
46
Frequency Polygon
Frequency polygons are a graphical device for understanding the shapes of distributions.
• X-axis =Mid Points
• Y-axis = Frequency
Classes (f) (X)
0 83
86–90 6 88
91–95 4 93
96–100 10 98
101–105 6 103
3 108
106–110
1 113
111–115
0 118 47
TYPES OF FREQUENCY CURVE
• Symmetrical distribution
• Skewed distribution
Symmetric
Symmetrical distribution
A frequency distribution or curve is symmetrical
if values equidistant from a central maximum
have the same frequencies.
Skewed distribution
A frequency distribution or curve is skewed
when it departs from symmetry.
48
A cumulative frequency Cumulative Frequency Polygon / Ogive
polygon is a plot of
the cumulative frequency against
the upper class boundary.
X-axis=Upper Class Boundaries
Y-axis=Cumulative Frequency
Marks f UCB CF
50
Frequency Distribution
• Advantage of frequency distribution:
Frequency distribution is useful to represent huge data with small
number of groups/ classes
• Disadvantage of frequency distribution:
Individual identity is lost during grouping process
• Assumption:
Each observation in a class is assumed to be at the center of the class(
mid-value)
• Solution:
Present data by using stem and leaf display
51
Stem & Leaf Display
• Stem and Leaf in addition to information on the number of observations falling in the
various classes, it displays details of what those observations actually are.
• Each number in the data set is divided into two parts, a Stem and a Leaf.
• A stem is the leading digit(s) of each number and is used in sorting, while a leaf is the rest of
the trailing digit.
32 , 132 , 02
• A relatively small data set can be represented by stem and leaf display.
52
Example:- Represent the following data by Stem and Leaf display by
(i) taking 10 unit as the width of the class
(ii) taking 5 unit as the width of the class
32 45 38 41 49 36 52 56 51 62
63 59 68
Stem Leaf *indicate 0—4
.indicate 5—9
Stem Leaf 3* 2
3. 8 6
32 8 6 4* 1
* and . are called placeholder
4. 5 9
45 1 9
5* 2 1
52 6 1 9 5. 6 9
6* 2 3
62 3 8 6. 8
53
Example
Use the data below to make a stem- Stem Leaf
and-leaf plot by taking 10 as a unit.
7 0589
85 115 126 92 104 8 4558
85 116 100 121 123 9 022379
79 90 110 129 108
10 0478
107 78 131 114 92
131 88 97 99 116
11 04566
93 84 75 70 132 12 1369
13 112
7 0589
These values are 70, 75, 78 and 79 54
Dotplot
Dot Plot (Single Quantitative variable)
Weight 32 36 40 40 44 48 52
52 56 60 64 64 68
Variable
32 36 40 44 48 52 56 60 64 68
Weight in Kg 55
Scatter Plot is a useful graph to Scatterplot of % Final vs % Mid
compare two related variables S2
Mid Final
S1 65 87 70 Line of
% Final
equality
S2 72 91 S4
60
S3 45 45
S4 80 60 50 S6
S3
S5 68 79
40
66 47 40 50 60 70 80
S6
% Mid
56