0% found this document useful (0 votes)
9 views56 pages

1 - Presentation of Data

The document outlines the course STAT-402, focusing on Statistics and Probability, emphasizing the importance of statistics in data analysis and decision-making. It introduces biostatistics and bioinformatics, highlighting their roles in biological research and data management. Additionally, it covers key statistical concepts, including population vs. sample, types of variables, and methods for data presentation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views56 pages

1 - Presentation of Data

The document outlines the course STAT-402, focusing on Statistics and Probability, emphasizing the importance of statistics in data analysis and decision-making. It introduces biostatistics and bioinformatics, highlighting their roles in biological research and data management. Additionally, it covers key statistical concepts, including population vs. sample, types of variables, and methods for data presentation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

STAT-402 3(2-1)

Statistics and Probability

Instructor : Muhammad Arif

M.Phil (Statistics), M.Sc (Computer Science)

1
STAT-402 3(2-1)
Statistics and Probability
Probability & Statistics for Engineers & Scientists
by
Walpole, R. E., R. H. Mayers, S. L. Mayers and K. Ye.

2
Why STATISTICS?
Statistics is the study of data. And data is everywhere. It is the science of
learning from data, organizing it, and drawing conclusions from it.

In order to make sound decisions, we need to be able to understand and use


data. Statistics provides us with the methods and tools to do this.

Why BIOLOGY?
The study of living organisms, their origins, anatomy, morphology, physiology,
behaviour , and distribution. 3
Some possible questions?
•A new treatment for HIV disease works better than current therapies
•High blood pressure is demonstrated to be associated with heart disease
•A study suggests that a certain pollutant may be harmful to humans
•Hormone replacement therapy is determined to carry increased risk of certain types
of cancer (and the evidence is so compelling that the study is stopped earlier than
planned)
Biostatisticians play essential roles in designing the studies, analyzing the data,
and creating new methods for addressing above problems.

Biostatistics (Biometry)
The branch of statistics that deals with data relating to living organisms i.e Biostatistics is a
branch of statistics that applies statistical methods to a wide range of topics in biology. 4
Bioinformatics is the science of storing, retrieving and
analysing large and complex biological data such as genetic
codes.
It is a highly interdisciplinary field involving many different
types of specialists, including biologists, molecular life
scientists, computer scientists , mathematicians and
statisticians.
Bioinformatics is mainly used to extract knowledge from
biological data through the development of algorithms and
software.
5
Why STATISTICS?
Information
DATA
4 X=7
3,2
2,5,3 S=5
2,5,40,6,2
5,5,2,5,4,5,6, Statistical tools Min=1
5,4,8,5,4,5,4,5,6 X , S, Min, Max, Outlier
,5,4,2,1,4,5,2,4,7,8,5,6,4,1 Max=40
15,4,5,6,9,8,7,7,7,7,7,5,2,2,2,3
Outlier=40

6
What is Statistics?
• “Statistics is a way to get information from data”
Decision
making
Statistics
Data Information

Data: Facts, especially numerical


facts, collected together for Information: Communicated
reference concerning some particular facts.

7
STATISTICS
Statistics may be defined as a science of
• Collection
• Representation
• Analysis
• Interpretation

of numerical data under uncertain conditions.

8
Statistics (Definitions)

•Statistics is the subject which deals with the


variability.
•No two objects in a universe are exactly alike. If
they were, there would have been no statistical
problem.
•It also deals with uncertainty as every process of
getting observations involves deficiencies or
chance variation. 9
Branches of Statistics
Statistics

Descriptive Inferential
Involves in Organization, Summarization Using sample information such as
and Display of Data into Tables, Graphs ഥ , S, r, p to draw Inference about the
𝑿
and Summary Numbers such as Population.
ഥ , S, r, p
𝑿

10
Descriptive Statistics
Descriptive Statistics consists of the tools and techniques
designed to describe data, such as frequency table, graphs,
and numerical measures like average and variation measures

Inferential Statistics
Inferential Statistics consists of techniques that allow a
decision-maker to reach a conclusion about characteristics of
a larger data set (Population) based upon a subset (Sample)
of those data
11
Population
• Totality of the observations made on all the objects (under investigation)
possessing some common specific characteristics, which are of particular
interest to researchers

• Size of the population, is denoted by “N”


• Population of all voters in Pakistan
• Population of all mobile users in Punjab
• Population of fish in a pond

Parameter
• Numerical quantity calculated from a population data
Population mean(μ), variance (σ2) proportion (P).

12
Sample
• A representative part of the population which is selected to obtain information concerning
the characteristics of the population
• Size of the sample, is denoted by “n”

Statistic
• Numerical quantity calculated from the sample data
Sample mean (X), variance(S2), proportion (p^)
• It is used to give information about unknown values in the corresponding population
parameter i,.e educated guess value about population parameter

13
Population vs Sample
Statistical
Inference

Population
Sample
(have Parameters) (have Statistic)
ഥ , S, r
Statistic: 𝑿
Parameters: µ, σ, ρ

Population: A Population is a group of Sample: A representative part of


all items under investigation. the population.

14
Variable:
Any characteristic that may vary from individual to individual is known as
Variable.

•Height of a tree
•BMI of a student
•Number of insects on a tree
•Colour of a flower
•Gender of a student 15
Type of Variable

Variable

Qualitative
Quantitative
(categorical)
Characteristic which Discrete Continuous
varies in quality (not
numerically) e.g., •Height
•Eye colour •No. of students
•Weight
•Behaviour •No. of chairs
•Marks
•Gender •No. of deaths
•Time
•Blood group •No. of births in a hospital
•Distance
•Taste •No. of accidents
•Temperature 16
Qualitative Variable
• When the characteristic being studied is
nonnumeric, it is called a qualitative variable or
an attribute.
• For example, gender, religious affiliation, type
of automobile owned, eye colour, etc.
• When the data are qualitative, we are usually
interested in how many or what proportion fall
in each category.
17
Quantitative Variable
When the variable studied can be reported
numerically, the variable is called a quantitative
variable.
For example,
• balance in your checking account
• ages of company employees
• life of an automobile battery
• number of children in a family
18
Discrete Variable
Discrete variables can assume only certain values, and
there are usually “gaps” between the values.
For example, number of bedrooms in a house (1, 2, 3,
4, etc.)
Typically, discrete variables result from counting.

19
Continuous Variable
Continuous variable can assume any value within a
specific range, i.e., its domain is an interval with all
possible values without gaps.
Examples
• air pressure in a tire
• weight of a shipment of tomatoes
• height of a student
Typically, continuous variables result from measuring.

20
Some important notations
•Variables are usually denoted by X, Y, Z etc

• Number of values in a data set by : n


•Sum of all the values of variable X :  X

X
2
•Sum of squared of all values of X :

•Deviation of values of X from a : (X-a)

•Sum of deviation of X from a : X a


•Sum of absolute deviation of X from a :   X  a 
2
21
Some ingredients of statistics formula

X X2
2 4 (1) X
4 16
( 2)  X
2
5 25

(3)  X   (25)
6 36 2 2
 625
8 64
25 145 22
Some ingredients of statistics formula

X X-6
2 -4
(X-6)2
16
(1)  ( X  6)
(2)  ( X  6)
2
4 -2 4
5 -1 1
6 0 0
8 2 4
25 -5 25 23
Some ingredients of statistics formula

X X-5 (1) X 
 X
5
2 -3 n
4 -1
(2)  X  X 
5 0
6 1 Sum of deviations of values
8 3 from mean is always zero
25 0 24
Some ingredients of statistics formula

X (X-6)2
 X  X 
(X-5)2 2
(1)
2 9 16
1 4 (2)   X  6
2
4
5 0 1
• Sum of squared deviations of values
6 1 0 from mean is always minimum
• Sum of squared deviations is always
8 9 4 positive
25 20 25
25
Presentation of data
After collecting statistical data, the next step is
the presentation of the data so that valid
inferences can be drawn.

Methods for the presentation of data


• Frequency Distribution
• Graphical presentation
• Stem and Leaf display
26
Presentation of Qualitative data
Example 1: Consider the data about Gender of 10 students

Gender M F M M F M F M M M

• Make a frequency distribution, relative frequency and % frequency of the


above data and interpret your results? Make an appropriate graph?
Example 2: Suppose we have also collected data of Sections of these 10
students as
Gender M F M M F M F M M M
Section A A A B B B A B A B
• Construct the Cross tabulation of the above data and interpret your results?
Also make an appropriate graph?
27
Gender f r.f %f cf

Male 7 0.7 70 7
Female 3 0.3 30 10 Gender Sec A Sec B Total
Total 10 1.0 100 Male 3 4 7

Bar Chart Female 2 1 3


Total 5 5 10
8
7
7
Multiple Bar chart
Frequency

6
5
5
4
4

Frequency
3 Sec A
3 3
2 2 Sec B
1 1
0
0
Male Female
Male Female
Sex
Sex 28
Simple Bar Chart
A bar chart is a type of chart which shows the values of different categories of
data as rectangular bars with different lengths.
Example: Draw a Simple Bar Chart to represent the Population of 5 cities of the
province Punjab. Bar diagram showing Population of 5 cities
Cities Population (000) of Punjab
12,000
10,355
Lahore 10,355 10,000

Population in ‘000’
Rawalpindi 4,765 8,000

6,000
Faisalabad 3,675 4,765
4,000 3,675
3,100
Sargodha 1,550 2,000 1,550

Multan 3,100 0
Lahore Rawalpindi Faisalabad Sargodha Multan
Cities 29
Multiple Bar Chart
Multiple Bar Chart showing Population of
Males and Females
Population
Cities (000) Male Female 6000
5385
4,970
5000 Males Females
Lahore 10,355 5385 4,970
4000

POPULATION
3000
2478
Rawalpindi 4,765 2478 2,287 2,287
19111,764
2000

1000 806 744


Faisalabad 3,675 1911 1,764
0
Lahore Rawalpindi Faisalabad Sargodha
Sargodha 1,550 806 744
CITIES 30
Component bar chart
Population Component Bar Chart showing population of
Cities (000) Male Female both Males and Females and Total

12000

Lahore 10,355 5385 4,970 10000 Males

8000 Females
4,970

Population
6000
Rawalpindi 4,765 2478 2,287
4000
2,287
5385 1,764
Faisalabad 3,675 1911 1,764 2000
2478 744
1911
0
806
Lahore Rawalpindi Faisalabad Sargodha
Sargodha 1,550 806 744 Cities
31
Discrete data – Frequency Distribution
Example:
Following data represent the number of infected plants from a sample of
twenty experimental plots.
Make a frequency distribution, relative frequency, % frequency and
cumulative frequency of the above data and interpret your results? Make an
appropriate graph?
1 2 4 3 0 1 2 3 1 1 0 2
1 0 2 3 0 0 1 3

32
Discrete Frequency Distribution

No. of infected Tally Frequency Relative %f c.f


plants Frequency
f
X
0 |||| 5 5/20 = 0.25 25 5
1 |||| | 6 0.30 30 11
2 |||| 4 0.20 20 15
3 |||| 4 0.20 20 19
4 | 1 0.05 05 20
Total 20 1.00 100

33
Graphical Representation of Discrete Data
Bar Chart representing the infected items
7

6
6
5
5
Frequency

4
4 4
3

1
1
0
0 1 2 3 4
No. of infected items
34
Pie Chart
A pie chart is a type of graph in which a circle is divided into sectors that each
represent a proportion of the whole.
Example: The following data represent the blood groups of different students
Blood Groups Number of students

A+ 75
B+ 80
O+ 30
O- 15
Total 200 35
Pie Chart
Blood frequency Angle Cumulative
groups of sector Angle
A+
A+ 75 (75/200)x360=135 135
B+ 80 144 279 O-
O+ 30 54 333 B+
O+
O- 15 27 360
Total 200 360

36
Line Chart / Histogram
The following data represent production of a factory over 9 years
years 2000 2001 2002 2003 2004 2005 2006 2007 2008
production 50 65 75 80 85 110 90 85 70

Historigram
115

105

95
Production

85

75

65

55

45
2000 2001 2002 2003 2004 2005 2006 2007 2008
Years

37
Line Chart / Histogram
The following data represent production of two factory over 9 years
years 2000 2001 2002 2003 2004 2005 2006 2007 2008
Company A 50 65 75 80 85 110 90 85 70
Company B 45 60 80 90 95 100 80 75 80

Historigram
115

105

95
Production

85

Company A
75
Company B

65

55

45
2000 2001 2002 2003 2004 2005 2006 2007 2008
Years 38
Frequency Table for continuous variable

The following data represents the height of 30 wheat plants taken from the
experimental area. Construct a frequency distribution and appropriate
graphs to explain the distribution of data:

87 91 89 88 89 91 87 92 90 98 95
97 96 100 101 96 98 99 98 100 102 99
101 105 103 107 105 106 107 112

39
Following data Classes Frequency
represents the plant (f)
height (cm) of a sample
of 30 plants. 86–90 6
87 91 89 91–95 4
88 89 91
87 92 90
96–100 10
98 95 97 101–105 6
96 100 101 106–110 3
96 98 99
98 100 102 111–115 1
99 101 105
103 107 105 Total 30
106 107 112
40
Frequency Distribution

• Tabular arrangement of data in which various items are arranged into


classes or groups and the number of items falling in each class is
stated.
• The number of observations falling in a particular class is referred to
as class frequency "f".
• Data presented in the form of a frequency distribution is also called
grouped data.

41
Some definitions
Class Limits
The class limits are defined as the number or the values of the variables which
are used to separate two classes. Sometimes classes are taken as
20--25, 25--30 etc In such a case, these class limits means " 20 but less than
25", "25 but less than 30" etc
Class marks or midpoints
The class mark or the midpoint is that value which divides a class into two
equal parts. It is obtained by dividing the sum of lower and upper class limits
or class boundaries of a class by 2.
Class interval
The difference between either two successive lower class limits or two
successive upper class limits or two successive midpoints and denoted by "h".
42
Construction of a frequency distribution

• Decide the number of classes:


K=1+3.3 log(n)=5.87 or 𝑛=5.47  6 Classes
• Determine the range of variation of the data i.e,
R= Max – Min = 112 – 87 = 25
• Determine the approximate size of class interval
𝑹
𝒉= = 25/6 = 4.17  5 Class Interval
𝑲
• Decide where to locate the class limit of first class
Just below the minimum value in the data  86-90, 91-95, …
• Distribute the data into appropriate classes
Use Tally (|) for each value
43
Frequency Distribution
Class Boundaries Class
Classes Tally Freq (f) Mid-Point Interval
c.f. r.f. % freq
(X) (h)

86–90 85.5–90.5 6 88 5 6 0.200 20.0


91–95 90.5–95.5 4 93 5 10 0.133 13.3
96–100 95.5–100.5 10 98 5 20 0.333 33.3
101–105 100.5–105.5 6 103 5 26 0.200 20.0
108 5 29 0.100 10.0
106–110 105.5–110.5 3
113 5 30 0.033 3.30
111–115 110.5–115.5 1
Total 1.000 100.0
30 44
Class Boundaries

• Subtract any Upper Class Limit from its Subsequent Lower Class limit and
divide the difference with 2, you will get the Continuity correction factor
• Subtract this factor from all Lower Class Limits and add it to all Upper Class
limits.

For example, 91-90 = ½ =0.5 or 96-95 = ½ =0.5

45
• X-axis =Class Boundaries Histogram
• Y-axis = Frequency
12
Class Boundaries
f 10
10

85.5–90.5 6 8

Frequency
90.5–95.5 4 6
6 6

4
95.5–100.5 10 4
3

100.5–105.5 6 2
1

105.5–110.5 3 0
85.5–90.5 90.5–95.5 95.5–100.5 100.5–105.5 105.5–110.5 110.5–115.5

110.5–115.5 1 Class Boundaries

46
Frequency Polygon
Frequency polygons are a graphical device for understanding the shapes of distributions.
• X-axis =Mid Points
• Y-axis = Frequency
Classes (f) (X)

0 83
86–90 6 88
91–95 4 93
96–100 10 98
101–105 6 103
3 108
106–110
1 113
111–115
0 118 47
TYPES OF FREQUENCY CURVE
• Symmetrical distribution
• Skewed distribution
Symmetric
Symmetrical distribution
A frequency distribution or curve is symmetrical
if values equidistant from a central maximum
have the same frequencies.

Skewed distribution
A frequency distribution or curve is skewed
when it departs from symmetry.

Negatively skewed Positively skewed

48
A cumulative frequency Cumulative Frequency Polygon / Ogive
polygon is a plot of
the cumulative frequency against
the upper class boundary.
X-axis=Upper Class Boundaries
Y-axis=Cumulative Frequency
Marks f UCB CF

< 30.5 0 30.5 0


30----40 20 40.5 20
41----50 30 50.5 50
51----60 60 60.5 110
61----70 20 70.5 130
71----80 25 80.5 155
81----90 25 90.5 180 49
• Approximately how many
students got marks 62 or below
115

• Approximately how many


students got marks 95 or above
10

• 50% of students (100) got how


many maximum marks
58 marks

50
Frequency Distribution
• Advantage of frequency distribution:
Frequency distribution is useful to represent huge data with small
number of groups/ classes
• Disadvantage of frequency distribution:
Individual identity is lost during grouping process
• Assumption:
Each observation in a class is assumed to be at the center of the class(
mid-value)
• Solution:
Present data by using stem and leaf display
51
Stem & Leaf Display
• Stem and Leaf in addition to information on the number of observations falling in the
various classes, it displays details of what those observations actually are.

• Each number in the data set is divided into two parts, a Stem and a Leaf.

• A stem is the leading digit(s) of each number and is used in sorting, while a leaf is the rest of
the trailing digit.

32 , 132 , 02
• A relatively small data set can be represented by stem and leaf display.

52
Example:- Represent the following data by Stem and Leaf display by
(i) taking 10 unit as the width of the class
(ii) taking 5 unit as the width of the class
32 45 38 41 49 36 52 56 51 62
63 59 68
Stem Leaf *indicate 0—4
.indicate 5—9
Stem Leaf 3* 2
3. 8 6
32 8 6 4* 1
* and . are called placeholder

4. 5 9
45 1 9
5* 2 1
52 6 1 9 5. 6 9
6* 2 3
62 3 8 6. 8
53
Example
Use the data below to make a stem- Stem Leaf
and-leaf plot by taking 10 as a unit.
7 0589
85 115 126 92 104 8 4558
85 116 100 121 123 9 022379
79 90 110 129 108
10 0478
107 78 131 114 92
131 88 97 99 116
11 04566
93 84 75 70 132 12 1369
13 112
7 0589
These values are 70, 75, 78 and 79 54
Dotplot
Dot Plot (Single Quantitative variable)

Weight 32 36 40 40 44 48 52
52 56 60 64 64 68
Variable

32 36 40 44 48 52 56 60 64 68
Weight in Kg 55
Scatter Plot is a useful graph to Scatterplot of % Final vs % Mid
compare two related variables S2

The following data represent the 90 S1


percent marks obtained by students in
Mid and Final Test
80 S5

Mid Final
S1 65 87 70 Line of

% Final
equality
S2 72 91 S4
60

S3 45 45
S4 80 60 50 S6
S3

S5 68 79
40
66 47 40 50 60 70 80
S6
% Mid
56

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy