Block I (Introduction To Statistics)
Block I (Introduction To Statistics)
Structure
1.0 Introduction
1.1 Objectives
1.2 Meaning of Statistics
1.2.1 Statistics in Singular Sense
1.2.2 Statistics in Plural Sense
1.2.3 Definition of Statistics
1.3 Types of Statistics
1.3.1 On the Basis of Function
1.3.2 On the Basis of Distribution of Data
1.4 Scope and Use of Statistics
1.5 Limitations of Statistics
1.6 Distrust and Misuse of Statistics
1.7 Let Us Sum Up
1.8 Unit End Questions
1.9 Glossary
1.10 Suggested Readings
1.0 INTRODUCTION
The word statistics has different meaning to different persons. Knowledge of
statistics is applicable in day to day life in different ways. In daily life it means
general calculation of items, in railway statistics means the number of trains
operating, number of passenger’s freight etc. and so on. Thus statistics is used by
people to take decision about the problems on the basis of different type of
quantitative and qualitative information available to them.
1.1 OBJECTIVES
After going through this unit, you will be able to:
Define the term statistics;
Explain the status of statistics;
Describe the nature of statistics;
State basic concepts used in statistics; and
Analyse the uses and misuses of statistics.
5
Introduction to Statistics
1.2 MEANING OF STATISTICS
The word statistics has been derived from Latin word ‘status’ or Italian ‘Statista’
meaning statesman. Professor Gott Fried Achenwall used it in the 18th century.
During early period, these words were used for political state of the region. The
word ‘Statista’ was used to keep the records of census or data related to wealth
of a state. Gradually, its meaning and usage extended and thereonwards its nature
also changed.
The word statistics is used to convey different meanings in singular and plural
sense. Therefore it can be defined in two different ways.
6
vii) Statistics should be comparable: Only comparable data will have some Introduction to Statistics
meaning. For statistical analysis, the data should be comparable with respect
to time, place group, etc.
Thus, it may be stated that “ All statistics are numerical statements of facts but
all numerical statements of facts are not necessarily statistics ”.
1.2.3 Definition of Statistics
In this unit emphasis is on the term statistics as a branch of science. It deals with
classification, tabulation and analysis of numerical facts. Different statistician
defined this aspect of statistics in different ways. For example.
According to Selligman “Statistics is the science which deals with the methods
of collecting, classifying, presenting , comparing and interpreting numerical data
collected to throw some light on any sphere of enquiry”.
Among all the definitions , the one given by Croxton and Cowden is considered
to be most appropriate as it covers all aspects and field of statistics.
Though various bases have been adopted to classify statistics, following are the
two major ways of classifying statistics: (i) on the basis of function and (ii) on
the basis of distribution.
There are certain basic assumptions of parametric statistics. The very first
characteristic of parametric statistics is that it moves after confirming its
population’s property of normal distribution. The normal distribution of a
population shows its symmetrical spread over the continuum of –3 SD to +3 SD
and keeping unimodal shape as its mean, median, and mode coincide. If the
samples are from various populations then it is assumed to have same variance
ratio among them. The samples are independent in their selection. The chances
of occurrence of any event or item out of the total population are equal and any
item can be selected in the sample. This reflects the randomized nature of sample
which also happens to be a good tool to avoid any experimenter bias.
However, along with many advantages, some disadvantages have also been noted
for the parametric statistics. It is bound to follow the rigid assumption of normal
distribution and further it narrows the scope of its usage. In case of small sample,
normal distribution cannot be attained and thus parametric statistics cannot be
used. Further, computation in parametric statistics is lengthy and complex because
of large samples and numerical calculations. T-test, F-test, r-test, are some of the
major parametric statistics used for data analysis.
Nonparametric statistics are those statistics which are not based on the
assumption of normal distribution of population. Therefore, these are also known
as distribution free statistics. They are not bound to be used with interval scale
data or normally distributed data. The data with non-continuity are to be tackled
with these statistics. In the samples where it is difficult to maintain the assumption
of normal distribution, nonparametric statistics are used for analysis. The samples
with small number of items are treated with nonparametric statistics because of
the absence of normal distribution. It can be used even for nominal data along
with the ordinal data. Some of the usual nonparametric statistics include chi-
square, Spearman’s rank difference method of correlation, Kendall’s rank
difference method, Mann-Whitney U test, etc.
Self Assessment Questions
1) State true/false for the following statements
i) Parametric statistics is known as distribution free (T/ F)
statistics
ii) Nonparametric tests assume normality of distribution (T/F)
iii) T test is an example of parametric test (T/F)
iv) Nonparametric tests are not bound to be used with (T/F)
interval scale.
v) Parametric tests are bound to be used with either (T/F)
interval or ratio scale.
vi) In case of small sample where normal distribution (T/F)
can not be attained, the use of nonparametric test is
more appropriate.
2) Define the term sample and population with one example each.
................................................................................................................
................................................................................................................
................................................................................................................
................................................................................................................
................................................................................................................
................................................................................................................
................................................................................................................
10
Introduction to Statistics
1.4 SCOPE AND USE OF STATISTICS
Statistical applications have a wide scope. Some of the major ones are given
below:
Problem solving: Knowing the useful difference between two or more variables
enable the individual to find out the best applicable solution to a problem situation
and it is possible because of statistics. During problem solving statistics helps
the person analyse his/ her pattern of response and the correct solution thereby
minimising the error factor.
11
Introduction to Statistics Theoretical researches: Theories evolve on the basis of facts obtained from the
field. Statistical analyses establish the significance of those facts for a particular
paradigm or phenomena. Researchers are engaged in using the statistical measures
to decide on the facts and data whether a particular theory can be maintained or
challenged. The significance between the facts and factors help them to explore
the connectivity among them.
Statistics deals with aggregate of facts. It cannot deal with single observation.
Thus statistical methods do not give any recognition to an object or a person or
an event in isolation. This is a serious limitation of Statistics.
Statistical conclusions are true only on the average . Thus, statistical inferences
may not be considered as exact like inferences based on Mathematical laws.
There are many other fields like, agriculture, space, medicine, geology, technology,
etc. where statistics is extensively used to predict the results and find out precision
in decision.
Self Assessment Question
1) Write three application of statistics in daily life.
................................................................................................................
................................................................................................................
................................................................................................................
12
Introduction to Statistics
2) List atleast two misuses of statistics.
................................................................................................................
................................................................................................................
................................................................................................................
................................................................................................................
................................................................................................................
................................................................................................................
................................................................................................................
................................................................................................................
13
Introduction to Statistics
1.9 GLOSSARY
Statistics in singular sense : In singular sense, it means scientific methods
for collection, presentation, analysis and
interpretation of data.
Statistics in plural sense : In plural sense it means a set of numerical
scores known as statistical data.
Correlational statistics : The statistics which speaks about one or more
than one variable’s positive or negative
magnitude of relationship.
Descriptive statistics : The statistics which describes the tendency or
variance of the scores in a distribution.
Inferential statistics : The statistics that enable the researchers to
have some conclusions about population or
events on the basis of past or observed
observations.
Non parametric statistics : The statistics free from the assumptions of
normal distribution.
Parametric statistics : The statistics based on assumptions of normal
distribution
Statistics : The branch of mathematics that deals with
inferring the chances of a particular pattern
of population or events on the basis of
observed patterns..
14
Introduction to Statistics
UNIT 2 DESCRIPTIVE STATISTICS
Structure
2.0 Introduction
2.1 Objectives
2.2 Meaning of Descriptive Statistics
2.3 Organising Data
2.3.1 Classification
2.3.2 Tabulation
2.3.3 Graphical Presentation of Data
2.3.4 Diagrammatical Presentation of Data
2.4 Summarising Data
2.4.1 Measures of Central Tendency
2.4.2 Measures of Dispersion
2.5 Use of Descriptive Statistics
2.6 Let Us Sum Up
2.7 Unit End Questions
2.8 Glossary
2.9 Suggested Readings
2.0 INTRODUCTION
We have learned in the previous unit that looking at the functions of statistics
point of view, statistics may be descriptive, correlational and inferential. In this
unit we shall discuss the various aspects of descriptive statistics, particularly
how to organise and discribe the data.
2.1 OBJECTIVES
After going through this unit, you will be able to:
Define the nature and meaning of descriptive statistics;
Describe the methods of organising and condensing raw data;
Explain concept and meaning of different measures of central tendency; and
Analyse the meaning of different measures of dispersion. 15
Introduction to Statistics
2.2 MEANING OF DESCRIPTIVE STATISTICS
Let us take up a hypothetical example of two groups of students taking a problem
solving test. One group is taken as experimental group in that the subjects in this
group are given training in problem solving while the other group subjects do
not get any training. Both were tested on problem solving and the scores they
obtained were as given in the table below.
2.3.1 Classification
The classification is a summary of the frequency of individual scores or ranges
of scores for a variable. In the simplest form of a distribution, we will have such
value of variable as well as the number of persons who have had each value.
16
Once data are collected, researchers have to arrange them in a format from which Descriptive Statistics
they would be able to draw some conclusions.
The arrangement of data in groups according to similarities is known as
classification. Thus by classifying data, the investigators move a step ahead to
the scores and proceed forward concrete decision. Classification is done with
following objectives:
Presenting data in a condensed form
Explaining the affinities and diversities of the data
Facilitating comparisons
Classification may be qualitative and quantitative
Frequency distribution.
A much clear picture of the information of score emerges when the raw data are
organised as a frequency distribution. Frequency distribution shows the number
of cases following within a given class interval or range of scores. A frequency
distribution is a table that shows each score as obtained by a group of individuals
and how frequently each score occurred.
Frequency distribution can be with ungrouped data and grouped data
i) An ungrouped frequency distribution may be constructed by listing all
score value either from highest to lowest or lowest to highest and placing a
tally mark (/) besides each scores every times it occurs. The frequency of
occurrence of each score is denoted by ‘f’ .
ii) Grouped frequency distribution: If there is a wide range of score value in
the data, then it is difficult to get a clear picture of such series of data. In this
case grouped frequency distribution should be constructed to have clear
picture of the data. A group frequency distribution is a table that organises
data into classes, into groups of values describing one characteristic of the
data. It shows the number of observations from the data set that fall into
each of the class.
Construction of Frequency Distribution
Before proceeding we need to know a few terminologies used in further discussion
as for instance, a variable. A variable refers to the phenomenon under study. It
may be the performance of students on a problem solving issue or it can be a
method of teaching students that could affect their performance.
Here the performance is one variable which is being studied and the method of
teaching is another variable that is being manipulated. Variables are of two
kinds :
i) Continuous variable
ii) Discrete variable.
Those variables which can take all the possible values in a given specified range
is termed as Continuous variable. For example, age ( it can be measured in years,
months, days, hours, minutes , seconds etc.) , weight (lbs), height(in cms), etc.
On the other hand those variables which cannot take all the possible values within
the given specified range are termed as discrete variables. For example, number
of children, marks obtained in an examination ( out of 200), etc.
17
Introduction to Statistics Preparation of Frequency Distribution
To prepare a frequency distribution, we, first decide the range of the given data,
that is, the difference between the highest and lowest scores. This will tell about
the range of the scores. Prior to the construction of any grouped frequency
distribution, it is important to decide the following
1) The number of class intervals: There is no hard and fast rules regarding
the number of classes into which data should be grouped . If there are very
few scores, it is useless to have a large number of class-intervals. Ordinarily,
the number of classes should be between 5 to 30
2) Limits of each class interval: Another factor used in determining the number
of classes is the size/ width or range of the class which is known as ‘class
interval’ and is denoted by ‘i’.
Class interval should be of uniform width resulting in the same-size classes of
frequency distribution. The width of the class should be a whole number and
conveniently divisible by 2, 3, 5, 10 or 20.
There are three methods for describing the class limits for distribution:
i) Exclusive method
ii) Inclusive method
iii) True or actual class method
i) Exclusive method: In this method of class formation, the classes are so
formed that the upper limit of one class also becomes the lower limit of the
next class. Exclusive method of classification ensures continuity between
two successive classes. In this classification, it is presumed that score equal
to the upper limit of the class is exclusive, i.e., a score of 40 will be included
in the class of 40 to 50 and not in a class of 30 to 40
ii) Inclusive method: In this method classification includes scores, which are
equal to the upper limit of the class. Inclusive method is preferred when
measurements are given in whole numbers.
iii) True or Actual class method: In inclusive method upper class limit is not
equal to lower class limit of the next class. Therefore, there is no continuity
between the classes.
However, in many statistical measures continuous classes are required. To have
continuous classes it is assumed that an observation or score does not just represent
a point on a continuous scale but an internal unit length of which the given score
is the middle point.
Thus, mathematically, a score is internal when it extends from 0.5 units below to
0.5 units above the face value of the score on a continuum. These class limits are
known as true or actual class limits.
Types of frequency distributions: There are various ways to arrange frequencies
of a data array based on the requirement of the statistical analysis or the study. A
couple of them are discussed below:
Relative frequency distribution: A relative frequency distribution is a distribution
that indicates the proportion of the total number of cases observed at each score
18
value or internal of score values.
Cumulative frequency distribution: Sometimes investigator is interested to know Descriptive Statistics
the number of observations less than a particular value. This is possible by
computing the cumulative frequency. A cumulative frequency corresponding to
a class-interval is the sum of frequencies for that class and of all classes prior to
that class.
Cumulative relative frequency distribution: A cumulative relative frequency
distribution is one in which the entry of any score of class interval expresses that
score’s cumulative frequency as a proportion of the total number of cases. Given
below are ability scores of 20 students.
10, 14, 14, 13, 16, 17, 18, 20, 22, 23, 23, 24, 25, 18, 12, 13, 14, 16, 19, 20
Let us see how the above scores could be formed into a frequency distribution.
Scores Frequency Cum. Freq. Rel. Cum.Freq.
10 1 1 1/20
12 1 2 2/20
13 2 4 4/20
14 3 7 7/20
16 2 9 9/20
17 1 10 10/20
18 2 12 12/20
19 1 13 13/20
20 2 15 15/20
22 1 16 16/20
23 2 18 18/20
24 1 19 19/20
25 1 20 20/20
Total 20
2.3.2 Tabulation
Tabulation is the process of presenting the classified data in the form of a table.
A tabular presentation of data becomes more intelligible and fit for further
statistical analysis. A table is a systematic arrangement of classified data in row
and columns with appropriate headings and sub-headings.
Components of a Statistical Table
The main components of a table are :
Table number, Title of the table, Caption, Stub, Body of the table, Head note,
Footnote, and Source of data
19
Introduction to Statistics TITLE
Stub Head Caption
Stub Entries Column Head I Column Head II
Sub Head Sub Head Sub Head Sub Head
MAIN BODY OF THE TABLE
Total
Footnote(s) :
Source :
Self Assessment Questions
1) Statistical techniques that summarise, organise and simplify data are
called as:
i) Population statistics ii) Sample statistics
iii) Descriptive statistics iv) Inferential statistics
2) Which one of the alternative is appropriate for descriptive statistics?
i) In a sample of school children, the investigator found an average
weight was 35 Kg.
ii) The instructor calculates the class average on their final exam. Was
76%
iii) On the basis of marks on first term exam, a teacher predicted that
Ramesh would pass in the final examination.
iv) Both (i) and (ii)
3) Which one of the following statement is appropriate regarding objective/s
of classification.
i) Presenting data in a condensed form
ii) Explaining the affinities and diversities of the data
iii) Facilitating comparisons
iv) All of these
4) Define the following terms
i) Discrete variable
ii) Continuous variable
iii) Ungrouped frequency distribution
iv) Grouped frequency distribution.
A graph is created on two mutually perpendicular lines called the X and Y–axes
on which appropriate scales are indicated.
The horizontal line is called the abscissa and vertical the ordinate. Like different
kinds of frequency distributions there are many kinds of graph too, which enhance
the scientific understanding of the reader. The commonly used among these are
bar graphs, line graphs, pie, pictographs, etc. Here we will discuss some of the
important types of graphical patterns used in statistics.
Now label the class-intervals on abscissa stating the exact limits or midpoints of
the class-intervals. You can also add one extra limit keeping zero frequency on
both side of the class-interval range.
The size of measurement of small squares on graph paper depends upon the
number of classes to be plotted.
Next step is to plot the frequencies on ordinate using the most comfortable
measurement of small squares depending on the range of whole distribution.
To plot a frequency polygon you have to mark each frequency against its concerned
class on the height of its respective ordinate.
After putting all frequency marks a draw a line joining the points. This is the
polygon. A polygon is a multi-sided figure and various considerations are to be
maintained to get a smooth polygon in case of smaller N or random frequency
distribution.
Frequency Curve : A frequency curve is a smooth free hand curve drawn through
frequency polygon. The objective of smoothing of the frequency polygon is to
eliminate as far as possible the random or erratic fluctuations that is present in
the data.
Bar Diagram: This is known as dimensional diagram also. Bar diagram is most
useful for categorical data. A bar is defined as a thick line. Bar diagram is drawn
from the frequency distribution table representing the variable on the horizontal
axis and the frequency on the vertical axis. The height of each bar will be
corresponding to the frequency or value of the variable.
Multiple Bar Diagram: This diagram is used when comparison are to be shown
between two or more sets of interrelated phenomena or variables. A set of bars
for person, place or related phenomena are drawn side by side without any gap.
To distinguish between the different bars in a set, different colours, shades are
used.
After the calculation of the angles for each component, segments are drawn in
the circle in succession corresponding to the angles at the center for each segment.
Different segments are shaded with different colour, shades or numbers.
In Statistics there are three most commonly used measures of central tendency.
These are:
1) Mean,
2) Median, and
3) Mode.
1) Mean: The arithmetic mean is most popular and widely used measure of
central tendency. Whenever we refer to the average of data, it means we are
talking about its arithmetic mean. This is obtained by dividing the sum of
the values of the variable by the number of values.
Merits and limitations of the arithmetic mean: The very first advantage
of arithmetic mean is its universality, i.e., it remains in every data set. The
arithmetic mean remains to be very clear and only single in a data set. It is
also a useful measure for further statistics and comparisons among different
data sets. One of the major limitations of arithmetic mean is that it cannot
be computed for open-ended class-intervals.
2) Median: Median is the middle most value in a data distribution. It divides
the distribution into two equal parts so that exactly one half of the
observations is below and one half is above that point. Since median clearly
denotes the position of an observation in an array, it is also called a position
average. Thus more technically, median of an array of numbers arranged in
order of their magnitude is either the middle value or the arithmetic mean of
the two middle values. For example, the set of numbers 2, 3, 5, 7, 9, 12, 15
has the median 7.
th
n +1
Thus, for ungrouped data median is —— value in case data are in their
2
magnitude order, where n denotes the number of given observations.
Quartile Deviation
Quartile deviation is denoted as Q. It is also known as inter-quartile range. It
avoids the problems associated with range. Inter-quartile range includes only
50% of the distribution. Quartile deviation is the difference between the 75%
and 25% scores of a distribution. 75th percentile is the score which keeps 75%
score below itself and 25th percentile is the score which keeps 25% scores below
itself. 25
Introduction to Statistics Merits and limitations: QD is a simple measure of dispersion. While the measure
of central tendency is taken as median, QD is most relevant to find out the
dispersion of the distribution. In comparison to range, QD is more useful because
range speaks about the highest and lowest scores while QD speaks about the
50% of the scores of a distribution. As middle 50% of scores are used in QD
there is no effect of extreme scores on computation, giving more reliable results.
In case of open-end distribution QD is more reliable in comparison to other
measures of dispersion. It is not recommended to use QD in further mathematical
computations. It is not a complete reliable measure of distribution as it doesn’t
include all the scores. As QD is based on 50% scores, it is not useful to study in
each and every statistical situation.
Properties of SD
If all the score have an identical value in a sample, the SD will be 0 (zero).
In different samples drawn from the same population, SDs differ very less as
compared to the other measures of dispersion.
For a symmetrical or normal distribution, the following relationship are true:
Mean ±1 SD covers 68.26 % cases
Mean ± 2 SD covers 95.45 % cases
Mean ± 3 SD covers 99.73 % cases
Merits: It is based on all observations. It is amenable to further mathematical
treatments. Of all measures of dispersion, standard deviation is least affected by
fluctuation of sampling.
Skewness and Kurtosis
There are two other important characteristics of frequency distribution that provide
useful information about its nature. They are known as skewness and Kurtosis.
Skewness: Skewness is the degree of asymmetry of the distribution. In some
frequency distributions scores are more concentrated at one end of the scale.
Such a distribution is called a skewed distribution.
Thus, Skewness refers to the extent to which a distribution of data points is
concentrated at one end or the other. Skewness and variability are usually related,
the more the Skewness the greater the variability.
Skewness has both, direction as well as magnitude. In actual practice, frequency
distributions are rarely symmetrical; rather they show varying degree of
asymmetry or Skewness.
In perfectly symmetrical distribution, the mean, median and mode coincide,
whereas this is not the case in a distribution that is asymmetrical or skewed. If
26
the frequency curve of a distribution has a longer tail to the right side of the Descriptive Statistics
origin, the distribution is said to be skewed positively (Fig.2.1).
In case the curve is having long tail towards left or origin, it is said to be negatively
Skewed (Fig. 2.2).
There are two measures of Skewness, i.e., SD and percentile. There are different
ways to compute Skewness of a frequency distribution.
27
Introduction to Statistics
Kurtosis in the Curves
29
Introduction to Statistics
2.8 GLOSSARY
Abscissa : X axis
Array : A rough grouping of data.
Classification : A systematic grouping of data
Cumulative frequency : A classification, which shows the cumulative
distribution frequency below, the upper real limit of the
corresponding class interval.
Data : Any sort of information that can be analysed.
Discrete data : When data are counted in a classification.
Exclusive classification : The classification system in which the upper
limit of the class becomes the lower limit of
next class.
Frequency distribution : Arrangement of data values according to their
magnitude.
Inclusive classification : When the lower limit of a class differs the
upper limit of its successive class.
Secondary data : Informatio n gathered t hrough already
maintained records about a variable.
Mean : The ratio between total and numbers of scores.
Median : The mid point of a score distribution.
Mode : The maximum occurring score in a score
distribution.
Central Tendency : The tendency of scores to bend towards center
of distribution.
Arithmetic mean : Mean for stable scores.
Dispersion : The extent to which scores tend to scatter from
their mean and from each other.
Standard Deviation : The square root of the sum of squared
deviations of scores from their mean.
Skewness : Tendency of scores to polarize on either side
of abscissa.
Kurtosis : Curvedness of a frequency distribution graph.
Platykurtic : Curvedness with flat tendency towards
abscissa.
Mesokurtik : Curvedness with normal distribution of
scores.
Leptokurtic : Curvedness with peak tendency from abscissa.
Range : Difference between the two extremes of a
score distribution.
30
Descriptive Statistics
2.9 SUGGESTED READINGS
Asthana, H. S. and Bhushan, B. (2007). Statistics for Social Sciences ( with
SPSS Application). Prentice Hall of India, New Delhi.
Yale, G. U., and M.G. Kendall (1991). An Introduction to the Theory of Statistics.
Universal Books, Delhi.
Nagar, A. L., and Das, R. K. (1983). Basic Statistics. Oxford University Press,
Delhi.
31
Introduction to Statistics
UNIT 3 INFERENTIAL STATISTICS
Structure
3.0 Introduction
3.1 Objectives
3.2 Concept and Meaning of Inferential Statistics
3.3 Inferential Procedures
3.3.1 Estimation
3.3.2 Point Estimation
3.3.3 Interval Estimation
3.4 Hypothesis Testing
3.4.1 Statement of Hypothesis
3.4.2 Level of Significance
3.4.3 One-Tail Test and Two-Tail Test
3.4.4 Errors in Hypothesis Testing
3.4.5 Power of a Test
3.5 General Procedure for Testing Hypothesis
3.5.1 Test of Hypothesis about a Population Mean
3.5.2 Testing Hypothesis about a Population Mean (Small Sample)
3.6 ‘t’ Test for Significance of Difference between Means
3.6.1 Assumption for ‘t’ Test
3.6.2 ‘t’ test for Independent Sample
3.6.3 ‘t’ Test for Paired Observation by Difference Method
3.7 Let Us Sum Up
3.8 Unit End Question
3.9 Glossary
3.10 Suggested Readings
3.0 INTRODUCTION
Before conducting any study, investigators it must be decided as to whether he/
she will depend on census details or sample details. On the basis of the information
contained in the sample we try to draw conclusions about the population. This
process is known as statistical inference. Statistical inference is widely applicable
in behavioural sciences, especially in psychology. For example, before the Lok
sabha or vidhan sabha election process starts or just before the declaration of
election results print media and electronic media conduct exit poll to predict the
election result. In this process all voters are not included in the survey, only a
portion of voters i.e. sample is included to infer about the population. This is
called inferential statistics and the present unit deals with the same in detail.
3.1 OBJECTIVES
After going through this unit, you will be able to :
define inferential statistics;
32 state the concept of estimation;
distinguish between point estimation and interval estimation; and Inferential Statistics
3.3.1 Estimation
In estimation a sample is drawn and studied and inference is made about the
population characteristics on the basis of what is discovered about the sample.
There may be sampling variations because of chance fluctuations, variations in
sampling techniques, and other sampling errors. We, therefore, do not expect
our estimate of the population characteristics to be exactly correct. We do, however
, expect it to be close. The real question in estimation is not whether our estimate
is correct or not but how close is it to be the true value.
Our first interest is in using the sample mean (X̄ ) to estimate the population
mean (µ).
Characteristics of X as an estimate of (µ).
The sample mean (X̄ ) often is used to estimate a population mean (µ). For
example, the sample mean of 45.0 from the Academic Anxiety Test may be used
to estimate the mean Academic Anxiety of population of college students. Using
this sample would lead to an estimate of 45.0 for the population mean. Thus,
sample mean is an unbiased and consistent estimator of population mean. 33
Introduction to Statistics Unbiased Estimator : An unbiased estimator is one which , if we were to obtain
an infinite number of random samples of a certain size, the mean of the statistic
would be equal to the parameter. The sample mean, (X̄ ) is an unbiased estimate
of (µ) because if we look at possible random samples of size N from a population,
mean of the sample would be equal to µ.
36
Inferential Statistics
Acceptance Region
Rejection
Region
Z scores +1.645
Acceptance Region
Rejection Rejection
Region Region
Z scores
37
Introduction to Statistics For a two-tailed test with the p chosen to be .05, each tail of the H0 distribution
ends with a rejection or critical region of area .025 extending beyond the critical
Z score of 1.96 in the tail. If the computed Z score lies between –1.96 to +1.96,
then the observed difference falls within the rejection region and consequently
null hypothesis is rejected. But in a one tail test with chosen p = .05, if the
computed Z score is equal to or greater than 1.645 then the observed difference
falls within the rejection region. Hence, the null hypothesis is rejected. It is clear
that with an identical p, an observed difference may be significant in a one-tail
test though it may fail to be significant in a two-tail test.
Type I error– When the null hypothesis is true, a decision to reject it is an error
and this kind of error is known as type I error in statistics. The probability of
making a type I error is denoted as ‘’ (read as alpha). The null hypothesis is
rejected if the probability ‘p’ of its being correct does not exceed the p. The
higher the chosen level of p for considering the null hypothesis, the greater is the
probability of type I error.
38
Inferential Statistics
iv) ......................... is that probability of chance of occurrence of
observed results.
v) Level of significance is denoted by .................................................
vi) When the null hypothesis is true, a decision to reject is known as
......................................................................................
vii) When a null hypothesis is false, a decision to accept is known as
...........................................................................
3.9 GLOSSARY
Confidence Level : It gives the percentage (probability) of samples
where the population mean would remain within
the confidence interval around the sample mean.
Estimation : It is a method of prediction about parameter value
on the basis Statistic.
Hypothesis testing : The statistical procedures for testing hypotheses.
Independent sample : Samples in which the subjects in the groups are
different individuals and not deliberately
matched on any relevant characteristics.
Level of significance : The probability value that forms the boundary
between rejecting and not rejecting the null
hypothesis.
Null hypothesis : The hypothesis that is tentatively held to be true
(symbolized by Ho)
One-tail test : A statistical test in which the alternative
hypothesis specifies direction of the departure
from what is expected under the null hypothesis.
Parameter : It is a measure of some characteristic of the
population.
Population : The entire number of units of research inerest
41
Introduction to Statistics Power of a test : An index that reflects the probability that a
statistical test will correctly reject the null
hypothesis relative to the size of the sample
involved.
Sample : A sub set of the population under study
Statistical Inference : It is the process of concluding about an unknown
population from known sample drawn from it
Statistical hypothesis : The hypothesis which may or may not be true
about the population parameter.
t-test : It is a parametric test for the significance of
differences between means.
Type I error : A decision error in which the statistical decision
is to reject the null hypothesis when it is actually
true.
Type II error : A decision error in which the statistical decision
is not to reject the null hypothesis when it is
actually false.
Two-tail test : A statistical test in which the alternative
hypothesis does not specify the direction of
departure from what is expected under the null
hypothesis.
Yale, G. U., and M.G. Kendall (1991). An Introduction to the Theory of Statistics.
Universal Books, Delhi.
Nagar, A. L., and Das, R. K. (1983). Basic Statistics. Oxford University Press,
Delhi.
Sani, F., and Todman, J. (2006). Experimental Design and Statistics for
Psychology. A first course book. Blackwell Publishing.
42
Inferential Statistics
UNIT 4 FREQUENCY DISTRIBUTION AND
GRAPHICAL PRESENTATION
Structure
4.0 Introduction
4.1 Objectives
4.2 Arrangement of Data
4.2.1 Simple Array
4.2.2 Discrete Frequency Distribution
4.2.3 Grouped Frequency Distribution
4.2.4 Types of Grouped Frequency Distributions
4.3 Tabulation of Data
4.3.1 Components of a Statistical Table
4.3.2 General Rules for Preparing Table
4.3.3 Importance of Tabulation
4.4 Graphical Presentation of Data
4.4.1 Histogram
4.4.2 Frequency Polygon
4.4.3 Frequency Curves
4.4.4 Cumulative Frequency Curves or Ogives
4.4.5 Misuse of Graphical Presentations
4.5 Diagrammatic Presentation of Data
4.5.1 Bar Diagram
4.5.2 Sub-divided Bar Diagram
4.5.3 Multiple Bar Diagram
4.5.4 Pie Diagram
4.5.5 Pictograms
4.6 Let Us Sum Up
4.7 Unit End Questions
4.8 Glossary
4.9 Suggested Readings
4.0 INTRODUCTION
Data collected either from Primary or Secondary source need to be systemetically
presented as these are invariably in unsystematic or rudimentary form. Such raw
data fail to reveal any meaningful information. The data should be rearranged
and classified in a suitable manner to understand the trend and message of the
collected information. This unit therefore, deals with the method of getting the
data organised in all respects in a tabular form or in graphical presentation.
4.1 OBJECTIVES
After going through this Unit, you will be able to:
Explain the methods of organising and condensing statistical data;
43
Introduction to Statistics Define the concepts of frequency distribution and state its various types;
Analyse the different methods of presenting the statistical data;
Explain how to draw tables and graphs diagrams, pictograms etc; and
describe the uses and misuses of graphical techniques.
On the other hand, those variables which cannot take all the possible values
within the given specified range, are termed as discrete variables. For example,
number of children, marks obtained in an examination ( out of 200), etc.
2) What would be the limits of each class interval ? Another factor used in
determining the number of classes is the size/ width or range of the class
which is known as ‘class interval’ and is denoted by ‘i’. Class interval should
be of uniform width resulting in the same-size classes of frequency
distribution. The width of the class should be a whole number and
conveniently divisible by 2, 3, 5, 10, or 20.
The width of a class interval (i) = Largest Observation(OL – OS) / I (class interval)
After deciding the class interval, the range of scores should be decided by
subtracting the highest value to the lowest value of the data array.
Now, the next step is to decide from where the class should be started. There are
three methods for describing the class limits for distribution
Exclusive method
Inclusive method
True or actual class method
Exclusive method: In this method of class formation, the classes are so formed
that the upper limit of one class also becomes the lower limit of the next class.
Exclusive method of classification ensures continuity between two successive
classes. In this classification, it is presumed that score equal to the upper limit of
the class is exclusive, i.e., a score of 40 will be included in the class of 40 to 50
and not in a class of 30 to 40.
Finally we count the number of scores falling in each class and record the
appropriate number in frequency column. The number of scores falling in each
class is termed as class frequency. Tally bar is used to count these frequencies.
Example: Scores of 30 students are given below. Prepare the frequency
distribution by using exclusive method of classification.
3, 30, 14, 30, 27, 11, 25, 16, 18, 33, 49, 35, 18, 10, 25, 20, 14, 18, 9, 39, 14, 29,
20, 25, 29, 15, 22, 20, 29, 29
The above ungrouped data do not provide any useful information about
observations rather it is difficult to understand.
Solution:
Step 1: First of all arrange the raw scores in ascending order of their magnitude.
3,9,10,11,14,14,14,15,16,18,18,18,20,20,20,22,25,25,25,27,29,29,29,29,30,30,33,35,39,49
Step 2: Determine the range of scores by adding 1 to the difference between
46
largest and smallest scores in the data array. For above array of data it Frequency Distribution and
Graphical Presentation
is 49–3 = 46+1= 47.
Step 3: Decide the number of classes. Say 5 for present array of data.
Step 4: To decide the approximate size of class interval, divide the range with
the decided number of classes (5 for this example) . If the quotient is in
fraction, accept the next integer. For examples, 47/5 = 9.4. Take it as
10
Step 5: Find the lower class-limit of the lowest class interval and add the width
of the class interval to get the upper class-limit. (e.g. 3 – 12)
Step 6: Find the class-limits for the remaining classes.(13-22), (23-32), (33-
42), (43-52)
Step 7: Pick up each item from the data array and put the tally mark (I) against
the class to which it belongs. Tallies are to mark in bunch of five, four
times in vertical and fifth in cross-tally on the first four. Count the
number of observations, i.e., frequency in each class. (an example is
given)
Table 4.3: Representation of preparing class-interval by marking the tallies
for data frequencies in the exclusive method.
Class Interval Tallies Frequency
40-50 I 1
30-40 IIII I 6
20-30 IIII IIII I 11
10-20 IIII IIII 10
0-10 II 2
30
Note: The tallying of the observations in frequency distribution may be checked
out for any omitted or duplicated one that the sum of the frequencies should
equal to the total number of scores in the array.
Inclusive method: In this method classification includes scores, which are equal
to the upper limit of the class. Inclusive method is preferred when measurements
are given in the whole numbers. Above example may be presented in the following
form by using inclusive method of classification.(Refer to table below)
Table number: When there are more than one tables in a particular analysis, a
table should be marked with a number for their reference and identification. The
number should be written in the center at the top of the table.
Title of the table: Every table should have an appropriate title, which describes
the content of the table. The title should be clear, brief, and self-explanatory.
Title of the table should be placed either centrally on the top of the table or just
below or after the table number.
Caption: Captions are brief and self-explanatory headings for columns. Captions
may involve headings and sub-headings. The captions should be placed in the
middle of the columns. For example, we can divide students of a class into males
and females, rural and urban, high SES and Low SES etc.
Stub: Stubs stand for brief and self-explanatory headings for rows. A relatively
more important classification is given in rows. Stub consist of two parts : (i)
Stub head : It describes the nature of stub entry (ii) Stub entry : It is the description
of row entries.
Body of the table: This is the real table and contains numerical information or
data in different cells. This arrangement of data remains according to the
description of captions and stubs.
Head note: This is written at the extreme right hand below the title and explains
the unit of the measurements used in the body of the tables.
Source of data : The source from which data have been taken is to be mentioned
at the end of the table. Reference of the source must be complete so that if the
potential reader wants to consult the original source they may do so.
50
– Items in the table should be placed logically and related items should be Frequency Distribution and
Graphical Presentation
placed nearby.
– All items should be clearly stated.
– If item is repeated in the table, its full form should be written.
– The unit of measurement should be explicitly mentioned preferably in the
form of a head note.
– The rules of forming a table is diagrammatically presented in the table below.
TITLE
Stub Head Caption
Stub Entries Column Head I Column Head II
Sub Head Sub Head Sub Head Sub Head
It simplifies complex data: If data are presented in tabular form, these can be
readily understood. Confusions are avoided while going through the data for
further analysis or drawing the conclusions about the observation.
It facilitates comparison: Data in the statistical table are arranged in rows and
columns very systematically. Such an arrangement enables you to compare the
information in an easy and comprehensive manner.
Tabulation presents the data in true perspective: With the help of tabulation, the
repetitions can be dropped out and data can be presented in true perspective
highlighting the relevant information.
Figures can be worked-out more easily: Tabulation also facilitates further analysis
and finalization of figures for understanding the data.
Self Assessment Questions
1) What points are to be kept in mind while taking decision for preparing
a frequency distribution in respect of (a) the number of classes and (b)
width of class interval.
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
51
Introduction to Statistics
2) Differentiate between following pairs of statistical terms
i) Column and row entry
ii) Caption and stub head
iii) Head note and foot note
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
3) State briefly the importance of tabulation in statistical analysis.
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
4.4.1 Histogram
It is one of the most popular method for presenting continuous frequency
distribution in a form of graph. In this type of distribution the upper limit of a
class is the lower limit of the following class. The histogram consists of series of
rectangles, with its width equal to the class interval of the variable on horizontal
axis and the corresponding frequency on the vertical axis as its heights. The
steps in constructing a histogram are as follows:
Step 1: Construct a frequency distribution in table form.
Step2: Before drawing axes, decide on a suitable scale for horizontal axis
then determine the number of squares ( on the graph paper) required
for the width of the graph.
52
Step 3: Draw bars equal width for each class interval. The height of a bar Frequency Distribution and
Graphical Presentation
corresponds to the frequency in that particular interval . The edge of a
bar represents both the upper real limit for one interval and the lower
real limit for the next higher interval.
Step 4: Identify class intervals along the horizontal axis by using either real
limit or midpoint of class interval. In case of real limits, these will be
placed under the edge of each bar. On the other hand , if you use midpoint
of class interval, it will be placed under the middle of each bar.
Step 5: Label both axes and decide appropriate title to the histogram.
Table 4.8: Results of 200 students on Academic achievement test.
Class Interval Frequency
10- 20 12
20- 30 10
30- 40 35
40- 50 55
50- 60 45
60- 70 25
70- 80 18
Let us take a simple example to demonstrate the construction of histogram based
on the above data.
60.00
50.00
40.00
Frequency
30.00
20.00
10.00
0.00
10 20 30 40 50 60 70 80
Achievement Scores
Fig. 4.1: Histogram
53
Introduction to Statistics
4.4.2 Frequency Polygon
Prepare an abscissa originating from ‘O’ and ending to ‘X’. Again construct the
ordinate starting from ‘O’ and ending at ‘Y’. Now label the class-intervals on
abscissa stating the exact limits or midpoints of the class-intervals. There is also
a fashion to add one extra limit keeping zero frequency on both side of the class-
interval range. The size of measurement of small squares on graph paper depends
upon the number of classes to be plotted. Next step is to plot the frequencies on
ordinate using the most comfortable measurement of small squares depending
on the range of whole distribution. To obtain an impressive visual figure it is
recommended to use the 3:4 ratio of ordinate and abscissa though there is no
tough rules in this regard. To plot a frequency polygon you have to mark each
frequency against its concerned class on the height of its respective ordinate.
After putting all frequency marks a draw a line joining. This is the polygon. A
polygon is a multi-sided figure and various considerations are to be maintained
to get a smooth polygon in case of smaller N or random frequency distribution.
The very common way is to compute the smoothed frequencies of the classes by
having the average of frequencies of that particular class along with upper and
lower classes’ frequencies. For instance, the frequency 4 of class-interval 75-79
might be smoothed as 6+4+5 /3 = 5.
60
50
40
Frequency
30
20
10
0
0 10 20 30 40 50 60 70 80 90
Achievement Scores
54
4.4.4 Cumulative Frequency Curve or Ogive Frequency Distribution and
Graphical Presentation
The graph of a cumulative frequency distribution is known as cumulative
frequency curve or ogive. Since there are two types of cumulative frequency
distribution e.g., “ less than” and “ more than” cumulative frequencies, we can
have two types of ogives.
i) ‘Less than’ Ogive: In ‘less than’ ogive , the less than cumulative frequencies
are plotted against the upper class boundaries of the respective classes. It is
an increasing curve having slopes upwards from left to right.
ii) ‘More than’ Ogive: In more than ogive , the more than cumulative frequencies
are plotted against the lower class boundaries of the respective classes. It is
decreasing curve and slopes downwards from left to right.
Example of ‘Less than’ and ‘more than’ cumulative frequencies based on data
reported in table
Class Interval Frequency Less than c.f. More than c.f.
10-20 12 12 200
20- 30 10 22 188
30- 40 35 57 178
40- 50 55 112 143
50- 60 45 157 88
60- 70 25 182 43
70- 80 18 200 18
The ogives for the cumulative frequency distributions given in above table are
drawn in Fig. 4.3
175
150
Less than
125
100
75
50
25
0
10 20 30 40 50 60 70 80
Achievement Score
A sub-divided bar diagram for the hypothetical data given in above Table 4.9 is
drawn in Fig. 4.5
Frequency
Metropolitan City
Fig. 4.5: Subdivided Bar diagram
Psychological Parameters
58
Component Value Frequency Distribution and
Degree of any component part = ——————————— ×360° Graphical Presentation
Total Value
After the calculation of the angles for each component, segments are drawn in
the circle in succession corresponding to the angles at the center for each segment.
Different segments are shaded with different colour, shades or numbers.
Table 4.11: 1000 software engineers pass out from a institute X and they
were placed in four different company in 2009.
Company Placement
A 400
B 200
C 300
D 100
Pie Diagram Representing Placement in four different company.
4.5.5 Pictograms
It is known as cartographs also. In pictogram we used appropriate picture to
represent the data. The number of picture or the size of the picture being
proportional to the values of the different magnitudes to be presented. For showing
population of human beings, human figures are used. We may represent 1 Lakh
people by one human figure. Pictograms present only approximate values.
59
Introduction to Statistics
Self Assessment Questions
1) Explain the following terms:
i) Frequency polygon
........................................................................................................
........................................................................................................
........................................................................................................
ii) Bar diagram
........................................................................................................
........................................................................................................
........................................................................................................
iii) Subdivided bar diagram
........................................................................................................
........................................................................................................
........................................................................................................
iv) Multiple bar diagram
........................................................................................................
........................................................................................................
........................................................................................................
v) Pie diagram
........................................................................................................
........................................................................................................
........................................................................................................
4.8 GLOSSARY
Abscissa (X-axis) : The horizontal axis of a graph.
Array : A rough grouping of data.
Bar diagram : It is thick vertical lines corresponding to
values of variables.
Body of the Table : This is the real table and contains numerical
information or data in different cells
Caption : It is part of table, which labels data presented
in the column of table.
Classification : A systematic grouping of data.
Continuous : When data are in regular in a classification.
Cumulative frequency : A classification, which shows the cumulative
distribution frequency below, the upper real limit of the
corresponding class interval.
Data : Any sort of information that can be analysed.
Discrete : When data are counted in a classification.
Exclusive classification : The classification system in which the upper
limit of the class becomes the lower limit of
next class.
Histogram : It is a set of adjacent rectangles presented
vertically with areas proportional to the
frequencies.
61
Introduction to Statistics Frequency distribution : Arrangement of data values according to their
magnitude.
Frequency Polygon : It is a broken line graph to represent frequency
distribution.
Inclusive classification : When the lower limit of a class differs the
upper limit of its successive class.
Ogive : It is the graph of cumulative frequency
Open-end distributions : Classification having no lower or upper
endpoints.
Ordinate (Y-axis) : The vertical axis of a graph.
Pictogram : In pictogram data are presented in the form
of pictures.
Pie diagram : It is a circle sub-divided into components to
present proportion of different constituent
parts of a total
Primary data : The information gathered direct from the
variable.
Qualitative classification : When data are classified on the basis of
attributes.
Quantitative classification : When data are classified on the basis of
number or frequency
Relative frequency : It is a frequency distribution where the
distribution frequency of each value is expressed as a
fraction or percentage of the total number of
observations.
Secondary data : Informatio n gathered t hrough already
maintained records about a variable.
Stub : It is a part of table. It stands for brief and self
explanatory headings of rows.
Tabulation : It is a systematic presentation of classified data
in rows and columns with appropriate
headings and sub headings.
Yale, G. U., and M.G. Kendall (1991). An Introduction to the Theory of Statistics.
Universal Books, Delhi.
Sani, F., and Todman, J. (2006). Experimental Design and Statistics for
Psychology. A first course book. Blackwell Publishing.
63