Part 1 Notes AGB Unit1
Part 1 Notes AGB Unit1
Application (AGB-Unit I)
1. Introduction and importance of statistics and biostatistics
2. Parameter, statistic and observation
3. Sampling methods Dr. Rojan P. M.
4. Classification and tabulation of data Assistant Professor
5. Graphical and diagrammatic representation of data AGB
Biostatistics
'Biostatistics is the application of statistical methods to a wide variety of fields of biology or life
sciences including human biology, medicine, public health, agriculture, veterinary, microbiology and
genetics.' Biostatistics is also called biometry, literally meaning biological measurement (Greek origin,
bios = life + metron = measure).
Francis Galton (1822-1911), first cousin of Charles Darwin, is called the 'Father of Biostatistics and
Eugenics'.
Karl Pearson (1857-1936) laid the foundation for the 'descriptive and correlational statistics'. He also
emphasised that the whole doctrine of heredity rests on statistical basis.
The term biometry was coined by W.F.R. Weldon (1860-1906), a zoologist at University College,
London.
Ronald A. Fisher (1890-1962) was a dominant figure in statistics and biometry.
Part I – Page 1
DESCRIPTIVE STATISTICS AND INFERENTIAL STATISTICS
Two branches of statistic are Descriptive Statistics and Inferential Statistics
Descriptive Statistics
Descriptive statistics is the study of statistical procedures that deal with the collection,
organisation, graphical representation and processing or summarisation of data to make it informative
and comprehensive.
There are two types of descriptive statistics:
• Measures of central tendency i.e., average, mean and median.
• Measures of spread i.e., range, variance and standard deviation.
Inferential Statistics
Inferential statistics involves those statistical procedures which are used to draw an inference
about the conditions or characteristics of a large population by studying attributes of some small
samples drawn from that population randomly. The inference is considered as a generalisation about
the large population.
An example of inferential statistics is to test the efficacy of a new hypertensive or a new cancer
drug, in which physician will have only limited number of patients to find the efficacy of the drug.
Parameter
Parameter is any numerical property, characteristic or fact that is descriptive of a population.
Usually all the characteristics of a population can be specified in terms of a few parameters.
Sample
Sample is a small group or subset of a population selected, that represents all the attributes of
entire population and can be used for investigating its properties. Say, researchers want to find out
some specific feature about a population, but it is not possible to study every single individual in the
population. They select a small number of individuals from the population, study them and use that
information to draw conclusions about the whole population. This is called sample.
For example, we want to study the average height of students studying KVASU. It is not
necessary to observe the height measurements of all the students. In fact, we can take a small
representative sample of a few students from different batches for measurements and can give results.
Statistic
Statistic is defined to represent any descriptive characteristic or measure obtained from sample
data. In other words, statistic is a function of sample observations.
Example: Average height of students obtained from sample observations.
VARIABLE
Variable is a quality or a characteristic which is being observed or measured and can vary from
one individual to another. For example, animals of same species may differ in their length, weight, age,
sex, etc. Variables may be of two types:
Random variable: Whenever the height, weight, or age of an individual is determined, the
result is referred as a value of the respective variable. When the values obtained arise as a result of
chance factors, the variable is called a random variable. Values obtained from measurement
procedures are described as observations. These are of following two types:
When observations within the data for a particular variable do not have the same value, they
exhibit variation. Variation between observations can be due to many factors. For example, the
variation in human height is hereditary but it may also be due to diet or disease.
Variables
Quantitative Qualitative
STATISTICAL DATA
Any record, descriptive or qualitative account or symbolic representation of any attribute,
event or process expressed in quantitative form is considered as data. The scientific records of the
results or observations of an experiment or a series of experiments is also called data.
Part I – Page 3
Sources of statistical data
The main sources for the collection of biological data are:
1. Experiments 2. Surveys 3. Records
Experiments
Experiments are performed in the fields, laboratories (biochemistry, physiology,
pharmacology), and in hospitals. Data is collected with a specific objective by one or more workers and
is compiled for analysis and conclusion. The data is made available to various scientific workers through
theses and scientific papers published in scientific journals.
Surveys
Surveys are carried out for epidemiological data or animal biometric (height, length, girth) data
collection from the field studies, carried out by trained teams.
Records
Data collected through experiments or surveys are maintained in registers or books and
journals over a long period of time to be consulted or referred for future use for various purposes.
Records provide secondary source of data while experiments and survey form primary source.
Primary Data
The data collected by the investigator from personal experimental studies or measurements
are called primary data. They are original and raw.
Secondary Data
The data obtained from some secondary source such as journals, magazines, newspapers or
research papers, etc. are known as secondary data. These data have already been collected by some
other person and organised by statistical procedures. These are in finished form and ready to analyse
and interpret. Even though it saves time and money, may not be very accurate.
Part I – Page 4
THE LEVELS OF MEASUREMENTS
Nominal: Nominal scale measure is used to identify by name or label of categories. The order,
distance, and ratio of measurement are not meaningful, and thus can be used only for identification
by names or labels of various categories. Nominal data measure qualitative characteristics of data
expressed by various categories. Examples: gender (male or female) disease category (acute or
chronic), place of residence (rural or urban), etc.
Ordinal: Ordinal data satisfy both identification and order criteria, but if we consider interval
and ratio between measurements, then there is no meaningful interpretation in case of ordinal data.
Ordinal Variables may be qualitative or quantitative. Examples: Educational level of respondent (no
schooling, primary incomplete, primary complete, secondary incomplete, secondary complete,
college, or higher), status of a disease (severe, moderate, normal), etc.
Interval: Interval data have better properties than the nominal and ordinal data. In addition to
identification and order, interval data possess the additional property that the difference between
interval scale measurements is meaningful. However, there is a limitation of the interval data due to
the fact that there is no true starting point (zero) in case of interval scale data. Examples: temperature,
IQ level, ranking an experience, score in a competition, etc. If we consider temperature data, then the
zero temperature is arbitrary and does not mean the absence of any temperature implying that zero
temperature is not absolute zero. Hence, any ratio between two values of temperature by Celsius or
Fahrenheit scales is not meaningful.
Ratio: Ratio data are the highest level of measurements with optimum properties. Ratios
between measurements are meaningful because there is a starting point (zero). Ratio scale satisfies
all the four criteria including absolute zero implying that not only difference between two values but
also ratio of two values is also meaningful. Examples: age, height, weight, etc.
SAMPLING DISTRIBUTION
Statistic: A measure computed from the data of a sample (Eg: Sample Mean, x̄)
Parameter: A measure calculated from the data of a population (Eg: Population Mean, μ)
Sampling distribution is the distribution of sample means. As sample size ‘n’ increases, the
shape of the distribution of the sample means obtained from any population (irrespective of
population distribution) with mean μ and standard deviation σ will approach a normal distribution
with mean μ and standard deviation σ/√n (Central limit theorem)
ESTIMATION
Generally parameters are unknown; they have to be estimated by the corresponding statistics.
Estimate: Value of population parameter obtained
Estimator: Method of estimation to estimate the value of population parameter
Part I – Page 5
Types of Estimate
Point estimate: It is the single value which is used to estimate the population parameter
Interval estimate: Is an interval in which population parameter lies between. It is called
confidence interval
Properties of estimator
Unbiasedness
An estimate is said to be unbiased if its expected value is identical with the population
parameter being estimated
Consistency
Means that, as the sample size increases, the estimates (produced by the estimator)
"converge" to the true value
Minimum variance
Best estimator
If an estimator is unbiased and consistent, it is called best estimator
Efficient estimator
Unbiased, minimum variance estimator is called the efficient estimator
THEORY OF SAMPLING
Census and Sampling are the two methods by which any required information or data may be
collected.
CENSUS METHOD
Complete enumeration of the data from each and every unit of the population or universe.
(Refer: Livestock census)
Merits
Data obtained from each and every unit
More accurate
Demerits
Difficult if the population is very large
Effort, money, time etc.
SAMPLING METHOD
Learning about the population on the basis of the samples drawn from the population. A
portion of population is known as sample. The process of selecting sample is sampling. For statistical
inferences about a population from a sample, it is essential that the samples are representative of the
population.
Merits
Saves time
Less cost
Demerits
It is only an estimate of population
parameter
Part I – Page 6
SAMPLING METHODS
The population of size 'N' is subdivided into a definite number of overlapping and distinct sub
populations of sizes N1, N2….Nk such that N1+N2+….+Nk = N.
The procedure of dividing the population into distinct sub population is called stratification and each
sub population is called a stratum. While forming a stratum, we see that the units within each stratum
are more homogenous with respect to character under study.
Within each stratum of size Ni, a random sample of ni is drawn such that n1+n2+….+nk=n while n is the
size of the sample.
Systematic Sampling
Consists of selecting only the 1st unit at random, the rest being selected according to some
predetermined pattern involving regular spacing of units. (Eg:- Randomly selecting first one, then
selecting every 10th item.)
Cluster sampling
The total population is divided depending on the problem under study, into some recognizable
subdivisions named as clusters and simple random samples from these clusters are drawn.
Purposive/Deliberate/Subjective/Judgment sampling
It is the one in which the investigator takes the samples exclusively at his discretion.
Convenient sampling
If the investigator chooses the samples at his convenience or ease of access.
Quota sampling
It is a type of Judgment sampling wherein quotas are setup according to some specified characteristics.
Part I – Page 7
CLASSIFICATION AND TABULATION OF DATA
The data obtained by the investigator is unorganised and does not give much information. This
data set has to be rearranged. The most elementary rearrangement of data is called array. It is 'an
arrangement of the observations according to size/magnitude'. This means the observed values are
arranged in order of magnitude and this rearranged data is termed as arrayed data and this
arrangement of observed values in order of magnitude from smallest value to the largest value is called
ordered array.
Objectives of Classification of Data
The objectives of compilation and classification of data are:
• to make the data simple and meaningful, to leave a lasting impression.
• to make the data easily accessible, easily understandable and proper use.
• to present data in condensed form by summation of items so as it is easy to draw statistical inference.
• to ensure easy detection of errors and omission in the data.
• to ensure and define the problem and suggest solution too.
• to ensure quick comparison and easy study of data.
Methods of classification of data
1. Classification by Space (Geographical Data)
In this classification, data is classified by location of occurrence, i.e., according to area or region. The
data is organised in the sets of categories in the order of their geographical location. For example, if
we consider production of fish statewise in India
2. Classification by Time (Chronological Data)
In this classification, the data is classified by the time of occurrence of the observations or occurrence
of an event. The categories are arranged in chronological order. For example, data of egg production
of a poultry farm for the last five years.
When data consisting of large number of observations are divided into certain groups that have
defined upper and lower limits, each group is called a class. The size of the class is called class interval.
For example, in Table values 10-20, 20-30 or 30-40, etc. are classes of the series 10 to 40.
There are two ways of classifying the data on the basis of class intervals:
(i) Exclusive method: In the exclusive method, the upper limit of a class is the lower limit of the
succeeding class. This method ensures the continuity of the data (Left hand side table).
(ii) Inclusive method: Under the inclusive method, the upper limit of one class is included in that class.
For continuous variable, the exclusive method should be used, and in the case of discrete variables it
is possible to use inclusive method.
Class Limits
The two ends of a class are called class limits. The smaller value of the class represents the lower class
limit of the class and the higher value represents the upper class limit. For example, in case of class 91-
94, 91 is the lower class limit and 94 is the upper class limit.
Class Boundary
The class boundaries are the limits up to which the two limits of each class or group may be extended
to fill up the gap existing between the classes. The lower extreme value of the class value is called the
lower class value. For example, in the class 91-94, the lower class limit can be extended to 90.5. It is
called lower class boundary. Similarly, its upper class limit can be extended to 94.5 and hence
represents the upper class boundary.
Part I – Page 10
Class Width or Class Magnitude
The difference between the upper and lower class boundaries is described as class magnitude. It is also
called class size, class range or class width. The class width can be calculated by the formula:
Class width or Class range = Largest value (or Upper class boundary) – Smallest value (or
Lower class boundary)
Mid Value of Class
The value just at the middle of the class is called mid value of the class. It is also known as mid point
or central value. It is calculated as the arithmetic mean of the upper class limit and lower class limit of
a class or the highest and lowest limits of the class interval. For example, the mid value of the class
91-94 will be 92.5. The formula used for calculating mid value is:
Mid value =
Arrangement of frequencies of a variable and their presentation in a defined group is called as
frequency table. The number of times a value occurs in a series is called the frequency of that value of
the variables.
PARTS OF A TABLE
A table should essentially contain seven parts namely:
1. Table number: When a book or report contains more than one table, each table must have a
number.
2. Title of the table: Every table must have a suitable heading. The heading should be short, clear and
convey the purpose of the table.
3. Captions and stubs: Caption refers to the vertical column headings while stubs refers to the
horizontal rows heading.
4. Head notes: It is a statement given below the title which clarifies the contents of the table. It explains
the entire table or main parts of it. For e.g. milk yield of the state in different years is usually expressed
in a head note as metric tonnes or cattle population of the state in millions.
5. Body: The figures that are to be presented to the readers are called as body of the table and must
contain subtotals and grand totals.
6. Source: When the secondary data is presented in a table, the source of the data is needed to be
given. The source should give the name of the book, page number, table number. etc., from which the
data have been collected.
7. Foot note: A footnote is a pointer; it tells the reader that whatever bit of text they are reading
requires additional information to make complete sense. For example in a table giving information on
the actual milk production of the state for the years from 2000 to 2018 , if the projected value is given
only for 2018, it needs to be marked as foot note as "projected value" which will be mentioned at the
bottom of the table.
Part I – Page 11
GRAPHICAL AND DIAGRAMMATIC REPRESENTATION OF DATA
One of the most convincing and appealing ways in which statistical results may be presented is
through diagrams and graphs. There are numerous ways in which statistical data may be displayed
pictorially such as different types of diagrams, graphs and maps.
1. One-dimensional diagrams
(a) Line diagram
(b) Bar diagram
There are four types of bar diagrams.
(i) Simple bar diagram
(ii) Divided bar diagram
(iii) Percentage bar diagram
(iv) Multiple bar diagram
2. Two-dimensional diagrams or area diagrams
Rectangles, squares and circles (pie diagram)
3. Three-dimensional diagrams or volume diagrams
Cubes, cylinders and spheres
4. Pictograms and cartograms
LINE DIAGRAM
It is the simplest type of a diagram. For diagrammatic representation of data, the frequencies of the
discrete variable can be presented by a line diagram. The variable is taken on the x-axis, and the
frequencies of the observations on the y-axis. The straight lines are drawn whose lengths are
proportional to the frequencies.
BAR DIAGRAM
Bar diagrams are commonly used in practice to represent the statistical data. They are also known as
one dimensional diagrams because the length of the bar is important, and not the width. In the case
of a large number of items, line diagrams may be drawn instead of bars. In the place of line diagram,
one can construct rectangular bars of equal width instead of straight lines, and such a representation
is called bar diagram.
The following points should be taken into consideration while constructing a bar diagram.
(i) They may be in the shape of horizontal or vertical bars.
(ii) The width of the bars should be uniform throughout the diagram.
(iii) The gap between one and the other bar should be uniform throughout.
Bar diagrams can be of the following types:
Part I – Page 12
Simple bar diagram: A simple bar diagram is used to represent only one variable. As one bar represents
only one figure, there are as many bars as the number of figures.
Divided bar diagram: In a divided bar diagram, the frequency is divided into different components and
such a representation is called a divided bar diagram.
Percentage bar diagram: In percentage bar diagrams, the length of the bars is kept equal to 100 and
the divisions of the bar correspond to the percentages of different components. This diagram is called
a percentage divided bar diagram.
Multiple bar diagram: Multiple bar diagrams are preferred whenever a comparison between two or
more related variables is to be made. The technique of simple bar diagrams can be extended to
represent two or more sets of interrelated data in a diagram.
Part I – Page 13
Pie diagram
It is used for percentage distribution. Different components are represented by means of sectors of a
circle. A circle represents an angle of 360° at the centre and this represents the total and angles of
sectors are proportional to the respective values or measurements of different components (usually
pie chart is not used for depicting large num. This sort of representation is called a pie chart. A pie
chart is also known as circular chart or sector chart. (See above RHS picture)
Pictogram:
When statistical data is represented by pictures, they give more attractive
presentation and such pictures are called pictograms. Pictograms are
diagrams of pictorial or semi-pictorial nature and are drawn in different
sizes according to scale.
It will be better for the statistician to present tables for detailed reference, and diagrams for rapid
understanding. Diagrammatic representation has the following limitations:
Can give only a limited amount of information because they show approximate values.
They can be used only for comparative studies.
Diagrams cannot be analysed further.
Diagrammatic representation is only useful to the common man. Its utility to an expert is
limited.
Part I – Page 14
GRAPHIC REPRESENTATION OF DATA
Graphic methods enable the statisticians to present quantitative data in a simple, clear and
effective manner. However, the important step in statistical analysis is to prepare a frequency
distribution table. But the graphic representation of data in the frequency distribution table reveals
the relationship that might be overlooked in a table. The frequency table of most biological variables
develops a distribution which can be compared with the standard distributions, such as normal,
binomial and poisson. A graph is a visual form of representation of statistical data. Comparisons can
be made between two or more phenomena very easily with the help of a graph. The frequency
distribution can be represented graphically in any of the following ways:
(i) Histogram
(ii) Frequency polygon
(iii) Frequency curve
(iv) Cumulative frequency curve
(v) Scatter diagram or dot diagram
Part I – Page 15
HISTOGRAM
Histogram is the most important method for displaying the frequency distribution. Histogram
is a set of vertical bars whose areas are proportional to the frequencies represented. In constructing
the histogram, the variable should be taken on the horizontal axis (x-axis) and the frequencies
depending on it on the vertical axis (y-axis). Each class is represented by a distance which is always
proportional to its class interval. When all the classes are of equal lengths, the heights of the rectangles
will be proportional to the frequencies of the respective classes. In this way, there are a number of
rectangles each with a class interval distance as its width, and the frequency distance as its height.
Histogram is two-dimensional where both the length as well as the width are important. Whereas a
bar diagram is one-dimensional, i.e. only the length of the bar is important, and not the width.
The histogram can be constructed in two ways depending upon the class-intervals:
(i) For distributions that have equal class intervals.
(ii) For distributions that have unequal class intervals.
In the first case, if the class intervals are equal, the height of the rectangles will be proportional
to the frequency. In case of unequal class intervals, a correction must be made. One has to take the
lowest class interval into consideration for making suitable adjustments in the frequencies of other
classes. For example, if one class interval is twice as compared to the lowest class interval, the height
of the rectangle is divided by two. On the other hand, if it is three times more, we divide the height of
its rectangle by three and so on. Frequency equal to the area of the bar.
FREQUENCY POLYGON
Frequency distribution can be portrayed graphically by means of a frequency polygon. To
construct a frequency polygon, we mark the frequencies on the vertical axis and the values of the
variable on the horizontal axis as in the case of histogram. A dot is placed above the mid-point of each
class and the height of a given dot corresponds to the frequency of the relevant class interval. By
connecting the dots by a straight line, the frequency polygon is prepared. A frequency polygon is simply
a line graph that connects the mid-points of all the bars in a histogram.
FREQUENCY CURVE
The frequency polygon or histogram will approach more and more the form of a smooth curve.
Such a curve is obtained in normal distribution of individuals in a large sample or in a population. In a
majority of the biological characters, the frequency distributions approximate to a symmetrical bell
shaped curve known as the normal curve. The frequency curve is drawn freehand to eliminate as far
as possible, the accidental variations that might be present in the biological, agricultural and other
data. The total area under the curve should be equal to the area under the original histogram or
polygon.
Part I – Page 16
CUMULATIVE FREQUENCY CURVE or OGIVE
In the case of this graphic representation of data, it is desirable to determine the number of
observations that fall above or below a certain value rather than within a given interval. In such cases,
the regular frequency distribution may be converted to a cumulative frequency distribution. A graph
of cumulative frequency distribution is called the ogive (pronounced "oh-jive"). There are two methods
of constructing ogive, namely:
(i) The "less than" method
(ii) The "more than" method-
In the "less than" method, we start with the upper limit of the classes and go on adding the frequencies
(Less than cumulative frequencies plotted against the upper class limits). However, in the case of
"more than" method, we start with the lower limit of classes (More than cumulative frequencies
plotted against the lower class limits). The first method gives a rising curve, whereas the second
method shows a declining curve.
Lorenz Curve is a modification of the Ogive when the variables and the cumulative frequencies
are expressed as percentages. It is a graphical method of studying dispersion.