Define Statistics? Explain The Nature, Scopeand Functi Ofstatistics?
Define Statistics? Explain The Nature, Scopeand Functi Ofstatistics?
Introduction :
The term “statistics” is used in two senses : first in plural sense meaning a collection of numerical facts or
estimates—the figure themselves. It is in this sense that the public usually think of statistics,
e.g., figures relating to population, profits of different units in an industry etc. Secondly, as a singular
noun, the term ‘statistics’ denotes the various methods adopted for the collection, analysis and
interpretation of the facts numerically represented. In singular sense, the term ‘statistics’ is better
described as statistical methods. In our study of the subject, we shall be more concerned with the second
meaning of the word ‘statistics’.
Definition:
A.L. Bowley defines, “Statistics may be called the science of counting”. At another place he
defines, “Statistics may be called the science of averages”. Both these definitions are narrow and throw light
only on one aspect of Statistics. According to King, “The science of statistics is the method of judging
collective, natural or social, phenomenon from the results obtained from the analysis or enumeration or
collection of estimates”.
Functions of Statistics:
It is not easy to treat large numbers and hence theyare simplified either by taking a few figures to serve as a
representative sample or by taking average to give a bird’s eye view of the large masses.
The comparison between two different groups is best represented by certain statistical methods, such as
average, coefficients, rates, ratios, etc.
An individual’s knowledge is limited to what he can observe and see; and that is a very small part of the
social organism. His knowledge is extended n various ways by studying certain conclusions and results,
the basis of which are numerical investigations.
5. To provide guidance in the formulation of policies :
The purpose of statistics is to enable correct decisions, whether they are taken by a businessman or Government.
In fact statistics is a great servant of business in management, governance and development. Sampling methods
are employed in industry in tacking the problem of standardisation of
products. Big business houses maintain a separate department for statistical intelligence, the work of which
is to collect, compare and coordinate figures for formulating future policies of the firm regarding production
and sales.
But for the development of the statistical science, it would not be possible to estimate the population of a
country or to know the quantity of wheat, rice and other agricultural commodities produced in the country
during any year.
Statistics in indispensable into planning in the modern age which is termed as “the age of planning”.
Almost all over the world the govt. are re-storing to planning for economic development.
Statistical data and techniques of statistical analysis have to immensely useful involving economical
problem. Such as wages, price, time series analysis, demand analysis.
Statistics is an irresponsible tool of production control. Business executive are relying more and more on
statistical techniques for studying the much and desire of the valued customers.
In industry statistics is widely used inequality control. In production engineering to find out whether the
product is confirming to the specifications or not. Statistical tools, such as inspection plan, control chart
etc.
Statistics are intimately related recent advancements in statistical technique are the outcome of wide
applications of mathematics.
In medical science the statistical tools for collection, presentation and analysis of observed facts relating to
causes and incidence of dieses and the result of application various drugs and medicine are of great
importance.
In education and physiology statistics has found wide application such as, determining or to determine the
reliability and validity to a test, factor analysis etc.
In war the theory of decision function can be a great assistance to the military and personal to plan
“maximum destruction with minimum effort.”
2. What are the measures of central tendency? Explain different measures of central tendency?
There are three main measures of central tendency: the mode, the median and the mean. Each of these
measures describes a different indication of the typical or central value in the distribution. The mode is
the most commonly occurring value in a distribution.
Introduction:
A measure of central tendency is a single value that attempts to describe a set of data by identifying the
central position within that set of data. As such, measures of central tendency are sometimes called
measures of central location. They are also classed as summary statistics. The mean (often called the average)
is most likely the measure of central tendency that you are most familiar with, but there are others, such as
the median and the mode.
The mean, median and mode are all valid measures of central tendency, but under different conditions, some
measures of central tendency become more appropriate to use than others. In the following sections, we will
look at the mean, mode and median, and learn how to calculate them and under what conditions they are
most appropriate to be used.
Mean (Arithmetic):
The mean (or average) is the most popular and well known measure of central tendency. It can be used with
both discrete and continuous data, although its use is most often with continuous data (see
our Types of Variable guide for data types). The mean is equal to the sum of all the values in the data set
divided by the number of values in the data set. So, if we have n values in a data set and they have values x1,
x2, ..., xn, the sample mean, usually denoted by (pronounced x bar), is:
This formula is usually written in a slightly different manner using the Greek capitol letter, , pronounced
"sigma", which means "sum of...":
You may have noticed that the above formula refers to the sample mean. So, why have we called it a sample
mean? This is because, in statistics, samples and populations have very different meanings and these
differences are very important, even if, in the case of the mean, they are calculated in the same way. To
acknowledge that we are calculating the population mean and not the sample mean, we use the Greek lower
case letter "mu", denoted as µ:
The mean is essentially a model of your data set. It is the value that is most common. You will notice,
however, that the mean is not often one of the actual values that you have observed in your data set.
However, one of its important properties is that it minimises error in the prediction of any one value in your
data set. That is, it is the value that produces the lowest amount of error from all other values in the data
set.
An important property of the mean is that it includes every value in your data set as part of the
calculation. In addition, the mean is the only measure of central tendency where the sum of the
deviations of each value from the mean is always zero.
The mean has one main disadvantage: it is particularly susceptible to the influence of outliers. These are
values that are unusual compared to the rest of the data set by being especially small or large in numerical
value.
The mean salary for these ten staff is $30.7k. However, inspecting the raw data suggests that this mean value
might not be the best way to accurately reflect the typical salary of a worker, as most workers have salaries in
the $12k to 18k range. The mean is being skewed by the two large salaries. Therefore, in this situation, we
would like to have a better measure of central tendency. As we will find out later, taking the median would be
a better measure of central tendency in this situation.
Median:
The median is the middle score for a set of data that has been arranged in order of magnitude. The
median is less affected by outliers and skewed data.
Mode:
The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar
chart or histogram. You can, therefore, sometimes consider the mode as being the most popular option.
3.what are the measures of dispersion? Elucidate the methods of measuring of dispersion?
Statistical dispersion:
In statistics, dispersion (also called variability, scatter, or spread) is the extent to which a distribution is
stretched or squeezed.[1] Common examples of measures of statistical dispersion are the variance,
standard deviation, and interquartile range.
Dispersion is contrasted with location or central tendency, and together they are the most used
properties of distributions.
Measures:
A measure of statistical dispersion is a nonnegative real number that is zero if all the data are the same and
increases as the data become more diverse.
Most measures of dispersion have the same units as the quantity being measured. In other words, if the
measurements are in metres or seconds, so is the measure of dispersion. Examples of dispersion measures
include:
Standard deviation
These are frequently used (together with scale factors) as estimators of scale parameters, in which
capacity they are called estimates of scale. Robust measures of scale are those unaffected by a small
number of outliers, and include the IQR and MAD.
All the above measures of statistical dispersion have the useful property that they are location-invariant and
linear in scale. This means that if a random variable X has a dispersion of SX then a linear transformationY =
aX + b for real a and b should have dispersion SY = |a|SX, where |a| is the absolute value of a, that is,
ignores a preceding negative sign –.
Other measures of dispersion are dimensionless. In other words, they have no units even if the variable itself
has units. These include:
Coefficient of variation
Entropy:
While the entropy of a discrete variable is location-invariant and scale-independent, and therefore not a
measure of dispersion in the above sense, the entropy of a continuous variable is location invariant and
additive in scale: If Hz is the entropy of continuous variable z and y=ax+b, thenHy=Hx+log(a).
Variance (the square of the standard deviation) – location-invariant but not linear in scale.
Variance-to-mean ratio – mostly used for count data when the term coefficient of dispersion is used and
when this ratio is dimensionless, as count data are themselves dimensionless, not otherwise.
Some measures of dispersion have specialized purposes, among them the Allan variance and the
Hadamard variance.
For categorical variables, it is less common to measure dispersion by a single number; see qualitative
variation. One measure that does so is the discrete entropy.
Sources :
In the physical sciences, such variability may result from random measurement errors: instrument
measurements are often not perfectly precise, i.e., reproducible, and there is additional inter-rater
variability in interpreting and reporting the measured results. One may assume that the quantity being
measured is stable, and that the variation between measurements is due to observational error. A
system of a large number of particles is characterized by the mean values of a relatively few number of
macroscopic quantities such as temperature, energy, and density. The standard deviation is an important
measure in Fluctuation theory, which explains many physical phenomena, including why the sky is blue.[2]
In the biological sciences, the quantity being measured is seldom unchanging and stable, and the variation
observed might additionally be intrinsic to the phenomenon: It may be due to inter-individual variability, that is,
distinct members of a population differing from each other. Also, it may be due
to intra-individual variability, that is, one and the same subject differing in tests taken at different times or in
other differing conditions. Such types of variability are also seen in the arena of manufactured products;
even there, the meticulous scientist finds variation.
In economics, finance, and other disciplines, regression analysis attempts to explain the dispersion of
a dependent variable, generally measured by its variance, using one or more independent variables each of
which itself has positive dispersion. The fraction of variance explained is called the coefficient of
determination.
A mean-preserving spread (MPS) is a change from one probability distribution A to another probability
distribution B, where B is formed by spreading out one or more portions of A's probability density function
while leaving the mean (the expected value) unchanged.[3] The concept of a mean-preserving spread
provides a partial ordering of probability distributions according to their dispersions: of two probability
distributions, one may be ranked as having more dispersion than the other, or alternatively neither may be
ranked as having more dispersion.
Correlation:
While studying statistics, one comes across the concept of correlation. It is a statistical method which enables
the researcher to find whether two variables are related and to what extent they are related. Correlation is
considered as the sympathetic movement of two or more variables. We can observe this when a change in
one particular variable is accompanied by changes in other variables as well, and this happens either in the
same or opposite direction, then the resultant variables are said to be correlated. Considering a data where
we find two or more variables getting valued then we might study the related variation for these variables.
Positive Correlation
Negative Correlation
No correlation
In correlation, when values of one variable increase with the increase in another variable, it is supposed to be a
positive correlation. On the other hand, if the values of one variable decrease with the decrease in another
variable, then it would be a negative correlation. There might be the case when there is no
change in a variable with any change in another variable. In this case, it is defined as no
correlation between the two.
Definition:
The relationship between more than one variable is considered as correlation. Correlation is considered as a
number which can be used to describe the relationship between two variables. Simple correlation is defined
as a variation related amongst any two variables. The multiple correlation and partial correlation are
categorized as related variation among three or more variables. Two variables are correlated only when they
vary in such a way that the higher and lower values of one variable corresponds to the higher and lower
values of the other variable. We might also get to know if they are correlated when the higher value of one
variable corresponds with the lower value of the other.
Correlation Symbol
Symbol of correlation = rr
Correlation Formula
bb = the slope of the regression line is also called as the regression coefficient
First Score
YY = Second Score
∑XY∑XY = Sum of the product of the first and Second Scores ∑X∑X = Sum of First Scores ∑Y∑Y = Sum of
Second Scores ∑X2∑X2 = Sum of square first scores. ∑Y2∑Y2 = Sum of square second scores.
Coefficient of Correlation:
Coefficient of correlation, r, called the linear correlation coefficient, measures the strength and the
direction of a linear relationship between two variables. It also called as Pearson product moment
correlation coefficient. The algebraic method of measuring the correlation is called the coefficient of
correlation. There are mainly three coefficients of correlation
Concurrent correlation
Karl Pearson’s Coefficient of correlation: The most important algebraic method of measuring correlation is
Karl Pearson’s Coefficient of correlation or Pearsonian’s coefficient of Correlation. It has widely used
application in Statistics. It is denoted by r. The formula is given by
r = n∑xy−∑x∑yn∑x2−(∑x)2√n∑y2−(∑y)2√n∑xy−∑x∑yn∑x2−(∑x)2n∑y2−(∑y)2
Interpretation of Karl Pearson’s Coefficient of correlation Karl Pearson’s Coefficient of correlation denoted by
r is the degree of correlation between two variables. r takes values between –1 and 1 When r is –1, we say
there is perfect negative correlation. When r is a value between –1 and 0, we say that there is a negative
correlation When r is 0, we say there is no correlation When r is a value between 0 and 1, we say there is a
positive correlation When r is 1, we say there is a perfect positive correlation.
Types of Correlation:
1. Positive Correlation:
2. Negative Correlation:
3. Partial Correlation:
The correlation is partial if we study the relationship between two variables keeping all other variables
constant.
Example: The Relationship between yield and rainfall at a constant temperature is partial correlation.
4. Linear Correlation:
When the change in one variable results in the constant change in the other variable, we say the
correlation is linear. When there is a linear correlation, the points plotted will be in a straight line
Example: Consider the variables with the following values.
X: Y:
10 20
20 40
30 60
40 80
50 100
Here, there is a linear relationship between the variables. There is a ratio 1:2 at all points. Also, if we plot
them they will be in a straight line.
A scatter plot is a type of mathematical diagram using cartesian coordinates to display values for two
variables for a set of data. Scatter plots will often show at a glance whether a relationship exists between
two sets of data. The data displayed on the graph resembles a line rising from left to right. Since the slope
of the line is positive, there is a positive correlation between the two sets of data.
7. Spearman's Correlation:
Spearman's rank correlation coefficient allows us to identify easily the strength of correlation within a
data set of two variables, and whether the⇒correlation⇒ is positive or negative. The Spearman coefficient is denoted with the Greek letter rho (ρρ). ρ ρ = 1 - 6∑d2n(n2−1)6∑d2n(n2−1)
When the amount of change in one variable is not in a constant ratio to the change in the other variable, we
say that the correlation is non linear. Here there is a non linear relationship between the variables. The ratio
between them is not fixed for all points. Also if we plot them on the graph, the points will not be in a straight
line. It will be a curve. Non linear correlation is also known as curvilinear correlation.
9. Simple Correlation:
If there are only two variable under study, the correlation is said to be simple. Example: The
correlation between price and demand is simple.
When one variable is related to a number of other variables, the correlation is not simple. It is multiple if
there is one variable on one side and a set of variables on the other side. Example: Relationship between
yield with both rainfall and fertilizer together is multiple correlations
The range of the correlation coefficient between -1 to +1. If the linear correlation coefficient takes
values close to 0, the correlation is weak.
Positive Correlation:
A relationship between two variables in which both variables move in same directions. A positive correlation
exists when as one variable decreases, the other variable also decreases and vice versa. When the values of
two variables x and y move in the same direction, the correlation is said to
be positive. That is in positive correlation, when there is an increase in x, there will be and an increase in y
also. Similarly when there is a decrease in x, there will be a decrease in y also.
Price and supply are two variables, which are positively correlated. When Price increases, supply also
increases; when price decreases, supply decreases.
A strong positive correlation has variables that has the same changes, but the point are more close
together and form a line.
A weak positive correlation has variables that has the same changes but the points on the graph are
dispersed.
Negative Correlation:
In a negative correlation, as the values of one of the variables increase, the values of the second variable
decrease or the value of one of the variables decreases, the value of the other variable increases. When the
values of two variables x and y move in opposite direction, we say correlation is negative. That is in negative
correlation, when there is an increase in x, there will be a decrease in y. Similarly when there is a decrease in
x, there will be an increase in y increase.
When price increases, demand also decreases; when price decreases, demand also increases. So price and
demand are negatively correlated.
The closer the correlation coefficient is either -1 or +1, the stronger the relationship is between the two
variables. A perfect negative correlation of -1.0 indicated that for every member of the sample, higher score
on one variable is related to a lower score on the other variable.
Covariance Correlation:
Covariance and correlation are both describe the degree of similarity between two random variables.
Suppose that X and Y are real-valued random variables for the experiment with
means E(X)E(X), E(Y)E(Y) and variances var(X)var(X), var(Y)var(Y), respectively. The covariance of XX
and YY is defined by
P(A)=m/m+n
Or P(A)= Number of cases favorable to the occurrence of the event /Total number of mutually exclusive and
exhaustive cases
So the total number of trials favorable to the event (A) divided by the total number of ways in which the event (A)
can happen.
or P(A)= Number of cases favorable to the occurrence of the event/ Total number of mutually exclusive and
exhaustive cases
The approaches of probability
The branch of the mathematics that is concerned with random phenomena analysis is known as probability
theory. The occurrence of a random event cannot be determined and it can be determined only by chance. There
are different definitions of probability which are
Types of probability
Objective probability
Subjective probability
Objective probability
The probability of the occurrence of an event is entirely based on the analysis where each measure depends on
the documented observation,in place of a subjective estimate.
Objective probabilities are of two types:
1) Classical probability
2) Emperical probability
Classical probability
Classical probability is the first approach to the theory of probability.
According to Laplace, "probability is the ratio of the number of favorable cases to the total number of equally likely
cases".
The fundamental assumption to this theory is that the various outcomes of an event are equally likely. Thus the
probability of happen in of these events is also equal. In this theory the probability of happening of an event is
determined prior probability.
P= Number of favorable cases/ Total no.of equally likely cases
Relative Frequency approach/ Empirical probability
The scientific study of measuring uncertainty is known as 'probability'.
Probabilities are empirically determined when their numerical values are based upon a sample or a census of
data which gives a distribution of events.
According to Con Mises , " If an experiment is performance repeatedly under essentially homogenous and
identical conditions , then the limiting value of the ratio of the number of times the event occurs to the number of
trials, as the number of trails becomes in identically large, is called the probability of happening of the event, it is
being assumed that the limit is finite and unique".
Symbolically, if A is the name of an event,f is the frequency with which that event occurred and n is the
sample size ,then
P (A) = f/ n
Subjective Approach to probability
The classical and empirical approach of probability is objective in description.
Whereas the subjective access to probability of an event is considered to be the scope of ones certainty of a
particular event to occur.