Unit - 1
Unit - 1
Statistics is a form of mathematical analysis that uses quantified models, representations and synopses
for a given set of experimental data or real-life studies. Statistics studies methodologies to gather,
review, analyze and draw conclusions from data. Some statistical measures include mean, regression
analysis, skewness, kurtosis, variance, and analysis of variance.
Statistics is a term used to summarize a process that an analyst uses to characterize a data set. If the data
set depends on a sample of a larger population, then the analyst can develop interpretations about the
population primarily based on the statistical outcomes from the sample. Statistical analysis involves the
process of gathering and evaluating data and then summarizing the data into a mathematical form.
Statistical methods analyze large volumes of data and their properties. Statistics is used in various
disciplines such as psychology, business, physical and social sciences, humanities, government, and
manufacturing. Statistical data is gathered using a sample procedure or other method. Two types of
statistical methods are used in analyzing data: descriptive statistics and inferential statistics. Descriptive
statistics are used to synopsize data from a sample exercising the mean or standard deviation. Inferential
statistics are used when data is viewed as a subclass of a specific population.
Thus, Statistics may be an important member of the mathematics family. In the words of Connor,
“Statistics is a branch of applied mathematics which specializes in data.”
Sampling Techniques and Estimation Theory are very powerful and indispensable tools for conducting
any social survey, pertaining to any strata of society and then analyzing the results and drawing valid
inferences. The most important application of statistics in sociology is in the field of Demography for
studying mortality (death rates), fertility (birth rates), marriages, population growth and so on.
Changes in demand, supply, habits, fashion etc. can be anticipated with the help of statistics. Statistics
is of utmost significance in determining prices of the various products, determining the phases of boom
and depression etc. Use of statistics helps in smooth running of the business, in reducing the
uncertainties and thus contributes towards the success of business.
Limitations of Statistics
1. Sampling Bias: Statistics relies on data collected from samples, and if the sample is not
representative of the entire population, the results can be biased. Sampling bias occurs when
certain groups or individuals are more likely to be included in the sample than others.
2. Assumptions: Many statistical methods are based on assumptions about the data, such as
normality, independence, and homogeneity of variance. If these assumptions are not met, the
results may be invalid.
3. Causation vs. Correlation: Statistics can show relationships between variables, but it cannot
prove causation. Just because two variables are correlated does not mean that one causes the
other.
4. Data Quality: Statistics can only work with the data it is given. If the data is incomplete,
inaccurate, or biased, the results will also be flawed.
5. Sensitivity to Outliers: Outliers, extreme data points, can significantly affect statistical results,
especially in small samples. They can skew means and standard deviations, leading to
misleading conclusions.
6. Interpretation: Statistical results require careful interpretation. Misinterpretation or
miscommunication of statistical findings can lead to incorrect conclusions.
7. Ethical Concerns: Statistics can be misused to manipulate or misrepresent data for personal or
political gain. Ethical considerations are important in the collection, analysis, and reporting of
data.
8. Overfitting: When fitting complex models to data, there is a risk of overfitting, where the
model captures noise in the data rather than the underlying patterns. This can result in poor
generalization to new data.
9. Data Availability: Statistics relies on available data, and sometimes the data needed for a
particular analysis may not exist or may be difficult to obtain.
10. Inference vs. Reality: Statistical results are based on inference and probability, not absolute
certainty. There is always some degree of uncertainty associated with statistical conclusions.
11. Complexity: Some real-world phenomena are too complex to be accurately represented by
statistical models. For example, modelling human behaviour can be challenging due to its
multifaceted nature.
12. Context Dependency: The interpretation of statistical results can depend on the context in
which they are applied. What is statistically significant in one context may not be in another.
13. Resource Intensive: Some advanced statistical analyses require significant computational
resources, and not all researchers or organizations may have access to these resources.
Every minute of the working day, decisions are made by business around the world that determines
whether companies will be profitable and growing or wither they will stagnate and die. Most of these
decisions are made with the assistance of information gathered about the market place, the economic
and financial environment, the workforce, the competition, and other factors. Such information in the
form of data or is accompanied by data. Business statistics provides the tool through which such data
are collected, analyzed, summarized, and presented to facilitate the decision-making process.
Virtually every area of business uses statistics in decision making. Here are some examples of the use
of statistics in several areas of business.
Measures of central tendency are statistical measures used to describe the central or typical value of a
dataset. They provide insight into where the bulk of the data is concentrated.
It provides information about the centre, or middle part, of a group of numbers. Measure of central
tendency is the single value which can be taken as representative of the whole distribution. There are
the following tools to measure central tendency-
Mean
The arithmetic mean is the average of a group of numbers. Because the arithmetic mean is so widely
used, most statisticians refer to it simply as the mean.
The arithmetic mean is obtained by adding all the observations and dividing the sum by the number of
observations. Suppose we have the following observations.
The population mean is denoted by the Greek letter mew (µ). The sample mean is denoted by 𝑋̅.
Calculation of Mean
∑𝑋 ∑ 𝑓𝑋
Direct Method 𝑋̅ = 𝑋̅ = ∑ 𝑓 ; Here 𝑛 = ∑ 𝑓
𝑛
∑𝑑 ∑ 𝑓𝑑
𝑋̅ = 𝐴 + 𝑋̅ = 𝐴 + ∑ 𝑓 ; Here 𝑛 = ∑ 𝑓
Shot-Cut Method 𝑛
Where 𝑑 = 𝑋 − 𝐴 and A is known as assumed mean
∑𝑢 ∑ 𝑓𝑢
𝑋̅ = 𝐴 + ∗ℎ 𝑋̅ = 𝐴 + ∑ 𝑓 ∗ ℎ ; Here 𝑛 = ∑ 𝑓
𝑛
Step Deviation Method
𝑋−𝐴
Where 𝑢 = ℎ
and h is the common width of the class intervals
1. The sum of the deviations of the individual items from the arithmetic mean is always zero. This
means ∑(𝑥 − 𝑥̅ ) = 0, where x is the value of an item and 𝑥̅ is the arithmetic mean. ‘Since the
sum of the deviations in the positive direction is equal to the sum of the deviations in the
negative direction, the arithmetic mean is regarded as a measure of central tendency.’
2. The sum of the squared deviations of the individual items from the arithmetic mean is always
minimum. In other words, the sum of the squared deviations taken from any value other than
the arithmetic mean will be higher.
3. As the arithmetic mean is based on all the items in a series, a change in the value of any item
will lead to a change in the value of the arithmetic mean.
Merits:
▪ The calculation of arithmetic mean is very simple. It is also simple to understand the meaning
of arithmetic mean
▪ Calculation of arithmetic mean is based on all the observations and hence, it can be regarded as
representative of the given data
▪ Arithmetic mean can be calculated even if the detailed observation is not known but the sum of
observations and number of observations are known
▪ It is least affected by the fluctuation of sampling
▪ It provides a good basis for the comparison of two or more distributions
Demerits
Median
Median is defined as the value of the middle item (or the mean of the values of the two middle items)
when the data are arranged in an ascending or descending order of magnitude. Thus, in an ungrouped
frequency distribution if the n values are arranged in ascending or descending order of magnitude, the
median is the middle value if n is odd. When n is even, the median is the mean of the two middle values.
Calculation of Median
In case of ungrouped data, we first arrange the In case of grouped data, we first find cumulative
data in ascending or descending order and then frequencies and use the following steps
we use following method
If ‘n’ is even 𝑛
− 𝑐𝑓
𝑀𝑒𝑑𝑖𝑎𝑛 = 𝐿 + 2 ∗ℎ
𝑛 𝑛 𝑓
[(2 ) 𝑡ℎ + 2 + 1 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛]
𝑀𝑒𝑑𝑖𝑎𝑛 =
2 L – lower limit of median class
h – width of the class interval
cf – cumulative frequency of the class preceding
median class
f – frequency of median class
Characteristics of Median
1. Unlike the arithmetic mean, the median can be computed from open-ended distributions. This
is because it is located in the median class-interval, which would not be an open-ended class.
2. The median can also be determined graphically whereas the arithmetic mean cannot be
ascertained in this manner.
3. As it is not influenced by the extreme values, it is preferred in case of a distribution having
extreme values.
4. In case of the qualitative data where the items are not counted or measured but are scored or
ranked, it is the most appropriate measure of central tendency.
Merits:
Demerits:
▪ In case of ungrouped data, the process of calculating median requires their arrangement in the
order of magnitude which may be a cumbersome task, particularly when the number of
observations is very large
▪ In comparison to arithmetic mean, it is much affected by the fluctuations of sampling
▪ Since it is not possible to define weighted median like weighted mean, this average is not
suitable when different items are of unequal importance
▪ It is not based on the magnitude of the observations, there may be a situation where different
sets of observations give same
Uses:
▪ It is an appropriate measure of central tendency when the characteristics are not measurable but
different items are capable of being ranked
▪ Median is used to convey the idea of a typical observation of the given data
▪ Median is often computed when quick estimates of averages are desired
▪ When the given data has class intervals with open ends, median is preferred as a measure of
central tendency since it is not possible to calculate mean in this case
Mode
The mode is the most frequently occurring value in a set of data or in other words it is that value which
occurs maximum number of times in a distribution. It is the value at the point around which the items
are most heavily concentrated.
Calculation of Mode
Mode in case of Ungrouped Data: In case of ungrouped data, mode is the number which have
maximum frequency.
Mode in case of Grouped Data: In the case of grouped data, mode is determined by the following
formula:
𝑓1 − 𝑓0
𝑀𝑜𝑑𝑒 = 𝐿 + ∗ℎ
(𝑓1 − 𝑓0 ) + (𝑓1 − 𝑓2 )
Where
Merits:
▪ It is easy to understand and easy to calculate. In many cases it can be located just by inspection.
▪ It can be located in situations where the variable is not measurable but categorization or ranking
of observation is possible
▪ Like mean or median it is not affected by extreme observations
▪ It can be determined even if the distribution has open end classes
▪ It is a value around which there is more concentration of observations and hence the best
representative of the data
Demerits:
It has been observed that for a moderately skewed distribution, the difference between mean and mode
is approximately three times the difference between mean and median, i.e.
This formula, is an empirical formula only. And it can give only approximate results. As such, its
frequent use should be avoided. However, when mode is ill-defined or the series is bimodal (as is the
case in the present example) it may be used.
Percentiles
Percentiles are measures of central tendency that divide a group of data into 100 parts. The nth
percentile is the value such that at least n percent of the data are below that value and at most (100 - n)
percent are above that value.
Specifically, the 87th percentile is a value such that at least 87% of the data are below the value and no
more than 13% are above the value.
Percentiles are widely used in reporting test results. Almost all college or university students have taken
the SAT, ACT, GRE, or GMAT examination. In most cases, the results for these examinations are
reported in percentile form and also as raw scores.
𝑃
𝑖=( )𝑁
100
Where
If 𝑥1 , 𝑥2 , 𝑥3 , … … … . . 𝑥𝑘 are k values (or mid values in case of class intervals) of a variable X with their
corresponding frequencies 𝑓1 , 𝑓2 , 𝑓3 , … … … . . 𝑓𝑘 , then first we form a cumulative frequency distribution.
After that we determine the ith percentile class as similar as we do in case of median.
𝑃
[(100) 𝑁 − 𝐶𝐹]
𝑃𝑖 = 𝐿 + ∗ℎ
𝑓
Where
Merits of Percentiles:
• Percentiles are easy to understand and communicate. They represent the relative position of a
data point within a dataset, making it accessible to a wide range of people, including non-
statisticians.
• Percentiles are resistant to outliers, extreme values, or skewed distributions. They focus on the
position of data points rather than their actual values, making them suitable for analyzing data
with anomalies.
• Percentiles enable meaningful comparisons between different datasets or groups. For example,
you can compare the performance of students in two schools by looking at their respective
percentile scores.
• Percentiles provide a clear and interpretable way to understand how a specific data point
compares to others within a dataset. For instance, a student in the 90th percentile in a math
exam performed better than 90% of their peers.
• They can be used to normalize data, making it easier to compare variables with different units
or scales. This is particularly useful in fields like standardized testing and financial analysis.
Demerits of Percentiles:
• Percentiles condense data into percentile ranks, which can lead to a loss of information. You
will not have access to the actual data values, which may be necessary for detailed analysis.
• Percentiles may not fully capture the characteristics of the data distribution. In cases where the
shape of the distribution is important, other measures like mean and standard deviation might
be more informative.
• The choice of percentiles (e.g., 25th, 50th, and 75th) is somewhat arbitrary and may not always
be the most relevant for a particular analysis. Different percentiles may be more appropriate in
certain situations.
• Percentiles can be misleading when dealing with highly skewed distributions. For example, in
a positively skewed income distribution, the 50th percentile might not represent the typical
income.
• The interpretation of percentiles can be sensitive to sample size. In smaller datasets, percentiles
may not provide a reliable estimate of where data points fall within the population.
Quartiles
Quartiles are measures of central tendency that divide a group of data into four subgroups or parts.
The three quartiles are denoted as Q1, Q2, and Q3. The first quartile, Q1, separates the first, or lowest,
one-fourth of the data from the upper three-fourths and is equal to the 25th percentile. The second
quartile, Q2, separates the second quarter of the data from the third quarter. Q2 is located at the 50th
percentile and equals the median of the data. The third quartile, Q3, divides the first three-quarters of
the data from the last quarter and is equal to the value of the 75th percentile.
• Start by arranging your ungrouped data in ascending order from smallest to largest. This step
is essential for finding quartiles accurately
• To find the positions of the quartiles, you can use the following formulas:
First Quartile (Q1): Position = (n + 1) / 4
Second Quartile (Q2, also the Median): Position = (n + 1) / 2
Third Quartile (Q3): Position = 3 * (n + 1) / 4
• After finding the positions, you can calculate the quartiles as follows:
❖ Q1: If the position is a whole number (e.g., 10, 20, 30, etc.), you can take the data point
at that position as Q1. If the position is not a whole number, you can calculate Q1 by
taking the weighted average of the two closest data points. For example, if the position
is 10.5, you would take the average of the 10th and 11th data points.
❖ Q2 (Median): Q2 is simply the value at the position you calculated in Step 2.
❖ Q3: Similar to Q1, you can find Q3 using the same method. If the position is a whole
number, take the data point at that position. If it's not a whole number, average the two
closest data points.
If 𝑥1 , 𝑥2 , 𝑥3 , … … … . . 𝑥𝑘 are k values (or mid values in case of class intervals) of a variable X with their
corresponding frequencies 𝑓1 , 𝑓2 , 𝑓3 , … … … . . 𝑓𝑘 , then first we form a cumulative frequency distribution.
After that we determine the ith percentile class as similar as we do in case of median.
𝑁
[( 4 ) 𝑄 − 𝐶𝐹]
𝑄𝑖 = 𝐿 + ∗ℎ
𝑓
Where
Merits of Quartiles
❖ Quartiles are less sensitive to extreme outliers compared to other measures like the mean and
standard deviation. This makes them useful for summarizing data with outliers.
❖ Quartiles are easy to interpret. The first quartile (Q1) represents the 25th percentile, the second
quartile (Q2) represents the median (50th percentile), and the third quartile (Q3) represents the
75th percentile.
❖ Quartiles work well with skewed data distributions and are robust in the presence of non-
normality, making them suitable for a wide range of datasets.
❖ Quartiles provide a quick way to get a sense of the spread and central tendency of data, making
them valuable for initial data exploration.
❖ Quartiles are commonly used in the construction of box plots, which provide a visual
representation of the data's spread and central tendency.
Demerits of Quartiles:
❖ Quartiles divide the data into four parts, which may not provide as much detail as other
measures, like percentiles or histograms, in describing the distribution.
❖ Quartiles require the data to be sorted in ascending order. This can be cumbersome for large
datasets and may introduce bias if the data is not well-organized.
❖ In cases where the data distribution is highly skewed or multimodal, quartiles alone may not
fully represent the complexity of the data.
❖ Quartiles are primarily descriptive statistics and may not be appropriate for more advanced
statistical analyses that require precise measures of central tendency or dispersion.
❖ Quartiles may not provide stable estimates for small sample sizes, as they rely on dividing the
data into quarters, which can be affected by a limited number of data points.
Measures of central tendency yield information about the center or middle part of a data set. However,
business researchers can use another group of analytic tools, measures of variability, to describe the
spread or the dispersion of a set of data. Using measures of variability in conjunction with measures of
central tendency makes possible a more complete numerical description of the data.
The concept of dispersion is related to the extent of scatter or variability in observations. The variability,
in an observation, is often measured as its deviation from a central value.
“The measure of the degree to which numerical data tend to spread about an average value is called
the measure of variability or dispersion.”
Range
It is the simplest measure of dispersion. For ungrouped data, the range is the difference between the
highest and lowest values in a set of data.
For grouped data the range is defined as the difference between upper limit of the highest class and
the lower limit of the lowest class.
Merits:
Demerits:
Another measure of variability is the interquartile range. The interquartile range is the range of values
between the first and third quartile. Essentially, it is the range of the middle 50% of the data and is
determined by computing the value of Q3 - Q1.
The interquartile range is especially useful in situations where data users are more interested in values
toward the middle and less interested in extremes. In describing a real estate housing market, Realtors
might use the interquartile range as a measure of housing prices when describing the middle half of the
market for buyers who are interested in houses in the midrange. In addition, the interquartile range is
used in the construction of box-and-whisker plots.
Many times, the interquartile range is reduced in the form of semi-interquartile range or quartile
deviation as shown below:
Q 3 − Q1
Semi − interquartile range or Quartile deviation =
2
It may be noted that interquartile range or the quartile deviation is an absolute measure of dispersion.
It can be changed into a relative measure of dispersion as follows:
Q 3 − Q1
Coefficient of QD =
Q 3 + Q1
Merits of Quartile Deviation
The mean absolute deviation (MAD) is the average of the absolute values of the deviations around
the mean for a set of numbers.
❖ Measures the ‘average’ distance of each observation away from the mean of the data
❖ Gives an equal weight to each observation
❖ Generally, more sensitive than the range or interquartile range, since a change in any value will
affect it
∑|𝑋 − 𝑋̅|
𝑀𝐴𝐷 =
𝑁
∑ 𝑓|𝑋 − 𝑋̅|
𝑀𝐴𝐷 =
∑𝑓
❖ A major advantage of mean deviation is that it is simple to understand and easy to calculate.
❖ It takes into consideration each and every item in the distribution. As a result, a change in the
value of any item will have its effect on the magnitude of mean deviation.
❖ The values of extreme items have less effect on the value of the mean deviation.
❖ As deviations are taken from a central value, it is possible to have meaningful comparisons of
the formation of different distributions.
In view of these limitations, it is seldom used in business studies. A better measure known as the
standard deviation is more frequently used.
Variance:
Variance is a measure of variability based on the squared deviations of the observed values in the data
set about the mean value.
The variance is the average of the squared deviations about the arithmetic mean for a set of numbers.
The population variance is denoted by 𝜎 2 (Sigma square) and sample variance is denoted by 𝑠 2 .
Calculation of Variance
▪ Variance indicates how much individual data points deviate from the mean or average. This is
valuable in understanding the spread of data.
▪ Variance is a critical tool in decision-making, particularly in quality control and finance, where
understanding and managing variability are essential.
▪ It helps analysts and researchers gain insights into the consistency or variability of data, which
can lead to more informed conclusions.
▪ Variance is a fundamental component in many statistical calculations, such as standard
deviation and coefficient of variation.
▪ Variance is sensitive to extreme values (outliers), making it a valuable tool for identifying data
points that significantly differ from the rest.
▪ Variance is measured in the square of the original units (e.g., square meters, square dollars),
which can be challenging to interpret and compare directly with the original data.
▪ Variance can be heavily influenced by the presence of outliers and the distribution of data. In
some cases, it may not accurately reflect the central tendency of the data.
▪ Variance is not a robust statistic, meaning it can be greatly affected by small changes or
fluctuations in the data, particularly in smaller sample sizes.
▪ Variance treats each data point independently and does not consider relationships between
variables. It may not capture important dependencies in multivariate data.
▪ Because variance squares the differences between data points and the mean, it can give more
weight to extreme values, potentially leading to misleading interpretations.
▪ Variance assumes that data follows a normal distribution. When dealing with non-normally
distributed data, other measures of dispersion (e.g., interquartile range) may be more
appropriate
Standard Deviation:
It is the most used measure of dispersion. The standard deviation is defined as as the square root of
arithmetic mean of sum of squares of deviation of all the observations. In other words, it is square root
of mean squared deviation. It can be computed by taking the positive square root of the variance. The
population standard deviation is denoted by 𝛿 (Sigma) and sample standard deviation is denoted by 𝑠.
Direct
Formula
∑(𝑋 − 𝜇)2 ∑(𝑋 − 𝑋̅)2 ∑ 𝑓(𝑋 − 𝜇)2 ∑ 𝑓(𝑋 − 𝑋̅)2
𝛿=√ 𝑠=√ 𝛿=√ 𝑠=√
𝑁 𝑛−1 𝑁 𝑛−1
Computational 𝛿 𝛿 𝑠
Formula 2
2 2 (∑ 𝑥) 2 2
2 (∑ 𝑋) √∑ 𝑥 − 𝑛 2 (∑ 𝑓𝑋) 2 (∑ 𝑓𝑥)
√∑ 𝑋 − 𝑁 𝑠= √∑ 𝑓𝑋 − 𝑁 √∑ 𝑓𝑥 − 𝑛
= 𝑛−1 = =
𝑁 𝑁 𝑛−1
The standard deviation is a frequently used measure of dispersion. It enables us to determine as to how
far individual items in a distribution deviate from its mean. In a symmetrical, bell-shaped curve such as
the one given below:
(i) About 68 per cent of the values in the population fall within ± 1 standard deviation from the mean.
(ii) About 95 per cent of the values will fall within ± 2 standard deviations from the mean.
(iii) About 99 per cent of the values will fall within ± 3 standard deviations from the mean.
Merits of Standard Deviation
▪ Standard deviation tells you how spread out or dispersed the data points are in a dataset. A
higher standard deviation indicates greater variability, while a lower standard deviation
suggests more consistency or precision.
▪ Standard deviation is expressed in the same units as the data, making it easy to interpret. For
example, if you are analyzing a dataset of exam scores in points, the standard deviation will
also be in points.
▪ Standard deviation allows for easy comparison of variability between different datasets or
groups. You can quickly assess which dataset has more or less variation.
▪ Standard deviation is a fundamental component in many other statistical calculations, such as
confidence intervals, z-scores, and hypothesis testing. It helps in making informed decisions
and drawing meaningful conclusions.
▪ Standard deviation is sensitive to outliers, making it valuable for identifying extreme values
that might skew the overall distribution.
▪ While the sensitivity to outliers can be an advantage, it can also be a drawback. Extreme outliers
can greatly influence the standard deviation, potentially leading to an inaccurate representation
of the data's central tendency and variability.
▪ Standard deviation is most informative when the data follows a normal distribution (bell-shaped
curve). In cases where the data is not normally distributed, the standard deviation may not
provide a complete picture of the data's variability.
▪ Standard deviation measures the spread of data around the mean but does not account for
skewness (asymmetry) or kurtosis (peakedness) in the data distribution. Other statistics, like
skewness and kurtosis coefficients, are needed to describe these aspects.
▪ The interpretation of standard deviation values often requires context. For instance, a standard
deviation of 5 might be considered high for the exam scores of a highly competitive class but
low for a class with consistently low scores.
▪ Standard deviation can be influenced by the sample size. Smaller sample sizes may yield less
stable standard deviation estimates, which can make comparisons between datasets with
different sample sizes challenging.
Coefficient of Variation
A measure of relative variability that expresses the standard deviation as a percentage of the mean.
The coefficient of variation represents the ratio of the standard deviation to the mean, and it is a useful
statistic for comparing the degree of variation from one data series to another, even if the means are
drastically different from each other.
In the investing world, the coefficient of variation allows you to determine how much volatility (risk)
you are assuming in comparison to the amount of return you can expect from your investment.
𝛿
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 (𝐶𝑉) = ∗ 100
𝜇
1. Relative Comparison: CV allows for the comparison of the variability of different data sets,
even if they have different units or scales. This is particularly useful when comparing data
from different contexts.
2. Standardized Measure: CV standardizes the measure of variability, making it easy to
interpret. A higher CV indicates greater relative variability, while a lower CV suggests less
relative variability.
3. Useful for Risk Assessment: In finance and investment, CV is often used to assess the risk
associated with different investment options. It helps investors understand the risk-to-reward
ratio.
4. Applicable to Different Data Types: CV can be applied to various types of data, including
financial data, biological data, and more, making it versatile in different fields.
Skewness:
Skewness describes asymmetry from the normal distribution in a set of statistical data. Skewness can
come in the form of "negative skewness" or "positive skewness", depending on whether data points are
skewed to the left (negative skew) or to the right (positive skew) of the data average.
Tests of Skewness
In order to ascertain whether a distribution is skewed or not the following tests may be applied.
Skewness is present if:
• Measures of skewness help us to know to what degree and in which direction (positive or
negative) the frequency distribution has a departure from symmetry.
• Positive or negative skewness can be detected graphically (as below) depending on whether the
right tail or the left tail is longer but, we do not get idea of the magnitude
• Hence some statistical measures are required to find the magnitude of lack of symmetry
𝑴𝒆𝒂𝒏 > 𝑴𝒆𝒅𝒊𝒂𝒏 > 𝑴𝒐𝒅𝒆 𝑴𝒆𝒂𝒏 = 𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑴𝒐𝒅𝒆 𝑴𝒆𝒂𝒏 < 𝑴𝒆𝒅𝒊𝒂𝒏 < 𝑴𝒐𝒅𝒆
Coefficient of Skewness:
𝑀𝑒𝑎𝑛 − 𝑀𝑜𝑑𝑒
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 (𝑆𝑘 ) =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
3(𝑀𝑒𝑎𝑛 − 𝑀𝑒𝑑𝑖𝑎𝑛)
𝑆𝑘 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
Kurtosis:
This is another measure of the shape of a frequency curve. While skewness refers to the extent of lack
of symmetry, kurtosis refers to the extent to which a frequency curve is peaked. Kurtosis is a Greek
word which means bulginess.
Kurtosis describes the amount of peakedness of a distribution. Distributions that are high and thin are
referred to as leptokurtic distributions. Distributions that are flat and spread out are referred to as
platykurtic distributions. Between these two types are distributions that are more “normal” in shape,
referred to as mesokurtic distributions.
▪ When the peak of a curve becomes relatively high then that curve is called Leptokurtic.
▪ When the curve is flat-topped, then it is called Platykurtic.
▪ Since normal curve is neither very peaked nor very flat topped, so it is taken as a basis for
comparison. The normal curve is called Mesokurtic.
Calculation of Kurtosis
1. The characteristic of a frequency distribution that ascertains its symmetry about the mean is
called skewness. On the other hand, Kurtosis means the relative pointedness of the standard
bell curve, defined by the frequency distribution.
2. Skewness is a measure of the degree of lopsidedness in the frequency distribution. Conversely,
kurtosis is a measure of degree of peakedness in the frequency distribution.
3. Skewness is an indicator of lack of symmetry, i.e., both left and right sides of the curve are
unequal, with respect to the central point. As against this, kurtosis is a measure of data, that is
either peaked or flat, with respect to the probability distribution.
4. Skewness shows how much and in which direction, the values deviate from the mean? In
contrast, kurtosis explain how tall and sharp the central peak is.