Descriptive Statistics
Descriptive Statistics
Descriptive Statistics
• The term “descriptive statistics” refers to the analysis, summary, and
presentation of findings related to a data set derived from a sample or
entire population.
• Descriptive statistics comprises three main categories – Frequency
Distribution, Measures of Central Tendency, and Measures of Variability.
Frequency Distribution
• Used for both quantitative and qualitative data, frequency distribution
depicts the frequency or count of the different outcomes in a data set
or sample.
• The frequency distribution is normally presented in a table or a graph.
Each entry in the table or graph is accompanied by the count or
frequency of the values’ occurrences in an interval, range, or specific
group.
• Common charts and graphs used in frequency distribution presentation
and visualization include bar charts, histograms, pie charts, and line
charts.
Central Tendency
• Central tendency refers to a dataset’s descriptive summary using a
single value reflecting the center of the data distribution.
• Measures of central tendency are also known as measures of central
location.
• The mean, median, and mode are the measures of central tendency.
• The mean, considered the most popular measure of central tendency, is
the average or most common value in a data set.
• The median refers to the middle score for a data set in ascending order.
• The mode refers to the score or value that is most frequent in a data
set.
Variability
• A measure of variability is a summary statistic reflecting the degree of
dispersion in a sample.
• The measures of variability determine how far apart the data points
appear to fall from the center.
• Dispersion, spread, and variability all refer to and denote the range
and width of the distribution of values in a data set.
• The range, standard deviation, and variance are used, respectively, to
depict different components and aspects of the spread.
• The range depicts the degree of dispersion or an ideal of the distance
between the highest and lowest values within a data set.
• The standard deviation is used to determine the average variance in a
set of data and provide an insight into the distance or difference
between a value in a data set and the mean value of the same data
set.
• The variance reflects the degree of the spread and is essentially an
average of the squared deviations.
Frequency Distribution
• The distribution is a summary of the frequency of individual values or
ranges of values for a variable.
• One of the most common ways to describe a single variable is with
a frequency distribution.
• Frequency distributions can be depicted in two ways, as a table or as
a graph.
• The table below shows an age frequency distribution with five
categories of age ranges defined.
• The same frequency distribution can be depicted in a graph
Category Percent
Under 35 years old 9%
36–45 21%
46–55 45%
56–65 19%
66+ 6%
Size of
the
38 39 40 42 43 44 45
winter
coat
Total
numbe
33 11 22 55 44 11 22
r of
shirts
• The range in statistics for a given data set is the difference between
the highest and lowest values.
• For example, if the given data set is {2,5,8,10,3}, then the range will
be 10 – 2 = 8.
• Thus, the range could also be defined as the difference between the
highest observation and lowest observation.
• Standard Deviation
• The Standard Deviation is a measure of how spread out numbers are.
• Its symbol is σ (the greek letter sigma)
• The formula is easy: it is the square root of the Variance.
• Variance
• The Variance is defined as:
• The average of the squared differences from the Mean.
• To calculate the variance follow these steps:
• Work out the Mean (the simple average of the numbers)Then for
each number: subtract the Mean and square the result (the squared
difference).Then work out the average of those squared differences.
You and your friends have just measured the heights of your dogs (in
millimeters):
• The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm
and 300mm.
• Find out the Mean, the Variance, and the Standard Deviation.
• Your first step is to find the Mean
600 + 470 + 170 + 430 + 300 /5
Mean =
1970/5
=
= 394
= 108520/5
= 21704
And the Standard Deviation is just the square root of Variance, so:
Standard Deviation
σ = √21704
= 147.32...
= 147 (to the nearest mm)
Probability and stats
• Probability implies 'likelihood' or 'chance’.
• When an event is certain to happen then the probability of
occurrence of that event is 1 and when it is certain that the event
cannot happen then the probability of that event is 0.
• Thus to calculate the probability we need information on number of
favorable cases and total number of equally likely cases. This can he
explained using following example.
• A coin is tossed. What is the probability of getting a head?
• Total number of equally likely outcomes (n) = 2 (i.e. head or tail)
• Number of outcomes favorable to head (m) = 1
• P(head) = 1/2
• A random experiment is a mechanism that produces a definite
outcome that cannot be predicted with certainty.
• The sample space associated with a random experiment is the set of
all possible outcomes.
• An event is a subset of the sample space.
• Construct a sample space for the experiment that consists of tossing a
single coin.
• The outcomes could be labeled h for heads and t for tails. Then the
sample space is the set S={h,t}.
DISTRIBUTIONS
One way is that you visualize the
grades and see if you can find a
trend in the data.
• The graph that you have plot is called the frequency distribution of
the data.
• You see that there is a smooth curve like structure that defines our
data, but do you notice an anomaly?
• We have an abnormally low frequency at a particular score range.
• So the best guess would be to have missing values that remove the
dent in the distribution.
• For any Data Scientist, a student or a practitioner, distribution is a
must know concept. It provides the basis for analytics and inferential
statistics.
• While the concept of probability gives us the mathematical
calculations, distributions help us to actually visualize what’s
happening underneath.
Common Data Types
• Discrete Data, as the name suggests, can take only specified values. For example,
when you roll a die, the possible outcomes are 1, 2, 3, 4, 5 or 6 and not 1.5 or
2.45.
• Continuous Data can take any value within a given range. The range may be
finite or infinite. For example, A girl’s weight or height, the length of the road.
The weight of a girl can be any value from 54 kgs, or 54.5 kgs, or 54.5436kgs.
Types of Distributions
• Bernoulli Distribution
• Uniform Distribution
• Binomial Distribution
• Normal Distribution
• Poisson Distribution
• Exponential Distribution
Bernoulli Distribution
• A Bernoulli distribution has only two possible outcomes, namely 1
(success) and 0 (failure), and a single trial.
• So the random variable X which has a Bernoulli distribution can take
value 1 with the probability of success, say p, and the value 0 with the
probability of failure, say q or 1-p.
• Here, the occurrence of a head denotes success, and the occurrence
of a tail denotes failure.
• Probability of getting a head = 0.5 = Probability of getting a tail since
there are only two possible outcomes.
• The probability mass function is given by: px(1-p)1-x where x € (0, 1).
• It can also be written as
The probabilities of success and failure need not be equally likely, like the result of a
fight between me and Undertaker. He is pretty much certain to win. So in this case
probability of my success is 0.15 while my failure is 0.85
Basically expected value of any distribution is the mean of the distribution. The expected
value of a random variable X from a Bernoulli distribution is found as follows: