0% found this document useful (0 votes)
17 views51 pages

Descriptive Statistics

Descriptive Statistics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views51 pages

Descriptive Statistics

Descriptive Statistics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

Class 3

Descriptive Statistics
• The term “descriptive statistics” refers to the analysis, summary, and
presentation of findings related to a data set derived from a sample or
entire population.
• Descriptive statistics comprises three main categories – Frequency
Distribution, Measures of Central Tendency, and Measures of Variability.
Frequency Distribution
• Used for both quantitative and qualitative data, frequency distribution
depicts the frequency or count of the different outcomes in a data set
or sample.
• The frequency distribution is normally presented in a table or a graph.
Each entry in the table or graph is accompanied by the count or
frequency of the values’ occurrences in an interval, range, or specific
group.
• Common charts and graphs used in frequency distribution presentation
and visualization include bar charts, histograms, pie charts, and line
charts.
Central Tendency
• Central tendency refers to a dataset’s descriptive summary using a
single value reflecting the center of the data distribution.
• Measures of central tendency are also known as measures of central
location.
• The mean, median, and mode are the measures of central tendency.
• The mean, considered the most popular measure of central tendency, is
the average or most common value in a data set.
• The median refers to the middle score for a data set in ascending order.
• The mode refers to the score or value that is most frequent in a data
set.
Variability
• A measure of variability is a summary statistic reflecting the degree of
dispersion in a sample.
• The measures of variability determine how far apart the data points
appear to fall from the center.
• Dispersion, spread, and variability all refer to and denote the range
and width of the distribution of values in a data set.
• The range, standard deviation, and variance are used, respectively, to
depict different components and aspects of the spread.
• The range depicts the degree of dispersion or an ideal of the distance
between the highest and lowest values within a data set.
• The standard deviation is used to determine the average variance in a
set of data and provide an insight into the distance or difference
between a value in a data set and the mean value of the same data
set.
• The variance reflects the degree of the spread and is essentially an
average of the squared deviations.
Frequency Distribution
• The distribution is a summary of the frequency of individual values or
ranges of values for a variable.
• One of the most common ways to describe a single variable is with
a frequency distribution.
• Frequency distributions can be depicted in two ways, as a table or as
a graph.
• The table below shows an age frequency distribution with five
categories of age ranges defined.
• The same frequency distribution can be depicted in a graph
Category Percent
Under 35 years old 9%
36–45 21%
46–55 45%
56–65 19%
66+ 6%

This type of graph is often referred to as a histogram or bar chart.


• Mean is the average of the given numbers and is calculated by
dividing the sum of given numbers by the total number of numbers.
Mean = (Sum of all the observations/Total number of observations)
What is the mean of 2, 4, 6, 8 and 10?
• First, add all the numbers.
• 2 + 4 + 6 + 8 + 10 = 30
• Now divide by 5 (total number of observations).
• Mean = 30/5 = 6
Median
• The value of the middle-most observation obtained after arranging
the data in ascending order is called the median of the data.
Median Formula When n is Odd
Median = [(n + 1)/2]th term
Median Formula When n is Even
Median = [(n/2)th term + ((n/2) + 1)th term]/2
• Find the median of the above set.
• {42, 40, 50, 60, 35, 58, 32}
• Step 1: Arrange the data items in ascending order.
• Original set: {42, 40, 50, 60, 35, 58, 32}
• Ordered Set: {32, 35, 40, 42, 50, 58, 60}
• Step 2: Count the number of observations. If the number of
observations is odd, then we will use the following formula: Median =
[(n + 1)/2]th term
• Step 3: Calculate the median using the formula.
• Median = [(n + 1)/2]th term
• = (7 + 1)/2th term = 4th term = 42
• Median = 42
Mode
In statistics, the mode formula is used to calculate the mode or modal
value of a given set of data.
It is defined as the value that is repeatedly occurring in a given set.

Size of
the
38 39 40 42 43 44 45
winter
coat

Total
numbe
33 11 22 55 44 11 22
r of
shirts
• The range in statistics for a given data set is the difference between
the highest and lowest values.
• For example, if the given data set is {2,5,8,10,3}, then the range will
be 10 – 2 = 8.
• Thus, the range could also be defined as the difference between the
highest observation and lowest observation.
• Standard Deviation
• The Standard Deviation is a measure of how spread out numbers are.
• Its symbol is σ (the greek letter sigma)
• The formula is easy: it is the square root of the Variance.
• Variance
• The Variance is defined as:
• The average of the squared differences from the Mean.
• To calculate the variance follow these steps:
• Work out the Mean (the simple average of the numbers)Then for
each number: subtract the Mean and square the result (the squared
difference).Then work out the average of those squared differences.
You and your friends have just measured the heights of your dogs (in
millimeters):
• The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm
and 300mm.
• Find out the Mean, the Variance, and the Standard Deviation.
• Your first step is to find the Mean
600 + 470 + 170 + 430 + 300 /5
Mean =
1970/5
=
= 394

Now we calculate each dog's difference from the Mean:


Variance

σ2 = (2062 + 762 + (−224)2 + 362 + (−94)2 )/5

= (42436 + 5776 + 50176 + 1296 + 8836)/5

= 108520/5

= 21704
And the Standard Deviation is just the square root of Variance, so:

Standard Deviation
σ = √21704
= 147.32...
= 147 (to the nearest mm)
Probability and stats
• Probability implies 'likelihood' or 'chance’.
• When an event is certain to happen then the probability of
occurrence of that event is 1 and when it is certain that the event
cannot happen then the probability of that event is 0.
• Thus to calculate the probability we need information on number of
favorable cases and total number of equally likely cases. This can he
explained using following example.
• A coin is tossed. What is the probability of getting a head?
• Total number of equally likely outcomes (n) = 2 (i.e. head or tail)
• Number of outcomes favorable to head (m) = 1
• P(head) = 1/2
• A random experiment is a mechanism that produces a definite
outcome that cannot be predicted with certainty.
• The sample space associated with a random experiment is the set of
all possible outcomes.
• An event is a subset of the sample space.
• Construct a sample space for the experiment that consists of tossing a
single coin.
• The outcomes could be labeled h for heads and t for tails. Then the
sample space is the set S={h,t}.
DISTRIBUTIONS
One way is that you visualize the
grades and see if you can find a
trend in the data.
• The graph that you have plot is called the frequency distribution of
the data.
• You see that there is a smooth curve like structure that defines our
data, but do you notice an anomaly?
• We have an abnormally low frequency at a particular score range.
• So the best guess would be to have missing values that remove the
dent in the distribution.
• For any Data Scientist, a student or a practitioner, distribution is a
must know concept. It provides the basis for analytics and inferential
statistics.
• While the concept of probability gives us the mathematical
calculations, distributions help us to actually visualize what’s
happening underneath.
Common Data Types
• Discrete Data, as the name suggests, can take only specified values. For example,
when you roll a die, the possible outcomes are 1, 2, 3, 4, 5 or 6 and not 1.5 or
2.45.
• Continuous Data can take any value within a given range. The range may be
finite or infinite. For example, A girl’s weight or height, the length of the road.
The weight of a girl can be any value from 54 kgs, or 54.5 kgs, or 54.5436kgs.

Types of Distributions
• Bernoulli Distribution
• Uniform Distribution
• Binomial Distribution
• Normal Distribution
• Poisson Distribution
• Exponential Distribution
Bernoulli Distribution
• A Bernoulli distribution has only two possible outcomes, namely 1
(success) and 0 (failure), and a single trial.
• So the random variable X which has a Bernoulli distribution can take
value 1 with the probability of success, say p, and the value 0 with the
probability of failure, say q or 1-p.
• Here, the occurrence of a head denotes success, and the occurrence
of a tail denotes failure.
• Probability of getting a head = 0.5 = Probability of getting a tail since
there are only two possible outcomes.
• The probability mass function is given by: px(1-p)1-x where x € (0, 1).
• It can also be written as
The probabilities of success and failure need not be equally likely, like the result of a
fight between me and Undertaker. He is pretty much certain to win. So in this case
probability of my success is 0.15 while my failure is 0.85
Basically expected value of any distribution is the mean of the distribution. The expected
value of a random variable X from a Bernoulli distribution is found as follows:

E(X) = 1*p + 0*(1-p) = p

The variance of a random variable from a bernoulli distribution is:

V(X) = E(X²) – [E(X)]² = p – p² = p(1-p)


Uniform Distribution
• When you roll a fair die, the
outcomes are 1 to 6.
• The probabilities of getting these
outcomes are equally likely and that
is the basis of a uniform
distribution.
• A variable X is said to be uniformly
distributed if the density function is:
F(x) = 1/b-a
For a Uniform Distribution, a and b
are the parameters.
The graph of a uniform distribution
curve looks like
• The number of bouquets sold daily at a flower shop is uniformly
distributed with a maximum of 40 and a minimum of 10.
• Let’s try calculating the probability that the daily sales will fall
between 15 and 30.
• The probability that daily sales will fall between 15 and 30 is (30-
15)*(1/(40-10)) = 0.5
• Similarly, the probability that daily sales are greater than 20 is = 0.667
The mean and variance of X following a uniform distribution is:
• Mean -> E(X) = (a+b)/2
• Variance -> V(X) = (b-a)²/12
• The standard uniform density has parameters a = 0 and b = 1, so the
PDF for standard uniform density is given by:
Binomial Distribution
• Let’s get back to cricket.
• Suppose that you won the toss today and this indicates a successful
event.
• You toss again but you lost this time.
• If you win a toss today, this does not necessitate that you will win the
toss tomorrow. Let’s assign a random variable, say X, to the number
of times you won the toss. What can be the possible value of X? It can
be any number depending on the number of times you tossed a coin.
• There are only two possible outcomes. Head denoting success and tail
denoting failure. Therefore, probability of getting a head = 0.5 and the
probability of failure can be easily computed as: q = 1- p = 0.5.
• A distribution where only two outcomes are possible, such as success or
failure, gain or loss, win or lose and where the probability of success and
failure is same for all the trials is called a Binomial Distribution.
• Each trial is independent since the outcome of the previous toss doesn’t
determine or affect the outcome of the current toss. An experiment with
only two possible outcomes repeated n number of times is called
binomial. The parameters of a binomial distribution are n and p where n is
the total number of trials and p is the probability of success in each trial.
The properties of a Binomial Distribution are
• Each trial is independent.
• There are only two possible outcomes in a trial- either a success or a
failure.
• A total number of n identical trials are conducted.
• The probability of success and failure is same for all trials. (Trials are
identical.)
The mathematical representation of binomial distribution is given
by:

A binomial distribution graph where the probability of success does not


equal the probability of failure looks like
Now, when probability of success = probability of failure, in such a
situation the graph of binomial distribution looks like

The mean and variance of a binomial distribution are given by:

Mean -> µ = n*p

Variance -> Var(X) = n*p*q


Normal Distribution
• Normal distribution represents the behavior of most of the situations
in the universe.
Any distribution is known as Normal distribution if it has the following
characteristics:
• The mean, median and mode of the distribution coincide.
• The curve of the distribution is bell-shaped and symmetrical about
the line x=μ.
• The total area under the curve is 1.
• Exactly half of the values are to the left of the center and the other
half to the right.
the PDF of a random variable X following a normal distribution is
given by:

The mean and variance of a random variable X which is said to be


normally distributed is given by:
Mean -> E(X) = µ
Variance -> Var(X) = σ^2
The graph of a random variable X ~ N (µ, σ) is
shown below.
A standard normal distribution is defined as the distribution with mean 0
and standard deviation 1. For such a case, the PDF becomes:
Poisson Distribution
• Suppose you work at a call center, approximately how many calls do
you get in a day? It can be any number. Now, the entire number of
calls at a call center in a day is modeled by Poisson distribution. Some
more examples are
• The number of emergency calls recorded at a hospital in a day.
• The number of thefts reported in an area on a day.
• The number of customers arriving at a salon in an hour.
• The number of suicides reported in a particular city.
• The number of printing errors at each page of the book
Poisson Distribution is applicable in situations where events occur at
random points of time and space wherein our interest lies only in the
number of occurrences of the event.
A distribution is called Poisson distribution when the following
assumptions are valid:
1. Any successful event should not influence the outcome of another
successful event.
2. The probability of success over a short interval must equal the
probability of success over a longer interval.
3. The probability of success in an interval approaches zero as the
interval becomes smaller.
Some notations used in Poisson distribution are:
• λ is the rate at which an event occurs,
• t is the length of a time interval,
• And X is the number of events in that time interval.
• Here, X is called a Poisson Random Variable and the probability
distribution of X is called Poisson distribution.
• Let µ denote the mean number of events in an interval of length t.
Then, µ = λ*t.
• The PMF of X following a Poisson distribution is given by:
The graph shown below illustrates the shift in the curve due to
increase in mean.

The mean and variance of X following a Poisson distribution:


Mean -> E(X) = µ
Variance -> Var(X) = µ
• How ML works?
• Gathering past data in any form suitable for processing. The better
the quality of data, the more suitable it will be for modeling
• Data Processing – Sometimes, the data collected is in the raw form
and it needs to be pre-processed.
Example: Some tuples may have missing values for certain attributes,
and, in this case, it has to be filled with suitable values in order to
perform machine learning or any form of data mining.
• Missing values for numerical attributes such as the price of the house
may be replaced with the mean value of the attribute whereas
missing values for categorical attributes may be replaced with the
attribute with the highest mode. This invariably depends on the types
of filters we use.
• If data is in the form of text or images then converting it to numerical
form will be required, be it a list or array or matrix. Simply, Data is to
be made relevant and consistent. It is to be converted into a format
understandable by the machine
• Divide the input data into training, cross-validation and test sets. The
ratio between the respective sets must be 6:2:2
• Building models with suitable algorithms and techniques on the
training set.
• Testing our conceptualized model with data which was not fed to the
model at the time of training and evaluating its performance using
metrics such as F1 score, precision and recall.
Types of machine learning problems
There are various ways to classify machine learning problems. Here, we
discuss the most obvious ones.
1. On basis of the nature of the learning “signal” or “feedback”
available to a learning system

Supervised learning: The model or algorithm is presented with


example inputs and their desired outputs and then finding patterns and
connections between the input and the output. The goal is to learn a
general rule that maps inputs to outputs. The training process
continues until the model achieves the desired level of accuracy on the
training data. Some real-life examples are:

Image Classification: You train with images/labels. Then in the future


you give a new image expecting that the computer will recognize the
new object.
• Market Prediction/Regression: You train the computer with historical
market data and ask the computer to predict the new price in the
future.
Unsupervised learning: No labels are given to the learning algorithm,
leaving it on its own to find structure in its input. It is used for clustering
population in different groups. Unsupervised learning can be a goal in
itself (discovering hidden patterns in data).
• Clustering: You ask the computer to separate similar data into
clusters, this is essential in research and science.
• High Dimension Visualization: Use the computer to help us visualize
high dimension data.
• Generative Models: After a model captures the probability
distribution of your input data, it will be able to generate more data.
This can be very useful to make your classifier more robust.
Semi-supervised learning: Problems where you have a large amount of
input data and only some of the data is labeled, are called semi-
supervised learning problems. These problems sit in between both
supervised and unsupervised learning. For example, a photo archive
where only some of the images are labeled, (e.g. dog, cat, person) and
the majority are unlabeled.
Reinforcement learning: A computer program interacts with a dynamic
environment in which it must perform a certain goal (such as driving a
vehicle or playing a game against an opponent). The program is
provided feedback in terms of rewards and punishments as it navigates
its problem space.
2. Two most common use cases of Supervised learning are:
• Classification: Inputs are divided into two or more classes, and the
learner must produce a model that assigns unseen inputs to one or
more (multi-label classification) of these classes and predicting
whether or not something belongs to a particular class. This is
typically tackled in a supervised way. Classification models can be
categorized in two groups: Binary classification and Multiclass
Classification. Spam filtering is an example of binary classification,
where the inputs are email (or other) messages and the classes are
“spam” and “not spam”.
• Regression: It is also a supervised learning problem, that predicts a
numeric value and outputs are continuous rather than discrete. For
example, predicting the stock prices using historical data.
3. Most common Unsupervised learning are:
• Clustering: Here, a set of inputs is to be divided into groups. Unlike in
classification, the groups are not known beforehand, making this
typically an unsupervised task. As you can see in the example below,
the given dataset points have been divided into groups identifiable by
the colors red, green and blue.
• Density estimation: The task is to find the distribution of inputs in
some space.
• Dimensionality reduction: It simplifies inputs by mapping them into a
lower-dimensional space. Topic modeling is a related problem, where
a program is given a list of human language documents and is tasked
to find out which documents cover similar topics.
Some commonly used machine learning algorithms are Linear
Regression, Logistic Regression, Decision Tree, SVM(Support vector
machines), Naive Bayes, KNN(K nearest neighbors), K-Means, Random
Forest, etc.
Terminologies of Machine Learning
• Model A model is a specific representation learned from data by
applying some machine learning algorithm. A model is also
called hypothesis.
• Feature A feature is an individual measurable property of our data. A
set of numeric features can be conveniently described by a feature
vector. Feature vectors are fed as input to the model. For example, in
order to predict a fruit, there may be features like color, smell,
taste, etc. Note:Choosing informative, discriminating and
independent features is a crucial step for effective algorithms. We
generally employ a feature extractor to extract the relevant features
from the raw data.
• Target (Label) A target variable or label is the value to be predicted by
our model. For the fruit example discussed in the features section, the
label with each set of input would be the name of the fruit like apple,
orange, banana, etc.
• Training The idea is to give a set of inputs(features) and it’s expected
outputs(labels), so after training, we will have a model (hypothesis)
that will then map new data to one of the categories trained on.
• Prediction Once our model is ready, it can be fed a set of inputs to
which it will provide a predicted output(label). But make sure if the
machine performs well on unseen data, then only we can say the
machine performs well.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy