0% found this document useful (0 votes)
55 views17 pages

Probability and Statistics: Lums Undergraduate SS-4-6

This document discusses numerical descriptive statistics used to describe datasets. It covers measures of central tendency like mean, median and mode. It also discusses measures of variability such as range, variance, standard deviation and coefficient of variation. Finally, it discusses measures of shape like skewness and kurtosis. Standardizing data using z-scores is described to identify outliers. The empirical rule and Chebychev's inequality are also summarized.

Uploaded by

M.Hasan Arshad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views17 pages

Probability and Statistics: Lums Undergraduate SS-4-6

This document discusses numerical descriptive statistics used to describe datasets. It covers measures of central tendency like mean, median and mode. It also discusses measures of variability such as range, variance, standard deviation and coefficient of variation. Finally, it discusses measures of shape like skewness and kurtosis. Standardizing data using z-scores is described to identify outliers. The empirical rule and Chebychev's inequality are also summarized.

Uploaded by

M.Hasan Arshad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Probability and Statistics

LUMS
Undergraduate
SS-4-6
Numerical Descriptive Statistics
• Numerical descriptive statistics take a different approach to
answer the same set of questions:
– Provide more precise information about a dataset’s distribution.
– The increased precision comes at the cost of stronger aggregation.
• Three basic types of numerical descriptive statistics:
– Measures of Central Location: Mean, Median, Mode
– Measures of Variability: Range, Variance, Standard Deviation,
Coefficient of Variation, Percentiles and Quartiles.
– Measures of Shape: Skewness and Kurtosis
• Ideally, employ visual and numerical descriptive statistics in
tandem to shed light on information embedded in datasets.
Measures of Central Location
• Average (i.e. arithmetic mean) is the most popular
measure of central location:
– computed by adding all the observations and dividing by the
total number of observations.
– appropriate for describing quantitative data only.
– Possesses nice theoretical properties:
• Sum of deviations from mean is zero.
• Linked to the measures of variation in a dataset.
• Changing value of a single observation changes the average.
• Central Limit Theorem
– Sensitive to outliers (extreme values) e.g. what happens to
average household income in a poor neighborhood when a
billionaire moves in?
Measures of Central Location
• Median: Place observations in ascending order, whereby,
observation/s falling in the middle is the median.
– Median not sensitive to outliers
– Often used for income and property values datasets.
– Cannot be computed for nominal data.
• Mode: value/class that occurs most frequently in a dataset.
– Most suitable for nominal data, but also used for ordinal data.
– Datasets may have more than one modal class.
– Not a good measure of central location for quantitative data.
Measures of Variability
• Measures of central location fail to tell the complete story
about a dataset’s distribution e.g. how are observations
spread out around the mean (on average)?
Measures of Variability
• Range: simplest measure of variability, calculated by
subtracting smallest observation from largest observation.
– Fails to provide information on the dispersion of the observations
located between the two end points.
• Variance, and its related measure Standard Deviation, is a
measure of variability that incorporates all the data points.
– Variance calculated by subtracting the mean from each number in a
dataset, squaring the differences, and dividing the sum of the
squares by the number of observations in the dataset.
– Standard deviation (square root of the variance) used to compare
the average degree of variability between two quantitative datasets.
• Commonly used as a measure of risk in finance.
Measures of Variability
• Coefficient of variation: Standard deviation of a variable
divided by its mean:
– A standardized measure of variation, when comparing the degree of
variability between variables with different means:
• Variation in salaries of managers and CEOs?
• Variation in the weights of watermelons and apples?
– Interpreted as variation in a variable as percentage of it’s mean.
• All of the above-mentioned measures of variability are
sensitive to outliers.
– Measures of relative variability are not sensitive to outliers.
• Provides information about position of a particular observation
relative to the entire dataset, often used to define benchmarks
in business applications.
Measures of Variability
– For example suppose your SAT score of 1340 is on the 80th percentile-
implies that 80% of students scored below you, while 20% of students
scored above you.
– Caution: This doesn’t mean you scored 80% on the exam!
• Difference between Q1 (25th percentile) and Q3 (75th
percentile) is called the interquartile range:
– Median is known as Q2 (50th percentile)
– Measures the spread around the middle 50% of the observations.
– Large values indicative of a high variability and presence of outliers.
• Measures of variation don’t tell us much about symmetry of
distribution, outliers and concentration of data in tails relative
to center of distribution.
Measures of Shape
Normal Distribution: A special type of symmetric uni-modal
distribution that is bell shaped, frequently encountered in
statistical modelling:
Many statistical techniques
require/assume that data
follows a bell shaped Frequency
distribution.

Variable

A Normal Distributions has Bell


Shaped Histogram
Measure of Shape
Skewness: A skewed distribution is one with a long tail
extending either to the right or the left of the distribution.

Positively Skewed or Right Skewed Negatively Skewed or left skewed


Implies mean>median i.e. more outliers Implies mean<median i.e. more
on the RHS outliers on the LHS
Measures of Shape
Kurtosis: Measure of relative concentration of data in the tails,
relative to the center of the distribution:
– Negative Excess Kurtosis: Relatively less concentration in the tails.
– Positive Excess Kurtosis: Relatively more concentration in the tails.
Overview-Numerical Descriptive Statistics

Describing Data Numerically

Central Tendency Variation Shape

Mean Range
Skewness
Median Variance/Std. Deviation and Kurtosis

Coefficient of Variation
Mode
Interquartile Range
Some Rules of Expectations Operation
• If 𝑘 is some constant, then we can mathematically prove the
following results:
– Rule-1: If 𝐸 𝑥𝑖 = 𝑥ҧ then 𝐸 𝑥𝑖 + 𝑘 = 𝑥ҧ + 𝑘
• Adding a constant to each observation changes average by that constant.
– Rule-2: If 𝑉𝑎𝑟 𝑥𝑖 = 𝜎 2 then 𝑉𝑎𝑟 𝑥𝑖 + 𝑘 = 𝜎 2
• Adding a constant to each observation does not change the variance.
– Rule-3: If 𝐸 𝑥𝑖 = 𝑥ҧ then 𝐸 𝑘𝑥𝑖 = 𝑘𝐸 𝑥𝑖 = 𝑘𝑥ҧ
• Multiplying each observation by a constant, changes the average by a factor of
that constant.
– Rule-4: If 𝑉𝑎𝑟 𝑥𝑖 = 𝜎 2 then 𝑉𝑎𝑟 𝑘𝑥𝑖 = 𝑘 2 𝑉𝑎𝑟 𝑥𝑖 = 𝑘 2 𝜎 2
• Multiplying each observation by a constant, changes the variance by the squared
factor of that constant.
• We apply these rules to standardize datasets to identify outliers.
Standardizing Datasets
• Z-scores used to identify outliers in a dataset. To calculate Z-
score of each observation:
– Subtract from each observation the mean of the variable
– Divide each observation by the standard deviation of the variable.
– The resulting distribution (of Z-scores) has a mean of 0 and standard
deviation of 1.
– Each observations Z-score is interpreted as number of standard
deviations it is above or below the mean
• Converting each observation into it’s corresponding Z score
does not change a non-normal distribution into a normal
distribution.
The Empirical Rule
Approximately 68% of all observations fall
within one standard deviation of the mean.

Approximately 95% of all observations fall


within two standard deviations of the mean.

Approximately 99.7% of all observations fall


within three standard deviations of the mean.
4.16
Chebychev’s Inequality
• For any type of distribution and any number k > 1, at least
1
100 × 1 − 1 − 2 % of the observations lie within 𝑘
𝑘
standard deviations of either side of the mean.
• Two special cases of Chebychev’s inequality are applied
frequently, namely, when k = 2 and k = 3:
– At least 75% of the observations in any data set lie within 2 standard
deviations to either side of the mean.
– At least 89% of the observations in any data set lie within 3 standard
deviations to either side of the mean.
• Does the empirical rule violate Chebychev’s inequality?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy