Statistics Batch4 Lecture
Statistics Batch4 Lecture
All-In-One Course
Batch (4)
Success Point
Statistics
What is statistics?
The field of statistics : the practice and study of collecting and analyzing data
Statistics in everyday life can be used to estimate budgets for households. Knowing average fuel,
food, and entertainment costs help prepare a person for the likely expenses they will have next
month or the month after that, and these numbers can be found by averaging the values found on
previous bills and receipts.
What can statistics do?
● How likely is someone to purchase a product? Are people more likely to purchase it if they
can use a different payment method?
● How many sizes of jeans need to be manufactured so they can fit 95% of the population?
Should the same number of each size be produced?
Example: In a marketing department, you're analyzing sales data for a new product launch.
Descriptive statistics would include metrics like average daily sales, standard deviation of
sales, highest and lowest sales days, etc. These statistics provide a snapshot of how the
product is performing in terms of sales volume and variability.
Inferential Statistics
What percentage
● 50% of friends drive to work. of people drive to
● 25% take the bus. work based on the
● 25% bike sample data?
Types of Data
Therefore, to be able to classify the data you are working with is key.
We can classify data in two main ways based on its type and its measurement
level.
Types of Data
Categorical
- categories or
groups
(eg. car brands)
Yes NO
Types of Data
Numeric
represents numbers
Discrete
Continuous
Types of Data
Discrete Continuous
Children : the number of children you want Weight : body weight can vary by
to have is directly understandable and is incomprehensibly small amounts and is
discrete. continuous.
Examples of Discrete
A, B, C,
D, E, F
Or
0 to 100%
Numeric Categorical
represents numbers - categories or
groups
(eg. car brands)
Discrete
- “Yes” and “No”
Continuous questions
Levels of Measurement
Quantitative Qualitative
Interval Nominal
Ratios Ordinal
Levels of Measurement (Qualitative)
Nominal: cannot be ordered, not numbers (each category is separate and cannot occur at the
same time)
Eg: rating your lunch such as Disgusting, Unappetizing, Neutral, Tasty, Delicious
(although we have words and not numbers, it is obvious that these preferences are
ordered from Negative to Positive.)
Levels of Measurement (Quantitative)
Ratio: Has a True Zero
Eg: I have 2 apples and you have 6 apples. So, you have 3 times as many as I do. Because the
Ratio of 6 / 2 = 3.
Other Example:
Eg: Temperature
Usually, temperature is expressed in Celsius or Fahrenheit. They are both interval variables.
Today: 5'C (or) 41'F
Yesterday: 10'C (or) 50'F
In terms of Celsius, it seems today is twice colder, but in terms of Fahrenheit, not really. The issue
comes from the fact that zero degrees Celsius and zero degrees Fahrenheit are not true zeros. These
scales were artificially created by humans for convenience.
Zero degrees Kelvin is the temperature at which atoms stop moving and nothing can be colder than
zero degrees Kelvin.
Definition: Interval data is numeric data where the difference between values is meaningful, but
there is no true zero. This means you can add and subtract the values, but ratios (like "twice as
much") don’t make sense because the zero point is arbitrary.
Key Features:
● Equal intervals: The difference between values is consistent (e.g., the difference between
10°C and 20°C is the same as the difference between 30°C and 40°C).
● No absolute zero: Zero does not indicate the absence of the quantity (e.g., 0°C doesn’t
mean “no temperature”).
● Mathematical operations: Addition and subtraction make sense, but multiplication and
division don’t.
Examples: Temperature in Celsius or Fahrenheit, IQ scores.
Levels of Measurement (Quantitative)
Definition: Ratio data has all the properties of interval data, but it also has a true zero, meaning
that zero represents a total absence of the quantity being measured. With ratio data, you can
perform all mathematical operations, including multiplication and division.
Key Features:
● Equal intervals: Like interval data, the difference between values is consistent.
● True zero: Zero means the complete absence of the measured quantity (e.g., 0 kg means no
weight at all).
● Mathematical operations: You can add, subtract, multiply, and divide the values, and you
can make meaningful statements like "twice as much" (e.g., 4 meters is twice as long as 2
meters).
● Somewhat agree ( 4 )
● Strongly agree ( 5 )
It is important to note that these numbers doesn’t necessarily make them numeric variables.
Data Analysis and Visualization Techniques
Visualization Techniques
● Categorical Variables
● Numerical Variables
Frequency
Example: Twenty students were asked how many hours they worked per day. Three
students who work two hours, five students who work three hours, and so on.
To find the relative frequencies, divide each frequency by the total frequency.
To find the cumulative relative frequencies, add all the previous relative
frequencies to the relative frequency for the current row.
Cumulative Frequency
The last entry of the cumulative relative frequency column is one, indicating that one
hundred percent of the data has been accumulated.
Numerical Variables
An explanation for the choice may be young adults under 25 cannot afford the product,
while adults over 60 have no interest in the product.
Categorical Numerical
Excel file
Cross Table
The term "cross table" typically refers to a type of table or matrix where data is
organized in rows and columns to show the relationship between two or more
variables. In the context of data visualization, a "cross table" is often used
interchangeably with a "side-by-side bar chart" when discussing categorical data
analysis.
In a side-by-side bar chart or cross table, categorical variables are displayed along the
x-axis (horizontal axis), and the corresponding frequencies or counts are represented
by bars side by side. Each bar represents a category, and the height or length of the
bar represents the frequency or count of observations in that category.
Scatter Plots
Notes
Scatter Plots are used when we are presenting two numerical variables.
Outliers are data points that go against the logic of the whole dataset.
Population, Sample
Population, Sample
Population Sample
- The entire set of items or individuals of - A subset selected from the larger
interest in a study. population.
- Denoted by “N”. - Denoted by “n”.
- The numbers we have obtained when - The numbers we have obtained with a
using a population are called sample are called “statistics”.
“parameters”.
Population, Sample
Let’s say, we would like to perform a survey of the job prospects of the students
studying in the NY University.
Student Database:
The safest way would be to get access to the student database and contact
individuals in a random manner. However, such surveys are almost impossible to
conduct without assistance from the university.
Sampling Methods
Random Sampling: Every individual in the population has an equal chance of being selected. This
method helps reduce bias and is the basis for many statistical tests.
Stratified Sampling: The population is divided into subgroups (strata), and random samples are
taken from each stratum. This ensures that each subgroup is adequately represented in the sample.
Cluster Sampling: The population is divided into clusters (usually geographically), and entire
clusters are randomly selected. This method is cost-effective and useful when a population is too
large or spread out.
Systematic Sampling: Individuals are selected at regular intervals from an ordered list. This method
is simple and quick, but it requires that the list be random or that periodic patterns do not exist in
the population.
Convenience Sampling: Individuals are selected based on their easy availability. While not
statistically rigorous, this method is often used in preliminary research.
What we have done
● Populations
● Samples
● Types of Variables
● Measurement Levels
● Graphics and Tables
Descriptive Statistics
Mean, Median, and Mode
They are all in their own way trying to measure the “common” point within the
data, that which is “normal”.
The first measure is the "Mean", also known as the simple average.
It is denoted by the Greek letter 'μ' for a population and 'x̄' for a sample.
Measures of Central Tendency
We can find the mean of a data set by adding up all of it's components and then
dividing them by their number.
x1 + x2 + x3 + …….+ xn
Mean =
N
The mean is the most common measure of central tendency, but a downside is It
is easily affected by outliers.
Measures of Central Tendency
● To calculate the median, first, organize and order the data from smallest to largest.
● If odd number, the median of the data set is the number at position n + 1 divided by 2
in the ordered list, where n is the number of observations.
● If the number of observations is even, take the average of the values found above and
below that position.
Measures of Central Tendency
In general, we often have multiple modes. Usually, two or three modes are
tolerable, but more than that would defeat the purpose of finding a mode.
Measures of Central Tendency
The example shows us that the measures of central tendency should be used
together, rather than independently.
Measures of Asymmetry (Skewness)
After exploring the measures of central tendency, let's move on to the measures
of asymmetry.
Formula:
We will not get into
computation but rather the
meaning of skewness
Skewness
If the distribution of data is skewed to the right (positive skew), the mean is
greater than the median.
If the distribution of data is skewed to the left (negative skew), the mean is less
than the median.
If the mean, median and mode are equal, it is zero skew. Because the distribution
of data is symmetrical.
Skewness
Positive skew
Skewness
Negative skew
Skewness
Zero skew
Variance
Variance measures the dispersion of a set of data points around their mean
Variance
In simpler terms, variance tells us how much the data points in a dataset vary or
spread out from the average value. A high variance indicates that the data points
are spread out over a wide range, while a low variance suggests that the data
points are clustered closely around the mean.
Variance (example)
Imagine you have a dataset representing sales figures for a product over several months. Each data point in this dataset is a
monthly sales figure. Now, you want to know not just the average sales but also how much the sales figures fluctuate or deviate
from this average. Variance gives you precisely that.
If the variance is high, it means the data points are spread out widely from the mean, indicating a lot of variability in your dataset.
Conversely, if the variance is low, it means the data points are clustered closely around the mean, suggesting less variability.
The formulas are the square root of the population variance and square root of
the sample variance, respectively.
A low standard deviation indicates that the values tend to be close to the mean
of the set, while a high standard deviation indicates that the values are spread
out over a wider range.
Coefficient of Variation
Imagine you have two products: Product A and Product B. Both products have varying
sales figures over several months. Product A has an average monthly sales figure of
$10,000, while Product B has an average monthly sales figure of $20,000.
Now, let's say the standard deviation for Product A is $2,000, and for Product B, it's $5,000.
At first glance, you might think Product B has more variability in sales because its standard
deviation is higher. However, when you calculate the coefficient of variation (CoV), you get
a better understanding of the relative variability.
Example
For Product A:
Standard Deviation = $2,000
Mean = $10,000
Coefficient of Variation (CoV) = (Standard Deviation / Mean) * 100
= (2000 / 10000) * 100
= 20%
For Product B:
Standard Deviation = $5,000
Mean = $20,000
Coefficient of Variation (CoV) = (Standard Deviation / Mean) * 100
= (5000 / 20000) * 100
= 25%
Example
The insight from this analysis is that while Product B has higher average sales compared to
Product A, it also exhibits higher variability in its sales figures relative to its average.
The higher standard deviation and coefficient of variation for Product B suggest that its sales
figures fluctuate more widely around the average compared to Product A. This indicates that
while Product B may have higher average sales, it also carries a higher degree of risk or
uncertainty in its sales performance.
Therefore, while Product B may offer greater potential for higher sales, it also comes with
greater relative variability or risk in its sales figures, which may need to be taken into account
when setting sales targets or making business decisions. On the other hand, Product A, despite
having lower average sales, demonstrates more consistent sales performance relative to its
average.
Covariance
Now, we'll explore measures that can help us explore the relationship between
two variables.
The two variables are correlated, and the main statistic to measure this
correlation is called covariance. Unlike variance, covariance may be: > 0
(positive), = 0 (equal to zero), < 0 (negative).
excel*
Covariance
● Positive Covariance: Indicates that when one variable increases, the other variable
tends to increase as well. For example, if the covariance between house size and price
is positive, it suggests that larger houses tend to have higher prices.
● Negative Covariance: Indicates that when one variable increases, the other variable
tends to decrease. For instance, if the covariance between house size and price is
negative, it suggests that smaller houses tend to have higher prices.
● Covariance of Zero: Suggests that there's no linear relationship between the variables.
Covariance
Limitations of Covariance
● Covariance is not standardized, meaning its value depends on the scale of the
variables. Therefore, it's not directly comparable across different datasets or
variables.
● Covariance only measures the direction of the relationship between variables and not
the strength or the degree of relationship.
Covariance