0% found this document useful (0 votes)
9 views82 pages

Statistics Batch4 Lecture

The document provides an overview of statistics, including its definitions, applications, and limitations, emphasizing its importance in data analysis. It covers various types of statistics such as descriptive, inferential, predictive, and prescriptive, along with the classification of data types and measurement levels. Additionally, it discusses data visualization techniques and sampling methods, highlighting their relevance in understanding and analyzing data effectively.

Uploaded by

noble true
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views82 pages

Statistics Batch4 Lecture

The document provides an overview of statistics, including its definitions, applications, and limitations, emphasizing its importance in data analysis. It covers various types of statistics such as descriptive, inferential, predictive, and prescriptive, along with the classification of data types and measurement levels. Additionally, it discusses data visualization techniques and sampling methods, highlighting their relevance in understanding and analyzing data effectively.

Uploaded by

noble true
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Data Analyst

All-In-One Course
Batch (4)

Success Point
Statistics
What is statistics?

The field of statistics : the practice and study of collecting and analyzing data

A summary statistics : a fact about or summary of some data


Statistics is everywhere!

Sport statistics, Personal finances, Investment, etc.

Statistics in everyday life can be used to estimate budgets for households. Knowing average fuel,
food, and entertainment costs help prepare a person for the likely expenses they will have next
month or the month after that, and these numbers can be found by averaging the values found on
previous bills and receipts.
What can statistics do?

Allow us to answer practical questions:


- What is the average salary in the USA?
- How many customer inquiries is a company likely to receive per week?

It has application across society:


- Developing safer products such as cars or airplanes
- Help governments understand the needs of a population

Validate scientific breakthroughs, such as Covid-19 vaccines


What can statistics do?

● How likely is someone to purchase a product? Are people more likely to purchase it if they
can use a different payment method?

● How many sizes of jeans need to be manufactured so they can fit 95% of the population?
Should the same number of each size be produced?

● A/B test: which ad is more effective in getting people to purchase a product?


Limitations of Statistics

● Statistics requires specific, measurable questions:


○ Is rock music more popular than jazz?
○ On average, do women live longer than men?

● We can’t use statistics to find out why relationships exist


○ Why is Game of Throne so popular?
Statistics in Data Analysis
Descriptive Statistics

Definition: Descriptive statistics summarize and describe the main features of a


dataset. They help in understanding the basic characteristics of the data.

Example: In a marketing department, you're analyzing sales data for a new product launch.
Descriptive statistics would include metrics like average daily sales, standard deviation of
sales, highest and lowest sales days, etc. These statistics provide a snapshot of how the
product is performing in terms of sales volume and variability.
Inferential Statistics

Definition: Inferential statistics involve making inferences or predictions about a


population based on a sample of data.

Example: In a HR department, you're conducting a survey to understand employee


satisfaction. Instead of surveying the entire workforce, you randomly select a sample of
employees. After analyzing the responses from the sample, you use inferential statistics to
draw conclusions about the entire employee population's satisfaction levels. This allows
you to make generalizations about the workforce as a whole based on the sampled data.
Predictive Analysis

Definition: Predictive analysis involves using historical data to make predictions


about future events or outcomes.

Example: In a manufacturing plant, you're tasked with predicting equipment failures to


optimize maintenance schedules. By analyzing past maintenance records and equipment
performance data, you can develop predictive models to forecast when a machine might
fail. This helps in scheduling preventive maintenance, reducing downtime, and avoiding
costly repairs.
Prescriptive Analysis

Definition: Prescriptive analysis focuses on recommending actions or decisions


based on the outcomes of descriptive, predictive, and inferential analyses.

Example: In a retail company, you're using prescriptive analysis to optimize pricing


strategies. After analyzing historical sales data (descriptive), predicting future sales trends
(predictive), and inferring customer preferences (inferential), you can prescribe specific
pricing changes. For example, if the data suggests that lowering prices on certain items
during specific times leads to increased sales, the prescriptive analysis would recommend
implementing these price adjustments to maximize revenue.
Type of Statistics
Type of Statistics

Descriptive Statistics Inferential Statistics

- Describe and summarize the data. - Use a sample of data to make


Example: inferences about a larger
population

What percentage
● 50% of friends drive to work. of people drive to
● 25% take the bus. work based on the
● 25% bike sample data?
Types of Data

Different types of variables require different types of statistical and visualization


approaches.

Therefore, to be able to classify the data you are working with is key.

We can classify data in two main ways based on its type and its measurement
level.
Types of Data

Types of Data Measurement Levels

Categorical
- categories or
groups
(eg. car brands)

- “Yes” and “No”


questions
Types of Data

Categorical (example questions):

Are you currently enrolled in a data analyst course at Success Point?

Do you own a car?

Yes NO
Types of Data

Types of Data Measurement Levels

Numeric
represents numbers

Discrete

Continuous
Types of Data

Discrete: usually be counted in a finite number.

# number of children (1, 2, 3, …)

IELTS Exam Sore (5, 5.5, 6, 6.5,.. )

Continuous: infinite, impossible to count.

# weight (68.0389 kg) ⇒ if gain +0.01 pound

new weight : 68.0434 kg / 150.01 lb

* scale won’t change


Types of Data

Discrete Continuous
Children : the number of children you want Weight : body weight can vary by
to have is directly understandable and is incomprehensibly small amounts and is
discrete. continuous.
Examples of Discrete

A, B, C,
D, E, F

Or

0 to 100%

Grades Number of Money


objects
Examples of Continuous

Height Area Distance Time

Apart from weight, other measurements are also continuous.


Examples

Time on a clock is Discrete Time in general is


Continuous
Types of Data

Types of Data Measurement Levels

Numeric Categorical
represents numbers - categories or
groups
(eg. car brands)
Discrete
- “Yes” and “No”
Continuous questions
Levels of Measurement

Types of Data Measurement Levels

Quantitative Qualitative

Interval Nominal

Ratios Ordinal
Levels of Measurement (Qualitative)
Nominal: cannot be ordered, not numbers (each category is separate and cannot occur at the
same time)

Eg: Vehicle Brands (Toyota, BMW, etc)


Seasons (Summar, Winter, etc)

Ordinal: consist of groups and categories which follows a strict order.

Eg: rating your lunch such as Disgusting, Unappetizing, Neutral, Tasty, Delicious

(although we have words and not numbers, it is obvious that these preferences are
ordered from Negative to Positive.)
Levels of Measurement (Quantitative)
Ratio: Has a True Zero

Eg: I have 2 apples and you have 6 apples. So, you have 3 times as many as I do. Because the
Ratio of 6 / 2 = 3.

Other Example:

Number of Distance Time


objects
Levels of Measurement (Quantitative)
Interval: Doesn’t have a True Zero

Eg: Temperature
Usually, temperature is expressed in Celsius or Fahrenheit. They are both interval variables.
Today: 5'C (or) 41'F
Yesterday: 10'C (or) 50'F

In terms of Celsius, it seems today is twice colder, but in terms of Fahrenheit, not really. The issue
comes from the fact that zero degrees Celsius and zero degrees Fahrenheit are not true zeros. These
scales were artificially created by humans for convenience.

There is another scale, Kelvin which has a true zero.

Zero degrees Kelvin is the temperature at which atoms stop moving and nothing can be colder than
zero degrees Kelvin.

This equals -273.15 degrees Celsius or -459.67 degrees Fahrenheit.


Levels of Measurement (Quantitative)
In statistics, levels of measurement refer to the types of data you can collect and how you can
analyze it.

Definition: Interval data is numeric data where the difference between values is meaningful, but
there is no true zero. This means you can add and subtract the values, but ratios (like "twice as
much") don’t make sense because the zero point is arbitrary.

Key Features:
● Equal intervals: The difference between values is consistent (e.g., the difference between
10°C and 20°C is the same as the difference between 30°C and 40°C).
● No absolute zero: Zero does not indicate the absence of the quantity (e.g., 0°C doesn’t
mean “no temperature”).
● Mathematical operations: Addition and subtraction make sense, but multiplication and
division don’t.
Examples: Temperature in Celsius or Fahrenheit, IQ scores.
Levels of Measurement (Quantitative)

Definition: Ratio data has all the properties of interval data, but it also has a true zero, meaning
that zero represents a total absence of the quantity being measured. With ratio data, you can
perform all mathematical operations, including multiplication and division.

Key Features:
● Equal intervals: Like interval data, the difference between values is consistent.
● True zero: Zero means the complete absence of the measured quantity (e.g., 0 kg means no
weight at all).
● Mathematical operations: You can add, subtract, multiply, and divide the values, and you
can make meaningful statements like "twice as much" (e.g., 4 meters is twice as long as 2
meters).

Examples: Height, weight, age, income, distance, time.


Types of data

Numeric (Quantitative) Categorical (Qualitative)

● Continuous (Measured) ● Nominal (Unordered)

○ Airplane speed ○ Married/unmarried

○ Time spent waiting in line ○ Eye color

○ Stock price ● Ordinal (Ordered)

● Discrete (Counted) Strongly disagree


Somewhat disagree
○ Number of pets
Neither agree nor disagree
○ Number of packages shipped Somewhat agree
Strongly agree
Types of data

Nominal (Unordered) Ordinal (Ordered)


● Strongly disagree ( 1 )
● Married/unmarried ( 1 / 0 )
● Somewhat disagree ( 2 )
● Country of residence ( 1, 2, …)
● Neither agree nor disagree ( 3 )

● Somewhat agree ( 4 )

● Strongly agree ( 5 )

It is important to note that these numbers doesn’t necessarily make them numeric variables.
Data Analysis and Visualization Techniques
Visualization Techniques

● Categorical Variables

● Numerical Variables
Frequency

A frequency is the number of times a value of the data occurs.

Example: Twenty students were asked how many hours they worked per day. Three
students who work two hours, five students who work three hours, and so on.

Data value Frequency


2 3
3 5
Relative Frequency

A relative frequency is the ratio (fraction or proportion) of the number of times a


value of the data occurs in the set of all outcomes to the total number of
outcomes.

To find the relative frequencies, divide each frequency by the total frequency.

Relative frequencies can be written as fractions, percents, or decimals.


Frequency

The sum of the values in the relative frequency column is 20/20, or 1.


Cumulative Frequency

Cumulative relative frequency is the accumulation of the previous relative


frequencies.

To find the cumulative relative frequencies, add all the previous relative
frequencies to the relative frequency for the current row.
Cumulative Frequency

The last entry of the cumulative relative frequency column is one, indicating that one
hundred percent of the data has been accumulated.
Numerical Variables

Age 18 to 25 Age 26 to 30 Age 31 to 35 Age 36 to 40 ….. Age 60 +


The Histogram

What is The Histogram? When should we use it?

Excel Lesson file


The Histogram

We may create a histogram with unequal intervals.


For example: age groups. You've likely completed some survey where you were asked
about your age and the possible answers were 18 to 25, then 26 to 30, 31 to 35,and so on
until 60 plus.

An explanation for the choice may be young adults under 25 cannot afford the product,
while adults over 60 have no interest in the product.

It is recommend to use with equal intervals.


How do we represent relationships between two
variables?
Cross Tables and Scatter Plots
Variables

Categorical Numerical

Cross Table / Scatter Plots


Contingency Table

(or) Side-by-side Bar


Chart
Excel file
Data Analysis & Visualization

How do we represent relationships between two variables?

Categorical : Cross Table / Contingency table (or) Side-by-Side Bar Chart

Numerical : Scatter Plot

Excel file
Cross Table

The term "cross table" typically refers to a type of table or matrix where data is
organized in rows and columns to show the relationship between two or more
variables. In the context of data visualization, a "cross table" is often used
interchangeably with a "side-by-side bar chart" when discussing categorical data
analysis.

In a side-by-side bar chart or cross table, categorical variables are displayed along the
x-axis (horizontal axis), and the corresponding frequencies or counts are represented
by bars side by side. Each bar represents a category, and the height or length of the
bar represents the frequency or count of observations in that category.
Scatter Plots
Notes

Scatter Plots are used when we are presenting two numerical variables.

Scatter Plots represent lots and lots of observations.

Outliers are data points that go against the logic of the whole dataset.
Population, Sample
Population, Sample
Population Sample
- The entire set of items or individuals of - A subset selected from the larger
interest in a study. population.
- Denoted by “N”. - Denoted by “n”.
- The numbers we have obtained when - The numbers we have obtained with a
using a population are called sample are called “statistics”.
“parameters”.
Population, Sample

Let’s say, we would like to perform a survey of the job prospects of the students
studying in the NY University.

What is the population?

** Population is hard to define and hard to observe in real life.


Sample

Sample ⇒ Less time consuming

Less costly (cheaper)

Data Analyst will almost always be working


with sample data.
Sample
Randomness Representativeness

- A random sample is collected when - A representative sample is a subset of


each member of the sample is chosen the population that accurately reflects
from the population strictly by chance. the members of the entire population.

Example: Interviewed the 50 students from the canteen.

Violated. Not chosen by chance. It represents only Students


who eat in the canteen.
They were group of non-University
students who were there for lunch.
Sample

Student Database:

The safest way would be to get access to the student database and contact
individuals in a random manner. However, such surveys are almost impossible to
conduct without assistance from the university.
Sampling Methods

Random Sampling: Every individual in the population has an equal chance of being selected. This
method helps reduce bias and is the basis for many statistical tests.
Stratified Sampling: The population is divided into subgroups (strata), and random samples are
taken from each stratum. This ensures that each subgroup is adequately represented in the sample.
Cluster Sampling: The population is divided into clusters (usually geographically), and entire
clusters are randomly selected. This method is cost-effective and useful when a population is too
large or spread out.
Systematic Sampling: Individuals are selected at regular intervals from an ordered list. This method
is simple and quick, but it requires that the list be random or that periodic patterns do not exist in
the population.
Convenience Sampling: Individuals are selected based on their easy availability. While not
statistically rigorous, this method is often used in preliminary research.
What we have done

● Populations
● Samples
● Types of Variables
● Measurement Levels
● Graphics and Tables
Descriptive Statistics
Mean, Median, and Mode

Mean, Median, and Mode

They are all in their own way trying to measure the “common” point within the
data, that which is “normal”.

It is also known as the Measures of Central Tendency.


Measures of Central Tendency

The first measure is the "Mean", also known as the simple average.

It is denoted by the Greek letter 'μ' for a population and 'x̄' for a sample.
Measures of Central Tendency

We can find the mean of a data set by adding up all of it's components and then
dividing them by their number.

x1 + x2 + x3 + …….+ xn
Mean =
N

The mean is the most common measure of central tendency, but a downside is It
is easily affected by outliers.
Measures of Central Tendency

The median is basically the middle number in an ordered data set.

● To calculate the median, first, organize and order the data from smallest to largest.
● If odd number, the median of the data set is the number at position n + 1 divided by 2
in the ordered list, where n is the number of observations.
● If the number of observations is even, take the average of the values found above and
below that position.
Measures of Central Tendency

The mode is the value that occurs most often.

It can be used for both numerical and categorical data.

In general, we often have multiple modes. Usually, two or three modes are
tolerable, but more than that would defeat the purpose of finding a mode.
Measures of Central Tendency

Which measure is best?


Measures of Central Tendency

Which measure is best?

The example shows us that the measures of central tendency should be used
together, rather than independently.
Measures of Asymmetry (Skewness)

After exploring the measures of central tendency, let's move on to the measures
of asymmetry.

The most commonly used tool to measure asymmetry is skewness.

Formula:
We will not get into
computation but rather the
meaning of skewness
Skewness

Skewness denotes whether the data points in a dataset are predominantly


clustered on one side. Its assessment isn't contingent on which side the
distribution leans towards, but rather, which direction the tail extends.

Skewness is significant as it provides insights into the distribution of data,


offering valuable information regarding its positioning and spread.
Skewness

If the distribution of data is skewed to the right (positive skew), the mean is
greater than the median.

If the distribution of data is skewed to the left (negative skew), the mean is less
than the median.

If the mean, median and mode are equal, it is zero skew. Because the distribution
of data is symmetrical.
Skewness

Positive skew
Skewness

Negative skew
Skewness

Zero skew
Variance

Variance measures the dispersion of a set of data points around their mean
Variance

Variance in statistics measures the dispersion or spread of a set of data points


around their mean or average.

In simpler terms, variance tells us how much the data points in a dataset vary or
spread out from the average value. A high variance indicates that the data points
are spread out over a wide range, while a low variance suggests that the data
points are clustered closely around the mean.
Variance (example)
Imagine you have a dataset representing sales figures for a product over several months. Each data point in this dataset is a
monthly sales figure. Now, you want to know not just the average sales but also how much the sales figures fluctuate or deviate
from this average. Variance gives you precisely that.
If the variance is high, it means the data points are spread out widely from the mean, indicating a lot of variability in your dataset.
Conversely, if the variance is low, it means the data points are clustered closely around the mean, suggesting less variability.

As a data analyst, understanding variance helps you:


● Assess Data Spread: Variance gives you a quantitative measure of how spread out your data points are, providing insights
into the variability within your dataset.
● Compare Datasets: You can use variance to compare the variability of different datasets. For example, you might compare
the sales variance of two different products to see which one has more consistent sales.
● Identify Patterns: High variance may indicate potential patterns or trends in your data that you may want to investigate
further.
● Make Inferences: Variance is often used in statistical tests and models to make inferences about populations based on
sample data.
Standard Deviation

The formulas are the square root of the population variance and square root of
the sample variance, respectively.

A low standard deviation indicates that the values tend to be close to the mean
of the set, while a high standard deviation indicates that the values are spread
out over a wider range.
Coefficient of Variation

It is equal to the Standard Deviation divided by the Mean.

Relative Standard Deviation = Standard Deviation divided by the Mean

Standard Deviation : the common measure of variability for a Single Dataset.

Coefficient of Variation: comparing two or more datasets


Example

Imagine you have two products: Product A and Product B. Both products have varying
sales figures over several months. Product A has an average monthly sales figure of
$10,000, while Product B has an average monthly sales figure of $20,000.

Now, let's say the standard deviation for Product A is $2,000, and for Product B, it's $5,000.

At first glance, you might think Product B has more variability in sales because its standard
deviation is higher. However, when you calculate the coefficient of variation (CoV), you get
a better understanding of the relative variability.
Example

For Product A:
Standard Deviation = $2,000
Mean = $10,000
Coefficient of Variation (CoV) = (Standard Deviation / Mean) * 100
= (2000 / 10000) * 100
= 20%

For Product B:
Standard Deviation = $5,000
Mean = $20,000
Coefficient of Variation (CoV) = (Standard Deviation / Mean) * 100
= (5000 / 20000) * 100
= 25%
Example

The insight from this analysis is that while Product B has higher average sales compared to
Product A, it also exhibits higher variability in its sales figures relative to its average.

The higher standard deviation and coefficient of variation for Product B suggest that its sales
figures fluctuate more widely around the average compared to Product A. This indicates that
while Product B may have higher average sales, it also carries a higher degree of risk or
uncertainty in its sales performance.

Therefore, while Product B may offer greater potential for higher sales, it also comes with
greater relative variability or risk in its sales figures, which may need to be taken into account
when setting sales targets or making business decisions. On the other hand, Product A, despite
having lower average sales, demonstrates more consistent sales performance relative to its
average.
Covariance

Now, we'll explore measures that can help us explore the relationship between
two variables.

The two variables are correlated, and the main statistic to measure this
correlation is called covariance. Unlike variance, covariance may be: > 0
(positive), = 0 (equal to zero), < 0 (negative).

excel*
Covariance

Coverance give a sense of direction:

> 0 : the two variables move together

< 0 : the two variables move in opposite directions

= 0 : the two variables are independent


Covariance

Interpretation of Covariance Value

● Positive Covariance: Indicates that when one variable increases, the other variable
tends to increase as well. For example, if the covariance between house size and price
is positive, it suggests that larger houses tend to have higher prices.

● Negative Covariance: Indicates that when one variable increases, the other variable
tends to decrease. For instance, if the covariance between house size and price is
negative, it suggests that smaller houses tend to have higher prices.

● Covariance of Zero: Suggests that there's no linear relationship between the variables.
Covariance

Limitations of Covariance
● Covariance is not standardized, meaning its value depends on the scale of the
variables. Therefore, it's not directly comparable across different datasets or
variables.
● Covariance only measures the direction of the relationship between variables and not
the strength or the degree of relationship.
Covariance

Use in Decision Making:


● Understanding covariance can be useful in decision-making processes, such as real
estate investments. If you're considering investing in properties, knowing the
covariance between house size and price can help you predict how changes in one
variable might affect the other.
● And, covariance is used in portfolio management in finance to understand how
different assets move in relation to each other. Assets with low or negative covariance
can help in diversification to reduce overall risk.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy