A Statistical Perspective On Data Mining
A Statistical Perspective On Data Mining
Name : Keerthi S
Class:5 th sem
BCA(B)
Statistical perspective on data mining
• Data mining involves extracting valuable insights from large datasets using
computational techniques.
• From a statistical perspective, data mining is deeply rooted in probability
theory, hypothesis testing, and statistical modeling.
• The goal is to analyze data to identify patterns, make predictions, and inform
decision-making processes.
Point estimation
• Point estimation is a statistical technique used to estimate the value of an
unknown parameter in a population based on a sample from that population.
• A point estimate is a single value that serves as an approximation of this
• Example : Imagine a farmer wants to estimate the average weight of all apples
parameter.
in their orchard. Measuring every apple in the orchard would take too much
time, so the farmer takes a random sample of 50 apples instead.
The farmer weighs each of these 50 apples and calculates the average
weight, which turns out to be 150 grams.
This sample mean of 150 grams is used as the point estimate for the average
weight of all apples in the orchard.
In this case, the point estimate (150 grams)
• Imagine a farmer wants to estimate the average weight of all apples in their
orchard. Measuring every apple in the orchard would take too much time, so
the farmer takes a random sample of 50 apples instead.
1.The farmer weighs each of these 50 apples and calculates the average
weight, which turns out to be 150 grams.
2.This sample mean of 150 grams is used as the point estimate for the average
weight of all apples in the orchard.
• In this case, the point estimate (150 grams)
1. Bias of an Estimator
The bias of an estimator is the difference between the expected (average)
value of the estimator and the true value of the parameter it is estimating.
2. Unbiased Estimator
An unbiased estimator is one whose expected value is equal to the true value
of the parameter it estimates. This means it does not systematically
overestimate or underestimate the parameter.
3.Mean Squared Error (MSE)
The Mean Squared Error (MSE) of an estimator measures the average of the squares of the
errors, where the error is the difference between the estimator and the true parameter
value. MSE combines both the variance of the estimator and its bias.
4. Squared Error
Squared Error refers to the square of the difference between an estimated value and the
true parameter value for a single estimate. It helps measure how far an estimate is from the
actual parameter value.
• Let's say you're trying to estimate the mean height of a group of people, and you use a
sample mean as your point estimate.
• Given:
• The true mean height of the population=170 cm.
• Your point estimate from the sample, =172 cm.
• Jackknife Estimator
The Jackknife Estimator is a resampling technique used to estimate the bias
and variance of a statistic. It involves systematically leaving out one
observation at a time from the sample set and recalculating the estimate to
create a "pseudo-sample." These pseudo-estimates are then averaged to
obtain the jackknife estimate.
2.Model based
summarization
• It can be defined as the presentation of a summary/report of generated
data in a comprehensible and informative manner.
• To relay information about the dataset, summarization is obtained from
the entire dataset.
• It is a carefully performed summary that will convey trends and patterns
from the dataset in a simplified manner.
Descriptive statistics
Descriptive statistics refers to a set of methods used to summarize and
describe the main features of a dataset, such as its central tendency,
variability, and distribution.
Ages = [10, 12, 10, 14, 16]
Ages = [10, 12, 10, 14, 16]
Ages = [10, 12, 10, 14, 16]
Ages = [10, 12, 10, 14, 16]
Frequency distribution
• A frequency distribution is a visual or tabular representation of
the number of observations within a given interval
Histogram
• A histogram is a type of bar chart used to represent the distribution
of numerical data. It shows how often different ranges (or bins) of
values occur in the dataset
• Example
• A coffee shop claims that the average waiting time for an order is 5
minutes. A customer wants to test whether the average waiting time
is different from 5 minutes based on a sample of orders
•Decision Making:
•To determine whether there is enough evidence to support a particular claim or hypothesis
about a population. For example, testing if a new drug is more effective than an existing one.
•Evaluate Assumptions:
•Hypothesis testing helps to assess the validity of assumptions or theories about a population.
For example, it can test whether the average income in a city is equal to a certain value, based
on sample data.
•Scientific Validation:
•It is commonly used in scientific research to validate or reject a research hypothesis. In
experimental settings, researchers use hypothesis tests to determine if the effects observed in
the experiment are statistically significant.
•Comparing Groups:
•Hypothesis testing allows for comparisons between groups or treatments. For instance,
comparing the performance of two different teaching methods by testing whether their
outcomes (e.g., exam scores) differ significantly.
•Predictive Analysis:
•It helps in understanding the relationships between variables and can be used to predict
outcomes based on sample data.
Regression
Purpose:
• Regression is used to model the relationship between a dependent (response) variable and one or
more independent (predictor) variables. It predicts the value of the dependent variable based on
the independent variables.
• Regression can help determine causal relationships, unlike correlation which only shows
association.
• Key points:
• The most common form of regression is linear regression, where the relationship between the
variables is modeled as a straight line.
• In simple linear regression, the relationship is between two variables:
• Y = β₀ + β₁X
• Y: Dependent variable
• X: Independent variable
• β₀: Y-intercept
• β₁: Slope (shows how much Y changes for a unit change in X)
• Example:
• Simple Linear Regression Example: Predicting exam scores based on the number of study hours.
• The equation might be: Exam Score = 50 + 5 × Study Hours.
Correlation
• Purpose:
Correlation measures the strength and direction of a linear relationship between two
variables.
It does not imply causation, only the degree of association between variables.
• Key points:
• The most commonly used measure is the Pearson correlation coefficient (denoted as r),
which ranges from -1 to 1.
• r = 1: Perfect positive correlation (as one variable increases, the other increases proportionally).
• r = -1: Perfect negative correlation (as one variable increases, the other decreases proportionally).
• r = 0: No correlation (the variables are unrelated).
• r > 0: Positive correlation (both variables tend to increase or decrease together).
• r < 0: Negative correlation (one variable tends to increase as the other decreases).
• Example:
• If you are studying the relationship between study hours and exam scores, a correlation
of 0.85 would indicate a strong positive correlation — as study hours increase, exam
scores tend to increase.
THANK YOU