0% found this document useful (0 votes)
37 views25 pages

A Statistical Perspective On Data Mining

The document provides an overview of data mining from a statistical perspective, emphasizing techniques such as point estimation, bias of an estimator, and hypothesis testing. It discusses various statistical methods and visualizations, including descriptive statistics, regression, and correlation, to analyze and summarize data effectively. Additionally, it highlights the importance of these methods in decision-making and predictive analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views25 pages

A Statistical Perspective On Data Mining

The document provides an overview of data mining from a statistical perspective, emphasizing techniques such as point estimation, bias of an estimator, and hypothesis testing. It discusses various statistical methods and visualizations, including descriptive statistics, regression, and correlation, to analyze and summarize data effectively. Additionally, it highlights the importance of these methods in decision-making and predictive analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Data mining

A statistical perspective on data


mining

Name : Keerthi S
Class:5 th sem
BCA(B)
Statistical perspective on data mining

• Data mining involves extracting valuable insights from large datasets using
computational techniques.
• From a statistical perspective, data mining is deeply rooted in probability
theory, hypothesis testing, and statistical modeling.
• The goal is to analyze data to identify patterns, make predictions, and inform
decision-making processes.
Point estimation
• Point estimation is a statistical technique used to estimate the value of an
unknown parameter in a population based on a sample from that population.
• A point estimate is a single value that serves as an approximation of this

• Example : Imagine a farmer wants to estimate the average weight of all apples
parameter.

in their orchard. Measuring every apple in the orchard would take too much
time, so the farmer takes a random sample of 50 apples instead.
 The farmer weighs each of these 50 apples and calculates the average
weight, which turns out to be 150 grams.
This sample mean of 150 grams is used as the point estimate for the average
weight of all apples in the orchard.
In this case, the point estimate (150 grams)
• Imagine a farmer wants to estimate the average weight of all apples in their
orchard. Measuring every apple in the orchard would take too much time, so
the farmer takes a random sample of 50 apples instead.
1.The farmer weighs each of these 50 apples and calculates the average
weight, which turns out to be 150 grams.
2.This sample mean of 150 grams is used as the point estimate for the average
weight of all apples in the orchard.
• In this case, the point estimate (150 grams)
1. Bias of an Estimator
The bias of an estimator is the difference between the expected (average)
value of the estimator and the true value of the parameter it is estimating.

2. Unbiased Estimator
An unbiased estimator is one whose expected value is equal to the true value
of the parameter it estimates. This means it does not systematically
overestimate or underestimate the parameter.
3.Mean Squared Error (MSE)
The Mean Squared Error (MSE) of an estimator measures the average of the squares of the
errors, where the error is the difference between the estimator and the true parameter
value. MSE combines both the variance of the estimator and its bias.
4. Squared Error
Squared Error refers to the square of the difference between an estimated value and the
true parameter value for a single estimate. It helps measure how far an estimate is from the
actual parameter value.
• Let's say you're trying to estimate the mean height of a group of people, and you use a
sample mean as your point estimate.
• Given:
• The true mean height of the population=170 cm.
• Your point estimate from the sample, =172 cm.
• Jackknife Estimator
The Jackknife Estimator is a resampling technique used to estimate the bias
and variance of a statistic. It involves systematically leaving out one
observation at a time from the sample set and recalculating the estimate to
create a "pseudo-sample." These pseudo-estimates are then averaged to
obtain the jackknife estimate.
2.Model based
summarization
• It can be defined as the presentation of a summary/report of generated
data in a comprehensible and informative manner.
• To relay information about the dataset, summarization is obtained from
the entire dataset.
• It is a carefully performed summary that will convey trends and patterns
from the dataset in a simplified manner.
Descriptive statistics
Descriptive statistics refers to a set of methods used to summarize and
describe the main features of a dataset, such as its central tendency,
variability, and distribution.
Ages = [10, 12, 10, 14, 16]
Ages = [10, 12, 10, 14, 16]
Ages = [10, 12, 10, 14, 16]
Ages = [10, 12, 10, 14, 16]
Frequency distribution
• A frequency distribution is a visual or tabular representation of
the number of observations within a given interval
Histogram
• A histogram is a type of bar chart used to represent the distribution
of numerical data. It shows how often different ranges (or bins) of
values occur in the dataset

•Bins: The data range is divided into intervals


(bins), and each bar represents the number of data
points in that interval.
•X-axis: Represents the range of values or
intervals (bins).
•Y-axis: Represents the frequency or count of data
points in each bin.
Box plot
• A box plot is standardized way of displaying
the distribution of data based on five number.
• Minimum ,first quartile, median , third
quartile, maximum interquartile range, outliers
• Example : Consider the following set of 10
exam scores: Scores = [45, 55, 60, 65, 70, 75,
80, 85, 90, 95]
•Minimum: 45 (the smallest data point)
•Q1 (Lower Quartile): 60
•Median (Q2): 72.5
•Q3 (Upper Quartile): 85
•Maximum: 95 (the largest data point)
Scatter diagram
• A scatter plot is a type of graph used to display the relationship
between two continuous variables.
• Each point on the plot represents a pair of values, with one
value on the x-axis and the other on the y-axis.
• Scatter plots help visualize the correlation or trend between
variables, as well as the spread or distribution of the data.
• Purpose:
• To determine if there is a relationship between the two
variables.
• To identify trends, correlations, or patterns (positive, negative,
or no correlation).
• To detect outliers or unusual patterns
Bayes theorem
Bayes theorem
Hypothesis testing
• Hypothesis testing is a statistical method used to determine whether
there is enough evidence in a sample of data to support or reject a
hypothesis about a population.

• Example
• A coffee shop claims that the average waiting time for an order is 5
minutes. A customer wants to test whether the average waiting time
is different from 5 minutes based on a sample of orders
•Decision Making:
•To determine whether there is enough evidence to support a particular claim or hypothesis
about a population. For example, testing if a new drug is more effective than an existing one.
•Evaluate Assumptions:
•Hypothesis testing helps to assess the validity of assumptions or theories about a population.
For example, it can test whether the average income in a city is equal to a certain value, based
on sample data.
•Scientific Validation:
•It is commonly used in scientific research to validate or reject a research hypothesis. In
experimental settings, researchers use hypothesis tests to determine if the effects observed in
the experiment are statistically significant.
•Comparing Groups:
•Hypothesis testing allows for comparisons between groups or treatments. For instance,
comparing the performance of two different teaching methods by testing whether their
outcomes (e.g., exam scores) differ significantly.
•Predictive Analysis:
•It helps in understanding the relationships between variables and can be used to predict
outcomes based on sample data.
Regression
Purpose:
• Regression is used to model the relationship between a dependent (response) variable and one or
more independent (predictor) variables. It predicts the value of the dependent variable based on
the independent variables.
• Regression can help determine causal relationships, unlike correlation which only shows
association.
• Key points:
• The most common form of regression is linear regression, where the relationship between the
variables is modeled as a straight line.
• In simple linear regression, the relationship is between two variables:
• Y = β₀ + β₁X
• Y: Dependent variable
• X: Independent variable
• β₀: Y-intercept
• β₁: Slope (shows how much Y changes for a unit change in X)
• Example:
• Simple Linear Regression Example: Predicting exam scores based on the number of study hours.
• The equation might be: Exam Score = 50 + 5 × Study Hours.
Correlation
• Purpose:
 Correlation measures the strength and direction of a linear relationship between two
variables.
 It does not imply causation, only the degree of association between variables.
• Key points:
• The most commonly used measure is the Pearson correlation coefficient (denoted as r),
which ranges from -1 to 1.
• r = 1: Perfect positive correlation (as one variable increases, the other increases proportionally).
• r = -1: Perfect negative correlation (as one variable increases, the other decreases proportionally).
• r = 0: No correlation (the variables are unrelated).
• r > 0: Positive correlation (both variables tend to increase or decrease together).
• r < 0: Negative correlation (one variable tends to increase as the other decreases).
• Example:
• If you are studying the relationship between study hours and exam scores, a correlation
of 0.85 would indicate a strong positive correlation — as study hours increase, exam
scores tend to increase.
THANK YOU

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy