0% found this document useful (0 votes)

37 views25 pages

A Statistical Perspective On Data Mining

The document provides an overview of data mining from a statistical perspective, emphasizing techniques such as point estimation, bias of an estimator, and hypothesis testing. It discusses various statistical methods and visualizations, including descriptive statistics, regression, and correlation, to analyze and summarize data effectively. Additionally, it highlights the importance of these methods in decision-making and predictive analysis.

Uploaded by

keerthi2005srinivas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views25 pages

A Statistical Perspective On Data Mining

Uploaded by

keerthi2005srinivas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 25

Data mining

A statistical perspective on data

mining

Name : Keerthi S
Class:5 th sem
BCA(B)
Statistical perspective on data mining

• Data mining involves extracting valuable insights from large datasets using
computational techniques.
• From a statistical perspective, data mining is deeply rooted in probability
theory, hypothesis testing, and statistical modeling.
• The goal is to analyze data to identify patterns, make predictions, and inform
decision-making processes.
Point estimation
• Point estimation is a statistical technique used to estimate the value of an
unknown parameter in a population based on a sample from that population.
• A point estimate is a single value that serves as an approximation of this

• Example : Imagine a farmer wants to estimate the average weight of all apples
parameter.

in their orchard. Measuring every apple in the orchard would take too much
time, so the farmer takes a random sample of 50 apples instead.
 The farmer weighs each of these 50 apples and calculates the average
weight, which turns out to be 150 grams.
This sample mean of 150 grams is used as the point estimate for the average
weight of all apples in the orchard.
In this case, the point estimate (150 grams)
• Imagine a farmer wants to estimate the average weight of all apples in their
orchard. Measuring every apple in the orchard would take too much time, so
the farmer takes a random sample of 50 apples instead.
1.The farmer weighs each of these 50 apples and calculates the average
weight, which turns out to be 150 grams.
2.This sample mean of 150 grams is used as the point estimate for the average
weight of all apples in the orchard.
• In this case, the point estimate (150 grams)
1. Bias of an Estimator
The bias of an estimator is the difference between the expected (average)
value of the estimator and the true value of the parameter it is estimating.

2. Unbiased Estimator
An unbiased estimator is one whose expected value is equal to the true value
of the parameter it estimates. This means it does not systematically
overestimate or underestimate the parameter.
3.Mean Squared Error (MSE)
The Mean Squared Error (MSE) of an estimator measures the average of the squares of the
errors, where the error is the difference between the estimator and the true parameter
value. MSE combines both the variance of the estimator and its bias.
4. Squared Error
Squared Error refers to the square of the difference between an estimated value and the
true parameter value for a single estimate. It helps measure how far an estimate is from the
actual parameter value.
• Let's say you're trying to estimate the mean height of a group of people, and you use a
sample mean as your point estimate.
• Given:
• The true mean height of the population=170 cm.
• Your point estimate from the sample, =172 cm.
• Jackknife Estimator
The Jackknife Estimator is a resampling technique used to estimate the bias
and variance of a statistic. It involves systematically leaving out one
observation at a time from the sample set and recalculating the estimate to
create a "pseudo-sample." These pseudo-estimates are then averaged to
obtain the jackknife estimate.
2.Model based
summarization
• It can be defined as the presentation of a summary/report of generated
data in a comprehensible and informative manner.
• To relay information about the dataset, summarization is obtained from
the entire dataset.
• It is a carefully performed summary that will convey trends and patterns
from the dataset in a simplified manner.
Descriptive statistics
Descriptive statistics refers to a set of methods used to summarize and
describe the main features of a dataset, such as its central tendency,
variability, and distribution.
Ages = [10, 12, 10, 14, 16]
Ages = [10, 12, 10, 14, 16]
Ages = [10, 12, 10, 14, 16]
Ages = [10, 12, 10, 14, 16]
Frequency distribution
• A frequency distribution is a visual or tabular representation of
the number of observations within a given interval
Histogram
• A histogram is a type of bar chart used to represent the distribution
of numerical data. It shows how often different ranges (or bins) of
values occur in the dataset

•Bins: The data range is divided into intervals

(bins), and each bar represents the number of data
points in that interval.
•X-axis: Represents the range of values or
intervals (bins).
•Y-axis: Represents the frequency or count of data
points in each bin.
Box plot
• A box plot is standardized way of displaying
the distribution of data based on five number.
• Minimum ,first quartile, median , third
quartile, maximum interquartile range, outliers
• Example : Consider the following set of 10
exam scores: Scores = [45, 55, 60, 65, 70, 75,
80, 85, 90, 95]
•Minimum: 45 (the smallest data point)
•Q1 (Lower Quartile): 60
•Median (Q2): 72.5
•Q3 (Upper Quartile): 85
•Maximum: 95 (the largest data point)
Scatter diagram
• A scatter plot is a type of graph used to display the relationship
between two continuous variables.
• Each point on the plot represents a pair of values, with one
value on the x-axis and the other on the y-axis.
• Scatter plots help visualize the correlation or trend between
variables, as well as the spread or distribution of the data.
• Purpose:
• To determine if there is a relationship between the two
variables.
• To identify trends, correlations, or patterns (positive, negative,
or no correlation).
• To detect outliers or unusual patterns
Bayes theorem
Bayes theorem
Hypothesis testing
• Hypothesis testing is a statistical method used to determine whether
there is enough evidence in a sample of data to support or reject a
hypothesis about a population.

• Example
• A coffee shop claims that the average waiting time for an order is 5
minutes. A customer wants to test whether the average waiting time
is different from 5 minutes based on a sample of orders
•Decision Making:
•To determine whether there is enough evidence to support a particular claim or hypothesis
about a population. For example, testing if a new drug is more effective than an existing one.
•Evaluate Assumptions:
•Hypothesis testing helps to assess the validity of assumptions or theories about a population.
For example, it can test whether the average income in a city is equal to a certain value, based
on sample data.
•Scientific Validation:
•It is commonly used in scientific research to validate or reject a research hypothesis. In
experimental settings, researchers use hypothesis tests to determine if the effects observed in
the experiment are statistically significant.
•Comparing Groups:
•Hypothesis testing allows for comparisons between groups or treatments. For instance,
comparing the performance of two different teaching methods by testing whether their
outcomes (e.g., exam scores) differ significantly.
•Predictive Analysis:
•It helps in understanding the relationships between variables and can be used to predict
outcomes based on sample data.
Regression
Purpose:
• Regression is used to model the relationship between a dependent (response) variable and one or
more independent (predictor) variables. It predicts the value of the dependent variable based on
the independent variables.
• Regression can help determine causal relationships, unlike correlation which only shows
association.
• Key points:
• The most common form of regression is linear regression, where the relationship between the
variables is modeled as a straight line.
• In simple linear regression, the relationship is between two variables:
• Y = β₀ + β₁X
• Y: Dependent variable
• X: Independent variable
• β₀: Y-intercept
• β₁: Slope (shows how much Y changes for a unit change in X)
• Example:
• Simple Linear Regression Example: Predicting exam scores based on the number of study hours.
• The equation might be: Exam Score = 50 + 5 × Study Hours.
Correlation
• Purpose:
 Correlation measures the strength and direction of a linear relationship between two
variables.
 It does not imply causation, only the degree of association between variables.
• Key points:
• The most commonly used measure is the Pearson correlation coefficient (denoted as r),
which ranges from -1 to 1.
• r = 1: Perfect positive correlation (as one variable increases, the other increases proportionally).
• r = -1: Perfect negative correlation (as one variable increases, the other decreases proportionally).
• r = 0: No correlation (the variables are unrelated).
• r > 0: Positive correlation (both variables tend to increase or decrease together).
• r < 0: Negative correlation (one variable tends to increase as the other decreases).
• Example:
• If you are studying the relationship between study hours and exam scores, a correlation
of 0.85 would indicate a strong positive correlation — as study hours increase, exam
scores tend to increase.
THANK YOU

Essentials of Econometrics
7% (27)
Essentials of Econometrics
12 pages
ASM Using R 2 Marks Answer Keys
100% (1)
ASM Using R 2 Marks Answer Keys
10 pages
Yatchew A. Semiparametric Regression For The Applied Econometrician (CUP, 2003) (ISBN 0521812836) (235s) - GL
100% (1)
Yatchew A. Semiparametric Regression For The Applied Econometrician (CUP, 2003) (ISBN 0521812836) (235s) - GL
235 pages
Isye 6412A Theoretical Statistics Fall 2019
No ratings yet
Isye 6412A Theoretical Statistics Fall 2019
2 pages
Chapter 3 Multiple Regression Analysis Estimation
No ratings yet
Chapter 3 Multiple Regression Analysis Estimation
9 pages
Descriptive & Inferential Statistics
No ratings yet
Descriptive & Inferential Statistics
6 pages
Data Visualization Notes Ou
No ratings yet
Data Visualization Notes Ou
125 pages
Statistical Assistant 2015
No ratings yet
Statistical Assistant 2015
8 pages
Sample Survey Ref
100% (1)
Sample Survey Ref
73 pages
DA Unit-2 Probability and Statistical Methods
No ratings yet
DA Unit-2 Probability and Statistical Methods
139 pages
Backward Chaining
No ratings yet
Backward Chaining
16 pages
Statistical Process Control: Relating Applied Statistics To Quality Control
No ratings yet
Statistical Process Control: Relating Applied Statistics To Quality Control
92 pages
Frequency Estimation
No ratings yet
Frequency Estimation
22 pages
Physics Iii: Experiments: Erhan Gülmez & Zuhal Kaplan
No ratings yet
Physics Iii: Experiments: Erhan Gülmez & Zuhal Kaplan
131 pages
Chap 11
No ratings yet
Chap 11
11 pages
Evaluating Large Language Models Trained On Code
No ratings yet
Evaluating Large Language Models Trained On Code
35 pages
BU EC507 Syllabus
No ratings yet
BU EC507 Syllabus
3 pages
5.basic Statistics
No ratings yet
5.basic Statistics
43 pages
Statistical Comparison of The Slopes of Two Regression Lines A Tutorial
No ratings yet
Statistical Comparison of The Slopes of Two Regression Lines A Tutorial
12 pages
2 Complex Sampling Concepts: PSU PSU PSU Usus CS SRS
No ratings yet
2 Complex Sampling Concepts: PSU PSU PSU Usus CS SRS
19 pages
FDS Sem5
No ratings yet
FDS Sem5
15 pages
Stats 201
No ratings yet
Stats 201
5 pages
IA Assignment - DMA103 - Statistics For Management - Jan-Feb 24
No ratings yet
IA Assignment - DMA103 - Statistics For Management - Jan-Feb 24
6 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
Statistics For Data Analyst
No ratings yet
Statistics For Data Analyst
7 pages
Data Mining Techniques
No ratings yet
Data Mining Techniques
33 pages
Unit - 2 Deep Learning
No ratings yet
Unit - 2 Deep Learning
26 pages
Statistics
No ratings yet
Statistics
22 pages
Data Science by CFA
No ratings yet
Data Science by CFA
27 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
14 pages
Slides
No ratings yet
Slides
39 pages
Mca-Probability & Statistics - II
No ratings yet
Mca-Probability & Statistics - II
3 pages
Unit 2
No ratings yet
Unit 2
28 pages
Chapter 8
No ratings yet
Chapter 8
27 pages
York University Adms2320 Final Formulas (Regular)
No ratings yet
York University Adms2320 Final Formulas (Regular)
16 pages
Data Analysis and Interpretation
No ratings yet
Data Analysis and Interpretation
24 pages
CN Lab Manual
No ratings yet
CN Lab Manual
68 pages
Stats-Edited Answers
No ratings yet
Stats-Edited Answers
30 pages
1 3MonteCarloSimulation
No ratings yet
1 3MonteCarloSimulation
11 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
54 pages
ML - 1 - Sovan - Introduction To ML
No ratings yet
ML - 1 - Sovan - Introduction To ML
83 pages
DSOST2
No ratings yet
DSOST2
44 pages
DeMeasure of Central Tendency and Dispersion
No ratings yet
DeMeasure of Central Tendency and Dispersion
15 pages
Statisticsgm
No ratings yet
Statisticsgm
2 pages
Introduction To Statistics
100% (3)
Introduction To Statistics
43 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
30 pages
Technical Analysis PDFdrive 10
No ratings yet
Technical Analysis PDFdrive 10
17 pages
Advance Statistics For Data Science and Data Analysis
No ratings yet
Advance Statistics For Data Science and Data Analysis
47 pages
Data Analytics Lab Keerthi
No ratings yet
Data Analytics Lab Keerthi
71 pages
Module 3 EDA
No ratings yet
Module 3 EDA
14 pages
ST Formula Sheet Midterm
No ratings yet
ST Formula Sheet Midterm
4 pages
Tinu
No ratings yet
Tinu
5 pages
(Ebook) Statistical Methods in Experimental Physics by James, Frederick ISBN 9789812567956, 9789812705273, 981256795X, 9812705279 Download
100% (1)
(Ebook) Statistical Methods in Experimental Physics by James, Frederick ISBN 9789812567956, 9789812705273, 981256795X, 9812705279 Download
56 pages
Week 4 Team Lecture
No ratings yet
Week 4 Team Lecture
55 pages
Software Programmer - HTML5
100% (1)
Software Programmer - HTML5
276 pages
Introduction To Data Analysis: Professor David Richardson IIT Stuart School of Business
No ratings yet
Introduction To Data Analysis: Professor David Richardson IIT Stuart School of Business
31 pages
Lecture 1-2-118
No ratings yet
Lecture 1-2-118
117 pages
Probability & Statistics Facts and Formulae: Guides To Statistical Information 1
No ratings yet
Probability & Statistics Facts and Formulae: Guides To Statistical Information 1
4 pages
Data Analysis Notes
No ratings yet
Data Analysis Notes
9 pages
Unit II TYCS DS
No ratings yet
Unit II TYCS DS
176 pages
Bocalig Act5 MMW
No ratings yet
Bocalig Act5 MMW
6 pages
3 4 Research 8 2
No ratings yet
3 4 Research 8 2
54 pages
Chapter 5
No ratings yet
Chapter 5
14 pages
Unit 5 Estimation
No ratings yet
Unit 5 Estimation
32 pages
Akshay 45
No ratings yet
Akshay 45
6 pages
12
No ratings yet
12
16 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
15 pages
DV Unit 1&2 Notes
No ratings yet
DV Unit 1&2 Notes
50 pages
Descriptive Statistics and Inferential Statistics Are Two Branches of Statistics That Serve Different Purposes
No ratings yet
Descriptive Statistics and Inferential Statistics Are Two Branches of Statistics That Serve Different Purposes
6 pages
Cs447 Glossary
No ratings yet
Cs447 Glossary
4 pages
Big Data
No ratings yet
Big Data
18 pages
Statistics
No ratings yet
Statistics
8 pages
powerBI 1 ST
No ratings yet
powerBI 1 ST
1 page
Seminar 4
No ratings yet
Seminar 4
43 pages
ECS7020P ClassificationExercisesSolutions II
No ratings yet
ECS7020P ClassificationExercisesSolutions II
7 pages
AP Statistics Michel Liao
No ratings yet
AP Statistics Michel Liao
20 pages
Unit2 ST
No ratings yet
Unit2 ST
105 pages
VI Semester BCA Project Report Template
No ratings yet
VI Semester BCA Project Report Template
91 pages
Stat Viva
No ratings yet
Stat Viva
10 pages
Chapter4 Estimation
No ratings yet
Chapter4 Estimation
28 pages
ML Notes
No ratings yet
ML Notes
38 pages
Basics of Business Statistics
100% (1)
Basics of Business Statistics
66 pages
Inferential Statistics
No ratings yet
Inferential Statistics
3 pages
Lecture1 Orientation
No ratings yet
Lecture1 Orientation
39 pages
Ii Bba
No ratings yet
Ii Bba
16 pages
Difference Between (Median, Mean, Mode, Range, Midrange) (Descriptive Statistics)
No ratings yet
Difference Between (Median, Mean, Mode, Range, Midrange) (Descriptive Statistics)
11 pages
Unit-2 Data Analytics Approaches
No ratings yet
Unit-2 Data Analytics Approaches
24 pages
A. Variables:: Types of Distributions
No ratings yet
A. Variables:: Types of Distributions
10 pages
Huzz 1
No ratings yet
Huzz 1
5 pages
bt1101 Cheat Sheet
No ratings yet
bt1101 Cheat Sheet
3 pages
Basic Stats
No ratings yet
Basic Stats
49 pages
BRM Viva Prep Ques
No ratings yet
BRM Viva Prep Ques
6 pages
Day 3 Statistics Interview QnA
No ratings yet
Day 3 Statistics Interview QnA
5 pages
Statistics
No ratings yet
Statistics
45 pages
TASK 6 Report
No ratings yet
TASK 6 Report
4 pages
Research Paper Ks
No ratings yet
Research Paper Ks
16 pages
Chapter 1
No ratings yet
Chapter 1
25 pages
Project Manager 365 Day Roadmap
No ratings yet
Project Manager 365 Day Roadmap
1 page
Population
No ratings yet
Population
27 pages
Q&A Univ 3unit
No ratings yet
Q&A Univ 3unit
18 pages
Statistics II Essentials
From Everand
Statistics II Essentials
Emil Milewski
2.5/5 (1)
Mean, Median, Mode, Variance & Standard Deviation: Subject: Statistics Created By: Marija Stanojcic Revised: 10/9/2018
No ratings yet
Mean, Median, Mode, Variance & Standard Deviation: Subject: Statistics Created By: Marija Stanojcic Revised: 10/9/2018
3 pages
Sampling in Statistics
From Everand
Sampling in Statistics
Stephanie Glen
No ratings yet
Glossary of Research Methodology
From Everand
Glossary of Research Methodology
Dr. Awadhesh Kishore
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

A Statistical Perspective On Data Mining

Uploaded by

A Statistical Perspective On Data Mining

Uploaded by

Data mining

A statistical perspective on data

•Bins: The data range is divided into intervals

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.