0% found this document useful (0 votes)

5 views19 pages

ML-Lecture-5-data-quality

Uploaded by

Shohanur Rahman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views19 pages

ML-Lecture-5-data-quality

Uploaded by

Shohanur Rahman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Machine Learning

Lecture 5: Data Quality

COURSE CODE: CSE451
2023
Course Teacher
Dr. Mrinal Kanti Baowaly
Associate Professor
Department of Computer Science and
Engineering, Bangabandhu Sheikh
Mujibur Rahman Science and
Technology University, Bangladesh.

Email: mkbaowaly@gmail.com
Data Quality
 Data quality is a perception or an assessment of data’s fitness to
serve its purpose in a given context
 Components of data quality

Source: Link1
#1: Completeness
 Completeness is defined as expected comprehensiveness.
 Data can be complete even if optional data is missing. As
long as the data meets the expectations then the data is
considered complete.
 For example, a customer’s first name and last name are
mandatory but middle name is optional; so a record can be
considered complete even if a middle name is not
available.
#2: Consistency
 Consistency means data across all systems reflects the
same information and are in synchronized with each other
across the enterprise.

 Examples of some inconsistencies:

• A business unit status is closed but there are sales for that
business unit.
• Employee status is terminated but pay status is active.
#3: Conformity
 Conformity means the data is following the set of standard
data definitions like data type, size and format.

 For example, date of birth of customer is in the format

“mm/dd/yyyy”
#4: Accuracy
 Accuracy is the degree to which data correctly reflects the
real world object or an event being described.

 Examples:
• Sales of the business unit are the real value.
• Address of an employee in the employee database is the real address.
#5: Integrity
 Integrity means validity of data across the relationships
and ensures that all data in a database can be traced and
connected to other data.

 For example, in a customer database, there should be a

valid customer, address and relationship between them. If
there is an address relationship data without a customer
then that data is not valid and is considered an orphaned
record.
#6: Timeliness
 Timeliness references whether information is available
when it is expected and needed.
 The data should be recorded as soon as possible after the
real-world event because, with the passage of time,
statistics become less useful and less accurate.
 Examples:
• Companies that are required to publish their quarterly results within a given frame
of time
• Customer service providing up-to date information to the customers
• Credit system checking in real-time on the credit card account activity
Data Quality Problems
What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?

Examples of data quality problems:

◦ Noise
◦ Outliers
◦ Missing values
◦ Duplicate or Redundant data
Noise
Noise refers to modification of original values.
◦ Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise

Noisy Data
 Noisy data (or corrupt data) are meaningless information
 It cannot be understood and interpreted correctly by machines
 It unnecessarily increases the amount of storage space
required and can adversely affect any data mining analysis
results.
 Noisy data can be caused by faulty data collection instruments,
human or computer errors occurring at data entry, data
transmission errors, limited buffer size for coordinating
synchronized data transfer, inconsistencies in naming
conventions or data codes used and inconsistent formats for
input fields( e.g. date).
How to Handle Noisy Data
 Remove noise from data (called data smoothing) using binning
method, regression, clustering
 Collect more data, it’s the best way to cut the noise out but
data is expensive
 Use Principal Component Analysis (PCA) for dimensionality
reduction
 Use regularization and cross validation (CV) to prevent
overfitting

Detail: Link
Outliers
Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set

How to detect outliers: use various visualization methods, like Box-plot, Histogram, Scatter Plot. Link1, Link2
How to Handle Outliers
 Drop the outlier records: Sometimes it’s best to completely remove those
records from your dataset to stop them from skewing your analysis.
 Cap your outliers’ data: Another way to handle true outliers is to cap them.
For example, if you’re using income, you might find that people above a
certain income level behave in the same way as those with a lower income.
In this case, you can cap the income value at a level that keeps that intact.
 Assign a new value: If an outlier seems to be due to a mistake in your data,
try imputing a new value. Common imputation methods include using the
mean of a variable or utilizing a regression model to predict the missing
value.
 Try a transformation: A different approach to true outliers could be to try
creating a transformation of the data rather than using the data itself.
Missing Values
Reasons for missing values
◦ Information is not collected
(e.g., people decline to give their age and weight)
◦ Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
Handling missing values
◦ Eliminate Data Objects
◦ Estimate Missing Values (Mean/ Mode/ Median /Prediction etc.)
◦ Ignore the Missing Value During Analysis
◦ Replace with all possible values (weighted by their probabilities)
Dealing with duplicate data
 You should probably remove duplicate data.
 Duplicate data will essentially lead to bias your fitted model or do
the model overfitting.
 But you should
1) be sure they are not real data that coincidentally have values that
are identical
2) try to figure why you have duplicates in your data. For example,
sometimes people intentionally ‘oversample’ rare categories in
training data
HW: Data Cleaning with Python and
Pandas and NumPy
According to IBM Data Analytics you can expect to spend up to 80%
of your time cleaning data.

Practice:
Data Preprocessing | Data Cleaning Python
Data Cleaning In Python Basics Using Pandas
Pythonic Data Cleaning With Pandas and NumPy
End of
Lecture-5

Quiz - Weekly Quiz 2 PDF
100% (1)
Quiz - Weekly Quiz 2 PDF
5 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
lec 1 Data Acquisition and preprocessing
No ratings yet
lec 1 Data Acquisition and preprocessing
8 pages
Da 5
No ratings yet
Da 5
6 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Data Quality
100% (2)
Data Quality
16 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Quality
No ratings yet
Data Quality
14 pages
Chapter 3& 4 (3)
No ratings yet
Chapter 3& 4 (3)
60 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
data preprocessing
No ratings yet
data preprocessing
11 pages
Data Quality and Data Cleaning: An Overview
No ratings yet
Data Quality and Data Cleaning: An Overview
27 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
Chapter 3 Data Preparation
100% (1)
Chapter 3 Data Preparation
34 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
UNIT _ Introduction_DataScience_new (1)
No ratings yet
UNIT _ Introduction_DataScience_new (1)
55 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
2 DM Datapreprocessing
No ratings yet
2 DM Datapreprocessing
41 pages
03_Data_Preprocessing
No ratings yet
03_Data_Preprocessing
15 pages
Data Preprocessing - 1: Course Leader
No ratings yet
Data Preprocessing - 1: Course Leader
22 pages
Data Preprocessing 1_annotated
No ratings yet
Data Preprocessing 1_annotated
23 pages
Data Analytics_Module-1.2
No ratings yet
Data Analytics_Module-1.2
55 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
8 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
03 Data Science Process_Fall 23-24
No ratings yet
03 Data Science Process_Fall 23-24
38 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Data Cleaning 2021
No ratings yet
Data Cleaning 2021
61 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
2 DM DataPreprocessing
No ratings yet
2 DM DataPreprocessing
43 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
Data Cleansing
No ratings yet
Data Cleansing
5 pages
3. Data Preprocessing
No ratings yet
3. Data Preprocessing
120 pages
DHV MODEL 1.2 Data Cleaning
No ratings yet
DHV MODEL 1.2 Data Cleaning
49 pages
02 Data_preprocessing -4,5,6
No ratings yet
02 Data_preprocessing -4,5,6
54 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Approval
No ratings yet
Approval
2 pages
6814911878
No ratings yet
6814911878
72 pages
ML-Lecture-14-SVM
No ratings yet
ML-Lecture-14-SVM
15 pages
ML-Lecture-13-KNN
No ratings yet
ML-Lecture-13-KNN
14 pages
ML-Lecture-2-3-Types
No ratings yet
ML-Lecture-2-3-Types
27 pages
ML-Lecture-12-NB
No ratings yet
ML-Lecture-12-NB
15 pages
Observer
No ratings yet
Observer
12 pages
ML-Lecture-11-Evaluation
No ratings yet
ML-Lecture-11-Evaluation
17 pages
ML-Lecture-8-9-Classification
No ratings yet
ML-Lecture-8-9-Classification
35 pages
ML-Lecture-1-Intro
No ratings yet
ML-Lecture-1-Intro
21 pages
Research Paper
No ratings yet
Research Paper
7 pages
Assignment2
No ratings yet
Assignment2
10 pages
Image Fish
No ratings yet
Image Fish
4 pages
Testing
No ratings yet
Testing
61 pages
Project Management
No ratings yet
Project Management
25 pages
Lec05 System Modeling Part2
No ratings yet
Lec05 System Modeling Part2
21 pages
Intro To Microprocessor
No ratings yet
Intro To Microprocessor
26 pages
B4 1-Seidel
No ratings yet
B4 1-Seidel
28 pages
Organization of 8086
No ratings yet
Organization of 8086
22 pages
Lec02 Process Model
No ratings yet
Lec02 Process Model
37 pages
Lec01 Intro
No ratings yet
Lec01 Intro
27 pages
Lec03 Agile
No ratings yet
Lec03 Agile
28 pages
Activity Case
No ratings yet
Activity Case
34 pages
Factory
No ratings yet
Factory
7 pages
Accelerate Your Workflow With Data Analytics
0% (1)
Accelerate Your Workflow With Data Analytics
49 pages
Ali Mohanad
No ratings yet
Ali Mohanad
127 pages
1 s2.0 S2214509524005904 Main
No ratings yet
1 s2.0 S2214509524005904 Main
21 pages
Nptel Bia All
No ratings yet
Nptel Bia All
42 pages
AI Cheatsheet Withlinks Compressed
No ratings yet
AI Cheatsheet Withlinks Compressed
15 pages
(eBook PDF) Forecasting and Predictive Analytics with Forecast X ? 7th Editioninstant download
100% (4)
(eBook PDF) Forecasting and Predictive Analytics with Forecast X ? 7th Editioninstant download
41 pages
To Artificial Intelligence: What Is Data Science?
100% (1)
To Artificial Intelligence: What Is Data Science?
131 pages
Project Publish1
No ratings yet
Project Publish1
12 pages
10.2. Accuracy and Quality Measurements
No ratings yet
10.2. Accuracy and Quality Measurements
55 pages
module4_DS_ppt
No ratings yet
module4_DS_ppt
49 pages
Engineering Mathematics And Artificial Intelligence Foundations Methods And Applications Herb Kunze download
No ratings yet
Engineering Mathematics And Artificial Intelligence Foundations Methods And Applications Herb Kunze download
79 pages
Ridge Regression LASSO
No ratings yet
Ridge Regression LASSO
18 pages
Lab3 NguyenQuocKhanh ITITIU18186
No ratings yet
Lab3 NguyenQuocKhanh ITITIU18186
7 pages
MiniProject-Weed_detection[1][1]
No ratings yet
MiniProject-Weed_detection[1][1]
25 pages
Machine Learning
No ratings yet
Machine Learning
25 pages
AWS ML Notes -Domain 2 - Data Transformation
No ratings yet
AWS ML Notes -Domain 2 - Data Transformation
32 pages
Deepeye document
No ratings yet
Deepeye document
53 pages
Curve Fitting - Practical Guideline - PSCAD
No ratings yet
Curve Fitting - Practical Guideline - PSCAD
2 pages
2023 CFA L2 Book 1 Quants Eco Multiple
No ratings yet
2023 CFA L2 Book 1 Quants Eco Multiple
63 pages
Recommender Systems Notes
No ratings yet
Recommender Systems Notes
21 pages
CourseDiary_MVJ22SAD22(B) - Deep Learning
No ratings yet
CourseDiary_MVJ22SAD22(B) - Deep Learning
60 pages
Lecture 9 - Evaluations
No ratings yet
Lecture 9 - Evaluations
68 pages
Computer Science Students Academic Performance Prediction Using Ai[1]
No ratings yet
Computer Science Students Academic Performance Prediction Using Ai[1]
68 pages
DecisionTree
No ratings yet
DecisionTree
73 pages
Foml Paper Solution 2
No ratings yet
Foml Paper Solution 2
34 pages
Deep Learning with Python 2nd Edition François Chollet download
No ratings yet
Deep Learning with Python 2nd Edition François Chollet download
47 pages
Exam Pa Note
No ratings yet
Exam Pa Note
73 pages
House Price Prediction using AI
No ratings yet
House Price Prediction using AI
14 pages
231
No ratings yet
231
8 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

ML-Lecture-5-data-quality

Uploaded by

ML-Lecture-5-data-quality

Uploaded by

Machine Learning

Lecture 5: Data Quality

 Examples of some inconsistencies:

 For example, date of birth of customer is in the format

 For example, in a customer database, there should be a

Examples of data quality problems:

Two Sine Waves Two Sine Waves + Noise

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.