0% found this document useful (0 votes)

22 views35 pages

1.2.1. Retrieving Data - 1.2.2. Cleaning Data

Uploaded by

havietthang02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views35 pages

1.2.1. Retrieving Data - 1.2.2. Cleaning Data

Uploaded by

havietthang02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 35

Retrieving data

1
Learning Goals
In this section, we will cover:

- Retrieving data from multiple data sources:

• SQL databases
• NoSQL databases
• APIs
• Cloud data sources

- Understand common issues that arise with importing data

2
Reading CSV Files
Comma-separated (CSV) files consist of rows of data, separated by commas.
In Pandas, CSV files can typically be read using just a few lines of code:

3
Reading CSV Files: Useful Arguments

4
JSON Files
JavaScript Object Notation (JSON) files are a standard way to store data across platforms.

JSON files are very similar in structure to Python dictionaries.

Reading JSON files into Python:

“price”

5
SQL Databases
Structured Query Language (SQL) represents a set of relational databases
with fixed schemas.

There are many types of SQL databases, which function similarly (with some
subtle differences in syntax).
Examples of SQL databases:
- Microsoft SQL Server
- Postgres
- MySQL
- AWS Redshift
- Oracle DB
- Db2 Family
6
Reading SQL Data
While this example uses sqlite3, there
are several other packages available.

The SQL module creates a connection

with the database.

Data is read into pandas by combining a

query with this connection.

7
NoSQL Databases

Not-only SQL (NoSQL) databases are not relational, vary more in structure.

Depending on application, may perform more quickly or reduce technical overhead.

Most NoSQL databases store data in JSON format.

Examples of NoSQL databases:

- Document databases: mongoDB, couchDB
- Key-value stores: Riak, Voldemort, Redis
- Graph databases: Neo4j, HyperGraph
- Wide-column stores: Cassandra, HBase
8
Reading NoSQL Data
This example uses the pymongo module
to read files stored in MongoDB, although
there are several other packages
available.

We first make a connection with the

database (MongoDB needs to be running).

Data is read into pandas by combining a

query with this connection.

Here, query should be replaced with a

MongoDB query string (or { } to select all).

9
APls and Cloud Data Access
A variety of data providers make data
available via Application Programming
Interfaces (APIs), that makes it easy to
access such data via Python.

There are also a number of datasets

available online in various formats.

An online available example is the UC

Irvine Machine Learning Library.

Here, we read one of its datasets into

Pandas directly via the URL.

10
Summary
Reading in Database Files
The steps to read in a database file using the sqlite library are:
● create a path variable that references the path to your database
● create a connection variable that references the connection to your database
● create a query variable that contains the SQL query that reads in the data table from
your database
● create an observations variable to assign the read_sql functions from pandas package
● create a tables variable to read in the data from the table sqlite_master
JSON files are a standard way to store data across platforms. Their structure is similar to
Python dictionaries.
NoSQL databases are not relational and vary more in structure. Most NoSQL databases
store data in JSON format.
11
Learning Recap
In this section, we discussed:

- Retrieving data from multiple data sources:

• SQL databases
• NoSQL databases
• APIs
• Cloud data sources

- Understand common issues that arise with importing data

12
Cleaning data

13
Learning Goals

In this section, we will cover:

- Why data cleaning is important for Machine Learning
- Issues that arise with messy data
- How to identify duplicate or unnecessary data
- Policies for dealing with outliers

14
Why is Data Cleaning so Important?
Decisions and analytics are increasingly driven by data and models.

Key aspects of Machine Learning Workflow depend on cleaned data:

- Observations: An instance of the data (usually a point or row in a dataset)
- Labels: Output variable(s) being predicted
- Algorithms: Computer programs that estimate models based on available data
- Features: Information we have for each observation (variables)
- Model: Hypothesized relationship between observations and data

Messy data can lead to “garbage-in, garbage-out" effect, and unreliable outcomes.

15
Why is Data Cleaning so Important?
The main data problems companies face:

- Lack of data
- Too much data
- Bad data

Having data ready for ML and Al ensures you are ready to infuse Al across
your organization.

16
How Can Data be Messy?

- Duplicate or unnecessary data

- Inconsistent text and typos
- Missing data
- Outliers
Data sourcing issues:
- Multiple systems
- Different database types
- On premises, in cloud
- ... and more.

17
Duplicate or Unnecessary Data

● Pay attention to duplicate values and research why there are multiple
values.
● Duplicate entries are problematic for multiple reasons. An entry
appearing more than once receives disproportionate weight during
training. Models that succeed on frequent entries only look like they
perform well.
● Duplicate entries can ruin the split between train, validation, and test
sets where identical entries are not all in the same set. This can lead
to biased performance estimates that result in disappointing the
model in production. 18
Duplicate or Unnecessary Data

● There are many possible causes for duplicate entries in databases,

such as processing steps that were rerun anywhere in the data
pipeline. While the existence of duplicates hurt the learning process
greatly, it is relatively easy to fix.
● One option is to enforce columns to be unique whenever applicable.
● Another is to run a script to automatically detect and delete duplicate
entries. This can be done easily with Pandas’ drop_duplicates
functionality shown in the next slide.
● Refer to https://pandas.pydata.org/docs/reference/frame.html
19
me LastName PhoneNo 0 A B 1 1 A B 1 2 A B 2 df.drop_duplicates(subset=["FirstName", "LastName"]) FirstName LastName PhoneN

Duplicate or Unnecessary Data

● Consider dataset containing ramen rating:
>>> df = pd.DataFrame({
... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie',
'Indomie', 'Indomie'],
... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
... 'rating': [4, 4, 3.5, 15, 5]
... })
>>> df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
20
Duplicate or Unnecessary Data
● By default, it removes duplicate rows based on all columns.:
>>> df.drop_duplicates()
brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0

● To remove duplicates on specific column(s), use subset:

>>> df.drop_duplicates(subset=['brand'])
brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
21
Duplicate or Unnecessary Data
● Unnecessary values can slow down the data analysis process and distract you
from achieving certain goals
● Other elements you’ll need to remove as they add nothing to your data include:
○ Personal identifiable (PII) data
○ URLs
○ HTML tags
○ Boilerplate text (for ex. in emails)
○ Tracking codes
○ Excessive blank space between text
● It's a good idea to look at the features you're bringing in and filter the data as
necessary (be careful not to filter too much if you may use features later).
22
● Refer to https://pandas.pydata.org/docs/reference/frame.html
Duplicate or Unnecessary Data

How to filter:
>>> df = pd.DataFrame(np.array(([1, 2, 3], [4, 5, 6])),
... index=['mouse', 'rabbit'],
... columns=['one', 'two', 'three'])
>>> df
one two three
mouse 1 2 3
rabbit 4 5 6
>>> # select columns by name
>>> df.filter(items=['one', 'three'])
one three
mouse 1 3
rabbit 4 6

23
Duplicate or Unnecessary Data

How to filter:
>>> # select columns by regular expression
>>> df.filter(regex='e$', axis=1)
one three
mouse 1 3
rabbit 4 6
>>> # select rows containing 'bbi'
>>> df.filter(like='bbi', axis=0)
one two three
rabbit 4 5 6

24
Duplicate or Unnecessary Data

How to remove data: >>> df = pd.DataFrame(np.arange(12).reshape(3, 4),

... columns=['A', 'B', 'C', 'D'])
>>> df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Drop columns:
>>> df.drop(['B', 'C'], axis=1)
A D
0 0 3
1 4 7
2 8 11

25
Policies for Missing Data

Remove the data: remove the row(s) entirely.

Impute the data: replace with substituted values. Fill in the missing
data with the most common value, the average value, etc.
Mask the data: create a category for missing values.
What are the pros and cons of each of these approaches?

26
Outliers
An outlier is an observation in data that is distant from most other observations.

Typically, these observations are aberrations and do not accurately represent

the phenomenon we are trying to explain through the model.

If we do not identify and deal with outliers, they can have a significant impact on
the model.

It is important to remember that some outliers are informative and provide

insights into the data.

27
How to Find Outliers?

28
Detecting Outliers: Plots

29
Detecting Outliers: Statistics

30
Detecting Outliers: Residuals
Residuals (differences between actual and predicted values of the outcome
variable) represent model failure.

Approaches to calculating residuals:

- Standardized: residual divided by standard error.

- Deleted: residual from fitting model on all data excluding current observation.
- Studentized: Deleted residuals divided by residual standard error (based on
all data, or all data excluding current observation).

31
Policies for Outliers
Remove them.

Assign the mean or median value.

Transform the variable.

Predict the what the value should be:

- Using 'similar' observations to predict likely values.
- Using regression.

Keep them, but focus on models that are resistant to outliers.

32
Summary
Retrieving Data
You can retrieve data from multiple sources:
● SQL databases
● NoSQL databases
● APIs
● Cloud data sources
The two most common formats for delimited data flat files are comma separated
(csv) and tab separated (tsv). It is also possible to use special characters as
separators.
SQL represents a set of relational databases with fixed schemas.
33
Summary
Data Cleaning
● Data Cleaning is important because messy data will lead to unreliable outcomes.
Some common issues that make data messy are: duplicate or unnecessary data,
inconsistent data and typos, missing data, outliers, and data source issues.
● You can identify duplicate or unnecessary dataCommon policies to deal with
missing data are: remove a row with missing columns, impute the missing data,
and mask the data by creating a category for missing values.
● Common methods to find outliers are: through plots, statistics, or residuals.
● Common policies to deal with outliers are: remove outliers, impute them, use a
variable transformation, or use a model that is resistant to outliers.

34
Learning Recap
In this section, we discussed:
- Why data cleaning is important for Machine Learning
- Issues that arise with messy data
- How to identify duplicate or unnecessary data
- Policies for dealing with outliers

Data Aggregation Using Python
No ratings yet
Data Aggregation Using Python
33 pages
Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
Lecture Week5
No ratings yet
Lecture Week5
72 pages
Hduud
No ratings yet
Hduud
55 pages
Data Science - A First Introduction With Python (Z-Lib - Io)
No ratings yet
Data Science - A First Introduction With Python (Z-Lib - Io)
452 pages
I.P File
No ratings yet
I.P File
20 pages
Christian Mayer, Lukas Rieger, Kyrylo Kravets - Coffee Break Pandas - 74 Pandas Puzzles To Build Your Pandas Data Science Superpower-Finxter - Com (2020)
No ratings yet
Christian Mayer, Lukas Rieger, Kyrylo Kravets - Coffee Break Pandas - 74 Pandas Puzzles To Build Your Pandas Data Science Superpower-Finxter - Com (2020)
156 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Lecture Week2
No ratings yet
Lecture Week2
72 pages
ML SIG - Day 1
No ratings yet
ML SIG - Day 1
55 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
CRM Data Collection and Storage
No ratings yet
CRM Data Collection and Storage
22 pages
Data - Part 1
No ratings yet
Data - Part 1
58 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
Project Report
No ratings yet
Project Report
37 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
14 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Lecture 4 Data Pre-Processing
No ratings yet
Lecture 4 Data Pre-Processing
43 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Datascience
No ratings yet
Datascience
26 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Dataframing in CSV
No ratings yet
Dataframing in CSV
14 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Pandas Notes
No ratings yet
Pandas Notes
3 pages
Document
No ratings yet
Document
29 pages
Dsmlusingpython
No ratings yet
Dsmlusingpython
10 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Advanced Python Programming Data Science: The University of Sheffield
No ratings yet
Advanced Python Programming Data Science: The University of Sheffield
55 pages
Data Handling Module
No ratings yet
Data Handling Module
10 pages
Pandas
No ratings yet
Pandas
94 pages
Introduction To Pandas in Data Analytics
No ratings yet
Introduction To Pandas in Data Analytics
12 pages
Unit-2 Bda
No ratings yet
Unit-2 Bda
11 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
Learning Pandas PDF
No ratings yet
Learning Pandas PDF
171 pages
Content Pandas Cheat Sheet
No ratings yet
Content Pandas Cheat Sheet
9 pages
Python and PowerBI Syllabus
No ratings yet
Python and PowerBI Syllabus
3 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
12 pages
Pandas PDF
No ratings yet
Pandas PDF
171 pages
Course - Introduction To Data Science (SD211105)
No ratings yet
Course - Introduction To Data Science (SD211105)
10 pages
Commands SQL, Python (BASICS)
No ratings yet
Commands SQL, Python (BASICS)
7 pages
Data Wrangling With Python and Pandas
No ratings yet
Data Wrangling With Python and Pandas
7 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
Pandas Cheat Sheet PDF
67% (3)
Pandas Cheat Sheet PDF
1 page
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Assvid
No ratings yet
Assvid
13 pages
Python Cheat Sheet Code Academy
100% (1)
Python Cheat Sheet Code Academy
1 page
Data Analytics and Reporting - Notes Unit 1 and 2
No ratings yet
Data Analytics and Reporting - Notes Unit 1 and 2
11 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
Empower 3
No ratings yet
Empower 3
4 pages
Hacking Websites Using SQLMAP - HackingLoops Tutorials - Learn Ethical Hacking Online - HackingLoops
100% (1)
Hacking Websites Using SQLMAP - HackingLoops Tutorials - Learn Ethical Hacking Online - HackingLoops
5 pages
Hci For Dummies 2nd Edition
No ratings yet
Hci For Dummies 2nd Edition
76 pages
Wikibook Algorithms
No ratings yet
Wikibook Algorithms
571 pages
User Manual 3842570
No ratings yet
User Manual 3842570
18 pages
Questions From Chapter 7
No ratings yet
Questions From Chapter 7
4 pages
Wireshark Notes-OSI and TCP-IP
No ratings yet
Wireshark Notes-OSI and TCP-IP
34 pages
Virtual Private Networks (VPN)
No ratings yet
Virtual Private Networks (VPN)
24 pages
Hardening Domain Controllers
No ratings yet
Hardening Domain Controllers
14 pages
Course Contents It Basics I: Computers - An Overview of Computer and Systems
No ratings yet
Course Contents It Basics I: Computers - An Overview of Computer and Systems
8 pages
WP EN DG Talend DefinitiveGuide DataGovernance
No ratings yet
WP EN DG Talend DefinitiveGuide DataGovernance
75 pages
Chassis Intrusion
No ratings yet
Chassis Intrusion
62 pages
Data Structure - Solution
No ratings yet
Data Structure - Solution
33 pages
DS in 7 Hours
No ratings yet
DS in 7 Hours
87 pages
Conducting Cambridge IGCSE ICT (0417) Practical Tests November 2021
No ratings yet
Conducting Cambridge IGCSE ICT (0417) Practical Tests November 2021
7 pages
Tellabs - 8813-311v Ethernet Access Node
No ratings yet
Tellabs - 8813-311v Ethernet Access Node
3 pages
MPLS L2VPN
No ratings yet
MPLS L2VPN
30 pages
1.4.1. Estimation and Inference
No ratings yet
1.4.1. Estimation and Inference
30 pages
MCS Protocol v3.2.1.2
No ratings yet
MCS Protocol v3.2.1.2
62 pages
PRTG Using Snmp-2022 60min
No ratings yet
PRTG Using Snmp-2022 60min
26 pages
1.3.1. Exploratory Data Analysis
No ratings yet
1.3.1. Exploratory Data Analysis
24 pages
Oracle 11g DBA
No ratings yet
Oracle 11g DBA
5 pages
DVR 416 Center Manual (Eng)
No ratings yet
DVR 416 Center Manual (Eng)
114 pages
4.1.1. Introduction To Unsupervised Learning
No ratings yet
4.1.1. Introduction To Unsupervised Learning
26 pages
1.1.2. Modern AI - Applications and Machine Learning Workflow
No ratings yet
1.1.2. Modern AI - Applications and Machine Learning Workflow
26 pages
Arcam Firmware
No ratings yet
Arcam Firmware
3 pages
Workplace, Filenet Enterprise Manager, Process Designer, Process Administrator, Process Analyzer
No ratings yet
Workplace, Filenet Enterprise Manager, Process Designer, Process Administrator, Process Analyzer
4 pages
PYTHON PANDAS Cheat Sheet
No ratings yet
PYTHON PANDAS Cheat Sheet
2 pages
Obsolete DBA Obsolete DBA Best Practices: EMEA Pug Challenge 2015
No ratings yet
Obsolete DBA Obsolete DBA Best Practices: EMEA Pug Challenge 2015
54 pages
FAQ - Delete All Snapshots and Consolidate Snapshots Feature
No ratings yet
FAQ - Delete All Snapshots and Consolidate Snapshots Feature
5 pages
Kibana Fa
No ratings yet
Kibana Fa
4 pages
OpenDedupe Architecture
No ratings yet
OpenDedupe Architecture
13 pages
CLS FDW
No ratings yet
CLS FDW
2 pages
Lab Vcxproj Filters
No ratings yet
Lab Vcxproj Filters
2 pages
CodeNotes for Oracle 9i
From Everand
CodeNotes for Oracle 9i
Gregory Brill
3.5/5 (3)
Amazon DynamoDB - The Definitive Guide: Explore enterprise-ready, serverless NoSQL with predictable, scalable performance
From Everand
Amazon DynamoDB - The Definitive Guide: Explore enterprise-ready, serverless NoSQL with predictable, scalable performance
Aman Dhingra
No ratings yet
DP-420 Designing and Implementing Cloud-Native Applications Using Microsoft Azure Cosmos DB Certification Exam Guide
From Everand
DP-420 Designing and Implementing Cloud-Native Applications Using Microsoft Azure Cosmos DB Certification Exam Guide
Anand Vemula
No ratings yet
Amazon SimpleDB: LITE
From Everand
Amazon SimpleDB: LITE
Prabhakar Chaganti
No ratings yet
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
From Everand
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
Olga Maria Stefania Cucaro
No ratings yet
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

1.2.1. Retrieving Data - 1.2.2. Cleaning Data

Uploaded by

1.2.1. Retrieving Data - 1.2.2. Cleaning Data

Uploaded by

Retrieving data

- Retrieving data from multiple data sources:

- Understand common issues that arise with importing data

JSON files are very similar in structure to Python dictionaries.

Reading JSON files into Python:

The SQL module creates a connection

Data is read into pandas by combining a

Depending on application, may perform more quickly or reduce technical overhead.

Most NoSQL databases store data in JSON format.

Examples of NoSQL databases:

We first make a connection with the

Data is read into pandas by combining a

Here, query should be replaced with a

There are also a number of datasets

An online available example is the UC

Here, we read one of its datasets into

- Retrieving data from multiple data sources:

- Understand common issues that arise with importing data

In this section, we will cover:

Key aspects of Machine Learning Workflow depend on cleaned data:

- Duplicate or unnecessary data

● There are many possible causes for duplicate entries in databases,

Duplicate or Unnecessary Data

● To remove duplicates on specific column(s), use subset:

How to remove data: >>> df = pd.DataFrame(np.arange(12).reshape(3, 4),

Remove the data: remove the row(s) entirely.

Typically, these observations are aberrations and do not accurately represent

It is important to remember that some outliers are informative and provide

Approaches to calculating residuals:

- Standardized: residual divided by standard error.

Assign the mean or median value.

Transform the variable.

Predict the what the value should be:

Keep them, but focus on models that are resistant to outliers.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.