0% found this document useful (0 votes)
22 views35 pages

1.2.1. Retrieving Data - 1.2.2. Cleaning Data

Uploaded by

havietthang02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views35 pages

1.2.1. Retrieving Data - 1.2.2. Cleaning Data

Uploaded by

havietthang02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Retrieving data

1
Learning Goals
In this section, we will cover:

- Retrieving data from multiple data sources:


• SQL databases
• NoSQL databases
• APIs
• Cloud data sources

- Understand common issues that arise with importing data

2
Reading CSV Files
Comma-separated (CSV) files consist of rows of data, separated by commas.
In Pandas, CSV files can typically be read using just a few lines of code:

3
Reading CSV Files: Useful Arguments

4
JSON Files
JavaScript Object Notation (JSON) files are a standard way to store data across platforms.

JSON files are very similar in structure to Python dictionaries.

Reading JSON files into Python:

“price”

5
SQL Databases
Structured Query Language (SQL) represents a set of relational databases
with fixed schemas.

There are many types of SQL databases, which function similarly (with some
subtle differences in syntax).
Examples of SQL databases:
- Microsoft SQL Server
- Postgres
- MySQL
- AWS Redshift
- Oracle DB
- Db2 Family
6
Reading SQL Data
While this example uses sqlite3, there
are several other packages available.

The SQL module creates a connection


with the database.

Data is read into pandas by combining a


query with this connection.

7
NoSQL Databases

Not-only SQL (NoSQL) databases are not relational, vary more in structure.

Depending on application, may perform more quickly or reduce technical overhead.

Most NoSQL databases store data in JSON format.

Examples of NoSQL databases:


- Document databases: mongoDB, couchDB
- Key-value stores: Riak, Voldemort, Redis
- Graph databases: Neo4j, HyperGraph
- Wide-column stores: Cassandra, HBase
8
Reading NoSQL Data
This example uses the pymongo module
to read files stored in MongoDB, although
there are several other packages
available.

We first make a connection with the


database (MongoDB needs to be running).

Data is read into pandas by combining a


query with this connection.

Here, query should be replaced with a


MongoDB query string (or { } to select all).

9
APls and Cloud Data Access
A variety of data providers make data
available via Application Programming
Interfaces (APIs), that makes it easy to
access such data via Python.

There are also a number of datasets


available online in various formats.

An online available example is the UC


Irvine Machine Learning Library.

Here, we read one of its datasets into


Pandas directly via the URL.

10
Summary
Reading in Database Files
The steps to read in a database file using the sqlite library are:
● create a path variable that references the path to your database
● create a connection variable that references the connection to your database
● create a query variable that contains the SQL query that reads in the data table from
your database
● create an observations variable to assign the read_sql functions from pandas package
● create a tables variable to read in the data from the table sqlite_master
JSON files are a standard way to store data across platforms. Their structure is similar to
Python dictionaries.
NoSQL databases are not relational and vary more in structure. Most NoSQL databases
store data in JSON format.
11
Learning Recap
In this section, we discussed:

- Retrieving data from multiple data sources:


• SQL databases
• NoSQL databases
• APIs
• Cloud data sources

- Understand common issues that arise with importing data


12
Cleaning data

13
Learning Goals

In this section, we will cover:


- Why data cleaning is important for Machine Learning
- Issues that arise with messy data
- How to identify duplicate or unnecessary data
- Policies for dealing with outliers

14
Why is Data Cleaning so Important?
Decisions and analytics are increasingly driven by data and models.

Key aspects of Machine Learning Workflow depend on cleaned data:


- Observations: An instance of the data (usually a point or row in a dataset)
- Labels: Output variable(s) being predicted
- Algorithms: Computer programs that estimate models based on available data
- Features: Information we have for each observation (variables)
- Model: Hypothesized relationship between observations and data

Messy data can lead to “garbage-in, garbage-out" effect, and unreliable outcomes.

15
Why is Data Cleaning so Important?
The main data problems companies face:

- Lack of data
- Too much data
- Bad data

Having data ready for ML and Al ensures you are ready to infuse Al across
your organization.

16
How Can Data be Messy?

- Duplicate or unnecessary data


- Inconsistent text and typos
- Missing data
- Outliers
Data sourcing issues:
- Multiple systems
- Different database types
- On premises, in cloud
- ... and more.

17
Duplicate or Unnecessary Data

● Pay attention to duplicate values and research why there are multiple
values.
● Duplicate entries are problematic for multiple reasons. An entry
appearing more than once receives disproportionate weight during
training. Models that succeed on frequent entries only look like they
perform well.
● Duplicate entries can ruin the split between train, validation, and test
sets where identical entries are not all in the same set. This can lead
to biased performance estimates that result in disappointing the
model in production. 18
Duplicate or Unnecessary Data

● There are many possible causes for duplicate entries in databases,


such as processing steps that were rerun anywhere in the data
pipeline. While the existence of duplicates hurt the learning process
greatly, it is relatively easy to fix.
● One option is to enforce columns to be unique whenever applicable.
● Another is to run a script to automatically detect and delete duplicate
entries. This can be done easily with Pandas’ drop_duplicates
functionality shown in the next slide.
● Refer to https://pandas.pydata.org/docs/reference/frame.html
19
me LastName PhoneNo 0 A B 1 1 A B 1 2 A B 2 df.drop_duplicates(subset=["FirstName", "LastName"]) FirstName LastName PhoneN

Duplicate or Unnecessary Data


● Consider dataset containing ramen rating:
>>> df = pd.DataFrame({
... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie',
'Indomie', 'Indomie'],
... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
... 'rating': [4, 4, 3.5, 15, 5]
... })
>>> df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
20
Duplicate or Unnecessary Data
● By default, it removes duplicate rows based on all columns.:
>>> df.drop_duplicates()
brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0

● To remove duplicates on specific column(s), use subset:


>>> df.drop_duplicates(subset=['brand'])
brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
21
Duplicate or Unnecessary Data
● Unnecessary values can slow down the data analysis process and distract you
from achieving certain goals
● Other elements you’ll need to remove as they add nothing to your data include:
○ Personal identifiable (PII) data
○ URLs
○ HTML tags
○ Boilerplate text (for ex. in emails)
○ Tracking codes
○ Excessive blank space between text
● It's a good idea to look at the features you're bringing in and filter the data as
necessary (be careful not to filter too much if you may use features later).
22
● Refer to https://pandas.pydata.org/docs/reference/frame.html
Duplicate or Unnecessary Data

How to filter:
>>> df = pd.DataFrame(np.array(([1, 2, 3], [4, 5, 6])),
... index=['mouse', 'rabbit'],
... columns=['one', 'two', 'three'])
>>> df
one two three
mouse 1 2 3
rabbit 4 5 6
>>> # select columns by name
>>> df.filter(items=['one', 'three'])
one three
mouse 1 3
rabbit 4 6

23
Duplicate or Unnecessary Data

How to filter:
>>> # select columns by regular expression
>>> df.filter(regex='e$', axis=1)
one three
mouse 1 3
rabbit 4 6
>>> # select rows containing 'bbi'
>>> df.filter(like='bbi', axis=0)
one two three
rabbit 4 5 6

24
Duplicate or Unnecessary Data

How to remove data: >>> df = pd.DataFrame(np.arange(12).reshape(3, 4),


... columns=['A', 'B', 'C', 'D'])
>>> df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Drop columns:
>>> df.drop(['B', 'C'], axis=1)
A D
0 0 3
1 4 7
2 8 11

25
Policies for Missing Data

Remove the data: remove the row(s) entirely.


Impute the data: replace with substituted values. Fill in the missing
data with the most common value, the average value, etc.
Mask the data: create a category for missing values.
What are the pros and cons of each of these approaches?

26
Outliers
An outlier is an observation in data that is distant from most other observations.

Typically, these observations are aberrations and do not accurately represent


the phenomenon we are trying to explain through the model.

If we do not identify and deal with outliers, they can have a significant impact on
the model.

It is important to remember that some outliers are informative and provide


insights into the data.

27
How to Find Outliers?

28
Detecting Outliers: Plots

29
Detecting Outliers: Statistics

30
Detecting Outliers: Residuals
Residuals (differences between actual and predicted values of the outcome
variable) represent model failure.

Approaches to calculating residuals:

- Standardized: residual divided by standard error.


- Deleted: residual from fitting model on all data excluding current observation.
- Studentized: Deleted residuals divided by residual standard error (based on
all data, or all data excluding current observation).

31
Policies for Outliers
Remove them.

Assign the mean or median value.

Transform the variable.

Predict the what the value should be:


- Using 'similar' observations to predict likely values.
- Using regression.

Keep them, but focus on models that are resistant to outliers.

32
Summary
Retrieving Data
You can retrieve data from multiple sources:
● SQL databases
● NoSQL databases
● APIs
● Cloud data sources
The two most common formats for delimited data flat files are comma separated
(csv) and tab separated (tsv). It is also possible to use special characters as
separators.
SQL represents a set of relational databases with fixed schemas.
33
Summary
Data Cleaning
● Data Cleaning is important because messy data will lead to unreliable outcomes.
Some common issues that make data messy are: duplicate or unnecessary data,
inconsistent data and typos, missing data, outliers, and data source issues.
● You can identify duplicate or unnecessary dataCommon policies to deal with
missing data are: remove a row with missing columns, impute the missing data,
and mask the data by creating a category for missing values.
● Common methods to find outliers are: through plots, statistics, or residuals.
● Common policies to deal with outliers are: remove outliers, impute them, use a
variable transformation, or use a model that is resistant to outliers.

34
Learning Recap
In this section, we discussed:
- Why data cleaning is important for Machine Learning
- Issues that arise with messy data
- How to identify duplicate or unnecessary data
- Policies for dealing with outliers

35

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy