1.2.1. Retrieving Data - 1.2.2. Cleaning Data
1.2.1. Retrieving Data - 1.2.2. Cleaning Data
1
Learning Goals
In this section, we will cover:
2
Reading CSV Files
Comma-separated (CSV) files consist of rows of data, separated by commas.
In Pandas, CSV files can typically be read using just a few lines of code:
3
Reading CSV Files: Useful Arguments
4
JSON Files
JavaScript Object Notation (JSON) files are a standard way to store data across platforms.
“price”
5
SQL Databases
Structured Query Language (SQL) represents a set of relational databases
with fixed schemas.
There are many types of SQL databases, which function similarly (with some
subtle differences in syntax).
Examples of SQL databases:
- Microsoft SQL Server
- Postgres
- MySQL
- AWS Redshift
- Oracle DB
- Db2 Family
6
Reading SQL Data
While this example uses sqlite3, there
are several other packages available.
7
NoSQL Databases
Not-only SQL (NoSQL) databases are not relational, vary more in structure.
9
APls and Cloud Data Access
A variety of data providers make data
available via Application Programming
Interfaces (APIs), that makes it easy to
access such data via Python.
10
Summary
Reading in Database Files
The steps to read in a database file using the sqlite library are:
● create a path variable that references the path to your database
● create a connection variable that references the connection to your database
● create a query variable that contains the SQL query that reads in the data table from
your database
● create an observations variable to assign the read_sql functions from pandas package
● create a tables variable to read in the data from the table sqlite_master
JSON files are a standard way to store data across platforms. Their structure is similar to
Python dictionaries.
NoSQL databases are not relational and vary more in structure. Most NoSQL databases
store data in JSON format.
11
Learning Recap
In this section, we discussed:
13
Learning Goals
14
Why is Data Cleaning so Important?
Decisions and analytics are increasingly driven by data and models.
Messy data can lead to “garbage-in, garbage-out" effect, and unreliable outcomes.
15
Why is Data Cleaning so Important?
The main data problems companies face:
- Lack of data
- Too much data
- Bad data
Having data ready for ML and Al ensures you are ready to infuse Al across
your organization.
16
How Can Data be Messy?
17
Duplicate or Unnecessary Data
● Pay attention to duplicate values and research why there are multiple
values.
● Duplicate entries are problematic for multiple reasons. An entry
appearing more than once receives disproportionate weight during
training. Models that succeed on frequent entries only look like they
perform well.
● Duplicate entries can ruin the split between train, validation, and test
sets where identical entries are not all in the same set. This can lead
to biased performance estimates that result in disappointing the
model in production. 18
Duplicate or Unnecessary Data
How to filter:
>>> df = pd.DataFrame(np.array(([1, 2, 3], [4, 5, 6])),
... index=['mouse', 'rabbit'],
... columns=['one', 'two', 'three'])
>>> df
one two three
mouse 1 2 3
rabbit 4 5 6
>>> # select columns by name
>>> df.filter(items=['one', 'three'])
one three
mouse 1 3
rabbit 4 6
23
Duplicate or Unnecessary Data
How to filter:
>>> # select columns by regular expression
>>> df.filter(regex='e$', axis=1)
one three
mouse 1 3
rabbit 4 6
>>> # select rows containing 'bbi'
>>> df.filter(like='bbi', axis=0)
one two three
rabbit 4 5 6
24
Duplicate or Unnecessary Data
25
Policies for Missing Data
26
Outliers
An outlier is an observation in data that is distant from most other observations.
If we do not identify and deal with outliers, they can have a significant impact on
the model.
27
How to Find Outliers?
28
Detecting Outliers: Plots
29
Detecting Outliers: Statistics
30
Detecting Outliers: Residuals
Residuals (differences between actual and predicted values of the outcome
variable) represent model failure.
31
Policies for Outliers
Remove them.
32
Summary
Retrieving Data
You can retrieve data from multiple sources:
● SQL databases
● NoSQL databases
● APIs
● Cloud data sources
The two most common formats for delimited data flat files are comma separated
(csv) and tab separated (tsv). It is also possible to use special characters as
separators.
SQL represents a set of relational databases with fixed schemas.
33
Summary
Data Cleaning
● Data Cleaning is important because messy data will lead to unreliable outcomes.
Some common issues that make data messy are: duplicate or unnecessary data,
inconsistent data and typos, missing data, outliers, and data source issues.
● You can identify duplicate or unnecessary dataCommon policies to deal with
missing data are: remove a row with missing columns, impute the missing data,
and mask the data by creating a category for missing values.
● Common methods to find outliers are: through plots, statistics, or residuals.
● Common policies to deal with outliers are: remove outliers, impute them, use a
variable transformation, or use a model that is resistant to outliers.
34
Learning Recap
In this section, we discussed:
- Why data cleaning is important for Machine Learning
- Issues that arise with messy data
- How to identify duplicate or unnecessary data
- Policies for dealing with outliers
35