Exploratory Data Analysis-1 (EDA-1)
Exploratory Data Analysis-1 (EDA-1)
• Data Cleaning,
• Imputation Techniques,
• Data Analysis and Visualization(Scatter Diagram, Correlation
Analysis),
• Transformations
• Auto EDA libraries
EDA
Data Quality:
• Validity,
• Accuracy,
• Completeness,
• Consistency,
• Uniformity.
Validity
•Data-Type Constraints: values in a particular column must be of a particular
datatype, e.g., boolean, numeric, date, etc.
•Range Constraints: typically, numbers or dates should fall within a certain range.
•Mandatory Constraints: certain columns cannot be empty.
•Set-Membership constraints: values of a column come from a set of discrete
values,. For example, Blood Groups – Fixed set of discrete values
Validity
Another thing to note is the difference between accuracy and precision. Saying that you
live on the earth is, actually true. But, not precise. Where on the earth?. Saying that you
live at a particular street address is more precise.
Consistency and
Uniformity
The degree to which the data is consistent, within the same data set or across
multiple data sets.
Inconsistency occurs when two values in the data set contradict each other.
A valid age, say 3, mightn’t match with the marital status, say divorced.
A customer is recorded in two different tables with two different Genders.
Which one is true?.
The degree to which the data is specified using the same unit of measure.
The weight may be recorded either in pounds or kilos. The date might follow the
USA format or European format. The currency is sometimes in USD and
sometimes in Euros.
And so data must be converted to a single measure unit.
Outlier
s
Outliers are data that is distinctively different from other observations. They
could be real outliers or mistakes.
2
Outlier
s
What to do?
While outliers are not hard to detect, we have to determine the right solutions to handle
them. It highly depends on the dataset and the goal of the project.
The methods of handling outliers are somewhat similar to missing data. We either drop or
adjust or keep them. We can refer to the missing data section for possible solutions.
Data Cleaning
Steps
Duplicate rows
Rename the columns
Drop unnecessary columns
Convert data types to other types
Remove strings in columns
Change the data types
Outliers
Refer ipython notebook for
data cleaning steps
Missing
Values:
• Detection
• Treatment
What is missing value
Some of the values will be missed in the data set because
of
various reasons such as human error, machine failures etc
Missing Values: How to
find?
Missing Data Heatmap:
The horizontal axis shows the feature name;
When there is a smaller number of
the vertical axis shows the number of
features, we can visualize the missing
observations/rows; the yellow colour
data via heatmap. represents the missing data while the blue
colour otherwise.
Treat missing values
In statistics, this method is called the listwise deletion technique. In this solution, we
drop the entire observation if it contains a missing value.
Only if we are sure that the missing data is not informative, we perform this.
Otherwise, we should consider other solutions
Treat missing values :Impute the
Missing
When the feature is a numeric variable, we can conduct missing data imputation. We
replace the missing values with the average or median value from the data of the same
feature that is not missing
data_cleaned3['Ozone'] = data_cleaned3['Ozone'].fillna(med)
df.method() description
dropna() Drop missing observations
dropna(how='all') Drop observations where all cells is NA
dropna(axis=1, Drop column if all the values are missing
how='all')
dropna(thresh = 5) Drop rows that contain less than 5 non-missing
values
fillna(0) Replace missing values with zeros
isnull() returns True if the value is missing
notnull() Returns True for non-missing values
Refer ipython notebook for
Missing values treatment
Scatter Plot
Scatter Plot
A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two
different numeric variables. Scatter plots are used to observe relationships between
variables.
Temperature Ice Cream
°C Sales
14.2° $215
16.4° $325
11.9° $185
15.2° $332
18.5° $406
22.1° $522
19.4° $412
25.1° $614
23.4° $544
18.1° $421
22.6° $445
17.2° $408
Which variable affects which one?
Scatter
Plot
Cigarettes (X) in
Years Lung Capacity (Y)
Scatter Plot
0 45
5 42
50
10 33 40
15 31 30
20 29
Y
20
10
0
0 5 10 15 20 25
X
Correlation
Pearson Correlation
Correlation is a bi-variate analysis that measures the strength of linear association between
two variables and the direction of the relationship. Correlation is a statistical technique used
to determine the degree to which two variables are linearly related.
[ -1 ….. 0 ….. +1 ]
High Low High
correlation correlation correlation
Correlation r - Interpretation
• Positive r indicates positive linear association between x and y or variables, and negative
r indicates negative linear relationship
• r – always between -1 and +1
• The strength increases as r moves away from zero toward wither -1 or +1
• The extreme values +1 and -1 indicate perfect linear relationship (points lie exactly along
a straight line)
• Graded interpretation : r 0.1-0.3 = weak; 0.4-0.7 = moderate and 0.8-1.0=strong
correlation
Correlation
Correlation
Source: https://en.wikipedia.org/wiki/Correlation_and_dependence Non-linear association will be covered in EDA -
Scatter Plot and Correlation : Smoking
and Lung Capacity
• Data: sample group response data on smoking habits, and measured lung
capacities, respectively
32
Smoking v Lung Capacity Data
Cigarettes Lung
N
(X ) Capacity (Y ) Lung Capacity (Y )
1 0 45
50
2 5 42
45
Lung Capacity
3 10 33 40
35
4 15 31
30
5 20 29
25
20
-5 0 5 10 15 20 25
Smoking (yrs)
• -0.96 implies almost certainty smoker will have diminish lung capacity
Transformations
Dummy Variable
Feature Scaling
Dummy variables
Pd.get_dummies(df)
sklearn.OneHotEncoding
Feature
scaling
Some machine learning algorithms are sensitive to feature scaling means results
will vary based on the units of the features so remove of the effect of scaling, it is
required to go for feature scaling
Standardization
Standardization is scaling technique where the values are cantered around the mean
with a unit standard deviation.
Normalization
Normalization is a scaling technique in which values are shifted and rescaled so
that they end up ranging between 0 and 1. It is also known as Min-Max scaling
Automatic EDA methods
Exploratory data analysis (EDA) is an essential early step in most data science projects
and it often consists of taking the same steps to characterize a dataset (e.g. find out data
types, missing information, distribution of values, correlations, etc.).
Given the repetitiveness and similarity of such tasks, there are a few libraries that
automate and help speed up the process
Libraries:
pandas_profiling
sweetviz
Thank you