0% found this document useful (0 votes)
62 views38 pages

Exploratory Data Analysis-1 (EDA-1)

Exploratory Data Analysis (EDA) involves cleaning data, imputing missing values, and visualizing data distributions and relationships. Key techniques include scatter plots to visualize correlations between variables, and calculating correlation coefficients to quantify linear relationships. For example, a scatter plot and correlation analysis of cigarette smoking (in years) versus lung capacity showed a negative linear relationship, with increased smoking associated with lower lung function.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views38 pages

Exploratory Data Analysis-1 (EDA-1)

Exploratory Data Analysis (EDA) involves cleaning data, imputing missing values, and visualizing data distributions and relationships. Key techniques include scatter plots to visualize correlations between variables, and calculating correlation coefficients to quantify linear relationships. For example, a scatter plot and correlation analysis of cigarette smoking (in years) versus lung capacity showed a negative linear relationship, with increased smoking associated with lower lung function.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Exploratory Data Analysis-1 (EDA-1)

• Data Cleaning,
• Imputation Techniques,
• Data Analysis and Visualization(Scatter Diagram, Correlation
Analysis),
• Transformations
• Auto EDA libraries
EDA

1) Describe a dataset: Number of rows/columns, missing data, data types, preview.


2) Clean data : Handle missing data, invalid data types, incorrect values and outliers
3) Visualize data distributions: Bar charts, histograms, box plots.
4) Calculate and visualize: Correlations (relationships) between variables, Heat map
Data Cleaning
Data
Cleaning
Data cleaning or cleansing is the process of detecting and correcting (or removing)
corrupt or inaccurate records from a record set, table, or database and refers to identifying
incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing,
modifying, or deleting the dirty or coarse data.

Data Quality:
• Validity,
• Accuracy,
• Completeness,
• Consistency,
• Uniformity.
Validity
•Data-Type Constraints: values in a particular column must be of a particular
datatype, e.g., boolean, numeric, date, etc.
•Range Constraints: typically, numbers or dates should fall within a certain range.
•Mandatory Constraints: certain columns cannot be empty.
•Set-Membership constraints: values of a column come from a set of discrete
values,. For example, Blood Groups – Fixed set of discrete values
Validity

•Regular expression patterns: text fields that have to be in a certain pattern. For


example, phone numbers may be required to have the pattern (999) 999–9999.
•Cross-field validation: certain conditions that span across multiple fields must
hold. For example, a patient’s date of discharge from the hospital cannot be earlier
than the date of admission.
Accurac
y
The degree to which the data is close to the true values.
While defining all possible valid values allows invalid values to be easily spotted, it
does not mean that they are accurate.
A valid street address mightn’t actually exist.

Another thing to note is the difference between accuracy and precision. Saying that you
live on the earth is, actually true. But, not precise. Where on the earth?. Saying that you
live at a particular street address is more precise.
Consistency and
Uniformity
The degree to which the data is consistent, within the same data set or across
multiple data sets.
Inconsistency occurs when two values in the data set contradict each other.

A valid age, say 3, mightn’t match with the marital status, say divorced.
A customer is recorded in two different tables with two different Genders.
Which one is true?.

The degree to which the data is specified using the same unit of measure.
The weight may be recorded either in pounds or kilos. The date might follow the
USA format or European format. The currency is sometimes in USD and
sometimes in Euros.
And so data must be converted to a single measure unit.
Outlier
s
Outliers are data that is distinctively different from other observations. They
could be real outliers or mistakes.

How to find out?


Depending on whether the feature is numeric or categorical, we can use different
techniques to study its distribution to detect outliers.
Outlier
s Histogram/Box Plot:
When the feature is numeric, we can use a histogram and box plot to detect
outliers.
From Histogram, if the data is highly skewed then there is a possibility of
outliers and confirm with the boxplot
Outlier
s
Descriptive Statistics:
Also, for numeric features, the outliers could be too distinctive. We can look at their
descriptive statistics.
For example, for the feature Ozone, we can see that the maximum value is 168, while
the 75% quartile is only 68. The 168 value could be an outlier.
Outlier
s
Bar Chart:
When the feature is categorical. We can use a bar chart to learn about its categories and
distribution.
For example, the feature Month has a reasonable distribution except for category 2 .
2nd Month column has only one value , this can be an outlier.

2
Outlier
s
What to do?
While outliers are not hard to detect, we have to determine the right solutions to handle
them. It highly depends on the dataset and the goal of the project.

The methods of handling outliers are somewhat similar to missing data. We either drop or
adjust or keep them. We can refer to the missing data section for possible solutions.
Data Cleaning
Steps
Duplicate rows
Rename the columns
Drop unnecessary columns
Convert data types to other types
Remove strings in columns
Change the data types
Outliers
Refer ipython notebook for
data cleaning steps
Missing
Values:
• Detection
• Treatment
What is missing value
Some of the values will be missed in the data set because
of
various reasons such as human error, machine failures etc
Missing Values: How to
find?
Missing Data Heatmap:
The horizontal axis shows the feature name;
When there is a smaller number of
the vertical axis shows the number of
features, we can visualize the missing
observations/rows; the yellow colour
data via heatmap. represents the missing data while the blue
colour otherwise.
Treat missing values

Drop the Observation

In statistics, this method is called the listwise deletion technique. In this solution, we
drop the entire observation if it contains a missing value.
Only if we are sure that the missing data is not informative, we perform this.
Otherwise, we should consider other solutions
Treat missing values :Impute the
Missing
When the feature is a numeric variable, we can conduct missing data imputation. We
replace the missing values with the average or median value from the data of the same
feature that is not missing

data_cleaned3['Ozone'] = data_cleaned3['Ozone'].fillna(med)

For categorical values


:
df =
df.fillna(df.mode().iloc[0])
Treat missing values

df.method() description
dropna() Drop missing observations
dropna(how='all') Drop observations where all cells is NA
dropna(axis=1, Drop column if all the values are missing
how='all')
dropna(thresh = 5) Drop rows that contain less than 5 non-missing
values
fillna(0) Replace missing values with zeros
isnull() returns True if the value is missing
notnull() Returns True for non-missing values
Refer ipython notebook for
Missing values treatment
Scatter Plot
Scatter Plot
A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two
different numeric variables. Scatter plots are used to observe relationships between
variables.
Temperature Ice Cream
°C Sales

14.2° $215
16.4° $325
11.9° $185
15.2° $332
18.5° $406
22.1° $522
19.4° $412
25.1° $614
23.4° $544
18.1° $421
22.6° $445
17.2° $408
Which variable affects which one?
Scatter
Plot
Cigarettes (X) in
Years Lung Capacity (Y)
Scatter Plot
0 45
5 42
50
10 33 40
15 31 30
20 29

Y
20
10
0
0 5 10 15 20 25
X
Correlation
Pearson Correlation
Correlation is a bi-variate analysis that measures the strength of linear association between
two variables and the direction of the relationship. Correlation is a statistical technique used
to determine the degree to which two variables are linearly related.

[ -1 ….. 0 ….. +1 ]
High Low High
correlation correlation correlation
Correlation r - Interpretation
• Positive r indicates positive linear association between x and y or variables, and negative
r indicates negative linear relationship
• r – always between -1 and +1
• The strength increases as r moves away from zero toward wither -1 or +1
• The extreme values +1 and -1 indicate perfect linear relationship (points lie exactly along
a straight line)
• Graded interpretation : r 0.1-0.3 = weak; 0.4-0.7 = moderate and 0.8-1.0=strong
correlation
Correlation
Correlation
Source: https://en.wikipedia.org/wiki/Correlation_and_dependence Non-linear association will be covered in EDA -
Scatter Plot and Correlation : Smoking
and Lung Capacity

• Example: investigate relationship between cigarette smoking and lung capacity

• Data: sample group response data on smoking habits, and measured lung
capacities, respectively

32
Smoking v Lung Capacity Data
Cigarettes Lung
N
(X ) Capacity (Y ) Lung Capacity (Y )
1 0 45
50
2 5 42
45

Lung Capacity
3 10 33 40
35
4 15 31
30
5 20 29
25
20
-5 0 5 10 15 20 25

Smoking (yrs)

• -0.96 implies almost certainty smoker will have diminish lung capacity
Transformations
Dummy Variable
Feature Scaling
Dummy variables

Categorical variables have to be


converted to numerical using a method
called One-hot encoding

OHE : Pandas, sklearn

Pd.get_dummies(df)
sklearn.OneHotEncoding
Feature
scaling
Some machine learning algorithms are sensitive to feature scaling means results
will vary based on the units of the features so remove of the effect of scaling, it is
required to go for feature scaling 

Standardization
Standardization is scaling technique where the values are cantered around the mean
with a unit standard deviation.

Normalization
Normalization is a scaling technique in which values are shifted and rescaled so
that they end up ranging between 0 and 1. It is also known as Min-Max scaling
Automatic EDA methods
Exploratory data analysis (EDA) is an essential early step in most data science projects
and it often consists of taking the same steps to characterize a dataset (e.g. find out data
types, missing information, distribution of values, correlations, etc.).

Given the repetitiveness and similarity of such tasks, there are a few libraries that
automate and help speed up the process

Libraries:
pandas_profiling
sweetviz
Thank you

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy