0% found this document useful (0 votes)

62 views38 pages

Exploratory Data Analysis-1 (EDA-1)

Exploratory Data Analysis (EDA) involves cleaning data, imputing missing values, and visualizing data distributions and relationships. Key techniques include scatter plots to visualize correlations between variables, and calculating correlation coefficients to quantify linear relationships. For example, a scatter plot and correlation analysis of cigarette smoking (in years) versus lung capacity showed a negative linear relationship, with increased smoking associated with lower lung function.

Uploaded by

Sparsh Vijayvargia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views38 pages

Exploratory Data Analysis-1 (EDA-1)

Uploaded by

Sparsh Vijayvargia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 38

Exploratory Data Analysis-1 (EDA-1)

• Data Cleaning,
• Imputation Techniques,
• Data Analysis and Visualization(Scatter Diagram, Correlation
Analysis),
• Transformations
• Auto EDA libraries
EDA

1) Describe a dataset: Number of rows/columns, missing data, data types, preview.

2) Clean data : Handle missing data, invalid data types, incorrect values and outliers
3) Visualize data distributions: Bar charts, histograms, box plots.
4) Calculate and visualize: Correlations (relationships) between variables, Heat map
Data Cleaning
Data
Cleaning
Data cleaning or cleansing is the process of detecting and correcting (or removing)
corrupt or inaccurate records from a record set, table, or database and refers to identifying
incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing,
modifying, or deleting the dirty or coarse data.

Data Quality:
• Validity,
• Accuracy,
• Completeness,
• Consistency,
• Uniformity.
Validity
•Data-Type Constraints: values in a particular column must be of a particular
datatype, e.g., boolean, numeric, date, etc.
•Range Constraints: typically, numbers or dates should fall within a certain range.
•Mandatory Constraints: certain columns cannot be empty.
•Set-Membership constraints: values of a column come from a set of discrete
values,. For example, Blood Groups – Fixed set of discrete values
Validity

•Regular expression patterns: text fields that have to be in a certain pattern. For

example, phone numbers may be required to have the pattern (999) 999–9999.
•Cross-field validation: certain conditions that span across multiple fields must
hold. For example, a patient’s date of discharge from the hospital cannot be earlier
than the date of admission.
Accurac
y
The degree to which the data is close to the true values.
While defining all possible valid values allows invalid values to be easily spotted, it
does not mean that they are accurate.
A valid street address mightn’t actually exist.

Another thing to note is the difference between accuracy and precision. Saying that you
live on the earth is, actually true. But, not precise. Where on the earth?. Saying that you
live at a particular street address is more precise.
Consistency and
Uniformity
The degree to which the data is consistent, within the same data set or across
multiple data sets.
Inconsistency occurs when two values in the data set contradict each other.

A valid age, say 3, mightn’t match with the marital status, say divorced.
A customer is recorded in two different tables with two different Genders.
Which one is true?.

The degree to which the data is specified using the same unit of measure.
The weight may be recorded either in pounds or kilos. The date might follow the
USA format or European format. The currency is sometimes in USD and
sometimes in Euros.
And so data must be converted to a single measure unit.
Outlier
s
Outliers are data that is distinctively different from other observations. They
could be real outliers or mistakes.

How to find out?

Depending on whether the feature is numeric or categorical, we can use different
techniques to study its distribution to detect outliers.
Outlier
s Histogram/Box Plot:
When the feature is numeric, we can use a histogram and box plot to detect
outliers.
From Histogram, if the data is highly skewed then there is a possibility of
outliers and confirm with the boxplot
Outlier
s
Descriptive Statistics:
Also, for numeric features, the outliers could be too distinctive. We can look at their
descriptive statistics.
For example, for the feature Ozone, we can see that the maximum value is 168, while
the 75% quartile is only 68. The 168 value could be an outlier.
Outlier
s
Bar Chart:
When the feature is categorical. We can use a bar chart to learn about its categories and
distribution.
For example, the feature Month has a reasonable distribution except for category 2 .
2nd Month column has only one value , this can be an outlier.

2
Outlier
s
What to do?
While outliers are not hard to detect, we have to determine the right solutions to handle
them. It highly depends on the dataset and the goal of the project.

The methods of handling outliers are somewhat similar to missing data. We either drop or
adjust or keep them. We can refer to the missing data section for possible solutions.
Data Cleaning
Steps
Duplicate rows
Rename the columns
Drop unnecessary columns
Convert data types to other types
Remove strings in columns
Change the data types
Outliers
Refer ipython notebook for
data cleaning steps
Missing
Values:
• Detection
• Treatment
What is missing value
Some of the values will be missed in the data set because
of
various reasons such as human error, machine failures etc
Missing Values: How to
find?
Missing Data Heatmap:
The horizontal axis shows the feature name;
When there is a smaller number of
the vertical axis shows the number of
features, we can visualize the missing
observations/rows; the yellow colour
data via heatmap. represents the missing data while the blue
colour otherwise.
Treat missing values

Drop the Observation

In statistics, this method is called the listwise deletion technique. In this solution, we
drop the entire observation if it contains a missing value.
Only if we are sure that the missing data is not informative, we perform this.
Otherwise, we should consider other solutions
Treat missing values :Impute the
Missing
When the feature is a numeric variable, we can conduct missing data imputation. We
replace the missing values with the average or median value from the data of the same
feature that is not missing

data_cleaned3['Ozone'] = data_cleaned3['Ozone'].fillna(med)

For categorical values

:
df =
df.fillna(df.mode().iloc[0])
Treat missing values

df.method() description
dropna() Drop missing observations
dropna(how='all') Drop observations where all cells is NA
dropna(axis=1, Drop column if all the values are missing
how='all')
dropna(thresh = 5) Drop rows that contain less than 5 non-missing
values
fillna(0) Replace missing values with zeros
isnull() returns True if the value is missing
notnull() Returns True for non-missing values
Refer ipython notebook for
Missing values treatment
Scatter Plot
Scatter Plot
A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two
different numeric variables. Scatter plots are used to observe relationships between
variables.
Temperature Ice Cream
°C Sales

14.2° $215
16.4° $325
11.9° $185
15.2° $332
18.5° $406
22.1° $522
19.4° $412
25.1° $614
23.4° $544
18.1° $421
22.6° $445
17.2° $408
Which variable affects which one?
Scatter
Plot
Cigarettes (X) in
Years Lung Capacity (Y)
Scatter Plot
0 45
5 42
50
10 33 40
15 31 30
20 29

Y
20
10
0
0 5 10 15 20 25
X
Correlation
Pearson Correlation
Correlation is a bi-variate analysis that measures the strength of linear association between
two variables and the direction of the relationship. Correlation is a statistical technique used
to determine the degree to which two variables are linearly related.

[ -1 ….. 0 ….. +1 ]
High Low High
correlation correlation correlation
Correlation r - Interpretation
• Positive r indicates positive linear association between x and y or variables, and negative
r indicates negative linear relationship
• r – always between -1 and +1
• The strength increases as r moves away from zero toward wither -1 or +1
• The extreme values +1 and -1 indicate perfect linear relationship (points lie exactly along
a straight line)
• Graded interpretation : r 0.1-0.3 = weak; 0.4-0.7 = moderate and 0.8-1.0=strong
correlation
Correlation
Correlation
Source: https://en.wikipedia.org/wiki/Correlation_and_dependence Non-linear association will be covered in EDA -
Scatter Plot and Correlation : Smoking
and Lung Capacity

• Example: investigate relationship between cigarette smoking and lung capacity

• Data: sample group response data on smoking habits, and measured lung
capacities, respectively

32
Smoking v Lung Capacity Data
Cigarettes Lung
N
(X ) Capacity (Y ) Lung Capacity (Y )
1 0 45
50
2 5 42
45

Lung Capacity
3 10 33 40
35
4 15 31
30
5 20 29
25
20
-5 0 5 10 15 20 25

Smoking (yrs)

• -0.96 implies almost certainty smoker will have diminish lung capacity
Transformations
Dummy Variable
Feature Scaling
Dummy variables

Categorical variables have to be

converted to numerical using a method
called One-hot encoding

OHE : Pandas, sklearn

Pd.get_dummies(df)
sklearn.OneHotEncoding
Feature
scaling
Some machine learning algorithms are sensitive to feature scaling means results
will vary based on the units of the features so remove of the effect of scaling, it is
required to go for feature scaling

Standardization
Standardization is scaling technique where the values are cantered around the mean
with a unit standard deviation.

Normalization
Normalization is a scaling technique in which values are shifted and rescaled so
that they end up ranging between 0 and 1. It is also known as Min-Max scaling
Automatic EDA methods
Exploratory data analysis (EDA) is an essential early step in most data science projects
and it often consists of taking the same steps to characterize a dataset (e.g. find out data
types, missing information, distribution of values, correlations, etc.).

Given the repetitiveness and similarity of such tasks, there are a few libraries that
automate and help speed up the process

Libraries:
pandas_profiling
sweetviz
Thank you

How Charts Lie
0% (1)
How Charts Lie
252 pages
Data Cleansing Using R
0% (1)
Data Cleansing Using R
10 pages
Lec4 SWN MC
No ratings yet
Lec4 SWN MC
45 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Worksheet 16: Scatter Diagrams and Correlation: Extended Revision Exercises: Data Handling
No ratings yet
Worksheet 16: Scatter Diagrams and Correlation: Extended Revision Exercises: Data Handling
2 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Data Quality
100% (2)
Data Quality
16 pages
EDA
100% (1)
EDA
9 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Quality and Six Sigma Tools Using MINITAB Statistical Software: A Complete Guide To Six Sigma DMAIC Tools Using MINITAB
No ratings yet
Quality and Six Sigma Tools Using MINITAB Statistical Software: A Complete Guide To Six Sigma DMAIC Tools Using MINITAB
10 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
Data Science Essentials: Missing and Repeated Values
No ratings yet
Data Science Essentials: Missing and Repeated Values
5 pages
Distance Time and Speed
No ratings yet
Distance Time and Speed
75 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Big Data Analytics (1) : Definition
No ratings yet
Big Data Analytics (1) : Definition
15 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
Cotm 2
No ratings yet
Cotm 2
77 pages
Introduction To Statistical Quality Control
No ratings yet
Introduction To Statistical Quality Control
43 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
What Is Stata?
No ratings yet
What Is Stata?
16 pages
Day 02-Random Variable and Probability - Part (I)
No ratings yet
Day 02-Random Variable and Probability - Part (I)
34 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Scatter Plot Lesson Plan
No ratings yet
Scatter Plot Lesson Plan
8 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
35 pages
7 Basic Quality Tools For Process Improvement
No ratings yet
7 Basic Quality Tools For Process Improvement
4 pages
Lesson 2 Science Act - Teacher
No ratings yet
Lesson 2 Science Act - Teacher
17 pages
Krysh Rajendran: Final Exam, MA544 Data Visualization
No ratings yet
Krysh Rajendran: Final Exam, MA544 Data Visualization
8 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Inferential Statistics
No ratings yet
Inferential Statistics
21 pages
HPC 104 Supply Chain Management in Hospitality Industry
No ratings yet
HPC 104 Supply Chain Management in Hospitality Industry
15 pages
11-Correlation Regression
No ratings yet
11-Correlation Regression
29 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
UNIT 1 Exploratory Data Analysis
100% (1)
UNIT 1 Exploratory Data Analysis
8 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Unit 1
No ratings yet
Unit 1
21 pages
Module 5 - Data Interpretation Bar Graph Bar Graph-Data Represented by Rectangles
No ratings yet
Module 5 - Data Interpretation Bar Graph Bar Graph-Data Represented by Rectangles
6 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Q&A Univ 3unit
No ratings yet
Q&A Univ 3unit
18 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
Lab 6 Data Visualization
No ratings yet
Lab 6 Data Visualization
8 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Scatter Graphs & Correlation - Level 2
No ratings yet
Scatter Graphs & Correlation - Level 2
27 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
Alford Chapter 2
No ratings yet
Alford Chapter 2
86 pages
DS203 2024 09 06 Data Problems 1
No ratings yet
DS203 2024 09 06 Data Problems 1
25 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Analysis and Approaches HL - Calculator Guide - TI-nspire
No ratings yet
Analysis and Approaches HL - Calculator Guide - TI-nspire
34 pages
ZOO212 - Session 2 - Graphs Powerpoint
No ratings yet
ZOO212 - Session 2 - Graphs Powerpoint
21 pages
Specialist Domain 2 Exploring Analyzing Data
No ratings yet
Specialist Domain 2 Exploring Analyzing Data
15 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
71 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Lec 7
No ratings yet
Lec 7
45 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
How To Write A Formal Lab Bouncy Ball 2015
No ratings yet
How To Write A Formal Lab Bouncy Ball 2015
7 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Data Visualization Using Python
No ratings yet
Data Visualization Using Python
79 pages
Journal of Maritime Research: Management Onboard Training On Student Job Readiness
No ratings yet
Journal of Maritime Research: Management Onboard Training On Student Job Readiness
6 pages
EDA and Cleaning
No ratings yet
EDA and Cleaning
24 pages
EDA - Zep
No ratings yet
EDA - Zep
33 pages
DM Merged
No ratings yet
DM Merged
169 pages
03 Data Preprocessing
No ratings yet
03 Data Preprocessing
15 pages
Data Visualizations Charts
No ratings yet
Data Visualizations Charts
18 pages
26 Optimization
No ratings yet
26 Optimization
39 pages
EDA - Task
No ratings yet
EDA - Task
20 pages
ML Exp No 1
No ratings yet
ML Exp No 1
8 pages
Data - Preprocessing 1 19
No ratings yet
Data - Preprocessing 1 19
19 pages
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
No ratings yet
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
57 pages
Data Quality
No ratings yet
Data Quality
14 pages
Inerpreting and Discussing Results (Notes)
No ratings yet
Inerpreting and Discussing Results (Notes)
11 pages
Frequency Distribution & Data Visualisation
No ratings yet
Frequency Distribution & Data Visualisation
26 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Tableau Desktop Training Notes Environment
No ratings yet
Tableau Desktop Training Notes Environment
34 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Eda 1
No ratings yet
Eda 1
36 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
Interested in More Graphing Resources? Check Out My Other Graphing Resources Below - Just Click The Pictures!
No ratings yet
Interested in More Graphing Resources? Check Out My Other Graphing Resources Below - Just Click The Pictures!
6 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
Association in A Two-Way Table
No ratings yet
Association in A Two-Way Table
3 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Exploratory Data Analysis-1 (EDA-1)

Uploaded by

Exploratory Data Analysis-1 (EDA-1)

Uploaded by

Exploratory Data Analysis-1 (EDA-1)

1) Describe a dataset: Number of rows/columns, missing data, data types, preview.

•Regular expression patterns: text fields that have to be in a certain pattern. For

How to find out?

Drop the Observation

For categorical values

• Example: investigate relationship between cigarette smoking and lung capacity

Categorical variables have to be

OHE : Pandas, sklearn

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.