0% found this document useful (0 votes)
17 views9 pages

Exploratory Data Analysis

exploratory data analysis

Uploaded by

Neha Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views9 pages

Exploratory Data Analysis

exploratory data analysis

Uploaded by

Neha Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Exploratory Data Analysis

Exploratory Data Analysis is a critical step in the data science process. It is the
foundation for understanding and interpreting complex data sets. EDA helps data
scientists identify patterns, spot anomalies, test hypotheses, and check
assumptions through various statistical and graphical techniques.
It is the initial examination of data and should occur before any assumptions or
conclusions are made to avoid faulty analysis.
EDA is important for understanding and preparing data before using it for
machine learning or complex modeling. It can help identify issues like missing
data, outliers, and anomalies.
Exploratory Data Analysis (EDA) is a process of describing the data by means of
statistical and visualization techniques in order to bring important aspects of that
data into focus for further analysis.
The four types of EDA are univariate non-graphical, multivariate non- graphical,
univariate graphical, and multivariate graphical.
The main purpose of EDA is to help look at data before making any assumptions.
It can help identify obvious errors, as well as better understand patterns within
the data, detect outliers or anomalous events, find interesting relations among
the variables.
Python libraries for EDA include Pandas for data manipulation, Matplotlib and
Seaborn for visualisations, Plotly for interactive plots, and Dask for scalable
computing. These libraries enhance data analysis by offering powerful tools for
summarising, visualising, and managing large datasets effectively.
Applications: EDA can be applied to enhance customer segmentation, optimize
marketing strategies, perform market basket analysis, detect anomalies, and
predict trends. This analysis informs decisions across various departments, from
marketing to risk management.

Exploratory Data Analysis (EDA) uses both quantitative and graphical


techniques:
 Quantitative techniques
These techniques include interval estimation and hypothesis testing:
 Interval estimation: This technique constructs a range of values that
a variable is likely to fall within. A confidence interval is an example
of interval estimation.
 Hypothesis testing: This technique determines if a proposition is
true or false.
 Graphical techniques
These techniques summarize data visually or diagrammatically. Some examples
of graphical techniques include:
 Scatterplot: This technique plots one variable on the x-axis and
another on the y-axis to show the points for each case in the
dataset.
 Run chart: This technique plots data as a line graph over time.
 Heat map: This technique uses color to represent values in the
data.
 Multivariate chart: This technique graphically shows the
relationships between factors and response.
 Bubble chart: This technique displays multiple circles (bubbles) in a
two-dimensional plot.
 Rootogram: This technique plots the square roots of the number of
observations in different ranges of a quantitative variable.
EDA techniques are often graphical because graphics help analysts explore data
openly and discover new insights.

Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis


that employs a variety of techniques (graphical and quantitative) to better
understand data. It is easy to get lost in the visualizations of EDA and to also lose
track of the purpose of EDA. EDA aims to make the downstream analysis easier.
To put EDA in context, the Data Science steps are: Obtain data, Clean and load
data; Exploratory Data Analysis; Model building; Model evaluation; Data
visualization and presentation
The Objectives of EDA are to discover underlying patterns, spot anomalies, frame
the hypothesis and check assumptions with the aim to find a good fitting model
(if one exists). At a more granular level, EDA involves understanding the
relationship between variables including determining relationships among the
explanatory variables; assessing the relationships between explanatory and
outcome variables (direction and rough size estimates); the presence of outliers;
a ranking of the important explanatory variables; conclusions as to whether
individual explanatory variables are statistically significant.
In this post, we present a systematic approach to EDA (based on the sources
listed below) to present EDA techniques in a concise manner.
Categorising EDA techniques
EDA techniques are either graphical or quantitative. Each of these techniques
are in turn, either univariate or multivariate (usually just bivariate). Quantitative
methods normally involve calculation of summary statistics. Graphical methods
summarize the data in a diagrammatic or visual way. Univariate methods look at
one variable (data column) at a time, while multivariate methods look at two or
more variables at a time to explore relationships. Usually, multivariate EDA will
be bivariate (looking at exactly two variables). Thus, the four types of EDA
techniques are Univariate non-graphical; Univariate graphical; Multivariate non-
graphical; Multivariate graphical. Non-graphical and graphical methods
complement each other. We can see graphical methods as more qualitative
(providing subjective analysis) vs quantitative methods as objective.
If we are focusing on data from observation of a single variable on n subjects, i.e.
a sample of size n, we also need to look graphically at the distribution of the
sample. Given a large enough sample size, we assume that the distribution is
normal. A more detailed explanation is HERE. There are exceptions to this idea –
for example – distributions could evolve over time, the distribution could be
unknown etc but for most cases, the normality conditions apply.
Univariate non-graphical EDA
Univariate non-graphical EDA techniques are concerned with understanding the
underlying sample distribution and make observations about the population. This
also involves Outlier detection. For univariate categorical data, we are
interested in the range and the frequency. Univariate EDA for quantitative
data involves making preliminary assessments about the population distribution
of the variable using the data from the observed sample. The characteristics of
the population distribution inferred include center, spread, modality, shape
and outliers. Measures of central tendency include Mean, Median, Mode. The
most common measure of central tendency is the mean. For skewed distribution
or when there is concern about outliers, the median may be
preferred. Measures of spread include variance, standard deviation, and
interquartile range. Spread is an indicator of how far away from the center we
are still likely to find data values. Univariate EDA also involves finding the
skewness (measure of asymmetry) and Kurtosis (measure of peakedness relative
to a Gaussian shape).
Univariate graphical EDA
For graphical analysis of univariate categorical data, histograms are typically
used. The histogram represents the frequency (count) or proportion (count/total
count) of cases for a range of values. Typically, between about 5 and 30 bins are
chosen. Histograms are one of the best ways to quickly learn a lot about your
data, including central tendency, spread, modality, shape and outliers. Stem
and Leaf plots could also be used for the same purpose. Boxplots can also be
used to present information about the central tendency, symmetry and skew, as
well as outliers. Quantile normal plots or QQ plots and other techniques could
also be used here.
Multivariate non-graphical EDA
Multivariate non-graphical EDA techniques generally show the relationship
between two or more variables in the form of either cross-tabulation or statistics.
For each combination of categorical variable (usually explanatory) and one
quantitative variable (usually outcome), we can create a statistic for a
quantitative variables separately for each level of the categorical variable, and
then compare the statistics across levels of the categorical variable. Comparing
the means is an informal version of ANOVA. Comparing medians is a robust
informal version of one-way ANOVA. (adapted from source. For two quantitative
variables, we can calculate co-variance and/or correlation. When we have many
quantitative variables, we typically calculate the pairwise covariances and/or
correlations and assemble them into a matrix.
Multivariate graphical EDA
For categorical multivariate quantities, the most commonly used graphical
technique is the barplot with each group representing one level of one of the
variables and each bar within a group representing the levels of the other
variable. For each category, we could have side-by-side boxplots or Parallel box
plots. For two quantitative multivariate variables, the basic graphical EDA
technique is the scatterplot which has one variable on the x-axis, one on the y-
axis and a point for each case in your dataset. Typically, the explanatory variable
goes on the X axis. Additional categorical variables can be accommodated by the
use of colour or symbols.
Univariate, Bivariate and Multivariate data and its analysis
Univariate data refers to a type of data in which each observation or data point
corresponds to a single variable. In other words, it involves the measurement or
observation of a single characteristic or attribute for each individual or item in
the dataset. Analyzing univariate data is the simplest form of analysis in
statistics.

Heights (in 16 167 17 174 17 18 18


cm) 4 .3 0 .2 8 0 6

Suppose that the heights of seven students in a class is recorded (above table).
There is only one variable, which is height, and it is not dealing with any cause or
relationship.
Key points in Univariate analysis:
1. No Relationships: Univariate analysis focuses solely on describing and
summarizing the distribution of the single variable. It does not explore
relationships between variables or attempt to identify causes.
2. Descriptive Statistics: Descriptive statistics, such as measures of
central tendency (mean, median, mode) and measures of
dispersion (range, standard deviation), are commonly used in the analysis
of univariate data.
3. Visualization: Histograms, box plots, and other graphical representations
are often used to visually represent the distribution of the single variable.
Bivariate data
Bivariate data involves two different variables, and the analysis of this type of
data focuses on understanding the relationship or association between these two
variables. Example of bivariate data can be temperature and ice cream sales in
summer season.

Temperat Ice Cream


ure Sales

20 2000

25 2500

35 5000

Suppose the temperature and ice cream sales are the two variables of a bivariate
data(table 2). Here, the relationship is visible from the table that temperature
and sales are directly proportional to each other and thus related because as the
temperature increases, the sales also increase.
Key points in Bivariate analysis:
1. Relationship Analysis: The primary goal of analyzing bivariate data is to
understand the relationship between the two variables. This relationship
could be positive (both variables increase together), negative (one
variable increases while the other decreases), or show no clear pattern.
2. Scatterplots: A common visualization tool for bivariate data is a
scatterplot, where each data point represents a pair of values for the two
variables. Scatterplots help visualize patterns and trends in the data.
3. Correlation Coefficient: A quantitative measure called the correlation
coefficient is often used to quantify the strength and direction of the linear
relationship between two variables. The correlation coefficient ranges from
-1 to 1.
Multivariate data
Multivariate data refers to datasets where each observation or sample point
consists of multiple variables or features. These variables can represent different
aspects, characteristics, or measurements related to the observed phenomenon.
When dealing with three or more variables, the data is specifically categorized as
multivariate.
Example of this type of data is suppose an advertiser wants to compare the
popularity of four advertisements on a website.

Advertisem Gend Click


ent er rate

Ad1 Male 80

Femal
Ad3 55
e

Femal
Ad2 123
e

Ad1 Male 66

Ad3 Male 35

The click rates could be measured for both men and women and relationships
between variables can then be examined. It is similar to bivariate but contains
more than one dependent variable.
Key points in Multivariate analysis:
1. Analysis Techniques:The ways to perform analysis on this data depends
on the goals to be achieved. Some of the techniques are regression
analysis, principal component analysis, path analysis, factor analysis
and multivariate analysis of variance (MANOVA).
2. Goals of Analysis: The choice of analysis technique depends on the
specific goals of the study. For example, researchers may be interested in
predicting one variable based on others, identifying underlying factors that
explain patterns, or comparing group means across multiple variables.
3. Interpretation: Multivariate analysis allows for a more nuanced
interpretation of complex relationships within the data. It helps uncover
patterns that may not be apparent when examining variables individually.

Univariate Bivariate Multivariate

It only summarize
It only summarize two It only summarize more
single variable at a
variables than 2 variables.
time.

It does not deal with It does deal with causes It does not deal with
causes and and relationships and causes and relationships
relationships. analysis is done. and analysis is done.

It is similar to bivariate
It does not contain any It does contain only one
but it contains more
dependent variable. dependent variable.
than 2 variables.

The main purpose is to


The main purpose is to The main purpose is to
study the relationship
describe. explain.
among them.

Example, Suppose an
advertiser wants to
compare the popularity
The example of of four advertisements
The example of a bivariate can be on a website.
univariate can be temperature and ice Then their click rates
height. sales in summer could be measured for
vacation. both men and women
and relationships
between variable can be
examined

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy