Exploratory Data Analysis
Exploratory Data Analysis
Exploratory Data Analysis is a critical step in the data science process. It is the
foundation for understanding and interpreting complex data sets. EDA helps data
scientists identify patterns, spot anomalies, test hypotheses, and check
assumptions through various statistical and graphical techniques.
It is the initial examination of data and should occur before any assumptions or
conclusions are made to avoid faulty analysis.
EDA is important for understanding and preparing data before using it for
machine learning or complex modeling. It can help identify issues like missing
data, outliers, and anomalies.
Exploratory Data Analysis (EDA) is a process of describing the data by means of
statistical and visualization techniques in order to bring important aspects of that
data into focus for further analysis.
The four types of EDA are univariate non-graphical, multivariate non- graphical,
univariate graphical, and multivariate graphical.
The main purpose of EDA is to help look at data before making any assumptions.
It can help identify obvious errors, as well as better understand patterns within
the data, detect outliers or anomalous events, find interesting relations among
the variables.
Python libraries for EDA include Pandas for data manipulation, Matplotlib and
Seaborn for visualisations, Plotly for interactive plots, and Dask for scalable
computing. These libraries enhance data analysis by offering powerful tools for
summarising, visualising, and managing large datasets effectively.
Applications: EDA can be applied to enhance customer segmentation, optimize
marketing strategies, perform market basket analysis, detect anomalies, and
predict trends. This analysis informs decisions across various departments, from
marketing to risk management.
Suppose that the heights of seven students in a class is recorded (above table).
There is only one variable, which is height, and it is not dealing with any cause or
relationship.
Key points in Univariate analysis:
1. No Relationships: Univariate analysis focuses solely on describing and
summarizing the distribution of the single variable. It does not explore
relationships between variables or attempt to identify causes.
2. Descriptive Statistics: Descriptive statistics, such as measures of
central tendency (mean, median, mode) and measures of
dispersion (range, standard deviation), are commonly used in the analysis
of univariate data.
3. Visualization: Histograms, box plots, and other graphical representations
are often used to visually represent the distribution of the single variable.
Bivariate data
Bivariate data involves two different variables, and the analysis of this type of
data focuses on understanding the relationship or association between these two
variables. Example of bivariate data can be temperature and ice cream sales in
summer season.
20 2000
25 2500
35 5000
Suppose the temperature and ice cream sales are the two variables of a bivariate
data(table 2). Here, the relationship is visible from the table that temperature
and sales are directly proportional to each other and thus related because as the
temperature increases, the sales also increase.
Key points in Bivariate analysis:
1. Relationship Analysis: The primary goal of analyzing bivariate data is to
understand the relationship between the two variables. This relationship
could be positive (both variables increase together), negative (one
variable increases while the other decreases), or show no clear pattern.
2. Scatterplots: A common visualization tool for bivariate data is a
scatterplot, where each data point represents a pair of values for the two
variables. Scatterplots help visualize patterns and trends in the data.
3. Correlation Coefficient: A quantitative measure called the correlation
coefficient is often used to quantify the strength and direction of the linear
relationship between two variables. The correlation coefficient ranges from
-1 to 1.
Multivariate data
Multivariate data refers to datasets where each observation or sample point
consists of multiple variables or features. These variables can represent different
aspects, characteristics, or measurements related to the observed phenomenon.
When dealing with three or more variables, the data is specifically categorized as
multivariate.
Example of this type of data is suppose an advertiser wants to compare the
popularity of four advertisements on a website.
Ad1 Male 80
Femal
Ad3 55
e
Femal
Ad2 123
e
Ad1 Male 66
Ad3 Male 35
The click rates could be measured for both men and women and relationships
between variables can then be examined. It is similar to bivariate but contains
more than one dependent variable.
Key points in Multivariate analysis:
1. Analysis Techniques:The ways to perform analysis on this data depends
on the goals to be achieved. Some of the techniques are regression
analysis, principal component analysis, path analysis, factor analysis
and multivariate analysis of variance (MANOVA).
2. Goals of Analysis: The choice of analysis technique depends on the
specific goals of the study. For example, researchers may be interested in
predicting one variable based on others, identifying underlying factors that
explain patterns, or comparing group means across multiple variables.
3. Interpretation: Multivariate analysis allows for a more nuanced
interpretation of complex relationships within the data. It helps uncover
patterns that may not be apparent when examining variables individually.
It only summarize
It only summarize two It only summarize more
single variable at a
variables than 2 variables.
time.
It does not deal with It does deal with causes It does not deal with
causes and and relationships and causes and relationships
relationships. analysis is done. and analysis is done.
It is similar to bivariate
It does not contain any It does contain only one
but it contains more
dependent variable. dependent variable.
than 2 variables.
Example, Suppose an
advertiser wants to
compare the popularity
The example of of four advertisements
The example of a bivariate can be on a website.
univariate can be temperature and ice Then their click rates
height. sales in summer could be measured for
vacation. both men and women
and relationships
between variable can be
examined