Ccs346 Eda Unit 1 Notes
Ccs346 Eda Unit 1 Notes
EDA fundamentals – Understanding data science – Significance of EDA – Making sense of data –
Comparing EDA with classical and Bayesian analysis – Software tools for EDA - Visual Aids for EDA-
Data transformation techniques-merging database, reshaping and pivoting, Transformation techniques.
EDA FUNDAMENTALS
Data and Information
Data is a collection of facts such as numbers, discrete objects, fugures, words, symbols, events,
measurements, observations, or descriptions of things.
Data can be Qualitative or Quantitative. Qualitative data is descriptive information (It describes
something). Quantitative data is numerical information (numbers)
Quantitative data can be Discrete or Continuous. Discrete data take only certain values (whole numbers).
Continuous data take any value (within a range). Discrete data is Counted and Continuous data is measured
Example: Qualitative data
Most common names in India
Colour of a Dress
Example: Quantitative data
Height, Weight, Leaves on the tree, Customers in a shop.
Information is defined as classified or organized data that has meaningful value. Information is
processed data sed to make decisions and take action. Processed data must meet the following criteria:
Accuracy
Completeness
Timeliness
Data science is the field of study that combines knowledge of mathematics and statistics to extract
meaningful insights from data. Data science deals with vast volumes of data using modern tools and
techniques to find unseen patterns, derive meaningful information and make business decisions.
Data science combines math and statistics, specialized programming, advanced analytics, Artificial
Intelligence (AI) and machine learning to extract meaningful insights from data and used to guide decision
making and strategic planning.
Data Science is extraction, preparation, analysis, visualization, and maintenance of information. Data
science involves cross-disciplinary knowledge from computer science, data, statistics, and mathematics.
The Data Science Lifecycle
Data science's lifecycle consists of five distinct stages, each with its own tasks:
1. Capture - Data acquisition, data entry, signal reception, data extraction - This stage involves gathering
raw structured and unstructured data.
2. Maintain - Data warehousing, data cleansing, data staging, data processing, data architecture - This
stage covers taking the raw data and putting it in a form that can be used.
3. Process - Data mining, clustering/classification, data modeling, data summarization Data scientists take
the prepared data and examine its patterns, ranges and biases to determine how useful it will be in
predictive analysis.
4. Analyze - Exploratory/confirmatory, predictive analysis, regression, text mining, qualitative analysis -
This stage involves performing the various analyses on the data.
5. Communicate - Data reporting, data visualization, business intelligence, decision making - In this final
step, analysts prepare the analyses in easily readable forms such as charts, graphs and reports.
Data science tools
Data science tools used in various stages of the data science process are:
o Data Analysis - SAS, Jupyter, R Studio, MATLAB, Excel, RapidMiner
o Data Warehousing - Informatica, AWS Redshift, Wega
o Data Visualization - Jupyter, Tableau, Cognos, RAW
o Machine Learning - Spark MLib, Mahout, Azure ML studio
3. Data processing
• Preprocessing involves the process of selecting and organizing the dataset before actual analysis.
• Common tasks involve correctly exporting the dataset, placing them under the right tables, structuring
them, and exporting them in the correct format.
4. Data cleaning
• Preprocessed data is still not ready for detailed analysis.
• It must be correctly transformed for an incompleteness check, duplicates check, error check, and missing
value check.
• This stage involves responsibilities such as matching the correct record, finding inaccuracies in the
dataset, understanding the overall data quality, removing duplicate items, and filling in the missing values.
• Data cleaning is dependent on the types of data under study.
• Hence, it is essential for data scientists or EDA experts to comprehend different types of datasets.
• An example of data cleaning is using outlier detection methods for quantitative data cleaning.
5. EDA
• Exploratory data analysis is the stage where the message contained in the data is actually understood.
• Several types of data transformation techniques might be required during the process of exploration.
6. Modeling and algorithm
• Generalized models or mathematical formulas represent relationships among different variables, such as
correlation or causation. These models or equations involve one or more variables that depend on other
variables to cause an event.
• For example, when buying pens, the total price of pens
(Total) = price for one pen (UnitPrice) * the number of pens bought (Quantity).
Hence, model would be Total = UnitPrice * Quantity. Here, the total price is dependent on the unit price.
Hence, the total price is referred to as the dependent variable and the unit price is referred to as an
independent variable.
• In general, a model always describes the relationship between independent and dependent variables.
• Inferential statistics deals with quantifying relationships between particular variables.
• The Judd model for describing the relationship between data, model, and the error still holds true:
Data = Model + Error
7. Data Product
• Any computer software that uses data as inputs, produces outputs, and provides feedback based on the
output to control the environment is referred to as a data product.
• A data product is generally based on a model developed during data analysis.
Example: a recommendation model that inputs user purchase history and recommends a related item that
the user is highly likely to buy.
8. Communication
• This stage deals with disseminating the results to end stakeholders to use the result for business
intelligence. • One of the most notable steps in this stage is data visualization.
• Visualization deals with information relay techniques such as tables, charts, summary diagrams, and bar
charts to show the analyzed result.
Applications of Data Science:
Healthcare
Gaming
Image Recognition
Recommendation Systems
Fraud Detection
Speech Recognition
Airline Route Planning
Virtual Reality
SIGNIFICANCE OF EDA
➢ Different fields of science, economics, engineering, and marketing accumulate and store data primarily
in electronic databases. Appropriate and well-established decisions should be made using data collected.
➢ It is practically impossible to make sense of datasets containing more than a handful of data points
without the help of computer programs. To make sure of the insights provided by the collected data and to
make further decisions, data mining is performed which includes distinct analysis processes.
➢ Exploratory data analysis is the key and first exercise in data mining. It allows us to visualize data to
understand it as well as to create hypotheses (ideas) for further analysis. The exploratory analysis centers
around creating a synopsis of data or insights for the next steps in a data mining project. EDA actually
reveals the ground truth about the content without making any underlying assumptions.
➢ Hence, data scientists use this process to actually understand what type of modeling and hypotheses can
be created. Key components of exploratory data analysis include summarizing data, statistical analysis, and
visualization of data.
➢ Python provides expert tools for exploratory analysis
• Pandas for summarizing
• Scipy, along with others, for statistical analysis
• Matplotlib and Plotly for visualizations
STEPS IN EDA
The four different steps involved in exploratory data analysis are,
1. Problem Definition
2. Data Preparation
3. Data Analysis
4. Development and Representation of the Results
1. Problem Definition
• It is essential to define business problem to be solved before trying to extract useful insight from the data.
• The problem definition works as the driving force for a data analysis plan execution. The main tasks
involved in problem definition are
o defining the main objective of the analysis
o defining the main deliverables
o outlining the main roles and responsibilities
o obtaining the current status of the data
o defining the timetable, and performing cost/benefit analysis
• Based on the problem definition, an execution plan can be created.
2. Data Preparation
• This step involves methods for preparing the dataset before actual analysis. This step involves
o defining the sources of data
o defining data schemas and tables
o understanding the main characteristics of the data
o cleaning the dataset
o deleting non-relevant datasets
o transforming the data
o dividing the data into required chunks for analysis
3. Data analysis
This is one of the most crucial steps that deals with descriptive statistics and analysis of the data
The main tasks involve
summarizing the data
finding the hidden correlation
relationships among the data
developing predictive models
evaluating the models
calculating the accuracies
➢ Each observation can have a specific value for each of these variables.
For example, a patient can have the following:
PATIENT_ID = 1001
Name = Yoshmi Mukhiya
Address = Mannsverk 61, 5094, Bergen, Norway
Date of birth = 10th July 2018
Email = yoshmimukhiya@gmail.com
Weight = 10
Gender = Female
➢ These datasets are stored in hospitals and are presented for analysis.
➢ Most of this data is stored in some sort of database management system in tables/schema.
Table for storing patient information
➢ The table contains five observations (001, 002, 003, 004, 005).
➢ Each observation describes variables (PatientID, name, address, dob, email, gender, and weight).
Types of datasets
➢ Most datasets broadly fall into two groups—Numerical Data and Categorical Data.
Numerical data
Discrete data
➢ This is data that is countable and its values can be listed.
➢ For example, if we flip a coin, the number of heads in 200 coin flips can take values from 0 to 200
cases.
➢ A variable that represents a discrete dataset is referred to as a discrete variable.
➢ The discrete variable takes a fixed number of distinct values.
Example:
o The Country variable can have values such as Nepal, India, Norway, and Japan.
o The Rank variable of a student in a classroom can take values from 1, 2, 3, 4, 5, and so on.
Continuous data
➢ A variable that can have an infinite number of numerical values within a specific range is classified as
continuous data.
➢ A variable describing continuous data is a continuous variable.
➢ Continuous data can follow an interval measure of scale or ratio measure of scale
Example:
o The temperature of a city
o The weight variable is a continuous variable
Categorical data
➢ This type of data represents the characteristics of an object
➢ Examples: gender, marital status, type of address, or categories of the movies.
➢ This data is often referred to as qualitative datasets in statistics.
Examples of categorical data
o Gender (Male, Female, Other, or Unknown)
o Marital Status (Annulled, Divorced, Interlocutory, Legally Separated, Married, Polygamous, Never
Married, Domestic Partner, Unmarried, Widowed, or Unknown)
o Movie genres (Action, Adventure, Comedy, Crime, Drama, Fantasy, Historical, Horror, Mystery,
Philosophical, Political, Romance, Saga, Satire, Science Fiction, Social, Thriller, Urban, or Western)
o Blood type (A, B, AB, or O)
Measurement scales
➢ There are four different types of measurement scales in statistics:
Nominal,
Ordinal,
Interval and
Ratio.
Nominal Scale:
• A nominal scale associates numbers with variables for naming or labeling the object. It is the
most basic of the four levels of measurement. This is a method of measuring the objects or events
into a discrete category. It assigns a number to an object only for the identification of the object. So
it is a categorical data or qualitative data.
• The numbers are only used for labeling variables and without any quantitative value. The scales are
referred to as labels.
Ordinal Scale:
It establishes the rank between the variables of a scale but not the difference value between the
variables. The ordinal scale is the next level of data measurement scale. Here the “Ordinal” is the
indication of “Order”.
Ordinal measurement assigns a numerical value to the variables based on their relative ranking.
In a nominal scale, there is no predefined order for arranging the data.
➢ The main difference in the ordinal and nominal scale is the order.
Example of ordinal scale using the Likert scale:
➢ The answer to the question is scaled down to five different ordinal values, Strongly Agree, Agree,
Neutral, Disagree, and Strongly Disagree.
➢ These Scales are referred to as the Likert scale.
More examples of the Likert scale:
Example1:
o The measure of central tendencies—mean, median, mode, and standard deviations.
Example2:
Ratio Scale:
Ratio scale is the most advanced measurement scale, which has variables that are labeled in order
and have a calculated difference between variables. This scale has a fixed starting point, i.e., the actual
zero value is present. Ratio scale is purely quantitative. Among the four levels of measurement, ratio scale
is the most precise.
Examples of ratio scales are age, wight, height, income, distance etc.
Examples: the measure of energy, mass, length, duration, electrical energy and volume.
NumPy
➢ NumPy is Numerical Python which is a Python library. NumPy is used for working with arrays.
➢ It also has functions for working in domain of linear algebra, fourier transform, and matrices.
Pandas
➢ Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning,
exploring, and manipulating data.
SciPy
➢ SciPy stands for Scientific Python which is a scientific computation library that uses NumPy.
➢ It provides more utility functions for optimization, stats and signal processing. Like NumPy, SciPy is
open source so we can use it freely. SciPy has optimized and added functions that are frequently used in
NumPy and Data Science.
Matplotlib
➢ Matplotlib is a low-level graph plotting library in python that serves as a visualization utility.
➢ It provides a huge library of customizable plots used to create professional reporting applications,
interactive analytical applications, complex dashboard applications, web/GUI applications etc.,
Univariate analysis (Used for univariate data - the data containing one variable)
Univariate plots show the frequency or the distribution shape of a variable. Below ar visual tools used to
analyze univariate data. Histograms are two-dimensional plots in which the x-axis divides into a range o
numerical bins or time intervals. The y-axis shows the frequency values, which are counts of
occurrences of values for each bin
Example:
Bivariate Plots (Used for bivariate data - the data containing two variables)
Example:
Multivariate Plots (Used for Multivariate data - the data containing more than two variables)
Multivariate data contains high dimensionality data and has applications in deep learning such as
visualizing natural language such as text or images.
Scatter Plot
Data transformation techniques refer to Techniques used to transform the raw data into a clean and ready-
to-use dataset.
There are different types of data transformation
1. Data smoothing
Smoothing is a technique where an algorithm is applied in order to remove noise from the dataset when
trying to identify a trend. Noise can have a bad effect on the data and by eliminating or reducing it one can
extract better insights or identify patterns that would not be seen otherwise.÷
There are three algorithm types that help with data smoothing:
o Clustering: Where one can group similar values together to form a cluster while labeling any value out
of the cluster as an outlier.
o Binning: Using an algorithm for binning will help to split the data into bins and smooth the data value
within each bin.
o Regression: Regression algorithms are used to identify the relation between two dependent attributes and
help to predict an attribute based on the value of the other.
2. Attribution construction
3. Data Generalization
4. Data Aggregation
5. Data Discretization
6. Data Normalization
7. Integration
8. Manipulation
https://cse.poriyaan.in/topic/data-wrangling-50606/