0% found this document useful (0 votes)
13 views43 pages

SML Updated UNIT-2

Uploaded by

22416
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views43 pages

SML Updated UNIT-2

Uploaded by

22416
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

The Yenepoya Institute of Arts, Science, Commerce and

Management (YIASCM)

Course: IV Semester BCA ( All specializations) & BSc

Statistics for Machine


Learning
Unit-2: Exploratory Data Analysis and Visualization
Objective
A the end of this session the leaner will able to understand
about
 Data pre-processing
 Steps for data pre-processing
 Data Cleaning
 How to clean data for Machine Learning?
What is Data Pre-processing?

 Data Pre-processing includes the steps we need to follow to


transform or encode data so that it may be easily parsed by
the machine.

 The main agenda for a model to be accurate and precise in


predictions is that the algorithm should be able to easily
interpret the data's features.
Data Pre-processing
 Data pre-processing is a process of preparing the raw data
and making it suitable for a machine learning model.
 It is the first and crucial step while creating a machine

learning model.
 When creating a machine learning project, it is not always a

case that we come across the clean and formatted data.


 And while doing any operation with data, it is mandatory to

clean it and put in a formatted way.


 So for this, we use data pre-processing task.
Why is Data Pre-processing
important?
• The majority of the real-world datasets for machine learning are highly susceptible to be
missing, inconsistent, and noisy due to their heterogeneous origin. Noisy data means
meaningless data.
• Applying data mining algorithms on this noisy data would not give quality results as they
would fail to identify patterns effectively.
• Data Processing is, therefore, important to improve the overall data quality.
Duplicate or missing values may give an incorrect view of the overall statistics of
data.
Outliers and inconsistent data points often tend to disturb the model’s overall
learning, leading to false predictions.
Quality decisions must be based on quality data.
Data Pre-processing is important to get this quality data, without which it would just
be a Garbage In, Garbage Out scenario.
Why do we need Data Pre-processing?
• A real-world data generally contains noises, missing values, and maybe in
an unusable format which cannot be directly used for machine learning
models.
• Data pre-processing is required tasks for cleaning the data and making it
suitable for a machine learning model which also increases the accuracy
and efficiency of a machine learning model.
Steps for data pre-processing
Data Cleaning: This is particularly done as part of data pre-processing to clean the data by filling missing
values, smoothing the noisy data, resolving the inconsistency, and removing outliers.

1. Missing values
Here are a few ways to solve this issue:
 Ignore those tuples

This method should be considered when the dataset is huge and numerous
missing values are present within a tuple.

 Fill in the missing values


There are many methods to achieve this, such as filling in the values manually,
predicting the missing values using regression method, or numerical methods like
attribute mean.
2. Noisy Data
It involves removing a random error or variance in a measured variable. It can be done
with the help of the following techniques:
 Binning

It is the technique that works on sorted data values to smoothen any noise
present in it. The data is divided into equal-sized bins, and each bin/bucket is dealt with
independently. All data in a segment can be replaced by its mean, median or boundary
values.
 Regression

This data mining technique is generally used for prediction. It helps to smoothen
noise by fitting all the data points in a regression function. The linear regression
equation is used if there is only one independent attribute; else Polynomial equations
are used.
 Clustering

Creation of groups/clusters from data having similar values. The values that don't
lie in the cluster can be treated as noisy data and can be removed.
3. Removing outliers:
Outliers are those data points which differs significantly from other
observations present in given dataset. It can occur because of variability in
measurement and due to misinterpretation in filling data points.
 Most common causes of outliers on a data set:

Data Entry Errors: Human errors such as errors caused during data collection,
recording, or entry can cause outliers in data.
Measurement Error (instrument errors): It is the most common source of
outliers. This is caused when the measurement instrument used turns out to be
faulty.
Experimental errors (data extraction or experiment planning errors)
Intentional (dummy outliers made to test detection methods)
Data processing errors (data manipulation or data set unintended mutations)
Sampling errors (extracting or mixing data from wrong or various sources)
How to detect Outliers?

1. Z-score method 6. Isolation Forest


2. Robust Z-score 7. Linear Regression Models
3. I.Q.R method (PCA, LMS)

4. Winsorization method 8. Standard Deviation


(Percentile Capping) 9. Percentile
5. DBSCAN Clustering 10. Visualizing the data
z score
 This method assumes that the variable has a Gaussian distribution. It represents the
number of standard deviations an observation is away from the mean.
 IQR Method

 In this method by using Inter Quartile Range(IQR), we detect outliers.


IQR tells us the variation in the data set.
 Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR treated

as outliers.
 Q1 represents the 1st quartile of the data.
 Q2 represents the 2nd quartile of the data.
 Q3 represents the 3rd quartile of the data.
 (Q1–1.5 x IQR) represent the smallest value in the data set and
 (Q3+1.5 x IQR) represent the largest value in the data set.
 Visualizing the data

 Data visualization is useful for data cleaning, exploring data,


detecting outliers and unusual groups, identifying trends and
clusters etc. Here the list of data visualization plots to spot the
outliers.
 Box and whisker plot (box plot)
 Scatter plot
 Histogram
 Distribution Plot
 QQ plot
Methods to prevent outliers:
1. Deleting observations
2. Transforming values
3. Imputation
4. Separately treating
5. Deleting observations
Data Exploration :
 Data exploration refers to the initial step in data analysis.
 Data analysts use data visualization and statistical techniques to describe
dataset characterizations, such as size, quantity, and accuracy, to understand
the nature of the data better.
 Data exploration techniques include both manual analysis and automated data
exploration software solutions that visually explore and identify relationships
between different data variables, the structure of the dataset, the presence of
outliers, and the distribution of data values to reveal patterns and points of
interest, enabling data analysts to gain greater insight into the raw data.
 Data is often gathered in large, unstructured volumes from various sources.
 Data analysts must first understand and develop a comprehensive view of the
data before extracting relevant data for further analysis, such as univariate,
bivariate, multivariate, and principal components analysis.
Why is Data Exploration Important?
 Humans process visual data better than numerical data.
 Therefore it is extremely challenging for data scientists and data analysts to

assign meaning to thousands of rows and columns of data points and


communicate that meaning without any visual components.
 Data visualization in data exploration leverages familiar visual cues such as

shapes, dimensions, colours, lines, points, and angles so that data analysts
can effectively visualize and define the metadata and then perform data
cleansing.
 Performing the initial step of data exploration enables data analysts to

understand better and visually identify anomalies and relationships that


might otherwise go undetected.
What can Data Exploration Do?
• The goals of data Exploration come into these three categories.
1. Archival: Data Exploration can convert data from physical formats (such as
books, newspapers, and invoices) into digital formats (such as databases)
for backup.
2. Transfer the data format: If you want to transfer the data from your current
website into a new website under development, you can collect data from
your own website by extracting it.
3. Data analysis: As the most common goal, the extracted data can be further
analysed to generate insights. This may sound similar to the data analysis
process in data mining, but note that data analysis is the goal of data
Exploration, not part of its process. What's more, the data is analysed
differently. One example is that e-store owners extract product details from
eCommerce websites like Amazon to monitor competitors' strategies.
Data Visualization
 Data visualization is a crucial aspect of
machine learning that enables
analysts to understand and make
sense of data patterns, relationships,
and trends.
 Through data visualization, insights

and patterns in data can be easily


interpreted and communicated to a
wider audience, making it a critical
component of machine learning
Significance of Data Visualization
 Data visualization helps machine learning analysts to better understand and
analyze complex data sets by presenting them in an easily understandable
format.
 Data visualization is an essential step in data preparation and analysis as it

helps to identify outliers, trends, and patterns in the data that may be
missed by other forms of analysis.
 With the increasing availability of big data, it has become more important

than ever to use data visualization techniques to explore and understand


the data.
 Machine learning algorithms work best when they have high-quality and

clean data, and data visualization can help to identify and remove any
inconsistencies or anomalies in the data.
Types of Data Visualization Approaches

Machine learning may make use of a wide variety of data visualization

approaches. They are….


 Line Charts
 Scatter Plots
 Bar Charts
 Heat Maps
 Tree Maps:
 Box Plots
Line Charts:
 In a line chart, each data point is
represented by a point on the
graph, and these points are
connected by a line.
 We may find patterns and trends

in the data across time by using


line charts.
 Time-series data is frequently

displayed using line charts


Scatter Plots:
 A quick and efficient method of
displaying the relationship between
two variables is to use scatter plots.
 With one variable plotted on the x-axis

and the other variable drawn on the y-


axis, each data point in a scatter plot is
represented by a point on the graph.
 We may use scatter plots to visualize

data to find patterns, clusters, and


outliers
Heat Maps:
 Heat maps are a type of graphical
representation that displays data
in a matrix format.
 The value of the data point that

each matrix cell represents


determines its hue.
 Heat maps are often used to

visualize the correlation between


variables or to identify patterns in
time-series data.
Tree Maps:

 Tree maps are used to display


hierarchical data in a compact
format and are useful in showing
the relationship between different
levels of a hierarchy
Box Plots:
 Box plots are a graphical representation of
the distribution of a set of data.
 In a box plot, the median is shown by a

line inside the box, while the centre box


depicts the range of the data.
 The whiskers extend from the box to the

highest and lowest values in the data,


excluding outliers.
 Box plots can help us to identify the

spread and skewness of the data.


Uses of Data Visualization:
• Data visualization has several uses in machine learning
 Identify trends and patterns in data: It may be challenging to spot trends and
patterns in data using conventional approaches, but data visualization tools may
be utilized to do so.
 Communicate insights to stakeholders: Data visualization can be used to
communicate insights to stakeholders in a format that is easily understandable
and can help to support decision-making processes.
 Monitor machine learning models: Data visualization can be used to monitor
machine learning models in real time and to identify any issues or anomalies in
the data.
 Improve data quality: Data visualization can be used to identify outliers and
inconsistencies in the data and to improve data quality by removing them.
Feature selection:
Feature Selection is the method of reducing the input
variable to your model by using only relevant data and
getting rid of noise in data.
 It is the process of automatically choosing relevant features

for your machine learning model based on the type of


problem you are trying to solve
Why Feature Selection?

 Machine learning models follow a simple rule: whatever goes in, comes
out. If we put garbage into our model, we can expect the output to be
garbage too. In this case, garbage refers to noise in our data.
 To train a model, we collect enormous quantities of data to help the

machine learn better. Usually, a good portion of the data collected is


noise, while some of the columns of our dataset might not contribute
significantly to the performance of our model.
 Further, having a lot of data can slow down the training process and

cause the model to be slower.


 The model may also learn from this irrelevant data and be inaccurate.
 Hence apart from choosing the right model for our data, we need to

choose the right data to put in our model.


In the below table, we can see that the model of the car, the year of
manufacture, and the miles it has traveled are pretty important to find out if
the car is old enough to be crushed or not. However, the name of the
previous owner of the car does not decide if the car should be crushed or not.
Further, it can confuse the algorithm into finding patterns between names
and the other features. Hence we can drop the column.
Feature Selection Models

 Feature selection models are of two types:


 Supervised Models: Supervised feature
selection refers to the method which uses
the output label class for feature selection.
They use the target variables to identify the
variables which can increase the efficiency
of the model
 Unsupervised Models: Unsupervised
feature selection refers to the method
which does not need the output label class
for feature selection. We use them for
unlabeled data.
Types of Supervised models:
We can further divide the supervised models into three :
 1. Filter Method: In this method, features are dropped based on their relation to the
output, or how they are correlating to the output. We use correlation to check if the
features are positively or negatively correlated to the output labels and drop features
accordingly. Eg: Information Gain, Chi-Square Test, Fisher’s Score, etc.
 2. Wrapper Method: We split our data into subsets and train a model using this. Based
on the output of the model, we add and subtract features and train the model again. It
forms the subsets using a greedy approach and evaluates the accuracy of all the
possible combinations of features. Eg: Forward Selection, Backwards Elimination, etc.
 3. Intrinsic Method: This method combines the qualities of both the Filter and Wrapper
method to create the best subset. This method takes care of the machine training
iterative process while maintaining the computation cost to be minimum. Eg: Lasso and
Ridge Regression.
How to Choose a Feature Selection Model?

 How do we know which feature selection model will work out for our
model? The process is relatively simple, with the model depending on
the types of input and output variables.
 Variables are of two main types:
 Numerical Variables: Which include integers, float, and numbers.
 Categorical Variables: Which include labels, strings, Boolean variables,

etc.
 Based on whether we have numerical or categorical variables as inputs

and outputs, we can choose our feature selection model as follows:


Input Variable Output Variable Feature Selection Model

Numerical Numerical • Pearson’s correlation coefficient


• Spearman’s rank coefficient

• ANOVA correlation coefficient


(linear).
Numerical Categorical
• Kendall’s rank coefficient
(nonlinear).

• Kendall’s rank coefficient (linear).


Categorical Numerical • ANOVA correlation coefficient
(nonlinear).

• Chi-Squared test (contingency


Categorical Categorical tables).
• Mutual Information.
What are the advantages and limitations of z-score method, IQR method and
visualizing the data method of detecting outliers?

1. Z-score method:
Advantages:
• It provides a standardized score indicating how many standard deviations an
observation is from the mean.
• It's easy to understand and implement.
• It works well for normally distributed data.
Limitations:
• It assumes that the data is normally distributed, which may not always be the
case.
• It can be sensitive to extreme values, especially in small datasets.
• It may not be effective for skewed distributions.
2. Interquartile Range (IQR) method:
Advantages:
• It's robust to outliers and resistant to skewed distributions.
• It's simple to calculate and understand.
• It provides a measure of the spread of the middle 50% of the data.
Limitations:
• It relies on quartiles, which may not be representative if the dataset is small.
• It may not be as informative about the location of outliers as the z-score
method.
• It's less effective for normally distributed data compared to the z-score
method.

3. Visualizing the data method:


Advantages:
• It allows for a quick and intuitive understanding of the distribution of data.
• It can reveal patterns and anomalies that may not be evident from summary
statistics alone.
• It's particularly useful for identifying outliers in multidimensional datasets.
Limitations:
• It can be subjective and dependent on the choice of visualization technique.
• It may not be suitable for large datasets with many variables, as it can be
time-consuming.
• It requires some expertise to interpret the visualizations accurately.
What are the potential benefits of reducing the number of features in a dataset,
and how can feature selection improve model performance and interpretability?

Reducing the number of features in a dataset, also known as feature selection, can offer
several potential benefits:
1.Improved model performance:
• By removing irrelevant or redundant features, feature selection can reduce
overfitting, where the model learns noise in the data rather than the underlying
patterns. This can lead to better generalization performance on unseen data.
• It can also reduce computational complexity, making the training process faster
and more efficient, especially for algorithms that are sensitive to the curse of
dimensionality.
2.Enhanced interpretability:
• With fewer features, it becomes easier to understand and interpret the model.
Simplifying the model can help identify the most important factors driving the
predictions, making it easier to communicate the results to stakeholders.
• Feature selection can highlight the most relevant variables, allowing domain
experts to gain insights into the underlying processes and relationships in the
data.
3. Reduced overfitting:
• Feature selection helps to mitigate the risk of overfitting by reducing the model's
reliance on noisy or irrelevant features. This allows the model to capture the
underlying patterns in the data more effectively, leading to better generalization
performance on new data.
• Removing irrelevant features can also improve the model's robustness to changes
in the dataset, such as missing values or outliers.

4. Faster training and inference:


• With fewer features, models require less computational resources for training and
inference. This can lead to faster model development and deployment, which is
particularly important in real-time or resource-constrained applications.

5. Improved data visualization:


• Feature selection can result in a reduced-dimensional space, making it easier to
visualize and explore the data. This can help identify relationships and patterns
that may not be apparent in high-dimensional spaces.
How can outliers arise in a dataset?

Outliers can arise in a dataset due to various reasons, and they can have
different underlying causes depending on the nature of the data and the
context of the problem

 Measurement errors: Outliers can occur due to errors in data collection,


recording, or measurement. These errors could be human errors, instrument
malfunctions, or data entry mistakes. For example, a sensor malfunctioning
could record an extreme value, leading to an outlier.

 Natural variation: In many real-world phenomena, there can be inherent


variability or randomness. Occasionally, extreme or unusual events can
occur, resulting in outliers. For instance, in a dataset of daily temperatures,
an unusually hot or cold day could be considered an outlier.

 Data entry errors: Outliers may arise from mistakes during data entry or
data preprocessing. Human error or typos when entering data into a
database or spreadsheet can result in values that are far from the typical
range.
 Sampling errors: Outliers can occur due to issues with the sampling process.
If the sample size is too small or if the sampling method is biased, it may not
accurately represent the underlying population, leading to outliers.

 Genuine extreme values: Sometimes, outliers represent real and


meaningful observations that are genuinely extreme. These could be rare
events, anomalies, or exceptional cases that are legitimately part of the data
distribution.

 Data transformation: Outliers can also arise during data transformation or


aggregation processes. For example, when summarizing data from multiple
sources, inconsistencies or extreme values in one source can lead to outliers in
the aggregated dataset.

 Intentional data points: In some cases, outliers may be deliberately


introduced into the dataset. This could be done for various reasons, such as
testing the robustness of a model or representing special cases in the analysis.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy