0% found this document useful (0 votes)

13 views43 pages

SML Updated UNIT-2

Uploaded by

22416

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views43 pages

SML Updated UNIT-2

Uploaded by

22416

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 43

The Yenepoya Institute of Arts, Science, Commerce and

Management (YIASCM)

Course: IV Semester BCA ( All specializations) & BSc

Statistics for Machine

Learning
Unit-2: Exploratory Data Analysis and Visualization
Objective
A the end of this session the leaner will able to understand
about
 Data pre-processing
 Steps for data pre-processing
 Data Cleaning
 How to clean data for Machine Learning?
What is Data Pre-processing?

 Data Pre-processing includes the steps we need to follow to

transform or encode data so that it may be easily parsed by
the machine.

 The main agenda for a model to be accurate and precise in

predictions is that the algorithm should be able to easily
interpret the data's features.
Data Pre-processing
 Data pre-processing is a process of preparing the raw data
and making it suitable for a machine learning model.
 It is the first and crucial step while creating a machine

learning model.
 When creating a machine learning project, it is not always a

case that we come across the clean and formatted data.

 And while doing any operation with data, it is mandatory to

clean it and put in a formatted way.

 So for this, we use data pre-processing task.
Why is Data Pre-processing
important?
• The majority of the real-world datasets for machine learning are highly susceptible to be
missing, inconsistent, and noisy due to their heterogeneous origin. Noisy data means
meaningless data.
• Applying data mining algorithms on this noisy data would not give quality results as they
would fail to identify patterns effectively.
• Data Processing is, therefore, important to improve the overall data quality.
Duplicate or missing values may give an incorrect view of the overall statistics of
data.
Outliers and inconsistent data points often tend to disturb the model’s overall
learning, leading to false predictions.
Quality decisions must be based on quality data.
Data Pre-processing is important to get this quality data, without which it would just
be a Garbage In, Garbage Out scenario.
Why do we need Data Pre-processing?
• A real-world data generally contains noises, missing values, and maybe in
an unusable format which cannot be directly used for machine learning
models.
• Data pre-processing is required tasks for cleaning the data and making it
suitable for a machine learning model which also increases the accuracy
and efficiency of a machine learning model.
Steps for data pre-processing
Data Cleaning: This is particularly done as part of data pre-processing to clean the data by filling missing
values, smoothing the noisy data, resolving the inconsistency, and removing outliers.

1. Missing values
Here are a few ways to solve this issue:
 Ignore those tuples

This method should be considered when the dataset is huge and numerous
missing values are present within a tuple.

 Fill in the missing values

There are many methods to achieve this, such as filling in the values manually,
predicting the missing values using regression method, or numerical methods like
attribute mean.
2. Noisy Data
It involves removing a random error or variance in a measured variable. It can be done
with the help of the following techniques:
 Binning

It is the technique that works on sorted data values to smoothen any noise
present in it. The data is divided into equal-sized bins, and each bin/bucket is dealt with
independently. All data in a segment can be replaced by its mean, median or boundary
values.
 Regression

This data mining technique is generally used for prediction. It helps to smoothen
noise by fitting all the data points in a regression function. The linear regression
equation is used if there is only one independent attribute; else Polynomial equations
are used.
 Clustering

Creation of groups/clusters from data having similar values. The values that don't
lie in the cluster can be treated as noisy data and can be removed.
3. Removing outliers:
Outliers are those data points which differs significantly from other
observations present in given dataset. It can occur because of variability in
measurement and due to misinterpretation in filling data points.
 Most common causes of outliers on a data set:

Data Entry Errors: Human errors such as errors caused during data collection,
recording, or entry can cause outliers in data.
Measurement Error (instrument errors): It is the most common source of
outliers. This is caused when the measurement instrument used turns out to be
faulty.
Experimental errors (data extraction or experiment planning errors)
Intentional (dummy outliers made to test detection methods)
Data processing errors (data manipulation or data set unintended mutations)
Sampling errors (extracting or mixing data from wrong or various sources)
How to detect Outliers?

1. Z-score method 6. Isolation Forest

2. Robust Z-score 7. Linear Regression Models
3. I.Q.R method (PCA, LMS)

4. Winsorization method 8. Standard Deviation

(Percentile Capping) 9. Percentile
5. DBSCAN Clustering 10. Visualizing the data
z score
 This method assumes that the variable has a Gaussian distribution. It represents the
number of standard deviations an observation is away from the mean.
 IQR Method

 In this method by using Inter Quartile Range(IQR), we detect outliers.

IQR tells us the variation in the data set.
 Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR treated

as outliers.
 Q1 represents the 1st quartile of the data.
 Q2 represents the 2nd quartile of the data.
 Q3 represents the 3rd quartile of the data.
 (Q1–1.5 x IQR) represent the smallest value in the data set and
 (Q3+1.5 x IQR) represent the largest value in the data set.
 Visualizing the data

 Data visualization is useful for data cleaning, exploring data,

detecting outliers and unusual groups, identifying trends and
clusters etc. Here the list of data visualization plots to spot the
outliers.
 Box and whisker plot (box plot)
 Scatter plot
 Histogram
 Distribution Plot
 QQ plot
Methods to prevent outliers:
1. Deleting observations
2. Transforming values
3. Imputation
4. Separately treating
5. Deleting observations
Data Exploration :
 Data exploration refers to the initial step in data analysis.
 Data analysts use data visualization and statistical techniques to describe
dataset characterizations, such as size, quantity, and accuracy, to understand
the nature of the data better.
 Data exploration techniques include both manual analysis and automated data
exploration software solutions that visually explore and identify relationships
between different data variables, the structure of the dataset, the presence of
outliers, and the distribution of data values to reveal patterns and points of
interest, enabling data analysts to gain greater insight into the raw data.
 Data is often gathered in large, unstructured volumes from various sources.
 Data analysts must first understand and develop a comprehensive view of the
data before extracting relevant data for further analysis, such as univariate,
bivariate, multivariate, and principal components analysis.
Why is Data Exploration Important?
 Humans process visual data better than numerical data.
 Therefore it is extremely challenging for data scientists and data analysts to

assign meaning to thousands of rows and columns of data points and

communicate that meaning without any visual components.
 Data visualization in data exploration leverages familiar visual cues such as

shapes, dimensions, colours, lines, points, and angles so that data analysts
can effectively visualize and define the metadata and then perform data
cleansing.
 Performing the initial step of data exploration enables data analysts to

understand better and visually identify anomalies and relationships that

might otherwise go undetected.
What can Data Exploration Do?
• The goals of data Exploration come into these three categories.
1. Archival: Data Exploration can convert data from physical formats (such as
books, newspapers, and invoices) into digital formats (such as databases)
for backup.
2. Transfer the data format: If you want to transfer the data from your current
website into a new website under development, you can collect data from
your own website by extracting it.
3. Data analysis: As the most common goal, the extracted data can be further
analysed to generate insights. This may sound similar to the data analysis
process in data mining, but note that data analysis is the goal of data
Exploration, not part of its process. What's more, the data is analysed
differently. One example is that e-store owners extract product details from
eCommerce websites like Amazon to monitor competitors' strategies.
Data Visualization
 Data visualization is a crucial aspect of
machine learning that enables
analysts to understand and make
sense of data patterns, relationships,
and trends.
 Through data visualization, insights

and patterns in data can be easily

interpreted and communicated to a
wider audience, making it a critical
component of machine learning
Significance of Data Visualization
 Data visualization helps machine learning analysts to better understand and
analyze complex data sets by presenting them in an easily understandable
format.
 Data visualization is an essential step in data preparation and analysis as it

helps to identify outliers, trends, and patterns in the data that may be
missed by other forms of analysis.
 With the increasing availability of big data, it has become more important

than ever to use data visualization techniques to explore and understand

the data.
 Machine learning algorithms work best when they have high-quality and

clean data, and data visualization can help to identify and remove any
inconsistencies or anomalies in the data.
Types of Data Visualization Approaches

Machine learning may make use of a wide variety of data visualization

approaches. They are….

 Line Charts
 Scatter Plots
 Bar Charts
 Heat Maps
 Tree Maps:
 Box Plots
Line Charts:
 In a line chart, each data point is
represented by a point on the
graph, and these points are
connected by a line.
 We may find patterns and trends

in the data across time by using

line charts.
 Time-series data is frequently

displayed using line charts

Scatter Plots:
 A quick and efficient method of
displaying the relationship between
two variables is to use scatter plots.
 With one variable plotted on the x-axis

and the other variable drawn on the y-

axis, each data point in a scatter plot is
represented by a point on the graph.
 We may use scatter plots to visualize

data to find patterns, clusters, and

outliers
Heat Maps:
 Heat maps are a type of graphical
representation that displays data
in a matrix format.
 The value of the data point that

each matrix cell represents

determines its hue.
 Heat maps are often used to

visualize the correlation between

variables or to identify patterns in
time-series data.
Tree Maps:

 Tree maps are used to display

hierarchical data in a compact
format and are useful in showing
the relationship between different
levels of a hierarchy
Box Plots:
 Box plots are a graphical representation of
the distribution of a set of data.
 In a box plot, the median is shown by a

line inside the box, while the centre box

depicts the range of the data.
 The whiskers extend from the box to the

highest and lowest values in the data,

excluding outliers.
 Box plots can help us to identify the

spread and skewness of the data.

Uses of Data Visualization:
• Data visualization has several uses in machine learning
 Identify trends and patterns in data: It may be challenging to spot trends and
patterns in data using conventional approaches, but data visualization tools may
be utilized to do so.
 Communicate insights to stakeholders: Data visualization can be used to
communicate insights to stakeholders in a format that is easily understandable
and can help to support decision-making processes.
 Monitor machine learning models: Data visualization can be used to monitor
machine learning models in real time and to identify any issues or anomalies in
the data.
 Improve data quality: Data visualization can be used to identify outliers and
inconsistencies in the data and to improve data quality by removing them.
Feature selection:
Feature Selection is the method of reducing the input
variable to your model by using only relevant data and
getting rid of noise in data.
 It is the process of automatically choosing relevant features

for your machine learning model based on the type of

problem you are trying to solve
Why Feature Selection?

 Machine learning models follow a simple rule: whatever goes in, comes
out. If we put garbage into our model, we can expect the output to be
garbage too. In this case, garbage refers to noise in our data.
 To train a model, we collect enormous quantities of data to help the

machine learn better. Usually, a good portion of the data collected is

noise, while some of the columns of our dataset might not contribute
significantly to the performance of our model.
 Further, having a lot of data can slow down the training process and

cause the model to be slower.

 The model may also learn from this irrelevant data and be inaccurate.
 Hence apart from choosing the right model for our data, we need to

choose the right data to put in our model.

In the below table, we can see that the model of the car, the year of
manufacture, and the miles it has traveled are pretty important to find out if
the car is old enough to be crushed or not. However, the name of the
previous owner of the car does not decide if the car should be crushed or not.
Further, it can confuse the algorithm into finding patterns between names
and the other features. Hence we can drop the column.
Feature Selection Models

 Feature selection models are of two types:

 Supervised Models: Supervised feature
selection refers to the method which uses
the output label class for feature selection.
They use the target variables to identify the
variables which can increase the efficiency
of the model
 Unsupervised Models: Unsupervised
feature selection refers to the method
which does not need the output label class
for feature selection. We use them for
unlabeled data.
Types of Supervised models:
We can further divide the supervised models into three :
 1. Filter Method: In this method, features are dropped based on their relation to the
output, or how they are correlating to the output. We use correlation to check if the
features are positively or negatively correlated to the output labels and drop features
accordingly. Eg: Information Gain, Chi-Square Test, Fisher’s Score, etc.
 2. Wrapper Method: We split our data into subsets and train a model using this. Based
on the output of the model, we add and subtract features and train the model again. It
forms the subsets using a greedy approach and evaluates the accuracy of all the
possible combinations of features. Eg: Forward Selection, Backwards Elimination, etc.
 3. Intrinsic Method: This method combines the qualities of both the Filter and Wrapper
method to create the best subset. This method takes care of the machine training
iterative process while maintaining the computation cost to be minimum. Eg: Lasso and
Ridge Regression.
How to Choose a Feature Selection Model?

 How do we know which feature selection model will work out for our
model? The process is relatively simple, with the model depending on
the types of input and output variables.
 Variables are of two main types:
 Numerical Variables: Which include integers, float, and numbers.
 Categorical Variables: Which include labels, strings, Boolean variables,

etc.
 Based on whether we have numerical or categorical variables as inputs

and outputs, we can choose our feature selection model as follows:

Input Variable Output Variable Feature Selection Model

Numerical Numerical • Pearson’s correlation coefficient

• Spearman’s rank coefficient

• ANOVA correlation coefficient

(linear).
Numerical Categorical
• Kendall’s rank coefficient
(nonlinear).

• Kendall’s rank coefficient (linear).

Categorical Numerical • ANOVA correlation coefficient
(nonlinear).

• Chi-Squared test (contingency

Categorical Categorical tables).
• Mutual Information.
What are the advantages and limitations of z-score method, IQR method and
visualizing the data method of detecting outliers?

1. Z-score method:
Advantages:
• It provides a standardized score indicating how many standard deviations an
observation is from the mean.
• It's easy to understand and implement.
• It works well for normally distributed data.
Limitations:
• It assumes that the data is normally distributed, which may not always be the
case.
• It can be sensitive to extreme values, especially in small datasets.
• It may not be effective for skewed distributions.
2. Interquartile Range (IQR) method:
Advantages:
• It's robust to outliers and resistant to skewed distributions.
• It's simple to calculate and understand.
• It provides a measure of the spread of the middle 50% of the data.
Limitations:
• It relies on quartiles, which may not be representative if the dataset is small.
• It may not be as informative about the location of outliers as the z-score
method.
• It's less effective for normally distributed data compared to the z-score
method.

3. Visualizing the data method:

Advantages:
• It allows for a quick and intuitive understanding of the distribution of data.
• It can reveal patterns and anomalies that may not be evident from summary
statistics alone.
• It's particularly useful for identifying outliers in multidimensional datasets.
Limitations:
• It can be subjective and dependent on the choice of visualization technique.
• It may not be suitable for large datasets with many variables, as it can be
time-consuming.
• It requires some expertise to interpret the visualizations accurately.
What are the potential benefits of reducing the number of features in a dataset,
and how can feature selection improve model performance and interpretability?

Reducing the number of features in a dataset, also known as feature selection, can offer
several potential benefits:
1.Improved model performance:
• By removing irrelevant or redundant features, feature selection can reduce
overfitting, where the model learns noise in the data rather than the underlying
patterns. This can lead to better generalization performance on unseen data.
• It can also reduce computational complexity, making the training process faster
and more efficient, especially for algorithms that are sensitive to the curse of
dimensionality.
2.Enhanced interpretability:
• With fewer features, it becomes easier to understand and interpret the model.
Simplifying the model can help identify the most important factors driving the
predictions, making it easier to communicate the results to stakeholders.
• Feature selection can highlight the most relevant variables, allowing domain
experts to gain insights into the underlying processes and relationships in the
data.
3. Reduced overfitting:
• Feature selection helps to mitigate the risk of overfitting by reducing the model's
reliance on noisy or irrelevant features. This allows the model to capture the
underlying patterns in the data more effectively, leading to better generalization
performance on new data.
• Removing irrelevant features can also improve the model's robustness to changes
in the dataset, such as missing values or outliers.

4. Faster training and inference:

• With fewer features, models require less computational resources for training and
inference. This can lead to faster model development and deployment, which is
particularly important in real-time or resource-constrained applications.

5. Improved data visualization:

• Feature selection can result in a reduced-dimensional space, making it easier to
visualize and explore the data. This can help identify relationships and patterns
that may not be apparent in high-dimensional spaces.
How can outliers arise in a dataset?

Outliers can arise in a dataset due to various reasons, and they can have
different underlying causes depending on the nature of the data and the
context of the problem

 Measurement errors: Outliers can occur due to errors in data collection,

recording, or measurement. These errors could be human errors, instrument
malfunctions, or data entry mistakes. For example, a sensor malfunctioning
could record an extreme value, leading to an outlier.

 Natural variation: In many real-world phenomena, there can be inherent

variability or randomness. Occasionally, extreme or unusual events can
occur, resulting in outliers. For instance, in a dataset of daily temperatures,
an unusually hot or cold day could be considered an outlier.

 Data entry errors: Outliers may arise from mistakes during data entry or
data preprocessing. Human error or typos when entering data into a
database or spreadsheet can result in values that are far from the typical
range.
 Sampling errors: Outliers can occur due to issues with the sampling process.
If the sample size is too small or if the sampling method is biased, it may not
accurately represent the underlying population, leading to outliers.

 Genuine extreme values: Sometimes, outliers represent real and

meaningful observations that are genuinely extreme. These could be rare
events, anomalies, or exceptional cases that are legitimately part of the data
distribution.

 Data transformation: Outliers can also arise during data transformation or

aggregation processes. For example, when summarizing data from multiple
sources, inconsistencies or extreme values in one source can lead to outliers in
the aggregated dataset.

 Intentional data points: In some cases, outliers may be deliberately

introduced into the dataset. This could be done for various reasons, such as
testing the robustness of a model or representing special cases in the analysis.

BA Full Note 1
No ratings yet
BA Full Note 1
183 pages
Machine Learning Chapter 2
No ratings yet
Machine Learning Chapter 2
37 pages
Insy662 - f23 - Week 1
No ratings yet
Insy662 - f23 - Week 1
21 pages
Bootstrapping in Excel
No ratings yet
Bootstrapping in Excel
41 pages
3_Preprocessing
No ratings yet
3_Preprocessing
82 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
CH 2
No ratings yet
CH 2
36 pages
Study Material I
No ratings yet
Study Material I
140 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
EDA - Zep
No ratings yet
EDA - Zep
33 pages
CHAPTER 5 Skewness, Kurtosis and Moments
0% (1)
CHAPTER 5 Skewness, Kurtosis and Moments
49 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
253777
No ratings yet
253777
66 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
Alifa Nasyahta Rosiana 22010110110055 Bab8KTI
No ratings yet
Alifa Nasyahta Rosiana 22010110110055 Bab8KTI
49 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
3-Preprocessing
No ratings yet
3-Preprocessing
27 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Data Mining
No ratings yet
Data Mining
22 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
UNIT 3
No ratings yet
UNIT 3
22 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
Unit 3
No ratings yet
Unit 3
18 pages
U1_DA_Data Preprocessing
No ratings yet
U1_DA_Data Preprocessing
6 pages
Unit - II
No ratings yet
Unit - II
56 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
ARIMA (With Seasonal Index) +SARIMA
No ratings yet
ARIMA (With Seasonal Index) +SARIMA
5 pages
HAMILTON Predictive Data Analysis Project
No ratings yet
HAMILTON Predictive Data Analysis Project
25 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
DS_UNIT_2
No ratings yet
DS_UNIT_2
23 pages
Statistic 217
No ratings yet
Statistic 217
16 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Instant Download (eBook PDF) Practicing Statistics: Guided Investigations for the Second Course PDF All Chapters
100% (6)
Instant Download (eBook PDF) Practicing Statistics: Guided Investigations for the Second Course PDF All Chapters
33 pages
12 Ai Data Story 3
No ratings yet
12 Ai Data Story 3
20 pages
Electric Vehicle Range Prediction-Regression Analysis
No ratings yet
Electric Vehicle Range Prediction-Regression Analysis
37 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
DWDM unit 3
No ratings yet
DWDM unit 3
16 pages
12 Correlation and Rank Correlation 05-02-2024
No ratings yet
12 Correlation and Rank Correlation 05-02-2024
19 pages
Interval Estimation in The Classical Normal Linear Regression Model
No ratings yet
Interval Estimation in The Classical Normal Linear Regression Model
16 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Analisis Pengaruh Kualitas Pelayanan Terhadap Kepuasan Pelanggan Pengguna Kartu
No ratings yet
Analisis Pengaruh Kualitas Pelayanan Terhadap Kepuasan Pelanggan Pengguna Kartu
15 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Basic Statistics in Fluid Mechanics
No ratings yet
Basic Statistics in Fluid Mechanics
34 pages
Output SPSS - Lina Lengkap
No ratings yet
Output SPSS - Lina Lengkap
17 pages
24 M A - Flat-Hierarchical - Approach - Based - On - Machine - Learning - Model - For - E-Commerce - Product - Classification
No ratings yet
24 M A - Flat-Hierarchical - Approach - Based - On - Machine - Learning - Model - For - E-Commerce - Product - Classification
16 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
(Part 2) 6.3 Normal Approximation of Binomial Probabilities: Continuity Correction Factor
No ratings yet
(Part 2) 6.3 Normal Approximation of Binomial Probabilities: Continuity Correction Factor
7 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
A CUSUM Chart Is A Time
No ratings yet
A CUSUM Chart Is A Time
8 pages
TSA Assignment
No ratings yet
TSA Assignment
8 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Wooldridge IE AISE SSM ch13
No ratings yet
Wooldridge IE AISE SSM ch13
7 pages
Imdb Movie Data Set
No ratings yet
Imdb Movie Data Set
9 pages
Skripsi Tablet Fe
No ratings yet
Skripsi Tablet Fe
7 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Interpreting and Using Statistics in Psychological Research 1st Edition Christopher Test Bank Download
100% (18)
Interpreting and Using Statistics in Psychological Research 1st Edition Christopher Test Bank Download
15 pages
Siti Hardiyanti Sibuea 22010110110069 Bab8KTI
No ratings yet
Siti Hardiyanti Sibuea 22010110110069 Bab8KTI
27 pages
Jntuk Machine Learning 3-2 Unit-4
No ratings yet
Jntuk Machine Learning 3-2 Unit-4
32 pages
Complete Statistical Methods For Machine Learning: Discover How To Transform Data Into Knowledge With Python Jason Brownlee PDF For All Chapters
100% (2)
Complete Statistical Methods For Machine Learning: Discover How To Transform Data Into Knowledge With Python Jason Brownlee PDF For All Chapters
62 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
17. Standard Deviation Exam Questions
No ratings yet
17. Standard Deviation Exam Questions
10 pages
Adoption of Improved Soybean Seeds in GH
No ratings yet
Adoption of Improved Soybean Seeds in GH
6 pages
DWM
No ratings yet
DWM
14 pages
Test Paper 1 2023
No ratings yet
Test Paper 1 2023
3 pages
D7bc1home Assignment 1
No ratings yet
D7bc1home Assignment 1
5 pages
Data Mining (Mid Sem)
No ratings yet
Data Mining (Mid Sem)
7 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

SML Updated UNIT-2

Uploaded by

SML Updated UNIT-2

Uploaded by

The Yenepoya Institute of Arts, Science, Commerce and

Course: IV Semester BCA ( All specializations) & BSc

Statistics for Machine

 Data Pre-processing includes the steps we need to follow to

 The main agenda for a model to be accurate and precise in

case that we come across the clean and formatted data.

clean it and put in a formatted way.

 Fill in the missing values

1. Z-score method 6. Isolation Forest

4. Winsorization method 8. Standard Deviation

 In this method by using Inter Quartile Range(IQR), we detect outliers.

 Data visualization is useful for data cleaning, exploring data,

assign meaning to thousands of rows and columns of data points and

understand better and visually identify anomalies and relationships that

and patterns in data can be easily

than ever to use data visualization techniques to explore and understand

Machine learning may make use of a wide variety of data visualization

approaches. They are….

in the data across time by using

displayed using line charts

and the other variable drawn on the y-

data to find patterns, clusters, and

each matrix cell represents

visualize the correlation between

 Tree maps are used to display

line inside the box, while the centre box

highest and lowest values in the data,

spread and skewness of the data.

for your machine learning model based on the type of

machine learn better. Usually, a good portion of the data collected is

cause the model to be slower.

choose the right data to put in our model.

 Feature selection models are of two types:

and outputs, we can choose our feature selection model as follows:

Numerical Numerical • Pearson’s correlation coefficient

• ANOVA correlation coefficient

• Kendall’s rank coefficient (linear).

• Chi-Squared test (contingency

3. Visualizing the data method:

4. Faster training and inference:

5. Improved data visualization:

 Measurement errors: Outliers can occur due to errors in data collection,

 Natural variation: In many real-world phenomena, there can be inherent

 Genuine extreme values: Sometimes, outliers represent real and

 Data transformation: Outliers can also arise during data transformation or

 Intentional data points: In some cases, outliers may be deliberately

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.