0% found this document useful (0 votes)
61 views15 pages

Machine: Learning

This document discusses machine learning and data preprocessing. It provides an overview of machine learning concepts like supervised learning, deep learning, and the machine learning workflow. It then focuses on key steps in data preprocessing, including data quality assessment, cleaning, transformation, and reduction. Specific techniques covered are handling missing values, outliers, and approaches like binning, winsorizing, and imputation.

Uploaded by

ARCHANA R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views15 pages

Machine: Learning

This document discusses machine learning and data preprocessing. It provides an overview of machine learning concepts like supervised learning, deep learning, and the machine learning workflow. It then focuses on key steps in data preprocessing, including data quality assessment, cleaning, transformation, and reduction. Specific techniques covered are handling missing values, outliers, and approaches like binning, winsorizing, and imputation.

Uploaded by

ARCHANA R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

data science

Machine
LEARNING group 1
1. Farhan Imam Naufal
2. Annisa Dwi Ari
3. Salwa Ahla Amania
4. Mikhael Aditha Sembiring Meliala
5. Djeremiah Christofel
what is MACHINE LEARNING ?
Machine learning is a method of data analysis that automates analytical model

building.

vs

Learn from
Learn from data
experience
ARTIFICIAL INTELLIGENCE
ability of a machine to
imitate intelligence human
behavior

MACHINE LEARNING

application of ai that
allows to system
automatically learn and
improve from experience

DEEP LEARNING application of machine


learning that uses complex
algorithms and deep
neural nets train a model
JENIS MACHINE LEARNING

SUPERVISED LEARNING SUPERVISED LEARNING SUPERVISED LEARNING




Task-Driven Data-Driven Learn From Errors


(Classification / Regression) (Clustering) (Playing Game)
MACHINE LEARNING work flow
PREPROCESSING
Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model. It is
the first and crucial step while creating a machine learning model.

DATA QUALITY DATA DATA DATA


ASSESMENT CLEANING TRANSFORMATION REDUCTION

Information Change DType Normalization Dimensionality


DType Check Handling Null Generalization Reduction
Null & Outlier Handling Outlier etc Numerosity
Disrepancy etc Reduction
etc etc
Data
CLEANING
Data cleaning is the process of preparing data for analysis by removing or modifying
data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted.

Data cleaning is a lot of muscle work. There’s a reason data cleaning is the most
important step if you want to create a data-culture, let alone make airtight
predictions. It involves:
Fixing spelling and syntax errors
Standardizing data sets
Correcting mistakes such as empty fields
Identifying duplicate data points
MISSING VALUES
Missing data means absence of observations in columns. It appears in
values such as “0”, “NA”, “NaN”, “NULL”, “Not Applicable”, “None”.

The cause of it can be data corruption ,failure to record data, lack of


Why dataset has Missing information, incomplete results ,person might not provided the data
values? intentionally ,some system or equipment failure etc. There could any
reason for missing values in your dataset.

One of the biggest impact of Missing Data is, It can bias the results of the
Why to handle Missing
machine learning models or reduce the accuracy of the model. So, It is very
values?
important to handle missing values.
MISSING VALUES TYPES
Missing values depend on the
you have complete information as unobserved data.
In MCAR, the probability of data being
there is some relationship between

missing is the same for all the


the missing data and other If there is some structure/pattern in
observations.
values/data. missing data and other observed data


can not explain it, then it is Missing Not
In this case, there is no relationship
In this case, the data is not missing for At Random (MNAR).
between the missing data and any
all the observations. It is missing only

other values observed or unobserved


within sub-samples of the data and If the missing data does not fall under
(the data which is not recorded)
there is some pattern in the missing the MCAR or MAR then it can be
within the given dataset.
values. categorized as MNAR.

mcar mnar
mar
( M is s in g C o m p l e t el y ( M is s in g N o t A t R a n do m )
(Missing At Random)
A t Ra n d o m )

Handling
MISSING VALUES
OUTLIERS
An outlier is an object that deviates significantly from the rest of the objects. They can be
caused by measurement or execution error. The analysis of outlier data is referred to as
outlier analysis or outlier mining.
Detecting OUTLIERS
BOXPLOT SCATTERPLOT Z-SCORE IQR
Handling OUTLIERS
Similar to not detecting outliers at all, handling outliers can bear the risk of having a substantial impact
on the outcome of an analysis or machine learning model. From a mathematical point of view, there is
no right and wrong answer on how to treat outlying observations. A more important role, next to
mathematics, can be given to qualitative information you have available in the decision process around
outliers.

If we can't rectify the outliers, then we may think of some the following methods to handle outliers.
Doing nothing
Deleting/Trimming
After deleting the outliers, we should be careful not to run the outlier detection test once again. As the IQR and standard
deviation changes after the removal of outliers, this may lead to wrongly detecting some new values as outliers.

Unlike trimming, here we replace the outliers with other values. Common is replacing
Winsorizing the outliers on the upper side with 95% percentile value and outlier on the lower side
with 5% percentile.

Transformation Use transformation such as log transformation in case of right tailed distribution.

Binning or discretization of continuous data into groups such low, medium and high
Binning converts the outlier values into count values.

Robust estimators such as median while measuring central tendency and decision trees for
Use robust estimators
classification tasks can handle the outliers better.

Another method is to treat the outliers as missing values and then imputing them using similar
Imputing methods that we saw while handling missing values.

Thank You

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy