Machine: Learning
Machine: Learning
Machine
LEARNING group 1
1. Farhan Imam Naufal
2. Annisa Dwi Ari
3. Salwa Ahla Amania
4. Mikhael Aditha Sembiring Meliala
5. Djeremiah Christofel
what is MACHINE LEARNING ?
Machine learning is a method of data analysis that automates analytical model
building.
vs
Learn from
Learn from data
experience
ARTIFICIAL INTELLIGENCE
ability of a machine to
imitate intelligence human
behavior
MACHINE LEARNING
application of ai that
allows to system
automatically learn and
improve from experience
Data cleaning is a lot of muscle work. There’s a reason data cleaning is the most
important step if you want to create a data-culture, let alone make airtight
predictions. It involves:
Fixing spelling and syntax errors
Standardizing data sets
Correcting mistakes such as empty fields
Identifying duplicate data points
MISSING VALUES
Missing data means absence of observations in columns. It appears in
values such as “0”, “NA”, “NaN”, “NULL”, “Not Applicable”, “None”.
One of the biggest impact of Missing Data is, It can bias the results of the
Why to handle Missing
machine learning models or reduce the accuracy of the model. So, It is very
values?
important to handle missing values.
MISSING VALUES TYPES
Missing values depend on the
you have complete information as unobserved data.
In MCAR, the probability of data being
there is some relationship between
can not explain it, then it is Missing Not
In this case, there is no relationship
In this case, the data is not missing for At Random (MNAR).
between the missing data and any
all the observations. It is missing only
mcar mnar
mar
( M is s in g C o m p l e t el y ( M is s in g N o t A t R a n do m )
(Missing At Random)
A t Ra n d o m )
Handling
MISSING VALUES
OUTLIERS
An outlier is an object that deviates significantly from the rest of the objects. They can be
caused by measurement or execution error. The analysis of outlier data is referred to as
outlier analysis or outlier mining.
Detecting OUTLIERS
BOXPLOT SCATTERPLOT Z-SCORE IQR
Handling OUTLIERS
Similar to not detecting outliers at all, handling outliers can bear the risk of having a substantial impact
on the outcome of an analysis or machine learning model. From a mathematical point of view, there is
no right and wrong answer on how to treat outlying observations. A more important role, next to
mathematics, can be given to qualitative information you have available in the decision process around
outliers.
If we can't rectify the outliers, then we may think of some the following methods to handle outliers.
Doing nothing
Deleting/Trimming
After deleting the outliers, we should be careful not to run the outlier detection test once again. As the IQR and standard
deviation changes after the removal of outliers, this may lead to wrongly detecting some new values as outliers.
Unlike trimming, here we replace the outliers with other values. Common is replacing
Winsorizing the outliers on the upper side with 95% percentile value and outlier on the lower side
with 5% percentile.
Transformation Use transformation such as log transformation in case of right tailed distribution.
Binning or discretization of continuous data into groups such low, medium and high
Binning converts the outlier values into count values.
Robust estimators such as median while measuring central tendency and decision trees for
Use robust estimators
classification tasks can handle the outliers better.
Another method is to treat the outliers as missing values and then imputing them using similar
Imputing methods that we saw while handling missing values.
Thank You