SML Updated UNIT-2
SML Updated UNIT-2
Management (YIASCM)
learning model.
When creating a machine learning project, it is not always a
1. Missing values
Here are a few ways to solve this issue:
Ignore those tuples
This method should be considered when the dataset is huge and numerous
missing values are present within a tuple.
It is the technique that works on sorted data values to smoothen any noise
present in it. The data is divided into equal-sized bins, and each bin/bucket is dealt with
independently. All data in a segment can be replaced by its mean, median or boundary
values.
Regression
This data mining technique is generally used for prediction. It helps to smoothen
noise by fitting all the data points in a regression function. The linear regression
equation is used if there is only one independent attribute; else Polynomial equations
are used.
Clustering
Creation of groups/clusters from data having similar values. The values that don't
lie in the cluster can be treated as noisy data and can be removed.
3. Removing outliers:
Outliers are those data points which differs significantly from other
observations present in given dataset. It can occur because of variability in
measurement and due to misinterpretation in filling data points.
Most common causes of outliers on a data set:
Data Entry Errors: Human errors such as errors caused during data collection,
recording, or entry can cause outliers in data.
Measurement Error (instrument errors): It is the most common source of
outliers. This is caused when the measurement instrument used turns out to be
faulty.
Experimental errors (data extraction or experiment planning errors)
Intentional (dummy outliers made to test detection methods)
Data processing errors (data manipulation or data set unintended mutations)
Sampling errors (extracting or mixing data from wrong or various sources)
How to detect Outliers?
as outliers.
Q1 represents the 1st quartile of the data.
Q2 represents the 2nd quartile of the data.
Q3 represents the 3rd quartile of the data.
(Q1–1.5 x IQR) represent the smallest value in the data set and
(Q3+1.5 x IQR) represent the largest value in the data set.
Visualizing the data
shapes, dimensions, colours, lines, points, and angles so that data analysts
can effectively visualize and define the metadata and then perform data
cleansing.
Performing the initial step of data exploration enables data analysts to
helps to identify outliers, trends, and patterns in the data that may be
missed by other forms of analysis.
With the increasing availability of big data, it has become more important
clean data, and data visualization can help to identify and remove any
inconsistencies or anomalies in the data.
Types of Data Visualization Approaches
Machine learning models follow a simple rule: whatever goes in, comes
out. If we put garbage into our model, we can expect the output to be
garbage too. In this case, garbage refers to noise in our data.
To train a model, we collect enormous quantities of data to help the
How do we know which feature selection model will work out for our
model? The process is relatively simple, with the model depending on
the types of input and output variables.
Variables are of two main types:
Numerical Variables: Which include integers, float, and numbers.
Categorical Variables: Which include labels, strings, Boolean variables,
etc.
Based on whether we have numerical or categorical variables as inputs
1. Z-score method:
Advantages:
• It provides a standardized score indicating how many standard deviations an
observation is from the mean.
• It's easy to understand and implement.
• It works well for normally distributed data.
Limitations:
• It assumes that the data is normally distributed, which may not always be the
case.
• It can be sensitive to extreme values, especially in small datasets.
• It may not be effective for skewed distributions.
2. Interquartile Range (IQR) method:
Advantages:
• It's robust to outliers and resistant to skewed distributions.
• It's simple to calculate and understand.
• It provides a measure of the spread of the middle 50% of the data.
Limitations:
• It relies on quartiles, which may not be representative if the dataset is small.
• It may not be as informative about the location of outliers as the z-score
method.
• It's less effective for normally distributed data compared to the z-score
method.
Reducing the number of features in a dataset, also known as feature selection, can offer
several potential benefits:
1.Improved model performance:
• By removing irrelevant or redundant features, feature selection can reduce
overfitting, where the model learns noise in the data rather than the underlying
patterns. This can lead to better generalization performance on unseen data.
• It can also reduce computational complexity, making the training process faster
and more efficient, especially for algorithms that are sensitive to the curse of
dimensionality.
2.Enhanced interpretability:
• With fewer features, it becomes easier to understand and interpret the model.
Simplifying the model can help identify the most important factors driving the
predictions, making it easier to communicate the results to stakeholders.
• Feature selection can highlight the most relevant variables, allowing domain
experts to gain insights into the underlying processes and relationships in the
data.
3. Reduced overfitting:
• Feature selection helps to mitigate the risk of overfitting by reducing the model's
reliance on noisy or irrelevant features. This allows the model to capture the
underlying patterns in the data more effectively, leading to better generalization
performance on new data.
• Removing irrelevant features can also improve the model's robustness to changes
in the dataset, such as missing values or outliers.
Outliers can arise in a dataset due to various reasons, and they can have
different underlying causes depending on the nature of the data and the
context of the problem
Data entry errors: Outliers may arise from mistakes during data entry or
data preprocessing. Human error or typos when entering data into a
database or spreadsheet can result in values that are far from the typical
range.
Sampling errors: Outliers can occur due to issues with the sampling process.
If the sample size is too small or if the sampling method is biased, it may not
accurately represent the underlying population, leading to outliers.