0% found this document useful (0 votes)
5 views20 pages

Unit 1 Notes

This document provides an overview of machine learning, including its basic concepts, types, and applications. It covers supervised, unsupervised, semi-supervised, and reinforcement learning, along with data preprocessing techniques and the importance of bias and variance in model performance. Additionally, it highlights practical applications such as image recognition, speech recognition, and product recommendations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views20 pages

Unit 1 Notes

This document provides an overview of machine learning, including its basic concepts, types, and applications. It covers supervised, unsupervised, semi-supervised, and reinforcement learning, along with data preprocessing techniques and the importance of bias and variance in model performance. Additionally, it highlights practical applications such as image recognition, speech recognition, and product recommendations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

UNIT – I

Introduction to Machine Learning – SCSB4009


UNIT 1 INTRODUCTION TO MACHINE LEARNING

Machine learning-basic concepts in machine learning- types of machine learning-


examples of machine learning- applications- the bias variance- data pre-
processing- noise removal-normalization.

Introduction
Machine Learning is a field of study that gives the computers to Learn Without
Being Explicitly Programmed” "A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure
P, if its performance at tasks in T, as measured by P, improves with
experience E."(Tom Michel)

"Field of study that gives computers the ability to learn without being explicitly
programmed". Learning = Improving with experience at some task
- Improve over task T,
- with respect to performance measure P,
- based on experience
- .E.g., Learn to lay checkers
- T : Play checkers
- P : % of games won in world tournament
- E: opportunity to play against self
Model
A model of machine learning is a set of programs that can be used to find the
pattern and make a decision from an unseen dataset. It can be any one of the
following
- • Mathematical Equation
- • Relational Diagrams Like Graphs/Trees
- • Logical If/Else Rules
- • Groupings Called Clusters Learning
Training set, Test set and Validation set

• Divide the total dataset into three subsets:

– Training data is used for learning the parameters of the model.

– Validation data is not used of learning but is used for deciding


what type of modeland what amount of regularization works best.

– Test data is used to get a final, unbiased estimate of how well the
network works. We expect this estimate to be worse than on the
validation data.

We could then re-divide the total dataset to get another unbiased estimate of the
true error rate.

DIFFERENCE BETWEEN TRADITIONAL PROGRAMMING VS MACHINE


LEARNING
Need for Machine Learning

- The Data-Information-Knowledge-Wisdom (DIKW) pyramid


illustrates the progression of raw data to valuable insights. It gives you a
framework to discuss the level of meaning and utility within data. Each level
of the pyramid builds on lower levels, and to effectively make data-driven
decisions, you need all four levels.
- Wisdom is the ability to make well-informed decisions and take
effective action based on understanding of the underlying knowledge.
- Knowledge is the result of analyzing and interpreting information to
uncover patterns, trends, and relationships. It provides an understanding of
"how" and "why" certain phenomena occur.
- Information is organized, structured, and contextualized data.
Information is useful for answering basic questions like "who," "what,"
"where," and "when."
- Data refers to raw, unprocessed facts and figures without context. It is
the foundation for all subsequent layers but holds limited value in isolation.
Types of Machine Learning

Supervised learning
Supervised learning is defined as when a model gets trained on a “Labelled
Dataset”.
Labelled datasets have both input and output parameters.
In Supervised Learning algorithms learn to map points between inputs and
correct outputs. It has both training and validation datasets labelled.

Example: Consider a scenario where you have to build an image classifier to


differentiate between cats and dogs. If you feed the datasets of dogs and cats
labelled images to the algorithm, the machine will learn to classify between a dog
or a cat from these labeled images. When we input new dog or cat images that it
has never seen before, it will use the learned algorithms and predict whether it is
a dog or a cat. This is how supervised learning works, and this is particularly an
image classification.
There are two main categories of supervised learning that are mentioned below:

Classification- deals with predicting categorical target variables, which


represent discrete classes or labels. For instance, classifying emails as spam or
not spam, or predicting whether a patient has a high risk of heart disease
Regression - deals with predicting continuous target variables, which represent
numerical values. For example, predicting the price of a house based on its size,
location, and amenities, or forecasting the sales of a product.

Unsupervised learning
Unsupervised learning is a type of machine learning technique in which an
algorithm discovers patterns and relationships using unlabeled data.
Unlike supervised learning, unsupervised learning doesn’t involve providing the
algorithm with labeled target outputs.
The primary goal of Unsupervised learning is often to discover hidden patterns,
similarities, or clusters within the data, which can then be used for various
purposes, such as data exploration, visualization, dimensionality reduction, and
more.

Example: Consider that you have a dataset that contains information about the
purchases you made from the shop. Through clustering, the algorithm can group
the same purchasing behavior among you and other customers, which reveals
potential customers without predefined labels. This type of information can help
businesses get target customers as well as identify outliers.
There are two main categories of unsupervised learning that are mentioned
below:

Clustering - Clustering is the process of grouping data points into clusters based
on their similarity. This technique is useful for identifying patterns and
relationships in data without the need for labeled examples.
Association - Association rule learning is a technique for discovering
relationships between items in a dataset. It identifies rules that indicate the
presence of one item implies the presence of another item with a specific
probability.

Semi-supervised learning

o It is the combination of supervised and un supervised learning models.

o Training data includes a few desired outputs


Reinforcement learning

Reinforcement machine learning algorithm is a learning method that interacts


with the environment by producing actions and discovering errors.

Trial, error, and delay are the most relevant characteristics of reinforcement
learning. In this technique, the model keeps on increasing its performance using
Reward Feedback to learn the behavior or pattern.

These algorithms are specific to a particular problem e.g. Google Self Driving
car, AlphaGo where a bot competes with humans and even itself to get better
and better performers in Go Game.

Each time we feed in data, they learn and add the data to their knowledge which
is training data. So, the more it learns the better it gets trained and hence
experienced.

Example:Consider that you are training an AI agent to play a game like chess.
The agent explores different moves and receives positive or negative feedback
based on the outcome. Reinforcement Learning also finds applications in which
they learn to perform tasks by interacting with their surroundings.

Examples of machine learning


Recognizing patterns:
– Facial identities or facial expressions
– Handwritten or spoken words
– Medical images
• Generating patterns:
– Generating images or motion sequences (demo)
• Recognizing anomalies:
– Unusual sequences of credit card transactions
– Unusual patterns of sensor readings in a nuclear power plant or unusual
sound in your car engine.
• Prediction:
-Future stock prices or currency exchange rates
• The web contains a lot of data. Tasks with very big datasets often use
machine learning
– especially if the data is noisy or non-stationary.
• Spam filtering, fraud detection:
• Recommendation systems:
– Lots of noisy data.

Applications of machine learning

1. Image Recognition

Image recognition is one of the most common applications of machine learning.


It is used to identify objects, persons, places, digital images, etc. The popular use
case of image recognition and face detection is, Automatic friend tagging
suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we
upload a photo with our Facebook friends, then we automatically get a tagging
suggestion with name, and the technology behind this is machine learning's face
detection and recognition algorithm.

2. Speech Recognition

Speech recognition is a process of converting voice instructions into text, and it


is also known as "Speech to text", or "Computer speech recognition." At present,
machine learning algorithms are widely used by various applications of speech
recognition. Google assistant, Siri, Cortana, and Alexa are using speech
recognition technology to follow the voice instructions.
3. Traffic prediction

It predicts the traffic conditions such as whether traffic is cleared, slow-moving,


or heavily congested with the help of two ways:
Real Time location of the vehicle form Google Map app and sensors
Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes
information from the user and sends back to its database to improve the
performance.

4. Email Spam and Malware Filtering

Whenever we receive a new email, it is filtered automatically as important,


normal, and spam. We always receive an important mail in our inbox with the
important symbol and spam emails in our spam box, and the technology behind
this is Machine learning. Below are some spam filters used by Gmail:
1. Content Filter
2. Header filter
3. General blacklists filter
4. Rules-based filters
5. Permission filters

5. Product recommendations

Machine learning is widely used by various e-commerce and entertainment


companies such as Amazon, Netflix, etc., for product recommendation to the user.
Whenever we search for some product on Amazon, then we started getting an
advertisement for the same product while internet surfing on the same browser
and this is because of machine learning.
Bias and Variance

Bias
Bias is simply defined as the inability of the model because of that there is some
difference or error occurring between the model’s predicted value and the actual
value. These differences between actual or expected values and the predicted
values are known as error or bias error or error due to bias. Bias is a systematic
error that occurs due to wrong assumptions in the machine learning process.

Let Y be the true value of a parameter, and let be an estimator of Y based on


a sample of data. Then, the bias of the estimator is given by:

Where is the expected value of the estimator . It is the measurement of


the model that how well it fits the data.

Low Bias: Low bias value means fewer assumptions are taken to build the target
function. In this case, the model will closely match the training dataset.
High Bias: High bias value means more assumptions are taken to build the target
function. In this case, the model will not match the training dataset closely.

Variance
Variance is the measure of spread in data from its mean position. In machine
learning variance is the amount by which the performance of a predictive model
changes when it is trained on different subsets of the training data. More
specifically, variance is the variability of the model that how much it is sensitive
to another subset of the training dataset. i.e. how much it can adjust on the new
subset of the training dataset.
Let Y be the actual values of the target variable, and be the predicted values
of the target variable. Then the variance of a model can be measured as the
expected value of the square of the difference between predicted values and the
expected value of the predicted values.

Where is the expected value of the predicted values. Here expected value is
averaged over all the training data.
Variance errors are either low or high-variance errors.

Low variance: Low variance means that the model is less sensitive to changes in
the training data and can produce consistent estimates of the target function with
different subsets of data from the same distribution. This is the case of underfitting
when the model fails to generalize on both training and test data.
High variance: High variance means that the model is very sensitive to changes
in the training data and can result in significant changes in the estimate of the
target function when trained on different subsets of data from the same
distribution. This is the case of overfitting when the model performs well on the
training data but poorly on new, unseen test data. It fits the training data too
closely that it fails on the new training dataset.

Data preprocessing

In real world the available data is


1. Incomplete data
2. Inaccurate data
3. Outlier data
4. Data with missing values
5. Data with inconsistent values
6. Duplicate data
Data preprocessing improves the quality of the data mining techniques. The raw
data must be preprocessed to give accurate results. The process of detection and
removal of errors in data is called data cleaning. Data wrangling means making
the data processable for machine learning algorithms. Some of the data errors
include human errors such as typographical errors or incorrect measurement and
structural errors like improper data formats. Data errors can also arise from
omission and duplication of attributes. Noise is a random component and involves
distortion of a value or introduction of spurious objects. Often, the noise is used
if the data is a spatial or temporal component. Certain deterministic distortions in
the form of a streak are known as artifacts.
Data preprocessing involves the following steps:
1. Getting the dataset
2. Importing libraries
3. Importing datasets
4. Finding Missing Data
5. Encoding Categorical Data
6. Splitting dataset into training and test set
7. Feature scaling

1. Get the Dataset

To create a machine learning model, the first thing we required is a dataset as a


machine learning model completely works on data. The collected data for a
particular problem in a proper format is known as the dataset.
Dataset may be of different formats for different purposes, such as, if we want to
create a machine learning model for business purpose, then dataset will be
different with the dataset required for a liver patient. So each dataset is different
from another dataset. To use the dataset in our code, we usually put it into a CSV
file. However, sometimes, we may also need to use an HTML or xlsx file.
2. Importing libraries
In order to perform data preprocessing using Python, we need to import some
predefined Python libraries. These libraries are used to perform some specific
jobs. There are three specific libraries that we will use for data preprocessing,
which are:

i) Numpy
ii) Matplotlib
iii) Pandas

3. Importing dataset
We need to import the datasets which we have collected for our machine learning
project. But before importing a dataset, we need to set the current directory as a
working directory. To set a working directory in Spyder IDE, we need to follow
the below steps:

a) Save your Python file in the directory which contains dataset.


b) Go to File explorer option in Spyder IDE, and select the required directory.
c) Click on F5 button or run option to execute the file.

4. Finding missing data


If our dataset contains some missing data, then it may create a huge problem for
our machine learning model. Hence it is necessary to handle missing values
present in the dataset. There are mainly two ways to handle missing data, which
are:
By deleting the particular row: The first way is used to commonly deal with null
values. In this way, we just delete the specific row or column which consists of
null values.
By calculating the mean: In this way, we will calculate the mean of that column
or row which contains any missing value and will put it on the place of missing
value.

5. Encoding Categorical data


Categorical data is data which has some categories such as, in our dataset; there
are two categorical variable, Country, and Purchased.
Since machine learning model completely works on mathematics and numbers,
but if our dataset would have a categorical variable, then it may create trouble
while building the model. So it is necessary to encode these categorical variables
into numbers.

6. Splitting the Dataset into the Training set and Test set
In machine learning data preprocessing, we divide our dataset into a training set
and test set. This is one of the crucial steps of data preprocessing as by doing this,
we can enhance the performance of our machine learning model.Suppose, if we
have given training to our machine learning model by a dataset and we test it by
a completely different dataset. Then, it will create difficulties for our model to
understand the correlations between the models.

If we train our model very well and its training accuracy is also very high, but we
provide a new dataset to it, then it will decrease the performance. So we always
try to make a machine learning model which performs well with the training set
and also with the test dataset. Here, we can define these datasets as:

Training Set: A subset of dataset to train the machine learning model, and we
already know the output.
Test set: A subset of dataset to test the machine learning model, and by using the
test set, model predicts the output.
7. Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a
technique to standardize the independent variables of the dataset in a specific
range. In feature scaling, we put our variables in the same range and in the same
scale so that no any variable dominate the other variable.

Noise removal

Noise is a random error or variance in a measured value. It results from inaccurate


measurements, inaccurate data collection, or irrelevant information.The following
could be the reason for noisy data:
i) Errors in data collection, such as malfunctioning sensors or human error during
data entry, can introduce noise into machine learning.
ii) Noise can also be introduced by measurement mistakes, such as inaccurate
instruments or environmental conditions.
iii) Another form of noise in data is inherent variability resulting from either
natural fluctuations or unforeseen events.
iv) If data pretreatment operations like normalization or transformation are not
done appropriately, they may unintentionally add noise.
v) Inaccurate data point labeling or annotation can introduce noise and affect the
learning process.
It can be removed by using binning, a method where the given data values are
sorted and distributed into equal-frequency bins, which are also called buckets.
The binning method then uses the neighbor values to smooth the noisy data. Some
of the techniques commonly used are ‘smoothing by means’ where the mean of
the bin removes the values of the bins, ‘smoothing by bin medians’ where the bin
median replaces the bin values, and ‘smoothing by bin boundaries’ where the bin
value is replaced by the closest bin boundary. The maximum and minimum values
are called bin boundaries. Binning methods may be used as a discretization
technique.

Noise removal techniques:


Data preprocessing: It consists of methods to improve the quality of the data and
lessen noise from errors or inconsistencies, such as data cleaning, normalization,
and outlier elimination.
Fourier Transform:The Fourier Transform is a mathematical technique used to
transform signals from the time or spatial domain to the frequency domain. In the
context of noise removal, it can help identify and filter out noise by representing
the signal as a combination of different frequencies. Relevant frequencies can be
retained while noise frequencies can be filtered out.
Constructive Learning:Constructive learning involves training a machine
learning model to distinguish between clean and noisy data instances. This
approach typically requires labeled data where the noise level is known. The
model learns to classify instances as either clean or noisy, allowing for the
removal of noisy data points from the dataset.
Autoencoders: Autoencoders are neural network architectures that consist of an
encoder and a decoder. The encoder compresses the input data into a lower-
dimensional representation, while the decoder reconstructs the original data from
this representation. Autoencoders can be trained to reconstruct clean signals while
effectively filtering out noise during the reconstruction process.
Normalization

Normalization is an essential step in the preprocessing of data for machine


learning models, and it is a feature scaling technique. Normalization is
especially crucial for data manipulation, scaling down, or up the range of
data before it is utilized for subsequent stages in the fields of soft
computing, cloud computing, etc.
Data normalization improves the consistency and comparability of
different predictive models by standardizing the range of independent
variables or features within a dataset, leading to more steady and
dependable results.
Although there are so many feature normalization techniques in Machine
Learning, few of them are most frequently used. These are as follows:

Min-Max Scaling:
This technique is also referred to as scaling. As we have already discussed
above, the Min-Max scaling method helps the dataset to shift and rescale
the values of their attributes, so they end up ranging between 0 and 1.
Standardization scaling:
Standardization scaling is also known as Z-score normalization, in which
values are centered around the mean with a unit standard deviation, which
means the attribute becomes zero and the resultant distribution has a unit
standard deviation. Mathematically, we can calculate the standardization
by subtracting the feature value from the mean and dividing it by standard
deviation.

There are several reasons for the need for data normalization as follows:

i) Normalisation is essential to machine learning for a number of reasons.


Throughout the learning process, it guarantees that every feature
contributes equally, preventing larger-magnitude features from
overshadowing others.
ii) It enables faster convergence of algorithms for optimisation, especially
those that depend on gradient descent. Normalisation improves the
performance of distance-based algorithms like k-Nearest Neighbours.
iii) Normalisation improves overall performance by addressing model
sensitivity problems in algorithms such as Support Vector Machines and
Neural Networks.
iv) Because it assumes uniform feature scales, it also supports the use of
regularisation techniques like L1 and L2 regularisation.
v) In general, normalisation is necessary when working with attributes that
have different scales; otherwise, the effectiveness of a significant attribute
that is equally important (on a lower scale) could be diluted due to other
attributes having values on a larger scale.

Advantages of Data Normalization


1. More clustered indexes could potentially be produced.
2. Index searching was accelerated, which led to quicker data retrieval.
3. Quicker data modification commands.
4. The removal of redundant and null values to produce more compact data.
5. Reduction of anomalies resulting from data modification.
6. Conceptual clarity and simplicity of upkeep, enabling simple adaptations
to changing needs.
7. Because more rows can fit on a data page with narrower tables,
searching, sorting, and index creation are more efficient.

Disadvantages of Data Normalization


1. It gets harder to link tables together when the information is spread
across multiple ones. It gets even more interesting to identify the database.
2. Given that rewritten data is saved as lines of numbers rather than actual
data, tables will contain codes rather than actual data. That means that you
have to keep checking the query table.
3. This information model is very hard to query because it is meant for
programmes, not ad hoc queries. Operating system friendly query devices
frequently perform this function. It is composed of SQL that has been
accumulated over time. If you don’t first understand the needs of the client,
it may be challenging to demonstrate knowledge and understanding.
4. A comprehensive understanding of the various conventional structures
is essential to completing the standardisation cycle successfully. Careless
use can lead to a poor plan with significant anomalies and inconsistent data.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy