Unit 1 Notes
Unit 1 Notes
Introduction
Machine Learning is a field of study that gives the computers to Learn Without
Being Explicitly Programmed” "A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure
P, if its performance at tasks in T, as measured by P, improves with
experience E."(Tom Michel)
"Field of study that gives computers the ability to learn without being explicitly
programmed". Learning = Improving with experience at some task
- Improve over task T,
- with respect to performance measure P,
- based on experience
- .E.g., Learn to lay checkers
- T : Play checkers
- P : % of games won in world tournament
- E: opportunity to play against self
Model
A model of machine learning is a set of programs that can be used to find the
pattern and make a decision from an unseen dataset. It can be any one of the
following
- • Mathematical Equation
- • Relational Diagrams Like Graphs/Trees
- • Logical If/Else Rules
- • Groupings Called Clusters Learning
Training set, Test set and Validation set
– Test data is used to get a final, unbiased estimate of how well the
network works. We expect this estimate to be worse than on the
validation data.
We could then re-divide the total dataset to get another unbiased estimate of the
true error rate.
Supervised learning
Supervised learning is defined as when a model gets trained on a “Labelled
Dataset”.
Labelled datasets have both input and output parameters.
In Supervised Learning algorithms learn to map points between inputs and
correct outputs. It has both training and validation datasets labelled.
Unsupervised learning
Unsupervised learning is a type of machine learning technique in which an
algorithm discovers patterns and relationships using unlabeled data.
Unlike supervised learning, unsupervised learning doesn’t involve providing the
algorithm with labeled target outputs.
The primary goal of Unsupervised learning is often to discover hidden patterns,
similarities, or clusters within the data, which can then be used for various
purposes, such as data exploration, visualization, dimensionality reduction, and
more.
Example: Consider that you have a dataset that contains information about the
purchases you made from the shop. Through clustering, the algorithm can group
the same purchasing behavior among you and other customers, which reveals
potential customers without predefined labels. This type of information can help
businesses get target customers as well as identify outliers.
There are two main categories of unsupervised learning that are mentioned
below:
Clustering - Clustering is the process of grouping data points into clusters based
on their similarity. This technique is useful for identifying patterns and
relationships in data without the need for labeled examples.
Association - Association rule learning is a technique for discovering
relationships between items in a dataset. It identifies rules that indicate the
presence of one item implies the presence of another item with a specific
probability.
Semi-supervised learning
Trial, error, and delay are the most relevant characteristics of reinforcement
learning. In this technique, the model keeps on increasing its performance using
Reward Feedback to learn the behavior or pattern.
These algorithms are specific to a particular problem e.g. Google Self Driving
car, AlphaGo where a bot competes with humans and even itself to get better
and better performers in Go Game.
Each time we feed in data, they learn and add the data to their knowledge which
is training data. So, the more it learns the better it gets trained and hence
experienced.
Example:Consider that you are training an AI agent to play a game like chess.
The agent explores different moves and receives positive or negative feedback
based on the outcome. Reinforcement Learning also finds applications in which
they learn to perform tasks by interacting with their surroundings.
1. Image Recognition
2. Speech Recognition
5. Product recommendations
Bias
Bias is simply defined as the inability of the model because of that there is some
difference or error occurring between the model’s predicted value and the actual
value. These differences between actual or expected values and the predicted
values are known as error or bias error or error due to bias. Bias is a systematic
error that occurs due to wrong assumptions in the machine learning process.
Low Bias: Low bias value means fewer assumptions are taken to build the target
function. In this case, the model will closely match the training dataset.
High Bias: High bias value means more assumptions are taken to build the target
function. In this case, the model will not match the training dataset closely.
Variance
Variance is the measure of spread in data from its mean position. In machine
learning variance is the amount by which the performance of a predictive model
changes when it is trained on different subsets of the training data. More
specifically, variance is the variability of the model that how much it is sensitive
to another subset of the training dataset. i.e. how much it can adjust on the new
subset of the training dataset.
Let Y be the actual values of the target variable, and be the predicted values
of the target variable. Then the variance of a model can be measured as the
expected value of the square of the difference between predicted values and the
expected value of the predicted values.
Where is the expected value of the predicted values. Here expected value is
averaged over all the training data.
Variance errors are either low or high-variance errors.
Low variance: Low variance means that the model is less sensitive to changes in
the training data and can produce consistent estimates of the target function with
different subsets of data from the same distribution. This is the case of underfitting
when the model fails to generalize on both training and test data.
High variance: High variance means that the model is very sensitive to changes
in the training data and can result in significant changes in the estimate of the
target function when trained on different subsets of data from the same
distribution. This is the case of overfitting when the model performs well on the
training data but poorly on new, unseen test data. It fits the training data too
closely that it fails on the new training dataset.
Data preprocessing
i) Numpy
ii) Matplotlib
iii) Pandas
3. Importing dataset
We need to import the datasets which we have collected for our machine learning
project. But before importing a dataset, we need to set the current directory as a
working directory. To set a working directory in Spyder IDE, we need to follow
the below steps:
6. Splitting the Dataset into the Training set and Test set
In machine learning data preprocessing, we divide our dataset into a training set
and test set. This is one of the crucial steps of data preprocessing as by doing this,
we can enhance the performance of our machine learning model.Suppose, if we
have given training to our machine learning model by a dataset and we test it by
a completely different dataset. Then, it will create difficulties for our model to
understand the correlations between the models.
If we train our model very well and its training accuracy is also very high, but we
provide a new dataset to it, then it will decrease the performance. So we always
try to make a machine learning model which performs well with the training set
and also with the test dataset. Here, we can define these datasets as:
Training Set: A subset of dataset to train the machine learning model, and we
already know the output.
Test set: A subset of dataset to test the machine learning model, and by using the
test set, model predicts the output.
7. Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a
technique to standardize the independent variables of the dataset in a specific
range. In feature scaling, we put our variables in the same range and in the same
scale so that no any variable dominate the other variable.
Noise removal
Min-Max Scaling:
This technique is also referred to as scaling. As we have already discussed
above, the Min-Max scaling method helps the dataset to shift and rescale
the values of their attributes, so they end up ranging between 0 and 1.
Standardization scaling:
Standardization scaling is also known as Z-score normalization, in which
values are centered around the mean with a unit standard deviation, which
means the attribute becomes zero and the resultant distribution has a unit
standard deviation. Mathematically, we can calculate the standardization
by subtracting the feature value from the mean and dividing it by standard
deviation.
There are several reasons for the need for data normalization as follows: