Ml-Unit 1
Ml-Unit 1
Definition
“Machine learning enables a machine to automatically learn
from data, improve performance from experiences, and predict
things without being explicitly programmed”.
Machine learning is supposed to overcome this issue. The machine learns how
the input and output data are correlated and it writes a rule. The programmers
do not need to write new rules each time there is new data. The algorithms
adapt in response to new data and experiences to improve efficacy over time.
One crucial part of the data scientist is to choose carefully which data to
provide to the machine.
The machine uses some algorithms to simplify the reality and transform this
discovery into a model.
The learning stage is used to describe the data and summarize it into a model.
The new data are transformed into a features vector, go through the model and
give a prediction.
You can use the model previously trained to make inference on new data.
4. Reinforcement Learning
Dataset
Supervised machine learning can be classified into two types of problems, which are
given below:
Classification
Regression
a) Classification
Classification algorithms are used to solve the classification problems in which the
output variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc.
The classification algorithms predict the categories present in the dataset.
b) Regression
Regression algorithms are used to solve regression problems in which there is a
linear relationship between input and output variables. These are used to predict
continuous output variables, such as market trends, weather prediction, etc.
Disadvantages:
These algorithms are not able to solve complex tasks.
It may predict the wrong output if the test data is different from the training
data.
It requires lots of computational time to train the algorithm.
4. Unsupervised Machine Learning
1) Clustering
Group the objects into a cluster such that the objects with the most similarities
remain in one group and have fewer or no similarities with the objects of other
groups.
2) Association
Association rule learning is an unsupervised learning technique, which finds
interesting relations among variables within a large dataset.
The main aim of this learning algorithm is to find the dependency of one data item on
another data item and map those variables accordingly so that it can generate
maximum profit. This algorithm is mainly applied in Market Basket analysis, Web
usage mining, continuous production, etc.
Disadvantages:
The output of an unsupervised algorithm can be less accurate as the dataset is
not labeled.
Working with Unsupervised learning is more difficult as it works with the
unlabelled dataset that does not map with the output.
Semi-Supervised Learning
Continuity Assumption: The algorithm assumes that the points which are closer
to each other are more likely to have the same output label.
Cluster Assumption: The data can be divided into discrete clusters and points in
the same cluster are more likely to share an output label.
Manifold Assumption: The data lie approximately on a manifold of much lower
dimension than the input space. This assumption allows the use of distances and
densities which are defined on a manifold.
Applications of Semi-Supervised Learning
1. Speech Analysis: Since labeling of audio files is a very intensive task, Semi-
Supervised learning is a very natural approach to solve this problem.
2. Internet Content Classification: Labeling each webpage is an impractical and
unfeasible process and thus uses Semi-Supervised learning algorithms. Even the
Google search algorithm uses a variant of Semi-Supervised learning to rank the
relevance of a webpage for a given query.
3. Protein Sequence Classification: Since DNA strands are typically very large in
size, the rise of Semi-Supervised learning has been imminent in this field.
Reinforcement learning
Neuron in Biology
The activation function calculates the output value for the neuron. This
output value is then passed on to the next layer of the neural network
through another synapse.
This serves as a broad overview of deep learning neurons.
Illustration of an ANN
Cell Neuron
Dendrites Weights or interconnections
Soma Net input
Axon Output
Missing or incomplete records. It is difficult to get every data point for every
record in a dataset. Missing data sometimes appear as empty cells or a
particular character, such as a question mark. For example:
Ethical Considerations:
Consider ethical implications, biases, and fairness in the data and
model. Ensure that the machine learning system does not
inadvertently perpetuate or exacerbate existing biases.
Documentation:
Document the entire machine learning pipeline, including data
sources, preprocessing steps, model architecture, and deployment
details.
This documentation is crucial for future reference, collaboration,
and troubleshooting.
User Interface (if applicable):
If the machine learning system is user-facing, design a user interface
that facilitates interaction and provides meaningful insights.
Ensure that users understand the system's capabilities and
limitations.
Security:
Implement security measures to protect the machine learning system
from potential attacks or unauthorized access, especially if it
involves sensitive data.
Compliance and Regulations:
Ensure that the machine learning system complies with relevant
regulations and standards, especially if it involves sensitive data or is
deployed in regulated industries.
The major issue that comes while using machine learning algorithms is the lack of
quality as well as quantity of data.
Data plays a significant role in machine learning, and it must be of good quality
as well. Noisy data, incomplete data, inaccurate data, and unclean data lead to
less accuracy in classification and low-quality results.
The training data must cover all cases that are already occurred as well as
occurring.
If there is less training data, then there will be a sampling noise in the model,
called the non-representative training set.
It won't be accurate in predictions. To overcome this, it will be biased against
one class or a group.
overfitting occurs when the model or the algorithm fits the data too well.
Whenever a machine learning model is trained with fewer amounts of data, and
as a result, it provides incomplete and inaccurate data and destroys the accuracy
of the machine learning model.
Hence, regular monitoring and maintenance become compulsory for the same.
Different results for different actions require data change, hence editing of
codes as well as resources for monitoring them also become necessary.
8. Customer Segmentation
To identify the customers who paid for the recommendations shown by the
model and who don't even check them.
The machine learning process is very complex, which is also another major
issue faced by machine learning engineers and data scientists.
There is the majority of hits and trial experiments; hence the probability of
error is higher than expected.
These errors exist when certain elements of the dataset are heavily weighted or
need more importance than others.
Biased data leads to inaccurate results, skewed outcomes, and other analytical
errors.
A machine learning model is said to be good if training data has a good set of
features or less to no irrelevant features.
FIND-S Algorithm starts from the most specific hypothesis and generalize it by
considering only positive examples.
This is the primary component of Perceptron which accepts the initial data into the
system for further processing. Each input node contains a real numerical value.
o Wight and Bias:
Weight parameter represents the strength of the connection between units. Weight is
directly proportional to the strength of the associated input neuron in deciding the
output. Further, Bias can be considered as the line of intercept in a linear equation.
o Activation Function:
These are the final and important components that help to determine whether the
neuron will fire or not.
o Sign function
o Step function
o Sigmoid function
The data scientist uses the activation function to take a subjective decision based on
various problem statements and forms the desired outputs.
Activation function may differ (e.g., Sign, Step, and Sigmoid) in perceptron models
by checking whether the learning process is slow or has vanishing or exploding
gradients.
The perceptron model begins with the multiplication of all input values and their
weights, then adds these values together to create the weighted sum.
Then this weighted sum is applied to the activation function 'f' to obtain the desired
output.
This activation function is also known as the step function and is represented by 'f'.
14.Linear Regression
o Regression is a supervised learning technique.
o Regression analysis is a statistical method to model the relationship between a
dependent (target) and independent (predictor) variables with one or more
independent variables.
o Linear regression is a method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and
shows the relationship between the continuous variables.
o Linear regression shows the linear relationship between the independent
variable (X-axis) and the dependent variable (Y-axis), hence called linear
regression.
o If there is only one input variable (x), then such linear regression is
called simple linear regression. And if there is more than one input variable,
then such linear regression is called multiple linear regression.
o The relationship between variables in the linear regression model can be
explained using the below image. Here we are predicting the salary of an
employee on the basis of the year of experience.