Machine L
Machine L
Arthur Samuel, an early American leader in the field of computer gaming and artificial
intelligence, coined the term “Machine Learning” in 1959 while at IBM. In 1959, he
published a paper in the IBM Journal of Research and Development with an, at the time,
obscure and curious title. The paper invested the use of machine learning in the game of
checkers “to verify the fact that a computer can be programmed so that it will learn to
play a better game of checkers than can be played by the person who wrote the
program.
He defined machine learning as “the field of study that gives computers the ability to
learn without being explicitly programmed.” However, there is no universally accepted
definition for machine learning. Different authors define the term differently.
The generally agreed upon definition of machine learning is: “machine learning
focuses on the development of computer programs that can access data and use
it to learn for themselves.”
Machine learning is a growing technology which enables computers to learn
automatically from past data.
Machine learning is recognized as a field of computer sciences; however, it does not
carry the same approach that traditional computer sciences carry. Whereas traditional
computer sciences are driven with algorithms that are human-created and managed,
machine learning is driven by algorithms that the device itself can learn from and grow
from.
Machine Learning is one of the most popular sub-fields of Artificial Intelligence. Machine
learning concepts are used almost everywhere, such as Healthcare, Finance,
Infrastructure, Marketing, Self-driving cars, recommendation systems, chatbots, social
sites, gaming, cyber security, and many more
Supervised learning: In this type of machine learning, data scientists supply algorithms
with labeled training data and define the variables they want the algorithm to assess for
correlations. Both the input and the output of the algorithm is specified.
Unsupervised learning: This type of machine learning involves algorithms that train on
unlabeled data. The algorithm scans through data sets looking for any meaningful
connection. The data that algorithms train on as well as the predictions or
recommendations they output are predetermined.
into subsets. Most types of deep learning, including neural networks, are unsupervised
algorithms. Unsupervised learning algorithms are good for the following tasks:
Clustering: Splitting the dataset into groups based on similarity.
Anomaly detection: Identifying unusual data points in a data set.
Association mining: Identifying sets of items in a data set that frequently occur
together.
Dimensionality reduction: Reducing the number of variables in a dataset.
Semi-supervised Learning
This approach to machine learning involves a mix of the two preceding types. Data
scientists may feed an algorithm mostly labeled training data, but the model is free to
explore the data on its own and develop its own understanding of the dataset.
Reinforcement learning
Data scientists typically use reinforcement learning to teach a machine to complete a
multi-step process for which there are clearly defined rules. Data scientists program an
algorithm to complete a task and give it positive or negative cues as it works out how to
complete a task. But for the most part, the algorithm decides on its own what steps to
take along the way.
The main differences between Supervised and Unsupervised learning are given
below:
Overfitting in Machine Learning
A statistical model is said to be overfitted when the model does not make accurate
predictions on testing data. When a model gets trained with so much data, it starts
learning from the noise and inaccurate data entries in our data set. And when testing
with test data results in High variance. Then the model does not categorize the data
correctly, because of too many details and noise. The causes of overfitting are the
non-parametric and non-linear methods because these types of machine learning
algorithms have more freedom in building the model based on the dataset and
therefore they can really build unrealistic models.
The causes of overfitting can be numerous:
Complex models. Using an overly complex model for a simple task can lead to
overfitting. For instance, using a high-degree polynomial regression for data
that's linear in nature.
Insufficient data. If there's not enough data, the model might find patterns that
don't really exist.
Noisy data. If the training data contains errors or random fluctuations, an
overfitted model will treat these as patterns.
The impact of overfitting is significant. While an overfitted model will have high accuracy
on its training data, it will perform poorly on new, unseen data because it's not
generalized enough.
Preventing overfitting is better than curing it. Here are some steps to take:
Simpler models. Start with a simpler model and only add complexity if necessary.
More data. If possible, collect more data. The more data a model is trained on,
the better it can generalize.
Overfitting vs Underfitting
Overfitting: High accuracy on training data, low accuracy on new data. Imagine a
GPS that works perfectly in your hometown but gets lost everywhere else.
Underfitting: Low accuracy on both training and new data. It's like a GPS that
can't even navigate your hometown.
Both overfitting and underfitting lead to poor predictions on new data, but for different
reasons. While overfitting is often due to an overly complex model or noisy data,
underfitting might result from an overly simple model or not enough features.
The training data is the biggest (in -size) subset of the original dataset, which is used to
train or fit the machine learning model. Firstly, the training data is fed to the ML
algorithms, which lets them learn how to make predictions for the given task.
What is Test Dataset?
Once we train the model with the training dataset, it's time to test the model with the
test dataset. This dataset evaluates the performance of the model and ensures that the
model can generalize well with the new or unseen dataset. The test dataset is another
subset of original data, which is independent of the training dataset. However, it has
some similar types of features and class probability distribution and uses it as a
benchmark for model evaluation once the model training is completed. Test data is a
well-organized dataset that contains data for each type of scenario for a given problem
that the model would be facing when used in the real world. Usually, the test dataset is
approximately 20-25% of the total original data for an ML project.
Splitting the dataset into train and test sets is one of the important parts of data pre-
processing, as by doing so, we can improve the performance of our model and hence
give better predictability.
We can understand it as if we train our model with a training set and then test it with a
completely different test dataset, and then our model will not be able to understand the
correlations between the features.
Therefore, if we train and test the model with two different datasets, then it will
decrease the performance of the model. Hence it is important to split a dataset into two
parts, i.e., train and test set.
In this way, we can easily evaluate the performance of our model. Such as, if it performs
well with the training data, but does not perform well with the test dataset, then it is
estimated that the model may be overfitted. For splitting the dataset, we can use
the train_test_split function of scikit-learn.
Consider a case where you have labeled data for 1,000 records. One way to train the
model is to expose all 1,000 records during the training process. Then you take a small
set of the same data to test the model, which would give good results in this case.
But this is not an accurate way of testing. So, we set aside a portion of that data called
the ‘test set’ before starting the training process. The remaining data is called the
‘training set’ that we use for training the model. The training set passes through the
model multiple times until the accuracy is high, and errors are minimized.
Missing values are a common issue in machine learning. This occurs when a particular
variable lacks data points, resulting in incomplete information and potentially harming
the accuracy and dependability of your models. One of the easiest ways to handle
missing or corrupted data is to drop those rows or columns or replace them entirely
with some other value. Missing values are data points that are absent for a specific
variable in a dataset. They can be represented in various ways, such as blank cells, null
values, or special symbols like “NA” or “unknown.” These missing data points pose a
significant challenge in data analysis and can lead to inaccurate or biased results.
Missing Values
IsNull() and dropna() will help to find the columns/rows with missing data and
drop them.
Here,
For actual values:
Total Yes = 12+1 = 13
Total No = 3+9 = 12
Similarly, for predicted values:
Total Yes = 12+3 = 15
Total No = 1+9 = 10
For a model to be accurate, the values across the diagonals should be high. The total
sum of all the values in the matrix equals the total observations in the test data set.
For the above matrix, total observations = 12+3+1+9 = 25
Now, accuracy = sum of the values across the diagonal/total dataset
= (12+9) / 25
= 21 / 25
= 84%
False Positive and False Negative
False positives are those cases that wrongly get classified as True but are False.
False negatives are those cases that wrongly get classified as False but are True.
In the term ‘False Positive,’ the word ‘Positive’ refers to the ‘Yes’ row of the predicted
value in the confusion matrix. The complete term indicates that the system has
predicted it as a positive, but the actual value is negative.
Additionally, there is also a stage called 'hyperparameter tuning' where the model's
hyperparameters are fine-tuned to get the best performance, this stage usually comes
right after model selection or after the model evaluation stage, so that the performance
can be further optimized.
One of the primary differences between machine learning and deep learning is that
feature engineering is done manually in machine learning. In the case of deep learning,
the model consisting of neural networks will automatically determine which features to
use (and which not to use).
Here we train the model using historical data that consists of emails
categorized as spam or not spam. This labeled information is fed as input to
the model.
Healthcare Diagnosis
Sentiment Analysis
Fraud Detection
There are two techniques used in unsupervised learning: clustering and association.
Clustering
Clustering problems involve data to be divided into subsets. These subsets, also called
clusters, contain data that are similar to each other. Different clusters reveal different
details about the objects, unlike classification or regression.
Association
For example, an e-commerce website can suggest other items for you to buy, based on
the prior purchases that you have made, spending habits, items in your wishlist, other
customers’ purchase habits, and so on.
K-means KNN
The classifier is called ‘naive’ because it makes assumptions that may or may not turn
out to be correct.
The algorithm assumes that the presence of one feature of a class is not related to the
presence of any other feature (absolute independence of features), given the class
variable.
For instance, a fruit may be considered to be a cherry if it is red in color and round in
shape, regardless of other features. This assumption may or may not be right (as an
apple also matches the description).
How Will You Know Which Machine Learning Algorithm to Choose for Your
Classification Problem?
While there is no fixed rule to choose an algorithm for a classification problem, you can
follow these guidelines:
If the training dataset is small, use models that have low variance and high
bias
If the training dataset is large, use models that have high variance and little
bias
Random Forest
A ‘random forest’ is a supervised machine learning algorithm that is generally used for
classification problems. It operates by constructing multiple decision trees during the
training phase. The random forest chooses the decision of the majority of the trees as
the final decision.
Bias
Bias in a machine learning model occurs when the predicted values are further from the
actual values. Low bias indicates a model where the prediction values are very close to
the actual ones.
Underfitting: High bias can cause an algorithm to miss the relevant relations between
features and target outputs.
Variance
Variance refers to the amount the target model will change when trained with different
training data. For a good model, the variance should be minimized.
Overfitting: High variance can cause an algorithm to model the random noise in the
training data rather than the intended outputs.
Precision
Precision is the ratio of several events you can correctly recall to the total number of
events you recall (mix of correct and wrong recalls).
Recall
A recall is the ratio of the number of events you can recall the number of total events.
A decision tree builds classification (or regression) models as a tree structure, with
datasets broken up into ever-smaller subsets while developing the decision tree, literally
in a tree-like way with branches and nodes. Decision trees can handle both categorical
and numerical data.
Logistic Regression
Logistic regression is a classification algorithm used to predict a binary outcome for a
given set of independent variables.
The output of logistic regression is either a 0 or 1 with a threshold value of generally 0.5.
Any value above 0.5 is considered as 1, and any point below 0.5 is considered as 0.
K nearest neighbor algorithm is a classification algorithm that works in a way that a new
data point is assigned to a neighboring group to which it is most similar.
In K nearest neighbors, K can be an integer greater than 1. So, for every new data point,
we want to classify, we compute to which neighboring group it is closest.
Let us classify an object using the following example. Consider there are three clusters:
Football
Basketball
Tennis ball
Let the new data point to be classified is a black ball. We use KNN to classify it. Assume
K = 5 (initially).
Observe that all five selected points do not belong to the same cluster. There are three
tennis balls and one each of basketball and football.
When multiple classes are involved, we prefer the majority. Here the majority is with
the tennis ball, so the new data point is assigned to this cluster.
F1 score
The F1 score is a metric that combines both Precision and Recall. It is also the weighted
average of precision and recall.
F1 = 2 * (P * R) / (P + R)
The F1 score is one when both Precision and Recall scores are one.
Correlation: Correlation tells us how strongly two random variables are related to each
other. It takes values between -1 to +1.
Covariance: Covariance tells us the direction of the linear relationship between two
random variables. It can take any value between - ∞ and + ∞.
Support Vectors are data points that are nearest to the hyperplane. It influences the
position and orientation of the hyperplane. Removing the support vectors will alter the
position of the hyperplane. The support vectors help us build our support vector
machine model.
Ensemble learning
K-Fold Cross Validation is the most popular resampling technique that divides the whole
dataset into K sets of equal sizes.