0% found this document useful (0 votes)
15 views41 pages

ML-QB-Unit 1

Uploaded by

gg9244260
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views41 pages

ML-QB-Unit 1

Uploaded by

gg9244260
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

UNIT-1

2 MARKS
1. HUMAN LEARNING?

Human Learning Systems is an alternative approach to public management


which embraces the complexity of the real world, and enables us to work effectively
in that complexity.

2. MACHINE LEARNING DEFINITION?

Machine Learning (ML) is that field of computer science with the help of which
computer systems can provide sense to data in much the same way as human beings do
In simple words, ML is a type of artificial intelligence that extract patterns out of raw
data by using an algorithm or method. The main focus of ML is to allow computer
systems learn from experience without being explicitly programmed or human
intervention.

3. MACHINE LEARNING AUTHOR DEFINETION?

Machine learning is the field of study that gives computer have the ability to learn
without being explicitly programmed.

4. WHAT ARE THE TYPES OF MACHINE LEARNING?

The three machine learning types are


supervised,
unsupervised
semi-supervised
reinforcement learning.

5. WHAT ARE THE ALGORITHM USED IN SUPERVISED LEARNING


ALGORITHM?

 Linear regression.
 Logistic regression.
 Decision tree.
 SVM algorithm.
 Naive Bayes algorithm.
 KNN algorithm.
 K-means.
 Random forest algorithm.
6. WHAT ARE THE ALGORITHM USED IN UNSUPERVISED LEARNING
ALGORITHM?

K-means clustering

o KNN (k-nearest neighbors)


o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition

7. WHAT IS REINFORCEMENT LEARNING?

Reinforcement learning (RL) is an area of machine learning concerned with


how intelligent agents ought to take actions in an environment in order to maximize
the notion of cumulative reward. Reinforcement learning is one of three basic machine
learning paradigms, alongside supervised learning and unsupervised learning.

8. WHAT ARE THE APPLICATION OF MACHINE LEARNING?

Image Recognition. ...


o Speech Recognition. ...
o Predict Traffic Patterns. ...
o E-commerce Product Recommendations. ...
o Self-Driving Cars. ...
o Catching Email Spam. ...
o Catching Malware. ...
o Virtual Personal Assistant.

9. WHAT ARE THE TOOLS OF MACHINE LEARNING?


Machine Learning Tools · 1. TensorFlow · 2. PyTorch · 3. Google Cloud
ML Engine · 4. Amazon Machine Learning (AML) · 5. NET · 6. Apache Mahout · 7.
Shogun · 8. Oryx2.

10. WHAT ARE THE DIFFERNCE BETWEEN HUMAN LEARNING AND MACHINE
LEARNING?

Let us examine the difference between human and machine learning process
in detail in this blog. Humans acquire knowledge through experience either directly
or shared by others. Machines acquire knowledge through experience shared in the
form of past data.

11. WHAT ARE THE TYPES OF HUMAN LEARNING?

Three Major Types of Learning


Learning through association - Classical Conditioning.
Learning through consequences – Operant Conditioning.
Learning through observation – Observational Learning.

12. DEFINETION OF DATA AND ITS TYPES?

Data refers to a systematic record of a specific quantity. It is the diverse values of


that quantity together which the sets represent. In other words, it is a set of facts and
figures which are useful in a particular purpose like a survey or an analysis.
Data can be classified as qualitative and quantitative.

13. WHAT IS TRAINING DATASET?

The training data is the biggest (in -size) subset of the original dataset,
which is used to train or fit the machine learning model. Firstly, the training data is fed
to the ML algorithms, which lets them learn how to make predictions for the given task.

14. WHAT IS TESTING DATASET?

Once we train the model with the training dataset, it's time to test the
model with the test dataset. This dataset evaluates the performance of the model and
ensures that the model can generalize well with the new or unseen dataset. The test
dataset is another subset of original data, which is independent of the training dataset.
15. WHAT IS DATA PREPROCESSING?
Data preprocessing, a component of data preparation, describes any type of
processing performed on raw data to prepare it for another data processing
procedure. It has traditionally been an important preliminary step for the data mining
process.

16. WHAT ARE THE ACTIVITIES OF MACHINE LEARNING?

Collecting Data: As you know, machines initially learn from the data that you
give them. ...
o Preparing the Data: After you have your data, you have to prepare it. ...
o Choosing a Model: ...
o Training the Model: ...
o Evaluating the Model: ...
o Parameter Tuning: ...
o Making Predictions.

17. WHAT ARE THE ISSUES OF MACHINE LEARNING?

Commonly used Algorithms in Machine Learning. ...


o Inadequate Training Data. ...
o Poor quality of data. ...
o Non-representative training data. ...
o Overfitting and Underfitting.

18. WHAT ARE THE PROBLEMS NOT TO BE SOLVED BY MACHINE LEARNING?

Some of the problems are not solved by the machine learning concept.
Because all the application of machine learning is not fully fulfilled .

19. WHAT DO YOU MEAN BY INTERPOLATE?

Interpolation is a method for generating points between given points.


For example: for points 1 and 2, we may interpolate and find points 1.33 and 1.66.
Interpolation has many usage, in Machine Learning we often deal with missing data in a
dataset, interpolation is often used to substitute those values.
13 Marks

1. MACHINE LEARNING AND TYPES OF MACHINE LEARNING WITH


ALGORITHM EXAMPLES?
Machine learning is a subfield of artificial intelligence, which is broadly defined as the
capability of a machine to imitate intelligent human behavior. Artificial intelligence systems
are used to perform complex tasks in a way that is similar to how humans solve problems.
Types of machine learining:
 Supervised Learning Algorithm
 Unsupervised Learning Algorithm
 Semisupervised Learning Algorithm

SUPERVISED LEARINING ALGORITHM:

Supervised learning is a type of Machine learning in which the machine needs external
supervision to learn. The supervised learning models are trained using the labeled dataset. Once
the training and processing are done, the model is tested by providing a sample test data to check
whether it predicts the correct output.

The goal of supervised learning is to map input data with the output data.

Supervised learning can be divided further into two categories of problem:

 Classification
 Regression

a) Classification

Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The
classification algorithms predict the categories present in the dataset. Some real-world examples
of classification algorithms are Spam Detection, Email filtering, etc.

Some popular classification algorithms are given below:


o Random Forest Algorithm
o Decision Tree Algorithm
o Logistic Regression Algorithm
o Support Vector Machine Algorithm

b) Regression

Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous output
variables, such as market trends, weather prediction, etc.

Some popular Regression algorithms are given below:

o Simple Linear Regression Algorithm


o Multivariate Regression Algorithm
o Decision Tree Algorithm
o Lasso Regression

Advantages and Disadvantages of Supervised Learning

Advantages:

o Since supervised learning work with the labelled dataset so we can have an exact idea
about the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior experience.

Disadvantages:

o These algorithms are not able to solve complex tasks.


o It may predict the wrong output if the test data is different from the training data.
o It requires lots of computational time to train the algorithm.

Unsupervised Machine Learning

Unsupervised learning is different from the Supervised learning technique; as its name suggests,
there is no need for supervision. It means, in unsupervised machine learning, the machine is
trained using the unlabeled dataset, and the machine predicts the output without any supervision.
The main aim of the unsupervised learning algorithm is to group or categories the unsorted
dataset according to the similarities, patterns, and differences. Machines are instructed to
find the hidden patterns from the input dataset.

Categories of Unsupervised Machine Learning

Unsupervised Learning can be further classified into two types, which are given below:

o Clustering
o Association

1) Clustering

The clustering technique is used when we want to find the inherent groups from the data. It is a
way to group the objects into a cluster such that the objects with the most similarities remain in
one group and have fewer or no similarities with the objects of other groups. An example of the
clustering algorithm is grouping the customers by their purchasing behaviour.

Some of the popular clustering algorithms are given below:

o K-Means Clustering algorithm


o Mean-shift algorithm
o DBSCAN Algorithm
o Principal Component Analysis
o Independent Component Analysis

2) Association

Association rule learning is an unsupervised learning technique, which finds interesting relations
among variables within a large dataset. The main aim of this learning algorithm is to find the
dependency of one data item on another data item and map those variables accordingly so that it
can generate maximum profit. This algorithm is mainly applied in Market Basket analysis,
Web usage mining, continuous production, etc.

Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-growth
algorithm.

Advantages and Disadvantages of Unsupervised Learning Algorithm

Advantages:
o These algorithms can be used for complicated tasks compared to the supervised ones
because these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset
is easier as compared to the labelled dataset.

Disadvantages:

o The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.

Semi-Supervised Learning

Semi-Supervised learning is a type of Machine Learning algorithm that lies between


Supervised and Unsupervised machine learning. It represents the intermediate ground
between Supervised (With Labelled training data) and Unsupervised learning (with no labelled
training data) algorithms and uses the combination of labelled and unlabeled datasets during the
training period.

To overcome the drawbacks of supervised learning and unsupervised learning algorithms,


the concept of Semi-supervised learning is introduced. The main aim of semi-supervised
learning is to effectively use all the available data, rather than only labelled data like in
supervised learning.

Advantages and disadvantages of Semi-supervised Learning

Advantages:

o It is simple and easy to understand the algorithm.


o It is highly efficient.
o It is used to solve drawbacks of Supervised and Unsupervised Learning algorithms.

Disadvantages:

o Iterations results may not be stable.


o We cannot apply these algorithms to network-level data.
o Accuracy is low.
Reinforcement Learning

Reinforcement learning works on a feedback-based process, in which an AI agent (A


software component) automatically explore its surrounding by hitting & trail, taking
action, learning from experiences, and improving its performance. Agent gets rewarded for
each good action and get punished for each bad action; hence the goal of reinforcement learning
agent is to maximize the rewards.

Categories of Reinforcement Learning

Reinforcement learning is categorized mainly into two types of methods/algorithms:

o Positive Reinforcement Learning: Positive reinforcement learning specifies increasing


the tendency that the required behaviour would occur again by adding something. It
enhances the strength of the behaviour of the agent and positively impacts it.
o Negative Reinforcement Learning: Negative reinforcement learning works exactly
opposite to the positive RL. It increases the tendency that the specific behaviour would
occur again by avoiding the negative condition.

Advantages

o It helps in solving complex real-world problems which are difficult to be solved by


general techniques.
o The learning model of RL is similar to the learning of human beings; hence most accurate
results can be found.
o Helps in achieving long term results.

Disadvantage

o RL algorithms are not preferred for simple problems.


o RL algorithms require huge data and computations.
o Too much reinforcement learning can lead to an overload of states which can weaken the
results.

The curse of dimensionality limits reinforcement learning for real physical systems.

2. APPLICATION OF MACHINE LEARING?


Machine learning is a buzzword for today's technology, and it is growing very rapidly day by
day. We are using machine learning in our daily life even without knowing it such as Google
Maps, Google assistant, Alexa, etc. Below are some most trending real-world :

1. Image Recognition:

Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image recognition
and face detection is, Automatic friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo
with our Facebook friends, then we automatically get a tagging suggestion with name, and the
technology behind this is machine learning's face detection and recognition algorithm.

It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture. Play Vid

2. Speech Recognition

While using Google, we get an option of "Search by voice," it comes under speech recognition,
and it's a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is also known as
"Speech to text", or "Computer speech recognition." At present, machine learning algorithms
are widely used by various applications of speech recognition. Google assistant, Siri, Cortana,
and Alexa are using speech recognition technology to follow the voice instructions.

3. Traffic prediction:

If we want to visit a new place, we take help of Google Maps, which shows us the correct path
with the shortest route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It takes information
from the user and sends back to its database to improve the performance.

4. Product recommendations:

Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some
product on Amazon, then we started getting an advertisement for the same product while internet
surfing on the same browser and this is because of machine learning.

Google understands the user interest using various machine learning algorithms and suggests the
product as per customer interest.

As similar, when we use Netflix, we find some recommendations for entertainment series,
movies, etc., and this is also done with the help of machine learning.

5. Self-driving cars:

One of the most exciting applications of machine learning is self-driving cars. Machine learning
plays a significant role in self-driving cars. Tesla, the most popular car manufacturing company
is working on self-driving car. It is using unsupervised learning method to train the car models to
detect people and objects while driving.

6. Email Spam and Malware Filtering:

Whenever we receive a new email, it is filtered automatically as important, normal, and spam.
We always receive an important mail in our inbox with the important symbol and spam emails in
our spam box, and the technology behind this is Machine learning. Below are some spam filters
used by Gmail:

o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters

Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve
Bayes classifier are used for email spam filtering and malware detection.
7. Virtual Personal Assistant:

We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As
the name suggests, they help us in finding the information using our voice instruction. These
assistants can help us in various ways just by our voice instructions such as Play music, call
someone, Open an email, Scheduling an appointment, etc.

These virtual assistants use machine learning algorithms as an important part.

These assistant record our voice instructions, send it over the server on a cloud, and decode it
using ML algorithms and act accordingly.

8. Online Fraud Detection:

Machine learning is making our online transaction safe and secure by detecting fraud transaction.
Whenever we perform some online transaction, there may be various ways that a fraudulent
transaction can take place such as fake accounts, fake ids, and steal money in the middle of a
transaction. So to detect this, Feed Forward Neural network helps us by checking whether it is
a genuine transaction or a fraud transaction.

For each genuine transaction, the output is converted into some hash values, and these values
become the input for the next round. For each genuine transaction, there is a specific pattern
which gets change for the fraud transaction hence, it detects it and makes our online transactions
more secure.

9. Stock Market trading:

Machine learning is widely used in stock market trading. In the stock market, there is always a
risk of up and downs in shares, so for this machine learning's long short term memory neural
network is used for the prediction of stock market trends.

10. Medical Diagnosis:

In medical science, machine learning is used for diseases diagnoses. With this, medical
technology is growing very fast and able to build 3D models that can predict the exact position
of lesions in the brain.

It helps in finding brain tumors and other brain-related diseases easily.

11. Automatic Language Translation:

Nowadays, if we visit a new place and we are not aware of the language then it is not a problem
at all, as for this also machine learning helps us by converting the text into our known languages.
Google's GNMT (Google Neural Machine Translation) provide this feature, which is a Neural
Machine Learning that translates the text into our familiar language, and it called as automatic
translation.
The technology behind the automatic translation is a sequence to sequence learning algorithm,
which is used with image recognition and translates the text from one language to another
language.

3. TOOLS USED IN MACHINE LEARING?

Machine learning is one of the most revolutionary technologies that is making lives
simpler. It is a subfield of Artificial Intelligence, which analyses the data, build the
model, and make predictions. Due to its popularity and great applications, every tech
enthusiast wants to learn and build new machine learning Apps.

There are different tools, software, and platform available for machine
learning, and also new software and tools are evolving day by day. Although there are
many options and availability of Machine learning tools, choosing the best tool per your
model is a challenging task. If you choose the right tool for your model, you can make it
faster and more efficient.

1. TensorFlow

TensorFlow is one of the most popular open-source libraries used to train and build both
machine learning and deep learning models. It provides a JS library and was developed
by Google Brain Team. It offers a powerful library, tools, and resources for numerical
computation, specifically for large scale machine learning and deep learning projects. It enables
data scientists/ML developers to build and deploy machine learning applications efficiently.

Features:

Below are some top features:Play Vid

o Tensor Flow enables us to build and train our ML models easily.


o It also enables you to run the existing models using the TensorFlow.js
o It helps in building a neural network.
o Provides support of distributed computing.
o This is open-source software and highly flexible.
o It also enables the developers to perform numerical computations using data flow graphs.
o It enables to easily deploy and training the model in the cloud.
o It can be used in two ways, i.e., by installing through NPM or by script tags.
o It is free to use.
2. PyTorch

PyTorch is an open-source machine learning framework, which is based on the Torch library.
This framework is free and open-source and developed by FAIR(Facebook's AI Research lab).
It is one of the popular ML frameworks, which can be used for various applications, including
computer vision and natural language processing. Different deep learning software is made up on
top of PyTorch, such as PyTorch Lightning, Hugging Face's Transformers, Tesla autopilot, etc.

It specifies a Tensor class containing an n-dimensional array that can perform tensor
computations along with GPU support.

Features:

Below are some top features:

o It enables the developers to create neural networks using Autograde Module.


o It is more suitable for deep learning researches with good speed and flexibility.
o It can also be used on cloud platforms.
o It includes tutorial courses, various tools, and libraries.
o It allows changing the network behaviour randomly without any lag.
o It is easy to use due to its hybrid front-end.
o It is freely available.

3. Google Cloud ML Engine

While training a classifier with a huge amount of data, a computer system might not perform
well. However, various machine learning or deep learning projects requires millions or billions
of training datasets. Or the algorithm that is being used is taking a long time for execution. In
such a case, one should go for the Google Cloud ML Engine. It is a hosted platform where ML
developers and data scientists build and run optimum quality machine, learning models. It
provides a managed service that allows developers to easily create ML models with any type of
data and of any size.

Features:

Below are the top features:

o Provides machine learning model training, building, deep learning and predictive
modelling.
o The two services, namely, prediction and training, can be used independently or
combinedly.
o It can be used by enterprises, i.e., for identifying clouds in a satellite image, responding
faster to emails of customers.
o It can be widely used to train a complex model.

4. Amazon Machine Learning (AML)

Amazon provides a great number of machine learning tools, and one of them is Amazon
Machine Learning or AML. Amazon Machine Learning (AML) is a cloud-based and robust
machine learning software application, which is widely used for building machine learning
models and making predictions. Moreover, it integrates data from multiple sources,
including Redshift, Amazon S3, or RDS.

Features

Below are some top features:

o AML offers visualization tools and wizards.


o Enables the users to identify the patterns, build mathematical models, and make
predictions.
o It provides support for three types of models, which are multi-class classification, binary
classification, and regression.
o It permits users to import the model into or export the model out from Amazon Machine
Learning.

5. NET

Accord.Net is .Net based Machine Learning framework, which is used for scientific computing.
It is combined with audio and image processing libraries that are written in C#. This framework
provides different libraries for various applications in ML, such as Pattern Recognition, linear
algebra, Statistical Data processing. One popular package of the Accord.Net framework
is Accord. Statistics, Accord.Math, and Accord.MachineLearning.

Features

Below are some top features:


o It contains 38+ kernel Functions.
o Consists of more than 40 non-parametric and parametric estimation of statistical
distributions.
o Used for creating production-grade computer audition, computer vision, signal
processing, and statistics apps.
o Contains more than 35 hypothesis tests that include two-way and one way ANOVA tests,
non-parametric tests such as the Kolmogorov-Smirnov test and many more.

6. Apache Mahout

Apache Mahout is an open-source project of Apache Software Foundation, which is used for
developing machine learning applications mainly focused on Linear Algebra. It is a distributed
linear algebra framework and mathematically expressive Scala DSL, which enable the
developers to promptly implement their own algorithms. It also provides Java/Scala libraries to
perform Mathematical operations mainly based on linear algebra and statistics.

Features:

Below are some top features:

o It enables developers to implement machine learning techniques, including


recommendation, clustering, and classification.
o It is an efficient framework for implementing scalable algorithms.
o It consists of matrix and vector libraries.
o It provides support for multiple distributed backends(including Apache Spark)
o It runs on top of Apache Hadoop using the MapReduce paradigm.

7. Shogun

Shogun is a free and open-source machine learning software library, which was created
by Gunnar Raetsch and Soeren Sonnenburg in the year 1999. This software library is written
in C++ and supports interfaces for different languages such as Python, R, Scala, C#, Ruby, etc.,
using SWIG(Simplified Wrapper and Interface Generator). The main aim of Shogun is on
different kernel-based algorithms such as Support Vector Machine (SVM), K-Means Clustering,
etc., for regression and classification problems. It also provides the complete implementation of
Hidden Markov Models.

Features:
Below are some top features:

o The main aim of Shogun is on different kernel-based algorithms such as Support Vector
Machine (SVM), K-Means Clustering, etc., for regression and classification problems.
o It provides support for the use of pre-calculated kernels.
o It also offers to use a combined kernel using Multiple kernel Learning Functionality.
o This was initially designed for processing a huge dataset that consists of up to 10 million
samples.
o It also enables users to work on interfaces on different programming languages such as
Lua, Python, Java, C#, Octave, Ruby, MATLAB, and R.

8. Oryx2

It is a realization of the lambda architecture and built on Apache Kafka and Apache Spark. It is
widely used for real-time large-scale machine learning projects. It is a framework for building
apps, including end-to-end applications for filtering, packaged, regression, classification, and
clustering. It is written in Java languages, including Apache Spark, Hadoop, Tomcat, Kafka, etc.
The latest version of Oryx2 is Oryx 2.8.0.

Features:

Below are some top features:

o It has three tiers: specialization on top providing ML abstractions, generic lambda


architecture tier, end-to-end implementation of the same standard ML algorithms.
o It is well suited for large-scale real-time machine learning projects.
o It contains three layers which are arranged side-by-side, and these are named as Speed
layer, batch layer, and serving layer.
o It also has a data transport layer that transfer data between different layers and receives
input from external sources.

9. Apache Spark MLlib

Apache Spark MLlib is a scalable machine learning library that runs on Apache Mesos, Hadoop,
Kubernetes, standalone, or in the cloud. Moreover, it can access data from different data sources.
It is an open-source cluster-computing framework that offers an interface for complete clusters
along with data parallelism and fault tolerance.
For optimized numerical processing of data, MLlib provides linear algebra packages such as
Breeze and netlib-Java. It uses a query optimizer and physical execution engine for achieving
high performance with both batch and streaming data.

Features

Below are some top features:

o MLlib contains various algorithms, including Classification, Regression, Clustering,


recommendations, association rules, etc.
o It runs different platforms such as Hadoop, Apache Mesos, Kubernetes, standalone, or in
the cloud against diverse data sources.
o It contains high-quality algorithms that provide great results and performance.
o It is easy to use as it provides interfaces In Java, Python, Scala, R, and SQL.

10. Google ML kit for Mobile

For Mobile app developers, Google brings ML Kit, which is packaged with the expertise of
machine learning and technology to create more robust, optimized, and personalized apps. This
tools kit can be used for face detection, text recognition, landmark detection, image labelling,
and barcode scanning applications. One can also use it for working offline.

Features:

Below are some top features:

o The ML kit is optimized for mobile.


o It includes the advantages of different machine learning technologies.
o It provides easy-to-use APIs that enables powerful use cases in your mobile apps.
o It includes Vision API and Natural Language APIS to detect faces, text, and objects, and
identify different languages & provide reply suggestions.

Conclusion

In this topic, we have discussed some popular machine learning tools. However, there are many
more other ML tools, but choosing the tool completely depends on the requirement for one's
project, skills, and price to the tool. Most of these tools are freely available, except for some
tools such as Rapid Miner. Each tool works in a different language and provides some
specifications.
4. Types of data in machine learning?

The data is classified into majorly four categories:

 Nominal data
 Ordinal data
 Discrete data
 Continuous data

Let us discuss the different types of data in Statistics herewith examples.

Qualitative or Categorical Data


Qualitative data, also known as the categorical data, describes the data that fits into the
categories. Qualitative data are not numerical. The categorical information involves categorical
variables that describe the features such as a person’s gender, home town etc. Categorical
measures are defined in terms of natural language specifications, but not in terms of numbers.
Sometimes categorical data can hold numerical values (quantitative value), but those values do
not have a mathematical sense. Examples of the categorical data are birthdate, favourite sport,
school postcode. Here, the birthdate and school postcode hold the quantitative value, but it does
not give numerical meaning.

Nominal Data
Nominal data is one of the types of qualitative information which helps to label the variables
without providing the numerical value. Nominal data is also called the nominal scale. It cannot
be ordered and measured. But sometimes, the data can be qualitative and quantitative. Examples
of nominal data are letters, symbols, words, gender etc.
The nominal data are examined using the grouping method. In this method, the data are grouped
into categories, and then the frequency or the percentage of the data can be calculated. These
data are visually represented using the pie charts.
Ordinal Data
Ordinal data/variable is a type of data that follows a natural order. The significant feature of the
nominal data is that the difference between the data values is not determined. This variable is
mostly found in surveys, finance, economics, questionnaires, and so on.
The ordinal data is commonly represented using a bar chart. These data are investigated and
interpreted through many visualisation tools. The information may be expressed using tables in
which each row in the table shows the distinct category.

Quantitative or Numerical Data


Quantitative data is also known as numerical data which represents the numerical value (i.e.,
how much, how often, how many). Numerical data gives information about the quantities of a
specific thing. Some examples of numerical data are height, length, size, weight, and so on. The
quantitative data can be classified into two different types based on the data sets. The two
different classifications of numerical data are discrete data and continuous data.

Discrete Data
Discrete data can take only discrete values. Discrete information contains only a finite number of
possible values. Those values cannot be subdivided meaningfully. Here, things can be counted in
whole numbers.
Example: Number of students in the class

Continuous Data
Continuous data is data that can be calculated. It has an infinite number of probable values that
can be selected within a given specific range.
Example: Temperature range

5. MACHINE LEARNING ACTIVITIES?


6. HOW TO TRAIN AND TEST THE DATASET IN MACHINE LEARNING?
Train and test datasets are the two key concepts of machine learning, where the training
dataset is used to fit the model, and the test dataset is used to evaluate the model.

What is Training Dataset?

The training data is the biggest (in -size) subset of the original dataset, which is used to train
or fit the machine learning model. Firstly, the training data is fed to the ML algorithms, which
lets them learn how to make predictions for the given task.

Input Output (Labels)


The New UI is Great Positive

Update is really Slow Negative

The training data varies depending on whether we are using Supervised Learning or
Unsupervised Learning Algorithms.

For Unsupervised learning, the training data contains unlabeled data points, i.e., inputs are not
tagged with the corresponding outputs. Models are required to find the patterns from the given
training datasets in order to make predictions.

On the other hand, for supervised learning, the training data contains labels in order to train the
model and make predictions.

The type of training data that we provide to the model is highly responsible for the model's
accuracy and prediction ability. It means that the better the quality of the training data, the better
will be the performance of the model. Training data is approximately more than or equal to 60%
of the total data for an ML project.

What is Test Dataset?

Once we train the model with the training dataset, it's time to test the model with the test dataset.
This dataset evaluates the performance of the model and ensures that the model can generalize
well with the new or unseen dataset. The test dataset is another subset of original data, which is
independent of the training dataset. Usually, the test dataset is approximately 20-25% of the
total original data for an ML project.

At this stage, we can also check and compare the testing accuracy with the training accuracy,
which means how accurate our model is with the test dataset against the training dataset. If the
accuracy of the model on training data is greater than that on testing data, then the model is said
to have overfitting.

The testing data should:

o Represent or part of the original dataset.


o It should be large enough to give meaningful predictions.

Need of Splitting dataset into Train and Test set

Splitting the dataset into train and test sets is one of the important parts of data pre-processing, as
by doing so, we can improve the performance of our model and hence give better predictability.
We can understand it as if we train our model with a training set and then test it with a
completely different test dataset, and then our model will not be able to understand the
correlations between the features.

Therefore, if we train and test the model with two different datasets, then it will decrease the
performance of the model. Hence it is important to split a dataset into two parts, i.e., train and
test set.

In this way, we can easily evaluate the performance of our model. Such as, if it performs well
with the training data, but does not perform well with the test dataset, then it is estimated that the
model may be overfitted.

For splitting the dataset, we can use the train_test_split function of scikit-learn.

The bellow line of code can be used to split dataset:

1. from sklearn.model_selection import train_test_split


2. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

Explanation:

In the first line of the above code, we have imported the train_test_split function from
the sklearn library.

In the second line, we have used four variables, which are

o x_train: It is used to represent features for the training data


o x_test: It is used to represent features for testing data
o y_train: It is used to represent dependent variables for training data
o y_test: It is used to represent independent variable for testing data
o In the train_test_split() function, we have passed four parameters. Which first two are for
arrays of data, and test_size is for specifying the size of the test set. The test_size may be
.5, .3, or .2, which tells the dividing ratio of training and testing sets.
o The last parameter, random_state, is used to set a seed for a random generator so that you
always get the same result, and the most used value for this is 42.

How do training and testing data work in Machine Learning?

Machine Learning algorithms enable the machines to make predictions and solve problems on
the basis of past observations or experiences. These experiences or observations an algorithm can
take from the training data, which is fed to it. Further, one of the great things about ML
algorithms is that they can learn and improve over time on their own, as they are trained with the
relevant training data.

Once the model is trained enough with the relevant training data, it is tested with the test data.
We can understand the whole process of training and testing in three steps, which are as follows:

1. Feed: Firstly, we need to train the model by feeding it with training input data.
2. Define: Now, training data is tagged with the corresponding outputs (in Supervised
Learning), and the model transforms the training data into text vectors or a number of
data features.
3. Test: In the last step, we test the model by feeding it with the test data/unseen dataset.
This step ensures that the model is trained efficiently and can generalize well.

The above process is explained using a flowchart given below:


15 MARKS

1. DATA PREPROCESSING

Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the first and crucial step while creating a machine learning model.

When creating a machine learning project, it is not always a case that we come across the clean
and formatted data. And while doing any operation with data, it is mandatory to clean it and put
in a formatted way. So for this, we use data preprocessing task.
Why do we need Data Preprocessing?

A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data preprocessing is required tasks
for cleaning the data and making it suitable for a machine learning model which also increases
the accuracy and efficiency of a machine learning model.

It involves below steps:

o Getting the dataset


o Importing libraries
o Importing datasets
o Finding Missing Data
o Encoding Categorical Data
o Splitting dataset into training and test set
o Feature scaling

1) Get the Dataset

To create a machine learning model, the first thing we required is a dataset as a machine learning
model completely works on data. The collected data for a particular problem in a proper format
is known as the dataset.

Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the dataset
required for a liver patient. So each dataset is different from another dataset. To use the dataset in
our code, we usually put it into a CSV file. However, sometimes, we may also need to use an
HTML or xlsx file.

What is a CSV File?

CSV stands for "Comma-Separated Values" files; it is a file format which allows us to save the
tabular data, such as spreadsheets. It is useful for huge datasets and can use these datasets in
programs.

Here we will use a demo dataset for data preprocessing, and for practice, it can be downloaded
from here, "https://www.superdatascience.com/pages/machine-learning. For real-world
problems, we can download datasets online from various sources such
as https://www.kaggle.com/uciml/datasets, https://archive.ics.uci.edu/ml/index.php etc.
We can also create our dataset by gathering data using various API with Python and put that data
into a .csv file.

2) Importing Libraries

In order to perform data preprocessing using Python, we need to import some predefined Python
libraries. These libraries are used to perform some specific jobs. There are three specific libraries
that we will use for data preprocessing, which are:

Numpy: Numpy Python library is used for including any type of mathematical operation in the
code. It is the fundamental package for scientific calculation in Python. It also supports to add
large, multidimensional arrays and matrices. So, in Python, we can import it as:

1. import numpy as nm

Here we have used nm, which is a short name for Numpy, and it will be used in the whole
program.

Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and with
this library, we need to import a sub-library pyplot. This library is used to plot any type of charts
in Python for the code. It will be imported as below:

1. import matplotlib.pyplot as mpt

Here we have used mpt as a short name for this library.

Pandas: The last library is the Pandas library, which is one of the most famous Python libraries
and used for importing and managing the datasets. It is an open-source data manipulation and
analysis library. It will be imported as below:

Here, we have used pd as a short name for this library. Consider the below image:

3) Importing the Datasets

Now we need to import the datasets which we have collected for our machine learning project.
But before importing a dataset, we need to set the current directory as a working directory. To set
a working directory in Spyder IDE, we need to follow the below steps:

1. Save your Python file in the directory which contains dataset.


2. Go to File explorer option in Spyder IDE, and select the required directory.
3. Click on F5 button or run option to execute the file.

Note: We can set any directory as a working directory, but it must contain the required dataset.

Here, in the below image, we can see the Python file along with required dataset. Now, the
current folder is set as a working directory.

read_csv() function:

Now to import the dataset, we will use read_csv() function of pandas library, which is used to
read a csv file and performs various operations on it. Using this function, we can read a csv file
locally as well as through an URL.

We can use read_csv function as below:

1. data_set= pd.read_csv('Dataset.csv')
Here, data_set is a name of the variable to store our dataset, and inside the function, we have
passed the name of our dataset. Once we execute the above line of code, it will successfully
import the dataset in our code. We can also check the imported dataset by clicking on the
section variable explorer, and then double click on data_set. Consider the below image:

As in the above image, indexing is started from 0, which is the default indexing in Python. We
can also change the format of our dataset by clicking on the format option.

Extracting dependent and independent variables:

In machine learning, it is important to distinguish the matrix of features (independent variables)


and dependent variables from dataset. In our dataset, there are three independent variables that
are Country, Age, and Salary, and one is a dependent variable which is Purchased.

Extracting independent variable:

To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to
extract the required rows and columns from the dataset.

1. x= data_set.iloc[:,:-1].values

In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is for all
the columns. Here we have used :-1, because we don't want to take the last column as it contains
the dependent variable. So by doing this, we will get the matrix of features.
By executing the above code, we will get output as:

1. [['India' 38.0 68000.0]


2. ['France' 43.0 45000.0]
3. ['Germany' 30.0 54000.0]
4. ['France' 48.0 65000.0]
5. ['Germany' 40.0 nan]
6. ['India' 35.0 58000.0]
7. ['Germany' nan 53000.0]
8. ['France' 49.0 79000.0]
9. ['India' 50.0 88000.0]
10. ['France' 37.0 77000.0]]

As we can see in the above output, there are only three variables.

Extracting dependent variable:

To extract dependent variables, again, we will use Pandas .iloc[] method.

1. y= data_set.iloc[:,3].values

Here we have taken all the rows with the last column only. It will give the array of dependent
variables.

By executing the above code, we will get output as:

Output:

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)
Note: If you are using Python language for machine learning, then extraction is mandatory,
but for R language it is not required.
4) Handling Missing data:

The next step of data preprocessing is to handle missing data in the datasets. If our dataset
contains some missing data, then it may create a huge problem for our machine learning model.
Hence it is necessary to handle missing values present in the dataset.

Ways to handle missing data:

There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal with null values. In this
way, we just delete the specific row or column which consists of null values. But this way is not
so efficient and removing data may lead to loss of information which will not give the accurate
output.

By calculating the mean: In this way, we will calculate the mean of that column or row which
contains any missing value and will put it on the place of missing value. This strategy is useful
for the features which have numeric data such as age, salary, year, etc. Here, we will use this
approach.

To handle missing values, we will use Scikit-learn library in our code, which contains various
libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library. Below is the code for it:

1. #handling missing data (Replacing missing data with the mean value)
2. from sklearn.preprocessing import Imputer
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
4. #Fitting imputer object to the independent variables x.
5. imputerimputer= imputer.fit(x[:, 1:3])
6. #Replacing missing data with the calculated mean value
7. x[:, 1:3]= imputer.transform(x[:, 1:3])

Output:

array([['India', 38.0, 68000.0],


['France', 43.0, 45000.0],
['Germany', 30.0, 54000.0],
['France', 48.0, 65000.0],
['Germany', 40.0, 65222.22222222222],
['India', 35.0, 58000.0],
['Germany', 41.111111111111114, 53000.0],
['France', 49.0, 79000.0],
['India', 50.0, 88000.0],
['France', 37.0, 77000.0]], dtype=object

As we can see in the above output, the missing values have been replaced with the means of rest
column values.

5) Encoding Categorical data:

Categorical data is data which has some categories such as, in our dataset; there are two
categorical variable, Country, and Purchased.
Since machine learning model completely works on mathematics and numbers, but if our dataset
would have a categorical variable, then it may create trouble while building the model. So it is
necessary to encode these categorical variables into numbers.

For Country variable:

Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.

1. #Catgorical data
2. #for Country Variable
3. from sklearn.preprocessing import LabelEncoder
4. label_encoder_x= LabelEncoder()
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])

Output:

Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)

Explanation:

In above code, we have imported LabelEncoder class of sklearn library. This class has
successfully encoded the variables into digits.

But in our case, there are three country variables, and as we can see in the above output, these
variables are encoded into 0, 1, and 2. By these values, the machine learning model may assume
that there is some correlation between these variables which will produce the wrong output. So to
remove this issue, we will use dummy encoding.

Dummy Variables:

Dummy variables are those variables which have values 0 or 1. The 1 value gives the presence of
that variable in a particular column, and rest variables become 0. With dummy encoding, we will
have a number of columns equal to the number of categories.
In our dataset, we have 3 categories so it will produce three columns having 0 and 1 values. For
Dummy Encoding, we will use OneHotEncoder class of preprocessing library.

1. #for Country Variable


2. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
3. label_encoder_x= LabelEncoder()
4. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
5. #Encoding for dummy variables
6. onehot_encoder= OneHotEncoder(categorical_features= [0])
7. x= onehot_encoder.fit_transform(x).toarray()

Output:

array([[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,


6.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.30000000e+01,
4.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
5.40000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
6.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
6.52222222e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.50000000e+01,
5.80000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.11111111e+01,
5.30000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.90000000e+01,
7.90000000e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 5.00000000e+01,
8.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
7.70000000e+04]])

As we can see in the above output, all the variables are encoded into numbers 0 and 1 and
divided into three columns.

It can be seen more clearly in the variables explorer section, by clicking on x option as:
For Purchased Variable:

1. labelencoder_y= LabelEncoder()
2. y= labelencoder_y.fit_transform(y)

For the second categorical variable, we will only use labelencoder object of LableEncoder class.
Here we are not using OneHotEncoder class because the purchased variable has only two
categories yes or no, and which are automatically encoded into 0 and 1.

Output:

Out[17]: array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

It can also be seen as:


6) Splitting the Dataset into the Training set and Test set

In machine learning data preprocessing, we divide our dataset into a training set and test set. This
is one of the crucial steps of data preprocessing as by doing this, we can enhance the
performance of our machine learning model.

Suppose, if we have given training to our machine learning model by a dataset and we test it by a
completely different dataset. Then, it will create difficulties for our model to understand the
correlations between the models.

If we train our model very well and its training accuracy is also very high, but we provide a new
dataset to it, then it will decrease the performance. So we always try to make a machine learning
model which performs well with the training set and also with the test dataset. Here, we can
define these datasets as:
Training Set: A subset of dataset to train the machine learning model, and we already know the
output.

Test set: A subset of dataset to test the machine learning model, and by using the test set, model
predicts the output.

For splitting the dataset, we will use the below lines of code:

1. from sklearn.model_selection import train_test_split


2. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

Explanation:

o In the above code, the first line is used for splitting arrays of the dataset into random train
and test subsets.
o In the second line, we have used four variables for our output that are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
o In train_test_split() function, we have passed four parameters in which first two are for
arrays of data, and test_size is for specifying the size of the test set. The test_size maybe
.5, .3, or .2, which tells the dividing ratio of training and testing sets.
o The last parameter random_state is used to set a seed for a random generator so that you
always get the same result, and the most used value for this is 42.

Output:

By executing the above code, we will get 4 different variables, which can be seen under the
variable explorer section.
As we can see in the above image, the x and y variables are divided into 4 different variables
with corresponding values.

7) Feature Scaling

Feature scaling is the final step of data preprocessing in machine learning. It is a technique to
standardize the independent variables of the dataset in a specific range. In feature scaling, we put
our variables in the same range and in the same scale so that no any variable dominate the other
variable.

Consider the below dataset:


As we can see, the age and salary column values are not on the same scale. A machine learning
model is based on Euclidean distance, and if we do not scale the variable, then it will cause
some issue in our machine learning model.

Euclidean distance is given as:

If we compute any two values from age and salary, then salary values will dominate the age
values, and it will produce an incorrect result. So to remove this issue, we need to perform
feature scaling for machine learning.

There are two ways to perform feature scaling in machine learning:

Standardization

Normalization

Here, we will use the standardization method for our dataset.

For feature scaling, we will import StandardScaler class of sklearn.preprocessing library as:
1. from sklearn.preprocessing import StandardScaler

Now, we will create the object of StandardScaler class for independent variables or features.
And then we will fit and transform the training dataset.

1. st_x= StandardScaler()
2. x_train= st_x.fit_transform(x_train)

For test dataset, we will directly apply transform() function instead of fit_transform() because
it is already done in training set.

1. x_test= st_x.transform(x_test)

Output:

By executing the above lines of code, we will get the scaled values for x_train and x_test as:

x_train:

x_test:
As we can see in the above output, all the variables are scaled between values -1 to 1.

Note: Here, we have not scaled the dependent variable because there are only two values 0 and
1. But if these variables will have more range of values, then we will also need to scale those
variables.

Combining all the steps:

Now, in the end, we can combine all the steps together to make our complete code more
understandable.

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('Dataset.csv')
8.
9. #Extracting Independent Variable
10. x= data_set.iloc[:, :-1].values
11.
12. #Extracting Dependent variable
13. y= data_set.iloc[:, 3].values
14.
15. #handling missing data(Replacing missing data with the mean value)
16. from sklearn.preprocessing import Imputer
17. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
18.
19. #Fitting imputer object to the independent varibles x.
20. imputerimputer= imputer.fit(x[:, 1:3])
21.
22. #Replacing missing data with the calculated mean value
23. x[:, 1:3]= imputer.transform(x[:, 1:3])
24.
25. #for Country Variable
26. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
27. label_encoder_x= LabelEncoder()
28. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
29.
30. #Encoding for dummy variables
31. onehot_encoder= OneHotEncoder(categorical_features= [0])
32. x= onehot_encoder.fit_transform(x).toarray()
33.
34. #encoding for purchased variable
35. labelencoder_y= LabelEncoder()
36. y= labelencoder_y.fit_transform(y)
37.
38. # Splitting the dataset into training and test set.
39. from sklearn.model_selection import train_test_split
40. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
41.
42. #Feature Scaling of datasets
43. from sklearn.preprocessing import StandardScaler
44. st_x= StandardScaler()
45. x_train= st_x.fit_transform(x_train)
46. x_test= st_x.transform(x_test)

In the above code, we have included all the data preprocessing steps together. But there are some
steps or lines of code which are not necessary for all machine learning models. So we can
exclude them from our code to make it reusable for all models.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy