Industrial Training Report
Industrial Training Report
ON
“INTRODUCTION TO MACHINE
LEARNING”
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE ENGINEERING
Submitted By
Vishakha Tandon , Enrollment no. (1708210168)
1
MORADABAD INSTITUTE OF TECHNOLOGY
CERTIFICATE
Certified that the Industrial Training entitled “Introduction To Machine Learning” submitted by
1708210168 is there own work and has been carried out under my supervision. It is recommended that
the candidate may now be evaluated their industrial training work by the university.
I would like to acknowledge the contributions of the following people without whose help and guidance this
report would not have been completed.
I acknowledge the counsel and support of our training coordinator, Mrs Shilpi Rani , CSE Department, with
respect and gratitude, whose expertise, guidance, support, encouragement, and enthusiasm has made this report
possible. Their feedback vastly improved the quality of this report and provided an enthralling experience. I am
indeed proud and fortunate to be supported by him/her.
I am also thankful to Moradabad Institute Of Technology for his constant encouragement, valuable
suggestions and moral support and blessings.
Although it is not possible to name individually, I shall ever remain indebted to the faculty members of
Moradabad Institute of Technology, their persistent support and cooperation extended during this work.
This acknowledgement will remain incomplete if I fail to express our deep sense of obligation to my parents
and God for their consistent blessings and encouragement.
Vishakha Tandon
1708210168
Table of Contents
Chapter 1.................................................................................................................................................................................. 5
Introduction.............................................................................................................................................................................. 6
History of Machine Learning................................................................................................................................................6
Types of Machine Learning..................................................................................................................................................6
Supervised Learning.............................................................................................................................................................7
Unsupervised Learning.........................................................................................................................................................7
Reinforcement Learning.......................................................................................................................................................7
Semi-Supervised Learning....................................................................................................................................................7
The Challenges Facing Machine Learning............................................................................................................................8
Chapter 2.................................................................................................................................................................................. 9
Technology Implemented.........................................................................................................................................................9
Data Preprocessing, Analysis & Visualization........................................................................................................................12
Machine Learning Algorithms................................................................................................................................................14
1. Linear Regression-..................................................................................................................................................14
2. Logistic Regression -..............................................................................................................................................14
3. Decision Tree -........................................................................................................................................................15
4. Support Vector Machine (SVM)-............................................................................................................................15
5. Naïve Bayes Algorithm -.........................................................................................................................................16
7. K-Means Algorithm -..............................................................................................................................................17
8. Random Forest -......................................................................................................................................................18
Chapter 3................................................................................................................................................................................ 20
Result Discussion....................................................................................................................................................................20
Dataset Description-...............................................................................................................................................................23
Result-.................................................................................................................................................................................... 23
Project Code And Output –.....................................................................................................................................................24
Disadvantages of Machine Learning.......................................................................................................................................31
Future Scope.......................................................................................................................................................................32
Bibliography...........................................................................................................................................................................33
Chapter 1
Machine Learning is the science of getting computers to learn without being explicitly programmed. It is
closely related to computational statistics, which focuses on making prediction using computer. In its
application across business problems, machine learning is also referred as predictive analysis. Machine
Learning is closely related to computational statistics. Machine Learning focuses on the development of
computer programs that can access data and use it to learn themselves. The process of learning begins with
observations or data, such as examples, direct experience, or instruction, in order to look for patterns in data
and make better decisions in the future based on the examples that we provide. The primary aim is to allow the
computers learn automatically without human intervention or assistance and adjust actions accordingly.
The name machine learning was coined in 1959 by Arthur Samuel. Tom M. Mitchell provided a widely quoted,
more formal definition of the algorithms studied in the machine learning field: "A computer program is said
to learn from experience E with respect to some class of tasks T and performance measure P if its
performance at tasks in T, as measured by P, improves with experience E." This follows Alan Turing's
proposal in his paper "Computing Machinery and Intelligence", in which the question "Can machines think?" is
replaced with the question "Can machines do what we (as thinking entities) can do?". In Turing’s proposal the
characteristics that could be possessed by a thinking machine and the various implications in constructing one
are exposed.
The types of machine learning algorithms differ in their approach, the type of data they input and output, and
the type of task or problem that they are intended to solve. Broadly Machine Learning can be categorized into
four categories.
I. Supervised Learning
II. Unsupervised Learning
III. Reinforcement Learning
IV. Semi-supervised Learning
Machine learning enables analysis of massive quantities of data. While it generally delivers faster, more
accurate results in order to identify profitable opportunities or dangerous risks, it may also require additional
time and resources to train it properly.
Supervised Learning
Supervised Learning is a type of learning in which we are given a data set and we already know what are
correct output should look like, having the idea that there is a relationship between the input and output.
Basically, it is learning task of learning a function that maps an input to an output based on example input-
output pairs. It infers a function from labeled training data consisting of a set of training examples.
Supervised learning problems are categorized
Unsupervised Learning
Unsupervised Learning is a type of learning that allows us to approach problems with little or no idea what our
problem should look like. We can derive the structure by clustering the data based on a relationship among the
variables in data. With unsupervised learning there is no feedback based on prediction result. Basically, it is a
type of self-organized learning that helps in finding previously unknown patterns in data set without pre-
existing label.
Reinforcement Learning
Reinforcement learning is a learning method that interacts with its environment by producing actions and
discovers errors or rewards. Trial and error search and delayed reward are the most relevant characteristics of
reinforcement learning. This method allows machines and software agents to automatically determine the ideal
behavior within a specific context in order to maximize its performance. Simple reward feedback is required for
the agent to learn which action is best.
Semi-Supervised Learning
Semi-supervised learning fall somewhere in between supervised and unsupervised learning, since they use both
labeled and unlabeled data for training – typically a small amount of labeled data and a large amount of
unlabeled data. The systems that use this method are able to considerably improve learning accuracy. Usually,
semi-supervised learning is chosen when the acquired labeled data requires skilled and relevant resources in
order to train it / learn from it. Otherwise, acquiring unlabeled data generally doesn’t require additional
resources.
Literature Survey
Theory
A core objective of a learner is to generalize from its experience. The computational analysis of machine
learning algorithms and their performance is a branch of theoretical computer science known as computational
learning theory. Because training sets are finite and the future is uncertain, learning theory usually does not
yield guarantees of the performance of algorithms. Instead, probabilistic bounds on the performance are quite
common. The bias–variance decomposition is one way to quantify generalization error.
For the best performance in the context of generalization, the complexity of the hypothesis should match the
complexity of the function underlying the data. If the hypothesis is less complex than the function, then the
model has underfit the data. If the complexity of the model is increased in response, then the training error
decreases. But if the hypothesis is too complex, then the model is subject to overfitting and generalization will
be poorer.
In addition to performance bounds, learning theorists study the time complexity and feasibility of learning. In
computational learning theory, a computation is considered feasible if it can be done in polynomial time. There
are two kinds of time complexity results. Positive results show that a certain class of functions can be learned in
polynomial time. Negative results show that certain classes cannot be learned in polynomial time.
While there has been much progress in machine learning, there are also challenges. For example, the
mainstream machine learning technologies are black-box approaches, making us concerned about their
potential risks. To tackle this challenge, we may want to make machine learning more explainable and
controllable. As another example, the computational complexity of machine learning algorithms is usually very
high and we may want to invent lightweight algorithms or implementations. Furthermore, in many domains
such as physics, chemistry, biology, and social sciences, people usually seek elegantly simple equations (e.g.,
the Schrödinger equation) to uncover the underlying laws behind various phenomena. Machine learning takes
much more time. You have to gather and prepare data, then train the algorithm. There are much more
uncertainties. That is why, while in traditional website or application development an experienced team can
estimate the time quite precisely, a machine learning project used for example to provide product
recommendations can take much less or much more time than expected. Why? Because even the best machine
learning engineers don’t know how the deep learning networks will behave when analyzing different sets of
data. It also means that the machine learning engineers and data scientists cannot guarantee that the training
process of a model can be replicated.
Chapter 2
Technology Implemented
Features
Interpreted
In Python there is no separate compilation and execution steps like C/C++. It directly run the program from
the source code. Internally, Python converts the source code into an intermediate form called bytecodes
which is then translated into native language of specific computer to run it.
Platform Independent
Python programs can be developed and executed on the multiple operating system platform. Python can
be used on Linux, Windows, Macintosh, Solaris and many more.
Multi- Paradigm
Python is a multi-paradigm programming language. Object-oriented programming and structured
programming are fully supported, and many of its features support functional programming and aspect-
oriented programming .
Simple
Python is a very simple language. It is a very easy to learn as it is closer to English language. In python
more emphasis is on the solution to the problem rather than the syntax.
Rich Library Support
Python standard library is very vast. It can help to do various things involving regular expressions,
documentation generation, unit testing, threading, databases, web browsers, CGI, email, XML, HTML,
WAV files, cryptography, GUI and many more.
Free and Open Source
Firstly, Python is freely available. Secondly, it is open-source. This means that its source code is available
to the public. We can download it, change it, use it, and distribute it. This is called FLOSS (Free/Libre and
Open Source Software). As the Python community, we’re all headed toward one goal- an ever-bettering
Python.
5. Community Support-
It’s always very helpful when there’s strong community support built around the programming
language. Python is an open-source language which means that there’s a bunch of resources open for
programmers starting from beginners and ending with pros. A lot of Python documentation is available
online as well as in Python communities and forums, where programmers and machine learning
developers discuss errors, solve problems, and help each other out. Python programming language is
absolutely free as is the variety of useful libraries and tools.
6. Growing Popularity-
As a result of the advantages discussed above, Python is becoming more and more popular among data
scientists. According to StackOverflow, the popularity of Python is predicted to grow until 2020, at
least. This means it’s easier to search for developers and replace team players if required. Also, the cost
of their work maybe not as high as when using a less popular programming language.
1. Rescaling Data -
For data with attributes of varying scales, we can rescale attributes to possess the same scale. We rescale
attributes into the range 0 to 1 and call it normalization. We use the MinMaxScaler class from scikit- learn.
This gives us values between 0 and 1.
2. Standardizing Data -
With standardizing, we can take attributes with a Gaussian distribution and different means and standard
deviations and transform them into a standard Gaussian distribution with a mean of 0 and a standard
deviation of 1.
3. Normalizing Data -
In this task, we rescale each observation to a length of 1 (a unit norm). For this, we use the Normalizer
class.
4. Binarizing Data -
Using a binary threshold, it is possible to transform our data by marking the values above it 1 and those
equal to or below it, 0. For this purpose, we use the Binarizer class.
5. Mean Removal-
We can remove the mean from each feature to center it on zero.
7. Label Encoding -
Some labels can be words or numbers. Usually, training data is labelled with words to make it readable.
Label encoding converts word labels into numbers to let algorithms work on them.
Machine Learning Algorithms
There are many types of Machine Learning Algorithms specific to different use cases. As we work with
datasets, a machine learning algorithm works in two stages. We usually split the data around 20%-80% between
testing and training stages. Under supervised learning, we split a dataset into a training data and test data in
Python ML. Followings are the Algorithms of Python Machine Learning -
1. Linear Regression-
Linear regression is one of the supervised Machine learning algorithms in Python that observes continuous
features and predicts an outcome. Depending on whether it runs on a single variable or on many features, we
can call it simple linear regression or multiple linear regression.
This is one of the most popular Python ML algorithms and often under-appreciated. It assigns optimal weights
to variables to create a line ax+b to predict the output. We often use linear regression to estimate real values
like a number of calls and costs of houses based on continuous variables. The regression line is the best line that
fits Y=a*X+b to denote a relationship between independent and dependent variables.
Fig 1
2. Logistic Regression -
Logistic regression is a supervised classification is unique Machine Learning algorithms in Python that finds its
use in estimating discrete values like 0/1, yes/no, and true/false. This is based on a given set of independent
variables. We use a logistic function to predict the probability of an event and this gives us an output between 0
and 1. Although it says ‘regression’, this is actually a classification algorithm. Logistic regression fits data into
a logit function and is also called logistic regression.
Fig 2
4. Decision Tree -
A decision tree falls under supervised Machine Learning Algorithms in Python and comes of use for both
classification and regression- although mostly for classification. This model takes an instance, traverses the
tree, and compares important features with a determined conditional statement. Whether it descends to the left
child branch or the right depends on the result. Usually, more important features are closer to the root.
Decision Tree, a Machine Learning algorithm in Python can work on both categorical and continuous
dependent variables. Here, we split a population into two or more homogeneous sets. Tree models where the
target variable can take a discrete set of values are called classification trees; in these tree structures, leaves
represent class labels and branches represent conjunctions of features that lead to those class labels. Decision
trees where the target variable can take continuous values (typically real numbers) are called regression trees.
Fig 4
7. kNN Algorithm -
This is a Python Machine Learning algorithm for classification and regression- mostly for classification. This
is a supervised learning algorithm that considers different centroids and uses a usually Euclidean function to
compare distance. Then, it analyzes the results and classifies each point to the group to optimize it to place with
all closest points to it. It classifies new cases using a majority vote of k of its neighbors. The case it assigns to a
class is the one most common among its K nearest neighbors. For this, it uses a distance function. k-NN is a
type of instance-based learning, or lazy learning, where the function is only approximated locally and all
computation is deferred until classification. k-NN is a special case of a variable- bandwidth, kernel density
"balloon" estimator with a uniform kernel.
Fig 5
8. K-Means Algorithm -
k-Means is an unsupervised algorithm that solves the problem of clustering. It classifies data using a number of
clusters. The data points inside a class are homogeneous and heterogeneous to peer groups. k-means clustering
is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data
mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to
the cluster with the nearest mean, serving as a prototype of the cluster. k-means clustering is rather easy to
apply to even large data sets, particularly when using heuristics such as Lloyd's algorithm. It often is used as a
preprocessing step for other algorithms, for example to find a starting configuration. The problem is
computationally difficult (NP-hard). k-means originates from signal processing, and still finds use in this
domain. In cluster analysis, the k-means algorithm can be used to partition the input data set into k partitions
(clusters). k-means clustering has been used as a feature learning (or dictionary learning) step, in either
(semi-)supervised learning or unsupervised learning.
Fig 6
9. Random Forest -
A random forest is an ensemble of decision trees. In order to classify every new object based on its attributes,
trees vote for class- each tree provides a classification. The classification with the most votes wins in the forest.
Random forests or random decision forests are an ensemble learning method for classification, regression and
other tasks that operates by constructing a multitude of decision trees at training time and outputting the class
that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Fig 7
Chapter 3
Result Discussion
Result
This training has introduced us to Machine Learning. Now, we know that Machine Learning is a technique of
training machines to perform the activities a human brain can do, albeit bit faster and better than an average
human-being. Today we have seen that the machines can beat human champions in games such as Chess,
Mahjong, which are considered very complex. We have seen that machines can be trained to perform human
activities in several areas and can aid humans in living better lives. Machine learning is quickly growing field in
computer science. It has applications in nearly every other field of study and is already being implemented
commercially because machine learning can solve problems too difficult or time consuming for humans to
solve. To describe machine learning in general terms, a variety models are used to learn patterns in data and
make accurate predictions based on the patterns it observes.
Machine Learning can be a Supervised or Unsupervised. If we have a lesser amount of data and clearly labelled
data for training, we opt for Supervised Learning. Unsupervised Learning would generally give better
performance and results for large data sets. If we have a huge data set easily available, we go for deep learning
techniques. We also have learned Reinforcement Learning and Deep Reinforcement Learning. We now know
what Neural Networks are, their applications and limitations. Specifically, we have developed a thought process
for approaching problems that machine learning works so well at solving. We have learnt how machine
learning is different than descriptive statistics.
Finally, when it comes to the development of machine learning models of our own, we looked at the choices of
various development languages, IDEs and Platforms. Next thing that we need to do is start learning and
practicing each machine learning technique. The subject is vast, it means that there is width, but if we consider
the depth, each topic can be learned in a few hours. Each topic is independent of each other. We need to take
into consideration one topic at a time, learn it, practice it and implement the algorithm/s in it using a language
choice of yours. This is the best way to start studying Machine Learning. Practicing one topic at a time, very
soon we can acquire the width that is eventually required of a Machine Learning expert.
Chapter 4
Project Report
Overview-
To Classify the tweets of the US Airlines as a positive, negative or neutral tweet. This can help
customers to choose airlines with better service.
.
Dataset Description-
We are given a Twitter US Airline Sentiment dataset that contains around 14,601 tweets about each
major U.S. airline. The tweets are labelled as positive, negative, or neutral based on the nature
of the respective Twitter user’s feedback regarding the airline.
The dataset is further segregated into training and test sets in a stratified fashion.
Train set contains 11,680 tweets whereas the test set contains 2,921 tweets.
Fig 8
Result-
Our project successfully classifies tweets based on sentiments with 78% Accuracy
Project Code And Output –
import string
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Mounted at /content/drive
data=pd.read_csv("/content/drive/MyDrive/Tweets.csv")
len(x_train)
10248
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
documents=[]
for text in x_train:
documents.append(word_tokenize(text))
documents2=[]
for text in x_test:
documents2.append(word_tokenize(text))
documents2[1]
['@',
'USAirways',
'how',
'is',
'it',
'that',
'my',
'flt',
'to',
'EWR',
'was',
'Cancelled',
'Flightled',
'yet',
'flts',
'to',
'NYC',
'from',
'USAirways',
'are',
'still',
'flying',
'?']
[('better', 'RBR')]
if tag.startswith('J'): return
wordnet.ADJ
elif tag.startswith('V'): return
wordnet.VERB
elif tag.startswith('N'): return
wordnet.NOUN
elif tag.startswith('R'): return
wordnet.ADV
else:
return wordnet.NOUN
#cleaning text
def clean_text(word):
clean_text=[]
for w in word:
if w.lower() not in stop:
pos=pos_tag([w])
clean_word=lematizer.lemmatize(w,pos=get_simple_pos(pos[0][1]))
clean_text.append(clean_word.lower())
return clean_text
#train docs
documents=[clean_text(tweet) for tweet in documents]
#test docs
documents2=[clean_text(tweet) for tweet in documents2 ]
documents2[5]
['united',
'still',
'wait',
'hear',
'back',
'wallet',
'steal',
'one',
'plane',
'would',
'appreciate',
'resolution']
'usairways call 12 time last three day 's unacceptable 'm willing wait hold 's op tion'
'usairways 've hold change date ticket 3 hour someone please assist unacceptable'
count_vect=CountVectorizer(max_features=4000,ngram_range=(1,2))
x_train_features=count_vect.fit_transform(train_documents)
#count_vect.get_feature_names()
x_test_features=count_vect.transform(test_documents)
ListOfClassifiers = [
LogisticRegression(C=0.000000001,solver='liblinear',max_iter=200),
RandomForestClassifier(n_estimators=200),
DecisionTreeClassifier(),
SVC(kernel="rbf"),
MultinomialNB(),
GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,max_depth=4, random_state=0),
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=18, min_samples_leaf=25,
]
advanceFeatures=x_train_features.toarray()
advanceFeaturesTest= x_test_features.toarray()
Accuracy=[]
ClassifierModel=[]
for classifier in
ListOfClassifiers:
try:
fit=classifier.fit(x_train_features,y_tra
in)
pred = fit.predict(x_test_features)
except Exception:
fit =
classifier.fit(advanceFeatures,y_train
) pred =
fit.predict(advanceFeaturesTest)
accuracy =
accuracy_score(pred,y_test)
Accuracy.append(accuracy)
ClassifierModel.append(classifier. class . name )
print ('Accuracy of '+classifier. class . name +' is '+str(accuracy))
OUTPUT
Accuracy of LogisticRegression is 0.6407103825136612
Accuracy of RandomForestClassifier is
0.7693533697632058 Accuracy of
DecisionTreeClassifier is 0.7040072859744991
Accuracy of SVC is 0.7855191256830601
Accuracy of MultinomialNB is 0.7739071038251366
Accuracy of GradientBoostingClassifier is
0.7597905282331512 Accuracy of AdaBoostClassifier
is 0.7515938069216758
Chapter 5
Every coin has two faces, each face has its own property and features. It’s time to uncover the
faces of ML. A very powerful tool that holds the potential to revolutionize the way things
work.
3. Continuous Improvement -
As ML algorithms gain experience, they keep improving in accuracy and efficiency. This lets
them make better decisions. Say we need to make a weather forecast model. As the amount of
data, we have keeps growing, our algorithms learn to make more accurate predictions faster.
5. Wide Applications -
We could be an e-seller or a healthcare provider and make ML work for us. Where it does
apply, it holds the capability to help deliver a much more personal experience .
Disadvantages of Machine Learning
With all those advantages to its powerfulness and popularity, Machine Learning isn’t perfect.
The following factors serve to limit it:
1. Data Acquisition -
Machine Learning requires massive data sets to train on, and these should be
inclusive/unbiased, and of good quality. There can also be times where they must wait for new
data to be generated.
3. Interpretation of Results -
Another major challenge is the ability to accurately interpret results generated by the
algorithms. We must also carefully choose the algorithms for your purpose.
4. High error-susceptibility -
Machine Learning is autonomous but highly susceptible to errors. Suppose you train an
algorithm with data sets small enough to not be inclusive. You end up with biased predictions
coming from a biased training set. This leads to irrelevant advertisements being displayed to
customers. In the case of ML, such blunders can set off a chain of errors that can go
undetected for long periods of time. And when they do get noticed, it takes quite some time to
recognize the source of the issue, and even longer to correct it.
Applications of Machine Learning
Machine learning is one of the most exciting technologies that one would have ever come
across. As it is evident from the name, it gives the computer that which makes it more similar
to humans: The ability to learn. Machine learning is actively being used today, perhaps in
many more places than one would expect. We probably use a learning algorithm dozen of time
without even knowing it. Applications of Machine Learning include:
Web Search Engine: One of the reasons why search engines like google, bing etc
work so well is because the system has learnt how to rank pages through a complex
learning algorithm.
Photo tagging Applications: Be it facebook or any other photo tagging application,
the ability to tag friends makes it even more happening. It is all possible because of a
face recognition algorithm that runs behind the application.
Spam Detector: Our mail agent like Gmail or Hotmail does a lot of hard work for us
in classifying the mails and moving the spam mails to spam folder. This is again
achieved by a spam classifier running in the back end of mail application.
Future Scope
Future of Machine Learning is as vast as the limits of human mind. We can always keep
learning, and teaching the computers how to learn. And at the same time, wondering how
some of the most complex machine learning algorithms have been running in the back of our
own mind so effortlessly all the time. There is a bright future for machine learning. Companies
like Google, Quora, and Facebook hire people with machine learning. There is intense
research in machine learning at the top universities in the world. The global machine learning
as a service market is rising expeditiously mainly due to the Internet revolution. The process
of connecting the world virtually has generated vast amount of data which is boosting the
adoption of machine learning solutions. Considering all these applications and dramatic
improvements that ML has brought us, it doesn't take a genius to realize that in coming future
we will definitely see more advanced applications of ML, applications that will stretch the
capabilities of machine learning to an unimaginable level.
Bibliography
https://expertsystem.com/
https://www.geeksforgeeks.org/
https://www.wikipedia.org/
https://www.coursera.org/learn/machine-learning
https://machinelearningmastery.com/
https://towardsdatascience.com/machine-learning/home