0% found this document useful (0 votes)
80 views14 pages

BigData Assessment2 26230605

The document discusses four machine learning models - SVM, KNN, fully connected neural network, and CNN - that were used to classify chest x-ray images into normal and pneumonia categories. It describes the image datasets, data preprocessing steps like resizing and augmentation, and model training procedures. Key steps included identifying hyperparameters, implementing models in Google Colab, and using techniques like k-fold cross validation to evaluate performance. The goal was to correctly classify images using these machine learning techniques.

Uploaded by

abaid choughtai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views14 pages

BigData Assessment2 26230605

The document discusses four machine learning models - SVM, KNN, fully connected neural network, and CNN - that were used to classify chest x-ray images into normal and pneumonia categories. It describes the image datasets, data preprocessing steps like resizing and augmentation, and model training procedures. Key steps included identifying hyperparameters, implementing models in Google Colab, and using techniques like k-fold cross validation to evaluate performance. The goal was to correctly classify images using these machine learning techniques.

Uploaded by

abaid choughtai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Big Data Analytics and Modelling CMP9781M-2122 Abaid Ur Rahman - 26230605

University of Lincoln
Abaid ur Rahman Choughtai26230605

Introduction:

The purpose of this report is to demonstrate the machine learning techniques used and explains the processes
involved in obtaining the result from those modeling techniques. The documents constitute the information
about the steps taken in achieving those results. The document contains descriptions and procedural information
about four machine learning models which were utilized to achieve results. The machine learning models were
implemented on image datasets. The main goal is to train the model on image dataset and classify images
correctly.

Introduction to Multi-Class Image Classification:

Image Classification using learning Algorithms is considered to be state-of-the-art in computer science


research. Multiclass classification is a machine learning classification task that consists of more than two classes
or dedicated output [1]. Or another plain definition would be multi-class image classification is a common task
in computer vision, where images are categorized into two or more classes using some modeling techniques [2].
Considering implementing a model to identify animal type through images is a multi-class classification. Multi-
class classification is the most common machine learning task. From the above example, we take numerous
training examples divided into several Separate classes. The model learns specific patterns to each class and the
model uses those patterns to predict any future data. Alot of machine learning techniques exist for classification.
The main purpose of this assessment is to perform image classification and prediction on the datasets. There are
some of the main techniques used while resolving the assessment such as K-nearest neighbor for clustering,
fully connected convoluted layer, simple Convolutional neural network, and support vector machine. And extra
models were used as a bonus work.
The image datasets which are acquired over here are given on the blackboard for the computer vision module.
The datasets consist of pneumonitis chest x-ray and normal chest x-ray. The datasets contain total images of 4026
which are labeled as pneumonitis chest x-ray and normal chest x-ray, and will be divided into training,
validation and test datasets separately.

Preparing of Datasets and models:

The first main part of the process was to understand and analyze the image datasets using the methodologies
discussed during lectures, workshops and in the research paper. These image datasets are given as an input to the
models to learn the patterns using different machine learning techniques which will be discussed below. The
main work process of models under different techniques are identified below:
● Datasets and its information
● Methodologies
● SVM model

1
Big Data Analytics and Modelling CMP9781M-2122 Abaid Ur Rahman - 26230605

● Fully connected Neural Net


● K-nearest neighbor algorithm
● Convoluted Neural Network
● Cross fold validation

Datasets and its information:

The image dataset of chest x-ray was given in a computer vision module consisting of normal and not normal
pneumoniatic datasets. This dataset was acquired from some health institute working on disease detection from
the chest x-ray. The dataset is labeled as normal and not-normal chest x-ray. And the total of image datasets is
about 4026. Pneumonia is a type of chest infection which affects the tiny air namely alveoli in your lungs. These
alveoli become inflamed and filled with fluid [11]. The dataset is depicted as:

The folder contains images of normal and not normal chest x-ray, which need to be resized. The image dataset
is splatted into training, validation and testing datasets using train_test_split function. Further processes done on
an image will be brief below.

Collaboratory:

To implement the techniques and models we are using google colab. Google developed this AI framework
called TensorFlow and a development tool called Collaboratory. This google colab is open-sourced and free for
public use. We shall be implementing the techniques over the tool. It is an online tool used to read, write, and
edit python scripts. We have uploaded all the chest x-ray data over the google drive namely chest x-ray folder
and the script for accessing the google drive is as given:

2
Big Data Analytics and Modelling CMP9781M-2122 Abaid Ur Rahman - 26230605

Data Augmentation:

The practice of data augmentation is an effective way to increase the size of training datasets. Data
Augmentation in data analysis are techniques used to increase the size of data by adding modified
copies of existing datasets such as flips or translation or rotation. Such images would be considered
distinct by the neural network. Whenever a machine learning model is trained, basically its parameters
are being tuned where some input produces some output. The main goal is to obtain that point where
the model’s loss would be low. Such a performance would happen when the parameters are tuned right
[3]. An example image for data augmentation would be according to the human understanding:

In digital images, a grayscale image is an image where the value of each pixel represents only the intensity
information of light. This grayscale image only shows the black and white color pixel in an image. The
definition of a grayscale is; A range of gray shades from white to black [4]. There is a default function for
performing a grayscale over an image and it is from scikit learn library. This predefined function reduces the
pixels which would help the machine learning models to extract features from images more precisely.

3
Big Data Analytics and Modelling CMP9781M-2122 Abaid Ur Rahman - 26230605

Image Resize:

Image resizing is necessary when we need to increase or decrease the total number of pixels. We used the image
resize function over here to resize the image dataset into uniform size. So, the machine learning model can learn
the features from the image more distinctly.

Identifying the Hyper Parameters:

Any machine learning algorithm needs datasets constituting training, validation, and testing datasets. A model is
trained over the training datasets and then tested on the validation or testing datasets. The validation dataset is
composed of a mixture of training and testing datasets. There are multiple hyper parameters for every machine
learning algorithm which can be changed, increased, or decreased, rationalized to improve the performance of
the model.

Whilst applying the Support Vector Machine SVM algorithm, we have implemented the sklearn SVC model.
Support Vector Classification model is based upon libsvm. SVC implements the one-vs-one approach for multi-
class classification [5]. In SVM, we are implementing sigmoid for better multi-class classification. The SVM
are effective in high dimensional spaces and uses a subset of training points in decision function which are
known as support vectors. In SVM, different kernel functions can be allocated for decision function [5]. Further
details about the SVM will be described later.

In the K- nearest neighbor KNN algorithm, we have tried different numbers of neighbors and tried to find out
which K-number of neighbors would give best results. The KNN algorithm is a supervised learning classifier,
which used proximity methods to make classification about the combining of individual data points [6]. In KNN
algorithm, a data point is allocated, and whichever data is closer to the given data point is grouped with the data
point in order to create the clusters. This algorithm performs it working over the basics of K-mean clustering.
Further details of KNN algorithm implementation will be written below along with the outcome results.

In a fully connected neural network, we have implemented a L-BFGS optimizer. Such optimizer is used to
extract patterns from large amounts of data. L-BFGS (limited-memory Broyden Fletcher Goldfarb Shanno) is an
optimization method that is used for parameter estimation for better training of a machine learning models [7].
Further working of the model is given in detail in the heading namely result and explanation.

4
Big Data Analytics and Modelling CMP9781M-2122 Abaid Ur Rahman - 26230605

While applying CNN model, we have used an Adam optimizer. The model consisted of such optimizer
predicted improved accuracy results. The model also efficiently reduces training loss, and it also increases
accuracy. Other details will be described later.

Furthermore, Cross-validation is a technique to evaluate a machine learning and checkout its performanceon the
dataset. Its helps to compare and select a good model for predicting modeling problems. There are multiple
ways for cross- validation but we have sorted to work out using K-fold cross validation technique. It is a
technique that splits the datasets between training and testing and validate over the testing dataset. The
algorithms work like; we pick a number for K. Then the dataset is then splitted into K-1 for training sets and
remaining is test set. The model is trained on the training data and then tested on a testing dataset on each
iteration / epoch.

The confusion matrix has been obtained from the model results. It is a very popular measuring function used
during classification problems. It is a performance measurement for machine learning classification problems
where output can be equal or more than two classes. This gives the table with four different combinations of
predicted and actual / true values [9]. The table can be as attached:

Results and Explanation:

Support Vector Machine SVM model

The Support vector machine model was introduced by Vladmir Vapnik. Such are powerful methods used for
solving classification problems for large datasets. The objective of support vector machine algorithms is to find
a hyper-plane in a multi-dimensional space containing numbers of features that correctly classifies the data
point. Hyperplanes are the decision boundaries that help classify the data points. The input is the set of training
pairs samples and outputs are the set of weights; one for each feature who’s linear predicts the value. There is a
use of optimization of maximizing the margin ‘the distance between the data points and support vectors’, which
reduces the number of weights which effects the margin of the hyperplane [14].

The SVM algorithm is obtained with the library namely Scikit learn. The SVC() function contains tuning
hyperparameter which are regularization, Kernel and Gamma. Regularization is a penalty parameter which
refers to the misclassification or error. This parameter tells the SVM optimizer how much error is tolerable.
Smaller the value of C “regularization parameter”, smaller the margin hyper-plane and vice versa. Gamma is the
determination of how much data point is to be read, lower value of gamma considers loosely fit training dataset.
The gamma is set to auto, to find its good value and considers it. The main function of kernel is to transfer
given dataset into required input form.
5
Big Data Analytics and Modelling CMP9781M-2122 Abaid Ur Rahman - 26230605

The four types of kernels can be applied on the model such as linear, gaussian, RBF, polynomial and sigmoid.
The sigmoid is equal to two-layer perceptron model of neural network which work as an activation function.
The first implemented model for this assessment is a SVM algorithm, where the regularization value can be
between 0 to 1, linear and sigmoid kernel was used to check out the accuracy of the model along with gamma as
an auto fix. The kernel implemented here was linear kernel. This is the accuracy recorded by the SVM model
over chest x-ray datasets. The accuracy of the SVM model is 0.948. The classification report, cross validation
10-fold function and confusion matrix was implemented to fully understand the accuracy of method used.

6
Big Data Analytics and Modelling CMP9781M-2122 Abaid Ur Rahman - 26230605

The confusion matrix displays the actual and predicted values for chest x-ray.

The implemented 10-fold cross validation for support vector machine model is shown below.

7
Big Data Analytics and Modelling CMP9781M-2122 Abaid Ur Rahman - 26230605

● Fully Connected Convoluted Layer


An MLP classifier was implemented in this part of assessment. MLP is a multi-layer Perceptron classifier,
which relies on underlying Neural Net to perform classification tasks [12]. MLP is feedforward artificial Neural
Net model that connects sets of input on the appropriate output. Each output dimension depends on each input
dimension [15]. The CNN contains the shared convolutional kernel, is very suitable for multidimensional data
and requires few parameters at the same network depth level which reduces complexity and increase the
learning process.
This MLP-Classifier algorithm trains itself using backpropagation. This trains itself on two arrays from which
one holds the training samples, and another holds the target value for training samples. The model is used to
optimize log-loss function using l-bfgs (from quasi-Newton methods) or stochastic gradient descent or adam
[16]. The parameters determined for the method is solver, alpha, hidden_layer_sizes and random state. Solver
parameter acts as a weight optimizer. The hidden layer size represents the number of neurons in a hidden layer.
Random_state generates values of weight and bias.
The implemented fully connected model gives the classification report shown below with the accuracy of 0.89.
A fully connected neural network is a series of fully connected layer where every neuron is connected to the
other layer. Below are the attached figures of classification report, confusion matrix, and cross validation 10-
fold implementation.

Furthermore, the attached figure displays the confusion matrix of a fully connected layer.

8
Big Data Analytics and Modelling CMP9781M-2122 Abaid Ur Rahman - 26230605

The implemented 10- fold validation on the chest xray for a fully connected neural network. The figure is
attached below.

● K-nearest Neighbor Algorithm

The KNN algorithm is a supervised machine learning algorithm and known to be lazy algorithm. The algorithm
can be used to solve classification as well as regression problems. A supervised machine learning algorithm is
one that relies on labeled input data to learn and provide an outcome with unlabeled data. In classification
problem, a class label is assigned to the data points with the majority votes for example: the label that is more
frequently represented around some data point is considered [17]. Such tactics of labelling requires more than
50% of votes when there are two categories.
In determining which data points are closest to the query points, the distance between both points is calculated.
This calculated distance helps to create decision boundaries. To implement K-nearest neighbor classifier, we
import from sklearn library. After that allocating the number of neighbors which is 2 in this scenario. In K-NN,
there is k-neighbors and similarity measures mostly. The hard thing is to find a best similarity measures for the
classification process. There are some mathematical distance formulae such as Euclidean distance, Manhattan
distance and more which are used to find the distance between query and data points [18]. The KNN can be
optimized by using K-NN with inverted list, with locality sensitive hashing or KD-tree [19].
The number of clusters are adjusted and then data points are amended at every iteration to create clusters which
are near to a given data point. The accuracy of this algorithm is 0.82.

9
Big Data Analytics and Modelling CMP9781M-2122 Abaid Ur Rahman - 26230605

● Convoluted Neural Network CNN algorithm


A convolutional neural network is an architecture for deep learning which learning from data, eliminating the
over working on feature extraction [20]. They are particularly useful in finding patterns in images to recognize
objects, faces and other objects. Conv-Net are quite useful for some of their import factors such as eliminating
manual feature extraction, produces accurate recognition results, and can be used as a pre-trained model.
A Conv-Net have many numbers of layers and all of them learn to detect different feature from an image.
Filters are applied on the training image and then convolved image becomes the input for next layer.
We have implemented the relu activation function along with having five layers of neuron and one dropout
varies with each layer. An activation function defines the output of a neuron given an input or set of input. Relu
is a function which returns 0 over negative value and returns the same value when positive value. A rmsprop
optimizer was implemented. And in all these layers, we also applied filters such as padding and stride. Then
implemented a loss function called binary_crossentropy which addresses this by performing the cross entropy of
error. This loss function is used when there is a classification problem between two classes. The accuracy
indicated by implementing CNN algorithm is 0.73.

10
Big Data Analytics and Modelling CMP9781M-2122 Abaid Ur Rahman - 26230605

The graph elaborates the accuracy on testing and validation datasets.

The confusion matrix was performed on the statistics.

11
Big Data Analytics and Modelling CMP9781M-2122 Abaid Ur Rahman - 26230605

The classification report can be seen as:

And the accuracy from the model can be stated as:

Conclusion:

Different algorithm practices on same dataset were conducted using Support Vector Machine, K-Neighbors
Classification K-NN algorithm, F-NN fully convoluted neural network, and CNN algorithm Convolutional neural
network. The classification reports generated from these describe different accuracy results. The accuracy of
SVM model is 0.95%, K-NN model has 0.824%, F-NN model has 0.89%, and C-NN model predicted 0.73%.
The SVM model predicts result more accurate rather than any other used model in this scenario with the given
dataset. So, this concludes that Support vector machine performs good and give good result in contrast with
another models.

12
Big Data Analytics and Modelling CMP9781M-2122 Abaid Ur Rahman - 26230605

References:
[1] “Multiclass classification in machine learning,” DataRobot AI Cloud, 27-Dec-2019. [Online]. Available:
https://www.datarobot.com/blog/multiclass-classification-in-machine-learning/. [Accessed: 25-Jun-2022].

[2] N. Desai, “MultiClass image classification - geek culture - medium,” Geek Culture, 16-Apr-2021. [Online].
Available:
https://medium.com/geekculture/multiclass-image-classification-dcf9585f2ff9. [Accessed: 25-Jun-2022].

[3] A. Gandhi, “Data Augmentation,” Nanonets AI & Machine Learning Blog, 19-May-2021. [Online]. Available:
https://nanonets.com/blog/data-augmentation-how-to-use-deep-learning-when-you-have-limit ed-data-part-2/.
[Accessed: 25-Jun-2022].

[4] M. Popescu and A. Naaji, “Detection of small tumors of the brain using medical imaging,” in
Handbook of Decision Support Systems for Neurological Disorders, Elsevier, 2021, pp. 33–53.

[5] “1.4. Support vector machines,” scikit-learn. [Online]. Available: https://scikit-


learn.org/stable/modules/svm.html. [Accessed: 25-Jun-2022].

[6] “What is the k-nearest neighbors algorithm?,” Ibm.com. [Online]. Available:https://www.ibm.com/uk-


en/topics/knn. [Accessed: 25-Jun-2022].

[7] M. M. Najafabadi, T. M. Khoshgoftaar, F. Villanustre, and J. Holt, “Large-scale distributed L-BFGS,” J. Big
Data, vol. 4, no. 1, 2017.

[8] V. Lyashenko, “Cross-validation in machine learning: How to do it right,” neptune.ai, 06-Oct-2020. [Online].
Available:
https://neptune.ai/blog/cross-validation-in-machine-learning-how-to-do-it-right. [Accessed: 25-Jun-2022].

[9] S. Narkhede, “Understanding confusion matrix,” Towards Data Science, 09-May-2018. [Online]. Available:
https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62. [Accessed: 25-Jun-2022].

13
Big Data Analytics and Modelling CMP9781M-2122 Abaid Ur Rahman - 26230605

[10] A. Kulkarni, D. Chong, and F. A. Batarseh, “Foundations of data imbalance and solutions for a data democracy,”
in Data Democracy, Elsevier, 2020, pp. 83–106.

[11] “What is pneumonia?,” Asthma + Lung UK, 08-Oct-2015. [Online]. Available:


https://www.blf.org.uk/support-for-you/pneumonia/what-is-pneumonia. [Accessed:26-Jun-2022].

[12] A. Nair, “A beginner’s guide to Scikit-learn’s MLPClassifier,” Analytics IndiaMagazine, 20-Jun-2019.


[Online]. Available:https://analyticsindiamag.com/a-beginners-guide-to-scikit-learns-mlpclassifier/.[Accessed:
26-Jun-2022]

[13] “KNN algorithm - Finding Nearest Neighbors,” Tutorialspoint.com. [Online]. Available:


https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_
knn_algorithm_finding_nearest_neighbors.htm. [Accessed: 26-Jun-2022].

[14] Mit.edu. [Online]. Available: https://web.mit.edu/6.034/wwwbob/svm-notes-long-08.pdf. [Accessed:


20-Aug-2022].
[15] B. Ramsundar and R. B. Zadeh, TensorFlow for Deep Learning. Sebastopol, CA: O’Reilly Media,
2018.
[16] “Sklearn.Neural_network.MLPClassifier,” scikit-learn. [Online]. Available: https://scikit-
learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html. [Accessed: 20-Aug-
2022].
[17] “What is the k-nearest neighbors algorithm?,” Ibm.com. [Online]. Available: https://www.ibm.com/uk-
en/topics/knn. [Accessed: 23-Aug-2022].
[18] “Sklearn.Neighbors.KNeighborsClassifier,” scikit-learn. [Online]. Available: https://scikit-
learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html. [Accessed: 23-Aug-
2022].
[19] “What parameters to optimize in KNN?,” Stack Overflow. [Online]. Available:
https://stackoverflow.com/questions/43726728/what-parameters-to-optimize-in-knn. [Accessed: 23-
Aug-2022].
[20] S. Kulshrestha, “What is A convolutional neural network?,” in Developing an Image Classifier Using
TensorFlow, Berkeley, CA: Apress, 2019.

14

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy