0% found this document useful (0 votes)
10 views51 pages

Summary

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views51 pages

Summary

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

SUMMARY

TRAINING
this model is developed for predicting DATASET
variable length B-cell epitopes. It is ALONG WITH
developed on LBtope_Variable dataset TESTING
that contain 14876 unique
B-cell epitopes and 23321 unique non B-
cellepitopes. These epitopes and non- all the common epitopes in both the datasets
epitopes have variable length. are removed and also the epitopes having
length less than 5 and greater than 50 are
removed.
REFERENCE : B Cell Epitopes

Non B Cell Epitopes


This model developed on VALIDATION
LBtop_Confirm dataset which contain DATASET
only those epitopes or non-epitopes that
have been experimentally validated by
two or more studies. This dataset
Lbtope_Confirm contain 1042 unique B-
cell epitopes and 1795 unique non B-cell
epitopes.

REFERENCE : B Cell Epitopes

Non B Cell Epitopes


Contents:
• method 1: support vector machine
• method 2: XGboost
• method 3: decision tree classifier
• method 4: RNN (Recurrent neural network)
• method 5: CNN (Convolutional neural network)
• method 6 : CNN WITH KFOLD -5,10
• method 7: Amino acid composition(AAC)
• method 8: Dipeptide composition(DPC) -1,2
• method 9: Voting Classifier
• method 10: BERT
METHOD -1 : SUPPORT VECTOR MACHINE

METHODOLOGY
Using TF-IDF Vectorizer (which converts all the sequences to numerical
features)
split the dataset into training and testing - LBtope variable
vectorize the x_train(fit and transform) and x_test(transform)
other parameters:
kernel – linear and rbf

METRICS
Accuracy score, mathews correlation coefficient ,recall, precision, f1score
ROC curve
Results: Testing data LBtope-vairable
Accuracy: 0.6100785340314137

precision recall f1-score support

negative 0.61 1.00 0.76 4661


positive 0.00 0.00 0.00 2979

accuracy 0.61 7640


macro avg 0.31 0.50 0.38 7640
weighted avg 0.37 0.61 0.46 7640

this model is only memorizing the data provided but not learning the
features from it
INTERPRETATION:

• The model is heavily biased towards predicting the negative class,


failing to identify any positive instances.

• The overall performance metrics indicate significant room for


improvement, especially in correctly identifying positive cases.
METHOD - 2 : XGBOOST

METHODOLOGY
Using TF-IDF Vectorizer (which converts all the sequences to numerical
features)
split the dataset into training and testing - Lbtope variable
vectorize the x_train(fit and transform) and x_test(transform)

METRICS
Accuracy score, mathews correlation coefficient ,recall, precision, f1score
ROC curve
Results: Testing data LBtope-vairable

Accuracy: 0.618455497382199

precision recall f1-score support

0 0.62 1.00 0.76 4725


1 0.00 0.00 0.00 2915

accuracy 0.62 7640


macro avg 0.31 0.50 0.38 7640
weighted avg 0.38 0.62 0.47 7640
INTERPRETATION:

• The model is heavily biased towards predicting the negative class,


failing to identify any positive instances.

• The overall performance metrics indicate significant room for


improvement, especially in correctly identifying positive cases.
METHOD - 3 : DECISION TREES

METHODOLOGY
Using TF-IDF Vectorizer (which converts all the sequences to numerical
features)
split the dataset into training and testing - LB tope variable
vectorize the x_train(fit and transform) and x_test(transform)

METRICS
Accuracy score, mathews correlation coefficient ,recall, precision, f1score
Results: Testing data LBtope-vairable

Accuracy: 0.6100785340314137
precision recall f1-score support

negative 0.61 1.00 0.76 4661


positive 0.00 0.00 0.00 2979

accuracy 0.61 7640


macro avg 0.31 0.50 0.38 7640
weighted avg 0.37 0.61 0.46 7640
INTERPRETATION:

• The model is heavily biased towards predicting the negative class,


failing to identify any positive instances.

• The overall performance metrics indicate significant room for


improvement, especially in correctly identifying positive cases.
METHOD - 4 : Recurrent Neural Network
METHODOLOGY
Using TF-IDF Vectorizer (which converts all the sequences to numerical
features)
split the dataset into training and testing - LB tope variable
vectorize the x_train(fit and transform) and x_test(transform)

MODEL
1.Embedding layer - where the input dimension is the size of vocabulary
plus 1 for pad ,output dimension is 16 dimensional vector
2. SimpleRNN (16) this layer consists of 16 units , i processes the
sequence data one step at a time and maintains a hidden state
3. layer : a fully connected layer with 16 units and relu activation function . it
processes the output from RNN layer and applies a non - linear transformation

4. Output Dense layer : the output layer with a single unit and sigmoid activation
function , used for binary classification

5. loss function : binary cross entropy , optimizer: adam

METRICS
Accuracy score, mathews correlation coefficient ,recall, precision, f1score
RESULTS

test

epoch vs accuracy ROC CURVE


Results: Testing data LBtope-vairable

Accuracy: 0.6663612723350525 (MCC): 0.28156001957534404

precision recall f1-score support

0 0.68 0.85 0.76 4603


Confusion
1 0.64 0.40 0.49 3037
Matrix:
[[3906 697]
accuracy 0.67 7640
[1820 1217]]
macro avg 0.66 0.62 0.62 7640
weighted avg 0.66 0.67 0.65 7640
With Validation dataset: LB tope fixed

Accuracy: 0.7271765949947128 (MCC): 0.38499575326462343

precision recall f1-score support


array([[1577, 218],
0 0.74 0.88 0.80 1795 [ 556, 486]])
1 0.69 0.47 0.56 1042

accuracy 0.73 2837


macro avg 0.71 0.67 0.68 2837
weighted avg 0.72 0.73 0.71 2837
INTERPRETATION:
• he flat and similar lines for training and validation accuracy suggest that the
model's learning is stagnant and it is neither improving nor overfitting. However,
the accuracy level indicates that the model's performance is relatively low.

• The AUC is 0.70, which means the model has a moderate ability to distinguish
between the classes, performing better than random guessing but still with
room for improvement.
METHOD - 5 : CNN
METHODOLOGY
Using TF-IDF Vectorizer (which converts all the sequences to numerical
features)
split the dataset into training and testing - LB tope variable
vectorize the x_train(fit and transform) and x_test(transform)

MODEL
1.Embedding layer - where the input dimension is the size of vocabulary
plus 1 for pad ,output dimension is 1000 dimensional vector
2. Convulutional layer - which has 128 filters, window size 5 and an
activation function of reLU
3. Global max pooling layer : A single value per feature map (total of 128 values if the
input had 128 filters)

4. Dense layer :The number of neurons in the layer (128). and The activation function
applied to the output of the layer is reLU

5. Dropout layer : defines a fraction of input layers to drop -0.5


6. Output Dense layer : the output layer with a single unit and sigmoid activation function ,
used for binary classification

METRICS
Accuracy score, mathews correlation coefficient ,recall, precision, f1score
RESULTS
testing Accuracy: 0.689397931098938 epoch=15 batch_size = 32
validation Accuracy : 0.8396193161790624

testing Accuracy: 0.7065445184707642 epoch=16 batch_size = 32


validation Accuracy: 0.8403242862178357

testing Accuracy: 0.6912303566932678 epoch=20 batch_size


= 32
validation Accuracy: 0.8491364117025026

testing
Results: Testing data LBtope-vairable

Classification Report: Confusion Matrix:


precision recall f1-score support [[3698 963]
[1396 1583]]
0 0.73 0.79 0.76 4661
1 0.62 0.53 0.57 2979

accuracy 0.69 7640 (MCC): 0.3360494589435545


macro avg 0.67 0.66 0.67 7640
weighted avg 0.69 0.69 0.69 7640
Validation data : LB tope fixed

MCC
precision recall f1-score support 0.7165044856049663

0 0.86 0.89 0.88 1795


1 0.80 0.74 0.77 1042

accuracy 0.84 2837 array([[1606, 189],


macro avg 0.83 0.82 0.82 2837
[ 266, 776]])
weighted avg 0.84 0.84 0.84 2837
RANDOM TRIES
Batchsize = 64,epoch = 100, accuracy = 69.82
batch_size=32, epochs=10 , accuracy 68.2
batch_size=32, epochs=13, accuracy 68.6
batchsize = 8, epochs = 15,accuracy = 69.06
batchsize = 8, epoch=16, accuracy = 69.86
batchsize=8, epochs=16,accuracy = 70.46
batch_size=8, epochs=18, accuracy = 71.06
batchsize = 8, epochs=21, accuracy = 71.3
batchsize=4, epochs=50, accuracy = 69.03
INTERPRETATION

• The validation accuracy shows a slight increase initially but then flattens and
fluctuates around 0.63.
• This indicates that while the model is improving on the training data, it is not showing
similar improvement on the validation data.
METHOD - 6 : CNN with KFold - 5
Results: Testing data LBtope-variable

Mean Cross-Validation Accuracy: 0.7072543501853943

(MCC): 0.3725244342365087
Classification Report:
precision recall f1-score support

0 0.74 0.80 0.77 23321


1 0.64 0.55 0.60 14876

accuracy 0.71 38197


macro avg 0.69 0.68 0.68 38197
weighted avg 0.70 0.71 0.70 38197
with Validation data
Accuracy:0.8854423686993302
:Lb tope - fixed
mcc: 0.7545398019978594

conf_matrix precision recall f1-score support


array([[1604, 191],
[ 137, 905]]) 0 0.91 0.91 0.91 1795
1 0.85 0.84 0.84 1042

accuracy 0.89 2837


macro avg 0.88 0.88 0.88 2837
weighted avg 0.89 0.86 0.89 2837
CNN with KFold - 10
Results: Testing data LBtope-variable

Mean Cross-Validation Accuracy:0.7143228650093079

(MCC): 0.3845715900895881
Classification Report:
precision recall f1-score support

0 0.74 0.82 0.78 23321


1 0.66 0.55 0.60 14876

accuracy 0.71 38197


macro avg 0.70 0.68 0.69 38197
weighted avg 0.71 0.71 0.71 38197
with Validation data : LB tope - fixed

Accuracy:0.8709904829044766
precision recall f1-score support

mcc: 0.718803700 0 0.88 0.93 0.90 1795


1 0.86 0.77 0.81 1042

conf_matrix
accuracy 0.87 2837
array([[1665, 130],
macro avg 0.87 0.85 0.86 2837
[ 236, 806]]
weighted avg 0.87 0.87 0.87 2837
INTERPRETATION

• The model shows higher precision and recall for the majority class
(class 0) compared to the minority class (class 1), indicating a bias
towards the majority class for cv = 5
• While the model performs well, focusing on improving recall for the
positive class and further tuning could enhance its performance for cv
=10.
METHOD - 7: PFeature - AAC

METHODOLOGY
• To preprocess the training dataset and extract relevant features, we
employed CD-HIT for sequence clustering and redundancy removal,
followed by Pfeature for feature extraction based on amino acid
composition which results in 20 distinct features
• the obtained features are then trained with lazyclassifier to know the
best fit model for classification

METRICS
Accuracy score, mathews correlation coefficient ,recall, precision, f1score
Results: Testing data Best fit model: Extra Trees Classifier
LBtope-variable
mcc :0.363363972584853
Accuracy: 0.6868533171028606
precision recall f1-score support

0 0.68 0.80 0.73 1779


1 0.70 0.55 0.61 1507

accuracy 0.69 3286


macro avg 0.69 0.67 0.67 3286
weighted avg 0.69 0.69 0.68 3286

array([[1429, 350],
[ 684, 823]])
with Validation data : LB tope - fixed
Accuracy: 0.700740218540712 mcc :0.38371043785935655

precision recall f1-score support

negative 0.79 0.71 0.75 1795


positive 0.58 0.68 0.63 1042

accuracy 0.70 2837 array([[1275, 520],


macro avg 0.69 0.70 0.69 2837 [ 329, 713]])
weighted avg 0.72 0.70 0.70 2837
METHOD - 8: PFeature - DPC - 1

METHODOLOGY
• To preprocess the training dataset and extract relevant features, we
employed CD-HIT for sequence clustering and redundancy removal,
followed by Pfeature for feature extraction based on amino acid
composition which results in 400 distinct features(alternte) - gap of 1
• the obtained features are then trained with lazyclassifier to know the
best fit model for classification

METRICS
Accuracy score, mathews correlation coefficient ,recall, precision, f1score
Results: LB tope - variable Best fit model: Random forest classifier
mcc :0.363363972584853
Accuracy: 0.7096774193548387
precision recall f1-score support

negative 0.69 0.84 0.76 1779


positive 0.74 0.56 0.64 1507

accuracy 0.71 3286


macro avg 0.72 0.70 0.70 3286
weighted avg 0.72 0.71 0.70 3286

array([[1490, 289],
[ 665, 842]])
with Validation data : LB tope - fixed
mcc :0.402687364
Accuracy: 0.7151921043355658
precision recall f1-score support

negative 0.79 0.74 0.77 1795


positive 0.60 0.67 0.63 1042
array([[1333, 462],
[ 346, 696]])
accuracy 0.72 2837
macro avg 0.70 0.71 0.70 2837
weighted avg 0.72 0.72 0.72 2837
PFeature - DPC - 2

METHODOLOGY
• To preprocess the training dataset and extract relevant features, we
employed CD-HIT for sequence clustering and redundancy removal,
followed by Pfeature for feature extraction based on amino acid
composition which results in 400 distinct features - gap of 2
• the obtained features are then trained with lazyclassifier to know the
best fit model for classification

METRICS
Accuracy score, mathews correlation coefficient ,recall, precision, f1score
Results: LB tope - variable Best fit model: Extra Trees Classifier
mcc :0.4106882852918963
Accuracy: 0.7087644552647596
precision recall f1-score support

negative 0.71 0.79 0.75 1779


positive 0.71 0.61 0.66 1507

accuracy 0.71 3286


macro avg 0.71 0.70 0.70 3286
weighted avg 0.71 0.71 0.71 3286

array([[1406, 373],
[ 584, 923]])
with Validation data : LB tope - fixed
Accuracy: 0.697567853366232 mcc :0.36

precision recall f1-score support

negative 0.81 0.69 0.74 1795 array([[1233, 562],


positive 0.57 0.72 0.63 1042 [ 296, 746]])

accuracy 0.70 2837


macro avg 0.69 0.70 0.69 2837
weighted avg 0.72 0.70 0.70 2837
METHOD - 9: VOTING CLASSIFIER
METHODOLOGY
• To preprocess the training dataset and extract relevant features, we
employed CD-HIT for sequence clustering and redundancy removal,
followed by Pfeature for feature extraction based on amino acid
composition which results in 400 distinct features(alternte) - gap of 1
• the obtained features are then trained with an ensemble model of
Random Classifier, extra trees classifier, lgbm, svc, nusvc and the voting
method used was soft

METRICS
Accuracy score, mathews correlation coefficient ,recall, precision, f1score
Results: LB tope - variable
mcc :0.3823098
Accuracy: 0.70
precision recall f1-score support

negative 0.70 0.77 0.73 1783


positive 0.69 0.61 0.65 1503

accuracy 0.70 3286


macro avg 0.69 0.69 0.69 3286
weighted avg 0.69 0.70 0.69 3286

array([[1366, 417],
[ 585, 918]])
with Validation data : LB tope - fixed
mcc :0.50040267
Accuracy: 0.77
precision recall f1-score support

negative 0.80 0.85 0.82 1795


positive 0.71 0.64 0.67 1042
array([[1518, 277],
[ 371, 671]])
accuracy 0.77 2837
macro avg 0.76 0.74 0.75 2837
weighted avg 0.77 0.77 0.77 2837
METHOD - 10: BERT
METHODOLOGY
• Model Selection: A pre-trained BERT model (BertForSequenceClassification)
was chosen for sequence classification tasks.
• Training Setup: The model was fine-tuned on the training data using the cross-
entropy loss function and the Adam optimizer.
• Training Loop: During training, the model's parameters were updated in each
epoch using backpropagation, and the loss and accuracy were recorded.
• Evaluation: After each epoch, the model's performance was evaluated on the
validation set without updating the model parameters.

METRICS
Accuracy score, mathews correlation coefficient ,recall, precision, f1score
ROC CURVE
Results: LB tope - variable

Training:
Accuracy of last epoch: 0.94059735
loss of last epoch: 0.159418595

Testing :
Accuracy of last epoch: 0.701178010
loss of last epoch: 1.02653
testing

-> epoch vs accuracy

testing

epoch vs loss <-


with Validation data : LB tope - fixed
Accuracy: 0.9407825167430 mcc :0.872104955
loss: 0.20065079126088
precision recall f1-score support

negative 0.93 0.97 0.95 1795


positive 0.95 0.88 0.92 1042
array([[1749, 49],
[ 122, 920]])
accuracy 0.94 2837
macro avg 0.94 0.93 0.94 2837
weighted avg 0.94 0.94 0.94 2837
• shaded region - random tries

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy