Summary
Summary
TRAINING
this model is developed for predicting DATASET
variable length B-cell epitopes. It is ALONG WITH
developed on LBtope_Variable dataset TESTING
that contain 14876 unique
B-cell epitopes and 23321 unique non B-
cellepitopes. These epitopes and non- all the common epitopes in both the datasets
epitopes have variable length. are removed and also the epitopes having
length less than 5 and greater than 50 are
removed.
REFERENCE : B Cell Epitopes
METHODOLOGY
Using TF-IDF Vectorizer (which converts all the sequences to numerical
features)
split the dataset into training and testing - LBtope variable
vectorize the x_train(fit and transform) and x_test(transform)
other parameters:
kernel – linear and rbf
METRICS
Accuracy score, mathews correlation coefficient ,recall, precision, f1score
ROC curve
Results: Testing data LBtope-vairable
Accuracy: 0.6100785340314137
this model is only memorizing the data provided but not learning the
features from it
INTERPRETATION:
METHODOLOGY
Using TF-IDF Vectorizer (which converts all the sequences to numerical
features)
split the dataset into training and testing - Lbtope variable
vectorize the x_train(fit and transform) and x_test(transform)
METRICS
Accuracy score, mathews correlation coefficient ,recall, precision, f1score
ROC curve
Results: Testing data LBtope-vairable
Accuracy: 0.618455497382199
METHODOLOGY
Using TF-IDF Vectorizer (which converts all the sequences to numerical
features)
split the dataset into training and testing - LB tope variable
vectorize the x_train(fit and transform) and x_test(transform)
METRICS
Accuracy score, mathews correlation coefficient ,recall, precision, f1score
Results: Testing data LBtope-vairable
Accuracy: 0.6100785340314137
precision recall f1-score support
MODEL
1.Embedding layer - where the input dimension is the size of vocabulary
plus 1 for pad ,output dimension is 16 dimensional vector
2. SimpleRNN (16) this layer consists of 16 units , i processes the
sequence data one step at a time and maintains a hidden state
3. layer : a fully connected layer with 16 units and relu activation function . it
processes the output from RNN layer and applies a non - linear transformation
4. Output Dense layer : the output layer with a single unit and sigmoid activation
function , used for binary classification
METRICS
Accuracy score, mathews correlation coefficient ,recall, precision, f1score
RESULTS
test
• The AUC is 0.70, which means the model has a moderate ability to distinguish
between the classes, performing better than random guessing but still with
room for improvement.
METHOD - 5 : CNN
METHODOLOGY
Using TF-IDF Vectorizer (which converts all the sequences to numerical
features)
split the dataset into training and testing - LB tope variable
vectorize the x_train(fit and transform) and x_test(transform)
MODEL
1.Embedding layer - where the input dimension is the size of vocabulary
plus 1 for pad ,output dimension is 1000 dimensional vector
2. Convulutional layer - which has 128 filters, window size 5 and an
activation function of reLU
3. Global max pooling layer : A single value per feature map (total of 128 values if the
input had 128 filters)
4. Dense layer :The number of neurons in the layer (128). and The activation function
applied to the output of the layer is reLU
METRICS
Accuracy score, mathews correlation coefficient ,recall, precision, f1score
RESULTS
testing Accuracy: 0.689397931098938 epoch=15 batch_size = 32
validation Accuracy : 0.8396193161790624
testing
Results: Testing data LBtope-vairable
MCC
precision recall f1-score support 0.7165044856049663
• The validation accuracy shows a slight increase initially but then flattens and
fluctuates around 0.63.
• This indicates that while the model is improving on the training data, it is not showing
similar improvement on the validation data.
METHOD - 6 : CNN with KFold - 5
Results: Testing data LBtope-variable
(MCC): 0.3725244342365087
Classification Report:
precision recall f1-score support
(MCC): 0.3845715900895881
Classification Report:
precision recall f1-score support
Accuracy:0.8709904829044766
precision recall f1-score support
conf_matrix
accuracy 0.87 2837
array([[1665, 130],
macro avg 0.87 0.85 0.86 2837
[ 236, 806]]
weighted avg 0.87 0.87 0.87 2837
INTERPRETATION
• The model shows higher precision and recall for the majority class
(class 0) compared to the minority class (class 1), indicating a bias
towards the majority class for cv = 5
• While the model performs well, focusing on improving recall for the
positive class and further tuning could enhance its performance for cv
=10.
METHOD - 7: PFeature - AAC
METHODOLOGY
• To preprocess the training dataset and extract relevant features, we
employed CD-HIT for sequence clustering and redundancy removal,
followed by Pfeature for feature extraction based on amino acid
composition which results in 20 distinct features
• the obtained features are then trained with lazyclassifier to know the
best fit model for classification
METRICS
Accuracy score, mathews correlation coefficient ,recall, precision, f1score
Results: Testing data Best fit model: Extra Trees Classifier
LBtope-variable
mcc :0.363363972584853
Accuracy: 0.6868533171028606
precision recall f1-score support
array([[1429, 350],
[ 684, 823]])
with Validation data : LB tope - fixed
Accuracy: 0.700740218540712 mcc :0.38371043785935655
METHODOLOGY
• To preprocess the training dataset and extract relevant features, we
employed CD-HIT for sequence clustering and redundancy removal,
followed by Pfeature for feature extraction based on amino acid
composition which results in 400 distinct features(alternte) - gap of 1
• the obtained features are then trained with lazyclassifier to know the
best fit model for classification
METRICS
Accuracy score, mathews correlation coefficient ,recall, precision, f1score
Results: LB tope - variable Best fit model: Random forest classifier
mcc :0.363363972584853
Accuracy: 0.7096774193548387
precision recall f1-score support
array([[1490, 289],
[ 665, 842]])
with Validation data : LB tope - fixed
mcc :0.402687364
Accuracy: 0.7151921043355658
precision recall f1-score support
METHODOLOGY
• To preprocess the training dataset and extract relevant features, we
employed CD-HIT for sequence clustering and redundancy removal,
followed by Pfeature for feature extraction based on amino acid
composition which results in 400 distinct features - gap of 2
• the obtained features are then trained with lazyclassifier to know the
best fit model for classification
METRICS
Accuracy score, mathews correlation coefficient ,recall, precision, f1score
Results: LB tope - variable Best fit model: Extra Trees Classifier
mcc :0.4106882852918963
Accuracy: 0.7087644552647596
precision recall f1-score support
array([[1406, 373],
[ 584, 923]])
with Validation data : LB tope - fixed
Accuracy: 0.697567853366232 mcc :0.36
METRICS
Accuracy score, mathews correlation coefficient ,recall, precision, f1score
Results: LB tope - variable
mcc :0.3823098
Accuracy: 0.70
precision recall f1-score support
array([[1366, 417],
[ 585, 918]])
with Validation data : LB tope - fixed
mcc :0.50040267
Accuracy: 0.77
precision recall f1-score support
METRICS
Accuracy score, mathews correlation coefficient ,recall, precision, f1score
ROC CURVE
Results: LB tope - variable
Training:
Accuracy of last epoch: 0.94059735
loss of last epoch: 0.159418595
Testing :
Accuracy of last epoch: 0.701178010
loss of last epoch: 1.02653
testing
testing