0% found this document useful (0 votes)
15 views6 pages

CMB Project Report

The project 'BIOACTIVITY PREDICTION APP' aims to enhance drug discovery by using machine learning models to predict the potency and solubility of drug-like compounds. By leveraging bioinformatics and databases like Chembl, the research seeks to streamline the drug discovery process, reducing time and costs associated with manual methods. The proposed models utilize Random Forest and Linear Regression algorithms to improve predictive accuracy for drug candidates.

Uploaded by

MONA KUMARI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views6 pages

CMB Project Report

The project 'BIOACTIVITY PREDICTION APP' aims to enhance drug discovery by using machine learning models to predict the potency and solubility of drug-like compounds. By leveraging bioinformatics and databases like Chembl, the research seeks to streamline the drug discovery process, reducing time and costs associated with manual methods. The proposed models utilize Random Forest and Linear Regression algorithms to improve predictive accuracy for drug candidates.

Uploaded by

MONA KUMARI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Date : 21 Dec 2024

Indian Statististical Institue , Kolkata


BIOACTIVITY PREDICTION IN DRUG DISCOVERY
Ranjan Kumar Choubey CS2316, Mona Kumari CS2311
Supervisor : Prof . Malay Bhattacharyya

ABSTRACT
The project “BIOACTIVITY PREDICTION APP” is one step ahead in the process of drug discovery using
bioinformatics and is used to predict the drug likeliness of the compound, predict the potency of the input
molecules and predict the solubility of the molecules. Drug discovery is a pivotal process in curing the living
beings or protecting them against the diseases and it should be as swift as possible so that lives of living beings
can be saved. In the current scenario, drug discovery is a slow process which takes a lot of time to make curable
drugs and medicines. This slow rate is unacceptable in today’s world as everything else is speeding up with the
advancement in technology. Since bioinformatics has come out as a standout field in the field of medicines, so
using the previous studies, we have made machine learning models which will speed up the process of
calculating the potency of the molecules in terms of pIC50 values and predicting the solubility of the molecules.
Keywords: Bioinformatics, Potency, Solubility, Random Forest, Linear Regression, Drug Discovery
I. INTRODUCTION
Drug discovery is a step-by-step process in which new drugs are discovered. In general, pharmaceutical
companies follow well Pharmacology and chemistry-based drug discovery approaches, and face various
difficulties in finding new drugs [1]. Purpose of drug availability to produce more drugs a short term with low
risk in bioinformatics [2]. In fact, there is now a new, different, well-known field such as computer-aid drug
design (CADD),[3],[4]. Bioinformatics is experiencing a exponential growth of biological data have favored
development of primary and secondary databases of nucleic acid sequences, protein sequences, and structures.
Some of the most popular databases include ChEMBL, GenBank, SWISS-PROT, PDB, PIR, SCOP, CATH, etc., these
information sites are available as a public domain information and hosted on various online servers throughout
the earth. We undertook a deep study on Alzheimer’s disease which attacks the single protein. When any new
disease attacks the human body, it either inhibit the protein or release it, which create an imbalance in the body
[5]. This imbalance in the body is the reason behind illness and other effects in the body. We created a dataset
in which different compounds are compared. First they are checked on Lipinski Descriptor for the drug
likeliness of the compound. Then Padel Descriptors are used to generate molecular fingerprints which are fed
to the model. Then the results are predicted.
II. PROJECT OBJECTIVE
The proposed machine learning model is highly trained, scalable, well researched, adaptive, flexible and
accurate, using the features of advanced neural networks to highly optimize the learning of the model. The
proposed research is used to predict the potency and solubility of the drug likely molecules which are cleaned
using the Lapinski Rule of Five. Generally speaking, the proposed research is of utmost benefit to the
researchers and the biologists who are manually doing these processes to discover a drug, Due to this manual
process there are increases chances of human errors which cause further delay in the drug discovery process.
Also it increases the cost of the process. Our machine learning algorithm considers only the molecular weight,
octal-water partition coefficient, number of hydrogen bond donors and number of hydrogen bond acceptors of
the molecule and input molecules must be in the form of smiles notation containing the Chembl Id of the
molecule.
The main objectives of the project are:-
1. This project is based on the applicability of the proposed machine learning algorithms that had
demonstrated their efficiency to predict potency and solubility of the drug likely molecules with a better
predictive rates.
2. To apply best machine learning procedures for prediction.

Indian Statistical Institute , Kolkata


3. We proposed the development of prediction model for predicting potency of the drug likely molecules
using Random Forest Model and development of prediction model for predicting solubility of the molecules
using Linear Regression Model.
III. WORKING PROCEDURE

Figure 1: Flow Diagram for Potency Prediction Model

Indian Statistical Institute , Kolkata


Figure 2: Flow Diagram for Solubility Prediction Model
3.1 ALGORITHM FOR POTENCY PREDICTION
Step1: Gathering the data from Chembl Database and preparing data by removing missing values.
Step 2: Perform Exploratory data analysis on the gathered data.
Step 3: Split the gathered data into Training Dataset and Testing dataset.
Step 4: Using training data we create Random Forest Model
Step 5: Using testing data we test the created Random Forest Model.
Step 6: Using the model now we predict the potency of the molecules.

3.2 ALGORITHM FOR SOLUBILITY PREDICTION


Step1: Gathering the data from Chembl Database and preparing data by removing missing values.
Step 2: Split the gathered data into Training Dataset and Testing dataset.
Step 3: Using training data we create Linear Regression Model
Step 4: Using testing data we test the created Linear Regression Model.
Step 5: Using the model now we predict the solubility of the molecules.

3.3 RANDOM FOREST


Random Forest is a popular machine learning algorithm that is part of a supervised learning strategy. Can be
used for both Scheduling as well as retrieve the problems in machine learning. It is based on the concept of
Indian Statistical Institute , Kolkata
integrated learning, which is the process of integrating multiple dividers to solve complex problems and
improve model performance. As the name suggests, "The Random Forest is a subdivision that contains a
number of decision trees for the various datasets set and takes measurement to improve the prediction
accuracy of that database." Instead of depending upon a single decision tree, the random forest takes a
prediction from each tree and is based on these multiple predictable votes and predicts the final result. Such
large number of trees in the forest leads to high accuracy and prevents the problem of overcrowding. It can also
maintain accuracy when a large portion of the data is missing.
3.4 LINEAR REGRESSION
Linear regression in machine learning helps analyzing and finding relationships and patterns in data and
eventually making educated prediction. It is one among the most known and understood algorithms in
statistics and machine learning. The linear regression algorithm shows the linear relationship between
dependent (y) and one or more independent variables (y), called linear regression. As the linear regression
reflects the linear relationship, which means it finds out how the value of the dependent variable changes in
accordance to the value of the independent variable. The linear regression model provides a sloped straight line
representing the relationship between the variables.
IV. RESULTS AND ANALYSIS
The results and discussion may be combined into a common section or obtainable separately. They may also be
broken into subsets with short, revealing captions. An easy way to comply with the conference paper
formatting requirements is to use this document as a template and simply type your text into it. This section
should be typed in character size 10pt Times New Roman.
4.1 RANDOM FOREST MODEL FOR POTENCY PREDICTION

Figure 3: Predicted vs Experimental pIC50 values of the drug likely molecules.

Figure 4: Input Data

Indian Statistical Institute , Kolkata


Figure 5: Output
4.2 LINEAR REGRESSION MODEL FOR SOLUBILITY PREDICTION

Figure 6: Predicted vs Experimental logS (solubility) values of the drug likely molecules.

Figure 7: Linear Regression Model Performance

Figure 8: Input Data

Indian Statistical Institute , Kolkata


Figure 9: Output
V. CONCLUSION
As stated above, we have created machine learning models to predict the potency and solubility of the
molecules. We have first trained the model by feeding input dataset to the model. Then we used this trained
model to make predictions for the new molecules. In the upcoming times, as there is advancement in the
technology new models will be used to make these predictions more accurate. Although, it is not so easy to
successfully predict the potency and solubility of the unknown compounds, but it will be of great benefit to the
biologists and researchers as it will exponentially speed up the process of drug discovery, which will result into
early medicines for unknown diseases thus benefiting the mankind.
VI. REFERENCES
[1] M.Iskar, G. Zeller, Zhao XM, V.Van Noort, P. Bork, “Drug discovery in the age of systems biology: the rise
of computational approaches for data integration”, Curr Opin Biotechnol 23, Pp.609–616, 2012.
[2] S.S. Ortega, L.C. Cara, M.K. Salvador, “In silico pharmacology for a multidisciplinary drug discovery
process”, Drug Metabol Drug Interact 27, Pp.199–207, 2012.
[3] C.M. Song, S.J. Lim, J.C. Tong, “Recent advances in computer aided drug design”, Brief Bioinform, 10,
Pp.579–591, 2009.
[4] A. Speck-Planche, M.N. Cordeiro, “Computer-aided drug design, synthesis and evaluation of new anti-
cancer drugs”, Curr Top Med Chem. [Epub ahead of print], 2013.
[5] Siddharthan N., M. Raja Prabhu, Balayogan S., “Bioinformatics in Drug Discovery a Revi”

Indian Statistical Institute , Kolkata

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy