CMB Project Report
CMB Project Report
ABSTRACT
The project “BIOACTIVITY PREDICTION APP” is one step ahead in the process of drug discovery using
bioinformatics and is used to predict the drug likeliness of the compound, predict the potency of the input
molecules and predict the solubility of the molecules. Drug discovery is a pivotal process in curing the living
beings or protecting them against the diseases and it should be as swift as possible so that lives of living beings
can be saved. In the current scenario, drug discovery is a slow process which takes a lot of time to make curable
drugs and medicines. This slow rate is unacceptable in today’s world as everything else is speeding up with the
advancement in technology. Since bioinformatics has come out as a standout field in the field of medicines, so
using the previous studies, we have made machine learning models which will speed up the process of
calculating the potency of the molecules in terms of pIC50 values and predicting the solubility of the molecules.
Keywords: Bioinformatics, Potency, Solubility, Random Forest, Linear Regression, Drug Discovery
I. INTRODUCTION
Drug discovery is a step-by-step process in which new drugs are discovered. In general, pharmaceutical
companies follow well Pharmacology and chemistry-based drug discovery approaches, and face various
difficulties in finding new drugs [1]. Purpose of drug availability to produce more drugs a short term with low
risk in bioinformatics [2]. In fact, there is now a new, different, well-known field such as computer-aid drug
design (CADD),[3],[4]. Bioinformatics is experiencing a exponential growth of biological data have favored
development of primary and secondary databases of nucleic acid sequences, protein sequences, and structures.
Some of the most popular databases include ChEMBL, GenBank, SWISS-PROT, PDB, PIR, SCOP, CATH, etc., these
information sites are available as a public domain information and hosted on various online servers throughout
the earth. We undertook a deep study on Alzheimer’s disease which attacks the single protein. When any new
disease attacks the human body, it either inhibit the protein or release it, which create an imbalance in the body
[5]. This imbalance in the body is the reason behind illness and other effects in the body. We created a dataset
in which different compounds are compared. First they are checked on Lipinski Descriptor for the drug
likeliness of the compound. Then Padel Descriptors are used to generate molecular fingerprints which are fed
to the model. Then the results are predicted.
II. PROJECT OBJECTIVE
The proposed machine learning model is highly trained, scalable, well researched, adaptive, flexible and
accurate, using the features of advanced neural networks to highly optimize the learning of the model. The
proposed research is used to predict the potency and solubility of the drug likely molecules which are cleaned
using the Lapinski Rule of Five. Generally speaking, the proposed research is of utmost benefit to the
researchers and the biologists who are manually doing these processes to discover a drug, Due to this manual
process there are increases chances of human errors which cause further delay in the drug discovery process.
Also it increases the cost of the process. Our machine learning algorithm considers only the molecular weight,
octal-water partition coefficient, number of hydrogen bond donors and number of hydrogen bond acceptors of
the molecule and input molecules must be in the form of smiles notation containing the Chembl Id of the
molecule.
The main objectives of the project are:-
1. This project is based on the applicability of the proposed machine learning algorithms that had
demonstrated their efficiency to predict potency and solubility of the drug likely molecules with a better
predictive rates.
2. To apply best machine learning procedures for prediction.
Figure 6: Predicted vs Experimental logS (solubility) values of the drug likely molecules.