0% found this document useful (0 votes)

85 views22 pages

Seminar Report On AI Driven Drug Discovery 2

Uploaded by

backupdell95

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views22 pages

Seminar Report On AI Driven Drug Discovery 2

Uploaded by

backupdell95

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 22

A Seminar Report on

“AI-Driven Drug Discovery”

SUBMITTED BY

SIRSAT KAYAN KISHOR

ROLLNO: 305C015
T.E. (Computer Engineering)

UNDER THE GUIDANCE OF

Prof. L.B. Pawar

DEPARTMENT OF COMPUTER ENGINEERING

SINHGAD COLLEGE OF ENGINEERING,
VADGAON BK, PUNE – 411041
Accredited by NAAC with A+ Grade
AFFILIATED TO
CERTIFICATE
This is to certify that the seminar report entitles

”AI-Driven Drug Discovery”

Submitted by
Sirsat Kayan Kishor
ROLL NO: 305C015

Has successfully completed his seminar presentation under the

supervision of Prof. L.B. Pawar for the partial fulfillment of Bachelor
of Engineering – Computer Engineering of Savitribai Phule Pune
University (SPPU). This work has not been submitted elsewhere for
any degree.

Prof. L.B. Pawar Prof. M.P. Wankhade

Seminar Guide HoD, Computer Engineering
department

External Examiner Dr. S.D. Lokhande

Principle,
Sinhgad College of Engineering,
Pune
ACKNOWLEDGEMENT

With due respect and gratitude, I take the opportunity to thank those who have
helped me directly and indirectly. I convey my sincere thanks to Prof. M.P.
Wankhade HOD Computer Dept. and Prof. L.B. Pawar for their help in selecting
the seminar topic and support.

I thank to my seminar guide Prof. L.B. Pawar for her guidance, timely help
and valuable suggestions without which this seminar would not have been
possible. Her direction has always been encouraging as well as inspiring for me.
Attempts have been made to minimize the errors in the report.

I would also like to express my appreciation and thanks to all my friends who
knowingly or unknowingly have assisted and encourage me throughout my hard
work.
ABSTRACT

The effectiveness of AI-driven drug discovery can be enhanced by pretraining

on small molecules. However, the conventional masked language model
pretraining techniques are not suitable for molecule pretraining due to the limited
vocabulary size and the non-sequential structure of molecules. To overcome these
challenges, we propose FragAdd, a strategy that involves adding a chemically
implausible molecular fragment to the input molecule. This approach allows for
the incorporation of rich local information and the generation of a high-quality
graph representation, which is advantageous for tasks like virtual screening.
Consequently, we have developed a virtual screening protocol that focuses on
identifying estrogen receptor alpha binders on a nucleus receptor. Our results
demonstrate a significant improvement in the binding capacity of the retrieved
molecules. Additionally, we demonstrate that the FragAdd strategy can be
combined with other self-supervised methods to further expedite the drug
discovery process.
INDEX

1. Introduction
2. Method
1.1 FragAdd Framework
2.1 Data Representation
2.3 Graph Neural Networks
2.4 Training Details
2.5 Virtual Screening Pipeline
3. Results
3.1 Molecular Property Prediction Benchmark
3.2 Adding Strategy Exploration
3.3 Visualization of molecular representation
3.4 Application in Virtual Screening
3.5 Combination of FragAdd with other methods

4. Conclusion
5. Reference
1 Introduction
Drug discovery is becoming increasingly costly [1, 2]. Research and development of a
new medicine can cost anywhere from million-to-million US dollars, which has
increased exponentially in the last decade [3]. With the growing availability of big
data, deep learning is a promising approach to accelerate drug discovery in areas such
as compound synthesis, virtual screening, and de novo drug design [4–7]. However,
the effectiveness of deep learning depends on the availability of labeled data, which is
expensive, time [1] consuming, and sometimes impractical to obtain[8] . Pretraining
can help address this issue by learning background knowledge from a large amount of
unlabeled data [9] , and this knowledge has been shown to significantly improve the
performance of downstream tasks[10] . Recently, the masked language model
approach has been widely utilized for pretraining small molecules [11–13] . Infomax
[14] was one of the first graph pretraining methods to promote mutual information
between local and global representations. Hu et al. [12] then implemented Mask in a
molecular graph and discussed the advantages of using local and global level tasks
simultaneously. Grover [11] further advanced the Mask concept by proposing 1-hop
Mask augmentation, which queries the model to predict artificial labels at local and
graph levels. MolCLR [13] then implemented contrastive learning from computer
vision and developed two deletion augmentation methods: bond deletion and subgraph
removal, which can further corrupt the molecule. Although Mask-based pretraining
methods have shown some success in small molecule deep learning, it is not ideal for
small molecules due to two intrinsic properties: a limited vocabulary size and a
non[1]sequential molecule structure. For example, molecules have a much smaller
vocabulary size of less than 20, while university-level English speakers know
approximately 10 000 word families on average[15] . If all the masked atoms in the
molecules are predicted as carbon, an accuracy of approximately 74% (counted for 1
million molecules) can be achieved. This task is too easy, which prevents the
pretraining method from learning useful information. Furthermore, unlike human
language, where words are arranged sequentially, molecules have chemical structures
that are essential to their properties[5, 16] . Applying a mask to chemical bonds does
not cause any changes in the structures of molecules, whereas deleting bonds
significantly modifies the properties of the molecule. Consequently, this obstacle
prevents the pretraining method from gaining valuable knowledge. In contrast to the
existing pretraining strategies that involve reducing or eliminating information through
the use of masks, we introduce a novel approach called FragAdd, which involves the
addition of a chemically implausible molecular fragment to the input molecule. This
strategy is intended to provide structural variation and prevent the collapse of the
molecular structure. To learn rich local information while producing a meaningful
molecular representation, we designed a series of experiments to explore how the
adding strategy can be implemented. The fragments used in the strategy were taken
from a fragment database created using pretraining data.

2 Method
2.1 FragAdd framework
We created FragAdd to pretrain small molecules and use the pretrained model for
downstream objectives such as property prediction and virtual screening. Pretraining
provides Artificial Intelligence (AI) systems with a basic understanding of the data by
learning the patterns in small molecule data [9] . As a small molecule pretraining
framework, FragAdd introduces novel augmentation and training objectives to process
molecule graphs and update parameters. After pretraining with unlabeled data, the model
is further refined on supervised tasks, for example, predicting the toxicity of molecules.
Inspired by the modular nature of small molecules, FragAdd changes the molecular
structure to provide diversity and avoids predicting molecular vocabulary to increase the
difficulty of pretraining tasks. Diversity describes the number of chemical forms
generated from the augmentation, and difficulty indicates how challenging the task is for
an intelligent system to complete. Focusing on these two aspects may corrupt the
molecules’ structure to increase diversity and adjust the difficulty level by multiple
operations. Molecules have a modular nature, which regards molecules as a collection of
molecular fragments generated by addition reactions. Pharmacists use this idea to
optimize the quality of drug candidates by adding or deleting parts from molecules.
Based on this idea, FragAdd attaches a fragment outside the input molecule to imitate
the process of the natural addition reaction. During the augmentation process, FragAdd
generates a chemically invalid fragment and adds it to the input molecule, as shown in
Fig. 1. We generated a fragment database from all molecules in the pretraining dataset.
To sample a fragment from the database, we designed a twostep approach: first,
choosing a subgroup based on size (the number of atoms in a fragment[17] ), and then
randomly sampling one fragment from the chosen group (fragments larger than 20 atoms
are placed into one group). Further, we corrupted the sampled fragment by atom
mutation and ring break so as to avoid problems in distinguishing the fragment from the
original molecule. Atom mutation replaces some atoms with a different atom type, and
ring break deletes a bond in a ring from a molecule if a ring exists. The ratio of mutation
and break can be adjusted for a suitable difficulty level. To attach the damaged fragment
to the input molecule, we connected two randomly sampled carbons from the two
pieces. If no carbon exists for connection, atoms indexed zero in the molecule graph can
be chosen. The FragAdd augmentation corrupts a dynamic region that depends on the
size of the added fragment instead of a fixed local region by Mask-like methods.

For pretraining objectives, FragAdd locally classifies whether each atom belongs to the
extra fragment while globally summing up the number of added atoms. In fact, previous
work has proved the effectiveness of pretraining small molecules at both local and global
levels [11, 12] . Locally, FragAdd predicts a binary classification for each atom so that
the model learns to decompose molecules into fragments and determine which fragment
is chemically unreasonable. Globally, FragAdd predicts the number of added atoms to
summarize the chemical knowledge into molecular representation by pooling. Both levels
of training objectives are vital for effective pretraining.

2.2 Data preparation

We transformed small molecules from SMILES to molecular graphs by computing node

and edge features. SMILES is not exclusive to a single molecule and necessitates treating
molecules as texts. We determined atom number and chirality as node features.
Additionally, we incorporated bond features with chirality and bond type selected from
single, double, triple, and aromatic. We hypothesized that atom type, bond type, and
chirality are sufficient to differentiate one molecule from another. The fragment database
was generated from the pretraining dataset with the molecule decomposition algorithm
BRICS. BRICS algorithm breaks molecules in positions where synthetic reaction could
happen [18]. We implemented BRICS with RDKit [17], and by default, the
decomposition process undergoes multiple rounds until no synthetically accessible bonds
exist in fragments. We saved the output fragments in SMILES format to organize
fragment data and removed duplicates. Additionally, the fragments were tagged with the
number of atoms in that fragment, categorizing the database by fragment size.

2.3 Graph neural networks

A molecule graph is represented as with a node feature for each. Graph Neural Networks
(GNNs) [19] use a message-passing approach, where the representations of the
neighboring nodes of node are combined to iteratively update the representation of node.
After rounds of aggregation, the representation of node captures the structural
information within its -hop neighborhood. Formally, the -th layer of a GNN is expressed
as follow v where is the feature vector of node at -th layer, and is a set of neighbors of
node. We implemented Graph Isomorphism Network (GIN) [20] as our model. GIN is the
most expressive of the GNNs for the representation learning of graphs. Moreover, GIN
uses Multi-Layer Perceptrons (MLPs) as the aggregation function, proving that it satisfies
the conditions for a maximally powerful GNN. For the pretraining of molecular graphs,
GIN is the most recognized architecture. When setting up the GIN model, all
hyperparameters stay the same as in the previous work to exclude the model’s influence
during comparison. Five GIN layers were used to process molecule graphs. Nodes were
embedded into 300-dimensional units, and no dropout are used. Only node features of the
last layer were considered when model outputs and mean pooling were used to read out
global representations. We used a linear layer to predict the training objective for all the
pretraining tasks.

2.4 Training detail

We pretrained two million small molecules from the ZINC database for 100 epochs, and
about 134 hundurds fragments were obtained using the BRICS algorithm. Instead of
increasing the pretraining data size to achieve the best benchmark result, we kept the data
size at two million molecules and conducted more rounds of exploration on adding
strategies. We set the random seed to zero and the batch size to 256. The Adam optimizer
was updated with a learning rate of 0.001; no weight decay or learning rate schedule was
used to keep the system at a minimum. We included a ratio of 0.1 in our global training
objective when combining the local and global loss. The pretrained model was fine-tuned
on eight classification datasets from Molecule Net, and the batch size was reduced to 32.
Molecule Net classification datasets are the most accepted for small molecule property
prediction, including three biophysics and five physiology datasets [21]. We added a
dropout rate of 0.5 and reduced the batch size to 32 for small-size downstream tasks such
as SIDER. Further, a linear layer was used to predict the final binary label and average
the accuracy across all tasks for each dataset.

2.5 Virtual screening pipeline

We took Estrogen Receptor Alpha (ER) binding data from the Nuclear Receptor Activity
(NURA) dataset and divided it into reference and search data. The search data were then
combined with two million molecules to form the final virtual screening dataset. NURA
dataset contains information on small molecules that act as nuclear receptor modulators
[22]. We obtained 1287 ER binding active and 4861 inactive molecules from the 11
nuclear receptors of NURA. We sampled 20% of ER data as reference data, which were
used as a template for similarity search and fine-tuning. The other 80% ER data were
merged with two million small molecules from the ZINC database for screening. All
weak active ER binders were eliminated for simplicity. We adjusted FragAdd on ERα
reference data for the purpose of generating molecular representations. We set the batch
size to 32, which is suitable for a small dataset, and fine-tuned the pretrained model for
30 epochs. To make the fine-tuning process easier, we excluded weak active data and
only took into account absolute active or inactive data. During training, a linear layer was
used to classify binding activity, and meaning pooling created molecular representations
for similarity search. We employed the Python library FAISS to carry out a molecular
similarity search, utilizing embeddings from the GIN model and the Tanimoto coefficient
to search for fingerprints. FAISS is a Python library for similarity searching and
clustering of large-scale vectors [23]. The distance between molecular representations
was calculated with the minimum Euclidean (L2) distance (the maximum inner product
search could also be used). In this study, we chose the RDKit fingerprint and set the
fingerprint size to 300, the same as the pretrained embedding. Additionally, the k-nearest
fingerprint was defined by the Tanimoto coefficient, which is the ratio of the intersection
of two vectors to the union of the two vectors. α We used AutoDock Vina (version 1.2.3),
a widely used docking software for protein-ligand interaction [24], to investigate the
interaction between unknown screening retrievals and ER protein. To begin our analysis,
we first created three-dimensional molecular structures with Open Babel [25]. We then
carefully determined the center of the grid box, using the mean value of atoms
coordinates within the binding pocket of ER. This approach helped us to accurately
define our docking search space, which was set to a dimension of 30 angstroms. Apart
from these specific settings, we adhered to the default parameters provided by AutoDock
Vina. Finally, we visualized the docking pose with Pymol and Discovery Studio [26, 27]
3 Result

3.1 Molecular property prediction benchmark

We compared FragAdd to other molecular machine learning frameworks by evaluating

them on eight molecular property classification tasks[21] , including Beta-secretase 1
Inhibition (BACE), Blood-Brain Barrier Permeability (BBBP), drug toxicity assessment
(ClinTox, Tox21, and ToxCast), HIV inhibition (HIV), challenging virtual screening
assays (MUV), and drug side effect profiling (SIDER). To ensure a fair comparison, we
pretrained all the frameworks with the same model and training procedure. We used Area
Under the Receiver Operating Characteristics (AUROC) as a metric and reported the
mean and standard deviation of five fine-tune random seeds. For the setup of FragAdd,
we set the probability of adding a fragment to the input molecule to 90%. Once a decision
is made to add a fragment to the molecule, two subsequent modifications are
independently performed on the fragment: each atom in the fragment has a 15% chance
of undergoing mutation, and there is a separate 50% probability of breaking a ring within
the fragment. These thresholds were set based on corresponding experiments. Several
possibilities for mutation and ring breaking were tested. Excessively high rates of
mutation and ring breaking could result in chemically implausible fragments, while
excessively low rates might not generate sufficient diversity. β FragAdd achieves the best
mean accuracy compared with Mask-like baselines, as shown in Table 1. From the
distribution of bold best accuracy, each pretraining method has its areas of expertise. For
example, Contextpred[12] performs the best in two datasets related to toxicity;
MolCLR[13] and Grover[11] also excel in two pairs of tasks. Considering this accuracy
pattern, evaluating the average performance across all eight datasets is crucial. FragAdd
achieves the highest mean AUROC accuracy for the eight tested datasets, showing that
the proposed adding strategy is comparable to previous best-performed baselines. For
individual tasks, FragAdd surpasses all other baselines in two datasets that are highly
related to drug discovery: BACE and ClinTox. BACE is a binary binding classification
dataset for inhibitors of human - secretase 1, and ClinTox includes FDA-approved drugs
and drugs that have not passed clinical trials for toxicity reasons[29, 30] . We can infer
that FragAdd has the potential to contribute to downstream applications in drug
discovery. 3.2 Adding strategy exploration In contrast to Mask and its variants, which
only specify what to delete, the augmentation strategy of FragAdd presents more
challenges and lacks approaches to address them. Mask deletes one feature for each
chosen atom, while 1-hop Mask extends this range to the chosen atom and its 1-hop
neighbors[11, 12] . This augmentation is straightforward and does not require additional
thought. On the other hand, when it comes to adding strategy, questions such as what
form should be added, how to connect the additional piece to the original molecule, and
what training objective to use must be answered. These issues may be the reason why the
default augmentation strategy for pretraining masks or hides something. Nevertheless, as
discussed, for small molecules, adding strategy could be more suitable for the
requirements of molecular data and is worth exploring.

To determine how to implement the adding strategy which described in Section 2.1, we
explored four components that influence the diversity and difficulty of augmentation on
small molecules (Fig. 2a). At the beginning of the FragAdd augmentation process, a
fragment should be sampled from the fragment database. However, the generated
fragment database is unbalanced for fragment size (number of atoms in fragment),
resulting in a decrease in diversity when sample from the database directly (one-step
sampling). For example, fragments with size less than 3 or larger than 20 have nearly no
chance of being selected. Therefore, a better sampling method that tackles the
unbalancing problem of fragment size can contribute to the diversity of corruption. For
fragment corruption, how the fragment can be damaged to adjust the difficulty to a
reasonable level needs to be explored. Additionally, it is crucial to choose the connection
bond in fragment addition step. If most connection bonds are obvious wrong, the model
will only need to break the bond to separate the molecule into two parts, which makes it
too easy for the model to learn valuable molecular information. Finally, training
objectives directly affect the difficulty of the pretraining tasks locally or globally. Based
on the benchmark, we found the best solutions for the four chosen components, shown in
Fig. 2b. Compared with one-step sampling, first choosing fragment size substantially
improves the accuracy, showing the importance of maintaining the fragment size
distribution normalized. For fragment corruption, atom substitution and ring scaffold
hoping contribute independently to the invalid chemical information. Additionally, the
carbon-carbon (C-C) bond proved to be a more effective choice for connecting fragments
than any random bond. This superiority can be attributed to the high prevalence of C-C
bonds in our pretraining dataset, where they constitute approximately 59% of all bonds in
small molecules. Furthermore, carbon atoms in these molecules are typically connected
to more than 1.05 hydrogen atoms on average, a higher connectivity compared to other
atoms (e.g., “O”: 0.06, “N”: 0.32). This statistical prevalence of C-C bonds and the
connectivity pattern of carbon atoms make the C-C bond attachment more chemically
reasonable and effective for maintaining molecular integrity. Results also show that local
and global training objectives are essential to pretrain performance, as they learn rich
local information while producing a high-quality graph representation.
3.3 Visualization of molecular representation

We assessed whether molecular representations carry meaningful chemical information

by visualizing embeddings from four structurally related scaffolds. To do this, we used
tSNE to plot the embeddings, expecting that molecules belonging to different scaffolds
would be clustered [31]. Instead of selecting the most popular scaffolds in the dataset, we
chose four structurally related scaffolds, as shown in Fig. 3a (popularity was determined
by the number of times the scaffold appeared in the cross-pretraining dataset). The four
chosen scaffolds only differ in the presence of one or two fragments, so this separation
task requires more powerful extraction capabilities. As opposed to Grover, FragAdd
generates molecular representations that contain information capturing the slight
difference between the scaffolds, as shown in Fig. 3. Grover fails to discriminate
molecules by their scaffolds, and FragAdd separates four groups of points into clusters.

The comparison shows that FragAdd learns the structure details about the existence of
fragments in molecules. We also noticed that FragAdd generates subgroups under the
same color, especially for scaffolds colored blue and red, which have subgroups far away
in the t-SNE space. We further found that the subgroups significantly differ in side
chains, showing that FragAdd can learn structural information deeper than the algorithm
used to calculate the scaffold.

3.4 Application in virtual screening

We replaced the fingerprint method used in virtual screening with FragAdd and
investigated whether it could help to retrieve more desired molecules from the screening
database. Virtual screening is a common technique for the in-silico development of new
medicines [32–34], which searches for molecules with the highest probability of a
particular property or activity in molecule libraries. To generate molecular representations
with abundant chemical information, pretraining methods have been employed [35]. This
approach is advantageous over the traditional fingerprint method, as it does not require
the use of artificial rules to extract chemical information. Nevertheless, the application of
pretraining in virtual screening has not been extensively studied. We created a scenario to
find molecules that bind to the estrogen receptor (ER) from the top output of a molecular
similarity search. ER is a crucial therapeutic target, especially considering that
approximately 70% of breast cancer patients exhibit ER positive status [36, 37]. Given
this prevalence and the critical role of ER in the disease’s progression, our study focuses
on this receptor to better understand its interactions and potential avenues for therapeutic
intervention. The dataset of ERα, comprising 6148 molecules, was split into reference
and search subsets in a 1:4 ratio, and the search subset was combined with two million
molecules to form the final search dataset. The k k α reference subset was employed to
fine-tune the model and served as the basis for reference molecules during the search
process. We used a -nearest neighbor search for each reference molecule, calculating the
distance between molecular representations and setting to 200. As most molecules in the
search data do not have ER binding activity labels, we used different methods to analyze
known and unknown retrievals (known retrievals include molecules that have a binding
label). The analysis of known ERα ligands suggests that pretraining and fine-tuning are
beneficial for virtual screening, as demonstrated in Fig. 4. FragAdd achieved the highest
true binder rate for known binders and retrieved more than half of the true binders in the
top 200 outputs for each reference molecule. The traditional fingerprint method was not
successful in retrieving enough true binders, which highlights the advantages of deep
learning compared to the fingerprint method. We further explored the roles of pretraining
and fine-tuning in virtual screening. Combining the true binder rate and inactive number
results, we found that fine-tuning improves performance by decreasing the number of
inactive binding molecules. To gain an intuitive understanding of the function of
pretraining and fine-tuning, we visualized ER data using tmap[38] . Comparing before
and after fine-tuning reveals that fine-tuning helps classify active and inactive binding to
reduce the inactive number. Without pretraining, many molecules mix with other ones
instead of forming a tree structure, which indicates that pretraining assists in learning the
chemical features of each molecule.

In contrast to known ER ligands, the lack of binding activity labels for unknown
retrievals makes it difficult to analyze them. To address this, we conducted a docking
study to assess their binding to the ER protein (Fig. 5). Docking is a computational
technique used to predict protein-ligand interactions and binding affinity. We used the
affinity gap to evaluate the binding of the unknown retrievals. FragAdd achieved the
closest affinity gap to zero, indicating that it retrieves better unknown binders than the
traditional fingerprint method. This confirms that both pretraining and fine-tuning are
essential for unknown retrievals. To further understand the affinity α gap result, we
visualized the docking pose of a high affinity unknown retrieval, ZINC1627292. The
molecule interacts with the protein target through two hydrogen bonds on either side of
the molecule and a Tshaped stacking between benzene rings. Of the three interactions, the
hydrogen bond with His524 and the Pi-Pi interaction with Phe404 are conserved in the
natural binders for ER. For both known and unknown retrievals, FragAdd increases the
number of potential binders in the top 200 outputs.
3.5 Combination of FragAdd with other methods
FragAdd preserves the original molecule component, thus allowing for the integration of
other augmentation techniques. As an addition approach, FragAdd only adds a bond to
one carbon atom in the original molecule; this means that FragAdd is compatible with
Mask and its derivatives, raising the question of whether FragAdd can be combined with
other methods. If it can, FragAdd will offer a new choice for other pretraining
frameworks. FragAdd improves the average performance added to other methods,
indicating that the adding and deleting strategies could be used simultaneously. To
implement this idea, we conducted Mask-like augmentation on the input molecule and
then attached a fragment to the masked molecule and added the two loss items. We tested
this operation for Infomax, Atom Mask, and Bond Mask. Bond Mask hides bond types
for some bonds inside the molecular graph. For Infomax and Atom Mask, FragAdd
improves more than 1% accuracy after being combined with FragAdd. Moreover, for
Bond Mask, the accuracy stays the same, showing that it is better to adjust the ratio of
loss items for the best combination performance.

4 Conclusion
We propose a pretraining framework, FragAdd, which uses fragments from
decomposition as an additional part of an adding strategy, as an alternative to the Mask
based strategy in small molecule pretraining. Our results show that FragAdd outperforms
previous baselines in molecular property prediction and virtual screening tasks. It
achieved the best average accuracy in eight classification datasets, and excelled in two
datasets related to drug discovery. This performance is attributed to the extraction of
molecular representations that capture structure details. We also found that both
pretraining and fine-tuning are essential for virtual screening, and that FragAdd can be
used in conjunction with other self-supervised methods. A pretrained model based
molecule search engine has the potential to greatly accelerate the drug discovery process.
However, we have noticed that FragAdd occasionally incorporates excessive structural
variations, resulting in a bias during subsequent virtual screening. Additionally, the
training of FragAdd has utilized the same model and dataset as previous studies, which
might not be adequate for achieving optimal performance. Currently, we are focusing on
developing a dependable molecule search engine that can cater to the specific
requirements of biomedical research.

5. References
[1] H. F. Lynch and C. T. Robertson, Challenges in confirming drug effectiveness after
early approval, Science, vol. 374, no. 6572, pp. 1205–1207, 2021. \

[2] M. Schlander, K. Hernandez-Villafuerte, C. Y. Cheng, J. Mestre-Ferrandiz, and M.

Baumann, How much does it cost to research and develop a new drug? A systematic
review and assessment, Pharmacoeconomics, vol. 39, no. 11, pp. 1243–1269, 2021.

[3] S. Simoens and I. Huys, R&D costs of new medicines: A landscape analysis, Front.
Med., vol. 8, p. 760762, 2021. [3] Q. Jiao, Z. Qiu, Y. Wang, C. Chen, Z. Yang, and X.
Cui, Edge-gated graph neural network for predicting protein ligand binding affinities, in
Proc. IEEE Int. Conf. Bioinformatics and Biomedicine (BIBM), Houston, TX, USA,
2021, pp. 334–339.

[4] H. Beck, M. Härter, B. Haß, C. Schmeck, and L. Baerfacker, Small molecules and
their impact in drug discovery: A perspective on the occasion of the 125th anniversary
of the Bayer Chemical Research Laboratory, Drug Discov. Today, vol. 27, no. 6, pp.
1560– 1574, 2022.

[5] Y. Ye, Unleashing the power of big data to guide precision medicine in China, Nature,
vol. 606, no. 7916, pp. 49–51, 2022.

[6] Y. Wang, Z. Qiu, Q. Jiao, C. Chen, Z. Meng, and X. Cui, Structure-based protein drug
affinity prediction with spatial attention mechanisms, in Proc. IEEE Int. Conf.
Bioinformatics and Biomedicine (BIBM), Houston, TX, USA, 2021, pp. 92–97.

[7] L. Ericsson, H. Gouk, C. C. Loy, and T. M. Hospedales, Self-Supervised

Representation Learning: Introduction, advances, and challenges, IEEE Signal Process.
Mag., vol. 39, no. 3, pp. 42–62, 2022.

[8] Y. LeCun and I. Misra, Self-supervised learning: The dark matter of intelligence,
https://ai.meta.com/blog/selfsupervised-learning-the-dark-matter-of-intelligence/,
2021.
[9] C. Cai, S. Wang, Y. Xu, W. Zhang, K. Tang, Q. Ouyang, L. Lai, and J. Pei, Transfer
learning for drug discovery, J. Med. Chem., vol. 63, no. 16, pp. 8683–8694, 2020.
[10] Y. Rong, Y. Bian, T. Xu, W. Xie, Y. Wei, W. Huang, and J. Huang, Self-supervised
graph transformer on large-scale molecular data, in Proc. 34th Int. Conf. Neural
Information Processing Systems, Virtual Event, 2020, pp. 12559–12571.
[11] W. H. Hu, B. W. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec,
Strategies for pre-training graph neural networks, presented at Int. Conf. Learning
Representations (ICLR), Virtual Event, 2020.
[12] Y. Wang, J. Wang, Z. Cao, and A. Barati Farimani, Molecular contrastive learning
of representations via graph neural networks, Nat. Mach. Intell., vol. 4, no. 3, pp.
279–
287, 2022. [13] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D.
Hjelm, Deep graph Infomax, presented at Int. Conf. Learning Representations (ICLR),
Vancouver, Canada, 2018. [14] J. Milton and J. Treffers-Daller, Vocabulary size revisited:
The link between vocabulary size and academic achievement, Appl. Linguist.
Rev., vol. 4, no. 1, pp. 151–172, 2013. [15] X. Zhang, C. Chen, Z. Meng, Z. Yang, H.
Jiang, and X. Cui, CoAtGIN: Marrying convolution and attention for graph-based
molecule property prediction, in Proc. IEEE Int. Conf. Bioinformatics and Biomedicine
(BIBM), Las Vegas, NV, USA, 2022, pp. 374–379.
[16] 574 Big Data Mining and Analytics, September 2024, 7(3): 565−576 G. Landrum,
RDKit: Open-source cheminformatics, https://www.rdkit.org, 2023.
[17] J. Degen, C. Wegscheid-Gerlach, A. Zaliani, and M. Rarey, On the art of
compiling and using ‘drug-like’ chemical fragment spaces, ChemMedChem, vol. 3, no.
10, pp.
1503–1507, 2008.
[18] Y. Li, R. Zemel, M. Brockschmidt, and D. Tarlow, Gated graph sequence neural
networks, presented at Int. Conf. Learning Representations (ICLR), San Juan, Puerto
Rico, 2016.
[19] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, How powerful are graph neural
networks? presented at Int. Conf. Learning Representations (ICLR), Vancouver, Canada,
2018.
[20] Z. Wu, B. Ramsundar, E. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K.
Leswing, and V. Pande, Molecule Net: A benchmark for molecular machine learning,
Chem. Sci., vol. 9, no. 2, pp. 513–530, 2018.
[21] C. Valsecchi, F. Grisoni, S. Motta, L. Bonati, and D. Ballabio, NURA: A curated
dataset of nuclear receptor modulators, Toxicol. Appl. Pharmacol., vol. 407, p. 115244,
2020.
[22] J. Johnson, M. Douze, and H. Jégou, Billion-scale similarity search with GPUs,
IEEE Trans. Big Data, vol. 7, no. 3, pp. 535–547, 2021.
[23] O. Trott and A. J. Olson, AutoDock Vina: Improving the speed and accuracy of
docking with a new scoring function. efficient optimization, and multithreading, J.
Comput. Chem., vol. 31, no. 2, pp. 455–461, 2010.
[24] N. M. O’Boyle, M. Banck, C. A. James, C. Morley, T. Vandermeersch, and G. R.
Hutchison, Open Babel: An open chemical toolbox, J. Cheminf., vol. 3, no. 1, p. 33,
2011.
[25] W. L. DeLano, PyMOL: An open-source molecular graphics tool, CCP4
Newsletter On Protein Crystallography, vol. 40, no. 1, pp. 82–92, 2002.
[26] Dassault Systèmes, BIOVIA discovery studio visualizer, https://www.3ds.com,
2023.
[27] W. Hamilton, Z. T. Ying, and J. Leskovec, Inductive representation learning on
large graphs, in Proc. 31st Int. Conf. Neural Information Processing Systems, Long
[28] Beach, CA, USA, 2017, pp. 1025–1035. G. Subramanian, B. Ramsundar, V.
Pande, and R. A. Denny, Computational modeling of β-secretase 1 (BACE1) inhibitors
using ligand based approaches, J. Chem. Inf. Model., vol. 56, no. 10, pp. 1936–1949,
2016.
[29] K. M. Gayvert, N. S. Madhukar, and O. Elemento, A data driven approach to
predicting successes and failures of clinical trials, Cell Chem. Biol., vol. 23, no. 10, pp.
1294–1301, 2016. [30] G. Hinton and S. Roweis, Stochastic neighbor embedding, in
Proc. 15th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2002,
pp. 857–864.
[31] A. A. Sadybekov, A. V. Sadybekov, Y. Liu, C. IliopoulosTsoutsouvas, X. P. Huang, J.
Pickett, B. Houser, N. Patel, N. K. Tran, F. Tong, et al., Synthon-based ligand
discovery in virtual libraries of over 11 billion compounds, Nature, vol. 601, no.
7893, pp. 452–459, 2022.
[32] F. Gentile, J. C. Yaacoub, J. Gleave, M. Fernandez, A. T. Ton, F. Ban, A. Stern, and
A. Cherkasov, Artificial intelligence–enabled virtual screening of ultra-large
chemical libraries with deep docking, Nat. Protoc., vol. 17, no. 3, pp. 672–697,
2022.
[33] J. Wang, Z. Qiu, X. Zhang, Z. Yang, W. Zhao, and X. Cui, Boosting deep learning
based docking with cross-attention and centrality embedding, in Proc. IEEE Int.
Conf. Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA, 2022, pp.
360–365.
[34] K. Atz, F. Grisoni, and G. Schneider, Geometric deep learning on molecular
representations, Nat. Mach. Intell., vol. 3, no. 12, pp. 1023–1032, 2021.
[35] D. Bafna, F. Ban, P. S. Rennie, K. Singh, and A. Cherkasov, Computer-aided ligand
discovery for estrogen receptor alpha, Int. J. Mol. Sci., vol. 21, no. 12, p. 4193,
2020.
[36] M. Kriegel, H. J. Wiederanders, S. Alkhashrom, J. Eichler, and Y. A. Muller, A
PROSS-designed extensively mutated estrogen receptor α variant displays enhanced
thermal stability while retaining native allosteric regulation and structure, Sci. Rep., vol.
11, no. 1, p. 10509, 2021.

Distributed Programming Study Guide
100% (2)
Distributed Programming Study Guide
33 pages
Windows 8 Case Study
No ratings yet
Windows 8 Case Study
11 pages
Ncit Project Proposal 6th Sem
100% (1)
Ncit Project Proposal 6th Sem
14 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
7 pages
O'Connor - Matlab's Floating Point System
No ratings yet
O'Connor - Matlab's Floating Point System
17 pages
CG - Module 1
100% (1)
CG - Module 1
28 pages
Internship Report (1AM17CS108)
No ratings yet
Internship Report (1AM17CS108)
32 pages
TS SBTET C-18 DEIE III Sem Syllabus
No ratings yet
TS SBTET C-18 DEIE III Sem Syllabus
113 pages
Internship 7th Sem
No ratings yet
Internship 7th Sem
16 pages
"Selective Repeat Arq Protocol": Jnana Sangama, Belagavi
No ratings yet
"Selective Repeat Arq Protocol": Jnana Sangama, Belagavi
18 pages
Vtu Software Testing Unit 1
100% (1)
Vtu Software Testing Unit 1
63 pages
TEACHING AND EVALUATION SCHEME FOR 5th Semester (CSE) (Wef 2020-21)
No ratings yet
TEACHING AND EVALUATION SCHEME FOR 5th Semester (CSE) (Wef 2020-21)
25 pages
Major Synopsis IPU PDF
No ratings yet
Major Synopsis IPU PDF
17 pages
Visvesvaraya Technological University: Computer Graphics & Visualization Laboratory With Miniproject 18Csl67
No ratings yet
Visvesvaraya Technological University: Computer Graphics & Visualization Laboratory With Miniproject 18Csl67
30 pages
R18 Btech 4-1 Syllabus
No ratings yet
R18 Btech 4-1 Syllabus
9 pages
Weekly Progress Report - Week 01
No ratings yet
Weekly Progress Report - Week 01
2 pages
Minor Project Report Format
No ratings yet
Minor Project Report Format
9 pages
Unit 5 2 Marks
No ratings yet
Unit 5 2 Marks
10 pages
Online Feedback Management System
No ratings yet
Online Feedback Management System
27 pages
RRIT Question Bank 1 - CC - IA-1-2021-22
No ratings yet
RRIT Question Bank 1 - CC - IA-1-2021-22
2 pages
AI-based Self-Driving Car
No ratings yet
AI-based Self-Driving Car
9 pages
Question Paper Code:: Reg. No.
No ratings yet
Question Paper Code:: Reg. No.
2 pages
Module Wise - BCT
No ratings yet
Module Wise - BCT
2 pages
Internship Presentation 20CE05
No ratings yet
Internship Presentation 20CE05
24 pages
Module 5
No ratings yet
Module 5
16 pages
Hadoop - Project 5th Sem - 1
No ratings yet
Hadoop - Project 5th Sem - 1
62 pages
A Project Presentation On Real Time Object Detection in Autonomous Driving
No ratings yet
A Project Presentation On Real Time Object Detection in Autonomous Driving
35 pages
Seminar Report Final
No ratings yet
Seminar Report Final
26 pages
Dbms Final Report Nithin and Ramesh
No ratings yet
Dbms Final Report Nithin and Ramesh
40 pages
Report
No ratings yet
Report
33 pages
App Java Report-Eb Ocr
No ratings yet
App Java Report-Eb Ocr
42 pages
Yashu CG Report-1
No ratings yet
Yashu CG Report-1
26 pages
BLOG Project Reprort
No ratings yet
BLOG Project Reprort
47 pages
Asyn Mad Report
No ratings yet
Asyn Mad Report
17 pages
CG Mini Project Report Kyashawanth
100% (1)
CG Mini Project Report Kyashawanth
33 pages
Fire Fighting System: Building Planning and Drawing Using CADD - 20CE34P 2021-22
No ratings yet
Fire Fighting System: Building Planning and Drawing Using CADD - 20CE34P 2021-22
10 pages
DBMS (UNIT-6) (Advances in Databases and Big Data)
No ratings yet
DBMS (UNIT-6) (Advances in Databases and Big Data)
103 pages
CG Labs Project Final v1
100% (4)
CG Labs Project Final v1
38 pages
TE7265 - Introduction To Data Science
No ratings yet
TE7265 - Introduction To Data Science
4 pages
18cs62 Mod 1
No ratings yet
18cs62 Mod 1
64 pages
18CS42 Model Question Paper - 1 With Effect From 2019-20 (CBCS Scheme)
No ratings yet
18CS42 Model Question Paper - 1 With Effect From 2019-20 (CBCS Scheme)
3 pages
III Year V Sem Cs6503 Theory of Computation
No ratings yet
III Year V Sem Cs6503 Theory of Computation
44 pages
VTU ECE CNLAB Manual 15ECL68
50% (4)
VTU ECE CNLAB Manual 15ECL68
2 pages
CB 17 Black Book
No ratings yet
CB 17 Black Book
47 pages
Mobile Application Development Question Paper
No ratings yet
Mobile Application Development Question Paper
12 pages
CS6001-C Sharp and .NET Programming
No ratings yet
CS6001-C Sharp and .NET Programming
12 pages
Web Development
No ratings yet
Web Development
23 pages
HN Daa m3 Notes
No ratings yet
HN Daa m3 Notes
24 pages
Intro Intern Final Merged
No ratings yet
Intro Intern Final Merged
22 pages
Unit01-Getting Started With .NET Framework 4.0
No ratings yet
Unit01-Getting Started With .NET Framework 4.0
40 pages
Anjali 'S CG REPORT Wit Code
63% (8)
Anjali 'S CG REPORT Wit Code
43 pages
Number System in Deld
100% (1)
Number System in Deld
283 pages
DBMS Basic Concepts
No ratings yet
DBMS Basic Concepts
56 pages
Social Connect and Responsibility
No ratings yet
Social Connect and Responsibility
6 pages
SE ASSIMENT QUESTION Final
No ratings yet
SE ASSIMENT QUESTION Final
1 page
Seminar Header File
No ratings yet
Seminar Header File
7 pages
Full Stack Internship Report
No ratings yet
Full Stack Internship Report
26 pages
Chatgpt Clone
No ratings yet
Chatgpt Clone
34 pages
IEEE Paper For AR Based Iot Switch
No ratings yet
IEEE Paper For AR Based Iot Switch
4 pages
Java - Lab - Manual-21csl35 - Skit
No ratings yet
Java - Lab - Manual-21csl35 - Skit
30 pages
Touchpad Plus Ver. 1.1 Class 7
From Everand
Touchpad Plus Ver. 1.1 Class 7
Nisha Batra
No ratings yet
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
Unit-5 Test
No ratings yet
Unit-5 Test
2 pages
Web Mining
No ratings yet
Web Mining
13 pages
3 - Offline Participant Information and Consent Form
No ratings yet
3 - Offline Participant Information and Consent Form
3 pages
Lab 3 Exercises
No ratings yet
Lab 3 Exercises
4 pages
GVX 9000
No ratings yet
GVX 9000
212 pages
Lect8 Spice
No ratings yet
Lect8 Spice
27 pages
AC51526140 Nimh Battery Pack
No ratings yet
AC51526140 Nimh Battery Pack
1 page
Soln Numerical Methods Practice Questions MSBTE
No ratings yet
Soln Numerical Methods Practice Questions MSBTE
24 pages
ActiveModels HR7
No ratings yet
ActiveModels HR7
8 pages
Information Technology Project Management: Prof. Dr. Ir. Riri Fitri Sari, MM, MSC
No ratings yet
Information Technology Project Management: Prof. Dr. Ir. Riri Fitri Sari, MM, MSC
19 pages
Pages From Trends in Educational Research About E-Learning A Systematic Literature Review (2009-2018) - 4
No ratings yet
Pages From Trends in Educational Research About E-Learning A Systematic Literature Review (2009-2018) - 4
1 page
Python PYQ
No ratings yet
Python PYQ
10 pages
Panasonic SA-AK750GCP
100% (1)
Panasonic SA-AK750GCP
106 pages
Probabilistic Reasoning: Unit-V
No ratings yet
Probabilistic Reasoning: Unit-V
33 pages
Bakhsh 2020 - Effect of Light Irradiation Condition On Gap Formation Under Polymeric Dental Restoration. OCT Study
No ratings yet
Bakhsh 2020 - Effect of Light Irradiation Condition On Gap Formation Under Polymeric Dental Restoration. OCT Study
7 pages
Advanced Ec Section 6
No ratings yet
Advanced Ec Section 6
5 pages
ST MicroInverter Schemetic
No ratings yet
ST MicroInverter Schemetic
12 pages
Module-1: Web Programming
100% (1)
Module-1: Web Programming
50 pages
Advanced Excel - Waterfall Chart
No ratings yet
Advanced Excel - Waterfall Chart
8 pages
Project Management Life Cycle
50% (2)
Project Management Life Cycle
5 pages
Binomial Worked Examples
No ratings yet
Binomial Worked Examples
2 pages
PS ScreenShots - Manual
No ratings yet
PS ScreenShots - Manual
32 pages
ICT Lounge - Section 8.3 - Hacking
No ratings yet
ICT Lounge - Section 8.3 - Hacking
4 pages
Invoice Template
No ratings yet
Invoice Template
5 pages
Khalid Khan
No ratings yet
Khalid Khan
4 pages
Prelim Intro To Multimedia Chap 1
No ratings yet
Prelim Intro To Multimedia Chap 1
38 pages
CS2230 OS Lab Mahesh Jangid CourseHandout
No ratings yet
CS2230 OS Lab Mahesh Jangid CourseHandout
5 pages
Sap HCM Payroll User Guide
100% (3)
Sap HCM Payroll User Guide
126 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Seminar Report On AI Driven Drug Discovery 2

Uploaded by

Seminar Report On AI Driven Drug Discovery 2

Uploaded by

A Seminar Report on

“AI-Driven Drug Discovery”

SIRSAT KAYAN KISHOR

UNDER THE GUIDANCE OF

DEPARTMENT OF COMPUTER ENGINEERING

”AI-Driven Drug Discovery”

Has successfully completed his seminar presentation under the

Prof. L.B. Pawar Prof. M.P. Wankhade

External Examiner Dr. S.D. Lokhande

The effectiveness of AI-driven drug discovery can be enhanced by pretraining

2.2 Data preparation

We transformed small molecules from SMILES to molecular graphs by computing node

2.3 Graph neural networks

2.4 Training detail

2.5 Virtual screening pipeline

3.1 Molecular property prediction benchmark

We compared FragAdd to other molecular machine learning frameworks by evaluating

We assessed whether molecular representations carry meaningful chemical information

3.4 Application in virtual screening

[2] M. Schlander, K. Hernandez-Villafuerte, C. Y. Cheng, J. Mestre-Ferrandiz, and M.

[7] L. Ericsson, H. Gouk, C. C. Loy, and T. M. Hospedales, Self-Supervised

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.