0% found this document useful (0 votes)
54 views38 pages

AI For Drug Discovery (Practical Demo Using RDKit)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views38 pages

AI For Drug Discovery (Practical Demo Using RDKit)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Practical Chemoinformatics

AI in Drug Discovery
RDKIT Demo (for Beginners!)

Muthukumarasamy Karthikeyan Ph.D


Chief Scientist

[LinkedIn Wikipedia
Google Scholar Academic Site ]
Chemoinformatics and AI Tools in Drug Design
• With over twenty-five years of experience in insilico design of drugs using chemoinformatics
tools for cancer, malaria, tuberculosis, sickle cell anemia, antifungal, antibacterial, and
antiviral agents etc., I thought it is time to highlight the key component in chemoinformatics
especially the open source tools like RDKit for better utilization in the field of Aritificial
Intelligence. Chemoinformatics combines computational and statistical methods to analyze
chemical compounds, while AI encompasses machine learning, deep learning, and natural
language processing for pattern recognition and prediction. The fundamental challenge lies in
translating pictorial chemical structures, easily understood by chemists, into numeric formats
that computers can process and analyze.

• This translation process, called inductive learning, converts molecular structures into
descriptors that include biological activity, physicochemical properties, and toxicological
information. These descriptors enable the creation of AI models that transform data into
knowledge. Once the model is trained and optimized using experimental data, it gains the
ability to predict properties of new molecules. These predictions guide decisions about which
molecules to synthesize, evaluate biologically, and test for properties and toxicology. The
predicted data can then be compared with experimental results to assess prediction accuracy.
Chemoinformatics and AI Tools in Drug Design
• Data quality remains the critical factor in AI modeling, as errors in input data
will lead to model failure. Modern AI systems are being developed to detect
input data errors, ensuring clean data for building high-quality models with
minimal errors through the learning process. The current success of generative
AI in producing textual outputs suggests that when fine-tuned for chemical
information, without proprietary data restrictions, it could revolutionize drug
discovery by identifying valuable therapeutic molecules more efficiently.

• Today's students and faculty, particularly those in bioinformatics familiar with


sequences, should expand their knowledge to understand molecules at the
atomic level using free chemoinformatics tools like RDKit. While modern
students quickly grasp these tools, success requires patience and persistence
through installation challenges and source code implementation. Starting with
tutorials builds confidence before advancing to novel methods for research
projects.
Chemoinformatics and AI Tools in Drug Design
• The field employs high-throughput screening for data generation, predictive
modeling through machine learning techniques like Random Forest and
Support Vector Machines, and deep learning approaches such as GANs and
VAEs for creating new chemical structures. Chemoinformatics tools provide
molecular structure visualization, while QSAR models predict bioactivity based
on chemical structure. Virtual screening helps select promising compounds
from chemical libraries. Looking forward, integrating genomics, proteomics,
and metabolomics data will enhance the accuracy of AI predictions in
understanding bacterial resistance mechanisms.

• This practical approach to learning chemoinformatics, particularly using


RDKit, creates opportunities for technical and scientific positions in
pharmaceutical industries. The key lies in mastering the translation of
molecular structures into numeric descriptors for correlating structure with
property, toxicity, or activity of interest.
“AI to find the “Needle (Drug) in a Haystack” ?
• Estimated 10200 compounds could be made1
• ~250+ million substances currently registered (CAS)
• Drug company biologists screen up to 1 million compounds
• Chemists select 50-100 compounds for follow-up
• 1-2 compounds are selected as potential drugs

11/24/202
M. Karthikeyan, CSIR-NCL, PUNE 5
4
Chemical Structure
Representation
Representation of Structure: 1, 2 &3 D
• 1D model
(SMILES)

• Constitution (Topological)
Graph (“2D”)

• 3D Model

• Molecular Surface

11/24/2024 M. Karthikeyan, CSIR-NCL, PUNE 7


Molecule Properties
SPC : Structure Property Correlation
CHEMICAL PROPERTIES
pKa
Log P
Solubility
Stability

INTRINSIC PROPERTIES
Molar Volume BIOLOGICAL PROPERTIES
Connectivity Indices Activity
Charge Distribution Toxicity
Molecular Weight Biotransformation
Polar surface Area Pharmacokinetics

11/24/202
M. Karthikeyan, CSIR-NCL, PUNE 8
4
Calculation of Structure Descriptors
• physical, chemical, or biological properties cannot be directly
calculated from the structure of a compound.
• representing the structure of the compound by structure
descriptors, and, then, to establish a relationship between
the structure descriptors
• A variety of structure descriptors has been developed
encoding 1D, 2D, or 3D structure information ..(Practical
Chemoinformatics)

11/24/202
M. Karthikeyan, CSIR-NCL, PUNE 9
4
Chemical Structures
A Function of Activity
Chemical Structures
to Property Prediction ?
Quantitative Structure/Property Relationships (QSPR)

Molecular
// Property
Structure

Representation Model Building


(2D & 3D) (Machine Learning)
Structure
Descriptors
11/24/2024
M. Karthikeyan, CSIR-NCL, PUNE 15
How to Convert Chemical Structures
to Numeric Data ? (Descriptors)

Commercial Tools &


Open Source Tools (RDKit)
!pip install rdkit-pypi
import pandas as pd
import numpy as np
import warnings

from rdkit import Chem


from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Descriptors
from rdkit.Chem import AllChem, PandasTools, MACCSkeys, AtomPairs, rdFingerprintGenerator
from rdkit import DataStructs
from rdkit.Chem.rdmolops import PatternFingerprint
from rdkit.Avalon import pyAvalonTools
from rdkit.Chem.AtomPairs.Pairs import GetAtomPairFingerprintAsBitVect
glvc = Chem.MolFromSmiles("Cc1ccc(cc1Nc2nccc(n2)c3cccnc3)NC(=O)c4ccc(cc4)CN5CCN(CC5)C")
import pandas as pd

smiles_list = [
'CC(=O)OC1=CC=CC=C1', # Aspirin-like structure
'C1=CC=C(C=C1)CCN', # Phenethylamine structure
'CC1=CC=C(C=C1)O', # p-Cresol
'C1=CC=C(C(=C1)O)O', # Hydroquinone
'CC(=O)CCCN', # 5-Aminopentan-2-one
'COC1=CC=CC=C1CC=O', # 2-(2-methoxyphenyl)acetaldehyde
'CC1=CC=CC=C1N', # o-Toluidine
'C1=CC=C(C=C1)C(=O)O', # Benzoic acid
'CCOC(=O)CC' # Ethyl propionate
]

df = pd.DataFrame(smiles_list, columns=['smiles'])
PandasTools.AddMoleculeColumnToFrame(df,'smiles','mol')
from rdkit.Chem.Draw import MolsToGridImage

def display_structures(smiles_list):
mol_list = []
for smiles in smiles_list:
mol_list.append(Chem.MolFromSmiles(smiles))
return MolsToGridImage(mol_list,molsPerRow=5)
df_maccs = []
for mol in df['mol']:
Calculate Molecular maccs_bitvector = MACCSkeys.GenMACCSKeys(mol)
# generate bitvector object

Descriptors arr = np.zeros((0,), dtype=np.int8)

# convert the RDKit explicit vectors into numpy arrays

(MACCS Keys) DataStructs.ConvertToNumpyArray(maccs_bitvector,arr)


df_maccs.append(arr)
MACCS = pd.concat([df, pd.DataFrame(df_maccs)], axis=1)
df_mf = []

Calculate Molecular for mol in df['mol']:


mf_bitvector = AllChem.GetMorganFingerprintAsBitVect(mol,
radius=1, nBits = 1024)
Descriptors arr = np.zeros((0,), dtype=np.int8)
# convert the RDKit explicit vectors into numpy arrays

(Morgan FP) DataStructs.ConvertToNumpyArray(mf_bitvector,arr)


df_mf.append(arr)

MF = pd.concat([df, pd.DataFrame(df_mf)], axis=1)


display(MF.iloc[0,1])
nmpyrrole = MF.iloc[0,2:]

print(f'No of ones: {nmpyrrole[nmpyrrole==1].count()}')


print(f'No of zeros: {nmpyrrole[nmpyrrole==0].count()}
\n')

print(nmpyrrole[nmpyrrole==1])

# print second molecule in the dataset


display(MF.iloc[1,1])
pyrrole = MF.iloc[1,2:]

print(f'No of ones: {nmpyrrole[pyrrole==1].count()}')


print(f'No of zeros: {nmpyrrole[pyrrole==0].count()} \n')

print(pyrrole[pyrrole==1])
##Atom Pair Fingerprint
# create an empty list
df_apf = []

# run a for loop to iterate through each molecule ##Topological Torsion Fingerprint
apgen = rdFingerprintGenerator.GetAtomPairGenerator(fpSize=4096)
df_ttf = []
for mol in df['mol']: ttgen =
apf_bitvector = apgen.GetFingerprint(mol) rdFingerprintGenerator.GetTopologicalTorsionGenerator(fpSize
=2048)
for mol in df['mol']:
# convert the RDKit explicit vectors into numpy arrays ttf_bitvector = ttgen.GetFingerprint(mol)
arr = np.array(apf_bitvector)
df_apf.append(arr) # convert the RDKit explicit vectors into numpy arrays
arr = np.array(ttf_bitvector)
APF = pd.concat([df, pd.DataFrame(df_apf)], axis=1) df_ttf.append(arr)

TTF = pd.concat([df, pd.DataFrame(df_ttf)], axis=1)


Molecules as Binary and Bitmap representation

Molecular similarity visualization


Compute Molecular Similarity/Dissimilarity Scores
(Tanimoto Disimilarity, Dice, Cosine)
Try Yourself !
(Python, Conda, RDKit & Libraries)
Please use online platform for Searching & Testing codes!
import pandas as pd

1
import numpy as np
import warnings
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Descriptors
from rdkit.Chem import AllChem, PandasTools, MACCSkeys,
AtomPairs, rdFingerprintGenerator
from rdkit import DataStructs
from rdkit.Chem.rdmolops import PatternFingerprint
from rdkit.Avalon import pyAvalonTools
from rdkit.Chem.AtomPairs.Pairs import
GetAtomPairFingerprintAsBitVect
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import rdDepictor
from rdkit import Chem
from rdkit.Chem.Draw import MolsToGridImage
from rdkit import DataStructs
import matplotlib.pyplot as plt
from rdkit import DataStructs
from sklearn.metrics import jaccard_score
mol = Chem.MolFromSmiles("c1ccccc1") df_maccs = []
glvc = for mol in df['mol']:
maccs_bitvector = MACCSkeys.GenMACCSKeys(mol) # generate bitvector object
Chem.MolFromSmiles("Cc1ccc(cc1Nc2nccc(n2)c3cccnc3)NC(=O)c4ccc(cc4)CN5CCN( arr = np.zeros((0,), dtype=np.int8)
CC5)C") DataStructs.ConvertToNumpyArray(maccs_bitvector,arr) # convert the RDKit explicit

2
import pandas as pd vectors into numpy arrays

smiles_list = [
'CC(=O)OC1=CC=CC=C1', # Aspirin-like structure
'C1=CC=C(C=C1)CCN',
'CC1=CC=C(C=C1)O',
# Phenethylamine structure
# p-Cresol
df_maccs.append(arr)
MACCS = pd.concat([df, pd.DataFrame(df_maccs)], axis=1)

MACCS.head()
3
'C1=CC=C(C(=C1)O)O', # Hydroquinone df_mf = []
for mol in df['mol']:
'CC(=O)CCCN', # 5-Aminopentan-2-one mf_bitvector = AllChem.GetMorganFingerprintAsBitVect(mol, radius=1, nBits = 1024)
'COC1=CC=CC=C1CC=O', # 2-(2-methoxyphenyl)acetaldehyde arr = np.zeros((0,), dtype=np.int8)
'CC1=CC=CC=C1N', # o-Toluidine # convert the RDKit explicit vectors into numpy arrays
'C1=CC=C(C=C1)C(=O)O', # Benzoic acid DataStructs.ConvertToNumpyArray(mf_bitvector,arr)
df_mf.append(arr)
'CCOC(=O)CC' # Ethyl propionate
] MF = pd.concat([df, pd.DataFrame(df_mf)], axis=1)
df = pd.DataFrame(smiles_list, columns=['smiles']) MF.head()
PandasTools.AddMoleculeColumnToFrame(df,'smiles','mol')
display(MF.iloc[0,1])
molecule = MF.iloc[0,2:]
IPythonConsole.ipython_useSVG = True num_ones = np.sum(molecule)
IPythonConsole.molSize = 300, 300 num_zeros = len(molecule) - num_ones
rdDepictor.SetPreferCoordGen(True) print("Number of ones:", num_ones)
examples = """C(C)(C)O isopropanol print("Number of zeros:", num_zeros)
C(Cl)(Cl)(Cl)Cl carbon tetrachloride display(MF.iloc[0,1])
CC(=O)O acetic acid""" nmpyrrole = MF.iloc[0,2:]
smiles_list = [x.split(" ",1) for x in examples.split("\n")]
smiles_list = df['smiles'].tolist() print(f'No of ones: {nmpyrrole[nmpyrrole==1].count()}')
print(f'No of zeros: {nmpyrrole[nmpyrrole==0].count()} \n')
def display_structures(smiles_list): print(nmpyrrole[nmpyrrole==1])
mol_list = []
for smiles in smiles_list: # print second molecule in the dataset
mol_list.append(Chem.MolFromSmiles(smiles)) display(MF.iloc[1,1])
pyrrole = MF.iloc[1,2:]
return MolsToGridImage(mol_list,molsPerRow=5)
print(f'No of ones: {nmpyrrole[pyrrole==1].count()}')
display_structures(smiles_list) print(f'No of zeros: {nmpyrrole[pyrrole==0].count()} \n')
4
print(pyrrole[pyrrole==1]) plt.figure(figsize=(10, 5))
df_apf = []
apgen = rdFingerprintGenerator.GetAtomPairGenerator(fpSize=4096)

for mol in df['mol']:


apf_bitvector = apgen.GetFingerprint(mol)
plt.subplot(1, 2, 1)
plt.imshow(this_fp_array.reshape(32,16), cmap='binary')
plt.title('this_fp')

plt.subplot(1, 2, 2)
5
plt.imshow(that_fp_array.reshape(32,16), cmap='binary')
plt.title('that_fp')
# convert the RDKit explicit vectors into numpy arrays MFP_this = AllChem.GetMorganFingerprint(first_mol_obj,2)
arr = np.array(apf_bitvector) MFP_this
MFP_this_bits = AllChem.GetMorganFingerprintAsBitVect(first_mol_obj,5,nBits=512)
df_apf.append(arr)
MFP_this_bits
DataStructs.DiceSimilarity(first_fp,first_fp)
APF = pd.concat([df, pd.DataFrame(df_apf)], axis=1)
DataStructs.DiceSimilarity(first_fp,second_fp)
DataStructs.DiceSimilarity(first_fp,MFP_this_bits)
df_ttf = []
ttgen = rdFingerprintGenerator.GetTopologicalTorsionGenerator(fpSize=2048)
for mol in df['mol']: # Tanimoto
ttf_bitvector = ttgen.GetFingerprint(mol) commonBits = first_fp&second_fp
print('first:',first_fp.GetNumOnBits(),'second:',second_fp.GetNumOnBits(),'num in
# convert the RDKit explicit vectors into numpy arrays common:',commonBits.GetNumOnBits())
arr = np.array(ttf_bitvector) print(commonBits.GetNumOnBits()/(first_fp.GetNumOnBits()+second_fp.GetNumOnBits()-
df_ttf.append(arr) commonBits.GetNumOnBits()))
print('Tanimoto:', DataStructs.TanimotoSimilarity(first_fp,second_fp))
TTF = pd.concat([df, pd.DataFrame(df_ttf)], axis=1)

display(MF.iloc[2,1]) jaccard_score(np.array(first_fp),np.array(second_fp))
display(MF.iloc[7,1])
from rdkit.Chem.Draw import SimilarityMaps
first_mol_obj = MF.iloc[0, 1]
second_mol_obj = MF.iloc[3, 1] mola = Chem.MolFromSmiles("O1C=C[C@H]([C@H]1O2)c3c2cc(OC)c4c3OC(=O)C5=C4CCC(=O)5")
molb = Chem.MolFromSmiles("OCCc1c(C)[n+](cs1)Cc2cnc(C)nc2N")
first_mol_obj
Draw.MolsToGridImage([mola,molb])
second_mol_obj
fig, maxweight = SimilarityMaps.GetSimilarityMapForFingerprint(mola,molb, SimilarityMaps.GetMorganFingerprint)
Draw.MolsToGridImage([first_mol_obj,second_mol_obj])
fig, maxweight = SimilarityMaps.GetSimilarityMapForFingerprint(molb,mola, SimilarityMaps.GetMorganFingerprint)
first_fp = Chem.RDKFingerprint(first_mol_obj,maxPath=7,fpSize=512) fig, maxweight = SimilarityMaps.GetSimilarityMapForFingerprint(mola, molb,
second_fp = Chem.RDKFingerprint(second_mol_obj,maxPath=7,fpSize=512) lambda m,idx: SimilarityMaps.GetMorganFingerprint(m, atomId=idx, radius=3,
first_fp.ToBitString() fpType='bv'),
metric=DataStructs.TanimotoSimilarity)
this_fp_array = np.array([int(bit) for bit in first_fp.ToBitString()])
that_fp_array = np.array([int(bit) for bit in second_fp.ToBitString()])
from rdkit import Chem
from rdkit.Chem import (

)
AllChem,
rdCoordGen, 6
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from matplotlib import pyplot as plt
from matplotlib.patches import Rectangle
from IPython.display import SVG

niacinamide = Chem.MolFromSmiles("c1cc(cnc1)C(=O)N")
rdCoordGen.AddCoords(niacinamide)
niacinamide
info = {}
fp = AllChem.GetMorganFingerprintAsBitVect(niacinamide, radius=2, bitInfo=info)
on_bits = [(niacinamide, i, info) for i in fp.GetOnBits()]
labels = [str(i[1]) for i in on_bits]
Draw.DrawMorganBits(on_bits, molsPerRow=5, legends=labels) # Draw the on bits
aspirin = AllChem.MolFromSmiles('O=C(C)Oc1ccccc1C(=O)O')
salicylic_acid = AllChem.MolFromSmiles('O=C(O)c1ccccc1O')
bit_asp = {}
bit_sal = {}
aspirin_fp = AllChem.GetMorganFingerprintAsBitVect(aspirin, 2, nBits=2048, bitInfo=bit_asp)
salicylic_acid_fp = AllChem.GetMorganFingerprintAsBitVect(salicylic_acid, 2, nBits=2048, bitInfo=bit_sal)
print("Salicylic acid:", sorted(set(salicylic_acid_fp.GetOnBits())))
print("Aspirin", sorted(set(aspirin_fp.GetOnBits())))
print("TanimotoSimilarity",DataStructs.FingerprintSimilarity(aspirin_fp, salicylic_acid_fp, metric=DataStructs.TanimotoSimilarity))
print("DiceSimilarity", DataStructs.FingerprintSimilarity(aspirin_fp, salicylic_acid_fp, metric=DataStructs.DiceSimilarity))
print("CosineSimilarity",DataStructs.FingerprintSimilarity(aspirin_fp, salicylic_acid_fp, metric=DataStructs.CosineSimilarity))

• Acknowledgement (Source Code & Testing !) : Mr Piyush


Descriptor Tools and AI Methods
Descriptor (Features) Calculation
 CDK / RDKIT 2500+ (approx.)
 PADEL 905+
 MOE 180
Types of Descriptors
 Constitutional
 Topological & Shape Based
 Physico-Chemical
 Quantum-Chemical
 Electrostatic
 Drug-like indices
Algorithm (ML/DL: AI)
 GA, KNN, RNN, GRU, SVM, SVR, Linear Regression

11/24/2024 M. Karthikeyan, CSIR-NCL, PUNE 34


Open Source Chemoinformatics Tools
• OpenBabel ISBN 978-81-322-1780-0

• CDK for designing new


• RDKit molecules
• C++
chemoinformatics
• JAVA using open source
• Python data, tools
• Perl
• Ruby
11/24/202
M. Karthikeyan, CSIR-NCL, PUNE 35
4
Practical Cheminformatics (Do It Yourself)

• Open Source Tools, Techniques and Data in Chemoinformatics.-


• Chemoinformatics Approach for the Design and Screening of focused virtual libraries.-
• Machine Learning Methods in Chemoinformatics for Drug Discovery.-
• Docking and pharmacophore modeling for virtual screening.-
• Active site directed pose prediction programs for efficient filtering of molecules.-
• Representation, fingerprinting and modeling of chemical reactions.-
• Predictive methods for Organic Spectral data Simulation.-
• Chemical Text mining for Lead Discovery.- Integration of Automated Work flow in
Chemoinformatics for drug discovery.-
• Cloud computing Infrastructure for Chemoinformatics.

11/24/2024 M. Karthikeyan, CSIR-NCL, PUNE 36


Coming Soon..
Generative AI for Chemical Research

Fine Tuning LLM for Chemistry, Biology and Medicine

(Cancer, Epilepsy, Sickle Cell Anemia, Anti-microbial Resistance .. )

Contact (Collaboration / Training AI Drug Design) :


• Email: karthincl@gmail.com / m.karthikeyan@ncl.res.in
• Phone : +91-020-25902483
Questions ?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy