AI For Drug Discovery (Practical Demo Using RDKit)
AI For Drug Discovery (Practical Demo Using RDKit)
AI in Drug Discovery
RDKIT Demo (for Beginners!)
[LinkedIn Wikipedia
Google Scholar Academic Site ]
Chemoinformatics and AI Tools in Drug Design
• With over twenty-five years of experience in insilico design of drugs using chemoinformatics
tools for cancer, malaria, tuberculosis, sickle cell anemia, antifungal, antibacterial, and
antiviral agents etc., I thought it is time to highlight the key component in chemoinformatics
especially the open source tools like RDKit for better utilization in the field of Aritificial
Intelligence. Chemoinformatics combines computational and statistical methods to analyze
chemical compounds, while AI encompasses machine learning, deep learning, and natural
language processing for pattern recognition and prediction. The fundamental challenge lies in
translating pictorial chemical structures, easily understood by chemists, into numeric formats
that computers can process and analyze.
• This translation process, called inductive learning, converts molecular structures into
descriptors that include biological activity, physicochemical properties, and toxicological
information. These descriptors enable the creation of AI models that transform data into
knowledge. Once the model is trained and optimized using experimental data, it gains the
ability to predict properties of new molecules. These predictions guide decisions about which
molecules to synthesize, evaluate biologically, and test for properties and toxicology. The
predicted data can then be compared with experimental results to assess prediction accuracy.
Chemoinformatics and AI Tools in Drug Design
• Data quality remains the critical factor in AI modeling, as errors in input data
will lead to model failure. Modern AI systems are being developed to detect
input data errors, ensuring clean data for building high-quality models with
minimal errors through the learning process. The current success of generative
AI in producing textual outputs suggests that when fine-tuned for chemical
information, without proprietary data restrictions, it could revolutionize drug
discovery by identifying valuable therapeutic molecules more efficiently.
11/24/202
M. Karthikeyan, CSIR-NCL, PUNE 5
4
Chemical Structure
Representation
Representation of Structure: 1, 2 &3 D
• 1D model
(SMILES)
• Constitution (Topological)
Graph (“2D”)
• 3D Model
• Molecular Surface
INTRINSIC PROPERTIES
Molar Volume BIOLOGICAL PROPERTIES
Connectivity Indices Activity
Charge Distribution Toxicity
Molecular Weight Biotransformation
Polar surface Area Pharmacokinetics
11/24/202
M. Karthikeyan, CSIR-NCL, PUNE 8
4
Calculation of Structure Descriptors
• physical, chemical, or biological properties cannot be directly
calculated from the structure of a compound.
• representing the structure of the compound by structure
descriptors, and, then, to establish a relationship between
the structure descriptors
• A variety of structure descriptors has been developed
encoding 1D, 2D, or 3D structure information ..(Practical
Chemoinformatics)
11/24/202
M. Karthikeyan, CSIR-NCL, PUNE 9
4
Chemical Structures
A Function of Activity
Chemical Structures
to Property Prediction ?
Quantitative Structure/Property Relationships (QSPR)
Molecular
// Property
Structure
smiles_list = [
'CC(=O)OC1=CC=CC=C1', # Aspirin-like structure
'C1=CC=C(C=C1)CCN', # Phenethylamine structure
'CC1=CC=C(C=C1)O', # p-Cresol
'C1=CC=C(C(=C1)O)O', # Hydroquinone
'CC(=O)CCCN', # 5-Aminopentan-2-one
'COC1=CC=CC=C1CC=O', # 2-(2-methoxyphenyl)acetaldehyde
'CC1=CC=CC=C1N', # o-Toluidine
'C1=CC=C(C=C1)C(=O)O', # Benzoic acid
'CCOC(=O)CC' # Ethyl propionate
]
df = pd.DataFrame(smiles_list, columns=['smiles'])
PandasTools.AddMoleculeColumnToFrame(df,'smiles','mol')
from rdkit.Chem.Draw import MolsToGridImage
def display_structures(smiles_list):
mol_list = []
for smiles in smiles_list:
mol_list.append(Chem.MolFromSmiles(smiles))
return MolsToGridImage(mol_list,molsPerRow=5)
df_maccs = []
for mol in df['mol']:
Calculate Molecular maccs_bitvector = MACCSkeys.GenMACCSKeys(mol)
# generate bitvector object
print(nmpyrrole[nmpyrrole==1])
print(pyrrole[pyrrole==1])
##Atom Pair Fingerprint
# create an empty list
df_apf = []
# run a for loop to iterate through each molecule ##Topological Torsion Fingerprint
apgen = rdFingerprintGenerator.GetAtomPairGenerator(fpSize=4096)
df_ttf = []
for mol in df['mol']: ttgen =
apf_bitvector = apgen.GetFingerprint(mol) rdFingerprintGenerator.GetTopologicalTorsionGenerator(fpSize
=2048)
for mol in df['mol']:
# convert the RDKit explicit vectors into numpy arrays ttf_bitvector = ttgen.GetFingerprint(mol)
arr = np.array(apf_bitvector)
df_apf.append(arr) # convert the RDKit explicit vectors into numpy arrays
arr = np.array(ttf_bitvector)
APF = pd.concat([df, pd.DataFrame(df_apf)], axis=1) df_ttf.append(arr)
1
import numpy as np
import warnings
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Descriptors
from rdkit.Chem import AllChem, PandasTools, MACCSkeys,
AtomPairs, rdFingerprintGenerator
from rdkit import DataStructs
from rdkit.Chem.rdmolops import PatternFingerprint
from rdkit.Avalon import pyAvalonTools
from rdkit.Chem.AtomPairs.Pairs import
GetAtomPairFingerprintAsBitVect
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import rdDepictor
from rdkit import Chem
from rdkit.Chem.Draw import MolsToGridImage
from rdkit import DataStructs
import matplotlib.pyplot as plt
from rdkit import DataStructs
from sklearn.metrics import jaccard_score
mol = Chem.MolFromSmiles("c1ccccc1") df_maccs = []
glvc = for mol in df['mol']:
maccs_bitvector = MACCSkeys.GenMACCSKeys(mol) # generate bitvector object
Chem.MolFromSmiles("Cc1ccc(cc1Nc2nccc(n2)c3cccnc3)NC(=O)c4ccc(cc4)CN5CCN( arr = np.zeros((0,), dtype=np.int8)
CC5)C") DataStructs.ConvertToNumpyArray(maccs_bitvector,arr) # convert the RDKit explicit
2
import pandas as pd vectors into numpy arrays
smiles_list = [
'CC(=O)OC1=CC=CC=C1', # Aspirin-like structure
'C1=CC=C(C=C1)CCN',
'CC1=CC=C(C=C1)O',
# Phenethylamine structure
# p-Cresol
df_maccs.append(arr)
MACCS = pd.concat([df, pd.DataFrame(df_maccs)], axis=1)
MACCS.head()
3
'C1=CC=C(C(=C1)O)O', # Hydroquinone df_mf = []
for mol in df['mol']:
'CC(=O)CCCN', # 5-Aminopentan-2-one mf_bitvector = AllChem.GetMorganFingerprintAsBitVect(mol, radius=1, nBits = 1024)
'COC1=CC=CC=C1CC=O', # 2-(2-methoxyphenyl)acetaldehyde arr = np.zeros((0,), dtype=np.int8)
'CC1=CC=CC=C1N', # o-Toluidine # convert the RDKit explicit vectors into numpy arrays
'C1=CC=C(C=C1)C(=O)O', # Benzoic acid DataStructs.ConvertToNumpyArray(mf_bitvector,arr)
df_mf.append(arr)
'CCOC(=O)CC' # Ethyl propionate
] MF = pd.concat([df, pd.DataFrame(df_mf)], axis=1)
df = pd.DataFrame(smiles_list, columns=['smiles']) MF.head()
PandasTools.AddMoleculeColumnToFrame(df,'smiles','mol')
display(MF.iloc[0,1])
molecule = MF.iloc[0,2:]
IPythonConsole.ipython_useSVG = True num_ones = np.sum(molecule)
IPythonConsole.molSize = 300, 300 num_zeros = len(molecule) - num_ones
rdDepictor.SetPreferCoordGen(True) print("Number of ones:", num_ones)
examples = """C(C)(C)O isopropanol print("Number of zeros:", num_zeros)
C(Cl)(Cl)(Cl)Cl carbon tetrachloride display(MF.iloc[0,1])
CC(=O)O acetic acid""" nmpyrrole = MF.iloc[0,2:]
smiles_list = [x.split(" ",1) for x in examples.split("\n")]
smiles_list = df['smiles'].tolist() print(f'No of ones: {nmpyrrole[nmpyrrole==1].count()}')
print(f'No of zeros: {nmpyrrole[nmpyrrole==0].count()} \n')
def display_structures(smiles_list): print(nmpyrrole[nmpyrrole==1])
mol_list = []
for smiles in smiles_list: # print second molecule in the dataset
mol_list.append(Chem.MolFromSmiles(smiles)) display(MF.iloc[1,1])
pyrrole = MF.iloc[1,2:]
return MolsToGridImage(mol_list,molsPerRow=5)
print(f'No of ones: {nmpyrrole[pyrrole==1].count()}')
display_structures(smiles_list) print(f'No of zeros: {nmpyrrole[pyrrole==0].count()} \n')
4
print(pyrrole[pyrrole==1]) plt.figure(figsize=(10, 5))
df_apf = []
apgen = rdFingerprintGenerator.GetAtomPairGenerator(fpSize=4096)
plt.subplot(1, 2, 2)
5
plt.imshow(that_fp_array.reshape(32,16), cmap='binary')
plt.title('that_fp')
# convert the RDKit explicit vectors into numpy arrays MFP_this = AllChem.GetMorganFingerprint(first_mol_obj,2)
arr = np.array(apf_bitvector) MFP_this
MFP_this_bits = AllChem.GetMorganFingerprintAsBitVect(first_mol_obj,5,nBits=512)
df_apf.append(arr)
MFP_this_bits
DataStructs.DiceSimilarity(first_fp,first_fp)
APF = pd.concat([df, pd.DataFrame(df_apf)], axis=1)
DataStructs.DiceSimilarity(first_fp,second_fp)
DataStructs.DiceSimilarity(first_fp,MFP_this_bits)
df_ttf = []
ttgen = rdFingerprintGenerator.GetTopologicalTorsionGenerator(fpSize=2048)
for mol in df['mol']: # Tanimoto
ttf_bitvector = ttgen.GetFingerprint(mol) commonBits = first_fp&second_fp
print('first:',first_fp.GetNumOnBits(),'second:',second_fp.GetNumOnBits(),'num in
# convert the RDKit explicit vectors into numpy arrays common:',commonBits.GetNumOnBits())
arr = np.array(ttf_bitvector) print(commonBits.GetNumOnBits()/(first_fp.GetNumOnBits()+second_fp.GetNumOnBits()-
df_ttf.append(arr) commonBits.GetNumOnBits()))
print('Tanimoto:', DataStructs.TanimotoSimilarity(first_fp,second_fp))
TTF = pd.concat([df, pd.DataFrame(df_ttf)], axis=1)
display(MF.iloc[2,1]) jaccard_score(np.array(first_fp),np.array(second_fp))
display(MF.iloc[7,1])
from rdkit.Chem.Draw import SimilarityMaps
first_mol_obj = MF.iloc[0, 1]
second_mol_obj = MF.iloc[3, 1] mola = Chem.MolFromSmiles("O1C=C[C@H]([C@H]1O2)c3c2cc(OC)c4c3OC(=O)C5=C4CCC(=O)5")
molb = Chem.MolFromSmiles("OCCc1c(C)[n+](cs1)Cc2cnc(C)nc2N")
first_mol_obj
Draw.MolsToGridImage([mola,molb])
second_mol_obj
fig, maxweight = SimilarityMaps.GetSimilarityMapForFingerprint(mola,molb, SimilarityMaps.GetMorganFingerprint)
Draw.MolsToGridImage([first_mol_obj,second_mol_obj])
fig, maxweight = SimilarityMaps.GetSimilarityMapForFingerprint(molb,mola, SimilarityMaps.GetMorganFingerprint)
first_fp = Chem.RDKFingerprint(first_mol_obj,maxPath=7,fpSize=512) fig, maxweight = SimilarityMaps.GetSimilarityMapForFingerprint(mola, molb,
second_fp = Chem.RDKFingerprint(second_mol_obj,maxPath=7,fpSize=512) lambda m,idx: SimilarityMaps.GetMorganFingerprint(m, atomId=idx, radius=3,
first_fp.ToBitString() fpType='bv'),
metric=DataStructs.TanimotoSimilarity)
this_fp_array = np.array([int(bit) for bit in first_fp.ToBitString()])
that_fp_array = np.array([int(bit) for bit in second_fp.ToBitString()])
from rdkit import Chem
from rdkit.Chem import (
)
AllChem,
rdCoordGen, 6
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from matplotlib import pyplot as plt
from matplotlib.patches import Rectangle
from IPython.display import SVG
niacinamide = Chem.MolFromSmiles("c1cc(cnc1)C(=O)N")
rdCoordGen.AddCoords(niacinamide)
niacinamide
info = {}
fp = AllChem.GetMorganFingerprintAsBitVect(niacinamide, radius=2, bitInfo=info)
on_bits = [(niacinamide, i, info) for i in fp.GetOnBits()]
labels = [str(i[1]) for i in on_bits]
Draw.DrawMorganBits(on_bits, molsPerRow=5, legends=labels) # Draw the on bits
aspirin = AllChem.MolFromSmiles('O=C(C)Oc1ccccc1C(=O)O')
salicylic_acid = AllChem.MolFromSmiles('O=C(O)c1ccccc1O')
bit_asp = {}
bit_sal = {}
aspirin_fp = AllChem.GetMorganFingerprintAsBitVect(aspirin, 2, nBits=2048, bitInfo=bit_asp)
salicylic_acid_fp = AllChem.GetMorganFingerprintAsBitVect(salicylic_acid, 2, nBits=2048, bitInfo=bit_sal)
print("Salicylic acid:", sorted(set(salicylic_acid_fp.GetOnBits())))
print("Aspirin", sorted(set(aspirin_fp.GetOnBits())))
print("TanimotoSimilarity",DataStructs.FingerprintSimilarity(aspirin_fp, salicylic_acid_fp, metric=DataStructs.TanimotoSimilarity))
print("DiceSimilarity", DataStructs.FingerprintSimilarity(aspirin_fp, salicylic_acid_fp, metric=DataStructs.DiceSimilarity))
print("CosineSimilarity",DataStructs.FingerprintSimilarity(aspirin_fp, salicylic_acid_fp, metric=DataStructs.CosineSimilarity))