0% found this document useful (0 votes)
7 views38 pages

2019 10 28 Chemoinformatics - Notes

The document outlines the drug discovery pipeline, emphasizing the importance of rationalized drug design and the key-lock principle where small molecules bind to target proteins to alter biological pathways. It details the various stages of drug discovery, including target identification, hit identification, hit characterization, lead optimization, organic synthesis, and assays, highlighting the complexity and cost involved in developing new drugs. The average cost of drug development is estimated to be around 2 billion USD, taking approximately 12 years to complete, with only about 26 new drugs reaching the market annually.

Uploaded by

Farhat Humayun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views38 pages

2019 10 28 Chemoinformatics - Notes

The document outlines the drug discovery pipeline, emphasizing the importance of rationalized drug design and the key-lock principle where small molecules bind to target proteins to alter biological pathways. It details the various stages of drug discovery, including target identification, hit identification, hit characterization, lead optimization, organic synthesis, and assays, highlighting the complexity and cost involved in developing new drugs. The average cost of drug development is estimated to be around 2 billion USD, taking approximately 12 years to complete, with only about 26 new drugs reaching the market annually.

Uploaded by

Farhat Humayun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

S1133: The Drug Discovery Pipeline

Lecture Notes – Chloé-Agathe Azencott – 2019

1 Modern Therapeutic Research


See Appendix A for an historic overview of therapeutic research.

1.1 Rationalized drug design


In most cases, targets are proteins involved in a biological pathway necessary to the development
of the disease. The goal of the drug discovery process is to identify a small molecular compound
that inhibits (or activates, depending on the particular case under investigation) the target protein
in such a way as to alter this biological pathway. In addition, the drug must also be non-toxic to
the patient and have the possibility to reach the protein target. How easily will the compound reach
the bloodstream? How effectively will it be carried to the target site? Will it be broken down by
the metabolism before it can act? Will it then be eliminated from the body or is there a risk that it
accumulates, inducing adverse effects? Once a target has been identified, the process of discovering
a matching drug is therefore complex, long and costly.

What is a drug?
• Legal definition: “Toute substance ou composition présentée comme possédant des propriétés cu-
ratives ou préventives à l’égard des maladies humaines ou animales, ainsi que toute substance ou
composition pouvant être utilisée chez l’homme ou chez l’animal ou pouvant leur être administrée,
en vue d’établir un diagnostic médical ou de restaurer, corriger ou modifier leurs fonctions physi-
ologiques en exerçant une action pharmacologique, immunologique ou métabolique.” [code de la
Santé publique, article L.5111-1]
• Rough translation: Any substance or compound, presented as having curative or preventive properties
towards human or animal diseases, as well as any substance or compound that can be used or adminis-
tered to humans or animals, so as to establish a medical diagnosis or so as to restaure, correct or modify
their physiological functions by exercising a pharmacological, immunological or metabolic function.
• Key-lock principle: Most drugs are relatively small molecules that work by binding to a protein.
Generally speaking, a small molecule that binds to a protein is called a ligand. The protein is
then referred to as the receptor. The terminology comes from the term “receptor” used to denote
a protein embedded in the plasma membrane of a cell which role it is to receive chemical signals
from outside the cell. The binding site of the protein, where the ligand attaches to the receptor, is
also called its active site or binding pocket. Note that a protein can have several such pockets. The
protein that the drug binds to is called its target.

Ligand
Target
Pocket

For details about the different modes of binding, see Appendix B.


• Targeting a pathogen: Antibiotics and antiviral drugs target proteins that are only found in the
targeted pathogen (bacterium, virus) and are crucial for its survival or multiplication. Penicillin,
for example, is an antibiotic that targets an enzyme necessary to build the cell walls of bacteria.
Without this enzyme, bacteria burst and die. Again, protease inhibitors like Saquinavir inhibits an

1
enzyme necessary for HIV to spread through the body of its host.

From serendipity to rationalized drug design Ancient Greeks or Egyptians treated infections with
mould; mould actually produces penicillin, which inhibits the biological processes that results in the
formation of bacterial cell walls.

NH2

NH
HO S CH3
O
N CH3
O
O
HO Biapenem in PBP-1A
Illustration: Biapenem (an antibiotic of the β-lactamase family), seen alone on the left image, binding
to a PBP (Penicillin Binding Protein, in blue on the right image), an enzyme which is involved in the
formation of peptidoglycan, a polymer that forms the bacterial cell wall. The mode of action of most
β-lactams antibiotics is to inhibit this protein, which results in bacterial death.

Overview of the drug discovery pipeline Using the key-lock principle, the modern approach to drug
discovery can be described as a pipeline, which will be detailed in the remainder of this section. The
steps of the drug discovery pipeline consist in: target identification (finding a protein that we want
to interfer with); hit identification (finding small organic molecules, or hits, that bind this protein);
hit characterization (determining physico-chemical and pharmacological properties of these hits, and
keep as leads those that present the best properties); lead optimization (refining the leads into drug
candidates) and, finally, assay evaluating the efficacy and toxicity of the candidates.

Target Hits Leads Candidates Drug

Protein linked to Compounds Desirable Approved for


Optimized and
the disease on which binding to the ADME-Tox medical use and
synthetizable
one wants to act target properties sale

For drugs that do not work following the key-lock principle, see Appendix F.

1.2 Target identification

Target Hits Leads Candidates Drug

Protein linked to Compounds Desirable Approved for


Optimized and
the disease on which binding to the ADME-Tox medical use and
synthetizable
one wants to act target properties sale

• Basic research into the disease process and its causes;


• Elucidation of (some of) the biochemical and biological processes involved;
• Choice of the target: researchers must determine which step of the process should be targeted.
Where is intervention most likely to bring the desired effect (interrupting the disease process) with
a limited amount of indesirable effects?
This is, ultimately, the topic of most bioinformatics research. Our focus here is on what happens after
a target has been identified.

2
1.3 Hit identification

Target Hits Leads Candidates Drug

Protein linked to Compounds Desirable Approved for


Optimized and
the disease on which binding to the ADME-Tox medical use and
synthetizable
one wants to act target properties sale

First of all, pharmacologists must identify in the chemical space compounds that are most likely to
be active (that is to say, bind to the protein and modify its level of expression in the organism). This
involves the screening of large panels of chemicals, that must each be tested against the protein.
• Hit: A compound that binds to the target At this stage, we are not looking for a specific effect. Any
compound that binds to the target protein is going to be relevant.
• Identify compounds known to have an effect on the target. An extensive literature and database
search is conducted to identify compounds that are known to have a biological effect on the target.
hose compounds can be endogenous, meaning that they are normally found in the organism, or
exogenous, meaning that there are not. Two essential, complementary approaches are:
• Combinatorial chemistry:
– For the history of the development of combinatorial chemistry, see Appendix C.
– Combinatorial chemistry is a family of techniques that can be used to generate large numbers
of derivatives of a given compound for testing. Robert Merrifield, who got the Nobel prize in
1984 for his development (in the 1960s) of the methodology for chemical synthesis on a solid
matrix, is considered the father of those techniques, together with Mario Geysen, who developed
in the early 1980s the pin method for the simultaneous synthesis of diverse peptides. Combi-
natorial chemistry consists in adding a variety of substituting groups to a fixed portion of the
start compound, called a scaffold. The resulting compounds form what is called a combinatorial
library.

Here one scaffold with three possible derivatization sites. If one has ten possible substituting
groups per site, one can potentially generate 103 compounds, hence the name combinatorial.
– The main technique for combinatorial chemistry is solid phase synthesis:
∗ Linkers, that are chemically stable, connect the scaffold to resin beads.
∗ Reactants are passed over (in solution) to generate intermediate compounds.
∗ A last step of detachment from the resin yields the derived compounds.

Source: Combinatorial Chemistry. Synthesis and application. Wilson & Czarnik ed., 1997

• (High-throughput) screening: All compounds in a library are added to a solution containing the

3
target to evaluate whether or not they bind. The library can be generated by combinatorial chemistry
if an interesting scaffold is known; otherwise, the library will typically contain all available in-house
chemicals, or a catalog provided by a chemical company, or a set of known drugs.

1.4 Hit characterization

Target Hits Leads Candidates Drug

Protein linked to Compounds Desirable Approved for


Optimized and
the disease on which binding to the ADME-Tox medical use and
synthetizable
one wants to act target properties sale

Once lead compounds with good target binding abilities are identified, they need to be characterized:
can they be used as drugs? Inhibiting the target is not a sufficient condition for being a promising
drug candidate. Indeed, in order to be both safely administered and efficient, a drug must satisfy low
toxicity as well as good pharmacokinetics requirements.
• Which of the hits are more suitable as medication?
• ADME-Tox properties: Pharmacokinetics requirements, compiled under the acronym of ADME (Ab-
sorption, Distribution, Metabolism, Elimination), characterize the ability of the molecule to reach
the target protein in the tissue where it is localized before being degraded.
Ideally, the drug is given orally, and must then enter the bloodstream via the digestive track. This
process can be inhibited by factors such as intestinal transit time, compound solubility, or chemical
reactivity in the stomach. If absorption is too low, the drug must be administered in a less desirable
and more invasive way (such as inhalation, patches, or intravenous injection).
Natural biological barriers (in particular the blood-brain barrier) can also negatively impact the
journey of the drug to its target. Even if the drug is satisfyingly absorbed, it can be partially blocked
by membranes or binding to proteins other than the one intended.
Moreover, chemicals usually break down as soon as they enter the body; in particular, the liver
will metabolize the drug, converting it into new metabolites, that can either be inactive or, on the
contrary, more potent (or causing more undesirable secondary effects) than the original drug.
While we wish for as much as possible of the administered dose to reach the target tissue, elim-
ination must also be taken into consideration, as the accumulation of foreign substances in the
organism can adversely affect healthy metabolism.
– Absorption: Drugs administered orally have to be absorbed before they can be transported via
the circulatory system (i.e. blood vessels) to their site of action.
– Distribution: In order to be effective, a drug must be able to reach its site of action. To pass
through bi-lipidid membranes, drugs must be reasonably soluble in both water and lipids. If
necessary, medications will be packaged in time-relapse capusles that ensure their level remains
constant over several hours, or in coating that ensures they can pass unharmed by the stomach’s
acidity into the small intestine.

Bi-lipidic structure of the cell membrane. Source: http://physio1.wikispaces.com.

4
– Metabolism: Chemicals (in particular, enyzmes in the gastro-intestinal track, and in the liver)
start breaking down compounds as soon as they enter the body. The drug may be inactivated
by this mechanism. In addition, the resulting metabolites may have undesired pharmacological
effects.
– Excretion/Elimination: The liver and kidney are the main organs involved in the elimination of
drugs (and, more generally, waste). The liver breaks down toxic substances through a series of
complex metabolic reactions. The kidneys further process the broken down waste and eliminate
it from the body through urine.
– Toxicity: Obviously, the compound should be as non-toxic, carcinogenic or mutagenic as possible
to the patient.

1.5 Lead optimization

Target Hits Leads Candidates Drug

Protein linked to Compounds Desirable Approved for


Optimized and
the disease on which binding to the ADME-Tox medical use and
synthetizable
one wants to act target properties sale

After promising drug candidates have been identified, they are assayed in vitro to verify that they do
indeed bind to the target protein. This is followed by a phase of lead optimization, during which the
chemical structure of the drug candidates is refined in order to meet the ADME and toxicity require-
ments.
• Optimize ligand-receptor interaction (pharmacodynamics)
• Optimize ADME-Tox properties (pharmacokinetics)
• Design of the synthetic path to produce the lead compound
• Synthesis of analogues: The main tools for lead optimization are:
– Combinatorial chemistry
– Structure-based design: use information about the structure of the target and that of the lead
to tweak the lead, usually visually in an appropriate software for positioning compounds and
computing binding energies (see Section 3).

1.6 Organic synthesis

Target Hits Leads Candidates Drug

Protein linked to Compounds Desirable Approved for


Optimized and
the disease on which binding to the ADME-Tox medical use and
synthetizable
one wants to act target properties sale

Once a drug candidate has been identified, a reliable process to produce it must be established. Al-
though molecular compounds can sometimes be extracted from a natural source, this technique usu-
ally proves to be laborious and the alternative of synthesizing molecules from commercially avail-
able starting materials is preferable. In addition, combinatorial synthesis, a technique by which large
numbers of compounds can be synthesized simultaneously to create chemical libraries for biological
screening, also relies on a thorough understanding of organic synthesis. Planning the total synthesis
of compounds with interesting biological or physical properties is therefore one of the core con-
cerns of organic chemistry. Devising the optimal multi-step route to a novel and potentially artificial
compound is a challenging problem, and organic chemists face the daily challenge of choosing the
most appropriate combination of reactants and reagents, as well as the necessary conditions and best
sequence of their assembly. See more details on the history of organic synthesis in Appendix D.

5
1.7 Assays

Target Hits Leads Candidates Drug

Protein linked to Compounds Desirable Approved for


Optimized and
the disease on which binding to the ADME-Tox medical use and
synthetizable
one wants to act target properties sale

Last but not least, drug candidates must be assayed to evaluate whether they indeed have the intended
effect, determine dosage, and assess their toxicity. Pre-clinical assays aim at demonstrating safety and
are used to file a new drug application and obtain permission to enter clinical trials. Clinical trials on
humans are used to determine that the drug is indeed effective (and, if a drug already exists on the
market with the same indication, that it is more effective than this drug), and determine dosage. Finally,
drugs are continuously monitored for side effects after they have been released, through a process
known as pharmacovigilance or post-marketing assays. For more details, including on regulation
agencies, see Appendix E.

1.8 Engineering in drug discovery

52 months 90 months

Target Hits Leads Candidates Drug


M
Protein linked to Compounds Desirable $ 500 Approved for
the disease on which binding to the
target
ADME-Tox
properties
synthetizablet
Optimized and
o medical use and
B
one wants to act sale

$ 2

The cost of drug discovery


• In 2003, DiMasi et al. evaluated the average time from the start of clinical testing to marketing
approval to be about 90 months. In addition, they estimated that preclinical development alone,
from lead identification to initial human testing, lasts an average of 52 months.
• Three years later, Adams and Brantner assessed the cost of the development of a new drug as
ranging from five hundred million to two billion US dollars, averaging at USD 868 M.
• Recent estimates (2013) [DGH16] are more around 2.8 billion US dollars. The development of a
new drug is estimated to take an average of 12 years and cost around 2 billion USD.
• Only about 26 new drugs make it to the market every year, in spite of numerous technological
breakthrough and careful and progressive reconsideration of the process itself coming from the
large efforts of academic research as well as the pharmaceutical industry.
• Moreover, adverse drug reactions (ADRs) may still happen, leading to large human and financial
costs:
– annual direct hospital cost in the US (2012) of 1.56 billion USD [Mig+12];
– 2.2 million hospitalized patients, 100 000 deaths per year in the US (1998) [LPC98].
• Precision medicine: The top highest-grossing drugs in the US (for indications such as diabetes,
asthma, cardiovascular diseases, depression, etc.) only help 1/25 to 1/4 patients [Sch15]. It does
not mean the remainders are not cured, but that treatment must be adjusted: symptoms are not
enough to know which treatment will work.
• The therapeutic research process is complex, costly, and time-consuming.

6
How can engineering help? Engineers in pharmaceutical companies have roles in many fields: au-
tomation, biotechnology, electrical engineering, mechanical engineering, mechatronics, computer en-
gineering, and more.
You can consult Appendix G for some information about the role of robotics and automation in the
drug discovery pipeline. We will focus here about the role of computer science, through a field called
chemoinformatics.

2 Chemoinformatics
By the 1970s, the amount of data and information produced by chemical research had grown large
enough that it became obvious that it could only be processed and analyzed by computer methods,
pushing the development of databases of chemical compounds and reactions. Furthermore, many of
the problems faced by chemists, from the prediction of physical, chemical and biological properties
of compounds and materials to structure elucidation or organic synthesis are so complex that they
require informatics-based approaches.

Chemoinformatics
Help from computer science:
“...the mixing of information resources to transform data into information, and information into knowl-
edge, for the intended purpose of making better decisions faster in the arena of drug lead identification
and optimisation.” – F. K. Brown
“... the application of informatics methods to solve chemical problems.” – J. Gasteiger and T. Engel. This
definition encompasses many aspects and these problems include:
– Representing, storing and retrieving chemical compounds and reactions;
– Predicting physical, chemical and biological properties of compounds;
– Drug design;
– Structure elucidation;
– Predicting the course of chemical reactions;
– Designing organic syntheses.

Target Hits
a ti c s Drug
orm
Leads Candidates

oi n f
hem
Protein linked to Compounds Desirable Approved for
Optimized and

C
the disease on which binding to the ADME-Tox medical use and
synthetizable
one wants to act target properties sale

Virtual chemical space In 2009, Blum and Reymond built a virtual library containing of the order of
109 potential drugs–all those made of hydrogen and no more than 13 atoms of carbon, oxygen, ni-
trogen, sulfur, and chlorine. This is rather limitative, considering that, for example, morphine contains
17 carbons and erythromycin is made of 37 carbon atoms, 13 oxygen atoms and 1 nitrogen atom. In
1996, Bohacek et al. estimated that there are potentially of the order of 1060 drug-sized organic com-
pounds. By comparison, the universe is estimated to contain 3.1023 stars, according to van Dokkum
and Conroy.
Although the combinatorial libraries of chemicals used in high-throughput screening are designed to
cover as much of the chemical space as possible, they still leave large areas of it unexplored, which
motivates the need for new, fast and accurate techniques.

7
2.1 Representing chemicals in silico
One of the core concerns of chemoinformatics is the representation of small organic compounds in
silico.
Many chemoinformatics applications, including high-throughput virtual screening, benefit from being
able to rapidly predict the physical, chemical, and biological properties of small molecules to screen
large repositories and identify suitable candidates.
Ab initio methods, such as quantum mechanical methods, can in most cases still not be applied sys-
tematically due to complexity and computational cost issues. When annotated data are available,
machine learning methods that try to extract relevant information more or less automatically from
the data provide a suitable alternative.

Representing and visualizing chemicals Here are some examples of different ways to describe the
same molecule:

Name: Amoxicillin; amoxycillin; amoxicilline; amoxicillin anhydrous; Clamoxlyl; Amopenixin;


Amolin; Moxal; AMPC ...
Chemical formula: C16 H19 N3 O5 S.
IUPAC name: (2S,5R,6R)-6-[[(2R)-2-amino-2-(4-hydroxyphenyl)acetyl]amino]-3,3-dimethyl-7-oxo-4-thia-
1- azabicyclo[3.2.0]heptane-2-carboxylic acid.
Molecular graph: Ball-and-stick representation: The atoms are col-
NH2
ored according to their type and the representa-
tion can be used in 3D.
NH
HO S CH3
O
N CH3
O

O
HO

Solvant excluded surface, or Connely surface, CPK (Corey, Pauling and Koltun) or space-filling
representing the surface of the molecule: representation, in which atoms are represented
by a sphere delimiting the locations of their elec-
trons:

None of these are straightforward to deal with as a computational representation of a molecule.

2.1.1 SMILES strings

Molecules can be represented using so-called SMILES strings (Simplified Molecular Input Line Entry
System). For a complete specification, see
http://www.daylight.com/meetings/summerschool98/course/dave/smiles-intro.html.
• Atoms are represented by their atomic symbol: C, O, N, S, ...
• Atoms in aromatic rings are represented in lower case: c, n, ...
• H and partial charges are attached with square brackets: [Fe+2] or [Fe++].

8
• No square brackets implies normal valence: [CH4] is equivalent to C.
• Single (-), double (=), triple (#), aromatic (:) bonds. Single and aromatic bonds may be ommitted if
unambiguous. E.g. C=O, C#N, C-C or CC.
• Branches are indicated between parentheses, eg O(H)H represents water.
• Break cycles, use numbers to indicate where: c1ccccc1 represents an aromatic cycle of 6 carbons.
• ! SMILES are not unique, each molecule has multiple possible representations. However, one SMILE
string represents a single molecular graph. (Not necessarily a unique molecule due to stereoisom-
etry).

Examples
• Fingerprint of cytosine: Nc1[nH]c(=O)ncc1.

Molecular graph of cytosine. Unlabeled nodes are carbon atoms.


Start with the nitrogen at the top: N. It is attached to a carbon that is part of an aromatic ring:
c1 (c for the aromatic carbon; 1 to denote where the ring starts). Turn counter-clockwise: [nH],
then c. There are two possible directions after this last aromatic carbon: either continue along the
ring, or branch towards the oxygen. Here is the branch: (=O) (() for the branch, = for the double
bound, and O for the oxygen). Now we continue along the ring: an aromatic nitrogen n, followed
by one aromatic carbon c, followed by an aromatic carbon which should also be connected to the
beginning of our ring: c1.
• Fingerprint of caffeine: CN1C=NC2=C1C(=O)N(C)C(=O)N2C.

Molecular graph of caffeine. Unlabeled nodes are carbon atoms.


Start with the CH3 in the upper-left corner: C. It is attached to a nitrogen that is part of a first ring:
N1. Turn counter-clockwise: a carbon, a double bond, a nitrogen, then a carbon that is also part of a
second ring: C=NC2. Turn clockwise: a double bond, then a carbon which closes the first ring: =C1.
Then there is a carbon: C, a branch containing a double bond and an oxygen: (=O), a nitrogen: N,
a branch containing a CH3: (C), a carbon: C, a branch containing a double bond and an oxygen:
(=O), a nitrogen that closes the second ring: N2, and finally a CH3: C.

Canonical SMILES As the above examples illustrate, while a SMILE determines a unique molecu-
lar graph, the converse is far from true. Canonical representations are desirable, to associate each
molecule with a unique SMILES string. The state-of-the-art is however still not stable, and multiple
canonicalization algorithms exist. For more details, see Appendix I.

9
2.1.2 Molecular fingerprints

Expert knowledge descriptors The first in silico representations of molecules to have been devel-
oped were based on list of descriptors, each corresponding to a characteristic thought relevant by
experts. Such descriptors range from the presence of particular functional groups to the number of
rotatable bonds or the partial charges on each of the atoms of the molecule.
One of the most popular set of such descriptors used in chemical informatics is given by DRAGON,
which provides 3224 molecular features, divided in 22 blocks, and ranging from simple atom type and
functional group and fragment counts to several advanced topological and geometrical descriptors.
With the help of expert chemists, more and more features have been added along the years since the
first release of DRAGON in 1997. Unfortunately, some of these descriptors are prohibitively complex
to compute for large data sets.
Another example of such descriptors is the PowerMV set of descriptors, which provides atom-based,
fragment-based, and real-valued descriptors. The latest include characteristics such as Gasteiger par-
tial charges, electronegativities, or logP (estimated using the XlogP predictor).
These representations are hard to define and some of the descritptors can be difficult to compute.
Moreover, they are potentially incomplete, as a key feature for the desired application might have
been missed out by the experts.

Molecular fingerprints Nowadays, molecules are typically represented by so-called molecular fin-
gerprints. A fingerprint is a bit string in which each bit corresponds to a particular molecular feature,
designed to be chemically relevant, and is turned to 1 if the molecule exhibits that feature and 0
otherwise.
• Define feature vectors that record the presence/absence (or number of occurrences) of pre-determined
molecular features in a compound.
φ(A) = (φs (A))s feature
where 
1 if s occurs in A
φs (A) =
0 otherwise.

Example of the construction of a fingerprint for amoxicillin.

Examples of molecular fingerprints The most widely known sets of fingerprints include the MACCS
keys and the CACTVS substructure keys.
• MACCS keys: answers to a set of true/false questions about a chemical structure
– “Are there fewer than 3 oxygen atoms?”

10
– “Is there at least one halogen atom present?”
– ...
• PubChem Substructure Fingerprints (aka CACTVS Subgraph Keys):
presence/absence of molecular substructures, e.g.:
– individual atoms of a given type
– rings of a given size containing given atoms
– more general patterns involving several bonds and atom types

The molecular graph Whereas most sets of fingerprints or descriptors (from DRAGON descriptors to
MACCS keys) are rather heterogeneous collections of all sorts of computable molecular properties that
heavily rely on expert knowledge, substructure-based representations are derived in a more principled
and automated way. In what follows we will focus on such representations and on extracting structural
information from the molecular graph.
Small molecules are most commonly represented as labeled graphs of bonds. The vertices repre-
sent the atoms, and the edges represent the bonds. Edges are labeled by the bond type (e.g. single,
double) they correspond to. Labels on the vertices correspond to the element (e.g. C, N, O) of the
atom they correspond to, and can be expanded to include more information about the local chemical
environment.

O O

d
C
O

d
C
C N
C

O C
d
C N C C C

C C C S

O C C N

C C

Amoxicillin represented as an undirected labeled graph


Examples of labels:
• The element
• The binding affinity
E.g. XSCORE: polar, hydrophobic, hydrogen-bond donating/accepting/both, other.
• The element-hybridization state
E.g. C.sp3, N.sp2.
• SYBYL: 53 labels
E.g. sp3 carbon, trigonal planar nitrogen, halogen, sulfoxide sulfur.

Structure-based fingerprints
• Paths fingerprints One can chose to use all labeled paths of length d, or up to d, starting from each
vertex of the graph. Paths are allowed to self-intersect and traverse the same vertex twice, so as to
capture ring structure. Edges cannot appear more than once in a path.
– Daylight fingerprints: all substructures with N atoms bonded by N − 1 bounds, for 3 ≤ N ≤ 7.
– Labeled sub-paths (walks): consider all possible paths of length d, or up to d, in the molecular
graph.

11
O O

C
O
d CsCsCdO
C
C N C
O C
d
C N C C C
C C C NsCsCsS S
O C C N

C C

Two sub-paths of length 3 in the molecular graph of amoxicillin.


• Circular fingerprints So-called circular, or extended-connectivity substructures, are labeled trees
rooted at each vertex of the molecular graph. A depth parameter d controls the depth of the trees.
For a given tree, the algorithm recursively labels each tree vertex form the leaf nodes to the root,
appending to each parent’s label the labels of its children. Each resulting vertex label is then
considered as a feature. For the labeling process to be unique, the vertices of the graph need to be
ordered in a unique canonical way. This ordering is achieved using Morgan’s algorithm.

O O

C
O
d
C
C N C
O C
d
C N C C C
C C C S
O C C N
C{sC{sN|sC}|sN{sC}|sS{sC}}
C C

Example of a circular substructure of depth 2 in the molecular graph of amoxicillin.


• Other examples include frequent pattern fingeprints [Hel+04].

Example (path fingeprints) Let us consider paths of length 6, using the atoms C (carbon), c (carbon
in an aromatic cycle), O (oxygen), N (nitrogen), n (nitrogen in an aromatic cycle). If we do not consider
other atom types, nor distinguish between bond types, there are 65 /2 + 6 = 3894 possible patterns
(note that CCCccc and cccCCC are equivalent, as molecular graphs have no orientation).
The fingerprint of amoxicillin, for paths of length 6 built on the 5 above atoms, ignoring bound type,
looks like:
CCCCCC CCCCCc CCCCCO CCCCCN CCCCCn . . . CCCNCC CCCNCc CCCNCO . . .
0 0 0 0 0 ... 1 0 1 ...
Alternatively, we can choose to represent the fingerprint (for paths of length 6) of a molecule using
a list structure. Such a fingerprint can be built by reading the SMILE string and listing all paths of
length 6 that are being encountered. This approach requires less memory (the vector fingerprint is
usually quite sparse) and this guarantees no atom type is forgotten (here, the sulfur S).
For amoxicillin, the list would looks like: [‘CCCNCC’, ‘CCCNCO’, ...].
If one would rather encode counts of paths rather than their mere presence/absence, then one can
use dictionaries following the same principle. For amoxicillin, again, the dictionary would look like:
{‘CCCNCC’: 3, ‘CCCNCO’: 2, ...}.
Fingerprints tend to be very long and sparse; as an example, an experiment on 50 000 random
molecules from the ChemDB data base yield 300 000 possible paths of length 8. Encoding each of
the 50 000 molecules as a vector of length 300 000 is inefficient, and compression strategies, about
which you can read more in Appendix I.1, are necessary.

12
Deep learning for molecular graphs In recent years, many research efforts have concentrated in ex-
tending the concepts of convolutional neural networks from applications to images to applications to
graphs. These approaches are beyond the scope of this introductory course. While they sometimes (far
from always) outperform classical machine learning approaches based on fingerprints, they require
much more computational power. An interesting area of development is that of generative models,
which one can hope to employ to generate molecules with properties similar to that of the training
set. See for example [LPB13; Duv+15; Col+17; HYL17] and [Kad+17; CKK17].

2.2 Computing chemical similarity


Similar Property Principle
• Molecules having similar structures should exhibit similar activities [JM90]. Hence it makes sense
to use structure-based representations and compare molecules by comparing substructures.
• Use fingeprints to describe molecules, and apply distance- or similarity-based learning algorithms.

2.2.1 Distance between two binary or real-valued vectors

• A distance over the space X (here X = {0, 1}p for binary fingerprints and X = Np for count
fingerprints) is a function d such that:

d:X ×X →R
d(x, x) = 0
d(x, z) = d(z, x) (symmetry)
d(x, z) ≤ d(x, w) + d(w, z) (triangular inequality).

• Real-valued vectors: X = Rp – which contains the cases X = {0, 1}p and X = Np .


– Euclidean distance: v
u p
uX
d(x, z) = ||x − z||2 = t (xj − zj )2 .
j=1

– Manhattan distance:
p
X
d(x, z) = ||x − z||1 = |xj − zj |.
j=1

– `q -norm associated distance:


 1/q
Xp
d(x, z) = ||x − z||q =  |xj − zj |q  .
j=1

– `∞ distance:
d(x, z) = ||x − z||∞ = max (xj − zj ).
j=1,...,p

• In the special case of binary-valued vectors: X = {0, 1}p , one often uses the Hamming distance,
which counts the number of bits that are different between x and z
p
X
d(x, z) = (xj XOR zj ).
j=1

Note: This is equivalent to Manhattan’s distance (and the squared Euclidean distance).

13
2.2.2 Similarity between two vectors

The idea of a measure of similarity is converse to that of a distance: the closer two points are, the
smaller their distance and the larger their similarity.

Similarity between two real-valued vectors:


• Pearson’s correlation:
Pp
j=1 (xj − x̄)(zj − z̄)
ρ(x, z) = qP qP .
p 2 p 2
j=1 (xj − x̄) j=1 (zj − z̄)

1
Pp
where x̄ = p j=1 xj .
• Suppose the data is centered: x̄ = z̄ = 0. Then
Pp
j=1 xj zj hx, zi
ρ(x, z) = qP qP = = cos(θ),
p 2 p 2 ||x||.||z||
j=1 xj j=1 zj

where θ is the angle between x and z, as shown for two dimensions on the illustration. Hence
Pearson’s correlation is also called the cosine similarity.
feature 2

feature 1
Pearson’s correlation between vectors x and z is given by the cosine of their angle.
• The dot product can be used to measure similarity between two vectors:
p
X
s(x, z) = hx, zi = xj zj .
j=1

Note that Pearson’s correlation between two vectors that are centered and of norm 1 is their dot
product.
• The dot product can also be computed in another feature space, using a mapping application Φ :
Rp → Rd :
d
X
s(x, z) = hΦ(x), Φ(z)i = Φ(xj )Φ(zj ).
j=1

• Such a similarity is a kernel, and can sometimes be computed directly in the space Rp . For reminders
about kernels, see Appendix J

Similarity between two fingerprints


• Idea: Compare the number of entries that are common to the two fingerprints.
• Binary fingerprints X = {0, 1}p : Tanimoto similarity
Pp
j=1 (xj AND zj )
s(x, z) = Pp
j=1 (xj OR zj )

14
• Count fingerpritns X = Np : Minmax similarity
Pp
j=1 min(xj , zj )
s(x, z) = Pp
j=1 max(xj , zj )

Note: For a binary fingerprint, Tanimoto = MinMax.


• Tanimoto and MinMax are kernels.

Limitations These representations do not model


• accessible surface, through which interactions happen;
• 3D configuration
– Information can be difficult to get (one can use simulations);
– Multiple conformations are possible in solution (rotatable bonds, etc.).
• stereoisomers, which may however have different ADME-Tox properties.

What about large molecules?


• Drug-like compounds: typically about 22 heavy atoms. Proteins are much larger and representations
based on molecular graphs become intractable.
• Sequence-based representations:
– Local Alignment kernel: based on the Smith-Waterman alignment algorithm [Sai+04];
– String kernels [Les+04]: similar to fingerprints, but use all possible sequences of amino acids of
fixed length k instead of paths/trees.
• Use hierarchy from the Enzyme Commission [JV08].
• Use the 3D binding pocket information [Kel+06].

2.3 Learning from molecular fingerprints


Classical machine learning and data mining techniques can be applied to the vectorial feature repre-
sentations we have described above.
• Any distance / similarity can be used;
• Clustering can be for instance be used to select compounds for biological screening, or process
substructure search outputs. Examples: [But99; Bar; RBW03].
• Classification and regression are used to predict properties of molecular compounds.
In what follows we will focus on supervised learning applications in chemoinformatics.

2.3.1 Supervised learning

Remember the principle: Learn f given {(xi , y i )}i=1...n .


• Classification: Y = −1, +1
E.g. Toxic vs. non-toxic; inhibits a given receptor vs not; anticancerous vs. not.
• Regression: Y = R
E.g. Solubility; LogP; IC50 (half minimal inhibitory concentration).
• Any and all machine learning algorithms for classification and/or regression have been applied to
chemoinformatics problems, including but not limited to: decision trees, Gaussian processes, lo-

15
gistic/linear regression, k-nearest neighbors, naive Bayes, neural networks, random forests, support
vector machines. In what follows we present a few specific applications.

Performance of supervised models One point we have not discussed much so far is how to measure
the performance of a supervised models. Let us assume a data set of n samples (xi , y i ) and a predictor
f.
• For regression models, performance is typically measured using
– The absolute average error:
n
1X
|f (xi ) − y i |.
n i=1

– The root mean squared error: v


u n
u1 X 2
t (f (xi ) − y i ) .
n i=1

– The coefficient of determination coefficient, or Pearson’s correlation between the predictions and
the true values:
Pn  
i i
i=1 (y − ȳ) f (x ) − f (x) n
1X i
r where ū = u.
pPn Pn  2 n i=1
i i
i=1 (y − ȳ) i=1 f (x ) − f (x)

• For classification models, performance is measure based on the confusion matrix:


Positive Negative
Predicted positive True Positive (TP) False Negative (FN)
Predicted negative False Positive (FP) True Negative (TN)
The true positive number TP is the number of actually positive samples that were predicted to be
positive by the model. The false positive number FP is the number of actually negative samples
that were predicted to be positive by the model, and so on and so forth.
• Many measures can be derived from this matrix:
– The accuracy is the proportion of correct predictions (TP + TN) / (TP + TN + FP + FN).
– If your data set is imbalanced, that is to say, contains many more of one class than the other, a
classifier that assigns the majority class to every one can have a good accuracy.
– The true positive rate, or sensitivity, or recall, which is the proportion of positive samples that
are (correclty) predicted to be positive: TPR = TP / (TP + FN).
– The false positive rate, which is the proportion of negative samples that are (wrongly) predicted
to be positive: FPR = FP / (TN + FP).
– The specificity, which is the true negative rate, that is to say, proportion of negative samples that
are (correctly) predicted to be negative: TNR = TN / (TN + FP) = 1 - FPR.
– The precision, or positive predictive value, is the proportion of positive predictions that are cor-
rect: PPV = TP / (TP + FP).
– Depending on the application, sensitivity or specificity can be more important. How costly/harmful
are false positives, versus false negatives? Think of a fire alarm, a blood test, etc.
– A classifier that classifiers all samples as positive will have a sensitivity of 1 and a specificity of
0. Conversely, a classifier that labels all samples as negative will have a sensitivity of 0 and a
specificity of 1.
– The F-score, or F1-score, which is the harmonic mean of precision and recall, combines both
aspects.

16
• Many classifiers return a a score rather than a binary answer. The score is then converted into a
binary answer with the help of a threshold: all samples with a score larger than the threshold are
labeled positive, and all other samples are labeled negative. For example, if a model outputs the
probability of the sample being positive, then it makes sense to label positive all samples for which
the output is greater than 1/2. However, the classifier might perform better with a threshold of, say,
0.45. For such models, it is interesting to look at the evolution of the above evaluation scores with
the threshold. In practice, people often look at:
– ROC curves (for Receiver-Operator Characteristic), which are constructed by plotting, for each
possible threshold, the true positive rate vs the false positive rate. (The possible thresholds are
obtained by listing all predicted values in ascending order.) The ROC curve of a random predictor
follows the diagonal line; good models have a TPR that increases faster than their FPR and are
close to the upper left corner.
– PR curves, or Precision-Recall curves, work on the same principle but plot precision against recall.
Good predictors maintain a good precision as recall increases, and are close to the upper right
corner.
– ROC curves and PR curves can be summarized in a single number, the area under the curve. This
number is between 0 and 1, the higher the better.

x 0.17 x 0.12 x 0.09


1
True Positive Rate

x 0.81 x 0.73 x 0.52 x 0.2


3/4

x 0.9
2/4

x 0.95 x 0.94
1/4

random
perfect
real
x Inf
0

0 1/6 1/3 1/2 2/3 5/6 1


False Positive Rate
Example (dashed line) of an ROC curve, built from the following data:
Predicted value 0.09 0.12 0.17 0.20 0.52 0.73 0.81 0.90 0.94 0.95
True label - - + - - - + + - +

2.3.2 Virtual High-Throughput Screening

Virtual High-Throughput Screening (vHTS) is the cost-effective, in silico complement of experimental


high-throughput screening.
In spite of recent technological advances, making high-throughput screening more and more afford-
able, this is still extremely costly, as the protein must be produced in sufficient quantities, and the
compounds to be tested must be purchased or synthesized first. In practice, even the better equiped
pharmaceutical companies rarely test more than two millions of compounds, a fraction of the number
of commercially available drug-like compounds.
Docking, which consists in modeling the van der Waals and electrostatic interactions between the
target protein and a small molecule, is one possible approach to the in silico screening of chemicals
(see Section 3). However, despite the development of effective approximations of intensive quantum
mechanical calculations, docking is still very time and resource consuming. Moreover, it requires to
accurately know the three-dimensional structure of both the protein and the compounds to be tested,
which is a major drawback to its systematic application.

17
Unlike screening by docking or de novo design, which require the structure of the target protein to
be known, a vHTS algorithm is a ligand-based approach that uses data from a first exploratory HTS
experiment to predict the activity of new sets of compounds in silico. vHTS is used to facilitate the
selection of compounds for experimental screening in HTS bioassays and translates into additional
protein inhibitor, anti-cancer, and antibiotic leads, which would have otherwise been missed. Can
we leverage knowledge extracted from a previous exploratory screen to predict, among a library of
untested compounds, which ones are most likely to be active and should therefore be tested first?
Virtual screening can be formulated as a:
• Binary classification problem: classify compounds as active/inactive;
• Regression problem: predict the biological activity of compounds;
• Ranking problem: rank compounds by biological activity [Swa+09; Mar+17].

2.3.3 QSAR/QSPR

Property predictions problems in chemoinformatics are refered to as QSAR and QSPR.


• QSAR: Qualitative Structure-Activity Relationship i.e. classification: use the structure of the molecule
to predict whether it is biologically active or not.
• QSPR: Quantititive Structure-Property Relationship i.e. regression: use the structure of the molecule
to predict a quantitative property of a molecule, such as boiling point, melting point, solubility.
Bibliography: [Aze+07; Dev].

Other uses of molecular fingerprints Molecular fingerprints, or vectorial representations of chemi-


cals, can be used for a variety of purposes others than learning. These include:
• Database indexing and search;
• Prediction of 3D structures of small compounds and proteins;
• Reaction prediction.

3 Molecular modeling and simulations


A basic introduction to drugs, drug targets, and molecular interactions:
http://www.youtube.com/watch?v=u49k72rUdyc

3.1 Structure-based design


Structure-based drug-design is much more focused than combinatorial chemistry. Instead of exploring
the chemical space at random, one leverages the biophysical laws of molecular bindings to refine the
ligand. It can be conducted in three fashions:
• Inspection: molecules known to bind the active site are modified based on on maximizing comple-
mentarity interactions in the binding pocket.
• Virtual screening: databases of available compounds are docked in silico into the binidng pocket,
and scored based on predicted interactions with the site.
• De novo generation: small molecular fragments (benzene rings, amino groups, etc.) are positioned
into the pocket, scored, and linked in silico.

Structured-based virtual screening aims at minimize the binding energy between the target protein
and a ligand. The idea of structure-based virtual screening is to use the 3D structure of the target

18
protein (and more importantly, of its binding pockets, although all atoms play a role) and that of a list
of potential ligands and try to find those ligands that fit best into the binding pockets.
Ideally, structures are obtained experimentally through X-ray crystallography, but homology modeling
makes it possible to infer the 3D structure using the sequence of the protein and the known structure
of homologous proteins. Similarly, the binding pocket is ideally obtained using the 3D structure of
the complex formed by the protein with a known ligand (one then just needs to “expand” around that
known ligand to find the shape of the pocket). If there is no known ligand, or that the complex has
not been crystalized, it is possible to predict the binding pocket (see e.g. PocketFinder in the DOCK
suite).

X-ray crystallography follows the following steps:


• Purification and crystalization of the protein to obtain a solid crystal of the protein of interest (and
this protein only);
• Exposition to an X-ray beam to get a diffraction pattern;
• Reconstruction of the electron density of the crystal from the diffraction pattern, based on the
interpretation of scattering as a Fourier transform.

Diffusion pattern of DNA observed by Rosalind Franklin in 1953.

The process of X-ray crystallography.

Ligand libraries: which small organic compounds to test to see whether they bind the target:
• NCI (∼ 275 000 compounds);
• ZINC (35 × 106 commercially available compounds);
• Commercial libraries, e.g. Nanosyn (64 898 compounds), Enamine’s pharmacologically diverse set
(∼ 23 000 compounds);
• But also: possible to explore the virtual space of not-yet-available compounds!

The search spectrum Virtual screening requires evaluating the binding affinity between a molecule
and a protein. If we’re focusing on molecular modeling (as opposed as statistical modeling as done
with the machine learning vHTS approach above), the problem is made difficult by:
• the need to account for 6 degrees of freedom (3 degrees for coordinates + 3 degrees of rotation), at
every point in the system to position the ligand with respect to the protein;
• that every atom interacts with every other atom.

19
Virtual screening can be performed at three scales:
• The local scale: quantum mechanics molecular modeling
Martin Karplus, Michael Levitt and Arieh Warshel, Nobel laureates 2013.
• The intermediate scale: molecular mechanics, molecular dynamics, Brownian dynamics.
• The global scale: molecular docking.
We will now detail each of these scales.

3.2 Quantum mechanics molecular modeling (QM/MM)


Electrons are too small to be modelized by classical mechanics. Quantum mechanics are required to
accurately model their behavior, by treating the full distribution of electrons using the Schrödinger
equations.
• Computationally expensive ⇒ one of the first applications of General Programming on GPU (Graph-
ics Processing Units), years before deep learning!
• Difficult to deal with systems of more than a few hundred atoms.

Other applications of QM/MM include protein structure prediction; Protein folding mechanics; DNA/RNA
simulations; lipid layers (cell membranes) simulations; and the refinment of structures obtained from
X-ray or NMR.

3.3 Molecular dynamics


• Molecules are modelized as balls connected by springs.
• Atoms are modelized as single particles, represented by balls of constant net charge and Van der
Waals radius as radius.
• Molecular mechanics rely on force fields, which are a collection of parameters for a potential energy
function: ∇U (x) = mẍ.
– Parameters can be derived either from fitting against experimental data, or quantum mechanics
calculations.
– Typical energy functions include:
∗ Bond stretches bonds 21 kr (r − r0 )2 ;
P

∗ Angle bending angles 12 kθ (θ − θ0 )2 ;


P

∗ Torsional rotation torsions 21 Vn (1 + cos(nφ − δ));


P
P
∗ Improper torsion (sp2, artificial term to force planarity) sp2 torsions Vimproper ;
P q q
∗ Electrostatic interactions are usually modeled with Coulomb potentials: atoms i,j riijj ;
P Aij Bij
∗ Van der Waals interactions are usually modeled as Lennard-Jones interaction: atoms i,j r12 6 .
ij rij
−6
As this attractive force decreases with r where r is the distance between the atoms to con-
sider, this is a short-range interaction that does not need to be computed for all pairs of atoms,
unlike the Coulomb potential.
• A short video introduction to molecular dynamics:
http://www.youtube.com/watch?v=lLFEqKl3sm4
• Simulation of a drug entering the binding pocket of its target: http://youtu.be/ckTqh50r 2w

20
Other applications of molecular dynamics include lead optimization, to improve on deficiencies on
the structure of a lead compound, while maintaining (or improving) its favorable ADME-Tox properties.

4 Molecular Docking
In docking, the space is discretized with a grid, and computations are only done on points on this
grid. The idea is to maximize the shape complementarity of the ligand and the protein pocket, while
minimizing the binding energy, which is approximated with a scoring function, which can be:
• Force-field based (e.g. DOCK, AutoDock). The biggest bottleneck here is the modelization of the
solvent, as accurate models such as those used in QM/MM are too time consuming to compute.
Several options are possible:
– Include a distance-based factor in the Coulomb term;
– Poisson-Boltzman/surface area (PB/SA);
– Generalized-Born/surface area (GB/SA) model.
• Empirical (e.g. FlexX, SCORE). Binding energies are approximated by a weighted sum of terms,
which are empirically fitted on a set of known protein-ligand complexes. Those are much faster to
compute, but less accurate.
• Knowledge-based (e.g. DrugScore, ITScore): Pairwise atom-atom potentials are obtained based on
the pair frequency and the inverse Boltzman relation. These methods are a good intermediate (in
terms of speed and accuracy) between the force-field and the empirical approaches.
For more details, see e.g. [HGZ10].

Correct vs. incorrect docking

Left: the ligand (in blue, licorice style) fits nicely within the pocket of the protein (in red, solvent
excluded surface style). Right: The ligand binds to the protein but the pose is wrong. )
Source: Bernhard Knapp – Medical University of Vienna
http://www.meduniwien.ac.at/imc/einf i d med inf/2011 MD.pdf

Other applications of docking


• Inverse Virtual Screening (IVS):
1 drug, many potential receptors
The goal of Inverse Virtual Screening (IVS) is to identify other proteins that can be targeted by a
given drug. The aim can be both to discover potential secondary effects or to discover new potential
indications of an already approved drug. This second aim is referred to as drug repositioning.
• protein-protein docking: Many proteins bind to each other, forming protein complexes. Understand-
ing the spatial configuration they adopt, whether mutations prevent them from binding, and in this
case, how they could be made to bind again, are a few of the questions addressed by protein-protein
docking.
A list of molecular modeling and simulation software is given in Appendix H.

21
5 Open questions and current challenges
In spite of the great progress made in the domain of drug design in the last 35 years, a number of
additional questions remain largely unsolved.

Mechanisms of action Many mechanisms of action are still unknwon. Identifying the target(s) of
a molecular compound is the first step towards proposing a mechanism of action, which is usually
required to obtain an authorization to enter clinical assays. This is a major center of interest regarding
traditional plant medicines.

Synthesis mechanisms Many synthesis mechanisms are still unknown. Drugs then have to be har-
vested from nature, which can be a complex and costly process.

Adverse drug reaction prediction Undesirable side effects are a major cause of failure of clinical
trials and the main reason for withdrawal from market. They can be due to:
• Poor estimation of ADME-Tox properties
• Stereochemistry: Molecules often exist in multiple configurations. Movable bonds, such as rotatable
bonds, give rise to conformers; stereocenters give rise to symmetries and isomers. Enantiomers are
chiral molecules (i.e. molecules that cannot be superimposed with their mirror image; human hands
are an example of chiral objects) that are mirror images of each other. Stereoisomers are molecules
that have the same molecular graph, but differ in the three-dimensional position of their atoms.
It is not always possible to synthetize only one of multiple conformers. Stereoisomers may have
different ADME-Tox properties and biological activities.
• Drug interactions:
– Drug-drug interactions, e.g. antacids prevent the absorption of many medicines; aspirin should
not be combined with blood thinners.
– Drug-food interactions, e.g. Grapefruit interacts with the absorption of a number of medications;
green-leafy vegetables, which contain vitamin K, may decrease the effectiveness of blood thin-
ners; black liquorice may decrease the activity of high blood pressure medications.
– Drug-condition interactions.
• Multiple targets (or drug promiscuity): a drug may bind not only to its primary, intended target, but
also to secondary targets, resulting in unwanted effects. Drug promiscuity can also lead to higher-
efficacy, as with the anti-cacner drug clozapine, which can be more efficient than expected due to
hitting multiple targets related to the disease.
Also note that how tolerable adverse effects are depends on the disease being treated: while nausea
is a tolerable side effect of an anti-cancer medication, it is much less tolerable for a medication that
must be taken daily, for instance for high blood pressure, or for a less severe disease, such as a head
cold.

Drug repositioning Drug promiscuity can also lead to finding novel indications for existing drugs, as
they may hit targets for other diseases.
• Pharmacokinetics properties already known;
• Potentially faster development.
• Intellectual property & regulatory issues
The importance of the question is illustrated by a recent NIH program devoted to “discovering new
therapeutic uses for existing molecules”: http://www.ncats.nih.gov/ntu

22
Drug specificity On the other hand, can we design drugs that bind specifically to the intended target
and to no other?
Example: Drugs targeting kinases, often used in cancer treatment, are particulary prone to suffering
from lack of specificity.
“A typical protein kinase must recognize between one and a few hundred bona fide phosphorylation
sites in a background of about 700,000 potentially phosphorylatable residues. Multiple mechanisms
have evolved that contribute to this exquisite specificity, including the structure of the catalytic site,
local and distal interactions between the kinase and substrate, the formation of complexes with scaf-
folding and adaptor proteins that spatially regulate the kinase, systems-level competition between
substrates, and error-correction mechanisms. The responsibility for the recognition of substrates by
protein kinases appears to be distributed among a large number of independent, imperfect specificity
mechanisms.” [UF07]

Precision medicine
• Drug response: Will this patient respond or not to the treatment?
• Can one find a function f that takes information (clinical, genetic, otherwise) about the patient and
returns appropriate drug(s)?
• Can one find which biomarkers are predictive of the patient’s response?

References and further reading


[Aze+07] Chloé-Agathe Azencott et al. “One-to four-dimensional kernels for virtual screening and
the prediction of physical, chemical, and biological properties”. In: J chem inf model 47.3
(2007), pp. 965–974.
[Bal+07] Pierre Baldi et al. “Lossless compression of chemical fingerprints using integer entropy
codes improves storage and retrieval”. In: J chem inf model 47.6 (2007), pp. 2098–2109.
[Bar] “Clustering of chemical structures on the basis of two-dimensional similarity measures”.
In: Journal of Chemical Information and Computer Science 32.6 (1992), pp. 644–649.
[But99] Darko Butina. “Unsupervised Data Base Clustering Based on Daylight’s Fingerprint and
Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets”. In:
J. Chem. Inf. Comput. Sci. 39.4 (1999), pp. 747–750.
[CKK17] Mehdi Cherti, Balazs Kegl, and Akin Kazakci. “De novo drug design with deep generative
models : an empirical study”. In: ICLR. 2017.
[Col+17] Connor W. Coley et al. “Convolutional embedding of attributed molecular graphs for phys-
ical property prediction”. In: J. Chem. Inf. Model. 57.8 (2017), pp. 1757–1772.
[Dev] “5-Year trends in QSAR and its machine learning methods”. In: Current Computer Aided Drug
Design 12.4 (), pp. 265–271.
[DGH16] Joseph A DiMasi, Henry G Grabowski, and Ronald W Hansen. “Innovation in the pharma-
ceutical industry: new estimates of R&D costs”. In: Journal of health economics 47 (2016),
pp. 20–33.
[Dre00] J. Drews. “Drug discovery: a historical perspective”. In: Science 287.5460 (2000), pp. 1960–
1964.
[Duv+15] David K Duvenaud et al. “Convolutional networks on graphs for learning molecular fin-
gerprints”. In: Advances in Neural Information Processing Systems 28. 2015, pp. 2224–2232.
[Gle+06] Robert C. Glem et al. “Circular fingerprints: flexible molecular descriptors with applications
from physical chemistry to ADME”. In: IDrugs 9.3 (2006), pp. 199–204.
[Hel+04] Christoph Helma et al. “Data mining and machine learning techniques for the identi-
fication of mutagenicity inducing substructures and structure activity relationships of
noncongeneric compounds”. In: Journal of chemical information and computer sciences 44.4
(2004), pp. 1402–1411.

23
[HGZ10] Sheng-You Huang, Sam Z. Grinter, and Xiaoqin Zou. “Scoring functions and their evaluation
methods for protein–ligand docking: recent advances and future directions”. In: Physical
Chemistry Chemical Physics 12.40 (2010), p. 12899.
[HYL17] William L. Hamilton, Rex Ying, and Jure Leskovec. “Representation learning on graphs:
methods and applications”. In: arxiv:1709.05584 [cs] (2017). arXiv: 1709.05584.
[JM90] Mark A Johnson and Gerald M Maggiora. “Concepts and applications of molecular similar-
ity”. In: J. Comput. Chem. 13.4 (1990). Ed. by John Wiley & Sons, pp. 539–540.
[JV08] Laurent Jacob and Jean-Philippe Vert. “Protein-ligand interaction prediction: an improved
chemogenomics approach”. In: Bioinformatics 24.19 (2008), pp. 2149–2156.
[Kad+17] Artur Kadurin et al. “druGAN: an advanced generative adversarial autoencoder model for
de novo generation of new molecules with desired molecular properties in silico”. In: Mol.
Pharmaceutics 14.9 (2017), pp. 3098–3104.
[Kel+06] Esther Kellenberger et al. “sc-PDB: An annotated database of druggable binding sites from
the protein data bank”. In: J. Chem. Inf. Model. 46.2 (2006), pp. 717–727.
[Les+04] Christina S. Leslie et al. “Mismatch string kernels for discriminative protein classification”.
In: Bioinformatics 20.4 (2004), pp. 467–476.
[LPB13] Alessandro Lusci, Gianluca Pollastri, and Pierre Baldi. “Deep architectures and deep learn-
ing in chemoinformatics: the prediction of aqueous solubility for drug-like molecules”. In:
J. Chem. Inf. Model. 53.7 (2013), pp. 1563–1575.
[LPC98] J. Lazarou, B. H. Pomeranz, and P. N. Corey. “Incidence of adverse drug reactions in hospital-
ized patients: a meta-analysis of prospective studies”. In: JAMA 279.15 (1998), pp. 1200–
1205.
[LW71] Michael Levandowsky and David Winter. “Distance between Sets”. In: Nature 234.5323
(1971), pp. 34–35.
[Mar+17] E. J. Martin et al. “Profile-QSAR 2.0: Kinase virtual screening accuracy comparable to four-
concentration IC50s for realistically novel compounds”. In: Journal of Chemical Information
and Modeling 49.4 (2017), pp. 756–766.
[MC+11] Giarratana M.-C. et al. “Proof of principle for transfusion of in vitro-generated red blood
cells”. In: Blood 118.19 (2011), pp. 5071–5079.
[Mig+12] A. Miguel et al. “Frequency of adverse drug reactions in hospitalized patients: a systematic
review and meta-analysis.” In: Pharmacoepidemiol Drug Saf 21.11 (2012), pp. 1139–1154.
[Mor65] H. L. Morgan. “The Generation of a Unique Machine Description for Chemical Structures-A
Technique Developed at Chemical Abstracts Service.” In: J. Chem. Doc. 5.2 (1965), pp. 107–
113.
[O’B12] Noel M. O’Boyle. “Towards a Universal SMILES representation - A standard method to gen-
erate canonical SMILES based on the InChI”. In: Journal of Cheminformatics 4.1 (2012), p. 22.
[Pee] “Special delivery: targeted therapy with small RNAs”. In: Gene therapy 18 (2011), pp. 1127–
1133.
[RBW03] John W Raymond, C. John Blankley, and Peter Willett. “Comparison of chemical clustering
methods using graph- and fingerprint-based similarity measures”. In: Journal of Molecular
Graphics and Modelling 21.5 (2003), pp. 421–433. (Visited on 10/25/2014).
[Sai+04] Hiroto Saigo et al. “Protein homology detection using string alignment kernels”. In: Bioin-
formatics 20.11 (2004), pp. 1682–1689.
[Sch15] Nicholas J. Schork. “Personalized medicine: Time for one-person trials”. In: Nature News
520.7549 (2015), p. 609.
[Swa+09] S. Joshua Swamidass et al. “Influence Relevance Voting: An Accurate And Interpretable Vir-
tual High Throughput Screening Method”. In: Journal of Chemical Information and Modeling
49.4 (2009), pp. 756–766.
[UF07] Jeffrey A. Ubersax and James E. Ferrell. “Mechanisms of specificity in protein phosphoryla-
tion”. In: Nat. Rev. Mol. Cell Biol. 8.7 (2007), pp. 530–541.

24
[WWW89] David Weininger, Arthur Weininger, and Joseph L. Weininger. “SMILES. 2. Algorithm for
generation of unique SMILES notation”. In: J. Chem. Inf. Comput. Sci. 29.2 (1989), pp. 97–
101.

A History of Therapeutic Research


A.1 Serendipity: natural products discovered by chance
Until very recently, drugs were natural products of which the benefits were discovered by chance
(hence the term of serendipity). It’s only at the beginning of the 20th century that we started under-
standing their mode of action and studying how to synthetize them industrially rather than harvest
them (generally from plants). Nowadays, drug design has been rationalized and involves many fields,
such as medicine, biochemistry, organic and synthetic chemistry, pharmacology, microbiology, physiol-
ogy, and toxicology but also computerized chemical modeling, physics, computer science, and robotics.
Here are some examples of natural products that have been used in traditional medicine for a long
time.
• Medicinal fungi include antibiotic agents from the Penicillium family, but also anti-cancer drugs
(including mitotic inhibitors), statins (widely used to lower blood cholesterol levels), immunosup-
pressants derived from instance from Tolypocladium inflatum, and other fungi used to treat malaria
or diabetes. Lingzhi mushroom (Ganoderma lucidum) has been used for several thousand years in
traditional Chinese medicine for its various (supposed or proven) health benefits.
• Botanical medicine
– Quinine naturally occurs in the bark of the cinchona tree and was first discovered by the Quechua
Indian (indigeneous to Peru and Bolivia). It has anti-inflammatory, anti-malarial, muscle relaxant
and fever-reducing properties.
– digitalis or foxgloves, contain digitoxin, a substance that can be used for treating various heart
conditions. It has severe side effects and has indeed been used by a number of authors of detec-
tive fiction.
– cocaine is obtained from the leaves of the coca plant. Its anesthetic and nervous system stimulant
effects are well known.
– aspirin is a well-known analgesic, fever reducer, and anti-inflammatory substance. It was first
discovered from the bark of the willow tree in 1763.

The Nobel Prize in Physiology or Medicine 2015


• William C. Campbell and Satoshi Ōmura “for their discoveries concerning a novel therapy against
infections caused by roundworm parasites.” (Avermectin)
• Youyou Tu “for her discoveries concerning a novel therapy against Malaria.” (Artemisinin, an extract
from Artemisia annua) Although the plant was found to be effective against Malaria in a large-scale
screen of herbal remedies, its use was found to give inconstistent results. Revisiting the ancient
literature, Tu discovered clues that guided her in properly extracting the active component from
this plant, and show its effectiveness against Malaria both in animals and humans.

Serendipity
Serendipity: Chance discoveries that have been exploited with sagacity
• In the laboratory
– Acetaminophen (paracetamol) is a derivative of acetanilide. The analgesic effects of acetanilide
were discovered when an inexperienced pharmacist mistakenly gave it to doctors in place of the
naphathalene they were investigating for treating a patient suffering from intestinal parasites.

25
– Cisplatin was long known as Peyrone’s salt when a team led by Barnett Rosenberg, which was
investigating the effect of an electric current on the growth of E. coli cells, discovered that the
impressive cell elongation they were observing was not due to the electric current but to the
cisplatin produced by the reaction of the platinum electrodes with the nutrient solution in which
the bacteria were. It is now used as a cytotoxic in the treatment of cancer.
– Heparin is an anticoagulant that was discovered by a scientist searching for procoagulant in dog
liver.
– Penicillin, the first antibiotic to successfully treat bacterial infection, was discovered when Alexan-
der Fleming realized that a petri dish containing a culture of Staphylococcus bacteria had been
contaminated by a mold that was now killing it.
• During clinical trials
– Dimenhydrinate (dramamine) was developed as a antihistamine but is now used against travel
sickness, thanks to a chance observation of participants in the clinical trials.
– Sildenafil (Viagra), was initially developed as a heart medicine for specific use against angina
pectoris. During clinical trials it proved to have little effect against angina, but provoked many
penile erections.

A.2 1670 – 1900s: Diseases caused by external agents


1670: Identification of bacteria
Antonie van Leeuwenhoek (1632-1723) was a Dutch scientist (initially a tradesman) from Delft. He
developed a great skill at grinding lenses and built a large number of microscopes (or rather powerful
magnifying lenses); he was the first to observe Charophyte algae, and ciliates living in water. Later
on, he examined the plaque from between his own teeth and those of (most likely) members of his
family, leading to the first observation of bacteria ever recorded. He also discovered blood and sperm
cells.

1850–1900: First links to diseases


• Casimir Davaine (1812-1882) discovered the bacillum now known as Bacillus anthracis, which causes
anthrax, in the blood of dying sheep. He established the link with the disease and observed that
it could be transmitted between animals. [Recherches sur les infusoires du sang dans la maladie
connue sous le nom de sang de rate par M. C. Davaine. Note présentée par M. Cl. Bernard. Comptes
rendus de l’Académie des sciences, 57 (1863).]
• Louis Pasteur (1822-1895) studied fermentation and demonstrated that the growth of bacteria in
nutrient was due to biogenesis, invalidating the theory of “spontaneous generation”. While not the
first to propose germ theory, he is the one who developed it and conducting experiments that con-
vinced his contemporaries that it was right. He also showed that micro-organisms are responsible
for causing beverages to spoil, leading (in collaboration with Claude Bernard) to the development
of pasteurization. From this he developed the idea that micro-organisms infecting animals and
humans leads to disease. Later on, this led to his work on immunization and vaccination. [La
théorie des germes et ses applications à la médecine et à la chirurgie. Lecture faite à l’Académie de
Médecine par M. Pasteur en son nom et au nom de MM. Joubert et Chamberland. Le 30 avril 1878.]
• Joseph Lister (1827-1912), leveraging on this work, was the first to propose antiseptic principles in
the practice of surgery.
• Robert Koch (1843-1910, Nobel 1905), now considered the father of modern bacteriology, identified
the causative agents of tuberculosis (Myobacterium tuberculosis, also known as Koch’s bacillus) and
cholera, and received a Nobel Prize in physiology in 1905.
• 1890: Koch’s postulates to determine whether a bacterium is the cause for a given disease:
– The bacteria must be present in every case of the disease;

26
– The bacteria must be isolated and grown in culture;
– The disease must be reproduced when a healthy host is inoculated with the bacteria;
– The bacteria must be found again in the diseased host.
In spite of a number of limitations (bacteria that are hard to grow in culture; immunocompromised
patient; differences between patients), Koch’s postulates are still useful guidelines nowadays.

A.3 Early 20th century – rationalization of chemotherapy


The principle of chemotherapy
• Paul Ehrlich (1854-1915, Nobel 1908) received the Nobel Prize in physiology for his contributions
to the field of immunology. His discovery of arsphenamine, the first effective treatment against
syphillis, marked the beginning of chemotherapy. Here chemotherapy is to be understood as the
use of a chemical substance to treat a disease; nowadays the term is restricted to the treatment of
cancer. Arsphenamine was indeed discovered by synthetizing and screening many arsenical com-
pounds,
• The principle of chemotherapy is as follows: Some chemicals can, at concentrations tolerable by
the host, directly interfere with the growth of the micro-organisms causing a disease.
• Hence the idea of a magic bullet: a therapeutic agent that kills only the targeted organism.

1930: The era of antibiotics


• 1930 Fleming (Nobel 1945): Penicillin inhibits bacterial proliferation. ⇒ Penicillin-like antibiotics.
• 1932 Domagk (Nobel 1939): Prontonsil (red dye) is effective against streptococci infections. ⇒
Sulfamides.
• Efforts in synthetizing or isolating compounds similar to penicillin
– Aminoglycosides (e.g. 1943 Schatz & Waksman (Nobel 1952): streptomycin, the first treatment
against tuberculosis)
– Macrolides (e.g. 1949 Aguilar: erythromycin). There is also a Nobel prize associated with that
discovery, but it is that of Woodward, whose team successfully synthetized erythromicin. He had
gotten the prize much earlier for his work in organic synthesis, and was actually working on the
synthesis of erythromicin when he died of a heart attack.

1950s: The raise of genetics


• 1957 Crick formulates the central dogma of molecular biology, leading to
– Cell- and molecular biology-based understanding of diseases;
– Rational hypotheses of disease-relevant mechanisms.
• 1980s: identification of single gene diseases (e.g. sickle cell anemia, cystic fibrosis).
Kristen Knickerbocker, MSU

27
Late 1980s: The AIDS epidemic
Considerable resources led to considerable development
• in immunology;
• in combinatorial chemistry to develop new drug candidates (see Appendix C);
• in the automation of drug discovery experiments with robotics.

Summary of the history of antibiotics

Source: compoundchem.com.

B Modes of ligand-target binding


• An agonist binds to a receptor and activates it. It often replaces or complements a naturally occur-
ing substance.
E.g.: Morphine binds to opioid receptors in the central nervous system; nicotine mimics the action
of acetylcholine (a neurotransmitter) at nicotinic acetylcholine receptors, stabilizing gated ion chan-
nels in their open state; phenylephrine, used as a decongestant, stimulates α1 -adrenergic receptors,
which ultimately induces smooth muscle and blood vessel constriction; isoprotenerol, a drug used
in the treatment of bradycardia, binds to beta-adrenoreceptors and has a structure similar to that
of adrenaline.
• An antagonist also binds to a receptor, but lacks the key feature to activate it; instead, by preventing
the natural activator of that receptor from binding in its stead, it inactivates it.
E.g.: Beta-blockers, which interfere with the binding of the receptors of epinephrine and other stress
hormones (their action is essentially opposite to that of phenylephrine); neuromuscular blocking
drugs used to relax the muscles during surgery, are analogs of curare that interfere with the binding
of acetylcholine or its analogs with nicotinic cholinergic receptors.
• An allosteric modulator also binds to a receptor, but in a different site. It can either increase or

28
decrease the response to the natural activator.
GPCRs: G Protein-Coupled Receptors are proteins which play key roles in cell signaling. They are
the main targets of allosteric modulators. Loratadine (Claritin) is an anti-histamine that relieves
allergies by blocking the histamine receptor. A number of antidepressant medications (Prozac,
Zoloft) affect the serotonin receptor. The adrenergic receptors affected by beta-blockers also
belong to the GPCR family.

Epping-Jordan et al. Allosteric modulation: A novel approach to drug discovery.

One also encounters the terms “inhibitors” and “activators”, usually referring to a target that is an
enzyme. Enzymes are protein that regulate the rate of chemical reactions.
• An inhibitor inhibits the activity of an enzyme.
E.g. Aspirin inhibits the activity of the cyclooxygenases COX-1 and COX-2, which are responsible
for the formation of prostaglandins, themselves involved in inflammatory response; HIV protease
inhibitors.
• An inducer or activator activates the activity of an enzyme.
E.g. Barbiturates and benzodiazepines. Both barbiturates and benzodiazepine activate GABA-A re-
ceptors. GABA receptors are an essential neurotransmitter in the central nervous system (CNS) of
mammals. The binding sites of barbiturates, GABA and benzodiazepines are distinct. This explains
in part why barbiturates are more dangerous in overdose.

C Combinatorial chemistry and the AIDS epidemic


It was quickly discovered that the causative agents for AIDS was the HIV-1 virus. Researchers ex-
tensively studied the life cycle of this virus, providing a list of potential targets for pharmaceutical
intervention. Among those was the HIV-1 protease, an enzyme that is essential to the processing and
maturation of HIV proteins: its role is to cleave nascent proteins for assembly into the final proteins.
Without this protease, the virus cannot generate HIV virons and therefore cannot replicate, meaning
that it stops infecting the body. Moreover, this protein is unique to HIV, meaning that there was hope
to affect it without having a strong effect on human proteins. Soon the crystal structure of HIV-1
protease was elucidated. Moreover, similar (although distinct) proteases can be found in human. A
particular example of them is renin, an enzyme secreted by the kidneys which is involved in the reg-
ulation of blood pressure. Several molecules were known to inhibit the function of renin. Computers
were a tool that could be used to leverage those known inhibitors of renin to rapidly generate novel

29
compounds. Computers also made it possible to evaluate the ability of those novel compounds to
bind to HIV-1 protease by computing binding affinities based on the physical laws of chemistry. Un-
fortunately, limitations in computing power meant that a number of approximations had to be used,
resulting in poor estimate of actual ligand-receptor binding energies. To this day, this remains a lim-
itation of de novo drug design. Indinavir, commercialized as Crixivan, is a protease inhibitor that was
approved by the FDA in 1996.

D History of organic synthesis


The importance and difficulty of the task is probably best illustrated by the number of Nobel Prizes in
Chemistry awarded to scientists for their contribution to the development of organic synthesis: Emil
Fischer in 1902 for his work on sugar and purine syntheses, Fritz Haber in 1918 for the synthesis of
ammonia from its elements, Hans Fischer in 1930 for the synthesis of hemin, Diels and Alder in 1950
for the development of diene synthesis, du Vigneau in 1957 for the first synthesis of a polypeptide
hormone, Woodward in 1965 for his outstanding achievements in the art of organic synthesis, Brown
and Wittig in 1979 for the development of the use of boron- and phosphorus-containing compounds
as important reagents, Corey in 1990 for laying out the principles of retrosynthesis, and Chauvin,
Grubbs and Schrock in 1995 for the development of the metathesis method in organic synthesis.
Organic synthesis was born in 1828, when Wöhler discovered how to prepare urea, a naturally occur-
ring organic compound, from ammonium cyanate. This first synthesis was followed by many milestone
syntheses, from acetic acid to glucose and quinine. Until the late 1950s, most syntheses were planned
by identifying compounds structurally similar to the desired molecule and selecting commercially
available starting materials accordingly. The understanding of the electronic nature of chemical bond-
ing and of the importance of molecular conformation, together with the work of Woodward, opened
the door to more complex syntheses. In 1957, Corey started developing the principles of retrosynthe-
sis, in which an organic compound is split into simpler precursor structures without any assumption
with regards to the starting materials. Simultaneous advances in chromatographic and spectroscopic
techniques, by facilitating the analysis of reaction mixtures and the purification and characterization
of organic compounds, furthered the blooming of the field, which is now vigorous and influential.
Nowadays, strong theoretical bases, such as conformational analysis, Woodward and Hoffmann rules,
and retrosynthetic analysis, combined with a thorough knowledge of chemical reagents, reactions
and conditions probably give synthetic chemists the potential to make any structure, given the man-
power and time. However, the process remains intricate, time consuming, and uncertain; facilitating
fast, efficient, economical organic syntheses with low environmental impact is still very much a core
concern for organic chemists.

E Assays
Pre-clinical assays This phase often involves in vivo experiments on animals. When a lead molecule
with good drug-likeliness is discovered, the pre-clinical data collected during lead optimization, aim-
ing at demonstrating safety in animals, is used to file a new drug application and obtain permission
to enter clinical trial.
The first in vivo trials are usually conducted on animals in order to obtain the pre- clinical data nec-
essary to file a new drug application with the FDA (Food and Drugs Administration) in the USA or the
European Agency for the Evaluation of Medical Products (EMEA) in Europe and obtain permission to
enter Phase I clinical trial.
• in vitro assays
• in vivo assays on animals to evaluate
– toxicity, mutagenicity, carcinogenicity
– pharmacokinetics
– efficacy

30
The animals being used are generally rodents (mice, rats). Pigs are also favored for their biologi-
cal closeness to humans. Once a drug’s interest has been demonstrated, trials will sometimes be
conducted on primates.
• Animal research regulations: The type of experiments that can be conducted on animals are care-
fully regulated. Policies are put in place to ensure that animals receive a certain standard of care
and treatment and aren’t subjected to unnecessary pain. Ethical boards and review committees are
often mandatory to control the use of animals for scientific experimentation.
– U.S.A: Animal Welfare Act.
– France:
∗ http://www.enseignementsup-recherche.gouv.fr/pid29417/utilisation-des-animaux-a-des-fins-
scientifiques.html
∗ https://www.inserm.fr/recherche-inserm/ethique/utilisation-animaux-fins-recherche
– Basel declaration: the 3R principle (Replace, Reduce, Refine) http://www.basel-declaration.org/
→ pre-approval for permission to enter Phase I clinical trial.

Clinical assays
• Phase I: Tests on a small number of healthy volunteers to determine
– maximal tolerated dose: the highest dose that does not produce unacceptable toxicity;
– pharmacokinetics;
– adverse effects.
Phase I trials in oncology are already conducted on patients.
• Phase II: Tests on a small number (a few dozens) of patients to determine L
– most appropriate dosage;
– efficacy;
– tolerance.
• Phase III: Double-blind, placebo-controlled trials to confirm tolerance and efficacy. These larger
trials usually involve hundreds if not thousands of patients. See https://www.inserm.fr/recherche-
inserm/recherche-clinique/etre-volontaire-essai-clinique
Regulatory agencies can always require additional trials.
Pharmacovigilance, or drug safety, is concerned with the collection, detection, assessment, monitoring,
and prevention of adverse drug reactions. Phase IV can be considered as the first real-world test of the
drug. Indeed, the true safety profile of a drug can only be characterized by ongoing safety surveillance,
through an adverse event monitoring system and a continuing post-marketing surveillance study.
Phase IV trials never end as long as the drug is being sold. Drug safety is constantly monitored
through:
• Phase IV trials or postmarketing surveillance take place after the drug has been approved for sale.
It can be required by regulatory agencies for further monitoring of some adverse effects, or under-
taken by the pharmaceutical company to evaluate for instance drug interactions, effects in a given
subpobulation, or the potential of the drug for other usages than the one it has been approved for.
• Non-Interventional Studies: Keep evaluating tolerance and efficacy in a large-scale, real-world set-
ting.
• Adverse Event Reporting is conducted by healthcare professionals and patients, and addressed to
both pharmaceutical companies and regulation agencies.

WHO Program for International Drug Monitoring

31
• 106 full members + 33 associate members.
• Goals:
– Enhance patient care and safety;
– Provide reliable information for the effective assessment of the risk-benefit profile of medicine.
• Centralization: the Uppsala Monitoring Centre http://www.who-umc.org/

Regulation agencies
• In the U.S., the FDA (Food and Drug Administration) is in charge of public health (also includes food
safety, medical devices, tobacco products).
• In Europe, the EMA (European Medicines Agency) harmonizes the work of national regulatory agen-
cies.
• In France, the ANSM (Agence Nationale de Sécurité du Médicament et des produits de santé, Na-
tional Agency for Medication and Health Products Safety).

En France: L’Agence Nationale de Sécurité du Médicament


• En France, les autorisations de mise sur le marché (AMM) sont délivrées par l’ANSM (Agence Na-
tionale de Sécurité du Médicament et des produits de santé) : https://www.ansm.sante.fr/Activites/Autorisations-
de-Mise-sur-le-Marche-AMM/L-AMM-et-le-parcours-du-medicament
• 31 CRPV (centres régionaux de pharmacovigilance)
• La base de données publique des médicaments: http://base-donnees-publique.medicaments.gouv.fr/
Pour chaque médicament bénéficiant d’une AMM:
– Dénomination ;
– Composition ;
– Forme pharmaceutique ;
– Clinique : indications thérapeutiques, posologie, contre-indications, mise en garde, interactions,
grossesse, allaitement, conduite de véhicules, effets indésirables, surdosage.
– Pharmacologie : pharmacodynamie, pharmacocinétique, sécurité préclinique.
– Pharmaceutique : excipients, incompatibilités, durée de conservation, conservation, emballage,
utilsation, manipulation.
– AMM: titulaire, numéro, date d’autorisation / de renouvellement.
• Le circuit du médicament : https://solidarites-sante.gouv.fr/soins-et-maladies/medicaments/le-circuit-
du-medicament/
• Déclarer un effet indésirable : les professionnels de santé sont tenus de déclarer tout effet indésirable
qu’ils constatent, mais vous pouvez aussi faire une déclaration vous-mêmes :
http://ansm.sante.fr/Declarer-un-effet-indesirable/

F Beyond the pipeline


Not all drugs are small organic chemical compounds that work along the “key-lock” principle. Among
othe types of therapies, you can find:
• Monoclonal antibodies are antibodies made by identical immune cells that are all clone of the
same parent cell (hence “monoclonal”). They all bind to the same antigenic determinant. They
attach to the cancer cells in a way that mimics how antibodies attach to invaders. They can work in
multiple ways:
– Make the cancer cells more visible to the immune system. E.g. Rituximab attaches to CD20, a

32
protein only found on B cells. The immune system is then more prone to identify B-cells as a
target. This treatment is used to treat B-cell lymphomas; although they also lower the number
of healthy B-cells, the body produces new healthy B-cells to replace them.
– Block growth factor receptors. This prevents growth factors (chemical that attach to these recep-
tors to signal the cells to grow) from getting through. E.g. Cetuximab blocks epidermal growth
factor and is used to slowing down or blocking the progression of colon, heand and neck cancers.
– Block angiogenesis. Similarly to how monoclonal antibodies can bock growth factors, they can
block the signals used by cancerous cells to attract blood vessels that will bring them the oxygen
and nutrients they need to grow. E.g. Bevacizumab intercepts the vascular endothelial growth
factor (VEGF).
– Deliver radiation or chemotherapy directly to cancer cells. E.g. Ibritumomab, a treatment used
against non-Hodgkin’s lymphoma, combines a monoclonal antibody with radioactive particles,
which then attach to cancerous blood cells and irradiates them. In HER2+ breast cancer, ado-
trastuzumab can be used to deliver trastuzumab specifically to the HER2 receptors of the cancer
cells.
• Gene therapy
– Replace a mutated gene with a healthy copy.
– Introduce a new gene.
– Inactivating a mutated gene: small RNAs therapy to knock down the expression of disease-
causing genes. [Pee] https://pharmaphorum.com/r-d/views-analysis-r-d/the-promising-future-of-
rna/
– Delivery: viral (integration of the genetic material to the host DNA) or non-viral (e.g. injection
of naked DNA or of oligonucleotides). Targeting the right region of the genome might also be
difficult.
– Issues:
– short-lived: The rapid division of cells means that patients must be treated multiple times for
the therapeutic DNA to be fully integrated to their genome.
– the immune response of the patient may lead to rejecting the treatment, particularly if multiple
innoculations are required.
– the use of viral vectors is associated with increased risks of toxicity and inflammatory response.
– complex traits, that is to say, those that are due to mutations in multiple genes,cannot be easily
treated this way, as one would have to target multiple (and yet mostly undiscovered) regions
of the genome.
– mutagenesis can be induced by the integration of DNA in a sensitive spot, in which case the
therapy would lead to an increased risk of cancer. This happened with the first clinical trials
for X-linked severe combined immunodeficiency (“bubble baby disease”), in which 3 out of 20
patients developed leukemia.
– Gene therapy holds promise for treating a number of diseases, in particular cancers and auto-
immune diseases. Currently, it is mostly available as part of clinical trials, for severe diseases
that have no other known cures. In 2012, Alipogene tiparvovec (commercialized as Glybera), a
treatment for a rare inherited disorder called lipoprotein lipase deficiency, became the first gene
therapy to be approved in Europe (and the US).
• Cell therapy
– Transplanting cells from donor to patient.
– E.g. bone marrow transplants.
– Transfusion of red blood cells generated in vitro from the patient’s own stem cells [MC+11].
– This is not to be confused with the alternative medicine meaning of “cell therapy”, whereby ill-

33
nesses are “treated” by the injection of animal cells. Since cells from another species cannot
replace human cells, this is unlikely to ever work. In addition, serious adverse effects have been
reported, and current scientific evidence does not support the claim that this type of “cell therapy”
is effective in treating cancer or any other disease.
Cours d’Alain Fischer au Collège de France:
http://www.college-de-france.fr/site/alain-fischer/p8400912045226082 content.htm

G Robotics and automation of the drug discovery pipeline


Automation of combinatorial chemistry automation
• Combinatorial Chemistry Unit at Parc Cientı́fic de Barcelona:
https://www.youtube.com/watch?v=mMGml9DkNBM
• Combinatorial chemistry (Royal Society of Chemistry): https://www.youtube.com/watch?v=MVgsX7PM4F4

High-throughput screening High-Throughput Screening (HTS) automatically screen tens of thou-


sands of compounds against a protein target, making it possible to quickly identify which of those
compounds bind to it. It is a complex process, involving in particular
• Robotics
• Optical readers
• Liquid handlers
• Control software.
Despite recent technological advances, HTS is still a very costly process, of the order of at least a
dollar per compound.
Some examples of automation of HTS:
• Compound Screening at the Broad Institute
http://www.youtube.com/watch?v=xakRli5vxd4
• Novartis: Robots speeds up the pace of drug discovery https://www.youtube.com/watch?v=EwzhEAZRX5o

H Molecular modeling and simulation software


The list at http://www.rcsb.org/pdb/static.do?p=software/software links/modeling and simulation.html
includes:
• Abalone, a general purpose molecular modeling program focused on the dynamics of biopolymers;
• Affinity, a free energy function for estimating binding affinities;
• AMBER (Assisted Model Building with Energy Refinement), a molecular dynamics and energy mini-
mization program;
• Animations, a PDB (Protein Data Base) viewer with an educational point of view, which includes
animated simulator to view molecule forces on a picometer distance scale and an attometer time
scale;
• ANM (Anisotropic Network Model), a simple NMA tool for analysis of vibrational motions in molec-
ular systems;
• AutoDock3.0, a suite of automated docking tools designed to predict how small molecules, such as
substrate or drug candidates, bind to a receptor of known 3D structure;
• CHARMM (Chemistry at HARvard Molecular Mechanics), a molecular dynamics and energy mini-
mization program;

34
• 3D-DOCK Suite, which includes FTDock, which performs rigid-body docking between biomolecules;
RPScore, which uses a pair potentials to screen output from FTDock; and MultiDock, which performs
multiple copy side-chain refinement;
• FIRST, which analyzes the flexibility in molecular structures of any size, and quickly explore the
available conformational space of the input molecule;
• FTDOCK, a program for carrying out rigid-body docking between biomolecules;
• GROMOS, a general-purpose molecular dynamics computer simulation package for the study of
biomolecular systems;
• GROMACS, a complete modelling package for proteins, membrane systems and more, including
fast molecular dynamics, normal mode analysis, essential dynamics analysis and many trajectory
analysis utilities;
• MolSoft ICM programs and modules for applications including for structure analysis, modeling,
docking, homology modeling and virtual ligand screening;
• NAMD, a parallel object-oriented molecular dynamics simulation program;
• OpenContact, an open source, PC software tool for quickly mapping the energetically dominant
atom-atom interactions between chains or domains of a given protein;
• YASARA, a complete molecular graphics and modeling program, including interactive molecular dy-
namics simulations, structure determination, analysis and prediction, docking, movies and eLearn-
ing;
• ZMM, an Internal Coordinate Molecular Modeling Program for theoretical studies of systems of any
complexity: small molecules, peptides, proteins, nucleic acids, and ligand-receptor complexes.

I Canonical SMILES
OCC, [CH3][CH2][OH], C-C-O and C(O)C all represent the structure of ethanol and can be represented
by the canonical SMILES string CCO.
Multiple canonicalization algorithms exist [WWW89; O’B12]. We will describe here how to use Mor-
gan’s algorithm to decide in which order to visit the nodes of the graph.

Morgan’s algorithm [Mor65]


• Label each heavy atom according to its heavy atom valence, i.e. assign to each non-hydrogen atom
the number of non-hydrogen atoms it is bound to.
• Relabel each atom with the sum of the labels of its neighbors
• Repeat until the labels are as unique as possible. This happens when a new iteration does not
increase the number of unique labels.
• Assign 1 to the atom with highest number. Then 2 to the neighbor of 1 with highest label, etc.
To solve ties, assign the smallest number to the atom with highest bond order (from the “issuing”
atom).
Symmetry-invariant atoms get assigned the same number. However, there are known cases where
the Morgan algorithm is not able to distinguish between atoms that are not equivalent by symmetry.

35
Example: Morgan labelling of proline.

(a) (b)

(c) (d) (e)

(a) First, we label each heavy atom by its number of heavy atom neighbors. There are 3 different labels.
(b) We then replace each atom’s label by the sum of its neighbors’ labels. There are 4 different labels.
(c) And again. There are 5 different labels. (d) If we repeat the process again, we still do not get more
than 5 different labels. We stop iterating here. (e) Finally, we relabel as 1 the atom with highest label
(35). Then we assign 2 to its neighbor with highest label (25) and 3 to both other neighbors, which
are undistinguishable (symmetry equivalent). We then move to atom 2 and label its neighbors. The
double-bonded O, having the highest bond order from the issuing atom, gets label 4, and the other O
gets label 5. Finally the neighbors of the nodes labeled 3 get label 6.

I.1 Fingerprint compression


Fingerprints are often used in database searches as a way to filter out obvious mismatches before
doing a computationally expensive sub-graph isomorphism test. They are usually long and sparse
binary vectors: there are a lot of potential sub-structures, but any given molecule only possesses a
small fraction of them. They therefore lead themselves to various compression schemes.
• Again, systematic enumeration leads to long, sparse vectors. For example, 50 000 random com-
pounds from the ChemDB data base contain 300 000 paths of length up to 8 that appear at least
once; but only 300 non-zeros on average per fingeprint.
• “Naive” Compression: List the positions of the 1s. (This is conceptually similar to the list/dictionary
encoding of the fingerprints). On the ChemDB example:
– 218 = 262 144 < 300 000 < 219 = 524 288. Therefore 19 bits are required to encode each position.
– average encoding: 300 × 19 = 5 700 bits instead of 300 000.
• Modulo Compression (lossy)

36
– Use a fixed fingerprint length d. The entry at bit b is based on the sum of all entries at bit d, d + b,
d + 2b, . . . of the uncompressed fingerprint: it is set to 1 if this sum is non-zero (i.e. at least one
of the bits is non-zero) and 0 otherwise. Because the fingerprints are very sparse, the compressed
version will still have many zeros. A zero in the compressed version indicates that all bits that
have been “folded” together were set to 0. A one only indicates that at least one of those bits
was set to 1 – this is where information is lost.
– Typically, 512 or 1024 bits are used.
• Elias-Gamma Monotone Encoding (lossless)
– Encode the first non-zero index j0 following the Elias-Gamma encoding, i.e. decompose it in its
higher power of 2 + the rest: j0 = 2N + m and encode N in unary, i.e. N zeros followed by a 1,
and append m in binary over N bits. E.g. 9 = 23 +1 → 0001001.
→ (N + 1) (unary) bits +N (binary) bits.
The trailing zeros are unnecessary → 2 × blog(j)c bits required.
– For the following non-zero indexes, encode ji+1 − ji similarly.
– For the ChemDB example: average compressed size = 1 800 bits, with no loss, versus 5 700 bits
for the naive encoding.

J Kernels
• Any function k : X × X → R that is symmetric and positive semi-definite, i.e. such that for any
n ∈ N, for any {x1 , x2 , . . . , xn } ∈ X , the n × n matrix K defined by Kij = k(xi , xj ) is positive
semi-definite, is a kernel. That is to say, given one such function k, there exists a Hilbert space
H (a Hilbert space is a vectorial space with a dot product; in essence, think of Rd or Cd where
d ∈ N ∪ {+∞}) and a function Φ : X → H such that k(x, z) = hΦ(x), Φ(z)i.
• Example: the polynomial kernel
m
k(x, z) = (hx, zi + c) ,
where c ∈ R and m ∈ N.
This kernel corresponds to a mapping Φ to a space of dimension d >> p, as this new feature space
contains all monomes of p variables of degree up to m.
• Example: the Gaussian kernel
||x − z||2
 
k(x, z) = exp − .
2σ 2

This kernel corresponds to a mapping Φ to a space of infinite dimension that contains all monomes
of p variables of any degree.
• When a machine learning algorithm does not require to access data points in any other form than
that of their dot product with another data point, one can apply the kernel trick, which consists
in replacing the dot product with a kernel. This is equivalent to applying the algorithm in the
feature space to which Φ maps, but without doing computations in this space, which is interesting
computationally when such a data space is very large (or infinite-dimensional).
• Example: kernel ridge regression. Given data (X, y)inRp×n , Rn , ridge regression consists in
– Learning: Find the vector of regression weights β ∈ Rp that minimizes
n
X 2
y i − hβ, xi i + λ||β||22 ,
i=1

or, equivalently, computing


−1
β = λIp + X > X X > y,
where Ip is the identity matrix of dimension p × p.

37
– Predicting: Given a sample x ∈ Rp , return f (x) = hβ, xi.
Some algebraic manipulations allow us to rewrite the prediction function as
−1
f (x) = xX > λIn + XX > y,

where In is the identity matrix of dimension n × n. This can be rewritten as


−1
f (x) = κ (λIn + K) y,

where κ ∈ Rn is the vector such that κi = hx, xi i and K ∈ Rn×n is the matrix such that Kij =
hxi , xj i. Therefore the ridge regression model can be fitted and applied using x1 , x2 , . . . xn and
x only inside dot products, which can be replaced with kernels, using κi = k(x, xi ) and Kij =
k(xi , xj ).

38

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy