2019 10 28 Chemoinformatics - Notes
2019 10 28 Chemoinformatics - Notes
What is a drug?
• Legal definition: “Toute substance ou composition présentée comme possédant des propriétés cu-
ratives ou préventives à l’égard des maladies humaines ou animales, ainsi que toute substance ou
composition pouvant être utilisée chez l’homme ou chez l’animal ou pouvant leur être administrée,
en vue d’établir un diagnostic médical ou de restaurer, corriger ou modifier leurs fonctions physi-
ologiques en exerçant une action pharmacologique, immunologique ou métabolique.” [code de la
Santé publique, article L.5111-1]
• Rough translation: Any substance or compound, presented as having curative or preventive properties
towards human or animal diseases, as well as any substance or compound that can be used or adminis-
tered to humans or animals, so as to establish a medical diagnosis or so as to restaure, correct or modify
their physiological functions by exercising a pharmacological, immunological or metabolic function.
• Key-lock principle: Most drugs are relatively small molecules that work by binding to a protein.
Generally speaking, a small molecule that binds to a protein is called a ligand. The protein is
then referred to as the receptor. The terminology comes from the term “receptor” used to denote
a protein embedded in the plasma membrane of a cell which role it is to receive chemical signals
from outside the cell. The binding site of the protein, where the ligand attaches to the receptor, is
also called its active site or binding pocket. Note that a protein can have several such pockets. The
protein that the drug binds to is called its target.
Ligand
Target
Pocket
1
enzyme necessary for HIV to spread through the body of its host.
From serendipity to rationalized drug design Ancient Greeks or Egyptians treated infections with
mould; mould actually produces penicillin, which inhibits the biological processes that results in the
formation of bacterial cell walls.
NH2
NH
HO S CH3
O
N CH3
O
O
HO Biapenem in PBP-1A
Illustration: Biapenem (an antibiotic of the β-lactamase family), seen alone on the left image, binding
to a PBP (Penicillin Binding Protein, in blue on the right image), an enzyme which is involved in the
formation of peptidoglycan, a polymer that forms the bacterial cell wall. The mode of action of most
β-lactams antibiotics is to inhibit this protein, which results in bacterial death.
Overview of the drug discovery pipeline Using the key-lock principle, the modern approach to drug
discovery can be described as a pipeline, which will be detailed in the remainder of this section. The
steps of the drug discovery pipeline consist in: target identification (finding a protein that we want
to interfer with); hit identification (finding small organic molecules, or hits, that bind this protein);
hit characterization (determining physico-chemical and pharmacological properties of these hits, and
keep as leads those that present the best properties); lead optimization (refining the leads into drug
candidates) and, finally, assay evaluating the efficacy and toxicity of the candidates.
For drugs that do not work following the key-lock principle, see Appendix F.
2
1.3 Hit identification
First of all, pharmacologists must identify in the chemical space compounds that are most likely to
be active (that is to say, bind to the protein and modify its level of expression in the organism). This
involves the screening of large panels of chemicals, that must each be tested against the protein.
• Hit: A compound that binds to the target At this stage, we are not looking for a specific effect. Any
compound that binds to the target protein is going to be relevant.
• Identify compounds known to have an effect on the target. An extensive literature and database
search is conducted to identify compounds that are known to have a biological effect on the target.
hose compounds can be endogenous, meaning that they are normally found in the organism, or
exogenous, meaning that there are not. Two essential, complementary approaches are:
• Combinatorial chemistry:
– For the history of the development of combinatorial chemistry, see Appendix C.
– Combinatorial chemistry is a family of techniques that can be used to generate large numbers
of derivatives of a given compound for testing. Robert Merrifield, who got the Nobel prize in
1984 for his development (in the 1960s) of the methodology for chemical synthesis on a solid
matrix, is considered the father of those techniques, together with Mario Geysen, who developed
in the early 1980s the pin method for the simultaneous synthesis of diverse peptides. Combi-
natorial chemistry consists in adding a variety of substituting groups to a fixed portion of the
start compound, called a scaffold. The resulting compounds form what is called a combinatorial
library.
Here one scaffold with three possible derivatization sites. If one has ten possible substituting
groups per site, one can potentially generate 103 compounds, hence the name combinatorial.
– The main technique for combinatorial chemistry is solid phase synthesis:
∗ Linkers, that are chemically stable, connect the scaffold to resin beads.
∗ Reactants are passed over (in solution) to generate intermediate compounds.
∗ A last step of detachment from the resin yields the derived compounds.
Source: Combinatorial Chemistry. Synthesis and application. Wilson & Czarnik ed., 1997
• (High-throughput) screening: All compounds in a library are added to a solution containing the
3
target to evaluate whether or not they bind. The library can be generated by combinatorial chemistry
if an interesting scaffold is known; otherwise, the library will typically contain all available in-house
chemicals, or a catalog provided by a chemical company, or a set of known drugs.
Once lead compounds with good target binding abilities are identified, they need to be characterized:
can they be used as drugs? Inhibiting the target is not a sufficient condition for being a promising
drug candidate. Indeed, in order to be both safely administered and efficient, a drug must satisfy low
toxicity as well as good pharmacokinetics requirements.
• Which of the hits are more suitable as medication?
• ADME-Tox properties: Pharmacokinetics requirements, compiled under the acronym of ADME (Ab-
sorption, Distribution, Metabolism, Elimination), characterize the ability of the molecule to reach
the target protein in the tissue where it is localized before being degraded.
Ideally, the drug is given orally, and must then enter the bloodstream via the digestive track. This
process can be inhibited by factors such as intestinal transit time, compound solubility, or chemical
reactivity in the stomach. If absorption is too low, the drug must be administered in a less desirable
and more invasive way (such as inhalation, patches, or intravenous injection).
Natural biological barriers (in particular the blood-brain barrier) can also negatively impact the
journey of the drug to its target. Even if the drug is satisfyingly absorbed, it can be partially blocked
by membranes or binding to proteins other than the one intended.
Moreover, chemicals usually break down as soon as they enter the body; in particular, the liver
will metabolize the drug, converting it into new metabolites, that can either be inactive or, on the
contrary, more potent (or causing more undesirable secondary effects) than the original drug.
While we wish for as much as possible of the administered dose to reach the target tissue, elim-
ination must also be taken into consideration, as the accumulation of foreign substances in the
organism can adversely affect healthy metabolism.
– Absorption: Drugs administered orally have to be absorbed before they can be transported via
the circulatory system (i.e. blood vessels) to their site of action.
– Distribution: In order to be effective, a drug must be able to reach its site of action. To pass
through bi-lipidid membranes, drugs must be reasonably soluble in both water and lipids. If
necessary, medications will be packaged in time-relapse capusles that ensure their level remains
constant over several hours, or in coating that ensures they can pass unharmed by the stomach’s
acidity into the small intestine.
4
– Metabolism: Chemicals (in particular, enyzmes in the gastro-intestinal track, and in the liver)
start breaking down compounds as soon as they enter the body. The drug may be inactivated
by this mechanism. In addition, the resulting metabolites may have undesired pharmacological
effects.
– Excretion/Elimination: The liver and kidney are the main organs involved in the elimination of
drugs (and, more generally, waste). The liver breaks down toxic substances through a series of
complex metabolic reactions. The kidneys further process the broken down waste and eliminate
it from the body through urine.
– Toxicity: Obviously, the compound should be as non-toxic, carcinogenic or mutagenic as possible
to the patient.
After promising drug candidates have been identified, they are assayed in vitro to verify that they do
indeed bind to the target protein. This is followed by a phase of lead optimization, during which the
chemical structure of the drug candidates is refined in order to meet the ADME and toxicity require-
ments.
• Optimize ligand-receptor interaction (pharmacodynamics)
• Optimize ADME-Tox properties (pharmacokinetics)
• Design of the synthetic path to produce the lead compound
• Synthesis of analogues: The main tools for lead optimization are:
– Combinatorial chemistry
– Structure-based design: use information about the structure of the target and that of the lead
to tweak the lead, usually visually in an appropriate software for positioning compounds and
computing binding energies (see Section 3).
Once a drug candidate has been identified, a reliable process to produce it must be established. Al-
though molecular compounds can sometimes be extracted from a natural source, this technique usu-
ally proves to be laborious and the alternative of synthesizing molecules from commercially avail-
able starting materials is preferable. In addition, combinatorial synthesis, a technique by which large
numbers of compounds can be synthesized simultaneously to create chemical libraries for biological
screening, also relies on a thorough understanding of organic synthesis. Planning the total synthesis
of compounds with interesting biological or physical properties is therefore one of the core con-
cerns of organic chemistry. Devising the optimal multi-step route to a novel and potentially artificial
compound is a challenging problem, and organic chemists face the daily challenge of choosing the
most appropriate combination of reactants and reagents, as well as the necessary conditions and best
sequence of their assembly. See more details on the history of organic synthesis in Appendix D.
5
1.7 Assays
Last but not least, drug candidates must be assayed to evaluate whether they indeed have the intended
effect, determine dosage, and assess their toxicity. Pre-clinical assays aim at demonstrating safety and
are used to file a new drug application and obtain permission to enter clinical trials. Clinical trials on
humans are used to determine that the drug is indeed effective (and, if a drug already exists on the
market with the same indication, that it is more effective than this drug), and determine dosage. Finally,
drugs are continuously monitored for side effects after they have been released, through a process
known as pharmacovigilance or post-marketing assays. For more details, including on regulation
agencies, see Appendix E.
52 months 90 months
$ 2
6
How can engineering help? Engineers in pharmaceutical companies have roles in many fields: au-
tomation, biotechnology, electrical engineering, mechanical engineering, mechatronics, computer en-
gineering, and more.
You can consult Appendix G for some information about the role of robotics and automation in the
drug discovery pipeline. We will focus here about the role of computer science, through a field called
chemoinformatics.
2 Chemoinformatics
By the 1970s, the amount of data and information produced by chemical research had grown large
enough that it became obvious that it could only be processed and analyzed by computer methods,
pushing the development of databases of chemical compounds and reactions. Furthermore, many of
the problems faced by chemists, from the prediction of physical, chemical and biological properties
of compounds and materials to structure elucidation or organic synthesis are so complex that they
require informatics-based approaches.
Chemoinformatics
Help from computer science:
“...the mixing of information resources to transform data into information, and information into knowl-
edge, for the intended purpose of making better decisions faster in the arena of drug lead identification
and optimisation.” – F. K. Brown
“... the application of informatics methods to solve chemical problems.” – J. Gasteiger and T. Engel. This
definition encompasses many aspects and these problems include:
– Representing, storing and retrieving chemical compounds and reactions;
– Predicting physical, chemical and biological properties of compounds;
– Drug design;
– Structure elucidation;
– Predicting the course of chemical reactions;
– Designing organic syntheses.
Target Hits
a ti c s Drug
orm
Leads Candidates
oi n f
hem
Protein linked to Compounds Desirable Approved for
Optimized and
C
the disease on which binding to the ADME-Tox medical use and
synthetizable
one wants to act target properties sale
Virtual chemical space In 2009, Blum and Reymond built a virtual library containing of the order of
109 potential drugs–all those made of hydrogen and no more than 13 atoms of carbon, oxygen, ni-
trogen, sulfur, and chlorine. This is rather limitative, considering that, for example, morphine contains
17 carbons and erythromycin is made of 37 carbon atoms, 13 oxygen atoms and 1 nitrogen atom. In
1996, Bohacek et al. estimated that there are potentially of the order of 1060 drug-sized organic com-
pounds. By comparison, the universe is estimated to contain 3.1023 stars, according to van Dokkum
and Conroy.
Although the combinatorial libraries of chemicals used in high-throughput screening are designed to
cover as much of the chemical space as possible, they still leave large areas of it unexplored, which
motivates the need for new, fast and accurate techniques.
7
2.1 Representing chemicals in silico
One of the core concerns of chemoinformatics is the representation of small organic compounds in
silico.
Many chemoinformatics applications, including high-throughput virtual screening, benefit from being
able to rapidly predict the physical, chemical, and biological properties of small molecules to screen
large repositories and identify suitable candidates.
Ab initio methods, such as quantum mechanical methods, can in most cases still not be applied sys-
tematically due to complexity and computational cost issues. When annotated data are available,
machine learning methods that try to extract relevant information more or less automatically from
the data provide a suitable alternative.
Representing and visualizing chemicals Here are some examples of different ways to describe the
same molecule:
O
HO
Solvant excluded surface, or Connely surface, CPK (Corey, Pauling and Koltun) or space-filling
representing the surface of the molecule: representation, in which atoms are represented
by a sphere delimiting the locations of their elec-
trons:
Molecules can be represented using so-called SMILES strings (Simplified Molecular Input Line Entry
System). For a complete specification, see
http://www.daylight.com/meetings/summerschool98/course/dave/smiles-intro.html.
• Atoms are represented by their atomic symbol: C, O, N, S, ...
• Atoms in aromatic rings are represented in lower case: c, n, ...
• H and partial charges are attached with square brackets: [Fe+2] or [Fe++].
8
• No square brackets implies normal valence: [CH4] is equivalent to C.
• Single (-), double (=), triple (#), aromatic (:) bonds. Single and aromatic bonds may be ommitted if
unambiguous. E.g. C=O, C#N, C-C or CC.
• Branches are indicated between parentheses, eg O(H)H represents water.
• Break cycles, use numbers to indicate where: c1ccccc1 represents an aromatic cycle of 6 carbons.
• ! SMILES are not unique, each molecule has multiple possible representations. However, one SMILE
string represents a single molecular graph. (Not necessarily a unique molecule due to stereoisom-
etry).
Examples
• Fingerprint of cytosine: Nc1[nH]c(=O)ncc1.
Canonical SMILES As the above examples illustrate, while a SMILE determines a unique molecu-
lar graph, the converse is far from true. Canonical representations are desirable, to associate each
molecule with a unique SMILES string. The state-of-the-art is however still not stable, and multiple
canonicalization algorithms exist. For more details, see Appendix I.
9
2.1.2 Molecular fingerprints
Expert knowledge descriptors The first in silico representations of molecules to have been devel-
oped were based on list of descriptors, each corresponding to a characteristic thought relevant by
experts. Such descriptors range from the presence of particular functional groups to the number of
rotatable bonds or the partial charges on each of the atoms of the molecule.
One of the most popular set of such descriptors used in chemical informatics is given by DRAGON,
which provides 3224 molecular features, divided in 22 blocks, and ranging from simple atom type and
functional group and fragment counts to several advanced topological and geometrical descriptors.
With the help of expert chemists, more and more features have been added along the years since the
first release of DRAGON in 1997. Unfortunately, some of these descriptors are prohibitively complex
to compute for large data sets.
Another example of such descriptors is the PowerMV set of descriptors, which provides atom-based,
fragment-based, and real-valued descriptors. The latest include characteristics such as Gasteiger par-
tial charges, electronegativities, or logP (estimated using the XlogP predictor).
These representations are hard to define and some of the descritptors can be difficult to compute.
Moreover, they are potentially incomplete, as a key feature for the desired application might have
been missed out by the experts.
Molecular fingerprints Nowadays, molecules are typically represented by so-called molecular fin-
gerprints. A fingerprint is a bit string in which each bit corresponds to a particular molecular feature,
designed to be chemically relevant, and is turned to 1 if the molecule exhibits that feature and 0
otherwise.
• Define feature vectors that record the presence/absence (or number of occurrences) of pre-determined
molecular features in a compound.
φ(A) = (φs (A))s feature
where
1 if s occurs in A
φs (A) =
0 otherwise.
Examples of molecular fingerprints The most widely known sets of fingerprints include the MACCS
keys and the CACTVS substructure keys.
• MACCS keys: answers to a set of true/false questions about a chemical structure
– “Are there fewer than 3 oxygen atoms?”
10
– “Is there at least one halogen atom present?”
– ...
• PubChem Substructure Fingerprints (aka CACTVS Subgraph Keys):
presence/absence of molecular substructures, e.g.:
– individual atoms of a given type
– rings of a given size containing given atoms
– more general patterns involving several bonds and atom types
The molecular graph Whereas most sets of fingerprints or descriptors (from DRAGON descriptors to
MACCS keys) are rather heterogeneous collections of all sorts of computable molecular properties that
heavily rely on expert knowledge, substructure-based representations are derived in a more principled
and automated way. In what follows we will focus on such representations and on extracting structural
information from the molecular graph.
Small molecules are most commonly represented as labeled graphs of bonds. The vertices repre-
sent the atoms, and the edges represent the bonds. Edges are labeled by the bond type (e.g. single,
double) they correspond to. Labels on the vertices correspond to the element (e.g. C, N, O) of the
atom they correspond to, and can be expanded to include more information about the local chemical
environment.
O O
d
C
O
d
C
C N
C
O C
d
C N C C C
C C C S
O C C N
C C
Structure-based fingerprints
• Paths fingerprints One can chose to use all labeled paths of length d, or up to d, starting from each
vertex of the graph. Paths are allowed to self-intersect and traverse the same vertex twice, so as to
capture ring structure. Edges cannot appear more than once in a path.
– Daylight fingerprints: all substructures with N atoms bonded by N − 1 bounds, for 3 ≤ N ≤ 7.
– Labeled sub-paths (walks): consider all possible paths of length d, or up to d, in the molecular
graph.
11
O O
C
O
d CsCsCdO
C
C N C
O C
d
C N C C C
C C C NsCsCsS S
O C C N
C C
O O
C
O
d
C
C N C
O C
d
C N C C C
C C C S
O C C N
C{sC{sN|sC}|sN{sC}|sS{sC}}
C C
Example (path fingeprints) Let us consider paths of length 6, using the atoms C (carbon), c (carbon
in an aromatic cycle), O (oxygen), N (nitrogen), n (nitrogen in an aromatic cycle). If we do not consider
other atom types, nor distinguish between bond types, there are 65 /2 + 6 = 3894 possible patterns
(note that CCCccc and cccCCC are equivalent, as molecular graphs have no orientation).
The fingerprint of amoxicillin, for paths of length 6 built on the 5 above atoms, ignoring bound type,
looks like:
CCCCCC CCCCCc CCCCCO CCCCCN CCCCCn . . . CCCNCC CCCNCc CCCNCO . . .
0 0 0 0 0 ... 1 0 1 ...
Alternatively, we can choose to represent the fingerprint (for paths of length 6) of a molecule using
a list structure. Such a fingerprint can be built by reading the SMILE string and listing all paths of
length 6 that are being encountered. This approach requires less memory (the vector fingerprint is
usually quite sparse) and this guarantees no atom type is forgotten (here, the sulfur S).
For amoxicillin, the list would looks like: [‘CCCNCC’, ‘CCCNCO’, ...].
If one would rather encode counts of paths rather than their mere presence/absence, then one can
use dictionaries following the same principle. For amoxicillin, again, the dictionary would look like:
{‘CCCNCC’: 3, ‘CCCNCO’: 2, ...}.
Fingerprints tend to be very long and sparse; as an example, an experiment on 50 000 random
molecules from the ChemDB data base yield 300 000 possible paths of length 8. Encoding each of
the 50 000 molecules as a vector of length 300 000 is inefficient, and compression strategies, about
which you can read more in Appendix I.1, are necessary.
12
Deep learning for molecular graphs In recent years, many research efforts have concentrated in ex-
tending the concepts of convolutional neural networks from applications to images to applications to
graphs. These approaches are beyond the scope of this introductory course. While they sometimes (far
from always) outperform classical machine learning approaches based on fingerprints, they require
much more computational power. An interesting area of development is that of generative models,
which one can hope to employ to generate molecules with properties similar to that of the training
set. See for example [LPB13; Duv+15; Col+17; HYL17] and [Kad+17; CKK17].
• A distance over the space X (here X = {0, 1}p for binary fingerprints and X = Np for count
fingerprints) is a function d such that:
d:X ×X →R
d(x, x) = 0
d(x, z) = d(z, x) (symmetry)
d(x, z) ≤ d(x, w) + d(w, z) (triangular inequality).
– Manhattan distance:
p
X
d(x, z) = ||x − z||1 = |xj − zj |.
j=1
– `∞ distance:
d(x, z) = ||x − z||∞ = max (xj − zj ).
j=1,...,p
• In the special case of binary-valued vectors: X = {0, 1}p , one often uses the Hamming distance,
which counts the number of bits that are different between x and z
p
X
d(x, z) = (xj XOR zj ).
j=1
Note: This is equivalent to Manhattan’s distance (and the squared Euclidean distance).
13
2.2.2 Similarity between two vectors
The idea of a measure of similarity is converse to that of a distance: the closer two points are, the
smaller their distance and the larger their similarity.
1
Pp
where x̄ = p j=1 xj .
• Suppose the data is centered: x̄ = z̄ = 0. Then
Pp
j=1 xj zj hx, zi
ρ(x, z) = qP qP = = cos(θ),
p 2 p 2 ||x||.||z||
j=1 xj j=1 zj
where θ is the angle between x and z, as shown for two dimensions on the illustration. Hence
Pearson’s correlation is also called the cosine similarity.
feature 2
feature 1
Pearson’s correlation between vectors x and z is given by the cosine of their angle.
• The dot product can be used to measure similarity between two vectors:
p
X
s(x, z) = hx, zi = xj zj .
j=1
Note that Pearson’s correlation between two vectors that are centered and of norm 1 is their dot
product.
• The dot product can also be computed in another feature space, using a mapping application Φ :
Rp → Rd :
d
X
s(x, z) = hΦ(x), Φ(z)i = Φ(xj )Φ(zj ).
j=1
• Such a similarity is a kernel, and can sometimes be computed directly in the space Rp . For reminders
about kernels, see Appendix J
14
• Count fingerpritns X = Np : Minmax similarity
Pp
j=1 min(xj , zj )
s(x, z) = Pp
j=1 max(xj , zj )
15
gistic/linear regression, k-nearest neighbors, naive Bayes, neural networks, random forests, support
vector machines. In what follows we present a few specific applications.
Performance of supervised models One point we have not discussed much so far is how to measure
the performance of a supervised models. Let us assume a data set of n samples (xi , y i ) and a predictor
f.
• For regression models, performance is typically measured using
– The absolute average error:
n
1X
|f (xi ) − y i |.
n i=1
– The coefficient of determination coefficient, or Pearson’s correlation between the predictions and
the true values:
Pn
i i
i=1 (y − ȳ) f (x ) − f (x) n
1X i
r where ū = u.
pPn Pn 2 n i=1
i i
i=1 (y − ȳ) i=1 f (x ) − f (x)
16
• Many classifiers return a a score rather than a binary answer. The score is then converted into a
binary answer with the help of a threshold: all samples with a score larger than the threshold are
labeled positive, and all other samples are labeled negative. For example, if a model outputs the
probability of the sample being positive, then it makes sense to label positive all samples for which
the output is greater than 1/2. However, the classifier might perform better with a threshold of, say,
0.45. For such models, it is interesting to look at the evolution of the above evaluation scores with
the threshold. In practice, people often look at:
– ROC curves (for Receiver-Operator Characteristic), which are constructed by plotting, for each
possible threshold, the true positive rate vs the false positive rate. (The possible thresholds are
obtained by listing all predicted values in ascending order.) The ROC curve of a random predictor
follows the diagonal line; good models have a TPR that increases faster than their FPR and are
close to the upper left corner.
– PR curves, or Precision-Recall curves, work on the same principle but plot precision against recall.
Good predictors maintain a good precision as recall increases, and are close to the upper right
corner.
– ROC curves and PR curves can be summarized in a single number, the area under the curve. This
number is between 0 and 1, the higher the better.
x 0.9
2/4
x 0.95 x 0.94
1/4
random
perfect
real
x Inf
0
17
Unlike screening by docking or de novo design, which require the structure of the target protein to
be known, a vHTS algorithm is a ligand-based approach that uses data from a first exploratory HTS
experiment to predict the activity of new sets of compounds in silico. vHTS is used to facilitate the
selection of compounds for experimental screening in HTS bioassays and translates into additional
protein inhibitor, anti-cancer, and antibiotic leads, which would have otherwise been missed. Can
we leverage knowledge extracted from a previous exploratory screen to predict, among a library of
untested compounds, which ones are most likely to be active and should therefore be tested first?
Virtual screening can be formulated as a:
• Binary classification problem: classify compounds as active/inactive;
• Regression problem: predict the biological activity of compounds;
• Ranking problem: rank compounds by biological activity [Swa+09; Mar+17].
2.3.3 QSAR/QSPR
Structured-based virtual screening aims at minimize the binding energy between the target protein
and a ligand. The idea of structure-based virtual screening is to use the 3D structure of the target
18
protein (and more importantly, of its binding pockets, although all atoms play a role) and that of a list
of potential ligands and try to find those ligands that fit best into the binding pockets.
Ideally, structures are obtained experimentally through X-ray crystallography, but homology modeling
makes it possible to infer the 3D structure using the sequence of the protein and the known structure
of homologous proteins. Similarly, the binding pocket is ideally obtained using the 3D structure of
the complex formed by the protein with a known ligand (one then just needs to “expand” around that
known ligand to find the shape of the pocket). If there is no known ligand, or that the complex has
not been crystalized, it is possible to predict the binding pocket (see e.g. PocketFinder in the DOCK
suite).
Ligand libraries: which small organic compounds to test to see whether they bind the target:
• NCI (∼ 275 000 compounds);
• ZINC (35 × 106 commercially available compounds);
• Commercial libraries, e.g. Nanosyn (64 898 compounds), Enamine’s pharmacologically diverse set
(∼ 23 000 compounds);
• But also: possible to explore the virtual space of not-yet-available compounds!
The search spectrum Virtual screening requires evaluating the binding affinity between a molecule
and a protein. If we’re focusing on molecular modeling (as opposed as statistical modeling as done
with the machine learning vHTS approach above), the problem is made difficult by:
• the need to account for 6 degrees of freedom (3 degrees for coordinates + 3 degrees of rotation), at
every point in the system to position the ligand with respect to the protein;
• that every atom interacts with every other atom.
19
Virtual screening can be performed at three scales:
• The local scale: quantum mechanics molecular modeling
Martin Karplus, Michael Levitt and Arieh Warshel, Nobel laureates 2013.
• The intermediate scale: molecular mechanics, molecular dynamics, Brownian dynamics.
• The global scale: molecular docking.
We will now detail each of these scales.
Other applications of QM/MM include protein structure prediction; Protein folding mechanics; DNA/RNA
simulations; lipid layers (cell membranes) simulations; and the refinment of structures obtained from
X-ray or NMR.
20
Other applications of molecular dynamics include lead optimization, to improve on deficiencies on
the structure of a lead compound, while maintaining (or improving) its favorable ADME-Tox properties.
4 Molecular Docking
In docking, the space is discretized with a grid, and computations are only done on points on this
grid. The idea is to maximize the shape complementarity of the ligand and the protein pocket, while
minimizing the binding energy, which is approximated with a scoring function, which can be:
• Force-field based (e.g. DOCK, AutoDock). The biggest bottleneck here is the modelization of the
solvent, as accurate models such as those used in QM/MM are too time consuming to compute.
Several options are possible:
– Include a distance-based factor in the Coulomb term;
– Poisson-Boltzman/surface area (PB/SA);
– Generalized-Born/surface area (GB/SA) model.
• Empirical (e.g. FlexX, SCORE). Binding energies are approximated by a weighted sum of terms,
which are empirically fitted on a set of known protein-ligand complexes. Those are much faster to
compute, but less accurate.
• Knowledge-based (e.g. DrugScore, ITScore): Pairwise atom-atom potentials are obtained based on
the pair frequency and the inverse Boltzman relation. These methods are a good intermediate (in
terms of speed and accuracy) between the force-field and the empirical approaches.
For more details, see e.g. [HGZ10].
Left: the ligand (in blue, licorice style) fits nicely within the pocket of the protein (in red, solvent
excluded surface style). Right: The ligand binds to the protein but the pose is wrong. )
Source: Bernhard Knapp – Medical University of Vienna
http://www.meduniwien.ac.at/imc/einf i d med inf/2011 MD.pdf
21
5 Open questions and current challenges
In spite of the great progress made in the domain of drug design in the last 35 years, a number of
additional questions remain largely unsolved.
Mechanisms of action Many mechanisms of action are still unknwon. Identifying the target(s) of
a molecular compound is the first step towards proposing a mechanism of action, which is usually
required to obtain an authorization to enter clinical assays. This is a major center of interest regarding
traditional plant medicines.
Synthesis mechanisms Many synthesis mechanisms are still unknown. Drugs then have to be har-
vested from nature, which can be a complex and costly process.
Adverse drug reaction prediction Undesirable side effects are a major cause of failure of clinical
trials and the main reason for withdrawal from market. They can be due to:
• Poor estimation of ADME-Tox properties
• Stereochemistry: Molecules often exist in multiple configurations. Movable bonds, such as rotatable
bonds, give rise to conformers; stereocenters give rise to symmetries and isomers. Enantiomers are
chiral molecules (i.e. molecules that cannot be superimposed with their mirror image; human hands
are an example of chiral objects) that are mirror images of each other. Stereoisomers are molecules
that have the same molecular graph, but differ in the three-dimensional position of their atoms.
It is not always possible to synthetize only one of multiple conformers. Stereoisomers may have
different ADME-Tox properties and biological activities.
• Drug interactions:
– Drug-drug interactions, e.g. antacids prevent the absorption of many medicines; aspirin should
not be combined with blood thinners.
– Drug-food interactions, e.g. Grapefruit interacts with the absorption of a number of medications;
green-leafy vegetables, which contain vitamin K, may decrease the effectiveness of blood thin-
ners; black liquorice may decrease the activity of high blood pressure medications.
– Drug-condition interactions.
• Multiple targets (or drug promiscuity): a drug may bind not only to its primary, intended target, but
also to secondary targets, resulting in unwanted effects. Drug promiscuity can also lead to higher-
efficacy, as with the anti-cacner drug clozapine, which can be more efficient than expected due to
hitting multiple targets related to the disease.
Also note that how tolerable adverse effects are depends on the disease being treated: while nausea
is a tolerable side effect of an anti-cancer medication, it is much less tolerable for a medication that
must be taken daily, for instance for high blood pressure, or for a less severe disease, such as a head
cold.
Drug repositioning Drug promiscuity can also lead to finding novel indications for existing drugs, as
they may hit targets for other diseases.
• Pharmacokinetics properties already known;
• Potentially faster development.
• Intellectual property & regulatory issues
The importance of the question is illustrated by a recent NIH program devoted to “discovering new
therapeutic uses for existing molecules”: http://www.ncats.nih.gov/ntu
22
Drug specificity On the other hand, can we design drugs that bind specifically to the intended target
and to no other?
Example: Drugs targeting kinases, often used in cancer treatment, are particulary prone to suffering
from lack of specificity.
“A typical protein kinase must recognize between one and a few hundred bona fide phosphorylation
sites in a background of about 700,000 potentially phosphorylatable residues. Multiple mechanisms
have evolved that contribute to this exquisite specificity, including the structure of the catalytic site,
local and distal interactions between the kinase and substrate, the formation of complexes with scaf-
folding and adaptor proteins that spatially regulate the kinase, systems-level competition between
substrates, and error-correction mechanisms. The responsibility for the recognition of substrates by
protein kinases appears to be distributed among a large number of independent, imperfect specificity
mechanisms.” [UF07]
Precision medicine
• Drug response: Will this patient respond or not to the treatment?
• Can one find a function f that takes information (clinical, genetic, otherwise) about the patient and
returns appropriate drug(s)?
• Can one find which biomarkers are predictive of the patient’s response?
23
[HGZ10] Sheng-You Huang, Sam Z. Grinter, and Xiaoqin Zou. “Scoring functions and their evaluation
methods for protein–ligand docking: recent advances and future directions”. In: Physical
Chemistry Chemical Physics 12.40 (2010), p. 12899.
[HYL17] William L. Hamilton, Rex Ying, and Jure Leskovec. “Representation learning on graphs:
methods and applications”. In: arxiv:1709.05584 [cs] (2017). arXiv: 1709.05584.
[JM90] Mark A Johnson and Gerald M Maggiora. “Concepts and applications of molecular similar-
ity”. In: J. Comput. Chem. 13.4 (1990). Ed. by John Wiley & Sons, pp. 539–540.
[JV08] Laurent Jacob and Jean-Philippe Vert. “Protein-ligand interaction prediction: an improved
chemogenomics approach”. In: Bioinformatics 24.19 (2008), pp. 2149–2156.
[Kad+17] Artur Kadurin et al. “druGAN: an advanced generative adversarial autoencoder model for
de novo generation of new molecules with desired molecular properties in silico”. In: Mol.
Pharmaceutics 14.9 (2017), pp. 3098–3104.
[Kel+06] Esther Kellenberger et al. “sc-PDB: An annotated database of druggable binding sites from
the protein data bank”. In: J. Chem. Inf. Model. 46.2 (2006), pp. 717–727.
[Les+04] Christina S. Leslie et al. “Mismatch string kernels for discriminative protein classification”.
In: Bioinformatics 20.4 (2004), pp. 467–476.
[LPB13] Alessandro Lusci, Gianluca Pollastri, and Pierre Baldi. “Deep architectures and deep learn-
ing in chemoinformatics: the prediction of aqueous solubility for drug-like molecules”. In:
J. Chem. Inf. Model. 53.7 (2013), pp. 1563–1575.
[LPC98] J. Lazarou, B. H. Pomeranz, and P. N. Corey. “Incidence of adverse drug reactions in hospital-
ized patients: a meta-analysis of prospective studies”. In: JAMA 279.15 (1998), pp. 1200–
1205.
[LW71] Michael Levandowsky and David Winter. “Distance between Sets”. In: Nature 234.5323
(1971), pp. 34–35.
[Mar+17] E. J. Martin et al. “Profile-QSAR 2.0: Kinase virtual screening accuracy comparable to four-
concentration IC50s for realistically novel compounds”. In: Journal of Chemical Information
and Modeling 49.4 (2017), pp. 756–766.
[MC+11] Giarratana M.-C. et al. “Proof of principle for transfusion of in vitro-generated red blood
cells”. In: Blood 118.19 (2011), pp. 5071–5079.
[Mig+12] A. Miguel et al. “Frequency of adverse drug reactions in hospitalized patients: a systematic
review and meta-analysis.” In: Pharmacoepidemiol Drug Saf 21.11 (2012), pp. 1139–1154.
[Mor65] H. L. Morgan. “The Generation of a Unique Machine Description for Chemical Structures-A
Technique Developed at Chemical Abstracts Service.” In: J. Chem. Doc. 5.2 (1965), pp. 107–
113.
[O’B12] Noel M. O’Boyle. “Towards a Universal SMILES representation - A standard method to gen-
erate canonical SMILES based on the InChI”. In: Journal of Cheminformatics 4.1 (2012), p. 22.
[Pee] “Special delivery: targeted therapy with small RNAs”. In: Gene therapy 18 (2011), pp. 1127–
1133.
[RBW03] John W Raymond, C. John Blankley, and Peter Willett. “Comparison of chemical clustering
methods using graph- and fingerprint-based similarity measures”. In: Journal of Molecular
Graphics and Modelling 21.5 (2003), pp. 421–433. (Visited on 10/25/2014).
[Sai+04] Hiroto Saigo et al. “Protein homology detection using string alignment kernels”. In: Bioin-
formatics 20.11 (2004), pp. 1682–1689.
[Sch15] Nicholas J. Schork. “Personalized medicine: Time for one-person trials”. In: Nature News
520.7549 (2015), p. 609.
[Swa+09] S. Joshua Swamidass et al. “Influence Relevance Voting: An Accurate And Interpretable Vir-
tual High Throughput Screening Method”. In: Journal of Chemical Information and Modeling
49.4 (2009), pp. 756–766.
[UF07] Jeffrey A. Ubersax and James E. Ferrell. “Mechanisms of specificity in protein phosphoryla-
tion”. In: Nat. Rev. Mol. Cell Biol. 8.7 (2007), pp. 530–541.
24
[WWW89] David Weininger, Arthur Weininger, and Joseph L. Weininger. “SMILES. 2. Algorithm for
generation of unique SMILES notation”. In: J. Chem. Inf. Comput. Sci. 29.2 (1989), pp. 97–
101.
Serendipity
Serendipity: Chance discoveries that have been exploited with sagacity
• In the laboratory
– Acetaminophen (paracetamol) is a derivative of acetanilide. The analgesic effects of acetanilide
were discovered when an inexperienced pharmacist mistakenly gave it to doctors in place of the
naphathalene they were investigating for treating a patient suffering from intestinal parasites.
25
– Cisplatin was long known as Peyrone’s salt when a team led by Barnett Rosenberg, which was
investigating the effect of an electric current on the growth of E. coli cells, discovered that the
impressive cell elongation they were observing was not due to the electric current but to the
cisplatin produced by the reaction of the platinum electrodes with the nutrient solution in which
the bacteria were. It is now used as a cytotoxic in the treatment of cancer.
– Heparin is an anticoagulant that was discovered by a scientist searching for procoagulant in dog
liver.
– Penicillin, the first antibiotic to successfully treat bacterial infection, was discovered when Alexan-
der Fleming realized that a petri dish containing a culture of Staphylococcus bacteria had been
contaminated by a mold that was now killing it.
• During clinical trials
– Dimenhydrinate (dramamine) was developed as a antihistamine but is now used against travel
sickness, thanks to a chance observation of participants in the clinical trials.
– Sildenafil (Viagra), was initially developed as a heart medicine for specific use against angina
pectoris. During clinical trials it proved to have little effect against angina, but provoked many
penile erections.
26
– The bacteria must be isolated and grown in culture;
– The disease must be reproduced when a healthy host is inoculated with the bacteria;
– The bacteria must be found again in the diseased host.
In spite of a number of limitations (bacteria that are hard to grow in culture; immunocompromised
patient; differences between patients), Koch’s postulates are still useful guidelines nowadays.
27
Late 1980s: The AIDS epidemic
Considerable resources led to considerable development
• in immunology;
• in combinatorial chemistry to develop new drug candidates (see Appendix C);
• in the automation of drug discovery experiments with robotics.
Source: compoundchem.com.
28
decrease the response to the natural activator.
GPCRs: G Protein-Coupled Receptors are proteins which play key roles in cell signaling. They are
the main targets of allosteric modulators. Loratadine (Claritin) is an anti-histamine that relieves
allergies by blocking the histamine receptor. A number of antidepressant medications (Prozac,
Zoloft) affect the serotonin receptor. The adrenergic receptors affected by beta-blockers also
belong to the GPCR family.
One also encounters the terms “inhibitors” and “activators”, usually referring to a target that is an
enzyme. Enzymes are protein that regulate the rate of chemical reactions.
• An inhibitor inhibits the activity of an enzyme.
E.g. Aspirin inhibits the activity of the cyclooxygenases COX-1 and COX-2, which are responsible
for the formation of prostaglandins, themselves involved in inflammatory response; HIV protease
inhibitors.
• An inducer or activator activates the activity of an enzyme.
E.g. Barbiturates and benzodiazepines. Both barbiturates and benzodiazepine activate GABA-A re-
ceptors. GABA receptors are an essential neurotransmitter in the central nervous system (CNS) of
mammals. The binding sites of barbiturates, GABA and benzodiazepines are distinct. This explains
in part why barbiturates are more dangerous in overdose.
29
compounds. Computers also made it possible to evaluate the ability of those novel compounds to
bind to HIV-1 protease by computing binding affinities based on the physical laws of chemistry. Un-
fortunately, limitations in computing power meant that a number of approximations had to be used,
resulting in poor estimate of actual ligand-receptor binding energies. To this day, this remains a lim-
itation of de novo drug design. Indinavir, commercialized as Crixivan, is a protease inhibitor that was
approved by the FDA in 1996.
E Assays
Pre-clinical assays This phase often involves in vivo experiments on animals. When a lead molecule
with good drug-likeliness is discovered, the pre-clinical data collected during lead optimization, aim-
ing at demonstrating safety in animals, is used to file a new drug application and obtain permission
to enter clinical trial.
The first in vivo trials are usually conducted on animals in order to obtain the pre- clinical data nec-
essary to file a new drug application with the FDA (Food and Drugs Administration) in the USA or the
European Agency for the Evaluation of Medical Products (EMEA) in Europe and obtain permission to
enter Phase I clinical trial.
• in vitro assays
• in vivo assays on animals to evaluate
– toxicity, mutagenicity, carcinogenicity
– pharmacokinetics
– efficacy
30
The animals being used are generally rodents (mice, rats). Pigs are also favored for their biologi-
cal closeness to humans. Once a drug’s interest has been demonstrated, trials will sometimes be
conducted on primates.
• Animal research regulations: The type of experiments that can be conducted on animals are care-
fully regulated. Policies are put in place to ensure that animals receive a certain standard of care
and treatment and aren’t subjected to unnecessary pain. Ethical boards and review committees are
often mandatory to control the use of animals for scientific experimentation.
– U.S.A: Animal Welfare Act.
– France:
∗ http://www.enseignementsup-recherche.gouv.fr/pid29417/utilisation-des-animaux-a-des-fins-
scientifiques.html
∗ https://www.inserm.fr/recherche-inserm/ethique/utilisation-animaux-fins-recherche
– Basel declaration: the 3R principle (Replace, Reduce, Refine) http://www.basel-declaration.org/
→ pre-approval for permission to enter Phase I clinical trial.
Clinical assays
• Phase I: Tests on a small number of healthy volunteers to determine
– maximal tolerated dose: the highest dose that does not produce unacceptable toxicity;
– pharmacokinetics;
– adverse effects.
Phase I trials in oncology are already conducted on patients.
• Phase II: Tests on a small number (a few dozens) of patients to determine L
– most appropriate dosage;
– efficacy;
– tolerance.
• Phase III: Double-blind, placebo-controlled trials to confirm tolerance and efficacy. These larger
trials usually involve hundreds if not thousands of patients. See https://www.inserm.fr/recherche-
inserm/recherche-clinique/etre-volontaire-essai-clinique
Regulatory agencies can always require additional trials.
Pharmacovigilance, or drug safety, is concerned with the collection, detection, assessment, monitoring,
and prevention of adverse drug reactions. Phase IV can be considered as the first real-world test of the
drug. Indeed, the true safety profile of a drug can only be characterized by ongoing safety surveillance,
through an adverse event monitoring system and a continuing post-marketing surveillance study.
Phase IV trials never end as long as the drug is being sold. Drug safety is constantly monitored
through:
• Phase IV trials or postmarketing surveillance take place after the drug has been approved for sale.
It can be required by regulatory agencies for further monitoring of some adverse effects, or under-
taken by the pharmaceutical company to evaluate for instance drug interactions, effects in a given
subpobulation, or the potential of the drug for other usages than the one it has been approved for.
• Non-Interventional Studies: Keep evaluating tolerance and efficacy in a large-scale, real-world set-
ting.
• Adverse Event Reporting is conducted by healthcare professionals and patients, and addressed to
both pharmaceutical companies and regulation agencies.
31
• 106 full members + 33 associate members.
• Goals:
– Enhance patient care and safety;
– Provide reliable information for the effective assessment of the risk-benefit profile of medicine.
• Centralization: the Uppsala Monitoring Centre http://www.who-umc.org/
Regulation agencies
• In the U.S., the FDA (Food and Drug Administration) is in charge of public health (also includes food
safety, medical devices, tobacco products).
• In Europe, the EMA (European Medicines Agency) harmonizes the work of national regulatory agen-
cies.
• In France, the ANSM (Agence Nationale de Sécurité du Médicament et des produits de santé, Na-
tional Agency for Medication and Health Products Safety).
32
protein only found on B cells. The immune system is then more prone to identify B-cells as a
target. This treatment is used to treat B-cell lymphomas; although they also lower the number
of healthy B-cells, the body produces new healthy B-cells to replace them.
– Block growth factor receptors. This prevents growth factors (chemical that attach to these recep-
tors to signal the cells to grow) from getting through. E.g. Cetuximab blocks epidermal growth
factor and is used to slowing down or blocking the progression of colon, heand and neck cancers.
– Block angiogenesis. Similarly to how monoclonal antibodies can bock growth factors, they can
block the signals used by cancerous cells to attract blood vessels that will bring them the oxygen
and nutrients they need to grow. E.g. Bevacizumab intercepts the vascular endothelial growth
factor (VEGF).
– Deliver radiation or chemotherapy directly to cancer cells. E.g. Ibritumomab, a treatment used
against non-Hodgkin’s lymphoma, combines a monoclonal antibody with radioactive particles,
which then attach to cancerous blood cells and irradiates them. In HER2+ breast cancer, ado-
trastuzumab can be used to deliver trastuzumab specifically to the HER2 receptors of the cancer
cells.
• Gene therapy
– Replace a mutated gene with a healthy copy.
– Introduce a new gene.
– Inactivating a mutated gene: small RNAs therapy to knock down the expression of disease-
causing genes. [Pee] https://pharmaphorum.com/r-d/views-analysis-r-d/the-promising-future-of-
rna/
– Delivery: viral (integration of the genetic material to the host DNA) or non-viral (e.g. injection
of naked DNA or of oligonucleotides). Targeting the right region of the genome might also be
difficult.
– Issues:
– short-lived: The rapid division of cells means that patients must be treated multiple times for
the therapeutic DNA to be fully integrated to their genome.
– the immune response of the patient may lead to rejecting the treatment, particularly if multiple
innoculations are required.
– the use of viral vectors is associated with increased risks of toxicity and inflammatory response.
– complex traits, that is to say, those that are due to mutations in multiple genes,cannot be easily
treated this way, as one would have to target multiple (and yet mostly undiscovered) regions
of the genome.
– mutagenesis can be induced by the integration of DNA in a sensitive spot, in which case the
therapy would lead to an increased risk of cancer. This happened with the first clinical trials
for X-linked severe combined immunodeficiency (“bubble baby disease”), in which 3 out of 20
patients developed leukemia.
– Gene therapy holds promise for treating a number of diseases, in particular cancers and auto-
immune diseases. Currently, it is mostly available as part of clinical trials, for severe diseases
that have no other known cures. In 2012, Alipogene tiparvovec (commercialized as Glybera), a
treatment for a rare inherited disorder called lipoprotein lipase deficiency, became the first gene
therapy to be approved in Europe (and the US).
• Cell therapy
– Transplanting cells from donor to patient.
– E.g. bone marrow transplants.
– Transfusion of red blood cells generated in vitro from the patient’s own stem cells [MC+11].
– This is not to be confused with the alternative medicine meaning of “cell therapy”, whereby ill-
33
nesses are “treated” by the injection of animal cells. Since cells from another species cannot
replace human cells, this is unlikely to ever work. In addition, serious adverse effects have been
reported, and current scientific evidence does not support the claim that this type of “cell therapy”
is effective in treating cancer or any other disease.
Cours d’Alain Fischer au Collège de France:
http://www.college-de-france.fr/site/alain-fischer/p8400912045226082 content.htm
34
• 3D-DOCK Suite, which includes FTDock, which performs rigid-body docking between biomolecules;
RPScore, which uses a pair potentials to screen output from FTDock; and MultiDock, which performs
multiple copy side-chain refinement;
• FIRST, which analyzes the flexibility in molecular structures of any size, and quickly explore the
available conformational space of the input molecule;
• FTDOCK, a program for carrying out rigid-body docking between biomolecules;
• GROMOS, a general-purpose molecular dynamics computer simulation package for the study of
biomolecular systems;
• GROMACS, a complete modelling package for proteins, membrane systems and more, including
fast molecular dynamics, normal mode analysis, essential dynamics analysis and many trajectory
analysis utilities;
• MolSoft ICM programs and modules for applications including for structure analysis, modeling,
docking, homology modeling and virtual ligand screening;
• NAMD, a parallel object-oriented molecular dynamics simulation program;
• OpenContact, an open source, PC software tool for quickly mapping the energetically dominant
atom-atom interactions between chains or domains of a given protein;
• YASARA, a complete molecular graphics and modeling program, including interactive molecular dy-
namics simulations, structure determination, analysis and prediction, docking, movies and eLearn-
ing;
• ZMM, an Internal Coordinate Molecular Modeling Program for theoretical studies of systems of any
complexity: small molecules, peptides, proteins, nucleic acids, and ligand-receptor complexes.
I Canonical SMILES
OCC, [CH3][CH2][OH], C-C-O and C(O)C all represent the structure of ethanol and can be represented
by the canonical SMILES string CCO.
Multiple canonicalization algorithms exist [WWW89; O’B12]. We will describe here how to use Mor-
gan’s algorithm to decide in which order to visit the nodes of the graph.
35
Example: Morgan labelling of proline.
(a) (b)
(a) First, we label each heavy atom by its number of heavy atom neighbors. There are 3 different labels.
(b) We then replace each atom’s label by the sum of its neighbors’ labels. There are 4 different labels.
(c) And again. There are 5 different labels. (d) If we repeat the process again, we still do not get more
than 5 different labels. We stop iterating here. (e) Finally, we relabel as 1 the atom with highest label
(35). Then we assign 2 to its neighbor with highest label (25) and 3 to both other neighbors, which
are undistinguishable (symmetry equivalent). We then move to atom 2 and label its neighbors. The
double-bonded O, having the highest bond order from the issuing atom, gets label 4, and the other O
gets label 5. Finally the neighbors of the nodes labeled 3 get label 6.
36
– Use a fixed fingerprint length d. The entry at bit b is based on the sum of all entries at bit d, d + b,
d + 2b, . . . of the uncompressed fingerprint: it is set to 1 if this sum is non-zero (i.e. at least one
of the bits is non-zero) and 0 otherwise. Because the fingerprints are very sparse, the compressed
version will still have many zeros. A zero in the compressed version indicates that all bits that
have been “folded” together were set to 0. A one only indicates that at least one of those bits
was set to 1 – this is where information is lost.
– Typically, 512 or 1024 bits are used.
• Elias-Gamma Monotone Encoding (lossless)
– Encode the first non-zero index j0 following the Elias-Gamma encoding, i.e. decompose it in its
higher power of 2 + the rest: j0 = 2N + m and encode N in unary, i.e. N zeros followed by a 1,
and append m in binary over N bits. E.g. 9 = 23 +1 → 0001001.
→ (N + 1) (unary) bits +N (binary) bits.
The trailing zeros are unnecessary → 2 × blog(j)c bits required.
– For the following non-zero indexes, encode ji+1 − ji similarly.
– For the ChemDB example: average compressed size = 1 800 bits, with no loss, versus 5 700 bits
for the naive encoding.
J Kernels
• Any function k : X × X → R that is symmetric and positive semi-definite, i.e. such that for any
n ∈ N, for any {x1 , x2 , . . . , xn } ∈ X , the n × n matrix K defined by Kij = k(xi , xj ) is positive
semi-definite, is a kernel. That is to say, given one such function k, there exists a Hilbert space
H (a Hilbert space is a vectorial space with a dot product; in essence, think of Rd or Cd where
d ∈ N ∪ {+∞}) and a function Φ : X → H such that k(x, z) = hΦ(x), Φ(z)i.
• Example: the polynomial kernel
m
k(x, z) = (hx, zi + c) ,
where c ∈ R and m ∈ N.
This kernel corresponds to a mapping Φ to a space of dimension d >> p, as this new feature space
contains all monomes of p variables of degree up to m.
• Example: the Gaussian kernel
||x − z||2
k(x, z) = exp − .
2σ 2
This kernel corresponds to a mapping Φ to a space of infinite dimension that contains all monomes
of p variables of any degree.
• When a machine learning algorithm does not require to access data points in any other form than
that of their dot product with another data point, one can apply the kernel trick, which consists
in replacing the dot product with a kernel. This is equivalent to applying the algorithm in the
feature space to which Φ maps, but without doing computations in this space, which is interesting
computationally when such a data space is very large (or infinite-dimensional).
• Example: kernel ridge regression. Given data (X, y)inRp×n , Rn , ridge regression consists in
– Learning: Find the vector of regression weights β ∈ Rp that minimizes
n
X 2
y i − hβ, xi i + λ||β||22 ,
i=1
37
– Predicting: Given a sample x ∈ Rp , return f (x) = hβ, xi.
Some algebraic manipulations allow us to rewrite the prediction function as
−1
f (x) = xX > λIn + XX > y,
where κ ∈ Rn is the vector such that κi = hx, xi i and K ∈ Rn×n is the matrix such that Kij =
hxi , xj i. Therefore the ridge regression model can be fitted and applied using x1 , x2 , . . . xn and
x only inside dot products, which can be replaced with kernels, using κi = k(x, xi ) and Kij =
k(xi , xj ).
38