0% found this document useful (0 votes)
14 views15 pages

Ijms 22 03848

Uploaded by

AntonioArlen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views15 pages

Ijms 22 03848

Uploaded by

AntonioArlen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

International Journal of

Molecular Sciences

Article
Structure Driven Prediction of Chromatographic Retention
Times: Applications to Pharmaceutical Analysis
Roman Szucs 1,2, * , Roland Brown 1 , Claudio Brunelli 1 , James C. Heaton 1 and Jasna Hradski 2

1 Pfizer R&D UK Limited, Ramsgate Road, Sandwich CT13 9NJ, UK; roland.brown@pfizer.com (R.B.);
claudio.brunelli@pfizer.com (C.B.); james.heaton@pfizer.com (J.C.H.)
2 Department of Analytical Chemistry, Faculty of Natural Sciences, Comenius University in Bratislava,
Mlynská Dolina CH2, Ilkovičova 6, SK-84215 Bratislava, Slovakia; hradski1@uniba.sk
* Correspondence: roman.szucs@pfizer.com

Abstract: Pharmaceutical drug development relies heavily on the use of Reversed-Phase Liquid Chro-
matography methods. These methods are used to characterize active pharmaceutical ingredients and
drug products by separating the main component from related substances such as process related im-
purities or main component degradation products. The results presented here indicate that retention
models based on Quantitative Structure Retention Relationships can be used for de-risking methods
used in pharmaceutical analysis and for the identification of optimal conditions for separation of
known sample constituents from postulated/hypothetical components. The prediction of retention
times for hypothetical components in established methods is highly valuable as these compounds
are not usually readily available for analysis. Here we discuss the development and optimization of

retention models, selection of the most relevant structural molecular descriptors, regression model
 building and validation. We also present a practical example applied to chromatographic method
Citation: Szucs, R.; Brown, R.; development and discuss the accuracy of these models on selection of optimal separation parameters.
Brunelli, C.; Heaton, J.C.; Hradski, J.
Structure Driven Prediction of Keywords: Quantitative Structure Retention Relationships; chromatographic method development;
Chromatographic Retention Times: pharmaceutical analysis
Applications to Pharmaceutical
Analysis. Int. J. Mol. Sci. 2021, 22,
3848. https://doi.org/10.3390/
ijms22083848 1. Introduction
Pharmaceutical analysis is an important area of chemical analysis used to support
Academic Editor: Josef Jampilek
diverse and excessively complex activities associated with drug development. The appli-
cation of Reversed-Phase Liquid Chromatography (RP-LC) is ubiquitous in the support
Received: 26 March 2021
Accepted: 6 April 2021
of process chemistry optimisation, formulation development as well as key quality con-
Published: 8 April 2021
trol assessment for the release of materials designated for all stages of pre-clinical and
clinical trials.
Publisher’s Note: MDPI stays neutral
In process chemistry development, RP-LC is commonly used to assess the assay/purity
with regard to jurisdictional claims in
of starting materials, isolated synthetic intermediates and Active Pharmaceutical Ingredi-
published maps and institutional affil- ents (APIs). This usually requires baseline separation of all known components of complex
iations. mixtures, their identification and subsequent quantitation. This is performed in accor-
dance with the International Council for Harmonisation of Technical Requirements for
Pharma-ceuticals for Human Use guidelines as applied to product specification, impurities
management and method validation [1–3]. In addition, purging of process related impuri-
Copyright: © 2021 by the authors.
ties, synthetic by-products and key degradants requires their chromatographic monitoring
Licensee MDPI, Basel, Switzerland.
at all relevant interventions (e.g., isolation steps). Process chemistry understanding relies
This article is an open access article
heavily on the application of RP-LC. Chemists are required to understand the impact of syn-
distributed under the terms and thetic parameters on the quality of their processes which make important starting materials,
conditions of the Creative Commons intermediates and final API. This is an essential requirement of commercial synthetic route
Attribution (CC BY) license (https:// development. Lastly, the understanding of degradation also requires chromatographic
creativecommons.org/licenses/by/ separation of key degradation products from the main component and their subsequent
4.0/). identification and quantitation [4–6].

Int. J. Mol. Sci. 2021, 22, 3848. https://doi.org/10.3390/ijms22083848 https://www.mdpi.com/journal/ijms


Int. J. Mol. Sci. 2021, 22, 3848 2 of 15

In order to nominate commercial synthetic process, chemists often generate a relatively


large number of hypothetical chemical structures that could be generated as process
related impurities. These could be formed as by-products due to impurities present in
starting materials or due to side reactions. Another potential for the formation of these
undesirable components is degradation reactions which take place either during synthesis
or during storage. Realistically, many of these theoretical components will never be
observed. However, the analytical methodology (e.g., RP-LC) supporting synthetic process
development should be able to at least detect them, if indeed they were to form under
certain extreme synthetic or storage conditions.
The satisfactory performance of chromatographic methods can only be guaranteed
for a defined/known sample composition. This can be referred to as the key predictive
sample set (KPSS), for which the given method was developed and subsequently validated.
Evolving requirements during pharmaceutical development, such as changes in synthetic
or formulation processes, may lead to alteration of the KPSS, for example in order to
manage new process related impurities or degradation products. Examples may include
changes in sources of synthetic starting materials, alteration of process chemistry conditions
or formulation manufacturing parameters. An inherent, if perhaps somewhat obvious,
constraint of chromatography lies in the fact that unless these new KPSS components are
physically available, e.g., obtained either by synthesis or purification, it has been virtually
impossible to predict whether the current version of chromatographic method used to
support particular synthetic or formulation development activities will be able to detect
and quantify them. One way to overcome this is to either synthesise these components or
to obtain them by purification. The production of compounds whose sole purpose is to
de-risk existing analytical methodology, and which may never be formed under “normal”
conditions, may be costly and time consuming. Consequently, such activities are often
deferred to the latter stages of the drug development lifecycle. Should it transpire, at
this stage, that the existing chromatographic method is not capable of detecting these
components, should they form, method re-development activity may be triggered. At best,
this would necessitate the considerable effort of repeat method robustness and validation
work, followed by retest of samples previously analysed using the insufficiently selective
methodology. At worst, prior development decisions, for example around synthetic process
or formulation, made on the basis of what now proves to be incomplete information, may
then need to be revisited.
Chromatographic method development typically starts with the definition of require-
ments for the capabilities of the analytical technique being used. However, significant
consideration is also paid to final product specification. Method performance understand-
ing includes at least following parameters: minimal tolerable resolution of key components
and determination of the accuracy, precision and sensitivity or range requirements. Once
the necessary performance criteria are understood, and the separation mode capable of
achieving these is identified, the next step in the method development process is to select
suitable combination of the stationary phase [7], mobile phase (solvent) and pH. This criti-
cal step, which ultimately affects the robustness of the method, is in present-day analytical
laboratories carried out by combinations of experimental screening [8]. Employment of
in-silico prediction tools, capable of calculating key physicochemical properties (e.g., logP,
pKa, aqueous/solvent solubilities) of pharmaceutical substances are employed to assist
in method design decision making. One example of such software is ACD/Labs Percepta
(ACD/Labs, Toronto, ON, Canada). Selection of suitable stationary and mobile phases is
followed by more detailed optimisation, for example the column temperature, content of
the organic solvent in isocratic or gradient elution, pH of the mobile phase and concentra-
tion of the buffer and/or ion pairing reagent. Such optimisation can be carried using an
“One Factor at a Time” approach. However, modern approaches employ multi-factorial
interpolation software such as ACD/Labs LC Simulator (ACD/Labs, Toronto, ON, Canada)
or DryLab (Molnar-Institute, Berlin, Gemany). These enable extrapolation of a relatively
Int. J. Mol. Sci. 2021, 22, 3848 3 of 15

small number of experiments that lead to accurate prediction of chromatographic retention


times within the intervals of tested conditions.
Alternative approach to de-risking chromatographic methods, and which does not
rely on the availability of all hypothetical components of KPSS, is to build sufficiently
accurate regression retention models. These are often referred to as Quantitative Structure
Retention Relationship (QSRR) models, which can be used to predict retention of these new
KPSS components. In QSRR models a mathematical relationship is built between molecular
descriptors (or features) and measured retention times, factors or indexes. If relevant
structural descriptors can be obtained for a given hypothetical structure, then its retention
time can be predicted using the QSRR model. This facilitates assessment of the separation
method (de-risking) for potential co-elutions with other sample components. Although
the concept of QSRR is not new [8–12], the last decade brought significant expansion of its
application especially in the pharmaceutical industry. The renewed interest in this field
is probably triggered by progress in availability of diverse structural descriptors [13–18],
structure geometry optimisation software as well as broad availability of feature selection
and regression algorithms [19]. Progress in high performance computing, as well as
more widespread and affordable access to it, has inevitably played a significant role
in this development. In liquid chromatography, which is by far most frequently used
technique in pharmaceutical development, QSRR models were developed for RP-LC,
Hydrophilic Interaction Liquid Chromatography and Ion Chromatography separation
modes, with applications ranging from method development to non-targeted screening
for metabolomics [9], environmental or food pollutants and toxins. These applications,
published between 2015 and 2020, were recently extensively reviewed [20]. Analytes for
which QSRR models were built range from small molecules, lipids [20] to peptides and
proteins [21–25].
In addition to the de-risking of analytical methods, further benefits of QSRR models
can be derived from their ability to provide complimentary information in support of
structural elucidation challenges. This is a common challenge to both the pharmaceutical
industry and in metabolite identification in metabonomic studies [9]. For example, in situa-
tions where multiple structural hypotheses remain consistent with available spectroscopic
(e.g., Nuclear Magnetic Resonance or Mass Spectrometry) data, additional information
based on retention time matching with QSRR prediction might usefully narrow the field.
Even if not viewed as definitive, such information might certainly help drive business
decision making, for example prioritising which proposed chemical entity should be syn-
thesized first in order to confirm the identity of unknown chromatographic peaks observed
in samples.
In this contribution, we describe how the development of accurate retention models
can be used to de-risk chromatographic methods in instances where a previously unseen
component is postulated. We will describe an optimized approach to model development
which includes selection of molecular descriptors using feature algorithms. Model valida-
tion, as well as practical application of these models to predict retention extrapolated from
a small number of experiments, will also be discussed.

2. Results and Discussion


2.1. Development of the Statistical Retention Models
As stated previously, the objective of the development of statistical retention models
is to create a mathematical relationship between measured retention time and chemical
structure. The process of building QSRR models typically starts with data collection. The
purpose of this is to create a database of chemical structures and corresponding retention
times. A recent review [20] lists number of data sources that have been used to build
QSRR models. The numbers of compounds in these databases vary from few tens of
compounds to several hundreds or even thousands. Although the perception that the
larger datasets typically generate more accurate predictions prevails among researchers,
this perception was successfully challenged in some recent publications, in which it was
Int. J. Mol. Sci. 2021, 22, 3848 4 of 15

demonstrated that significantly smaller datasets of compounds bearing structural similarity


to the analytes of interest generated highly accurate models [26–28]. A certain degree of
variability of retention times can occur if measured at different time points or utilising
different batches of stationary phases. It is therefore preferential to generate the entire
dataset in as few chromatographic injections as possible. However, if additional data is
required for the purpose of creating QSRR models it is essential that it is collected in a
well-controlled environment.
Chemical structures need to be converted into their numerical representation by
expressing them through structural descriptors. Structural descriptors can range from
measured or calculated physicochemical properties such as octanol-water partition coeffi-
cients, to a series of theoretical descriptors which are products of complex cheminformatics
algorithms [29]. Contemporary software packages can generate large numbers of structural
descriptors (features). It is usually necessary to apply some form of data pre-processing to
eliminate constant or nearly constant features as well as those which are highly correlated.
Finally, when large numbers of descriptors are generated, an evolutionary searching or
genetic algorithm is required to identify or preserve those which positively impact model
performance [30]. Selection of suitable regression algorithms can also have a major impact
on the accuracy of predictions. There is a relatively large number of classification and
regression algorithms available in commercial or open source platforms e.g., WEKA [19,31].
Selection and optimisation of these algorithms can be carried out either manually or by
automated procedures [32]. The final step in QSRR model development is usually model
validation, which provides an estimate of how accurate the prediction of retention time
might be for a hypothetical chemical entity.

2.1.1. Data Collection, Molecular Descriptor Calculation and Data Preprocessing


API and 23 related components from a representative API development program,
were selected for initial screening in which multiple stationary and mobile phases were
tested for overall best chromatographic performance (see Materials and Methods for
details). The mixture of 24 components, being comprised of API, synthetic intermediates,
process related impurities, synthetic by-products and degradation products, from the
same development program exhibited high degree of structural similarity. Figure 1 shows
pairwise structural similarities expressed as Tanimoto index [33] which was calculated
using ACD/Labs Spectrus DB (ACD/Labs, Toronto, ON, Canada). Nearly 75% of pairs
have similarities higher than 0.8, only three compounds exhibit lower pairwise similarities
in the range 0.5–0.7. The high similarities within the dataset are in line with previously
published papers in which a relatively small numbers of structurally similar compounds
were used to build accurate QSRR models [26–28]. For each geometry-optimised 3-D
molecular structure, three types of (native) descriptors were calculated, Dragon (4886
descriptors), MOE (256 descriptors) and VolSurf+3D (128 descriptors) (see e.g. [29] for
details of what descriptors are and how are they calculated). All zero variance and highly
correlated (Correlation Coefficient (R) > |0.85|) descriptors were eliminated. Of multiple
highly correlated descriptors, the one with the best correlation with the retention time was
preserved [34,35].
Int. J. Mol. Sci. 2021, 22, 3848 5 of 15
Int. J. Mol. Sci. 2021, 22, x FOR PEER REVIEW 5 of 15

Figure 1. Pairwise
Figure 1. Pairwise structural
structural similarities
similarities expressed
expressed as
as Tanimoto
Tanimotoindex.
index.See
Seetext
textfor
fordetails.
details.

2.1.2. Generation of Training and Test Sets


2.1.2. Generation of Training and Test Sets
From the entire dataset of 24 compounds, 3 random components were removed, and
From the entire dataset of 24 compounds, 3 random components were removed, and
these were used as the external test set. The remaining 21 compounds formed the training
these were used as the external test set. The remaining 21 compounds formed the training
set that was used to identify significant descriptors and to build and optimize regression
set that was used to identify significant descriptors and to build and optimize regression
models. This process was repeated 8 times, every time removing different 3 components.
models. This process was repeated 8 times, every time removing different 3 components.
This way 8 training sets /test set combinations were created. The superscripts T1 to T8 in
This way 8 training sets /test set combinations were created. The superscripts T1 to T8 in
Tables S1 and S2 indicate which training set/test set combination compounds belong to.
Tables S1 and S2 indicate which training set/test set combination compounds belong to.
2.1.3. Selection of Molecular Descriptors
2.1.3. Selection of Molecular Descriptors
Evolutionary search (ES) algorithm combined with Multiple Linear Regression (MLR)
Evolutionary
implemented in Wekasearch (ES)used
[31] was algorithm
to selectcombined
significant with Multiple
descriptors. OneLinear
thousandRegression
genera-
(MLR) implemented in Weka [31] was used to select significant descriptors.
tions were calculated with a population size of 100. The mutation probability was set 1000 genera-
to 2%
tionsthe
and were calculated
cross-over with a population
probability was set to 6%. sizeBecause
of 100. of
Thethemutation probability
random nature was set to
of evolutionary
2% and thethis
searching, cross-over
selectionprobability
was appliedwasto set
everyto 6%. Because
training of the
set and random
repeated nature
three of evolu-
times for all
tionarydescriptors
native searching, (Dragon,
this selection was applied
VolSurf+3D and to every
MOE) astraining
well as set and repeated three
all combinations times
of descrip-
for all
tors native&descriptors
(Dragon VolSurf+3D, (Dragon,
Dragon VolSurf+3D
& MOE, VolSurf+3Dand MOE) as well
& MOE, as all &
Dragon combinations
VolSurf+3D of &
descriptors
MOE). Root (Dragon
Mean Square& VolSurf+3D, Dragon
Error (RMSE) & MOE,
calculated VolSurf+3D
from & MOE,
7-fold cross Dragon
validation & Vol-
applied to
Surf+3D
each & MOE).
training set wasRoot Mean
used Square Error
to identify (RMSE)
and select calculated
significant from 7-fold
descriptors. crossdescriptors
Thirty validation
applied to each training set was used to identify
most frequently selected by ES are listed in Table 1. and select significant descriptors. Thirty
descriptors most frequently selected by ES are listed in Table 1.
Int. J. Mol. Sci. 2021, 22, 3848 6 of 15

Table 1. List of the 30 Most Frequently Selected Descriptors by Evolutionary Search.

Descriptor Description Descriptor Description


CATS2D_03_DL CATS2D Donor-Lipophilic at lag 03 LOGP_N-oct Log Octanol/water
CATS2D_09_DA CATS2D Donor-Acceptor at lag 09 CATS2D_09_NL CATS2D Negative-Lipophilic at lag 09
Frequency of C-O at topological Geary autocorrelation of lag 5 weighted
F03[C-O] GATS5s
distance 3 by I-state
Geary autocorrelation of lag 6
Geary autocorrelation of lag 6 weighted
GATS6e weighted by Sanderson GATS6m
by mass
electronegativity
leverage-weighted autocorrelation of
Geary autocorrelation of lag 7
GATS7m HATS4e lag 4/weighted by Sanderson
weighted by mass
electronegativity
leverage-weighted autocorrelation
HATS5s Mor10p signal 10/weighted by polarizability
of lag 5/weighted byI-state
Verhaar Algae base-line toxicity from
AMW average molecular weight BLTA96
MLOGP (mmol/L)
signal 24/weighted by
Mor24p N-075 R–N–R/R–N–X
polarizability
nArCOOR number of esters (aromatic) NNRS normalized number of ring systems
3D Topological distance-based
3D Topological distance-based
TDB07m descriptors—lag 7 weighted by TDB08s
descriptors—lag 8 weighted byI-state
mass
Number of hydrogen bond acceptor
a_acc logS Log of the aqueous solubility
atoms
Total negative van der Waals Sum of vi where qi is in the range of
PEOE_VSA_NEG PEOE_VSA+0
surface area 0.00–0.05
SMR_VSA7 Sum of vi such that Ri > 0.56 ACACDO H-bond acceptor and donor
L0LgS Solubility profiling coefficient L2LgS Solubility profiling coefficient
pctFU4 Percent unionized species at pH 4 pctFU6 Percent unionized species at pH 6

2.1.4. Selection of Regression Algorithm


Five regression algorithms (Table 2) implemented in WEKA [31] were applied to all
training sets. Each training set consisted of either native descriptors, or their combinations,
and were selected by ES as described above. For each training set RMSE as well as R were
calculated using 7-fold cross validation. Results are summarized in the Table S3. Figure 2
shows the RMSE (a) and R (b) averaged for all 8 training sets. In addition, Figure 2a
also shows the average RMSE values for all applied regression algorithms. It can be
seen from Figure 2, that mixed descriptors provide marginally better performance than
native descriptors and that Support Vector Machine (SVM) and Gaussian Processes (GPR)
regression algorithms consistently outperform MLR, Random Forest (RF) and Partial Least
Squares (PLS). Overall best performance was obtained using a mixture of all descriptors
and the SVM algorithm. Further attempts to optimise the SVM hyperparameters, such as
the complexity factor for example, as well as the exponent in the Normalized Polynomial
Kernel did not lead to further improvement of RMSE or R values. Therefore, it was decided
to use the WEKA default values i.e., complexity factor was set to 1.0 and the exponent 2.0.
The best performing algorithm and the mixture of all 3 descriptors were used to validate
the model.
Int.J.J.Mol.
Int. Mol.Sci.
Sci.2021,
2021,22,
22,3848
x FOR PEER REVIEW 7 7ofof1515

Table 2. Regression Algorithms and their settings.


Table 2. Regression Algorithms and their settings.
Algorithm Settings
Algorithm Settings
Normalized training data
Support Vector Machine [36,37] Normalized training data
Support Vector Machine [36,37] Polynomial Kernel
Polynomial Kernel
Without
Without hyperparameter
hyperparameter tuning
tuning
Gaussian
Gaussian Processes
Processes Normalized
Normalized Polynomial
Polynomial Kernel
Kernel
Multiple
MultipleLinear Regression
Linear Regression M5
M5attribute selection
attribute method
selection method
Random
Random Forest [38][38]
Forest WEKA
Wekadefault
defaultSetting
Setting
Optimal Number of PLS factors determined
Partial Least Squares (PLS) Optimal Number of PLS factors determined
using Leave One Out cross validation
Partial Least Squares
using Leave One Out cross validation

Figure2.2. Comparison of Root


Figure Root Mean
MeanSquare
SquareError
Error(a)
(RMSE) (a) and Correlation
and Correlation coefficient Coefficient (R) (b). For
(b). For all calculated all calculated
descriptors, their
descriptors,
combinationstheir
andcombinations
all regressionand all regression
algorithms. algorithms.
Each Each bar
bar corresponds to corresponds
the average to the average
value value for
for all training allFigure
sets. training
2asets.
also
contains
Figure average
2a also valueaverage
contains for all applied algorithms.
value for all applied algorithms. SVM: Support vector machine; GPR: Gaussian processes
regression; MLR: multiple linear regression; RF: random forest; PLS: partial least squares.
Int. J. Mol. Sci. 2021, 22, x3848
FOR PEER REVIEW 8 of 15

2.1.5.
2.1.5. Model
Model Validation
Validation
In
In order to
order to assess
assess the
the ability
ability of
of QSRR
QSRR models
models to
to predict
predict retention
retention times
times ofof compounds
compounds
that
that were not used in their development or optimisation, retention times eight
were not used in their development or optimisation, retention times for test sets,
for eight test
created
sets, created as described in the Section 2.1.2, were predicted. This was repeated for all
as described in the Section 2.1.2, were predicted. This was repeated for six
all six
screening
screening conditions
conditions asas described
describedin inthe
theSection
Section2.1.1. QSRR predicted
2.2. QSRR predicted retention
retention times
times are
are
shown
shown in in the
the Table S1 and
Table S1 and Figure
Figure 33 demonstrates
demonstrates the
the match
match between
between QSRR
QSRR predicted and
predicted and
experimentally determinedretention
experimentally determined retentiontimes.
times.Finally,
Finally,the
thecorresponding
corresponding RMSE
RMSE andand R val-
R values
ues are provided in the
are provided in the Table 3. Table 3.

Figure 3. Predicted vs experimental retention times (t ) for 6 screening conditions. See Table 4 for
Figure 3. Predicted vs experimental retention times (tR
R) for 6 screening conditions. See Table 4 for
the details
the details of
of experiments.
experiments.

Table 3.3.RMSE
Root and
meanR values
squareforerror
test sets at six screening
(RMSE) conditions.
and correlation See Table(R)
coefficient 4 for the experiment
values details.
for test sets at 6
screening conditions. See Table 4 for the details of experiments.
Experiment Experiment Experiment Experiment Experiment Experiment
#1
Experiment Experiment #2 #3
Experiment #4
Experiment #5
Experiment #6
Experiment
RMSE #1
0.4262 #2
0.9981 #3
0.3472 #4
1.0133 #5
0.4091 #6
0.8401
RMSER 0.9769
0.4262 0.9763
0.9981 0.9851
0.3472 0.9792
1.0133 0.9799
0.4091 0.9874
0.8401
R 0.9769 0.9763 0.9851 0.9792 0.9799 0.9874
2.2. Application to Method Development
2.2. Application
As describedto Method Development optimisation is performed once a suitable stationary
in the introduction,
and mobile phase,in
As described buffer, and pH [20]optimisation
the introduction, is selected. Atis this stage, itonce
performed is typically column
a suitable tem-
stationary
perature
and andphase,
mobile the content
bufferofand
organic modifier
pH [20] in the At
is selected. mobile
this phase
stage, (Gradient time
it is typically = tG [min])
column tem-
that are optimised.
perature The details
and the content of the modifier
of organic initial six in
experiments
the mobileare presented
phase in Table
(Gradient time4.= Ex-
tG
[min]) that are optimised. The details of the initial six experiments are presentedS2.
perimental retention times for KPSS for these experiments are shown in Table inThese
Table
measured
4. retention
Experimental times were
retention timesextrapolated
for KPSS forusingthesethe ACD/Labsare
experiments LC shown
Simulator software.
in Table S2.
Int. J. Mol. Sci. 2021, 22, x FOR PEER REVIEW 9 of 15

Int. J. Mol. Sci. 2021, 22, 3848 9 of 15


These measured retention times were extrapolated using the ACD/Labs LC Simulator
software. Linear extrapolation was used for the mathematical relationship between natu-
ral logarithm of the retention factor of each component (lnk) in the KPPS and tG [min].
Quadratic extrapolation
Linear extrapolation was for
was used applied to the relationship
the mathematical between
relationship lnk natural
between and 1/Tlogarithm
where T
of the retention
[Kelvin] factor temperature.
is the column of each component (lnk)
Figure 4a in the
shows theKPPS and tG
resolution of [min]. Quadratic
every component
extrapolation
of the KPSS forwas all applied to the of
combinations relationship
tG and 1/T.between lnk and 1/T
For the purposes where the
of clarity, T [Kelvin]
retentionis
the column
model built temperature. Figure 4adetermined
from experimentally shows the resolution of every
retention times component
is depicted of the KPSS
as RtModel EXP.
for all
The combinations
optimal of tG and
temperature and gradient
1/T. For composition,
the purposes theof clarity, the centre
so-called retention model
point, werebuilt
se-
from experimentally
lected to consider maximumdetermined retention
method times i.e.
robustness is depicted
where the as overall
RtModel EXP . The optimal
resolution is maxi-
temperature
mum and leastandaffected
gradientbycomposition,
alteration ofthe so-called
T or centre
tG (Figure 4).point,
Figurewere selected
5 shows the to consider
separation
maximum
at the optimalmethod robustness
temperature and i.e., where the overall resolution is maximum and least
gradient.
affected by alteration of T or tG (Figure 4). Figure 5 shows the separation at the optimal
temperature
Table and gradient.
4. Screening sequence used to optimize the column temperature and gradient elution. See
Materials and Methods for other conditions.
Table 4. Screening sequence used to optimize the column temperature and gradient elution. See
Experiment Column Temperature (°C)
Materials and Methods for other conditions.
Gradient Profile a
Time = 0 min, %B = 5%;
1
Experiment
20
Column Temperature (◦ C) a
Time =Gradient
15 min,Profile
%B = 95%
1 20 Time = 0= min,
Time 0 min,%B
%B==5%;
5%;
2 20 Time = 15 min,%B
%B==95%
95%
Time = 45 min,
2 20 Time = 0 min, %B = 5%;
Time = 0 min, %B = 5%;
3 40 Time = 45 min, %B = 95%
3 40 Time = 15= min,
Time 0 min,%B
%B==95%
5%;
Time
Time==0 15
min,
min,%B
%B==5%;
95%
4 40
4 40 Time = 0 min, %B =
Time = 45 min, %B = 95% 5%;
Time==0 45
Time min,%B
min, %B==5%;
95%
55 60 60 Time = 0 min, %B = 5%;
Time = 15
Time = 15min,
min,%B
%B==95%
95%
Time = 0= min,
0 min,%B
%B==5%;
66 60 60 Time 5%;
Time = 45
Time = 45min,
min,%B
%B==95%
95%
aa Followed
Followedbyby
4 min equilibration.
4 min equilibration.

Figure
Figure 4. Resolution heat
4. Resolution heat map
map for
for KPSS.
key predictive
Intensitysample set (KPSS).
represents Intensity represents
overall chromatogram overallHigh
resolution. chromatogram resolution.
resolution is depicted
High resolution is depicted by red color, low resolution is depicted by blue color. (a) constructed from experimental
by red color, low resolution is depicted by blue color. (a) constructed from experimental retention times. (b) constructed retention
times. (b) constructed
from QSRR predicted from Quantitative
retention Structureindicates
times. Diamond RetentiontheRelationship (QSRR)
center point selectedpredicted
from theretention times. The
model created fromdiamond
experi-
mental retention
indicates times.
the center point selected from the model created from experimental retention times.
In order to assess the suitability of the QSRR, we have essentially replicated the
process described except that in this case, instead of measured retention times, we used
Int. J. Mol. Sci. 2021, 22, x FOR PEER REVIEW 10 of 15

Int. J. Mol. Sci. 2021, 22, 3848 10 of 15


In order to assess the suitability of the QSRR, we have essentially replicated the pro-
cess described except that in this case, instead of measured retention times, we used QSRR
predicted retention times (Table S1) as described above (see 2.1). Again, for the purpose
QSRR predicted retention times (Table S1) as described above (see Section 2.1). Again,
of clarity this retention model is depicted as RtModelQSRR. Agreement between predicted
for the purpose of clarity this retention model is depicted as RtModelQSRR . Agreement
chromatographic separation of KPSS components from RtModelEXP and RtModelQSRR, at
between predicted chromatographic separation of KPSS components from RtModelEXP
the experimental conditions corresponding to the centre point is demonstrated in Figure
and RtModelQSRR , at the experimental conditions corresponding to the centre point is
5.demonstrated
For this subset of compounds,
in Figure 5. For thisthe retention
subset times predicted
of compounds, from RtModel
the retention EXP and
times predicted
RtModel QSRR are nearly identical. Also, the resolution heatmap constructed from QSRR
from RtModelEXP and RtModelQSRR are nearly identical. Also, the resolution heatmap
predicted retention
constructed times, predicted
from QSRR although not entirely
retention identical
times, to thenot
although oneentirely
constructed fromtoex-
identical the
perimentally
one constructed from experimentally obtained retention times, indicates similaralloptimal
obtained retention times, indicates similar optimal resolution of com-
pounds belonging
resolution to KPSS (Figure
of all compounds 4b). This
belonging may(Figure
to KPSS not be 4b).
the case
This for
may allnot
compounds
be the caseasfor
the
all
accuracy of prediction varies from compound to compound. This is also demonstrated
compounds as the accuracy of prediction varies from compound to compound. This is also in
the Figure 3. in the Figure 3.
demonstrated

Figure 5. Predicted chromatogram for KPSS components from the retention model built from
Figure 5. Predicted chromatogram for KPSS components from RtModelEXP (solid line) and RtMod-
experimentally determined retention times (RtModelEXP ) (solid line) and the retention model built
elQSRR (dashed line). Column temperature 40°C. Gradient profile: Time = 0 min, %B = 15%; Time = ◦
from QSSR predicted retention times (RtModelQSRR ) (dashed line). Column temperature 40 C.
12 min, %B = 45%; Time = 17 min, %B = 95%. See Materials and Methods for other details.
Gradient profile: Time = 0 min, %B = 15%; Time = 12 min, %B = 45%; Time = 17 min, %B = 95%. See
Materials andto
In order Methods
comparefor retention
other details.
times predicted from RtModelEXP and those predicted
from RtModelQSRR we used all 24 compounds. We then created all possible combinations
In order to compare retention times predicted from RtModelEXP and those predicted
of two to ten components from this compound set. For each of these combinations we
from RtModelQSRR we used all 24 compounds. We then created all possible combinations
calculated a resolution coefficient (RC) according to equation 1
of two to ten components from this compound set. For each of these combinations we
calculated a resolution coefficient (RC) according
𝑅𝐶 = 1 to Equation (1):
(1)
,
𝑒 1 ,
RC = ∏ Rs (1)
limit 1)
where Rslimit = 1.25 is minimal satisfactory resolution i,j ( Rsi,j − between two components and Rsi,j is
e
the actual chromatographic resolution between two components in the mixture. If the Rsi,j
equal Rs
iswhere tolimit = 1.25 is Rs
or exceeds minimal
limit then satisfactory
it is set to resolution
Rslimit. Thebetween two components
RC indicates and Rsi,j is
that if the resolution
the actual
between two chromatographic
components is resolution
equal to or between
exceedstwo components
Rslimit then the inRCthe
hasmixture.
a valueIfof Rsi,j
theone.
is equal to
Whereas, or exceeds
if the Rslimit
resolution then ittwo
between is set to Rslimit . The
components RCthen
is zero indicates
the RCthat if the
value resolution
will also be
∞ components is equal to or exceeds Rs
between
zero (i.e. 1/etwo ≈ 0). Therefore, all other values will falllimit then the
between RC has
values a value
of zero andofone.
one.
Whereas, if the resolution between two components is zero then
Note that for the calculation of the resolution between two components we used average the RC value will also be
zero (i.e., 1/e ∞ ≈ 0). Therefore, all other values will fall between values of zero and one.
peak width of 0.1 min. The black line in Figure 6 shows the portion of all combinations for
Note that
which bothfor the calculation
models (RtModelof EXPthe
andresolution
RtModelQSRR between two components
), predicted we used average
baseline separation of all
peak width of 0.1 min. The
components in the mixture (RC = 1). black line in Figure 6 shows the portion of all combinations for
which both models (RtModelEXP and RtModelQSRR ), predicted baseline separation of all
components in the mixture (RC = 1).
Mol. Sci. 2021,
Int.22,
J. xMol.
FORSci.
PEER REVIEW
2021, 22, 3848 11 of 15 11 of 15

Figure 6. Portion
Figure(%) of all combinations
6. Portion of compounds
(%) of all combinations containing
of compounds two to ten
containing twocomponents for which
to ten components for RtModel
which EXP and
RtModelQSRR predicted
RtModel baseline
EXP and separation
RtModel (Resolution
QSRR predicted Coefficient
baseline (RC) =
separation 1). The total
(Resolution number of(RC)
Coefficient combinations
= 1). The evaluated
total number
is in parentheses. of corresponds
Black line combinationstoevaluated is from
model built in parentheses. Black
predicted data line
and redcorresponds to model
line corresponds built built from
to model
from predicted data and red line corresponds
mixture of predicted and experimental data. See text for details.to model built from mixture of predicted and experi-
mental data. See text for details.
This data demonstrates that of all theoretical mixtures containing up to seven compo-
This data demonstrates
nents which were thatseparated
of all theoretical mixtures containing
with a resolution up tomore
of at least 1.25, seventhan compo-
80% were identified
nents which werewithseparated with aEven
both models. resolution
for theofmost
at least 1.25, more
complex than containing
mixtures 80% were identified
ten components, nearly
with both models.
65%Evenof allfor the most complex
combinations mixtures containing
were identified ten components,
with both models. It can be nearly
concluded that once
QSRR derived
65% of all combinations were retention
identifiedtimes
with are
bothestablished
models. It they
can be canconcluded
be used tothat identify
once conditions in
QSRR derived which all components
retention are fully separated.
times are established, they can be However,
used tothe observation
identify conditions described
in in Figure 6
(black line)
which all components represents
are fully an extreme
separated. However,casethe
since we are comparing
observation describeda in model
Figurebuilt
6 from entirely
experimental
(black line) represents data case
an extreme withsince
one webuilt
arefrom entirelya QSRR
comparing model predicted data. Practically, this
built from entirely
scenario will almost always be applied to a mixture of
experimental data with one built from entirely QSRR predicted data. Practically, this components, for sce-
some of which the
nario will almost always be applied to a mixture of components, for some of which the replacing ap-
measured data will be available. We simulated this scenario by randomly
measured dataproximately 20% (5 We
will be available. out of 24) of retention
simulated times obtained
this scenario by randomly from replacing
RtModelQSRR ap- with retention
proximately 20% times
(fiveobtained
out of 24)from RtModeltimes
of retention EXP . As shown from
obtained in Figure 6 (red
RtModel line),
QSRR withthere
reten-were noticeable
increases
tion times obtained frominRtModel
the proportion of mixtures
EXP. As shown identified
in Figure 6 (redas baseline
line), thereseparated
were noticea- in both models. In
ble increases inpractical terms, we
the proportion usually have
of mixtures many as
identified experimentally determined
baseline separated in both retention
models.times available
In practical terms, we usually have many experimentally determined retention2–5
and few QSRR determined data. We would typically be looking at components with
times
which to estimate successful separation. These components
available and few QSRR determined data. We would typically be looking at 2–5 compo- are likely to be subtle molecular
nents with which to estimate successful separation. These components are likely to bemodel.
modifications within the acceptable structural similarity properties of the
Lastly, pairwise
subtle molecular modifications resolutions
within were calculated
the acceptable structuralforsimilarity
all 24 compounds
propertiesdetermined
of using
the model. both QSRR and experimentally determined retention times. The same assumptions re-
garding
Lastly, pairwise the peak widths
resolutions as in previous
were calculated for allcalculations
24 compounds were made. All using
determined pairs that exhibited
both QSRR and experimentally determined retention times. The same assumptions re- be separated
resolution higher than 20 were excluded as these components would always
garding the peakeven if theas
widths error of prediction
in previous was excessive.
calculations were made. RCAll values
pairsfor
thatallexhibited
remaining pairs were
calculated for retention times predicted from RtModelEXP and RtModelQSRR . RC values
resolution higher than 20 were excluded as these components would always be separated
for these models were compared. Figure 7 shows what proportion of pairwise RC values
even if the error of prediction was excessive. RC values for all remaining pairs were cal-
calculated from RtModelQSRR which falls within specified intervals of RC values calculated
culated for retention times predicted from RtModelEXP and RtModelQSRR. RC values for
from RtModel . This figure demonstrates that in excess of 60% of pairwise RC values
these models were compared.EXP Figure 7 shows what proportion of pairwise RC values cal-
obtained from RtModel fall within ±0.1 of RC values obtained from RtModelEXP . This
culated from RtModelQSRR, which fellQSRR within specified intervals of RC values calculated
again indicates that likelihood of making correct decision with regards to selection optimal
from RtModelEXP. This figure demonstrates that in excess of 60% of pairwise RC values
separation conditions based on QSRR derived models is high.
obtained from RtModelQSRR fall within ±0.1 of RC values obtained from RtModelEXP. This
Int. J. Mol. Sci. 2021, 22, 3848
Int. x FOR PEER REVIEW 12 of
12 of 15
15

Figure Portion(%)
7. Portion
Figure 7. (%)ofofpairwise
pairwiseRC
RCvalues
valuescalculated
calculated from
from RtModel
RtModel QSRR
QSRR
falling
which fallswithin
within certain
certain
interval RC values calculated from RtModel
interval RC values calculated from RtModelEXP . See text for details.
EXP. See text for details.

3. Materials and Methods


3. Materials
3.1. and Methods
Instrumentation
3.1. Instrumentation
All experiments were performed using an Agilent 1290 – Infinity UHPLC (Agilent
All experiments
Technologies, were Germany)
Waldbronn, performedliquid
using chromatography
an Agilent 1290 –apparatus
Infinity UHPLC
equipped (Agilent
with
aTechnologies,
diode array Waldbronn, Germany) liquid
detector, autosampler, chromatography
and thermostat. apparatus
Quadrupole equipped with
time-of-flight massa
diode array detector,
spectrometer Agilentautosampler,
6550i (Agilent andTechnologies,
thermostat. Quadrupole
Singapore) time-of-flight
was employed mass spec-
to track
trometer Agilent 6550i (Agilent Technologies, Singapore) was employed
chromatographic peaks between different methods. Chromatographic data were collected to track chroma-
tographic
and peaks
processed between
using differentWorkstation
a MassHunter methods. Chromatographic data were
LC/MS data acquisition collected
software and
(Agilent
processed using
Technologies, a MassHunter
Santa Workstation
Clara, CA, USA). The columnLC/MS data acquisition
employed software
in this study was a(Agilent
Waters
BEH Acquity C18
Technologies, (2.1Clara,
Santa mm idCA, × 100 mm,The
USA). 1.7 column
µm) (Waters, Milford,
employed MA,study
in this USA).was The agradient
Waters
eluent utilizedC18
BEH Acquity consisted
(2.1 mm ofid
acetonitrile
× 100 mm,(Mobile
1.7 μm).phase B) and 10
The gradient mM utilized
eluent ammonium acetate
consisted of
solution, pH(Mobile
acetonitrile adjusted to 4.9B)with
phase andacetic
10 mM acid (Mobile phase
ammonium A).solution,
acetate Dataset for
pHbuilding
adjustedQSRR
to 4.9
models wasacid
obtained at column temperature ◦ C and following gradient profile: Time
60 building
with acetic (Mobile phase A). Dataset for QSRR models was obtained at
= 0 min,temperature
column %B = 5%; Time 60°C= and
45 min, %B = 95%
following followed
gradient by Time
profile: 4 min= equilibration.
0 min, %B = 5%; AllTime
other=
gradient
45 min, %Bprofiles
= 95%arefollowed
specifiedby in 4Table 4. Allequilibration.
minutes data were collected at column
All other gradient temperatures
profiles are
as specified
specified in Table
in Table 4 and
4. All with
data an collected
were eluent flow rate of 0.4
at column mL/min. The
temperatures injection in
as specified volume
Table
was
4 and2 µL
withand
anthe UV flow
eluent detection was
rate of 0.4carried
mL/min. outThe
at 254 nm. volume was 2 μL and the UV
injection
detection was carried out at 254 nm.
3.2. Chemicals and Reagents
All standards
3.2. Chemicals used throughout the study were synthesized and characterized at Pfizer
and Reagents
R&D UK Limited (Sandwich, UK). Standard solutions were initially prepared at 1 mg/mL
All standards used throughout the study were synthesized and characterized at
concentration in diluent solution consisting of 50:50 (v/v) mixture of acetonitrile and
Pfizer R&D UK Limited (Sandwich, United Kingdom). Standard solutions were initially
water and stored in refrigerator. They were diluted 50-fold prior to injection with diluent.
prepared at 1 mg/mL concentration in diluent solution consisting of 50:50 (v/v) mixture of
Acetonitrile (HPLC grade), ammonium acetate (LCMS grade) and acetic acid (Analytical
acetonitrile and water and stored in refrigerator. They were diluted 50-fold prior to injec-
grade) were purchased from Fisher Scientific (Loughborough, UK). Deionized water was
tion with diluent. Acetonitrile (HPLC grade), ammonium acetate (LCMS grade) and acetic
prepared in house by MilliQ LC-Pak (Merck, Amsterdam, The Netherlands).
acid (Analytical grade) were purchased from Fisher Scientific (Loughborough, UK). De-
ionized
3.3. water was prepared in house by MilliQ LC-Pak (Merck, Amsterdam, The Nether-
Software
lands).
AlvaDesc (Alvascience Srl, Lecco, Italy) software was used to calculate Dragon [13]
descriptors (Formerly DragonX), Molecular Operating Environment (MOE, Chemical Com-
puting Group Inc, Montreal, QC, Canada) software was used to calculate MOE descriptors
and Molecular Discovery Software (Molecular Discovery, Borehamwood, UK) software
Int. J. Mol. Sci. 2021, 22, 3848 13 of 15

was used to calculate VolSurf+3D descriptors [14]. Prior to descriptor calculation, 3D con-
formers were generated using Corina (Molecular Networks GmbH, Nürnberg, Germany
and Altamira LLC, Columbus, OH, USA) followed by energy minimization using MMFF94
force field, embedded in MOE software.
WEKA [39] (version 3.8, Waikato, New Zealand) platform was used for feature selec-
tion and for the development and optimization of regression algorithms.
ACD/Labs LC Simulator (ACD/Labs, Toronto, ON, Canada) version 2019 was used
to carry out two-dimensional resolution optimisation.

4. Conclusions
Chromatographic QSRR models were demonstrated to be useful for the prediction
of retention times for hypothetical components with favourable accuracy. Likewise, the
optimum resolution space was shown to be accurately represented when calculated using
this approach. This was achieved by using a combination of Dragon, MOE and VolSurf+3D
descriptors with a Support Vector Machine regression algorithm which outperformed all
other tested conditions. An Evolutionary Search algorithm was used to reduce number of
considered molecular descriptors from which the retention models were built. The retention
times predicted from these models were used to build two-dimensional (gradient time
versus temperature) resolution maps in order to identify optimal separation conditions.
We found excellent agreement between the resolution of sample components obtained
from a model built using experimental retention times with those from QSRR predicted
retention times. These results indicate the usefulness of QSRR for the identification of
optimal chromatographic conditions as well as for de-risking of existing methods for
new/hypothetical components. It thus raises the prospect of an alternative approach to
separation optimisation and de-risking that would not inherently rely on the availability of
physical samples.

Supplementary Materials: The following are available online at https://www.mdpi.com/article/10


.3390/ijms22083848/s1, Table S1: QSRR predicted retention times from second screening, Table S2:
Experimental retention times from second screening, Table S3: Selection of regression algorithm.
Author Contributions: Conceptualization, R.S. and R.B.; methodology, R.S. and J.C.H.; software, R.S.
and J.H.; validation, R.S. and C.B.; formal analysis, C.B.; investigation, R.S. and J.C.H.; resources, R.B.;
data curation, R.S. and J.H.; writing—original draft preparation, R.S.; writing—review and editing,
R.S. and J.C.H.; visualization, R.S. and J.H.; supervision, R.B.; project administration, R.B.; funding
acquisition, R.B. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Data Availability Statement: Not applicable.
Acknowledgments: The work of J.H. was supported by the Slovak Research and Development
Agency (APVV-17-0318).
Conflicts of Interest: The authors declare no conflict of interest.
Int. J. Mol. Sci. 2021, 22, 3848 14 of 15

Abbreviations
RP-LC Reversed-Phase Liquid Chromatography
API Active Pharmaceutical Ingredient
KPSS Key Predictive Sample Set
QSRR Quantitative Structure Retention Relationship
R Correlation Coefficient
ES Evolutionary Search
MLR Multiple Linear Regression
RMSE Root Mean Square Error
SVM Support Vector Machine
GPR Gaussian Processes Regression
RF Random Forest
PLS Partial Least Squares
RtModelEXP Retention model built from experimental retention times
RtModelQSRR Retention model built from QSRR predicted retention times
RC Resolution Coefficient
Rslimit Minimal satisfactory resolution between two components
Rsi,j Actual chromatographic resolution between two components in the mixture

References
1. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH
Harmonised Tripartite Guideline: Specifications: Test Procedures and Acceptance Criteria for New Drug Substances and New
Drug Products: Chemical Substances Q6A. Available online: https://database.ich.org/sites/default/files/Q6A%20Guideline.pdf
(accessed on 14 November 2020).
2. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH
Harmonised Tripartite Guideline: Impurities in New Drug Substances Q3A(R2). Available online: https://database.ich.org/
sites/default/files/Q3A%28R2%29%20Guideline.pdf (accessed on 31 July 2020).
3. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use.
ICH Harmonised Tripartite Guideline: Validation of Analytical Procedures: Text and Methodology Q2(R1). Available online:
https://database.ich.org/sites/default/files/Q2%28R1%29%20Guideline.pdf (accessed on 31 July 2020).
4. Olsen, B.A.; Sreedhara, A.; Baertschi, S.W. Impurity investigations by phases of drug and product development. TrAC, Trends
Anal. Chem. 2018, 101, 17–23. [CrossRef]
5. Baertschi, S.W.; Alsante, K.M.; Reed, R.A. (Eds.) Pharmaceutical Stress Testing: Predicting Drug Degradation, 2nd ed.; CRC Press:
Boca Raton, FL, USA, 2011. [CrossRef]
6. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use.
ICH Harmonised Tripartite Guideline: Stability Testing of New Drug Substances and Products Q1A(R2). Available online:
https://database.ich.org/sites/default/files/Q1A%28R2%29%20Guideline.pdf (accessed on 22 February 2021).
7. Fekete, S.; Fekete, J.; Molnár, I.; Ganzler, K. Rapid high performance liquid chromatography method development with high
prediction accuracy, using 5 cm long narrow bore columns packed with sub-2 µm particles and Design Space computer modeling.
J. Chromatogr. A 2009, 1216, 7816–7823. [CrossRef]
8. Szucs, R.; Brunelli, C.; Lestremau, F.; Hanna-Brown, M. Liquid chromatography in the pharmaceutical industry. In Liq-
uid Chromatography: Applications, 2nd ed.; Fanali, S., Haddad, P.R., Poole, C.F., Riekkola, M.-L., Eds.; Elsevier: Amsterdam,
The Netherlands, 2017; pp. 515–537. [CrossRef]
9. Witting, M.; Böcker, S. Current status of retention time prediction in metabolite identification. J. Sep. Sci. 2020, 43, 1746–1754.
[CrossRef]
10. Taraji, M.; Haddad, P.R.; Amos, R.I.J.; Talebi, M.; Szucs, R.; Dolan, J.W.; Pohl, C.A. Chemometric-assisted method development in
hydrophilic interaction liquid chromatography: A review. Anal. Chim. Acta 2018, 1000, 20–40. [CrossRef]
11. Kaliszan, R. Quantitative structure property (retention) relationships in liquid chromatography. In Liquid Chromatography:
Fundamentals and Instrumentation, 2nd ed.; Fanali, S., Haddad, P.R., Poole, C.F., Riekkola, M.-L., Eds.; Elsevier: Amsterdam,
The Netherlands, 2017; pp. 553–572. [CrossRef]
12. Bouwmeester, R.; Martens, L.; Degroeve, S. Comprehensive and Empirical Evaluation of Machine Learning Algorithms for Small
Molecule LC Retention Time Prediction. Anal. Chem. 2019, 91, 3694–3703. [CrossRef] [PubMed]
13. Mauri, A.; Consonni, V.; Pavan, M.; Todeschini, R. DRAGON software: An easy approach to molecular descriptor calculations.
MATCH Commun. Math. Comput. Chem. 2006, 56, 237–248.
14. Cruciani, G.; Crivori, P.; Carrupt, P.A.; Testa, B. Molecular fields in quantitative structure-permeation relationships: The VolSurf
approach. J. Mol. Struct. THEOCHEM 2000, 503, 17–30. [CrossRef]
15. Valdés-Martiní, J.R.; Marrero-Ponce, Y.; García-Jacas, C.R.; Martinez-Mayorga, K.; Barigye, S.J.; Vaz D‘Almeida, Y.S.; Pham-The,
H.; Pérez-Giménez, F.; Morell, C.A. QuBiLS-MAS, open source multi-platform software for atom- and bond-based topological
(2D) and chiral (2.5D) algebraic molecular descriptors computations. J. Cheminformatics 2017, 9, 35. [CrossRef]
Int. J. Mol. Sci. 2021, 22, 3848 15 of 15

16. Yap, C.W. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 2011,
32, 1466–1474. [CrossRef] [PubMed]
17. Cao, D.-S.; Xu, Q.-S.; Hu, Q.-N.; Liang, Y.-Z. ChemoPy: Freely available python package for computational biology and
chemoinformatics. Bioinformatics 2013, 29, 1092–1094. [CrossRef] [PubMed]
18. Steinbeck, C.; Hoppe, C.; Kuhn, S.; Floris, M.; Guha, R.; Willighagen, E.L. Recent Developments of the Chemistry Development
Kit (CDK) - An Open-Source Java Library for Chemo- and Bioinformatics. Curr. Pharm. Des. 2006, 12, 2111–2120. [CrossRef]
19. Witten, I.H.; Frank, E.; Hall, M.A.; Pal, C.J. Data Mining: Practical Machine Learning Tools and Techniques, 4th ed.; Morgan Kaufmann:
Cambridge, MA, USA, 2016.
20. Haddad, P.R.; Taraji, M.; Szücs, R. Prediction of Analyte Retention Time in Liquid Chromatography. Anal. Chem. 2021, 93, 228–256.
[CrossRef]
21. Henneman, A.; Palmblad, M. Retention Time Prediction and Protein Identification. In Mass Spectrometry Data Analysis in
Proteomics; Matthiesen, R., Ed.; Humana: New York, NY, USA, 2020; pp. 115–132. [CrossRef]
22. Moruz, L.; Käll, L. Peptide retention time prediction. Mass Spectrom. Rev. 2017, 36, 615–623. [CrossRef]
23. Krokhin, O.V.; Spicer, V. Predicting Peptide Retention Times for Proteomics. Curr. Protoc. Bioinformatics 2010, 13.14.11–13.14.15.
[CrossRef]
24. Tarasova, I.A.; Masselon, C.D.; Gorshkov, A.V.; Gorshkov, M.V. Predictive chromatography of peptides and proteins as a
complementary tool for proteomics. Analyst 2016, 141, 4816–4832. [CrossRef]
25. Krokhin, O. Peptide retention prediction in reversed-phase chromatography: Proteomic applications. Expert Rev. Proteomics 2012,
9, 1–4. [CrossRef] [PubMed]
26. Wen, Y.; Talebi, M.; Amos, R.I.J.; Szucs, R.; Dolan, J.W.; Pohl, C.A.; Haddad, P.R. Retention prediction in reversed phase high
performance liquid chromatography using quantitative structure-retention relationships applied to the Hydrophobic Subtraction
Model. J. Chromatogr. A 2018, 1541, 1–11. [CrossRef] [PubMed]
27. Wen, Y.; Amos, R.I.J.; Talebi, M.; Szucs, R.; Dolan, J.W.; Pohl, C.A.; Haddad, P.R. Retention Index Prediction Using Quantitative
Structure-Retention Relationships for Improving Structure Identification in Nontargeted Metabolomics. Anal. Chem. 2018, 90,
9434–9440. [CrossRef]
28. Taraji, M.; Haddad, P.R.; Amos, R.I.J.; Talebi, M.; Szucs, R.; Dolan, J.W.; Pohl, C.A. Rapid Method Development in Hydrophilic
Interaction Liquid Chromatography for Pharmaceutical Analysis Using a Combination of Quantitative Structure-Retention
Relationships and Design of Experiments. Anal. Chem. 2017, 89, 1870–1878. [CrossRef] [PubMed]
29. Mauri, A.; Consonni, V.; Todeschini, R. Molecular descriptors. In Handbook of Computational Chemistry, 2nd ed.; Leszczynski, J.,
Kaczmarek-Kedziera, A., Puzyn, T., Papadopoulos, M.G., Reis, H., Shukla, M.K., Eds.; Springer: Cham, Switzerland, 2017; pp.
2065–2093. [CrossRef]
30. Leardi, R. Genetic algorithms in chemistry. J. Chromatogr. A 2007, 1158, 226–233. [CrossRef] [PubMed]
31. Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H. The WEKA Data Mining Software: An Update.
SIGKDD Explor. 2009, 11, 10–18. [CrossRef]
32. Kotthoff, L.; Thornton, C.; Hoos, H.H.; Hutter, F.; Leyton-Brown, K. Auto-WEKA 2.0: Automatic model selection and hyperpa-
rameter optimization in WEKA. J. Mach. Learn. Res. 2017, 18, 826–830.
33. Willett, P.; Barnard, J.M.; Downs, G.M. Chemical Similarity Searching. J. Chem. Inf. Comput. Sci. 1998, 38, 983–996. [CrossRef]
34. Aalizadeh, R.; Thomaidis, N.S.; Bletsou, A.A.; Gago-Ferrero, P. Quantitative Structure-Retention Relationship Models to Support
Nontarget High-Resolution Mass Spectrometric Screening of Emerging Contaminants in Environmental Samples. J. Chem. Inf.
Model. 2016, 56, 1384–1398. [CrossRef] [PubMed]
35. Passarin, P.B.S.; Lourenço, F.R. Modeling an in silico platform to predict chromatographic profiles of UV filters using ChromSimu-
lator. Microchem. J. 2020, 157, 105002. [CrossRef]
36. Shevade, S.K.; Keerthi, S.S.; Bhattacharyya, C.; Murthy, K.R.K. Improvements to the SMO Algorithm for SVM Regression. IEEE
Trans. Neural Netw. 2000, 11, 1188–1193. [CrossRef]
37. Smola, A.J.; Schölkopf, B. A tutorial on support vector regression. Stat. Comput. 2004, 14, 199–222. [CrossRef]
38. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
39. Frank, E.; Hall, M.A.; Witten, I.H. The WEKA Workbench. Online Appendix for “Data Mining: Practical Machine Learning Tools and
Techniques”, 4th ed.; Morgan Kaufmann, 2016. Available online: https://www.cs.waikato.ac.nz/ml/weka/Witten_et_al_2016
_appendix.pdf (accessed on 14 November 2020).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy