0% found this document useful (0 votes)
159 views17 pages

Prediction Reliability of QSAR Models An

The document discusses various validation tools for QSAR models including double cross-validation, small dataset modeling, intelligent consensus prediction, and read-across. These tools help improve model quality, robustness, and predictive ability. The tools are freely available online.

Uploaded by

drfperez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
159 views17 pages

Prediction Reliability of QSAR Models An

The document discusses various validation tools for QSAR models including double cross-validation, small dataset modeling, intelligent consensus prediction, and read-across. These tools help improve model quality, robustness, and predictive ability. The tools are freely available online.

Uploaded by

drfperez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Archives of Toxicology (2022) 96:1279–1295

https://doi.org/10.1007/s00204-022-03252-y

REVIEW ARTICLE

Prediction reliability of QSAR models: an overview of various


validation tools
Priyanka De1 · Supratik Kar2 · Pravin Ambure3 · Kunal Roy1

Received: 28 December 2021 / Accepted: 14 February 2022 / Published online: 10 March 2022
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2022

Abstract
The reliability of any quantitative structure–activity relationship (QSAR) model depends on multiple aspects such as the
accuracy of the input dataset, selection of significant descriptors, the appropriate splitting process of the dataset, statistical
tools used, and most notably on the measures of validation. Validation, the most crucial step in QSAR model development,
confirms the reliability of the developed QSAR models and the acceptability of each step in the model development. The
present review deals with various validation tools that involve multiple techniques that improve the model quality and robust-
ness. The double cross-validation tool helps in building improved quality models using different combinations of the same
training set in an inner cross-validation loop. This exhaustive method is also integrated for small datasets (< 40 compounds)
in another tool, namely the small dataset modeler tool. The main aim of QSAR researchers is to improve prediction quality by
lowering the prediction errors for the query compounds. ‘Intelligent’ selection of multiple models and consensus predictions
integrated in the intelligent consensus predictor tool were found to be more externally predictive than individual models.
Furthermore, another tool called Prediction Reliability Indicator was explained to understand the quality of predictions for
a true external set. This tool uses a composite scoring technique to identify query compounds as ‘good’ or ‘moderate’ or
‘bad’ predictions. We have also discussed a quantitative read-across tool which predicts a chemical response based on the
similarity with structural analogues. The discussed tools are freely available from https://dtclab.webs.com/software-tools
or http://teqip.jdvu.ac.in/QSAR_Tools/DTCLab/ and https://sites.google.com/jadavpuruniversity.in/dtc-lab-software/home
(for read-across).

Keywords QSAR · Validation · Double cross-validation · Small dataset modeling · Intelligent consensus prediction · Read
across

Introduction is a popular in silico technique performed to find out a quan-


titative correlation between the structural features (known as
A growing number of research have been conducted in descriptors) and a known response (activity/property/toxic-
recent years, wherein computational methods have been ity) for a set of molecules using various chemometric meth-
used to predict the physicochemical properties and biologi- odologies. QSAR evolves at the crossroads of chemistry,
cal activities of chemical compounds. Quantitative struc- statistics, biology, and toxicological studies. The main aim
ture−activity relationship (QSAR) (Dearden 2016) modeling is to identify and optimize new leads to shorten the time and
reduce expenditure for drug discovery (Hsu et al. 2017). The
fundamental assumption regarding QSAR modeling is that
* Kunal Roy
kunalroy_in@yahoo.com; kunal.roy@jadavpuruniversity.in a chemical structure possesses unique features (geometric,
steric, and electronic properties) responsible for its physical,
1
Drug Theoretics and Cheminformatics Laboratory, chemical, and biological properties.
Department of Pharmaceutical Technology, Jadavpur The European Union (EU) envisaged that QSAR models
University, Kolkata 700032, India
would increasingly be used for hazard and risk assessments
2
Interdisciplinary Center for Nanotoxicity, Department of chemicals (Commission of the European Communities
of Chemistry, Physics and Atmospheric Sciences, Jackson
State University, Jackson, MS 39217, USA 2001). It is also necessary to create and apply QSARs to
3 address animal welfare concerns by replacing, reducing,
ProtoQSAR S.L., Valencia, Spain

13
Vol.:(0123456789)
1280 Archives of Toxicology (2022) 96:1279–1295

and refining animal testing in toxicological assessments. In similarity-based quantitative read-across tool addressing the
November 2004, the European Commission and the OECD quality of predictions both quantitatively and qualitatively.
(Organisation for Economic Co-operation and Develop-
ment) member countries adopted principles for valida-
tion of QSAR models for use in regulatory assessment of Predictive QSAR model development
chemical safety (Organisation for Economic Co-operation approaches
and Development (OECD 2004). According to the agreed
guidelines of OECD, a QSAR model should be developed Modern QSAR methods use multiple descriptors combined
with with the application of both linear and non-linear mode-
ling approaches with a strong emphasis on rigorous model
(a) A defined endpoint, validation to afford robust and predictive QSAR models.
(b) An unambiguous algorithm to guarantee model Several types of research along with our understanding of
transparency, QSAR model development and validation led us to establish
(c) A defined domain of applicability, a general outline of QSAR model workflow as described
(d) Proper measures of validation including internal in Fig. 1. This figure illustrates the classical QSAR model
performance (as determined by goodness-of-fit and development algorithm which includes: (a) collection of per-
robustness) and predictivity (as represented by exter- tinent data with a defined endpoint, (b) descriptor calcula-
nal validation), and tion and data pre-treatment, (c) model development through
(e) Possible mechanistic interpretation. analysis of the correlation between input data and descrip-
tors calculated, (d) validation of the model, and (e) design
Validation is crucial for the development and applica- and prediction of the activity of new query molecules. The
tion of any QSAR model. It confirms the reliability of QSAR modeling scheme is further described briefly in the
the developed model and the acceptability of each step following section.
through model development. The debate between internal
versus external validation prevails predominantly among (i) Dataset preparation and data curation: One of the
QSAR practitioners (Roy 2007). Some QSAR studies most challenging parts of QSAR is dataset collec-
reported an inconsistency between internal and exter- tion with a “defined endpoint” as explained in OECD
nal predictivity (Novellino et al. 1995; Norinder 1996). principle 1. The intent is to confirm the transparency
According to researchers, there might be an inconsistency of the endpoint aimed for prediction models, con-
between internal and external predictability, i.e., high sidering that a given endpoint could be dependent
internal predictivity may result in low external predictivity on the experimental protocol and the experimental
and vice versa (Kubinyi 1998). However, external valida- conditions. Data curation is an essential and time-
tion is considered the ‘gold standard’ of checking predic- consuming step in the QSAR model development
tive potential of QSAR models. Some researchers consider process. Erroneous data (both in chemical structures
cross-validation to be more appropriate for checking the and biological data) retrieved from online sources
predictive ability of QSAR models to circumvent the loss require strict curation to avoid false or non-predictive
of information from splitting the dataset into training and models (Ambure and Cordeiro 2020).
test sets (Héberger 2017). Several validation metrics (as (ii) Calculation of molecular descriptors: The molecu-
discussed later) are used to check the quality of predictions lar structures applied for QSAR modeling need to
generated by regression-based and classification-based be translated into numbers, i.e., molecular descrip-
QSAR models (Gramatica and Sangion 2016; Todeschini tors. The molecular descriptor is an encoded repre-
et al. 2016). sentation of the information about a chemical com-
The present review has discussed several prediction reli- pound in the form of numerical values based on its
ability tools exploring various strategies to determine model chemical constitution, allowing the correlation of
reliability and predictivity. We have discussed the tools that chemical structure with physical properties, chemi-
engage in the model-building through a double cross-vali- cal reactions, or biological activity (Consonni and
dation approach on large and small datasets. Furthermore, Todeschini 2010). In a QSAR model, descriptors of
we have explained the utility of intelligent selection of mul- a molecule, which describe specific aspects of a mol-
tiple models and various forms of consensus prediction. ecule, are predictors (X) of the dependent variable
We have also mentioned a tool that explains a similarity- (Y). A QSAR study uses a variety of descriptors that
based reliability scoring approach to understand the quality can be classified into different dimensions or catego-
of predictions for a new query compound and ensure the ries, as shown in Table 1.
developed models’ reliability. We have further reported a

13
Archives of Toxicology (2022) 96:1279–1295 1281

Fig. 1 Schematic representation of QSAR methodology according to OECD guidelines

of the corresponding compounds. According to the


OECD guidelines, several feature selection techniques
(iii) Dataset division: A predictive model's performance have been applied using a mechanistic basis includ-
must be determined by dividing the dataset into a train- ing, genetic algorithms, genetic function approxima-
ing set and a test set. Among all chemicals, only the tion (GFA), forward selection, backward elimination,
training set molecules are used for developing QSAR stepwise regression, simulated annealing, etc.
models, and the external predictivity of the models is (v) Model development algorithms: The OECD guideline
examined through the use of test set compounds. In 2 explains that a QSAR model should be developed
developing the QSAR model, it is necessary to select a using an “unambiguous algorithm” (Directorate 2007).
training set in a way, such that it encompasses a wide The rule focuses on bringing transparency in model-
chemical domain. The test set compounds must lie building, rendering it reproducible to others and mak-
within the chemical space of the training set. Data- ing it possible to achieve the endpoint estimates. This
set division involves different methods including (a) embraces the methods implemented during data pre-
Euclidean distance (diversity-based) (Golmohammadi treatment, division of data, feature selection, and model
et al. 2012), (b) Kennard-Stone (Kennard and Stone development. Linear modeling techniques involve mul-
1969), (c) k-means clustering (Likas et al. 2003), (d) tiple linear regression (MLR) (Pope and Webster 1972;
sorted response (Roy 2018), etc. De and Roy 2018), ordinary least squares (OLS), par-
(iv) Feature selection: A feature selection process is a vital tial least squares (PLS) (Wold et al. 2001), principal
step that involves identifying important predictor vari- component analysis (PCA) (Abdi and Williams 2010),
ables to develop correlations with the response vari- principal component regression (PCR), etc.
able. Feature selection helps decrease the model com-
plexity, decreases the risk of overfitting or overtraining, In QSAR, model-building tools can be grouped into two
and helps select the most critical descriptors among a major categories: regression-based approach and classifica-
pool of hundreds or thousands. In this way, the dimen- tion-based approach. Regression-based approaches are effec-
sionality of input descriptors is minimized without the tive when both dependent (response variable) and independ-
loss of essential information (Goodarzi et al. 2012). ent (molecular descriptors) variables are quantitative (Roy
Finally, these selected descriptors are used to build a et al. 2015a; b). In the case of classification-based modeling,
mathematical model linking to the biological activity a relationship between the descriptors and the graded values

13
1282 Archives of Toxicology (2022) 96:1279–1295

Table 1 Types of 0D-3D descriptors used in the QSAR study


Dimension Parameters Examples
of descrip-
tors

0D Constitutional indices Number of atoms, number of non-H atoms, number of bonds, number of aromatic bonds, sum of
atomic van der Waals volumes (scaled on carbon atom), etc.
Molecular property Unsaturation count, unsaturation index, hydrophilic factor, unsaturation index
Atom and bond counts
1D Fragment counts, fingerprints Atom centered fragments (C-001, H-046, O-056, etc.)
2D Topological Wiener index (W), Zagreb group indices, Balaban J index, Randic branching index (χ), Molecular
connectivity index, subgraph count, Chi indices, etc.
Structural Chiral centers, rotatable bonds, HBond donor, HBond acceptor
Physicochemical parameters Heat of formation (Hf), Log of the partition coefficient using Ghose and Crippen’s method
(thermodynamic param- (AlogP), Desolvation free energy (Fh2o, Foct)
eters)
Connectivity indices Average connectivity index, valence connectivity index, solvation connectivity index, modified
Randic index, connectivity topochemical index, perturbation connectivity index
Functional group counts Number of terminal primary C(sp3), number of total secondary C(sp3), number of ring quaternary
C(sp3), number of carboxylic acids, number of hydroxyl groups, etc.
2D matrix based Balaban-like index from adjacency matrix, logarithmic spectral positive sum from adjacency
matrix, spectral absolute deviation from adjacency matrix, etc.
2D atom pairs Presence or absence of any two atoms at a particular topological distance (B01[C–C], B09[C-F],
etc.), frequency of two atoms at a particular topological distance (F01[C-F], F05[O-N]), sum of
occurrence of two atoms at a particular topological distance (T(N..I), T(O..N))
3D Electronic Dipole moment, highest occupied molecular orbital (HOMO), lowest unoccupied molecular orbital
(LUMO), superdelocalizibility
Spatial The radius of gyration, Jurs descriptors, area, density, volume, etc.
Receptor surface Hydrophobicity, partial charge, electrostatic (ELE) potential, van der Waals (VDW) potential, and
analysis parameters hydrogen bonding propensity
Molecular shape analysis Difference volume (DIFFV), Common overlap steric volume (COSV), Common overlap volume
ratio (Fo), Noncommon overlap steric volume (NCOSV), Root mean square to shape reference
(ShapeRMS)
Geometric Molecular eccentricity, spherosity, asphericity, aromaticity index, gravitational index
Other 3D descriptors 3D matrix based (Wiener like index, Randic like index, Balaban-like index, etc. all from geomet-
ric matrix, spectral moment,), 3D autocorrelations (3D Topological distance-based descriptors:
unweighted; weighted by mass, polarizability, van der Waals volume, Sanderson electronegativ-
ity, ionization potential), 3D Morse descriptors, WHIM descriptors, GETAWAY descriptors,
quantum-chemical descriptors

0D, 1D, and 2D descriptors may be collectively grouped under the broad class of 2D descriptors in general

of the response variable(s) is established. Here, the response compounds in the training set and is mandatory to
is offered in a Boolean form like active/inactive and positive/ examine whether the prediction of test set molecules
negative or categorical (as observed in linear discriminant is reliable or not. The concept of AD was used to
analysis, logistic regression, and cluster analysis). avoid an unjustified extrapolation of property predic-
tions.
(vi) Determination of domain of applicability: One of (vii) QSAR model validation: Before interpreting and
the most essential checkpoints in QSAR modeling predicting biological responses of untested com-
is determining the applicability domain (AD) of a pounds, any QSAR model needs to be validated.
model as explained in OECD principle 3. The appli- Here, the model's predictive power is established, and
cability domain denotes a physicochemical space the ability to reproduce the biological activities of
(both the response and chemical structure space) the untested compounds is measured. In consonance
within which a QSAR model can predict with a cer- with the fourth principle of OECD guidelines, statis-
tain degree of reliability (Roy et al. 2015a, b). This tical validation of models in terms of goodness-of-fit,
space is defined by the features explained by the robustness, and predictivity is an extremely impor-

13
Archives of Toxicology (2022) 96:1279–1295 1283

tant step during QSAR model development. The vali- prediction capability and applicability of a QSAR model to
dation of QSAR models is crucial if these models are predict newly designed or untested molecules is done using
used for virtual screening. Each validation parameter external validation metrics. In most cases, some compounds
aims to judge the accuracy of prediction, i.e., deter- from the original datasets are used for validation purpose
mining whether the experimental value is close to the when true external data points are limited or not available.
model-derived value. The model fitness determined
using the coefficient of determination or correlation Regression-based validation metrics
coefficient from the training set measures the degree
of achieved correlation between the experimental One of the main quality metrics to check the goodness-of-fit( )
(Yexp) and calculated (Ycalc) response values. Data of a regression model is the determination coefficient R2
fitting does not confirm the predictability of a model which measures the variation of observed data with the fitted
but instead demonstrates the model’s statistical qual- ones. The maximum possible value for R2 is 1, which defines
ity. Different internal and external validation metrics a perfect correlation.
for both regression and classification modeling are Adjusted R2 ( R2adj) is a modified version of the determina-
utilized to check model prediction quality which is tion coefficient and is also known as the explained variance.
discussed later in the following section. The R2adj parameter incorporates the information of the number
(viii) Mechanistic interpretation: The fifth OECD prin- of samples and the independent variables used in the model.
ciple focuses on identifying the features of the vari- Considering the internal validation for a regression-based
ables that may contribute to a more thorough under- QSAR model, the leave-one-out cross-validation (Q2LOO) met-
standing of the response being modeled. Chemicals ric is calculated. Here, a model is developed by modifying the
that act specifically using a specific mechanism original training set of n compounds by removing one com-
can only be designed and developed with absolute pound. The activity of the omitted compound is then predicted
certainty using the structural analogues. However, using the model developed with n-1 compounds. This cycle
it is evident that furnishing mechanistic informa- is repeated until all the training set compounds have been
tion may not always be feasible. The rule suggests eliminated once and the predicted activity data are obtained
that the modeler should report if any such informa- for all the training set compounds. The model predictivity is
tion is available, facilitating future research on that thus measured using the predicted residual sum of squares
endpoint. A mechanistic interpretation from the lit- (PRESS) and cross-validated R2 (Q2) (Table 2). The PRESS
erature can be added, and therefore, the fifth OECD value is defined as the sum of squared differences between the
principle encourages the reporting of such informa- experimental and leave-one-out predicted data. The standard
tion to enrich the physicochemical understanding of deviation of error of predictions (SDEP) is calculated from
response being modeled. the PRESS value (Table 2). A model is considered satisfactory
if the value of Q2 is higher than the predetermined value of
0.6. However, numerous evidences suggested that leave-one-
Regression and classification validation out prediction should neither be considered as the ultimate
metrics standard for judging the predictive power of models nor for
model selection (Konovalov et al. 2007; Veerasamy et al.
The reliability of a developed QSAR model is confirmed 2011). There is a chance of overfitting and overestimation in
through the validation process. The quality of input data, LOO due to structural redundancy (Höltje and Sippl 2001).
dataset diversity, predictability on an external set, applica- Leave-many-out (LMO) or leave-some-out (LSO) might be a
bility domain determination, and mechanistic interpretabil- better alternative where a part of the training data is held out
ity are also confirmed through various validation metrics. ((1 ≤ m < n, where n is a sample size) and the remaining data
QSAR model validation can be classified into two major are modeled. The model is developed using the remaining
types: (a) internal validation and (b) external validation. compounds in each cycle, and the hold-out compounds are
Internal validation in QSAR modeling involves activity pre- predicted. This cycle continues till all the compounds are pre-
diction of the molecules/compounds used for generating the dicted, and the predicted values are used for the calculation of
model. This is followed by estimating metrics for detecting Q2LMO. Therefore, the LMO technique partly reflects external
the precision of predictions. Internal validation is useful in validation in the context of internal validation.
the case of cross-validation approaches (Konovalov et al. Although, Q2LOO provides a measure of model robustness,
2008) where the internal data are partitioned into calibration it may not be sufficient to characterize the performance of the
(training) and validation (test) subsets. The calibration set model during prediction of new query/test compounds. Fur-
is used for model-building purposes, and the validation set thermore, Q2LOO can provide an overestimation of model
is utilized for model predictivity assessment. Assessment of

13
1284 Archives of Toxicology (2022) 96:1279–1295

Table 2 Validation metrics for regression modeling


Parameters Equation Description
( ) 2
Determination coefficient R2 (Yobs −Ypred ) Metric to check the goodness-of-fit of a regression

R2 = 1 − ∑�
Yobs −Ytraining
�2
model. It measures the variation of observed data
with the predicted ones. The maximum possible
value for R2 is 1, which defines a perfect correla-
tion. Yobs denotes the observed response values for
the training set, and Ypred denotes the calculated
response values for the training set of compounds.
Ytraining is the mean observed response of the train-
ing set compounds
Explained variance or adjusted R2 ( R2adj) R2adj =
{(n−1)XR2 }−p Modified version of the determination coefficient.
n−p−1 The R2adj parameter incorporates the information of
the number of samples and the independent vari-
ables used in the model. n is the number of training
set compounds and p is the number of predictor
variables
Leave-one-out cross-validation (Q2LOO) (Yobs(training) −Ypred(training) )
2
Cross‐validated R2(Q2) is checked for internal

Q2LOO = 1 − �2
∑�
Yobs(training) −Ytraining validation. Yobs(training) is the observed response, and
Ypred(training) is the predicted response of the training
set molecules based on the leave‐one‐out (LOO)
technique
Predictive R2 or R2pred or Q2ext(F1) (Yobs(test) −Ypred(test) )2 This metric employed for judging external predic-

Q2ext(F1) = 1 − 2

(Yobs(test) −Ytraining ) tivity. It is a measure of correlation between the
observed and predicted data of test set. Yobs(test) is
the observed response, and Ypred(test) is the predicted
response of the test set molecules. Ytraining denotes
the mean observed response of the training set
Q2ext(F2) (Yobs(test) −Ypred(test) )2 It helps in the judgment of predictivity of a model
∑ ( )
Q2ext(F2) = 1 − 2 Ytest .

(Yobs(test) −Ytest ) using the test set mean
Q2ext(F3) Q2ext(F3) is measured to determine external predictiv-
�∑ �
2
(Yobs(test) −Ypred(test) ) ∕ntest
Q2ext(F3) = 1 − � �
∑ �2 � ity employing both training and test set features.
Yobs(train) −Ytraining ∕ntrain
Yobs(test) is the observed response, and Ypred(test) is
the predicted response of the test set molecules.
Yobs(training) is the observed response and Ytraining
denotes the mean observed response of the training
set molecules. The threshold for Q2ext(F3) is 0.5
∑n
Concordance correlation coefficient (CCC) CCC = pc =
2 i=1 (xi −x)(yi −y) The concordance correlation coefficient (CCC)
∑n 2 ∑n 2
i=1 (xi −x) + i=1 (yi −y) +n(x−y) measures both precision and accuracy detecting the
distance of the observations from the fitting line
and the degree of deviation of the regression line
from that passing through the origin, respectively.
‘n’ denotes the number of compounds, and xi and yi
signify the mean of observed and predicted values,
respectively
�∑
Root mean square error in predictions (RMSEp) (Yobs(test) −Ypred(test) )2 It gives a measure of model external validation. A
RMSEp = ntest lower value of this parameter is desirable for good
external predictivity
rm2 metrics rm2 +r� 2m | | r2 is the squared correlation coefficient value between
rm2 = and Δrm2 = |rm2 − r� 2m |
2 | | observed and predicted response values, and r02
and r′02 are the respective squared correlation coef-

where rm2 = r2 × (1 − r2 − r02 ))
( ) ficients when the regression line is passed through
the origin by interchanging the axes. For the

r� 2m = r2 × 1 − r2 − r� 20
acceptable prediction, the value of all Δrm2 metrics
should preferably be lower than 0.2 provided that
the value of
rm2 is more than 0.5 (Ojha et al. 2011)
Predicted residual sum of squares (PRESS) ∑� �2 Sum of squared differences between experimental
PRESS = Yobs − Ypred
and predicted data. Yobs and Ypred correspond to the
observed and LOO predicted values

13
Archives of Toxicology (2022) 96:1279–1295 1285

Table 2 (continued)
Parameters Equation Description

Standard deviation of error of prediction (SDEP) SDEP = PRESS The value of standard deviation of error of prediction
n (SDEP) is calculated from PRESS. n refers to the
number of observations
Mean absolute error (MAE) 1 ∑� � This is also known as average absolute error (AAE)
MAE = n
× �Yobs − Ypred �
� � and is considered a better index of errors in the
context of predictive modeling studies

quality as a result of structural redundancy in the training set a model when there is large deviation between r2 (squared
data. Thus, the performance of a model on an external dataset correlation coefficient values between the observed (Y
is considered mandatory for the judgment of predictivity. The axis) and predicted (X axis) values of the compounds with
metric employed for judging external predictivity is termed intercept) and r02 (squared correlation coefficient values
as predictive R2 or R2pred or Q2ext(F1). The Q2ext(F1) metric is char- between the observed (Y axis) and predicted (X axis) val-
acterized by a minimum threshold value of 0.6, i.e., models ues of the compounds without intercept) values (Table 1).
showing a value more than 0.6 are considered to be externally MAE-based criteria: in a study, Roy et al. (2016) have
predictive with the ideal value being 1.0. Schüürmann and shown that the conventional correlation-based external
co-workers (Schüürmann et al. 2008) defined another external validation metrics ( Q2ext(F1),Q2ext(F2) ) often provide biased
validation metric Q2ext(F2) for the judgment of the predictivity judgment of model predictivity, since such metrics are
of a model using the test set. Consonni et al. (2009) introduced influenced by factors such as response range and distribu-
another external validation metric Q2ext(F3). This metric meas- tion of data. Here, the authors have defined a set of criteria
ures the model predictability and is sensitive to the selection using simple ‘mean absolute error’ (MAE) and the cor-
of training dataset and tends to penalize models fitted to a very responding standard deviation (σ) measure of the predicted
homogeneous data set even if predictions are close to the residuals to judge the external predictivity of the models.
∑� �
truth, with a threshold value being 0.6. Note that MAE = 1n × �Yobs − Ypred �, where Yobs and Ypred
� �
Another metric that checks the model reliability is the are the respective observed and predicted response values
concordance correlation coefficient (CCC) metric (Chirico of the test set comprising n number of compounds. The
and Gramatica 2011). It measures both precision and accu- response range of training set compounds has been
racy, detecting the distance of the observations from the fit- employed here to define the threshold values. Furthermore,
ting line and the degree of deviation of the regression line the authors have proposed the application of the ‘MAE
from that passing through the origin, respectively. Any devi- based criteria’ on 95% of the test set data by removing 5%
ation of the regression line from the concordance line (line data with high predicted residual values precluding the
passing through the origin) gives a value of CCC smaller possibility of biased prediction quality due to any outlier
than 1. The desirable threshold value for CCC prediction. The following criteria for MAE prediction are
( is 0.85. )
The root-mean-square error in predictions RMSEp gives followed:
a measure of model external validation. This metric is com- i. Good predictions: in easier terms, an error of 10% of the
paratively simpler and directly depicts the prediction errors training set range should be acceptable, while an error
for the test set observations concerning the total number of more than 20% of the training set range should be a very
test set samples. A lower value of this metric is desirable for high error. Thus, the criterion for good predictions is as
good external predictivity. follows:
2
The rm metrics: the training set mean value and the
MAE ≤ 0.1 × training set range and (MAE + 3𝜎)
distance of the mean from the response values of each
compound play a decisive role in computing the Q2 val- ≤ 0.2 × training set range.
ues. The Q2 value increases with the rise in the value
Here, σ value indicates the standard deviation of absolute
of the denominator in the expression of Q 2. Thus, even
errors for the test data. For a normal distribution pattern,
for a considerable deviation between the predicted and
mean ± 3σ covers 99.7% of the data points.
observed response values, satisfactory Q2 values may be
obtained, if the molecules exhibit a considerably broad ii. Bad predictions: a value of MAE more than 15% of the
range of response data. Using the concept of regression training set range is considered high, while an error
through origin approach, Roy et al. (2012) introduced a higher than 25% of the training set range is judged as
new metric rm 2
or modified r2 that penalizes the r2 value of very high. Thus, prediction is considered bad when

13
1286 Archives of Toxicology (2022) 96:1279–1295

MAE > 0.15 × training set range or (MAE + 3𝜎) > 0.25 × training set range.

Predictions which do not fall under either of the above can be √ o b t a i n e d t h r o u g h R2p c a l c u l a t i o n
2 2
two conditions may be considered as of moderate quality. ( Rp = R × R2 − R2r ). A robust QSAR model should have
This criterion is applied for judging the quality of test set R2p value less than 0.5. At the ideal condition, the average
prediction when there are at least 10 data points signifying value of R2 for the randomized models should be zero, i.e.,
statistical reliability and there is no systemic error in model R2r should be zero. Consequently, in such a case, the value of
predictions. R2p should be equal to the value of R2 for the developed
Randomisation of response (Y-scrambling)–Randomisa- QSAR model. Thus, as proposed by Todeschini, √ the cor-
tion is an assessment to ensure the developed QSAR model rected formula of R2p (cR 2p ) is cR 2p = R × R2 − R2r (Todes-
is not due to chance, thereby giving an idea of model robust- chini 2010).
ness (Rücker et al. 2007). In this technique, validation met-
rics are checked by repetitive permutation of the response Classification-based QSAR validation metrics
data (Y) of n compounds in the training set with respect to
the X (descriptor) matrix which is kept unchanged. The cal- In a binary classification model, several validation metrics
culations are repeated with randomized activities, followed are utilized to evaluate the model's performance in terms of
by a probabilistic examination of the results. Every run will accurate qualitative prediction of the dependent variable.
yield approximations of R2 and Q2 , which are recorded. For Classification models are generally assessed using a statisti-
an acceptable QSAR model, the average correlation coeffi- cal method that is based on the Bayesian approach (Ghosh
cient ( Rr ) of randomized models should be less than the et al. 2020). A binary classification model is typically a
correlation coefficient ( R) of a non-random model. The dif- two-class model, i.e., positive and negative, or active and
ference between mean-squared correlation coefficients of the inactive. The results obtained can be arranged in a contin-
randomized ( R2r ) and that of the non-random ( R2 ) models gency table (also known as confusion matrix) (Table 3). The

Table 3 Contingency table Experimental Total


or confusion matrix for
classification modeling Active Inactive

Active True positive (TP) False positive (FP) TP+FP


Predicted Inactive False negative (FN) True negative (TN) FN+TN
Total TP+FN FP+TN N=
TP+FP+FN+TN

Table 4 Validation metrics for Sl No. Classification metric Equation


classification modeling
1 Sensitivity TP
Sensitivity = TP+FN
2 Specificity TN
Specif icity = TN+FP
3 Precision TP
Precision = TP+FP
4 Accuracy TP+TN
Accuracy = TP+FN+TN+FP
5 F-measure 2
F − measure(%) = 1 1
Precision
+ Sensitivity

6 G-means G − means = Specif icityXSensitivity
7 Cohen’s Kappa (κ) (TP+TN)
Pr (a) = (TP+FP+TN+FN)
Pr (e) = {(TP+FP)×(TP+FN)}+{(TN+FP)×(TN+FN)}
(TP+FN+FP+TN)2
� Pr (a)−Pr (e)
Cohen sK = 1−P (e)
r

8 Mathews correlation coefficient (TP×TN)−(FP×FN)


MCC = √
(TP+FP)×(TP+FN)×(TN+FP)×(TN+FN)
(MCC)

Pr (a): relative observed agreement between the predicted classification of the model and the known clas-
sification; Pr (e): hypothetical probability of chance agreement

13
Archives of Toxicology (2022) 96:1279–1295 1287

statistical metrics explaining the quality of a classification predictive performance of the chosen model. Several groups
model are given below and in Table 4. of researchers in QSAR suggested external validation to be
In classification QSAR modeling, the compounds are the gold standard in demonstrating the predictive ability of
classified into four main categories: a) true positives (TP), b) a model (Golbraikh and Tropsha 2002; Gramatica and San-
true negative (TN), c) false positive (FP), and d) false nega- gion 2016; Gramatica 2020). Multiple modeling in consen-
tive (FN) (Table 3). Researchers have used a variety of sta- sus form has been introduced to achieve a lower degree of
tistical tests to assess the classifier model performance and predicted residuals for query compounds (Roy et al. 2015b;
classification capability. Sensitivity (Sn) is the percentage of Khan et al. 2019a; Roy et al. 2019). In the following sec-
active compounds correctly predicted and is expressed as the tions, we will discuss various tools from the DTC Labo-
ratio of true-positive results to the total number of positive ratory (https://sites.google.com/site/kunalroyindia/home/
data. Specificity (Sp) is the ratio of true-negative results to qsar- model- devel opment- tools) that help understand the
the total number of negative data. Accuracy (Acc) implies prediction ability of one or more QSAR models.
the fraction of correctly predicted compounds. The precision
indicates the accuracy of a predicted class (ratio between the (i) Double cross-validation (version 2.0) tool
true positives and total positives) and F-measure refers to
the harmonic mean of Recall (or Sensitivity) and Precision. The most common scheme of external validation is by
Higher values for recall and precision give higher values for introducing the hold-out method. Here, the original dataset
F-measure, thereby implying better classification. is divided into training and test sets, where the training set is
G-means is a combination term that includes Sn and Sp used for model-building purposes followed by model selec-
into a single parameter merged via the geometric mean. This tion based on internal validation metrics, and the test set is
allows an easy assessment of the model’s ability to distin- used for model validation through external validation met-
guish between active or inactive samples. rics. This approach ensures that the test set is never applied
Cohen’s kappa (κ) can be utilized to determine the con- during the model-building procedure and it remains unseen
cordance between classification (predicted) models and by the developed model. However, a single training set does
known classifications (Cohen 1960). It is a measure of the not confirm feature optimization, since a fixed training set
degree of agreement. It returns value from − 1 (total disa- composition leads to a bias in feature selection. This issue is
greement) to 0 (random classification) to 1 (total agreement). more apparent in the case of MLR models than partial least-
Mathews correlation coefficient (MCC) measures the squares (PLS) or principal component regression (PCR)
quality of binary classifications and compares different models which are more robust and generalized methods.
classifiers. In any case, where the number of positive and Baumann and Baumann (Baumann and Baumann 2014) dis-
negative compounds is not equal, the terms sensitivity, cussed the concept of double cross-validation (DCV) which
specificity, and accuracy are not reliable. MCC uses all four Roy and Ambure implemented in a tool (Roy and Ambure
values (TP, TN, FP, and FN) and is directly calculated from 2016) where the training set is further divided into ‘n’ num-
the confusion matrix to provide a more-balanced prediction ber of calibration and validation sets. The tool is freely avail-
evaluation. Like Cohen’s kappa, the value for MCC also able from http://dtclab.webs.com/software-tools and http://
ranges from − 1 to 1. teqip. jdvu. ac. in/ QSAR_ Tools/ DTCLab/. The algorithm
comprises two nested cross-validation loops (Bates et al.
2021), namely, the outer loop and the inner loop (Fig. 2).
The outer loop consists of data points that are split arbitrarily
Prediction reliability detection tools into disjoint subsets known as training set compounds and
test set compounds. The training set is utilized in the inner
As discussed earlier, the process of QSAR modeling con- loop for model development and model selection, and the
sists of three important steps: model development, model test set is used exclusively for checking model predictiv-
selection, and model interpretation. The model develop- ity. The training set in the inner loop is further split into k
ment process involves various feature selection practices number of calibration and validation sets in the inner loop
including stepwise-multiple linear regression (S-MLR), by applying the k-fold cross-validation technique (Wainer
genetic algorithm, genetic function approximation, etc. and Cawley 2021). In the k-fold cross-validation method,
Model selection is based on the identification of the fin- the training data are initially segregated into k subsets fol-
est model (based on validation metric values) from a set lowed by preparing k-iterations of calibration and validation
of alternative models. When it comes to the reliability of sets. At each reiteration, different subset of data is excluded
QSAR/QSPR models, validation is essential. After a model and used as validation set, while the remaining k-1 subsets
has been selected, several internal and external validation are used as calibration sets. The data are passed through a
metrics are assessed which help in demonstrating the actual stratification process, i.e., data rearrangement which helps

13
1288 Archives of Toxicology (2022) 96:1279–1295

Fig. 2 Schematic diagram of


double cross-validation algo-
rithm (colour figure online)

maintain data uniformity (each fold is representative of the important features, undervalue other features, and overlook
whole dataset). Each k-fold calibration set is then used to some significant characteristics features. Roy et al. (2018b)
develop multiple linear regression (MLR) models, while the proposed an “intelligent” selection of multiple models that
respective validation sets are applied to find the prediction would enhance the quality of predictions of query com-
errors. The tool provides two methods of feature selection: pounds (Roy et al. 2018b). This software helps judge the
stepwise-multiple linear regression (S-MLR) (Maleki et al. performance of consensus predictions compared to their
2014; Ojha and Roy 2018) and genetic algorithm-MLR (GA- quality obtained from the individual MLR models based
MLR) (Leardi 2001). The prediction error is checked using on the MAE-based criteria (95%). The tool “Intelligent
mean absolute error (MAE95%) (Roy et al. 2016). There is Consensus Prediction” is available from http://dtclab.webs.
also a provision for the generation of PLS models in the tool. com/software-tools and http://teqip.jdvu.ac.in/QSAR_Tools/
Furthermore, the models in the inner loop are selected based DTCLab/. The tool takes multiple individual models (M1,
on three major criteria as follows: M2, M3, etc.) as input derived using a different combination
of descriptors from the training set. There are four ways of
i) The models with the lowest MAE value (on the valida- consensus prediction explained in the work:
tion set) are selected.
ii) Consensus predictions of the top model are selected (i) Consensus model 0 (CM0): it provides a simple aver-
based on the MAE value of the validation set. age of predictions from all input individual models.
iii) Searching out the best descriptor combination from the (ii) Consensus model 1 (CM1): it is the average of predic-
top models. tions from all individual qualified models. It is calculated
from the arithmetic average of predicted response values
Researchers found the DCV approach to be reliant and attained from the ‘n’ qualified models for test compounds
useful and thus successfully employed in various applica- rather than from all existing individual models.
tions, for example, quantitative structure–property relation- (iii) Consensus model 2 (CM2): it is the weighted aver-
ship (QSPR) modeling for sweetness potency of organic age prediction (WAP) from all qualified individual mod-
chemicals (Ojha and Roy 2018), developing nano-QSAR els. In CM2, the average is evaluated by giving a proper
models for TiO2-based photocatalysts (Mikolajczyk et al. weightage to the qualified models for a particular test set
2018), inhalational toxicity modeling (Nath et al. 2022), compound.
modeling of diagnostic agents (De et al. 2019; De et al. (iv) Consensus model 3 (CM3): compound-wise best
2020, 2022; De and Roy 2020, 2021), etc. selection of predictions from qualified individual mod-
els. The best model for a particular test compound is
(ii) Intelligent consensus predictor tool
selected based on its cross-validated mean absolute error
A well-validated QSAR model engages different classes ( MAECV ). A model with the lowest value MAECV is the
of descriptors, which accentuate many features of molecular best for a particular test set compound.
structures. Individual QSAR models may exaggerate a few

13
Archives of Toxicology (2022) 96:1279–1295 1289

The tool further provides additional selection criteria The complete calculation method is demonstrated in the
which include: published article by Roy et al. and the methodology is given
in Fig. 3. The ICP method has found good application in the
(a) Euclidean distance cut-off: this is used to find a fitting prediction of pharmaceuticals (Khan et al. 2019a), organic
model to predict the test set compound, where 10 most chemicals and dyes (Roy et al. 2019; Khan and Roy 2019;
similar compounds are selected based on Euclidean Ghosh and Roy 2019; Ojha et al. 2020), determining aquatic
Distance score. The user can set a Euclidean cut-off toxicity (Hossain and Roy 2018), inhalational toxicity (Nath
ranging from 0 to 1 to restrict the selection of only et al. 2022), polymer properties (Khan et al. 2018), etc.
those training set compounds with a Euclidean distance
(iii) Prediction Reliability Indicator tool
score less than or equal to the set cut‐off value.
(b) Applicability domain: AD helps to check whether the A QSAR model is developed based on the physicochemi-
test/query compound is in the chemical space of the cal features of an appropriately designed training set having
model or not. A simple standardization approach is experimentally derived response data. In contrast, the model
used for AD determination. is validated using one or more test set(s) for which the exper-
(c) Dixon Q test: this test can be employed to spot and imental response data are available. The suitability of this
remove a response outlier out of selected similar train- model for a completely new data set (true external set) for
ing set compound. providing a reliable prediction is quite an interesting study.

Fig. 3 “Intelligent Consensus


Prediction” algorithm

Fig. 4 Methodology applied for scoring test/query compounds in “Prediction Reliability Indicator” tool

13
1290 Archives of Toxicology (2022) 96:1279–1295

Roy et al. (2018a, b) have developed a new scheme (Fig. 4) (c) Rule/Criterion 3: scoring based on the proximity of
to define the reliability of predictions from QSAR models predictions to the training set observed/experimental
for new query compounds and implemented the method in response mean. Earlier, the quality of fit or prediction
a new tool called “Prediction Reliability Indicator” freely of compounds is better when compounds possess
available from http://dtclab.webs.com/software-tools and experimental response values (training and test com-
http://teqip.jdvu.ac.in/QSAR_Tools/DTCLab/. This tool is pounds) close to the training set observed response
applicable for predictions from MLR and PLS models. The mean. Thus, in rule/criterion 3, the authors have pro-
work aimed at formulating a set of rules/criteria that will posed to assess the prediction quality of a test com-
ultimately empower the user to estimate the quality of pre- pound based on the closeness of predicted response
dictions for individual test (external) compounds. Prediction value to the training set observed/experimental
test
of test/external sets can have varying quality. It might not response mean. The predicted response value (Ypred ) of
be good predictions in all cases, while the model can show each test compound is first measured using the training
moderate to bad/unreliable predictions for some of the exter- test
set model, and then, this Ypred value is compared with
nal set compounds. By keeping the variation of prediction train
the training set experimental response mean (Ymean ) and
quality, the authors have hypothesized three rules/criteria train
the corresponding standard deviation (𝜎 ). The scor-
which might assist in classifying the quality of predictions ing is based on the following manner:
for individual test/external set compounds into good, moder-
ate, and poor/unreliable ones. We have now discussed the (i) A test compound with Ypred test
value falling within the
three rules briefly in the following segment: train
range inside Ymean ± 2𝜎 train
, that is, (Y train mean
+ 2𝜎 train ) ≥
test train
Ypred ≥ (Ymean − 2𝜎 train ), can be assumed to be well (good)
(a) Rule/criterion 1: the scoring is based on the quality
of leave-one-out predictions of the closest 10 training predicted by the model and thus have a score 3.
test
compounds to a test/external compound. Here, 10 most (ii) A test compound with Ypred value falling within the
train train test train − 3𝜎 train
similar compounds are identified for each test/query range (Y mean + 3𝜎 ) ≥ Ypred ≥ (Ymean ) and
train train test train train
compound (based on Euclidean distance similarity), (Y mean + 2𝜎 ) < Ypred < (Ymean − 2𝜎 ) can be presumed
followed by which mean of absolute LOO prediction to be predicted moderately by the model and thus gets a
error ( MAELOO ) is calculated for the selected clos- score 2.
est 10 compounds. For a test/query compound whose test
(iii) A test compound with Ypred value falling within the
MAELOO is lowest corresponding to its closest train- train train
range (Y mean + 3𝜎 ) < Ypred < (Ymean test train
− 3𝜎 train ) can be
ing compounds is predicted well and gets the highest assumed to be predicted poorly by the model and thus gets
prediction score (Prediction Score = 3). Test/query a score 1.
compounds that have medium MAELOO values with Furthermore, after these three criteria are checked, a
corresponding close training compounds should get a weighting scheme is employed to compute a composite score
moderate score (Prediction Score = 2), and those test for judging the prediction quality of each test compound
compounds with corresponding close training com- using all three individual scores. The composite score is
pounds having high MAELOO values should get the defined as follows:
least score (Prediction Score = 1). The MAE-based
criteria (Roy et al. 2016) are applied here for scoring Composite score =W1 × scorerule1
the compounds which involve MAELOO and standard + W2 × scorerule2
deviation (𝜎LOO) of the absolute prediction error values.
+ W3 × scorerule3 .
(b) Rule/criterion 2: scoring based on the similarity-based
AD using standardization method. The applicability Here, scorerule1 , scorerule2 , and scorerule3 represent the
domain (AD) of a model plays an important role in scores obtained after applying respective rules, whereas W1
identifying uncertainty in the prediction of a specific ,W2, W3 indicate the weightage (automatic or user-provided)
chemical (test/query) by that model. This is based on given to each of the three individual scores. The PRI tool
how similar is the test/query compound with those in offers a unique composite score which can act as a marker of
the training set. When a test/query compound is similar prediction quality of true external test compound. This tool
to a small fraction or none of the training compounds, has found application for the prediction of external set/query
the prediction is considered unreliable or fails to fall compounds in many areas, viz., endocrine disruptor chemi-
under the training set AD. Here, a simple AD based on cals (Khan et al. 2019b), metal oxide nanoparticles (De et al.
the standardization approach (Roy et al. 2015a, b) is 2018), organic chemicals (Khan and Roy 2019; Khan et al.
employed. 2019c; De et al. 2020; 2022; Nath et al. 2022), etc.

13
Archives of Toxicology (2022) 96:1279–1295 1291

(iv) Small dataset modeler (version 1.0.0) tool Algorithm-Multiple Linear Regression (GA-MLR) method
(Devillers 1996; Venkatasubramanian and Sundaram
Various specialized datasets involving nanomaterials,
2002) of variable selection, while the validation sets are
properties of catalysts, radiosensitizer molecules, etc. have
employed to judge the predictive ability of the models.
smaller number of data points where the division of data
Numerous important internal ( R2 , R2adj , Q2LMO , MAELOO ,
into training and test sets may not produce robust and pre-
dictive models. A small dataset with 25–50 compounds rm2 (LOO) metrics) and external ( Q2F1 , Q2F2 , rm2 (test), CCC,
cannot be used for conventional double cross-validation MAEtest ) validation metrics are measured in the exhaustive
as dividing the data set into training and test sets and fur- DCV method for all the chosen models. The tool is
ther into calibration and validation sets is not possible. designed in such a way that it also develops Partial Least
Ambure et al. have developed a new tool called the Small Squares Regression (PLS-R) models based on the descrip-
Dataset Modeler, version 1.0.0 (http:// dtclab. webs. com/ tors selected in MLR models. The final top model selec-
software- tools and http:// teqip. jdvu.ac. in/ QSAR_ Tools/ tion can be done in any five of the following recommended
DTCLab/) solely for small datasets which includes a dou- ways:
ble cross-validation approach to develop a model for a
small number of data points without training and test sets (i) Any model (MLR/PLS) with the smallest MAE (95%)
division of the dataset (Ambure et al. 2019) (Fig. 5). Here, in the validation set is chosen.
the whole input set (containing n number of compounds) (ii) Any model (MLR/PLS) with the smallest MAE (95%)
goes into a loop where it is repeatedly split up into calibra- in the modeling set is chosen.
tion and validation sets (same as in the inner loop of (iii) Any model (MLR/PLS) with the lowest
DCV). The best possible combinations (k) are tried to Q2Leave−Many−Out (modeling set) is chosen.
obtain using validation sets of r compounds and calibra- (iv) Implementing consensus predictions using the best
tion sets of n–r compounds. The tool asks for the number models that are chosen depending on the MAE (95%)
of compounds (i.e., r) in the validation set from the user in the validation sets. Consensus predictions can be of
based on which all probable combinations of calibration two types: (a) Using a simple arithmetic average of pre-
and validation sets are produced. The Multiple Linear dictions of the best models. (b) Using a weighted aver-
Regression (MLR) models are generated using the calibra- age of predictions (WAP) by assigning proper weights
t i o n s e t c o m p o u n d s e m p l oy i n g t h e G e n e t i c to the top chosen models depending on the mean abso-

Fig. 5 Methodology behind


the “Small Dataset Modeler”
(version 1.0.0) tool to perform
QSAR modeling for a small set
of data points

13
1292 Archives of Toxicology (2022) 96:1279–1295

lute error obtained from leave-one-out cross-valida- that can be utilized for data-gap filling. The basic aim of
tion,MAEcv (95%). the read-across technique is to predict endpoint information
(v) A pool of exclusive descriptors from the best models for one or more chemicals (i.e., the target chemicals) using
with the smallest MAE(95%) obtained from the valida- data from the same endpoint from another substance (the
tion set is again employed to build models. In the case of source chemicals) using the similarity principle. The method
MLR, the best descriptor combinations are put through is widely used as an alternative tool for hazard assessment to
the Best Subset Selection method. However, in the case fill data gaps (ECHA 2011). Read-across based predictions
of a PLS model, descriptors nominated in the top models seem to be more fitting for small data sets (limited source
are pooled together for a PLS run. compounds). Hence, it has provided promising results in
nanosafety assessment possessing limited data. Chatterjee
The method proposed in the “Small Dataset Modeler” and co-workers (2022) developed a new prediction-oriented
tool confirms internal divisions of small datasets within the quantitative read-across approach based on certain similarity
DCV technique without taking any test set into account. The principles. The reported work verifies the efficiency of the
approach of “Small Dataset Modeler” tool integrates data newly developed read-across algorithm in filling nanosafety
curation, exhaustive DCV technique, and ideal modeling data gaps. A tool has been developed to facilitate the imple-
techniques entailing consensus predictions to develop mod- mentation of the approach (Fig. 6) for quantitative read-
els, principally for a small set of data points. The methodol- across which is available from https://sites.google.com/jadav
ogy behind the “Small Dataset Modeler” tool is schemati- purun ivers ity. in/ dtc- lab- software/ home. The tool allows
cally presented in Fig. 5. Small dataset modeling has found the users to optimize different hyperparameters includ-
use in environmental toxicity modeling including acute ing similarity kernel functions and distance and similarity
toxicity of antifungal agents toward fish species (Nath et al. thresholds to get the best quality of quantitative predictions.
2021) and soil ecotoxicity (Lavado et al. 2022), radiosensi- Mainly, three types of similarity estimation techniques were
tization modeling (De and Roy 2020), modeling of Hepatitis introduced involving Euclidean distance, Gaussian kernel
C virus inhibitor protein (Ejeh et al. 2021), and modeling function, and Laplacian kernel function. The algorithm
anesthetics causing GABA inhibition (Stošić et al. 2020). developed in this study was optimized using three small
nanotoxicity datasets (n ≤ 20). The algorithm is based on
(v) Read-Across-v3.1 tool two basic steps: (a) finding the 10 most similar training
compounds for each query or test compound; (b) calculat-
The read-across methodology has gained immense atten- ing the weighted average prediction of test set compounds
tion in recent years, because it is a non-testing approach from the most similar training set compounds. Different

Fig. 6 Quantitative read-across algorithm

13
Archives of Toxicology (2022) 96:1279–1295 1293

hyperparameters like sigma and gamma values in Gauss- picture of the prediction errors. The present review explains
ian and Laplacian kernel functions have been optimized. various internal and external validation metrics necessary for
The effect of the number of close training compounds on model predictivity assessment. Furthermore, a brief explana-
the prediction quality has been evaluated; 2–5 close training tion of various innovative QSAR modeling tools developed
compounds can efficiently predict the toxicity of query com- by Drug Theoretics and Cheminformatics (DTC) laboratory
pounds. Another feature incorporation in the tool involves (https:// sites. google. com/ site/ kunal royin dia/ home/ qsar-
a distance threshold for the Euclidean distance similarity model-development-tools) is given for better selection and
estimation and a similarity threshold for the Gaussian and development of models. These tools are aimed at addressing
Laplacian kernel function similarity estimations. This gen- various features like selection of training set, model develop-
erated better prediction at the distance threshold of 0.4–0.5 ment methodology, model selection techniques, the use of
and a similarity threshold of 0.00–0.05. This algorithm is multiple models, scoring of query compounds, etc. These
easy to use, proficient, and an expert independent alternative improvisations helped in enhancing the quality of predictions
method for the nanoparticle toxicity prediction which can of QSAR models. The tools highly assist in the reliability
further assist in data-gap filling and prioritization. Version estimation of untested chemicals when experimental data are
3.1 of this tool also computes classification-based validation unavailable. However, most of these tools cannot be used for
metrics and generates receiver operating curve (ROC) for classification-based/graded data, but are well suited for quan-
predictions which can be used to estimate the uncertainty of titative models like MLR and PLS regression. Furthermore,
predictions. The tool is also applicable for several endpoints the tools have a major role in different fields for predicting
other than nanotoxicity, for example activity/toxicity/prop- chemicals associated with the pharmaceutical industry, cos-
erty of organic compounds in general. meceuticals, polymer chemistry, diagnostic agents, dyes,
nano-chemistry, food chemistry, etc.

Future perspectives Acknowledgements PD thanks Indian Council for Medical Research,


New Delhi for Senior Research Fellowship.

Over the past few decades, the QSAR methodology has


received both praise and criticism in connection to its reli-
Declarations
ability, limitations, successes, and failures. The above dis- Conflict of interest The authors declare no conflict of interest.
cussion of the aforementioned tools from the DTC Labora-
tory provides methods and information relating to QSAR
model development and validation, pointing out current
trends, unresolved problems, and persistent challenges asso- References
ciated with evolution of QSAR. Furthermore, there are few
scopes of further refining the present tools like inclusion Abdi H, Williams LJ (2010) Principal component analysis. Wiley Inter-
of computation of Golbraikh and Tropsha’s (Golbraikh and discip Rev 2(4):433–459
Tropsha 2002) criteria in the Double Cross Validation tool Ambure P, Cordeiro MNDS (2020) Importance of data curation in
QSAR studies especially while modeling large-size datasets. In:
and computation of leave-many-out cross-validation (Q2LMO) Roy K (ed) Ecotoxicol QSARs. Springer, New York, pp 97–109
criteria for both the Double Cross Validation tool and Small Ambure P, Gajewicz-Skretna A, Cordeiro MND, Roy K (2019) New
Dataset Modeler tool (PLS version), etc. Additionally, there workflow for QSAR model development from small data sets:
is an opportunity to incorporate an uncertainty measure of small dataset curator and small dataset modeler integration of
data curation, exhaustive double cross-validation, and a set of
predictions in the read-across tool which will improve the optimal model selection techniques. J Chem Inform Model
reliability for quantitative predictions of untested molecules. 59(10):4070–4076
Bates S, Hastie T, Tibshirani R (2021) Cross-validation: what does it
estimate and how well does it do it? arXiv:210400673
Baumann D, Baumann K (2014) Reliable estimation of prediction
Conclusion errors for QSAR models under model uncertainty using double
cross-validation. J Cheminform 6(1):1–19
The QSAR domain has been expanded substantially in Chatterjee M, Banerjee A, De P, Gajewicz-Skretna A, Roy K (2022) A
the past few years as databases and their applications have novel quantitative read-across tool designed purposefully to fill the
existing gaps in nanosafety data. Environ Sci Nano 9(1):189–203
grown. As the field of QSAR evolves through decades, it is Chirico N, Gramatica P (2011) Real external predictivity of QSAR
necessary to evaluate the effectiveness of the QSAR models models: how to evaluate it? Comparison of different validation
in predicting the behavior of new molecules. A QSAR model criteria and proposal of using the concordance correlation coef-
stands on the pillars of various validation metrics used to ficient. J Chem Inf Model 51(9):2320–2335
assess the quality of a predictive model that portrays the true

13
1294 Archives of Toxicology (2022) 96:1279–1295

Cohen J (1960) A coefficient of agreement for nominal scales. Educ Gramatica P, Sangion A (2016) A historical excursus on the statistical
Psychol Meas 20(1):37–46 validation parameters for QSAR models: a clarification concern-
Consonni V, Todeschini R (2010) Molecular descriptors Recent ing metrics and terminology. J Chem Inf Model 56(6):1127–1131
advances in QSAR studies. Springer, New York, pp 29–102 Héberger K, Rácz A, Bajusz D (2017) Which performance param-
Consonni V, Ballabio D, Todeschini R (2009) Comments on the defini- eters are best suited to assess the predictive ability of models?
tion of the Q 2 parameter for QSAR validation. J Chem Inf Model Advances in QSAR Modeling. Springer, New York, pp 89–104
49(7):1669–2167 Höltje H-D, Sippl W (2001) Rational approaches to drug desing:
De P, Roy K (2018) Greener chemicals for the future: QSAR modelling proceedings of the 13th European symposium on quantitative
of the PBT index using ETA descriptors. SAR QSAR Environ Res structure-activity relationships, August 27-Setember, 1, 2000. JR
29(4):319–337 Prous Science
De P, Roy K (2020) QSAR modeling of PET imaging agents for the Hossain KA, Roy K (2018) Chemometric modeling of aquatic toxicity
diagnosis of Parkinson’s disease targeting dopamine receptor. of contaminants of emerging concern (CECs) in Dugesia japonica
Theor Chem Acc 139:176 and its interspecies correlation with daphnia and fish: QSTR and
De P, Roy K (2021) QSAR and QSAAR modeling of nitroimidazole QSTTR approaches. Ecotoxicol Environ Saf 166:92–101
sulfonamide radiosensitizers: application of small dataset mod- Hsu H-H, Hsu Y-C, Chang L-J, Yang J-M (2017) An integrated
eling. Struct Chem 32(2):631–642 approach with new strategies for QSAR models and lead optimi-
De P, Kar S, Roy K, Leszczynski J (2018) Second generation periodic zation. BMC Genom 18(2):1–9
table-based descriptors to encode toxicity of metal oxide nano- Kennard RW, Stone LA (1969) Computer aided design of experiments.
particles to multiple species: QSTR modeling for exploration of Technometrics 11(1):137–148
toxicity mechanisms. Environ Sci Nano 5(11):2742–2760 Khan K, Roy K (2019) Ecotoxicological QSAR modelling of organic
De P, Bhattacharyya D, Roy K (2019) Application of multilayered chemicals against Pseudokirchneriella subcapitata using consen-
strategy for variable selection in QSAR modeling of PET and sus predictions approach. SAR QSAR Environ Res 30(9):665–681
SPECT imaging agents as diagnostic agents for Alzheimer’s dis- Khan PM, Rasulev B, Roy K (2018) QSPR modeling of the refractive
ease. Struct Chem 30(6):2429–2445 index for diverse polymers using 2D descriptors. ACS Omega
De P, Bhattacharyya D, Roy K (2020) Exploration of nitroimidazoles 3(10):13374–13386
as radiosensitizers: application of multilayered feature selection Khan K, Benfenati E, Roy K (2019a) Consensus QSAR modeling of
approach in QSAR modeling. Struct Chem 31(3):1043–1055 toxicity of pharmaceuticals to different aquatic organisms: rank-
De P, Bhayye S, Kumar V, Roy K (2022) In silico modeling for quick ing and prioritization of the DrugBank database compounds.
prediction of inhibitory activity against 3CLpro enzyme in SARS Ecotoxicol Environ Saf 168:287–297
CoV diseases. J Biomol Struct 40(3):1010–1036 Khan K, Roy K, Benfenati E (2019b) Ecotoxicological QSAR
Dearden JC (2016) The history and development of quantitative struc- modeling of endocrine disruptor chemicals. J Hazard Mater
ture-activity relationships (QSARs). Int J Quant Struct-Property 369:707–718
Relat 1(1):1–44 Khan PM, Roy K, Benfenati E (2019c) Chemometric modeling
Devillers J (1996) Genetic algorithms in molecular modeling. Aca- of Daphnia magna toxicity of agrochemicals. Chemosphere
demic Press, NY 224:470–479
Directorate E (2007) Environment health and safety publications series Konovalov DA, Coomans D, Deconinck E, Vander Heyden Y (2007)
on testing and assessment No. 69, Guidance document on the Benchmarking of QSAR models for blood-brain barrier permea-
validation of (quantitative) structure-activity relationships [(Q) tion. J Chem Inf Model 47(4):1648–1656
SAR] models. OECD, Paris, France Konovalov DA, Llewellyn LE, Vander Heyden Y, Coomans D (2008)
ECHA (2011) The Use of Alternatives to Testing on Animals for the Robust cross-validation of linear regression QSAR models. J
REACH Regulation. European Chemicals Agency Helsinki, Chem Inf Model 48(10):2081–2094
Finland Kubinyi H, Hamprecht FA, Mietzner T (1998) Three-dimensional
Ejeh S, Uzairu A, Shallangwa GA, Abechi SE (2021) Computational quantitative similarity—activity relationships (3d qsiar) from seal
insight to design new potential hepatitis C virus NS5B polymer- similarity matrices. J Med Chem 41(14):2553–2564
ase inhibitors with drug-likeness and pharmacokinetic ADMET Lavado GJ, Baderna D, Carnesecchi E, Toropova AP, Toropov AA,
parameters predictions. Future J Pharm Sci 7(1):1–13 Dorne JLC, Benfenati E (2022) QSAR models for soil ecotoxic-
Ghosh S, Ojha PK, Roy K (2019) Exploring QSPR modeling for ity: development and validation of models to predict reproductive
adsorption of hazardous synthetic organic chemicals (SOCs) by toxicity of organic chemicals in the collembola Folsomia candida.
SWCNTs. Chemosphere 228:545–555 J Hazard Mater 423:127236
Ghosh K, Bhardwaj B, Amin S, Jha T, Gayen S (2020) Identification Leardi R (2001) Genetic algorithms in chemometrics and chemistry: a
of structural fingerprints for ABCG2 inhibition by using Monte review. J Chemom 15(7):559–569
Carlo optimization, Bayesian classification, and structural and Likas A, Vlassis N, Verbeek JJ (2003) The global k-means clustering
physicochemical interpretation (SPCI) analysis. SAR QSAR algorithm. Pattern Recognit 36(2):451–461
Environ Res 31(6):439–455 Maleki A, Daraei H, Alaei L, Faraji A (2014) Comparison of QSAR
Golbraikh A, Tropsha A (2002) Beware of q2! J Mol Graph Model models based on combinations of genetic algorithm, stepwise
20(4):269–276 multiple linear regression, and artificial neural network meth-
Golmohammadi H, Dashtbozorgi Z, Acree WE Jr (2012) Quantita- ods to predict K d of some derivatives of aromatic sulfonamides
tive structure–activity relationship prediction of blood-to-brain as carbonic anhydrase II inhibitors. Russ J Bioorganic Chem
partitioning behavior using support vector machine. Eur J Pharm 40(1):61–75
Sci 47(2):421–429 Mikolajczyk A, Gajewicz A, Mulkiewicz E, Rasulev B, Marchelek M,
Goodarzi M, Dejaegher B, Heyden YV (2012) Feature selection meth- Diak M, Hirano S, Zaleska-Medynska A, Puzyn T (2018) Nano-
ods in QSAR studies. J AOAC Int 95(3):636–651 QSAR modeling for ecosafe design of heterogeneous TiO 2-based
Gramatica P (2020) Principles of QSAR modeling: comments and sug- nano-photocatalysts. Environ Sci Nano 5(5):1150–1160
gestions from personal experience. IJQSPR 5(3):61–97 Nath A, De P, Roy K (2021) In silico modelling of acute toxicity of
1, 2, 4-triazole antifungal agents towards zebrafish (Danio rerio)

13
Archives of Toxicology (2022) 96:1279–1295 1295

embryos: application of the small dataset modeller tool. Toxicol Roy K, Das RN, Ambure P, Aher RB (2016) Be aware of error meas-
in Vitro 75:105205 ures. Further studies on validation of predictive QSAR models.
Nath A, De P, Roy K (2022) QSAR modelling of inhalation toxicity Chemometr Intell Lab Syst 152:18–33
of diverse volatile organic molecules using no observed adverse Roy K, Ambure P, Kar S (2018a) How precise are our quantitative
effect concentration (NOAEC) as the endpoint. Chemosphere structure–activity relationship derived predictions for new query
287:131954 chemicals? ACS Omega 3(9):11392–11406
Norinder U (1996) Single and domain mode variable selection in 3D Roy K, Ambure P, Kar S, Ojha PK (2018b) Is it possible to improve the
QSAR applications. J Chemom 10(2):95–105 quality of predictions from an “intelligent” use of multiple QSAR/
Novellino E, Fattorusso C, Greco G (1995) Use of comparative molec- QSPR/QSTR models? J Chemom 32(4):e2992
ular field analysis and cluster analysis in series design. Pharm Roy J, Ghosh S, Ojha PK, Roy K (2019) Predictive quantitative struc-
Acta Helv 70(2):149–154 ture–property relationship (QSPR) modeling for adsorption of
Ojha PK, Roy K (2018) Development of a robust and validated organic pollutants by carbon nanotubes (CNTs). Environ Sci Nano
2D-QSPR model for sweetness potency of diverse functional 6(1):224–247
organic molecules. Food Chem Toxicol 112:551–562 Rücker C, Rücker G, Meringer M (2007) y-Randomization and its vari-
Ojha PK, Mitra I, Das RN, Roy K (2011) Further exploring rm2 met- ants in QSPR/QSAR. J Chem Inf Model 47(6):2345–2357
rics for validation of QSPR models. Chemometr Intell Lab Syst Schüürmann G, Ebert R-U, Chen J, Wang B, Kühne R (2008) External
107(1):194–205 validation and prediction employing the predictive squared cor-
Ojha PK, Kar S, Roy K, Leszczynski J (2020) Chemometric mod- relation coefficient—test set activity mean vs training set activity
eling of power conversion efficiency of organic dyes in dye sen- mean. J Chem Inf Model 48(11):2140–2145
sitized solar cells for the future renewable energy. Nano Energy Stošić B, Janković R, Stošić M, Marković D, Stanković D, Sokolović
70:104537 D, Veselinović AM (2020) In silico development of anesthetics
Organisation for Economic Co-operation and Development (OECD) based on barbiturate and thiobarbiturate inhibition of GABAA.
(2004) The Report from the Expert Group on (Quantitative) Struc- Comput Biol Chem 88:107318
ture-Activity Relationships [(Q) SARs] on the Principles for the Todeschini R (2010) Milano Chemometrics. University of MilanoBico-
Validation of (Q) SARs. Series on Testing and Assesment, p 206 cca, Milano, Italy (personal communication)
Pope P, Webster J (1972) The use of an F-statistic in stepwise regres- Todeschini R, Ballabio D, Grisoni F (2016) Beware of unreliable Q 2!
sion procedures. Technometrics 14(2):327–340 A comparative study of regression metrics for predictivity assess-
Roy K (2007) On some aspects of validation of predictive quantitative ment of QSAR models. J Chem Inf Model 56(10):1905–1913
structure–activity relationship models. Expert Opin Drug Discov Veerasamy R, Rajak H, Jain A, Sivadasan S, Varghese CP, Agrawal RK
2(12):1567–1577 (2011) Validation of QSAR models-strategies and importance. Int
Roy K (2018) Quantitative structure-activity relationships (QSARs): a J Drug Des Discov 3:511–519
few validation methods and software tools developed at the DTC Venkatasubramanian V, Sundaram A (2002) Genetic algorithms:
laboratory. J Indian Chem Soc 95(12):1497–1502 introduction and applications. In: Encyclopedia of computational
Roy K, Ambure P (2016) The “double cross-validation” software tool chemistry 2. Wiley, New Jersey
for MLR QSAR model development. Chemom Intell Lab Syst Wainer J, Cawley G (2021) Nested cross-validation when selecting
159:108–126 classifiers is overzealous for most practical applications. Expert
Roy K, Mitra I, Kar S, Ojha PK, Das RN, Kabir H (2012) Comparative Syst Appl 182:115222
studies on some metrics for external validation of QSPR models. White Paper on a Strategy for a Future Chemicals Policy. Commission
J Chem Inf Model 52(2):396–408 of the European Communities. (2001) Brussels, Belgium
Roy K, Kar S, Ambure P (2015a) On a simple approach for determin- Wold S, Sjöström M, Eriksson L (2001) PLS-regression: a basic tool of
ing applicability domain of QSAR models. Chemom Intell Lab chemometrics. Chemom Intell Lab Syst 58(2):109–130
Syst 145:22–29
Roy K, Kar S, Das RN (2015b) Statistical methods in QSAR/QSPR A Publisher's Note Springer Nature remains neutral with regard to
primer on QSAR/QSPR modeling. Springer, New York, pp 37–59 jurisdictional claims in published maps and institutional affiliations.

13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy