Chemoinformatics and Chemical Genomics: Potential Utility of in Silico Methods
Chemoinformatics and Chemical Genomics: Potential Utility of in Silico Methods
Received: 25 May 2012, Revised: 26 June 2012, Accepted: 27 June 2012 Published online in Wiley Online Library: 10 August 2012
Introduction
Chemoinformatics is a broad discipline, standing on the interface between chemistry, biology and computer science (Agraotis et al., 2007). Chemoinformatics generally implies the mixing of information technology and management to transform data into information, and information into knowledge, for the intended purpose of making better decisions faster in the area of drug design and lead identication (Chen, 2006; Brown, 1998). Chemoinformatics can easily be confused with computational chemistry which primarily focuses on theoretical quantum mechanical calculations (Engel, 2006). However, chemoinformatics encompasses the design, creation, storage, organization, management, retrieval, analysis, dissemination, visualization and use of chemical information (Agraotis et al., 2007). The scope of chemoinformatics is very broad, but a pragmatic use of it for toxicology involves acquiring structural information about chemicals, studying their effects on the living systems, attempts to classify chemicals based on chemical structural information, functional properties etc., and link discovered associations or associate them with various types of adverse effects and toxicities on biological systems. Chemoinformatics is also used to accumulate empirical knowledge that is then used to build relational databases that can be utilized for the purpose of future retrieval and further analysis as more information becomes available. Relational databases can also be used to construct computational models for toxicological endpoints of interest to help predict the effects of new chemical substances (e.g. impurities) in light of potential human exposure conditions. In this regard, the study of the quantitative structureactivity relationship (QSAR) is an integral part of chemoinformatics and will be reviewed in terms of its practical use. With the ability to study global gene expression proles following exposure to chemicals, the scope of chemoinformatics has expanded even further. The study of the effects of small
chemicals/molecules on cell function at the genome level, that is, the study of gene expression proling following exposure to different chemicals is the purview of chemical genomics or chemogenomics. Therefore, with the advent of informatics in biology and chemistry simultaneously, and with the development of interconnected database resources, the scope of bioinformatics and chemoinformatics has signicantly overlapped, and has expanded to include the effects of specic chemical structures/ structural class (or other properties) on gene expression as a major focus.
In Silico (Q)SAR
Background The characterization of drug and chemical toxicity as well as biological pathway analysis for a substance is of paramount
*Correspondence to: Luis G. Valerio Jr, Ofce of Pharmaceutical Science, Center for Drug Evaluation and Research, US Food and Drug Administration, White Oak 51, Room 4128, 10903 New Hampshire Avenue, Silver Spring, MD 209930002, USA. E-mail: luis.valerio@fda.hhs.gov The opinions expressed in this article are the authors personal opinions and do not reect those of FDA, DHHS or the Federal Government. The authors report no conicts of interest. The authors alone are responsible for the content and writing of the article.
a
Science and Research Staff, Ofce of Pharmaceutical Science, Center for Drug Evaluation and Research, US Food and Drug Administration, White Oak 51, Room 4128, 10903 New Hampshire Avenue, Silver Spring, MD 20993-0002, USA
b Division of Biotechnology and GRAS Notice Review, Ofce of Food Additive Safety, Center for Food Safety and Applied Nutrition, US Food and Drug Administration, White Oak 51, Room 4128, 10903 New Hampshire Avenue, Silver Spring, MD 20993-0002, USA
880
Published 2012. This article is a US Government work and is in the public domain in the USA.
In silico methods and chemogenomics importance as this information is integrated into larger assessments to protect public health. Moreover, there is a real nancial burden for substances that have been subjected to intense preclinical investigations for safety and target analysis and in many cases have undergone extensive late-stage product development yet fail owing to toxicity. Therefore, identication of safety liabilities as early on in the drug discovery and development processes can help mitigate compound attrition. For other substances such as food ingredients there are no benet considerations and the safety assessment plays a critical role before human exposure is permitted. In this situation, the characterization of potential toxicity relies principally on preclinical toxicology studies and human exposure assessments, and if there are data gaps then these may be agged as requiring additional information. In this light, in silico methods such as computational QSARs or human expert SARs systems may be able to offer predictive results to help ll data gaps in safety assessments of chemical substances. Use case scenarios where this is being permitted or investigated at the US Food and Drug Administration (FDA)/Center for Drug Evaluation and Research (CDER) and Center for Food Safety and Applied Nutrition (CFSAN) will be described in this review. QSAR Models The use of QSAR in predicting compound safety liabilities has its roots in drug discovery technologies (Merlot, 2010). The basis for interest in QSAR approaches derives from the need to identify drugs with toxicity at the clinical dose much earlier in development before signicant resources and time are invested in compound development. For a number of years the number of drugs reaching the market was slowly declining to 21 new molecular entities approved by FDA/CDER in 2010 (FDA, 2010). However, in 2011, 31 new molecular entities drug products were approved at FDA/CDER and CBER (Mullard, 2012). Thus, use of alternative, efcient, and reliable approaches are sought after to further enhance drug development, modernize safety signal detection, and increase the probability of identifying drug toxicity early so as to reduce attrition rates of drug development. Computational toxicology techniques including but not limited in silico QSARs have been suggested as a potential method to help address these needs (Merlot, 2008; Boyer, 2009). Recently, U.S. regulatory agencies have invested time and resources in evaluating the utility of computational methods oriented for safety signal detection such as toxicity-based QSAR models (Yang et al., 2009; Arvidson et al., 2010; Valerio, 2011). At the FDA, such efforts took shape as structure-based models targeted for toxicity endpoints that cannot be tested in humans (Matthews and Contrera, 2007; Contrera et al., 2008). These include QSAR models for genetic toxicity (Contrera et al., 2005a, 2008), reproductive and developmental toxicity (Matthews et al., 2007) and carcinogenicity (Contrera et al., 2003; Matthews et al., 2008). According to the Organisation for Economic Co-operation and Development (OECD), there are general principles that help determine if a QSAR model is appropriate for regulatory use (OECD, 2004). These are often referred to as the Sebul principals because they were derived at an OECD meeting of QSAR experts in Sebul, Portugal and the principals include: (1) a dened endpoint; (2) an unambiguous algorithm; (3) a dened domain of applicability; (4) appropriate measures of goodness-of-t, robustness and predictivity; and (5) a mechanistic interpretation, if possible (OECD, 2004). Despite these general principles, the literature shows that QSAR model building is segmented according to the philosophy of the model builder. For example, there are models built based on training set chemicals of high structural diversity intended to predict new molecules that may share substructural features present in the model. There are also models based on mechanistic data that consist of training set chemicals that are structurally congeneric (Benigni, 2005; Benigni et al., 2007; Benigni and Bossa, 2008a, 2008b). There are models built using only on public data, which facilitates interpretability and transparency (Hansen et al., 2009), but the disadvantage is that these data may not reect the proprietary chemistry being developed by a pharmaceutical or even chemical company. One example pointed out by Naven et al. is the alkyl halides chemical class, which often occurs in data sets with industrial chemicals but to a lesser extent in drug-like molecules (Naven et al., 2010). The QSAR models that are built using in-house proprietary data might be advantageous at least for the user, as these are customized models based upon the chemical structures frequently seen during development, and algorithms and software specialized for their own organizational use (Boyer, 2009; Bercu et al., 2010; Naven et al., 2010). The disadvantage of such systems is that, because of their proprietary nature, there is little interpretability or verication available for regulatory authorities upon submission of computational toxicology data to support the safety of a substance under review. Recent discussion in a proposed draft ICH M7 guideline surrounding the use of predictive in slico tools such as QSARs and SAR expert systems for the qualication of genotoxic impurities in drug products helps address this issue. A proposal of standardizing external validation data for models designed to predict Salmonella (Ames assay) mutagenicity could potentially be the method of choice for learning the acceptability of such a QSAR model for regulatory use, regardless of the source of the model and training set compounds (private, commercial, or truly public cost-free). Computational Algorithms The available pool of algorithms for data mining that are useful for performing QSAR modeling is large. An excellent review on this subject is available (Nantasenamat et al., 2010). Table 1 summarizes some of the better known machine learning algorithms. However, it is not uncommon that an algorithm is modied so it is customized for a particular purpose. It is well known that some algorithms are better t for particular types of data than others, and therefore, selection of a suitable machine learning algorithm depends upon the nature of the data set being mined and type of learning task (e.g. linear or non-linear regression, classication, clustering, kernel methods; Vert and Jacob, 2008). In general, reliable algorithms are not difcult to nd, nor a software program for supporting computational toxicology predictions. The greater issue is with nding quality data for modeling.
881
Published 2012. This article is a US Government work and is in the public domain in the USA.
wileyonlinelibrary.com/journal/jat
L. G. Valerio and S. Choudhuri Table 1. Machine learning algorithms available for building toxicological quantitative structureactivity relationships Algorithm Stochastic gradient boosting Discriminant analysis Partial least squares Random forest Articial neural network Support vector machines Kohonens self-organizing map Decision tree Nave Bayes Gaussian process k Nearest neighbor Optimization algorithms Genetic algorithm Ant colony Simulated annealing Particle swarms Reference Friedman (2002) Contrera et al. (2005b); Perez-Garrido et al. (2010) Hellburg et al. (1987); Wold et al. (2004) Breiman (2001) Baskin et al. (2008); Cartwright (2008) Cristianini and Shawe-Taylor (2000); Nantasenamat et al. (2008); Vert and Jacob (2008) Kohonen (1990); Bienfait (1994) Al-Razzak and Glen (1992) OBrien and de Groot (2005); Hoare (2008); Langdon et al. (2010) Obrezanova and Segall (2010) Golbraikh et al. (2003) Rogers and Hopnger (1994); Matthews et al. (2008) Izrailev and Agraotis (2002) Sutter et al. (1995); Oloff et al. (2005) Agraotis and Cedeo (2002)
feature with toxicity for the endpoint being predicted. Unfortunately, specic doses for the adverse effects are not predicted using this approach. Because the SAR prediction is based on a knowledge base of information, predictions are supported by references or citations which can be followed up for relevance of suitability of t for the risk assessment being conducted. Predictions are derived from human expert judgment of scientic evidence and not generated from a statistical QSAR calculation. Development of the knowledge base or rule base obviously requires careful analysis of large bodies of data, and essentially represents a consensus-based approach. The main advantage is the knowledge base of information used to derive predictions is supported by empirical evidence usually linked to a mechanistic understanding of toxicity before a rule is entered into the system. Moreover, what is useful is there is usually a formation of human judgment or reasoning provided with the prediction of toxicity (e.g. degree of uncertainty or classication to aid in analysis). Sometimes this is referred to as a reasoning engine (Valerio and Long, 2010). A disadvantage is that the generality of the prediction to a substructural feature that is only one part of the larger set of features on a molecule may tend to overpredict the toxicity without expert interpretation of potential mitigating features or modulating factors. Another disadvantage inherent to this approach is that human experts can be wrong or even inadvertently biased in their determinations (Guzelian et al., 2005). A frequent misunderstanding of the human knowledgebased SAR approaches is that the software program contains a database of toxicity information. These software systems are not database management systems, but are computer programs with expert knowledge and expert rules derived from analysis of toxicology study data. Two very well-known programs using the knowledge based SAR approach are Derek Nexus and ToxTree (Marchant et al., 2008; Mostrag-Szlichtyng et al., 2010). Another interesting point is that there is really no negative prediction with the human knowledge-based approaches. If no structural alert is identied on the query molecule, then the program delivers the prediction, nothing to report. This simply means there are no human rules or knowledge that have been encoded in the software to alert of a toxicity based on in silico
analysis of the compound molecule structure because either: (1) the alerting rule has not been identied yet and reported in the scientic literature; or (2) the alert has not been captured by the consensus-based approach and encoded in the knowledge base of the expert system. If the alert has not been identied yet it may be because there is not enough representation of the substructural feature in the available literature being reviewed in order to point to it as a feature portending toxicity. However, a versatile aspect to the knowledge-based approach is that most software has a functionality of customized editing of the rules so one can build into the program ones own rule base derived from an organizations own in-house proprietary test data, a feature which is likely to be employed in real risk assessment and toxicity screening situations.
882
wileyonlinelibrary.com/journal/jat
Published 2012. This article is a US Government work and is in the public domain in the USA.
In silico methods and chemogenomics More recently, HTS is being applied as a tool integrated with chemoinformatics methods to study siRNA, antisense and gene overexpression through study of chemical libraries. In this way, chemoinformatics is linked as a critical tool for chemical genomic studies. There are other areas within drug discovery where there are active methods development for the integration of chemoinformatics and chemical genomics. For example, Okuno (2008) reports on development of new mining and in silico screening methods for G-protein coupled receptor (GPCR) ligands. The approach is to provide a public GPCR chemical genomics database. The database, called GLIDA, is reported to nd correlations of information between GPCRs and their ligands (Okuno 2008). In addition, there are other databases for virtual screening for human GPCRs, such as FINDSITE(X) (Zhou and Skolnick, 2012). Because of their widespread prevalence as cell surface receptors and their role in transmitting chemical signals to an array of cell types, GPCRs are well recognized as a major class of drug targets for investigational drugs (Zhang and Xie, 2012). Therefore, HTP screening assays and in silico structure-based docking models of GPCRs have been developed to aid drug discovery of molecules selective for these widely used targets (Shoichet and Kobilka, 2012; Zhang and Xie, 2012).
Assessment and Identification of a Regulatory Need Informatics and Safety Science Needs Applied Research
Technology Evaluation
Independent Validation
Regulatory Application
Updating
Figure 1. Despite the availability of predictive technologies such as QSAR and toxicity data mining systems, time and resources dictate that regulatory scientists prioritize which approaches are worth pursuing. The gure outlines such strategic approach which always begins with identifying a regulatory need.
examination of any potential data use restrictions. In either case, once the software is deployed, data collection and testing commence. This step is not only time-consuming and arduous, but requires substantial oversight into precisely which data are employed for construction of intellectual property such as training sets for computational models. After a computational model is built with the technology, validation testing should be performed and should be performed using independent data sets and by objective assessments. The independent validation testing will provide rigor to the scientic validity of a model and instill condence in its performance. Conceivability regulatory application of the model could be envisioned after the validation testing, and equally important is continuous updating of the model as well as the software technology.
883
Published 2012. This article is a US Government work and is in the public domain in the USA.
wileyonlinelibrary.com/journal/jat
L. G. Valerio and S. Choudhuri It is estimated that, out of the 20 000 to 25 000 human genes supposed to encode ~3000 potential drug targets, only ~800 proteins have currently been investigated by the pharmaceutical industry. In parallel with this, out of several million non-redundant chemical structures synthesized, ~1000 (much less than 0.1%) have been approved as drugs (Rognan, 2007). Accordingly, studies have been conducted to explore the possibility of connecting structureactivity signatures with gene expression signatures. The following discussion will focus mainly on the chemical genomics aspect of chemoinformatics. (2000) constructed a reference database or compendium of expression proles corresponding to 300 diverse mutations and chemical treatments in the yeast Saccharomyces cerevisiae, and showed that the cellular pathways affected can be determined by pattern matching, even among very subtle proles Hughes et al. (2000). The utility of this approach was further validated by examining proles caused by deletions of uncharacterized genes. The authors identied and experimentally conrmed that eight uncharacterized open reading frames encoded proteins required for sterol metabolism, cell wall function, mitochondrial respiration, or protein synthesis. They also showed that the compendium can be used to characterize pharmacological perturbations by identifying a novel target of the commonly used drug dyclonine. The authors generated 300 expression proles in S. cerevisiae using a two-color cDNA microarray hybridization assay. The transcript levels of a mutant or compound-treated culture were compared with that of a wild-type or mock-treated culture. Two-hundred and seventy-six deletion mutants, 11 tetracycline-regulatable alleles of essential genes and 13 well-characterized compounds were proled. Sixty-nine of the 276 deletions were uncharacterized open reading frames (ORFs). In parallel with the 300 experiment data set, a series of 63 negative control experiments were conducted to ensure that the observed transcriptional alterations were caused by the mutations or treatments, and not by random uctuations or systematic biases. It was found that the genes with highest variance in the 63 control experiments were among those that uctuated the most in the 300 compendium experiments as well. The authors concluded that the juxtaposition of functionally related mutants on the prole index of the clustering analysis supports the idea that a compendium of proles could serve as a systematic tool for identication of gene functions. This is because mutants that display similar expression proles are likely to share cellular functions. When treatment with an uncharacterized pharmacological compound elicits a response mimicking that of a mutation of the target, it indicates that pathways affected by the uncharacterized compound could be determined as well. The authors next designed experiments to determine whether the cellular functions of the uncharacterized S. cerevisiae ORFs could be predicted by comparing the expression proles of the corresponding deletion mutants to proles of known mutants in the compendium, and whether an unknown drug target could be identied in the same way. The protein products of all eight uncharacterized ORFs were identied. The work by Hughes et al. therefore demonstrated the utility of determining the gene expression prole caused by an uncharacterized perturbation and comparing it with a large and diverse set of reference proles. In this effort, the pathway(s) perturbed is determined by matching the prole caused by the uncharacterized perturbation to proles that correspond to disturbance of a known cellular pathway. Success of the method depends on the existence of distinct expression proles that identify different pathways. The authors also found that many classes of mutants display recognizable and distinct expression proles. Importantly, it was found that meaningful expression patterns can involve groups of transcripts whose relative abundance changes at levels considerably less than 2-fold. It is, therefore, apparent that the success of this method is directly dependent on the robustness of the database, and reference information. In discussing the advantages of the compendium approach Hughes et al. (2000), using their own results and other supporting citations, pointed out that the fundamental advantage of the
High-throughput Assay Formats, Proling the Specicity of Target Binding, and the use of Gene Expression Signatures to Connect Small Molecules, Genes and Disease
Chemical genomics or chemogenomics is the study of the interactions of functional biological systems with exogenous small molecules. Therefore, chemogenomics involves conventional pharmaceutical approaches, since it involves the screening of libraries of small molecules and high throughput assays for their effect on biological targets (Marchal, 2008). The ability to synthesize a large number of molecules through combinatorial chemistry and subsequent screening is at the heart of the recent advances in chemoinformatics and chemical genomics. An example of such an effort is provided by the smallmolecule screening performed at the Harvard Institute of Chemistry and Cell Biology (MacBeath, 2001). In this format, libraries of small molecules are synthesized on polystyrene beads. The beads are arrayed in 384-well plates, one per well. Next, the compounds are cleaved from the beads, the beads are resuspended in solvent, and transferred to 384-well plates to yield stock solution. Minute amounts of these compounds can be introduced into microtiter plate or arrayed on glass slides to create small-molecule microarrays. These microarrays can then be tested by labeled ligands (such as, proteins) trying to identify potential targets. Information on the target specicity of a small molecule is of immense value. In other words, how many targets does a small molecule bind to one or multiple? Does it trigger some kind of signaling from every target it binds to and if so does that biological effect translate to specic and measurable gene expression signatures? In other words, gene expression signature as an endpoint can be used to determine the specicity of an upstream target-binding event. At present, gene expression proling is the best way to do this, although it is far from ideal (MacBeath, 2001). Gene expression proling in a cell where the small molecule binds to its target can be compared with that in an otherwise isogenic cell in which the binding target has been deleted (Marton et al., 1998). Being able to extend the gene expression proling signature to protein expression proling signature will make such studies more complete and meaningful, but the state of the technology has limited such efforts so far (MacBeath, 2001). Examples of the Proof of Principle: the Compendium and the Connectivity Map Resources An understanding of the relationship among diseases and physiological processes and the action of therapeutic small molecules is sorely needed in biomedicine. Hughes et al.
884
wileyonlinelibrary.com/journal/jat
Published 2012. This article is a US Government work and is in the public domain in the USA.
In silico methods and chemogenomics compendium approach over conventional assays is that the compendium approach substitutes a single genome-wide expression prole for many conventional assays that measure only a single cellular parameter. Compendium of expression proles also has many applications; the same compendium used to characterize mutants can also be used to characterize other perturbations, such as treatments with pharmaceutical compounds, as well as potential disease states. One of the advantages of the compendium approach is that, rather than using a simple assay to examine many compounds for a single activity, each compound can be examined for many possible activities in a single assay. The fact that multiple molecular consequences of a given perturbation can be discerned simultaneously suggests that a compendium might be used to discover unanticipated activities of drugs, or to ensure that only desired treatment effects are occurring in patients. In order to pursue a systematic approach to the discovery of functional connections among disease, genetic perturbation and drug action, Lamb et al. (2006) developed a reference collection of gene expression proles from cultured human cells treated with bioactive small molecules, together with pattern-matching software to mine these data (Lamb et al., 2006). This repository/ resource, which the authors termed a connectivity map, can be used to nd connections among small molecules sharing a mechanism of action, chemicals and physiological processes, diseases and drugs. Using the connectivity map resource, a researcher studying a drug candidate, a gene or a disease state could compare its signature with the database to discover unexpected connections among drugs, genes and diseases (hence the resources name of connectivity map; Lamb et al., 2006). Thus, connectivity mapping has the potential to identify novel pharmacological and toxicological properties in new molecular entities. Using small molecules termed perturbagens (that create specic perturbations in cell culture model systems), the authors created a rst-generation connectivity map. These compounds were chosen to represent broad range of activities, and they could be grouped based on activities/properties. For example, some of these compounds share the same molecular targets (e.g. histone deacetylase inhibitor); some have the same clinical indication (e.g. antidiabetics), whereas some have similar regulatory impact on gene expression. The map provides a tool that will help discover the hitherto unknown relationships (connections) among small molecules, diseases and the biologic pathways that join them. Using the gene-expression prole of an uncharacterized compound, new insight can be gained into the compounds possible mechanisms of action with the help of the connectivity map. Alternatively, if a new bioactive compound reverses the gene expression prole that is associated with a disease phenotype, then it can be assumed that the bioactive compound may have a therapeutic potential. In this way the connectivity map can be used to identify potential therapeutic compounds as well. Querying the connectivity map with the gene expression signature of a chemical (whose mechanism of action is known) can help identify new compounds with similar mode of action. For example, by querying the connectivity map with a gene-expression signature induced by estradiol-17b in the MCF-7 cell line, Lamb et al. (2006) identied other estrogen receptor modulators, both agonists and antagonists. High positive connectivity score was observed for genistein, a phytoestrogen, whereas the highest negative connectivity score was observed for fulvestrant, a known anti-estrogenic drug. Other estrogen receptor agonists and antagonists were also identied this way. The utility of this method was further corroborated when this method was applied to phenothiazine antipsychotics. Querying the connectivity map revealed that phenothiazines can inhibit prostaglandin synthesis, a nding that is consistent with the known anti-inammatory and pro-inammatory actions of phenothiazines and prostaglandins, respectively. When Lamb et al. (2006) performed highthroughput gene expression-based screen for small molecules that are capable of abrogating the gene expression signature of androgen receptor activation in prostate cancer cells, they identied gedunin as a candidate small molecule. Gedunin is a tetranortriterpenoid isolated from the Indian neem tree, Azadirachta indica. The authors stated that the mechanism by which gedunin-abrogated androgen receptor activity was unknown at the time. By dening the gene expression signature of gedunin and querying the connectivity map for high connectivity scores, it was found that known HSP90 inhibitors showed marked connectivity to the gedunin signature. Further studies led the authors to conclude that gedunin, although structurally dissimilar from known HSP90 inhibitors used as reference in the study, exerts its antiproliferative action by inhibiting HSP90. Gedunins ability to inhibit HSP90 and its antiproliferative role through the inhibition of HSP90 was later conrmed (Brandt et al., 2008). In connectivity mapping, similarity is as valuable as diversity, thereby offering the potential to uncover new therapeutically important molecular entities to treat human diseases. For example, when a small molecule produces signatures that mimic a diseased state, it provides clues to the pathways involved in the development of the disease. Conversely, if a new bioactive compound reverses the gene expression prole that is associated with a disease phenotype, then it can be assumed that the bioactive compound may have a therapeutic potential. Based on this premise, the authors identied 4,5-dianilinophthalimide or DAPH as a potential therapeutic molecule for Alzheimers disease because it could reverse the Alzheimers disease signature in cultured cells. This nding was consistent with the report that DAPH can reverse the in vitro formation of neurotoxic Ab42 brils, a hallmark of the pathogenesis of this disease. Lamb et al. (2006) also uncovered a potentially new approach to the treatment of acute lymphoblastic leukemia (ALL). Children with ALL are known to develop resistance to dexamethasone (DEX). By comparing bone-marrow leukemia cells from patients exhibiting DEX sensitivity or resistance, the authors rst obtained a gene expression signature that denes DEX sensitivity. Next, by querying the DEX-sensitivity signature to the connectivity map, the authors found strong connectivity to sirolimus, suggesting that sirolimus may confer DEX sensitivity, and thus reverse DEX resistance. This hypothesis was indeed conrmed when the authors found that sirolimus treatment of the lymphoid cell line CEM-c1, which is normally resistant, made the cells sensitive to DEX-mediated apoptosis. Therefore, the combination of DEX and sirolimus may provide an effective approach to treating ALL. The usefulness of connectivity mapping in predicting the properties of chemical entities, based on positive or negative connections with the database of reference compounds, is of utmost importance in the discovery of a new chemical entity, whether a drug or a new chemical to be used in the environment. The connectivity map approach represents an ideal and direct approach to link small molecules, molecular targets, corresponding gene expression signatures, database building and establishing reliable training sets. All these little pieces of the puzzle together can aid in the accurate prognostication of the biological effects of a new molecular entity. The concept of a connectivity map has already been put to the test. Using this concept, various groups
885
Published 2012. This article is a US Government work and is in the public domain in the USA.
wileyonlinelibrary.com/journal/jat
L. G. Valerio and S. Choudhuri have developed resources (databases or tools) to connect chemical structure/class, gene expression signatures and associated biological effects, such as a large-scale chemogenomics database developed from in vivo treated rats to understand and predict the mechanisms of chemical toxicity and action (Ganter et al., 2005); sscMap (statistically signicant connections map), a Java application designed to undertake connectivity mapping tasks (Zhang and Gant, 2009); similar compounds searching system on a toxicogenomics database (Toyoshiba et al., 2009); disease-specic drugprotein connectivity maps (Li et al., 2009); phenotype-associated gene expression signatures to identify candidate therapeutic agents for hepatocellular cancer (Braconi et al., 2009); and GEM-TREND, a web tool for nding similar gene expression data (Feng et al., 2009). Very recently, Keiser et al. (2009) compared 3665 approved and investigational drugs against hundreds of targets, dening each target by its ligands. They found that chemical similarities between drugs and ligand sets predicted thousands of unanticipated associations, of which 30 were experimentally tested. Overall, 23 new drugtarget associations were conrmed, ve of which were potent (<100 nM). The authors commented that such a chemical similarity approach is systematic and comprehensive, and may suggest side-effects and new indications for many drugs. Complex Bioactivity Databases More databases that will be useful in specic chemogenomic applications are being developed. Two types of complex databases can be recognized: (1) databases containing information about a large number of molecules tested against a single target; and (2) databases containing information about a series of compounds tested concurrently in multiple assays (Oprea et al., 2011). An example of the rst kind is WOMBAT database, which is a large chemogenomics database; the 2009 version of this database contains 295 435 entries representing 1966 unique targets, captured from 14 367 papers published between 1975 and 2008 (Oprea et al., 2011). Another example is the publicly available Distributed Structure-Searchable Toxicity (DSSTox) database of the U.S. Environmental Protection Agency (EPA). DSSTox contains data on chemical structures associated with toxicity such as tumor target site (single target) and carcinogenic potencies (single endpoint). In recent years as more and more data have become available, the second type of databases has been rapidly developing (Oprea et al., 2011). For example, using high-throughput screening, NIHs Molecular Libraries Program (an NIH Roadmap Initiative) aims to obtain small molecule probes effective at modulating a given biological process or disease state. Another example of the mining methods for chemical genomics data based on the integration of bioinformatics and chemoinformatics is the publicly available GPCR-ligand database GLIDA (Okuno, 2008). GPCRs form the largest class of cell surface receptors that are involved in important cellular functions. A large fraction of GPCRs are orphans with no known ligands. For example, in humans over 200 GPCRs are orphan receptors. Because GPCRs provide attractive targets for drug discovery, identifying ligands for these orphan receptors and determining the signaling pathways/cellular functions they control signicantly broaden the scope of drug target identication and drug discovery. These two types of databases are complementary and together they can provide annotated chemical libraries for the lead and drug discovery.
MicroRNA-based Biological Markers Associated with Diseased States Recent advances show that non-mRNA-based biological markers can also be incorporated in such an integrated scenario. For example, drug-induced liver injury (DILI) is a frequent side effect of many drugs. DILI poses a signicant threat to patient health, and therefore has an enormous adverse impact on the
886
wileyonlinelibrary.com/journal/jat
Published 2012. This article is a US Government work and is in the public domain in the USA.
In silico methods and chemogenomics economics and progress of drug development. Efforts to identify direct early genetic biomarkers of DILI have not made much progress. One potential way to approach this problem is to study the association among specic chemical structure, nonmRNA-based target expression and the development of phenotype, such as a pathological condition. Using acetaminophen overdose-induced liver injury in mouse, Wang et al. (2009) observed highly signicant differences in the spectrum and levels of microRNAs (miRNAs or mir) in both liver tissues and in plasma between control and acetaminophen-treated animals. They found that mir-122 and mir-192 were enriched in the liver and exhibited dose- and exposure duration-dependent changes in the plasma that parallel serum aminotransferase levels and the histopathology of liver degeneration; however, miRNA changes can be detected signicantly earlier. These ndings demonstrate the utility of using specic circulating microRNAs as sensitive and informative biomarkers for DILI. In other words, the expression of these specic miRNAs can be utilized to predict the effect of chemicals that have structural similarity to acetaminophen in causing hepatotoxicity. Clayton et al. (2009) demonstrated that just like the gene expression or miRNA signature data, pharmacometabolomics data can also be used for such prognostication purpose. For example, acetaminophen and p-cresol, as aromatic phenols, are structurally quite similar and both compete for sulfation for being metabolized. Clayton et al. (2009) demonstrated that individuals having high predose urinary levels of p-cresol sulfate had low postdose urinary ratios of acetaminophen sulfate to acetaminophen glucuronide. In other words, individuals excreting comparatively high concentrations of p-cresol-O-sulfate were prone to excrete relatively less acetaminophen-O-sulfate and larger amounts of acetaminophen-O-glucuronide than people excreting low amounts of p-cresol-O-sulfate. Several other examples also strongly suggest the association between specic pathological state and the expression of tissue-specic and/or circulating miRNA levels. For example, a large number of studies have focused on the role of miRNAs in carcinogenesis. The proto-oncogenic miRNA cluster miR-1792 is overexpressed and amplied in many cancers, such as lymphomas, lung cancers and hepatocyte proliferation, and its introduction accelerates tumorigenicity. Thus, increased expression of the miR-1792 cluster could be used as a marker for carcinogenesis. Several studies implicated the miR-34 family of miRNAs (miR-34a, 34b and 34c) in the p53 tumor suppressor network. Their expression is robustly induced by DNA damage and oncogenic stress in a p53-dependent manner. Further examples of tumor suppressor miRNAs are miR-15a and miR-161, which have been studied in great detail for their mechanism of action. These miRNAs are negative regulators of the antiapoptotic protein Bcl2, and their mechanism of action involves miRNA-mediated translational repression of Bcl2 protein (see review by Choudhuri, 2010). However, the utility of DNA methylation and specic histone modication in such chemical structureepigenomic modicationsphenotypic outcome prediction paradigm is yet to be established. the accuracy of prediction will denitely depend on the extent of available biological and chemical information. In other words, robust databases encompassing information about the characterization of various biological pathways, in vitro/in vivo models, various perturbagens, as well as other details, such as information on dose and time response, are all essential. Using all this information, expanded connectivity maps can be generated. Lamb et al. (2006) asserted that at the present moment even an incomplete connectivity map will probably accelerate progress in characterizing new chemical entities, nding new uses for existing drugs and understanding the molecular mechanisms of disease. A challenge of chemical genomics is that biological assays are increasingly monitoring multiple parameters and thus are becoming more complex to be analyzed. Additionally, investigating the effects of compound combinations including synergies or helping delineate possible mechanisms of action also poses unique challenge (Kmmerl and Parker, 2011).
REFERENCES
Agraotis DK, Cedeo W. 2002. Feature selection for structureactivity correlation using binary particle swarms. J. Med. Chem. 45: 10981107. Agraotis DK, Bandyopadhyay D, Wegner JK, Vlijmen H. 2007. Recent advances in chemoinformatics. J. Chem. Inf. Model. 47: 12791293. Al-Razzak M, Glen, RC. 1992. Applications of rule-induction in the derivation of quantitative structureactivity relationships. J. Comput. Aided Mol. Des. 6: 349383. Arvidson KB, Chanderbhan R, Muldoon-Jacobs K, Mayer J, Ogungbesan A. 2010. Regulatory use of computational toxicology tools and databases at the United States food and drug administrations ofce of food additive safety. Expert Opin. Drug Metab. Toxicol. 6: 793796. Baskin I, Palyulin VA, Zerov NS. 2008. Neural networks in building QSAR models. Meth. Mol. Biol. 458: 137158. Benigni R. 2005. Structureactivity relationship studies of chemical mutagens and carcinogens: mechanistic investigations and prediction approaches. Chem. Rev. 105: 17671800. Benigni R, Bossa C. 2008a. Predictivity of QSAR. J. Chem. Inf. Model. 48: 971980. Benigni R, Bossa C. 2008b. Predictivity and reliability of QSAR models: the case of mutagens and carcinogens. Toxicol. Mech. Meth. 18: 137147. Benigni R, Bossa C, Netzeva T, Rodomonte A, Tsakovska I. 2007. Mechanistic QSAR of aromatic amines: new models for discriminating between homocyclic mutagens and nonmutagens, and validation of models for carcinogens. Environ. Mol. Mutagen. 48: 754771. Bercu JP, Morton SM, Deahl JT, Gombar VK, Callis CM, van Lier RB. 2010. In silico approaches to predicting cancer potency for risk assessment of genotoxic impurities in drug substances. Regul. Toxicol. Pharmacol. 57: 300306. Bienfait B. 1994. Applications of high-resolution self-organizing maps to retrosynthetic and qsar analysis. J. Chem. Inform. Comput. Sci. 34: 890898. Braconi C, Meng F, Swenson E, Khrapenko L, Huang N, Patel, T. 2009. Candidate therapeutic agents for hepatocellular cancer can be identied from phenotype-associated gene expression signatures. Cancer 115: 37383748. Brandt G, Schmidt M, Prisinzano T, Blagg B. 2008. Gedunin, a novel hsp90 inhibitor: semisynthesis of derivatives and preliminary structureactivity relationships. J. Med. Chem. 51: 64956502. Breiman L. 2001. Random forests. Mach. Learn. 45: 532. Brown F. 1998. Chemoinformatics: what is it and how does it impact drug discovery. Annu. Rep. Med. Chem. 33: 375384. Boyer S. 2009. The use of computer models in pharmaceutical safety evaluation. Altern. Lab. Anim. 37: 467475. Callinan P, Feinberg A. 2006. The emerging science of epigenomics. Hum. Mol. Genet. 15: R95R101. Cartwright HM. 2008. Articial neural networks in biology and chemistry: the evolution of a new analytical tool. Meth. Mol. Biol. 458: 113. Chen WL. 2006. Chemoinformatics: past, present, and future. J. Chem. Inf. Model. 46: 22302255. Cheng T, Choudhuri S, Jacobs K. 2012. Epigenetic targets of some toxicologically relevant metals: a review of the literature. J. Appl. Toxicol. in press.
887
Published 2012. This article is a US Government work and is in the public domain in the USA.
wileyonlinelibrary.com/journal/jat
888
wileyonlinelibrary.com/journal/jat
Published 2012. This article is a US Government work and is in the public domain in the USA.
889
Published 2012. This article is a US Government work and is in the public domain in the USA.
wileyonlinelibrary.com/journal/jat