A Tour of Structural Genomics (En)
A Tour of Structural Genomics (En)
A TOUR OF STRUCTURAL
GENOMICS
Steven E. Brenner
Structural genomics projects aim to provide an experimental or computational three-
dimensional model structure for all of the tractable macromolecules that are encoded by
complete genomes. To this end, pilot centres worldwide are now exploring the feasibility of
large-scale structure determination. Their experimental structures and computational models
are expected to yield insight into the molecular function and mechanism of thousands of
proteins. The pervasiveness of this information is likely to change the use of structure in
molecular biology and biochemistry.
The explosive growth of genetic sequence information three-dimensional fold7 (FIG. 1), although their sequences
has offered us comprehensive collections of the protein did not contain recognizable similarity8. (Modern
sequences found in many living organisms. Most of these sequence analysis, however, would now detect their simi-
are not experimentally characterized. Although half of larity.) Today, the literature is rich with celebrated cases
the proteins that are encoded in sequenced eukaryotic of homology inferred from structure, including the
genomes have computationally recognized homology to unexpected similarity between actin and the 70-kDa
at least one well-characterized domain1,2, functional heat-shock cognate protein9, the TopRim domain shared
interpretation of these matches is fraught with difficulty. between some topoisomerases, primases and nucleas-
Functional changes over evolutionary time3,4 and data- es10,11, and the highly similar constant and variable
base errors5 confound reliable computational prediction domains of immunoglobulins. Indeed, most evolution-
of the precise roles of newly discovered genes. Even pro- ary relationships cannot be detected from sequence12.
teins with recognized domains are often scattered with In addition, the three-dimensional structure of a
regions of unmatched sequence. So, most of the residues protein can yield direct insight into its molecular
in putative gene products lack any computational anno- mechanism. For example, the structure of the TATA-
tation, and there exists no general experimental approach box-binding protein (TBP) when it is bound to DNA
to directly ascertain their molecular role. provides not only a sense of how these molecules inter-
The challenge of understanding these gene products act in general, but also some fascinating clues about
has led to the development of functional genomics DNA-binding specificity. Furthermore, structural
methods, which collectively aim to imbue the raw understanding of recognition mechanisms in major
sequence with biological understanding. Structural histocompatibility complex molecules and T-cell recep-
genomics is one such approach, with unique promise to tors helped to make immunology comprehensible at a
reveal the molecular function6 of protein domains. molecular level13,14. Structural genomics efforts plan to
Department of Plant and Protein structure represents a powerful means of dis- extend structural insight to a broad repertoire of pro-
Microbial Biology, covering function, because structure is well conserved teins, using large-scale high-throughput techniques15–26.
University of California, over evolutionary time, and it therefore provides the While the term ‘structural genomics’ is sometimes
461A Koshland Hall, opportunity to recognize homology that is undetectable loosely used to encompass disparate large-scale efforts
Berkeley, California
94720-3102, USA.
by sequence comparison. This became apparent with the to determine protein structure, by international agree-
e-mail: brenner@ first two protein structures that were determined, ment it has come to have a relatively specific meaning
compbio.berkeley.edu because their common ancestry was clear from the (see link to the Airlie Agreement for ‘Agreed Principles
a b
c d
e 4hhba VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLS.....HGSAQVKGHGKKVA
1mbd_ VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKHGVTVL
4hhba DALTNAVAHVD..DMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR......
1mbd_ TALGAILKK.K.GHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
Figure 1 | Structure similarity without sequence similarity. The first two protein structures that were solved — sperm-
whale myoglobin and horse haemoglobin — were recognizable as homologues even at low resolution, even though their
sequences were more different than similar. a | Papier mâché model of sperm-whale myoglobin. b | Baked and painted foam
model of horse haemoglobin. Modern representations of these structures clearly show the areas of structural similarity
(highlighted in red in c and d). c | Myoglobin (Protein Data Bank (PDB) code 1mbd)117. d | Human haemoglobin (PDB code
4hhb)118. e | Alignment of horse myoglobin and human α-haemoglobin sequences119 shows little sequence similarity. Photos
taken of the structures at the MRC Laboratory of Molecular Biology by S.E.B. Computer images were generated using
Rasmol120, Molscript121 and Raster3D122.
and Procedures’). In this more purist sense, structural however, the gamut of molecules suitable for large-scale
genomics is an effort to create a representative set of studies is likely to increase; one can already imagine what
experimental macromolecular structures, which will structural genomics of RNA might involve27, although
be augmented by computational methods to provide no such projects are underway at present. Moreover,
model structures for most tractable macromolecules. rather than solving the structures of all domains, the
Although this reflects a primary focus on surveying the general intent at present is to solve experimentally the
structures of different families, agreed goals of struc- structure of one representative domain from each family,
tural genomics include the study of biologically inter- and use computational comparative modelling to pro-
esting molecules, such as those from model organisms vide the COORDINATES for related proteins. In this way, cur-
and those with medical importance. In addition, struc- rent structural genomics is a conjoined experimental
tural genomics specifically aims to derive function and computational effort, which expects to provide a
from the structures. comprehensive repertoire of models of soluble globular-
Because structural genomics is in its infancy, its protein domains. This review outlines how proteins are
course might change over the next several years; indeed, selected for structural genomics and how they are exper-
the experiences of the current pilot centres will inform imentally characterized in a typical pilot centre, discusses
future directions. However, the relatively precise defini- some early results, and suggests what they might mean
tion of structural genomics includes several hints about for the future of the field.
the limitations and scope of the field. For example, struc-
tural genomics efforts often study individual protein The process
domains, rather than whole proteins or complexes, The principles of experimental structural genomics are
COORDINATES because domains are the fundamental units of protein largely the same as those for traditional structural biology,
A set of numbers that specify structure and evolution. For the time being, proteins and but differ in motivation, automation and scale. The key to
the X, Y and Z positions for
each atom in a protein.
other macromolecules that are not tractable for high- the success of this scientific venture is the ability to opti-
Together, they describe the throughput characterization will largely be left uncon- mize the structure-determination process, so as to reap
molecular structure. sidered by structural genomics efforts. Over time, economies of scale as centres increase their throughput.
Choose targets
Y
N Choose another Cloning coding Y Other expression N
Abandon family member? sequences systems? Abandon
N N
Expression
Y Disseminate clones
Solubilize N
refolding, detergents, Soluble
metals, cofactors, etc. Y
Y
Purify
Disseminate proteins
Quality assurance/
biophysical analysis
Y
Identify and correct problem N
further purification, subclone, Likely to crystallize?
add metal or cofactor, other?
Y
N
Y Crystallization trials
NMR
N N
Microcrystals
Abandon Y
N Diffraction-quality
crystals
Y
N N Y
MIR search Contains methionine? Obtain SeMet crystals
Y N Y
N Y N
MIR data collection Phasing, model Y
MAD data collection
N building, refinement
Y
Deposit structure in PDB,
create homology models,
annotate structure
Figure 2 | Processes involved in high-throughput structural genomics using X-ray crystallography. N indicates that a
process has failed and Y that it has succeeded. (MIR, multiple isomorphous replacement; an alternative to multiple anomolous
dispersion (MAD) phasing for structure determination; NMR, nuclear magnetic resonance; SeMet, selenomethionine.) (Modified with
permission from REF. 16.)
Experimental structural genomics faces no single that allow trials in different expression systems28.
bottleneck to overcome: nearly every stage of the process Expressing high levels of soluble protein is a particular
needs to be refined and optimized. Moreover, many indi- challenge, so there is considerable interest in fusions
vidual proteins are expected to be intractable without between the target protein and green fluorescent protein
specialized extensive effort. Therefore, parallel studies on that fluoresce only when soluble and folded, therefore
related proteins are being relied on to increase the likeli- indicating folded proteins in solution29. Cell-free expres-
hood of readily solving a structure for a family of pro- sion systems hold great promise for improving yields
teins. The progress of individual protein targets through and allowing the production of toxic proteins30.
HIS-TAG the experimental process will be like a funnel, with many Another optimization is the use of hyperthermophilic
A series of histidine residues targets starting at the same time, and a fraction failing at proteins, which are easier to purify when expressed in
fused to a protein that aids each stage of the process. The slope of the funnel is MESOPHILIC hosts, as they are resistant to heat that will
protein purification because of dependent on the effort devoted at each step, which is, in denature most of the proteins of the host.
its strong binding to nickel
columns.
turn, a consequence of the specific motivations of the The expressed proteins might have their domain
particular structural genomics centre. boundaries identified by proteolysis and mass spec-
MESOPHILE Although the detailed processes of scaling up the trometry, and several groups subject samples to DYNAMIC
An organism that grows at procedures involved in structure determination are LIGHT SCATTERING to detect when proteins have formed
moderate temperature.
unique to each centre for structural genomics, several heterogeneously sized oligomers that are unlikely to
DYNAMIC LIGHT SCATTERING characteristics are shared among most centres (FIG. 2). crystallize. In some centres, the proteins are studied by
A technique for determining The experimental process begins with the cloning of a heteronuclear single-quantum coherence nuclear
apparent molecular size, in selected target sequences, frequently with recombina- magnetic resonance (HSQC NMR) experiment,
which laser light is shone on a tion-based vectors that allow the creation of many dif- because this technique gives insight into the ‘folded-
solution. Its scatter corresponds
to the diffusion rate and,
ferent constructs. These vectors incorporate different ness’ of a protein31,32. Any promising purified soluble
therefore, the size of the affinity tags, such as HIS-TAGS and glutathione-S-trans- proteins are then subjected to crystallization trials or
molecules in solution. ferase (GST), to aid purification, as well as promoters NMR experiments.
Several centres are investing in considerable automa- roughly half of the structural genomics effort in Japan
tion to allow parallel large-scale expression trials and use NMR39,30.
parallel crystallization trials (TABLE 1); for example, the The refinement of crystallographic structures has
Joint Center for Structural Genomics hopes to be able been reported to be the slowest step in structure deter-
SYNCHROTRON
to analyse up to 130,000 crystallization experiments per mination (S.-H. Kim, personal communication), and
A device that accelerates day33. To ensure optimal use of precious SYNCHROTRON the advent of highly automated structure-determina-
particles of atomic size through time, BEAMLINE AUTOMATION is crucial34. In addition, careful tion software for both crystallography40,41 and NMR42,43
an electric field; it is used to tracking of laboratory results and analyses can be used is therefore likely to have a marked effect on increasing
produce synchronous packets
to predict better which proteins will be most the speed of solution of structures.
of particles.
successful35; this information might then be fed into the
BEAMLINE AUTOMATION target-selection process to improve future results. Target selection: which proteins and how many?
Technologies to reduce human Crystallography has benefited from many tech- It would be desirable to have an experimental molecular
intervention on synchrotron nologies, including the brilliance of synchrotron radi- structure for every known protein, such as the ~600,000
beamlines, such as robots for
mounting and centring crystals
ation and its tunability for multiple anomalous dis- in the protein sequence databases SWISS-PROT and
in the X-ray beam. persion (MAD) PHASING36. Other improvements include TrEMBL44. However, practicalities dictate a compro-
charged coupled device detectors, as well as the mise, whereby a more modest number of structures are
MAD PHASING enhanced stability provided by cryocrystallography. solved, and these are used as templates for the compara-
(Multiple anomolous
NMR has seen similar advances, including cryogenic tive modelling of most soluble protein domains. A
dispersion). An approach to
determining the phases of a probes and higher-field magnets, as well as new tech- rough consensus indicates that it could be feasible for
crystal structure by relying on niques such as transverse relaxation-optimized spec- 10,000 structures to be experimentally solved over the
the anomalous scattering of troscopy (TROSY)32,37. Consequently, although early next decade45.
X-rays near the absorption edge plans for structural genomics focused primarily on Dennis Vitkup and colleagues have shown that
of the atom (such as selenium).
It allows determination of
crystallography, NMR has already proved to have this number of experimental structures is insufficient
phase from several sets of data great value for the field32,38. At this time, most centres to provide templates for high-quality models of all
collected from a single crystal. in the United States have NMR spectroscopists, and protein domains46. To determine how many structure
1. Lander, E. S. et al. Initial sequencing and analysis of the Describes the determination of ten protein structures SCOP: a structural classification of proteins database for
human genome. Nature 409, 860–921 (2001). from M. thermoautotrophicum, using the principle of the investigation of sequences and structures. J. Mol. Biol.
2. Venter, J. C. et al. The sequence of the human genome. finding proteins that are most amenable to structural 247, 536–540 (1995).
Science 291, 1304–1351 (2001). characterization. The SCOP database is a comprehensive expert-
3. Devos, D. & Valencia, A. Practical limits of function 32. Montelione, G. T., Zheng, D., Huang, Y. J., Gunsalus, K. C. curated hierarchical evolutionary classification of
prediction. Proteins 41, 98–107 (2000). & Szyperski, T. Protein NMR spectroscopy in structural protein domains using structural information.
4. Todd, A. E., Orengo, C. A. & Thornton, J. M. Evolution of genomics. Nature Struct. Biol. 7, 982–985 (2000). 57. Pearl, F. M. et al. A rapid classification protocol for the
function in protein superfamilies, from a structural 33. Terwilliger, T. C. Structural genomics in North America. CATH Domain Database to support structural genomics.
perspective. J. Mol. Biol. 307, 1113–1143 (2001). Nature Struct. Biol. 7, 935–939 (2000). Nucleic Acids Res. 29, 223–227 (2001).
5. Brenner, S. E. Errors in genome annotation. Trends Genet. 34. Abola, E., Kuhn, P., Earnest, T. & Stevens, R. C. Automation An introduction to CATH, a largely automated
15, 132–133 (1999). of X-ray crystallography. Nature Struct. Biol. 7, 973–977 hierarchical classification of protein domain
6. Ashburner, M. et al. Gene ontology: tool for the unification of (2000). structures.
biology. The Gene Ontology Consortium. Nature Genet. 25, 35. Bertone, P. et al. SPINE: an integrated tracking database 58. Siddiqui, A. S., Dengler, U. & Barton, G. J. 3Dee: a
25–29 (2000). and data mining approach for identifying feasible targets in database of protein structural domains. Bioinformatics 17,
7. Perutz, M. F. et al. Structure of hæmoglobin. A three- high-throughput structural proteomics. Nucleic Acids Res 200–201 (2001).
dimensional Fourier synthesis at 5.5 Å resolution, obtained 29, 2884–2898 (2001). 59. Apic, G., Gough, J. & Teichmann, S. A. Domain
by X-ray analysis. Nature 185, 416–422 (1960). 36. Hendrickson, W. A. Synchrotron crystallography. Trends combinations in archaeal, eubacterial and eukaryotic
8. Kendrew, J. C. & Watson, H. C. Comparison between Biochem. Sci. 25, 637–643 (2000). proteomes. J. Mol. Biol. 310, 311–325 (2001).
amino-acid sequences of sperm whale myoglobin and of 37. Wider, G. & Wuthrich, K. NMR spectroscopy of large 60. Apic, G., Gough, J. & Teichmann, S. A. An insight into
human haemoglobin. Nature 190, 670 (1961). molecules and multimolecular assemblies in solution. Curr. domain combinations. Bioinformatics 17 (Suppl. 1),
9. Flaherty, K. M., McKay, D. B., Kabsch, W. & Holmes, K. C. Opin. Struct. Biol. 9, 594–601 (1999). S83–S89 (2001).
Similarity of the three-dimensional structures of actin and the 38. Prestegard, J. H., Valafar, H., Glushka, J. & Tian, F. Nuclear 61. Saha, S. et al. Solution structure of the LDL receptor EGF-
ATPase fragment of a 70-kDa heat shock cognate protein. magnetic resonance in the era of structural genomics. AB pair. A paradigm for the assembly of tandem calcium
Proc. Natl Acad. Sci. USA 88, 5041–5045 (1991). Biochemistry 40, 8677–8685 (2001). binding EGF domains. Structure 9, 451–456 (2001).
10. Aravind, L., Leipe, D. D. & Koonin, E. V. Toprim — a 39. Yokoyama, S. et al. Structural genomics projects in Japan. 62. Gerstein, M. Integrative database analysis in structural
conserved catalytic domain in type IA and II Nature Struct. Biol. 7, 943–945 (2000). genomics. Nature Struct. Biol. 7, 960–963 (2000).
topoisomerases, DnaG-type primases, OLD family 40. Adams, P. D. & Grosse-Kunstleve, R. W. Recent 63. Fischer, D. Rational structural genomics: affirmative action
nucleases and RecR proteins. Nucleic Acids Res. 26, developments in software for the automation of for ORFans and the growth in our structural knowledge.
4205–4213 (1998). crystallographic macromolecular structure determination. Protein Eng. 12, 1029–1030 (1999).
11. Berger, J. M., Fass, D., Wang, J. C. & Harrison, S. C. Curr. Opin. Struct. Biol. 10, 564–568 (2000). This paper describes interesting features of genes
Structural similarities between topoisomerases that cleave 41. Lamzin, V. S. & Perrakis, A. Current state of automated without homologues and the ability of structural
one or both DNA strands. Proc. Natl Acad. Sci. USA 95, crystallographic data analysis. Nature Struct. Biol. 7, genomics to elucidate their provenance.
7876–7881 (1998). 978–981 (2000). 64. Galperin, M. Y. Conserved ‘hypothetical’ proteins: new
12. Brenner, S. E., Chothia, C. & Hubbard, T. J. P. Assessing 42. Helgstrand, M., Kraulis, P., Allard, P. & Hard, T. Ansig for hints and new puzzles. Comp. Funct. Genomics 2, 14–18
sequence comparison methods with reliable structurally Windows: an interactive computer program for (2001).
identified distant evolutionary relationships. Proc. Natl Acad. semiautomatic assignment of protein NMR spectra. 65. Linial, M. & Yona, G. Methodologies for target selection in
Sci. USA 95, 6073–6078 (1998). J. Biomol. NMR 18, 329–336 (2000). structural genomics. Prog. Biophys. Mol. Biol. 73,
13. Bjorkman, P. J. et al. Structure of the human class I 43. Zimmerman, D. E. et al. Automated analysis of protein NMR 297–320 (2000).
histocompatibility antigen, HLA-A2. Nature 329, 506–512 assignments using methods from artificial intelligence. J. 66. Mallick, P., Goodwill, K. E., Fitz-Gibbon, S., Miller, J. H. &
(1987). Mol. Biol. 269, 592–610 (1997). Eisenberg, D. Selecting protein targets for structural
14. Wilson, I. A. & Garcia, K. C. T-cell receptor structure and 44. Bairoch, A. & Apweiler, R. The SWISS-PROT protein genomics of Pyrobaculum aerophilum: validating
TCR complexes. Curr. Opin. Struct. Biol. 7, 839–848 (1997). sequence database and its supplement TrEMBL in 2000. automated fold assignment methods by using binary
15. Blundell, T. L. & Mizuguchi, K. Structural genomics: an Nucleic Acids Res. 28, 45–48 (2000). hypothesis testing. Proc. Natl Acad. Sci. USA 97,
overview. Prog. Biophys. Mol. Biol. 73, 289–295 (2000). 45. Norvell, J. C. & Machalek, A. Z. Structural genomics 2450–2455 (2000).
16. Burley, S. K. et al. Structural genomics: beyond the human programs at the US National Institute of General Medical 67. Erlandsen, H., Abola, E. E. & Stevens, R. C. Combining
genome project. Nature Genet. 23, 151–157 (1999). Sciences. Nature Struct. Biol. 7, 931 (2000). structural genomics and enzymology: completing the
17. Domingues, F. S., Koppensteiner, W. A. & Sippl, M. J. The 46. Vitkup, D., Melamud, E., Moult, J. & Sander, C. picture in metabolic pathways and enzyme active sites.
role of protein structure in genomics. FEBS Lett. 476, Completeness in structural genomics. Nature Struct. Biol. 8, Curr. Opin. Struct. Biol. 10, 719–730 (2000).
98–102 (2000). 559–566 (2001). 68. Lewis, H. A. et al. A structural genomics approach to the
18. Gaasterland, T. Structural genomics: bioinformatics in the This paper predicts the number of structure study of quorum sensing. Crystal structures of three LuxS
driver’s seat. Nature Biotechnol. 16, 625–627 (1998). determinations necessary to provide three- orthologs. Structure 9, 527–537 (2001).
19. Kim, S. H. Shining a light on structural genomics. Nature dimensional models of all (or most) families of 69. Terwilliger, T. C. et al. Class-directed structure
Struct. Biol. 5, 643–645 (1998). proteins. determination: foundation for a protein structure initiative.
20. Mittl, P. R. & Grutter, M. G. Structural genomics: 47. Bateman, A. et al. The Pfam protein families database. Protein Sci. 7, 1851–1856 (1998).
opportunities and challenges. Curr. Opin. Chem. Biol. 5, Nucleic Acids Res. 28, 263–266 (2000). 70. Shapiro, L. & Harris, T. Finding function through structural
402–408 (2001). 48. Kim, K. K., Hung, L. W., Yokota, H., Kim, R. & Kim, S. H. genomics. Curr. Opin. Biotechnol. 11, 31–35 (2000).
21. Montelione, G. T. & Anderson, S. Structural genomics: Crystal structures of eukaryotic translation initiation factor 5A 71. Skolnick, J., Fetrow, J. S. & Kolinski, A. Structural
keystone for a Human Proteome Project. Nature Struct. from Methanococcus jannaschii at 1.8 Å resolution. Proc. genomics and its importance for gene function analysis.
Biol. 6, 11–12 (1999). Natl Acad. Sci. USA 95, 10419–10424 (1998). Nature Biotechnol. 18, 283–287 (2000).
22. Sali, A. 100,000 protein structures for the biologist. Nature A report of one of the first structural genomics 72. Thornton, J. M. From genome to function. Science 292,
Struct. Biol. 5, 1029–1032 (1998). proteins solved; it represented inadvertent duplication 2095–2097 (2001).
23. Shapiro, L. & Lima, C. D. The Argonne Structural Genomics of effort, as the same structure was independently 73. Thornton, J. M., Todd, A. E., Milburn, D., Borkakoti, N. &
Workshop: Lamaze class for the birth of a new science. solved in the next reference. Orengo, C. A. From structure to function: approaches and
Structure 6, 265–267 (1998). 49. Peat, T. S., Newman, J., Waldo, G. S., Berendzen, J. & limitations. Nature Struct. Biol. 7, 991–994 (2000).
24. Smith, T. A new era. Nature Struct. Biol. 7, 927 (2000). Terwilliger, T. C. Structure of translation initiation factor 5A 74. Berman, H. M. et al. The Protein Data Bank and the
The introduction to a supplement to Nature Structural from Pyrobaculum aerophilum at 1.75 Å resolution. challenge of structural genomics. Nature Struct. Biol. 7,
Biology devoted to structural genomics, which Structure 6, 1207–1214 (1998). 957–959 (2000).
contains 20 articles that address different aspects of 50. Sinha, S. et al. Crystal structure of Bacillus subtilis YabJ, a 75. Gibrat, J. F., Madej, T. & Bryant, S. H. Surprising similarities
the field. purine regulatory protein and member of the highly in structure comparison. Curr. Opin. Struct. Biol. 6,
25. Teichmann, S. A., Chothia, C. & Gerstein, M. Advances in conserved YjgF family. Proc. Natl Acad. Sci. USA 96, 377–385 (1996).
structural genomics. Curr. Opin. Struct. Biol. 9, 390–399 13074–13079 (1999). 76. Orengo, C. A. & Taylor, W. R. SSAP: sequential structure
(1999). 51. Volz, K. A test case for structure-based functional alignment program for protein structure comparison.
26. Teichmann, S. A., Murzin, A. G. & Chothia, C. Determination assignment: the 1.2 Å crystal structure of the YjgF gene Methods Enzymol. 266, 617–635 (1996).
of protein function, evolution and interactions by structural product from Escherichia coli. Protein Sci. 8, 2428–2437 77. Shindyalov, I. N. & Bourne, P. E. Protein structure
genomics. Curr. Opin. Struct. Biol. 11, 354–363 (2001). (1999). alignment by incremental combinatorial extension (CE) of
This review includes an analysis of 32 structural 52. Smaglik, P. Protein structure groups seek to draft common the optimal path. Protein Eng. 11, 739–747 (1998).
genomics proteins and presents lessons learned in ground rules. Nature 403, 691 (2000). 78. Subbiah, S., Laurents, D. V. & Levitt, M. Structural similarity
each case. 53. Brenner, S. E. Target selection for structural genomics. of DNA-binding domains of bacteriophage repressors and
27. Doudna, J. A. Structural genomics of RNA. Nature Struct. Nature Struct. Biol. 7, 967–969 (2000). the globin core. Curr. Biol. 3, 141–149 (1993).
Biol. 7, 954–956 (2000). 54. Kuroda, Y., Tani, K., Matsuo, Y. & Yokoyama, S. Automated 79. Brenner, S. E. & Levitt, M. Expectations from structural
28. Edwards, A. M. et al. Protein production: feeding the search of natively folded protein fragments for high- genomics. Protein Sci. 9, 197–200 (2000).
crystallographers and NMR spectroscopists. Nature Struct. throughput structure determination in structural genomics. Uses historical data to predict the fraction of new
Biol. 7, 970–972 (2000). Protein Sci. 9, 2313–2321 (2000). folds and new superfamilies to be discovered by
29. Waldo, G. S., Standish, B. M., Berendzen, J. & Terwilliger, T. 55. Dietmann, S. et al. A fully automatic evolutionary structural genomics.
C. Rapid protein-folding assay using green fluorescent classification of protein folds: Dali Domain Dictionary version 80. Koppensteiner, W. A., Lackner, P., Wiederstein, M. & Sippl,
protein. Nature Biotechnol. 17, 691–695 (1999). 3. Nucleic Acids Res. 29, 55–57 (2001). M. J. Characterization of novel proteins based on known
30. Yokoyama, S. et al. Structural genomics projects in Japan. An introduction to one of the most popular systems protein structures. J. Mol. Biol. 296, 1139–1152 (2000).
Prog. Biophys. Mol. Biol. 73, 363–376 (2000). for automatically comparing proteins of known 81. Cort, J. R., Yee, A., Edwards, A. M., Arrowsmith, C. H. &
31. Christendat, D. et al. Structural proteomics of an archaeon. structure. Kennedy, M. A. Structure-based functional classification of
Nature Struct. Biol. 7, 903–909 (2000). 56. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. hypothetical protein MTH538 from Methanobacterium
thermoautotrophicum. J. Mol. Biol. 302, 189–203 (2000). 97. Dunker, A. K. et al. Protein disorder and the evolution of 117. Phillips, S. E. & Schoenborn, B. P. Neutron diffraction
82. Cort, J. R., Yee, A., Edwards, A. M., Arrowsmith, C. H. & molecular recognition: theory, predictions and reveals oxygen–histidine hydrogen bond in oxymyoglobin.
Kennedy, M. A. NMR structure determination and observations. Pac. Symp. Biocomput. 473–484 (1998). Nature 292, 81–82 (1981).
structure-based functional characterization of conserved 98. Wootton, J. C. & Federhen, S. Analysis of compositionally 118. Fermi, G., Perutz, M. F., Shaanan, B. & Fourme, R. The
hypothetical protein MTH1175 from Methanobacterium biased regions in sequence databases. Methods Enzymol. crystal structure of human deoxyhaemoglobin at 1.74 Å
thermoautotrophicum. J. Struct. Funct. Genomics 1, 266, 554–571 (1996). resolution. J. Mol. Biol. 175, 159–174 (1984).
15–25 (2001). 99. Wright, P. E. & Dyson, H. J. Intrinsically unstructured 119. Bashford, D., Chothia, C. & Lesk, A. M. Determinants of a
83. Fetrow, J. S., Godzik, A. & Skolnick, J. Functional analysis proteins: re-assessing the protein structure–function protein fold. Unique features of the globin amino acid
of the Escherichia coli genome using the sequence-to- paradigm. J. Mol. Biol. 293, 321–331 (1999). sequences. J. Mol. Biol. 196, 199–216 (1987).
structure-to-function paradigm: identification of proteins 100. Schaffer, A. A. et al. Improving the accuracy of PSI-BLAST 120. Sayle, R. A. & Milner-White, E. J. RASMOL: biomolecular
exhibiting the glutaredoxin/thioredoxin disulfide protein database searches with composition-based
graphics for all. Trends Biochem. Sci. 20, 374 (1995).
oxidoreductase activity. J. Mol. Biol. 282, 703–711 (1998). statistics and other refinements. Nucleic Acids Res. 29,
121. Kraulis, P. J. Molscript: a program to produce both
84. Wallace, A. C., Borkakoti, N. & Thornton, J. M. TESS: a 2994–3005 (2001).
detailed and schematic plots of protein structure. J. Appl.
geometric hashing algorithm for deriving 3D coordinate 101. Fowler, C. A., Tian, F., Al-Hashimi, H. M. & Prestegard,
Crystallography 24, 946–950 (1991).
templates for searching structural databases. Application J. H. Rapid determination of protein folds using residual
122. Merritt, E. A. & Bacon, D. J. Raster3d: photorealistic
to enzyme active sites. Protein Sci. 6, 2308–2323 (1997). dipolar couplings. J. Mol. Biol. 304, 447–460 (2000).
85. Wei, L. & Altman, R. B. Recognizing protein binding sites 102. Potts, B. C. & Chazin, W. J. Chemical shift homology in molecular graphics. Methods Enzymol. 277, 505–524
using statistical descriptions of their 3D environments. proteins. J. Biomol. NMR 11, 45–57 (1998). (1997).
Pac. Symp. Biocomput. 4, 497–508 (1998). 103. Young, M. M. et al. High throughput protein fold 123. Eisenstein, E. et al. Biological function made crystal clear
86. Lichtarge, O., Bourne, H. R. & Cohen, F. E. An evolutionary identification by using experimental constraints derived — annotation of hypothetical proteins via structural
trace method defines binding surfaces common to protein from intramolecular cross-links and mass spectrometry. genomics. Curr. Opin. Biotechnol. 11, 25–30 (2000).
families. J. Mol. Biol. 257, 342–358 (1996). Proc. Natl Acad. Sci. USA 97, 5802–5806 (2000). 124. Heinemann, U. et al. An integrated approach to structural
87. Sowa, M. E. et al. Prediction and confirmation of a site In this work, cross-linking and mass spectrometry genomics. Prog. Biophys. Mol. Biol. 73, 347–362 (2000).
critical for effector regulation of RGS domain activity. were used to glean limited structural information, 125. Dry, S., McCarthy, S. & Harris, T. Structural genomics in
Nature Struct. Biol. 8, 234–237 (2001). sufficient to predict a protein fold. the biotechnology sector. Nature Struct. Biol. 7, 946–949
88. Boggon, T. J., Shan, W. S., Santagata, S., Myers, S. C. & 104. Simons, K. T., Strauss, C. & Baker, D. Prospects for ab (2000).
Shapiro, L. Implication of tubby proteins as transcription initio protein structural genomics. J. Mol. Biol. 306,
factors by structure-based functional analysis. Science 1191–1199 (2001). Acknowledgements
286, 2119–2125 (1999). 105. Wuthrich, K. Protein recognition by NMR. Nature Struct. This work is supported by NIH grants and a Searle Scholarship.
This paper predicts the DNA-binding function of Biol. 7, 188–189 (2000). S.E.B. is grateful to J.-M. Chandonia, L. Lo Conte and R. Peters for
tubby proteins on the basis of examination of the 106. Baumeister, W. & Steven, A. C. Macromolecular electron critical review of the manuscript.
surface electrostatics of the structure. microscopy in the era of structural genomics. Trends
89. Teplova, M. et al. The structure of the YrdC gene product Biochem. Sci. 25, 624–631 (2000). Online Links
from Escherichia coli reveals a new fold and suggests a 107. Heinemann, U. Structural genomics in Europe: slow start,
role in RNA binding. Protein Sci. 9, 2557–2566 (2000). strong finish? Nature Struct. Biol. 7, 940–942 (2000). DATABASES
90. Hwang, K. Y., Chung, J. H., Kim, S. H., Han, Y. S. & Cho, Y. 108. Butler, D. Wellcome discusses structural genomics effort The following terms in this article are linked online to:
Structure-based identification of a novel NTPase from with industry. . . but data release remains an open InterPro: http://www.ebi.ac.uk/interpro/
Methanococcus jannaschii. Nature Struct. Biol. 6, question. Nature 406, 923–924 (2000). TIM | TopRim
691–696 (1999). 109. Williamson, A. R. Creating a structural genomics LocusLink: http://www.ncbi.nlm.nih.gov/LocusLink/
91. Minasov, G. et al. Functional implications from crystal consortium. Nature Struct. Biol. 7, 953 (2000). TBP | tubby
structures of the conserved Bacillus subtilis protein Maf 110. Berman, H. M. et al. The Protein Data Bank. Nucleic Acids OMIM: http://www.ncbi.nlm.nih.gov/Omim/
with and without dUTP. Proc. Natl Acad. Sci. USA 97, Res. 28, 235–242 (2000). retinitis pigmentosa type 14
6328–6333 (2000). 111. Orengo, C. A. et al. The CATH database provides insights
92. Lim, K. et al. Crystal structure of YecO from Haemophilus into protein structure/function relationships. Nucleic Acids FURTHER INFORMATION
influenzae (HI0319) reveals a methyltransferase fold and a Res. 27, 275–279 (1999). Airlie Agreement:
bound S-adenosylhomocysteine. Proteins (in the press). 112. Brenner, S. E., Barken, D. & Levitt, M. The PRESAGE
http://www.nigms.nih.gov/news/meetings/airlie.html#agree
93. Zarembinski, T. I. et al. Structure-based assignment of the database for structural genomics. Nucleic Acids Res. 27,
Airlie Conference:
biochemical function of a hypothetical protein: a test case 251–253 (1999).
http://www.nigms.nih.gov/news/meetings/airlie.html
of structural genomics. Proc. Natl Acad. Sci. USA 95, 113. Sanchez, R. & Sali, A. ModBase: a database of
CATH: http://www.biochem.ucl.ac.uk/bsm/cath_new/
15189–15193 (1998). comparative protein structure models. Bioinformatics 15,
Dali: http://www.ebi.ac.uk/dali/
This paper reports that a bound ATP that was found 1060–1061 (1999).
in the solved structure indicated that this 114. Huynen, M. et al. Homology-based fold predictions for ModBase: http://pipe.rockefeller.edu/modbase/
hypothetical protein is a molecular switch. Mycoplasma genitalium proteins. J. Mol. Biol. 280, National Institute of General Medical Sciences (NIGMS):
94. Sanchez, R. et al. Protein structure modeling for structural 323–326 (1998). http://www.nigms.nih.gov
genomics. Nature Struct. Biol. 7, 986–990 (2000). 115. Rychlewski, L., Zhang, B. & Godzik, A. Functional insights Pfam: http://www.sanger.ac.uk/Software/Pfam/
95. Friedberg, I., Kaplan, T. & Margalit, H. Evaluation of PSI- from structural predictions: analysis of the Escherichia coli PRESAGE: http://presage.berkeley.edu
BLAST alignment accuracy in comparison to structural genome. Protein Sci. 8, 614–624 (1999). Protein Data Bank: http://www.rcsb.org/pdb/
alignments. Protein Sci. 9, 2278–2284 (2000). 116. Teichmann, S. A., Park, J. & Chothia, C. Structural SCOP: http://scop.mrc-lmb.cam.ac.uk/scop/
96. Sauder, J. M., Arthur, J. W. & Dunbrack, R. L. Jr Large-scale assignments to the Mycoplasma genitalium proteins show SNP Consortium: http://snp.cshl.org
comparison of protein sequence alignment algorithms with extensive gene duplications and domain rearrangements. Structuralgenomics.org: http://www.structuralgenomics.org
structure alignments. Proteins 40, 6–22 (2000). Proc. Natl Acad. Sci. USA 95, 14658–14663 (1998). SWISS-PROT and TrEMBL: http://www.expasy.ch/sprot/