Protein Modeling: Protein Structure Prediction Other Topics
Protein Modeling: Protein Structure Prediction Other Topics
June 7, 2007
Protein Architecture
proteins are polymers consisting of amino acids linked by peptide bonds each amino acid consists of a central carbon atom an amino group NH 2 a carboxyl group COOH a side chain differences in side chains distinguish different amino acids
Peptide Bonds
amino group side chain carboxyl group
Levels of Description
protein structure is often described at four different scales primary structure secondary structure tertiary structure quaternary structure
Levels of Description
Levels of Description
Secondary Structure
secondary structure refers to certain common repeating structures it is a local description of structure two common secondary structures helices strands/sheets a third category, called coil or loop, refers to everything else
Helices
carbon
Sheets
Motivation
Want to identify the function of genes we find, and what different mutations/alleles do One gene = one protein (sort of)
Function of protein = function of gene
But these take time, and are prone to mistakes Goal: If we can structure every protein, learning their functions isnt too far away
Similar problems
Straight up 3D prediction hard (Nobel awaits) Subproblem 1: Identify patterns in sequence
Profile HMMs, multiple sequence alignments
http://www.ludwig.edu.au/course/course2002/
DNA
Myoglobin
From www.inst.bnl.gov/GasDetectorLab/x-rays/SRI94.htm
Myoglobin
S.E.V. Phillips. "Structure and refinement of oxymyoglobin at 1.6 resolution.", J. Mol. Biol. 1980, 142, 531.
NMR
Nuclear Magnetic Resonance Spectroscopy Cannot handle large proteins like X-ray Exploits the chemical environment to return distances between atoms
Can use knowledge of restraints to identify positions of atoms that produce peaks
Protein structure determination in solution by NMR spectroscopy Wuthrich K. J Biol Chem. 1990 December 25;265(36):22059-62
Protein structure determination in solution by NMR spectroscopy Wuthrich K. J Biol Chem. 1990 December 25;265(36):22059-62
Experimental Methods
Very expensive and time-consuming
Computational methods can help with time (Frank DiMaio)
More motivation
there is a large sequence-structure gap 158K protein sequences in SwissProt database 27K protein structures in PDB database key question: can we predict structures by computational means instead?
2D Prediction Approaches
use secondary structure predictions to predict short-range contacts (e.g. hydrogen bonds in helices)
Prediction in 3D
homology modeling given: a query sequence Q, a database of protein structures do: find protein P such that structure of P is known P has high sequence similarity to Q return Ps structure as an approximation to Qs structure fold recognition given: a query sequence Q, a database of known folds do: find fold F such that Q can be aligned with F in a highly compatible manner return F as an approximation to Qs structure ab initio prediction given: a query sequence Q (assuming no similar sequence or fold is known) do: return a predicted structure S for Q
Homology Modeling
most pairs of proteins with similar structure are remote homologs (< 25% sequence identity) homology modeling usually doesnt work for remote homologs ; most pairs of proteins with < 25% sequence identity are unrelated
probably unrelated
remote homologs
homologs
0%
20%
30%
100%
Threading
Form of fold recognition
prediction.ppt
From ai.stanford.edu/~serafim/CS262_2006/Slides/
Proteomics
Microarrays are useful primarily because mRNA concentrations serve as surrogate for protein concentrations Like to measure protein concentrations directly, but at present cannot do so in same high-throughput manner Proteins do not have obvious direct complements Could build molecules that bind, but binding greatly affected by protein structure
Sample +V
Detector Laser
Time-of-Flight Demonstration 0
Sample Plate
Time-of-Flight Demonstration 1
Matrix Molecules
Time-of-Flight Demonstration 2
Protein Molecules
Time-of-Flight Demonstration 3
Laser Detector
+10KV
Positive Charge
Time-of-Flight Demonstration 4
+10KV
Time-of-Flight Demonstration 5
Lots of protons kicked off matrix ions, giving rise to more positively charged molecules
+ +
+ +
+10KV
Time-of-Flight Demonstration 6
The high positive potential under sample plate, causes positively charged molecules to accelerate towards detector
+ + + + +
+10KV
Time-of-Flight Demonstration 7
+ + + + +
Smaller mass molecules hit detector first, while heavier ones detected later
+10Kv
Time-of-Flight Demonstration 8
+ + + + +
The incident time measured from when laser is pulsed until molecule hits detector
+10KV
Time-of-Flight Demonstration 9
+ + + + + +
+10KV
Intensity
M/Z
Trypsin-Treated Spectra
Frequency
M/Z
Challenges (Continued)
Better results if partially digest proteins (break into smaller peptides) first Can be difficult to determine what proteins we have from spectrum Isotopic peaks: C13 and N15 atoms in varying numbers cause multiple peaks for a single peptide
Using Mass Spectrometry for Early Detection of Ovarian Cancer [Petricoin to al., early, often Ovarian cancer difficult et detect2002]
leading to poor prognosis Trained and tested on mass spectra from blood serum 100 training cases, 50 with cancer Held-out test set of 116 cases, 50 with cancer 100% sensitivity, 95% specificity (63/66) on heldout test set
Not So Fast
Data mining methodology seems sound But Keith Baggerly argues that cancer samples were handled differently than normal samples, and perhaps data were preprocessed differently too If we run cancer samples Monday and normals Wednesday, could get differences from machine breakdown or nearby electrical equipment thats running on Monday but not Wed Lesson: tell collaborators they must randomize samples for the entire processing phase and of course all our preprocessing must be same Debate is still raging results not replicated in trials
each node represents a gene product (protein) blue edges show direct protein-protein interactions yellow edges show interactions in which one protein binds to DNA and affects the expression of another
Protein-Protein Interactions
Yeast 2-Hybrid Immunoprecipitation
Antibodies (immuno) are made by combinatorial combinations of certain proteins Millions of antibodies can be made, to recognize a wide variety of different antigens (invaders), often by recognizing specific proteins
antibody protein
Protein-Protein Interactions
Immunoprecipitation
antibody
Co-Immunoprecipitation
antibody
ChIP-Chip Data
Immunoprecipitation can also be done to identify proteins interacting with DNA rather than other proteins Chromatin immunoprecipitation (ChIP): grab sample of DNA bound to a particular protein (transcription factor) ChIP-Chip: run this sample of DNA on a microarray to see which DNA was bound Example of analysis of such new data: Keles et al., 2006
Metabolomics
Measures concentration of each low-molecular weight molecule in sample These typically are metabolites, or small molecules produced or consumed by reactions in biochemical pathways These reactions typically catalyzed by proteins (specifically, enzymes) This data typically also mass spectrometry, though could also be NMR
Lipomics
Analogous to metabolomics, but measuring concentrations of lipids rather than metabolites Potentially help induce biochemical pathway information or to help disease diagnosis or treatment choice
To Design a Drug:
Identify Target Protein Determine Target Site Structure Synthesize a Molecule that Will Bind Knowledge of proteome/genome Relevant biochemical pathways Crystallography, NMR Difficult if Membrane-Bound
Imperfect modeling of structure Structures may change at binding And even then
Inactive
Active
PatientID Date P1 P1
Lab Test
Result 42 45
PatientID Date Prescribed Date Filled Physician Medication P1 5/17/98 5/18/98 Jones prilosec
Final Wrap-up
Molecular biology collecting lots and lots of data in post-genome era Opportunity to connect molecular-level information to diseases and treatment Need analysis tools to interpret Data mining opportunities abound Hopefully this tutorial provided solid start toward applying data mining to high-throughput biological data