0% found this document useful (0 votes)
208 views76 pages

Protein Modeling: Protein Structure Prediction Other Topics

This document provides a summary of a lecture on protein modeling. It discusses the levels of protein structure from primary to quaternary structure. It also describes common secondary structures like alpha helices and beta strands. While the structure of a protein is determined by its amino acid sequence, computational methods are needed to predict structure due to the large sequence-structure gap. Approaches to protein structure prediction include predictions in 1D, 2D and 3D using techniques like homology modeling, fold recognition, and ab initio prediction. Mass spectrometry is also discussed as a method for analyzing proteins on a large scale.

Uploaded by

uma-chen
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
208 views76 pages

Protein Modeling: Protein Structure Prediction Other Topics

This document provides a summary of a lecture on protein modeling. It discusses the levels of protein structure from primary to quaternary structure. It also describes common secondary structures like alpha helices and beta strands. While the structure of a protein is determined by its amino acid sequence, computational methods are needed to predict structure due to the large sequence-structure gap. Approaches to protein structure prediction include predictions in 1D, 2D and 3D using techniques like homology modeling, fold recognition, and ab initio prediction. Mass spectrometry is also discussed as a method for analyzing proteins on a large scale.

Uploaded by

uma-chen
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Lecture 6 Protein Modeling

June 7, 2007

Protein Structure Prediction Other topics

Protein Architecture
proteins are polymers consisting of amino acids linked by peptide bonds each amino acid consists of a central carbon atom an amino group NH 2 a carboxyl group COOH a side chain differences in side chains distinguish different amino acids

Peptide Bonds
amino group side chain carboxyl group

carbon (common reference point for coordinates of a structure)

Amino Acid Side Chains


side chains vary in: shape, size, charge, polarity

Levels of Description
protein structure is often described at four different scales primary structure secondary structure tertiary structure quaternary structure

Levels of Description

Levels of Description

Secondary Structure
secondary structure refers to certain common repeating structures it is a local description of structure two common secondary structures helices strands/sheets a third category, called coil or loop, refers to everything else

Helices
carbon

individual amino acid hydrogen bond

Sheets

Ribbon Diagram Showing Secondary Structures

The Protein Folding Problem


we know that the function of a protein is determined in large part by its 3D shape (fold, conformation) can we predict the 3D shape of a protein given only its amino-acid sequence? in general NO, current methods cannot do this accurately but the methods can often provide a partial description of the 3D structure, which is often helpful

Motivation
Want to identify the function of genes we find, and what different mutations/alleles do One gene = one protein (sort of)
Function of protein = function of gene

Function can be determined in many ways


Gene expression, knockouts, etc

But these take time, and are prone to mistakes Goal: If we can structure every protein, learning their functions isnt too far away

Thornton et al 2000 (Nature)

Similar problems
Straight up 3D prediction hard (Nobel awaits) Subproblem 1: Identify patterns in sequence
Profile HMMs, multiple sequence alignments

Subproblem 2: Identify common motifs


Various methods

Subproblem 3: Identify classes of proteins


SCOP

Subproblem 4: Identify homologs


BLAST

http://www.ludwig.edu.au/course/course2002/

What Determines Conformation?


in general, the amino-acid sequence of a protein determines the 3D shape of a protein [Anfinsen et al., 1950s] but some exceptions all proteins can be denatured some proteins are inherently disordered (i.e. lack a regular structure) some proteins get folding help from chaperones there are various mechanisms through which the conformation of a protein can be changed in vivo post-translational modifications such as phosphorylation prions etc.

What Determines Conformation?


what physical properties of the protein determine its fold? rigidity of the protein backbone interactions among amino acids, including electrostatic interactions van der Waals forces volume constraints hydrogen, disulfide bonds interactions of amino acids with water

Determining Protein Structures


protein structures can be determined experimentally (in many cases) by x-ray crystallography nuclear magnetic resonance (NMR)

DNA

Picture by Anthony North

Myoglobin

From www.inst.bnl.gov/GasDetectorLab/x-rays/SRI94.htm

Myoglobin

S.E.V. Phillips. "Structure and refinement of oxymyoglobin at 1.6 resolution.", J. Mol. Biol. 1980, 142, 531.

NMR
Nuclear Magnetic Resonance Spectroscopy Cannot handle large proteins like X-ray Exploits the chemical environment to return distances between atoms
Can use knowledge of restraints to identify positions of atoms that produce peaks

Protein structure determination in solution by NMR spectroscopy Wuthrich K. J Biol Chem. 1990 December 25;265(36):22059-62

Protein structure determination in solution by NMR spectroscopy Wuthrich K. J Biol Chem. 1990 December 25;265(36):22059-62

Experimental Methods
Very expensive and time-consuming
Computational methods can help with time (Frank DiMaio)

Many proteins still cannot be done in this manner

More motivation
there is a large sequence-structure gap 158K protein sequences in SwissProt database 27K protein structures in PDB database key question: can we predict structures by computational means instead?

Approaches to Protein Structure Prediction


prediction in 1D secondary structure solvent accessibility (which residues are exposed to water, which are buried) transmembrane helices (which residues span membranes) prediction in 2D inter-residue/strand contacts prediction in 3D homology modeling fold recognition (e.g. via threading) ab initio prediction (e.g. via molecular dynamics)

Prediction in 1D, 2D and 3D


predicted secondary structure and solvent accessibility

known secondary structure (E = beta strand) and solvent accessibility


Figure from B. Rost, Protein Structure in 1D, 2D, and 3D, The Encyclopaedia of Computational Chemistry, 1998

2D Prediction Approaches
use secondary structure predictions to predict short-range contacts (e.g. hydrogen bonds in helices)

use secondary structure predictions to predict strand alignments

use correlated mutations to predict contacts

Prediction in 3D
homology modeling given: a query sequence Q, a database of protein structures do: find protein P such that structure of P is known P has high sequence similarity to Q return Ps structure as an approximation to Qs structure fold recognition given: a query sequence Q, a database of known folds do: find fold F such that Q can be aligned with F in a highly compatible manner return F as an approximation to Qs structure ab initio prediction given: a query sequence Q (assuming no similar sequence or fold is known) do: return a predicted structure S for Q

Homology Modeling
most pairs of proteins with similar structure are remote homologs (< 25% sequence identity) homology modeling usually doesnt work for remote homologs ; most pairs of proteins with < 25% sequence identity are unrelated

probably unrelated

remote homologs

homologs

0%

20%

30%

100%

pairwise sequence identity

Threading
Form of fold recognition

prediction.ppt
From ai.stanford.edu/~serafim/CS262_2006/Slides/

Proteomics
Microarrays are useful primarily because mRNA concentrations serve as surrogate for protein concentrations Like to measure protein concentrations directly, but at present cannot do so in same high-throughput manner Proteins do not have obvious direct complements Could build molecules that bind, but binding greatly affected by protein structure

Time-of-Flight (TOF) Mass Spectrometry (thanks Sean McIlwain)


Detector Measures the time for an ionized particle, starting from the sample plate, to hit the detector Laser

Sample +V

Time-of-Flight (TOF) Mass Spectrometry 2


Matrix-Assisted Laser Desorption-Ionization (MALDI) Crystalloid structures made using proton-rich matrix molecule Hitting crystalloid with laser causes molecules to ionize and fly towards Sample +V detector

Detector Laser

Time-of-Flight Demonstration 0

Sample Plate

Time-of-Flight Demonstration 1
Matrix Molecules

Time-of-Flight Demonstration 2

Protein Molecules

Time-of-Flight Demonstration 3
Laser Detector

+10KV

Positive Charge

Time-of-Flight Demonstration 4

Laser pulsed directly onto sample

Proton kicked off matrix molecule onto another molecule

+10KV

Time-of-Flight Demonstration 5

Lots of protons kicked off matrix ions, giving rise to more positively charged molecules

+ +

+ +

+10KV

Time-of-Flight Demonstration 6
The high positive potential under sample plate, causes positively charged molecules to accelerate towards detector
+ + + + +

+10KV

Time-of-Flight Demonstration 7
+ + + + +

Smaller mass molecules hit detector first, while heavier ones detected later

+10Kv

Time-of-Flight Demonstration 8
+ + + + +

The incident time measured from when laser is pulsed until molecule hits detector

+10KV

Time-of-Flight Demonstration 9
+ + + + + +

Experiment repeated a number of times, counting frequencies of flight-times

+10KV

Example Spectra from a Competition by Lin et al. at Duke


These are different fractions from the same sample.

Intensity

M/Z

Trypsin-Treated Spectra

Frequency

M/Z

Many Challenges Raised by Mass Spectrometry Data


Noise: extra peaks from handling of sample, from machine and environment (electrical noise), etc. M/Z values may not align exactly across spectra (resolution ~0.1%) Intensities not calibrated across spectra: quantification is difficult Cannot get all proteins typically only several hundred. To improve odds of getting the ones we want, may fractionate our sample by 2D gel electrophoresis or liquid chromatography.

Challenges (Continued)
Better results if partially digest proteins (break into smaller peptides) first Can be difficult to determine what proteins we have from spectrum Isotopic peaks: C13 and N15 atoms in varying numbers cause multiple peaks for a single peptide

Handling Noise: Peak Picking


Want to pick peaks that are statistically significant from the noise signal
Want to use these as features in our learning algorithms.

Many Supervised Learning Tasks


Learn to predict proteins from spectra, when the organisms proteome is known Learn to identify isotopic distributions Learn to predict disease from either proteins, peaks or isotopic distributions as features Construct pathway models

Using Mass Spectrometry for Early Detection of Ovarian Cancer [Petricoin to al., early, often Ovarian cancer difficult et detect2002]
leading to poor prognosis Trained and tested on mass spectra from blood serum 100 training cases, 50 with cancer Held-out test set of 116 cases, 50 with cancer 100% sensitivity, 95% specificity (63/66) on heldout test set

Not So Fast
Data mining methodology seems sound But Keith Baggerly argues that cancer samples were handled differently than normal samples, and perhaps data were preprocessed differently too If we run cancer samples Monday and normals Wednesday, could get differences from machine breakdown or nearby electrical equipment thats running on Monday but not Wed Lesson: tell collaborators they must randomize samples for the entire processing phase and of course all our preprocessing must be same Debate is still raging results not replicated in trials

Other Proteomics: Interactions

Figure from Ideker et al., Science 292(5518):929-934, 2001

each node represents a gene product (protein) blue edges show direct protein-protein interactions yellow edges show interactions in which one protein binds to DNA and affects the expression of another

Protein-Protein Interactions
Yeast 2-Hybrid Immunoprecipitation
Antibodies (immuno) are made by combinatorial combinations of certain proteins Millions of antibodies can be made, to recognize a wide variety of different antigens (invaders), often by recognizing specific proteins
antibody protein

Protein-Protein Interactions

Immunoprecipitation
antibody

Co-Immunoprecipitation
antibody

Many Supervised Learning Tasks


Learn to predict protein-protein interactions: protein 3D structures may be critical Use protein-protein interactions in construction of pathway models Learn to predict protein function from interaction data

ChIP-Chip Data
Immunoprecipitation can also be done to identify proteins interacting with DNA rather than other proteins Chromatin immunoprecipitation (ChIP): grab sample of DNA bound to a particular protein (transcription factor) ChIP-Chip: run this sample of DNA on a microarray to see which DNA was bound Example of analysis of such new data: Keles et al., 2006

Metabolomics
Measures concentration of each low-molecular weight molecule in sample These typically are metabolites, or small molecules produced or consumed by reactions in biochemical pathways These reactions typically catalyzed by proteins (specifically, enzymes) This data typically also mass spectrometry, though could also be NMR

Lipomics
Analogous to metabolomics, but measuring concentrations of lipids rather than metabolites Potentially help induce biochemical pathway information or to help disease diagnosis or treatment choice

To Design a Drug:
Identify Target Protein Determine Target Site Structure Synthesize a Molecule that Will Bind Knowledge of proteome/genome Relevant biochemical pathways Crystallography, NMR Difficult if Membrane-Bound

Imperfect modeling of structure Structures may change at binding And even then

Molecule Binds Target But May:


Bind too tightly or not tightly enough. Be toxic. Have other effects (side-effects) in the body. Break down as soon as it gets into the body, or may not leave the body soon enough. It may not get to where it should in the body (e.g., crossing blood-brain barrier). Not diffuse from gut to bloodstream.

And Every Body is Different:


Even if a molecule works in the test tube and works in animal studies, it may not work in people (will fail in clinical trials). A molecule may work for some people but not others. A molecule may cause harmful side-effects in some people but not others.

Typical Practice when Target Structure is Unknown


High-Throughput Screening (HTS): Test many molecules (1,000,000) to find some that bind to target (ligands). Infer (induce) shape of target site from 3D structural similarities. Shared 3D substructure is called a pharmacophore. Perfect example of a machine learning task with spatial target.

An Example of Structure Learning

Inactive

Active

Common Data Mining Approaches


Represent a molecule by thousands to millions of features and use standard techniques (e.g., KDD Cup 2001) Represent each low-energy conformer by feature vector and use multiple-instance learning (e.g., Jain et al., 1998) Relational learning
Inductive logic programming (e.g., Finn et al., 1998) Graph mining

Supervised Learning Task


Given: a set of molecules, each labeled by activity -- binding affinity for target protein -- and a set of low-energy conformers for each molecule Do: Learn a model that accurately predicts activity (may be Boolean or real-valued)

Clinical Databases of the Future (Dramatically Simplified)


PatientID Gender Birthdate P1 M 3/22/63 PatientID Date P1 P1 1/1/01 2/1/03 Physician Symptoms Smith Jones Diagnosis palpitations hypoglycemic fever, aches influenza

PatientID Date P1 P1

Lab Test

Result 42 45

PatientID SNP1 SNP2 SNP500K P1 P2 AA AB AB BB Dose 10mg BB AA Duration 3 months

1/1/01 blood glucose 1/9/01 blood glucose

PatientID Date Prescribed Date Filled Physician Medication P1 5/17/98 5/18/98 Jones prilosec

Final Wrap-up
Molecular biology collecting lots and lots of data in post-genome era Opportunity to connect molecular-level information to diseases and treatment Need analysis tools to interpret Data mining opportunities abound Hopefully this tutorial provided solid start toward applying data mining to high-throughput biological data

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy