0% found this document useful (0 votes)
5 views12 pages

Secondary Structure Prediction

The document discusses various computational techniques for predicting the secondary structure of proteins, focusing on methods like Chou-Fasman and GOR, which categorize amino acids based on their propensity to form helices or sheets. It also explores advanced approaches such as neural networks, Hidden Markov Models, and Profile HMMs, which utilize statistical and biophysical principles to enhance prediction accuracy. The accuracy of modern methods exceeds 70%, significantly improving upon earlier techniques.

Uploaded by

mavi260900
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views12 pages

Secondary Structure Prediction

The document discusses various computational techniques for predicting the secondary structure of proteins, focusing on methods like Chou-Fasman and GOR, which categorize amino acids based on their propensity to form helices or sheets. It also explores advanced approaches such as neural networks, Hidden Markov Models, and Profile HMMs, which utilize statistical and biophysical principles to enhance prediction accuracy. The accuracy of modern methods exceeds 70%, significantly improving upon earlier techniques.

Uploaded by

mavi260900
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

NAME : MAHALAKSHMI S

REG NO : 810022509005
COURSE : M. Tech (BIOTECHNOLOGY)
SUB NAME : COMPUTATIONAL BIOLOGY
SUB CODE : BY4202
TITLE : SECONDARY STRUCTURE PREDICTION
Secondary Structure Prediction

When a novel protein is the topic of interest and it’s structure is unknown, a solid method
for predicting its secondary (and eventually tertiary) structure is desired. There are a
variety of computational techniques employed in making secondary structure predictions
for a particular protein sequence, and all work with the goal of differentiating between
helix (H), strand (S), or coil/loop regions (C). Though these three classifica- tions are a
simplification of the terminology used in presented solved protein structures, they provide
enough information to characterize the general structure of a protein. Sequence alignment
is an important tool in sec- ondary structure prediction as highly conserved regions of
related sequences generally correlate to specific secondary structure elements that are
necessary for proper protein function. In this light, it is important to consider that many
methods which use sequence alignment as an initial step in developing models for prediction
are observing trends in sequence conservation (which residues can be conservatively
substituted for each other, etc.). Such intuition is a strong first step in developing a solid
prediction model that fits the utilized test data. Early secondary structure prediction
methods (such as Chou-Fasman and GOR, out- lined below) had a 3-state accuracy of 50–
60%. (They initially reported higher accu- racy, but this was found to be inflated once they
were tested against proteins outside of the training set.) Today’s methods have an
accuracy of > 70%.

The Chou-Fasman method


If you were asked to determine whether an amino acid in a protein of interest is part of a
α-helix or β-sheet, you might think to look in a protein database and see which
secondary structures amino acids in similar contexts belonged to. The Chou-Fasman
method (1978) is a combination of such statistics-based methods and rule-based
methods. Here are the steps of the Chou-Fasman algorithm:

1. Calculate propensities from a set of solved structures. For all 20 amino acids i,
calculate these propensities by:

That is, we determine the probability that amino acid i is in each structure, normalized by
the background probability that i occurs at all. For example, let’s say that there are 20,000
amino acids in the database, of which 2000 are serine, and there are 5000 amino acids in
helical conformation, of which 500 are
Then the helical propensity for serine is 0.1. The propensities defined as:

And that these two formulations are equal.


2. Once the propensities are calculated, each amino acid is categorized using the
propensities as one of: helix-former, helix-breaker, or helix-indifferent. (That is,
helix-formers have high helical propensities, helix-breakers have low heli- cal
propensities, and helix-indifferents have intermediate propensities.) Each amino
acid is also categorized as one of: sheet-former, sheet-breaker, or sheet- indifferent.
For example, it was found (as expected) that glycine and prolines are helix-
breakers.
3. When a sequence is input, find nucleation sites. These are short subsequences with
a high-concentration of helix-formers (or sheet-formers). These sites are found with
some heuristic rule (e.g. “a sequence of 6 amino acids with at least 4 helix-formers,
and no helix-breakers”).
4. Extend the nucleation sites, adding residues at the ends, maintaining an average
propensity greater than some threshold.
5. Step 4 may create overlaps; finally, we deal with these overlaps using some
heuristic rules.

Figure: In the Chou-Fasman method, nucleation sites are found along the protein using a
heuristic rule, and then extended.

The GOR method


To determine the structure for a given amino acid position j, the GOR method
(named for the authors Garnier, Osguthorpe, Robson) looks at a window of 8 amino acids
before and 8 after the position of interest. Suppose aj is the amino acid that we are
trying to categorize. GOR looks at the residues aj−8aj−7 . . . aj . . . aj+7aj+8. Intuitively, it
assigns a structure based on probabilities it has calculated from protein databases. These
probabilities are of the form

P r[amino acid j is α|aj−8, . . . , aj , . . . , aj+8 ]

P r[amino acid j is β|aj−8, . . . , aj , . . . , aj+8 ]

That is, each corresponds to the probability that an amino acid has a particular structure,
given the sequence around it. The GOR method thus looks at a window of 17 amino
acids:

8 8
... ...
j
There are far too many possible sequences of length 17 to make calculating the above
probabilities feasible. Instead, it is assumed that these probabilities can be estimated using
just pairwise probabilities. We omit details but the overall idea is similar to the log-odds
ratios we have studied before, except that pairwise dependencies are considered.
There are two main schools of thought on secondary structure prediction: statistical analysis and
reliance on biophysical principles in order to predict sequence. These methods will be explored below.
Probability Based Methods

Single Amino Acid Based Methods


In single amino acid based methods, 0 th order Markov Model is used, were each amino acid is looked at by
itself in terms of its position as either in a helix, strand, or coil. Once these predictions are made, a
”clean up” algorithmic step is used, to go back over predicted sequences to unify the data into a single
prediction for a sequence. Here is a simple example of how this may work:
If the secondary structure is predicted as follows (structure prediction in bold):

Then the alignment will be cleaned up to its most likely structural position, which in this case
would be entirely a helix:

This type of correction is needed to make the predictions logical. Proteins contain conserved
domains,with each domain consisting of a certain length of amino acids, sometimes as short as three or
four, with others extending to 30, 40, or more. It is impossible to come up with the Markov transition
matrix that will always work correctly to extend domains in the order model, which makes taking a
consensus over a region necessary to make biologically relevant predictions, as shown in the above
example. Using models of this type are limiting, which lead to the development of more involved
models now presented.

Window Based Methods


Here, higher order Markov models are used to calculate the probability of secondary structure in an
amino acid sequence. A window is used to examine amino acids upstream and downstream in the se-
quence in calculating a probability for secondary structure type of the current residue being examined.
For example, for a order model,a 5-tuple is used, looking at the two residues upstream and the two
residues downstream of the current residue. Then a probability is assigned to the region telling whether it
is a helix, strand or a coil. In this manner less clean up is necessary, as regions are assigned structure
type based on the highest scoring probabilities for groups, rather than individual residues. Here then the
main problem is going back and solving for transition between regions, which can become slightly
ambiguous in this method, as well as looking of small loop regions which may be missed, especially if
they are only four or five residues long and consist of very flexible residues that may fit into a longer
predicted sequence. For example, a sequence predicted as:

but the loop region is lost because of too large a window size. This is the largest problem with the
window based method, and can be improved by using multiple window sizes in conjunction with each
other and developing a heuristic to find a consensus structural prediction for a given stretch of
sequence. As with most structural prediction methods, developing a biologically sound scoring system
is crucial to algorithmic success.

Biophysical Principles
Beyond simply looking at which residues follow which others in structures, it has also become
important to examine biophysical principles in structure prediction. From electrostatic interactions to
simply geometric observations (i.e., you simple can’t place multiple prolines in a row in a helix, as they
cannot turn fast enough), there are many ways that algorithms can be improved using these principles
in predicting protein structure in addition to simple sequence information.

Nearest Neighbor Methods


The nearest neighbor method looks at each n-tuple and maps it to a labelled point in for a given training
data set. The idea behind the approach is to to predict secondary structure of the center residue in an n-
tuple window based on the known secondary structure of homologous protein segments of proteins
with already characterized tertiary structure. This method was pioneered by Yi and Lander (1993) and
Rost and Sander (1994), and later improved by Salamov and Solovyev (1995)[6]. To deal with
ambiguity in prediction, this method looks at the nearest k neighbors from the test data, and takes the
consensus of the k neighbors found. Combining information from a variety of window sizes, usually in
the 4-6tuple range in order to come up with a more selective and sensitive selection method. The
important trick to determining the accuracy of these methods is how the scoring system is
implemented.

As mentioned before, one way to improve the development of scoring matrices is to use multiple
sequence alignment data in training rather than just single sequence data, demonstrated by Salamov and
Solovyev (1995), who improved previous accuracy of 68 percent to 72 percent.
Neural Networks
A newer method used for predicting secondary structure is the use of neural networks. These networks
consist of multiple levels, comprised of nodes. Each node has a variety of inputs, anywhere from 1 to n,
and 1 output. The majority of neural networks used in biological sequence analysis, both DNA and
amino acid alike, consist of an input layer, a middle ”hidden layer,” and an output layer. This neural
network architecture is depicted in the below figure.

Node
i

Fig: A Single Node

Fig: A Neural Netwo


As you can see, many nodes are combined into a network, with each edge leading to or
from a node is given a certain statistical weight for interpretation by its destination node.
These weights are determined from training data that is fed through the neural network
algorithm, which consists of sets of protein sequence data whose structure is already
known. Each node, no matter which layer it is included in, will return an output value.
Depending on the weight of each edge, these values will be interpreted differently. These
values are calculated from the following equation:

Perceptrons, also called threshold units, are a simple method for classifying
input vectors, or examples, into one of two categories. They function similarly
to one-layer neural networks in fact, we will see that full neural networks are
essentially built from many threshold units.
If the weights are not known in advance, the perceptron must be trained. For
that we need a training set: a set S of input vectors for which we know the
desired (target) answer. Ideally, the goal of training is to find a set of weights
such that the perceptron returns the correct answer for all of the training
examples, with the hope that such a perceptron will have good performance on
examples it has never seen. The training set should contain both positive and
negative examples. For example, if we were to build a perceptron to recognize
α-helices, then we should have sequences that are part of α-helices (positive
examples), as well as sequences that are not (negative examples).

Hidden Markov Models


This is another new method used for secondary structure prediction which is implemented
in a similar fashion to gene prediction using Hidden Markov Models. The intuition is to
use multiple sequence alignment data using sequences whose secondary structure are
already known to develop an HMM that will predict secondary structure based on
probabilities determined from the training alignment. Then an unknown sequence can
then be fed through the HMM to get a prediction of the secondary sequence alignment
HMMs can also be used in a more complex application in secondary structure prediction,
similar to the idea of threading. To predict the secondary structure state for each position in
a sequence, the state-specific probability distribution is multiplied by the position-specific
state probability over all possible states. The predicted secondary structure (helix, strand,
or coil) is the one with the highest value in the summed distribution. The model works by
analyzing the input sequence profile from both a sequence perspective (much like
Markov Models discussed above), while also looked at probabilities of sequences fitting
particular structural motifs (a simplified version of threading). In doing so, a more
accurate model sensitive model is created for prediction the correct state of the sequence.
The equation used for this calculation using a two tiered voting scheme as follows:
Profile HMM
Using consensus secondary structure information taken from a particular family of proteins
(ty- rosine kinases, for example), a Profile HMM can be created which contains
probabilities for secondary structure based on the consensus sequence data (see Lecture 15
notes for a further ex- planation of profile HMM). In this manner, new proteins thought to
belong to a particular family via primary sequence alignment (easily and accurately
calculated, using blastp or similar tool) can be run through a Profile HMM which will then
assign a set of predicted secondary structure elements to the primary sequence based on
the consensus of its family members. This provides a very accurate prediction of
secondary structure, but is limited in usefulness to analyzing protein sequences of proteins
that can be characterized as part of a well understood and documented pro- tein family.
The CATH system is used to measure the accuracy of secondary structure prediction. (See
lecture 19 notes for a detailed discussion of CATH). Secondary structure of novel proteins
cannot be classified in this manner, and running them through a Profile HMM of a
sequence that it is unrelated to will result in what is likely to be highly incorrect secondary
structure prediction.
References

[1] Bystroff C, Thorsson V, and Baker D. (2000) HMMSTR: a hidden Markov model
for local sequence- structure correlations in proteins. Journal of Molecular Biology,
301, 173-190.
[2] Durbin R., Eddy. S, Krogh A. and Mitchison G. Biological sequence analysis United
Kingdom, Cam- bridge University Press, 2002
See Chapter 6 for information on multiple sequence alginments, and Chapter 7 for
information for information of dealing with complex phylogentic relationships. A
review the Feng-Doolittle alignment can also be found on pages 145-146.
[3] Jones DT. Protein Secondary Structure Prediction Based on Position-specific Scoring
Matrices. Journal of Molecular Biology, 292, 195-202.
[4] Koehl P. (2001) Protein structure similarities. Current Opinions in Structural Biology,
11, 348-353.
[5] Presnell SR, Cohen BI, and Cohen FE. (1992) A Segment-based Approach to Protein
Secondary Struc- ture Prediction. Biochemistry, 31, 983-993.
[6] Salamov AA, Solovyev VV. (1995) Prediction of Protein Secondary Structure by
Combining Nearest- neighbor Algorithms and Multiple Sequence Alignments. J.
Mol. Biol., 247, 11-15.
[7] Sternberg M. Protein Structure Prediction, IRL Press, 1996.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy