Secondary Structure Prediction
Secondary Structure Prediction
REG NO : 810022509005
COURSE : M. Tech (BIOTECHNOLOGY)
SUB NAME : COMPUTATIONAL BIOLOGY
SUB CODE : BY4202
TITLE : SECONDARY STRUCTURE PREDICTION
Secondary Structure Prediction
When a novel protein is the topic of interest and it’s structure is unknown, a solid method
for predicting its secondary (and eventually tertiary) structure is desired. There are a
variety of computational techniques employed in making secondary structure predictions
for a particular protein sequence, and all work with the goal of differentiating between
helix (H), strand (S), or coil/loop regions (C). Though these three classifica- tions are a
simplification of the terminology used in presented solved protein structures, they provide
enough information to characterize the general structure of a protein. Sequence alignment
is an important tool in sec- ondary structure prediction as highly conserved regions of
related sequences generally correlate to specific secondary structure elements that are
necessary for proper protein function. In this light, it is important to consider that many
methods which use sequence alignment as an initial step in developing models for prediction
are observing trends in sequence conservation (which residues can be conservatively
substituted for each other, etc.). Such intuition is a strong first step in developing a solid
prediction model that fits the utilized test data. Early secondary structure prediction
methods (such as Chou-Fasman and GOR, out- lined below) had a 3-state accuracy of 50–
60%. (They initially reported higher accu- racy, but this was found to be inflated once they
were tested against proteins outside of the training set.) Today’s methods have an
accuracy of > 70%.
1. Calculate propensities from a set of solved structures. For all 20 amino acids i,
calculate these propensities by:
That is, we determine the probability that amino acid i is in each structure, normalized by
the background probability that i occurs at all. For example, let’s say that there are 20,000
amino acids in the database, of which 2000 are serine, and there are 5000 amino acids in
helical conformation, of which 500 are
Then the helical propensity for serine is 0.1. The propensities defined as:
Figure: In the Chou-Fasman method, nucleation sites are found along the protein using a
heuristic rule, and then extended.
That is, each corresponds to the probability that an amino acid has a particular structure,
given the sequence around it. The GOR method thus looks at a window of 17 amino
acids:
8 8
... ...
j
There are far too many possible sequences of length 17 to make calculating the above
probabilities feasible. Instead, it is assumed that these probabilities can be estimated using
just pairwise probabilities. We omit details but the overall idea is similar to the log-odds
ratios we have studied before, except that pairwise dependencies are considered.
There are two main schools of thought on secondary structure prediction: statistical analysis and
reliance on biophysical principles in order to predict sequence. These methods will be explored below.
Probability Based Methods
Then the alignment will be cleaned up to its most likely structural position, which in this case
would be entirely a helix:
This type of correction is needed to make the predictions logical. Proteins contain conserved
domains,with each domain consisting of a certain length of amino acids, sometimes as short as three or
four, with others extending to 30, 40, or more. It is impossible to come up with the Markov transition
matrix that will always work correctly to extend domains in the order model, which makes taking a
consensus over a region necessary to make biologically relevant predictions, as shown in the above
example. Using models of this type are limiting, which lead to the development of more involved
models now presented.
but the loop region is lost because of too large a window size. This is the largest problem with the
window based method, and can be improved by using multiple window sizes in conjunction with each
other and developing a heuristic to find a consensus structural prediction for a given stretch of
sequence. As with most structural prediction methods, developing a biologically sound scoring system
is crucial to algorithmic success.
Biophysical Principles
Beyond simply looking at which residues follow which others in structures, it has also become
important to examine biophysical principles in structure prediction. From electrostatic interactions to
simply geometric observations (i.e., you simple can’t place multiple prolines in a row in a helix, as they
cannot turn fast enough), there are many ways that algorithms can be improved using these principles
in predicting protein structure in addition to simple sequence information.
As mentioned before, one way to improve the development of scoring matrices is to use multiple
sequence alignment data in training rather than just single sequence data, demonstrated by Salamov and
Solovyev (1995), who improved previous accuracy of 68 percent to 72 percent.
Neural Networks
A newer method used for predicting secondary structure is the use of neural networks. These networks
consist of multiple levels, comprised of nodes. Each node has a variety of inputs, anywhere from 1 to n,
and 1 output. The majority of neural networks used in biological sequence analysis, both DNA and
amino acid alike, consist of an input layer, a middle ”hidden layer,” and an output layer. This neural
network architecture is depicted in the below figure.
Node
i
Perceptrons, also called threshold units, are a simple method for classifying
input vectors, or examples, into one of two categories. They function similarly
to one-layer neural networks in fact, we will see that full neural networks are
essentially built from many threshold units.
If the weights are not known in advance, the perceptron must be trained. For
that we need a training set: a set S of input vectors for which we know the
desired (target) answer. Ideally, the goal of training is to find a set of weights
such that the perceptron returns the correct answer for all of the training
examples, with the hope that such a perceptron will have good performance on
examples it has never seen. The training set should contain both positive and
negative examples. For example, if we were to build a perceptron to recognize
α-helices, then we should have sequences that are part of α-helices (positive
examples), as well as sequences that are not (negative examples).
[1] Bystroff C, Thorsson V, and Baker D. (2000) HMMSTR: a hidden Markov model
for local sequence- structure correlations in proteins. Journal of Molecular Biology,
301, 173-190.
[2] Durbin R., Eddy. S, Krogh A. and Mitchison G. Biological sequence analysis United
Kingdom, Cam- bridge University Press, 2002
See Chapter 6 for information on multiple sequence alginments, and Chapter 7 for
information for information of dealing with complex phylogentic relationships. A
review the Feng-Doolittle alignment can also be found on pages 145-146.
[3] Jones DT. Protein Secondary Structure Prediction Based on Position-specific Scoring
Matrices. Journal of Molecular Biology, 292, 195-202.
[4] Koehl P. (2001) Protein structure similarities. Current Opinions in Structural Biology,
11, 348-353.
[5] Presnell SR, Cohen BI, and Cohen FE. (1992) A Segment-based Approach to Protein
Secondary Struc- ture Prediction. Biochemistry, 31, 983-993.
[6] Salamov AA, Solovyev VV. (1995) Prediction of Protein Secondary Structure by
Combining Nearest- neighbor Algorithms and Multiple Sequence Alignments. J.
Mol. Biol., 247, 11-15.
[7] Sternberg M. Protein Structure Prediction, IRL Press, 1996.