0% found this document useful (0 votes)
28 views56 pages

Fundamentals of Bioinformatics - L5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views56 pages

Fundamentals of Bioinformatics - L5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Fundamentals of Bioinformatics

Lectures 5
Dr. Marwa N.M.E. Sanad

Dr. Marwa Sanad 1


Determining homology

• Alignment
• Ancestor
• Identity
• Similarity
• Homology
• Analogous
• Ortholog
• Paralog
Homology (Common ancestor)

http://evolution.berkeley.edu/evolibrary/article/0_0_0/similarity_ms_06
Homology (Common ancestor)

http://www.ncbi.nlm.nih.gov/books/NBK62051/
Analogy (Convergent ancestor)

Fish Mammals
Sequence
alignment

Pairwise alignment Multiple alignment


(2 sequences) (more than 2 sequences)

Dot plot
Sliding Sliding

Less than 500 residues of Local


DNA/ protein sequences global

Good for large or short

Good for large or short local

Dr. Marwa Sanad 6


Sequence Alignment
Pairwise alignment Multiple alignment
• Using 2 sequences •Using more than 2 sequences
Dot Plot Sliding Sliding
Global Local Local
Good for large or short sequences
Less than 500 Good for Good for
residues of short large or
DNA/ protein sequence short
sequences sequence

Dotlet
Sequence Alignment
Global alignment Local alignment
•Pairwise alignment •Pairwise alignment
•Multiple alignment
•Smith- waterman algorithms
•Sliding alignment •Sliding alignment
cg gg ta - - tccaa Gap
cc c - ta gg tccca Indel

Indel: Could be insertion or deletion


Gap: A sequence of consecutive indels

A scoring scheme:
Using to discriminate between good and bad alignments.
Score of alignment=
Ʃ ( identities, mismatches)- Ʃ (gap penalties)
Substitution Matrices

Mismatches

NUCLEOTIDES AMINO ACIDS


Sequence Alignment
Sequence Alignment
Substitution Matrices

•Substitution matrices should reflect the true


probabilities of mutations occurring through a
period of evolution

•Constructed by measuring the relative


frequency of amino acid changes in a set of
homologous protein sequences
The substitution matrices
•PAM •BlOSUM
(Percent Accepted Mutation) Blocks Amino Acid Substitution
PAM 0 30 80 110 200 250
Matrix
%identity 100 75 50 60 25 20
Blosum 80 62 30
% identity 80 62 30
Determining the substitution matrices

Closely related Distantly related


sequences sequences

Lower PAM Higher PAM

Higher BLOSUM Lower BLOSUM


http://www.nature.com/nmeth/j
ournal/v7/n3s/fig_tab/nmeth.14
34_F2.html
Function Prediction

• Aligned sequences with a range of identity below


25% (aa) or 70% (nt) are considered to be in the
twilight region = unable to claim homology
• Identify conserved domains/elements in sequences
• Compare regions of similarity among multiple
organisms.
• Identify Low complexity regions.
• Predict structural/functional relationships
Substitution Matrices

• Substitutions hardly alter protein


function/structure
• Substitutions in evolution can be predicted
through constructed matrices
Learning Outcomes

1. Understanding what are the appropriate BLAST programs

2. Understanding the most important parameters that

might change your alignment cause error in your

alignment .

3. Understanding how to interpret the E-value and your

output data.
Dr. Marwa Sanad 20
Basic Local Alignment Search Tool (BLAST)

• It is an algorithms for comparing primary biological


sequence information

• It is heuristic approach to local sequence alignment


through searching for the HSPs.

• To find other sequences that are similar to the query


nucleotide or amino acid sequence

• Run most popularly from NCBI


Dr. Marwa Sanad 21
High Scoring Segment Pair (HSP)

http://en.wikipedia.org/wiki/BLAST Dr. Marwa Sanad 22


Scoring alignment function?

• To distinguish the bad alignment from the good


alignment.
• To choose the alignment that has the maximum score

Dr. Marwa Sanad 23


Four components to a BLAST search
(1) Select the BLAST program
(2) Retrieve then upload your sequence (query)
➢ Example: Using Ac:JQ680980/ fasta format/upload seq.
(3) Choose Search Set
➢Choose or exclude the [database, organism]
(4) Choose the program selection
➢For nucleotide [megablast, discontiguous megablast,
blastn]
➢For protein [blastp, PSI-Blast, PHI-Blast, DELTA-Blast]
(5) Choose optional parameters
Dr. Marwa Sanad 24
Step1: BLAST Programs
• blastp:
compares an amino acid query sequence against a protein sequence
database.
• blastn
compares a nucleotide query sequence against a nucleotide sequence
database.
• blastx
compares the six-frame conceptual translation products of a nucleotide
query sequence against a protein sequence database
• tblastn
compares a protein query sequence against a nucleotide sequence
database dynamically translated in all six reading frames (both strands).
• tblastx
compares the six-frame translations of a nucleotide query sequence
against the six-frame translations of a nucleotide sequence database.
Dr. Marwa Sanad 25
Step1: BLAST Programs
1
blastn DNA DNA

1
Blastp protein protein

6
Blastx DNA protein

6
tblastn protein DNA

36
tblastx DNA DNA

Dr. Marwa Sanad 26


Step1: Guidance for BLAST Programs

• Which BLAST to use:

o BLASTn: Interested in non-coding DNA

o tBLASTx: Discover new proteins

o BLASTx: Discover proteins encoding in my DNA sequence

o BLASTx: Unsure of DNA quality

Dr. Marwa Sanad 27


Step2: Upload your sequence

1. Run blast directly from the NCBI page


2. Put the accession number of the your sequence
3. Save the FASTA format in file and then browse the file
4. Copy and paste the sequence

Dr. Marwa Sanad 28


Step2: Upload your sequence

Dr. Marwa Sanad 29


Step2: Upload your sequence

Dr. Marwa Sanad 30


Step3: Search Set
a. Choose the database
Default database choice:
nucleotide collection (nr/nt)
nr = non-redundant= most general database
“Genbank, EMBL, DDBJ, PDB, RefSeq and
excluding EST,STS,GSS,WGS,TSA”

b. Choose the organism


Choose or exclude organism to your search set
Dr. Marwa Sanad 31
Dr. Marwa Sanad 32
Dr. Marwa Sanad 33
Step4: program selection

Nucleotide Peptide

Megablast Blastp
Highly similar Protein-protein blast

Discontiguous megablast PSI-Blast


More dissimilar Position Specific Iterated blast

Blastn PHI- Blast


Somewhat similar Pattern Hit Initiated blast

DELTA-Blast
Domain Enhanced lookup time
accelerated balst
Dr. Marwa Sanad 34
Step 5: The algorithms parameters
[a] General properties : word size, threshold
1- Expect (e) value:
Control with the expected number of matches

2- EXPECT thresholds:
Lower EXPECT thresholds are more stringent, leading to
fewer chance matches being reported.

Dr. Marwa Sanad 35


[a] General properties
3-Word size (K-letter word):
Default is 11 (nucleotide), (3 for protein), but may set to smaller
word values from the menu (increase or decrease the speed of
alignment)

4-Make matches in a query limits:


Limit the number of matches to a query range. This option is useful
if many strong matches to one part of a query may prevent BLAST
from presenting weaker matches to another part of the query

Dr. Marwa Sanad 36


Step 5: The optional parameters (blastn)

Dr. Marwa Sanad 37


Step 5: The optional parameters (blastp)

Dr. Marwa Sanad 38


Accepted Parameters for Other Advanced
Field
Not often the default parameters are the right parameters for your sequence

G Cost to open gap [Integer]: default = 5 for nucleotides/ 11 for proteins

E
Cost to extend gap [Integer]: default = 2 for nucleotides/ 1 for proteins

Q Penalty for nucleotide mismatch [Integer]: default = -3

R reward for nucleotide match [Integer]: default = 1

E expect value [Real]: default = 10

Word size [Integer]: default = 11 for nucleotides/ 28 for megablast/ 3 for


W
proteins

Dr. Marwa Sanad 39


[b] Scoring parameters

For nucleotide Sequence

1. Match and mismatch :


Reward and penalty for matching and mismatching bases

2. Gap cost
•Existence :extension
•Increasing the Gap Costs will decrease the number of Gaps
introduced.
• Cost to create and extend a gap in an alignment. Linear costs
are available only with megablast and are determined by the
match/mismatch scores
Dr. Marwa Sanad 40
[b] Scoring parameters

For amino acid Sequence

1. Matrix :
Substitution matrices PAM&BLOSUM

2. Gap cost
• Existence :extension
• Cost to create and extend a gap in

3. Composition adjustment (associated with DELTA-BLAST)

Dr. Marwa Sanad 41


[b] Filters and Masking
1. Filter (complexity )

• Mask off regions of the query sequence that have low


compositional complexity

• Mask repeating sequences, speeding up the search

2. Mask

• Masking look-up tables is experimental and eliminates hits


based on low complexity sequences

• Search only the upper case sequences


Dr. Marwa Sanad 42
low-complexity sequence
•Unusual composition
•Can often be recognized by visual inspection
•For example,
•For protein sequence PPCDPPPPPKDKKKKDDGPP
•For nucleotide sequence AAATAAAAAAAATAAAAAAT
•Filters are used to remove low-complexity sequence because it
can cause artifactual hits.

Dr. Marwa Sanad 43


low-complexity sequence
Note:
Means a region of a sequence composed of few kinds of elements. These
regions might give high scores that confuse the program to find the actual
significant sequences in the database, so they should be filtered out. The
regions will be marked with an X (protein sequences) or N (nucleic acid
sequences) and then be ignored by the BLAST program. To filter out the low-
complexity regions, the SEG program is used for protein sequences and the
program DUST is used for DNA sequences. On the other hand, the
program XNU is used to mask off the tandem repeats in protein sequences.
Most often, it is inappropriate to consider this type of match as the result of
shared homology. Rather, it is as if the low-complexity region is "sticky" and is
pulling out many sequences that are notSanad
Dr. Marwa truly related. 44
UPPER CASE- LOWER CASE

• An upper case letter in a DNA consensus sequence indicates that the


nucleotide is preserved in that position, used to make the consensus.
• A lower case letter is the most common nucleotide in a variable
position.
• The protein sequences are always upper case letters.
Mask lower case:
•With this option selected you can cut and paste a FASTA sequence in
upper case characters and denote areas you would like filtered with
lower case. This allows you to customize what is filtered from the
sequence during the comparison to the BLAST databases.
Dr. Marwa Sanad 45
Dr. Marwa Sanad 46
Interpreting Results
The Expect value (E)

➢ Describes the number of hits one can "expect" to see by


chance ,It decreases as the Score (S) of the match increases.

➢ Is the expected number of sequence (HSPs) matches in


database of n number of sequences

➢ Describes the random background noise.

➢ Gives an indication of the statistical significance of a given


pairwise alignment and reflects the size of the database and
the scoring system used.
Dr. Marwa Sanad 47
Interpreting Results
The Expect value (E)

➢ The lower value, the more significant the hit. If you want to
be certain of homology, your E-value must be lower than 10-
4/10-6

➢ A sequence alignment that has an E-value of 0.05 means


that this similarity has a 5 in 100 (1 in 20) chance of
occurring by chance alone.

➢ Identical short alignments have relatively high E values. This


is because the calculation of the E value takes into account
the length of the query sequence.

Dr. Marwa Sanad 48


Interpreting Results
The Expect value (E)

➢ Shorter sequences have a higher probability of occurring in the


database purely by chance.
➢ It is not easily compared between searches of different
databases

➢ Used as a convenient way to create a significance threshold for


reporting results. You can change the Expect value threshold on
most BLAST search pages.

Dr. Marwa Sanad 49


Continue….Interpreting Results

• The % identity:
o A subsititute for the E-value..
o The fraction of residues that are either identical or
similar. (+)

• Length:
o This is the length of the alignment, which indicates how
long are the two segments of your sequences that BLAST
has aligned.
o Note: very short alignments can come up with high E-
values and not be very meaningful.
Dr. Marwa Sanad 50
Interpreting Results
• Generally:
o Bit matches below 50 are unreliable
o E scores greater than 0.0001 are often close to the
twilight zone

• Note : Although programs like BLAST search databases


through pairwise comparisons, these programs are
optimized for speed, not for alignment accuracy.

Dr. Marwa Sanad 51


Interpreting Results

The Bit score


The bit score gives an indication of how good the alignment is; the higher the
score, the better the alignment.
In general terms, this score is calculated from a formula that takes into
account the alignment of similar or identical residues, as well as any gaps
introduced to align the sequences
A key element in this calculation is the “substitution matrix ”, which assigns a
score for aligning any possible pair of residues. The BLOSUM62 matrix is the
default for most BLAST programs, the exceptions being blastn and
MegaBLAST (programs that perform nucleotide–nucleotide comparisons and
hence do not use protein-specific matrices). Bit scores are normalized, which
means that the bit scores from different alignments can be compared, even if
different scoring matrices have been used.
Dr. Marwa Sanad 52
Troubleshooting
ERROR 1 : "No significant similarity found“

Possible problem 1: Short query sequences: Short alignments may have Expect
values above the default threshold, which is 10 on most pages,
and, therefore, are not displayed.
Solution: Try increasing the Expect threshold (under 'Algorithm
parameters').

Possible problem 2: The low complexity regions are not allowed to initiate
alignments, so if your query is largely low complexity, the
filter may prevent all hits to the database. On the Basic BLAST
pages,
Solution: Adjust the filter settings in the section 'Filters and Masking',
under 'Algorithm parameters'. For a description of low
complexity filters,
Dr. Marwa Sanad 53
Troubleshooting
ERROR 2: An error has occurred on the server, Too many HSPs to save all

Possible problem 1: The total number of high-scoring segment pairs (HSPs) is far
too many for the BLAST servers to return the results. This is
rare as the results have to be several hundred megabytes of
information for this to happen. However, there are certain
searches which could generate a huge amount of data. Most
typically this error occurs when the default filters are turned
off or when the query sequences have repeat elements in
them.

Dr. Marwa Sanad 54


Troubleshooting
ERROR 2: An error has occurred on the server, Too many HSPs to save all

Solution1: 1. Enable species specific repeats if applicable


2. If using tblastx, try blastx instead. The tblastx program is very CPU
intensive as it not only translates the query in six reading frames but
every database sequence as well. Often, using tblastx is a measure of
last resort; a blastx search against a database of known proteins may
provide what you need.
3. Search a smaller database, such as refseq_rna. Larger databases
obviously contain more sequences and for some queries this results in
numerous "background" hits. If you want a database of known
mRNAs (and their translations) then refseq_rna is a good choice.

Dr. Marwa Sanad 55


Troubleshooting
ERROR 2: An error has occurred on the server, Too many HSPs to save all

Solution1: 4. Break up large queries into smaller pieces; submit each piece in a
separate search. A common cause of errors in BLAST is searching with
a huge sequence, like a complete chromosome, against a large
database like nr. This is better accomplished in portions rather than
one large, continuous sequence.

5. Limit the database by taxonomy. Start with large groups, such as


mammals, bacteria, etc. Any taxonomic node or tax id number that
you can find in the Taxonomy browser can be used in the 'Organism'
text box.

6. You may be hitting a large number of 'PREDICTED' or 'hypothetical


protein' records. If you do not want these hits, use an Entrez Query
such as: all[filter] NOT predicted[title].

7. For megablast and blastn searches, try increasing the word size and/or
decreasing the Expect threshold
Dr. Marwa Sanad 56

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy