100% found this document useful (1 vote)
151 views38 pages

04B. Bioinformatics-Lecture 4 (Alternative) - Blast

The document provides information about BLAST (Basic Local Alignment Search Tool) and the BLAST algorithm. It discusses how BLAST is used to identify similar sequences by finding local alignments between a query sequence and large databases. The BLAST algorithm works by breaking up query sequences into words and using heuristics to find high-scoring matches in the database, then extending these matches into high-scoring segment pairs (HSPs) while evaluating their statistical significance.

Uploaded by

LinhNguye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
151 views38 pages

04B. Bioinformatics-Lecture 4 (Alternative) - Blast

The document provides information about BLAST (Basic Local Alignment Search Tool) and the BLAST algorithm. It discusses how BLAST is used to identify similar sequences by finding local alignments between a query sequence and large databases. The BLAST algorithm works by breaking up query sequences into words and using heuristics to find high-scoring matches in the database, then extending these matches into high-scoring segment pairs (HSPs) while evaluating their statistical significance.

Uploaded by

LinhNguye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Lecture 3b: BLAST

Ly Le, PhD
School of Biotechnology
Email: ly.le@hcmiu.edu.vn
Office: Rm705, HCM International University
BLAST and SIMILARITY
SEARCH
Why is similarity important

• One sequence by itself is not informative; it


must be analyzed by comparative methods
against existing sequence databases to develop
hypothesis concerning relatives and function.
• Similar sequences (homologues) often derive
from the same ancestor, share the same
structure, and have similar biological function.
Evolution
• Evolution has duplicated and shuffled bits and
pieces of molecules to produce new linear
arrangements that combine function in novel
ways.
• Regions of similarity often suggest an
evolutionary tie and/or common functional
properties between very different molecules.
Homology

• Shared morphology does NOT necessarily imply


common ancestry
• When similarity is due to common ancestry, we
call it homology
Similarity judgments should be based on

• The types of changes or mutations that occur


within sequences.
• Characteristics of those different types of
mutations.
• The frequency of those mutations.
Common similarity problems

• Start with a query sequence with unknown


properties and search within a database of
millions of sequences to find those which
share similarity with the query.
• Start with a small set of sequences and identify
similarities and differences among them.
• In many sequences or very long sequences,
detect commonly occurring patterns.
Crude similarity thresholds

• Proteins – 25% similarity Sequences must have


more than 1200
• Nucleic acids – 75% similarity residues

• Below 25/75% is twilight zone  further


study is needed to make reliable conclusion
BLAST
• Basic Local Alignment Search Tool
• A set of sequence comparison algorithms
developed in 1990 and 1997 (S. Altschul)
• A heuristic method for performing local
alignments through searches of high scoring
segment pairs (HSP’s)
– Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) “Basic
local alignment search tool.” J. Mol. Biol. 215:403-410.
– Altschul SF, Madden TL, Schaeffer AA, Zhang J, Zhang Z, Miller W,
Lipman DJ (1997) “Gapped BLAST and PSI-BLAST: a new generation
of protein database search programs.” NAR 25:3389-3402.

Lecture 3.1 9
BLAST TYPES
• BLAST at NCBI
http://www.ncbi.nlm.nih.gov/BLAST
• and BLAST at EMBnet
http://www.ch.embnet.org/software/aBLAST.html

use different databases – yield slightly different


results.
• Standard BLAST – uses substitution matrix (i.e.
PAM or BLOSUM) to reward identity match,
gives positive points for similar aa, and penalties
for different aa.
FASTA vs. BLAST
• Lipman and Pearson (1985), • Altshul et al. (1990)
improved in (1988)
• Compares query to every
• Faster than FASTA
sequence in a database • More sensitive
• Uses heuristics, such as “hot • Based on theoretical
spots”, “best diagonal runs” and
execution of dynamic algorithm in foundations
a narrow band around a hot spot
• Marked improvement over the
dynamic algorithms
BLAST is a Heuristic

• BLAST does not use Needle-Wunsch (global algorithm)


or Smith-Waterman (local algorithm)
• BLAST approximates dynamic programming methods
• BLAST is not guaranteed to give a mathematically
optimal alignment
• BLAST does not explore the complete search space
• BLAST uses heuristics (loosly-defined rules) to refine
High-scoring Segment Pairs (HSPs)
• BLAST reports only “statistically-significant” alignments
How does BLAST work?
• BLAST allows user to select one sequence
(termed the query) and perform pairwise
alignment between the query and an entire
database (termed the target).

Þ This means that millions of alignments are


analyzed in a BLAST search, and only the most
closely related matches are returned.
BLAST’s main page
15
What BLAST tells you ...
• BLAST reports surprising alignments
- Different than chance

• Assumptions
- Random sequences
- Constant composition

• Conclusions
- Surprising similarities imply evolutionary homology

Evolutionary Homology: descent from a common ancestor


Does not always imply similar function
16
BLAST Algorithm

1. Remove low-complexity region or sequence repeats in the


query sequence.
"Low-complexity region" means a region of a sequence
composed of few kinds of elements. These regions might give
high scores that confuse the program to find the actual
significant sequences in the database, so they should be
filtered out.
The regions will be marked with an X (protein sequences) or
N (nucleic acid sequences) and then be ignored by the
BLAST program.
LCR’s (low complexity)
• Watch out for…
– transmembrane or signal peptide regions
– coil-coil regions
– short amino acid repeats (collagen, elastin)
– homopolymeric repeats
• BLAST uses SEG to mask amino acids
• BLAST uses DUST to mask bases
Lecture 3.1 18
2. Make a k-letter word list of the
query sequence. Take k=3 for
example, we list the words of
length 3 in the query protein
sequence (k is usually 11 for a
DNA sequence) "sequentially",
until the last letter of the query
sequence is included. The method
is illustrated in figure 1.
3. List the possible matching words.
BLAST only cares about the high-scoring words. The scores are
created by comparing the word in the list in step 2 with all the 3-
letter words. By using the scoring matrix to score the comparison of
each residue pair.
For example, the score obtained by comparing PQG with PEG and
PQA is 15 and 12, respectively.
After that, a neighborhood word score threshold T is used to reduce
the number of possible matching words. The words whose scores are
greater than the threshold T will remain in the possible matching
words list, while those with lower scores will be discarded. For
example, PEG is kept, but PQA is abandoned when T is 13.
BLAST Algorithm

21
BLAST Algorithm

22
4.Organize the remaining high-scoring words into an
efficient search tree. This allows the program to rapidly
compare the high-scoring words to the database sequences.
5. Repeat step 3 to 4 for each k-letter word in the query
sequence.
6. Scan the database sequences for exact matches with the
remaining high-scoring words. The BLAST program
scans the database sequences for the remaining high-scoring
word, such as PEG, of each position. If an exact match is
found, this match is used to seed a possible un-gapped
alignment between the query and database sequences
7.Extend the exact matches to
high-scoring segment pair
(HSP).
BLAST stretches a longer alignment between
the query and the database sequence in the left
and right directions, from the position where
the exact match occurred. The extension does
not stop until the accumulated total score of
the HSP begins to .
Extending the High Scoring
Segment Pair (HSP)

Minimum
Score (S)

Neighborhood
Score Threshold (T)
25
8.List all of the HSPs in the database whose
score is high enough to be considered.

9.Evaluate the significance of the HSP score.


For local alignments containing gaps it is not
proved.). In accordance with the Gumbel
EVD, the probability p of observing a score S
equal to or greater than x using scoring matrix
10.Make two or more HSP regions into a longer
alignment.
11.Show the gapped Smith-Waterman local
alignments of the query and each of the
matched database sequences.
12.Report every match whose expect score is
lower than a threshold parameter E.
Use of BLAST

•Finding genes in a genome


•Predicting a protein function
•Predicting a protein 3-D structure
•Finding protein family members
BLAST SEARCH STEPS

• Step 1: specify sequence of interest


• Step 2: select BLAST program
• Step 3: select a database
• Step 4: Selecting Optional Search Parameters
Step 1: specify sequence of interest

The data input (query) could be in 2 forms:


• DNA or protein sequence in FASTA format
• An accession number or GI (Genbank
Identification)
Step 2: select BLAST program
Step 3: select a database

•For protein database searches (blastp and blastx), the default option is the
nonredundant (nr) database (GenBank, the Protein Data Bank (PDB),
SwissProt, PIR, and PRF). Another option is to search only
Refseq proteins.

•For DNA database searches (blastn, tblastn, tblastx), the default option is to
search the human (or mouse) genomic plus transcript database. Other
commonly
used options include the nucleotide nr database (GenBank, EMBL, DDBJ, and
PDB).
Step 4: Selecting Optional Search
Parameters
BLAST result

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy