100% found this document useful (1 vote)

151 views38 pages

04B. Bioinformatics-Lecture 4 (Alternative) - Blast

The document provides information about BLAST (Basic Local Alignment Search Tool) and the BLAST algorithm. It discusses how BLAST is used to identify similar sequences by finding local alignments between a query sequence and large databases. The BLAST algorithm works by breaking up query sequences into words and using heuristics to find high-scoring matches in the database, then extending these matches into high-scoring segment pairs (HSPs) while evaluating their statistical significance.

Uploaded by

LinhNguye

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

151 views38 pages

04B. Bioinformatics-Lecture 4 (Alternative) - Blast

Uploaded by

LinhNguye

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 38

Lecture 3b: BLAST

Ly Le, PhD
School of Biotechnology
Email: ly.le@hcmiu.edu.vn
Office: Rm705, HCM International University
BLAST and SIMILARITY
SEARCH
Why is similarity important

• One sequence by itself is not informative; it

must be analyzed by comparative methods
against existing sequence databases to develop
hypothesis concerning relatives and function.
• Similar sequences (homologues) often derive
from the same ancestor, share the same
structure, and have similar biological function.
Evolution
• Evolution has duplicated and shuffled bits and
pieces of molecules to produce new linear
arrangements that combine function in novel
ways.
• Regions of similarity often suggest an
evolutionary tie and/or common functional
properties between very different molecules.
Homology

• Shared morphology does NOT necessarily imply

common ancestry
• When similarity is due to common ancestry, we
call it homology
Similarity judgments should be based on

• The types of changes or mutations that occur

within sequences.
• Characteristics of those different types of
mutations.
• The frequency of those mutations.
Common similarity problems

• Start with a query sequence with unknown

properties and search within a database of
millions of sequences to find those which
share similarity with the query.
• Start with a small set of sequences and identify
similarities and differences among them.
• In many sequences or very long sequences,
detect commonly occurring patterns.
Crude similarity thresholds

• Proteins – 25% similarity Sequences must have

more than 1200
• Nucleic acids – 75% similarity residues

• Below 25/75% is twilight zone  further

study is needed to make reliable conclusion
BLAST
• Basic Local Alignment Search Tool
• A set of sequence comparison algorithms
developed in 1990 and 1997 (S. Altschul)
• A heuristic method for performing local
alignments through searches of high scoring
segment pairs (HSP’s)
– Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) “Basic
local alignment search tool.” J. Mol. Biol. 215:403-410.
– Altschul SF, Madden TL, Schaeffer AA, Zhang J, Zhang Z, Miller W,
Lipman DJ (1997) “Gapped BLAST and PSI-BLAST: a new generation
of protein database search programs.” NAR 25:3389-3402.

Lecture 3.1 9
BLAST TYPES
• BLAST at NCBI
http://www.ncbi.nlm.nih.gov/BLAST
• and BLAST at EMBnet
http://www.ch.embnet.org/software/aBLAST.html

use different databases – yield slightly different

results.
• Standard BLAST – uses substitution matrix (i.e.
PAM or BLOSUM) to reward identity match,
gives positive points for similar aa, and penalties
for different aa.
FASTA vs. BLAST
• Lipman and Pearson (1985), • Altshul et al. (1990)
improved in (1988)
• Compares query to every
• Faster than FASTA
sequence in a database • More sensitive
• Uses heuristics, such as “hot • Based on theoretical
spots”, “best diagonal runs” and
execution of dynamic algorithm in foundations
a narrow band around a hot spot
• Marked improvement over the
dynamic algorithms
BLAST is a Heuristic

• BLAST does not use Needle-Wunsch (global algorithm)

or Smith-Waterman (local algorithm)
• BLAST approximates dynamic programming methods
• BLAST is not guaranteed to give a mathematically
optimal alignment
• BLAST does not explore the complete search space
• BLAST uses heuristics (loosly-defined rules) to refine
High-scoring Segment Pairs (HSPs)
• BLAST reports only “statistically-significant” alignments
How does BLAST work?
• BLAST allows user to select one sequence
(termed the query) and perform pairwise
alignment between the query and an entire
database (termed the target).

Þ This means that millions of alignments are

analyzed in a BLAST search, and only the most
closely related matches are returned.
BLAST’s main page
15
What BLAST tells you ...
• BLAST reports surprising alignments
- Different than chance

• Assumptions
- Random sequences
- Constant composition

• Conclusions
- Surprising similarities imply evolutionary homology

Evolutionary Homology: descent from a common ancestor

Does not always imply similar function
16
BLAST Algorithm

1. Remove low-complexity region or sequence repeats in the

query sequence.
"Low-complexity region" means a region of a sequence
composed of few kinds of elements. These regions might give
high scores that confuse the program to find the actual
significant sequences in the database, so they should be
filtered out.
The regions will be marked with an X (protein sequences) or
N (nucleic acid sequences) and then be ignored by the
BLAST program.
LCR’s (low complexity)
• Watch out for…
– transmembrane or signal peptide regions
– coil-coil regions
– short amino acid repeats (collagen, elastin)
– homopolymeric repeats
• BLAST uses SEG to mask amino acids
• BLAST uses DUST to mask bases
Lecture 3.1 18
2. Make a k-letter word list of the
query sequence. Take k=3 for
example, we list the words of
length 3 in the query protein
sequence (k is usually 11 for a
DNA sequence) "sequentially",
until the last letter of the query
sequence is included. The method
is illustrated in figure 1.
3. List the possible matching words.
BLAST only cares about the high-scoring words. The scores are
created by comparing the word in the list in step 2 with all the 3-
letter words. By using the scoring matrix to score the comparison of
each residue pair.
For example, the score obtained by comparing PQG with PEG and
PQA is 15 and 12, respectively.
After that, a neighborhood word score threshold T is used to reduce
the number of possible matching words. The words whose scores are
greater than the threshold T will remain in the possible matching
words list, while those with lower scores will be discarded. For
example, PEG is kept, but PQA is abandoned when T is 13.
BLAST Algorithm

21
BLAST Algorithm

22
4.Organize the remaining high-scoring words into an
efficient search tree. This allows the program to rapidly
compare the high-scoring words to the database sequences.
5. Repeat step 3 to 4 for each k-letter word in the query
sequence.
6. Scan the database sequences for exact matches with the
remaining high-scoring words. The BLAST program
scans the database sequences for the remaining high-scoring
word, such as PEG, of each position. If an exact match is
found, this match is used to seed a possible un-gapped
alignment between the query and database sequences
7.Extend the exact matches to
high-scoring segment pair
(HSP).
BLAST stretches a longer alignment between
the query and the database sequence in the left
and right directions, from the position where
the exact match occurred. The extension does
not stop until the accumulated total score of
the HSP begins to .
Extending the High Scoring
Segment Pair (HSP)

Minimum
Score (S)

Neighborhood
Score Threshold (T)
25
8.List all of the HSPs in the database whose
score is high enough to be considered.

9.Evaluate the significance of the HSP score.

For local alignments containing gaps it is not
proved.). In accordance with the Gumbel
EVD, the probability p of observing a score S
equal to or greater than x using scoring matrix
10.Make two or more HSP regions into a longer
alignment.
11.Show the gapped Smith-Waterman local
alignments of the query and each of the
matched database sequences.
12.Report every match whose expect score is
lower than a threshold parameter E.
Use of BLAST

•Finding genes in a genome

•Predicting a protein function
•Predicting a protein 3-D structure
•Finding protein family members
BLAST SEARCH STEPS

• Step 1: specify sequence of interest

• Step 2: select BLAST program
• Step 3: select a database
• Step 4: Selecting Optional Search Parameters
Step 1: specify sequence of interest

The data input (query) could be in 2 forms:

• DNA or protein sequence in FASTA format
• An accession number or GI (Genbank
Identification)
Step 2: select BLAST program
Step 3: select a database

•For protein database searches (blastp and blastx), the default option is the
nonredundant (nr) database (GenBank, the Protein Data Bank (PDB),
SwissProt, PIR, and PRF). Another option is to search only
Refseq proteins.

•For DNA database searches (blastn, tblastn, tblastx), the default option is to
search the human (or mouse) genomic plus transcript database. Other
commonly
used options include the nucleotide nr database (GenBank, EMBL, DDBJ, and
PDB).
Step 4: Selecting Optional Search
Parameters
BLAST result

Biotechnology Principles and Processes - Short Notes
67% (3)
Biotechnology Principles and Processes - Short Notes
11 pages
BLAST
100% (1)
BLAST
4 pages
Lecture 4: Blast: Ly Le, PHD
No ratings yet
Lecture 4: Blast: Ly Le, PHD
60 pages
Bioinformatics Lab 2 (Evelyn)
No ratings yet
Bioinformatics Lab 2 (Evelyn)
9 pages
Bioinformatics Lab 2
No ratings yet
Bioinformatics Lab 2
9 pages
NPTEL - Final Course List (Jan - Apr 2025)
No ratings yet
NPTEL - Final Course List (Jan - Apr 2025)
246 pages
Lecture/Lab: BLAST: Materials Last Updated June 2007
No ratings yet
Lecture/Lab: BLAST: Materials Last Updated June 2007
11 pages
Shampoo 12 19
50% (2)
Shampoo 12 19
48 pages
Gene Finding
No ratings yet
Gene Finding
31 pages
Computational Genomics With R
No ratings yet
Computational Genomics With R
3 pages
Bioinformatics Pratical File
No ratings yet
Bioinformatics Pratical File
63 pages
Pairwise Sequence Alignment
No ratings yet
Pairwise Sequence Alignment
12 pages
Biological Database 1
No ratings yet
Biological Database 1
50 pages
Report 1
100% (1)
Report 1
11 pages
Next Generation Sequencing - : An Overview
No ratings yet
Next Generation Sequencing - : An Overview
46 pages
Blast
100% (1)
Blast
21 pages
Bioinformatics
No ratings yet
Bioinformatics
18 pages
Bioinformatics Tutorial 2019
No ratings yet
Bioinformatics Tutorial 2019
54 pages
Pam Blosum
100% (1)
Pam Blosum
71 pages
Group # 13
No ratings yet
Group # 13
49 pages
Assignment 1 - Database - Oct 2021
No ratings yet
Assignment 1 - Database - Oct 2021
5 pages
Genotypic Methods
No ratings yet
Genotypic Methods
83 pages
Phylogenetic Trees
100% (2)
Phylogenetic Trees
20 pages
BIOINFORMATICS
100% (1)
BIOINFORMATICS
4 pages
Primer Design For PCR Assignment
100% (1)
Primer Design For PCR Assignment
5 pages
Experiment 4 - Effect of Enzyme Concentration On Enzyme Activity
No ratings yet
Experiment 4 - Effect of Enzyme Concentration On Enzyme Activity
5 pages
Notes Applications of Molecular Techniques (Supplementation)
No ratings yet
Notes Applications of Molecular Techniques (Supplementation)
5 pages
Sequence Similarity Searching: Basic Local Alignment Search Tool
No ratings yet
Sequence Similarity Searching: Basic Local Alignment Search Tool
47 pages
Tutorial For Proteome Data Analysis Using The Perseus Software Platform
No ratings yet
Tutorial For Proteome Data Analysis Using The Perseus Software Platform
22 pages
Lab Report 2 Bioinformatics
No ratings yet
Lab Report 2 Bioinformatics
17 pages
Multiple Seq Alignment
No ratings yet
Multiple Seq Alignment
36 pages
Emboss (Pairwise Sequence Alignment: Prepared By:-Bansari Patel (19it02) M.Sc. IT (SEM-2
No ratings yet
Emboss (Pairwise Sequence Alignment: Prepared By:-Bansari Patel (19it02) M.Sc. IT (SEM-2
19 pages
Bio PPT
No ratings yet
Bio PPT
35 pages
FASTA
No ratings yet
FASTA
33 pages
EXP.1 - Pipette Calibration
No ratings yet
EXP.1 - Pipette Calibration
7 pages
Lecture 1: INTRODUCTION: A/Prof. Ly Le School of Biotechnology Email: Office: RM 705
100% (1)
Lecture 1: INTRODUCTION: A/Prof. Ly Le School of Biotechnology Email: Office: RM 705
43 pages
Assignment: Date of Submission
No ratings yet
Assignment: Date of Submission
21 pages
Proteomic and Proteomics
No ratings yet
Proteomic and Proteomics
6 pages
Next Generation
No ratings yet
Next Generation
5 pages
APPLICATION OF BIOINFORMATICS IN MOLECULAR BIOLOGY AND CURRENT RESEACRH-Dr. Ruchi Yadav
No ratings yet
APPLICATION OF BIOINFORMATICS IN MOLECULAR BIOLOGY AND CURRENT RESEACRH-Dr. Ruchi Yadav
105 pages
7.1 Linkage and Crossing Over
No ratings yet
7.1 Linkage and Crossing Over
34 pages
A Systematic Review On The Comparison of Molecular Gene Editing Tools
No ratings yet
A Systematic Review On The Comparison of Molecular Gene Editing Tools
8 pages
Mb504t Mid Term Best File 2023 by Sulman Ali
No ratings yet
Mb504t Mid Term Best File 2023 by Sulman Ali
5 pages
Biotechnology Principles and Processes Neet Pyq
No ratings yet
Biotechnology Principles and Processes Neet Pyq
64 pages
Bi0505 Lab
No ratings yet
Bi0505 Lab
102 pages
Overview of Next Generation Sequencing Technologies
No ratings yet
Overview of Next Generation Sequencing Technologies
12 pages
Browsing Genomes With Ensembl PDF
No ratings yet
Browsing Genomes With Ensembl PDF
105 pages
Gene Mapping
No ratings yet
Gene Mapping
4 pages
Introduction To Bioinformatics Lab: 10B17BT571 Core Course Credits: 1 L0T0P2
No ratings yet
Introduction To Bioinformatics Lab: 10B17BT571 Core Course Credits: 1 L0T0P2
3 pages
Molecular Systematic of Animals
No ratings yet
Molecular Systematic of Animals
37 pages
Broad Specificity Profiling of Talens Results in Engineered Nucleases With Improved Dna-Cleavage Specificity
No ratings yet
Broad Specificity Profiling of Talens Results in Engineered Nucleases With Improved Dna-Cleavage Specificity
9 pages
Query Sequence 1
No ratings yet
Query Sequence 1
3 pages
Phylogenetic Tree Lab (FASTA)
No ratings yet
Phylogenetic Tree Lab (FASTA)
8 pages
Feulgen Stain Questions
No ratings yet
Feulgen Stain Questions
5 pages
Discussion 1. Evaluation of P-Nitrophenol Standard Curve
100% (1)
Discussion 1. Evaluation of P-Nitrophenol Standard Curve
1 page
LSM2241 Practical 4: Introduction To BLAST
No ratings yet
LSM2241 Practical 4: Introduction To BLAST
12 pages
Molecular Phylogenetics
No ratings yet
Molecular Phylogenetics
4 pages
Sequencing Depth and Coverage: Key Considerations in Genomic Analyses
No ratings yet
Sequencing Depth and Coverage: Key Considerations in Genomic Analyses
12 pages
BIO 401 Note... Introduction To Bioinformatics
No ratings yet
BIO 401 Note... Introduction To Bioinformatics
4 pages
Omics
No ratings yet
Omics
6 pages
Extraction of Pigment From Natural Agents
No ratings yet
Extraction of Pigment From Natural Agents
6 pages
Sequence Comparison Homology and Similarity
No ratings yet
Sequence Comparison Homology and Similarity
12 pages
EXP.2 Enzyme Extraction From Bacteria
No ratings yet
EXP.2 Enzyme Extraction From Bacteria
3 pages
Guide Sheet For Tics Lab 1 - 4
No ratings yet
Guide Sheet For Tics Lab 1 - 4
17 pages
Data Mining-Mining Sequence Patterns in Biological Data
No ratings yet
Data Mining-Mining Sequence Patterns in Biological Data
6 pages
Microbiome and Skin Biology: Review
No ratings yet
Microbiome and Skin Biology: Review
6 pages
Shea Butter
No ratings yet
Shea Butter
5 pages
Comparing DNA Sequences To Understand Evolutionary Relationships With Blast
No ratings yet
Comparing DNA Sequences To Understand Evolutionary Relationships With Blast
3 pages
Blast2Go Tutorial
No ratings yet
Blast2Go Tutorial
31 pages
Studies On Evaluation of Physical and Chemical Composition of Beetroot (Beta Vulgaris L.)
No ratings yet
Studies On Evaluation of Physical and Chemical Composition of Beetroot (Beta Vulgaris L.)
3 pages
Online Biological Databases: A/Prof. Ly Le
No ratings yet
Online Biological Databases: A/Prof. Ly Le
64 pages
Protein Sequence Databases
No ratings yet
Protein Sequence Databases
4 pages
Sequence Alignment Methods and Algorithms
75% (4)
Sequence Alignment Methods and Algorithms
37 pages
Criteria For Accrediting Engineering Technology Programs: Effective For Reviews During The 2016-2017 Accreditation Cycle
No ratings yet
Criteria For Accrediting Engineering Technology Programs: Effective For Reviews During The 2016-2017 Accreditation Cycle
32 pages
Insilico Gene Analysis
No ratings yet
Insilico Gene Analysis
34 pages
BTBC209IU Biochemistry 1: International University
No ratings yet
BTBC209IU Biochemistry 1: International University
40 pages
BTBC209IU Biochemistry 1: International University
No ratings yet
BTBC209IU Biochemistry 1: International University
39 pages
BTBC209IU Biochemistry 1: International University
No ratings yet
BTBC209IU Biochemistry 1: International University
39 pages
BTBC209IU Biochemistry 1: International University
No ratings yet
BTBC209IU Biochemistry 1: International University
28 pages
Blast ND Fasta
No ratings yet
Blast ND Fasta
28 pages
BTBC209IU Biochemistry 1: International University
No ratings yet
BTBC209IU Biochemistry 1: International University
33 pages
Report 3 Officially
No ratings yet
Report 3 Officially
7 pages
Biochem 1
No ratings yet
Biochem 1
40 pages
BTBC209IU Biochemistry 1: International University
No ratings yet
BTBC209IU Biochemistry 1: International University
33 pages
Introduction To Biochemstry Lab
No ratings yet
Introduction To Biochemstry Lab
48 pages
Undergrad CPT OPT Spring 2025
No ratings yet
Undergrad CPT OPT Spring 2025
111 pages
Lecture 3: Sequence Alignments: Ly Le, PHD
No ratings yet
Lecture 3: Sequence Alignments: Ly Le, PHD
35 pages
BTBC209IU Biochemistry 1: International University
No ratings yet
BTBC209IU Biochemistry 1: International University
36 pages
BTBC209IU Biochemistry 1: International University
No ratings yet
BTBC209IU Biochemistry 1: International University
31 pages
Tissue Engineering Slide by Slide Explanation
No ratings yet
Tissue Engineering Slide by Slide Explanation
2 pages
Nguyen Thi Ngoc Linh - Enzymology
No ratings yet
Nguyen Thi Ngoc Linh - Enzymology
4 pages
ENZYME - note bài giảng
No ratings yet
ENZYME - note bài giảng
3 pages
Summer Internship Report
No ratings yet
Summer Internship Report
11 pages
A Silent Killer
No ratings yet
A Silent Killer
3 pages
Introduction To Genetic Eng
No ratings yet
Introduction To Genetic Eng
38 pages
B Cacao Ơ: W Thejungleherbs: 10ml 110k - W Herbstory: NTMK
No ratings yet
B Cacao Ơ: W Thejungleherbs: 10ml 110k - W Herbstory: NTMK
2 pages
Nguyễn Thị Ngọc Linh - enzymes assignment 2
No ratings yet
Nguyễn Thị Ngọc Linh - enzymes assignment 2
2 pages
Free Access To Science Direct Textbooks PDF
No ratings yet
Free Access To Science Direct Textbooks PDF
8 pages
Grade 12 Genetic Engineering
No ratings yet
Grade 12 Genetic Engineering
18 pages
Synteny Analyse Notes
No ratings yet
Synteny Analyse Notes
5 pages
John Singleton Interviews John Singleton - Own The Ebook Now and Start Reading Instantly
No ratings yet
John Singleton Interviews John Singleton - Own The Ebook Now and Start Reading Instantly
47 pages
Bioindustrial
No ratings yet
Bioindustrial
7 pages
Next Generation Sequencing
No ratings yet
Next Generation Sequencing
9 pages
STE Biotechnology Q3M5 BNHS
No ratings yet
STE Biotechnology Q3M5 BNHS
22 pages
Machines Like Us Passage
No ratings yet
Machines Like Us Passage
5 pages
Identification of The Four Species of Human Malaria Parasites by Nested PCR That Targets Variant Sequences in The Small Subunit rRNA Gene Primers 3
No ratings yet
Identification of The Four Species of Human Malaria Parasites by Nested PCR That Targets Variant Sequences in The Small Subunit rRNA Gene Primers 3
5 pages
Bio2 TG1
No ratings yet
Bio2 TG1
6 pages
Protein Alignment Scoring - PAM and BLOSUM
No ratings yet
Protein Alignment Scoring - PAM and BLOSUM
11 pages
Dr. Owain Edwards - Gene Drive Webinar 2
No ratings yet
Dr. Owain Edwards - Gene Drive Webinar 2
18 pages
CS444: BIO INFORMATICS (Lab 1 - Manual) Bioinformatics Databases and Key Online Resources
No ratings yet
CS444: BIO INFORMATICS (Lab 1 - Manual) Bioinformatics Databases and Key Online Resources
2 pages
MSC in Bioinformatics Fact Sheet 2024
No ratings yet
MSC in Bioinformatics Fact Sheet 2024
2 pages
Extracted Pages From 66583952cae5b
No ratings yet
Extracted Pages From 66583952cae5b
1 page
13.4 Why Does The Genome Matter Worksheet
No ratings yet
13.4 Why Does The Genome Matter Worksheet
3 pages
Cobalt RID 9JFE347B211 (5 Seqs)
No ratings yet
Cobalt RID 9JFE347B211 (5 Seqs)
2 pages
Evaluation of Cellular Processes by in vitro Assays
From Everand
Evaluation of Cellular Processes by in vitro Assays
Taseen Gul
No ratings yet
Science behind Non-specific Science: (For Molecular Biologist & Biotechnologist)
From Everand
Science behind Non-specific Science: (For Molecular Biologist & Biotechnologist)
Vikash Bhardwaj
No ratings yet
MCQs Series for Life Sciences: Volume 2
From Everand
MCQs Series for Life Sciences: Volume 2
Maddaly Ravi
4/5 (1)
Notes On a Few Minor Phyla
From Everand
Notes On a Few Minor Phyla
Daniel Zimmermann
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

04B. Bioinformatics-Lecture 4 (Alternative) - Blast

Uploaded by

04B. Bioinformatics-Lecture 4 (Alternative) - Blast

Uploaded by

Lecture 3b: BLAST

• One sequence by itself is not informative; it

• Shared morphology does NOT necessarily imply

• The types of changes or mutations that occur

• Start with a query sequence with unknown

• Proteins – 25% similarity Sequences must have

• Below 25/75% is twilight zone  further

use different databases – yield slightly different

• BLAST does not use Needle-Wunsch (global algorithm)

Þ This means that millions of alignments are

Evolutionary Homology: descent from a common ancestor

1. Remove low-complexity region or sequence repeats in the

9.Evaluate the significance of the HSP score.

•Finding genes in a genome

• Step 1: specify sequence of interest

The data input (query) could be in 2 forms:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.