0% found this document useful (0 votes)

20 views40 pages

05 CAP5510 Fall21

This document discusses database search methods for biological sequences. It begins by explaining what a database search is and some of the key issues involved. It then describes several popular heuristic methods: FASTA, BLAST, and suffix trees. FASTA and BLAST use hash tables to quickly find local alignments. Statistical significance of results is also discussed. Variations of BLAST as well as other sequence comparison tools are mentioned.

Uploaded by

Arman Singhal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views40 pages

05 CAP5510 Fall21

Uploaded by

Arman Singhal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 40

CAP5510 – Bioinformatics

Database Searches for

Biological Sequences
Tamer Kahveci
CISE Department
University of Florida

1
Goals
• Understand how major heuristic
methods for sequence comparison work
– FASTA
– BLAST
• Understand how search results are
evaluated

2
What is Database Search ?

Many long sequences One giant sequence

. . .

query query 3
What is Database Search ?

Two giant sequences

4
What is Database Search ?
• Find a particular (usually) short sequence in a
database of sequences (or one huge sequence).
• Problem is identical to local sequence alignment, but
on a much larger scale.
• We must also have some idea of the significance of a
database hit.
– Databases always return some kind of hit, how much
attention should be paid to the result?
• A similar problem is the global alignment of two large
sequences
• General idea: good alignments contain high scoring
regions.

5
Database Search Issues

• How can we search massive space

quickly?

• How can we evaluate the significance of

the result?

6
Database Search Methods
• Hash table based methods
– FASTA family
• FASTP, FASTA, TFASTA, FASTAX, FASTAY
– BLAST family
• BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ,
MegaBLAST, PsiBLAST, PhiBLAST
– Others
• FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS
• Suffix tree based methods
– Mummer, AVID, Reputer, MGA, QUASAR

7
Hash Table

8
Hash Table
• K-gram =
subsequence of
length K
• Ak entries
– A is alphabet
size
• Linear time
construction
• Constant lookup
time

9
FASTP

Lipman & Pearson, 1985

10
FASTP
• Three phase algorithm
1. Find short good matches using k-grams
1. K = 1 or 2
2. Find start and end positions for good
matches
3. Use DP to align good matches

11
FASTP: Phase 1 (1)
position 1 2 3 4 5 6 7 8 9 10 11
protein 1 n c s p t a . . . . .
protein 2 . . . . . a c s p r k
position in offset
amino acid protein A protein B pos A - posB
-----------------------------------------------------
a 6 6 0
c 2 7 -5
k - 11
n 1 -
p 4 9 -5
r - 10
s 3 8 -5
t 5 -
-----------------------------------------------------
Note the common offset for the 3 amino acids c,s and p
A possible alignment can be quickly found :
protein 1 n c s p t a
| | |
protein 2 a c s p r k 12
FASTP: Phase 1 (2)
• Similar to dot plot
• Offsets range from 1-m
to n-1
• Each offset is scored as
– # matches - #
mismatches
• Diagonals (offsets) with
large score show local
similarities

• How does it depend on

13
FASTP: Phase 2
• 5 best diagonal runs
are found
• Rescore these 5
regions using
PAM250.
– Initial score
• Indels are not
considered yet

14
FASTP: Phase 3
• Sort the aligned regions in descending
score
• Optimize these alignments using
Needleman-Wunsch
• Report the results

15
FASTP - Discussion
• Results are not optimal. Why ?

• How does performance compare to Smith-

Waterman?

• What is the impact of k?

• How does this idea work for DNAs ?

– K = 4 or 6 for DNA
16
FASTA – Improvement Over
FASTP
Pearson 1995

17
FASTA (1)
• Phase 2: Choose 10 best diagonal runs instead of 5

18
FASTA (2)
• Phase 2.5
– Eliminate diagonals that score less than some given
threshold.
– Combine matches to find longer matches. It incurs join
penalty similar to gap penalty

19
BLAST

Altschul, Gish, Miller, Myers,

Lipman, 1990

20
BLAST (or BLASTP)
• BLAST – Basic Local Alignment Search
Tool
• An approximation of Smith-Waterman
• Designed for database searches
– Short query sequence against long database
sequence or a database of many sequences
• Sacrifices search sensitivity for speed

21
BLAST Algorithm (1)
• Eliminate low complexity regions from
the query sequence.
– Replace them with X (protein) or N (DNA)
• Hash table on query sequence.
– K = 3 for proteins

MCGPFILGTYC

CGP
MCG 22
BLAST Algorithm (2)
• For each k-gram find all
k-grams that align with
score at least cutoff T PQGMCGPFILGTYC
using BLOSUM62
– 20k candidates QGM
– ~50 on the average per k- PQG
gram
– ~50n for the entire query
• Build hash table PQG
PQG 18
PEG 15
PRG 14
PSG 13 T = 13
PQA 12
23
BLAST Algorithm (3)
• Sequentially scan the database and
locate each k-gram in the hash table
• Each match is a seed for an ungapped
alignment.

24
BLAST Algorithm (4)
• HSP (High Scoring Pair)
= A match between a
query word and the
database
• Find a “hit”: Two non-
overlapping HSP’s on a
diagonal within distance
A
• Extend the hit until the
score falls below a
threshold value, X

25
BLAST Algorithm (5)
• Keep only the extended matches that
have a score at least S.
• Determine the statistical significance
of the result

26
What is Statistical Significance?

•Two one-on-one
games, two scores.
13 : 15
•Which result is
more significant?

•Expected: maybe a
random result.
•Unexpected: 13 : 15
significant, may have
significant meanings.
27
Statistical Significance
• E-value: The expected number of matches with
score at least S
• E = Kmne-lambda.S
• m, n : sequence lengths
• S : alignment score
• K, lambda: normalization parameters
• P-value: The probability of having at least one
match with score at least S
• 1 – e-E
• The smaller these values are, the more
significant the result
• http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html
28
BLAST - Analysis
• K (k-gram)
– Lower: more sensitive.
Slower.
• T (neighbor cutoff)
– Lower: Find distant
neighbors. Introduces
noise
• X (extension cutoff)
– Higher: lower chances of
getting into a local
minima. Slower.

29
Sample Query
• http://www.ncbi.nlm.nih.gov/BLAST/

Dhal_ecoli

IDRAMSAARGVFERGDWSLSSPAKRKAVLNKLADLMEAH
AEELALLETLDTGKPIRHSLRDDIPGAARAIRWYAEAIDK
VYGEVATTSSHELAMIVREPVGVIAAIVPWNFPLLLTCW
KLGPALAAGNSVILKPSEKSPLSAIRLAGLAKEAGLPDGVL
NVVTGFGHEAGQALSRHNDIDAIAFTGSTRTGKQLLKDA
GDSNMKRVWLEAGGKSANIVFADCPDLQQAASATAAGI
FYNQGQVCIAGTRLLLEESIADEFLALLKQQAQNWQPG
HPLDPATTMGTLIDCAHADSVHSFIREGESKGQLLLDGR
NAGLAAAIGPTIFVDVDPNASLSREEIFGPVLVVTRFTSE
EQALQLANDSQYGLGAAVWTRDLSRAHRMSRRLKAGSV
FVNNYNDGDMTVPFGGYKQSGNGRDKSLHALEKFTELKT
IWI
30
BLASTN
• BLAST for nucleic acids
• K = 11
• Exact match instead of neighborhood
search.

31
BLAST Variations
Program Query Target Type

BLASTP Protein Protein Gapped

BLASTN Nucleic acid Nucleic acid Gapped

BLASTX Nucleic acid Protein Gapped

TBLASTN Protein Nucleic acid Gapped

TBLASTX Protein Nucleic acid Gapped

32
Even More Variations
– PsiBLAST (iterative)
– BLAT, BLASTZ, MegaBLAST
– FLASH, PatternHunter, SSAHA, SENSEI,
WABA, GLASS

– Main differences are

• Seed choice (k, gapped seeds)
• Additional data structures

33
Suffix Trees

34
Suffix Tree
• Tree structure that contains all suffixes of the input sequence

• TGAGTGCGA
• GAGTGCGA
• AGTGCGA
• GTGCGA
• TGCGA
• GCGA
• CGA
• GA
• A

35
Suffix Tree Example

36
Suffix Tree Analysis
• O(n) space and construction time
– 10n to 70n space usage reported
• O(m) search time for m-letter sequence
• Good for
– Small data
– Exact matches

37
Suffix Array
• 5 bytes per letter
• O(m log n) search
time

• Better space usage

• Slower search

38
Mummer

39
Other Sequence Comparison
Tools
• Reputer, MGA, AVID
• QUASAR (suffix array)

Blast & Fasta
No ratings yet
Blast & Fasta
47 pages
Microteaching Chemistry
No ratings yet
Microteaching Chemistry
3 pages
Valve Body and Mechatronic Service PDF
100% (5)
Valve Body and Mechatronic Service PDF
44 pages
Aanchal Maurya Bioinformatics 2
No ratings yet
Aanchal Maurya Bioinformatics 2
24 pages
Blast
No ratings yet
Blast
115 pages
Bioinformatics 1 p3
No ratings yet
Bioinformatics 1 p3
17 pages
Laguna - Coupe Quick Manual
No ratings yet
Laguna - Coupe Quick Manual
23 pages
BOX Hill Growth Centres Precinct Development Control Plan - in Force 28 June 2021
No ratings yet
BOX Hill Growth Centres Precinct Development Control Plan - in Force 28 June 2021
243 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
Finn John - From Counting To Calculus
No ratings yet
Finn John - From Counting To Calculus
113 pages
5 Database Similarity Search BLAST
No ratings yet
5 Database Similarity Search BLAST
47 pages
Fundamentals of Bioinformatics - L5
No ratings yet
Fundamentals of Bioinformatics - L5
56 pages
Lecture 4
No ratings yet
Lecture 4
106 pages
Engineering Mechanics - ME3351 2021 Regulation - Semester Question Paper 2022 Nov Dec
No ratings yet
Engineering Mechanics - ME3351 2021 Regulation - Semester Question Paper 2022 Nov Dec
5 pages
Blast N Fasta
No ratings yet
Blast N Fasta
55 pages
q2 Module 1 Growingseedling
100% (2)
q2 Module 1 Growingseedling
9 pages
RSLTE001 - System Program Cell Level - RSLTE-LNBTS-2-Day-rslte LTE17A Reports RSLTE001 XML-2018 03-27-06!40!24 955
No ratings yet
RSLTE001 - System Program Cell Level - RSLTE-LNBTS-2-Day-rslte LTE17A Reports RSLTE001 XML-2018 03-27-06!40!24 955
1,000 pages
BLAST
No ratings yet
BLAST
30 pages
Bioinformatics: Blast and Sequence Analysis
No ratings yet
Bioinformatics: Blast and Sequence Analysis
45 pages
Second - Done - w14b - Searching Squence Databases
No ratings yet
Second - Done - w14b - Searching Squence Databases
32 pages
Lecture - 02 - Comparative Sequence Analysis
No ratings yet
Lecture - 02 - Comparative Sequence Analysis
28 pages
Talal Khan CV Civil Engineer - Planning Engineer - Dec 18
100% (1)
Talal Khan CV Civil Engineer - Planning Engineer - Dec 18
4 pages
BLAST
No ratings yet
BLAST
17 pages
Magnifico 160000334 V1 1121 LR 01
No ratings yet
Magnifico 160000334 V1 1121 LR 01
12 pages
Unit Iv - Blast
No ratings yet
Unit Iv - Blast
21 pages
Zohar - Sifra Detzniyutha - Book of Secrets
50% (2)
Zohar - Sifra Detzniyutha - Book of Secrets
26 pages
Bio 2
No ratings yet
Bio 2
39 pages
Lecture2022 - 3 /!
No ratings yet
Lecture2022 - 3 /!
60 pages
Introduction To Different Resources of Bioinformatics and Application PDF
No ratings yet
Introduction To Different Resources of Bioinformatics and Application PDF
55 pages
Ballistic Training
No ratings yet
Ballistic Training
130 pages
Bs982 l08 Basic Blast
No ratings yet
Bs982 l08 Basic Blast
38 pages
Article 28809
No ratings yet
Article 28809
20 pages
Bioinformatics Tutorial
No ratings yet
Bioinformatics Tutorial
12 pages
Week 3 LocalAlignment
No ratings yet
Week 3 LocalAlignment
25 pages
Blast Nsuite
No ratings yet
Blast Nsuite
19 pages
UCE Service Manual
100% (2)
UCE Service Manual
220 pages
ALLIENU Blast and Fasta
No ratings yet
ALLIENU Blast and Fasta
27 pages
04B. Bioinformatics-Lecture 4 (Alternative) - Blast
100% (1)
04B. Bioinformatics-Lecture 4 (Alternative) - Blast
38 pages
Introduction To Bioinformatics: Database Search (FASTA)
No ratings yet
Introduction To Bioinformatics: Database Search (FASTA)
35 pages
Bioinformatics Session8
No ratings yet
Bioinformatics Session8
33 pages
Underwater Noise Review: For Saoirse Wave Energy Limited
No ratings yet
Underwater Noise Review: For Saoirse Wave Energy Limited
29 pages
Fasta and Blast
No ratings yet
Fasta and Blast
3 pages
Traffic Management and Accident Investigation
67% (3)
Traffic Management and Accident Investigation
8 pages
Blast Glossary
No ratings yet
Blast Glossary
8 pages
Blast Fasta
No ratings yet
Blast Fasta
27 pages
Lab 2.1
No ratings yet
Lab 2.1
21 pages
Database Similarity Searching
No ratings yet
Database Similarity Searching
4 pages
Blast
No ratings yet
Blast
18 pages
Sequence Alignment and Searching
No ratings yet
Sequence Alignment and Searching
37 pages
Os Study at Penpol PVT LTD
No ratings yet
Os Study at Penpol PVT LTD
88 pages
W 35432
No ratings yet
W 35432
10 pages
Algorithm Design and Scoring Matrices PDF
No ratings yet
Algorithm Design and Scoring Matrices PDF
31 pages
Cisco 2800 Series Integrated Services Routers: Data Sheet
No ratings yet
Cisco 2800 Series Integrated Services Routers: Data Sheet
16 pages
P. 5 Maths 3
No ratings yet
P. 5 Maths 3
3 pages
Blast
100% (1)
Blast
21 pages
Lecture 4: Blast: Ly Le, PHD
No ratings yet
Lecture 4: Blast: Ly Le, PHD
60 pages
Sequence Alignment and Searching
No ratings yet
Sequence Alignment and Searching
54 pages
Unit2 2
No ratings yet
Unit2 2
30 pages
c630 Nickel Aluminum Bronze PDF
No ratings yet
c630 Nickel Aluminum Bronze PDF
2 pages
Introduction To Bioinformatics 3. Sequence Alignment #1
No ratings yet
Introduction To Bioinformatics 3. Sequence Alignment #1
24 pages
Genomic Sequence Alignment
No ratings yet
Genomic Sequence Alignment
25 pages
Heteroskedasticity
100% (1)
Heteroskedasticity
23 pages
Cause and Effect - Key IELTS Vocabulary Because: Notes
100% (1)
Cause and Effect - Key IELTS Vocabulary Because: Notes
18 pages
Bioinformatics Is The Inter-Disciplinary Branch of Biology Which Merges Computer Science, Mathematics and Engineering To Study The Biological Data
No ratings yet
Bioinformatics Is The Inter-Disciplinary Branch of Biology Which Merges Computer Science, Mathematics and Engineering To Study The Biological Data
26 pages
Sequence Alignments: Felix Sappelt Irina Wagner
100% (1)
Sequence Alignments: Felix Sappelt Irina Wagner
34 pages
BLAST - A Heuristic Algorithm
No ratings yet
BLAST - A Heuristic Algorithm
18 pages
Fasta& Blasta
No ratings yet
Fasta& Blasta
5 pages
Method Statement 14728983812691479973057231
No ratings yet
Method Statement 14728983812691479973057231
6 pages
Disaster Risk Reduction
No ratings yet
Disaster Risk Reduction
2 pages
Complete Urine Analysis 07-03-2022
No ratings yet
Complete Urine Analysis 07-03-2022
1 page
Prepguide Schedule chm1045
No ratings yet
Prepguide Schedule chm1045
2 pages
Module Qw1325
No ratings yet
Module Qw1325
2 pages
University of Kwazulu-Natal Bioinformatics Gene320 3 May 2016 Test 2 Duration 100 Minutes Total Marks: 70
No ratings yet
University of Kwazulu-Natal Bioinformatics Gene320 3 May 2016 Test 2 Duration 100 Minutes Total Marks: 70
6 pages
Introduction To Bioinformatics: Sequence Alignment
No ratings yet
Introduction To Bioinformatics: Sequence Alignment
29 pages
BLAST Background
100% (1)
BLAST Background
27 pages
BLAST Script
No ratings yet
BLAST Script
10 pages
Running BLAST Through Perl
No ratings yet
Running BLAST Through Perl
35 pages
Description and Application: 80%ar - 20%CO / 100%CO EN ISO 17633-A T 19 9 L P C1/M21 1 AWS A5.22 E308LT1-1/4 EN 1.4316
No ratings yet
Description and Application: 80%ar - 20%CO / 100%CO EN ISO 17633-A T 19 9 L P C1/M21 1 AWS A5.22 E308LT1-1/4 EN 1.4316
1 page
Bioinformatics Lab 2
No ratings yet
Bioinformatics Lab 2
9 pages
Bioinformatics Lab 2 (Evelyn)
No ratings yet
Bioinformatics Lab 2 (Evelyn)
9 pages
Lecture/Lab: BLAST: Materials Last Updated June 2007
No ratings yet
Lecture/Lab: BLAST: Materials Last Updated June 2007
11 pages
Promethean Hardware Cheat Sheet
No ratings yet
Promethean Hardware Cheat Sheet
4 pages
Suburba Contest
100% (4)
Suburba Contest
4 pages
Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
No ratings yet
Bioinformatics Tools: Stuart M. Brown, PH.D Dept of Cell Biology NYU School of Medicine
50 pages
Blast ND Fasta
No ratings yet
Blast ND Fasta
28 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Statistical Analysis Techniques in Particle Physics: Fits, Density Estimation and Supervised Learning
From Everand
Statistical Analysis Techniques in Particle Physics: Fits, Density Estimation and Supervised Learning
Ilya Narsky
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
Competition Training Exams for Pool & Billiards – Intermediate Players
From Everand
Competition Training Exams for Pool & Billiards – Intermediate Players
Allan P. Sand
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

05 CAP5510 Fall21

Uploaded by

05 CAP5510 Fall21

Uploaded by

CAP5510 – Bioinformatics

Database Searches for

Many long sequences One giant sequence

Two giant sequences

• How can we search massive space

• How can we evaluate the significance of

Lipman & Pearson, 1985

• How does it depend on

• How does performance compare to Smith-

• What is the impact of k?

• How does this idea work for DNAs ?

Altschul, Gish, Miller, Myers,

BLASTP Protein Protein Gapped

BLASTN Nucleic acid Nucleic acid Gapped

BLASTX Nucleic acid Protein Gapped

TBLASTN Protein Nucleic acid Gapped

TBLASTX Protein Nucleic acid Gapped

– Main differences are

• Better space usage

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.