0% found this document useful (0 votes)

28 views56 pages

Fundamentals of Bioinformatics - L5

Uploaded by

mohamed.mostafa.req

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views56 pages

Fundamentals of Bioinformatics - L5

Uploaded by

mohamed.mostafa.req

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

Fundamentals of Bioinformatics

Lectures 5
Dr. Marwa N.M.E. Sanad

Dr. Marwa Sanad 1

Determining homology

• Alignment
• Ancestor
• Identity
• Similarity
• Homology
• Analogous
• Ortholog
• Paralog
Homology (Common ancestor)

http://evolution.berkeley.edu/evolibrary/article/0_0_0/similarity_ms_06
Homology (Common ancestor)

http://www.ncbi.nlm.nih.gov/books/NBK62051/
Analogy (Convergent ancestor)

Fish Mammals
Sequence
alignment

Pairwise alignment Multiple alignment

(2 sequences) (more than 2 sequences)

Dot plot
Sliding Sliding

Less than 500 residues of Local

DNA/ protein sequences global

Good for large or short

Good for large or short local

Dr. Marwa Sanad 6

Sequence Alignment
Pairwise alignment Multiple alignment
• Using 2 sequences •Using more than 2 sequences
Dot Plot Sliding Sliding
Global Local Local
Good for large or short sequences
Less than 500 Good for Good for
residues of short large or
DNA/ protein sequence short
sequences sequence

Dotlet
Sequence Alignment
Global alignment Local alignment
•Pairwise alignment •Pairwise alignment
•Multiple alignment
•Smith- waterman algorithms
•Sliding alignment •Sliding alignment
cg gg ta - - tccaa Gap
cc c - ta gg tccca Indel

Indel: Could be insertion or deletion

Gap: A sequence of consecutive indels

A scoring scheme:
Using to discriminate between good and bad alignments.
Score of alignment=
Ʃ ( identities, mismatches)- Ʃ (gap penalties)
Substitution Matrices

Mismatches

NUCLEOTIDES AMINO ACIDS

Sequence Alignment
Sequence Alignment
Substitution Matrices

•Substitution matrices should reflect the true

probabilities of mutations occurring through a
period of evolution

•Constructed by measuring the relative

frequency of amino acid changes in a set of
homologous protein sequences
The substitution matrices
•PAM •BlOSUM
(Percent Accepted Mutation) Blocks Amino Acid Substitution
PAM 0 30 80 110 200 250
Matrix
%identity 100 75 50 60 25 20
Blosum 80 62 30
% identity 80 62 30
Determining the substitution matrices

Closely related Distantly related

sequences sequences

Lower PAM Higher PAM

Higher BLOSUM Lower BLOSUM

http://www.nature.com/nmeth/j
ournal/v7/n3s/fig_tab/nmeth.14
34_F2.html
Function Prediction

• Aligned sequences with a range of identity below

25% (aa) or 70% (nt) are considered to be in the
twilight region = unable to claim homology
• Identify conserved domains/elements in sequences
• Compare regions of similarity among multiple
organisms.
• Identify Low complexity regions.
• Predict structural/functional relationships
Substitution Matrices

• Substitutions hardly alter protein

function/structure
• Substitutions in evolution can be predicted
through constructed matrices
Learning Outcomes

1. Understanding what are the appropriate BLAST programs

2. Understanding the most important parameters that

might change your alignment cause error in your

alignment .

3. Understanding how to interpret the E-value and your

output data.
Dr. Marwa Sanad 20
Basic Local Alignment Search Tool (BLAST)

• It is an algorithms for comparing primary biological

sequence information

• It is heuristic approach to local sequence alignment

through searching for the HSPs.

• To find other sequences that are similar to the query

nucleotide or amino acid sequence

• Run most popularly from NCBI

Dr. Marwa Sanad 21
High Scoring Segment Pair (HSP)

http://en.wikipedia.org/wiki/BLAST Dr. Marwa Sanad 22

Scoring alignment function?

• To distinguish the bad alignment from the good

alignment.
• To choose the alignment that has the maximum score

Dr. Marwa Sanad 23

Four components to a BLAST search
(1) Select the BLAST program
(2) Retrieve then upload your sequence (query)
➢ Example: Using Ac:JQ680980/ fasta format/upload seq.
(3) Choose Search Set
➢Choose or exclude the [database, organism]
(4) Choose the program selection
➢For nucleotide [megablast, discontiguous megablast,
blastn]
➢For protein [blastp, PSI-Blast, PHI-Blast, DELTA-Blast]
(5) Choose optional parameters
Dr. Marwa Sanad 24
Step1: BLAST Programs
• blastp:
compares an amino acid query sequence against a protein sequence
database.
• blastn
compares a nucleotide query sequence against a nucleotide sequence
database.
• blastx
compares the six-frame conceptual translation products of a nucleotide
query sequence against a protein sequence database
• tblastn
compares a protein query sequence against a nucleotide sequence
database dynamically translated in all six reading frames (both strands).
• tblastx
compares the six-frame translations of a nucleotide query sequence
against the six-frame translations of a nucleotide sequence database.
Dr. Marwa Sanad 25
Step1: BLAST Programs
1
blastn DNA DNA

1
Blastp protein protein

6
Blastx DNA protein

6
tblastn protein DNA

36
tblastx DNA DNA

Dr. Marwa Sanad 26

Step1: Guidance for BLAST Programs

• Which BLAST to use:

o BLASTn: Interested in non-coding DNA

o tBLASTx: Discover new proteins

o BLASTx: Discover proteins encoding in my DNA sequence

o BLASTx: Unsure of DNA quality

Dr. Marwa Sanad 27

Step2: Upload your sequence

1. Run blast directly from the NCBI page

2. Put the accession number of the your sequence
3. Save the FASTA format in file and then browse the file
4. Copy and paste the sequence

Dr. Marwa Sanad 28

Step2: Upload your sequence

Dr. Marwa Sanad 29

Step2: Upload your sequence

Dr. Marwa Sanad 30

Step3: Search Set
a. Choose the database
Default database choice:
nucleotide collection (nr/nt)
nr = non-redundant= most general database
“Genbank, EMBL, DDBJ, PDB, RefSeq and
excluding EST,STS,GSS,WGS,TSA”

b. Choose the organism

Choose or exclude organism to your search set
Dr. Marwa Sanad 31
Dr. Marwa Sanad 32
Dr. Marwa Sanad 33
Step4: program selection

Nucleotide Peptide

Megablast Blastp
Highly similar Protein-protein blast

Discontiguous megablast PSI-Blast

More dissimilar Position Specific Iterated blast

Blastn PHI- Blast

Somewhat similar Pattern Hit Initiated blast

DELTA-Blast
Domain Enhanced lookup time
accelerated balst
Dr. Marwa Sanad 34
Step 5: The algorithms parameters
[a] General properties : word size, threshold
1- Expect (e) value:
Control with the expected number of matches

2- EXPECT thresholds:
Lower EXPECT thresholds are more stringent, leading to
fewer chance matches being reported.

Dr. Marwa Sanad 35

[a] General properties
3-Word size (K-letter word):
Default is 11 (nucleotide), (3 for protein), but may set to smaller
word values from the menu (increase or decrease the speed of
alignment)

4-Make matches in a query limits:

Limit the number of matches to a query range. This option is useful
if many strong matches to one part of a query may prevent BLAST
from presenting weaker matches to another part of the query

Dr. Marwa Sanad 36

Step 5: The optional parameters (blastn)

Dr. Marwa Sanad 37

Step 5: The optional parameters (blastp)

Dr. Marwa Sanad 38

Accepted Parameters for Other Advanced
Field
Not often the default parameters are the right parameters for your sequence

G Cost to open gap [Integer]: default = 5 for nucleotides/ 11 for proteins

E
Cost to extend gap [Integer]: default = 2 for nucleotides/ 1 for proteins

Q Penalty for nucleotide mismatch [Integer]: default = -3

R reward for nucleotide match [Integer]: default = 1

E expect value [Real]: default = 10

Word size [Integer]: default = 11 for nucleotides/ 28 for megablast/ 3 for

W
proteins

Dr. Marwa Sanad 39

[b] Scoring parameters

For nucleotide Sequence

1. Match and mismatch :

Reward and penalty for matching and mismatching bases

2. Gap cost
•Existence :extension
•Increasing the Gap Costs will decrease the number of Gaps
introduced.
• Cost to create and extend a gap in an alignment. Linear costs
are available only with megablast and are determined by the
match/mismatch scores
Dr. Marwa Sanad 40
[b] Scoring parameters

For amino acid Sequence

1. Matrix :
Substitution matrices PAM&BLOSUM

2. Gap cost
• Existence :extension
• Cost to create and extend a gap in

3. Composition adjustment (associated with DELTA-BLAST)

Dr. Marwa Sanad 41

[b] Filters and Masking
1. Filter (complexity )

• Mask off regions of the query sequence that have low

compositional complexity

• Mask repeating sequences, speeding up the search

2. Mask

• Masking look-up tables is experimental and eliminates hits

based on low complexity sequences

• Search only the upper case sequences

Dr. Marwa Sanad 42
low-complexity sequence
•Unusual composition
•Can often be recognized by visual inspection
•For example,
•For protein sequence PPCDPPPPPKDKKKKDDGPP
•For nucleotide sequence AAATAAAAAAAATAAAAAAT
•Filters are used to remove low-complexity sequence because it
can cause artifactual hits.

Dr. Marwa Sanad 43

low-complexity sequence
Note:
Means a region of a sequence composed of few kinds of elements. These
regions might give high scores that confuse the program to find the actual
significant sequences in the database, so they should be filtered out. The
regions will be marked with an X (protein sequences) or N (nucleic acid
sequences) and then be ignored by the BLAST program. To filter out the low-
complexity regions, the SEG program is used for protein sequences and the
program DUST is used for DNA sequences. On the other hand, the
program XNU is used to mask off the tandem repeats in protein sequences.
Most often, it is inappropriate to consider this type of match as the result of
shared homology. Rather, it is as if the low-complexity region is "sticky" and is
pulling out many sequences that are notSanad
Dr. Marwa truly related. 44
UPPER CASE- LOWER CASE

• An upper case letter in a DNA consensus sequence indicates that the

nucleotide is preserved in that position, used to make the consensus.
• A lower case letter is the most common nucleotide in a variable
position.
• The protein sequences are always upper case letters.
Mask lower case:
•With this option selected you can cut and paste a FASTA sequence in
upper case characters and denote areas you would like filtered with
lower case. This allows you to customize what is filtered from the
sequence during the comparison to the BLAST databases.
Dr. Marwa Sanad 45
Dr. Marwa Sanad 46
Interpreting Results
The Expect value (E)

➢ Describes the number of hits one can "expect" to see by

chance ,It decreases as the Score (S) of the match increases.

➢ Is the expected number of sequence (HSPs) matches in

database of n number of sequences

➢ Describes the random background noise.

➢ Gives an indication of the statistical significance of a given

pairwise alignment and reflects the size of the database and
the scoring system used.
Dr. Marwa Sanad 47
Interpreting Results
The Expect value (E)

➢ The lower value, the more significant the hit. If you want to
be certain of homology, your E-value must be lower than 10-
4/10-6

➢ A sequence alignment that has an E-value of 0.05 means

that this similarity has a 5 in 100 (1 in 20) chance of
occurring by chance alone.

➢ Identical short alignments have relatively high E values. This

is because the calculation of the E value takes into account
the length of the query sequence.

Dr. Marwa Sanad 48

Interpreting Results
The Expect value (E)

➢ Shorter sequences have a higher probability of occurring in the

database purely by chance.
➢ It is not easily compared between searches of different
databases

➢ Used as a convenient way to create a significance threshold for

reporting results. You can change the Expect value threshold on
most BLAST search pages.

Dr. Marwa Sanad 49

Continue….Interpreting Results

• The % identity:
o A subsititute for the E-value..
o The fraction of residues that are either identical or
similar. (+)

• Length:
o This is the length of the alignment, which indicates how
long are the two segments of your sequences that BLAST
has aligned.
o Note: very short alignments can come up with high E-
values and not be very meaningful.
Dr. Marwa Sanad 50
Interpreting Results
• Generally:
o Bit matches below 50 are unreliable
o E scores greater than 0.0001 are often close to the
twilight zone

• Note : Although programs like BLAST search databases

through pairwise comparisons, these programs are
optimized for speed, not for alignment accuracy.

Dr. Marwa Sanad 51

Interpreting Results

The Bit score

The bit score gives an indication of how good the alignment is; the higher the
score, the better the alignment.
In general terms, this score is calculated from a formula that takes into
account the alignment of similar or identical residues, as well as any gaps
introduced to align the sequences
A key element in this calculation is the “substitution matrix ”, which assigns a
score for aligning any possible pair of residues. The BLOSUM62 matrix is the
default for most BLAST programs, the exceptions being blastn and
MegaBLAST (programs that perform nucleotide–nucleotide comparisons and
hence do not use protein-specific matrices). Bit scores are normalized, which
means that the bit scores from different alignments can be compared, even if
different scoring matrices have been used.
Dr. Marwa Sanad 52
Troubleshooting
ERROR 1 : "No significant similarity found“

Possible problem 1: Short query sequences: Short alignments may have Expect
values above the default threshold, which is 10 on most pages,
and, therefore, are not displayed.
Solution: Try increasing the Expect threshold (under 'Algorithm
parameters').

Possible problem 2: The low complexity regions are not allowed to initiate
alignments, so if your query is largely low complexity, the
filter may prevent all hits to the database. On the Basic BLAST
pages,
Solution: Adjust the filter settings in the section 'Filters and Masking',
under 'Algorithm parameters'. For a description of low
complexity filters,
Dr. Marwa Sanad 53
Troubleshooting
ERROR 2: An error has occurred on the server, Too many HSPs to save all

Possible problem 1: The total number of high-scoring segment pairs (HSPs) is far
too many for the BLAST servers to return the results. This is
rare as the results have to be several hundred megabytes of
information for this to happen. However, there are certain
searches which could generate a huge amount of data. Most
typically this error occurs when the default filters are turned
off or when the query sequences have repeat elements in
them.

Dr. Marwa Sanad 54

Troubleshooting
ERROR 2: An error has occurred on the server, Too many HSPs to save all

Solution1: 1. Enable species specific repeats if applicable

2. If using tblastx, try blastx instead. The tblastx program is very CPU
intensive as it not only translates the query in six reading frames but
every database sequence as well. Often, using tblastx is a measure of
last resort; a blastx search against a database of known proteins may
provide what you need.
3. Search a smaller database, such as refseq_rna. Larger databases
obviously contain more sequences and for some queries this results in
numerous "background" hits. If you want a database of known
mRNAs (and their translations) then refseq_rna is a good choice.

Dr. Marwa Sanad 55

Troubleshooting
ERROR 2: An error has occurred on the server, Too many HSPs to save all

Solution1: 4. Break up large queries into smaller pieces; submit each piece in a
separate search. A common cause of errors in BLAST is searching with
a huge sequence, like a complete chromosome, against a large
database like nr. This is better accomplished in portions rather than
one large, continuous sequence.

5. Limit the database by taxonomy. Start with large groups, such as

mammals, bacteria, etc. Any taxonomic node or tax id number that
you can find in the Taxonomy browser can be used in the 'Organism'
text box.

6. You may be hitting a large number of 'PREDICTED' or 'hypothetical

protein' records. If you do not want these hits, use an Entrez Query
such as: all[filter] NOT predicted[title].

7. For megablast and blastn searches, try increasing the word size and/or
decreasing the Expect threshold
Dr. Marwa Sanad 56

Blast & Fasta
No ratings yet
Blast & Fasta
47 pages
Database Similarity Searching
No ratings yet
Database Similarity Searching
4 pages
Sequence DB Search
No ratings yet
Sequence DB Search
38 pages
Bioinformatics: Blast and Sequence Analysis
No ratings yet
Bioinformatics: Blast and Sequence Analysis
45 pages
04B. Bioinformatics-Lecture 4 (Alternative) - Blast
100% (1)
04B. Bioinformatics-Lecture 4 (Alternative) - Blast
38 pages
Lab 2.1
No ratings yet
Lab 2.1
21 pages
Blast
No ratings yet
Blast
60 pages
Bioinformatics 1 p3
No ratings yet
Bioinformatics 1 p3
17 pages
BLAST Glossary With Highlights
No ratings yet
BLAST Glossary With Highlights
9 pages
Blast 170122070200
No ratings yet
Blast 170122070200
22 pages
Blast
100% (1)
Blast
21 pages
Database Searching
No ratings yet
Database Searching
41 pages
BE Blast
No ratings yet
BE Blast
11 pages
Bt7 Ncbi Blast
No ratings yet
Bt7 Ncbi Blast
60 pages
BLAST
No ratings yet
BLAST
30 pages
Sequence Alignment
No ratings yet
Sequence Alignment
14 pages
Blast
No ratings yet
Blast
115 pages
Lecture2022 - 3 /!
No ratings yet
Lecture2022 - 3 /!
60 pages
Blast Analisis II
No ratings yet
Blast Analisis II
15 pages
Blast Fasta
No ratings yet
Blast Fasta
27 pages
Sequence Alignment and Searching
No ratings yet
Sequence Alignment and Searching
54 pages
Blast
No ratings yet
Blast
18 pages
Bs982 l08 Basic Blast
No ratings yet
Bs982 l08 Basic Blast
38 pages
5 Database Similarity Search BLAST
No ratings yet
5 Database Similarity Search BLAST
47 pages
Lecture 4
No ratings yet
Lecture 4
106 pages
Second - Done - w14b - Searching Squence Databases
No ratings yet
Second - Done - w14b - Searching Squence Databases
32 pages
Bio 2
No ratings yet
Bio 2
39 pages
BLAST Background
100% (1)
BLAST Background
27 pages
Soil Nailing For Failed Slope Stabilization On Hilly Terrain
No ratings yet
Soil Nailing For Failed Slope Stabilization On Hilly Terrain
7 pages
ALLIENU Blast and Fasta
No ratings yet
ALLIENU Blast and Fasta
27 pages
Lecture 05
No ratings yet
Lecture 05
36 pages
METTL - Logical Building 1 - 2 and 3 Links
100% (1)
METTL - Logical Building 1 - 2 and 3 Links
2 pages
BLAST
100% (1)
BLAST
4 pages
Week 3 LocalAlignment
No ratings yet
Week 3 LocalAlignment
25 pages
Performance Task in STS
No ratings yet
Performance Task in STS
3 pages
HL-740 (TM) 7-5
No ratings yet
HL-740 (TM) 7-5
17 pages
TY-Exercise 4
No ratings yet
TY-Exercise 4
8 pages
Final Blast PDF
No ratings yet
Final Blast PDF
31 pages
Bioinformatics Session8
No ratings yet
Bioinformatics Session8
33 pages
Basic Local Alignment
No ratings yet
Basic Local Alignment
36 pages
Lab Report 03
No ratings yet
Lab Report 03
18 pages
Technical Manual Qa-S (10-25) PDF
No ratings yet
Technical Manual Qa-S (10-25) PDF
102 pages
Introduction To Different Resources of Bioinformatics and Application PDF
No ratings yet
Introduction To Different Resources of Bioinformatics and Application PDF
55 pages
Ncbi Blast Name: Rohith ND Roll No:20054
No ratings yet
Ncbi Blast Name: Rohith ND Roll No:20054
11 pages
Merin 1
No ratings yet
Merin 1
10 pages
Lecture - 02 - Comparative Sequence Analysis
No ratings yet
Lecture - 02 - Comparative Sequence Analysis
28 pages
Blast
No ratings yet
Blast
12 pages
Analog PPT 2
No ratings yet
Analog PPT 2
86 pages
BLAST
No ratings yet
BLAST
17 pages
Indonesia (Suite) Wiring Diagram
No ratings yet
Indonesia (Suite) Wiring Diagram
1 page
Unit Iv - Blast
No ratings yet
Unit Iv - Blast
21 pages
Blast Introduction
No ratings yet
Blast Introduction
42 pages
6G Spectrum - Analyzer Device User Manual
No ratings yet
6G Spectrum - Analyzer Device User Manual
23 pages
6.4.1 Packet Tracer - Implement Etherchannel
0% (1)
6.4.1 Packet Tracer - Implement Etherchannel
2 pages
Codigos de FalhaCP 224 e 274
No ratings yet
Codigos de FalhaCP 224 e 274
6 pages
Blast (Basic Local Alignment Search Tool)
No ratings yet
Blast (Basic Local Alignment Search Tool)
28 pages
Integration-And System Testing: O O S C
No ratings yet
Integration-And System Testing: O O S C
32 pages
Results Experiment 1: Determination of Power Input, Heat Output and Coefficient of Performance
No ratings yet
Results Experiment 1: Determination of Power Input, Heat Output and Coefficient of Performance
6 pages
Spiral Wound Gasket - Type LS
No ratings yet
Spiral Wound Gasket - Type LS
1 page
Hertz Heat Recovery
No ratings yet
Hertz Heat Recovery
11 pages
Accu 204 Trabajofinal
No ratings yet
Accu 204 Trabajofinal
3 pages
Electrical Performance Testing of AC Motors
No ratings yet
Electrical Performance Testing of AC Motors
3 pages
Digital Mp3 Player
No ratings yet
Digital Mp3 Player
3 pages
Lecture 4: Blast: Ly Le, PHD
No ratings yet
Lecture 4: Blast: Ly Le, PHD
60 pages
Biology 171L - General Biology Lab I Lab 12: Introduction To Bioinformatics
No ratings yet
Biology 171L - General Biology Lab I Lab 12: Introduction To Bioinformatics
6 pages
LSM2241 Practical 4: Introduction To BLAST
No ratings yet
LSM2241 Practical 4: Introduction To BLAST
12 pages
IT Reviewer
No ratings yet
IT Reviewer
13 pages
14 NLP
No ratings yet
14 NLP
20 pages
Cs403 Assignment Solution 1 Fall 2023
No ratings yet
Cs403 Assignment Solution 1 Fall 2023
7 pages
A Systematic Literature Review of A Pathfinding
No ratings yet
A Systematic Literature Review of A Pathfinding
8 pages
1 s2.0 S0306261924004148 Main
No ratings yet
1 s2.0 S0306261924004148 Main
20 pages
B EMI Strategy
No ratings yet
B EMI Strategy
5 pages
Bioinformatics Lab 2 (Evelyn)
No ratings yet
Bioinformatics Lab 2 (Evelyn)
9 pages
Bioinformatics Lab 2
No ratings yet
Bioinformatics Lab 2
9 pages
IRC Codes
No ratings yet
IRC Codes
36 pages
Lecture/Lab: BLAST: Materials Last Updated June 2007
No ratings yet
Lecture/Lab: BLAST: Materials Last Updated June 2007
11 pages
Arun Internship Report
No ratings yet
Arun Internship Report
16 pages
1 s2.0 S2772940024000171 Main1
No ratings yet
1 s2.0 S2772940024000171 Main1
10 pages
Blast
No ratings yet
Blast
6 pages
CONDUITE
No ratings yet
CONDUITE
9 pages
Signal
No ratings yet
Signal
3 pages
Productattachments Files Downloads Ezmimo 2-4ghz Datasheet
No ratings yet
Productattachments Files Downloads Ezmimo 2-4ghz Datasheet
1 page
Link L6 U1 5min Test Vocab
No ratings yet
Link L6 U1 5min Test Vocab
1 page
Bioinformatics: Arushi Dinesh Kasi Shruthi
No ratings yet
Bioinformatics: Arushi Dinesh Kasi Shruthi
28 pages
Blast ND Fasta
No ratings yet
Blast ND Fasta
28 pages
Cellular and Molecular Pharmacology
From Everand
Cellular and Molecular Pharmacology
Dr. Amteshwar Singh Jaggi
4.5/5 (6)
Introduction to Bioinformatics Using Action Labs
From Everand
Introduction to Bioinformatics Using Action Labs
Jean-Louis Lassez
5/5 (1)
Gene Editing 101: Principles and Applications
From Everand
Gene Editing 101: Principles and Applications
DINESHKUMAR PANDIAN
No ratings yet
Mastering Parallel Programming with R
From Everand
Mastering Parallel Programming with R
Simon R. Chapple
No ratings yet
Oracle 11g Streams Implementer's Guide
From Everand
Oracle 11g Streams Implementer's Guide
Ann L. R. McKinnell
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.