Bif501 Handouts PDF Bif
Bif501 Handouts PDF Bif
MUHAMMAD IMRAN
BIF501 - Bioinformatics II
Topic - 1 Applications of Bioinformatics
Applications of Bioinformatics
1. Drug Development 10. Personalized Medicine
2. Crop Improvement 11. Preventive Medicine
3. Microbial Genome 12. Waste Cleanup
4. Gene Therapy 13. Antibiotic Resistance
5. Biotechnology 14. Alternate Energy Science
6. Comparative Study 15. Insect Resistance
7. Evolutionary Studies 16. Climate Change Studies
8. Veterinary Science 17. Nutritional Quality
9. Molecular Medicine
Drug development
✓ Drugs target only about 500 proteins
✓ Disease mechanisms and using computational tools identify and validate new
drug targets
Crop improvement
✓ Comparative genetics of the plant genomes
✓ Information obtained from the model crop systems can be used to suggest
improvements to other food crops.
✓ At present the complete genomes of Arabidopsis thaliana (water cress) and
Oryza sativa (rice) are available.
extreme conditions.
Gene Therapy
➢ Gene therapy-used to treat, cure or even prevent disease
➢ Clinical trials
Biotechnology
➢ Archaeoglobus fulgidus and Thermotoga maritima
➢ Corynebacterium glutamicum
➢ Xanthomonas campestris
➢ Lactococcus lactis
Evolutionary studies
➢ The sequencing of genomes from all three domains of life; eukaryota, bacteria
and archaea
Personalized medicine
➢ Pharmacogenomics
➢ Sequence variants in DNA
➢ Trial and error to find the best drug
➢ Patient's genetic profile
➢ With the specific details of the genetic mechanisms of diseases
being unraveled, the development of diagnostic tests to measure a
persons susceptibility to different diseases may become a distinct
reality.
➢ Preventative actions such as change of lifestyle or having treatment
at the earliest possible stages when they are more likely to be
successful, could result in huge advances in our struggle to conquer
disease.
Waste cleanup
➢ Deinococcus radiodurans
➢ Potential usefulness in cleaning up waste sites that contain radiation
and toxic chemicals
Antibiotic Resistance
➢ Enterococcus faecalis
➢ Virulence region-resistant genes
➢ The discovery of the region, known as a pathogenicity island
Insect resistance
➢ Bacillus thuringiensis
➢ Control serious pests of cotton, maize and potatoes
➢ Insecticides can be reduced and hence the nutritional quality of the
crops is increased
Climate change Studies
➢ Increasing levels of carbon dioxide emission-global climate change.
➢ Study the genomes of microbes that use carbon dioxide as their sole
carbon source.
Improve nutritional quality
MUHAMMAD IMRAN 2
Classification:
Biological databases can be classified as
➢ Primary databases (that stores the Primary Sequences)
➢ Secondary databases (the primary sequences are annotated and kept in Secondary
Databases)
➢ Specialized Databases (they are dedicated towards some specific organism or can
have some disease data)
Biological databases can also be classified on the bases of types of data which they contain,
such as:
➢ Nucleotide databases
➢ Protein databases
➢ RNA databases
➢ Genome databases
➢ Expression databases (Gene Expression Databases)
Issues:
The issues which are present generally in other databases are also found to be in Biological
databases that may be co-related with the relatively slow pace of quality assurance techniques
as compared to the pace with which new data is emerging, so the issues are similar and are as
follows:
Due to limited Q/A
➢ Redundancy
➢ Inconsistency
➢ Incompatibility (format, terminology, data types, etc.)
Here, we have a diagram where we have a genomic DNA which has different Exons(we know
that in Eukaryotes, we have exons and introns). So exonsgets transcribed into mRNA and we
can get cDNA from this mRNA through reverse transcriptionand then we can store this cDNA
into our databases whereas the ESTs are the subsets within those cDNA’s.
MUHAMMAD IMRAN 3
Origin:
The Nucleotide Sequence Databses were first assembled into Genebank (1982) at Los Alamos
National Laboratory (LANL), New Maxico under the leadership of Walter Goad. GeneBankis
now working under the umbrella of NCBI (National Center for Biotechnology Information).
NCBI is the central repository that stores multiple types of biological data that includes
genomes, their assemblies, their sequencing data, their expression data and what not. In this
diagram, we can see the page where you can search for any kind of data; a drop-down list
which provides you with various options. The link to this page is
http://www.ncbi.nlm.nih.gov/.
Here, is the page for GeneBank, so if you want search about nucleotides and genome
sequences, this is the best resource
NCBI was established in United States.
National Center for Biotechnology Information
GeneBank
MUHAMMAD IMRAN 4
INSDC:
• Genebank, DDBJ and EMBL joined together in International Nucleotide Sequence
Database Collaboration (INSDC)
You can see in this diagram, the NCBI (National Center for
Biotechnology Information), DDBJ(DNA Databank of Japan)
and EBI (EuropeanBioinformaticsInstitute) /ENA (European
Nucleotide Archive)forms an International collaboration known
as INSDC (International Nucleotide Sequence Database
Collaboration).
Where EMBLestablishedEBI, to deal with Bioinformatics kind
of stuff and within them they have established ENA to maintain
the DNA sequence datasets
Here, is the page of INSDC (International Nucleotide Sequence Database Collaboration), and
you can observe that all three collaborators’ logos are there. Similarly, if you look into the
data, we can have Next Generation reads, Capillary reads and information about samples and
annotated sequences all on this first page.
(We’ll discuss it later).
Growth of Genebank:
If we look into the growth of Gene Bank as shown in the figure below (left), we can see the
number of bases in the GeneBank which are uncountable as they are in trillions which is a
huge number starting somewhere in 1982 and if we look into these curves, blue is the growth
of GeneBank and red one is the whole genome sequences (which we are comparing) which is
starting somewhere in 2003 or 2004 after the publication of Human genome Project.
So, if you look into the number of bases, it seems like they double after every 18 month which
means the growth is huge and is exponential.
Similarly, if we look into the sequences (right figure), are also around somewhere in 1000’s in
1982 but now they are more than hundred million sequences in this GeneBank.
MUHAMMAD IMRAN 5
GeneBank
WGS
http://www.ncbi.nlm.nih.gov/genbank/statistics
Conclusions:
In the end, we conclude some of the followings:
➢ Biological databases store biological data.
➢ INSDC is joint venture of NCBI, EMBL and DDBJ.
➢ Growth of bases in GeneBank is exponential, doubling every 18 months.
Origin:
First sequences to be collected were Proteins (before Nucleotide Sequences) using Sanger and
Tupy’smethods (1951) where Common protein families like cytochromes were sequenced (as
in that era people were focusing on the sequences made from cytochrome molecules).
Atlas of protein sequences (mainly cytochromes) was assembled by Margret Dayhoff and her
collaborators at National Biomedical Research Foundation (NBRF) in 1960s.
PIR (Protein Information Resource):
The collection (of Dayhoff and co) became PIR (Protein Information Resource) which is now
a collaboration of NBRF, Munich Center for Protein Sequences (MIPS) and Japan
International Protein Information Database (JIPID).
Protein Sequences:
MUHAMMAD IMRAN 6
MUHAMMAD IMRAN 7
Here, is the page of UniProt and you can see, we have 3 main sections i.e. Protein Ontologies
labelled as PRO then we have ProClass where we can have the sequences and ProLINK tells
us about the literature.
Here, we look into the PRO which is the Protein Ontologies- ontologies is where we can
classify those proteins on the basis of their functions and different functions have their
hierarchy so ontologies are labelled in form of different hierarchies, so there is a major
function and a trend towards moving the specific function.
Here, we can see just a PRO Hierarchy Ontology in this example.
http://pir.georgetown.edu/
MUHAMMAD IMRAN 8
In this figure, we have iProLINK which provides literature information and most of the
research papers can be found here.
MUHAMMAD IMRAN 9
In this figure, we can see SCOP which is a similar effort that utilizes different structural
elements on those proteins and it classify those proteins on the basis of their structural
elements like family, fold super family, domains and then classes.
So, Class is the biggest in this SCOP hierarchy, there
are different major group of classes.
The link is : http://scop.mrc-lmb.cam.ac.uk/scop/
We can see here for an example, we have the class in which we have all the alpha helices;
these helices are formed by special arrangement of amino-acids. Basically when the protein
sequences- just a linear sequence of amino-acids when it turns around on it selves, it forms
those secondary structures so those structures are then recognized as alpha and beta (we are not
going into the details; you can go for molecular biology course or Google about alpha or beta).
The main idea to present here is that SCOP actually classify the proteins on the basis of those
structures so for an example, alpha (is that class where we have all those proteins that has
alpha helices in them), we can also have beta (where we have all those proteins that has beta
chains in them), alpha/beta (where we have alpha helices then comes beta then comes alpha so
they are present one after the other), alpha + beta ( we can have separate regions where we can
have alpha helices stacked together and then we have beta chains stacked together). And the
link for it is - http://scop.mrc-lmb.cam.ac.uk/scop/.
MUHAMMAD IMRAN 10
Conclusions:
We conclude that:
✓ First sequences to be collected were Protein sequences.
✓ Protein databases are classified on the basis of sequences, motifs, structures and different
structural alignments.
✓ Growth of Sequence in Databases is exponential (just like as in Nucleotide Databases the
growth of sequence is higher).
(Same Durbin whose book “Biological Sequence Analysis” we’ll consider in the latter
half of the course).
Here is the figure of AceDB webpage, you find the sea elegans; a worm and there are other
organisms.
Link for this page is: http://www.acedb.org/.
Examples:
TAIR (The Arabidopsis Information Resource) which is a database for Arabidopsis
(http://www.arabidopsis.org/) and SGB (Saccharomyces Genome Database) actually uses the
system of AceDB (http://www.yeastgenome.org/).
MUHAMMAD IMRAN 11
While we have those genomes available, we want to see their graphical views where we can
get the reports, get the idea about where different genes are located, so in order to do that we
needed to make something which we call it as genome browsers- are the webpages where we
can look into the different features within our genomes so UCSC is one of the example (shown
on the left) which is University of California Santa Cruz which is the biggest genome browser.
The link to this browser is http://genome.ucsc.edu/.
The figure of UCSC Genome Browser, where we can have information, so on the top we see a
chromosome and down below we see various lines which are known as different tracks (for
snips, genes, EST’s etc.) so we can look or zoom into different regions of the genome by
using those genome browsers.
The link to this webpage is: http://genome.ucsc.edu/.
Conclusion:
In the end, we conclude the following:
✓ Success of Haemophilusinfluenzaepaved the way for other genome sequencing
projects
MUHAMMAD IMRAN 12
✓ Human Genome Project was accomplished by NHGRI and Celera (they were
working independently from one another).
✓ Genome browsers help in exploring different regions of the genome.
Here, is the webpage of GEO which is Gene Expression Omnibus running under NCBI (you
can visit NCBI where you can get to the GEO Database) which are having different datasets,
has expression profiles where we can see the change in expression of genes across different
treatments and we can also analyze this expression data. There is a tool called as GEO2R, we
can use BLAST in it. (We’ll discuss later)
http://www.ncbi.nlm.nih.gov/geo/
Gene Architecture:
GEO has four kinds of records or data files (keeping in view the MIAME rules) and are as
follows:
✓ Sample(GSM) – these files stores the sample information like how the samples are
prepared, how the treatments are given, how the experimental design was established.
✓ Platform (GPL) – The idea about platforms, they are stored in GPL files so here we can see
whether it’s a microarray data or RNAseq data (there are different protocols coming from
different agencies so we can have that information).
MUHAMMAD IMRAN 13
✓ Series (GSE) – Sometimes different treatments are recorded as separate files so GSE are
the files where we can have the similar treatment files and they are put together in a
shape of series (are a set of samples and which are somehow related).
✓ Datasets (GDS) – Whereas the actual data is stored as GDS files which are the sample data
collections and are assembled by GEO.
Here, is the Gene Expression Omnibus page and if we look into the different types of datasets
it have, we can have Series (on the top left side of right figure), different records for the
Platform, Samples. If you look into the types of series, you can see there are expression
profiling by array, expression profiling by high throughput sequencing (in our course we’ll be
getting some RNAseq data which is under the expression profiling by high throughout
sequencing), similarly there are other various techniques for getting the expression which are
MUHAMMAD IMRAN 14
listed below in the Series section as can be seen and number of datasets available are also
present in the
column called as
count.
If you want to look
into some dataset,
you can simply type
into search bar say
for example, you
write colon cancer
RNAseq data which
leads us to the sets of records it gets and when we click onto one of them the page appears
(shown below).
MUHAMMAD IMRAN 15
There are total six samples in this dataset, so individual samples are put together labelled as
GSM.
We can also download the sequence expression counts or values in different formats and there
are also some
normalized counts as
shown in the figure,
these are compressed
files. There are also raw
reads data are present in
the format, which we
call as SRP or
Sequencing Read
Archive so that stores
the raw read data.
Since, funding and the publication agencies demands that your data should be submitted and
shared with the community so here is an example (in the figure shown) where we can see a
publication and they have put GSE into their publication which helps other scientists to get
access to this data using this ID number as highlighted (in the figure).If you are submitting
your paper, you need to provide this information to the publication agencies which is an
essential consideration.
Conclusion:
So we sum-up that GEO is a public repository for the archiving and distribution of gene
expression data and is the Best resource to get microarray and Next Generation Sequencing
(RNASeq) data.
MUHAMMAD IMRAN 16
MEDLINE
• MEDLINE is the primary resource for biomedical journal articles
• Millions of citations to articles in biomedical journals
• MEDLINE uses the MeSH vocabulary
Other Databases
MEDLINE is the primary resource, but other databases may also be helpful
• Academic OneFile
• CINAHL (Cumulated Index of Nursing and Allied Health Literature)
• PsycINFO
• Web of Knowledge
Academic OneFile
• Academic OneFile lists articles from journals covering a broad range of
subjects
• While it does not primarily focus on medical topics, useful articles can
still be found here
PsycINFO
• PsycINFO searches the psychological literature
• While it does not primarily focus on medical topics, useful articles can still
be found here
MUHAMMAD IMRAN 17
• http://www.apa.org/pubs/databases/psycinfo/coverage.aspx
http://www.apa.org/pubs/databases/psycinfo/coverage.aspx
Web of Science
• Major source for articles in a wide range of fields, including the sciences,
social sciences, and humanities.
• Excellent place to find articles from scientific journals that may not be
included in MEDLINE
Conclusions
Informatics in health care may be called as health informatics
• Medical databases deal with the acquisition, storage, retrieval, and use of
information in health and biomedicine.
MUHAMMAD IMRAN 18
NCBI:
NCBI has two options for sequence submission
BANKIt - for simple sequences (not related with down-stream analysis) and annotations
and can be submitted through web (if the datasets are small) which does not requires any
advanced tools.
Sequin - For Complex sequences and annotations and is also good if we want to do some
off-line submissions normally where we have our datasets which are huge ones and can be
used in future with some advanced tools (for analysis) and graphical reports.
MUHAMMAD IMRAN 19
http://www.ncbi.nlm.nih.gov/WebSub/?tool=genbank
In the figures above, are the glances BankIt and Sequin webpages.
UniProt:
For protein sequences, just like NCBI tools, we have UniProt and the similar tool is called
asSPIN which is a web-based tool for submitting directly sequenced protein sequences
and biological annotations to the knowledgebase.
Shown in this figure, is the webpage of SPIN.
We can register here and then we can submit our data.
https://www.ebi.ac.uk/swissprot/Submissions/spin
Conclusion:
We conclude that sequences are stored in databases in specific format and when we want to
submit them into a database then we need to follow the guidelines provided by those
databases.
MUHAMMAD IMRAN 20
So, here is the webpage of NCBI, for example you want to search for say p53 gene; tumour
suppressor gene. We write p53 on the search bar, then we get then results, so here we can find
many ID entries like 9000 entries are there, we are just looking into the first page in this we
choose the first two. So let’s click the first one, the p53 where the ID is 2768677, there is a
description that what sort of gene is it, and its actually coming from Drosophila melanogaster,
the location is Chromosome number 3 and we see some Aliases; the alternative names of this
gene. The link to NCBI is http://www.ncbi.nlm.nih.gov/.
When we clicked on the first gene as shown in the figure above, we now come to this webpage
which is a huge page that is portioned into different figures.
In this figure (on the left), we can see the summary of this gene.
The official symbol is p53 provided by FlyBase which is also written in the Primary source
(FlyBase is the databases that stores the genome of this fruit-fly Drosophila), then the locus
tag, gene type is protein coding, RefSeq says reviewed (sometimes the genes are submitted and
reviewed by some other scientist so it means that this gene has been REVIEWED). In the
MUHAMMAD IMRAN 21
organism section, we see the classification of that organism and the Aliases are written beneath
it.
In this figure, we can look into the structure of this gene and its coordinates (genomic
coordinates), where we can see the location from where it is coming from, we can also see the
orientations- the directions in which it is going (down below).
MUHAMMAD IMRAN 22
In the end, till we reach the word called as origin, and here we can see the actual nucleotide
sequences which are present starting from 1 until the last nucleotide and the sequence ends
with a double slash sign (//).
Conclusions:
So, we conclude that DNA Sequences are stored in DNA sequence databases in specified
formats and Genebank format is a standard format.
MUHAMMAD IMRAN 23
MUHAMMAD IMRAN 24
So, let’s check the first one and here we reach on the record for this protein
After scrolling the same webpage (shown in the figure on the left), we can see the
feature key and in some site written (there are unique sites in different proteins
which are having some specific properties in them so this is just one amino-acid
present in this protein that interacts with the DNA). Similarly, there are different
metal binding sites and we can see that it’s mainly binding to the Zincmetal.The
number of amino-acids is shown here so these are the regions where it interacts
with the metal.
Down below, we can also see the DNA binding region, for example here, the amino
acids are from 102 to 292 and that is also shown in the Graphical view as well.
GO-Molecular function or GO-Gene Ontologies, so gene ontologies are the
different functional annotation term, there they define different functions, so
amongst them we have molecular functions, biological processes, and we have
cellular components. So here we just see a Molecular function, so it tells us that it
performs the functions as shown in the figure , mainly it’s a ATP binding, it’s p53
binding with various other functions like DNA binding. So all those functions
related to these proteins are present in the heading of GO-Molecular Function.
MUHAMMAD IMRAN 25
When we move
further (as shown in the figure on the left) till we reach its Taxonomy.
On the top, we can see something written as Protein family or group databases
whichis TCDB. Basically, there is another classification in which the proteins are
classified on the basis of being as transporter proteins so it is associated with the
transportation across the membranes and there is 5-digit number, so there is a
specific classification code which is given to each protein, and this protein has the
specific code as shown in the figure.
So then we have the names and taxonomies, where there are protein names, and
thetaxonomyof the individual can be seen in the Taxonomic lineage row. Let’s
see how we reach to its sequence and is shown in the figure below:
In this
figure,
we can
see the
sequence
of the
protein
which is
found to
be at the
end of
the page.
Here,
it says
Isoform 1, so different proteins have different isoforms, different alternative
splice variants so this is Isoform 1 as exhibited by its name which is P04637-1,
and is the kind of first isoform. We can see the sequence of the protein and starts
with a methionine (always a first amino acid in those proteins) and ending at
390TH amino acid. So, it’s a 393 aa long protein and the sequence is right here.
You can click on the FASTA button on the top and then you can get this output in
FASTA format (we’ll discuss it later).
NCBI:
We can also get the same protein from NCBI (as shown in the figure on the left)
MUHAMMAD IMRAN 26
In NCBI,
obviously
the
sequence is
pretty
similar and
the
arrangement is slightly
different so it is
ORIGIN, where the sequence starts and sequence ends at those two slashes
(//). So, we can get the protein sequence from NCBI as well and the link to this
website is http://www.ncbi.nlm.nih.gov/.
PDB:
PDB gives us the structures, so we can go to PDB webpage (as shown in the
figure on the left) and search for the same ID i.e. P04637 and it gives us the sections
or the regions from where it can make up some specific structures.
You can see the turns in Annotations section, the black ones are the empty lines
where no secondary structure can be formed, blue ones show those bends and the
orange ones are designated as alpha helices regions. So in PDB, we can have
structures in this format as well as the 3D-Structures as shown in the figure
below:
MUHAMMAD IMRAN 27
MUHAMMAD IMRAN 28
Example:
LOCUS AAU03518 237 bpDNA PRI 04-FEB-1995
DEFINITION Aspergillusawamori internal transcribed spacer 1 (ITS1) and
18S rRNA and 5.8S rRNA genes, partial sequence.
ACCESSION U03518
BASE COUNT 41 a 77 c 67 g 52 t
ORIGIN
1 aacctgcggaaggatcattaccgagtgcgggtcctttgggcccaacctcccatccgtgtc
61 tattgtaccctgttgcttcggcgggcccgccgcttgtcggccgccgggggggcgcctctg
121 ccccccgggcccgtgcccgccggagaccccaacacgaacactgtctgaaagcgtgcagtc
181 tgagttgattgaatgcaatcagttaaaactttcaacaatggatctcttggttccggc
//
Here is the GeneBank format, which starts with the word ‘LOCUS’, then we have it’s ‘ID’, it
is 237 base pairs long, we have some short description that it is a ‘DNA’, ‘PRI’ – primary
sequence, submitted on ‘04-FEB-1995’.
Then we have a ‘DEFINITION’ line where we can have some description/explanation about
this gene. Then again we have an ‘ACESSION number’. It also provides us with the ‘BASE
COUNT’ (i.e. how many A’s (Adenines), G’s (Guanines), C’s (Cytocines), T’s (Thymines) are
there).
Then finally the word ‘ORIGIN’ tells us that the actual sequence is right here, we have these
lines (60 bases on each line) that are separated into chunks of 10 bases and is a kind of
standard practice. The sequences ends with the those slashes (//).
EMBL Format:
This format is similar to that of GeneBank Format. An example sequence in EMBL format is:
ID AA03518 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillusawamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
aacctgcggaaggatcattaccgagtgcgggtcctttgggcccaacctcccatccgtgtc 60
tattgtaccctgttgcttcggcgggcccgccgcttgtcggccgccgggggggcgcctctg 120
ccccccgggcccgtgcccgccggagaccccaacacgaacactgtctgaaagcgtgcagtc 180
tgagttgattgaatgcaatcagttaaaactttcaacaatggatctcttggttccggc 237
//
Here, we have ID, accession number (AC), descriptions (DE), and the sequence actually starts
from where the word ‘SQ’ is there, and we can observe that we have pretty similar lines as
seen in the previous example. Finally, the sequence ends with doubles slashes same as in
GeneBank format.
SwissProt Format:
SwissProt protein sequence format is similar to EMBL format but there is considerably more
information about physical and biochemical properties of a protein (as you can see below there
is more description).
ID - Identification.
AC - Accession number(s).
DT - Date.
DE - Description.
GN - Gene name(s).
OS - Organism species.
OG - Organelle.
OC - Organism classification.
RN - Reference number.
RP - Reference position.
MUHAMMAD IMRAN 29
RC - Reference comments.
RX - Reference cross-references.
RA - Reference authors.
RL - Reference location.
CC - Comments or notes.
DR - Database cross-references.
KW - Keywords.
FT - Feature table data.
SQ - Sequence header.
// - Termination line.
XML Format:
It is a modern practice in which we try to put those sequences in kind of a machine language.
So, XML stands for Extensible Markup Language. The format is similar to HTML (language
for Web programming).
The good part is that this language is in between machine and man readable so it’s kind of easy
to code over this.
And it is becoming standard data format for transferring genome data.
Example:
<xsd:annotation>
<xsd:documentation>
XML Schema for SBOL core data model compatible with RDF/XML serialization.
<dc:date>2012-01-19</dc:date>
<dc:creator>EvrenSirin</dc:creator>
<dc:contributor>Michal Galdzicki</dc:contributor>
</xsd:documentation>
</xsd:annotation>
This format seems pretty weird but not for the people with computer science
background.NBRF Format:
>DL;seq1
seq1, 16 bases, 2688 checksum.
agctagctagctagct*
>DL;seq2 seq2,
16 bases, 25C8 checksum.
aactaactaactaact*
The format is pretty similar to fasta but in addition to that it gives us the checksum value
(checksum- we take those nucleotides and since we know that in computers every digit is
related to some ‘ascii’ value, we can take those values and add them up together and then we
can come up with this number known as checksum. So, it’s a good thing to have this number
as when somebody is downloading the sequence, he can again check on his computer and find
the checksum, if they are equivalent to one another, the sequences are correctly downloaded
otherwise there must be some issues with the downloading)
GCG FORMAT:
GCG stands for Genetics Computer Group (basically it was a group of scientists who were
helping the biological community to develop different software and training programs to help
with the biological sequence analysis problems, so they also came up with the sequence
formats). This format is kind of similar to the NBRF format (we have checksum but we don’t
have greater than (>) sign as in fasta, we have length of the sequence). There can be multiple
sequences in one file.
Example:
seq1 seq1 Length: 16 Check: 9864 .. 1 agctagctagctagct seq2 seq2 Length: 16 Check: 9672 ..
1 aactaactaactaact
MUHAMMAD IMRAN 30
Sequence converters:
Sometimes, we need to convert between sequences so you can come up with your own script
or you can come up with your own codes and there are also some programs meant for this
purpose alone such as READSEQ is a useful sequence converter (developed by D.G.Gilbert at
Indiana University, USA) basically it recognizes DNA or Protein sequence file and
interconvert them between different formats.
Conclusions:
What we conclude in the end of this lecture is the following:
• Databases store sequences in specified formats
• Genebank, DDBJ and EMBL has similar formats
• Different software need sequences in different formats
We might convert the sequences into other formats on our own or we can also simply use one
of the programs available for converting like READSEQ
MUHAMMAD IMRAN 31
Here, is the page of ENTREZ that allows you to search anything by the help of a search bar at
the top. It has different connections like we have Literature resources, we have Health
Databases, Genomes, different Genes Databases, Proteins and Chemicals.
Bulk Data Retrieval:
Sometimes, we need to obtain data in bulk amount and for this purpose normally we use Linux
but for Windows users, there are some packages or programs available and are known as FTP
clients so the best option is to use FTP (File transfer protocol). The File Transfer Protocol
(FTP) is a standard network protocol used to transfer files Via command line or application
programs like FTP clients (we’ll be using it).
Once, we get the data which is mostly not in a proper format and every other software require
different specific formats so we might want to use some programming languages to help
convert the data into the required format. The programming languages like PERL and Python
are good for processing Biological data in Bioinformatics.
Conclusions:
We have learned that :
• Data is transferred over the internet.
• Data needs to be transformed or processed before handing it over to any software.
Genome Informatics:
It is about the Genome sequencing that provides the sequences of all the genes of an organism.
The major application of Bioinformatics is the analysis of full genomes that have been
sequenced. Whereas the challenge is to identify those particular genes that are predicted to
have a specific biological function.
Genomics definition:
NHGRI (National Human Genome Research Institute) defines Genomics as:
“Study of all of a person's genes (the Genome), including interactions of those genes with
each other and with the person's environment.”
So, Genome Informatics can be defined as:
It is the field in which computational and statistical techniques are applied to derive biological
information from genome sequences.
MUHAMMAD IMRAN 32
Genome Analysis:
In Genome Analysis we mainly perform the following tasks;
Sequencing
➢ Assembly (since the sequencing is done in a way that whole genome is broken
down into short fragments and once those fragments are sequenced, we need to put
them together, this step in genome analysis is known as Assembly).
➢ Repeat identification and masking out (once we assemble that genome, we try to
find out the regions in which we have large number of repeats because assemblies
jumble up where we have those repeats so we need to find those regions and it is
one of the important task to go and look into those assemblies while keeping in
mind those regions in which we have those repeats).
➢ Gene prediction (after we have assembled a finished genome, now we can go for
the prediction of the genes where we can find the genes by using different patterns
or features of those genes).
➢ Looking for EST (Expressed Sequenced Tags) and cDNA (complementary DNA)
sequences.
(EST and cDNA are basically originated from the DNA where the genes that are
expressed are transcribed into mRNA which is then reversed transcribed back into
cDNA. So by the help of cDNA, we can look into where those expressing regions
are present in the genes that will give us the idea of the gene expression or the
regions from where the mRNAs are made).
➢ Genome annotation (in which we can find out similar functions performed by
different genes)
➢ Expression analysis (once we have the idea about the regions of the gene in which
we can have the gene expression then we can explore the quantification i.e. how
much those genes are being expressed).
➢ Metabolic pathways and regulation studies (once those genes are expressed, their
products interact with each-other and then they perform different metabolic roles in
the shape of different metabolic pathways and networks).
➢ Functional genomics (where we are actually looking into the different functions
performed by different regions of the genome that are under the control of different
genes and what exactly would be the effect of changes in those genes specifically if
we want to study about the genes related to diseases).
➢ Gene location/gene map identification (map the location of those genes on the
chromosomes).
➢ Comparative genomics (in which we can take one genome and compare it with
another genome, where we can find the comparative features; what is present in the
first genome and not in the second one and the intersections between them, etc.).
MUHAMMAD IMRAN 33
➢ Identify clusters of functionally related genes (those genes they might be having
similar structures, sequences and also performs similar functions, which can give us
the idea about the evolution).
➢ Evolutionary modeling (so the identification of the clusters of functionally related
gene can help us in making an evolutionary model).
➢ Self-comparison of proteome (sometimes we are interested in finding genes which
are kind of duplicated within the same organism, so in order to do that this self-
comparison of proteome is made, where proteome is the collection of the proteins
which are derived from those genomes. Therefore, the whole collection of one
organism’s proteins can be termed as proteome and we can compare it with itself
and can find about those sequences which are being duplicated in it).
• Model organisms:
Most of the times, while we are doing those genome sequencing projects, our objective is to
find the cure of some disease, or improving some variety of the crop for enhancing its
production, or looking into some drugs against different organisms so it’s a good idea to have
some model organisms that can be used for studying various processes in labs and there are is
a range of model organisms which includes:
➢ E. coli – bacteria
➢ S. cerevisiae – yeast
➢ C. elegans – worm
➢ D.melanogaster – fly
➢ Daniorerio – zebrafish
➢ Musmusculus - mouse
➢ Homo sapiens – you and me
➢ Arabidopsis - plant
Here, in this
diagram we see
auniversal tree of
life that has been
made with the
structures of small
ribosomal RNA
unit. It divides
whole living
organisms into three
groups, we have
Bacteria at the top,
we have Archaea(they are special organisms that lives under hard conditions) and then we
have Eucarya (which is obviously the biggest among all groups). We pick those model
organisms from important branches of this tree of life so for example, E-coli is shown,
yeast as an example of fungi, from animals we have worms, flies, fish, mice, and
Arabidopsis, rice, soybean are the examples from plants. So, we try to get these organisms;
best representatives from different classes from important branches on this tree of life.
Conclusions:
We conclude the following:
• Sequencing and analysis of full genomes paves the way for future discoveries
• Different model organisms are best source to explore our Genome and to interpolate
the results towards the higher organisms.
MUHAMMAD IMRAN 34
MUHAMMAD IMRAN 35
MUHAMMAD IMRAN 36
Here, in this diagram we can see a typical eukaryotic cell which is pretty stuffed as
compare to prokaryotic one. We have nucleus in the middle, channels coming out of
the nucleus known as endoplasmic reticulum (helps in transportation), ribosomes (for
protein synthesis), mitochondria (energy synthesis), we can also see the cytoskeleton
that makes the structure of this cell intact and Golgi apparatus (are concerned with the
secretion). So complicated membrane bounded organelles are present in the eukaryotes
Here, in this diagram we see the connection between the DNA and the chromosomes.
On the left-hand side, we see a DNA strand that is a 2nm wide strand. So, the DNA wraps over
the protein complex molecules (histones - labelled as 1, 2, 3...), and this structure is known as
nucleosome. Then these histones, turn around and makes a wider structure and it makes a 3nm
filament (in third section). Then these nucleosome structures supercoil on their selves to make
those further bigger fibres and until they reach the chromosome the width is 1,400 nm. So, if
we look into the chromosome, we can recognize that there are different arms in it, which are
known as sister chromatids (remember this is just one chromosome but we have two
chromatids), somewhere in the middle we see a constricted part known as centromere, whereas
the terminals are known as telomeres, (remember these nomenclature while we are discussing
the heterochromatin and euchromatin parts).
Staining with dyes:
So chromosomes if stained with the dyes, they give different coloring patterns, we can come
up with the following:
• Dense heterochromatin (dense regions obviously take more color)
MUHAMMAD IMRAN 37
If we look into the gene expression, the heterochromatic regions are packed so the enzymatic
machinery cannot reach there; hence these regions are poorly transcribed (expressed).
Whereas, the euchromatic regions are highly expressed because they are loosely packed and
the enzymatic machinery can easily reach to them.
Here, is the diagram in which we can see the relationship between heterochromatin and
euchromatin. You can look into these nucleosomes (combination of DNA and histones), which
are quite jammed packed with one another, so definetly the enzymes cannot access the DNA
which is embedded inbetween. There are different modifications on the DNA or histones that
bring about those structures (we can see on the top), so for example there are histone
methylations (in which methyl groups are added to those histones) in that case the system moves
towards downside (as shown in the diagram) so it becomes euchromatin and similarly, we see
that there are some other methylations on some other aminoacids that can move back into the
opposite direction also. So, histone methylation, histone deacetylation and there are some other
complex proteins which gets attached and give us this heterochromatin region and in the
reverse process, we get the euchromatin region. In the Euchromatin region, the histones are
quite spaced and DNA can be acessible. So, this is the reason why the euchromatin region is
expressed more as compared to the heterochromatin region.
Conclusions:
We conclude that:
• Eukaryotes are distinguished by the presence of prominent nuclei
• Eukaryotes have larger genomes, tandem repeats and introns in their protein-coding
genes (i.e. they are complicated).
Prokaryotic EEs:
1. Plasmids
2. Self-replicating
3. Additional rings
4. Bacteriophages
5. Host colonization
MUHAMMAD IMRAN 38
6. Transposons
7. Parasitic DNA elements
Eukaryotic EEs:
Eukaryotes have extra organelles that contain the genome (DNA) which we call it as
Organellar DNA.
Examples are:
Mitochondrial DNA (both in animals and plants), Chloroplasts DNA (in plants), these are
membrane-bound organelles and they may be present in hundreds to thousands of copies (so
there is also multiple copies of these genomes). Mitochondrion is the site for respiration
whereas chloroplast is the site for photosynthesis. Their DNA’s can be labeled as mtDNA or
cpDNA respectively.
Plasmids, yeast, Transposons, Viral genomes and retroviruses are other examples of organellar
DNA.
Endosymbiont hypothesis:
How do these organelles evolved?
So there is a hypothesis known as Endosymbiont hypothesis. According to this hypothesis,
these organelles originated as separate prokaryotic organisms that were taken inside a
primordial eukaryotic cell. Such symbiotic relationships in which two species are dependent
upon one another to varying extents served as crucial elements of the evolutionary progression
of eukaryotic cells.
This hypothesis was originally proposed in 1883 by Andreas Schimper, but extended by Lynn
Margulis in the 1980s.
So according to this theory, Mitochondria and Chloroplastare derived from endosymbiotic
bacteria (that got incorporated into the cells).
Organelle Genome:
Organelle genome (of mtDNA/cell or cpDNA/cell) features are as following:
➢ Circular
➢ Double stranded
➢ Supercoiled
➢ No histones
➢ Multiple copies
Here, in this table we can see the size of these genomes. For example, plant genome is 150kb
circular genome, plant mitochondria is 150-2000kb multipartite, human mitochondria is
17kb circular and saccharomyces mitochondria is 75kb circular.
*Mostly the genomes are circular.
As far as the expression of these organelle genomes is concerned, it has been observed that
their functions are actually dependent on nuclear genomes (they cannot make functions for
themselves).
MUHAMMAD IMRAN 39
They encode only a subset of genes required to elaborate a functional organelle like rRNAs,
tRNAs, ribosomal proteins, membrane-associated respiratory or photosynthetic components.
Other components which are encoded by nuclear genome are translated in the cytosol of the
cell and are imported into the organelle. It has been observed that 10% of nuclear genes are
devoted to mitochondrial function whereas 15% to plastid function.
Conclusions:
We conclude the following:
• Organelle genome is similar to prokaryotes.
• It is in high copy number.
• The mtDNA and cpDNA depends on Nuclear DNA (genome) for their function.
Sequence Repeats:
These repeats skew the base composition (normally the A’s, T’s, G’s and C’s relative
proportion is similar to one another but if repeats are present and these repeats are of same
types, for example if we have runs of GC’s, then obviously they’ll change the proportion of
different bases) which can contribute to having differences in there buoyant densities (so those
fragments can then be separated on the basis of those differential densities).
The repeat containing DNA can be separated as satellite DNA on the bases of these densities.
http://mcb1.ims.abdn.ac.uk/djs/web/lectures/repeats1.html#anchor10305
Satellite DNA:
Satellite DNA has following features:
➢ It may be one to several thousand bp long and it can also be present as Tandem; array of
100 million bases long.
➢ They are present near centromere and telomere and
➢ They can be classified as Mini-satellite and Micro-satellite.
Mini-satellite:
Mini-satellite features are as follows:
• They are 15 bases long in array of several hundred to thousands kb.
• They are typically present in euchromatin region.
• Example is VNTR and is used to identify human individuals in forensics.
MUHAMMAD IMRAN 40
MUHAMMAD IMRAN 41
So, mainly in humans and Z. mays, they are present as a major proportion.
Conclusions:
We conclude that:
➢ Large proportion of eukaryotic genome is composed of repeats
➢ Different repeats act as markers to detect genetic variation (of organisms) and are
also used to study evolution of those organisms
MUHAMMAD IMRAN 42
MUHAMMAD IMRAN 43
Conclusions:
We conclude the following about Transposable elements:
➢ They make up a significant part of organisms’ genome especially in that of the
eukaryotic genome.
➢ They move within and across genomes and
➢ Causes genome expansion.
MUHAMMAD IMRAN 44
They can be distinguished by the presence of GT at the 5’ ends and AG towards the 3’ end
(GT-------------AG) and this trend is highly preserved all over the genome.
Here, in this
diagram is the
structure of a typical
eukaryotic gene.
We see the
chromosome and
gene is the region
which has specific
patterns so we can
observe the
promoter region (in
the beginning of a
gene), the blue ones are the exons and those orange ones are introns.
There is start of transcription (marked by black line) which ends at the exon3 (as shown here
and marked by black line), this whole region is then transcribed into mRNA. We can see a 5’-
UTR (Un-translated region) region, so this region is transcribed into mRNA but is not
translated i.e. no protein is formed from this region, similarly we also have a 3’-UTR region.
When we see the ORF i.e. the region from start codon (initiator) to stop codon, and in between
them we can see there are number of amino-acids, so this is the region from where translation
takes place and we get a protein.
After transcription, the transcript is known as Primary RNA transcript and we can see that it
also contains those introns. Which are later on removed through a process called as splicing
and then we get a Mature RNA transcript, so that transcript is then translated into the proteins.
This mature RNA transcript is also recognized by the presence of a poly-A tail (long runs of
A’s)
Intron origin
So, about the origin of introns, there are two theories which are as follows:
• Intron-early- According to this, they used to assemble the genes from already existing
exons (so they brought the exons together and then these structures became the genes).
• Intron-late- According to this, the exons were already present with one another, then
introns got into them (i.e. they Broke up previously continuous genes by inserting into
them).
Number of Genes:
Now we talk about the degree of compactness, so the compact genomes whose size is small
and the relative proportion of gene is higher which contributes to the variation in gene density.
In short, we can say that compact genomes have higher genome density.
MUHAMMAD IMRAN 45
MUHAMMAD IMRAN 46
Conclusions:
In the end, we conclude the following:
➢ Eukaryotic genes have exons and introns.
➢ Introns make up a significant portion of higher organisms’ genome (Human
genome).
➢ Pseudo genes are non-functional genes.
➢ The genes which are similar in function, they make up the gene families
MUHAMMAD IMRAN 47
MUHAMMAD IMRAN 48
• If we have a good match between query sequence and some other sequence, we can
suspect those two are
the paralogs (because
they are present
within the same
organism).
MUHAMMAD IMRAN 49
– Distance in alignment (so the proteins which are more similar will be grouped
together and distant proteins will be grouped from them so in this way we can
have sub-groups or clusters in our data).
There are different clustering methods and are explained briefly in the below section.
Clustering by subgraph:
The way of clustering or grouping by the method of sub-graph is as follows:
➢ Each sequence is a vertex (vertex or vertices are the point or dots by which the edges
(links) in a graph are connected. There can also be a vertex that’s without any edge
connecting to it, known as isolated vertex).
➢ Significant alignment score is an edge (on the vertices, we put our sequences and on the
edges we put our alignment scores).
➢ Trimming by removing weak edges (if we have High P/E ratio, we will remove them).
Single Linkage:
Linkage is done by the following method:
• A group of sequences in all-against-all comparison is subjected to MSA (group those
proteins which are co-related with multiple sequence alignment by first aligning them
and then calculating their distances).
• Create distance matrix (by using those distance calculation just made).
• Neigbour joining is then used to do clustering (by distance matrices, we create those
trees and the method used is Neighbor joining- will be discussed later).
MUHAMMAD IMRAN 50
MUHAMMAD IMRAN 51
As mentioned earlier, this helps in finding the orthologs, gene families and the domains
(between different organisms). There can be other significance of between-proteome
comparisons search and are as follows:
• Proteins that have a highly significant alignment score can be suspected as the orthologs.
• Mostly the proteins that are related to core biological functions (basic functions of life) are
likely to be orthologs.
MUHAMMAD IMRAN 52
Then we have the ‘number of groups with more than two members’ (as shown).
Lastly, we have the ‘percentage of yeast’ and ‘percentage of worm’ (i.e. how many amongst
the total, they are present) and are presented in these two groups (the yeast and the worm), say
we have 40 percent and 19 percent on <10-10 cut-off P-value, and we have 5 percent and 2
percent on <10-100 cut-off P-value (if our criteria is strict).
So, in this way we can group the similar proteins at different cut offs of P-values, and we can
have the various results.
Proteomes to EST databases:
Sometimes, we take those proteomes and we match them or align them with Expressed
Sequence Tags (EST) (which is cDNA copies of cell’s mRNA sequences). We do this
procedure for those organisms’ genomes whose sequences are not available.
ESTs are single DNA reads and are mostly 3’ biased (since we get them from mRNA and
mRNA extraction protocols relies on getting those mRNA by using their 3-prime poly A-tail
which is present on their 3-prime end, so that is why they are kind of more tiled or oriented
towards 3-prime ends as they are mainly extracted from this site).
EST may be incomplete because it is wholly dependent upon the gene expression, so if we do
not have genomes rather than we only have the ESTs, we might be biased towards only those
genes which are expressed.
The softwareor the package in BLAST which is being frequently used for this purpose is
TBLASTN.
Family and Domain Analysis:
Proteins are organized into domains that represent modules of structure or function (as
domains are specific arrangements of amino-acids). And domain comparison sometimes is co-
related with their biological functions.
MUHAMMAD IMRAN 53
MUHAMMAD IMRAN 54
MUHAMMAD IMRAN 55
MUHAMMAD IMRAN 56
Conclusions
MUHAMMAD IMRAN 57
➢ We can also explore the links to genetic maps where they are located on the
chromosomes.
➢ We can look into the location of the repeats.
➢ We can also look into the location of STS (sequence tag sites).
➢ We can also look into the location of sequence polymorphisms.
➢ And we can find the significant alignment to some protein sequences of known function in
databases (by comparison).
Annotations steps:
Annotations are divided into two types and are as follows:
• Structural annotation
• Functional annotation
Structural Annotation:
Structural Annotation is where we try to identify certain gene features like;
• Promoters
• Terminators
• Shine-Dalgarno sites; the ribosomal binding sites during the protein synthesis)
• DNA motifs (patterns of nucleotides within the genes)
• Co-transcription units
• Operons in microbes (in micro-organisms, lots of genes are transcribed together known as
operons)
Annotations Tools:
There are two tools which are important worth mentioning here, one is MAGPIE and the other
is GENEQUIZ and these are designed to assist with gene the genome annotations.
MAGPIE (Multipurpose Automated Genome Project Investigation Environment) - It’s an
automated genome analysis tool that is used for structural annotation.
GENEQUIZ- Focuses on deriving a predicted protein function based upon the available
evidence; including evaluation of similarity to the closest homologue in the database (i.e. it is
good tool for functional annotation).
MUHAMMAD IMRAN 58
The attributing biological information to the genes is called as functional Annotations and we
can have it via;
➢ Biological function
➢ Biochemical function
➢ Gene expression (transcription of the gene is considered as gene expression)
➢ Regulation and interactions among different genes
8 Group Classifications:
There are different classification schemes, which are meant for functional classification, to
classify the genes and their products into one of these groups;
➢ Enzymes
➢ Transporters
➢ Regulators
➢ Membranes
➢ Structural elements
➢ Protein factors
➢ Leader peptides (control transcription and translation)
➢ Carriers (transporters)
In this way, scientists have seen that 90% of the E-coli genes fit into these categories
(so their annotations can be explained).
Enzyme Commission (EC) numbers:
It is another scheme which was put forward by the Enzyme Commission (EC) that was
working under the IUBMB (International Union for Biochemistry and Molecular Biology).
They say that the enzymes are classified on the basis of the reactions they catalyze and have a
4-digit scheme which is actually the enzyme commission number: EC a.b.c.d
‘a’ (first digit) informs that it is from one of the 6 classes of biochemical reactions (enzyme
might be coming from one of these classes).
‘b’ (second digit) informs that is from the group of substrate (the thing on which the enzyme
attacks).
‘c’ (third digit) informs us that it is anaccepter molecule.
‘d’ (fourth digit) gives thedetails of biochemical reaction
For example,
tripeptideaminopeptidases
EC 3.4.11.4
Where 3 – tells us that it is a Hydrolase (use water to break substrate).
This 3.4- tell us thatit is aHydrolase that acts on the peptide bonds.
The 3.4.11- tells us that it is a Hydrolase that cleaves the amino terminal amino acids of
polypeptide.
While putting everything together, EC 3.4.11.4- it tells us that it is a Hydrolase that cleaves the
amino terminal amino acids of a tri-peptide.
With Enzyme Commission Scheme, they classified that 70% of E-coli genes shared a and
b (first two classes), which means that they catalyzes the same biochemical reaction.
Three Groups Scheme:
This is another classification scheme known as a ‘Three Groups Scheme’, where we divide all
those functions which are related to the following:
• Energy
MUHAMMAD IMRAN 59
• Information
• Communication
It was found that plants devotes half of their genome to the energy metabolism (they make
food), whereas animals devotes half of their genome to communication (they talk a lot :D )
Conclusions:
We conclude the following:
• Finding genes and their coding regions is an important task in Genome annotations.
• Functional annotations correlate the genes to different classes of functions
MUHAMMAD IMRAN 60
MS2 genome
MS2 has 49 different codons in the genetic code that specify the sequence of the 129 amino-
acids long coat polypeptide (virus has a coat which is made up of proteins on its outer side and
it has a RNA; it’s genome).
Here, we see the virus genome,
lys which has the genes like mat (helps
mat cp rep in assembly; putting those proteins
5'- bacteriophage MS2 RNA - 3' together), cp (codes for coat
protein), rep (codes for replicase
protein) and lys(codes for lysis
protein; that breaks the host cells). When we observe cp, rep, andlys, we can see the lysgene is
embedded between these two
genes, so we can have the
genes within the genes (here).
Leroy E. Hood
Institute of system biology
Seatle Washington
Conclusions:
We conclude the following:
• Genome sequencing involves recognition and determining the precise order of
nucleotides in a Genome.
• Advances in sequencing technologies have revolutionized the pace of scientific discover
MUHAMMAD IMRAN 61
MUHAMMAD IMRAN 62
MUHAMMAD IMRAN 63
MUHAMMAD IMRAN 64
0 1 2 3 4 5 6 7 8 9 10
0 W
1 W W
MUHAMMAD IMRAN 65
10
0 1 2 3 4 5 6 7 8 9 10
0 L W L W L W L W L W L
1 W W W W W W W W W W W
2 L W L W L W L W L W L
3 W W W W W W W W W W W
4 L W L W L W L W L W L
5 W W W W W W W W W W W
6 L W L W L W L W L W L
7 W W W W W W W W W W W
8 L W L W L W L W L W L
9 W W W W W W W W W W W
10 L W L W L W L W L W L
FASTBLOCK(n, m)
1. if n and m are both even
2. return L
3. else
4. return W
MUHAMMAD IMRAN 66
FASTBLOCK(n, m)
1. if n and m are both even
2. return L
3. else
4. return W
Rn,m
R2,2 = L
R4,4 = L
R4,5 = W
MUHAMMAD IMRAN 67
ATGTTTGCATTACGATAGAATTCCGTCAAAGTGCTAG
TACAAACGTAATGCTATCTTAAGGCAGTTTCACGATC
GCCGTTATACGCTGGATTTAAATTGCTGTGAAATGGT
CGGCAATATGCGACCTAAATTTAACGACACTTTACCA
TACTGCCAAGACCGAATTCCTGCGAGTGCTGAAACG
ATGACGGTTCTGGCTTAAGGACGCTCACGACTTTGC
GCGATATTACGAATGTGCTTACAGCACCGAATTCATC
CGCTATAAAGCTTACACGAATGTCGTGGCTTAAGTAG
ATGTTTGCATTACGATAGAATTCCGTCAAAGTGCTAG
TACAAACGTAATGCTATCTTAAGGCAGTTTCACGATC
GCCGTTATACGCTGGATTTAAATTGCTGTGAAATGGT
CGGCAATATGCGACCTAAATTTAACGACACTTTACCA
TACTGCCAAGACCGAATTCCTGCGAGTGCTGAAACG
ATGACGGTTCTGGCTTAAGGACGCTCACGACTTTGC
GCGATATTACGAATGTGCTTACAGCACCGAATTCATC
CGCTATAAAGCTTACACGAATGTCGTGGCTTAAGTAG
MUHAMMAD IMRAN 68
{2, 2, 2, 3, 3, 4, 5}
If X = {x1 = 0, x2, . . . , xn}
∆X = {xj − xi : 1≤ i < j ≤n}
X={0, 2, 4, 7, 10}, then ∆X={2, 2, 3, 3, 4, 5, 6, 7, 8, 10},
Representation of ∆X
0 2 4 7 10
0 2 4 7 10
2 2 5 8
4 3 6
7 3
10
Representation of ∆X
MUHAMMAD IMRAN 69
0 2 4 7 10
0 4 7 10
2
2 2 5 8
4 3 6
7 3
10
∆A is equal to ∆(A ⊕
{v}), where A⊕ {v} is
defined to be
{a + v : a ϵ A},
= {−10,−7,−4,−2,
0}
{0, 1, 3, 8, 9, 11, 12, 13, 15} and {0, 1, 3, 4, 5, 7, 12, 13, 15}
0 1 3 4 5 7 12 13 15
0 1 3 4 5 7 12 13 15
1 2 3 4 6 11 12 14
MUHAMMAD IMRAN 70
0 1 3 8 9 11 12 13 15
0 1 3 8 9 11 12 13 15
1 2 7 8 10 11 12 14
3 5 6 8 9 10 12
8 1 3 4 5 7
9 2 3 4 6
11 1 2 4
12 1 3
13 2
15
3 1 2 4 9 10 12
4 1 3 8 9 11
5 2 7 8 10
7 5 6 8
12 1 3
13 2
15
{14, 24, 34, 43, 52, 62, 72, 83, 92, 102, 112, 123, 13, 14, 15}
U ⊕ V = {u + v : u ϵ U, v ϵ V }
U ⊝ V = {u − v : u ϵ U, v ϵ V }
U⊕V -6 2 6
6 12 4 0
6 0 8 12
7 13 5 1
7 1 9 13
9 15 7 3
9 3 11 15
BRUTEFORCEPDP(L, n)
1. M maximum element in L
4. Form ∆X from X
5. if X = L
6. return X
MUHAMMAD IMRAN 71
ANOTHERBRUTEFORCEPDP(L, n)
1. M maximum element in L
2. for every set of n−2 integers 0 < x2< · · · < xn−1< M from L
3. X {0, x2, . . . , xn−1, M}
4. Form ∆X from X
5. if X = L
6. return X
7. output “No Solution”
MUHAMMAD IMRAN 72
MUHAMMAD IMRAN 73
X = {0, 10}
MUHAMMAD IMRAN 74
MUHAMMAD IMRAN 75
“The Gold Bug” by Edgar Allan provided some clue of finding DNA motifs, one
of the character find parchment written below
53++!305))6*;4826)4+.)4+);806*;48!8‘60))85;]8*:+*8!83(88)5*!;46(;88*96*?;8)
*+(;485);5 *!2:*+(;4956*2(5*-)8‘8*;4069285) ;)6!8)
4++;1(+9;48081;8:8+1;48!85;4)485!
528806*81(+9;48;(88;4(+?34;48)4+;161;:188;+?;
Topic # 40 Profiles 1
MUHAMMAD IMRAN 76
Conserved Pattern
32 nucleotide
7 sequences
1. CGGGGCTGGGTCGTCACATTCCCCTTTCGATA
2. TTTGAGGGTGCCCAATAACCAAAGCGGACAAA
3. GGGATGCCGTTTGACGACCTAAATCAACGGCC
4. AAGGCCAGGAGCGCCTTTGCTGGTTCTACCTG
5. AATTTTCTAAAAAGATTATAATGTCGGTCCTC
6. CTGCTGTACAACTGAGATCATGCTGCTTCAAC
7. TACATGATCTTTTGTGGATGAGGGAATGATGC
Figure 1
P = ATGCAACT
l=8
1. CGGGGCTATGCAACTGGGTCGTCACATTCCCCTTTCGATA
2. TTTGAGGGTGCCCAATAAATGCAACTCCAAAGCGGACAAA
3. GGATGCAACTGATGCCGTTTGACGACCTAAATCAACGGCC
4. AAGGATGCAACTCCAGGAGCGCCTTTGCTGGTTCTACCTG
5. AATTTTCTAAAAAGATTATAATGTCGGTCCATGCAACTTC
6. CTGCTGTACAACTGAGATCATGCTGCATGCAACTTTCAAC
7. TACATGATCTTTTGATGCAACTTGGATGAGGGAATGATGC
Figure 2
P = ATGCAACT
l=8
1. CGGGGCTATGCAACTGGGTCGTCACATTCCCCTTTCGATA
2. TTTGAGGGTGCCCAATAAATGCAACTCCAAAGCGGACAAA
3. GGATGCAACTGATGCCGTTTGACGACCTAAATCAACGGCC
4. AAGGATGCAACTCCAGGAGCGCCTTTGCTGGTTCTACCTG
5. AATTTTCTAAAAAGATTATAATGTCGGTCCATGCAACTTC
6. CTGCTGTACAACTGAGATCATGCTGCATGCAACTTTCAAC
7. TACATGATCTTTTGATGCAACTTGGATGAGGGAATGATGC
Figure 3
P = ATGCAACT
l=8
7 x (32 + 8) = 280 nucleotides
Probability = 280/48 = 0.004
1. CGGGGCTATcCAgCTGGGTCGTCACATTCCCCTTTCGATA
2. TTTGAGGGTGCCCAATAAggGCAACTCCAAAGCGGACAAA
3. GGATGgAtCTGATGCCGTTTGACGACCTAAATCAACGGCC
4. AAGGAaGCAACcCCAGGAGCGCCTTTGCTGGTTCTACCTG
5. AATTTTCTAAAAAGATTATAATGTCGGTCCtTGgAACTTC
6. CTGCTGTACAACTGAGATCATGCTGCATGCcAtTTTCAAC
7. TACATGATCTTTTGATGgcACTTGGATGAGGGAATGATGC
Figure 4
MUHAMMAD IMRAN 77
Topic # 41 Profiles 2
Conserved Pattern
1- position 8 - Sequence 1
2- position 19 - Sequence 2
3- position 3- Sequence 3
4- position 5- Sequence 4
5- position 31- Sequence 5
6- position 27- Sequence 6
7- position 15- Sequence 7
1-
CGGGGCTATcCAgCTGGGTCGTCACATTCCCCTT
2-
TTTGAGGGTGCCCAATAAggGCAACTCCAAAGCGGACAAA
3-
GGATGgAtCTGATGCCGTTTGACGACCTA
4-
AAGGAaGCAACcCCAGGAGCGCCTTTGCTGG
5- AATTTTCTAAAAAGATTATAATGTCGGTCCtTGgAAC
TTC
6- CTGCTGTACAACTGAGATCATGCTGCATGCcAtTTTC
AAC
MUHAMMAD IMRAN 78
7-
TACATGATCTTTTGATGgcACTTGGATGAGGGAATGATGC
Figure 6
Topic # 42 Profiles 3
Conserved Pattern
MUHAMMAD IMRAN 79
MUHAMMAD IMRAN 80
Min min
dH(v, s) all choices of
all choices l-mers v
starting positions s
The Median String problem- Minimization problem
The Motif Finding problem- Maximization problem
Computationally equal
Let s be a set of starting positions with consensus score Score(s,DNA), and let w be
the consensus string of the corresponding profile. Then dH(w, s) = lt - Score(s,DNA)
= 7 x 8 − 42 = 14
MUHAMMAD IMRAN 81
ATTGTC
: x : x : :
ACTCTC
s = (s1, s1, . . . , st) v is some l-mer dH(v, s) to denote the total Hamming distance
between v and the l-mers starting at positions s: dH where dH(v,
si) is the Hamming distance between v and the l-mer that starts at si in the ith DNA
sequence
TotalDistance(v,DNA) = mins(dH(v, s)
Finding Total Distance(v,DNA) is a simple problem:
find the best match for v in the first DNA sequence (i.e., a position minimizing dH(v,
s1) for 1
≤ s1 ≤ n-l+1), then the best match in the second sequence and so on
Median string for DNA as the string v that minimizes TotalDistance(v,DNA); this
minimization is performed over all 4l strings v of length l.
Median String Problem:
Given a set of DNA sequences, find a median
string. Input: A t × n matrix DNA, and l,
the length of the pattern to find
Output: A string v of l nucleotides that
minimizes TotalDistance(v,DNA)
over all strings of that length
Double minimization: finding a string v that minimizes TotalDistance(v,DNA),
which is in turn the smallest distance among all choices of starting points s in the
DNA sequences.
Min min dH(v, s)
all choices of all
choices l-mers v
starting positions s
The Median String problem- Minimization problem
The Motif Finding problem- Maximization problem
Computationally equal
Let s be a set of starting positions with consensus score Score(s,DNA), and let w be
the consensus string of the corresponding profile. Then dH(w, s) = lt - Score(s,DNA)
MUHAMMAD IMRAN 82
= 7 x 8 − 42 = 14
Figure 1
MUHAMMAD IMRAN 83
AA· · · AA
AA· · · AT
AA· · · AG
AA· · · AC
AA· · · TA
AA· · · TT
AA· · · TG
AA· · · TC
...
CC· · · GG
CC· · · GC
CC· · · CA
CC· · · CT
CC· · · CG
CC· · · CC
All 4l
Figure 2
(1, 1, . . . , 1, 1)
(1, 1, . . . , 1, 2)
(1, 1, . . . , 1, 3)
(1, 1, . . . , 1, 4)
(1, 1, . . . , 2, 1)
(1, 1, . . . , 2, 2)
(1, 1, . . . , 2, 3)
MUHAMMAD IMRAN 84
(1, 1, .
. . , 2,
4) ...
(4, 4, . . . , 3, 3)
(4, 4, . . . , 3, 4)
(4, 4, . . . , 4, 1)
(4, 4, . . . , 4, 2)
(4, 4, . . . , 4, 3)
(4, 4, . . . , 4, 4)
1 for A
2 for T
3 for G
4 for C
MUHAMMAD IMRAN 85
MUHAMMAD IMRAN 86
L
=
4
k
=
2
MUHAMMAD IMRAN 87
1. (-,-,-,-)
2. (1,-,-,-)
3. (1,1,-,-)
4. (1,1,1,-)
5. (1,1,1,1)
6. (1,1,1,2)
7. (1,1,2,-)
8. (1,1,2,1)
9. (1,1,2,2)
10. (1,2,-,-)
11. (1,2,1,-)
12. (1,2,1,1)
13. (1,2,1,2)
14. (1,2,2,-)
15. (1,2,2,1)
16. (1,2,2,2)
17. (2,-,-,-)
18. (2,1,-,-)
19. (2,1,1,-)
20. (2,1,1,1)
21. (2,1,1,2)
22. (2,1,2,-)
23. (2,1,2,1)
24. (2,1,2,2)
25. (2,2,-,-)
26. (2,2,1,-)
27. (2,2,1,1)
28. (2,2,1,2)
29. (2,2,2,-)
30. (2,2,2,1)
31. (2,2,2,2)
MUHAMMAD IMRAN 88
When i < L, NEXTVERTEX (a, i, L, k) moves down to the next lower level and
explores that subtree of a. If i = L, NEXTVERTEX either moves along the lowest
level as long as aL < k or jumps back up in the tree.
MUHAMMAD IMRAN 89
➢ A computer is less intelligent but can perform simple steps quickly and
reliably
➢ Algorithm must be rephrased in programming language
➢ Pseudocode: language often used to describe algorithm
➢ Complex operations are grouped together into mini-algorithms called
subgroups
➢ Variable is written as x or total
➢ An array of n elements is an ordered collection of n variables a1,
a2,……..an
➢ An algorithm is a pseudocode is denoted by a name, followed by the list
of arguments
A tree that has uninteresting subtrees. The numbers next to a leaf represent the
“score” for that L-mer. Scores at internal vertices represent the maximum score in
the subtree rooted at that vertex. To improve the brute force algorithm, we can
“prune” subtrees that do not contain highscoring leaves. For example, since the
score of the very first leaf is 24, it does not make sense to analyze the 4th, 5th, or
6th leaves whose scores are 20, 4, and 5, respectively. Therefore, the subtree
containing these vertices can be ignored.
MUHAMMAD IMRAN 90
MUHAMMAD IMRAN 91
1. s (1, . . . , 1)
2. bestScore 0
3. i 1
4. while i > 0
5. if i < t
6. (s, i) NEXTVERTEX(s, i, t, n − l + 1)
7. else
8. if Score(s,DNA) > bestScore
9. bestScore Score(s,DNA)
10. bestMotif (s1, s2, . . . , st)
11. (s, i) NEXTVERTEX(s, i, t, n − l + 1)
12. return bestMotif
Simple Motif Search Algorithm
Some sets of starting positions can be ruled out
If the first i of t starting positions [i.e., (s1, s2,... , si)]
Sequences i+1, i+2, . . . , t,
s = (s1, s2,... , st), define the partial consensus score, Score(s, i, DNA)- i×l alignment
matrix
Partial consensus score for s1, . . . , si, remaining t−i rows can only improve the
consensus score by (t − i) · l
First i starting positions (s1, . . . , s1) could be at most Score(s, i,DNA)+(t−i) · l
MUHAMMAD IMRAN 92
(t − i)·l
7. if optimisticScore < bestScore
8. (s, i) BYPASS(s, i, t, n − l + 1)
9. else
10. (s, i) NEXTVERTEX(s, i, t, n − l + 1)
11. else
12. if Score(s,DNA) > bestScore
13. bestScore Score(s)
14. bestMotif (s1, s2, . . . , st)
15. (s, i) NEXTVERTEX(s, i, t, n − l + 1)
16. return bestMotif
A search tree for the Median String problem. Each branching point can give
rise to only four children, as opposed to the n−l+1 children in the Motif
Finding problem.
SIMPLEMEDIANSEARCH(DNA, t, n, l)
1. s (1, 1, . . . , 1)
2. bestDistance
MUHAMMAD IMRAN 93
3. i 1
4. while i > 0
5. if i < l
6. (s, i) NEXTVERTEX(s, i, l, 4)
7. else
8 word nucleotide string corresponding to (s1, s2, . . . sl)
9. if TOTALDISTANCE(word,DNA) < bestDistance
10. bestDistance TOTALDISTANCE(word,DNA)
11. bestWord word
12. (s, i) NEXTVERTEX(s, i, l, 4)
13. return bestWord
BRANCHANDBOUNDMEDIANSEARCH(DNA, t, n, l)
1. s (1, 1, . . . , 1)
2. bestDistance
3. i 1
4. while i > 0
5. if i < l
6. prefix nucleotide string corresponding to (s1,
s2 , . . . , si )
7. optimisticDistance TOTALDISTANCE(prefix,DNA)
8. if optimisticDistance > bestDistance
9. (s, i) BYPASS(s, i, l, 4)
10. else
11. (s, i) NEXTVERTEX(s, i, l, 4)
12. else
13 word nucleotide string corresponding to
(s1, s2, . . . sl)
14. if TOTALDISTANCE(word,DNA) < bestDistance
15. bestDistance TOTALDISTANCE(word,DNA)
16. bestWord word
17. (s, i) NEXTVERTEX(s, i, l, 4)
18. return bestWord
MUHAMMAD IMRAN 94
Mouse X chromosome
Human X chromosome
Transformation of the mouse gene order into the human gene order on the X
chromosome
MUHAMMAD IMRAN 95
6. BestMotif1 s1
7. BestMotif2 s2
8. s1 BestMotif1
9. s2 BestMotif2
10. for i 3 to t
11 for si 1 to n − l + 1
12. if Score(s, i, DNA) > Score (bestMotif , i, DNA)
13. BestMotifi si
14. si bestMotifi
15. return bestMotif
Approximation algorithm
Two
closest l-
mers 2 × l
seed
matrix l(n
− l + 1)2
operations
Introduction
Discovery of new gene-no idea of functions
Find similarities with genes of known function
Newly discovered cancer-causing -sis oncogene matched a normal gene involved
in growth and development called platelet-derived growth factor
Oncogene v-sis is the simian sarcoma virus
Scientists became suspicious that cancer might be caused by a normal growth gene
Discovery of cystic fibrosis gene
MUHAMMAD IMRAN 96
MUHAMMAD IMRAN 97
Instead of solving the Manhattan Tourist problem directly, that is, finding the
longest path from source (0, 0) to sink (n,m), we solve a more general problem:
find the longest path from source to an arbitrary vertex (i, j) with
0 ≤ i ≤ n, 0 ≤ j ≤ m. We will denote the length of such a best path as si,j , noticing
that sn,m is the weight of the path that represents the solution to the Manhattan
Tourist problem
MUHAMMAD IMRAN 98
It is common case that the ith symbol in one sequence corresponds to a symbol at
different position in other. Mutation in DNA-evolutionary process: DNA
replication- substitutions, insertions, and deletions of nucleotides, leads to “edited”
DNA texts. Whether the ith symbol in one DNA sequence corresponds to the ith
symbol in the other
MUHAMMAD IMRAN 99
Topic # 62 Alignment
The alignment of the strings v (of n characters) and w (of m characters, with m not
necessarily the same as n) is a two-row matrix such that the first row contains the
characters of v in order while the second row contains the characters of w in
order, where spaces may be interspersed throughout the strings in different places
As a result, the characters in each string appear in order, though not necessarily
adjacently.
No column of the alignment matrix contains spaces in both rows, so that the
alignment may have at most n + m columns.
A T -- G T T A T --
A T C G T -- A -- C
Columns that contain the same letter in both rows are called matches, while
columns containing different letters are called mismatches. The columns of the
alignment containing one space are called indels, with the columns containing a
space in the top row called insertions and the columns with a space in the bottom
row deletions. Five matches, zero mismatches, and four indels. The number of
matches plus the number of mismatches plus the number of indels is equal to the
length of the alignment matrix and must be smaller than n + m
A T -- G T T A T --
A T C G T -- A -- C
Each of the two rows in the alignment matrix is represented as a string interspersed
by space symbols “−”; for example AT--GTTAT-- is a representation of the row
corresponding to v = ATGTTAT, while ATCGT--A--C is a representation of the
row corresponding to w = ATCGTAC
si−1,j
si−1,j−1 + 1, if vi = wj
The first term- when vi is not present in the LCS
of the i-prefix of v and j-prefix of w (deletion of vi); the second term- when wj is
not present in this LCS (an insertion of wi ); and the third term-when both vi and
wj are present in the LCS (vi matches wj).
These recurrences can be rewritten by adding some zeros here and there as
si−1, j + 0
si−1,j−1 + 1, if vi = wj
The length of an LCS between v and w can be read from the element (n,m) of the
dynamic programming table, but to reconstruct the LCS from the dynamic
programming table, one must keep some additional information about which of the
three quantities, si−1,j , si,j−1, or si−1,j−1 + 1, corresponds to the maximum in the
recurrence for si,j .
di−1,j + 1
di,j = min di,j−1 + 1
di−1,j−1, if vi = wj
Types of changes- most and least common- amino acid scoring matrix, sequence
alignment
Amino acids sequences- very few matches-scoring matrix δ(i, j) – how often a. a
‘i’ substitutes a. a ‘j’
Amounts to counting how many times the amino acid ‘i’ is aligned with amino acid
‘j’
Needs to know scoring matrix
Met Ala Phe Ser Gly Asp Glu Ser. . . . . . .
Met Ala Phe Ser -- Asp Glu Ser. . . . . . .
If proteins are 90% identical, premium +1 for matches and -1 for mismatches and
indels will do the job. Then “obvious” alignments are constructed that are used to
compute scoring matrix δ.
The simplified description hides subtle details are important in the construction of
scoring matrix
Ser Phe Try Phe
(related proteins in mouse and rat) LESS (related proteins in mouse and
human)
15 million years 80 million years
The best scoring matrix to compare two proteins depends on similarity of these
organisms
Also define g(i,j) as where f(i) is the frequency of the amino acid i in
all proteins from data set.
g(i,j) defines the probability that an amino acid i mutates into amino acids j within
1 PAM unit. The (i,j) entry of the PAM 1 matrix is defined as δ(i,j) =
This path contains so many indels that it is unlikely to be the highest scoring
alignment.
Biologically irrelevant diagonal paths – likely have score- mismatches are
Topic # 72
same TOPIC 71
Global Sequence Alignment- finding the longest path between vertices (0,0) and
(n,m) in the edit graph
Local Alignment-finding the longest path among the paths between arbitary
vertices (i, j) and (iˊ, jˊ) in the edit graph.
Find the longest path between every pair of vertices (i, j) and (iˊ, jˊ)- then select
longest of these computed paths. Instead of finding the longest path from (i, j) to
(iˊ, jˊ), LAP- finding the longest path from the source (0,0) to every other vertex by
adding edges to weight 0
The motivation for the choice of the closest strings at the early steps of the algorithm
is that close strings often provide the most reliable information about a real
alignment
Many popular iterative multiple alignment algorithms including the tool
CLUSTAL, use similar strategies
ATGTCATATTCGGAC
ATGTCATATTCGGAC
ATGTCATATTCGGAC
ATG- CATA
ATGTCATA
Progressive multiple alignment algorithms- problem with CLUSTAL-may be
misled by some spuriously strong pairwise alignment effect, a bad seed. The error
in initial pairwise alignment will propagate all the way through to the whole
multiple alignment. Many algorithms have been proposed,-even with systematic
deficiencies are quite useful in computational biology
Multiple alignment for k sequences
Generalization of the Pairwise Alignment problem Existence of a k-dimensional
scoring matrix k-dimensional scoring matrices are not very practical Describe two
other scoring approaches that are
more biologically relevant. The choice of the scoring function can drastically affect
the quality of the resulting alignment, and no single scoring approach is perfect in
all circumstances.
Multiple alignment of k sequences-a path of edges in a k-dimensional-Manhattan
gridlike edit graph.
The weights of the edges-scoring function
Intuitively, assign higher scores to the columns with a low variation in letters-high
scoreshighly conserved sequences
Multiple Longest Common Subsequence problem, the score of a column is set to 1
if all the characters in the column are the same, and 0 if even one character disagrees
Hexon mRNA, mRNA was hybridized to adenovirus DNA- the hybrid molecules-
electron microscopy.
mRNA-DNA hybrids-three loop structures-continuous duplex segment- classic
continuous gene model
Know a human protein, and we want to discover the exon structure of the related
gene in the mouse genome. The more sequence data we collect, the more accurate
and reliable similarity based
methods become. Consequently, the trend in gene prediction has recently shifted
from statistically motivated approaches to similarity-based algorithms
Some more facts about genetic code and codon usage in humans
Biologically oriented
Approach
Recognize the locations of splicing signals at exon-intron junctions
There exists a weakly conserved sequence of eight
nucleotides at the boundary of an exon and an intron (donor splice site) and a
sequence of four nucleotides at the boundary of an intron and exon (acceptor splice
site)
Profiles for splice sites are weak-limited success Hidden Markov Model (HMM)
approaches that capture statistical dependencies between sites GENSCAN
developed by Chris Burge and Samuel Karlin. GENSCAN combines coding region
and splicing signal predictions into a single framework.
Splice site prediction-coding region appear on one side of the site
Such statistics are used in the HMM framework of GENSCAN that merges splicing
site statistics, coding region statistics, and motifs near the start of the gene
The accuracy of GENSCAN decreases for genes with many short exons or with
unusual codon usage
Overlapping
Model a putative exon with a weighted interval in the genomic sequence,
parameters (l, r, w)
l is the left-hand position, r is the right-hand position, and w is the weight of the
putative exon “w”- local alignment score
Likelihood that this interval is an exon
A chain is any set of non overlapping weighted intervals.
Total weight of a chain
A maximum chain
Five weighted intervals, (2, 3, 3), (4, 8, 6), (9, 10, 1), (11, 15, 7), and (16, 18, 4),
shown by bold edges, form an optimal solution to the Exon Chaining problem. The
array at the bottom shows the values s1, s2, . . . , s2n generated by the
EXONCHANING algorithm
Reading Frames:
Depending on the start point, we can define different ORFs, so if we want to go for those
triplet codons, we can start with any nucleotide, in this way we have three different
possibilities for one of the strands and since we have two strands in the DNA, so in total we
can have six ORFs (so 3 of them are from 3’ to 5’ direction whereas the other 3 are from 5’ to
3’ direction).
*Three on forward
strand and three on
complementary strand
A complementary runs in an anti-parallel fashion which starts with 3’ end and ends at 5’ end.
Similarly, we can do like this for the opposite direction strands with the possibility of position
number 1, 2 and 3.
Conclusions:
─ ORF providesimportant evidence in gene finding.
─ Generally longer ORFs are preferred.
─ However presence of ORF not necessarily means the region is translated to a functional
product
Coding Potential:
Hexamer frequencies in coding versus non-coding regions may provide important insights
Frequency of X(A,G,C,T) at position i is
Fi (X)= log(Ci (X)/Ni(X))(frequency of any nucleotide can be found by taking the sum of
thelog of ratio of the counts of theparticular nucleotide in that particular position divided by
the total)
Based upon
the
frequency
equation,
we can
come up
with a
frequency
table as
shown in
the figure.
The
frequencies
shown here
indicates
that you have the presence of true TSS here or you can expect that there are some
Transcription Start Sites (TSS) here.
You can observe that on different positions like -4, -3, -2 and -1, similarly at +3, +4,+5, and +6
you have which nucleotides and what are their percentages (in the table).
Example
Which one is more probable to be a Translation Start?
Solution
We can use frequency table and the scoring function as under;
Si = log (Fi (X)/0.25)
-frequency from the frequency table divided by the expected frequency and then we convert it
into log scale because we want to play with big numbers.
We can call this equation as theInformation Content (IC)
Here, is our
frequency table,
we pick the
frequencies
from here.
So, which sequence has the strong evidence to have a translation site. In our case, we will
prefer the one with higher value so its probably the green sequence.
Algorithm:
➢ Build a mathematical model, based on collected translation start sequence
➢ For each candidate translation start sequence, apply the model and get a score
➢ If the score is larger than zero, predict it is a “translation start”; the higher score, the
higher the probability the prediction is true
Conclusions:
• TSS prediction can be an important step in gene prediction
References:
Biological Sequence Analysis
R Durbin, S Eddy, A Krogh and G Mitchison
Cambridge University Press, 1998.
Bioinformatics The machine learning approach
P Baldi and S Brunak
The MIT Press, 1998
Post-Genome Informatics
M Kanehisa
Oxford University Press, 2000
Acceptor
• (introns ends)YAG | coding region
-Y can be any pyrimidine.
• Canonical form
• GT-AG: 99.24%
Like TSS, the flanks of splice junctions show “biased” distributions of nucleotides in certain
positions
• These biased distributions of nucleotides are the basis for prediction of splice junctions
Sequence LOGOS:
─ A visual representation of a position-specific distribution
─ Easy for nucleotides, but we need colour to depict up to 20 amino acid proportions.
Algorithm:
Mathematical model: Fi (X): frequency of X (A, C, G, T) in position I
Score a segment as a candidate donor/acceptor site by
log (Fi (X)/0.25)
For each candidate sequence, apply the model and get a score
If the score if larger than zero, predict it is “donor/acceptor”; the higher score, the higher the
probability the prediction is true
Conclusions:
Like TSS, the flanks of splice junctions show “biased” distributions of nucleotides in certain
positions
• These biased distributions of nucleotides can be used for prediction of splice junctions
References:
Biological Sequence Analysis
R Durbin, S Eddy, A Krogh and G Mitchison
Cambridge University Press, 1998.
Bioinformatics The machine learning approach
P Baldi and S Brunak
The MIT Press, 1998
Post-Genome Informatics
M Kanehisa
Oxford University Press, 2000
Approach:
For each segment [acceptor, donor], we get three scores (coding potential, donor score,
acceptor score)
Various possibilities
─ all three scores are high – probably true exon.
─ all three scores are low – probably not a real exon.
─ all in the middle -- ?.
─ some scores are high and some are low -- ??
So here, we can get the evidence by the help of that information which we gathered from those
splice sites.
Prediction:
➢ Collect a set of exons and non-exons
➢ Score them using our scoring schemes
➢ Plot them as follows
➢ “draw” a separating line between exons and non-exons
For example, if we
look into this picture,
there is the coding
region and the non-
coding region and
there is a line in the
middle that tries to
separate these two
which helps in
discrimination.
We can draw a
central line or linear
regression line, we
can see the equation
over there and by drawing this line we can have the prediction, this line fits in such a way that
it fits into most of the data points.
If we don’t know about data point, we can predict it while using this prediction line (angular
line in the figure).
Conclusion:
➢ Collect a set of exons and non-exons
➢ Score them using our scoring schemes
➢ Plot them as follows
➢ “draw” a separating line between exons and non-exons
References:
Biological Sequence Analysis
R Durbin, S Eddy, A Krogh and G Mitchison
Cambridge University Press, 1998.
Bioinformatics The machine learning approach
P Baldi and S Brunak
The MIT Press, 1998
Post-Genome Informatics
M Kanehisa
Oxford University Press, 2000
Ab initio methods:
Based on sequence alone
• Gene prediction algorithms (e.g. AUGUSTUS, Glimmer, GeneMark)
• RepeatMasker(repeat families)
Evidence-based Methods:
• Require transcriptome data for the target organism (the more the better)
Biological Annotations:
BLAST of gene models against protein databases
• Sequence similarity to known proteins
– GO terms
– BLAST2GO
Pattern Finding:
─ Much of the data processing in bioinformatics involves searching and recognizing
certain patterns within DNA, RNA or protein sequences.
─ In Biology it means finding motifs in DNA or proteins while in computational means it
is finding a pattern in a string
Conclusions:
After a genome is assembled, genome annotations are performed to identify gene and other
features in a genome
(InDels).
Rank possible matches according to a weight function and keep matches above a certain
threshold
Generalized Algorithm:
Goal: Finding all occurrences of a pattern in a text
Input:
Pattern p = [p1…pn] of length n
Text t = [t1…tm] of length m
Output:
An indication that pattern P exist in T
or it does not exist in text T
So, the pattern matches on the seventh index, so what we will get in the end is that P is a
substring of T, starting from 7th index to 12th.
Conclusion:
Pattern searching algorithms search specific sequences in strands of DNA, RNA and proteins
having important biological meaning
The matching needs to be exact, which means that the exact word or pattern is found
Exact Pattern Matching Algorithms:
➢ Naïve Brute Force algorithm
➢ Boyer-Moore algorithm
➢ Knuth Morris Pratt algorithm
Suffix Trees:
It is a compressed tree containing all the suffixes and allows many problems on strings to be
solved quickly
Conclusions:
➢ Exact searching or pattern matching methods
➢ Approximate searching or pattern matching methods
➢ Position weight matrices.
➢ Suffix trees
Working:
➢ Searches patterns by going through the whole sequence nucleotide per nucleotide.
➢ Always shifts the window by exactly one position to right
➢ Requires 2n expected text character comparisons
When a mismatch the comparison stops and starts again by moving the pattern one position
forward
Algorithm:
Brute_Force(T,P)
n length[T]
m length[P]
For s 0 to n-m
Do if P[1..m]= T[s+1…s+m]
print “pattern occurs at position” s+1
Here, Brute_Force is the function which has two arguments (T= text and P= pattern) where n
records the length of T and m records the length of P and we start with a for loop which goes
from 0 to n-m, say for example we have ‘T’ which is of length 10 and ‘P’ which is of length 5
so it starts from 0 and will go till 5.
Do if P[1..m]= T[s+1…s+m]
This line is where we are saying that nucleotide number 1 of the pattern and we go up to its
whole length. In case of the text, we started with 0, so we add one over here, so 0+1 i.e. the 1st
nucleotide is compared till the last nucleotide.
So, if we find the occurrence of the patterns we will put that in the print statement and s+1 will
give its position.
Drawback:
The repetitive use of residues in comparison leads to runtime of O(mn), which makes it very
slow
Conclusion:
Brute force is an exhaustive search method that takes long time as it does nucleotide by
nucleotide comparison
Components:
• Π
The Prefix Function
Conclusions:
A linear time algorithm for string matching
• avoid useless shifts of the pattern ‘p’
7 do q Π[q]
9 then q q + 1
Let us execute the KMP algorithm to find whether ‘p’ occurs in ‘S’.
For ‘p’ the prefix function, Π was computed previously and is as follows:
q 1 2 3 4 5 6 7
p a t a t a c a
Π 0 0 1 2 3 1 0
Complexity
➢ O(m) - It is to compute the prefix function values.
➢ O(n) - It is to compare the pattern to the text.
➢ Total of O(m + n) run time.
Advantages
➢ The running time of the KMP algorithm is optimal (O(m + n)), which is very fast
➢ The algorithm never needs to move backwards that makes the algorithm good for
processing very large files
Drawback
96.Scoring Scheme
Scoring System: Introduction:
• Total score assigned to an alignment is sum of terms for each aligned pair of residues,
plus terms for each gap
∑ S(xi,yj) + d
d = linear gap penalty
Simple Alignment Scores:
• A simple way (but not the best) to score an alignment is to count 1 for each match and 0
for each mismatch
Substitution Matrices:
For a set of well known proteins:
• Align the sequences
Positive Score:
The amino Acids are similar, mutations from one into the other occur more often than expected
by chance during evolution
Negative Score:
The amino Acids are dissimilar, mutations from one into the other occur less often then expected
by chance during evolution
Conclusion
• Substitution matrices are the log-odds matrices used for scoring amino acid substitutions
in pairwise alignments
97.Substitution Matrices
Introduction:
• Substitution scores can be derived from probabilistic model
Notations:
➢ Let a pair of sequences x and y of length n and m
➢ Xi be the ith symbol in x yj be the jth symbol in y
➢ Symbols are from alphabet A
o A={A,T,G,C}
➢ A={twenty amino acids}
➢ Symbols from alphabet be a,betc
Objective:
Given a pair of aligned sequences, we want to assign a score to the alignment that gives a
measure of the relative likelihood that the sequences are related as opposed to being unrelated
Unrelated or random Model R:
Letter a occurs independently with frequency qaand the probability of two sequences is the
product of probability of each amino acid
Match Model M:
─ Aligned pairs occur with a joint probability pab.
─ pabcan be thought of as the probability that the residues a and b have been independently
derived from some unknown original residue c in their common ancestor .
─ The probability of the alignment is
P(x,y|M)
P(x,y|R)
Odds ratio:
─ The ratio of two likelihoods can be calculated as;
S = ∑ S(xi,yj)
where S(a,b) = log(pab/qaqb)
log likelihood of (a,b) as aligned vs unaligned pair
Dayhoff PAM matrices:
Dayhoff, Schwartz and Orcutt (1978) presented their famous PAM (Point accepted mutations)
using substitution data from similar proteins then extrapolating this information to longer
evolutionary distances
S(a,b) = log(pab/qaqb)
incorporating the evolutionary time
S(a,b|t) = log P(b|a,t)/qb
Since pab/qa= P(b|a)
Values are rounded to near integer for computational convenience
• PAM250 is scaled by 3/log2 to give scores in third-bits
BLOSUM matrices:
─ Dayhoff matrices do not capture true difference between short time substitutions and long
term ones.
─ PAM matrices do not perform well in case of distantly related proteins.
─ BLOSUM matrices are derived from set of aligned, ungapped regions from protein
families called BLOCKS database (Henikoff&Henikoff 1992).
─ Sequences from each block was clustered together with score >L%.
─ Matrices with L= 62 and L= 50 known as BLOSUM62 and BLOSUM50 respectively.
─ BLOSUM62 is good for ungapped alignments and BLOSUM50 is good for gapped
alignments
Conclusions
• Substitution scores can be derived from probabilistic models
98.Optimal Algos
Introduction:
Finding the path whose total score is maximal will give the best sequence alignment
• Two methods
– Local alignment
– Global alignment
Global Alignment:
It is an alignment that essentially spans the full extents of input sequences
• Hence it covers the entire length of sequences involved
• The Needleman-Wunsch algorithm finds best global alignment between two sequences
Local Alignment:
• It only covers parts of the sequences to be aligned
• Smith-Waterman algorithm finds the best local alignment between two sequences
Dynamic Programming:
• Dynamic programming is used to find an optimal alignment of two sequences and its
scores
• It is a method by which a larger problem may be solved by first solving smaller, partial
versions of the problem
• Initialization
Traceback (alignment
─ Initialization: Create matrix with M+1 columns and N+1 rows where M and N
correspond to the size of sequences to be aligned
─ Trace back:Move from the last corner and follow the arrow
Conclusion:
• Dynamic programming is used to find an optimal alignment of two sequences and its
scores
99.Needleman_wunch Algos
Introduction:
It performs global alignment on two sequences
• The algorithm was developed by Saul B. Needleman and Christian D. Wunsch and
published in 1970
Basic idea is
• to build up the best alignment by using optimal alignments of smaller subsequences
Steps:
Three steps
1. Initialization
2. Matrix filling
3. Traceback
Initialization:
• The cell of first row and first column of the matrix is initially filled with zero
Matrix Fill:
• Move through the cells row by row, calculating the score for each cell
– A match score
• The match score is the sum of the diagonal cell score and the score for a match
• The horizontal gap score is the sum of the cell to the left and the gap score
Traceback:
➢ The final step in the algorithm is the trace back for the best alignment
➢ Start at the bottom-right corner
➢ Follow where maximum value comes from
➢ F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal
alignment
Scoring Scheme:
• Scoring scheme introduced can be user defined
• It contains specific scores for match and mismatch residues as well as gap
Conclusions:
• Needleman and Wunsh Algorithm performs global alignment on two sequences using a
dynamic programming approach
100.Smith_waterman Algo
Introduction:
• Finds the best local alignment between two subsequences
Steps:
➢ Initialization
➢ Matrix filling
➢ Traceback or alignment
Trace-back:
F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment
Creating the Matrix:
• Initial matrix is created with M+1 columns and N+1 rows
Initialization:
• First row and first column of the matrix is filled with zero
Matrix Filling:
Traceback:
• Traceback starts with maximum value in the matrix and then go backwards
Alignment:
Conclusion:
• Finds the best local alignment between two subsequences
Sequencing of a 5386- nucleotide virus, DNA sequence data, Human Genome Project , 3
billionnucleotide sequence, DNA sequencing technology, Sequencing reads, Continuous genome,
DNA reading process, DNA sequencing machines, Single DNA fragment.
Shotgun sequencing
Sonicated, Inserts, Vector, Bacterial host, Cloning proce, DNA sequencing- inserts and
computational
20% of human genome Alu sequence-million times Repeats occur at many scales Human T-cell
receptor locus-Trypsinogen gene (4 kb).1 million Alu repeats ( 300 bp) and 200,000 LINE repeats (
1000 bp) 3 billion-letter sequence 500-letter reads, large number of repeats.
Inserts of length L- both ends are sequenced, Mates at distance L- length is larger than most repeats
Mass spectrum of a peptide is a collection of masses of these fragments- Derive the sequence of a
peptide given its mass spectrum. For an ideal fragmentation process the peptide sequencing problem is
simple. The fragmentation process is not ideal, and mass spectrometers measure mass with some
imprecision.
SEQUEST Algorithm
Output: A protein of mass m with the best match to spectrum S that is at most k
modifications away from an entry in the database
Modified Protein Identification problem P1 and P2- S1 and S2, Notion of spectral similarity
Shared peaks count, Limitations in detecting similarities by database search.
Background
Complex protein structures enable proteins to perform complex functions. We know over a million
protein sequences but only about 100,000 protein structures.
Why only 100,000 proteins for over million protein sequences
Estimating exact protein structures is very difficult. Its difficult to crystallize proteins. Even if we
manage to get protein’s X-Ray, to reconstruct the structure is extremely complex
Introduction
What if we could somehow predict protein structures?
• Since we know so many sequences, they can be used for predicting protein structures. This
indeed is possible and helpful.
To computationally predict protein structures, we need to copy or mimic the natural folding! What
are the steps in protein folding and structure formation?
To fold we must learn the steps
Introduction
What if we survey the entire PDB and check the presence of each amino in each type of secondary
structure
• If we know which amino acid is found in which specific secondary structure,
then we can use it for prediction!
Conclusion
• Several algorithms have been designed to predict 2’ given an amino acid sequence
• Product of propensity values is computed for overall propensity for each 2’ structure
Introduction
• An important point to note here is that 2’ structures are formed due to hydrogen
bonding between amino acids
Conclusion
• The highest net propensity will be the most probably secondary structure that will be
formed!
Conclusion
• For Alpha Helices, 4 contiguous amino acids are required
• Let’s see how Beta sheets are evaluated using Chou Fasman Algorithm
Conclusion
• Alpha Helices can be finalized if their propensity is higher than the propensity for
Beta Sheets in regions of 5 amino acids
• For those regions where that is not the case, further evaluation is required
Conclusion
• Using the strategy of higher propensity, alpha helices and beta sheets can be
completely resolved
• Assignments for each beta sheet and alpha helix can be finalized
• Let’s see how can we find out the loops using Chou Fasman Algorithm
Conclusion
• Chou Fasman Algorithm helps predict Alpha Helices, Beta Sheets and Turns
• Chou Fasman Algorithm helps predict secondary structures such as Alpha Helices,
Beta Sheets and Turns. Step by step flowchart of the entire algorithm. Beta sheets
can be predicted from primary amino acid sequences
• Chou Fasman Algorithm helps predict secondary structures from amino acid
sequences. Step by step flowchart of the algorithm for extracting Alpha Helices
• Alpha helices, beta sheets and turns can be predicted using Chou Fasman Algorithm.
This algorithm is based on statistical analysis of amino acid occurrences in proteins.
• Secondary structure propensity values of alpha helix, beta sheet and turns should be
recalculated with the latest protein data sets.
Conclusion
Chou Fasman can be improved to better predict secondary structures by incorporating
biochemical factors and updated statistics!
Structure Classification
Structure Prediction
A. Why structure of proteins are important?
B. Why are so few structures reported till date?
C. Benefits of predicting structures
Conclusion
Structure visualization, classification and prediction equip us to perform functional evaluation of
proteins! This is important for understanding disease and designing drugs for treating them
Proteins are 3D molecules with their own unique structures. Protein structure is reflective of
the protein function. Protein structure includes 1’, 2’, 3’ and 4’ structures. 1’ structure of
proteins is the sequence of proteins and can be obtained by mass spectrometry. 2’ structures
formed by proteins are the helices, beta sheets, loops and coils. 3’ structure of proteins is the
combination of 2’ structures such that the overall protein structure is formed. 4’ protein
structure is formed when two or more proteins complex together. X-Ray Crystallography and
NMR Spectroscopy are used to find the structures of proteins. However, these methods are
difficult and expensive. Solution: Prediction of structures. Protein sequence gives rise to its
structure. If another protein which has a similar sequence also has its structure known, the
structure of an unknown protein can be predicted based on that similar protein . So, it is then
possible to identify unknown protein structures by just examining the homologous protein
sequences.
Conclusions
• Sequence Identity
• Alignment Length
Which combination of identity and alignment length is suitable for best for structure
prediction?
Conclusions
• Good sequence alignment and identity ensures that homology modelling will give
accurate results
2. Threading/Fold Recognition
3. Ab Initio Modelling
Let’s start by looking at Homology Modelling. There are seven salient steps in any Homology
Modelling pipeline. Definition of Template (known) & Target (unknown). Homology modeling of the
target structure can be done as follows:
7. Model validation
Introduction
• A protein fold is defined by the way the secondary structure elements of the structure are
arranged relative to each other in space.
• Common folds include 4-helix bundle and the TIM barrel.
• 5,000 stable folds in nature
• Fold recognition: Finding the best fit of a sequence to a set of candidate folds
Fold recognition is also called Threading. Technique for predicting protein structures.
Employed when homology modelling cannot predict quality structures. A protein fold is
defined by the way the secondary structure elements of the structure are arranged relative to
each other in space. Common folds include 4-helix bundle and the TIM barrel. 5,000 stable
folds in nature. Fold recognition: Finding the best fit of a sequence to a set of candidate folds.
Threading involves “passing” the amino acid sequence through each fold in the database. The best
match is computed using a scoring function. Combinations of secondary structures come together to
form the best prediction. Scoring typically involves using a Z-Score function based on energy of a
molecule.
Advantages
Threading helps predict secondary structures of proteins towards tertiary structure prediction. For
the “Twilight Zone” with low alignment quality and identity, threading is use.
Disadvantages
Novel proteins cannot be predicted using threading. Fewer than 30% of the predicted first hits are
true remote homologues. Validation of each result is necessary.
• Accuracy and applicability are limited by our understanding of the protein folding
problem
Limitation
• Computationally expensive
Ab initio methods rely on computing the energies of folded proteins. The protein
structures with the lowest energy are declared as plausible predictions
Rationale
Sometimes it so happens that even slightly homologous proteins may not be available. This renders
homology modelling and threading/fold recognition as futile . Also, newer protein structures continue
to be discovered every day. These could not have been identified by methods which only rely on
matching with available structures. Lastly, homology / fold recognition predict protein structures
without computing fundamental physical/chemical properties of the mechanisms and driving forces in
structure formation. Ab initio methods, in contrast, base their predictions on physical models for these
mechanisms. Energy released during the folding process is computed for predicting structure.
2. Define an energy function mapping structures to energy values. We have to minimize this
later!!
Native structure not always at the global minimum. No clear way of choosing among
alternative structures that are generated Advantages
• Ab Initio methods can fold any target sequence using only physical atomic properties
• Predictions are mostly accurate and correctly describe the natural folding process
Disadvantages
• Ab initio methods are the very difficult to design (energy function)
• Fold Recognition
• Ab Initio Modelling
2. Alignment correction
3. Backbone generation
4. Loop modeling
5. Side-chain modeling
6. Model optimization
7. Model validation
2. Define an energy function mapping structures to energy values. We have to minimize this
later!!
Conclusion
• Homology modelling is performed in cases of high identity and alignment score
• For low identity and alignment scores, a “Twilight zone” for structure prediction
exists
• For cases where even the fold libraries do not give any high scoring matches, Ab
Initio strategies can help model the structure
153.Review of Phylogenetics
• Important Concepts
Molecular Evolution
Insertions, Deletions, Substitutions
Phylogenetic Trees
Scaled Trees, Unscaled Trees
Phylogenetic Trees
Rooted Trees, Unrooted Trees
• Protein Structure Database – PDB, Online tools for predicting structures by using
proteins sequences
• We studied the basic algorithms for each topic, With evolution and growth of
Bioinformatics, newer and better algorithms are now also available!
• For advanced study in Genomics, you may take “Computational Genomics” course
Topics:
Genome Assembly, Gene Finding, Annotation, GWAS etc
• For advanced study in Integrative Biology, you may take “Systems Biology”
course. Topics: Metabolomics, Transcriptomics, Network Biology etc
160.Careers in Bioinformatics
Pakistan as an infrastructure-limited country. The onset of digital revolution. Emergence of data as
the most precious commodity, globally. Specifically, health data as a key commodity of the future.
Health and disease as the primordial challenge of mankind
• Unique opportunity for us in Pakistan
You can take public databases and design drugs. One man vs. Roche?
BIGDATA
You can make a startup company which manages and process health BIGDATA. All it needs is basic
software development skills coupled with Bioinformatics
The next disruption
The next Google, Facebook and Uber is going to emerge from Health and Bioinformatics.
Pharmaceutical companies are investing into bioinformatics human resource development
Jobs Market
Pharmaceutical Giants, Research Centers & Universities, Hospital & Diagnostic IT departments ,
Your own startup company
• RNA folding
Base Pairing
• A-U
• G-C
• “wobble” pairing
• G-U
• I-U
A-U, G-C
• “wobble” pairing
transfer RNA (tRNA), ribosomal RNA (rRNA), small interfering RNA (siRNA), micro RNA (miRNA), small
nucleolar RNA (snoRNA)
Aligning bases, based on pairing with each other gives an algorithmic approach to
determining the optimal structure
RNA Folding
RNA is produced as a single stranded molecule (unlike DNA)
• Strand folds upon itself to form base pairs & secondary structures
RNA Structure
Structures are more conserved than sequences •
Covariation
Pseudoknots
• Base-Pair Maximization
1. Energy minimization
Approach
Energy minimization algorithm predicts secondary structure by minimizing the free energy ( G). G
calculated as sum of individual contributions of:
• Loops
• stacking
Energy minimization
• Thermodynamic Stability
Drawbacks
Compute only one optimal structure. Usual drawbacks of purely mathematical approaches
164.RNASeq
Calculating transcript abundance and prevalence by Ultra high throughput cDNA sequencing
(Mortazwi et al, 2008)
The sequence reads are individually mapped to the source genome and counted to obtain the number and
density of reads corresponding to RNA from each
• known exon
• splice event
Procedures
Isolation of all mRNA, Convert to cDNA using reverse transcriptase, Sequence the cDNA, Map
sequences to the genome.
The more times a given sequence is detected, the more abundantly transcribed it is.
If enough sequences are generated (> 40 Million), a comprehensive and quantitative view of the entire
transcriptome of an organism or tissue can be obtained (Mortazvi et al, 2008)
Data analysis
Mapping reads, Visualization Genome browser , De novo assembly , Quantification, Differential
Gene Expression, Functional Analysis, Gene Networks
In RNASeq, transcript abundance and prevalence is calculated using Ultra high throughput cDNA
sequencing
165.RNASeq Normalization
Sequencing reactions may vary across different sequencing plate-forms as well as within different
lanes of the same sequencer. Transcript lengths also vary
Raw read counts may vary
RNASeq challenges
Uniformity of sequence coverage, Quantity of sequence required to reliably detect RNAs of lower
abundance classes, Quantification and conversion of relative quantification to absolute RNA
concentrations, Transcriptomes of organisms with large genomes, containing genes with more
complicated structure, present some special challenges
Mapping Biases
Read counts will be higher if sequencer produces more reads. Longer genes will have the probability of
mapping more reads than smaller ones
RPKM (reads per kilobase of transcript (or exon model) per million mapped reads)
Method for quantification of transcript levels. RPKM measure of read density reflects the molar
concentration of a transcript in the starting sample by normalizing for RNA length and for the total read
number in the measurement. This facilitates transparent comparison of transcript levels both within and
between samples
RPKM
Number of reads mapped per gene length in KB per total reads in that sample in millions
C = Count of Mapped Reads
L= Length of transcript
M = Mapped reads of sample
RPKM
How many reads are required to map at 1 RPKM with a transcript of 2Kb length from a total of 40
Million Mapped reads?
(Trapnell et al 2010)
RPM
While comparing the same genes expression across different samples (treatments), normalizing for
RPKM reflects the molar concentration of transcript in starting sample normalized for
• Length of RNA
166.Neural Network
The human brain can be described as a biological neural network an interconnected web of neurons
transmitting elaborate patterns of electrical signals A neural network is a “connectionist”
computational system. Information is processed collectively in parallel throughout a network of nodes.
Complex adaptive system.
• Learning processes in biological systems.
• a consequent (then)
… …
tn: {biscuit, eggs, milk} Concepts:
An item: an item/article in a basket I:
The set of all items sold in the store A
transaction:
Items purchased in a basket; it may have TID (transaction ID) A
transactional dataset:
A set of transactions
168.Clustering
Clustering is “a process of organizing objects into groups whose members are similar in some
way” A cluster is therefore a collection of objects which are “similar” between them and are
“dissimilar” to the objects belonging to other clusters.
• Simplifications
• Pattern detection
• Useful in data concept construction
• Unsupervised learning process
Hierarchical agglomerative general algorithm
• Find the 2 closest objects and merge them into a cluster
• Find and merge the next two closest points, where a point is either an individual object
or a cluster of objects
Applications
• For administrative purposes
169.Machine Learning
Programming computers to optimize performance criterion using example data or past experience
When to learn? Calculate payroll , Solution needs to be adapted to particular cases (user biometrics)
When To Learn
Human expertise does not exist (navigating on Mars), Humans are unable to explain their expertise
(speech recognition), Solution changes in time (routing on a computer network)
Model
Build a model that is a good and useful approximation to the Data
KDD is the non-trivial process of identifying valid, novel, potentially useful, & ultimately
understandable patterns in data
Applications
Retail , Finance , Manufacturing , Medicine , Telecommunications , Bioinformatics, Web mining
Retail: Market basket analysis, Customer relationship management (CRM)
Finance: Credit scoring, fraud detection
Manufacturing: Optimization, troubleshooting
Medicine: Medical diagnosis
Telecommunications: Quality of service optimization
Bioinformatics: Motifs, alignment
Web mining: Search engines
Machine Learning
Study of algorithms that Improve performance at some task with exp. Role of Statistics, Role of CS
Applications of ML
Speech recognition, NLP, Computer vision, Medical outcomes analysis/Computational biology, Robot
control
170.ML Concepts
❖ Association Analysis
❖ Supervised Learning
• Classification
• Regression/Prediction
❖ Unsupervised Learning
❖ Reinforcement Learning
Learning Associations
Basket analysis:
P (Y | X ) probability that somebody who buys X also buys Y where X and Y are
products/services.
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Classification Apps
Character recognition:
Speech recognition:
Web Advertizing
171.ML Applications
Supervised Learning
Un-supervised Learning
No output
Clustering: Grouping
similar instances
Applications
Reinforcement Learning
No supervised output
Delayed reward
Applications:
172.Forensic Science
Cells have stayed in warm to for such a time that DNA is freed from cells.Salt causes proteins & other
cellular debris to clump together.Place tube into micro centrifuge.Inside the centrifuge tube spin
around and debris &heavy proteins sink in bottom of tube and DNA strands remains distributed
throughout liquid.
4. Isolate concentrated DNA:
Add the liquid containing DNA in separate tube Now add isopropyl alcohol to tube.DNA is not soluble
in this alcohol so it comes out and it can be seen with naked eye.DNA is collected at bottom tube after
placing in centrifuge.
• RFLP analysis.
• PCR analysis.
• STR analysis.
• Amp FLP.
• Y-chromosome analysis.
• Restriction Fragment length Polymorphism (RFLP)
RFLP
• It analyzes the lengthof strands of DNA that include repeating base pairs (VNTRs).
• Repeated sequence of human genome can be same but the number of times it is repeated is
unique to everyone.
RFLP analysis requires investigators to dissolve DNA in an enzyme that breaks the strand at specific
points.The number of repeats affects the length of each resulting strand of DNA.Investigators compare
samples by comparing the lengths of the strands.
Example: CAT is repeated continuously 13 times in a row. In somebody else, it might be 12 times or 14
or whatever.
Limitation: RFLP analysis requires a fairly large sample of DNA that hasn't been contaminated with dirt.
Polymerase Chain
Reaction (PCR)
Replicate a small amount of DNA to create a larger sample for analysis. First, a heat-stable DNA polymerase
-- a special enzyme that binds to the DNA and allows it to replicate -- is added. Next, the DNA sample is
heated it to 200 degrees F (93 degrees C) to separate the threads. Then the sample is cooled and reheated.
Reheating doubles the number of copies. Process is repeated about 30 times, there is enough DNA for
further analysis.
Analyzing STRs
PCR is the first step in analyzing STRs (Short Tandem Repeats), which are very small, specific
alleles in a variable number tandem repeat (VNTR).
Analyzing STRs is more accurate than the RFLP technique because their small size makes them easier to
separate. If you want to create a fingerprint, you might look at 20 different STRs at different places in
order to create a profile.
STRs
• It is impossible for two persons to have same number of STR repeated in a given sequence.
Y-chromosome Analysis
STRs in Y-chromosome Useful if the sample has mixed DNA. Gender analysis cases. It is processed just
like simple STR analysis.
AmpFLP
Amplified fragment length polymorphism, is another technique that uses PCR to replicate DNA. Like RFLP,
it first uses a restriction enzyme. Then, the fragments are amplified using PCR and sorted using gel
electrophoresis. Can be automated , Doesn't cost very much., DNA sample must be high quality otherwise
errors may result, which is the case with most DNA analysis techniques.
Drug: agent used for the psychotic effect by the media or general public. Even the drugs abused have
their activity. No drug is completely safe. Suitable quantity to cure or excess to be poisonous! E.g.
aspirin, paracetamol can be toxic if excesses.
Combinatorial Chemistry
Laboratory technique in which millions of molecular constructions can be
synthesized and tested for biological activity.
179.Pharmacogenetics
It is the branch of pharmacology concerned with the effect of genetic factors on reactions to drugs.
How people respond to medicines, Correlating heritable genetic variation to drug response.
Defination:-
Applications:-
1. Detection of genetic variability of drug effects on the genome level
2. Agent selection
3. Analysis of drug
4. reactions and drug toxicity on gene expression
5. Development of new indications for already approved drugs
6. Discovery of new drug targets
7. Identification of (non) responders in clinical trials of phase I-IV
8. Identification of genotype dependent adverse drug reactions
9. Identification of individuals at risk for severe adverse drug effects
180.Pharmacogenomic applications
How genes affect persons response to drugs. Pharmacology (science of drugs). Genomics (the study of
genes and their functions). Develop effective, safe medications & doses tailored to a person’s genetic
makeup.
Applications:-
Improve drug safety, Reduce ADRs, Tailor treatments to meet patients unique genetic
predisposition, Optimal dosing, Improve drug discovery and Improve proof of principle for
efficacy trials.
Future
Blessing in research. As a simple example, for nearly a decade the ability to store more information on
a hard drive has enabled us to investigate a human genome sequence cheaper.
• Pharmacology &Toxicology
Current:
Diseases are controlled at molecular & physiological level. Information of Human Genome
Methods for DD
✓ Random Screening
✓ Molecular Manipulation
✓ Molecular Designing
✓ Drug Metabolites • Serendipity
Random Screening
Higher/crude plants, opium, senna, reserpine, etc. Penicillin microorganism. Antibacterials with
improved therapeutic profiles.
Molecular Manipulation
Drug Metabolism
readily excreted hydrophilic products.Rate of metabolism determines duration & intensity of a drug's
pharmacological action
Serendipity
Prototype psychotropic
drugs
Development of psychiatry
Finding of one thing while looking for something else
1. Understanding data
2. context information
3. background knowledge
4. curated databases &
5. Literature extensive
Dataset for extraction of disease/treatment entities relations. Corpora are usually constructed for
training or evaluation purposes during the development of particular system
Annotation Consumers
The linguistic community typically uses annotation as training data or for specific tasks. An abundance of
tools that can produce annotations in the specific format of those resources. Biomedical annotation
typically used for gene set enrichment analysis
Information deluge
Curators struggling to process scientific literature. Discovery of facts & events crucial for gaining insights
in biosciences: need for text mining
Data Mining
IR: yields all relevant . Corpora; gathers, selects, filters documents that may prove useful , Finds
what is known
IE: extracts facts & events of interest to user, Finds relevant concepts, facts about concepts. Finds
only what we are looking for
✓ Text documents
✓ Retrieval/storage Indexaccess relevant storage
✓ Text Processing: word Filters, Pattern filters,Lexicon matching,Ontology, NLP
parsingetc, …
Feature Extractions:
etc …
• Physiology-based approach
• Target-based approach
1. Physiology-based approach
Is a disease-centric approach in which target is not identified, multiple targets are involved. In vivo
screening is done by using drugs, siRNAor antisense oligonucleotides.
2. Target-centric approach
Target based discovery starts with the identification of genes and their protein products. Aim to
develop drugs affecting one gene or a molecular mechanism. The identification of diseaserelevant
genes in vitro cellular models has been possible due to several tools. Gene-suppression tool used to
linked the genes with disease.
1. Genetic
2. Mechanistic
Genetic targets are represented by genes and genes products. Mechanistic targets include mechanism
based targets such as receptors, enzymes or genes, identified on the basis of the disease state.
• Dorzolamide
• Captopril
• Imatinib
• Zanamivir
• It wants the thorough knowledge of the disease processes and characterization of genes.
• Expression profiling Proteomic approach to identify disease related genes based on differential
EP, homology and post translational modification
• Biochemical and cell biological assays - To identify genes and proteins linked with disease
pathways
• Cell-based genetics - Leads to the discovery of targets by disturbing gene function in whole
organisms, corrleation with phenotypes.
Cell-based Genetics
Cell-based assays may lead to the identification of genes involve in cellular transformation, activation,
migration and a host of biological processes relevant to a human disease.
Genetic-based Target Identification It
has some methods:
• Positional Cloning- Laboratory technique used to locate the position of a disease associated
gene along the chromosomes.
To identify complex disease-linked genes through SNP markers. 10 million in HGP and 3 million
identified.
Target class genetic approach
• Is applied to drug target gene families such as proteases, ion channels and GPCRs.
• Best candidate are selected from gene family for genetic analysis.
• A validated target is the one that can be manipulated with drugs to produce positive clinical
effects in humans.
Requirements for TV
Genetic Approach
Gene to disease correlation in animal model
• Forward Genetics
• Reverse Genetics
• Antisense agents
• Ribozymes
Antisense Technology
Ribozymes
• Small RNA mol. cleave other RNA mol. at sequence specific sites.
• Hairpin Ribozyme: GUC
189.Lead Identification
• Compound for biological activity on target.
• Potency threshold
• Libraries of molecules
Virtual Screening:
Protein structure , docking Chemical similarity search, Knowledge of compounds against receptor,
receptor structure & receptor ligand interactions
Visual Screening
MLCC: Multilevel chemical Compatibility scoring
Pharmacophore Mapping
Identify lead compounds against a desired target
Definition: 3-D arrange… Usage:
interaction of receptor & legend
DB concept
• QSAR
• SAR:
• Nuclear Magnetic
• Resonance
• 3-D potential DC & tertiary structure of Proteins
• Need of prior information
• SHAPES
• WaterLogsy
Chemical Genetics
• Knockouts
• Cell cycle - arresting agents
LO Methods
De novo drug desing
Charge distribution, liphophilicity or pka of side chains and H-bonds donors and acceptors
SBDD
Structure based drug Design. Effective: 3-D structure of inhibitor with target known. Large no of
medicines. Molecular recognition in protein ligand complexes
Drug Like properties
• ADMET in phase 1
• Filters
• Bioavailability, PK, CNS
Pre- Clinical Pharma-cology & Toxicity
• Animals testing
• Xenograft models
• ADME/T testing and validation