0% found this document useful (0 votes)
7 views34 pages

Module 1

Bioinformatics is an interdisciplinary field that combines computer science, statistics, and biology to analyze and interpret biological data. It involves the development of algorithms and tools for managing and analyzing data from various biological sources, including genetic sequences and protein structures. The goals of bioinformatics include understanding biological processes, identifying disease mechanisms, and improving drug discovery through computational techniques.

Uploaded by

tixabi3785
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views34 pages

Module 1

Bioinformatics is an interdisciplinary field that combines computer science, statistics, and biology to analyze and interpret biological data. It involves the development of algorithms and tools for managing and analyzing data from various biological sources, including genetic sequences and protein structures. The goals of bioinformatics include understanding biological processes, identifying disease mechanisms, and improving drug discovery through computational techniques.

Uploaded by

tixabi3785
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Module 1:

Introduction: Emergence of Bioinformatics, Applications in the field-Biological Databases Formats-


Nucleic acid and Protein sequence Databases, Structure Databases, Chemical Databases, Literature
Databases

INTRODUCTION TO BIOINFORMATICS

Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding
biologicaldata. As an interdisciplinary field of science, bioinformatics combines computer science,
statistics, mathematics, and engineering to analyze and interpret biological data. Bioinformatics has been
used for in silico analyses of biological queries using mathematical and statistical techniques.
Bioinformatics derives knowledge from computer analysis of biological data. These can consist of the
information stored in the genetic code, but also experimental results from various sources, patient statistics,
and scientific literature. Research in bioinformatics includes method development for storage, retrieval, and
analysis of the data. Bioinformatics is a rapidly developing branch of biology and is highly
interdisciplinary, using techniques and concepts from informatics, statistics, mathematics, chemistry,
biochemistry, physics, and linguistics. It has many practical applications in different areas of biology and
medicine.
Bioinformatics: Research, development, or application of computational tools and approaches for
expanding the use of biological, medical, behavioral or health data, including those to acquire, store,
organize, archive, analyze, or visualize such data.
Computational Biology: The development and application of data-analytical and theoretical methods,
mathematical modeling and computational simulation techniques to the study of biological, behavioral, and
social systems.
"Classical" bioinformatics: "The mathematical, statistical and computing methods that aim to solve
biological problems using DNA and amino acid sequences and related information.”
The National Center for Biotechnology Information (NCBI 2001) defines bioinformatics as:
"Bioinformatics is the field of science in which biology, computer science, and information technology
merge into a single discipline. There are three important sub- disciplines within bioinformatics: the
development of new algorithms and statistics with which to assess relationships among members of large
data sets; the analysis and interpretation of various types of data including nucleotide and amino acid
sequences, protein domains, and protein structures; and the development and implementation of tools that
enable efficient access and management of different types of information.

Even though the three terms: bioinformatics, computational biology and bioinformation infrastructure are
often times used interchangeably, broadly, the three may be defined as follows:
1. bioinformatics refers to database-like activities, involving persistent sets of data that are maintained in a
consistent state over essentially indefinite periods of time;
2. computational biology encompasses the use of algorithmic tools to facilitate biological analyses; while
3. bioinformation infrastructure comprises the entire collective of information management systems,
analysis tools and communication networks supporting biology. Thus, the latter may be viewed as a
computational scaffold of the former two.

There are three important sub-disciplines within bioinformatics:


 the development of new algorithms and statistics with which to assess relationships among
members of large data sets;
 the analysis and interpretation of various types of data including nucleotide and amino acid
sequences, protein domains, and protein structures;
 and the development and implementation of tools that enable efficient access and management
of different types of information

Bioinformatics definition - other sources


o Bioinformatics or computational biology is the use of mathematical and informational
techniques, including statistics, to solve biological problems, usually by creating or using
computer programs, mathematical models or both. One of the main areas of bioinformatics
is the data mining and analysis of the data gathered by the various genome projects. Other
areas are sequence alignment, protein structure prediction, systems biology, protein-protein
interactions and virtual evolution. (source: www.answers.com)
o Bioinformatics is the science of developing computer databases and algorithms for the
purpose of speeding up and enhancing biological research. (source: www.whatis.com)

 "Biologists using computers, or the other way around. Bioinformatics is more of a


tool than a discipline.(source: An Understandable Definition of Bioinformatics , The O'Reilly
Bioinformatics Technology Conference, 2003) (4)
 The application of computer technology to the management of biological information. Specifically,
it is the science of developing computer databases and algorithms to facilitate and expedite
biological research.(source: Webopedia)
 Bioinformatics: a combination of Computer Science, Information Technology and Genetics to
determine and analyze genetic information. (Definition from BitsJournal.com)
 Bioinformatics is the application of computer technology to the management and analysis of
biological data. The result is that computers are being used to gather, store, analyse and merge
biological data.(EBI - 2can resource)
 Bioinformatics is concerned with the creation and development of advanced information and
computational technologies to solve problems in biology.
 Bioinformatics uses techniques from informatics, statistics, molecular biology and high-
performance computing to obtain information about genomic or protein sequence data.

 Bioinformaticist versus a Bioinformatician


 A bioinformaticist is an expert who not only knows how to use bioinformatics tools, but also knows
how to write interfaces for effective use of the tools.
 A bioinformatician , on the other hand, is a trained individual who only knows to use bioinformatics
tools without a deeper understanding.

Aims of Bioinformatics
In general, the aims of bioinformatics are three-fold.

1. The first aim of bioinformatics is to store the biological data organized in form of a database. This
allows the researchers an easy access to existing information and submit new entries. These data
must be annoted to give a suitable meaning or to assign its functional characteristics. The databases
must also be able to correlate between different hierarchies of information. For example: GenBank
for nucleotide and protein sequence information, Protein Data Bank for 3D macromolecular
structures, etc.

2. The second aim is to develop tools and resources that aid in the analysis of data. For example:
BLAST to find out similar nucleotide/amino-acid sequences, ClustalW to align two or more
nucleotide/amino-acid sequences, Primer3 to design primers probes for PCR techniques, etc.

3. The third and the most important aim of bioinformatics is to exploit these computational tools to
analyze the biological data interpret the results in a biologically meaningful manner.

Goals
The goals of bioinformatics thus is to provide scientists with a means to explain
1. Normal biological processes
2. Malfunctions in these processes which lead to diseases
3. Approaches to improving drug discovery
To study how normal cellular activities are altered in different disease states, the biological data must be
combined to form a comprehensive picture of these activities. Therefore, the field of bioinformatics has
evolved such that the most pressing task now involves the analysis and interpretation of various types of
data. This includes nucleotide and amino acid sequences, protein domains, and protein structures. The
actual process of analyzing and interpreting data is referred to as computational biology.
Important sub-disciplines within bioinformatics and computational biology include:
 Development and implementation of computer programs that enable efficient access to, use and
management of, various types of information
 Development of new algorithms (mathematical formulas) and statistical measures that assess
relationships among members of large data sets. For example, there are methods to locate a gene within a
sequence, to predict protein structure and/or function, and to cluster protein sequences into families of
related sequences.

The primary goal of bioinformatics is to increase the understanding of biological processes. What sets it
apart from other approaches, however, is its focus on developing and applying computationally intensive
techniques to achieve this goal. Examples include: pattern recognition, data mining, machine learning
algorithms, and visualization. Major research efforts in the field include sequence alignment, gene finding,
genome assembly, drug design, drug discovery, protein structure alignment, protein structure prediction,
prediction of gene expression and protein–protein interactions, genome-wide association studies, the
modeling of evolution and cell division/mitosis.
Bioinformatics now entails the creation and advancement of databases, algorithms, computational and
statistical techniques, and theory to solve formal and practical problems arising from the management and
analysis of biological data.

Tools: Used in three areas


 Molecular Sequence Analysis
 Molecular Structural Analysis
 Molecular Functional Analysis

Over the past few decades, rapid developments in genomic and other molecular research technologies and
developments in information technologies have combined to produce a tremendous amount of information
related to molecular biology. Bioinformatics is the name given to these mathematical and computing
approaches used to glean understanding of biological processes.
Common activities in bioinformatics include mapping and analyzing DNA and protein sequences, aligning
DNA and protein sequences to compare them, and creating and viewing 3-D models of protein structures.
Bioinformatics encompasses the use of tools and techniques from three separate disciplines; molecular
biology (the source of the data to be analyzed), computer science (supplies the hardware for running
analysis and the networks to communicate the results), and the data analysis algorithms which strictly
define bioinformatics. For this reason, the editors have decided to incorporate events from these areas into a
brief history of the field.

A SHORT HISTORY OF BIOINFORMATICS

1933 A new technique, electrophoresis, is introduced by Tiselius for separating proteins in solution.
1951 Pauling and Corey propose the structure for the alpha-helix and beta-sheet (Proc. Natl. Acad. Sci.
USA, 27: 205-211, 1951; Proc. Natl. Acad. Sci. USA, 37: 729- 740, 1951).
1953 Watson and Crick propose the double helix model for DNA based on x-ray data obtained by Franklin
and Wilkins (Nature, 171: 737-738, 1953).
1954 Perutz's group develop heavy atom methods to solve the phase problem in protein crystallography.
1955 The sequence of the first protein to be analyzed, bovine insulin, is announced by F. Sanger.
1969 The ARPANET is created by linking computers at Stanford and UCLA.
1970 The details of the Needleman-Wunsch algorithm for sequence comparison are published.
1972 The first recombinant DNA molecule is created by Paul Berg and his group.
1973 The Brookhaven Protein Data Bank is announced (Acta. Cryst. B, 1973, 29: 1746).
Robert Metcalfe receives his Ph.D. from Harvard University. His thesis describes Ethernet.
1974 Vint Cerf and Robert Kahn develop the concept of connecting networks of computers into an
"internet" and develop the Transmission Control Protocol (TCP).
1975 Microsoft Corporation is founded by Bill Gates and Paul Allen.
Two-dimensional electrophoresis, where separation of proteins on SDS polyacrylamide gel is combined
with separation according to isoelectric points, is announced by P. H. O'Farrell (J. Biol. Chem., 250: 4007-
4021, 1975).
E. M. Southern published the experimental details for the Southern Blot technique of specific sequences of
DNA (J. Mol. Biol., 98: 503-517, 1975).
1977 The full description of the Brookhaven PDB (http://www.pdb.bnl.gov) is published (Bernstein, F.C.;
Koetzle, T.F.; Williams, G.J.B.; Meyer, E.F.; Brice, M.D.; Rodgers, J.R.; Kennard, O.; Shimanouchi, T.;
Tasumi, M.J.; J. Mol. Biol., 1977, 112:, 535).
Allan Maxam and Walter Gilbert (Harvard) and Frederick Sanger (U.K. Medical Research Council), report
methods for sequencing DNA.
1980 The first complete gene sequence for an organism (FX174) is published. The gene consists of 5,386
base pairs which code nine proteins.
Wuthrich et. al. publish paper detailing the use of multi-dimensional NMR for protein structure
determination (Kumar, A.; Ernst, R.R.; Wuthrich, K.; Biochem. Biophys. Res. Comm., 1980, 95:, 1).
IntelliGenetics, Inc. founded in California. Their primary product is the IntelliGenetics Suite of programs
for DNA and protein sequence analysis.
1981 The Smith-Waterman algorithm for sequence alignment is published.
IBM introduces its Personal Computer to the market.
1982 Genetics Computer Group (GCG) created as a part of the University of Wisconsin of Wisconsin
Biotechnology Center. The company's primary product is The Wisconsin Suite of molecular biology tools.
1983 The Compact Disk (CD) is launched.
1984 Jon Postel's Domain Name System (DNS) is placed on-line.
The Macintosh is announced by Apple Computer.
1985 The FASTP algorithm is published.
The PCR reaction is described by Kary Mullis and co-workers.
1986 The term "Genomics" appeared for the first time to describe the scientific discipline of mapping,
sequencing, and analyzing genes. The term was coined by Thomas Roderick as a name for the new journal.
Amoco Technology Corporation acquires IntelliGenetics.
NSFnet debuts.
The SWISS-PROT database is created by the Department of Medical Biochemistry of the University of
Geneva and the European Molecular Biology Laboratory (EMBL).
1987 The use of yeast artifical chromosomes (YAC) is described (David T. Burke, et. al., Science, 236:
806-812).
The physical map of E. coli is published (Y. Kohara, et. al., Cell 51: 319-337).
1988 The National Center for Biotechnology Information (NCBI) is established at the National Cancer
Institute.
The Human Genome Initiative is started (Commission on Life Sciences, National Research Council.
Mapping and Sequencing the Human Genome, National Academy Press: Washington, D.C.), 1988.
The FASTA algorithm for sequence comparison is published by Pearson and Lupman.
A new program, an Internet computer virus designed by a student, infects 6,000 military computers in the
US.
1989 The Genetics Computer Group (GCG) becomes a private company.
Oxford Molecular Group, Ltd. (OMG) founded in Oxford, UK by Anthony Marchington, David Ricketts,
James Hiddleston, Anthony Rees, and W. Graham Richards. Primary products: Anaconda, Asp, Cameleon
and others (molecular modeling, drug design, protein design).
1990 The BLAST program (Altschul, et. al.) is implemented.
Molecular Applications Group is founded in California by Michael Levitt and Chris Lee. Their primary
products are Look and SegMod which are used for molecular modeling and protein design.
InforMax is founded in Bethesda, MD. The company's products address sequence analysis, database and
data management, searching, publication graphics, clone construction, mapping and primer design.
1991 The research institute in Geneva (CERN) announces the creation of the protocols which make-up the
World Wide Web.
The creation and use of expressed sequence tags (ESTs) is described (J. Craig Venter, et. al., Science, 252:
1651-1656).
Incyte Pharmaceuticals, a genomics company headquartered in Palo Alto California, is formed.
Myriad Genetics, Inc. is founded in Utah. The company's goal is to lead in the discovery of major common
human disease genes and their related pathways. The Company has discovered and sequenced, with its
academic collaborators, the following major genes: BRCA1, BRCA2, CHD1, MMAC1, MMSC1, MMSC2,
CtIP, p16, p19, and MTS2.
1992 Human Genome Systems, Gaithersburg Maryland, is formed by William Haseltine.
The Institute for Genomic Research (TIGR) is established by Craig Venter.
Genome Therapeutics announces its incorporation.
Mel Simon and coworkers announce the use of BACs for cloning.
1993 CuraGen Corporation is formed in New Haven, CT.
Affymetrix begins independent operations in Santa Clara, California
1994
Netscape Comminications Corporation founded and releases Navigator, the commercial version of NCSA's
Mozilla.
Gene Logic is formed in Maryland.
The PRINTS database of protein motifs is published by Attwood and Beck.
Oxford Molecular Group acquires IntelliGenetics.
1995 The Haemophilus influenzea genome (1.8 Mb) is sequenced.
The Mycoplasma genitalium genome is sequenced.
1996 Oxford Molecular Group acquires the MacVector product from Eastman Kodak.
The genome for Saccharomyces cerevisiae (baker's yeast, 12.1 Mb) is sequenced.
The Prosite database is reported by Bairoch, et.al.
Affymetrix produces the first commercial DNA chips.
1997 The genome for E. coli (4.7 Mbp) is published.
Oxford Molecular Group acquires the Genetics Computer Group.
LION bioscience AG founded as an integrated genomics company with strong focus on bioinformatics. The
company is built from IP out of the European Molecular Biology Laboratory (EMBL), the European
Bioinformatics Institute (EBI), the German Cancer Research Center (DKFZ), and the University of
Heidelberg.
Paradigm Genetics Inc., a company focussed on the application of genomic technologies to enhance
worldwide food and fiber production, is founded in Research Triangle Park, NC.
deCode genetics publishes a paper that described the location of the FET1 gene, which is responsible for
familial essential tremor, on chromosome 13 (Nature Genetics).
1998 The genomes for Caenorhabditis elegans and baker's yeast are published.
The Swiss Institute of Bioinformatics is established as a non-profit foundation.
Craig Venter forms Celera in Rockville, Maryland.
PE Informatics was formed as a Center of Excellence within PE Biosystems. This center brings together
and leverages the complementary expertise of PE Nelson and Molecular Informatics, to further complement
the genetic instrumentation expertise of Applied Biosystems.
Inpharmatica, a new Genomics and Bioinformatics company, is established by University College London,
the Wolfson Institute for Biomedical Research, five leading scientists from major British academic centers
and Unibio Limited.
GeneFormatics, a company dedicated to the analysis and prediction of protein structure and function, is
formed in San Diego.
Molecular Simulations Inc. is acquired by Pharmacopeia
1999 deCode genetics maps the gene linked to pre-eclampsia as a locus on chromosome 2p13.
2000 The genome for Pseudomonas aeruginosa (6.3 Mbp) is published.
The A. thaliana genome (100 Mb) is secquenced.
The D. melanogaster genome (180Mb) is sequenced.
Pharmacopeia acquires Oxford Molecular Group.
2001 The human genome (3,000 Mbp) is published.
2002 Chang Gung Genomic Research Center established.
-Bioinformatics Center, -Proteomics Center, -Microarray Center
All the applications of bioinformatics are carried out in the user level. Here is the biologist including
the students at various level can use certain applications and use the output in their research or in
study. Various bioinformatics application can be categorized under following groups:
 Sequence Analysis
 Function Analysis
 Structure Analysis

Sequence Analysis: All the applications that analyzes various types of sequence information and can
compare between similar types of information is grouped under Sequence Analysis.
Function Analysis: These applications analyze the function engraved within the sequences and helps
predict the functional interaction between various proteins or genes. Also expressional analysis of
various genes is a prime topic for research these days.
Structure Analysis: When it comes to the realm of RNA and Proteins, its structure plays a vital role
in the interaction with any other thing. This gave birth to a whole new branch termed Structural
Bioinformatics with is devoted to predict the structure and possible roles of these structures of
Proteins or RNA

Sequence Analysis:
The application of sequence analysis determines those genes which encode regulatory sequences or
peptides by using the information of sequencing. For sequence analysis, there are many powerful
tools and computers which perform the duty of analyzing the genome of various organisms. These
computers and tools also see the DNA mutations in an organism and also detect and identify those
sequences which are related. Shotgun sequence techniques are also used for sequence analysis of
numerous fragments of DNA. Special software is used to see the overlapping of fragments and their
assembly.
Prediction of Protein Structure:-
It is easy to determine the primary structure of proteins in the form of amino acids which are present
on the DNA molecule but it is difficult to determine the secondary, tertiary or quaternary structures
of proteins. For this purpose either the method of crystallography is used or tools of bioinformatics
can also be used to determine the complex protein structures.
Genome Annotation:-
In genome annotation, genomes are marked to know the regulatory sequences and protein coding. It
is a very important part of the human genome project as it determines the regulatory sequences.
Comparative Genomics:-
Comparative genomics is the branch of bioinformatics which determines the genomic structure and
function relation between different biological species. For this purpose, intergenomic maps are
constructed which enable the scientists to trace the processes of evolution that occur in genomes of
different species. These maps contain the information about the point mutations as well as the
information about the duplication of large chromosomal segments.
Health and Drug discovery:
The tools of bioinformatics are also helpful in drug discovery, diagnosis and disease management.
Complete sequencing of human genes has enabled the scientists to make medicines and drugs which
can target more than 500 genes. Different computational tools and drug targets has made the drug
delivery easy and specific because now only those cells can be targeted which are diseased or
mutated. It is also easy to know the molecular basis of a disease.

APPLICATION OF BIOINFORMATICS IN VARIOUS FIELDS

 Molecular medicine
The human genome will have profound effects on the fields of biomedical research and clinical
medicine. Every disease has a genetic component. This may be inherited (as is the case with an
estimated 3000-4000 hereditary disease including Cystic Fibrosis and Huntingtons disease) or a
result of the body's response to an environmental stress which causes alterations in the genome (eg.
cancers, heart disease, diabetes.). The completion of the human genome means that we can search for
the genes directly associated with different diseases and begin to understand the molecular basis of
these diseases more clearly. This new knowledge of the molecular mechanisms of disease will enable
better treatments, cures and even preventative tests to be developed.
 Personalised medicine
Clinical medicine will become more personalised with the development of the field of
pharmacogenomics. This is the study of how an individual's genetic inheritence affects the body's
response to drugs. At present, some drugs fail to make it to the market because a small percentage of
the clinical patient population show adverse affects to a drug due to sequence variants in their DNA.
As a result, potentially life saving drugs never make it to the marketplace. Today, doctors have to use
trial and error to find the best drug to treat a particular patient as those with the same clinical
symptoms can show a wide range of responses to the same treatment. In the future, doctors will be
able to analyse a patient's genetic profile and prescribe the best available drug therapy and dosage
from the beginning.
 Preventative medicine
With the specific details of the genetic mechanisms of diseases being unravelled, the development of
diagnostic tests to measure a persons susceptibility to different diseases may become a distinct
reality. Preventative actions such as change of lifestyle or having treatment at the earliest possible
stages when they are more likely to be successful, could result in huge advances in our struggle to
conquer disease.
 Gene therapy
In the not too distant future, the potential for using genes themselves to treat disease may become a
reality. Gene therapy is the approach used to treat, cure or even prevent disease by changing the
expression of a persons genes. Currently, this field is in its infantile stage with clinical trials for many
different types of cancer and other diseases ongoing.
 Drug development
At present all drugs on the market target only about 500 proteins. With an improved understanding of
disease mechanisms and using computational tools to identify and validate new drug targets, more
specific medicines that act on the cause, not merely the symptoms, of the disease can be developed.
These highly specific drugs promise to have fewer side effects than many of today's medicines.
 Microbial genome applications
Microorganisms are ubiquitous, that is they are found everywhere. They have been found
surviving and thriving in extremes of heat, cold, radiation, salt, acidity and pressure. They are
present in the environment, our bodies, the air, food and water. Traditionally, use has been made
of a variety of microbial properties in the baking, brewing and food industries. The arrival of the
complete genome sequences and their potential to provide a greater insight into the microbial
world and its capacities could have broad and far reaching implications for environment, health,
energy and industrial applications. For these reasons, in 1994, the US Department of Energy
(DOE) initiated the MGP (Microbial Genome Project) to sequence genomes of bacteria useful in
energy production, environmental cleanup, industrial processing and toxic waste reduction. By
studying the genetic material of these organisms, scientists can begin to understand these
microbes at a very fundamental level and isolate the genes that give them their unique abilities to
survive under extreme conditions. Waste cleanup
Deinococcus radiodurans is known as the world's toughest bacteria and it is the most radiation
resistant organism known. Scientists are interested in this organism because of its potential
usefulness in cleaning up waste sites that contain radiation and toxic chemicals.
 Climate change Studies
Increasing levels of carbon dioxide emission, mainly through the expanding use of fossil fuels for
energy, are thought to contribute to global climate change. Recently, the DOE (Department of
Energy, USA) launched a program to decrease atmospheric carbon dioxide levels. One method of
doing so is to study the genomes of microbes that use carbon dioxide as their sole carbon source.
 Alternative energy sources
Scientists are studying the genome of the microbe Chlorobium tepidum which has an unusual
capacity for generating energy from light
 Biotechnology
The archaeon Archaeoglobus fulgidus and the bacterium Thermotoga maritima have potential for
practical applications in industry and government-funded environmental remediation. These
microorganisms thrive in water temperatures above the boiling point and therefore may provide the
DOE, the Department of Defence, and private companies with heat-stable enzymes suitable for use in
industrial processes Other industrially useful microbes include, Corynebacterium glutamicum which
is of high industrial interest as a research object because it is used by the chemical industry for the
biotechnological production of the amino acid lysine. The substance is employed as a source of
protein in animal nutrition. Lysine is one of the essential amino acids in animal nutrition.
Biotechnologically produced lysine is added to feed concentrates as a source of protein, and is an
alternative to soybeans or meat and bonemeal. Xanthomonas campestris pv. is grown commercially
to produce the exopolysaccharide xanthan gum, which is used as a viscosifying and stabilising agent
in many industries. Lactococcus lactis is one of the most important micro-organisms involved in the
dairy industry, it is a non-pathogenic rod-shaped bacterium that is critical for 17
manufacturing dairy products like buttermilk, yogurt and cheese. This bacterium, Lactococcus lactis
ssp., is also used to prepare pickled vegetables, beer, wine, some breads and sausages and other
fermented foods. Researchers anticipate that understanding the physiology and genetic make- up of
this bacterium will prove invaluable for food manufacturers as well as the pharmaceutical industry,
which is exploring the capacity of L. lactis to serve as a vehicle for delivering drugs.
 Antibiotic resistance
Scientists have been examining the genome of Enterococcus faecalis-a leading cause of bacterial
infection among hospital patients. They have discovered a virulence region made up of a number of
antibiotic-resistant genes that may contribute to the bacterium's transformation from harmless gut
bacteria to a menacing invader. The discovery of the region, known as a pathogenicity island, could
provide useful markers for detecting pathogenic strains and help to establish controls to prevent the
spread of infection in wards.
 Forensic analysis of microbes
Scientists used their genomic tools to help distinguish between the strain of Bacillus anthryacis that
was used in the summer of 2001 terrorist attack in Florida with that of closely related anthrax strains.
 The reality of bioweapon creation
Scientists have recently built the virus poliomyelitis using entirely artificial means. They did this
using genomic data available on the Internet and materials from a mail-order chemical supply. The
research was financed by the US Department of Defence as part of a biowarfare response program to
prove to the world the reality of bioweapons. The researchers also hope their work will discourage
officials from ever relaxing programs of immunisation. This project has been met with very mixed
feeelings
 Evolutionary studies
The sequencing of genomes from all three domains of life, eukaryota, bacteria and archaea means
that evolutionary studies can be performed in a quest to determine the tree of life and the last
universal common ancestor.
 Crop improvement
Comparative genetics of the plant genomes has shown that the organisation of their genes has
remained more conserved over evolutionary time than was previously believed. These findings
suggest that information obtained from the model crop systems can be used to suggest improvements
to other food crops. At present the complete genomes of Arabidopsis thaliana (water cress) and
Oryza sativa (rice) are available.
 Insect resistance
Genes from Bacillus thuringiensis that can control a number of serious pests have been successfully
transferred to cotton, maize and potatoes. This new ability of the plants to resist insect attack means
that the amount of insecticides being used can be reduced and hence the nutritional quality of the
crops is increased.
 Improve nutritional quality
Scientists have recently succeeded in transferring genes into rice to increase levels of Vitamin A, iron
and other micronutrients. This work could have a profound impact in reducing occurrences of
blindness and anaemia caused by deficiencies in Vitamin A and iron respectively. Scientists have
inserted a gene from yeast into the tomato, and the result is a plant whose fruit stays longer on the
vine and has an extended shelf life.
 Development of Drought resistance varieties
Progress has been made in developing cereal varieties that have a greater tolerance for soil alkalinity,
free aluminium and iron toxicities. These varieties will allow agriculture to succeed in poorer soil
areas, thus adding more land to the global production base. Research is also in progress to produce
crop varieties capable of tolerating reduced water conditions.
 Veterinary Science
Sequencing projects of many farm animals including cows, pigs and sheep are now well under
way in the hope that a better understanding of the biology of these organisms will have huge
impacts for improving the production and health of livestock and ultimately have benefits for
human nutrition.
 Comparative Studies
Analysing and comparing the genetic material of different species is an important method for
studying the functions of genes, the mechanisms of inherited diseases and species evolution.
Bioinformatics tools can be used to make comparisons between the numbers, locations and
biochemical functions of genes in different organisms.
Organisms that are suitable for use in experimental research are termed model organisms. They have
a number of properties that make them ideal for research purposes including short life spans, rapid
reproduction, being easy to handle, inexpensive and they can be manipulated at the genetic level.
An example of a human model organism is the mouse. Mouse and human are very closely related
(>98%) and for the most part we see a one to one correspondence between genes in the two species.
Manipulation of the mouse at the molecular level and genome comparisons between the two species
can and is revealing detailed information on the functions of human genes, the evolutionary
relationship between the two species and the molecular mechanisms of many human diseases.

Definitions of Fields Related to Bioinformatics

Bioinformatics has various applications in research in medicine, biotechnology, agriculture etc.


Following research fields has integral component of Bioinformatics
1. Computational Biology: The development and application of data-analytical and theoretical
methods, mathematical modeling and computational simulation techniques to the study of biological,
behavioral, and social systems.
2. Genomics: Genomics is any attempt to analyze or compare the entire genetic complement of a
species or species (plural). It is, of course possible to compare genomes by comparing more-or-less
representative subsets of genes within genomes.
3. Proteomics: Proteomics is the study of proteins - their location, structure and function. It is the
identification, characterization and quantification of all proteins involved in a particular pathway,
organelle, cell, tissue, organ or organism that can be studied in concert to provide accurate and
comprehensive data about that system. Proteomics is the study of the function of all expressed
proteins. The study of the proteome, called proteomics, now evokes not only all the proteins in any
given cell, but also the set of all protein isoforms and modifications, the interactions between them,
the structural description of proteins and their higher-order complexes, and for that matter almost
everything 'post-genomic'."
4. Pharmacogenomics: Pharmacogenomics is the application of genomic approaches and
technologies to the identification of drug targets. In Short, pharmacogenomics is using genetic
information to predict whether a drug will help make a patient well or sick. It Studies how genes
influence the response of humans to drugs, from the population to the molecular level.
5. Pharmacogenetics: Pharmacogenetics is the study of how the actions of and reactions to drugs
vary with the patient's genes. All individuals respond differently to drug treatments; some positively,
others with little obvious change in their conditions and yet others with side effects or allergic
reactions. Much of this variation is known to have a genetic basis. Pharmacogenetics is a subset of
pharmacogenomics which uses genomic/bioinformatic methods to identify genomic correlates, for
example SNPs (Single Nucleotide Polymorphisms), characteristic of particular patient response
profiles and use those markers to inform the administration and development of therapies. Strikingly
such approaches have been used to "resurrect" drugs thought previously to be ineffective, but
subsequently found to work with in subset of patients or in optimizing the doses of chemotherapy for
particular patients.

6. Cheminformatics:
Chemical informatics: 'Computer-assisted storage, retrieval and analysis of chemical information,
from data to chemical knowledge.' This definition is distinct from Chemoinformatics which focus on
drug design. chemometrics: The application of statistics to the analysis of chemical data (from
organic, analytical or medicinal chemistry) and design of chemical experiments and simulations.
computational chemistry: A discipline using mathematical methods for the calculation of molecular
properties or for the simulation of molecular behavior. It also includes, e.g., synthesis planning,
database searching, combinatorial library manipulation
7. Structural genomics or structural bioinformatics refers to the analysis of macromolecular
structure particularly proteins, using computational tools and theoretical frameworks. One of the
goals of structural genomics is the extension of idea of genomics, to obtain accurate three-
dimensional structural models for all known protein families, protein domains or protein folds
Structural alignment is a tool of structural genomics.
8. Comparative genomics: The study of human genetics by comparisons with model organisms
such as mice, the fruit fly, and the bacterium E. coli.
9. Biophysics: The British Biophysical Society defines biophysics as: "an interdisciplinary field
which applies techniques from the physical sciences to understanding biological structure and
function".
10. Biomedical informatics / Medical informatics: "Biomedical Informatics is an emerging
discipline that has been defined as the study, invention, and implementation of structures and
algorithms to improve communication, understanding and management of medical information."
11. Mathematical Biology: Mathematical biology also tackles biological problems, but the methods
it uses to tackle them need not be numerical and need not be implemented in software or hardware. It
includes things of theoretical interest which are not necessarily algorithmic, not necessarily
molecular in nature, and are not necessarily useful in analyzing collected data.
12. Computational chemistry: Computational chemistry is the branch of theoretical chemistry
whose major goals are to create efficient computer programs that calculate the properties of
molecules (such as total energy, dipole moment, vibrational frequencies) and to apply these programs
to concrete chemical objects. It is also sometimes used to cover the areas of overlap between
computer science and chemistry.
13. Functional genomics: Functional genomics is a field of molecular biology that is attempting to
make use of the vast wealth of data produced by genome sequencing projects to describe genome
function. Functional genomics uses high-throughput techniques like DNA microarrays, proteomics,
metabolomics and mutation analysis to describe the function and interactions of genes.
14. Pharmacoinformatics: Pharmacoinformatics concentrates on the aspects of bioinformatics
dealing with drug discovery
15. In silico ADME-Tox Prediction: Drug discovery is a complex and risky treasure hunt to find the
most efficacious molecule which do not have toxic effects but at the same time have desired
pharmacokinetic profile. The hunt starts when the researchers look for the binding affinity of the
molecule to its target. Huge amount of research requires to be done to come out with a molecule
which has the reliable binding profile. Once the molecules have been identified, as per the traditional
methodologies, the molecule is further subjected to optimization with the aim of improving efficacy.
The molecules which show better binding is then evaluated for its toxicity and pharmacokinetic profiles.
It is at this stage that most of the candidates fail in the race to become a successful drug.
16. Agroinformatics / Agricultural informatics:
Agroinformatics concentrates on the aspects of bioinformatics dealing with plant genomes.
II. DATABASES
Biological databases
Biological databases are libraries of life sciences information, collected from scientific
experiments, published literature, high-throughput experimenttechnology, and computational
analysis. They contain information from research areas including genomics, proteomics,
metabolomics, microarray gene expression, and phylogenetics. Information contained in
biological databases includes gene function, structure, localization (both cellular and
chromosomal), clinical effects of mutations as well as similarities of biological sequences and
structures.
Why databases?
• Means to handle and share large volumes of biological data
• Support large-scale analysis efforts
• Make data access easy and updated
• Link knowledge obtained from various fields of biology and medicine
Features
• Most of the databases have a web-interface to search for data
• Common mode to search is by Keywords
• User can choose to view the data or save to your computer
• Cross-references help to navigate from one database to another easily

Biological databases can be broadly classified into sequence and structure databases. Nucleic
acid and protein sequences are stored in sequence databases and structure database only store
proteins. These databases are important tools in assisting scientists to analyze and explain a host
of biological phenomena from the structure of biomolecules and their interaction, to the whole
metabolism of organisms and to understanding the evolution of species. This knowledge helps
facilitate the fight against diseases, assists in the development of medications , predicting
certain genetic diseases and in discovering basic relationships among species in the history of
life.
Classification of databases

Types of biological Database


.
Types of Biological Databases (Based on data)
There are basically 3 types of biological databases are as follows.
1. Primary databases :
 It can also be called an archival database since it archives the experimental results submitted
by the scientists. The primary database is populated with experimentally derived data like
genome sequence, macromolecular structure, etc. The data entered here remains
uncurated(no modifications are performed over the data).
 It obtains unique data obtained from the laboratory and these data are made accessible to
normal users without any change.
 The data are given accession numbers when they are entered into the database. The same data
can later be retrieved using the accession number. Accession number identifies each data
uniquely and it never changes.
Examples –
 Examples of Primary database- Nucleic Acid Databases are GenBank, EMBL and DDBJ
 Protein Databases are PDB, SwissProt, PIR, TrEMBL, Metacyc, etc.

2. Secondary Database :
 The data stored in these types of databases are the analyzed result of the primary database.
Computational algorithms are applied to the primary database and meaningful and
informative data is stored inside the secondary database.
 The data here are highly curated(processing the data before it is presented in the database). A
secondary database is better and contains more valuable knowledge compared to the primary
database.
Examples –
Examples of Secondary databases are as follows.
 InterPro (protein families, motifs, and domains)
 UniProt Knowledgebase (sequence and functional information on proteins)
3. Composite Databases :
 The data entered in these types of databases are first compared and then filtered based on
desired criteria.
 The initial data are taken from the primary database, and then they are merged together based
on certain conditions.
 It helps in searching sequences rapidly. Composite Databases contain non-redundant data.
Examples –
Examples of Composite Databases are as follows.
 Composite Databases -OWL,NRD and Swissport +TREMBL

GenBank
GenBank sequence database is an open access and annotated collection of nucleotide sequences and their
protein translations including mRNA sequences with coding regions, segments of genomic DNA with a
single gene or multiple genes, and ribosomal RNA gene clusters. GenBank is produced and maintained by
the National Centre for Biotechnology Information (NCBI) as part of the International collaboration with
EMBL Data Library from the EBI and the DNA Data Bank of Japan (DDBJ). Individual laboratory can
submit sequence data or large scale sequencing centre can submit bulk submission directly to the
GenBank by using Banklt or Sequin. The Banklt is a webbased form and Sequin is a stand-alone software
tool developed by the NCBI for submitting and updating sequence to the GenBank, EMBL and DDBJ
databases. After sequence submission the GenBank staffs assigns an Accession Number to the newly
entered sequence 3 and performs quality assurance checks. Then the newly submitted sequence is
released to the database. Data that are stored in GenBank can be retrieved by Entrez or by downloading
File Transfer Protocol (FTP).
The GenBank is a collection of information on Expressed Sequence Tag (EST), Sequence
Tagged Site (STS), Genome Survey Sequence (GSS), and HighThroughput Genome Sequence
(HTGS) and complete microbial genome sequences. Information of GenBank can be accessed
through the server http://www.ncbi.nlm.nih.gov/genbank/.
There are several ways to search and retrieve data from GenBank as given under –
• Search GenBank for sequence identifiers and annotations with Entrez Nucleotide , which is
divided into three divisions:
CoreNucleotide (the main collection), dbEST (Expressed Sequence Tags), and dbGSS (Genome
Survey Sequences).
• Search and align GenBank sequences to a query sequence using BLAST.
• Search, link, and download sequences programmatically using NCBI e-utilities
Only original sequences can be submitted to GenBank. Direct submissions are made to
GenBank using BankIt, which is a Web-based form, or the stand-alone submission
program, Sequin. Upon receipt of a sequence submission, the GenBank staff examines the
originality of the data and assigns an accession number to the sequence and performs
quality assurance checks. The submissions are then released to the public database, where
the entries are retrievable by Entrez or downloadable by FTP. Bulk submissions of
Expressed Sequence Tag (EST), Sequence-tagged site (STS), Genome Survey Sequence
(GSS), and High- Throughput Genome Sequence (HTGS) data are most often submitted
by large-scale sequencing centers. The GenBank direct submissions group also processes
complete microbial genome sequences.

THE GENBANK FLATFILE: A DISSECTION


The GenBank flatfile (GBFF) is the elementary unit of information in the GenBank database. It
is one of the most commonly used formats in the representation of biological sequences. At the
time of this writing, it is the format of exchange from GenBank to the DDBJ and EMBL
databases and vice versa. The DDBJ flat file format and the GBFF format are now nearly
identical to the GenBank format. Subtle differences exist in the formatting of the definition line
and the use of the gene feature. EMBL uses line-type prefixes, which indicate the type of
information present in each line of the record.
The GBFF can be separated into three parts: the header, which contains the information
(descriptors) that apply to the whole record; the features, which are the annotations on the record;
and the nucleotide sequence itself. All major nucleotide database flat files end with // on the last
line of the record. The header is the most database-specific part of the record.
The various databases are not obliged to carry the same information in this segment, and minor
variations exist, but some effort is made to ensure that the same information is carried from one
to the other.
The first line of all GBFFs is the Locus line:
Locus name
The locus name was originally designed to help group entries with similar
sequences: the first three characters usually designated the organism; the fourth and
fifth characters were used to show other group designations, such as gene product;
for segmented entries, the last character was one of a series of sequential integers.
Sequence length
Number of nucleotide base pairs (or amino acid residues) in the sequence record.
Molecule Type
The type of molecule that was sequenced Genbank division
The GenBank division to which a record belongs is indicated with a three letter
abbreviation. In this example, GenBank division is PRI.
Modification date
The date in the LOCUS field is the date of last modification. The sample record
shown here was last modified on
Definition
Brief description of sequence; includes information such as source organism, gene
name/protein name, or some description of the sequence's function
Accession
The unique identifier for a sequence record. An accession number applies to the
complete record and is usually a combination of a letter(s) and numbers, such as a
single letter followed by five digits (e.g., U12345) or two letters followed by six
digits (e.g., AF123456). Accession numbers do not change, even if information in
the record is changed at the author's request.
Version
If there is any change to the sequence data (even a single base), the version number
will be increased, e.g., U12345.1 → U12345.2, but the accession portion will
remain stable.
GI
"GenInfo Identifier" sequence identification number, in this case, for the nucleotide
sequence. If a sequence changes in any way, a new GI number will be assigned. GI
sequence identifiers run parallel to the new accession.version system of sequence
identifiers
Keywords
Word or phrase describing the sequence. If no keywords are included in the entry,
the field contains only a period.
Source
Free-format information including an abbreviated form of the organism name,
sometimes followed by a molecule type.
Features
Information about genes and gene products, as well as regions of biological
significance reported in the sequence. These can include regions of the sequence
that code for proteins and RNA molecules, as well as a number of other features.
The location of each feature is provided as well, and can be a single base, a
contiguous span of bases, a joining of sequence spans, and other representations. If
a feature is located on the complementary strand, the word "complement" will
appear before the base span
Source: Mandatory feature in each record that summarizes the length of the
sequence, scientific name of the source organism, and Taxon ID number. Can also
include other information such as map location, strain, clone, tissue type, etc., if
provided by submitter.
Taxon: A stable unique identification number for the taxon of the source organism.
A taxonomy ID number is assigned to each taxon
CDS:
Coding sequence; region of nucleotides that corresponds with the sequence of
amino acids in a protein (location includes start and stop codons). The CDS feature
includes an amino acid translation <1…206 Base span of the biological feature
indicated to the left, in this case, a CDS feature Gene
Origin
The ORIGIN may be left blank, may appear as "Unreported," or may give a local
pointer to the sequence start, usually involving an experimentally determined
restriction cleavage site or the genetic locus (if available). This information is
present only in older records.
The sequence data begin on the line immediately below ORIGIN.
DNA Data Bank of Japan (DDBJ)
DDB is a kind of nucleotide sequence data bank that receives nucleotide sequence from
researchers and assigns an accession number to data submitters. DDBJ collects sequence data
mainly from Japanese researchers, however, they also receive data and assign accession number
to researchers of any other countries. DDBJ began data bank activities in 1986 at National
Institute of Genetics (NIG). Currently, DDBJ is in operation at NIG in Mishima, Japan.
Main activities of DDBJ are –
i) being a member of INSDC, DDBJ collects nucleotide sequence data from
researcher, assigns an accession number to the data submitters exchanges the
collected data with EMBL-Bank and GenBank on a daily basis,
ii) ii) DDBJ manage bioinformatics tools for data submission and retrieval,
iii) iii) DDBJ develops tools for analysis of biological data and iv) organizes
Bioinformatics Training Course in Japanese to teach how to analyze
biological data. Information of DDBJ can be accessed through the server
http://www.ddbj.nig.ac.jp.

European Molecular Biology Laboratory


(EMBL)
European Bioinformatics Institute (EMBL-EBI) European Bioinformatics Institute (EBI) is part
of European Molecular Biology Laboratory (EMBL). EMBL-EBI now known as EMBL-Bank
and was established in 1980 at the EMBL in Heidelberg, Germany. It was the world's first
nucleotide sequence database. EMBL-EBI provides freely available data from life science
experiments, performs basic research in computational biology and offers an extensive user
training programme for the researchers. EMBL-EBI stores data on DNA and RNA (genes,
genomes and variation), gene expression (RNA, protein and metabolite expression), protein
(sequence, families and motifs), structure (molecular and cellular structures), systems (reaction,
interaction, pathways), chemical biology (chemogenomics and metabolomics), ontologies
(taxonomies and controlled vocabularies) and literature (scientific publications and patents).
EMBL-EBI can be accessed through the server
http://www.ebi.ac.uk.

PRIMARY DATABASES OF PROTEIN

The PRIMARY databases hold the experimentally determined protein sequences inferred from
the conceptual translation of the nucleotide sequences. This, of course, is not experimentally
derived information, but has arisen as a result of interpretation of the nucleotide sequence
information and consequently must be treated as potentially containing misinterpreted
information. There is a number of primary protein sequence databases and each requires some
specific consideration.
a. Protein Information Resource (PIR) – Protein Sequence Database (PIR-PSD):
• The PIR-PSD is a collaborative endeavor between the PIR, the MIPS (Munich
Information Centre for Protein Sequences, Germany) and the JIPID (Japan International Protein
Information Database, Japan).
• The PIR-PSD is now a comprehensive, non-redundant, expertly annotated, object-
relational DBMS.
• A unique characteristic of the PIR-PSD is its classification of protein sequences based on
the superfamily concept.
• The sequence in PIR-PSD is also classified based on homology domain and sequence
motifs.
• Homology domains may correspond to evolutionary building blocks, while sequence
motifs represent functional sites or conserved regions.
• The classification approach allows a more complete understanding of sequence function-
structure relationship.
b. SWISS-PROT
• SWISS-PROT (1) is an annotated protein sequence database, which was created at the
Department of Medical Biochemistry of the University of Geneva and has been a collaborative
effort of the Department and the European Molecular Biology Laboratory (EMBL), since 1987.
SWISS-PROT is now an equal partnership between the EMBL and the Swiss Institute of
Bioinformatics (SIB). The EMBL activities are carried out by its Hinxton Outstation, the
European Bioinformatics Institute (EBI) (2).
• The data in each entry can be considered separately as core data and annotation.
• The core data consists of the sequences entered in common single letter amino acid code,
and the related references and bibliography. The taxonomy of the organism from which the
sequence was obtained also forms part of this core information.
• The annotation contains information on the function or functions of the protein, post-
translational modification such as phosphorylation, acetylation, etc., functional and structural
domains and sites, such as calcium binding regions, ATP-binding sites, zinc fingers, etc., known
secondary structural features as for examples alpha helix, beta sheet, etc., the quaternary
structure of the protein, similarities to other protein if any, and diseases that may arise due to
different authors publishing different sequences for the same protein, or due to mutations in
different strains of an described as part of the annotation.
• The SWISS-PROT protein sequence database consists of sequence entries. Sequence
entries are composed of different line types, each with their own format.
• The SWISS-PROT database distinguishes itself from other protein sequence databases by
three distinct criteria: (i) annotations, (ii) minimal redundancy and (iii) integration with other
databases.
Annotation
In SWISS-PROT two classes of data can be distinguished: the core data and the annotation. For
each sequence entry the core data consists of the sequence data; the citation information
(bibliographical references) and the taxonomic data (description of the biological source of the
protein), while the annotation consists of the description of the following items:
 Function(s) of the protein
• Post-translational modification(s). For example carbohydrates, phosphorylation,
acetylation, GPI-anchor, etc.
• Domains and sites. For example calcium binding regions, ATP-binding sites, zinc
fingers, homeoboxes, SH2 and SH3 domains, etc.
• Secondary structure. For example alpha helix, beta sheet, etc.
• Quaternary structure. For example homodimer, heterotrimer, etc.
• Similarities to other proteins
• Disease(s) associated with deficiencie(s) in the protein
• Sequence conflicts, variants, etc.
We try to include as much annotation information as possible in SWISS-PROT. To obtain this
information we use, in addition to the publications reporting new sequence data, review articles
to periodically update the annotations of families or groups of proteins. We also make use of
external experts who have been recruited to send us their comments and updates concerning
specific groups of proteins (see http://www.expasy. ch/cgi-bin/experts ).
We believe that the systematic recourse both to publications other than those reporting the core
data and to subject referees represents a unique and beneficial feature of SWISS-PROT. In
SWISS-PROT, annotation is mainly found in the comment lines (CC), in the feature table (FT)
and in the keyword lines (KW). Most comments are classified by ‘topics’; this approach permits
the easy retrieval of specific categories of data from the database.
Minimal redundancy
Many sequence databases contain, for a given protein sequence, separate entries which
correspond to different literature reports. In SWISS-PROT we try as much as possible to merge
all these data so as to minimise the redundancy of the database. If conflicts exist between various
sequencing reports, they are indicated in the feature table of the corresponding SWISS-PROT
entry.
Integration with other databases
It is important to provide the users of biomolecular databases with a degree of integration
between the three types of sequence-related databases (nucleic acid sequences, protein sequences
and protein tertiary structures) as well as with specialised data collections. Cross-references are
provided in the form of pointers to information related to SWISS-PROT entries and found in
data collections other than SWISS-PROT. For example the sample sequence mentioned above
contains, among others, DR (Databank Reference) lines that point to EMBL, PDB, OMIM, Pfam
and PROSITE. In this particular example it is therefore possible to retrieve the nucleic acid
sequence(s) that codes for that protein (EMBL), the description of genetic disease(s) associated
with that protein (OMIM), the 3D structure (PDB) or information specific to the protein family
to which it belongs (PROSITE and Pfam)
TrEMBL (for Translated EMBL) is a computer-annotated protein sequence database that is
released as a supplement to SWISS-PROT. It contains the translation of all coding sequences
present in the EMBL Nucleotide database, which have not been fully annotated. Thus it may
contain the sequence of proteins that are never expressed and never actually identified in the
organisms. Ongoing genome sequencing and mapping projects have dramatically increased the
number of protein sequences to be incorporated into SWISS-PROT. Since we do not want to
dilute the quality standards of SWISS-PROT by incorporating sequences without proper
sequence analysis and annotation, we cannot speed up the incorporation of new incoming data
indefinitely. However, as we also want to make the sequences available as fast as possible we
will introduce with SWISS-PROT release 33 an unannotated supplement to SWISS-PROT. This
supplement consists of entries in SWISS-PROT-like format derived from the translation of all
coding sequences (CDS) in the EMBL nucleotide sequence database, except CDS already
included in SWISS-PROT.
We name this supplement TREMBL (TRanslation from EMBL), since the translation tools used
to create translations of the CDS are based on the program ‘TREMBL’ written by Thure Etzold
at the EMBL in Heidelberg.
Translation of all CDS in the EMBL nucleotide sequence database release 44 resulted in the
creation of 145 000 TREMBL pre-entries. Around 65 000 of these pre-entries were already
present as sequence reports in SWISS-PROT and were excluded from TREMBL. The remaining
∼80 000 sequence entries have been automatically merged whenever possible, to reduce
redundancy in TREMBL. This step led to ∼70 000 TREMBL entries, which supplement SWISS-
PROT.
We have split TREMBL into two main sections, SP-TREMBL and REM-TREMBL. SP-
TREMBL (SWISS-PROT TREMBL) contains entries (∼55 000) which should be incorporated
into SWISS-PROT. SWISS-PROT accession numbers have been assigned to these entries. SP-
TREMBL is partially redundant against SWISS-PROT, since ∼30 000 of these SP-TREMBL
entries aie only additional sequence reports of proteins already in SWISS-PROT. We will try to
merge these sequence reports as fast as possible with the already existing SWISS-PROT entries
for these proteins, so as to make SWISS-PROT and TREMBL completely non-redundant. REM-
TREMBL (REMaining TREMBL) contains those entries (∼15 000) that we do not wish to
include in SWISS-PROT. This section is organized into four subsections.
• Most REM-TREMBL entries are immunoglobulins and T-cell receptors. We have
stopped entering immunoglobulins and T-cell receptors into SWISS-PROT, because we want to
keep only germ line gene-derived translations of these proteins in SWISS-PROT and not all
known somatic recombinant variations of these proteins. At the moment there are >10 000
immunoglobulins and T cell receptors in TREMBL. We would like to create a specialized
database dealing with these sequences as a further supplement to SWISS-PROT and keep only a
representative cross-section of these proteins in SWISS-PROT.
• Another category of data which will not be included in SWISS-PROT is synthetic
sequences. Again, we do not want to leave these entries in TREMBL. Ideally one should build a
specialized database for artificial sequences as a further supplement to SWISS-PROT.
• A third subsection consists of fragments with less than seven amino acids.
• The last subsection consists of CDS translations where we have strong evidence to
believe that these CDS are not coding for real proteins.
The creation of TREMBL as a supplement to SWISS-PROT was not only for the purpose of
producing a more complete and up to date protein sequence collection. We used this task to also
achieve a deeper integration of the EMBL nucleotide sequence database with SWISS-PROT +
TREMBL.
We used the PID, the Protein IDentification number found in the /dbxref qualifier tagged to
every CDS in the EMBL nucleotide sequence database, as the ID of the TREMBL entries created
from these CDS. In all 65 000 cases where an EMBL nucleotide sequence database CDS was
already present as a sequence report in SWISS-PROT the SWISS-PROT DR lines of the
corresponding SWISS-PROT entries have been updated by citing the EMBL AC number as
primary identifier and the PID as secondary identifier. In all cases where a PID is already
integrated into SWISS-PROT a /db xref qualifier citing the corresponding SWISS-PROT entry is
added to the EMBL nucleotide sequence database CDS labelled with this PID.
This approach enables us to point precisely from a given SWISS-PROT entry to one of
potentially many CDS in the corresponding EMBL entry, and vice versa. This change will allow
the development of software tools that automatically retrieve that part of a nucleotide sequence
entry that codes for a specific protein. This will be especially useful in the context of the World
Wide Web, as it will render obsolete the current situation where, for example, one needs to
retrieve the complete sequence of a yeast chromosome when one wants the nucleotide sequence
coding for a specific protein encoded on that chromosome

Secondary Databases of Protein


The secondary databases are so termed because they contain the
results of analysis of the sequences held in primary databases.
Many secondary protein databases are the result of looking for
features that relate different proteins. Some commonly used
secondary databases of sequence and structure are as follows:
a. PROSITE:
 A set of databases collects together patterns found in protein
sequences rather than the complete sequences. PROSITE is one
such pattern database.
 The protein motif and pattern are encoded as “regular
expressions”.
 The information corresponding to each entry in PROSITE is of the
two forms – the patterns and the related descriptive text.
b. PRINTS:
 In the PRINTS database, the protein sequence patterns are stored
as ‘fingerprints’. A fingerprint is a set of motifs or patterns rather
than a single one.
 The information contained in the PRINT entry may be divided into
three sections. In addition to entry name, accession number and
number of motifs, the first section contains cross-links to other
databases that have more information about the characterized
family.
 The second section provides a table showing how many of the
motifs that make up the fingerprint occurs in the how many of
the sequences in that family.
 The last section of the entry contains the actual fingerprints that
are stored as multiple aligned sets of sequences, the alignment is
made without gaps. There is, therefore, one set of aligned
sequences for each motif.
c. MHCPep:
 MHCPep is a database comprising over 13000 peptide sequences
known to bind the Major Histocompatibility Complex of the
immune system.
 Each entry in the database contains not only the peptide
sequence, which may be 8 to 10 amino acid long but in addition
has information on the specific MHC molecules to which it binds,
the experimental method used to assay the peptide, the degree
of activity and the binding affinity observed , the source protein
that, when broken down gave rise to this peptide along with
other, the positions along the peptide where it anchors on the
MHC molecules and references and cross-links to other
information.
d. Pfam
 Pfam contains the profiles used using Hidden Markov models.
 HMMs build the model of the pattern as a series of the match,
substitute, insert or delete states, with scores assigned for
alignment to go from one state to another.
 Each family or pattern defined in the Pfam consists of the four
elements. The first is the annotation, which has the information
on the source to make the entry, the method used and some
numbers that serve as figures of merit.
 The second is the seed alignment that is used to bootstrap the
rest of the sequences into the multiple alignments and then the
family.
 The third is the HMM profile.
 The fourth element is the complete alignment of all the
sequences identified in that family.
CAMBRIDGE STRUCTURAL DATABASE
The Cambridge Structural Database (CSD) is both a repository and a validated and curated
resource for the three-dimensional structural data of molecules generally containing at least
carbon and hydrogen, comprising a wide range of organic, metal-organic and organometallic
molecules. The specific entries are complementary to the other crystallographic databases such
as the Protein Data Bank (PDB), Inorganic Crystal Structure Database and International Centre
for Diffraction Data. The data, typically obtained by X-ray crystallography and less frequently
by electron diffraction or neutron diffraction, and submitted by crystallographers and chemists
from around the world, are freely accessible (as deposited by authors) on the Internet via the
CSD's parent organization's website (CCDC, Repository). The CSD is overseen by the not-for-
profit incorporated company called the Cambridge Crystallographic Data Centre, CCDC.

The inside of the CCDC headquarters Cambridge, UK


The CSD is a widely used repository for small-molecule organic and metal-organic crystal
structures for scientists. Structures deposited with Cambridge Crystallographic Data Centre
(CCDC) are publicly available for download at the point of publication or at consent from the
depositor. They are also scientifically enriched and included in the database used by software
offered by the centre. Targeted subsets of the CSD are also freely available to support teaching
and other activities.

https://www.ccdc.cam.ac.uk

RCSB https://www.rcsb.org/

(The Research Collaboratory for Structural Bioinformatics)

This resource is powered by the Protein Data Bank archive-information about the 3D shapes of
proteins, nucleic acids, and complex assemblies that helps students and researchers understand
all aspects of biomedicine and agriculture, from protein synthesis to health and disease.

As a member of the wwPDB, the RCSB PDB curates and annotates PDB data.
The RCSB PDB builds upon the data by creating tools and resources for research and education
in molecular biology, structural biology, computational biology, and beyond.

The Research Collaboratory for Structural Bioinformatics Protein Data Bank ,the US data center for the
global PDB archive, makes PDB data freely available to all users, from structural biologists to
computational biologists and beyond. New tools and resources have been added to the RCSB PDB web
portal in support of a ‘Structural View of Biology.’ Recent developments have improved the User
experience, including the high-speed NGL Viewer that provides 3D molecular visualization in any web
browser, improved support for data file download and enhanced organization of website pages for query,
reporting and individual structure exploration. Structure validation information is now visible for all
archival entries. PDB data have been integrated with external biological resources, including
chromosomal position within the human genome; protein modifications; and metabolic pathways. PDB-
101 educational materials have been reorganized into a searchable website and expanded to include new
features such as the Geis Digital Archive.
Protein Data bank (PDB)- it is the collection of the experimentally determined crystal stuture
of the biological macromolecules. It is co-ordinated by the consortium located in Europe, Japan
and USA. As of August 2013, the database contains 93043 structures which includes protein,
nucleic acids, and protein-nucleic acid or proteinsmall molecule complexes
(http://www.rcsb.org/pdb/home/home.do). A PDB ID or the key word can be use to search the
database. The result from the database summarizes all information related to the structure such as
crystallization condition, reference of the journal article where the finding are published etc.

Introduction to PDB Data

The PDB archive is a repository of atomic coordinates and other information describing proteins
and other important biological macromolecules. Structural biologists use methods such as X-ray
crystallography, NMR spectroscopy, and cryo-electron microscopy to determine the location of
each atom relative to each other in the molecule. They then deposit this information, which is
then annotated and publicly released into the archive by the wwPDB.

The constantly-growing PDB is a reflection of the research that is happening in laboratories


across the world. This can make it both exciting and challenging to use the database in research
and education. Structures are available for many of the proteins and nucleic acids involved in the
central processes of life, so you can go to the PDB archive to find structures for ribosomes,
oncogenes, drug targets, and even whole viruses. However, it can be a challenge to find the
information that you need, since the PDB archives so many different structures. You will often
find multiple structures for a given molecule, or partial structures, or structures that have been
modified or inactivated from their native form.
Guide to Understanding PDB Data is designed to help you get started with charting a path
through this material, and help you avoid a few common pitfalls. These chapters are intertwined
with one another. To begin, select a topic from the right menu, or select a topic from below:

 PDB Data

The primary information stored in the PDB archive consists of coordinate files for
biological molecules. These files list the atoms in each protein, and their 3D location in
space. These files are available in several formats (PDB, mmCIF, XML). A typical PDB
formatted file includes a large "header" section of text that summarizes the protein,
citation information, and the details of the structure solution, followed by the sequence
and a long list of the atoms and their coordinates. The archive also contains
the experimental observations that are used to determine these atomic coordinates.

 Visualizing Structures

While you can view PDB files directly using a text editor, it is often most useful to use a
browsing or visualization program to look at them. Online tools, such as the ones on the
RCSB PDB website, allow you to search and explore the information under the PDB
header, including information on experimental methods and the chemistry and biology of
the protein. Once you have found the PDB entries that you are interested in, you may
use visualization programs to allow you to read in the PDB file, display the protein
structure on your computer, and create custom pictures of it. These programs also often
include analysis tools that allow you to measure distances and bond angles, and identify
interesting structural features.

 Reading Coordinate Files

When you start exploring the structures in the PDB archive, you will need to know a few
things about the coordinate files. In a typical entry, you will find a diverse mixture of
biological molecules, small molecules, ions, and water. Often, you can use the names and
chain IDs to help sort these out. In structures determined from crystallography, atoms are
annotated with temperature factors that describe their vibration and occupancies that
show if they are seen in several conformations. NMR structures often include several
different models of the molecule.

 Potential Challenges

You may run into several challenges as you explore the PDB archive. For example, many
structures, particular those determined by crystallography, only include information about
part of the functional biological assembly. Fortunately the PDB can help with this. Also,
many PDB entries are missing portions of the molecule that were not observed in the
experiment. These include structures that include only alpha carbon positions, structures
with missing loops, structures of individual domains, or subunits from a larger molecule.
In addition, most of the crystallographic structure entries do not have information on
hydrogen atoms.
NDB
The Nucleic Acid Database (NDB) (http://ndbserver.rutgers.edu) is a web portal providing
access to information about 3D nucleic acid structures and their complexes. In addition to
primary data, the NDB contains derived geometric data, classifications of structures and motifs,
standards for describing nucleic acid features, as well as tools and software for the analysis of
nucleic acids. A variety of search capabilities are available, as are many different types of
reports. This article describes the recent redesign of the NDB Web site with special emphasis on
new RNA-derived data and annotations and their implementation and integration into the search
capabilities.

The Nucleic Acid Database (NDB) was founded in 1991 to assemble and distribute structural
information about nucleic acids (1). In addition to the primary structural data that are contained
in the archival Protein Data Bank (PDB) (2), the NDB contains annotations specific to nucleic
acid structure and function, as well as tools that enable users to search, download, analyze and
learn more about nucleic acids. NDB is thus a value-added database providing services
specifically for the nucleic acid community.
The NDB contains primary structural information about nucleic acid containing structures
obtained from the PDB as well as classifications and derived data. Manually annotated nucleic
acid classifications as well as derived and calculated data regarding structural features of RNA
are managed separately from the primary structure entries; these data are recorded and stored as
external reference files (ERFs).

SCOP-SCOP (structural classification of protein) utilizes the basic idea that the proteins with
similar biological functions and evolutionary related with each other must have a similar
structure. The database classifies the structure of a known protein into the families, superfamilies
and fold. A protein structure belongs to a famiy if the sequence identity must be atleast 30% over
the total length of the sequence. Proteins with structural or functional similarity but low sequence
identity are classified into the superfamilies. Whereas proteins with similar secondary structure
arrangement belongs to the fold.

CATH-Similar to SCOP,
CATH classifies the protein into 4 categories: Class (C), Architecture (A), Topology (T), and
Homologous superfamily (H). A protein is classified as Class depending on the proportion of the
secondary structure elements rather than their arrangement. There are 4 classes, helices (α-class),
sheet (β-class), helix-sheet (α/β class) and proteins with few secondary structures. The
arrangement of secondary elements in a protein structure is used for their classification within
the architecture. The connection of secondary elements is used for their classification within the
topology category. The homologous superfamily consider the presence of similar domains in two
protein structure for their classification.

FORMATS FOR SEQUENCES

FASTA format

A sequence file in FASTA format can contain several sequences.


Each sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line must
begin with a greater-than (">") symbol in the first column.

An example sequence in FASTA format is:

>AB000263 |acc=AB000263|descr=Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.|len=368
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC
CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGC
CTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGG
AAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCC
CTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAG
TTTAATTACAGACCTGAA

GCG format

A sequence file in GCG format contains exactly one sequence, begins with annotation lines and the start of the sequence is
marked by a line ending with two dot ("..") characters. This line also contains the sequence identifier, the sequence length
and a checksum. This format should only be used if the file was created with the GCG package.

An example sequence in GCG format is:

ID AB000263 standard; RNA; PRI; 368 BP.


XX
AC AB000263;
XX
DE Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.
XX
SQ Sequence 368 BP;
AB000263 Length: 368 Check: 4514 ..
1 acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg
61 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg
121 caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc
181 aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag
241 gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac ttcttctgga
301 agaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattaca

GenBank format

A sequence file in GenBank format can contain several sequences.


One sequence in GenBank format starts with a line containing the word LOCUS and a number of annotation lines. The start
of the sequence is marked by a line containing "ORIGIN" and the end of the sequence is marked by two slashes ("//").

An example sequence in GenBank format is:

LOCUS AB000263 368 bp mRNA linear PRI 05-FEB-1999


DEFINITION Homo sapiens mRNA for prepro cortistatin like peptide, complete
cds.
ACCESSION AB000263
ORIGIN
1 acaagatgcc attgtccccc ggcctcctgc tgctgctgct ctccggggcc acggccaccg
61 ctgccctgcc cctggagggt ggccccaccg gccgagacag cgagcatatg caggaagcgg
121 caggaataag gaaaagcagc ctcctgactt tcctcgcttg gtggtttgag tggacctccc
181 aggccagtgc cgggcccctc ataggagagg aagctcggga ggtggccagg cggcaggaag
241 gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc ctgcaggaac ttcttctgga
301 agaccttctc ctcctgcaaa taaaacctca cccatgaatg ctcacgcaag tttaattaca
361 gacctgaa
//

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy