2.3 - History of Biological Databases
2.3 - History of Biological Databases
Sequence databases are applicable to both nucleic acid sequences and protein sequences,
whereas structure database is applicable only to proteins. The first database was created within a
short period after the insulin protein sequence was made available in 1956. Incidentally, insulin
is the first protein to be sequenced. The sequence of insulin consisted of just 51 residues. Around
mid 1960s, the first nucleic acid sequence of Yeast tRNA with 77 bases was determined. During
this period, three dimensional structures of proteins were studied and the well known Protein
Data Bank (PDB) was developed as the first protein structure database with 10 entries in 1972.
This now has grown into a large database with over 10,000 entries. While the initial databases of
protein sequences were maintained at the individual laboratories, the development of a
consolidated formal database known as SWISS-PROT protein sequence database was initiated in
1986, which now has about 70,000 protein sequences from more than 5,000 model organisms, a
small fraction of all known organisms. These huge varieties of divergent data resources are now
available for study and research by both academic institutions and industries. These are made
available as public domain information in the larger interest of research community through
internet (www.ncbi.nlm.nih.gov) and CD ROMs (on request from www.rcsb.org). These
databases are constantly updated with additional entries.
The practice of bioinformatics can be traced as far back as the 1960s. This is when
Margaret Oakley Dayhoff, who is sometimes referred to as the mother of bioinformatics,
developed a computer program to aid in the determination of protein sequences. Dr. Dayhoff
developed the one-letter amino acid codes to make sequences easier to input into a computer
using punch cards. Her single-letter codes are still used to this day.
The actual term of “Bioinformatics” has been around since at least as early as 1970, when Ben
Hesper and Pauline Hogeweg used it to describe “the study of informatic processes in biotic
systems”. From then through the 1980s, however, the concept of bioinformatics shifted away
from generally describing biochemical networks to become synonymous with sequence analysis
using algorithms to compare data. In this phase of its history, two of the most important
contributors were Elvin Kabat and Tai Te Wu. They collected and aligned amino acid sequences
from humans and mice. Kabat and Wu, “used a simple mathematical formula to calculate the
various amino acid substitutions at each position and predict the precise locations of segments of
the [protein]”. Their database was released in print throughout the late 70s and 80s until it
became so expansive that it was impossible to print.
By the end of the 90s, bioinformatics became known as the use of computational methods
for comparative analysis in biology. This is more in line with today’s definition, but in the 90s
sequence analysis was still the major focus — largely because bioinformatics gained public
attention during the Human Genome Project (HGP). An argument can be made that the HGP was
a springboard for bioinformatics as the study became a dramatic scientific race. The HGP was
initiated in 1990 as a publicly funded project. With the technology of the time, sequencing all 3
billion base pairs in the human genome was a huge challenge! Scientists had to map a gene,
sequence it in small segments, and reconstruct the sequences into a whole using the map. Suffice
it to say, it was a slow process! A privately-owned company called Celera arose to compete with
the public project in ’98. Headed by Dr. J. Craig Venter, Celera was a biotech company that used
computational methods to automatically match the overlapping sections of sequences — no more
mapping or slow, grueling human assembly. This is what bioinformatics is all about!
In the years since the completion of the HGP, the use of computers in biological research
has only increased. Bioinformatics has grown to encompass a huge variety of fields, from
immunology to cardiology to neuroscience and more. People working in all of these fields use
computer science to advance our understanding of life science every day! As bioinformaticians
do the work to hitch progress in biochemistry and medicine to the rapid pace of improvements in
computer processing power, we have begun to approach a world where medical science
improves at pace with Moore’s law.
With that, we have arrived at our answer, at least as it is understood today: bioinformatics
is the creation, advancement, and understanding of immense sets of data using mathematical and
computational techniques, in order to improve the quality and pace of new discoveries.
The RCSB Protein Data Bank is a resource that provides access to information about the three-
dimensional structures of proteins, nucleic acids, and complex assemblies. It was established in
1971 as a repository for structural data and has since grown to contain over 150,000 structures.
The database provides a valuable resource for researchers studying the structure and function of
biological macromolecules and has played a vital role in advancing the field of structural
biology. In recent years, the database has also begun incorporating drug discovery and design
data, making it an essential tool for the pharmaceutical industry.
In 1982, GenBank, a database of nucleotide sequences, was created by the National Institutes of
Health (NIH) to store and share genetic information. However, Walter Goad started the database
at Los Alamos National Laboratory. Today, the database contains millions of sequences from a
wide range of organisms, and it has been an essential tool for researchers in the field of
bioinformatics.
GenBank has undergone significant changes since its creation. In the early days, the database
was maintained manually, with researchers submitting their sequences on paper forms. However,
as the number of submissions grew, this approach became impractical.
In 1986, the NIH began accepting electronic submissions; by 1988, the entire GenBank database
was available in electronic form. The database continued to grow in the following years, and new
features were added to make it easier to search and analyze the data.
In 1992, GenBank was made available over the internet, which made it accessible to researchers
all over the world. Access over the internet was a significant milestone in the history of
bioinformatics, as it allowed researchers to share and access genetic information more efficiently
than ever before.
Since then, GenBank has continued to evolve, with new data types being added and new tools
being developed to analyze the data. Today, it remains one of the most important resources for
researchers in bioinformatics, and it continues to play a critical role in advancing our
understanding of genetics and genomics.
The PIR-International Protein Sequence Database was one of the earliest databases established in
1984. It was an essential milestone in bioinformatics, allowing researchers to analyze and
compare protein sequences on a large scale. The database was later incorporated into the UniProt
Knowledgebase, which is now one of the world's most widely used protein sequence databases.
UniProt, which stands for Universal Protein Resource, is a comprehensive protein database
created in 2002 by merging three separate databases: the Swiss-Prot, TrEMBL, and PIR-PSD.
Swiss-Prot was initially created as a protein sequence database in 1986 by Amos Bairoch and his
team at the University of Geneva. TrEMBL, on the other hand, was a computer-annotated
supplement to Swiss-Prot that was created in 1996. Swiss-Prot and TrEMBL were later merged
with the PIR-PSD database to create UniProt.
Today, UniProt is one of the largest protein databases in the world, containing information on
millions of proteins from a wide range of species. Researchers in the field of bioinformatics
widely use it for a variety of applications, including protein identification, characterization, and
annotation. UniProt also provides many tools and resources to help researchers analyze and
interpret protein data, making it an invaluable resource for the scientific community.