0% found this document useful (0 votes)
22 views4 pages

2.3 - History of Biological Databases

Biological databases are categorized into sequence and structure databases, with significant milestones dating back to the 1960s, including the development of the first protein sequence database by Margaret Dayhoff. The Human Genome Project, launched in 1990, marked a pivotal moment for bioinformatics, leading to advancements in computational methods for biological analysis. Today, bioinformatics encompasses various fields, utilizing AI and machine learning to analyze large datasets and improve discoveries in biology and medicine.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views4 pages

2.3 - History of Biological Databases

Biological databases are categorized into sequence and structure databases, with significant milestones dating back to the 1960s, including the development of the first protein sequence database by Margaret Dayhoff. The Human Genome Project, launched in 1990, marked a pivotal moment for bioinformatics, leading to advancements in computational methods for biological analysis. Today, bioinformatics encompasses various fields, utilizing AI and machine learning to analyze large datasets and improve discoveries in biology and medicine.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Biological databases can be broadly classified into sequence and structure databases.

Sequence databases are applicable to both nucleic acid sequences and protein sequences,
whereas structure database is applicable only to proteins. The first database was created within a
short period after the insulin protein sequence was made available in 1956. Incidentally, insulin
is the first protein to be sequenced. The sequence of insulin consisted of just 51 residues. Around
mid 1960s, the first nucleic acid sequence of Yeast tRNA with 77 bases was determined. During
this period, three dimensional structures of proteins were studied and the well known Protein
Data Bank (PDB) was developed as the first protein structure database with 10 entries in 1972.
This now has grown into a large database with over 10,000 entries. While the initial databases of
protein sequences were maintained at the individual laboratories, the development of a
consolidated formal database known as SWISS-PROT protein sequence database was initiated in
1986, which now has about 70,000 protein sequences from more than 5,000 model organisms, a
small fraction of all known organisms. These huge varieties of divergent data resources are now
available for study and research by both academic institutions and industries. These are made
available as public domain information in the larger interest of research community through
internet (www.ncbi.nlm.nih.gov) and CD ROMs (on request from www.rcsb.org). These
databases are constantly updated with additional entries.

The practice of bioinformatics can be traced as far back as the 1960s. This is when
Margaret Oakley Dayhoff, who is sometimes referred to as the mother of bioinformatics,
developed a computer program to aid in the determination of protein sequences. Dr. Dayhoff
developed the one-letter amino acid codes to make sequences easier to input into a computer
using punch cards. Her single-letter codes are still used to this day.
The actual term of “Bioinformatics” has been around since at least as early as 1970, when Ben
Hesper and Pauline Hogeweg used it to describe “the study of informatic processes in biotic
systems”. From then through the 1980s, however, the concept of bioinformatics shifted away
from generally describing biochemical networks to become synonymous with sequence analysis
using algorithms to compare data. In this phase of its history, two of the most important
contributors were Elvin Kabat and Tai Te Wu. They collected and aligned amino acid sequences
from humans and mice. Kabat and Wu, “used a simple mathematical formula to calculate the
various amino acid substitutions at each position and predict the precise locations of segments of
the [protein]”. Their database was released in print throughout the late 70s and 80s until it
became so expansive that it was impossible to print.

By the end of the 90s, bioinformatics became known as the use of computational methods
for comparative analysis in biology. This is more in line with today’s definition, but in the 90s
sequence analysis was still the major focus — largely because bioinformatics gained public
attention during the Human Genome Project (HGP). An argument can be made that the HGP was
a springboard for bioinformatics as the study became a dramatic scientific race. The HGP was
initiated in 1990 as a publicly funded project. With the technology of the time, sequencing all 3
billion base pairs in the human genome was a huge challenge! Scientists had to map a gene,
sequence it in small segments, and reconstruct the sequences into a whole using the map. Suffice
it to say, it was a slow process! A privately-owned company called Celera arose to compete with
the public project in ’98. Headed by Dr. J. Craig Venter, Celera was a biotech company that used
computational methods to automatically match the overlapping sections of sequences — no more
mapping or slow, grueling human assembly. This is what bioinformatics is all about!

In the years since the completion of the HGP, the use of computers in biological research
has only increased. Bioinformatics has grown to encompass a huge variety of fields, from
immunology to cardiology to neuroscience and more. People working in all of these fields use
computer science to advance our understanding of life science every day! As bioinformaticians
do the work to hitch progress in biochemistry and medicine to the rapid pace of improvements in
computer processing power, we have begun to approach a world where medical science
improves at pace with Moore’s law.

With that, we have arrived at our answer, at least as it is understood today: bioinformatics
is the creation, advancement, and understanding of immense sets of data using mathematical and
computational techniques, in order to improve the quality and pace of new discoveries.

Major milestones in bioinformatics


Over the past few decades, numerous milestones in bioinformatics have significantly impacted
our understanding of biology and the development of new therapies and treatments.
 1965: Margaret Dayhoff developed the first protein sequence database, which was called
the Atlas of Protein Sequence and Structure. This was a major step towards
understanding the relationship between protein structure and function.
 1970: Saul B. Needleman and Christian D. Wunsch published the first sequence
alignment method to align and compare protein and nucleotide sequences.
 1971: The RCSB Protein Data Bank.
 1977: Frederick Sanger developed a rapid method for determining the base sequence of
DNA. This was the first time that DNA sequencing had been automated, and it paved the
way for the Human Genome Project.
 1981: The Smith-Waterman sequence alignment algorithm, useful in identifying regions
of similarity that may indicate functional, structural, or evolutionary relationships
between two sequences.
 1982: GenBank, a database of nucleotide sequences, created in 1982 by the National
Institutes of Health (NIH) as a way to store and share genetic information.
 1984: The PIR-International Protein Sequence Database.
 1990: The Human Genome Project was launched. This ambitious project aimed to
sequence the entire human genome, and it was completed in 2003.
 1996: The first proteomics database, SWISS-PROT, was created. This database contains
information about protein sequences, functions, and structures.
 Late 1990s and early 2000s: The field of metagenomics was established. This field
focuses on studying the genetic material of entire microbial communities, rather than just
individual organisms.
 2001: The first draft of the human genome was published. This was a major breakthrough
in our understanding of human biology, and it opened up new avenues for research and
drug development.
 2002: UniProt protein sequence database.
 2010: The first synthetic genome was created. This was a landmark achievement in the
field of synthetic biology, and it paved the way for the creation of new organisms with
custom-designed genomes.
 2012: The CRISPR-Cas9 system was discovered. This revolutionary technology allows
scientists to edit genomes with unprecedented precision and accuracy.
 2023: The integration of artificial intelligence (AI) and machine learning (ML) into
bioinformatics tools and workflows is revolutionizing the field. AI and ML are being
used to analyze large datasets, predict protein structures, and develop new drugs.

The RCSB Protein Data Bank is a resource that provides access to information about the three-
dimensional structures of proteins, nucleic acids, and complex assemblies. It was established in
1971 as a repository for structural data and has since grown to contain over 150,000 structures.
The database provides a valuable resource for researchers studying the structure and function of
biological macromolecules and has played a vital role in advancing the field of structural
biology. In recent years, the database has also begun incorporating drug discovery and design
data, making it an essential tool for the pharmaceutical industry.
In 1982, GenBank, a database of nucleotide sequences, was created by the National Institutes of
Health (NIH) to store and share genetic information. However, Walter Goad started the database
at Los Alamos National Laboratory. Today, the database contains millions of sequences from a
wide range of organisms, and it has been an essential tool for researchers in the field of
bioinformatics.
GenBank has undergone significant changes since its creation. In the early days, the database
was maintained manually, with researchers submitting their sequences on paper forms. However,
as the number of submissions grew, this approach became impractical.
In 1986, the NIH began accepting electronic submissions; by 1988, the entire GenBank database
was available in electronic form. The database continued to grow in the following years, and new
features were added to make it easier to search and analyze the data.
In 1992, GenBank was made available over the internet, which made it accessible to researchers
all over the world. Access over the internet was a significant milestone in the history of
bioinformatics, as it allowed researchers to share and access genetic information more efficiently
than ever before.
Since then, GenBank has continued to evolve, with new data types being added and new tools
being developed to analyze the data. Today, it remains one of the most important resources for
researchers in bioinformatics, and it continues to play a critical role in advancing our
understanding of genetics and genomics.
The PIR-International Protein Sequence Database was one of the earliest databases established in
1984. It was an essential milestone in bioinformatics, allowing researchers to analyze and
compare protein sequences on a large scale. The database was later incorporated into the UniProt
Knowledgebase, which is now one of the world's most widely used protein sequence databases.
UniProt, which stands for Universal Protein Resource, is a comprehensive protein database
created in 2002 by merging three separate databases: the Swiss-Prot, TrEMBL, and PIR-PSD.
Swiss-Prot was initially created as a protein sequence database in 1986 by Amos Bairoch and his
team at the University of Geneva. TrEMBL, on the other hand, was a computer-annotated
supplement to Swiss-Prot that was created in 1996. Swiss-Prot and TrEMBL were later merged
with the PIR-PSD database to create UniProt.
Today, UniProt is one of the largest protein databases in the world, containing information on
millions of proteins from a wide range of species. Researchers in the field of bioinformatics
widely use it for a variety of applications, including protein identification, characterization, and
annotation. UniProt also provides many tools and resources to help researchers analyze and
interpret protein data, making it an invaluable resource for the scientific community.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy