0% found this document useful (0 votes)
176 views50 pages

Biological Database 1

Biological databases are collections of biological information organized so they can be easily accessed and updated. They store data from experiments, literature, and computational analysis. The key types are primary databases containing original sequence data, secondary databases with additional annotation, and composite databases combining data from multiple primary sources. Major databases include GenBank, EMBL, DDBJ for nucleic acid sequences and PIR, SWISS-PROT, and TrEMBL for protein sequences, which collaborate internationally for data sharing and standardization.

Uploaded by

Muhammad uzair
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
176 views50 pages

Biological Database 1

Biological databases are collections of biological information organized so they can be easily accessed and updated. They store data from experiments, literature, and computational analysis. The key types are primary databases containing original sequence data, secondary databases with additional annotation, and composite databases combining data from multiple primary sources. Major databases include GenBank, EMBL, DDBJ for nucleic acid sequences and PIR, SWISS-PROT, and TrEMBL for protein sequences, which collaborate internationally for data sharing and standardization.

Uploaded by

Muhammad uzair
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Biological Databases

Databases
• A database is a collection of information that is organized so that it can be easily
accessed, managed and updated

• The db are located at different places

• They exchange information on a daily basis so that they are up-to-date


Biological databases
• Biological databases are libraries of life sciences information, collected from
scientific experiments, published literature, high-throughput experiment
technology, and computational analysis
• Stores biological data in electronic form
Purpose
• Systemization of database
• Availability of biological data
• Analysis of computed biological data
THE ‘PERFECT’ DATABASE
• Comprehensive, but easy to search

• Annotated

• A simple, easy to understand structure

• Cross-referenced

• Minimum redundancy

• Easy retrieval of data


HISTORY
• Insulin, first protein that was sequenced; composed of 55 amino acid
• The sequence was published in “Atlas Of Protein Sequence” in 1965 by Margaret
Day Hoff
• Became base for PIR database
• First nucleotide sequenced was of Yeast tRNA, composed of 77 bp
• First organism whose genome was sequenced, a free living virus Haemophilus
influenzae in 1995 by Craig Venta
CLASSIFICATION
Types of database
Primary database
• It is also called sequence data
• It gives information about sequence of DNA nucleotides or protein amino acids
Secondary database
• Store secondary structure information or results of searches of the primary data
base
Composite database
• They compile and filter sequence data from different primary databases to
produce combined non-redundant sets that are more complete than the individual
databases
SEQUENCE DATABASES
• Sequence Databases are classified as:
 Genome sequence databases
 Nucleic acid sequence databases
 Protein sequence databases
 Amino acid sequence databases
• SD’s also fall into three database categories:
 Primary databases
 Secondary databases
 Composite databases
FUNDAMENTAL ELEMENTS OF SEQUENCE DATABASES
• All of the following elements represent the “ideal minimal content of annotation entry in a
Sequence Database”
 Name :LOCUS, ENTRY, ID all unique identifiers
 Definition: A brief, one-line, textual sequence description
 Accession: A constant data identifier
 Version
 Gene identifier (GI)
 Comments & Keywords
 Source
 Organism & Taxonomy Information
 Literature References
 Features table
 Base count & Origin
 The Sequence itself
FUNDAMENTAL ELEMENTS OF
SEQUENCE DATABASES
• The LOCUS field: It consists of five different subfields, namely:
 1a Locus Name (e.g. HSHFE) - It is a tag for grouping similar sequences. The first two or three letters usually
designate the organism. In this case HS stands for Homo sapiens. The last several characters are associated
with another group designation, such as gene product. In this example, the last three digits represent the gene
symbol, HFE. Currently, the only requirement for assigning a locus name to a record is that it is unique
 1b Sequence Length (12146 bp) – It is the total number of nucleotide base pairs (or amino acid residues) in
the sequence record
 1c Molecule Type (e.g. DNA) - Type of molecule that was sequenced. All sequence data in an entry must be
of the same type
 1d GenBank Division (PRI) - GenBank has different divisions. In this example, PRI stands for primate
sequences. Other divisions include ROD (rodent sequences), MAM (other mammal sequences), PLN (plant,
fungal, and algal sequences), & BCT (bacterial sequences)
 1e Modification Date (23-July-1999) - Date of most recent modification made to the record. The date of first
public release is not available in the sequence record. This information can be obtained only by contacting
NCBI at info@ncbi.nlm.nih.gov
FUNDAMENTAL ELEMENTS OF
SEQUENCE DATABASES
DEFINITION:
 It is a brief description of the sequence
 The description may include source organism name, gene or protein name, or designation as un-
transcribed or untranslated sequences (e.g., a promoter region)
 For sequences containing a coding region (CDS), the definition field may also contain a
“completeness” qualifier such as "complete CDS" or "exon 1”

ACCESSION (Z92910):
 It is a unique identifier assigned to a complete sequence record
 This number never changes, even if the record is modified
 An “accession number” is a combination of letters and numbers that are usually in the format of one
letter followed by five digits (e.g., M12345) or two letters followed by six digits (e.g., AC123456)
FUNDAMENTAL ELEMENTS OF
SEQUENCE DATABASES
VERSION (Z92910.1) :
 It is an identification number assigned to a single, specific sequence in the database
 This number is in the format “accession.version”
 If any changes are made to the sequence data, the version part of the number will increase by one
 E.g. U12345.1 becomes U12345.2
 A version number of Z92910.1 for this HFE sequence indicates that the sequence data has not been
altered thus it is an original submission
Gene Identifier (GI) (1890179) :
 Also a sequence identification number
 Whenever a sequence is changed, the version number is increased and a new GI is assigned
 If a nucleotide sequence record contains a protein translation of the sequence, the translation will have
its own GI number
FUNDAMENTAL ELEMENTS OF
SEQUENCE DATABASES
KEYWORDS (haemochromatosis; HFE gene) :
A “keyword” can be “any word or phrase used to describe the sequence”
SOURCE (human):
Usually contains an abbreviated or common name of the source organism
ORGANISM (Homo sapiens) :
The scientific name (usually genus & species) & phylogenetic lineage
REFERENCE :
It is a citation of publications by sequence authors that supports information
presented in the sequence record
The FEATURES Table
FUNDAMENTAL ELEMENTS OF
SEQUENCE DATABASES
BASE COUNT:
Base Count gives the total number of adenine (A), cytosine (C), guanine (G), and
thymine (T) bases in the sequence
ORIGIN:
Origin contains the sequence data, which begins on the line immediately below the
field title
Major Primary DB
Nucleic Acid Protein
EMBL (Europe) PIR -
Protein Information Resource

GenBank (USA) MIPS


DDBJ (DNA Data Bank of Japan) SWISS-PROT
University of Geneva, now with EBI

TrEMBL
A supplement to SWISS-PROT

NRL-3D
TYPES OF NUCLEIC ACID
DATABASES
PRIMARY NUCLEIC ACID DATABASES:
 Contain complete annotations of all the nucleic acid sequence information of
organisms whose genomes have been successfully sequenced
 Examples include GenBank, DDBJ and EMBL
International Nucleotide Sequence Database
Collaboration (INSDC)
• These 3 combined make-up the International Nucleotide Sequence Database
Collaboration (INSDC)
International Nucleotide Sequence Database
Collaboration (INSDC)
INSDC is a synchronization of GenBank, DDBJ and EMBL databases
• Properties of INSDC include:
 Consistent Accession numbers;
 No legal restrictions. Although there are some patented sequences stored and
managed
 Holds both sequences submitted directly by scientists and genome sequencing
groups & sequences taken from literature & patents
 Has very limited error checking thus there is a fair amount of redundancy
 Access is provided via ftp & www interfaces
 Sequences are listed in the 5’-3’ orientation
DNA Data bank of Japan
Overview
DNA Data Bank of Japan (DDBJ)
The DNA Data Bank of Japan is a biological database that collects DNA sequences
Collects and supplies DNA data since its inception in 1986
Data entry as in GenBank
It is also a member of the International Nucleotide Sequence Database
Collaboration or INSDC
DDBJ exchanges data via the SINET3 computer network
European Molecular
Biology Laboratory
Overview
European Molecular Biology Laboratories
(EMBL)
It is a comprehensive database of DNA and RNA sequences collected from the scientific
literature and patent applications and directly submitted from researchers and sequencing groups
Data collection is done in collaboration with GenBank (USA) and the DNA Database of Japan
(DDBJ)
It doubles in size every 18 months and as of June 1994 it contained nearly 2 million bases from
182,615 sequence entries
It is maintained by the European Bioinformatics Institute (EBI)
Data entry is friendly both to computers and humans
Standard English used (explanations, descriptions etc)
Sequences are stored in the database as they would occur in the biological state
T H E N AT I O N A L C E N T E R F O R
B I O T E C H N O L O G Y I N F O R M AT I O N

Bethesda, D
M

Created in 1988 as a part of the


National Library of Medicine at NIH
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
WEB ACCESS: WWW.NCBI.NLM.NIH.GOV

New
New pages!
pages!

New Homepage
Common
Common footer
footer
NCBI DATABASES AND SERVICES
• GenBank primary sequence database
• Free public access to biomedical literature
 PubMed free Medline (3 million searches per day)
 PubMed Central full text online access
• Entrez integrated molecular and literature databases
PubMed

• PubMed is a free search engine accessing primarily the MEDLINE database of


references and abstracts on life sciences and biomedical topics

• The United States National Library of Medicine (NLM) at the National Institutes
of Health maintains the database as part of the Entrez system of information
retrieval
Pubmed: click on the drop down menu select the
pubmed option. Type any topic which you want to find
in the search box
After typing the topic of our interest lots of research papers w ill appear
on window from where we select the specific papers for our study
Entrez : A retrieval system
• Capable of accessing integrated
information by searching many
of the NCBI databases with just
one query

• Instead of searching only one


database per query, then
repeating the same query to find
information on the same topic
from another NCBI database
Genetic Sequence Databank
• GenBank is one of the fastest growing repositories of known genetic sequences
• text file, readable & downloadable
• It is maintained by the National Center for Biotechnology (NCB)
• Entry data contains information on:
 The sequence
 Accession numbers
 The scientific and gene names
 Taxonomy/phylogenetic classification of the source organism
 A feature that identifies coding regions
 References to published literature
 Transcription units
 Mutation sites
TYPES OF NUCLEIC ACID DATABASES
• SECONDARY NUCLEIC ACID DATABASES

 They contain additional information derived from analysis of data available in


primary repositories

 They deal with particular classes of sequences

 Examples include UniGene, the HIV sequence database and REBASE


SECONDARY NUCLEIC ACID DATABASES
• UniGene
 It has records with unique gene clusters
 Each cluster contains: sequences that represent a unique gene and related
information e.g tissue types in which the gene have been expressed
 The database is populated with Expressed Sequence Tags (EST’s)

• HIV SEQUENCE DATABASE


 The HIV Sequence Database (HSD) collects, curates & annotates HIV sequence
data.
Protein Sequence Database
PROTEIN SEQUENCE DATABABES
• They consists of:

 All the proteins that have been translated from the RNA sequences and Protein
sequenced

• Three (3) types of protein sequence databases exist:

 Primary protein databases

 Secondary protein databases

 Composite protein databases


PRIMARY PROTEIN SEQUENCE
DATABASES
• Primary protein sequence databases are:
 SWISS-PROT
 PIR (protein information resource)
• Both SWISS-PROT & PIR are curated
This means groups of designated curators (database managers) prepare the entries
from literature and/ or contacts with external experts prior to submission into the
respective databases
Swiss-Prot
• It provides high level notations describing:
 Functions of a protein
 Protein domain structure
 Post-translational modifications
 Protein variants and other variables
• It also provides a minimum level of redundancy & a high level of integration with
other databases
• It has legal restrictions in that entries are copyrighted, but freely accessible and
usable by academic researchers
PROTEIN INFORMATION RESOURCE
(PIR)
• It is a division of the National Biomedical Research Foundation (NBRF) in the US
• It is a database that produces the NRL-3D (a database of sequences extracted from the
three dimensional structures in the Protein Databank (PDB))
• It’s existence allows sequence information in PDB to be available for similarity searches
& retrieval & provides cross reference information for use with other PIR Protein
Sequence databases
• It provides comprehensive, well organized, & accurate information about proteins such as
sequence similarity
SECONDARY PROTEIN SEQUENCE
DATABASES
• Major examples of Secondary protein sequence databases are:
 TrEMBL
 Prosite
 Pfam
 TrEMBL:
• TrEMBL stands for Translation of EMBL nucleotide sequence database
• It is a computer-annotated supplement of SWISSPROT
• It contains all translations of EMBL nucleotide sequence entries not yet integrated in SWISS-
PROT
• TrEMBL speeds new sequence information to the public
SECONDARY PROTEIN SEQUENCE
 PROSITE
DATABASES
• It is a database of protein families and domains

• It consists of biologically significant sites, patterns and profiles that help to reliably identify to
which known protein family (if any) a new sequence belongs

• It is part of and is maintained much like Swiss-Prot

• It is based on regular expressions describing characteristic sub-sequences of specific protein


families or domains

 Pfam
• It is a database of protein families defined as domains
• It can be searched and used to identify domains in sequence
• It is licensed under the GNU General Public License making it available to anyone
COMPOSITE DATABASES
• They compile and filter sequence data from different primary databases to
produce combined non-redundant sets that are more complete than the individual
databases
• An example of a composite database is OWL
• OWL combines 4 publicly available primary sources:
 SWISS-PROT
 PIR,
 GenBank
 NRL-3D
Databases
Sequence database a) Nucleotide database : GenBank,
EMBLBank
b) Protein database: Swiss-Prot, PIR
Structure database PDB, NDB, DALI, MSD
Microarray database ArrayExpress, MIAME
Chemical database PubChem
Pathway database KEGG, BioSilico
Enzyme database ExPASy, REBASE
Disease database OMIM, OMIA
Literature database PubMed, ScoPUS
Your Assignment
• What is curated data?
• What is patent?
• What is domain and motif?
• How many formats in which nucleotide and protein sequences were downloaded?
• What is FASTA and GFF formats?
• What is redundancy?
• What is GRCh38?
Practical
Genes
• AMY2B • SGLT3
• MGAM • ARID1B
• GP2 • CRYM
• TAS2R38 • FRMD6
• ACMSD • GALR1
• METAP2 • GPR139
• FABP5 • GRIK3
• SGLT1
Select one gene
• Go to NCBI and collect following information for human:

Gene full name


 Gene ID
 Gene Type
Gene Location
Number of Exon
Number of A,T,G,C bases
Length of Gene
Download gene, mRNA and protein sequence in FASTA format

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy