Biological Database 1
Biological Database 1
Databases
• A database is a collection of information that is organized so that it can be easily
accessed, managed and updated
• Annotated
• Cross-referenced
• Minimum redundancy
ACCESSION (Z92910):
It is a unique identifier assigned to a complete sequence record
This number never changes, even if the record is modified
An “accession number” is a combination of letters and numbers that are usually in the format of one
letter followed by five digits (e.g., M12345) or two letters followed by six digits (e.g., AC123456)
FUNDAMENTAL ELEMENTS OF
SEQUENCE DATABASES
VERSION (Z92910.1) :
It is an identification number assigned to a single, specific sequence in the database
This number is in the format “accession.version”
If any changes are made to the sequence data, the version part of the number will increase by one
E.g. U12345.1 becomes U12345.2
A version number of Z92910.1 for this HFE sequence indicates that the sequence data has not been
altered thus it is an original submission
Gene Identifier (GI) (1890179) :
Also a sequence identification number
Whenever a sequence is changed, the version number is increased and a new GI is assigned
If a nucleotide sequence record contains a protein translation of the sequence, the translation will have
its own GI number
FUNDAMENTAL ELEMENTS OF
SEQUENCE DATABASES
KEYWORDS (haemochromatosis; HFE gene) :
A “keyword” can be “any word or phrase used to describe the sequence”
SOURCE (human):
Usually contains an abbreviated or common name of the source organism
ORGANISM (Homo sapiens) :
The scientific name (usually genus & species) & phylogenetic lineage
REFERENCE :
It is a citation of publications by sequence authors that supports information
presented in the sequence record
The FEATURES Table
FUNDAMENTAL ELEMENTS OF
SEQUENCE DATABASES
BASE COUNT:
Base Count gives the total number of adenine (A), cytosine (C), guanine (G), and
thymine (T) bases in the sequence
ORIGIN:
Origin contains the sequence data, which begins on the line immediately below the
field title
Major Primary DB
Nucleic Acid Protein
EMBL (Europe) PIR -
Protein Information Resource
TrEMBL
A supplement to SWISS-PROT
NRL-3D
TYPES OF NUCLEIC ACID
DATABASES
PRIMARY NUCLEIC ACID DATABASES:
Contain complete annotations of all the nucleic acid sequence information of
organisms whose genomes have been successfully sequenced
Examples include GenBank, DDBJ and EMBL
International Nucleotide Sequence Database
Collaboration (INSDC)
• These 3 combined make-up the International Nucleotide Sequence Database
Collaboration (INSDC)
International Nucleotide Sequence Database
Collaboration (INSDC)
INSDC is a synchronization of GenBank, DDBJ and EMBL databases
• Properties of INSDC include:
Consistent Accession numbers;
No legal restrictions. Although there are some patented sequences stored and
managed
Holds both sequences submitted directly by scientists and genome sequencing
groups & sequences taken from literature & patents
Has very limited error checking thus there is a fair amount of redundancy
Access is provided via ftp & www interfaces
Sequences are listed in the 5’-3’ orientation
DNA Data bank of Japan
Overview
DNA Data Bank of Japan (DDBJ)
The DNA Data Bank of Japan is a biological database that collects DNA sequences
Collects and supplies DNA data since its inception in 1986
Data entry as in GenBank
It is also a member of the International Nucleotide Sequence Database
Collaboration or INSDC
DDBJ exchanges data via the SINET3 computer network
European Molecular
Biology Laboratory
Overview
European Molecular Biology Laboratories
(EMBL)
It is a comprehensive database of DNA and RNA sequences collected from the scientific
literature and patent applications and directly submitted from researchers and sequencing groups
Data collection is done in collaboration with GenBank (USA) and the DNA Database of Japan
(DDBJ)
It doubles in size every 18 months and as of June 1994 it contained nearly 2 million bases from
182,615 sequence entries
It is maintained by the European Bioinformatics Institute (EBI)
Data entry is friendly both to computers and humans
Standard English used (explanations, descriptions etc)
Sequences are stored in the database as they would occur in the biological state
T H E N AT I O N A L C E N T E R F O R
B I O T E C H N O L O G Y I N F O R M AT I O N
Bethesda, D
M
New
New pages!
pages!
New Homepage
Common
Common footer
footer
NCBI DATABASES AND SERVICES
• GenBank primary sequence database
• Free public access to biomedical literature
PubMed free Medline (3 million searches per day)
PubMed Central full text online access
• Entrez integrated molecular and literature databases
PubMed
• The United States National Library of Medicine (NLM) at the National Institutes
of Health maintains the database as part of the Entrez system of information
retrieval
Pubmed: click on the drop down menu select the
pubmed option. Type any topic which you want to find
in the search box
After typing the topic of our interest lots of research papers w ill appear
on window from where we select the specific papers for our study
Entrez : A retrieval system
• Capable of accessing integrated
information by searching many
of the NCBI databases with just
one query
All the proteins that have been translated from the RNA sequences and Protein
sequenced
• It consists of biologically significant sites, patterns and profiles that help to reliably identify to
which known protein family (if any) a new sequence belongs
Pfam
• It is a database of protein families defined as domains
• It can be searched and used to identify domains in sequence
• It is licensed under the GNU General Public License making it available to anyone
COMPOSITE DATABASES
• They compile and filter sequence data from different primary databases to
produce combined non-redundant sets that are more complete than the individual
databases
• An example of a composite database is OWL
• OWL combines 4 publicly available primary sources:
SWISS-PROT
PIR,
GenBank
NRL-3D
Databases
Sequence database a) Nucleotide database : GenBank,
EMBLBank
b) Protein database: Swiss-Prot, PIR
Structure database PDB, NDB, DALI, MSD
Microarray database ArrayExpress, MIAME
Chemical database PubChem
Pathway database KEGG, BioSilico
Enzyme database ExPASy, REBASE
Disease database OMIM, OMIA
Literature database PubMed, ScoPUS
Your Assignment
• What is curated data?
• What is patent?
• What is domain and motif?
• How many formats in which nucleotide and protein sequences were downloaded?
• What is FASTA and GFF formats?
• What is redundancy?
• What is GRCh38?
Practical
Genes
• AMY2B • SGLT3
• MGAM • ARID1B
• GP2 • CRYM
• TAS2R38 • FRMD6
• ACMSD • GALR1
• METAP2 • GPR139
• FABP5 • GRIK3
• SGLT1
Select one gene
• Go to NCBI and collect following information for human: