PC#1 Exercises Introduction To NCBI 2020-Solved
PC#1 Exercises Introduction To NCBI 2020-Solved
I. Explore the National Center for Biotechnology information (NCBI) website and get familiar
with its design and environment and its major databases.
https://www.ncbi.nlm.nih.gov/
What NCBI is? How many databases are hosted by the NCBI?
59 https://www.ncbi.nlm.nih.gov/guide/all/#databases
Which of the following NCBI databases could be considered as primary databases? Protein,
Nucleotide, CDD, PubMed, Gene, Genomes, Refseq, BioProjects.
Protein, Nucleotide, PubMed, BioProjects. Pubmed would be a special case since is not exactly
“experimental data”, but data is stored “as is” there is no postprocessing to it.
Perform a Global Query at the NCBI through the Entrez using the expression (all[Filter]). Which
database contains the largest number of records?
https://www.ncbi.nlm.nih.gov/entrez/query/static/help/
Summary_Matrices.html#Search_Fields_and_Qualifiers
https://www.ncbi.nlm.nih.gov/genbank/statistics/
https://www.ncbi.nlm.nih.gov/search/all/?term=all[Filter]
Now we are interested to find all the information at the NCBI related to the group of diseases
in humans known as cancer. Type the word “cancer” in the search box on the NCBI homepage
and run the search (Global Query). Note that the query is interpreted differently in different
databases.
How many scientific papers contain this word?
https://www.ncbi.nlm.nih.gov/pubmed/?term=cancer 4180400
How many cancer-related functional genomics studies have been stored at NCBI?
https://www.ncbi.nlm.nih.gov/bioproject/?term=cancer 28765
Why does taxonomy database give us one record? (For discussion in class)
https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?
mode=Info&id=6754&lvl=3&lin=f&keep=1&srchmode=1&unlock there is a crustacean from
the family Cancridae
Perform a new Global Query but using the word “human”.
How many entries (records) have been obtained for the different databases?
https://www.ncbi.nlm.nih.gov/search/all/?term=human
Would we get the same results if we perform a Global Query using the search expression
(homo sapiens),
https://www.ncbi.nlm.nih.gov/search/all/?term=homo%20sapiens
(human[organism])
https://www.ncbi.nlm.nih.gov/search/all/?term=human%5Borganism%5D
Why?
It looks like in some databases human is a clear alias of “homo sapiens”, but not in all of them.
(eg: Pubchem databases)
When the [organism] is included this term is only looked in the field “organism” of the
database. Some of them use controlled vocabularies, some not.
Beware that results in main screen not necessarily matches results in the database (when we
click the link) eg: taxonomy.
If you are interested in studying human cancer, which of the following strategies would
produce a more useful set of results in a Global Query at the NCBI?
cancer OR human[organism]
Cancer AND human search both terms in any place we can find cases like:
https://www.ncbi.nlm.nih.gov/gene/39645575 a gene from a “Klebsiella pneumoniae” isolated
from a cancer patient.
III. At NCBI each record is assigned a UID “unique integer identifier” for internal tracking. In
sequence databases this unique identifier is also known as the Accession number.
SRX4644664 SRA
PRJNA490405 BioProject
IV. Open the NCBI entry with accession number NG_011877 and get familiar with the format
and the different fields used to store sequence information. This will open in the GenBank
Flat File Format.
What does this entry represent? Do you think this entry provides cross-references (links) to
other databases? From which organism this sequence was obtained? What is the UID or
identifier for this organism in the Taxanomy database? What does the underscore “_” in the
accession number stand for? Display the entry in FASTA format. What happened?
V. In the Taxonomy database explore all the information related to the organism Homo
sapiens.
Look at the lineage for this taxon. What order do humans belong to? Primates What is the txid
for this mammalian order? 9443
How many human protein sequences are there today at the NCBI? 1421783
With which of these strategies will you find all the human sequences stored in the nucleotide
database at the NCBI?
A. txid9606[Primary organism]
C. homo sapiens[porgn]
D. human[porgn]
Use Batch Entrez to upload a file of GIs or accession numbers from the Nucleotide or Protein
databases, or upload a list of record identifiers from other Entrez databases. Batch Entrez will
download automatically the corresponding records.
In this exercise we will retrieve from the NCBI database all sequences related to Homo sapiens
tumor protein 53 (TP53) published on a paper with PubMed accession number PMC3675194.
This flat text file has a list of the accession numbers referenced in this paper.
https://www.ncbi.nlm.nih.gov/sites/batchentrez
3. Select the database from which the list of accessions will be queried.
4. Use the “Browse” button to select the filename containing the list of idetifiers from
your system directory.
5. After pressing the “Retrieve” button you will see a list of record summaries. Retrieve
them!
6. Optionally, select a format in which to display the data for viewing, and/or saving.
Select “Send to file” to save the file.
If it does not work you can obtain the same results with this link:
https://www.ncbi.nlm.nih.gov/nuccore/?term=KC820708:KC820786[pacc]
You can download all this sequences as fasta files using e-utilities, under linux / OSX / Cygwin,
etc:
Note that the command “b” starts with “cat” and ends with “text”. The command “a” is
important, you should know why.